First of all, thank you for a great learning material!
In the chapter about LeNet architecture you mention that your implementation matches the historical definition of Lenet5 (Gradient-Based Learning Applied to Document Recognition) except the last layer, but I found two other inconsistencies in subsection B. LeNet-5.
LeNet paper does not describe pooling layer as an average pooling layer, but rather as layer that perform summation over 2x2 neighborhood within input activation feature map, then multiply it with trainable weight, add trainable bias and finally pass it through sigmoidal function.
According to LeNet paper, the activation function used at both convolution and fully connected layers is scaled hyperbolic tangent function, not sigmoid as is used in code. These two functions looks similar but have different output range (http://m.wolframalpha.com/input/?i=tanh(a)%2C+sigmoid(a))
If there is something I missed and your implementation of LeNet5 is correct, please let me know.
Pooling was called sub-sampling in the original paper. According to the pg6 on the paper
"This can be achieved
with a socalled subsampling layers which performs a local
averaging and a subsampling reducing the resolution of
the feature map and reducing the sensitivity of the output
to shifts and distortions"
Also, for tanh vs sigmoid, it seems that tanh converges faster than sigmoid (especially useful in 20 years ago when compute power is not strong enough).
Hopefully it helps!