As an attempt to understand Convolutional Neural Network (CNN/ConvNet) better, I was suggested to read the section about LeNet5 in the original paper and figure out where every numbers come from ?
Input layer
The input of this neural network is an image of size 32*32
pixels where each pixels are represented by an input neuron. Pixel value 0..255
is normalized to -0.1..1.75
so that the mean is 0 and the variance is around 1.
The hidden layers (C1 up to F6) use Hyperbolic Tangent or tanh
as activation function.
Convolution layer 1 (C1)
Since a local receptive fields of size 5*5
are chosen hence the shared weight size (kernel) is also 5*5
for each feature maps. Since the kernel has 1
bias and 6
feature maps are required therefore the number of trainable parameters (weights & biases) are
trainable params = (weight * input maps + bias) * feature maps = (5 * 5 * 1 + 1) * 6 = 156
Since the size of each feature maps are 28*28
,
connections = (input + bias) * feature maps * feature map size = (5 * 5 + 1) * 6 * 28 * 28 = 122304
The size of feature map 28x28
is the consequence of intentional overlapping pixels by 4
columns.
Subsampling layer 2 (S2)
In this layer the kernel size is 2*2
and weights are shared. The differences from C1 are no pixel overlapping and only 1
weight and 1
bias used per feature maps. Since the output of this layer is 6
feature maps (the same as input maps),
trainable params = (weight + bias) * feature maps = (1 + 1) * 6 = 12
Since the size of each feature maps is 14 * 14
,
connections = (input + bias) * feature maps * feature map size = (2 * 2 + 1) * 6 * 14 * 14 = 5880
Convolutional layer 3 (C3)
C3 layer is similar to C1 except that there are more than one input maps and each (output) feature maps are connected to different input maps. These are the arrangement of 16
feature maps of size 10*10
:
- First
6
feature maps are connected to3
contiguous input maps each (overlapping 2 maps) - Second
6
feature maps are connected to4
contiguous input maps (overlapping 3 maps) - Next
3
feature maps are connected to4
discontinuous input maps (overlapping 1 map) - Last
1
feature map are connected to all6
input maps
Hence,
trainable params = (weight * input maps + bias) * feature maps 1st group = (5 * 5 * 3 + 1) * 6 = 456 2nd group = (5 * 5 * 4 + 1) * 6 = 606 3rd group = (5 * 5 * 4 + 1) * 3 = 303 4th group = (5 * 5 * 6 + 1) * 1 = 151 all group = 456 + 606 + 303 + 151 = 1516
then,
connections = (input + bias) * feature maps * feature map size = trainable params * feature map size = 1516 * 10 * 10 = 151600
Subsampling layer 4 (S4)
Similar to S2 except the number of feature maps is 16
(the same as input maps), and each of the is 5*5
pixels. Hence,
trainable params = (weight + bias) * feature maps = (1 + 1) * 16 = 32
and,
connections = (input + bias) * feature maps * feature map size = (2 * 2 + 1) * 16 * 5 * 5 = 2000
Convolution layer (C5)
The last convolution layer is similar to C3 except the number of feature maps is 120
and each of them are connected to all input maps. Hence,
trainable params = (weight * input maps + bias) * feature maps = (5 * 5 * 16 + 1) * 120 = 48120
then,
connections = (input + bias) * feature maps * feature map size = trainable params * feature map size = 48120 * 1 * 1 = 48120
Fully-connected layer (F6)
This layer is just a simple neural network layer with 84
output neurons. Hence,
trainable params = connections = (input + bias) * output = (120 + 1) * 84 = 10164
Output layer (F6)
Finally, the output layer consists of 10
Euclidean Radian Basis Function (RBF) units that matches the number of classes.
And we are done! ?