Understanding LeNet (LeCun, 1998)

As an attempt to understand Convolutional Neural Network (CNN/ConvNet) better, I was suggested to read the section about LeNet5 in the original paper and figure out where every numbers come from ?

Input layer

The input of this neural network is an image of size 32*32 pixels where each pixels are represented by an input neuron. Pixel value 0..255 is normalized to -0.1..1.75 so that the mean is 0 and the variance is around 1.

The hidden layers (C1 up to F6) use Hyperbolic Tangent or tanh as activation function.

Convolution layer 1 (C1)

Since a local receptive fields of size 5*5 are chosen hence the shared weight size (kernel) is also 5*5 for each feature maps. Since the kernel has 1 bias and 6 feature maps are required therefore the number of trainable parameters (weights & biases) are

trainable params = (weight * input maps + bias) * feature maps = (5 * 5 * 1 + 1) * 6 = 156

Since the size of each feature maps are 28*28,

connections = (input + bias) * feature maps * feature map size = (5 * 5 + 1) * 6 * 28 * 28 = 122304

The size of feature map 28x28 is the consequence of intentional overlapping pixels by 4 columns.

Subsampling layer 2 (S2)

In this layer the kernel size is 2*2 and weights are shared. The differences from C1 are no pixel overlapping and only 1 weight and 1 bias used per feature maps. Since the output of this layer is 6 feature maps (the same as input maps),

trainable params = (weight + bias) * feature maps = (1 + 1) * 6 = 12

Since the size of each feature maps is 14 * 14,

connections = (input + bias) * feature maps * feature map size = (2 * 2 + 1) * 6 * 14 * 14 = 5880

Convolutional layer 3 (C3)

C3 layer is similar to C1 except that there are more than one input maps and each (output) feature maps are connected to different input maps. These are the arrangement of 16 feature maps of size 10*10:

First 6 feature maps are connected to 3 contiguous input maps each (overlapping 2 maps)
Second 6 feature maps are connected to 4 contiguous input maps (overlapping 3 maps)
Next 3 feature maps are connected to 4 discontinuous input maps (overlapping 1 map)
Last 1 feature map are connected to all 6 input maps

Hence,

trainable params = (weight * input maps + bias) * feature maps 1st group = (5 * 5 * 3 + 1) * 6 = 456 2nd group = (5 * 5 * 4 + 1) * 6 = 606 3rd group = (5 * 5 * 4 + 1) * 3 = 303 4th group = (5 * 5 * 6 + 1) * 1 = 151 all group = 456 + 606 + 303 + 151 = 1516

then,

connections = (input + bias) * feature maps * feature map size = trainable params * feature map size = 1516 * 10 * 10 = 151600

Subsampling layer 4 (S4)

Similar to S2 except the number of feature maps is 16 (the same as input maps), and each of the is 5*5 pixels. Hence,

trainable params = (weight + bias) * feature maps = (1 + 1) * 16 = 32

and,

connections = (input + bias) * feature maps * feature map size = (2 * 2 + 1) * 16 * 5 * 5 = 2000

Convolution layer (C5)

The last convolution layer is similar to C3 except the number of feature maps is 120 and each of them are connected to all input maps. Hence,

trainable params = (weight * input maps + bias) * feature maps = (5 * 5 * 16 + 1) * 120 = 48120

then,

connections = (input + bias) * feature maps * feature map size = trainable params * feature map size = 48120 * 1 * 1 = 48120

Fully-connected layer (F6)

This layer is just a simple neural network layer with 84 output neurons. Hence,

trainable params = connections = (input + bias) * output = (120 + 1) * 84 = 10164

Output layer (F6)

Finally, the output layer consists of 10 Euclidean Radian Basis Function (RBF) units that matches the number of classes.

And we are done! ?

0 0 votes

Article Rating

Facebook

Google

Twitter