Convolutional Neural Networks — Part 4: The Pooling and Fully Connected Layer
This is the fourth part of my blog post series on convolutional neural networks. Here are the pre-requisite parts for this post:
- Part 1: Edge Detection
- Part 2: Padding and Strided Convolutions
- Part 3: Convolutions Over Volume and The Convolutional Layer
The final part of the series explains why it might be a great idea to use convolutions in a neural network:
1. Pooling Layer
Other than convolutional layers, ConvNets often also use pooling layers to reduce the size of the representation, to speed the computation, as well as make some of the features that it detects a bit more robust.
1.1 Max Pooling
Suppose you have a 4 by 4 input, and you want to apply a type of pooling called max pooling. The output of this particular implementation of max pooling will be a 2 by 2 output. The way that you would do that is quite simple. Take your 4 by 4 input and break it into different regions (these regions are colored as the four regions as shown in figure 1). And then, in the output, which is 2 by 2, each of the outputs will just be the max from the corresponding shaded region. Notice that at the lower left green region, the biggest number is 6, and the red lower right region, the biggest number is 3. So to compute each of the numbers of the output on the right, we would take the max over the 2 by 2 regions. So, this would be as if you applied a filter size of 2 (it’s as if f = 2), because you’re taking 2 by 2 regions and you’re taking a stride of 2 (because you’re taking 2 steps to move a filter to a different colored region).
As we can see from the GIF illustration, the filter size f and the stride s are the hyperparameters of max pooling because we start from the upper left 2 by 2 filter (indigo colored filter) and we get a value of 9. Then, we step over by 2 steps to look at the upper right filter (light blue filter), to get 2 (since 2 is the max value in that light blue region), and then for the next row, you step the filter down 2steps to get 6, and then we take 2 steps to the right to get 3.
If you think of the 4 by 4 region as some set of features, the activations in some layer of the neural network, then the large number means that it has maybe detected a particular feature (if the input image is a cat, then a feature could be a cat’s eye, cat’s whisker, cat’s nose etc.). So what the max operation does is say as long as a feature is detected anywhere in one of the colored quadrants, it then remains preserved in the output of max pooling. So, what the max operator does is really to say, if a feature detected anywhere in in filter, then keep a high number. However if a feature is not detected, so maybe that particular feature doesn’t exist in the upper right-hand quadrant (light blue region) then the max of those numbers will be small. We saw that the GIF 1 the max of the upper right quadrant is 2 which is much smaller than the upper left quadrant which is 9, so if you’re maybe detecting a particular feature such as a cat’s eye, then it’s likely to be in the upper left quadrant than the upper right quadrant.
I’ll paraphrase what Andrew Ng said: I think the main reason people use max pooling is because it’s been found in a lot of experiments to work well, and the intuition (that a much bigger max number of a particular quadrant corresponds to a feature detected), despite it being often cited, I don’t know of anyone fully knows if that is the real underlying reason.
So far, we’ve seen max pooling on a 2D inputs. If you have a 3D input, then the output will have the same number of channels as the input. For example, if you have a 5 by 5 by 2 input image then the output image will be a 3 by 3 by 2.
1.2. Average Pooling
Average pooling does pretty much what you’d expect. That is, instead of taking the maxes within each filter, you take the average, as we can see on figure 1. If you understood the process of max pooling then average pooling process is self- explanatory. In a very deep neural network, you might use average pooling to collapse our network representation, let’s say from a 7 by 7 by 1000 to a 1 by 1 y 1000. Average pooling is not used very often though.
1.3 Additional Notes About Pooling Layers
One thing to note about pooling is that there are no parameters to learn. So, when we implement backpropagation, you find that there are no parameters that backpropagation will adapt through max pooling. Instead, there are just the hyperparameters (filter size f & stride s) that you set once, maybe set once by hand or set using cross-validation.
Also notice that the formula for finding the height and width of the output has no padding amount as before, hence that is why there’s no addition of 2p to the height and width formula that we saw in the previous parts of this blog post series.
2. Fully Connected Layer
Fully connected layers usually appear at the end of deep neural networks. The typical example below shows two fully connected layers (FC-3 and FC-4). So let me outline what is happening.
Let’s say that you input an image which is 32 by 32 by 3, so it’s an RGB image and maybe you’re trying to do handwritten digit recognition. So you have a number like 7 in a 32 by 32 RGB image trying to recognize which one of the 10 digits from 0 to 9 this is. Also note that by convention what is called a one layer is a layer that that has weight, i.e that has parameters. And because the pooling layer has no weights, has no parameters, only a few hyper-parameters, Andrew Ng used a convention that convolutional layer 1 (Conv 1) and pooling layer 1 (Pool 1) are one layer and he called it ‘Layer 1’. Although sometimes you see people maybe read articles online and read research papers, you hear about the conv layer and the pooling layer as if they are two separate layers. The same convention applies to ‘Layer 2.’
Notice that the first fully connected layer (FC3) is the 120 units that are connected with the 400 units. So this is actually our. This fully connected layer is just like the single neural network layer. In other words, this is just a standard neural network where you have a weight matrix that’s called W^[3] of dimension 120 by 400. This is fully connected because each of the 400 units here is connected to each of the 120 units here, and you also have the bias parameter that’s just a 120 dimensional (120 outputs). And then lastly we have the fully connected Layer 4 (FC4) with 84 units, where each of the 120 units are connected to each of the 84 units. So finally, there are now 84 real numbers that can fit to a softmax unit. And if you’re trying to do handwritten digital recognition, i.e to recognize if the handwritten input image was which digit (0, 1, 2, 3, 4, 5, 6, 7, 8 or 9), then the softmax would have 10 outputs. Adrew Ng said that one common guideline is to actually not try to invent your own settings of hyper-parameters [filter-size f, stride s, etc], but to look in the literature to see what hyper parameters do work for others and to just choose an architecture that has worked well for someone else, and there’s a chance that will work for your application as well.
As always, thank you for your attention. Clap and share if you liked this post. Feel free to comment if you have feedback or questions.
REFERENCES: