In the last section, we introduced the two-dimensional convolution layer, which can help us detect the edges of objects in images. Both the original image and the intermediate features processed by the convolution kernel are mathematical operations based on pixels. In actual images, objects we are interested in will not always appear at fixed pixel positions: even if we use a tripod to fix the camera to take consecutive shots of the same object, pixel positions will most likely be offset. As a result, the output corresponding to the edge of the same object may appear in different positions in the convolutional output, thus causing inconvenience to the following pattern recognition. In addition, the ultimate goal of most computer vision tasks for image processing is to identify the object in the image, so it is not necessary to detect every pixel, just to find the outline of the object in the image. Pooling layer can alleviate the excessive sensitivity of convolution layer to position.
Like the convolutional layer, the pooling layer computes the output on each element of a fixed-shape window (also known as the pooling window) of the input data. Unlike the convolutional layer, which calculates the cross-correlation between input and kernel, the pooling layer directly calculates the maximum or average value of elements in the pooling window. This operation is also called maximum pooling or average pooling, respectively. In two-dimensional maximum pooling, the pooling window starts at the top left of the input array and slides from left to right and top to bottom. When the pooled window slides to a certain position, the maximum value of the input subarray in the window is the element at the corresponding position in the output array.
The figure above shows the maximum pooling with a pool window shape of 2×2, shaded by the first output element and the input element used in the calculation. The output array is 2 in height and 2 in width, and the four elements are calculated by Max. For example, the first element of the output 4 = Max (0, 2, 3, 4).
Two-dimensional average pooling works similarly to two-dimensional maximum pooling, but replaces the maximum operator with the average operator. The pooling layer with the shape of the pooling window p× Q is called p× Q pooling layer, and the pooling operation is called P × Q pooling.
type | Max pooling | Average pooling |
---|---|---|
calculation | Select the maximum value of the current window | Select the average value of the current window |
Consider the example of object edge detection. Now we take the output of the convolution layer as the input for maximum pooling of 2 by 2. Let the input of the convolution layer be X and the output of the pooling layer be Y. Assuming that the object edge of sample 1 is on X[I, j] and X[I, j+1], and the object edge of sample 2 is on X[I, j+1] and X[I, j+2], when using the 2×2 maximum pooling layer, as long as the pattern identified by the convolution layer moves no more than one element in height and width, we can still detect it.
Like the convolution layer, the pooling layer can also change the output shape by Padding on both sides of the input height and width and adjusting the Stride of the window. Pooling layer filling and stride have the same working mechanism as convolution layer filling and stride.
When processing multi-channel input data, the pooling layer pools each input channel separately, rather than adding the inputs of each channel by channel like the convolution layer. This means that the pooled layer has the same number of output channels as the number of input channels.
In convolutional neural networks, convolution layer and pooling layer usually appear in pairs.