Why do convolutional neural networks have translation invariance

Simply put, convolution + maximum pooling is approximately equal to translation invariance.

Convolution: Simply put, the image is translated and the representation on the corresponding feature graph is also translated.

The figure below is just an example to illustrate this problem. There is a face in the lower left corner of the input image. After convolution, the features of the face (eyes, nose) are also located in the lower left corner of the feature image.

If the face feature is in the upper left corner of the image, then the corresponding feature after convolution is also in the upper left corner of the feature image.

In the neural network, convolution is defined as feature detectors at different positions, which means that no matter where the target appears in the image, it will detect the same features and output the same response. For example, when a face is moved to the lower left corner of the image, the convolution kernel will not detect its features until it moves to the lower left corner.

Pooling: For example, maximum pooling returns the maximum value in the field. If the maximum value is moved but still in the field, the pooling layer will still output the same maximum value. That’s a little bit of translation invariant.

Therefore, these two operations together provide some translation invariance. Even if the image is translated, convolution is guaranteed to still detect its features, while pooling keeps consistent expression as much as possible.

conclusion

Translation invariance of convolution means that after convolution + pooling, no matter the position of a feature is moved, it can always be detected and input to the next layer. Since the full connection is a weighted sum calculation, the feature activated by CNN can be transmitted to the next layer.

reference

1. www.cnblogs.com/Terrypython…