- Transposed Convolution Demystified with Transposed Convolution
- Originally written by Divyanshu Mishra
- Translation from: The Gold Project
- This article is permalink: github.com/xitu/gold-m…
- Translator: PingHGao
- Proofreader: CHzh9311, SAMyu2000
Decryption transpose convolution
Transpose convolution is a revolutionary concept in image segmentation, super resolution and other applications, but it is somewhat obscure. In this article, I will demythologize it and make it easier for you to understand.
preface
Since the popularization of convolutional neural network (CNN) technology, computer vision technology has entered a transitional stage. Since Alexnet won the ImageNet competition in 2012, CNN has become the primary technology for image classification, target detection, image segmentation, and other image/video-related tasks.
As the network gets deeper, the convolution operation makes the space smaller and creates an abstract representation of the input image. This feature of CNN is very useful for tasks such as image classification, which only needs to predict whether a specific object exists in the input image. However, this feature can cause problems for tasks such as target localization, segmentation, and so on. In this type of task, the spatial size of the object in the original image is essential to predict the output bounding box or to segment the object.
In order to solve this problem, a variety of technologies, including full-size convolutional neural network, emerged. In the full-size convolutional neural network, the image dimension is consistent with the input through the same padding. Although this technique largely solves this problem, it also increases the computational cost because the entire network convolution is performed on the original input size.
Another method used in the field of image segmentation is to divide the network into two parts, namely downsampling network and upsampling network. In the down-sampling network, the ordinary CNN architecture is used and an abstract representation of the input image is generated. In the upsampling network, various techniques are used to upsample an abstract representation of an image so that its spatial size is equal to the input image. This architecture is known as the encoder-decoder network.
Upsampling technique
The downsampling network is intuitive and widely used, but there is little discussion about the related techniques of upsampling.
The most widely used upsampling techniques in encoder-decoder networks are:
- Nearest neighbor technique: As the name implies, in the nearest neighbor technique, we copy an input pixel value into K nearest neighbor elements, where K depends on the desired output.
- Bilinear interpolation: In bilinear interpolation, we take the four nearest pixel values of the input pixel and smooth the output by a weighted average based on the distance to these four pixels.
- Bed Of Nails: In this method, we copy the value Of the input pixel to the corresponding position in the output image and fill the remaining positions with zeros.
- Anti-maximum pooling: The maximum pooling layer in CNN selects the maximum value in the convolution kernel as the output. To perform the anti-maximum pooling operation, first, the index of the maximum value is saved for each maximum pooling layer in the encoding step. The saved index is then used in the “Decode” step to map the input pixels to the position corresponding to the saved index and fill in zeros in the other positions.
All of the above techniques are predefined and do not rely on data, which makes them task-specific. They can’t learn from data, so they’re not generic techniques.
Transpose convolution
Transpose convolution is used to upsample the input feature with some learnable parameters to obtain the desired output feature. The basic operation steps of transpose convolution are as follows:
- Take a 2×2 coded feature diagram as an example and sample it up to 3×3 size.
2. We use a core of size 2×2, step 1 and fill 0.
- Now we multiply the upper-left element of the input feature graph by each element of the kernel, as shown in Figure 10.
- Again, we do this for all the remaining elements of the input feature map, as shown in Figure 11.
- As you can see, some elements of the generated upsampled feature map overlap. To solve this problem, we simply sum the overlapping elements.
- The final output will be a 3×3 size feature map with the desired spatial size after upsampling.
Transpose convolution is also called deconvolution, but that is not a proper nickname. Because deconvolution means canceling out the effects of the convolution operation, and that’s not our goal.
It is also called upsampling convolution, which intuitively reflects the task it performs, that is, upsampling the input feature graph.
Since conventional convolution on the output is equivalent to decimal-step long product on the input, transpose convolution is also called decimal-step long product. For example, a convolution of step 2 on the output is equivalent to a convolution of step 0.5 on the input.
Finally, it is also referred to as reverse convolution because the forward computation in transpose convolution is equivalent to the reverse computation of regular convolution.
The problem of transpose convolution:
The convolution of transposes is affected by the checkerboard effect, as shown below.
The main reason for this is the uneven overlap in some parts of the image, which leads to artifacts. This can be repaired or mitigated by using a core size that is divisible by the step size, such as 2×2 or 4×4 for step size 2.
Application of transpose convolution:
- Super resolution:
- Semantic segmentation:
Conclusion:
Transpose convolution is the basis of semantic segmentation and super resolution algorithms. They provide the best and most general way to upsample abstract representations. In this article, we explore various commonly used upsampling techniques, and then try to get a more intuitive and in-depth understanding of transpose convolution. I hope you enjoyed this post, and if you have any questions, questions, or comments, feel free to contact me on Twitter or Linkedin.
Reference directory:
- CS231n: Convolutional Neural Networks for Visual Recognition
- Transposed Convolutions explained with… MS Excel!
- Deep Dive into Deep Learning
If you find any errors in the translation or other areas that need improvement, you are welcome to revise and PR the translation in the Gold Translation program, and you can also get corresponding bonus points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.
Diggings translation project is a community for translating quality Internet technical articles from diggings English sharing articles. The content covers the fields of Android, iOS, front end, back end, blockchain, products, design, artificial intelligence and so on. For more high-quality translations, please keep paying attention to The Translation Project, official weibo and zhihu column.