This is the 28th day of my participation in the First Challenge 2022
FCN, an article that played an important role in semantic segmentation, was published in 2014. It is the SOTA model of that year and the pioneering full convolutional network, which realizes end-to-end semantic segmentation network based on convolution. Although today has withdrawn from the stage of semantic segmentation, but whenever people talk about the development process of semantic segmentation, it must be said about this milestone of the network FCN.
Convolutional networks are extended to spatially dense prediction tasks, that is, pixel-level classification tasks.
We’ll talk about that today
- From image classification to semantic segmentation
- The migration study
- Talk about the transpose convolution
- Output the fusion
We’ve talked about image classification to semantic segmentation before, and if you’re starting with this one, you can go back to the two previous posts that you shared about transpose convolution.
Image classification to semantic segmentation
For image classification and semantic segmentation of input is an image, but the size of the image classification task input image is usually fixed, this is because for the last few image classification network layer is usually full connection, for the semantic network segmentation usually input image size is different, so the sampling connection network, all of them, and it avoids the limitation on the image.
In the latter layers of the classification network, the output feature map of the convolutional layer is input to the full connection layer. After several full connection layers, the output is a dimension vector with the same number of categories, and then a category probability distribution is obtained after Softmax.
The limitation of full connection is that the input needs to be fixed size and the spatial coordinates are discarded. However, the fully connected layer can also be regarded as a convolution with a convolution kernel covering its entire input region size
The full connection layer becomes the convolution layer
For example, for the ImageNet classification problem, For example, if the input is 224×224×3224 \times 224 \times 3224×224×3, the pooling layer becomes a feature map 7×7×512 7 \times 7 \times 5127×7×512 after a series of convolution kernels sampling And then we flatten it and connect it to the full connection layer, and the output is 4096 and then we go through several full connection layers and we get 1000 neurons for 1000 classifications. 1×1×40961 \times 1 \times 40961×1×4096 \times 1 \times 40961×1×4096 Finally, the number of channels is transformed to 1×1×10001 \times 1 \times 10001×1×1000 by 1×11 times 11×1 convolution kernel size
The migration study
After training a pre-training model on large data sets such as ImageNet or COCO, it is usually followed by some specific tasks to continue training in some data sets to obtain the desired network for transfer learning.
Output the fusion
When images are continuously pooled for down-sampling, a relatively small feature map is obtained, and the size of the feature map is 1/32 of the original map. Then, an image with the same size as the input image is obtained through up-sampling. However, it is rather rough to obtain the label feature map in this way, because 32 times of downsampling is carried out, so the network structure designed in this way is FCN-32S.
As the number of network layers increases, that is, the network becomes deeper and deeper, more depth features are acquired and spatial information is lost. That is to say, in the network, spatial information is more in the shallow layer of the network. So can we incorporate shallow information into the output? The idea is output fusion
The specific fusion method is summed point by point. For example, pool5 can be sampled twice before pool4 and pool4 can be sampled at 16 times. Pool4 can be upsampled twice, pool3 can be merged, pool3 can be upsampled 8.
Due to the position loss function, the results obtained by FCN-32S are relatively rough, while the results obtained by 8S are relatively more effective.
Refer to the
- Review: FCN — Fully Convolutional Network (Semantic Segmentation) by SiK-Ho Tsang