Summary of a.
FCN is a representative application of deep learning in image segmentation. It is an end to end image segmentation method, which enables the network to make pixel level prediction and directly obtain label map. Because all layers in FCN are convolution layers, it is called fully convolutional network.
Full convolutional neural network mainly uses three technologies:
-
Convolutional
-
Upsample
-
Skip Layer
Ii. Network structure
FCN classifies images at the pixel level, thus solving the problem of semantic segmentation. Different from the classic CNN, which uses the full connection layer to get fixed-length feature vectors after the convolution layer for classification (the full connection layer + SoftMax output), FCN can accept input images of any size. The deconvolution layer is used to carry out the up sampling of the feature map of the last convolution layer to restore it to the same size of the input image. Thus, a prediction can be generated for each pixel, while retaining the spatial information in the original input image. Finally, a pixel-by-pixel classification is carried out on the upsampled feature map.
Full convolution detailed solution
1. The possibility of mutual transformation between full join and full convolution
The only difference between the full connection layer and the convolution layer is that the neurons in the convolution layer are only connected to a local region in the input data, and the neurons in the convolution column share parameters. However, in both layers, neurons compute the dot product, so their functional form is the same. Therefore, it is possible to translate the two into each other:
- For any convolution layer, there exists a fully connected layer which can implement the same forward propagation function. The weight matrix is a huge matrix that is zero except for certain blocks. In most of these blocks, the elements are equal.
- Any fully connected layer can be transformed into a convolution layer. The fully connected layer is actually the convolution operation where the size of the convolution kernel is the size of the feature of the upper layer, and the result of the convolution is a node, corresponding to a point of the fully connected layer. In other words, if the size of feature map at the upper layer of full convolution is: NH∗NW∗NCN_H * N_W * N_CNH∗NW∗NC. The full convolution layer is: 1∗1∗K1 * 1 * K1∗1∗K
Then a convolution operation can be used to replace the full connection layer, and the convolution kernel size is NH∗NW∗NCN_H * N_W * N_CNH∗NW∗NC, and there are K of them.
- Implementation of convolution layer to full connection layer: the transformation of convolution layer to full connection layer can be completed by a W and b, W:[K, W], b:[K, 1].
2. Why is the traditional CNN input picture of fixed size?
For CNN, when an input image goes through the convolution and pooling layers, these layers do not care about the size of the image. For example, for a convolution layer, outputsize = (inputsize -kernelsize)/stride + 1, it does not care about the inputsize size, inputsize input feature map, slidewindow convolution, Output feature map with outputsize (mainly, the number of convolution kernel K can be specified by itself). The same is true for the pooling layer. However, when entering the fully connected layer, the feature map (assuming the size of n×n) should be drawn into a vector, and each element (n×n in total) in the vector as a node should be fully connected with all nodes of the next layer (assuming 4096). The number of weights here is 4096×n×n. We know that once the neural network structure is determined, It has a fixed number of weights, so this n can’t change, n is the outputsize of conv5, so if you look back, each outputsize has to be fixed, so each inputsize has to be fixed, so the input image size has to be fixed.
3. What are the benefits of remolding the weight W of the full connection layer into the filter of the convolution layer?
The reason why a certain CNN network structure needs to fix the size of the input image is that the weight of the full connection layer is fixed, which is related to the size of the feature map. However, FCN changes the full connection layer with 1000 nodes into the convolution layer with 1000 1×1 convolution kernels on the basis of CNN. After this layer, The two-dimensional feature map is still obtained. Similarly, we do not care about the size of the feature map, so there is no restriction on the size of the input image
As shown in the figure below, FCN converts the full connection layer in traditional CNN into the convolution layer, and correspondingly, CNN network FCN converts the last three full connection layers into the three-layer convolution layer:
- The full connection layer is transformed into the full convolution layer: In the traditional CNN structure, the first five layers are the convolution layer, the sixth and seventh layers are a one-dimensional vector with a length of 4096 respectively, and the eighth layer is a one-dimensional vector with a length of 1000, corresponding to the probabilities of 1000 different categories respectively. FCN represents these three layers as the convolution layer, and the size (channel number, width, and height) of the convolution kernel are (4096,1,1), (4096,1,1), and (1000,1,1) respectively. It seems that there is no difference in numbers, but the concept and calculation process of convolution and full connection are different. The weights and biases trained by CNN are used, but the difference is that the weights and biases have their own scope and belong to a convolution kernel, right
- The image size of CNN input is uniformly fixed to 227×227, 55×55 after pooling for layer 1, 27×27 after pooling for layer 2, 13×13 after pooling for layer 5, and H* W for FCN input. After pooling, the first layer shall be 1/2 of the original image size, the second layer shall be 1/4 of the original image size, the fifth layer shall be 1/8 of the original image size, and the eighth layer shall be 1/16 of the original image size
- After multiple convolution and pooling, the obtained images become smaller and smaller with lower resolution. When the image arrives at the smallest layer, the resulting image is called heatmap, which is the most important high-dimensional feature map for us. After obtaining the heatmap with high-dimensional features, the most important and final step is to carry out upsampling of the original image and enlarge the image several times to the size of the original image
Compared with using the original convolutional neural network before transformation to carry out iterative calculation and optimization model for all 36 positions, and then make prediction for 36 positions, it is much more efficient to use the transformed convolutional neural network to carry out a forward propagation calculation, because all 36 computations share computing resources. This technique is often used in practice. It is usually used to make an image larger in size, then use the transformed convolutional neural network to evaluate many different locations in space to get classification scores, and then average these scores.
Deconvolution
The operation of Upsampling can be regarded as deconvolutional. The parameters of the convolution operation, like those of CNN, are learned by BP algorithm during the training of FCN model. The deconvolution layer is also the convolution layer. It does not care about the size of the input, and output output after the convolution through the sliding window. Deconv is not a true deconvolution. It is generally referred to as transposed convolution. The forward propagation of deconv is the back propagation of conV. Deconvolution parameter: Use the transpose of the convolution process filter (in fact, flip the filter horizontally and vertically) as the feature graph before the calculation of convolution. The operation of deconvolution is as follows:
Blue is the input of the deconvolution layer, green is the outputFull padding, transposed Full padding, transposed.
4. Skip structure
After processing the results of CNN, the dense Prediction was obtained. However, the author found in the experiment that the segmentation results obtained were relatively rough, so the author considered adding more details of the previous layer, that is, making a fusion between the output of the last layer and the output of the last layer, which is actually the sum:
Experimental results show that the segmentation results are more detailed and accurate. In the process of layer by layer fusion, the results would get worse after the third line, so the author stopped at this point.
5. Model training
The model trained by AlexNet, VGG16 or GoogleNet is initialized, and fine-tuning is done on this basis. All fine-tuning is done. Only upsampling is added at the end, and the back propagation principle of CNN itself is used for parameter learning. Whole image was used for training without patchwise sampling. Experiments have proved that direct use of the full map is effective and efficient. Initialize the convolution layer of class Score with all zero. Random initialization has no performance or convergence advantages.
FCN example: Input can be arbitrary size image color image; The output has the same size as the input, and the depth is: 20 class target + background =21. The model is based on AlexNet
Blue: the convolution layer
Green: Max Pooling layer
Yellow: summation operation, which uses a data-by-data addition to fuse three different depth predictions: the shallower result is more refined, and the deeper result is more robust
Grey: clipping, before fusion, use clipping layer to unify the two sizes, and finally clipping to the same size as the input output
For input images of different sizes, the size (height, width) of each layer of data changes correspondingly, while the depth (channel) remains unchanged
Feature extraction was carried out in the full convolution layer, and the output of the convolution layer (3 blue layers) was extracted to predict the features of 21 categories
The dotted line in the figure is the operation of the deconvolution layer. The deconvolution layer (3 orange layers) can enlarge the size of the input data. As with the convolution layer, the specific parameters of ascending sampling are determined by training
- Initialize with the classic AlexNet classification network. The last two levels are fully joined (red), and the parameters are discarded
2. The small image () is predicted and segmented from the feature small image (), and then directly upsampled into the large image.
The deconvolution (orange) has step size 32, and this network is called FCN-32s
3. The upsampling will be carried out twice (orange ×2). Before the second upsampling, the prediction results (blue) of the fourth layer (green) will be integrated. Use skip structures to improve accuracy
The second deconvolution step is 16, and this network is called FCN-16s
4. Ascending sampling will be performed in three times (orange ×3) to further integrate the prediction results of the third pooling layer
Other parameters:
-
Minibatch: 20 images
-
Learning rate: 0.001
-
Initialization: The convolution layer parameter outside the classification network is initialized to 0
The deconvolution parameters are initialized to bilinear interpolation. The last layer of deconvolution fixed bit bilinear interpolation is not studied
FCN Summary:
In order to accurately predict the segmentation result of each pixel, it is necessary to go through two processes, from large to small, and then from small to large. In the process of ascending sampling, the effect of increasing by stages is better than that of achieving in one step. In each stage of ascending sampling, the characteristics of the corresponding layer of descending sampling are used to assist disadvantages:
- The results are still not precise enough. Although the result of 8x upsampling is much better than that of 32x, the upsampling result is still fuzzy and smooth, which is not sensitive to the details in the image
- The classification of each pixel does not fully consider the relationship between pixels. The spatial regularization step used in the segmentation method based on pixel classification is ignored, which lacks spatial consistency