The air is fresh and the scenery is pleasant

preface

There are four articles in the DeepLab series, DeepLab V1, DeepLab V2, DeepLab V3 and DeepLab V3+.

DeepLab V1

Thesis title: Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs

Open source code: TheLegendAli/ Deeplab-context

DeepLab V2

Thesis title: DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

Open source: DrSleep/ Tensorflow-Deeplab-resnet

DeepLab V3

Thesis title: Rethinking Atrous Convolution for Semantic Image Segmentation

Open source: Leonndong/Deeplabv3-tensorFlow

DeepLab V3+

Encoder-Decoder with Atrous Convolution for Semantic Image Segmentation

Open source: jfzhang95/ PyTorch – Deeplab-xception

DeepLab series of ideas

Image segmentation CNN is adapted from the high-level SEMANTICS of classification, but the accuracy of CNN in semantic segmentation is not enough. The fundamental reason is the translation invariance of high-level features of DCNNs, namely high-level feature mapping. Because repeated pooling and downsampling will lose localization information, that is, the precise low-level semantics of pixels cannot be located.

For downsampling or pooling to reduce resolution, DeepLab uses void convolution instead of pooling to expand the field of perception and obtain more contextual information. DeepLab V1V2 combines deep convolutional neural networks (DCNNs) and probability graph models (DenseCRFs). DeepLab v2 puts forward a serial of ASPP module, ASPP enhanced network in multi-scale category at the time of the partitioning of more robustness, using different sampling rate and extract the input characteristics o f the reception field, can capture target with the context information in multiple scales, while feeling greatly expanded the convolution kernel, but as the receptive field is more and more close to the image size, It degenerates to 1 times x1 convolution.

To solve this problem, DeepLab V3 improves the ASPP void convolution space pyramid pooling layer, different dilation convolution parallel operation, and then the sum after the normalized size. Using the idea of PSPNet for reference, ASPP module captures the context information of different scales of images through the parallel sampling of void convolution with different sampling rates.

DeepLab V3 + extends DeepLab V3 by adding a simple and effective decoder module to optimize the segmentation results, achieving 89% and 82.1% MIOU in the PASCAL VOC 2012 dataset and Cityscapes dataset, respectively.

Major challenges for semantic segmentation

Continuous pooling or sub-sampling of the resolution will lead to a significant decrease in the resolution of the image, resulting in the loss of original information, and it is difficult to recover in the process of upsampling. As a result, more and more networks are trying to reduce the loss of resolution, such as using empty convolution, or using convolution operations of step 2 instead of pooling.

Multi-scale feature The segmentation accuracy of objects of different sizes in the same image is different because the convolution of different scales verifies the segmentation effect of objects of different sizes. In the case of small resolution, the location information of small objects is often lost. By setting the convolution layer or pooling layer with different parameters, the feature maps of different scales can be extracted. These feature maps are fed into the network for fusion, which greatly improves the performance of the whole network. However, due to the multi-scale input of the image pyramid, a large number of gradient is saved in the calculation, which leads to a high requirement on hardware.

1 DeepLab V1

DeepLab V1 is a combination of deep convolutional neural networks (DCNNs) and probability graph models (DenseCRFs)

Deep convolutional neural network (DCNNs) • FCN idea was adopted to modify VGG16 network, and coarse score map was obtained and interpolated to the original image size; Atrous Convolution was used to obtain a feature map with more dense and invariant receptive field

• Refine the segmentation results from DCNNs by using fully connected CRF.

DeepLab V1 :VGG16+ void convolution +CRF to post-process the edge segmentation results. For downsampling or pooling to reduce resolution, DeepLab uses void convolution to extend the receptive field and capture more contextual information. At the same time, fully connected conditional random field (CRF) is used to improve the model’s ability to capture details.

Correspondence between receptive domain, step size and convolution kernel:

Network structure of DeepLab V1

1. Change the full connection layer (FC6, FC7, FC8) to the convolution layer (end-to-end training) 2. Change the step 2 of the last two pooling layers (pool4 and pool5) to 1 (ensure that the resolution of the feature is reduced to 1/8 of the original image). 3. Set the dilate rate of the last three convolution layers (Conv5_1, CONV5_2 and Conv5_3) to 2, and set the dilate rate of the first fully connected layer to 4 (maintain the sensing field). 4. Change the number of fc8 channels at the last full connection layer from 1000 to 21 (the number of classifications is 21). 5. For the first full-connection layer fc6, the number of channels changed from 4096 to 1024, and the size of convolutional kernel changed from 7×7 to 3×3. In subsequent experiments, it was found that when the dilate rate here was 12 (LargeFOV), the effect was the best.

The experimental setup

Network deformation: Deeplab-MSC: similar to FCN, add feature fusion deeplab-7 ×7: replace fully connected convolution kernel size is 7×7 deeplab-4 ×4: Replace fully connected convolution kernel size is 4×4 deeplab-Largefov: The size of the fully connected convolution kernel is 3×3 and the void rate is 12

Loss function: cross entropy + SoftMax optimizer: SGD + Momentum 0.9 Batchsize: 20 Learning rate: 10^−3 (learning rate * 0.1 per 2000 epochs)

2 DeepLab V2

Deeplab V2: VGG16/ResNet+ serial ASPP module +CRF to post-process the edge segmentation results. ASPP void convolution space pyramid pooling layer was added, and different dilation convolution serial operations were performed to replace the pooling operation that resulted in the loss of shallow features, greatly expanding the receptivity field.

Background knowledge

Atrous Convolution

Dense map: Standard 3×3 convolution: an area of size 33 corresponds to an output value Void convolution (rate=2) : an area of size 55 corresponds to an output value

Standard 3×3 convolution (rate is 1), receptive field is 3; Void convolution (rate is 2), convolution kernel size is 5×5, receptive field is 7; Void convolution (rate is 4), convolution kernel size is 9×9, receptive field is 15.

ResNet

Rate of change: The introduction of residrals removes the main body, thus highlighting minor changes.

Main idea: Using a neural network to fit an identity map like y=x is more difficult than using a neural network to fit a 0 map like y=0. Because when you fit y equals 0, you just have to get the weights and the bias close to 0.

Network&ASPP

DeepLab V2 is a new version of DeepLab V2, which is based on VGG16.

The experimental setup

Loss function: cross entropy + SoftMax optimizer: SGD + Momentum 0.9 Batchsize: 20 Learning rate strategy: step: 10^−3 (every 2000 epochs, learning rate * 0.1)

Poly real:

Aspp-s: r = 2, 4, 8, 12 aspp-L: r = 6, 12, 18, 24

3 DeepLab V3

Deeplab V3: ResNet+ improved parallel ASPP module. With the increase of sampling rate, the number of effective filter weight (applied to the characteristics of effective weight rather than padding added 0) less, in the empty rate is nearly the size of feature mapping in extreme cases, 3 x3 filter not capture the whole image context, but degenerate into simple 1 x 1 convolution (only center filter weight is effective). Therefore, V3 uses parallel ASPP modules, the last of which splicing the global pooling module to capture global context information.

1. Image pyramid: Starting from the input image, the images of different scales are respectively sent to the network for feature extraction, and later re-fusion. 2. Codec structure: the encoder part uses the undersampling to extract features, and the decoder part uses the upsampling to restore the size of the feature map. 3. Deep networks vs void convolution: Classical classification algorithms use continuous subsampling to extract features, while void convolution uses different sampling rates. 4. Spatial pyramid structure: In addition to ASPP, there are still other networks using this idea, such as SPPNet, PSPNet, etc.

The network structure

Classical classification algorithm network architecture, such as ResNet — > DeepLab V3 Void Convolutional Serial network Architecture — > DeepLab V3 Void Convolutional Parallel Network Architecture (adjusted ASPP module)

The experimental setup

Crop size: Crop the picture to 513×513 (in order to better fit the cavity rate) Learning rate strategy: adopt poly strategy, the principle is the same as v2

BN layer strategy: when output_stride=16, batchsize=16, and BN layer does parameter decay=0.9997. On the enhanced data set, after 30K training with an initial learning rate of 0.007, the BN layer parameters are frozen. When output_stride=8 and batchsize=8, the initial learning rate 0.001 is used to train 30K.

4 DeepLab V3+

Deeplab V3 + : DeepLabv3+, at its core, extends Deeplab V3 by adding a simple and effective decoder module to restore object boundaries (refining the segmentation results along object boundaries). Xcepition/ResNet is used as the skeleton, deep separable convolution is adopted for coding, and a simple decoder module is connected after multi-scale feature extraction of ASPP module.

Background knowledge

Deeply separable convolution

Standard convolution: The size of the standard input image is 12×12×3. The convolution operation with a 5×5×3 convolution kernel will get 8×8×1 output. A convolution operation with 256 5×5×3 convolution kernels yields an output of 8×8×256. Parameter calculation: 256 x 5 x 5 x 3 = 19200

Group convolution: Group convolution is to group the input feature graph, and each group is convolved separately. Suppose the size of the input feature map is CHW (12× 5×5) and the number of output feature maps is N (6). If it is set to be divided into G (3) groups, the number of input feature maps of each group is C/G (4) and the number of output feature maps of each group is N/G (2). The size of each convolution kernel is (C/G)KK (4×5×5), the total number of convolution kernels is still N (6), and the number of convolution kernels in each group is N/G (2). Each convolution kernel only convolves with the input feature graph of the same group, and the total number of parameters of the convolution kernel is N*(C/G)KK. It can be seen that the total number of parameters is reduced to the original 1/G.

Deeply separable convolution is an extreme case of group convolution, that is, there are as many groups as there are channels in the input, that is, the number of groups equals to the number of channels in the input feature graph.

Deep separable convolution = deep convolution + point by point convolution

** Deep convolution: ** Each 5×5×1 convolution check should be input to a channel in the image to obtain three 8×8×1 outputs, and the result of 8×8×3 is obtained after splicing

** point-by-point convolution: ** Set 256 convolution kernels of 1×1×3, perform the convolution operation on the output of deep convolution, and finally get 8×8×256 output

Parameter calculation: Deep convolution parameter = 5×5×3 = 75 Point-by-point convolution parameter = 256×1×1×3 = 768 Total parameter = 75 + 768 = 843 << 19200

The network structure

The encoder:

  1. DeepLab V3 is used as the encoder structure, and the ratio of output to input size is 16(output_stride = 16).
  2. ASPP: one 1×1 convolution + three 3×3 convolution (rate = {6, 12, 18}) + global average pooling.

Decoder:

  1. Firstly, the result of encoder is upsampled by a factor of 4 (bilinear interpolation), and then it is combined with the feature map of corresponding size in the encoder, and then the convolution of 3×3 is carried out. Finally, the final result is obtained by upsampling by a factor of 4
  2. Before fusion of low-level information, 1×1 convolution is performed to reduce the number of channels.

DeepLab V3 + has tweaked Xception:

1. Deeper Xception structure, original middle flow iteration 8 times, after fine tuning iteration 16 times. 2. All Max pooling structures are replaced by the deeply separable convolution with stride=2. 3. Every 3×3 depthwise convolution is followed by BN and Relu.

The experimental setup

Cutting size: Cut the picture to 513*513 learning rate strategy: adopt poly strategy, the principle is the same as v2 v3

Thesis summed up

History of DeepLab series

V1: Modify the classical classification network (VGG16), apply void convolution to the model, try to solve the problem of low resolution and multi-scale feature extraction, and use CRF for post-processing (VGG16 + void convolution +CRF for post-processing of edge segmentation results)

V2: ASPP module is designed to maximize the performance of void convolution. VGG16 is used as the main network, resnet-101 is tried for comparison experiment, and CRF is used for post-processing (VGG16/ResNet+ serial ASPP module +CRF is used for post-processing of edge segmentation results)

V3: A serial and a parallel DCNN network is designed with ResNet as the main network. ASPP module is fine-tuned and CRF is cancelled to do post-processing (ResNet+ improved parallel ASPP module).

V3 + : A new algorithm model is designed with ResNet or Xception as the main network, combined with the codec structure. V3 is used as the encoder structure, and another decoder structure is designed. CRF is cancelled to do the post-processing (ResNet/Xception+ parallel ASPP module + encoder structure).