Background of Unet

Unet was published in 2015, which is a variant of FCN. If you want to know ABOUT FCN, you can see my other paper on FCN full convolutional network reading and code implementation. The original intention of Unet is to solve the problems of biomedical images. Due to its good effect, IT has been widely applied in all directions of semantic segmentation, such as satellite image segmentation and industrial defect detection.

Unet and FCN are Encoder-Decoder structures, simple but effective. Encoder is responsible for feature extraction, and you can put all the feature extraction networks you’re familiar with here. As it is difficult to collect samples in medical field, the author applies image enhancement method to solve this problem, and obtains good accuracy in the case of limited data set.

Unet network structure and details

  • Encoder

As shown in the figure above, Unet network is called Unet because its structure is symmetric and its shape resembles the English letter U. The whole picture is composed of blue/white boxes and arrows of various colors. The blue/white boxes represent feature map. The blue arrow represents 3×3 convolution, which is used for feature extraction; Gray arrow represents skip-connection, which is used for feature fusion; The red arrow represents pooling, which is used to reduce dimensions; The green arrow represents upsample, which is used to restore dimensions. The cyan arrow represents the 1×1 convolution, which is used to output the result.

You may ask why there are 5 layers instead of 4 or 6 layers, emMM, and you should ask the author himself. This layer may perform better for the data set that the author got at that time, but it does not mean that this structure is suitable for all data sets. We should pay more attention to the design idea of Encoder-Decoder, the specific implementation should vary according to the data set.

Encoder is composed of convolution operation and downsampling operation. The convolution structure used in this paper is unified as 3×3 convolution kernel, padding is 0, striding is 1. There is no padding, so H and W of feature map become smaller after each convolution. Attention should be paid to dimension of feature map in skip-connection (in fact, the padding can also be set to 1 to avoid dimension mismatch). Pytorch code:

nn.Sequential(nn.Conv2d(in_channels, out_channels,  3), 
              nn.BatchNorm2d(out_channels), 
              nn.ReLU(inplace=True))
Copy the code

After the above two convolution, a Max pooling with stride 2 is performed, and the output size becomes 1/2 *(H, W) :

Pytorch code:

nn.MaxPool2d(kernel_size=2, stride=2)
Copy the code

Repeat the above steps for 5 times, and without max-pooling for the last time, directly send the feature map obtained into the Decoder.

  • Decoder

Feature map restores original resolution through Decoder. In addition to convolution comparison, the key steps in this process are upsampling and skip-connection.

There are two common methods of Upsampling: 1. Deconvolution introduced in FCN; 2. 2. The interpolation. Here is the interpolation method used in this paper. Among the interpolation implementation methods, bilinear bilinear interpolation has good comprehensive performance and is relatively common.

The calculation process of bilinear interpolation has no parameters to learn, but is actually a set of formulas. Here is an example for you to understand (the example describes the case where parameter align_corners is Fasle).

In the example, a 2×2 matrix is interpolated to obtain a 4×4 matrix, so the 2×2 matrix is called the source matrix, and the 4×4 matrix is called the target matrix. In bilinear interpolation, the value of the target point is calculated from the values of the four nearest points. We first introduce how to find the four points around the target point, taking P2 as an example.

The first formula, the coordinate mapping from the target matrix to the source matrix:

To find the four points, first find the relative positions of the target points in the source matrix, which is what the formula above is used to calculate. The coordinates of P2 in the target matrix are (0, 1), and the corresponding coordinates in the source matrix are (-0.25, 0.25). There are decimals and negative numbers in the coordinates, so let’s deal with them one by one. We know that bilinear interpolation computs the value of the coordinate from the 4 points around the coordinate, (-0.25, 0.25) the 4 points around this point are (-1, 0), (-1, 1), (0, 0), (0, 1). To find negative coordinate points, we extend the source matrix to the following form, with the red part in the middle as the source matrix.

It is stipulated that f(I, j) represents the pixel value at the coordinate point (I, j), and the calculated corresponding coordinates are uniformly written in the form of (I +u, j+ V). Then I =-1, u=0.75, j=0, v=0.25. Drawing the four points individually, you can see that the target point P2 corresponds to the relative position in the source matrix.

The second formula and the last one.

f(i + u, j + v) = (1 – u) (1 – v) f(i, j) + (1 – u) v f(i, j + 1) + u (1 – v) f(i + 1, j) + u v f(i + 1, j + 1)

The pixel value of the target point is the weighted sum of the pixel values of the four surrounding points. Obviously, it can be seen that the weight of the nearest point, for example, (0, 0), is 0.75×0.75, while the weight of the far point, such as (-1, 1), is 0.25*0.25, which is more reasonable. Plug the value into the calculation and you get the value of P2, which is 12.5, nice.

Bilinear interpolation is used in Pytorch:

nn.Upsample(scale_factor=2, mode='bilinear')
Copy the code

For CNN network to achieve good results, skip-connection is basically essential. This key step in Unet combines the location information of the underlying information with the semantic information of the underlying features. The PyTorch code:

torch.cat([low_layer_features, deep_layer_features], dim=1)
Copy the code

It should be noted that in FCN, deep information and shallow information are fused by adding corresponding pixels, while Unet is by splicing.

So what’s the difference between ResNet and DenseNet? In fact, ResNet uses the sum of corresponding values, while DenseNet uses the concatenation. From my personal understanding, dimension of feature map does not change in addition, but each dimension contains more features. This is an efficient choice for common classification tasks that do not need to recover from feature map to original resolution. However, splicing retains more dimension/position information, which enables the following layers to freely choose between shallow features and deep features, which is more advantageous for semantic segmentation tasks.

summary

Unet is based on Encoder-Decoder structure to achieve feature fusion by splicing. The structure is simple and stable. If you have semantic segmentation problems, especially in the case of small sample data, it is recommended to try.

PS: Welcome to pay attention to my personal wechat public number [MachineLearning Learning road], every week a CV direction of the paper interpretation provided!