Application scenarios

In the automobile transaction scenario, it is often necessary to recognize the user’s driving license. In this recognition task, it is necessary to combine multiple models to get a better result. Among them, page detection, as the first part of the recognition task, is also a very important part. Dividing the AB side of the driving license can greatly avoid the detection of irrelevant words and improve the accuracy of the overall recognition task. This article will introduce the application of U2Net in id detection tasks.

  • Link to the paper: arxiv.org/abs/2005.09…
  • Github: github.com/xuebinqin/U…

one U2Net presents the background

Significance detection aims to segment the most attractive content in an image. This method can be applied in many fields such as image segmentation, tracking and so on. The current mainstream network mainly has the following problems:

  • Focus on local details without global contrast information.
  • Backbone, such as ResNet, VGG, etc.
  • The high resolution of extracted feature maps leads to high computing resource consumption.

two U2Net network structure

1. Infrastructure rSU-L

Inspired by UNET, the author proposed Residual U Block Layer(RSU-L), where L represents the number of coding layers.

Figure 1: RSU-L structure (image source)

In Figure 1, the green part represents Conv+BN+Relu, the blue part represents Downsample+Conv+BN+Relu, and the red part represents Upsample+Conv+BN+Relu. It can be seen that the infrastructure is essentially a single U2Net.

Aiming at the shortcoming that the commonly used 3×3 convolution cannot effectively extract global information, this paper points out that this model can effectively obtain global information from the high resolution shallow feature map and increase the sensing field. At the same time, L has three options, 3, 5 and 7, which can be selected for different tasks, but generally, select 7.

Compared with the commonly used ResNet, as shown in Figure 2, the features provided by this structure are composed of multi-scale feature+local-feature, while ResNet can only provide local feature+original feature. The increase of features will inevitably lead to the increase of computing resource overhead, so the author adds many Maxpooling layers to eliminate some repetitive features.

Figure 2: Comparison with ResNet

2.U2Net network structure

Figure 3: U2Net Network

The overall network structure is shown in Figure 3. On the left is an encoder composed of six RSU-L, on the right is a decoder composed of five RSU-L, and at the bottom is a salient Map fusion module to connect the coder decoder

3. Loss function

The loss function is shown in FIG. 3. The author makes a loss calculation for the output results of each decoder. After splicing all the output results together, a loss calculation is made. The calculation formula of loss function is:

Where and are the weight of each loss, corresponding to loss and loss in FIG. 5. For each item, the standard binary cross entropy loss function is adopted for calculation:

Where (r, c) represents the pixel point coordinates, (H, W) represents the size of the image, and respectively represents the true pixel value and the pixel value of the generated probability graph.

3. Model performance

The authors show the recall and accuracy curves of U2Net and other networks on public data sets, as shown in Figure 4

Figure 4: Performance on open data sets (image from author’s thesis)

The author also presents the sample test results, as shown in Figure 5

Figure 5: Specific display (picture source: author’s paper)

Four. Training and comparison

This article compares deeplabv3 only.

1. Composition of driving license data

There were 3760 training sets and 939 verification sets, which were randomly divided according to 8:2, and 100 test sets were selected from Baidu pictures. Input data is the original image +mask.

2. How to train

For U2Net, raw data and mask are divided according to train and mask, and the data set is composed as follows:

    |--dataset  
    |----train  
    |----train_mask  
    |----val  
    |----val_mask  
Copy the code

3. Comparison with Deeplabv3

In order to correctly compare the effects of the models, we cancel all data enhancement of the two models, and only train for 100 rounds. The batchsize is set to 8, and the training environment is Python3.6.12 and Pytorch1.6. Each model takes 2080Ti.

Model size Param size GFLOPs mIOU Infer time
U2Net 168.27 M 44.01 M 150.67 G 0.937 0.43 s
Deeplav3-Resnet101 226.85 M 58.63 M 249.42 G 0.898 0.31 s

Note: GFLOPs and Param size are calculated using the PTFlops library

Test case

Figure 6: Test sample (available for download from public Internet search, blurred)

From left to right are real images, masks generated by Deeplabv3, and masks generated by U2net. It can be clearly seen that U2Net performs better and the segmtioned mask is more accurate.


Nanjing S300 Cloud Information Technology Co., LTD. (CH300) was founded on March 27, 2014. It is a mobile Internet enterprise rooted in Nanjing, currently located in Nanjing and Beijing. After 7 years of accumulation, the cumulative valuation has reached 5.2 billion times, and has been favored by many high-quality investment institutions at home and abroad, such as Sequoia Capital, SAIC Industrial Fund, etc. S300 Cloud is an excellent domestic independent third-party SaaS service provider of auto transaction and finance, which is based on artificial intelligence and takes the standardization of auto transaction pricing and auto financial risk control as the core product.

Welcome to join s300 cloud to witness the booming development of the automobile industry. Look forward to walking hand in hand with you! Website: www.sanbaiyun.com/ Email: [email protected]