Recently, PaddleSeg officially released the industrial-grade image segmentation model library, bringing developers full of sincerity triple value surprise: ① a one-time open source 15 officially supported image segmentation field mainstream models, the gift package brings great satisfaction. The multi-card training speed is twice faster than the standard products, industrial deployment ability, time saving super happy. 3. Reveal the key technologies of ACE2P prediction model, the triple crown of CVPR2019 LIP Challenge human body Analysis Task Grand Slam, taking you one step to experience the world’s leading level effect.
1. PaddleSeg hits the big time
PaddleSeg, the new product of PaddleSeg, is newly launched, focusing on the field of image segmentation, providing developers with a complete and easy-to-use industrial-level segmentation model library.
Yes, you read that right, a really tried-and-true industrial-grade segmentation model library.
According to the report, PaddleSeg has been applied or practiced in baidu unmanned vehicles, AI open platform portrait segmentation, small P map and Baidu Map, and has achieved good results in the industrial quality inspection industry. PaddleSeg panorama provided by paddler official is shown below:
2. What is image segmentation?
Image semantic segmentation, which is a key step from image processing to image analysis, achieves pixel level semantic segmentation by giving the label of each pixel point in an image.
As you can see in the image below, instances of vehicles, roads, sidewalks, and so on can be partitioned and marked!
Compared with the traditional image classification task, image segmentation is obviously more difficult and complex, but image segmentation is an important cornerstone of image understanding, and plays a pivotal role in automatic driving, UAV, industrial quality inspection and other applications.
PaddleSeg triple surprise
3.1. Open source 15 mainstream models in the field of image segmentation at a time, and the gift package brings great satisfaction
PaddleSeg provides pre-training models under public data sets for all built-in segmentation models, comprehensively covering mainstream model implementations in the field of image segmentation such as DeepLabv3+, ICNet and U-Net. And built-in ImageNet, COCO, CityScapes and other data sets under the 15 pre-training models to meet the different accuracy and performance requirements of different scenarios! 15 pre-training models, please refer to github.com/PaddlePaddl…
Among them, the three most important models are introduced as follows:
(1) Support U-NET model: lightweight model, less parameters, fast calculation
U-net originated from medical image segmentation. The whole network is a standard Encoder-Decoder network, which is characterized by fewer parameters, fast calculation, strong applicability and high adaptability for general scenes. The network structure of U-NET is as follows:
(2) Support DeepLabv3+ model: PASCAL VOC SOTA effect, support a variety of Backbone
DeepLabv3+ is the last article in the DeepLab series. Its previous titles are DeepLabv1, DeepLabv2, and DeepLabv3. In the latest work, DeepLab uses Encoder-decoder to merge multi-scale information while preserving the original void convolution and ASSP layer. The backbone network uses Xception model to improve the robustness and speed of semantic segmentation. A new state-of-art performance, i.e. 89.0mIOU, was obtained in PASCAL VOC 2012 dataset. DeepLabv3+ network structure is as follows:
In PaddleSeg, two types of Backbone networks can be switched:
• MobileNetv2: Suitable for mobile deployments or scenarios where segmentation prediction speed is required, PaddleSeg also provides models with different DepthMultiplies from 0.5x to 2.0x. • Xception: Backbone network of DeepLabv3+ original implementation, taking into account precision and performance, suitable for server deployment. PaddleSeg offers 41/65/71 pre-training models in three different depths.
(3) Support ICNet model: real-time semantic segmentation, suitable for high-performance prediction scenarios
Image Cascade Network (ICNet) is mainly used for real-time Image semantic segmentation. Compared with other methods of compression calculation, ICNet considers both speed and accuracy. The main idea of ICNet is to transform the input image into different resolutions, then compute the input with different resolutions using subnetworks of different computational complexity, and then combine the results. ICNet consists of three sub-networks, the high computational complexity network processes the low resolution input, and the low computational complexity network processes the high resolution network. In this way, a balance is achieved between the accuracy of the high resolution image and the efficiency of the low complexity network. ICNet’s network structure is as follows:
3.2. The multi-card training speed is twice as fast as the standard product, industrial deployment capability, and time saving. In terms of speed, PaddleSeg also provides multi-process I/O and excellent video memory optimization strategy, which greatly improves performance. PaddleSeg’s single-card training is 2.3 times faster than its rival, and multi-card training is 3.1 times faster. PaddleSeg has significant advantages over its counterpart in terms of training speed, GPU utilization, video memory overhead and Max Batch Size. Detailed comparison data is shown in the figure below:
Test environment and Model: • GPU: Nvidia Tesla V100 16G * 8 • CPU: Intel(R) Xeon(R) Gold 6148 • Model: DeepLabv3+ with Xception65 Backbone
• high performance C++ prediction library: supports Windows cross-platform compatibility, Operator fusion, TensorRT acceleration, McL-dnn and other graph optimization. • Paddle Serving deployment: supports high concurrency predictions, supports single service multiple models, also supports model hot updates and A/B tests.
The framework diagram of Paddle Serving is as follows:
3.3. CVPR2019 LIP Challenge human Body Analysis task Grand Slam triple Crown ACE2P model, take you one step to experience the world’s leading level effect.
In the CVPR2019 LIP Challenge, Baidu’s ACE2P model won the first prize in all three human body analysis tasks, which is a worthy grand Slam triple crown. After watching the movie, I will take you to know:
What is LIP? LIP (Look Into Person) is an important benchmark in the field of Human Parsing, in which Human Parsing is a fine-grained semantic segmentation task that aims to break the Human body in an image Into multiple regions, each corresponding to a specific category. Body parts such as faces or clothing categories such as jackets. Due to the diversity and complexity of categories, it is more challenging than simply dividing the human body.
The specific LIP can be divided into three directions, which are: Parsing Track • Multi-person Human Parsing Track • Video multi-person Human Parsing Track
Augmented Context contexting with Edge Perceiving ACE2P is the segmentation model of human body parts, which aims to segment human body parts and clothing in the image. The model integrates low-level features, global context information, and edge details to learn the human body parsing task. Backbone is a single model of ResNet101. The network structure diagram is as follows:
ACE2P Champion prediction model is directly used on the PaddleHub version of the fast Experience command line:
Paddlepaddle.org.cn/hubdetail?n…
4. How is the actual application effect?
Having said that, how does PaddleSeg actually work? Let’s use case studies.
4.1. Application Scenario 1: Industrial inspection
Feipar cooperated with domestic leading enterprises in quality inspection of rare earth permanent magnet parts and upgraded the quality inspection of precision parts with AI power based on PaddleSeg model base.
Under the traditional working mode, quality inspection workers need to visually inspect the quality of parts within 45mm in diameter under bright light for 8 to 12 hours a day. The working intensity is very high, which also causes great damage to eyesight. Currently, the precision parts intelligent sorting system based on PaddleSeg’s built-in ICNet model has a error rate of less than 0.1%. For color images with 1K*1K resolution, the predicted speed is up to 25ms at 1080Ti, and the single-part sorting speed is 20% faster than achieved with other frameworks. PaddleSeg has helped factories achieve an average 15 per cent reduction in production costs and an average 15 per cent increase in factory efficiency. At the same time, the quality of delivery has also improved significantly, with the complaint rate decreasing by 30% on average
4.2. Application Scenario 2: Block segmentation
Segmentation technology has also been widely used in the field of agriculture, and the segmentation of plots is one of the scenarios. The traditional segmentation method is based on remote sensing images taken by satellites and relies on a large number of technicians with remote sensing professional background to use professional software for analysis. Satellite remote sensing image data has the problems of huge picture and low visual resolution, which requires high professional ability of technicians. Moreover, manual annotation requires a lot of repetitive work, which is very time-consuming and tedious. If an intelligent segmentation system based on image segmentation technology is developed to quickly and automatically know the border and area of agricultural land, crop yield prediction and crop classification can be more effective, and agricultural decision-making can be assisted.
** Lane segmentation is an important application of image segmentation in the field of automatic driving. There are two difficulties in lane line segmentation: • One is accuracy. Because of the safety of the vehicle, the accuracy of lane line segmentation is very high. • The other is real-time. In the process of high-speed vehicle, lane line segmentation results must be provided quickly and in real time. Accurate and fast lane line segmentation can provide real-time navigation and lane positioning guidance for vehicles and improve vehicle safety. It is currently being applied in Baidu unmanned vehicle practice.
PaddleSeg measured effect:
4.4. Application Scenario 4: Portrait Segmentation
Not only in the industrial scene, but also in the field of C-end mutual entertainment, short video portrait special effects, intelligent matting of certificate photos, post-processing of film and television and other scenes need to be segmented for portrait.
5. Technical dry goods: Reveal the key technical points of LIP human body parts segmentation
5.1. Modified the network structure, introduced Dilation convolution, and improved 1.7 points
• ResNet’s 7×7 convolutional layer is replaced with three 3×3 convolutional layers to increase the network depth and reinforce the underlying characteristics of the network.
• Replace all pooling layers in the network with the convolution layer of stride=2 to make the down-sampling process learnable • Dilation is added to stage=5 of Renset structure to expand the receptive field of the network and increase the effective area of action of the network. • Implement a pyramid pooling structure to ensure that a global context is extracted.
5.2. Lovasz Loss was introduced, which improved 1.3 points
• Lovasz Loss is a multi-type IOU loss specially designed for the evaluation index of segmentation, which is more suitable for segmentation tasks. • Lovasz Loss is used together with Cross Entroy Loss, and the overall effect is improved by 1.3 points
5.3. Customized learning method, 0.8 points
In the process of practice, we found that the learning method also had a great influence on the final effect, so we customized the learning method for the task. • Warmup learning strategy was used at the beginning of learning, which made the model optimization easier to converge at the beginning. Replacing common poly learning strategy and introducing cosine Decay method, the learning rate was not too small at the end of training, resulting in the network failure to converge to the optimal value. • The learning rate curve of the whole process is visualized as follows:
5.4. Add edge module to improve 1.4 points
• Edge detection module is added to deepen skeleton features between different parts and reduce missegmentation between classes. • Features of EDGE module are fused with features of SEG, so that effects of different tasks can improve each other. Details are as follows:
6. Code experience
In order to better experience the effect of segmentation library and avoid various problems caused by the software and hardware environment, we adopted AIStudio one-stop training development platform as the experience environment. This tutorial uses DeepLabv3+ xception’s network structure for portrait segmentation to familiarize yourself with PaddleSeg’s use. DeepLabv3+ is the latest work of DeepLab semantic segmentation network series, its previous work includes DeepLabv1, DeepLabv2, DeepLabv3. In the latest work, DeepLab author carries out multi-scale information fusion through encoder-decoder. Meanwhile, the original void convolution and ASSP layers are retained, and the backbone network uses Xception model to improve the robustness and running speed of semantic segmentation. A new state-of-art performance is achieved in PASCAL VOC 2012 Dataset. 89.0 mIOU.
The entire network structure is as follows:
Aistudio.baidu.com/aistudio/pr…
Aistudio.baidu.com/aistudio/pr…
The code content of the project is carefully optimized and encapsulated with top-level logic by the r&d staff, so that developers can experience the effect of PaddleSeg in the fastest way. The following code content is for reference to the core process and ideas. It is suggested that developers Fork the project completely and click on the whole to run it.
6.1. Model training
Step 1: Decompress the pre-training model.
6.2. Model prediction and visualization
The prediction visualization parameter “– vis_DIR” is used to specify the location of the prediction result picture.
6.3. Practical effect
Display the data before and after segmentation. Here, you can either select the data from the test set or upload the data yourself to test the actual segmentation results.
The effect is good yo, quickly use it!