Abstract: Semantic segmentation data sets are relatively large, so training needs very powerful hardware support.
This article is shared from huawei cloud community “[Cloud Resident Co-creation] Semantic Segmentation Algorithm Sharing based on transfer Learning”, the original author: Qiming.
This paper shares two semantic segmentation algorithms based on transfer learning. 1. Learning to Adapt Structured Output Space for Semanticsegmentation, 2. ADVENT: Adversarial Entropy Minimization for DomainAdaptation in Semantic Segmentation.
Part 1: Background introduction of migration segmentation
Semantic segmentation, along with detection and classification, are the three main directions in machine vision. However, compared with detection and classification, semantic segmentation faces two very difficult problems:
One is that it lacks data sets. The data set of classification is a category, the data set of detection is a box, and the purpose of segmentation is to make semantic prediction. This means that its annotations also need to be pixel-level annotations. We know that tagging a pixel-level dataset can be time-consuming and labor-intensive. For example, Cityspaces in Autopilot takes 1.5 hours to tagging an image. In this case, the time and effort cost of constructing semantically segmented data sets is very high.
Another problem is the need to cover real-world data sets because of semantic segmentation. But the reality is that it is difficult to cover all situations, such as different weather, different places, different styles of architecture, which are the problems faced by semantic segmentation.
In the face of the above two situations, how do researchers solve the two problems of semantic segmentation?
Instead of working from data sets, they found that they could reduce the cost of annotation by synthesizing simulated data sets instead of real-world ones using techniques such as computer graphics.
Take a familiar and common GTA5 game for example: in GTA5, one of the tasks is to collect simulation data in GTA5 game, and then reduce the cost of annotation through the natural annotation of data engine. But there’s a problem: models trained on these simulations can degrade their performance in the real world. Because traditional machine learning requires a prerequisite that your test set and your training set are equally distributed, and your simulated data set and your real data set are necessarily distributed differently.
Therefore, our current goal is to solve the problem that the performance of the model trained in the source domain is degraded in the target domain through the proposed migration algorithm.
The main contributions and related work of the two papers are introduced
The main contributions
Article 1: Learning to Adapt Structured OutputSpace for Semantic Segmentation
1. A migration segmentation algorithm based on chance versus learning is proposed.
2. It is verified that the scene layout and context information of the two domains can be effectively evaluated by confrontation in Output space;
3. The migration performance of the model is further improved by antagonizing multiple levels of output.
ADVENT: Adversarial EntropyMinimization for Domain Adaptation in Semantic Segmentation
1. The entropy-based loss function is used to prevent the network from making low confidence predictions for the target domain;
2. An adversarial learning approach based on chance entropy is proposed, which considers entropy reduction and structural alignment of two domains.
3. A constraint method based on category distribution prior is proposed.
Related work
Before explaining the two papers, let’s briefly introduce another article: FCNs in the Wild. This paper is the first to apply transfer learning to semantic segmentation. It proposes that the features extracted from semantic segmentation feature extractor are fed into the discriminator, and then the segmentation task is transferred by allocating Global information.
First of all, let’s introduce the common semantic segmentation network is what kind of form. The commonly used semantic segmentation is generally composed of two parts: one is the feature extractor, for example, Resnet series or VGG series can be used to extract the features of pictures; The second is the classifier, which feeds the previously extracted features into the classifier (the more common classifier is PSP, or ASPP in DeepLab V2, which is the most commonly used in DA segmentation). The whole DA can be completed by feeding the features mentioned in feature extraction into the discriminator.
Why does feeding features into a discriminator complete DA? We can understand this problem in terms of what the discriminator does.
By training the discriminator, it can tell whether the input image is true or false. In this process, a discriminator is needed to distinguish whether the input is characterized by the source domain or the target domain. After obtaining a discriminator that can distinguish the source domain from the target domain, the discriminant parameters are fixed and the feature extractor of the segmentation network is trained. How to train: Get the feature extractor to confuse the discriminator.
So how does the feature extractor confuse the discriminator? No matter extracting the features of the source domain or the target domain, it is necessary to align the distribution of the two features, so that the discriminator can not distinguish the features of the two domains, and then complete the task of “obfuscation”. Once the “obfuscation” task is completed, it means that the feature extractor has extracted this “domain-invariant” information.
Extracting “domain invariant” information is essentially a migration process. Because the network has the function of extracting “domain invariant” information, both the source domain and the target domain can extract a very good feature.
The following two articles are guided by the idea of “using adversarial and discriminator”, but the difference is that the input information of the discriminator is different in the latter two articles, which will be described in more detail later.
The first paper algorithm model analysis
Learning to Adapt Structured Output Space for Semanticsegmentation
This paper, like the first work in the related working paper, is composed of segmentation networks and discriminators. From the graph above, or just from the title, it is Output space that does best-seller. So what is output space?
Here, output space means that the output result of voice segmentation network becomes a probability after Softmax, and we call this probability output space
The author of this paper thinks that it is not good to directly use features for confrontation. It is better to use output space probability for confrontation. Why? Because the author thinks that in the original, such as classification, we all use features to do, but the segmentation is not the same. Because the segmented high dimensional feature, the feature section in front of you, is a very long vector, such as the last layer of Resnet101, which has a feature length of 2048 dimensions, such high dimensional features, the encoded information is of course more complex. But for semantic segmentation, this complex information may not be useful. This is an opinion of the author.
Another point of view of the author is that although the output result of semantic segmentation is of low dimension, namely the probability of output space, there is actually only one dimension of category number, that is, if category number C, its probability is a vector of C *1 for each pixel point. Although it is a low-dimensional space, the output of an entire image actually contains rich information about the scene, layout, and context. The author of this paper believes that no matter the image comes from the source domain or the target domain, the segmentation results should have a very strong similarity in space. Because whether it is simulation data or simulation data, are also doing the segmentation task. As shown above, both the source and target domains are for autonomous driving. One obvious idea is that most of it would be road in the middle, usually sky above, and then buildings to the left and right. The distribution in this scenario is very similar, so the author believes that using the probability of the lower dimension directly, namely softmax output, can achieve a very good effect.
Based on the above two insights, the author designs to put the probability directly into the discriminator. The training process is essentially the same as GAN, except instead of passing features into the discriminator, the discriminator will pass the probability of the final output.
Going back to the picture above, you can see that the two DA’s on the left, they’re multiscale. Going back to the semantic segmentation network we talked about at the beginning, it is divided into feature extractor and classifier. The input of classifier is actually the feature extracted by feature extractor.
As you all know, Resnet actually has several layers. Based on this fact, the author puts forward that the confrontation of output space can be done in the last layer and the penultimate layer respectively. You send these two features to a Discriminator, and then you take out the Discriminator and put it in the Discriminator.
To sum up, the algorithm innovation points of this paper are:
1. The antagonism is carried out in output space, using the structure information of network prediction results;
2. The model is further improved by antagonizing multiple levels of output.
So let’s see how this works out, right?
The above image is from GTA5 to Cityspaces. As you can see, the first data Baseline (Resnet) trains a model on the source domain and then tests it on the target domain. The second data is a result on the characteristic dimension, 39.3. Although it is improved compared with the source Only model, the confrontation with the following two models on Output space is relatively low. In the first single level, features are directly extracted from the last layer of Resnet and then input into the classifier to produce the results. The second multi-level is to play against both the bottom 1 and the bottom 2 of Resnet, and the result can be seen, is better.
The second thesis algorithm model analysis
ADVENT: Adversarial Entropy Minimization for DomainAdaptation in Semantic Segmentation
Next, we will talk about the second part: the method of transfer segmentation based on entropy reduction and entropy confrontation.
To understand this article, we need to understand the concept of entropy. In this paper, the author uses information entropy as entropy, namely P and logP in the formula, which is also a probability log.
Semantic segmentation, in which the network makes predictions about each pixel of an image. So for each pixel, the end result is a vector of c times 1, where C is the number of possible categories, so it should be the probability of each category times the probability of log of this category. So for a pixel, if you add up the entropy of that category, that’s the entropy of that pixel. Therefore, for an image, you need to sum the length and width of the image, that is, the entropy of each pixel.
By looking at the figure above, the author found a phenomenon: entropy distribution in the segmentation graph of the source domain can be found in the edge of these categories, entropy is very high (the darker the color, the lower the entropy; The lighter the color, the higher the entropy. So for the image of the target domain, we can see that the predicted result of the whole image is very many light-colored parts. Therefore, the author believes that the gap between the source domain and the target domain can be narrowed by reducing the entropy of the target rate because there are too many useless entropy values in the source domain (because there is a certain amount of noise).
So how do we reduce the entropy of the target domain? The author puts forward two methods, namely the algorithm innovation of this paper:
1. The entropy-based loss function is used to prevent the network from making low confidence predictions for the target domain;
2. An entropy-based adversarial learning approach is proposed, which considers both entropy reduction and structural alignment of the two domains.
Adversarial learning is used to minimize the entropy value, that is, the entropy of an image population is obtained, and the entropy of the population is optimized directly by gradient back propagation. However, the author thinks that if entropy is directly reduced, a lot of information, such as the semantic structure of the picture itself, will be ignored. Therefore, referring to the approach of output space antagonism described in the first article, the author proposes a method to reduce entropy by using antagonistic learning.
As can be seen from the previous figure, the entropy of the source domain is very low. Therefore, he believes that the entropy of the target domain can be reduced if a Discriminator can be used to distinguish the source domain from the target domain so that the final output entropy of the source domain and the target domain are very similar. And the way you do it is just like the last video, except in the first video you put the probability directly into the discriminator, and in the second video you put the entropy into the discriminator, and you do the whole thing.
On the one hand, the process of entropy reduction is considered in this paper, and on the other hand, structural information is applied. Therefore, the experimental results show that direct minimization of entropy from GTA5 to Cityspace has been significantly improved compared with FCNs and output space. The entropy antagonism is a little bit better than the original method.
Moreover, the authors found that if the probability predicted by direct entropy reduction and entropy confrontation were added and the maximum was calculated, the result would be improved by a few more points. In the case of semantic segmentation tasks, this improvement is considerable.
Repetition code
Let’s move on to code reproduction.
The principle of reproducing the paper is to be consistent with the specific methods, parameters, data enhancement, etc., described in the paper.
In this reappearance, we first searched the open source code on GitHub. Based on the open source code, we implemented the two papers using the same framework based on PyTorch framework. If you can read the code of one paper, then the code of another paper is very easy to understand.
Below are two two-dimensional codes, which are the codes of the two papers. You can scan the codes to check them.
ModelArts profile
Both papers are based on ModelArts of Huawei Cloud. Let’s start with a brief introduction to ModelArts.
ModelArts is a one-stop AI development platform for developers. It provides machine learning and deep learning with massive data preprocessing and semi-automatic annotation, large-scale distributed Training, automatic model generation, and end-to-end cloud model on-demand deployment capability, helping users to quickly create and deploy models. Manage the full cycle AI workflow. It has the following core functions:
Data management, which can save up to 80% of the cost of manual data processing: covering image, sound, text, video 4 categories of data format 9 annotation tools, while providing intelligent annotation, team annotation greatly improve the efficiency of annotation; Support data cleaning, data enhancement, data inspection and other common data processing capabilities; Flexible and visual management of multiple versions of data sets, support data set import and export, easy for model development and training of ModelArts.
Development management, local development environment (IDE) can be used to connect to cloud services: ModelArts can be developed in the cloud through the (management console) interface, but also provides Python SDK function, you can use Python to access ModelArts in any local IDE through THE SDK, including creating and training models, deploying services, Get closer to your development habits.
Training management, faster training of high-precision models: Three advantages of universal AI modeling workflow with ei-backbone as the core:
1. Train high-precision models based on small sample data, greatly saving the cost of data annotation;
2. Full-space network architecture search and automatic hyperparameter optimization technology can automatically and rapidly improve the model accuracy;
3. After loading ei-Backbone integrated pre-training model, the process from model training to deployment can be shortened from several weeks to several minutes, greatly reducing the training cost.
Model management, support for all iterations and debugging of unified management model: the development of AI model and tuning often need a lot of iteration and debugging, training data set, code or parameter changes are likely to affect the quality of the model, such as not unified metadata management development process, there may be unable to reproduce the optimal model of the phenomenon. ModelArts supports import models for four scenarios: select from training, select from templates, select from container images, and select from OBS.
Deployment management, one-click deployment to the end, edge, and cloud: ModelArts supports online reasoning, batch reasoning, and edge reasoning. At the same time, high-concurrent online inference can meet the demand of large online business volume, high-throughput batch inference can quickly solve the demand of deposition data inference, and high-flexibility edge deployment enables the inference action to be completed in the local environment.
Image management and custom image function support custom running engine: ModelArts uses container technology at the bottom, you can make container image and run it on ModelArts. The customized image function supports command-line parameters and environment variables in the form of free text, providing high flexibility and facilitating job startup requirements of any computing engine.
Code interpretation
Next, let’s give the code a concrete explanation.
AdaptSegNet multi-level output space confrontation, code is as follows:
As mentioned earlier, the probability obtained by SoftMax is sent to the discriminator, which is the red box in the drawing above.
Why D1 and D2? As mentioned earlier, features can be used with the penultimate and penultimate layers of Resnet101 to form a multi-level confrontation process. In the case of specific Loss, bCE_Loss can be used to deal with confrontation.
Minimizing entropy with Adversariallearning
This paper is needed to calculate entropy. So how does entropy work?
So let’s get the probability. Run the network output through SoftMax to obtain a probability, then use P*logP to calculate the entropy value, and send the entropy into the Discriminator.
The only difference is that one puts the output of Softmax into it, and the other converts the output probability of Softmax into entropy and then sends it into it. That’s the one change I made to the code. So if you can understand the flow of the first code, it’s easy to get started with the second code
conclusion
Semantically segmented data sets are relatively large, so training requires very powerful hardware support. Generally speaking, the lab may only have a 10/11/12G GPU, but if you use Huawei Cloud ModelArts (based on a function introduction of ModelArts before), you can get a better output result.
If you are interested, you can click >>>AI development platform ModelArts and go to experience.
Learning to AdaptStructured Output Space for Semantic Segmentation and ADVENT: Adversarial EntropyMinimization for Domain Adaptation in Semantic Segmentation are two articles of interest. The full text of the article can be obtained through the following QR code.
Click to follow, the first time to learn about Huawei cloud fresh technology ~