Editor’s note: “Sky street light rain run such as crisp, grass color distant see close but no.”

It can be seen from the two sentences of Han Yu that people’s semantic understanding of the image content does not rely on fine-grained supervisory information.

In contrast, in the field of machine learning, semantic segmentation tasks at present rely on a large amount of finely annotated data. The Internet, as the most abundant data source, attracts the attention of relevant practitioners. However, if they want to make use of these data, they face great pressure of annotation.

Therefore, two considerations are triggered: First, can we learn knowledge directly from the Web with the help of keyword information, without the need for elaborate manual annotation? Second, can category-independent cues be used to generalize to all other categorical objects after training on a data set labeled with a few categories?

In this paper, Professor Cheng Mingming from Nankai University will introduce the current research progress from these two points.

At the end of this paper, the algorithm code and the download link of references are provided.



Traditional pixel level semantic understanding methods usually require a lot of image training with fine annotation. The image above is an example from the ADE20K dataset, which contains images of 210,000 finely annotated objects that Professor Antonio’s mother spent a lot of time annotating.


“My mother annotated such a good data set, I wish I had more mothers,” Antonio joked in CVML2012. This is a joke, but it also illustrates the importance of building a data set, and the amount of time and effort it takes to build it.


When we were growing up, our parents never gave us such detailed notes to help us recognize and perceive the world around us. The usual way of learning is that parents show us a flower and tell us it’s a flower, and then we can easily figure out which regions, which pixels, correspond to that flower. So how do we use this information to learn the semantic content that each pixel represents? At the same time, can such information help us better understand the content of the image and carry out a fine semantic understanding of the image?


Our research focuses on how to use a similar mechanism to remove dependence on fine labeling information. In life, when we want to understand an unfamiliar object, such as a fruit, we usually only need to search on the Internet and observe a few pictures, so that we can have a full understanding of the fruit and easily identify the corresponding target and target area. Can computers be given the ability to learn directly from the Web, without the need for elaborate manual annotation?


There are many related efforts that can help with pixel-level semantic understanding, such as salient object detection: given an image, finding and finding salient objects in the image is critical. For example, when we use keywords to retrieve images on the Internet, usually retrieve image and has a strong correlation between keywords, through significant target detection, we can assume that the corresponding result significant regional semantic information is the key word, this assumption is, of course, there is noise or wrong.


In addition to saliency detection, image edge detection and over segmentation are also included. This information is class-independent and can be trained into a good general model from a small set of data. For edge detection, we can train a good edge detection model from a BSD dataset of only 500 data. Edges can describe the boundaries of objects well, thus reducing the dependence on fine labeling. Similarly, over-segmentation and significance detection have the same effect. An immediate idea is whether it is possible to take these class-independent cues, trained on data sets labeled with a small number of categories, and generalize them to objects of all other categories. Even if we have never seen an object, we can find the region of the object without knowing its category.


Along this line of thought came our first work: Salience object detection, published in CVPR 2017 and TPAMI 2018. Let’s take a look at the job.


The core idea of this work is to Deeply Supervised by multi-scale, which integrates information of different scales to detect the region of significant object from multiple scales. It is difficult to obtain high-quality segmentation results because the low-level features and high-level features of CNN are better at detailed description and global positioning respectively, but not comprehensive enough. We enrich the information at the bottom by passing the information at the top down, so that it can be well positioned and detailed.


Here are some sample results. Our point is not to show how to do salient object detection, but to convey the important message that by salient object detection, we can segment salient objects in the image well. This discovery could help machines learn pixel-level semantic segmentation directly from the Web.


The figure above shows the detection results of our method in different scenarios. It can be seen that our salient object detection method can still find the area of the object well even in the case of low contrast and complex objects.


At the same time, in common data sets, the Fβ performance index of our algorithm exceeds 90%. In order to verify the generalization ability of the algorithm, we between different data sets to cross validation, the experimental results show that our significant target detection method, can from a small number of categories labeled data (such as 1000) learned tool that is independent of the type of the tools in the case of don’t know the object classes, and can be a very good to be segmented from the image.


There are some drawbacks to this approach, for example, our approach fails in situations where the scene is particularly complex (like a motorcycle) or salient objects are ambiguous (like the right side of a cat).


As mentioned above, the Fβ performance of our method can exceed 90% on multiple data sets, which can well locate significant objects. An application of our work, shown above, is applied to a Smart camera on a Huawei phone: it automatically finds a foreground object while taking a picture, enabling the camera to take a wide-aperture photo. However, traditional large-aperture photography requires the use of SLR camera (extra physical burden) to obtain the artistic effect of the combination of front and rear virtual, virtual and real.


Another important category independent information is edge detection. Edges help locate objects. As shown in the figure above, without knowing the specific category of the animal, we can find the corresponding region of the animal as long as we know that there is an animal in the image (keyword level label). Here we present the work published in CVPR 2017 (RCF).


The core idea of RCF is to use rich multi-scale features to detect edges in natural images. In the early classification tasks, the middle layer was often neglected, and later people made use of it by 1×1 convolution layer. However, these works only use the last convolution layer of each stage, and in fact each convolution layer is useful for the final result. RCF fuses all convolution layers through 1×1 convolution layer. This fusion improves the effect of edge detection effectively.


For example, in the straw area of the image, traditional methods such as Canny operator have very high responses in these areas, but RCF can suppress these responses well. In addition, for example, sofa, tea table and other people are difficult to observe the edge of the area, RCF can detect the edge robustly, the result is even clearer than the original structure. This gives us a foundation to implement learning directly from the Web.


Edge detection, one of the earliest problems studied in computer vision, has been in development for more than 50 years, but RCF is the first effort to achieve real-time detection while outperforming manual annotation on the Berkeley dataset. Of course, this doesn’t mean that RCF is better than humans, who can do better given enough time to think about it, but RCF is certainly a breakthrough. And training such a powerful edge detection model, using only a data set of 500 images, is very enlightening for us to learn directly from the Web.


Good over-segmentation results can effectively assist pixel-level semantic understanding (especially in the case of less manual annotation data). Over segmentation is also an important category-independent information. Although the over-segmentation results in the figure above are similar to semantic segmentation, they also have essential regions. In semantic segmentation, each pixel has a clear semantic label, so we can learn the specific semantic information of each pixel through neural network. The over-segmentation just divides the image into many different regions, and each region corresponds to a label. These labels have no definite semantic information, so given an image, we cannot determine how many regions each image can produce, nor how many labels each image can produce (100, 1000 or 1000?). This problem poses many great difficulties in learning. Here is our work published in IJCAI2018.


Our method does not directly match pixels with annotations, but first super-pixelates the image to improve the computing speed, then extracts the convolution features of superpixels, then pools the features of each superpixel into a fixed-length vector, and finally learns the distance between each two superpixels. Merge superpixels when the distance between them is less than a threshold. Compared to traditional methods, our method is simple, efficient, achieves good results, and can be processed in real time (50fps/s). This also supports learning pixel-level semantic understanding directly from the Internet.


With the class-independent low-level visual knowledge mentioned above, we can do a lot of interesting analysis on images. For example, we use keywords to retrieve images on the Internet. Through salient target detection, we can detect the approximate position of the object in the image, and then we can further refine the regional information of the object through edge, over-segmentation and other information. Proxy Groundtruth (GT) can be generated eventually. This GT is not manually marked, but a guess of Internet image GT by automatic method. This guess is likely to cover the corresponding areas of keywords in the image, of course, there is a lot of error in these areas. For example, in the case of the bicycle in the image above, our method often marks the person as well, because the bicycle is usually accompanied by the person.

So how do you get rid of these errors?


The process of the whole method is as follows: 1. A large number of images are obtained by keyword retrieval; 2. Proxy GT corresponding to the image is obtained by using the underlying visual knowledge; 3. NFM was used to remove the influence of noise area in Proxy GT on the training process; 4. Finally, semantic segmentation results are obtained through SSM.


NFM (Noise Filtering Module) : a Noise Filtering Module that filters Noise regions in image Proxy GT using image-level annotations and heuristic map given input images.


The red area in the figure above is the identified noise area.


NFM is used only in the testing phase as a training aid.


We verify the importance of basic visual knowledge through experiments. The experiment was divided into two categories. Weak indicated that the image had only one keyword level annotation, while WebSeg indicated that the image had no manual annotation. In fact, there are many categories of underlying visual knowledge, and we only show three categories here, namely Saliency Object Detection (SAL), Edge and Attention (ATT). Attention is a kind of top-down information, which requires the tagging information at the keyword level. Since WebSeg does not use any manual tagging, there is no Attention in the WebSeg experiment.


Similarly, we verified the validity of NFM. It can be seen that NMF can improve the accuracy of IoU.


In the training process, the training data can be divided into three categories, D(S) : the image content is simple, and each image has a manually reviewed image-level label; D(C) : The image content is complex, and each image has multiple verified image-level annotations; D(W) : Image content is variable, each image has an uncensored label.

The above table lists the different performance of different training set combinations.


Using CRF can further improve the accuracy of the results.


The figure above shows the results of our experiment. From left to right, we can see the importance of NFM and CRF respectively. In general, our method can learn directly from Web images and get good semantic segmentation results.


The above table shows the experimental results on PASCAL 2012. After using a lot of low-level visual knowledge, our method can achieve an average IOU of 63%, which is a big improvement from the best result on CVPR of last year, which is 58%.


Another significant result was that without explicit manual tagging, we were still able to achieve 57% of the results. This result actually exceeds many of the weakly supervised methods in CVPR 2017. In fact, the annotation of weakly supervised information also takes a lot of time and energy. In contrast, our method does not require any manual annotation. We’re only taking a few steps in the direction of directly allowing machines to learn pixel-level semantic segmentation from the Web, but it’s exciting to see it outperform most of CVPR 2017’s weakly supervised results on PASCAL VOC scale datasets. In the long run, this is a very interesting research direction.


In conclusion, we present a significant and challenging visual problem: how to learn semantic segmentation directly from the Web without human annotations. At the same time, we also propose an online noise filtering mechanism for CNN to learn how to eliminate the noise regions in the Web learning results. The purpose of this work is to reduce or remove the dependency of pixel-level semantic understanding tasks on finely annotated data.


We have only scratched the surface of pure Web-based supervised learning, and there is still much work to be done, such as:

1. How to make effective use of images. In our current work, we directly deal with Web images regardless of whether they are good or bad, without doing more learning.

2. Or whether the underlying visual knowledge can be associated with its corresponding keywords, such as Salient Object Detection, which has not been associated with its corresponding labels before. Can such correlation further improve the results?

3. Improve the performance of class-independent underlying visual knowledge, such as edge detection and over-segmentation;

4. There are other purely Web monitoring tasks.


We also do a lot of work related to the underlying visual knowledge, such as over-segmentation.


Salient-instance Segmentation is also category-independent information. Although the category of the object is not known, it can segment significant instances.

http://mmcheng.net/ (Qr code automatic recognition)

In addition, our work has the corresponding source code, welcome to use.


References:

https://pan.baidu.com/s/1Qtb11J4lfAw6wscPMMiOGg

Password: d7f3


Editor: Yuan Jirui, Editor: Cheng Yi

Collation: Feng First-class, Yang Ruyin, Gao Ke, Gao Liming


About the author:


Mingming Cheng graduated from Tsinghua University in 2012, then engaged in computer vision research in Oxford, UK, and returned to China to teach in 2014. Currently, she is a professor of Nankai University, national “Ten Thousand Talents Plan” Young Talents, selected by China Association for Science and Technology young Talent Promotion Project. His research interests include computer graphics, computer vision, image processing, etc. He has published more than 30 papers in IEEE PAMI, ACM TOG and other CCF-A international conferences and journals. Related research results have been widely recognized by domestic and foreign peers, he cited more than 7000 papers, the highest single paper he cited more than 2000 times. His research work has been reported by BBC, Daily Telegraph, Der Spiegel, Huffington Post and other authoritative international media.


This article belongs to “deep learning lecture hall” original, if need to reprint, please contact Ruyin

Pixel-level semantic recognition in Internet images

Welcome to follow our wechat official account and search our wechat name: Deep Learning Lecture Hall