This paper introduces the basic knowledge of computer vision through the case, and analyzes its basic tasks of image classification, image segmentation progress and application.


Review of historical articles:
HBase Replication,



preface

Here are a few examples of computer vision applications:

At the 23rd St. Petersburg International Economic Forum from June 6 to 8, Xinhua News Agency, Russian news agency TASS and Sogou jointly launched the world’s first Russian-language AI anchor, which will be used in TASS news reports in the future. TASS is the state news agency of Russia. As one of the top five news agencies in the world, itAR-Tass provides news and information to 115 countries and regions and has extensive influence in the world.

MAGIC short video intelligent production platform by Xinhua News Agency and Alibaba jointly established xinhua Zhiyun Technology Co., Ltd. independent research and development. During the World Cup, through the production of MAGIC short video reached 37581, the average time of a video 50.7 seconds, the network achieved 116604975 times! One of the faster videos, “Russia leads Egypt 2-0”, took just six seconds!

The faces above are ai-generated fake faces that don’t exist but are generated by Nvidia’s GAN model (generative adversarial network).

Computer Vision

The goal of computer vision research is to enable computer programs to interpret and understand images, not only to understand the colors of the images, but also to understand the semantics and characteristics of the images at a higher level, which is to let the computer open its eyes to “see” the world. 70% of the activities of human cerebral cortex are processing visual information, so from the perspective of perception, vision is an important information perception function.

Here are a few things that drove vision:

1. Breakthrough in the field of deep learning. Deep learning is built on neural network, and the concept of neural network was derived from the research and simulation of human brain nervous system by researchers in the 1950s. The theory of neural networks has been around since the 1950s, but it has been applied at a shallow level, and people have not thought of what new changes will be brought about by layers.

2. Nvidia developed gpus to continuously improve the computing power. Due to its natural parallel computing and matrix processing capabilities, the image processing process and neural network computing process are greatly accelerated. So far, training AlexNet model in 2012 required two Gpus and took 6 days. Today, it only takes a new GPU to do the same thing, which can be done in ten minutes.

3. Stanford University professor Feifei Li founded ImageNet by Posting millions of photos online and asking people to tag them. What really attracts people’s attention is the experiment of Stanford in 2012. The number of image samples in the past experiments was mostly at the level of “ten thousand”, but Stanford used 10 million and applied the multi-layer neural network. The result showed that the recognition rate of this model was about 7%-10% improved in the three image categories of human face, human body and cat face. This was a big shock, because normally it takes a lot of effort to increase the recognition rate by 1%, but now you just increase the number of layers, and there are two big changes, one is that the recognition rate increases so much; The second is being able to handle such large data. These two changes are very encouraging, and ai had not solved any real problems before 2012. In December 2015, Microsoft reduced the error rate of image recognition on ImageNet to 3.57%, lower than the human error rate of 5.1%, using the 152-layer deep network.

Below is the progress of image classification on ImageNet. The lower the bar, the lower the error rate:

Deep Learning

We are now in the third rise of ARTIFICIAL intelligence. The first two waves, in the 1950s-1960s and 1980-1990s, had considerable impact, but they also cooled off. That’s because neural networks didn’t deliver the performance improvements they were supposed to, nor did they help us understand biological visual systems. Unlike the first two, deep learning has vastly outstripped biology in many benchmarking and real-world applications since the early 2000s.

Deep learning generally refers to the Deep Neural network, also known as DNN (Deep Neural Networks). Neural network was put forward in the 1950s, but due to the disappearance of its own inherent gradient, a large number of parameters lead to problems such as over-fitting and too much calculation, resulting in poor practical application effect. Therefore, machine learning has almost always been dominated by SVM.

Deep learning was proposed by Hinton et al. in 2006, but its real rise, or the emergence of significant impact work, was after 2012. For example, Krizhevsky et al. used deep learning to greatly improve the accuracy of image classification, which is the work of Alex Net.

Convolutional Neural Network (CNN) is the main application technology of deep learning in image field. The reason why CNN is so successful in computer vision application is that traditional machine learning methods are basically abandoned. One of the biggest reasons is that feature design of image data, namely feature description, has always been a headache for computer vision. Ten years before the breakthrough of deep learning, the successful hand crafted feature is SIFT, There is also the famous BOW (bag ofvisual words), which takes a long time and requires very professional domain knowledge to design. These high-cost model iterations make the development ofvisual algorithms in the past very slow. You can refer to the following flow chart, which is a traditional machine learning process:

Hot application fields of deep learning can be seen in the figure below (according to the statistical results of PapersWithCode in 2018).

The areas where deep learning should be successful are computer vision, speech recognition and home-grown language processing. With the success of AlphaGo and OpenAI, augmented learning is also slowly emerging.

The basic task

The field of computer vision is a lot of tasks, but the basic task is to image classification, image detection/positioning, image positioning, key image segmentation, these tasks have been developed for many years, and because of its basic position, will impact other areas (such as face recognition, OCR), the following simple introduced the new progress of each task.

Progress in image classification

Image classification refers to judging the category of an input image given by the machine. Generally speaking, it is to let the machine understand what this image is or has (cat, dog, etc.). Image classification is a basic task in computer vision, and it is also the task for almost all reference models to compare. From the beginning of the relatively simple 10 classification of gray image handwritten number recognition task MNIST, to the later larger 10 classification ciFAR10 and 100 classification CIFAR100 task, to the later imagenet task, image classification model with the growth of data set, step by step to the level of today. Now, with more than 10 million images and more than 20, 000 categories in data sets like Imagenet, computers are better at classifying images than humans.

According to the different content of the picture, it can be divided into object classification, scene classification and behavior event classification.

According to the degree of fineness of classification, it can be divided into coarse-grained classification and fine-grained classification.

According to the correlation of classification labels, it can be divided into single-label classification and multi-label classification.

Difficulties and challenges of image classification: rigid vs. non-rigid changes, multiple perspectives, scales, occlusion, lighting conditions, intra-class differences, refer to the following figure:

Single label classification


Single-label classification is a simple task. The content of pictures is relatively simple and only contains one object or scene. ImageNet is a single label classified data set. Below, through the time context of ImageNet competition, the progress of single label classification is introduced.

AlexNet: The AlexNet network structure model proposed in 2012 triggered the application upsurge of neural network and won the champion of the 2012 Image Recognition Contest, making CNN become the core algorithm model in image classification.

ZFNet: The 2013 ILSVRC Classified task Champion network is Clarifai, but better known as ZFNet. Hinton’s students Zeiler and Fergus introduced the visualization of neural network by using deconvolution technology in their research, which visualized the intermediate feature layer of the network, making it possible for researchers to examine the activation of different features and their relationship with the input space. Under this guidance, AlexNet network made simple improvements, including using smaller convolution kernel and step size, changing the convolution kernel from 11×11 to 7×7, and the stride from 4 to 2. The performance exceeded that of the original AlexNet network.

VGGNet: The runner-up in 2014, VGGNet includes two versions of layer 16 and 19, with a total of about 550M parameters. The structure of convolutional neural network is simplified by using 3×3 convolution kernel and 2×2 large pooling kernel. VGGNet is a good example of how network performance can be improved by simply increasing the number and depth of network layers on top of previous network architectures. Simple but surprisingly effective, VGGNet is still the benchmark model of choice for many tasks today.

GoogLeNet: The 22-layer network proposed by Christian Szegedy et al at Google has a top-5 classification error rate of only 6.7%. At the heart of GoogleNet is Inception Module, which takes a parallel approach. A classic inception structure consists of four components. 1×1 convolution, 3×3 convolution, 5×5 convolution, 3×3 pooling, and then the operation results of the four components are combined in the channel. This is the core idea of Inception Module. Better representation of the image can be obtained by extracting the information of different scales of the image through multiple convolution kernels and then fusion. Since then, the classification accuracy of deep learning models has reached human levels (5 to 10 percent).

ResNet: Won the category task Champion in 2015. It outperforms human recognition with a 3.57% error rate and sets a new model record with 152 layers of network architecture. Because ResNet adopts cross-layer connection, it successfully alleviates the gradient dissipation problem in deep neural network, and provides the possibility for network training of thousands of layers.

ResNeXt: there are still many classic models in 2016, including ResNeXt, which won the second place in the classification contest. ResNeXt at 101 layer can reach the accuracy of ResNet152, but only half of the complexity of ResNet152. The core idea is grouping convolution. That is, input channels are grouped first, then combined after several nonlinear transformation of parallel branches.

DenseNet: Based on ResNet, DenseNet connects each layer with other layers in a feedforward process. For each network layer, all the features of the previous network are used as input, and its feature images are also used as input by the network layer behind. DenseNet also mitigated the problem of gradient disappearance, enhanced feature propagation and feature reuse compared to ResNet, and reduced the number of parameters. DenseNet requires less memory and computing resources than ResNet and achieves better performance.

SeNet: 2017 was also the year after ILSVRC image Classification competition, and SeNet won the title. This structure only uses the strategy of “feature recalibration” to process features. The importance of each feature channel is acquired through learning, and the weight of the corresponding feature channel is reduced or increased according to the importance.

At this point, the image classification competition is basically over, and also close to the limit of the algorithm. However, in the practical application, we are faced with more complex and realistic problems than in the competition, which requires us to accumulate experience continuously.

At present, with the rise of NASNet (Neural Architecture Search Network), good effects are basically these networks, such as: NASNet, PNasNet and AmoebaNet, especially Google’s EfficientNet, have been a runaway improvement to other networks. See the picture below:

Fine-grained image classification


Fine-grained Image Categorization refers to Categorization of images that belong to the same basic category (cars, dogs, flowers, birds, etc.) (for example, somo or husky). Fine-grained classification has many practical application scenarios, such as distinguishing in traffic monitoring and identifying different vehicle types.

Due to the classification of particle size is very small, subtle differences, between a subclass only have small differences in a local (e.g., the dog’s eyes), even in some categories even experts are difficult to distinguish, plus subclasses huge internal differences, such as posture, background difference, and the Angle of view, the interference of background, shade, etc., so fine-grained image classification is difficult than coarse granularity classification, Therefore, it is still a popular research field at present.

Since deep convolutional networks can learn very robust image feature representation, most methods of fine-grained image classification are based on deep convolutional networks. These methods can be roughly divided into the following four directions:

1. Fine-tuning method based on conventional image classification network

Most of these methods directly use common deep convolutional networks to directly classify images with fine granularity, such as ResNet, DenseNet, SENet, etc. Because these networks have strong ability of feature representation, they can achieve better results in conventional image classification. However, in fine-grained classification, the differences between different species are actually very subtle. Therefore, it is not ideal to directly apply the conventional image classification network to fine-grained image classification. Inspired by transfer learning theory, one approach is to transfer networks trained on large-scale data to fine-grained classification and recognition tasks. The common solution is to use the network weights pre-trained on ImageNet as the initial weights, and then FineTune the network weights on the fine-grained classification data set to get the final classification network.

2, based on the network integration method

The representative one is Bilinear convolutional neural network model (Bilinear CNN). In this method, VGG-D and VGG-M networks are used as the reference networks to obtain the vectors after the fusion of two features by Bilinear Pooling, and then used for classification. The cub200-2011 dataset achieves 84.1% classification accuracy without using Bounding Box (border) annotations

When BoundingBox, its classification accuracy is as high as 85.1%.

3. Part Detection and alignment methods based on target blocks

The method of object part detection is as follows: firstly, the position of the target is detected in the image, and then the position of the discriminative region in the target is detected, and then the target image (foreground) and the discriminative target region block are simultaneously sent into the deep convolutional network for classification. However, the object block-based detection methods often need the Bounding box annotations of targets or even the key feature points in the target image in the training process, and it is very difficult to obtain these annotations in practical applications. The representative one is the part-RCNN method proposed in ECCV in 2014.

4. Methods based on visual attention mechanism

Vision Attention Mechanism is a special signal processing Mechanism of human Vision. Specifically, when the visual system is looking at something, it first scans the global image quickly to get the target area that needs attention, and then inhibits other useless information to get the target of interest. In deep convolutional networks, attention model can also be used to find regions of interest or discriminative regions in images, and for different tasks, convolutional networks focus on different regions of interest. Because the method based on Vision Attention Model can locate discriminative regions in images without additional annotation information (such as the target position annotation box and the location annotation information of important parts), it has been widely applied in the field of fine-grained image classification in recent years. The representative work is a Recurrent Attention Convolutional Neural Network (RA-CNN) proposed in CVPR in 17 years.

At present, all fine-grained image recognition tasks need to rely on a large amount of or even massive annotation data. For fine-grained images, the cost of image collection and annotation is huge. This limits the development of fine-grained research and its application in real-world scenarios. Humans, on the other hand, have the ability to learn new concepts with minimal supervision, for example, an average adult can learn to identify a new species of bird using only a few images. In order to enable fine-grained image recognition models to have the same learning ability under a small number of training samples as humans, researchers are also studying a small number of sample learning tasks for fine-grained image recognition, which may also be the future development trend.

Multilabel classification


The above classification, all of them are single tag classification problem, namely each figure corresponds to only one category, but a lot of tasks, is actually more tags classification problem, a picture can be multiple tags, compared to many categories image classification, the tabbed task more difficult, because the output space with the category number increases exponentially. Multi-label classification problems usually have the following strategies: First-order strategy: A naive approach that ignores the correlation with other labels and treats each objective separately, such as splitting multiple labels into separate dichotomies (simple and efficient).

Second-order strategy: Consider pair associations between tags, such as sorting related and unrelated tags.

Higher-order strategy: Consider associations between multiple tags, such as the effect of each tag on all other tags (excellent).

Expanding to higher-order strategy: Since many objects usually appear at the same time in the real world, modeling the correlation between labels becomes the key to multi-label image recognition, as shown in the figure below:

There are roughly two directions, and you can model correlations between exploring tags from multiple angles. One is to explicitly model label dependencies based on probability graph models or recurrent neural networks (RNN). The other is to implicitly model tag relevance through the attentional mechanism. This method considers the relationship between the regions of attention in the image (which can be regarded as local correlation). Even so, the method ignores the global correlation between labels in the image, which can only be inferred from knowledge outside of a single image.

For example, ML-GCN uses graphs to model the interdependencies between tags. It can obtain topology structure in tag space flexibly, and has obtained some results on MS-COCO and VOC2007 test sets.

Progress of target detection

The goal of target detection task is to give an image or a video frame, let the computer find out the position of all the targets in it, and give the specific category of each target. It combines the two tasks of target classification and positioning, which is to tell the machine what is in the picture and where. Detection is the basis of many computer vision applications, such as instance segmentation, human key point extraction, face recognition and so on. The framework of most modern target detectors is two-stage, in which target detection is defined as a multi-task learning problem:

(1) Distinguish foreground object boxes from background and assign appropriate category labels to them;

(2) Regression a set of coefficients to maximize the intersection ratio (IoU) or other indicators between the detection frame and the target frame. After that, the redundant bounding boxes (repeated checks on the same target) are removed through an NMS procedure.



The Anchor – -based method

The traditional anchor-based methods propose some candidate boxes (prior box or Anchor box) with strategies, and then classify and return these candidate boxes. The method is to classify the featureMap vectors corresponding to these boxes (softmax) or regression (linear regression) to get the position and category of boxes.

OneStage algorithm is to directly extract features from the network to predict object classification and position. Two-stage algorithm is to first generate a proposal and then carry out fine-grained object detection.

Most modern target detectors have a two-step framework:

(1) RPN: distinguish foreground object box from background and assign appropriate category labels to them;

(2) After regression of a set of coefficients to maximize the intersection ratio (IoU) between the detection frame and the target frame or other indicators, redundant boundary boxes (repeated detection of the same target) are removed through an NMS process.

The important technology roadmap of target detection is clearly described in the figure below:

Figure milestone detectors: VJ Det, HOG Det, DPM, RCNN, SPPNet, Fast RCNN, Faster RCNN, YOLO, SSD, PyramidNetworks, RetinaNet.

The following is the detection results of each detection model on VOC07, VOC12 and MS-COCO data sets:

Due to space limitation, I will have the opportunity to explain each specific detector in detail.

The Anchor – Free method

Since the CornerNet started in August last year, the Anchor-Free target detection model has emerged one after another. The so-called anchor-free refers to the anchor-free without presetting some references during detection

Box, but directly predict the location and category of the object through the model, for example by means of key points.

In fact, anchorfree is not a new concept, which can be traced back to Baidu’s DenseBox model (this model was proposed in 2015, earlier than fast-RCNN), and the firebrand YOLO can also be regarded as the anchorfree model in the field of target detection. The shadow of DenseBox can be seen in the nearly-anchored Free models such as FASF, FCOS and FoveaBox. Representative anchor-free models include: DenseBox, YOLO, CornerNet, ExtremeNet, FSAF, FCOS and FoveaBox.

Although the anchort-free method has not completely outperformed the traditional anchort-based method at present, it does provide a feasible new detection process. The main challenge is whether BoundingBox is a reasonable expression of detection. With the evolution of the anchort-free model, It is possible to produce a good expression of goals.

Progress in image segmentation

Image segmentation is the technology and process of dividing an image into several specific regions with unique qualities and proposing objects of interest. It can be regarded as a pixel-by-pixel image classification problem. The segmentation tasks can be divided into semantic segmentation, instancesegmentation, and panoptic segmentation, a new category that has emerged this year. The above figure shows the differences among the different segmentation tasks.

Expand a bit to illustrate the different segmentation tasks:

Semantic segmentation: semantic segmentation pay more attention to the distinction between “category”, the semantic segmentation will people will focus on the foreground and background segmentation in the trees, the sky and the grass, but it does not distinguish between people separate individuals, as shown in figure all the people marked in red, in the yellow box on the right people are unable to identify a person or a different person. Main models include U-NET, SegNet, DeepLab series, FCN, ENet, ICNet, ShelfNet, BiseNet, DFN and CCNet networks.

Instance splitting: A greater focus on “distinguishing between individuals”, instance splitting is an issue that has grown in recent years largely driven by COCO datasets and contests. Methods from MNC, FCIS to PANet all win the first place in COCO Instance segmentation track. The main models include FCIS, DeepMask, MASKR-CNN, Hybrid Task Cascade (HTC), PANet and other networks.

Panorama segmentation: A new sub-task, first proposed by FAIR and The University of Heidelberg, Germany, can be said to be a combination of semantic segmentation and instance segmentation. Under the panorama segmentation task, each pixel in the image has its corresponding semantic label and instance label, so that the whole image can be understood to a large extent. The main models are JSIS-Net, TASCNet and so on.

Image segmentation model


The general framework or process of image segmentation is as follows:

Downsampling + upsampling: Convlution + Deconvlution/Resize.

Multi-scale feature fusion: feature point by point addition/feature channel dimension splicing.

Get a pixel-level SeINTERFACE map: judge categories for each pixel.

The following figure shows the technical spectrum of the progress of image segmentation:

1. FullyConvolutional Networks (FCN) : This is the pioneering work of neural network to do semantic segmentation, and proposes full convolutional Networks. The fully connected network is replaced by the convolutional network, so that the network can accept images of any size and output segmented images of the same size as the original image. Only then can each pixel be classified. Deconvolution layer is used to up-sample the feature images.

2. SegNet adds decoder on the basis of FCN, forming a popular codec structure in segmentation task at present, and gives the influence of different decoders on the effect and the reasons.

DeepLabv1/v2/v3: Introduced Dilated Convolution or Atrous Convolution.

4. PSPNet: The core contribution is Global Pyramid Pooling, which scales the feature map to several different sizes so that features have better Global and multi-scale information.

5. Maskr-cnn: Object Detection and SemanticSegmentation are combined together, and RoiAlign is proposed to replace RoiPooling, thus eliminating the offset caused by rounding and improving the Detection accuracy.

6. U-net: The codec structure is adopted. The encoding part will construct a new scale every time it passes through a pooling layer, including the original scale, there are altogether 5 scales. The decoding part, once sampled, is fused with the number of channels corresponding to the feature extraction part at the same scale. In this way, more abundant context information is obtained. In the Decode process, the details are enriched by the fusion of multiple scales and the accuracy of segmentation is improved.

Image Matting


Matting is also a kind of front background Segmentation problem. However, Matting is not hard Segmentation but Soft Segmentation. The color of the corresponding pixels of foreground such as glass and hair is not only determined by the color of the foreground itself, but is the result of the fusion of the colors of the front background. Find out the background colors and how well they blend.

ImageMatting (ImageMatting) only divides the picture into foreground and background, the purpose is to get the foreground, a good ImageMatting algorithm for hair and other details processing effect is more accurate. The important difference between matting and segmentation is that segmentation returns the classification result of pixels, and the classification result is an integer. While matting returns the probability P belonging to the foreground or background, which will produce gradual effect in the interaction area between foreground and background, making matting more natural.

The core problem of image matting technology is the solution formula: I = αF + (1-α)B, where I is the currently observable pixel of the image and is a known quantity; Alpha is the transparency, F is the foreground pixel, and B is the background pixel. These three variables are unknown quantities. To understand this formula, the original image can be thought of as the foreground and background superimposed with a certain weight (alpha transparency). For pixels that are completely determined to be foreground, α = 1; For pixels that are completely determined to be background, α = 0; For pixels that are uncertain whether they are foreground or background, alpha is a floating point number between 0 and 1.

Excellent matting algorithm is capable of extracting very small hair details in the foreground of a good algorithm, which is the traditional image segmentation technology can not do.

Now deep learning is also slowly introducing Image Matting, basically using Encoder-Decoder framework, but the GroundTruth of training data is changed into Trimap. A typical example is Adobe’s End to End DeepImage Matting.

Matting technology is not as popular as other segmentation technologies due to the wide segmentation of application scenarios and the lack of data sets and benchmarks.

subsequent

Of course, computer vision is more than these tasks, classification, detection and segmentation is only the most basic task of computer vision, and these tasks because of its basic and universal, will be used in other tasks. For example, in the face field, detection and classification will also be used, when doing special effects will also use segmentation. The basic network architecture described in this article, such as ResNet, GoogleNet, etc., will also be used in other tasks.

Deep vision, there are many other areas that are not covered, such as critical point detection, video classification, video detection and tracking, generative adversative networks (GAN), automatic learning (AutoML), vertical field face recognition, optical character recognition (OCR), pedestrian re-recognition, It includes commonly used deep learning frameworks such as TensorFlow and PyTorch, etc., as well as unsupervised/weakly supervised learning, self-supervised learning and enhanced learning, etc., which have been studied for a long time. It takes a lot of space to explain each sub-area, and the progress in these directions will be introduced later.

References:

1, https://blog.csdn.net/xys430381_1/article/details/89640699

2, https://medium.com/atlas-ml/state-of-deep-learning-h2-2018-review-cc3e490f1679

3, https://zhuanlan.zhihu.com/p/57643009

4, https://zhuanlan.zhihu.com/p/62212910

5, https://cloud.tencent.com/developer/article/1428956


This article was first published in the public “Miui Cloud technology”, please indicate the source of reprint, click to view the original link.