Man is a sensory animal.
Our brain, like a highly complex CPU, receives various forms of data every day and makes endless calculations. We touch the world with various senses and extract information from different senses, so as to understand the world. Image, as a medium carrying the most abundant information, has always occupied an important position in the history of human exploration of wisdom. How do people recognize different types of images with such a pair of naked eyes? How to segment various semantic segmentation and Object Detection in images How to imagine image super-resolution from fuzzy images and how to create unrestrained image synthesis are both hot issues in the field of machine vision image processing. Researchers around the world hope that computers, instead of human eyes, will one day be able to read images and discover hidden codes.
Image classification
Image classification is an important task in image processing. In the field of traditional machine learning, the standard process of image recognition and classification is feature extraction, feature screening, and finally the feature vector is input into the appropriate classifier to complete feature classification. It was not until 2012 that Alex Krizhevsky proposed the network structure of AlexNet in a breakthrough. With the help of deep learning algorithm, the three modules of image feature extraction, screening and classification were integrated into one, and a deep convolutional neural network structure with five convolutional layers and three fully connected layers was designed to excavate and extract image information in different directions layer by layer. For example, shallow convolution usually acquires general features such as image edges, while deep convolution acquires specific distribution features of specific data sets. AlexNet won the 2012 ILSVRC (ImageNet Large-scale Visual Recognition Challenge) with a record low error rate of 15.4%. The runner-up made a 26.2% error rate. AlexNet’s perfect battle beyond traditional machine learning is recognized as a landmark historical event in the field of deep learning, blowing the horn of the explosive development of deep learning in the field of computer.
(The picture shows Dr. Fei-fei Li and her ImageNet data set.) In 2014, GoogleNet was born. At this time, deep learning has been further refined by ZF-NET and VGG-NET, and the depth of network, the size of convolution kernel, The problem of gradient disappearance in back propagation and other technical details have been discussed in detail. Based on these technologies, Google introduces Inception unit, which greatly breaks the paradigm of sequential arrangement among computing units of traditional deep neural networks, namely, convolutional layer -> activation layer -> pooling layer -> next convolutional layer. The ImageNet classification error rate was increased to 6.7%.
As the network becomes deeper and deeper and the network structure becomes more and more complex, the training of deep neural network becomes more and more difficult. In 2015, The god of Microsoft, He Kaiming (now working in Facebook AI Research), in order to solve the problem of accurate first saturation and then decrease in training, The concept of residual learning is introduced into the field of deep learning. The core idea is that when the neural network reaches saturation at a certain layer, all the following layers are used to map a function f(x)=x. Due to the existence of nonlinear part in the activation layer, this goal is almost impossible to achieve.
However, in ResNet, some convolution layers are short-connected, so that when the training is saturated, the goal of all subsequent layers becomes to map a function f(x)=0. In order to achieve this goal, all training variable values need to converge to 0 during the training process. The emergence of Resdiual Learning ensures the stability of network training on the premise of deepening network depth and improving model performance. In 2015, ResNet also won the ImageNet Challenge 2015 with an ultra-low error rate of 3.6%. This technology also exceeded the average human recognition level, indicating the beginning of the rise of artificial intelligence in the human arena.
Object detection in images
The realization of image classification task can let us know the rough image includes what type of object, but does not know the object which one location in the image, also don’t know the specific information of the object, in some specific application scenarios such as license plate recognition, traffic violation detection, face recognition, motion capture, pure image classification can’t fully meet our requirements. At this time, we need to introduce another important task in image field: object detection and recognition. In the field of traditional machines, a typical case isto use HOG (Histogram of Gradient) features to generate corresponding “filters” of various objects. The HOG filter can completely record the edge and contour information of the object, and filter different positions of different pictures by using this filter. When the amplitude of the output response value exceeds a certain threshold, It is considered that the filter has a high matching degree with the object in the picture, thus completing the object detection. This work was led by Pedro F. Felzenszalb, Ross B. Girshick, David Mcallester and Deva Ramanan used Object Detection with Discriminatively Trained part-based Models was co-published in IEEETransactions on Pattern Analysis and Machine Interlligence in September 2010.
In the past four years, Ross B. Girishick has grown from an IEEE Student Member standing on the shoulders of giants to a god in the AI industry, inheriting the will of the pioneers of deep learning. Published an article entitled Rich Feature Hirarchies for Accurate Object Detection and SemanticSegmentation in 2014 CVPR Conference. RCNN, the world knows it.
The core idea of RCNN is to transform an object detection task into a classification task. The input of RCNN is a series of image blocks extracted from images using the selectivesearch algorithm, which is called region proposal. After warping processing, region proposals are standardized to the same size and input to the pre-trained and finely tuned convolutional neural network to extract CNN features. After obtaining the CNN features of each proposal, a binary classifier was trained for each object category to judge whether the proposal belonged to the object category. In 2015, in order to shorten the extraction time of CNN features of each proposal, Girishick draws on the Pooling technology in Spatial Pooling Pyramid Network (SPPnet) to extract CNN feature maps from a whole image. Then, proposals of different locations are intercepted on this feature map to obtain feature proposals of different sizes. Finally, these feature proposals are standardized to the same size through SPPnet and classified. This improvement solved the disadvantage of CNN feature extraction for each proposal in RCNN and completed feature extraction on the whole graph at one time, greatly shortening the running time of the model, so it was called “Fast R-CNN”. The article with the same name was published in ICCV 2015 conference.
In 2015, Girishick continued to make efforts to define RPN (Region-proposal-Network) layer, replace the traditional regionproposal interception algorithm, embed the interception of regionproposal into the deep neural network. It further improves the efficiency of the fast r-cnn model, so it is called “Faster r-cnn”. In NIPS2015, Girishick published a paper entitled “Faster r-cnn: “Towards Real-Time Object Detection with RegionProposal Networks”, which completed the triple jump feat in RCNN research field.
Image generation
With the development of The Times, scientists are not only the researchers of technology, but also the creators of art. In 2014, Ian Goodfellow proposed Generative Adversarial Net, The task of image generation is accomplished by defining a generator and a discriminator. The principle is that the generator mission is “to create” the random noise out close to the target image “false image” to cheat the discriminant, the discriminant of task is to identify which is some images from the real data sets, which is some images from the generator, the generator and discrimination against each other, through reasonable loss function is designed to complete the training, After the final model converges, the probability output of the discriminator is constant 0.5, that is, the probability of an image from the generator is the same as that from the real data set, and the probability distribution of the image generated by the generator is infinitely close to the real data set.
GAN technology has become a hot field of deep learning research in 2015 and 2016, and has achieved excellent performance in image restoration, noise reduction, super-resolution reconstruction and other directions. A series of technologies such as WGAN, Info-GAN, DCGAN, conditional-GAN have been derived, leading a wave of trends.
(The picture uses cycle-GAN technology to generate oil paintings of Monet, Van Gogh and other styles from an ordinary photograph)
The story of images has just begun.
When we put a ZhenZhen in series with the image into a flow of light and shadow, we study the problem of extended from spatial dimension to the time dimension, we not only need to care about the position of objects in an image, category, contour shape, semantic information, we are more care about the relationship between frame and the frame of time, to capture, identify the motion of an object, to extract the video, To analyze the meaning of the video, to consider the sound and text annotations in addition to images, to deal with a series of natural languages, our research step by step, towards a broader stars and the sea.
Images and videos are virtual strings of numbers, bytes, but they make the world more real. (Search Tucodec on wechat to contact us)