Abstract: In order to help you better understand convolutional neural networks, here I summarize many important new advances and related papers in the field of computer vision and convolutional neural networks.

Following the article “Tools for Understanding Convolutional Neural Networks: 9 Important Deep Learning Papers (Part 1)”, this article continues to introduce some important papers published in the past five years and discuss their significance. Papers 1-5 deal with the development of general network architectures, while papers 6-9 deal with other network architectures. Click on the original article for more details.

5.Microsoft ResNet (2015)

Now, it is still impossible to double the number of layers of a deep convolutional neural network and add a few more layers to reach the depth of ResNet architecture proposed by Microsoft Research Asia in 2015. ResNet is a new 152-layer networking architecture that uses a special structure to document classification, detection, and location. In addition to innovating in the number of layers, ResNet won the ImageNet Large-scale Visual Recognition Challenge in 2015 with a 3.6% error rate (typically 5-10% with current technology)

Residual block

The principle of residual block is that input X gets output F(x) through convolution – residual function – convolution series, and then adds this result to the original input X, denoted by H(x)= F(x)+ x. In traditional convolutional neural networks, H(x)=F(x). So instead of just taking the transformation from x to F of x, we’re going to take H of x is equal to F of x plus x. The smallest module in the figure below is calculating an “increment,” or a slight change to the original input X to get a slightly changed representation. The authors argue that “it is easier to optimize a residual map than an original unreferenced map.”



Another reason why residual blocks may be more efficient is that, during the backward transmission of the back propagation, the gradient will more easily pass through the residual blocks because of the addition operation that can be applied to the gradient.

The main argument

1. “Extreme Depth” – Yann LeCun.

2. Contains 152 layers

3. Interestingly, the array is compressed from 224*224 to 56×56 only after the first two layers.

4. In ordinary networks, simply increasing the number of layers will lead to higher training and testing errors (see paper for details).

5. This model tries to build a 1202-layer network, which may be due to over-fitting and low test accuracy.

The importance of

3.6% margin of error! This is important enough. ResNet model is the best convolutional neural network architecture we have at present, and it is also a great innovation of residual learning concept. I’m sure even if we stack more layers on top of each other, there won’t be any more dramatic performance improvements, but there will certainly be innovative new architectures like the last two years.

6. Region-based Convolutional Neural Network: R-CNN (2013); Fast R-CNN (2015); Faster R-CNN (2015)

Some might argue that the emergence of R-CNN has been more influential than any previous paper on new network architectures. With the first paper on R-CNN cited more than 1,600 times, Ross Girshick’s team at the University of California, Berkeley, has created one of the most influential advances in computer vision: research showing that Fast R-CNN and Faster R-CNN are better suited for object detection.

The goal of r-CNN architecture is to solve the object detection problem. Now, we want to draw bounding boxes for all objects contained on a given image, which can be divided into two steps: candidate region selection and classification.

The authors point out that any class-agnostic candidate region approach should apply. Selective search is specifically used for R-CNN, which can generate 2000 different regions most likely to contain the specified object. After candidate regions are generated, they are “transformed” into image-sized regions and fed into a trained convolutional neural network (AlexNet in this case) to extract feature vectors for each region. This set of vectors then serves as input to a set of linear support vector machines that train each class and output a classification. Vectors are also fed into the bounding box regressor to obtain the most accurate position coordinates. Finally, non-maximum suppression is used to suppress boundary boxes with significant overlap.

Fast R-CNN

There are three reasons to improve the original model: the model training needs to go through multiple steps (ConvNets→ support vector machines → bounding box regressors); Computing costs are high and running speed is slow (R-CNN takes 53 seconds to process an image). In order to improve the running speed, Fast R-CNN shared the calculation of convolution layer between different candidate regions, exchanged the generation sequence of candidate regions, and ran the convolutional neural network at the same time. In this model, the image image is first fed into the convolutional network, then the feature of the candidate region is obtained from the last feature map of the convolutional network, and finally is fed into the full connection layer, regression and classification head.

Faster R-CNN

Faster R-CNN is committed to simplifying the complex training steps of R-CNN and Fast R-CNN. The author inserts the candidate region generation network after the last convolutional layer, which can view the last convolutional feature map and generate the candidate region. The same method as R-CNN is then used: pooling of regions of interest, full connection layer, classification and regression header.

The importance of

In addition to being able to accurately identify a specific object in the image, the Faster R-CNN can also accurately locate the object, which is a qualitative leap. Now, Faster R-CNN has become a standard for object detection programs.

7. Generating Hostile Networks (2014)

According to Yann LeCun, the network could be the next big advance. Before introducing this article, let’s look at an adversarial example: running a perturbed image through a convolutional neural network (already trained and working well on the ImageNet dataset) to maximize the prediction error. As a result, the predicted object category changes and the image looks the same as the undisturbed image. In a sense, antagonism is fooling convolutional networks with images.

This example of confrontation surprised many researchers and quickly became a topic of interest. Now let’s talk about generative adversarial networks, which consist of two models: a generative model and a discriminant model. A discriminator is used to determine whether a given image is truly from the data set, or whether it was created artificially; Generators are used to create images so that the discriminator can be trained to produce the correct output. This can be viewed as a game, for example: the generation model is like a “team of forgers, trying to make and use counterfeit money”; The discriminant model is like “police, trying to detect counterfeit money”. The generator tries to fool the discriminator, and the discriminator tries not to be fooled. As the model is trained, both methods improve until “real money is indistinguishable from fake money.”

The importance of

It seems simple enough, but why do we care so much about this network? As Yan Le Leun points out on Quora, the discriminator is now aware of the “inherent representation of the data” because it has been trained to understand the difference between the real image in the data set and the artificially created image. Therefore, it can be used as a feature extractor like convolutional neural networks. Alternatively, you can create very realistic artificial images (link).

8. Image Descriptions (2014)

What happens when you combine convolutional neural networks with recurrent neural networks? Andrej Karpathy’s team studied the combination of convolutional neural networks and bidirectional cyclic neural networks and wrote a paper to generate natural language descriptions of different areas of the image. Basically, the output effect of the image after the model is as follows:

This is incredible! Let’s see how this differs from a normal convolutional neural network. In traditional convolutional neural networks, each image in the training data has a clear label. The model described in the paper has trained a sample that has text associated with each image. This type of tag is called a weak tag, where a text fragment refers to an unknown part of the image. Using this training data, the deep neural network was able to “infer the underlying relationships between text fragments and the regions they were intended to describe” (as cited in the paper). Another neural network converts the image into a text description. Let’s look at these two parts separately: the alignment model and the generation model.

Alignment model

The goal of the alignment model is to be able to align visual images and text descriptions. The model converts images and text into a measure of similarity between the two.

First, the image is fed into the R-CNN model to detect a single object, which is trained on the ImageNet dataset. The top 19 (plus the original image) object regions are embedded into the 500-dimensional space. Now in each image we have 20 different 500-dimensional vectors (represented by V) that describe the information of the image. Now we need information about the text, embedding the text in the same multi-dimensional space, this step is done by two-way recursive neural network. At a higher level, this is to interpret contextual information about words in a given text. Since the information of the image and text is in the same space, we can calculate the internal representation to output the similarity measure.

Generate models

The primary purpose of the alignment model is to create a data set that contains the image region and the corresponding text. The generation model will learn from this data set to generate a description of a given image. The model feeds the image into a convolutional neural network, and since the output of the full connection layer becomes the input of another recurrent neural network, the Softmax layer can be ignored. For those unfamiliar with recurrent neural networks (which also need to be trained like convolutional neural networks), the model can be understood as a probability distribution for producing different words in a sentence.

The importance of

What is innovative about Generating Image Descriptions is that it uses seemingly different cyclic neural network and convolutional neural network models to create a very practical application, which somehow combines the field of computer vision and natural language processing. It opens the door to new ideas about how to make computers and models smarter for tasks that span different domains.

9. Spatial Transformer Network (2015年)

Finally, we will introduce an equally important paper. The main highlight of this model is the introduction of a transformation module, which transforms the input image in a way so that the subsequent network layer can classify the image more easily. Instead of modifying the main architecture of the convolutional neural network, the author transforms the image before it is input to a specific convolutional layer. This module wants to correct posture normalization (for scenes where objects are tilted or scaled) and spatial attention (focusing on objects that need to be classified in a crowded image). For traditional convolutional neural networks, if the model is expected to be applicable to images of different scales and rotations at the same time, a large number of training samples are needed to make the model learn correctly. How does this transform module solve this problem?

Entities in traditional convolutional neural network models dealing with spatial invariability are the largest pooling layer, and once we know a particular feature in the original input array (which has a high activation value), its exact position is less important than its relative position with respect to other features. The new spatial converter is dynamic and produces different transformations for each input image, rather than simple and predefined as traditional maximum pooling. Let’s take a look at how this transformation module works. This module includes:

1. Locate the network, convert the input array and output the spatial transformation parameters that must be used. For affine transformations, the parameters or θ can be six dimensional.

Sample the mesh, which is the result of deforming the regular mesh using an affine transformation (θ) created in the localized network.

3. The sampler transforms the input feature map.

This module can be placed in any node of the convolutional neural network, and can basically help the network learn how to transform the feature map, so as to minimize the cost function during training.

The importance of

The main reason why this article catches my attention is that the improvement of convolutional neural networks does not necessarily require a huge change in the overall architecture of the network. We do not need to create the next ResNet or Inception architecture. In this paper, affine transformation is applied to the input image to make the model more suitable for image translation, scaling and rotation.

The above is the translation.

A Beginner’s Guide to Understanding Convolutional Neural Networks


Translator: Mags, edited by Yuan Hu.

The original link