“This is the 28th day of my participation in the November Gwen Challenge. See details of the event: The Last Gwen Challenge 2021”.

Convolutional Neural Network: Convolutional Neural Network

Image Recognition problem & classical data set

CIFAR dataset ILSVRC2012 dataset

1. Introduction to convolutional neural networks

Compared with the fully connected neural network, the difference lies in that in the fully connected neural network, the nodes between two adjacent layers are connected by edges, so the nodes in each fully connected layer are generally organized into a column. However, in convolutional neural network, only part of nodes are connected between two adjacent layers. In order to display the dimension of neurons at each layer, nodes of each convolutional layer are generally organized into a THREE-DIMENSIONAL matrix. Convolutional neural network structure diagram:

Input information > convolution layer 1 — – > pooling layer 1 — 2 — — > > convolution layer 2 — > all connection pooling layer layer 1 – > 2 – > connection layer of the softMax – > classification results

In the first several layers of convolutional neural network, nodes of each layer are organized into a THREE-DIMENSIONAL matrix:

  1. Input layer: The input layer is the input of the whole neural network. In the convolutional neural network processing images, it generally represents the pixel matrix of a picture. Starting from the input layer, the convolutional neural network transforms the three-dimensional matrix of the upper layer into the three-dimensional matrix of the next layer through different neural network structures, and finally reaches the fully connected layer.
  2. Convolutional layer: the most important layer in convolutional neural network. The input of each node in the convolutional layer is only a small block of the neural network of the previous layer, and the size of this small block is usually 3*3 or 5*5. The convolutional layer tries to analyze each piece of the neural network more deeply so as to get features with higher abstraction degree.
  3. Pooling layer: Pooling layer does not change the depth of the 3D matrix, but reduces the size of the matrix. For example: converting a higher resolution image into a lower resolution image. Through pooling layer, the number of nodes in the last full connection layer can be further reduced, so as to reduce the parameters of the whole neural network.
  4. Full connection layer: mainly to complete classification tasks;
  5. Softmax layer: Through the Softmax layer, the probability distribution of the current samples belonging to different types can be obtained.

2. Common structures of convolutional neural networks

2.1 the convolution layer

You can call it a filter or a kernel.

The function is to transform a child node matrix in the current layer into a unit node matrix in the next layer. The so-called identity node matrix refers to a node matrix of length and width of 1, but the depth is unlimited.

The length and width of the node matrix to be processed in the filter are manually specified. This size is also called the filter size. Commonly used sizes are 3×3 or 5×5. Because the depth of the matrix processed by the filter is the same as the depth of the node matrix of the current neural network, the size of the filter only needs to specify two dimensions, although the node matrix is three-dimensional. Another setting in the filter that needs to be manually specified is the depth of the resulting unit node matrix, called the depth of the filter.

The forward propagation process of convolutional layer neural network is the process of calculating the unit node matrix through filter matrix: Assuming that Wx,y,ziW^i_{x,y,z}Wx,y,zi is used to represent the weight of the filter input node (x,y,z) for the ith node in the output unit node matrix, and bib^ibi is used to represent the bias parameter corresponding to the ith output node, then the ith node in the unit matrix is taken as:


g ( i ) = f ( 1 2 x 1 2 y 1 3 z ( a x . y . z ) w x . y . z i + b i ) g(i)=f(\int_1^2x\int_1^2y\int_1^3z(a_{x,y,z})*w^i_{x,y,z}+b^i)

Where Ax,y,za_{x,y,z} Ax,y,z are the values of filter nodes (x,y,z) and f is the activation function.

The forward propagation process of the convolutional layer structure is obtained by moving a filter from the upper left corner of the current layer to the lower right corner of the neural network, and calculating each corresponding identity matrix during the process of moving.

The following formula gives the size of the resulting matrix when all zeros are used simultaneously:


O u t l e n g t h = [ i n l e n g t h s t r i d e l e n g t h ] Out_{length}=[\frac{in_{length}}{stride_{length}}]

O u t w i d t h = [ i n w i d t h s t r i d e w i d t h ] Out_{width}=[\frac{in_{width}}{stride_{width}}]

Outlengthout_ {length} outLength indicates the length of the output layer matrix, which is equal to the length of the input layer matrix divided by the rounded step in the length direction. Similarly, outwidThOut_ {width}outwidth represents the width of the output layer matrix, which is equal to the width of the input layer matrix divided by the rounded step in the width direction.

When not filled with all zeros, the following is the size of the resulting matrix:


O u t l e n g t h = [ i n l e n g t h f i l t e r l e n g t h + 1 s t r i d e l e n g t h ] Out_{length}=[\frac{in_{length}-filter_{length}+1}{stride_{length}}]

O u t l e n g t h = [ i n w i d t h f i l t e r w i d t h + 1 s t r i d e w i d t h ] Out_{length}=[\frac{in_{width}-filter_{width}+1}{stride_{width}}]

In convolutional neural network, the parameters of the filter used in each convolutional layer are the same, and the parameters of the filter shared in each convolutional layer can greatly reduce the parameters of the neural network. Moreover, the number of parameters of the convolution layer has nothing to do with the size of the image, but only with the size and depth of the filter and the depth of the node matrix of the current layer. This feature enables convolutional neural networks to be well extended to larger image data.

import tensorflow as tf

"' by tf. GetV... Create filter weight variables and offset item variables in the. It is introduced above that the number of parameters of the convolution layer is only related to the size and depth of the filter and the depth of the node matrix of the current layer, so the parameter variable declared here is a four-dimensional matrix. The first two dimensions indicate the size of the filter, the third dimension indicates the depth of the current layer, and the fourth dimension indicates the depth of the filter.
filter_weight = tf.get_variable('weights'[5.3.1.16],
                                initializer=tf.truncated_normal_initializer(stddev=0.1))
Similar to the weights of the convolution layer, the offset terms at different positions on the current layer matrix are also shared, so there are altogether different offset terms at the next layer depth. 16 in this code is the depth of the filter, which is also the depth of the node matrix of the next layer in the neural network.
biases = tf.get_variable('biases'[16], initializer=tf.constant_initializer(0.1))
Tf.nn. conv2d provides a very convenient function to implement the forward propagation algorithm of the convolution layer. The first input of this function is the node matrix of the current layer, which is a four-dimensional matrix. The last three dimensions correspond to a node matrix, and the first dimension corresponds to an input batch. For example, in the input layer, input[0, :, :, :] represents the first image and input[1, :, :, :] represents the second image. And so on. The second parameter of tF.n.c onv2d provides the weight of the convolution layer, and the third parameter is the step size of different dimensions. Although parameter 3 provides an array of length 4, the number requirement of the first and last dimension must be 1, because the step size of the convolution layer is only valid for the length and width of the matrix. The last parameter is padding. TensorFlow provides two options: SAME for adding all zeros, and VALID for not adding "".
conv = tf.nn.conv2d(input, filter_weight, strides=[1.1.1.1], padding='SAME')
Tf.nn,bias_add provides a convenient function to add a bias item to each node. You cannot add directly here, because nodes at different positions in the matrix need the same bias item. ' ' '
bias = tf.nn.bias_add(conv, biases)
The results are de-linearized by ReLU activation function
actived_conv = tf.nn.relu(bias)
Copy the code

2.2 pooling layer

  • A pooling layer is often added between the convolution layers. The pooling layer can very effectively reduce the size of the matrix, thereby reducing the parameters in the final fully connected layer. Pooling layer can not only speed up the calculation, but also prevent the over-fitting problem.
  • The forward propagation of the pooling layer is similar to the convolution layer, which is also achieved by moving a filter. However, instead of a weighted sum of nodes, the calculation in the pooling layer filter takes the simpler maximum or average calculation. The pooling layer that uses the maximum operation is called the maximum pooling layer, which is the most used pooling layer structure. The pooling layer that uses the average operation is called the average pooling layer;
  • Filters in the convolutional layer and the pooled layer move in a similar way. The only difference is that the filters used by the convolutional layer span the entire depth, while the filters used by the pooled layer only affect nodes at one depth. So the filter of the pool layer needs to move in depth as well as in length and width.

The following TensorFlow program implements forward propagation of the maximum pooling layer:

# tf.nn.max_pool implements forward propagation of the maximum pooling layer, and its parameters are similar to those of tf.nn.conv2d
# tf.nn.avg_pool to implement average pooling layer
# ksize provides the size of the filter, strides provides the step information, and the padding determines whether to fill with all zeros
pool = tf.nn.max_pool(activted_conv, ksize = [1.3.3.1], strides=[1.2.2.1], padding='SAME')
Copy the code

3. Classical convolutional network model

3.1 LeNet – 5 model

Lenet-5 model structure, a total of seven layers:

  1. Convolution layer: this layer is the original image pixel, and the size of input layer received by LeNet model is 32x32x1. Filter size is 5×5, depth is 6, do not fill with all zeros, step size is 1. Because it is not filled with all zeros, the output size is 32-5+1 = 28 and the depth is 6. This convolution layer has a total of 5x5x1x6+6 = 156 parameters, among which 6 are bias parameters. Because the node matrix of the next layer has 28x28x6 = 4071 nodes, and each node is connected to 5×5 = 25 nodes of the current layer, there are altogether 4071x(25+1) = 122304 connections in this layer.
  2. Pooling layer: The input of this layer is the output of the first layer, which is a 28x28x6 node matrix. The filter size is 2×2, and the step length and width are both 2, so the output matrix size of this layer is 14x14x6.
  3. Convolution layer: the input matrix size of this layer is 14x14x6, the filter size used is 5×5, and the depth is 16. Instead of using all zeros, the step is 1. The output matrix size is 10x10x16. According to the standard convolution layer, this layer should have 5x5x6x16+16 = 2416 parameters, 10x10x16x(25+1) = 41600 connections;
  4. Pooling layer: Input matrix size 10x10x16, filter size 2×2, step size 2. The output matrix is 5x5x16;
  5. Full connection layer: the size of the input matrix is 5×5 and the size of the filter is 5×5, which is called the convolution layer in Lenet-5. Since the size of the input matrix is the same as that of the filter, it can be regarded as the full connection layer. The number of output nodes in this layer is 120, and there are 5x5x16x120+120 = 48120 parameters.
  6. Full connection layer: the number of input nodes is 120, the number of output nodes is 84, and the total parameter is 120×84+84 = 10164.
  7. Full connection layer: the number of input nodes is 84, the number of output nodes is 10, and there are 84×10+10 = 850 parameters.

Summary of lenET-5 Model application:

The pool. Get_shape function is the first to determine the dimensions of the output matrix, and tF. shape is the first function to transform the data from Get_Shape into a batch vector dropout concept: Dropout will randomly change the output of some nodes to 0 during training. Dropout eliminates overfitting problems, resulting in better performance of the model on test data. Dropout is typically only used at the fully connected layer, not at the convolutional or pooling layer.

The following regular expressions summarize some classical convolutional neural network architectures for image classification problems:

Input layer –> convolution layer +–> pooling layer? +–> Full connection layer +

In the above formula, “convolution layer +” means one or more convolution layers. “Pooled layer?” Indicates that there is no or one pooling layer. After the multi-convolutional layer is added to the pooling layer, the convolutional neural network generally passes through 1 or 2 full connection layers before output. For example, Lenet-5 model can be expressed as follows:

Input layer –> Convolution layer –> Pooling layer –> Convolution layer –> Pooling layer –> Full connection layer –> Full connection layer –> Output layer

Summary of convolutional Neural Network parameter Configuration:

  • Generally, the edge length of the convolutional layer filter is no more than 5, but in some convolutional neural networks, the convolutional layer processing input uses filters with edge length 7 or even 11.
  • In terms of the depth of filter, most convolutional neural networks adopt the method of increasing layer by layer.
  • The step size of convolution layer is generally 1, but 2 and 3 are also used as step size in some models.
  • The configuration of the pooling layer is relatively simple, and the maximum pooling layer is generally used.
  • The filter side length of the pool layer is usually 2 or 3, and the step size is also 2 or 3.

3.2 Inception – v3 model

Convolutional neural network structure completely different from Lenet-5 structure. The Inception structure in the Inception- V3 model is to combine different convolutional layers in parallel. That is, use filters with side lengths of 1, 3, 5, all the different sizes of filters, and then join the resulting matrix together.

The Inception module will first process the input matrix using filters of different sizes. The different matrices represent a computational path in the Inception module. Although the filter sizes, but if all the filters using 0 all filling and step length is 1, then spread to the result before length and the width of the matrix are consistent with the input matrix, so after processing the different filter matrix can be spliced into a deeper matrix, they can be combined in depth this dimension.

The Inception- V3 model has a total of 46 layers, consisting of 11 Inception modules, with 96 convolutional layers. If you implement a convolutional layer according to the five lines of code in the previous section, you will need 480 lines of code, which is undoubtedly too cumbersome. Here is the Tensorflow-Slim tool to implement a convolution layer more concisely:

# Implement the convolution layer directly using the TensorFlow raw API
with tf.variable_scope(scope_name):
    weights = tf.get_variable("weights", ...)    
    biases = tf.get_variable("bias", ...)
    conv = tf.nn.conv2d(...)
    relu = tf.nn.relu(tf.nn.bias_add(conv, biases))

# Implement the convolutional layer using Tensorflow-slim
The # slim.conv2d function has three mandatory arguments. The first parameter is the input node matrix
The second parameter is the current depth of the convolutional layer filter
The third parameter is the size of the filter
Optional parameters include step size of filter movement, whether to select all zeros, activation function selection, variable namespace, etc
net = slim.conv2d(input.2[3.3])
Copy the code

The following code implements an Inception module:

import tensorflow as tf

# Load the Slim library
slim = tf.contrib.slim

The slim.arg_scope function can be used to set default values. The first argument to the slim.arg_scope function is a list of functions in which the default values are used. Stride =1 and padding='SAME' are automatically added to slim. Conv2d (net, 320, [1, 1]). If the stride is specified at the time of the function call, the default value set here is no longer used to reduce code redundancy in this way.
with slim.arg_scope([slim.conv2d, slim.max_pool2d, slim.avg_pool2d],
                    stride=1, padding='VALID') :The other network structures in the Inception v3 model are omitted here and the Inception structure in the red box at the back is implemented directly. The result of the forward propagation of the input image through the previous neural network is assumed to be stored in the variable NET.
    net = 'Upper level output node matrix'
    Declare a uniform variable namespace for an Inception module
    with tf.variable_scope('Mixed_7c') :Declare a namespace for each path in the Inception module
        with tf.variable_scope('Branch_0') :# Implement a filter, side length 1, depth 320 convolution layer
            branch_0 = slim.conv2d(net, 320[1.1], scope='Conv2d_0a_1x1')

        The second path in the Inception module, the structure on this computational path is itself a Inception structure
        with tf.variable_scope('Branch_1'):
            branch_1 = slim.conv2d(net, 384[1.1], scope='Conv2d_0a_1x1')
            The tf.concat function concatenates multiple matrices. The first argument to the tf.concat function specifies the concatenation dimension. The '3' given here indicates that the matrix is concatenated on the dimension of depth. ' ' '
            branch_1 = tf.concat(3,
                                 [slim.conv2d(branch_1, 384[1.3], scope='Conv2d_0b_1x3'),
                                  slim.conv2d(branch_1, 384[3.1], scope='Conv2d_0c_3x1')])
        
        The third path in Inception, which is also a Inception structure
        with tf.variable_scope('Branch_2'):
            branch_2 = slim.conv2d(net, 448[1.1], scope='Conv2d_0a_1x1')
            branch_2 = slim.conv2d(net, 384[3.3], scope='Conv2d_0b_3x3')
            branch_2 = tf.concat(3.Here, the input of layer 2 convolution layer is BRANch_1 instead of NET
                                 [slim.conv2d(branch_2, 384[1.3], scope='Conv2d_0c_1x3')],
                                 [slim.conv2d(branch_2, 384[3.1], scope='Conv2d_0d_3x1')])
        
        # Fourth path in Inception
        with tf.variable_scope('Branch_3'):
            branch_3 = slim.avg_pool2d(net, [3.3], scope='AvgPool_0a_3x3')
            branch_3 = slim.conv2d(branch_3, 192[1.1], scope='Conv2d_0b_1x1')
            
        The final output of the current Inception module is a concatenation of the above four calculations
        net = tf.concat(3, [branch_0, branch_1, branch_2, branch_3])
Copy the code

4. Convolutional neural network transfer learning

The so-called transfer learning is to apply a trained model on a problem to a new problem through simple adjustment. According to the conclusion in DeCAF, the parameters of all convolutional layers in the trained Inception- V3 model can be retained and only the last full connection layer is replaced. The network layer before the last full connection layer is called the bottleneck layer. The process of moving the new image through the trained convolutional neural network to the bottleneck layer can be regarded as the process of feature extraction of the image. In the trained Inception v3 model, since the output of the bottleneck layer can be well distinguished from 1000 types of images by a single-layer fully connected neural network, it is reasonable to assume that the node vector output of the bottleneck layer can be directly used for feature extraction of the image using the trained neural network. Then the extracted feature vectors are used as input to train a new single-layer fully connected neural network to deal with the new classification problem.

Note: Natural language processing

Language model

Assuming that all possible sentences in a language obey a probability distribution and that the sum of the probabilities of each sentence is 1, the task of language model is to predict the probabilities of each sentence in a language. A good language model should yield relatively high probabilities for common sentences in the language, and close to zero for illegal sentences. So how do you calculate the probability of a sentence? First a sentence can be thought of as a sequence of words:


S = ( w 1 . w 2 . w 3 . w 4 . . . . . w m ) S = (w_1, w_2, w_3, w_4, … ,w_m)

Where m is the length of the sentence, then its probability can be expressed as:


p ( S ) = p ( w 1 ) p ( w 2 w 1 ) p ( w 3 w 1 . w 2 ) . . . p ( w m w 1 . w 2 . w 3 . . . . . w m 1 ) p(S)=p(w_1)p(w_2|w_1)p(w_3|w_1,w_2)… p(w_m|w_1,w_2,w_3,… ,w_{m-1})

P (wm ∣ w1, w2, w3,… , wm – 1 p (w_m | w_1, w_2, w_3,… And w_} {m – 1 p (wm ∣ w1, w2, w3,… ,wm−1 represents the probability that the MTH word is WM when the first m-1 word is known. If you can model this term, you can calculate the probability of a sentence simply by multiplying the conditional probabilities for each position. Common models include N-Gram model, decision tree, maximum entropy model, conditional random field, neural network language model, etc.

Evaluation methods for language models

The commonly used evaluation indexes for evaluating the effectiveness of language models are perplexity. The lower the complexity on a test set, the better the modeling.

The formula for calculating complexity is as follows:


p e r p l e x i t y ( S ) = p ( w 1 . w 2 . w 3 . . . . . w m ) 1 m perplexity(S)=p(w_1,w_2,w_3,… ,w_m)^{\frac{-1}{m}}

From this formula, for example, given (w1,w2,w3… ,wm)(w_1, w_2, w_3,… , w_m)(w1,w2,w3,… , WM) will appear in the corpus, so the higher the probability of this sentence calculated through the language model, it indicates that the language model fits this corpus well.

In the training of language model, the logarithmic expression of perplexity is usually adopted:


l o g ( p e r p l e x i t y ( S ) ) = 1 m 1 m l o g ( p ( w i w 1 . . . . . w i 1 ) ) log(perplexity(S))=\frac{-1}{m}\sum_1^mlog(p(w_i|w_1,… ,w_{i-1}))

So the multiplication becomes the sum.

In mathematics, log perplexity can be regarded as the cross entropy between the real distribution and the predicted distribution, which describes a kind of distance between two probability distributions. Assuming x is a discrete variable, and u(x) and V (x) are two probability distributions related to x, then the cross entropy between u and V is defined as the expected value of -log(v(x)) under the distribution u.


H ( u . v ) = E u [ l o g v ( x ) ] = x u ( x ) l o g v ( x ) H(u,v)=E_u[-logv(x)]=-\sum_x^{u(x)}logv(x)

The x as a word, u (x) for each position words the real distribution of v (x) for the prediction of model distribution p (wi | w1, w2,… , wi-1), it can be seen that the log Perplexity and cross entropy are equivalent.