Learning based Computer Vision part II: Convolutional Neural Networks

In the first article, we introduced the basic concepts of neural networks and the basic usage of Tensorflow. This is the second article in a series that introduces convolutional neural networks. This paper mainly introduces the classical convolutional neural network, the basic concepts and basic units of full convolutional neural network, and the similarities and differences between convolutional neural network and neural network. At last, we introduce how to construct convolutional neural network with TensorFlow by implementing a face key point detection algorithm which is widely used in practice.

The history of neural networks

Convolutional Neural Network (CNN) can be traced back to the 1960s. Biological studies have shown that visual information is transmitted from the retina to the brain through multiple levels of sensory Field stimulation (Receptive Field), and early models such as Neocognitron have been proposed. In 1998, Lecun et al., one of the three giants of deep learning, formally proposed CNN and designed lenet-5 model as shown in the figure below. This model has achieved good results in handwriting character recognition.

Due to computing resources and other reasons, CNN has been forgotten for a long time. More than twenty years later, in the ImageNet competition, CNN-based AlexNet made a great impression and led the revival of CNN. After that, CNN research entered a period of rapid development. At present, the development of convolutional neural networks has two main directions:

How to improve model performance. A big focus in this direction is how to train wider and deeper networks. Many classic models emerged along this line, including GoogleNet,VGG, ResNet and ResNext.
How to speed up the model. Increasing speed is critical to CNN’s mobile deployment. By removing Max pooling, using stride convolution, using group convolution, fixed-point and other methods, CNN applications such as face detection and background segmentation have been deployed on mobile phones on a large scale.

At present, CNN is the most important algorithm in the field of computer vision and has achieved good results on many problems. Due to the limitation of space, this paper will mainly introduce the basic knowledge of convolutional neural network.

Neural networks vs convolutional neural networks

In the last article we introduced neural networks. Neural networks are widely used in big data processing, language recognition and other fields. But there are many problems when dealing with image problems:

Parameter explosion takes the image of 200x200x3 as an example. If the hidden layer behind the input layer has 100 neurons, the number of parameters will reach 200x200x3x100=12 million. Obviously a model with so many parameters is difficult to train and easy to overfit.
Translation invariance For many image problems, we expect the model to satisfy a certain translation invariance. For example, for the problem of image classification, we hope that the model can correctly identify the object whenever it appears in the picture.
Local correlation In big data and other problems, there is no explicit topological relationship between input dimensions, so neural network (full connection layer) is suitable for modeling. But for computer vision problems, there is a natural topological relationship between adjacent pixels of the input image. For example, when judging whether there is an object at a certain position in the picture, we only need to consider the pixels around the position, instead of taking the information of all pixels in the picture as input like the traditional neural network.

In order to overcome the above problems of neural networks, we need a more reasonable network structure in the field of vision. Convolutional neural network overcomes the above problems of neural network by means of local connection and parameter sharing during design, thus achieving amazing results in the field of image. Next, we will introduce the principle of convolutional neural network in detail.

Convolutional neural network

The network structure

The overall structure of convolutional neural network is roughly the same as that of traditional neural network. As shown in the figure below, both the traditional neural network with two fully connected layers and the convolutional neural network with two convolutional layers are stacked with basic units, and the output of the former layer serves as the input of the latter layer. The output of the final layer is the predicted value of the model. The main difference between the two is that the basic units are different. Convolutional neural networks use the convolutional layer instead of the full connection layer in neural networks.

Like the full connection layer, the convolution layer also contains parameters weight and bias that can be learned. The parameters of the model can be defined in the framework of supervised learning according to the method introduced in the previous article, and optimized by back propagation.

Convolution (Convolution)

The convolutional layer is the basis of the whole convolutional neural network. The 2D convolution operation can be regarded as a process similar to template matching. As shown in the figure below, the template with the size of H × W × D is matched with the input by sliding window. During sliding, the inner product of the input position and the weight of the template adds an offset b as the value of the output position. W and H are the size of the template, collectively known as kernel size. In CNN, w and H generally take the same value. D is the number of channels in the template, which is the same as the input, for example, 3 for RGB images.

Template is often called convolution kernel (K) or filter in convolutional neural network. In the standard convolution, the output value corresponding to the output position (x,y) can be expressed as:

Conv (I, K) = 1, y = ∑ hi dk 1 ∑ ∑ wj = = 1 kijk ⋅ Ix + I – 1, y + j – 1, K + b

In CNN, in addition to h, W and D parameters of a single filter, there are also three important parameters depth, stride and padding:

Depth refers to the number of output channels, corresponding to the number of filters in the convolution layer
Stride refers to the step length of each slide of the filter
Padding refers to the width of the padding zeros around the input. Padding is used primarily to control the size of the output. If padding is not added, using a filter with kernel size greater than 1 will make the output size smaller than the input. In practice, padding is often added to make the input and output size the same.

As shown in the figure below, for 1D, assuming that the input size is W, the filter size is F, the stride is S, and the padding is P, the output size is (W−F+2P)/S+1. By setting P=(F−1)/2, the input and output sizes remain the same when S=1. The calculation of 2D convolution is similar to 1D convolution.

Compared with the full-connection layer in the traditional neural network, the convolution layer can actually be regarded as a special case of the full-connection layer. Firstly, local connectivity. By using the spatial topology of the input, the convolutional neural network only needs to consider the input nodes whose distance from the output node is within the filter range in space, and the weight of other edges is 0. In addition, we force the parameters of filter to be exactly the same for different output nodes. However, through this local connection and parameter sharing, the convolution layer can make better use of the internal topological relations and translation in the image without deformation, greatly reducing the parameters, so as to obtain a better local optimal solution, so that it has better performance in the image problem.

Implementing the convolution layer in TensorFlow is very simple and can be done directly by calling tf.nn.conv2d:

Pooling (Pooling)

In CNN network, in addition to a large number of convolutional layers, we will also insert appropriate pooling layers according to needs. The pooling layer can be used to reduce the size of the input, thereby reducing the parameters and computation of the subsequent network. Common pooling operations (such as Max pooling and average pooling) can also provide some translation invariance.

For example, Max pooling is used for all values within the range of kernel size, and the result is the output of the corresponding position. Pooling is usually performed separately for each channel, so the number of channels output is the same as that of input. Pooling layer is similar to convolution layer. Pooling operation can also be understood as a sliding window, so there are concepts such as step length, stride and padding corresponding to convolution. The following figure shows a Max pooling operation with kernel size and stride 2:

In practice, there are two common configurations for the parameters of the pooling layer. One is that the kernel size and stride are both 2, and there is no overlapping area in the pooling process. The other is overlapping pooling if the kernel size is 3 and the stride is 2. Implementing the pooling layer in TensorFlow is also very simple:

Classical network structure of convolutional neural networks

After introducing the basic modules of convolutional neural networks, we will introduce the classical network structure of convolutional neural networks. From lenet-5 in 1998, to AlexNet model in Imagenet 2012, and then to VGG and a series of classical models, basically follow this classical structure.

For clarity, we omit the nonlinear activation function after the convolution and full connection layers. As shown in the figure above, the classical convolutional neural network can be divided into three parts:

A series of cascading CONV +pooling layers (sometimes omitted pooling layers). In the cascade process, the size of the input gradually becomes smaller, and the output channels gradually become more and more, which completes the abstraction of information from low-level to high-level.
A series of cascaded fully connected layers. At the junction between the convolution layer and the fully connected layer, the output of the convolution layer is transformed into a one-dimensional input and fed into the fully connected layer. A series of fully connected layers are then cascaded, depending on the complexity of the task.
The final output layer determines the form of output according to the needs of the task. In case of multiple classification problems, a Softmax layer will be followed eventually.

The classical convolutional neural network can be regarded as a nonlinear function with a fixed output size. It can convert an input image of size H×W×3 into a final fixed-length vector of dimension D. Classical convolutional neural networks have achieved great success in image classification and regression. We’ll give you an example of a regression problem later in the field.

Fully Convolution Network

Due to the existence of full connection layer, the classical convolutional neural network can only accept pictures of fixed size as input and produce output of fixed size. Although the adaptive pooling method can be used to accept the input of variable length, such processing can only generate the output of fixed size. In order to overcome this shortcoming of classical convolutional neural network, we no longer use full connection layer in application scenarios with variable output size such as object segmentation. This network, whose main computing units are all composed of convolutional layers, is called full convolutional network (FCN).

As shown in the figure above, since the convolution operation has no limit on the input size, and the output size is determined by the input, the full convolutional network can well deal with the problem of unfixed size, such as segmentation. The full convolutional network can be regarded as a nonlinear function in which the output size changes linearly with the input size. It can convert an input picture of size H×W×3 into an output with final dimension H/S×H/S× D. Supervised learning problems that can be transformed into this form can basically be solved under the framework of full convolutional networks.

Deconvolution

In the full convolutional network, the standard convolution + pooling operation will reduce the size of the output. For many problems, we need the size of the output to be the same as the size of the input image, so we need an operation that can enlarge the size of the input. The most common operation is deconvolution.

Deconvolution can be understood as the reverse operation of convolution. Here we mainly introduce deconvolution with stride>1 and integer. This deconvolution can be understood as a generalized difference operation. In the following figure, the input is a green square of 3×3, the stride for deconvolution is 2, the kernel size is 3, and the padding is 1. In the sliding process, for each input square, its output is the corresponding 3×3 shaded area, and the output value is the product of the input value and the corresponding location value of kernel. The final output is the sum of the corresponding values for each output position during sliding. This can be regarded as a difference operation weighted by the value of 3×3 kernel. The outermost white area cannot perform a complete difference operation, so the padding can be set to 1 to remove the surrounding white area, and the final output size is 5×5.

According to the above description, for the deconvolution of stride>1 and integer, if the value of fixed deconvolution kernel is bilinear difference kernel, deconvolution can be equivalent to bilinear difference. However, the deconvolution kernel obtained through learning can better adapt to different problems than the kernel with fixed parameters. Therefore, deconvolution can be regarded as an extension of the traditional difference. Similar to convolution, the deconvolution module TF. Layers. Conv2d_transpose has been implemented in Tensorflow.

Application of convolutional Neural Network in visual recognition

CNN is widely used in Visual Recognition. Next, we take three classic problems in visual recognition: classification/regression, detection and segmentation as examples to introduce how to use CNN to solve practical problems.

Classification/regression (classification/regression)

Image classification refers to identifying which/which pre-specified categories an image belongs to, and image regression refers to judging the value of image attributes according to the image content. Classification and regression are widely used in practice. From object classification, face recognition, to 12306 verification code recognition, etc., can be abstracted into a standard classification problem. Similarly, the key point of the face location prediction, face attribute prediction (such as age, appearance level), can also be abstracted as a standard regression problem. At present, if the applications in the field of vision can be abstracted into classification or regression problems with fixed output length, the classical convolutional neural network framework introduced before can usually be used to solve the problems with a large amount of training data.

Detection

The detection problem usually refers to determining whether there is an object in the picture and the position of the object. There are one-stage and two-stage methods for testing. Due to the lack of space, we focus on the one-stage method under the FRAMEWORK of FCN. According to the previous introduction, FCN can be regarded as a nonlinear function that converts the input picture of H×W×3 into the output of H/S×W/S× D. To solve the detection problem under the framework of FCN, we can predict whether there is an object at each output position, and the offset of the upper left corner and lower right corner of the object relative to the current input position. So for each output position, you need a 5 dimensional vector to indicate whether there is an object, that is, d=5. After the output of the network is defined, the corresponding ground truth is constructed manually. Then, under the framework of supervised learning, the parameters are learned by defining loss function (L2 Loss) and carrying out back propagation.

Segmentation

The segmentation problem refers to the classification of each pixel in the image. The segmentation method based on FCN is very similar to the one-stage detection method introduced above. For a multi-classification segmentation problem, for each position of the output, we can determine the category it belongs to. In the framework of FCN, for N classification problems, the output is H/S×W/S×N. And then they train by way of back propagation. A difference between segmentation and detection problems is that we sometimes need to obtain the output of the same size as the input image (H×W×N), but in order to accelerate the convolutional neural network, the pooling layer is usually added to reduce the size of the middle convolutional layer. As shown in the figure below, in order to ensure that the size of the output meets the requirements, we can add a deconvolution layer at the end of the network for compensation, so as to obtain a larger size of the output.

Actual combat: face key point detection

Face key point detection is now a more mature application in the field of vision, is the basis of living detection, human beautification, face recognition and other advanced applications. Finally, this paper shows how to use Tensorflow to realize the application of image regression class through an example of face key point detection. The experimental data set adopts Faical Kerypoints Detection data set in Kaggle competition (www.kaggle.com/c/facial-ke…). . The dataset consisted of 7094 training images and 1783 test images. Each face in the dataset was labeled with 15 key points, and the image was 96×96 in size.

L2 distance regression

The goal of the Kaggle contest is to predict the coordinates of 15 key points on a person’s face, a total of 30 float values, which are standard regression problems. We chose the most common L2 distance as the target of optimization. Same as the code structure of the neural network model in the first article, we divided the code into three main modules, namely Dataset module, Net module and Solver module.

Model structure

inference

We define the main structure of the network in the inference function. Because the model uses the full connection layer and the convolution layer repeatedly, we encapsulate them as functionslinear_reluandconv_reluTo make it easy to reuse code. In the network structure, we adopt a relatively simple three-layer convolution and two-layer fully connected structure. The output of the convolution layer passes throughtf.reshapeTo a format acceptable to the full connection layer. Since this is a regression problem, we directly output the result of the last fully connected layer.
loss

For simplicity, we use MSE as the loss function for the standard regression problemtf.reduce_mean(tf.square(predictions - labels), name='mse')
metric

During the test, we still used Tensorflow to provide tF. metrics module, which automatically completed the evaluation of each batch and summarized all the evaluations. In this case, we are solving the regression problem, so we can usetf.metrics.mean_squared_errorCalculate the mean square error.

__Fri Dec 15 2017 10:13:58 GMT+0800 (CST)____Fri Dec 15 2017 10:13:58 GMT+0800 (CST)__def linear(x, output_size, wd=0): input_size = x.get_shape()[1].value weight = tf.get_variable( name='weight', shape=[input_size, output_size], initializer=tf.contrib.layers.xavier_initializer()) bias = tf.get_variable( 'bias', shape=[output_size], Initializer =tf.constant_initializer(0.0)) out = tf.matmul(x, weight) + bias if wd! = 0: weight_decay = tf.multiply(tf.nn.l2_loss(weight), wd, name='weight_loss') tf.add_to_collection('losses', weight_decay) return out def linear_relu(x, output_size, wd=0): return tf.nn.relu( linear(x, output_size, wd), name=tf.get_default_graph().get_name_scope()) def conv_relu(x, kernel_size, width, wd=0): input_size = x.get_shape()[3] weight = tf.get_variable( name='weight', shape=[kernel_size, kernel_size, input_size, width], initializer=tf.contrib.layers.xavier_initializer()) bias = tf.get_variable( 'bias', shape=[width], Initializer = tf.constant_Initializer (0.0)) conv = tf.nn.conv2d(x, weight, strides=[1, 1, 1], padding='SAME') if wd! = 0: weight_decay = tf.multiply(tf.nn.l2_loss(weight), wd, name='weight_loss') tf.add_to_collection('losses', weight_decay) out = tf.nn.relu(conv + bias, name=tf.get_default_graph().get_name_scope()) return out def pool(x, size): return tf.nn.max_pool( x, ksize=[1, size, size, 1], strides=[1, size, size, 1], padding='SAME')__Fri Dec 15 2017 10:13:58 GMT+0800 (CST)____Fri Dec 15 2017 10:13:58 GMT+0800 (CST)__Copy the code

__Fri Dec 15 2017 10:13:58 GMT+0800 (CST)____Fri Dec 15 2017 10:13:58 GMT+0800 (CST)__class BasicCNN(Net):

  def __init__(self, **kwargs):
    self.output_size = kwargs.get('output_size', 1)
    return

  def inference(self, data):

    with tf.variable_scope('conv1'):
      conv1 = conv_relu(data, kernel_size=3, width=32)
      pool1 = pool(conv1, size=2)

    with tf.variable_scope('conv2'):
      conv2 = conv_relu(pool1, kernel_size=2, width=64)
      pool2 = pool(conv2, size=2)

    with tf.variable_scope('conv3'):
      conv3 = conv_relu(pool2, kernel_size=2, width=128)
      pool3 = pool(conv3, size=2)

    # Flatten convolutional layers output
    shape = pool3.get_shape().as_list()
    flattened = tf.reshape(pool3, [-1, shape[1] * shape[2] * shape[3]])

    # Fully connected layers
    with tf.variable_scope('fc4'):
      fc4 = linear_relu(flattened, output_size=100)

    with tf.variable_scope('fc5'):
      fc5 = linear_relu(fc4, output_size=100)

    with tf.variable_scope('out'):
      prediction = linear(fc5, output_size=self.output_size)

    return {"predictions": prediction, 'data': data}

  def loss(self, layers, labels):
    predictions = layers['predictions']
    with tf.variable_scope('losses'):
      loss = tf.reduce_mean(tf.square(predictions - labels), name='mse')
    return loss

  def metric(self, layers, labels):
    predictions = layers['predictions']
    with tf.variable_scope('metrics'):
      metrics = {
        "mse": tf.metrics.mean_squared_error(
          labels=labels, predictions=predictions)}
    return metrics__Fri Dec 15 2017 10:13:58 GMT+0800 (CST)____Fri Dec 15 2017 10:13:58 GMT+0800 (CST)__Copy the code

Dataset

__Fri Dec 15 2017 10:13:58 GMT+0800 (CST)____Fri Dec 15 2017 10:13:58 GMT+0800 (CST)__images = np.vstack(df['Image'].values) / 255.  # scale pixel values to [0, 1]
images = images.astype(np.float32)

label = df[df.columns[:-1]].values
label = (label - 48) / 48  # scale target coordinates to [-1, 1]
label = label.astype(np.float32)__Fri Dec 15 2017 10:13:58 GMT+0800 (CST)____Fri Dec 15 2017 10:13:58 GMT+0800 (CST)__Copy the code

__Fri Dec 15 2017 10:13:58 GMT+0800 (CST)____Fri Dec 15 2017 10:13:58 GMT+0800 (CST)__def parse_example(example_proto): features = { "data": tf.FixedLenFeature((9216), tf.float32), "label": Tf.fixedlenfeature ((30), tF.float32, default_Value =[0.0] * 30), } parsed_features = tf.parse_single_example(example_proto, features) image = tf.reshape(parsed_features["data"], (96, 96, -1)) return image, parsed_features["label"] dataset = tf.contrib.data.TFRecordDataset(files) dataset = dataset.map(self.parse_function)__Fri Dec 15 2017 10:13:58 GMT+0800 (CST)____Fri Dec 15 2017 10:13:58 GMT+0800 (CST)__Copy the code

In the Dataset part, we use tfRecord format recommended by Tensorflow. The TFRecord file is read by the TFRecordDataset function, and tfreCOD is converted to the input format of the model by parse_example. Tfrecord, as a fixed-length format, can greatly speed up the reading and delivery of data. This prevents data IO from becoming a performance bottleneck, especially when using a GPU.

Solver

With a modular design, we can reuse the Solver code from the first article completely, without any modifications, thus increasing the efficiency of code reuse.

The experimental results

__Fri Dec 15 2017 10:13:58 GMT+0800 (CST)____Fri Dec 15 2017 10:13:58 GMT+0800 (CST)__file_dict = {
    'train': os.path.join(args.data_dir, 'train.tfrecords'),
    'eval': os.path.join(args.data_dir, 'test.tfrecords')
}

with tf.Graph().as_default():
  dataset = Dataset(
      file_dict=file_dict,
      split='train',
      parse_function=parse_example,
      batch_size=50)
  net = Net(output_size=30)
  solver = Solver(dataset, net, max_steps=200, summary_iter=10)
  solver.train()__Fri Dec 15 2017 10:13:58 GMT+0800 (CST)____Fri Dec 15 2017 10:13:58 GMT+0800 (CST)__Copy the code

With the excuse encapsulated, we can train the model with the simple code above. The following figure shows the visualized network structure, loss statistics and the effect of the model on the test picture in Tensorboad:

__Fri Dec 15 2017 10:13:58 GMT+0800 (CST)____Fri Dec 15 2017 10:13:58 GMT+0800 (CST)__step 10: Loss = 0.0756 (136.2 examples/ SEC) step 20 loss = 0.0230 (155.2 examples/ SEC) step 30: Loss = 0.0102 (149.1 examples/ SEC) step 40 loss = 0.0071 (125.1 examples/ SEC) step 50: Loss = 0.0065 (160.9 examples/ SEC) step 60: Loss = 0.0081 (171.9 examples/ SEC) step 70: Loss = 0.0058 (148.4 examples/ SEC) step 80 loss = 0.0060 (169.4 examples/ SEC) Step 90: Loss = 0.0069 (185.4 examples/ SEC) Step 100 loss = 0.0057 (186.1 examples/ SEC) Step 110: Loss = 0.0062 (183.8 examples/ SEC) step 120: Loss = 0.0080 (170.3 examples/ SEC) step 130: Loss = 0.0052 (185.8 examples/ SEC) Step 140 loss = 0.0071 (184.3 examples/ SEC) Step 150 Loss = 0.0049 (170.7 examples/ SEC) step 160: Loss = 0.0056 (178.7 examples/ SEC) step 170: Loss = 0.0053 (173.2 examples/ SEC) step 180: Loss = 0.0058 (172.6 examples/ SEC) step 190: Loss = 0.0053 (172.5 examples/ SEC) Step 200: Loss = 0.0056 (188.1 examples/ SEC) MSE: 0.140243709087__Fri Dec 15 2017 10:13:58 GMT+0800 (CST)____Fri Dec 15 2017 10:13:58 GMT+0800 (CST)__Copy the code

It can be seen that a three-layer convolution + two-layer fully connected classical convolutional neural network can well solve the problem of face key point detection. In practice, we can use more complex networks and some other tricks to further improve model performance.

Full code download: github.com/Dong–Jian/…