Introduction to Tensorflow 1-CNN and MNIST examples explained

1. Introduction

Artificial intelligence has been very hot since AlphaGo defeated Lee Sedol. I need to explore the application of AI in mobile terminal in my recent work. Let’s take advantage of the introduction of this plan for deep learning.

The entry curve of deep learning is still very steep. After reading a lot of materials and asking friends who are engaged in related work, I finally have a feeling. In this article, I will talk about what I have seen and gained in this process, which is definitely not professional.

The preparation for this introductory article is divided into the following sections:

The principle of CNN
Tensorflow uses CNN network to solve MNIST problem
Demo transformation to solve my own stock picture recognition problem
The trained model is deployed to Tensorflow Mobile framework on iOS
The trained model is deployed to iOS Tensorflow List framework

The first thing I did was to clarify some concepts. I didn’t know anything about artificial intelligence, but I didn’t know what it was. Since we want to learn, first of all, we should be clear about what artificial intelligence, machine learning and deep learning are, where to start and so on. After asking the students engaged in related work and then looking up some information, it is easy to understand. This answer on Zhihu is more reliable, portal, summed up in these two sentences.

Machine learning: an approach to artificial intelligence

Deep learning: A technique for implementing machine learning

And the relationship between the three is

As you can see, artificial intelligence is a very big concept, the idea of using machines to solve problems can be considered artificial intelligence, and this concept was put forward as early as the 1850s. Artificial intelligence and machine learning is a kind of method, all using data analysis to help decision scenario can be called machine learning, this concept from the 1880 s to now also has developed nearly 40 years, machine learning in fact is not mysterious, commonly found in the application of modern production, such as various website recommendation algorithm, Spam filtering algorithm and so on. And one of the hottest concepts of late is actually a small part of artificial intelligence called deep learning. Simply put, the learning method that uses deep neural networks to solve problems is called deep learning. If machine learning in a broad sense is about humans defining rules and letting computers do them, then deep learning is about humans defining rules and letting computers learn them. Relying on the rapid development of computing power, deep learning plays a huge role in image recognition, speech recognition and other fields. After reading this article, you should have a simple understanding of deep learning.

Given that there are so many algorithms in machine learning, it takes a lot of time to learn (as I understand it, machine learning is a whole section of statistics applied to computing, which is huge). Deep learning is relatively simple. As far as I know, the current popular deep learning algorithms are ONLY CNN (convolutional neural network), RNN (recurrent neural network) and DNN (deep neural network). DNN is the basis of the whole deep learning, and the subsequent CNN and RNN are actually based on DNN. CNN (convolutional neural network) is good at extracting image features and processing image problems. RNN takes the output of the previous network as the input of the next network, so that the whole network has the concept of cause and effect. RNN is good at dealing with the problems related to time series, such as speech recognition and semantic analysis.

According to the previous outline, this article will first introduce the concept of CNN network, and then introduce a very useful example in Tensorflow, which uses CNN network to solve MNIST problem. The rest of the outline will be covered in an upcoming article.

2. CNN Neural Network

CNN neural network, full name is convolutional neural network, is the most common and widely used network in deep learning at present, which is suitable for solving the problems of image recognition, image classification and image prediction.

Here is a good article about CNN, I see a lot of big VS will cite the content of this article when talking about CNN. Take a look at this article if you are interested.

An Intuitive Explanation of Convolutional Neural Networks

原文 : What is convolutional neural network? Why are they important?

If you don’t think about CNN, think about what the next application that implements a classification image will do. I made a program to determine if an image is a stock pick screenshot, which is a picture like the following.

The first thing I would do is look at the features of this image. Obviously, the self-selected stock screenshot has regular red and green squares, while the other images don’t. So in the program, I first extract the pixel values of the image, and if I look for the red and green blocks in the image, the algorithm to look for the red and green blocks, I can do this, see if the coordinates of the red pixel value are a square, or the coordinates of the green pixel value are a square.

As we know, a typical image classification algorithm is to extract features and compare features. CNN network simply automates this process. Developers do not need to tell the network what the features of the pictures are. CNN network can automatically find features from the pictures and record them. So how does that work? Here’s how.

Input an image and the network extracts its features through a series of operations. As shown below:

Of course, there are algorithms and parameters for this sequence of calculations. During training, we will label each image with a corresponding label. After CNN calculates the features through the above series, each feature will correspond to a label. Such as

Feature 1 -> Label A Feature 2 -> Label B Feature 3 -> Label ACopy the code

When the next picture enters the training, CNN still extracts features using the parameters calculated in the previous training. If feature 2 is extracted, and if the label of the picture is B, it is proved that the parameters are correct without adjustment. If feature 2 is extracted, but the label of the picture is A, then the parameter is proved to be inaccurate and the parameter needs to be adjusted. After adjusting the parameters, continue the next training, and so on, until the parameters are likely to be accurate.

It is very simple to explain how CNN convolutional neural network works. Of course, the actual process is much more complex than this. Feature extraction requires some algorithms, such as convolution, pooling and activation, and the parameters of the algorithm are not just one, but millions. The following article shows how the CNN network works, based on a very simple example of the MNIST problem in Tensorflow.

3. The problem of MNIST

The MNIST problem is the image processing equivalent of the Hello World program, which has a full Demo in the official Tensorflow tutorial.

MNIST problem handling

MNIST problem is a very common image classification problem. The training set is coded handwritten pictures, in which there are handwritten numbers 0~9. After training, the model can output numbers 0~9 by inputting a picture.

3.1 input set

First let’s take a look at the MNIST problem’s input set, data set introduction portal.

The entire dataset consists of four files,

// Train-images-idx3-ubyte. gz: training set images (9912422 bytes) // Train-label data train-labels-idx1-ubyte.gz: Training set images (9912422 bytes) // Train-label data train-labels-idx1-ubyte.gz: training set images (9912422 bytes) Training Set Labels (28881 bytes) // Test Set t10K-images-idx3-ubyt.gz: training Set Labels (28881 bytes) Test set images (1648877 bytes) // Test set-label data T10K-labelages-IDx1-ubyte. gz: test set labels (4542 bytes)Copy the code

The format of the training set – picture data is as follows

[offset] [type] [value] [description] 0000 32 bit integer 0x00000803(2051) magic number 0004 32 bit integer 60000 number  of images 0008 32 bit integer 28 number of rows 0012 32 bit integer 28 number of columns 0016 unsigned byte ?? pixel 0017 unsigned byte ?? pixel ........ xxxx unsigned byte ?? pixelCopy the code

The format of the training set-label data is as follows

[offset] [type]          [value]          [description] 
0000     32 bit integer  0x00000801(2049) magic number (MSB first) 
0004     32 bit integer  60000            number of items 
0008     unsigned byte   ??               label 
0009     unsigned byte   ??               label 
........ 
xxxx     unsigned byte   ??               label
The labels values are 0 to 9.Copy the code

The data format of the test set is the same as that of the training set.

First let’s write a program that tries to parse the data:

#coding=utf-8 import os import struct import numpy as np import matplotlib.pyplot as plt def load_mnist(path, kind='train'): Labels_path = os.path.join(path, '%s- allagrams-idx1 -ubyte' % kind) images_path = os.path.join(path, '%s- Allagrams-idx1 -ubyte' % kind) images_path = os.path.join(path, '%s-images-idx3-ubyte' % kind) with open(labels_path, 'rb') as lbpath: Unpack ('>II', lbpath.read(8)) print 'label magic: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 'rb') as imgpath: Num, rows, cols = struct.unpack('>IIII', imgpath.read(16)) print 'image magic: Shape (num, rows * cols) Return images, labels, 0 0 rows, cols def show_image(): images, labels, rows, cols = load_mnist('/tmp/tensorflow/mnist/input_data/') fig, ax = plt.subplots( nrows=2, ncols=5, sharex=True, sharey=True, ) ax = ax.flatten() for i in range(10): img = images[i].reshape(rows, cols) print labels[i] ax[i].imshow(img, cmap='Greys', interpolation='nearest') ax[0].set_xticks([]) ax[0].set_yticks([]) plt.tight_layout() plt.show() if __name__ == '__main__': show_image()Copy the code

After running the program, the first 10 images of the input set are displayed, as shown below:

It prints labels on the console

label magic :  2049
image magic :  2051
[5] [0] [4] [1] [9] [2] [1] [3] [1] [4]Copy the code

This is a typical input for a CNN model, with some data, and each data has a label. Later, when we do our own projects, this data format is also used.

Because MNIST problems are typical, Tensorflow even encapsulates parsing methods.

3.2 MNIST problem code parsing

The complete code in Tensorflow/Tensorflow/git/Tensorflow Tensorflow/examples/tutorials/mnist/mnist_deep py.

Let’s look at it in the order in which it runs.

3.2.1 Reading Data

if __name__ == '__main__': Parse command line arguments Parser = argparse.argumentParser () # Parser.argument ('--data_dir', type= STR, default='/tmp/tensorflow/mnist/input_data', help='Directory for storing input data') FLAGS, Parse_known_args () # run tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)Copy the code

  # Import data
  mnist = input_data.read_data_sets(FLAGS.data_dir)Copy the code

The code encapsulates the MNIST download and parsing process once, looking at the read_data_sets method

def read_data_sets(train_dir,
                   fake_data=False,
                   one_hot=False,
                   dtype=dtypes.float32,
                   reshape=True,
                   validation_size=5000,
                   seed=None,
                   source_url=DEFAULT_SOURCE_URL):
                   ...
                   return base.Datasets(train=train, validation=validation, test=test)Copy the code

This method returns a Datasets format in which train, Validation, and test are parsed, similar to the previous section.

Validation comes from the MNIST data set, which contains only train and test. The validation uses the train dataset. CNN network runs validation data set after each training to let the developer know the current accuracy, and runs test data set after all training to know the accuracy of the trained model. Therefore, validation data is only for the reference of the developer. The data set of Train can be used directly, and the test data set must not be the data in Train when testing the accuracy of the model, because if the two data sets overlap, It may not be possible to accurately test the accuracy of the model.

3.2.2 Defining input and output

  # Create the model
  x = tf.placeholder(tf.float32, [None, 784], name="x")

  # Define loss and optimizer
  y_ = tf.placeholder(tf.int64, [None])
  Copy the code

Tf.placeholder can be interpreted as defining variables. There are two variables defined here, one is the input value x, the type is floating point, the latitude is [None, 784], None represents an indefinite quantity, 784 represents 28*28, i.e., a picture of MNIST input set. The meaning here is that the input layer X can be an indefinite number of pictures, which means that our network can input data of multiple pictures at the same time. That variable is the tensor tensor of Tensorflow, and every tensor can have a name, but we only have a representative number of tensor names, like x for this input node.

The variable y_ declares to store label is of type int, because label is 1-10, and there is an indefinite number of labels, because the number of images entered by x should be the same as the number of labels.

Why is it so complicated to declare a variable, why can’t we just declare float x = 5 like we did when we wrote a program. The concept of Tensorflow calculation diagrams and static diagrams is explained here. Training a deep learning network requires a lot of calculation. However, in order to speed up the calculation, the deep learning framework will directly throw the calculation process to the CPU or GPU to run. After the CPU and GPU run, the framework results will be returned. So just to take a simple example, if we want to do 3 times 5 plus 2, the normal procedure is to do 3 times 5, get the answer and then add 2, get the answer. If Tensorflow were to do the same, it would tell the CPU to do 3*5, return the result, and then throw the result and 2 back to the CPU to compute the result. But it’s important to know that every interaction with the CPU is extremely time consuming, so training the neural network can be extremely time consuming if it’s like a normal program. Therefore, some deep learning frameworks like Tensorflow put forward the concept of static graph, first define the whole calculation graph, in the previous example, define “multiply and then add”, and then throw the calculated number and the calculation graph to the CPU, and the CPU will calculate a final result and send it back to the framework. This allows only one interaction between the framework and the CPU during the entire process. This can speed up the calculation, but the fatal drawback is that it can’t be debugable. You can’t break after 3*5 to see if the result is correct, you can only know if the final result is correct.

So TF.placeholder just defines a tensor at one point in the diagram, so you have to use Tensorflow to define the grammar of the tensor.

3.2.3 Define the calculation diagram

# Build the graph for the deep net
  y_conv = deepnn(x)Copy the code

The deepNN method is the core of the whole program, in which a whole calculation graph is defined. Let’s go layer by layer. This CNN network uses the LeNet network. The composition looks like this.

There are two convolution layers, two pooling layers, and the last output layer is two fully connected layers. The code is as follows:

def deepnn(x): """deepnn builds the graph for a deep net for classifying digits. Args: x: an input tensor with the dimensions (N_examples, 784), where 784 is the number of pixels in a standard MNIST image. Returns: A tuple (y, keep_prob). y is a tensor of shape (N_examples, 10), with values equal to the logits of classifying the digit into one of 10 classes (the digits 0-9). keep_prob is a scalar placeholder for the probability of dropout. """ # Reshape to use within a convolutional neural net. # Last dimension is for "features" - there is only one here, since images are # grayscale -- it would be 3 for an RGB image, 4 for RGBA, etc. with tf.name_scope('reshape'): x_image = tf.reshape(x, [-1, 28, 28, 1]) # First convolutional layer - maps one grayscale image to 32 feature maps. with tf.name_scope('conv1'): W_conv1 = weight_variable([5, 5, 1, 32]) b_conv1 = bias_variable([32]) h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1) # Pooling layer - downsamples by 2X. with tf.name_scope('pool1'): h_pool1 = max_pool_2x2(h_conv1) # Second convolutional layer -- maps 32 feature maps to 64. with tf.name_scope('conv2'):  W_conv2 = weight_variable([5, 5, 32, 64]) b_conv2 = bias_variable([64]) h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2) # Second pooling layer. with tf.name_scope('pool2'): h_pool2 = max_pool_2x2(h_conv2) # Fully connected layer 1 -- after 2 round of downsampling, our 28x28 image # is down to 7x7x64 feature maps -- maps this to 1024 features. with tf.name_scope('fc1'): W_fc1 = weight_variable([7 * 7 * 64, 1024]) b_fc1 = bias_variable([1024]) h_pool2_flat = tf.reshape(h_pool2, [-1, 7 * 7 * 64]) h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1) # Map the 1024 features to 10 classes, one for each digit with tf.name_scope('fc2'): W_fc2 = weight_variable([1024, 10]) b_fc2 = bias_variable([10]) y_conv = tf.add(tf.matmul(h_fc1, W_fc2), b_fc2, name="output") return y_convCopy the code

Let’s take a look layer by layer at how the network works.

Convolution layer

CNN is called convolutional neural network, and it can be seen that convolution is the core of this network. The convolution layer is used to extract image features. The operation of convolution is to use a convolution matrix (also known as the convolution kernel) to scan the input matrix in turn and multiply the matrix. The result is input a feature of the matrix. This is not easy to understand, but here is a picture of the process.

Let’s say the input matrix looks like this

Take the following convolution kernel

You scan the input matrix with the convolution kernel, do matrix multiplication, and you get the features of the input matrix extracted by the convolution kernel.

In terms of CNN, the 3×3 matrix is called “filter” or “kernel” or “feature detector”. The matrix obtained by sliding the filter across the image and calculating the dot product is called a “Convolved Feature” or an “Activation Map” or a “Feature Map”. Remember that the filter acts as a feature detector on the original input image.

In case you’re wondering if this operation can actually extract features, let’s look at an example of a convolution operation on an actual picture.

The input image is as follows:

The results of convolution operation with different convolution kernels are as follows:

It can be seen that different convolution can obtain different feature images after processing the original image. Some convolution nuclear energy extracts edge information, some convolution nuclear energy extracts color information, some convolution nuclear energy extracts light and shade features, and so on. Does this feel like a filter in PHOTOSHOP? In fact, the operation of convolution is very similar to the operation of filter. Different convolution is like different filters that are sensitive to different features. The convolution layer in the code is as follows:

# First convolutional layer - maps one grayscale image to 32 feature maps.
  with tf.name_scope('conv1'):
    W_conv1 = weight_variable([5, 5, 1, 32])
    b_conv1 = bias_variable([32])
    h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)Copy the code

Firstly, a convolution kernel W_conv1 variable was defined. Since variables would be defined several times later, methods were extracted from the part of defining variables as follows:

def weight_variable(shape): """weight_variable generates a weight variable of a given shape.""" initial = tf.truncated_normal(shape, Stddev return tf = 0.1). The Variable (initial)Copy the code

Tf. Variable generates a tensor, and initial is the initialization value. Initial is generated by truncated_normal. Shape specifies the latitude of the generated variable, which is a four-dimensional variable of [5,5,1,32]. The initial value is the normally distributed value generated by truncated_normal. This is what the truncated_normal method does, as described in the documentation.

The generation of partial b_conv1 is simpler than the declaration of the convolution kernel.

def bias_variable(shape): ""bias_variable generates a bias of a given shape."" initial = tf.constant(0.1, shape=shape) return tf.Variable(initial)Copy the code

The initialization value is fixed at 0.1. The initial values of these two variables should not be 0, but some dirty values should be added in order to break symmetry and avoid 0 gradient and improve the efficiency of model training.

So the next part of the code is doing the convolution, and let’s see,

conv2d(x_image, W_conv1)

def conv2d(x, W):
  """conv2d returns a 2d convolution layer with full stride."""
  return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')Copy the code

This is the function used in Tensorflow for convolution, 2d means it generates a two-dimensional feature graph, what does that mean, we’ll talk about that later. In addition to conv2D, there are conv1D and conv3D functions.

Take a look at conv2D’s function signature.

tf.nn.conv2d(input, filter, strides, padding, use_cudnn_on_gpu=None, name=None)Copy the code

Excluding the name argument, which specifies the name of the operation, there are five arguments related to the method:

The first parameter input: You have a Tensor, you have batch, IN_height, in_width, in_channels, and you have a Tensor. Notice that this is a 4-dimensional Tensor, and you need to have one of the types float32 and float64

The second parameter filter: It’s equivalent to CNN convolution kernel, it needs to be a Tensor, it has filter_height, filter_width, in_channels, out_channels, shape, it means the height of the convolution kernel, the width of the convolution kernel, the number of image channels, Number of convolution kernels], which requires the same type as parameter INPUT. There is one thing to note: the third dimension, in_channels, is the fourth dimension of parameter input

The third parameter strides: the strides for each step in each dimension of the image as convolved. This is a one-dimensional vector with the same length as the latitude of the previous convolution kernel, which is 4.

The fourth padding argument is a string value that can only be either “SAME” or “VALID”, which determines the different convolution modes

Use_cudnn_on_gpu :bool indicates whether cudNN is used for acceleration. The default value is true. Cudnn is nvidia’s GPU processing unit.

Two dimensional eigenvector

Conv2d generates two-dimensional feature vectors. Two parameters of conv2D are required, input and filter. Our input parameter is a picture, three-dimensional data [width, height, color space]. In order for the final vector multiplication to be two-dimensional, the third latitude of filer should be equal to the fourth latitude of input, that is, in_Channels, so that only one two-dimensional feature vector can be output. That’s why this function is called conv2D, which is a little bit harder to understand here, so think about it a little bit.

Padding parameters

The padding parameter has two optional values: SAME and VALID. This parameter affects the processing of the edges of the input matrix in the convolution check and determines the size of the eigenvector of the output.

Convolution kernel and scan the input matrix step by step according to the number of steps to do the multiplication operation. But at the edge of the input matrix, if the remaining latitude of the input matrix is less than the latitude of the defined convolution kernel, then the convolution kernel cannot do the multiplication operation, how to deal with the remaining part of the boundary.

If the padding is SAME, zeros will be completed at both ends of the input matrix, so that the latitude of the completed input matrix can be processed by the convolution kernel, and the final eigenmatrix is the SAME latitude as the original input matrix.

If the padding is VALID, then the boundary nodes are discarded and the latitude of the output eigenmatrix is different from that of the input matrix

Let’s go back to the code

  W_conv1 = weight_variable([5, 5, 3, 32])
  conv2d(x_image, W_conv1)Copy the code

The convolution kernels are 5*5*3, there are 32 convolution kernels, and after processing the image, we get 32 feature vectors, because we didn’t specify the step size and padding, so the step size is 1 by default, and the padding is SAME, so the output feature matrix is 28, 28, 1.

# first convolution layer, extract 32 features from image with tf.name_scope('conv1') W_conv1 = weight_variable([5, 5, 3, 32]) # y =wX+b b_conv1 = bias_variable([32]) h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)Copy the code

After you do the convolution, you do a RELU

ReLU is an element-level operation (applied to each pixel) that sets all pixel values less than 0 in the feature graph to zero. The purpose of ReLU is to introduce nonlinearity into ConvNet, because for the most part we want the actual data ConvNet learns to be nonlinear (convolution is a linear operation — matrix multiplication and addition at the element level, so we need to introduce nonlinearity by using the nonlinear function ReLU.

The introduction of nonlinearity can make the neural network work better, which means that the problem of gradient disappearance can be avoided when doing back propagation. The reasons involved in deep theories will not be discussed in this article.

So the first convolution operation is done.

pooling

The convolution operation is followed by a pooling layer.

# Pooling layer - downsamples by 2X.
  with tf.name_scope('pool1'):
    h_pool1 = max_pool_2x2(h_conv1)Copy the code

Pooling is similar to compression, which reduces the latitude of the input values while preserving the characteristics of the input values. For example, take the maximum, minimum, or average of 4 pixels of 2 × 2. However, after research, maximum pooling can better maintain the original eigenvalue.

# 2*2 maximized pooling layer with tf.name_scope('pool1'): h_pool1 = max_pool_2x2(h_conv1)Copy the code

def max_pool_2x2(x):
  """max_pool_2x2 downsamples a feature map by 2X."""
  return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
                        strides=[1, 2, 2, 1], padding='SAME')Copy the code

The pooling operation is also called to the Tensorflow function, with parameters strides and padding the same as those of the conv2D function mentioned earlier. Ksize is defined as the pooled size in each dimension. After convolution in the previous step, the output eigenvector is 1*28*28*1, so it is processed in the middle two latitudes.

# Second convolutional layer -- maps 32 feature maps to 64.
  with tf.name_scope('conv2'):
    W_conv2 = weight_variable([5, 5, 32, 64])
    b_conv2 = bias_variable([64])
    h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)

  # Second pooling layer.
  with tf.name_scope('pool2'):
    h_pool2 = max_pool_2x2(h_conv2)Copy the code

Then we do a convolution operation and a pooling operation, which is the same as the previous operation. The output of the convolution operation does not affect the latitude of the input value, but the depth of the input value. The pooling operation does not affect the depth of the input value but the latitude of the input value. After two layers of convolution and 2*2 pooling, the latitude of the input value becomes 1/4 of its original value. The depth becomes 64. So 7 times 7 times 64.

The upper layers, also called hidden layers, represent processing layers that are invisible to the user.

All connection

Next comes the output layer, which is two fully connected layers.

  # Fully connected layer 1 -- after 2 round of downsampling, our 28x28 image
  # is down to 7x7x64 feature maps -- maps this to 1024 features.
  with tf.name_scope('fc1'):
    W_fc1 = weight_variable([7 * 7 * 64, 1024])
    b_fc1 = bias_variable([1024])

    h_pool2_flat = tf.reshape(h_pool2, [-1, 7 * 7 * 64])
    h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

  # Map the 1024 features to 10 classes, one for each digit
  with tf.name_scope('fc2'):
    W_fc2 = weight_variable([1024, 10])
    b_fc2 = bias_variable([10])

    y_conv = tf.add(tf.matmul(h_fc1, W_fc2), b_fc2, name="output")Copy the code

A fully connected layer, as the name implies, means that every node in this layer is connected to all nodes in the next layer.

Let’s say a1 is connected to all the nodes in the upper layer, x1,x2,x3, and A2 is connected to all the nodes in the upper layer, and that’s it. So when you interpret the fully connected layer, the graph looks like this network.

The above mentioned contact, here “contact” refers to what. As shown in the figure, assuming x1, x2 and x3 are input values for the full connection layer, a1 at the full connection layer can be expressed in the following form.

It can be seen from this formula that A1 is related to x1, x2 and x3, but the weight values of the input nodes are different. Similarly, A2 and A3 can be expressed in the following form.

Thus can from the mathematical level to understand what is the connection layer, but in actual application of the full connection layer said what meaning is, in simple terms, it can be classified as a network, or above that figure, input layer are three values, three nodes is all connection layer, means the full connection layer can be put on a layer of the eigenvalues of the classified as three types of characteristics. Of course, the input layer and the full connection layer do not necessarily have the same number of nodes, as in the following structure.

The last full connection layer is 10 nodes, and the upper layer of the full connection layer is 15 eigenvalues, so the 15 eigenvalues are classified as 10 features. For example, the output of 15 feature marker bits T1-15, and T1, T3, T5 features can be considered as belonging to a certain output O1. The two features of T2 and T6 can be considered as outputs of O2, and so on, the 15 features can be classified into 10 outputs.

In the MNIST problem mentioned earlier, the full connection layer has two layers, the first layer is 1024 nodes, and the second layer is 10 nodes. Under normal circumstances, the output layer closest to the user is the expected number of result categories. In MNIST problem, the expected output of the user is 0-9, which is a total of 10 numbers, namely 10 categories, so the second full connection layer is 10 nodes. Looking further forward, the convolution layer extracts 64 feature values at each pixel point, so the whole image can have high * wide *64 feature values, namely 7*7*64 features. The first fully connected layer is 1024 nodes, indicating that the network is expected to classify the 7*7*64 features into 1024 categories. As for why the first connection layer is 1024 nodes, it is a problem of experience value. The number of nodes can be adjusted. Later, we will talk about the influence of the full connection layer on the model volume, and we will also talk about this experience value. So how is the full connection layer classified? Let’s see.

Review the representation formula for the full connection layer.

The input values here x1,x2,x3… The output value A1 and the weight W, and the offset B exactly refer to.

On the basis of MNIST, such a case can be considered. Take a look at the figure below. For the number 0, after training, we believe that if a picture has pixel values in the red part in the middle, there is a certain probability that the picture is not 0, and the more pixel values in the red part, the greater the probability that the picture is not 0. If pixel values appear in the blue part of the surrounding circle, then the graph has a certain probability of being 0, and the more pixel values appear in the blue part, the higher the probability of being 0.

This process using a mathematical formula how said, the weight of blue part have positive, red with negative weights, the input images of each pixel point after multiplying the sum with weight value, so that if the red part of the more pixels, the final and the smaller, if the blue part of the more pixels, the final and more. The final result, which we call input picture X, is the evidence of this classification (such as classification 0). In this way, for an input image, the evidence on each node can be expressed by the following formula.

Let’s see if this is the formula for all connected layers. Therefore, the process of full connection layer is rather convoluted, but it is the basic principle of deep learning network, which can be carefully understood.

Look at the code in reverse.

  # Fully connected layer 1 -- after 2 round of downsampling, our 28x28 image
  # is down to 7x7x64 feature maps -- maps this to 1024 features.
  with tf.name_scope('fc1'):
    W_fc1 = weight_variable([7 * 7 * 64, 1024])
    b_fc1 = bias_variable([1024])

    h_pool2_flat = tf.reshape(h_pool2, [-1, 7 * 7 * 64])
    h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)Copy the code

First, the weight W_fc1 and the partial b_fc1 of the full connection layer are generated. Here, the previous two methods of generating the convolution kernel and the partial of the convolution operation are reused. Although the concepts are different, the calculation is similar, so it can be reused. Tf.0 (H_POOL2, [-1, 7 7 64]) is 0 0 turning the 4-dimensional [1,7,7,64] output of the previous layer into a 1-dimensional vector because the fully connected layer can only handle 1-dimensional problems. Tf.matmul vector multiplication is used to realize the operation of full connection layer. Tf.nn. relu, as before, is an activation function.

droupout

After the first full connection layer, a dropout operation is added

  # Dropout - controls the complexity of the model, prevents co-adaptation of
  # features.
  with tf.name_scope('dropout'):
    keep_prob = tf.placeholder(tf.float32)
    h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)Copy the code

Droupout can be seen in the following figure:

Simply put, h_FC1 has 1024 nodes by droupout. If keep_PROB is 50%, 512 nodes can be considered by droupout. Droupout has two advantages.

Solve the problem of overfitting
Add probability to your training

By droupout, the layer is fully connected.

  # Map the 1024 features to 10 classes, one for each digit
  with tf.name_scope('fc2'):
    W_fc2 = weight_variable([1024, 10])
    b_fc2 = bias_variable([10])

    y_conv = tf.add(tf.matmul(h_fc1, W_fc2), b_fc2, name="output")Copy the code

The second full connection classifies the 1024 nodes generated at the previous layer into 10 nodes to obtain the final output. Even after droupout, the entity still has 1024 nodes, but some of them are no longer involved in computing. The final output of the network structure is the evidence value for 10 nodes.

3.2.4 Softmax regression processing

At this point, the entire diagram is defined, and the necessary units of training are defined to tell the framework how to train.

  with tf.name_scope('loss'):
    cross_entropy = tf.losses.sparse_softmax_cross_entropy(
        labels=y_, logits=y_conv)
  cross_entropy = tf.reduce_mean(cross_entropy)Copy the code

The loss function is defined using Softmax cross entropy. For Softmax and cross entropy, the portal section is described here.

The original output of the neural network is not a probability value, but a value of the input value after complex weighted sum and nonlinear processing. Then how to change the output into probability distribution?

This is what the Softmax layer does, assuming that the original output of the neural network is y1,y2… Yn, then the output after Softmax regression processing is:

After the formula of Softmax, the sum of all nodes becomes 1, and the weight value of each node output by the neural network becomes the probability value of each node.

The output of a single node becomes a probability value, and the result is processed by Softmax as the final output of the neural network.

3.2.5 Softmax regression processing

Cross entropy describes the distance between the actual output (probability) and the expected output (probability), that is, the smaller the cross entropy value is, the closer the two probability distributions will be. Assuming that the probability distribution P is the expected output, the probability distribution Q is the actual output, and H(p,q) is the cross entropy, then:

For example: assuming N=3, expected output p=(1,0,0), actual output q1=(0.5,0.2,0.3), q2=(0.8,0.1,0.1), then:

Obviously, q2 is closer to P, and its cross entropy is lower. In addition, there is another expression of cross entropy, again using the above assumptions:

The results are:

Above all instructions for a single sample is the case, but in the use of the practical training process, combine data is often become a batch to use, so the use of the neural network output should be a two-dimensional matrix of m * n, which is suitable for batch number, m, n for classification number, and the corresponding Label is a two-dimensional matrix. Take the above data and combine them into a batch=2 matrix:

So the result of the cross entropy should be a column vector (according to the first method) :

For a batch, the average value is 0.2.

Tf.losses. Sparse_softmax_cross_entropy is the result of Tensorflow encapsulating the above two processes. The final result is a batch probability value and reduce_mean is used to obtain the average of the probability. To sum up, it is the loss function of the whole network.

3.2.6 Training method – Gradient descent

And then define the method of gradient descent

  with tf.name_scope('adam_optimizer'):
    train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)Copy the code

The loss function cross_entropy is minimized at a descending step speed, that is, a learning rate of 1E-4. So you can define that the purpose of the training is to make the loss function more and more frequent, each time the parameter changes by 1E minus 4.

3.2.7 accuracy

Define a method for calculating accuracy

  with tf.name_scope('accuracy'):
    correct_prediction = tf.equal(tf.argmax(y_conv, 1), y_)
    correct_prediction = tf.cast(correct_prediction, tf.float32)
  accuracy = tf.reduce_mean(correct_prediction)Copy the code

Y_conv is the raw result of the network output, which is the weight of the image on the last 10 nodes, such as [314, -423, 342…]. In the end, the node with the largest weight value is our expected node. For example, the third node here has the largest weight value, so the picture may be the number 2. Tf. argmax is the subscript to extract the largest weight value. Y_ is the label value we give, and the two values are compared to get the accuracy. Finally, tF.reduce_mean was used to calculate the average accuracy of batch.

3.2.8 Visualization of computational graphs

  graph_location = tempfile.mkdtemp()
  print('Saving graph to: %s' % graph_location)
  train_writer = tf.summary.FileWriter(graph_location)
  train_writer.add_graph(tf.get_default_graph())Copy the code

This is the calculation graph visualization method provided by Tensorflow, which does not work here.

3.2.9 Start the calculation diagram

with tf.Session() as sess: sess.run(tf.global_variables_initializer()) for i in range(20000): batch = mnist.train.next_batch(50) if i % 100 == 0: train_accuracy = accuracy.eval(feed_dict={ x: batch[0], y_: Batch [1], keep_prob: 1.0}) print('step %d, training accuracy %g' % (I, train_accuracy)) train_step. Run (feed_dict={x: Batch [0], y_: Batch [1], keep_prob: 0.5}) print('test accuracy %g' % accuracy.eval(feed_dict={x: mnist.test.images, y_: Mnist. Test. Labels, keep_prob: 1.0}))Copy the code

Tensorflow defines an interaction with the CPU as a session. After the session is started, it starts running.

sess.run(tf.global_variables_initializer())Copy the code

First, initialize all variables defined above. Note that in defining the graph, we only defined the method of variable initialization, not the actual initialization. The real operation is in this place.

Then the cycle was repeated 20000 times, and 50 image data were taken from the training set as a batch each time.

Train_step. Run (feed_dict={x: Batch [0], y_: Batch [1], keep_prob: 0.5})Copy the code

Start training, train_step is the gradient descent method defined earlier, and start training with this method. The parameters are picture data, correct labels, and droupout parameters.

if i % 100 == 0: train_accuracy = accuracy.eval(feed_dict={ x: batch[0], y_: batch[1], keep_prob: Print ('step %d, training %g' % (I, train_accuracy))Copy the code

Every 100 training sessions, output the current accuracy, is for the developer perception. Keep_prob is 1 because it does not need to be droupout.

Print ('test accuracy %g' % accuracy.eval(feed_dict={x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0}))Copy the code

After 20,000 training sessions, output the final accuracy rate.

The output looks something like this

Step 0, training accuracy 0.16 step 100, training accuracy 0.9 step 200, training accuracy 0.94 step 300, Training accuracy 0.9 step 400, training accuracy 0.96... The test accuracy of 0.9446Copy the code

conclusion

At this point, the whole model training process is completely over. Through this example, we know how a CNN network works and how to implement a CNN network with Tensorflow. In my next article, I’ll write about how to modify this example to solve one of my real problems.