preface
When we start learning to program, the first thing we do is learn to print “Hello World”. Just like Hello World for beginners in programming, MNIST for beginners in machine learning.
MNIST is an entry-level computer vision dataset that contains a variety of handwritten digital images:
It also contains a tag for each image that tells us what the number is. For example, the four images above are labeled 5,0,4,1.
Training actually a simple handwritten numeral recognition model code is very short, I the sample code is a total of 50 lines, remove annotation, estimate is not even 30 lines of space, but to understand the design thought contained in the code is very important, so the notes I’ll record of my understanding of the each piece of code.
Reference:
Introduction to MNIST machine learning
Machine learning-loss function
MNIST data set
The MNIST dataset is available on Yann LeCun’s website. Although Python provides code to download the dataset directly, for domestic network reasons, it is recommended to download the dataset here and import it into the project root directory.
As mentioned earlier, each MNIST data unit consists of two parts: a picture containing a handwritten number and a corresponding label. Let’s call these images XS, and let’s call these labels ys. Both training data sets and test data sets contain XS and YS. For example, the image of the training data set is mnist.train.images, and the label of the training data set is mnist.train.labels.
Each image contains 28 pixels X28 pixels. We can represent this image as an array of numbers:
Let’s expand this array into a vector of length 28×28 = 784. It doesn’t matter how the array is expanded (the order between the numbers), as long as the images are expanded the same way.
Therefore, in the MNIST training data set, mnist.train.images is a tensor with shape [60000, 784]. The first dimension number is used to index images, and the second dimension number is used to index pixels in each image. Each element in this tensor represents the intensity value of a pixel in an image between 0 and 1.
The corresponding MNIST dataset is labeled with numbers between 0 and 9 that describe the numbers represented in a given image. For the purposes of this tutorial, we make the label data “one-hot Vectors “. A one-hot vector is 0 in all dimensions except one digit. So in this tutorial, the number n will be represented as a 10-dimensional vector with the number 1 only in the NTH dimension (starting from 0). For example, the tag 0 will be represented as ([1,0,0,0,0,0,0,0]). Therefore, mnist.train.labels is a matrix of numbers [60000, 10].
Now we are ready to start building our model!
Softmax is back
(Because this paragraph is boring and I can’t explain it well, SO I just copied and pasted it directly from Tensorflow website. If you don’t want to see it, you can just skip to the model implementation. When writing the code, you only need to know how to use softmax function.)
We know that each image of MNIST represents a number, from zero to nine. We want to get the probability that a given picture represents each number. For example, our model might assume that a picture containing a 9 has an 80% chance of representing the number 9 but a 5% chance of identifying it as an 8 (because both 8 and 9 have small circles in the top half) and then give it an even smaller chance of representing other numbers.
This is a classic example of using a Softmax Regression model. The Softmax model can be used to assign probabilities to different objects. Even later, when we train more elaborate models, softMax is the last step to assign probabilities.
Softmax Regression is divided into two steps: Step 1
To get evidence that a given image belongs to a particular numeric class, we weighted the pixel values of the image. If this pixel has strong evidence that this image does not belong to this class, then the corresponding weight value is negative; on the contrary, if this pixel has favorable evidence that this image belongs to this class, then the weight value is positive.
The following image shows the weights learned by a model for each pixel on the image for a particular numeric class. Red represents negative weights and blue represents positive weights.
We also need to add an extra bias because the input tends to have some irrelevant interference. So for a given input picture x the evidence that it represents the number I can be expressed as
Among themRepresents the weight,On behalf of the digitaliThe offset quantity of class,jRepresents the given imagexThe pixel index is used for pixel summation. This evidence can then be converted into probabilities using the Softmax functiony:
Softmax here can be thought of as an activation function or a link function that converts the output of our defined linear function into the desired format, which is the probability distribution of 10 numeric classes. Therefore, given a picture, its coincidence for each number can be converted into a probability value by the SoftMax function. The Softmax function can be defined as:
Expanding the subformula on the right side of the equation, we can get:
But softMax model functions are more often defined in the former form: evaluate input values as powers and regularize the resulting values. This power operation indicates that the larger evidence corresponds to the multiplier weight value in the larger hypothesis model. Conversely, having less evidence means having a smaller multiplier coefficient in the hypothetical model. Suppose the weights in the model can’t be zero or negative. Softmax then regularizes these weight values so that their sum equals 1 to construct a valid probability distribution. (For more information about Softmax functions, refer to this section of Michael Nieslen’s book for interactive visual explanations of Softmax.)
The Softmax regression model can be explained by the following figure. The xs weighted sum of the input is added with a bias value respectively, and then the input is entered into the Softmax function:
If we write this as an equation, we can get:
We can also represent this calculation in terms of vectors: matrix multiplication and vector addition. This helps improve computational efficiency. (And a more effective way to think about it)
Further, it can be written more compactly:
The implementation model
Before using TensorFlow, import it first:
import tensorflow as tf
Copy the code
The dataset is then imported and loaded
from tensorflow.examples.tutorials.mnist import input_data
# Load data
mnist = input_data.read_data_sets("MNIST_data", one_hot=True)
Copy the code
We describe these interactive units of action by manipulating symbolic variables. We can create one as follows:
x = tf.placeholder(tf.float32, [None.784])
Copy the code
X is not a specific value, but a placeholder, which we enter when TensorFlow runs the calculation. We want to be able to input any number of MNIST images, each flattened into a 784-dimensional vector. We represent these graphs with a 2-dimensional floating point tensor of the shape [None, 784]. (None indicates that the first dimension of the tensor can be of any length.)
Our model also needs weights and biases, which we can treat as additional inputs (using placeholders), but TensorFlow has a better way of representing them: Variable. A Variable represents a modifiable tensor that exists in TensorFlow’s diagram for describing interactive operations. They can be used to compute input values or modified during computation. For various machine learning applications, there are generally model parameters, which can be represented by Variable.
W = tf.Variable(tf.zeros([784.10]))
b = tf.Variable(tf.zeros([10]))
Copy the code
We give tf.variable different initial values to create different variables: here we both initialize W and b with tensors that are all zero. Because we’re going to learn the values of W and b, their initial values can be arbitrarily set.
Note that the dimension of W is [784, 10], because we want to multiply it by the 784-dimensional picture vector to get a 10-dimensional evidence value vector, each bit corresponding to a different numeric class. B has the shape [10], so we can add it directly to the output.
Now we can implement our model. It only takes one line of code!
prediction = tf.nn.softmax(tf.matmul(x, W)+b)
Copy the code
First, we use tf.matmul(X, W) to represent X times W, corresponding to Wx in the previous equation, where X is a 2-dimensional tensor with multiple inputs. Then add b and type the and into the tf.nn.softmax function.
Training model
To train our model, we first need to define a metric to evaluate whether the model is good. In fact, in machine learning, we usually define an indicator to indicate that a model is bad, this indicator is called cost or loss, and then try to minimize this indicator. However, the two approaches are the same. There are many kinds of loss functions, and we’re going to use them hereSquare loss function, we usually use the mean square deviation (MSE) as the measurement index, and the formula is as follows:
To calculate the loss function, we first need to add a new placeholder for entering the correct value:
y = tf.placeholder(tf.float32, [None.10])
Copy the code
Then define the loss function (Loss) :
# quadratic cost function
loss = tf.reduce_mean(tf.square(y-prediction))
Copy the code
TensorFlow (2) – Using TensorFlow to train a simple unary linear model
Then, the optimization algorithm is used to continuously modify the variable to reduce the loss value:
# Use gradient descentTrain_step = tf. Train. GradientDescentOptimizer (0.1). Minimize (loss)Copy the code
Here, we require TensorFlow to minimize cross entropy with the gradient Descent algorithm at a learning rate of 0.01. The gradient Descent algorithm is a simple learning process in which TensorFlow simply moves each variable in a direction that keeps costs down. Of course, TensorFlow also offers many other optimization algorithms: you can use other algorithms by simply tweaking one line of code.
What TensorFlow actually does here is, it adds a new set of computation operations in the background to the graph that describes your computation to implement the back propagation algorithm and the gradient descent algorithm. Then, all it returns to you is a single operation that, as it runs it, trains your model with a gradient descent algorithm, fine-tunes your variables, and keeps reducing costs.
Now, we have set up our model. Before running the calculation, we need to add an operation to initialize the variable we created:
Initialize variables
init = tf.global_variables_initializer()
Copy the code
Now we can define a session and initialize the variable in the session:
with tf.Session() as sess:
sess.run(init)
Copy the code
Then began training model, we need to define a batch batch_size, because we are training can’t just put a picture every time into the neural network (because it is too slow), said in this batch of 100 is that we put in 100 images at a time into the neural network, and then we will need to compute a total of how many batches:
# Size of each batch
batch_size = 100
# Count how many batches there are
n_batch = mnist.train.num_examples // batch_size
Copy the code
Then we put the model through 30 cycles:
with tf.Session() as sess:
sess.run(init)
for epoch in range(30) :for batch in range(n_batch):
batch_xs, batch_ys = mnist.train.next_batch(batch_size)
sess.run(train_step, feed_dict={x: batch_xs, y: batch_ys})
Copy the code
At each step of the loop, we randomly grab N_batch batch data points from the training data, and then we run train_step with these data points as parameters replacing the previous placeholders.
A train using a small amount of random data is called stochastic training — in this case, stochastic gradient descent. Ideally, we’d like to use all of our data for each step of the training because it gives us better training results, but obviously there’s a lot of computational overhead. Therefore, we can use a different subset of data for each training, which can reduce computational overhead and maximize the learning of the overall characteristics of the data set.
Evaluate our model
So how does our model perform?
First let’s find the labels that predict correctly. Tf. argmax is a very useful function that gives you an index of the maximum value of a tensor object in one dimension. Since the label vector is composed of 0,1, the index position of the maximum value 1 is the category label. For example, tf.argmax(y,1) returns the label value predicted by the model for any input x, while tf.argmax(prediction,1) represents the correct label. We can use tF.equal to check if our prediction is a true tag match (index position equals match).
# The results are stored in a Boolean list
# argmax returns the position of the largest value in the one-dimensional tensor
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(prediction, 1))
Copy the code
This line of code will give us a set of booleans. To determine the percentage of correct predictions, we can convert booleans to floating point numbers and take the average. For example, [True, False, True, True] becomes [1,0,1,1], averaging 0.75.
# Request accuracy
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
Copy the code
Finally, we calculate the accuracy of the learned model on the test data set.
with tf.Session() as sess:
sess.run(init)
for epoch in range(30) :for batch in range(n_batch):
batch_xs, batch_ys = mnist.train.next_batch(batch_size)
sess.run(train_step, feed_dict={x: batch_xs, y: batch_ys})
acc = sess.run(accuracy, feed_dict={
x: mnist.test.images, y: mnist.test.labels})
print("Iter "+str(epoch)+",Testing Accuracy "+str(acc))
Copy the code
The end result is shown below, with an accuracy of about 90%
The complete code
I added the dateTime package in order to calculate the execution time of the code without affecting the reading.
import datetime
# 3.2 Simple version of MNIST dataset classification
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
start = datetime.datetime.now()
# Load data
mnist = input_data.read_data_sets("MNIST_data", one_hot=True)
# Size of each batch
batch_size = 100
# Count how many batches there are
n_batch = mnist.train.num_examples // batch_size
# Define two placeholder placeholder
x = tf.placeholder(tf.float32, [None.784])
y = tf.placeholder(tf.float32, [None.10])
# Create a simple neural network
W = tf.Variable(tf.zeros([784.10]))
b = tf.Variable(tf.zeros([10]))
prediction = tf.nn.softmax(tf.matmul(x, W)+b)
# quadratic cost function
loss = tf.reduce_mean(tf.square(y-prediction))
# Use gradient descent
train_step = tf.train.GradientDescentOptimizer(0.1).minimize(loss)
Initialize variables
init = tf.global_variables_initializer()
# The results are stored in a Boolean list
# argmax returns the position of the largest value in the one-dimensional tensor
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(prediction, 1))
# Request accuracy
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
with tf.Session() as sess:
sess.run(init)
for epoch in range(30) :for batch in range(n_batch):
batch_xs, batch_ys = mnist.train.next_batch(batch_size)
sess.run(train_step, feed_dict={x: batch_xs, y: batch_ys})
acc = sess.run(accuracy, feed_dict={
x: mnist.test.images, y: mnist.test.labels})
print("Iter "+str(epoch)+",Testing Accuracy "+str(acc))
end = datetime.datetime.now()
print((end-start).seconds)
Copy the code