Write in front: The complete code can be obtained by following my official account [thumb notes] and replying to “softmax_py” in the background

Recognition effect:

1. The softmax regression

This part is divided into four parts: the concept of Softmax regression model, the concept of image classification data set, the implementation of Softmax regression model and the implementation of SoftMax regression model based on PyTorch framework.

For discrete value prediction problems, we can use classification models such as Softmax regression. Softmax regression model has multiple output units. This chapter takes Softmax regression model as an example to introduce the classification model in neural network.

1.1 Classification Problems

For a simple image classification problem, the input graphics are 2 pixels high and 2 pixels wide, and the color is grayscale (the pixel value of a grayscale image can be represented by a scalar). Let’s call the four pixel values of the image x1,x2,x3,x4. It is assumed that the real labels of the images in the training data set are dog, cat and chicken, and these labels correspond to discrete values y1,y2 and y3 respectively.

We usually use discrete values to represent categories, such as y1=1,y2=2, and y3=3. An image is labeled as one of the values 1, 2 and 3. For this kind of problem, we generally use a model more suitable for discrete output to solve the classification problem.

1.2 Softmax regression model

Softmax regression model does linear superposition of input features and weights. The main difference from linear regression is that the output value of Softmax regression is equal to the number of categories in the label.

In the example above, each image has four more pixels, corresponding to each image has four eigenvalues (x), there are three possible animal categories, corresponding to three

Four discrete value labels (O). So there are 12 weights (w) and 3 deviations (b)


Softmax regression is also a single-layer neural network, and the calculation of each output O depends on all input X, so the output layer of Softmax regression is also a fully connected layer.

Generally, the output value OI is taken as the confidence of the prediction category I, and the corresponding class of the output with the largest value is taken as the prediction output, i.e


For example, o1,o2,o3 are 0.1,10,0.1 since o2 is the largest, the prediction category is 2.

But there are two problems with this approach

  1. The range of output values of the output layer is uncertain and it is difficult to judge the meaning of these values only

    For example, if three values are 0.1,10,0.1,10 means very convincing; But when the three values are 1000,10,1000,10 represents disbelief.

  2. Since real labels are also discrete values, the error between these discrete values and the output values in an uncertain range is difficult to measure.

The Softmax operator solves both of these problems. It converts the output values to a probability distribution with positive values and a sum of 1 by using the following formula.


Among them


It’s very easy to see


Based on the above two equations, it can be known that y1,y2, and y3 are legitimate probability distributions. For example: y2=0.8 then whatever y1 and y3 are, we know that the probability of being in the second category is 80%

Due to the


As you can see, the Softmax operation does not change the forecast category output.

1.3 Vector calculation expression of single sample classification

In order to improve the efficiency, vector calculation is adopted. As an example of the image classification problem above, the vector expression of weight and deviation parameters is


Set the feature of image sample I with height and width of 2 pixels respectively as


Output layer output is


The predicted probability distribution is zero


Finally, the vector calculation expression of sample I classification by Softmax regression is


For a given small batch sample, exists


1.4 Cross entropy loss function

After using Softmax calculation, it is more convenient to calculate errors in discrete labels. True labels can also be transformed into a legal probability distribution, that is, for a sample (an image) whose true category is y_i, let y_i be 1 and the rest 0. If the image is a cat (second), its y = [0 1 0]. This makes \hat{y} closer to y.

In the image classification problem, in order to predict the correct result, it is not necessary to make the predicted probability equal to the labeled probability (cats with different action colors), we just need to make the probability corresponding to the real category greater than the probability of other categories, so there is no need to use the square loss function in the linear regression model.

We use the cross entropy function to calculate the loss.


In this formula, y^(I) _j is the element of 1 in the true label probability, and \hat{y}^{(I)}_j is the element of the predicted category probability.

Since there is only one label in y^(I), all elements in y^{I} except y^(I) _j are 0, resulting in the simplified equation above


In other words, the cross entropy function is only related to the predicted probabilities. As long as the predicted values are large enough, the accuracy of classification results can be guaranteed.

For the whole sample, the cross entropy loss function is defined as


Where \theta represents the model parameter. If each sample has only one label, the above formula can be simplified as


Minimizing the cross entropy loss function is equivalent to maximizing the joint prediction probability of all label classes in the training data set.

2. Image classification data set (fashion-MNIST)

This chapter required the TorchVision package, which I reinstalled for this purpose

This data set is the graph classification data set that we will use later in our study. Its image content is more complex than MINIST, a handwritten number recognition data set, which is more convenient for us to intuitively observe the differences between algorithms.

This section uses the TorchVision package primarily for building computer vision models.

The main components of torchvision package function
torchvision.datasets Some functions to load data and common data set interface
torchvision.madels Includes commonly used model structures (including pre-trained models)
torchvision.transforms Common image transformations (cropping, rotation)
torchvision.utils Other methods

2.1 Obtaining data sets

Import the required packages first

import torch 
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import time
import sys
sys.path.append("..")	
Sys. path will automatically search for path when calling d2l library.
#import d2lzh_Pytorch as d2l
from IPython import display
# In this section the d2L library is only used for drawing, so use this library instead
Copy the code

** Download this dataset by calling TorchVision.datasets in TorchVision. ** The first call automatically gets data from the web.

The training data set or test data set is obtained by setting the parameter train (test set: used to evaluate model performance, not for training model).

Transforms all the data into Tensor by setting the transfrom = transforms.totensor (), and returns a PIL if you don’t transform it.

Transforms.totensor () transforms a PIL image of size (H*W*C) with data between [0,255] or a NumPy array of type np.uint8 to a (C*H*W) with data type torch. Float32 with data between [0,0,1.0 The Tensor]

C represents the number of channels, and the number of channels in gray image is 1

PIL images are the python standard for handling images

Note: The transforms.totensor () function sets the input type to uint8 by default

Get the training set
mnist_train = torchvision.datasets.FashionMNIST(root='~/Datasets/FashionMNIST',train=True,download = True,transform = transforms.ToTensor())
Get the test set
mnist_test = torchvision.datasets.FashionMNIST(root='~/Datasets/FashionMNIST',train=True,download = True,transform = transforms.ToTensor())
Copy the code

Where mnist_train and mnist_test can use len() to obtain the size of the data set, and can also use subscripts to obtain a specific sample.

Both the training set and the test set have 10 categories. The number of images of each category in the training set is 6000, and the number of images of each category in the test set is 1000, that is, there are 60,000 samples in the training set and 10,000 samples in the test set.

len(mnist_train)	# output the number of samples of the training set
mnist_train[0]		Access any sample by subscript and return two torch, a characteristic tensor, and a label tensor
Copy the code

There are ten categories in the fashion-MNIST data set, which are: Trouser shirts, pullover, dresses, coat, sandal, shirt, sneaker, bag and ankle boot.

To convert these text and numeric labels to each other, you can use the following function.

def get_fashion_mnist_labels(labels):
    text_labels = ['t-shirt'.'trouser'.'pullover'.'dress'.'coat'.'sandal'.'shirt'.'sneaker'.'bag'.'ankle boot']
    return [text_labels[int(i)] for i in labels]
	#labels is a list
	# Convert numeric tags to text tags
Copy the code

Here is a function that draws multiple images and labels in a row

def show_fashion_mnist(images, labels):
    d2l.use_svg_display()
   	# Draw vector diagrams
    _, figs = plt.subplots(1, len(images), figsize=(12.12))
    Create a single len(images) column with the image size 12 by 12
    for f, img, lbl in zip(figs, images, labels):
        The #zip function compresses them into a list of multiple tuples
        f.imshow(img.view((28.28)).numpy())
        Convert img to a 28*28 tensor, then convert it to a NUMpy array
        f.set_title(lbl)
        Set the title of each subgraph to a label
        f.axes.get_xaxis().set_visible(False)
        f.axes.get_yaxis().set_visible(False)
        # Close X-axis Y-axis
    plt.show()
Copy the code

Use of the above functions

X,y = [],[]
Initialize two lists
for i in range(10):
	X.append(mnist_train[i][0])
	# Loop to add images to the X list
	y.append(mnist_train[i][1])
	# loop to add tags to the y list
show_fashion_mnist(X,get_fashion_mnist_labels(y))
# Display images and lists
Copy the code

2.2 Reading Small Batches

With our experience reading small batches in linear regression, we know that reading small batches can be done using the built-in Dataloader function in Torch.

Dataloader also supports multithreading to read data by setting its num_workers parameter.

batch_size = 256
# Small batch number
train_iter = torch.utils.data.DataLoader(mnist_train,batch_size=batch_size,shuffle = True,num_workers = 0)
#num_workers=0, do not enable multithreaded read.
test_iter = torch.utils.data.DataLoader(mnist_test,batch_size = batch_size,shuffle=False,num_workers=0)
Copy the code

3. Use PyTorch to implement softMax regression model

Using PyTorch makes it easier to implement the SoftMax regression model.

3.1 Obtaining and reading data

Methods for reading small batches of data:

  1. The first step is to get the data. Pytorch can easily get the fashion-Mnist data set with the following code.

    mnist_train = torchvision.datasets.FashionMNIST(root='~/Datasets/FashionMNIST',train=True,download=True,transform=transforms.ToTensor())
    
    mnist_test = torchvision.datasets.FashionMNIST(root='~/Datasets/FashionMNIST',train=False,download=True,transform=transforms.ToTensor())
    
    # parameters
    
    #root: home directory for processed/training.pt and processed/test.pt
    #train: True = training set, False = test set
    # Download: True = Download the dataset from the Internet and place the dataset in the root directory. If the data set has been downloaded before, place the processed data (there are related functions in minist.py) in the processed folder
    #transform = transforms.ToTensor(): Transforms all data into Tensor
    Copy the code
  2. The next step is to generate an iterator that reads the data

    Generate iterators
    train_iter = torch.utils.data.DataLoader(mnist_train,batch_size=batch_size,shuffle = True,num_workers = 0)
    
    test_iter = torch.utils.data.DataLoader(mnist_test,batch_size = batch_size,shuffle=False,num_workers=0)
    # parameters
    
    #dataset: the type of dataset from which to load data
    # batCH_size: int how many numbers are loaded in each batch
    #shuffle: bool. Each learning cycle is shuffled
    #num_workers: int how many subprocesses are used to load data. The default value is 0.
    #collate_fn: Defines how to sample. You can do this by defining your own function.
    #pin_memory: Lock page memory processing.
    #drop_last: bool; if there are any samples left, True indicates that they are discarded; Flase means do not discard
    Copy the code

3.2 Define and initialize the model

According to the definition of Softmax regression model, softmax regression model only has weight parameters and deviation parameters. So you can use the linear module in the neural network complex module.


  1. Define the network first, softMax regression is a two-tier network, so you only need to define the input layer and output layer.
num_inputs = 784
num_outputs = 10

class LinearNet(nn.Module):
    def __init__(self,num_inputs,num_outputs):
        super(LinearNet,self).__init__()
        self.linear = nn.Linear(num_inputs,num_outputs)
        Define an input layer
        
    Define forward propagation (in this two-layer network, it is also the output layer)
    def forward(self,x):
        y = self.linear(x.view(x.shape[0].- 1))
        # Change x into y, then continue to propagate
        return y
    
net = LinearNet(num_inputs,num_outputs)
Copy the code
  1. Initialization parameter

Init in torch. Nn can be used to quickly initialize parameters. We set the weight parameters as a normal distribution with a mean value of 0 and a standard deviation of 0.01. The deviation is 0.

init.normal_(net.linear.weight, mean=0, std=0.01)
init.constant_(net.linear.bias, val=0) 
Copy the code

3.3 Softmax operation and cross entropy loss function

Defining softmax operations and cross entropy loss functions separately would create numerical instability. PyTorch therefore provides a function with good numerical stability and includes Softmax and cross entropy calculations.

loss = nn.CrossEntropyLoss()
Copy the code

3.4 Define the optimization algorithm

Small batch stochastic gradient descent is still used as the optimization algorithm. The learning rate was defined as 0.1.

optimizer = torch.optim.SGD(net.parameters(),lr=0.01)
Copy the code

3.5 Calculate classification accuracy

Principle of calculation accuracy:

We take the category with the highest prediction probability as the output category. If it is consistent with the real category Y, it indicates that the prediction is correct. Classification accuracy is the ratio of the number of correct predictions to the total number of predictions

First we need to get the predicted results.

Find the index corresponding to the highest probability from a set of predicted probabilities (variable y_hat)

#argmax(f(x)) function that maximizes f(x) at the point x. If f(x)= dim=1, we can find the index corresponding to the maximum value on all rows.
A = y_hat.argmax(dim=1)	
The final output is a column vector with the same number of rows as y_hat
Copy the code

Then we need to compare the category corresponding to the maximum probability obtained with the real category (y) to determine whether the prediction is correct

B = (y_hat.argmax(dim=1)==y).float()
Argmax (dim=1)==y then we have a Tensor at home. Float () then we have a Tensor at home
Copy the code

Finally, we need to calculate the classification accuracy

We know that the number of rows of y_hat corresponds to the total number of samples, so the average of B is the classification accuracy

(y_hat.argmax(dim=1)==y).float().mean()
Copy the code

The final result from the previous step is in the form of tensor(X), so you need to do the next step to get the final PyTorch Number

(y_hat.argmax(dim=1)==y).float().mean().item()
# PyTorch Number is obtained via.item()
Copy the code

The classification accuracy function is obtained

def accuracy(y_hat,y):
    return (y_hat.argmax(dim=1).float().mean().item())
Copy the code

As a generalization, the function can also evaluate the accuracy of model NET on dataset data_iter.

def net_accurary(data_iter,net):
    right_sum,n = 0.0.0
    for X,y in data_iter:
    Get X and y from iterator data_iter
        right_sum += (net(X).argmax(dim=1)==y).float().sum().item()
        # Count the number of accurate judgments
        n +=y.shape[0]
        Get the number of elements in the zero dimension of y from Shape [0]
    return right_sum/n
Copy the code

3.6 Training model

num_epochs = 5
There are five learning cycles

def train_softmax(.net, train_iter test_iter, loss, num_epochs, batch_size, optimizer, net_accurary):
    for epoch in range(num_epochs):
        # Loss value, correct amount, total initialized.
        train_l_sum,train_right_sum,n= 0.0.0.0.0
        
        for X,y in train_iter:
            y_hat = net(X)
            l = loss(y_hat,y).sum()
            The value of the dataset loss function = the sum of the values of the loss function for each sample.
            optimizer.zero_grad()			Clear the gradient of the optimization function
            l.backward()	Find the gradient for the loss function
            optimizer(params,lr,batch_size)
            
            train_l_sum += l.item()
            train_right_sum += (y_hat.argmax(dim=1) == y).sum().item()
            n += y.shape[0]
            
        test_acc = net_accurary(test_iter, net)	# Accuracy of test set
        print('epoch %d, loss %.4f, train right %.3f, test acc %.3f' % (epoch + 1, train_l_sum / n, train_right_sum / n, test_acc))
        
train_softmax(net,train_iter,test_iter,cross_entropy,num_epochs,batch_size,optimizernet_accurary,net_accurary)
Copy the code

Training effect

3.7 Image Classification

Use the trained model to predict the test set

The ultimate goal of making a model is not training, so try forecasting.

# Convert the sample's category number to text
def get_Fashion_MNIST_labels(labels):
    text_labels = ['t-shirt'.'trouser'.'pullover'.'dress'.'coat'.'sandal'.'shirt'.'sneaker'.'bag'.'ankle boot']
    return [text_labels[int(i)] for i in labels]
    #labels is a list, so we have a for loop to retrieve the list of text

# display image
def show_fashion_mnist(images,labels):
    display.set_matplotlib_formats('svg')
    # Draw vector diagrams
    _,figs = plt.subplots(1,len(images),figsize=(12.12))
    # set the number and size of subgraphs to be added
    for f,img,lbl in zip(figs,images,labels):
        f.imshow(img.view(28.28).numpy())
        f.set_title(lbl)
        f.axes.get_xaxis().set_visible(False)
        f.axes.get_yaxis().set_visible(False)
    plt.show()

Get samples and tags from test sets
X, y = iter(test_iter).next()

true_labels = get_Fashion_MNIST_labels(y.numpy())
pred_labels = get_Fashion_MNIST_labels(net(X).argmax(dim=1).numpy())

# Add real tags and predicted tags to the image
titles = [true + '\n' + pred for true, pred in zip(true_labels, pred_labels)]

show_fashion_mnist(X[0:9], titles[0:9])

Copy the code

Implementation effect

The first line is the real tag, and the second line is the identification tag