We know that linear regression is generally used to solve regression problems, such as housing price prediction, temperature prediction and so on. In fact, with technology like Softmax, we can also use linear regression to solve the multi-classification problem. Softmax is the transformation of the output layer in the network structure, and its schematic diagram is as follows:

Softmax technical details

In the figure above, x1, x2 and x3 are input layers, which generate y1 and y2 through two linear regression models respectively:


Then, input Y1 and y2 into Softmax module and output y1′ and y2′. After processing by Softmax, the result is a discrete probability distribution, that is, Y1 ‘+ Y2′ = 1. Thus, we can use y1′ and y2’ to represent the predicted probabilities of different categories.

The details of Softmax are as follows:

  1. First, I take the output to the power e, and I get.
  2. Sum the results of step 1: sum = z1 + z2
  3. Finally, divide the result of step 1 by sum.

Because it does e to the power of the result, Softmax intensifies larger numbers — making larger results more likely, which is why Softmax is called Softmax instead of Max.

Loss function for Softmax classification

Softmax classification uses a Cross Entropy loss function, which on closer inspection is really Log Likelihood (see in-depth Understanding of Logistic Regression), and their goal is to maximize the predicted value of the correct classification.

In the figure above, y1′, y2′ and y3′ are the predicted probabilities of different categories, which are the output results of Softmax. Y1, y2 and y3 are real classification data. In the classification task, only one of these three numbers is 1 and the other two are 0, so y1, y2 and y3 are also probability distributions.

So our machine learning task becomes: get the probability distribution to predictWe’re getting closer and closer to the real probability distributionThe optimization problem of. We use the cross entropy to measure the difference between two probability distributions. The smaller the cross entropy, the closer the two distributions are. The cross entropy is expressed as follows:


In the above formula, Q indicates that there are Q categories,Is the probability of the JTH classification. Add up the cross entropy of all samples, and you get the loss function we want:


As I said, cross entropy is the same thing as Log Likelihood, and you can think of it in terms of Log Likelihood. In addition, you can also understand from the perspective of information theory:

In information theory, entropyUsed to describe the expectation of the amount of information contained in the system, whereIs self-information, indicating that the probability isThe amount of information generated when the event of is, self-information can be understood as: small probability events will generate large amount of information, on the contrary, large probability events will generate small amount of information.

The cross entropyIt describes the amount of effort required to eliminate information from a system, and when distributedEqual, the least effort —That’s why machine learning uses cross entropy as a loss function.

See this article for a more detailed explanation.

Implement Softmax from zero

Now we use TF2.0 to implement Softmax from zero as follows:

  1. Define Softmax multi-classification model
  2. Defining loss function
  3. Training, evaluation model

Define Softmax multi-classification model

Softmax model of linear regression mainly does two things:

The number of rows of the input data is the sample number N, and the content in each row is the feature of each sample. Suppose the feature number is D, then the shape of the input data is N * D, so as to train multiple data at a time, so as to achieve the purpose of batch calculation; The dimension of the parameter matrix is D *q, d is still the feature number and Q represents the classification number, so the multiplied result is N *q, which means the output of each sample of the N samples in different classification.

2. Softmax process the results of the first step to get the prediction results of each sample in different categories.

def net(X):
    Args: -x: N samples, each sample has D dimensions, that is, N * D-dimensional matrix -w: global parameters, D * Q-dimensional matrix, q represents the number of categories -b: Bias, 1*q vector Return: multiple classification result after -softmax
    return softmax(tf.matmul(X, W) + b)
   
def softmax(y):
    For n samples, each sample has q categories of data do the softmax Args: -y: n* Q dimension matrix Return: -n * Q dimension after the Softmax matrix Example: > > > y = np. Array ([[0.1, 0.2, 0.8], [0.8, 0.2, 0.1]]) > > > softmax (y) (tf) Tensor: Shape =(2, 3), DType =float64, NumPY = array([0.24278187, 0.26831547, 0.48890266], [0.48890266, 0.26831547, 0.24278187]] > "'
    return tf.exp(y) / tf.reduce_sum(tf.exp(y), axis=1, keepdims=True)
Copy the code

Defining loss function

Observe the formula of cross entropy loss function. To realize it, we should first get the prediction probability corresponding to the correct classification:

Here, the one-Hot coding is used to transform the target vector into the matrix form that is the same as the predicted result. For example, if the predicted result is n* Q matrix (N indicates the prediction of N samples at a time, and Q indicates the number of classifications), then the one-Hot coding will transform the target vector into n* Q matrix.

Then an “and operation” is performed on the prediction matrix and the target matrix, so that the predicted value corresponding to the correct classification can be taken out;

Finally, -log is calculated for the predicted value and then sum, which is the Loss of this batch of samples. The code is as follows:

def cross_entropy(y, y_hat):
    Cross entropy loss function Args: -y: the target value of n samples, n*1 vector -y_hat: the predicted distribution of N samples (softmax output result), n*q matrix Return: Examples and Examples of -log(y_hat) n samples: > > > y = np. Array ([[2], [1]]) > > > y_hat = np, array ([[0.1, 0.2, 0.2], [0.3, 0.9, 0.2]]) > > > cross_entropy (y, y_hat) < tf. The Tensor: Shape = (), dtype = float64, numpy = 1.7147984280919266 > ' ' '
    y_obs = tf.one_hot(y, depth=y_hat.shape[- 1])
    y_obs = tf.reshape(y_obs, y_hat.shape)
    y_hat = tf.boolean_mask(y_hat, y_obs)
    return tf.reduce_sum(-tf.math.log(y_hat))
Copy the code

Evaluation model

This time we used accuracy to evaluate the effectiveness of the model, which means to predict the proportion of correct numbers.

When evaluating the model, the data should be predicted first, and then the predicted result (the classification with the highest probability value) is compared with the correct classification, so as to count the number of correct predictions. The tf.argmax function used here means that the largest value is taken as the prediction result among multiple prediction categories:

def accuracy(x, y, num_inputs, batch_size):
    Args: -x: data set characteristics -y: data set target value, n*1 dimensional matrix -num_inputs: characteristics dimension (number of inputs) -batCH_size: batch of each prediction
    test_iter = tf.data.Dataset.from_tensor_slices((x, y)).batch(batch_size)
    acc, n = 0.0
    for X, y in test_iter:
        X = tf.reshape(X, (- 1, num_inputs))
        y = tf.cast(y, dtype=tf.int64)
        acc += np.sum(tf.argmax(net(X), axis=1) == y)
        n += y.shape[0]
    return acc/n
Copy the code

training

The above are the general methods needed for model training, prediction and evaluation, and then we can train the model. This time, we use fashion_mnist data set, in which each sample is a 28*28 pixel image, and the target value label is the category number of this image. There are 10 categories in the dataset, as follows:

Our task is to read an image like this and predict its category. The input of the model can be the value of each pixel of the picture, because there are 28*28=784 pixels, and the value of each pixel ranges from 0 to 255, so the number of nodes in the input layer of our model is 784. Because there are only 10 categories in the data set, the shape of parameter W of the model is 784* 10-dimensional matrix, the shape of bias is 10*1, and the number of nodes in the output layer is also 10.

After making the parameters clear, the rest is the model iteration, which is roughly the same as the training code in the previous linear regression, which is as follows:

import tensorflow as tf
from tensorflow import keras
import numpy as np
import time
import sys
from tensorflow.keras.datasets import fashion_mnist

def train(W, b, lr, num_inputs):
    (x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()
    # Data normalization
    x_train = tf.cast(x_train, tf.float32) / 255
    x_test = tf.cast(x_test, tf.float32) / 255
    
    batch_size = 256
    num_epochs = 5
    for i in range(num_epochs):
        # Small batch iteration
        train_iter = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(batch_size)
        train_acc_sum, loss_sum, n = 0.0.0
        for X, y in train_iter:
            X = tf.reshape(X, (- 1, num_inputs))
            y = tf.reshape(y, (- 1.1))
            # Calculate loss and gradient
            with tf.GradientTape() as tape:
                l = cross_entropy(y, net(X))
            grads = tape.gradient(l, [W, b])
            Adjust parameters according to the gradient
            W.assign_sub(lr * grads[0])
            b.assign_sub(lr * grads[1])

            loss_sum += l.numpy() # accumulative loss
            n += y.shape[0] # Cumulative number of training samples

        print("epoch %s, loss %s, train accuracy %s, test accuracy %s" 
              % (i+1, loss_sum/n, 
                 accuracy(x_train, y_train, num_inputs, batch_size), 
                 accuracy(x_test, y_test, num_inputs, batch_size)))

num_inputs = 784
num_outputs = 10
lr = 0.001
Initialize model parameters
W = tf.Variable(tf.random.normal(shape=(num_inputs, num_outputs), 
                                 mean=0, stddev=0.01, dtype=tf.float32))
b = tf.Variable(tf.zeros(num_outputs, dtype=tf.float32))
train(W, b, lr, num_inputs)
Copy the code

The following is the output of the training. It can be seen that the accuracy rate of the fashion_mnist image classification task can be 0.85 by using a simple model like linear regression.

Epoch 1, Loss 0.8956155544281006, train accuracy 0.82518,testAccuracy 0.8144 epoch 2, loss 0.6048591234842936, train accuracy 0.83978,testAccuracy 0.8272 epoch 3, loss 0.5516327695210774, train accuracy 0.84506,testAccuracy 0.8306 epoch 4, Loss 0.5295544961929322, train accuracy 0.84906,testAccuracy 0.8343 epoch 5, loss 0.5141636388142904, train accuracy 0.85125,testAccuracy of 0.8348Copy the code

Simple implementation

As usual, we’ll also look at a simple implementation of Softmax:

Load the data set
(x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()
x_train = x_train / 255
x_test = x_test / 255

# Config model
model = keras.Sequential([
    keras.layers.Flatten(input_shape=(28.28)),  # input layer
    keras.layers.Dense(10, activation=tf.nn.softmax)  # Output layer, activation function is softmax
])
Configure the cross entropy loss function
loss = 'sparse_categorical_crossentropy'  
With SGD configuration, the learning rate is 0.1
optimizer = tf.keras.optimizers.SGD(0.1)
model.compile(optimizer=optimizer,
             loss = loss,
             metrics=['accuracy'])  # Use accuracy to evaluate the model

model.fit(x_train, y_train, epochs=5, batch_size=256)
Copy the code

It’s still just configuration, you don’t have to write a line of logic.

summary

In this paper, we learned how to use linear regression and Softmax to realize a multi-classification model, and actually did experiments with fashion_MNist data set, and got a not bad accuracy result of 0.85. In this paper, there are two details that need to be mastered

  1. Softmax implementation details
  2. Principle of cross entropy loss function

Reference:

  • Hands-on Deep Learning – Softmax Regression (zh.gluon.ai/)
  • Hands-on deep learning VERSION TF2.0 – Softmax is back
  • How to understand cross entropy and relative entropy

Related articles:

  • Deep understanding of logistic regression algorithms
  • TF2.0 was used to achieve linear regression