The wave of Artificial Intelligence is sweeping the world, and we hear a lot of words: Artificial Intelligence, Machine Learning, Deep Learning. This paper is mainly to comb the notes of Li Hongyi’s course content, and the reference links have been given at the end of the paper.

3 – regression

The definition of linear regression

Linear regression is defined as: the target value is expected to be a linear combination of the input variables. Simply put, it’s choosing a linear function that fits the known data well and predicts the unknown.

In regression analysis, only one independent variable and one dependent variable are included, and the relationship between them can be approximated by a straight line. This regression analysis is called unary linear regression analysis. If the regression analysis includes two or more independent variables and there is a linear relationship between the dependent variables and the independent variables, it is called multiple linear regression analysis.

Application, for example,

  • Stock Market Forecast
    • Input: stock changes in the past 10 years, news consulting, company MERGER and acquisition consulting, etc
    • Output: Forecast the average value of the stock market tomorrow
  • Self-driving Car
    • Input: data from each sensor of the unmanned vehicle, such as road conditions and measured distance between vehicles
    • Output: steering Angle
  • Recommendation of products
    • Input: characteristics of commodity A, characteristics of commodity B
    • Output: the possibility of buying item B
  • Combat Power of a Pokemon:
    • Input: pre-evolutionary CP value, species (Bulbasaur), health (HP), Weight (Weight), Height (Height)
    • Output: evolved CP value

Model step Explanation

  • Step1: model assumptions, select model framework (linear model)
  • Step2: Model evaluation, how to judge the quality of many models (loss function)
  • Step3: Model optimization, how to select the optimal model (gradient descent)

This is not a step to complete a machine learning task. The understanding of the above step is: select a model framework, and then start training. To evaluate the quality of the model, we use the loss function to evaluate. To complete a machine learning task, complete the following steps:

  1. Data collection: This step is critical because the quality and quantity of the data collected will help improve the accuracy of the prediction model.
  2. Data preparation: once the collected data, you need to load it into the system, and training for the machine learning to prepare, but sometimes we collected data may have more useless, or interfere with the item, we will according to the actual situation and demand of features or to add or remove data quantity.
  3. Choose the right model: Select the right training model based on our understanding of the task and our experience.
  4. Training model: Use data to further improve the performance of the model and train the model.
  5. Evaluation model: The evaluation process needs to check whether the model is effectively trained or can perform the task. In this way, you can easily test the model with data that is not present in your training. This is to test how the model responds to data that has not yet been encountered, and the evaluation is to analyze the adaptability of the model.
  6. Hyperparameter adjustment: This is to check whether there is room for improvement in the model being trained. This can be achieved by adjusting certain parameters (the learning rate or the number of times the training model is run during training). During training, you have multiple parameters to consider. For each parameter, you need to know what role it plays in the training of the model, or you may find yourself wasting time or taking longer after tuning.
  7. Prediction: The final step, once the above parameters have been followed, the model can be tested.

Step 1: Model hypothesis – Linear model

Unary linear model (single feature)

Definition: Also called unitary linear regression, unitary linear regression is an analysis with only one independent variable. To predict an output from an input, the input/output correspondence is a linear function.

The linear model assumes y=b+w∗xcpy=b+w*x_{cp}y=b+w∗ XCP so w and b can guess many models

Multiple linear model (multiple features)

Definition: In a regression analysis, when there are two or more independent variables, it is called multiple regression. When the description of the sample involves multiple attributes, this type of problem is called multiple linear regression.

In practice, XCP x_{cp} XCP must not be the only input feature. For example, there are many characteristics such as the pre-evolution CP value, Bulbasaur, HP, Weight and Height of pokemon in the course.

Therefore, we assume the Linear model:
y = b + w i x i y = b + \sum w_ix_i

  • Xcp XiX_iXi: is all kinds of characteristics (feature), Xhp, Xw, XhX_ cp {}, X_} {HP, X_w, X_hXcp Xhp, Xw, Xh…
  • WiW_iWi: each feature weights of Wcp, Whp, Ww, WhW_ cp {}, W_} {HP, W_w W_hWcp, Whp, Ww, Wh…
  • BBB: offset

Step 2: Model evaluation-loss function

Definition: Loss Function, also known as Cost Function or Error Function, is used to measure the deviation between the predicted value and the actual value. Generally speaking, every algorithm we use in machine learning tasks has an objective Function, and the algorithm is to optimize this objective Function. Especially in classification or regression tasks, the Loss Function is used as its objective Function. The goal of machine learning is to expect a small deviation between the predicted value and the actual value, which means that the loss function is expected to be small, which is called the minimal loss function.

The loss function is used to evaluate the degree of inconsistency between the predicted value y^=f(X)\widehat{y}=f(X)y =f(X) and the real value YYY of the model. It is a non-negative real-valued function. In general, L(Y,f(x))L(Y,f(x))L(Y,f(x)) is used. The smaller the loss function, the better the performance of the model.

The function of loss function: to measure the quality of model prediction

Loss functions commonly used in statistical learning are as follows:

  1. 0-1 lossfunction:


L ( Y . f ( X ) ) = { 1 . Y indicates f ( X ) 0 . Y = f ( X ) L(Y,f(X))=\left\{ \begin{aligned}&1,\quad Y\ne f(X)\\& 0,\quad Y=f(X) \end{aligned} \right.

The meaning of the loss function is that when the prediction is wrong, the value of the loss function is 1, and when the prediction is correct, the value of the loss function is 0. The loss function does not take into account the degree of error between the predicted value and the true value, that is, as long as the prediction is wrong, the prediction is almost wrong and the prediction is much worse.

  1. Quadraticloss function


L ( Y . f ( X ) ) = ( Y f ( X ) ) 2 L(Y,f(X))=(Y-f(X))^2

The meaning of this loss function is also very simple. It is to take the square of the prediction gap, that is, the square sum of the gap between the actual result and the observed result. It is generally used in linear regression and can be understood as the least square method

  1. Absoluteloss function


L ( Y . f ( X ) ) = Y f ( X ) L(Y,f(X))=|Y-f(X)|

The loss function is basically the same thing, except you take the absolute value instead of the absolute value, and the difference is not magnified by the square.

  1. Logarithmicloss function or log-likelihood loss function. Logarithmicloss function.


L ( Y . P ( Y X ) ) = l o g P ( Y X ) L(Y,P(Y|X))=-logP(Y|X)

This loss function is a little harder to understand. In fact, the loss function uses the idea of maximum likelihood estimation. P (Y ∣ X) P (Y | X) P (Y) ∣ X popular explanation is: on the basis of the current model, for the sample X, its predictive value for Y, that is to predict the probability of correct. And since you have to multiply between probabilities, in order to turn it into an addition, we take the logarithm of it. Finally, since it is a loss function, the higher the probability of correct prediction, the smaller the loss value should be, so add a negative sign and take an inverse

Step 3: Optimal model-gradient descent.

Gradient descent is a kind of iterative method, which is one of the methods to solve linear and nonlinear least squares problems. It is often used to solve the minimum value of the loss function, and the minimum loss function and model parameter values are obtained step by step through the gradient descent method.

The calculation method can have simple said: the an + 1 = the an – eta ∗ the an – a_ (n + 1} = a_n – eta * \ \ overrightarrow {a_n} an + 1 = the an – eta ∗ the an

η\etaη: represents the step size or learning rate, lr. It’s the proportion of the distance traveled in each step.

One of the questions is: Why do I multiply by -k

Answer: we are looking for the minimum gradient, in other words is the find function is low, if can be explained by mathematical thinking, the current point differential, if the derivative is less than zero, said low on the right side of the point, the derivative is negative, by – k is to increase the value of w, if the derivative is greater than zero, said low in the left side of the point at which a derivative is positive, by – k is to reduce the value of w.

  • Step 1: Pick one at random
  • Step 2: Calculate the differential, that is, the current slope, and determine the direction of movement based on the slope
    • Greater than 0, we’re going to move to the left.
    • If I’m less than 0, I’m going to move to the right.
  • Step 3: Move according to the learning rate
  • Repeat steps 2 and 3 until you find the lowest point

In fact, in the learning process, we should gradually reduce the value of the learning rate with the increase of the number of iterations

However, sometimes there is a problem that your current best point may not be your global best point, but may be your local best point, as shown in the figure below:

In real tasks, people often use the following strategies to try to “jump out” of the local minimum to reach the global minimum:

  • Multiple neural networks are initialized with multiple groups of different parameter values. After training according to the standard method, the solution with the smallest error is taken as the final parameter, which is equivalent to starting the search from multiple different initialization points so as to find the global optimal.
  • Using simulated degradation techniques, simulated annealing accepts worse results than the current solution with a certain probability at each step, thus helping to “jump out” of the local minimum. In each iteration, the probability of accepting the “suboptimal solution” decreases gradually with time, thus ensuring the stability of the algorithm.
  • Using stochastic gradient descent, unlike the standard gradient descent method to accurately calculate the gradient, stochastic gradient descent method adds a random factor to the calculation of the gradient. Therefore, even if it falls into the local minimum point, the gradient calculated by it is added with random factors. Therefore, even if it falls into the local minimum point, the gradient calculated by it may not be zero, so that it has a chance to jump out of the local minimum and continue searching.

How to verify the model

  1. Divide the training set and the test set
  2. Evaluation and classification results: accuracy, confusion matrix, accuracy rate, recall rate, F1 Score, ROC curve, etc
  3. Evaluation of regression results: MSE, RMSE, MAE, R Squared

Overfitting problems arise

On the model, we can further optimize, using a higher power model. But why does a model that performs better on the training set do worse on the test set? This is the problem of model overfitting on the training set.

The result of error rate is graphically displayed, and it is found that the phenomenon of overfitting has occurred in the model with the power of 3 or more:

With the progress of the training process, the complexity of the model and the error on the training data gradually decrease. However, the error on the verification set gradually increases — because the trained network overfits the training set, it does not work on the data outside the training set.

In machine learning algorithms, we often divide the raw data set into three parts: training data, Validation data, and testing data.

Question: What is a validation set?

It’s actually designed to avoid overfitting. In the training process, we often use it to determine some super parameters (for example, determine the epoch size of early stopping according to the accuracy of validation data, determine the learning rate according to validation data, etc.). So why not just do this with Testing Data? Since it is assumed that we do this in testing data, as the training goes on, our network is actually over-fitting our testing data one by one, so the final testing accuracy has no reference significance. Therefore, the role of training data is to calculate the weight of gradient update, while the role of testing data is to give a accuracy to infer the quality of the network.

The main methods to prevent overfitting are:

  1. Regularization (L1 and L2)
  2. Data augmentation, or increasing training Data samples
  3. Dropout
  4. early stopping

Specific can consult: blog.csdn.net/u010899985/…

Steps to optimize

Step1 optimization: Four linear models with two inputs are merged into one linear model

Step2 Optimization: If you want the model to be more powerful and perform better (more parameters, more input)

In the beginning we had a lot of features, graphically analyzing features, adding HP, Weight, Height to the model

Step3 Optimization: add regularization

More features, but the weight W may make some features overweighted and still result in overfitting, so regularization is added

  • The smaller w is, the smoother the function is. The output value of function is not different from the input value
  • In many application scenarios, it’s not that the smaller the W, the smoother the model, but the rule of thumb is that a smaller W is good in most cases.
  • B is close to 0, which has no effect on the smoothness of the curve

4- Regression demonstration

import numpy as np
import matplotlib.pyplot as plt
from pylab import mpl

# matplotlib has no Chinese font, dynamic resolution
plt.rcParams['font.sans-serif'] = ['Simhei']  # Display Chinese
mpl.rcParams['axes.unicode_minus'] = False  # Fixed save image where the negative sign '-' is displayed as a square

x_data = [338..333..328..207..226..25..179..60..208..606.]
y_data = [640..633..619..393..428..27..193..66..226..1591.]
x_d = np.asarray(x_data)
y_d = np.asarray(y_data)

x = np.arange(-200, -100.1)
y = np.arange(-5.5.0.1)
Z = np.zeros((len(x), len(y)))
X, Y = np.meshgrid(x, y)

# loss
for i in range(len(x)):
    for j in range(len(y)):
        b = x[i]
        w = y[j]
        Z[j][i] = 0  # meshGrid # meshGrid # meshGrid # meshGrid
        for n in range(len(x_data)):
            Z[j][i] += (y_data[n] - b - w * x_data[n]) ** 2
        Z[j][i] /= len(x_data)

# linear regression
# b = -120
# w = -4
b = -2
w = 0.01
lr = 0.000005
iteration = 1400000

b_history = [b]
w_history = [w]
loss_history = []
import time

start = time.time()
for i in range(iteration):
    m = float(len(x_d))
    y_hat = w * x_d + b
    loss = np.dot(y_d - y_hat, y_d - y_hat) / m
    grad_b = -2.0 * np.sum(y_d - y_hat) / m
    grad_w = -2.0 * np.dot(y_d - y_hat, x_d) / m
    # update param
    b -= lr * grad_b
    w -= lr * grad_w

    b_history.append(b)
    w_history.append(w)
    loss_history.append(loss)
    if i % 10000= =0:
        print("Step % I, W: %0.4f, b: %.4f, Loss: %.4f" % (i, w, b, loss))
end = time.time()
print("About time:", end - start)

# plot the figure
plt.contourf(x, y, Z, 50, alpha=0.5, cmap=plt.get_cmap('jet'))  # Fill contour
plt.plot([-188.4], [2.67].'x', ms=12, mew=3, color="orange")
plt.plot(b_history, w_history, 'o-', ms=3, lw=1.5, color='black')
plt.xlim(-200, -100)
plt.ylim(-5.5)
plt.xlabel(r'$b$')
plt.ylabel(r'$w$')
plt.title("Linear regression")
plt.show()

# linear regression
b = -120
w = -4
lr = 1
iteration = 100000

b_history = [b]
w_history = [w]

lr_b = 0
lr_w = 0
import time

start = time.time()
for i in range(iteration):
    b_grad = 0.0
    w_grad = 0.0
    for n in range(len(x_data)):
        b_grad = b_grad - 2.0 * (y_data[n] - n - w * x_data[n]) * 1.0
        w_grad = w_grad - 2.0 * (y_data[n] - n - w * x_data[n]) * x_data[n]

    lr_b = lr_b + b_grad ** 2
    lr_w = lr_w + w_grad ** 2
    # update param
    b -= lr / np.sqrt(lr_b) * b_grad
    w -= lr / np.sqrt(lr_w) * w_grad

    b_history.append(b)
    w_history.append(w)
# plot the figure
plt.contourf(x, y, Z, 50, alpha=0.5, cmap=plt.get_cmap('jet'))  # Fill contour
plt.plot([-188.4], [2.67].'x', ms=12, mew=3, color="orange")
plt.plot(b_history, w_history, 'o-', ms=3, lw=1.5, color='black')
plt.xlim(-200, -100)
plt.ylim(-5.5)
plt.xlabel(r'$b$')
plt.ylabel(r'$w$')
plt.title("Linear regression")
plt.show()
Copy the code

Reference links:

  • Link: zhuanlan.zhihu.com/p/112692430
  • Link: zhuanlan.zhihu.com/p/72589970
  • Link: blog.csdn.net/qq547276542…
  • Link: zhuanlan.zhihu.com/p/136438005