Prediction begins with a shot in the dark

As mentioned in the previous article, machine learning is the process of applying mathematical methods to discover patterns in data. Since mathematics is an explanation of the real world, let’s go back to the real world and do some contrast imagery.

Imagine a styrofoam whiteboard with blue thumbtacks arranged on it. Vaguely, there seems to be a pattern, and we try to figure it out.

Is there a way (mathematical algorithm) to find patterns (model interpretation) as the pins (data) on the whiteboard are shown above? Since we don’t know what to do, let’s take a shot in the dark!

I picked up two sticks and stroked them in front of the whiteboard, trying to represent the pattern of the data. I put it casually, as shown below:

They all seem to represent the pattern of blue pins to some extent, so the question is, which is better, green (dashed) or red (solid)?

Loss function (cost function)

Good and bad are very subjective expressions, subjective feelings are not reliable, we must find an objective measure. We take it for granted that the representation with the least error is the best. So, we introduce a method to quantify the error, the least square method.

Least square method: to minimize the sum of squares of error, is a method of error statistics, two is square meaning. SE = ∑ (ypred – ytrue) 2 SE = \ sum {(y_ {Mr Pred} – y_ {true}) ^ 2} SE = ∑ (ypred – ytrue) 2

The interpretation of the least square method is that we take the predicted value and the actual value to represent the error of a single point, and add their sum of squares together to represent the overall error. (The square advantage is that you can handle negative values, and you can use absolute sums.) We use this final value to represent the loss (cost), and the function that can represent the loss (cost) is called the loss function (cost function).

As we can see in the figure above, the distance from the blue dot to the solid line is the error that we’re going to plug into the formula. Although they look similar, the calculated result is that the loss of the red solid line (y=3x+2) is 27.03, while the loss of the green solid line (y= 4x+4) is 29.54, obviously the red model is better than the green model.

So, what better model to represent data than a solid red line? Is there a way to find it?

Gradient descent

We mathematize the representation of the stick (solid line, model), and since we can use 3 and 4 as coefficients of x, we can certainly try other numbers. We can express this relationship by the following formula:


y = w x + b y = wx + b

Where x and y are known, we keep adjusting w(weight) and B (deviation), and then plug in the loss function to get the minimum, which is gradient descent.

We set the value of w from -50 to the end of 50. We bias B by random number and then insert the loss function to calculate the error loss between our prediction and the actual value. The following curve is obtained:

It is important to note that the graph we draw is a curve based on weights and losses. Our model represents a straight line. We can see that in the figure above, we can find the minimum value, which is around 5, where we have the smallest loss, and at this point our model can best represent the law of data.

A gradient can be understood as a derivative, and the process of gradient descent is the process of taking derivatives.

Learning rate (step size)

The process of constantly adjusting weights and deviations to find the minimum loss function is the process of using the gradient descent method to fit the data to find the best model. So now that we have the solution, isn’t it time to think about how to increase efficiency, how to find the lowest point quickly?

Imagine being lost in a mountain fog and all you feel is the slope of the road beneath your feet. One strategy for getting to the bottom of the hill quickly is to go downhill in the steepest direction. An important parameter in gradient descent is the step size of each step (the learning rate). If the step size is too small, it takes a lot of iterations for the algorithm to converge. If the step size is too large, you may go straight across the valley, causing the algorithm to diverge and get bigger and bigger.

Set step size to too small:

Set step size to too large:

Set the step size appropriately:

The step size cannot be learned by the algorithm itself; it must be specified externally. This algorithm cannot be learned and requires artificial parameters, which are called hyperparameters.

Linear regression

Finally, we found a linear model to explain the relationship between independent variable X and dependent variable Y, which is linear regression. The explanation for regression is that things tend to move toward some kind of “average.” This tendency is called regression, so regression is often used for prediction.

In the figure above, the red line is the best model we can fit, on which we can find the predicted values of 2.2, 2.6 and 2.8, corresponding to the three red dots in the figure respectively.

That’s what linear regression is all about.

Code practice

Data preparation:

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
X = 2 * np.random.rand(10)
y = 4 + 3 * X + np.random.randn(10)

plt.plot(X, y, "bo")
plt.xlabel("$X$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.axis([0, 2, 0, 15])
plt.show()
Copy the code

Draw two lines y=3x+2 and y=4x+4:

plt.plot(X, y, "bo")
plt.plot(X, 3*X+2, "r-", lw="5", label = "y=3x+2")
plt.plot(X, 4*X+4, "g:", lw="5", label = "y=4x+4")
plt.xlabel("$X$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.axis([0, 2, 0, 15])
plt.legend(loc="upper left")
plt.show()
Copy the code

Calculate the loss and compare the two lines y=3x+2 and y=4x+4:

Plot (X, y, "bo") ax_list[0]. Plot (X, 3*X+2), and plots(nrows=1, ncols=2,figsize=(20,10)) ax_list[0]. "r-", lw="5", label = "y=3x+2") loss = 0 for i in range(10): ax_list[0].plot([X[i],X[i]], [y[i],3*X[i]+2], color='grey') loss= loss + np.square(3*X[i]+2-y[i]) pass ax_list[0].axis([0, 2, 0, 15]) ax_list[0].legend(loc="upper left") ax_list[0].title.set_text('loss=%s'%loss) ax_list[1].plot(X, y, "bo") ax_list[1].plot(X, 4*X+4, "g:", lw="5", label = "y=4x+4") loss = 0 for i in range(10): ax_list[1].plot([X[i],X[i]], [y[i],4*X[i]+4], color='grey') loss= loss + np.square(4*X[i]+4-y[i]) pass ax_list[1].axis([0, 2, 0, 15]) ax_list[1].legend(loc="upper left") ax_list[1].title.set_text('loss=%s'%loss) FIG. Subplots_adjust (wspace = 0.1, img tags like hspace = 0.5) FIG. Suptitle (" Calculate loss ", fontsize = 16)Copy the code

Train the model and predict:

Linear_model import LinearRegression LR = LinearRegression() lR.fit (x.reshape (-1,1), y.Reshape (-1,1)) X_test = [[2.2], [2.6], [2.8]] y_test = lr. The predict (X_test) X_pred = 3 * np in random. The rand (100, 1) y_pred = lr.predict(X_pred) plt.scatter(X,y, c='b', label='real') plt.plot(X_test,y_test, 'r', label='predicted point' ,marker=".", ms=20) plt.plot(X_pred,y_pred, 'r-', label='predicted') plt.xlabel("$X$", fontsize=18) plt.ylabel("$y$", rotation=0, fontsize=18) plt.axis([0, 3, 0, 15]) plt.legend(loc="upper left") loss = 0 for i in range(10): loss = loss + np.square(y[i]-lr.predict([[X[i]]])) plt.title("loss=%s"%loss) plt.show()Copy the code

Other regression

How do you really understand regression? According to a large number of data statistics, beans with small individuals tend to produce larger offspring, while beans with large individuals tend to produce smaller offspring. The newly generated individuals have a trend toward the average value of beans, which is regression. Linear regression, as described in this article, is one of the techniques used for prediction. In this case, regression is often opposed to classification.

Linear regression, logistic regression, polynomial regression, stepwise regression, ridge regression, Lasso regression, ElasticNet regression are the most commonly used regression techniques. I’m going to do a brief overview of these technologies, so that you can sort out the threads and explore them further when you actually need them. Trying to exhaust knowledge only leads to tiredness.

The name of the explain The formula
Linear Regression A linear model is used to model the relationship between independent variables and dependent variables
y = w x + b y = wx+b
Logistic Regression Modeling specific categories for dichotomies
y = 1 1 + e x y=\frac{1}{1+e^{-x}}
Polynomial Regression The relationship between the independent variable x and the dependent variable y is modeled as a polynomial of NTH degree with respect to x
y = Beta. 0 + Beta. 1 x + Beta. 2 x 2 + . . . + Beta. m x m + Epsilon. y=\beta_0 + \beta_1x + \beta_2x^2 + … + \beta_mx^m + \varepsilon
(C α IYONGJI watermark)
Stepwise Regression Multiple variables are introduced into the model one by one to find the variables that have the greatest influence on the model
Lasso Regression Sparse matrix, eliminating unimportant features, MSE+L1 norm
J ( Theta. ) = M S E ( Theta. ) + Alpha. Theta. J(\theta)=MSE(\theta) + \alpha\sum\mid\theta\mid
Where, the larger α is, the smaller the weight of the model is
Ridge Regression Regularize linear regression, increase model freedom, prevent overfitting, MSE+L2 norm
J ( Theta. ) = M S E ( Theta. ) + Alpha. 1 2 Theta. 2 J(\theta)=MSE(\theta) + \alpha\frac{1}{2}\sum\theta^2
Where, the larger α is, the smaller the weight of the model is
ElasticNet It is between ridge regression and Lasso regression
J ( Theta. ) = M S E ( Theta. ) + gamma Alpha. Theta. + Alpha. 1 gamma 2 Theta. 2 J(\theta)=MSE(\theta) + \gamma\alpha\sum\mid\theta\mid + \alpha\frac{1-\gamma}{2}\sum\theta^2
Where, γ is between 0 and 1, close to 0 is more inclined to ridge regression, close to 1 is more inclined to Lasso regression