Public account: You and the cabin by: Peter Editor: Peter

Hello, I’m Peter

Scikit-learn machine learning linear regression model

Machine learning practice — linear regression

Ideally, anything non-linear can be fitted with something linear (see Taylor’s formula). Moreover, linear models are mean prediction in nature, and most things only fluctuate around the mean value (refer to the law of large numbers), so linear models can simulate most phenomena.

The article directories

Basic form

The basic expression

Suppose x is described by d attributes: x=x1,x2,…… ,xdx={x_1,x_2,…… ,x_d}x=x1,x2,…… Xd; Where xix_ixi represents the value on the ith attribute. The linear model is a prediction function obtained through the combination of various attributes, which can be expressed as:


f ( x ) = w 1 x 1 + w 2 x 2 + . . . + w d x d + b f(x) = w_1x_1+ w_2x_2 + … + w_dx_d + b

If expressed by a vector:


f ( x ) = w T x + b f(x) = w^Tx+b

W = (w1, w2,… ,wd)w=(w_1,w_2,… ,w_d)w=(w1,w2,… Wd) is called the weight and B is called the bias


The different forms

Linear function f(x) at different scales:

  • F (x) takes discrete values: linear multi-classification model
  • F (x) takes a real valued function over a real number field: linear regression model
  • F (x) is logarithm: log-linear mode
  • Sigmoid nonlinear transformation of F (x) : log-probability regression

This article focuses on linear regression problems, references: Zhou Zhihua – Machine Learning, Li Hang, Statistical Learning Methods, and dataWhale’s Pumpkin Book formula section.

Linear regression

Given data set D=(x1,y1),(x2,y2)… , xm, ym, yi ∈ RD = {(x_1, y_1), (x_2, y_2),… ,(x_m,y_m)},y_i \in RD=(x1,y1),(x2,y2),… , xm, ym, yi ∈ R.

Linear regression attempts to learn a linear model to minimize the difference between the predicted value f(x)f(x)f(x) and the actual value yiy_iyi

measure

Measured by using mean square error (mse) : to find the right w and b, meet the min ⁡ ∑ (w, b) I = 1 m (f (xi) – yi) \ 2 min _ {(w, b)} \ sum m ^ _ {I = 1} (f (x_i) – y_i) ^ 2 min (w, b) ∑ I = 1 m (f (xi) – yi) 2

The mean square error corresponds to the common Euclidean distance (Euclidean distance). The method of model solving based on the minimization of mean square error is called least square method.

In a linear regression model, the least square method is an attempt to find a line that minimizes the sum of the Euclidean distances from all samples to the line

Parameters to solve

Solving w and b E (w, b) = ∑ I = 1 m (yi – wxi – b) 2 (w, b) = E \ sum m ^ _ {I = 1} (y_i – wx_i – b) ^ 2 E (w, b) = ∑ I = 1 m (yi – wxi – b) 2 minimum, this process is called the least squares parameter estimation

E (w, b) of w and b separate derivation process, the specific derivation process reference: datawhalechina. Making. IO/pumpkin – boo…

Set the two results of the above two derivatives to be equal to 0, and we can obtain two results of w and B:

Sklearn actual combat linear regression

We use an open data set on the Internet to realize how to use the machine learning library SKlearn to learn linear regression problems.

  • The data source

  • Data exploration

  • The data processing

  • Run the linear model of SkLearn

  • Model to evaluate

  • Model optimization

  • Visualization of results

The data source

1, the introduction of data here: archive.ics.uci.edu/ml/datasets…

2, data download address please click archive.ics.uci.edu/ml/machine-…

The data has four field attributes that describe the power output of the power plant, the last of which is the actual power output value.

What we need to do now is to use the linear regression model to find the influence of four factors on the output power to predict the power generation in the later period.

Data exploration

import pandas as pd
import numpy  as np
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression  # Import linear regression model

import matplotlib.pyplot as plt
%matplotlib inline 
Copy the code

1. Read data

2. View data information

3. View data statistics

The data processing

1. Select the required sample data X, that is, the information of the first four fields

2. Select our sample output label data Y, which is the data of the last attribute field

Data set partitioning

Partition of data set: one part becomes training set and the other part is divided into test set

from sklearn.model_selection import train_test_split  Import training and test set modules from model selection module

X_train,X_test,y_train,y_test = train_test_split(X, y, random_state=1)

X_train   # 7176 data training data
Copy the code

The data in the test set is 2392:

The row record of the output value of the sample is the same as the corresponding X; At the same time, we also found that 75% of the data in this case was used as a training set and 25% as a test set

Fits the instantiation object

line.fit(X_train,y_train)
Copy the code
LinearRegression()
Copy the code

Check out our simulated coefficients:

print(line.intercept_)  # is equivalent to d; It's like an intercept
print(line.coef_)   The # is the same thing as the value of each w
Copy the code
[460.05727267] [[-1.96865472-0.2392946 0.0568509-0.15861467]]Copy the code

So far, we have simulated the linear regression expression: that is, the relationship between PE and the preceding four variables:


P E = 1.96865472 A T 0.2392946 V + 0.0568509 A P 0.15861467 P H + 460.05727267 PE = -1.96865472* at-0.2392946 *V + 0.0568509* AP-0.15861467 *PH + 460.05727267

Model evaluation

We obtained the linear regression expression from the data of the training set, and now we use this expression to simulate the data of the test set generated before.

The evaluation of a model, for linear regression, is usually based on the above mentioned Mean Squared Error (MSE) or Root Mean Squared Error (RMSE) in the test set to evaluate the model. Here is the code to test:

y_pred = line.predict(X_test)  # Make predictions for test set data

y_pred
Copy the code
Array ([[457.26722361], [466.70748375], [440.33763981],..., [457.39596168], [429.37990249], [438.16837983]])Copy the code
len(y_pred)
Copy the code
2392
Copy the code

After predicting the result, we compare the predicted value with the actual value:

from sklearn import metrics  # contrast module
Copy the code
# output MSE

print("MSE:",metrics.mean_squared_error(y_test,y_pred)) 
Copy the code
MSE: 20.837191547220346
Copy the code
RMSE: the root of MSE

print("RMSE: ",np.sqrt(metrics.mean_squared_error(y_test,y_pred)))
Copy the code
RMSE:  4.564777272465804
Copy the code

Optimize the model using cross validation

Cross validation method: Divide all data sample D into k mutually exclusive subsets of similar size. Each subset tries to keep the data distribution consistent. And then we use k minus 1 as the training set, and the rest as the test set.

In this way, k training sets/test sets can be obtained to perform k training and testing, and finally return the mean value of K training.

We can optimize our resulting model using the cross-validation method that comes with SkLearn. In the following example, the 10-fold cross-validation is used:

line
Copy the code
LinearRegression()
Copy the code
from sklearn.model_selection import cross_val_predict  Import the cross-validation module

y_predicted = cross_val_predict(line,X,y,cv=10) 
y_predicted  # Predicted values obtained by cross validation
Copy the code
Array ([[467.24487977], [444.06421837], [483.53893768],..., [432.47556666], [443.07355843], [449.74395838]])Copy the code
len(y_predicted)
Copy the code
9568
Copy the code

MSE and RMSE can be calculated according to predicted and real values obtained through cross verification:

# output MSE

print("MSE:",metrics.mean_squared_error(y,y_predicted)) 
Copy the code
MSE:20.79367250985753
Copy the code
print("RMSE: ",np.sqrt(metrics.mean_squared_error(y,y_predicted)))
Copy the code
RMSE:  4.560007950635342
Copy the code

We find that the mean square error after using cross validation is better than the error without using cross validation

drawing

Finally, we draw a graph between the real sample value and the predicted value. The closer the point is to the intermediate y=x value, the lower the predicted loss is.

Here is the code and effect for drawing using Matplotlib:

fig, ax = plt.subplots()

ax.scatter(y, y_predicted)

ax.plot([y.min(), y.max()], [y.min(), y.max()].'k--', lw=3)

ax.set_xlabel('Measured')  # real value
ax.set_ylabel('Predicted') # predicted

plt.show()
Copy the code

I redrew the data using Plotly_express to see what it looked like:

data = pd.concat([pd.DataFrame(y),pd.DataFrame(y_predicted)],axis=1)
data.columns = ["Measured"."Predicted"]
data
Copy the code

import plotly_express as px

fig = px.scatter(data,
                 x="Measured".# real value
                 y="Predicted".# predicted
                 trendline="ols".# the trend line
                 trendline_color_override="red" # color
                )

fig.show()
Copy the code