Public account: You and the cabin by: Peter Editor: Peter
Hello, I’m Peter
Scikit-learn machine learning linear regression model
Machine learning practice — linear regression
Ideally, anything non-linear can be fitted with something linear (see Taylor’s formula). Moreover, linear models are mean prediction in nature, and most things only fluctuate around the mean value (refer to the law of large numbers), so linear models can simulate most phenomena.
The article directories
Basic form
The basic expression
Suppose x is described by d attributes: x=x1,x2,…… ,xdx={x_1,x_2,…… ,x_d}x=x1,x2,…… Xd; Where xix_ixi represents the value on the ith attribute. The linear model is a prediction function obtained through the combination of various attributes, which can be expressed as:
If expressed by a vector:
W = (w1, w2,… ,wd)w=(w_1,w_2,… ,w_d)w=(w1,w2,… Wd) is called the weight and B is called the bias
The different forms
Linear function f(x) at different scales:
- F (x) takes discrete values: linear multi-classification model
- F (x) takes a real valued function over a real number field: linear regression model
- F (x) is logarithm: log-linear mode
- Sigmoid nonlinear transformation of F (x) : log-probability regression
This article focuses on linear regression problems, references: Zhou Zhihua – Machine Learning, Li Hang, Statistical Learning Methods, and dataWhale’s Pumpkin Book formula section.
Linear regression
Given data set D=(x1,y1),(x2,y2)… , xm, ym, yi ∈ RD = {(x_1, y_1), (x_2, y_2),… ,(x_m,y_m)},y_i \in RD=(x1,y1),(x2,y2),… , xm, ym, yi ∈ R.
Linear regression attempts to learn a linear model to minimize the difference between the predicted value f(x)f(x)f(x) and the actual value yiy_iyi
measure
Measured by using mean square error (mse) : to find the right w and b, meet the min ∑ (w, b) I = 1 m (f (xi) – yi) \ 2 min _ {(w, b)} \ sum m ^ _ {I = 1} (f (x_i) – y_i) ^ 2 min (w, b) ∑ I = 1 m (f (xi) – yi) 2
The mean square error corresponds to the common Euclidean distance (Euclidean distance). The method of model solving based on the minimization of mean square error is called least square method.
In a linear regression model, the least square method is an attempt to find a line that minimizes the sum of the Euclidean distances from all samples to the line
Parameters to solve
Solving w and b E (w, b) = ∑ I = 1 m (yi – wxi – b) 2 (w, b) = E \ sum m ^ _ {I = 1} (y_i – wx_i – b) ^ 2 E (w, b) = ∑ I = 1 m (yi – wxi – b) 2 minimum, this process is called the least squares parameter estimation
E (w, b) of w and b separate derivation process, the specific derivation process reference: datawhalechina. Making. IO/pumpkin – boo…
Set the two results of the above two derivatives to be equal to 0, and we can obtain two results of w and B:
Sklearn actual combat linear regression
We use an open data set on the Internet to realize how to use the machine learning library SKlearn to learn linear regression problems.
-
The data source
-
Data exploration
-
The data processing
-
Run the linear model of SkLearn
-
Model to evaluate
-
Model optimization
-
Visualization of results
The data source
1, the introduction of data here: archive.ics.uci.edu/ml/datasets…
2, data download address please click archive.ics.uci.edu/ml/machine-…
The data has four field attributes that describe the power output of the power plant, the last of which is the actual power output value.
What we need to do now is to use the linear regression model to find the influence of four factors on the output power to predict the power generation in the later period.
Data exploration
import pandas as pd
import numpy as np
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression # Import linear regression model
import matplotlib.pyplot as plt
%matplotlib inline
Copy the code
1. Read data
2. View data information
3. View data statistics
The data processing
1. Select the required sample data X, that is, the information of the first four fields
2. Select our sample output label data Y, which is the data of the last attribute field
Data set partitioning
Partition of data set: one part becomes training set and the other part is divided into test set
from sklearn.model_selection import train_test_split Import training and test set modules from model selection module
X_train,X_test,y_train,y_test = train_test_split(X, y, random_state=1)
X_train # 7176 data training data
Copy the code
The data in the test set is 2392:
The row record of the output value of the sample is the same as the corresponding X; At the same time, we also found that 75% of the data in this case was used as a training set and 25% as a test set
Fits the instantiation object
line.fit(X_train,y_train)
Copy the code
LinearRegression()
Copy the code
Check out our simulated coefficients:
print(line.intercept_) # is equivalent to d; It's like an intercept
print(line.coef_) The # is the same thing as the value of each w
Copy the code
[460.05727267] [[-1.96865472-0.2392946 0.0568509-0.15861467]]Copy the code
So far, we have simulated the linear regression expression: that is, the relationship between PE and the preceding four variables:
Model evaluation
We obtained the linear regression expression from the data of the training set, and now we use this expression to simulate the data of the test set generated before.
The evaluation of a model, for linear regression, is usually based on the above mentioned Mean Squared Error (MSE) or Root Mean Squared Error (RMSE) in the test set to evaluate the model. Here is the code to test:
y_pred = line.predict(X_test) # Make predictions for test set data
y_pred
Copy the code
Array ([[457.26722361], [466.70748375], [440.33763981],..., [457.39596168], [429.37990249], [438.16837983]])Copy the code
len(y_pred)
Copy the code
2392
Copy the code
After predicting the result, we compare the predicted value with the actual value:
from sklearn import metrics # contrast module
Copy the code
# output MSE
print("MSE:",metrics.mean_squared_error(y_test,y_pred))
Copy the code
MSE: 20.837191547220346
Copy the code
RMSE: the root of MSE
print("RMSE: ",np.sqrt(metrics.mean_squared_error(y_test,y_pred)))
Copy the code
RMSE: 4.564777272465804
Copy the code
Optimize the model using cross validation
Cross validation method: Divide all data sample D into k mutually exclusive subsets of similar size. Each subset tries to keep the data distribution consistent. And then we use k minus 1 as the training set, and the rest as the test set.
In this way, k training sets/test sets can be obtained to perform k training and testing, and finally return the mean value of K training.
We can optimize our resulting model using the cross-validation method that comes with SkLearn. In the following example, the 10-fold cross-validation is used:
line
Copy the code
LinearRegression()
Copy the code
from sklearn.model_selection import cross_val_predict Import the cross-validation module
y_predicted = cross_val_predict(line,X,y,cv=10)
y_predicted # Predicted values obtained by cross validation
Copy the code
Array ([[467.24487977], [444.06421837], [483.53893768],..., [432.47556666], [443.07355843], [449.74395838]])Copy the code
len(y_predicted)
Copy the code
9568
Copy the code
MSE and RMSE can be calculated according to predicted and real values obtained through cross verification:
# output MSE
print("MSE:",metrics.mean_squared_error(y,y_predicted))
Copy the code
MSE:20.79367250985753
Copy the code
print("RMSE: ",np.sqrt(metrics.mean_squared_error(y,y_predicted)))
Copy the code
RMSE: 4.560007950635342
Copy the code
We find that the mean square error after using cross validation is better than the error without using cross validation
drawing
Finally, we draw a graph between the real sample value and the predicted value. The closer the point is to the intermediate y=x value, the lower the predicted loss is.
Here is the code and effect for drawing using Matplotlib:
fig, ax = plt.subplots()
ax.scatter(y, y_predicted)
ax.plot([y.min(), y.max()], [y.min(), y.max()].'k--', lw=3)
ax.set_xlabel('Measured') # real value
ax.set_ylabel('Predicted') # predicted
plt.show()
Copy the code
I redrew the data using Plotly_express to see what it looked like:
data = pd.concat([pd.DataFrame(y),pd.DataFrame(y_predicted)],axis=1)
data.columns = ["Measured"."Predicted"]
data
Copy the code
import plotly_express as px
fig = px.scatter(data,
x="Measured".# real value
y="Predicted".# predicted
trendline="ols".# the trend line
trendline_color_override="red" # color
)
fig.show()
Copy the code