Prediction of enzyme activity task (Nonlinear curve prediction) : 1. Establish a linear regression model based on T-R-train. CSV data, calculate its R2 score on T-R-test. CSV data, and predict results of visual model 2. 4. Determine which model is more accurate to predict by visualizing the prediction results of polynomial regression model data
# load data
import pandas as pd
import numpy as np
data_train = pd.read_csv('T-R-train.csv')
data_train.head()Display header data
Copy the code
T | rate | |
---|---|---|
0 | 45.376344 | 2.334559 |
1 | 52.186380 | 2.775735 |
2 | 61.863799 | 2.930147 |
3 | 73.154122 | 2.488971 |
4 | 78.888889 | 1.981618 |
5 | 82.473118 | 1.518382 |
6 | 43.046595 | 2.080882 |
# Define training data
X_train = data_train.loc[:,'T']
y_train = data_train.loc[:,'rate']
Copy the code
# Train data visualization
%matplotlib inline
from matplotlib import pyplot as plt
fig1 = plt.figure(figsize=(5.5))
plt.scatter(X_train,y_train)
plt.title('raw data')
plt.xlabel('temperature')
plt.ylabel('rate')
plt.show()
Copy the code
X_train = np.array(X_train).reshape(-1.1)0 must be converted to a 1-d array or an error is reported for 0 0 0 (-1,1) 0 0
Copy the code
# Model prediction (first attempt with linear regression model)
from sklearn.linear_model import LinearRegression
lr1 = LinearRegression()# Create lR1 training model
lr1.fit(X_train,y_train)
Copy the code
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
Copy the code
Load test data set
data_test = pd.read_csv('T-R-test.csv')
X_test = data_test.loc[:,'T']
y_test = data_test.loc[:,'rate']
Copy the code
T | rate | |
---|---|---|
0 | 45.376344 | 2.334559 |
1 | 52.186380 | 2.775735 |
2 | 61.863799 | 2.930147 |
3 | 73.154122 | 2.488971 |
4 | 78.888889 | 1.981618 |
5 | 82.473118 | 1.518382 |
6 | 43.046595 | 2.080882 |
X_test = np.array(X_test).reshape(-1 ,1)
Copy the code
# Prediction with test data sets
y_train_predict = lr1.predict(X_train)
y_test_predict = lr1.predict(X_test)
from sklearn.metrics import r2_score #R2 Determination coefficient (goodness of fit) The better the model: R2 ->1 the worse the model: R2 ->0
r2_train = r2_score(y_train,y_train_predict)
r2_test = r2_score(y_test,y_test_predict)
print('training r2:',r2_train)
print('test r2:',r2_test)
It is obvious that the training results are poor
Copy the code
Training R2:0.016665703886981964 Test R2: -0.7583363437351314Copy the code
# Visualization of training results - It can be seen that the linear model does not use the current situation
X_range = np.linspace(40.90.300).reshape(-1.1)
y_range_predict = lr1.predict(X_range)
Copy the code
fig2 = plt.figure(figsize=(5.5))
plt.plot(X_range,y_range_predict)
plt.scatter(X_train,y_train)
plt.title('prediction data')
plt.xlabel('temperature')
plt.ylabel('rate')
plt.show()
Copy the code
# Second attempt - using polynomial regression model
from sklearn.preprocessing import PolynomialFeatures
poly2 = PolynomialFeatures(degree=2)# Create an instance of a 2-order regression model
X_2_train = poly2.fit_transform(X_train)# Convert the data to second-order model training data
X_2_test = poly2.transform(X_test)# fit_transform is called the first time, and only the transform is required the second time
poly5 = PolynomialFeatures(degree=5)
X_5_train = poly5.fit_transform(X_train)
X_5_test = poly5.transform(X_test)
print(X_5_train.shape)
Copy the code
(18, 6)
Copy the code
lr2 = LinearRegression() lr2.fit(X_2_train,y_train) y_2_train_predict = lr2.predict(X_2_train) y_2_test_predict = lr2.predict(X_2_test) r2_2_train = r2_score(y_train,y_2_train_predict) r2_2_test = r2_score(y_test,y_2_test_predict) lr5 = LinearRegression() lr5.fit(X_5_train,y_train) y_5_train_predict = lr5.predict(X_5_train) y_5_test_predict = lr5.predict(X_5_test) r2_5_train = r2_score(y_train,y_5_train_predict) r2_5_test = r2_score(y_test,y_5_test_predict)print('training r2_2:',r2_2_train)
print('test r2_2:',r2_2_test)
print('training r2_5:',r2_5_train)
print('test r2_5:',r2_5_test)
# The training results of the two models are good, but the fifth-order model is over-fitted and the prediction results are poor
Copy the code
Training R2_5:0.9978527267187657 Test R2_5:0.9700515400689422 test R2_2:0.9963954556468684 training R2_5:0.9978527267187657 test R2_5: 0.5437837627381457Copy the code
X_2_range = np.linspace(40.90.300).reshape(-1.1)
X_2_range = poly2.transform(X_2_range)
y_2_range_predict = lr2.predict(X_2_range)
X_5_range = np.linspace(40.90.300).reshape(-1.1)
X_5_range = poly5.transform(X_5_range)
y_5_range_predict = lr5.predict(X_5_range)
Copy the code
fig3 = plt.figure(figsize=(5.5))
plt.plot(X_range,y_2_range_predict)Draw the curve of the training model
plt.scatter(X_train,y_train)
plt.scatter(X_test,y_test)
plt.title('polynomial prediction result (2)')
plt.xlabel('temperature')
plt.ylabel('rate')
plt.show()
Copy the code
fig4 = plt.figure(figsize=(5.5))
plt.plot(X_range,y_5_range_predict)Draw the curve of the training model
plt.scatter(X_train,y_train)
plt.scatter(X_test,y_test)
plt.title('polynomial prediction result (5)')
plt.xlabel('temperature')
plt.ylabel('rate')
plt.show()
Copy the code
1. A better prediction of enzyme activity was achieved by establishing a second-order polynomial regression model, and a high R2 score was obtained regardless of training or test data; 2. By establishing linear regression and fifth-order polynomial regression models, it is found that there is over-fitting or under-fitting. In the case of overfitting, the r2 score of training data was high (accurate prediction), but the R2 score of prediction data was low (inaccurate prediction). 3. It can be found that the second-order polynomial regression model has the best effect regardless of R2 score or visual model results