Industrial steam forecast

preface

Thermal power generation uses steam generated when a fuel is burned to heat water to turn a turbine, which then turns a generator to generate electricity. In this process, the combustion efficiency of boiler is the core of power generation efficiency. There are many factors that affect the combustion efficiency of the boiler, including the conditions of the boiler itself, such as the amount of combustion, a secondary wind, wind, return wind, etc.; And boiler working conditions, such as boiler bed temperature, furnace temperature, bed pressure, etc. This paper analyzes the efficiency of thermal power generation by using industrial steam data set to predict the steam generated.

1.1 Jupyter setup, packet guide and data set loading

Import related modules.

import pandas as pd,numpy as np,matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import warnings 
import sklearn
from sklearn.exceptions import ConvergenceWarning
from typing import types
import pandas_profiling
Copy the code

Intercepts warning

warnings.filterwarnings('ignore')
warnings.filterwarnings(action ='ignore',category=ConvergenceWarning)   
Copy the code

To prevent Chinese garbled characters, set the Seaborn Chinese font.

mpl.rcParams['font.sans-serif'] =[u'simHei']
mpl.rcParams['axes.unicode_minus'] =False
sns.set(font='SimHei')
Copy the code

Example Set the number of lines displayed in jupyter

mpl.rcParams['axes.unicode_minus'] =False
pd.options.display.min_rows = None
pd.set_option('display.expand_frame_repr', False)
pd.set_option('expand_frame_repr', False)
pd.set_option('max_rows', 10)
pd.set_option('max_columns', 30)
Copy the code

Load data.

df_train = pd.read_csv('zhengqi_train.txt',sep='\t',encoding='utf-8')
df_test = pd.read_csv('zhengqi_test.txt',sep='\t',encoding='utf-8')
Copy the code

1.2 Exploratory analysis

1.2.1 Analysis of data sets

1.2.1.1 Preview the dataset

Preview the data set.

Df_all. Head (5). Append (df_all. The tail (5)) -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- where V0 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 ... V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 target 0 0.566 0.016-0.143 0.407 0.452-0.901-1.812-2.360 - 0.436-2.114-0.940-0.307-0.073-0.550-0.484... 0.800-0.223 0.796 0.168-0.450 0.136 0.109-0.615 0.327-4.627-4.789-5.101-2.608-3.508 0.175 1 0.968 0.437 0.066 0.163-0.156-2.360 0.332-2.114 0.188-0.455-0.134 1.109-0.488... 0.801-0.144 1.057 0.338 0.671 0.128 0.124 0.032 0.600-0.843 0.160 0.364-0.335-0.730 0.676 2 1.013 0.568 0.235 0.370 0.112-0.77-1.367-2.360 0.396-2.114 0.774-0.051-0.072 0.767-0.493... 0.961-0.067 0.915 0.326 1.287-0.009 0.361 0.277-0.116-0.843 0.160 0.364 0.765-0.589 0.633 3 0.733 0.368 0.283 0.165 0.403-2.114 0.011 0.102-0.014 0.769-0.371 1.200-2.086 0.403-2.114 0.011 0.102-0.014 0.769-0.371... 1.435 0.113 0.898 0.277 1.298 0.015 0.417 0.279 0.603-0.843-0.065 0.364 0.333-0.112 0.206 4 0.684 0.638 0.260 0.209 0.337-0.544-1.073-2.086 0.314-2.114-0.251 0.570 0.199-0.349-0.342... 0.881 0.221 0.386 0.332 1.289 0.183 1.078 0.328 0.418-0.843-0.215 0.364-0.280-0.028 0.384 2883 0.190-0.025-0.138 0.160.600-0.212 0.757 0.584-0.026 0.904 0.355-0.066 0.436 0.141-0.560... -1.310 0.094-0.461 0.189-0.449 0.128-0.208 0.809-0.173 0.247-0.027-0.349 0.576 0.686 0.235 2884 0.507 0.557 0.296 0.183 0.530-0.237 0.749 0.584 0.537 0.904-0.061 0.033 0.414-0.634-0.626... -1.314-0.066-0.892 0.372-0.439 0.291-0.287 0.465-0.310 0.763 0.4988-0.349-0.615-0.380 1.042 2885-0.394-0.721 -0.485 0.084 0.136 0.034 0.655 0.614-0.818 0.904 0.240 0.287-0.185 0.389-0.725... -1.310-0.360-0.349 0.058-0.445 0.291-0.179 0.268 0.552 0.763 0.498 0.349 0.951 0.748 0.005 2886-0.219-0.282 0.344-0.049 0.449-0.140 0.560 0.583 0.596 0.904-0.395-0.023-0.053-0.310-0.258... -1.313-0.603-0.677 0.133-0.448 0.216 1.061-0.051 1.023 0.878 0.610-0.230-0.301 0.555 0.350 2887 0.368 0.380-0.225 -0.049 0.379 0.092 0.550 0.551 0.244 0.904-0.419 0.515 0.346-0.114-0.204... -1.314-0.62-0.596 0.208-0.449 0.047 0.057-0.042 0.317-0.442 0.534-0.009-0.190-0.567 0.388 0.417 10 rows × 39 columnsCopy the code

1.2.1.2 Previewing Related statistics

Df_train. The describe () -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- where V0 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 ... V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 Target Count 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000... 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 mean 0.123048 0.056068 0.289720-0.067790 0.012921-0.558565 0.182892 0.116155 0.177856-0.169452 0.034319-0.364465 0.023177 0.195738 0.016081... -0.021813-0.051679 0.072092 0.272407 0.137712 0.097648 0.055477 0.127791 0.020806 0.007801 0.006715 0.197764 0.030658 -0.130330 0.126353 STD 0.928031 0.941515 0.911236 0.970298 0.888377 0.517957 0.918054 0.955116 0.895444 0.953813 0.968272 0.858504 0.894092 0.922757 1.015585... 1.033403 0.915957 0.889771 0.270374 0.929899 1.061200 0.901934 0.873028 0.902584 1.006995 1.003291 0.985675 0.970812 1.017196 0.983966 min-4.335000-5.122000-3.420000 -3.956000-4.742000-2.182000-4.576000-5.048000-4.692000 -12.891000-2.584000-3.160000-5.165000-3.675000-2.455000... -1.344000-3.808000-5.131000-1.164000-2.435000-2.912000 -4.507000-5.859000-4.053000-4.627000-4.789000-5.695000 -2.608000-3.63000-3.044000 25%-0.297000-0.226250-0.313000-0.652250-0.385000-0.853000-0.310000 -0.295000 0.159000-0.390000-0.420500-0.803250-0.419000-0.398000-0.668000... -1.191000-0.557250-0.452000 0.157750-0.455000-0.664000-0.283000-0.170250-0.407250-0.499000-0.290000-0.202500 -0.413000-0.798250-0.350250 50% 0.359000 0.272500 0.386000 -0.044500 0.110000 -0.466000 0.388000 0.344000 0.362000 0.042000 0.157000-0.112000 0.123000 0.289500-0.161000... 0.095000-0.076000 0.075000 0.325000-0.447000-0.023000 0.053500 0.299500 0.039000-0.040000 0.160000 0.364000 0.137000 0.185500 0.313000 75% 0.726000 0.599000 0.918250 0.624000 0.550250-0.154000 0.831250 0.782250 0.726000 0.042000 0.619250 0.247000 0.616000 0.864250 0.829750... 0.931250 0.356000 0.644250 0.442000 0.730000 0.745250 0.488000 0.635000 0.557000 0.462000 0.273000 0.602000 0.644250 0.495250 0.793250 Max 2.121000 1.918000 2.828000 2.457000 2.689000 0.489000 1.895000 1.918000 2.245000 1.335000 4.830000 1.455000 2.657000 2.475000 2.558000... 2.423000 7.284000 2.980000 0.925000 4.671000 4.580000 2.689000 2.013000 2.395000 5.465000 5.110000 2.324000 5.238000 3.000000 2.538000 8 rows × 39 columnsCopy the code

1.2.1.3 Previewing data types

All are float data

Df_train. Info () -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- dtypes: float64 (39) the memory usage is: 880.1 KBCopy the code

1.2.1.4 Preview training set and Test Set dimensions

Df_train. Shape, df_test. Shape -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- ((2888, 39), (1925, 38))Copy the code

1.2.1.5 Number and distribution of missing values

No missing values were found in the training set

Df_train.isnull ().sum() # missing_pct = df_all.isnull().sum() * 100 / len(df_all pd.DataFrame({ # 'name': df_all.columns, # 'missing_pct': missing_pct, # }) # missing.sort_values(by='missing_pct', Ascending = False), head () -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- the name missing_pct where V0 where V0 0.0 V29 V29 0.0 V23 V23 0.0 V24 V24 0.0Copy the code

1.2.1.6 Distribution of predicted values

The mean, median, maximum, and minimum of the predicted values

df_train['target'].mean(), df_train['target'].median(), df_train['target'].max(), Df_train [' target ']. Min () -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- (0.12635283933517938, 0.313, 2.5380000000000003, 3.0439999999999996)Copy the code

The curve of the predicted value

plt.figure()
df_train['target'].plot()
plt.ylabel('target')
plt.xlabel('id')
plt.show()
Copy the code

1.2.1.7 Generating a Data report

pfr = pandas_profiling.ProfileReport(data_train) pfr.to_file("./example.html") -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- the Summarize the dataset: [02:42<00:00, 3.06s/it, Completed] Generate report structure: 100% 1/1 [00:11<00:00, 11.36s/it] Render HTML: 100% 1/1 [00:32<00:00, 32.81s/it] Export report to file: 100% 1/1 [00:32<00:00, 3.85 IT /s]Copy the code

1.2.2 Characteristic distribution curve

It can be observed from the figure whether each feature is normally distributed
Data values in V9, V17, V18, V22, V23, V24, V28, and V35 are not evenly distributed
The graphs of V14, V17, V19, V22, V24 and V28 exist in multiple extreme values

df = pd.melt(df_train,value_vars=df_train.columns)
sp=sns.FacetGrid(df,col='variable',col_wrap=5,sharex=False,sharey=False)
sp=sp.map(sns.distplot,'value',color='m',rug=True)
Copy the code

1.3 Feature Engineering

Add test sets to training sets for feature engineering to ensure the same data types.

target = df_train['target'] combined = df_train.drop('target',axis=1).append(df_test) combined.head() -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- where V0 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14... V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 0 0.566 0.016-0.143 0.407 0.452-0.901-1.812-2.360-0.436 -2.114-0.940-0.307-0.073 0.550-0.484... 0.356 0.800-0.223 0.796 0.168-0.450 0.136 0.109-0.615 0.327-4.627-4.789-5.101-2.608-3.508 1 0.968 0.437 0.066 0.163-0.156-2.360 0.332-2.114 0.188-0.455-0.134 1.109-0.488... 0.357 0.801-0.144 1.057 0.338 0.671 0.128 0.124 0.032 0.600-0.843 0.160 0.364-0.335-0.730 2 1.013 0.568 0.235 0.370 0.112-0.77-1.367-2.360 0.396-2.114 0.774-0.051-0.072 0.767-0.493... 0.355 0.961-0.067 0.915 0.326 1.287-0.009 0.361 0.277-0.116-0.843 0.160 0.364 0.765-0.589 3 0.733 0.368 0.283 0.165 0.403-2.114 0.011 0.102-0.014 0.769-0.371 1.200-2.086 0.403-2.114 0.011 0.102-0.014 0.769-0.371... 0.352 1.435 0.113 0.898 0.277 1.298 0.015 0.417 0.279 0.603-0.843-0.065 0.364 0.333-0.112 4 0.684 0.638 0.260 0.209 0.337-0.544-1.073-2.086 0.314-2.114-0.251 0.570 0.199-0.349-0.342... 5 rows × 38 columns Q 0 rows × 38 columns Q 0 rows × 38 columns Q 0 rows × 38 columns Q 0 rows × 38 columns Q 0 rows × 38 columns QCopy the code

1.3.1 Feature correlation analysis

The graph shows the correlation of all features, which can be divided into several categories by observation and analysis
The linear distribution with V0 is: V1, V4, V8, V12, V27, V31, target
The linear distribution with V2 is V6, V7, V16
The linear distribution with V5 is: V11
The linear distribution with V10 is: V36
The linear distribution with V15 is: V29
The linear distribution with V33 is: V34

features = ['V0','V1','V2','V3','V4','V5','V6','V7','V8','V9','V10','V11','V12',
        'V13','V14','V15','V16','V17','V18','V19','V20','V21','V22','V23','V24',
         'V25','V26','V27','V28','V29','V30','V31','V32','V33','V34','V35','V36',
        'V37','target']
sns.pairplot(df_train[features])
Copy the code

After a preliminary analysis of feature relationships on the training set, the analysis results will now be used to remove highly correlated features from Combined.

PLT. Figure (figsize = (15, 8)) SNS. Heatmap (combined [the features]. Corr (), annot = True) PLT. The show ()Copy the code

Process V0, V1, V4, V8, V12, V27, V31

The features = [' where V0 ', 'V1', 'the V4', '8', 'V12', 'V27', 'V31', 'V16] PLT. Figure (figsize = (15, 8)) sns.heatmap(combined[features].corr(),annot=True) plt.show()Copy the code

Process V2, V6, V7, and V16

Figure (figsize=(15,8)) SNS. Heatmap (combined[features].corr(),annot=True) plt.show()Copy the code

– Delete features with Pearson correlation coefficient greater than 0.75, including: V1, V4, V8, V6, V16, V5, V29, V36,

features = ['V0','V2','V7','V10','V11','V12', 'V15', 'V27', 'V31', 'V33', 'V34] PLT. Figure (figsize = (15, 8)) SNS. Heatmap (combined [the features]. Corr (), annot = True) PLT. The show () rv_features = ['V1','V4','V5','V6','V8','V16','V29','V36'] combined.drop(rv_features,axis=1,inplace=True) Combined. Head () -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- where V0 V2 V3 V7 V9 V10 V11 V12 V13 V14 V15 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V30 V31 V32 V33 V34 V35 V37 0 0.5666-0.143 0.407-2.360-2.114-0.940-0.307 -0.073 0.550-0.484 0.000-1.162-0.573-0.991 0.610-0.400-0.063 0.356 0.800-0.223 0.796 0.168-0.450 0.109-0.615 0.327-4.627-4.789-5.101-3.508 1 0.968 0.066 0.666-2.360-2.114 0.188-0.455-0.134 1.109-0.488 0.000-1.162-0.571 -0.836 0.5888-0.802-0.063 0.357 0.801-0.144 1.057 0.338 0.671 0.124 0.032 0.600-0.843 0.160 0.364-0.730 2 1.013 0.235 0.370-2.360-2.114 0.874-0.051-0.072 0.767 0.493-0.212 0.977-0.564-0.558 0.766-0.4777-0.063 0.355 0.961 -0.067 0.915 0.326 1.287 0.361 0.277-0.116-0.843 0.160 0.364-0.589 3 0.733 0.283 0.165-2.086-2.114 0.011 0.102 -0.014 0.7699-0.371-0.162-0.5974-0.564 0.272-0.491-0.063 0.352 1.435 0.113 0.898 0.277 1.298 0.417 0.279 0.603-0.843-0.065 0.364-0.112 4 0.684 0.260 0.209-2.086-2.114-0.251 0.570 0.199-0.349-0.342-0.138-0.897-0.572 -0.394 0.106 0.309-0.259 0.352 0.881 0.221 0.386 0.332 1.289 1.078 0.328 0.418-0.843-0.215 0.364-0.028Copy the code

1.3.2 Consistency analysis of feature distribution

The inconsistent feature distributions of the training set and the test set affect the generalization ability of the model, so the inconsistent feature distributions were eliminated, and V17 and V22 were eliminated.

Plt. figure(figsize=(42,36)) I = 1 for feature in test.columns: ax = plt.subplot(6, 6, i) ax = sns.kdeplot(train[feature], color='r', shade=True) ax = sns.kdeplot(test[feature], color='k', shade=True) ax = ax.legend(['train','test']) i = i + 1 plt.show() combined.drop(['V17','V22'],axis=1,inplace=True)Copy the code

1.3.1 Handling outliers

Data within three standard deviations of the mean are considered to be normally distributed, and data outside this range are considered outliers.

def find_outliers_by_3segama(data, feature): Data_std = np.std(data[feature]) data_mean = np.mean(data[feature]) # Calculate 3 standard deviation outliers_cut_off = data_std * 3 # lower boundary Lower_rule = data_mean-outliers_cut_OFF # upper boundary upper_rule = data_mean + outliers_cut_OFF # filter outliers Data [feature+'_outliers'] = data[feature]. Apply (lambda x:' outlier 'if x > upper_rule or x < lower_rule else' normal ') return For feature in df_train.columns: df_train = find_outliers_by_3segama(df_train, feature) print(combined[fea+'_outliers'].value_counts()) print('='*50)Copy the code

Deleting outliers (Optional)

for fea in train.columns: Df_train =df_train [feA +'_outliers']==" normal "] df_train=df_train. Iloc [:,:29] df_train ________________________________________________________ ((2447, 29)Copy the code

1.4 Model training

– Import related modules

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, ElasticNetCV
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.metrics import make_scorer,mean_squared_error
from sklearn.preprocessing import MinMaxScaler,StandardScaler,Normalizer,PolynomialFeatures
from sklearn.pipeline import Pipeline
import time
from bayes_opt import BayesianOptimization
from sklearn.model_selection import cross_val_score,StratifiedKFold
Copy the code

Partition data set

train=combined[:2888] test =combined[2888:] x_train,x_val,y_train,y_val = train_test_split (" train ", target, test_size = 0.2, random_state = 42)Copy the code

1.4.1 Use of basic model and scoring

Create a regression model

lr = LinearRegression()
rgcv=RidgeCV()
eltcv=ElasticNetCV()
lasso=LassoCV()
rf =RandomForestRegressor()
gbdt=GradientBoostingRegressor()
xgb =XGBRegressor()
lgbm = LGBMRegressor()
models =[lr,rgcv,eltcv, lasso,rf,gbdt ,xgb,lgbm]
Copy the code

Data conversion: standardization, normalization, polynomial expansion, after experiment, choose polynomial expansion

# x_train = ss.fit_transform(x_train) # x_val =ss.transform(x_val) x_train=poly.fit_transform(x_train) x_val =poly.transform(x_val) # x_train=norm.fit_transform(x_train) # x_val=norm.fit_transform(x_val) # x_train=mms.fit_transform(x_train) # x_val=mms.transform(x_val) for model in models: model=model.fit(x_train,y_train) predict_val=model.predict(x_val) print(model) print('val r2_score :',metrics.r2_score(y_val,predict_val)) print('val mean_squared_error :',metrics.mean_squared_error(y_val,predict_val)) Print (' * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ') -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- LinearRegression () Val r2_score: 0.8255887645766965 val mean_squared_error: 0.10567861144493183 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * RidgeCV (alphas = array ([0.1, 1, 10.])) val r2_score: 0.8255973558177241 val mean_squared_error: 0.10567340587190682 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ElasticNetCV () val r2_score: 0.826007552533511 val mean_squared_error: 0.10542486099325592 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * LassoCV () val r2_score: 0.8260362646728213 val mean_squared_error: 0.1054074638398755 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * RandomForestRegressor () val r2_score: 0.8142584766091285 val mean_squared_error: 0.11254381767306122 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * GradientBoostingRegressor () val r2_score: 0.8214384543762159 val mean_squared_error: 0.10819335206922724 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * [00:29:40] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror. XGBRegressor() val r2_score : 0.8221341278990157 val mean_squared_error: 0.10777183213830037 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * LGBMRegressor () val r2_score: 0.8315534168942726 val mean_squared_error: 0.10206453134772146 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *Copy the code

1.4.2 Adjusting hyperparameters

1.4.2.1 Ridge and Lasso model tuning

According to the evaluation of R square and mean square error of basic models, several models were selected for parameter tuning optimization, and the grid search parameter tuning was carried out according to the sequence of standardization, binomial expansion and regression model parameter setting using pipeline method.

Model = [Pipeline([('ss', StandardScaler())), ('poly', PolynomialFeatures())), (' Linear '), RidgeCV(alphas=np.logspace(3,1,10))]), Pipeline([('ss', StandardScaler()), ('poly', PolynomialFeatures()), (' Linear ', LassoCV(alphas= np.logSpace (-3,1,10)))]),] Parameters = {"poly__degree": [3,2,1], "poly__interaction_only": [True, False], "poly__include_bias": [True, False], "linear__fit_intercept": [True, False] } for mode in models: model = GridSearchCV(mode, param_grid=parameters,cv=5, scoring='neg_mean_squared_error') model.fit(x_train, Y_train) print(mode[2]) print(" ", model.best_params_) print(" ", model.best_score_) print('**************************************************************') -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- RidgeCV (alphas = array ([1.00000000 e-03, 2.78255940 e-03, 7.74263683e-03, 2.15443469e-02, 5.99484250e-02, 1.66810054e-01, 4.64158883e-01, 1.29154967e+00, 3.59381366e+00, {'linear__fit_intercept': True, 'poly__degree': 1, 'poly__include_bias': True, 'poly__interaction_only': True} 0.10341087538163592 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * LassoCV(alphas=array([1.00000000E-03, 2.78255940E-03, 7.74263683E-03, 2.15443469E-02, 5.99484250E-02, 1.66810054E-01, 4.64158883e-01, 1.29154967e+00, 3.59381366e+00, 1.00000000e+01]) False, 'poly__degree': 2, 'poly__include_bias': True, 'poly__interaction_only': False} 0.10087217338993411 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *Copy the code

The optimal parameters are brought into the model for training and evaluation

models = [ Pipeline([('ss', StandardScaler()), ('poly',PolynomialFeatures(degree=1,include_bias=True,interaction_only=True)), ('linear', RidgeCV(alphas= Np.logSpace (-3,1,10), fit_Intercept = True))], Pipeline([(' SS ', StandardScaler()), ('poly', PolynomialFeatures(degree=2,include_bias=True,interaction_only=False)), ('linear', LassoCV(alphas= Np.logSpace (-3,1,10), fit_Intercept =False)])] for mode in models: model=mode.fit(x_train,y_train) predict_val =model.predict(x_val) print(model) print('train mean_squared_error :',metrics.mean_squared_error(y_val,predict_val)) print('**********************************') -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- "train" mean_squared_error: 0.10561696740544291 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * "train" mean_squared_error: 0.10668960584119808 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *Copy the code

1.4.2.2 lightgbm adjustable parameters

Model_lgb =LGBMRegressor(random_state=2021) params_dic=dict(learning_rate=[0.01, 0.1, 1], n_estimators=[20,50,120,300], Num_leaves =[10,30],max_depth=[-1,4,10]) grid_search = GridSearchCV(model_lgb, CV =5, param_grid=params_dic, Score ='neg_mean_squared_error') grid_search.fit(x_train,y_train) print(f' best parameter :{grid_search.best_params_}') Print (f 'best scores are: {- grid_search. Best_score_}') -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- # The best arguments are :{'learning_rate': 0.1, 'max_depth': 4, 'n_ESTIMators ': 300, 'num_leaves': 30}Copy the code

Using validation sets to validate the LGB final model,mean_squared_error decreases

Regressor(random_state=2021, LEARning_rate = 0.1, max_depth= 5, N_ESTIMators =200, num_leaves= 50) lgb_final.fit(x_train,y_train) val_pred =lgb_final.predict(x_val) print(f'mean_squared_error:{metrics.mean_squared_error(y_val,val_pred)}') ____________________________________________________ mean_squared_error: 0.10652795234920218Copy the code

1.4.2.3 XGB adjustable parameters

Xgb_re = XGBRegressor (seed = 27, learning_rate = 0.1, n_estimators = 300, silent = 0, objective = 'reg: linear', Gamma = 0, subsample = 0.8, colsample_bytree = 0.8, nthread = 4, scale_pos_weight xgb_params = 1) = {' n_estimators: [50100120], 'min_child_weight: list (range,4,2) (1),} best_model = GridSearchCV(xgb_re,param_grid=xgb_params,refit=True, cv=5,scoring='neg_mean_squared_error') best_model.fit(x_train,y_train) print('best_parameters:',best_model.best_params_) Print (f 'best scores are: {- grid_search. Best_score_}') -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- # best_parameters:  {'min_child_weight': 3, 'n_estimators': 120}Copy the code

Verify the XGB final model using validation sets

Xgb_final = XGBRegressor (seed = 27, learning_rate = 0.1, objective = 'reg: linear', gamma = 0.2, subsample = 0.5, Colsample_bytree = 0.8, nthread = 1, scale_pos_weight = 1, min_child_weight = 0.3, n_estimators=300) xgb_final.fit(x_train,y_train) val_pred =xgb_final.predict(x_val) print(f'mean_squared_error:{metrics.mean_squared_error(y_val,val_pred)}') ______________________________________________________ mean_squared_error: 0.10342130385311427Copy the code

1.4.4 Model prediction and output result file

Model prediction and output TXT file

Xgb_final = XGBRegressor (seed = 27, learning_rate = 0.1, objective = 'reg: linear', gamma = 0.2, subsample = 0.5, Colsample_bytree = 0.8, nthread = 1, scale_pos_weight = 1, min_child_weight = 0.3, n_estimators=300) xgb_final.fit(x_train,y_train) pre_test=xgb_final.predict(test) pred=pd.Series(pre_test) pred.to_csv('submit.txt',sep='\t',index=False)Copy the code