preface

Thermal power generation uses steam generated when a fuel is burned to heat water to turn a turbine, which then turns a generator to generate electricity. In this process, the combustion efficiency of boiler is the core of power generation efficiency. There are many factors that affect the combustion efficiency of the boiler, including the conditions of the boiler itself, such as the amount of combustion, a secondary wind, wind, return wind, etc.; And boiler working conditions, such as boiler bed temperature, furnace temperature, bed pressure, etc. This paper analyzes the efficiency of thermal power generation by using industrial steam data set to predict the steam generated.

1.1 Jupyter setup, packet guide and data set loading

Import related modules.

import pandas as pd,numpy as np,matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import warnings 
import sklearn
from sklearn.exceptions import ConvergenceWarning
from typing import types
import pandas_profiling
Copy the code

Intercepts warning

warnings.filterwarnings('ignore')
warnings.filterwarnings(action ='ignore',category=ConvergenceWarning)   
Copy the code

To prevent Chinese garbled characters, set the Seaborn Chinese font.

mpl.rcParams['font.sans-serif'] =[u'simHei']
mpl.rcParams['axes.unicode_minus'] =False
sns.set(font='SimHei')
Copy the code

Example Set the number of lines displayed in jupyter

mpl.rcParams['axes.unicode_minus'] =False
pd.options.display.min_rows = None
pd.set_option('display.expand_frame_repr', False)
pd.set_option('expand_frame_repr', False)
pd.set_option('max_rows', 10)
pd.set_option('max_columns', 30)
Copy the code

Load data.

df_train = pd.read_csv('zhengqi_train.txt',sep='\t',encoding='utf-8')
df_test = pd.read_csv('zhengqi_test.txt',sep='\t',encoding='utf-8')
Copy the code

1.2 Exploratory analysis

1.2.1 Analysis of data sets

1.2.1.1 Preview the dataset
  • Preview the data set.
Df_all. Head (5). Append (df_all. The tail (5)) -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- where V0 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 ... V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 target 0 0.566 0.016-0.143 0.407 0.452-0.901-1.812-2.360 - 0.436-2.114-0.940-0.307-0.073-0.550-0.484... 0.800-0.223 0.796 0.168-0.450 0.136 0.109-0.615 0.327-4.627-4.789-5.101-2.608-3.508 0.175 1 0.968 0.437 0.066 0.163-0.156-2.360 0.332-2.114 0.188-0.455-0.134 1.109-0.488... 0.801-0.144 1.057 0.338 0.671 0.128 0.124 0.032 0.600-0.843 0.160 0.364-0.335-0.730 0.676 2 1.013 0.568 0.235 0.370 0.112-0.77-1.367-2.360 0.396-2.114 0.774-0.051-0.072 0.767-0.493... 0.961-0.067 0.915 0.326 1.287-0.009 0.361 0.277-0.116-0.843 0.160 0.364 0.765-0.589 0.633 3 0.733 0.368 0.283 0.165 0.403-2.114 0.011 0.102-0.014 0.769-0.371 1.200-2.086 0.403-2.114 0.011 0.102-0.014 0.769-0.371... 1.435 0.113 0.898 0.277 1.298 0.015 0.417 0.279 0.603-0.843-0.065 0.364 0.333-0.112 0.206 4 0.684 0.638 0.260 0.209 0.337-0.544-1.073-2.086 0.314-2.114-0.251 0.570 0.199-0.349-0.342... 0.881 0.221 0.386 0.332 1.289 0.183 1.078 0.328 0.418-0.843-0.215 0.364-0.280-0.028 0.384 2883 0.190-0.025-0.138 0.160.600-0.212 0.757 0.584-0.026 0.904 0.355-0.066 0.436 0.141-0.560... -1.310 0.094-0.461 0.189-0.449 0.128-0.208 0.809-0.173 0.247-0.027-0.349 0.576 0.686 0.235 2884 0.507 0.557 0.296 0.183 0.530-0.237 0.749 0.584 0.537 0.904-0.061 0.033 0.414-0.634-0.626... -1.314-0.066-0.892 0.372-0.439 0.291-0.287 0.465-0.310 0.763 0.4988-0.349-0.615-0.380 1.042 2885-0.394-0.721 -0.485 0.084 0.136 0.034 0.655 0.614-0.818 0.904 0.240 0.287-0.185 0.389-0.725... -1.310-0.360-0.349 0.058-0.445 0.291-0.179 0.268 0.552 0.763 0.498 0.349 0.951 0.748 0.005 2886-0.219-0.282 0.344-0.049 0.449-0.140 0.560 0.583 0.596 0.904-0.395-0.023-0.053-0.310-0.258... -1.313-0.603-0.677 0.133-0.448 0.216 1.061-0.051 1.023 0.878 0.610-0.230-0.301 0.555 0.350 2887 0.368 0.380-0.225 -0.049 0.379 0.092 0.550 0.551 0.244 0.904-0.419 0.515 0.346-0.114-0.204... -1.314-0.62-0.596 0.208-0.449 0.047 0.057-0.042 0.317-0.442 0.534-0.009-0.190-0.567 0.388 0.417 10 rows × 39 columnsCopy the code
1.2.1.2 Previewing Related statistics
Df_train. The describe () -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- where V0 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 ... V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 Target Count 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000... 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 mean 0.123048 0.056068 0.289720-0.067790 0.012921-0.558565 0.182892 0.116155 0.177856-0.169452 0.034319-0.364465 0.023177 0.195738 0.016081... -0.021813-0.051679 0.072092 0.272407 0.137712 0.097648 0.055477 0.127791 0.020806 0.007801 0.006715 0.197764 0.030658 -0.130330 0.126353 STD 0.928031 0.941515 0.911236 0.970298 0.888377 0.517957 0.918054 0.955116 0.895444 0.953813 0.968272 0.858504 0.894092 0.922757 1.015585... 1.033403 0.915957 0.889771 0.270374 0.929899 1.061200 0.901934 0.873028 0.902584 1.006995 1.003291 0.985675 0.970812 1.017196 0.983966 min-4.335000-5.122000-3.420000 -3.956000-4.742000-2.182000-4.576000-5.048000-4.692000 -12.891000-2.584000-3.160000-5.165000-3.675000-2.455000... -1.344000-3.808000-5.131000-1.164000-2.435000-2.912000 -4.507000-5.859000-4.053000-4.627000-4.789000-5.695000 -2.608000-3.63000-3.044000 25%-0.297000-0.226250-0.313000-0.652250-0.385000-0.853000-0.310000 -0.295000 0.159000-0.390000-0.420500-0.803250-0.419000-0.398000-0.668000... -1.191000-0.557250-0.452000 0.157750-0.455000-0.664000-0.283000-0.170250-0.407250-0.499000-0.290000-0.202500 -0.413000-0.798250-0.350250 50% 0.359000 0.272500 0.386000 -0.044500 0.110000 -0.466000 0.388000 0.344000 0.362000 0.042000 0.157000-0.112000 0.123000 0.289500-0.161000... 0.095000-0.076000 0.075000 0.325000-0.447000-0.023000 0.053500 0.299500 0.039000-0.040000 0.160000 0.364000 0.137000 0.185500 0.313000 75% 0.726000 0.599000 0.918250 0.624000 0.550250-0.154000 0.831250 0.782250 0.726000 0.042000 0.619250 0.247000 0.616000 0.864250 0.829750... 0.931250 0.356000 0.644250 0.442000 0.730000 0.745250 0.488000 0.635000 0.557000 0.462000 0.273000 0.602000 0.644250 0.495250 0.793250 Max 2.121000 1.918000 2.828000 2.457000 2.689000 0.489000 1.895000 1.918000 2.245000 1.335000 4.830000 1.455000 2.657000 2.475000 2.558000... 2.423000 7.284000 2.980000 0.925000 4.671000 4.580000 2.689000 2.013000 2.395000 5.465000 5.110000 2.324000 5.238000 3.000000 2.538000 8 rows × 39 columnsCopy the code
1.2.1.3 Previewing data types
  • All are float data
Df_train. Info () -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- dtypes: float64 (39) the memory usage is: 880.1 KBCopy the code
1.2.1.4 Preview training set and Test Set dimensions
Df_train. Shape, df_test. Shape -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- ((2888, 39), (1925, 38))Copy the code
1.2.1.5 Number and distribution of missing values
  • No missing values were found in the training set
Df_train.isnull ().sum() # missing_pct = df_all.isnull().sum() * 100 / len(df_all pd.DataFrame({ # 'name': df_all.columns, # 'missing_pct': missing_pct, # }) # missing.sort_values(by='missing_pct', Ascending = False), head () -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- the name missing_pct where V0 where V0 0.0 V29 V29 0.0 V23 V23 0.0 V24 V24 0.0Copy the code
1.2.1.6 Distribution of predicted values
  • The mean, median, maximum, and minimum of the predicted values
df_train['target'].mean(), df_train['target'].median(), df_train['target'].max(), Df_train [' target ']. Min () -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- (0.12635283933517938, 0.313, 2.5380000000000003, 3.0439999999999996)Copy the code
  • The curve of the predicted value
plt.figure()
df_train['target'].plot()
plt.ylabel('target')
plt.xlabel('id')
plt.show()
Copy the code

1.2.1.7 Generating a Data report
pfr = pandas_profiling.ProfileReport(data_train) pfr.to_file("./example.html") -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- the Summarize the dataset: [02:42<00:00, 3.06s/it, Completed] Generate report structure: 100% 1/1 [00:11<00:00, 11.36s/it] Render HTML: 100% 1/1 [00:32<00:00, 32.81s/it] Export report to file: 100% 1/1 [00:32<00:00, 3.85 IT /s]Copy the code

1.2.2 Characteristic distribution curve

  • It can be observed from the figure whether each feature is normally distributed
  • Data values in V9, V17, V18, V22, V23, V24, V28, and V35 are not evenly distributed
  • The graphs of V14, V17, V19, V22, V24 and V28 exist in multiple extreme values
df = pd.melt(df_train,value_vars=df_train.columns)
sp=sns.FacetGrid(df,col='variable',col_wrap=5,sharex=False,sharey=False)
sp=sp.map(sns.distplot,'value',color='m',rug=True)
Copy the code

1.3 Feature Engineering

  • Add test sets to training sets for feature engineering to ensure the same data types.
target = df_train['target'] combined = df_train.drop('target',axis=1).append(df_test) combined.head() -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- where V0 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14... V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 0 0.566 0.016-0.143 0.407 0.452-0.901-1.812-2.360-0.436 -2.114-0.940-0.307-0.073 0.550-0.484... 0.356 0.800-0.223 0.796 0.168-0.450 0.136 0.109-0.615 0.327-4.627-4.789-5.101-2.608-3.508 1 0.968 0.437 0.066 0.163-0.156-2.360 0.332-2.114 0.188-0.455-0.134 1.109-0.488... 0.357 0.801-0.144 1.057 0.338 0.671 0.128 0.124 0.032 0.600-0.843 0.160 0.364-0.335-0.730 2 1.013 0.568 0.235 0.370 0.112-0.77-1.367-2.360 0.396-2.114 0.774-0.051-0.072 0.767-0.493... 0.355 0.961-0.067 0.915 0.326 1.287-0.009 0.361 0.277-0.116-0.843 0.160 0.364 0.765-0.589 3 0.733 0.368 0.283 0.165 0.403-2.114 0.011 0.102-0.014 0.769-0.371 1.200-2.086 0.403-2.114 0.011 0.102-0.014 0.769-0.371... 0.352 1.435 0.113 0.898 0.277 1.298 0.015 0.417 0.279 0.603-0.843-0.065 0.364 0.333-0.112 4 0.684 0.638 0.260 0.209 0.337-0.544-1.073-2.086 0.314-2.114-0.251 0.570 0.199-0.349-0.342... 5 rows × 38 columns Q 0 rows × 38 columns Q 0 rows × 38 columns Q 0 rows × 38 columns Q 0 rows × 38 columns Q 0 rows × 38 columns QCopy the code

1.3.1 Feature correlation analysis

  • The graph shows the correlation of all features, which can be divided into several categories by observation and analysis
  • The linear distribution with V0 is: V1, V4, V8, V12, V27, V31, target
  • The linear distribution with V2 is V6, V7, V16
  • The linear distribution with V5 is: V11
  • The linear distribution with V10 is: V36
  • The linear distribution with V15 is: V29
  • The linear distribution with V33 is: V34
features = ['V0','V1','V2','V3','V4','V5','V6','V7','V8','V9','V10','V11','V12',
        'V13','V14','V15','V16','V17','V18','V19','V20','V21','V22','V23','V24',
         'V25','V26','V27','V28','V29','V30','V31','V32','V33','V34','V35','V36',
        'V37','target']
sns.pairplot(df_train[features])
Copy the code

  • After a preliminary analysis of feature relationships on the training set, the analysis results will now be used to remove highly correlated features from Combined.
PLT. Figure (figsize = (15, 8)) SNS. Heatmap (combined [the features]. Corr (), annot = True) PLT. The show ()Copy the code

  • Process V0, V1, V4, V8, V12, V27, V31
The features = [' where V0 ', 'V1', 'the V4', '8', 'V12', 'V27', 'V31', 'V16] PLT. Figure (figsize = (15, 8)) sns.heatmap(combined[features].corr(),annot=True) plt.show()Copy the code

  • Process V2, V6, V7, and V16
Figure (figsize=(15,8)) SNS. Heatmap (combined[features].corr(),annot=True) plt.show()Copy the code

– Delete features with Pearson correlation coefficient greater than 0.75, including: V1, V4, V8, V6, V16, V5, V29, V36,

features = ['V0','V2','V7','V10','V11','V12', 'V15', 'V27', 'V31', 'V33', 'V34] PLT. Figure (figsize = (15, 8)) SNS. Heatmap (combined [the features]. Corr (), annot = True) PLT. The show () rv_features = ['V1','V4','V5','V6','V8','V16','V29','V36'] combined.drop(rv_features,axis=1,inplace=True) Combined. Head () -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- where V0 V2 V3 V7 V9 V10 V11 V12 V13 V14 V15 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V30 V31 V32 V33 V34 V35 V37 0 0.5666-0.143 0.407-2.360-2.114-0.940-0.307 -0.073 0.550-0.484 0.000-1.162-0.573-0.991 0.610-0.400-0.063 0.356 0.800-0.223 0.796 0.168-0.450 0.109-0.615 0.327-4.627-4.789-5.101-3.508 1 0.968 0.066 0.666-2.360-2.114 0.188-0.455-0.134 1.109-0.488 0.000-1.162-0.571 -0.836 0.5888-0.802-0.063 0.357 0.801-0.144 1.057 0.338 0.671 0.124 0.032 0.600-0.843 0.160 0.364-0.730 2 1.013 0.235 0.370-2.360-2.114 0.874-0.051-0.072 0.767 0.493-0.212 0.977-0.564-0.558 0.766-0.4777-0.063 0.355 0.961 -0.067 0.915 0.326 1.287 0.361 0.277-0.116-0.843 0.160 0.364-0.589 3 0.733 0.283 0.165-2.086-2.114 0.011 0.102 -0.014 0.7699-0.371-0.162-0.5974-0.564 0.272-0.491-0.063 0.352 1.435 0.113 0.898 0.277 1.298 0.417 0.279 0.603-0.843-0.065 0.364-0.112 4 0.684 0.260 0.209-2.086-2.114-0.251 0.570 0.199-0.349-0.342-0.138-0.897-0.572 -0.394 0.106 0.309-0.259 0.352 0.881 0.221 0.386 0.332 1.289 1.078 0.328 0.418-0.843-0.215 0.364-0.028Copy the code

1.3.2 Consistency analysis of feature distribution

  • The inconsistent feature distributions of the training set and the test set affect the generalization ability of the model, so the inconsistent feature distributions were eliminated, and V17 and V22 were eliminated.
Plt. figure(figsize=(42,36)) I = 1 for feature in test.columns: ax = plt.subplot(6, 6, i) ax = sns.kdeplot(train[feature], color='r', shade=True) ax = sns.kdeplot(test[feature], color='k', shade=True) ax = ax.legend(['train','test']) i = i + 1 plt.show() combined.drop(['V17','V22'],axis=1,inplace=True)Copy the code

1.3.1 Handling outliers

  • Data within three standard deviations of the mean are considered to be normally distributed, and data outside this range are considered outliers.
def find_outliers_by_3segama(data, feature): Data_std = np.std(data[feature]) data_mean = np.mean(data[feature]) # Calculate 3 standard deviation outliers_cut_off = data_std * 3 # lower boundary Lower_rule = data_mean-outliers_cut_OFF # upper boundary upper_rule = data_mean + outliers_cut_OFF # filter outliers Data [feature+'_outliers'] = data[feature]. Apply (lambda x:' outlier 'if x > upper_rule or x < lower_rule else' normal ') return For feature in df_train.columns: df_train = find_outliers_by_3segama(df_train, feature) print(combined[fea+'_outliers'].value_counts()) print('='*50)Copy the code

  • Deleting outliers (Optional)
for fea in train.columns: Df_train =df_train [feA +'_outliers']==" normal "] df_train=df_train. Iloc [:,:29] df_train ________________________________________________________ ((2447, 29)Copy the code

1.4 Model training

– Import related modules

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, ElasticNetCV
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.metrics import make_scorer,mean_squared_error
from sklearn.preprocessing import MinMaxScaler,StandardScaler,Normalizer,PolynomialFeatures
from sklearn.pipeline import Pipeline
import time
from bayes_opt import BayesianOptimization
from sklearn.model_selection import cross_val_score,StratifiedKFold
Copy the code
  • Partition data set
train=combined[:2888] test =combined[2888:] x_train,x_val,y_train,y_val = train_test_split (" train ", target, test_size = 0.2, random_state = 42)Copy the code

1.4.1 Use of basic model and scoring

  • Create a regression model
lr = LinearRegression()
rgcv=RidgeCV()
eltcv=ElasticNetCV()
lasso=LassoCV()
rf =RandomForestRegressor()
gbdt=GradientBoostingRegressor()
xgb =XGBRegressor()
lgbm = LGBMRegressor()
models =[lr,rgcv,eltcv, lasso,rf,gbdt ,xgb,lgbm]
Copy the code
  • Data conversion: standardization, normalization, polynomial expansion, after experiment, choose polynomial expansion
# x_train = ss.fit_transform(x_train) # x_val =ss.transform(x_val) x_train=poly.fit_transform(x_train) x_val =poly.transform(x_val) # x_train=norm.fit_transform(x_train) # x_val=norm.fit_transform(x_val) # x_train=mms.fit_transform(x_train) # x_val=mms.transform(x_val) for model in models: model=model.fit(x_train,y_train) predict_val=model.predict(x_val) print(model) print('val r2_score :',metrics.r2_score(y_val,predict_val)) print('val mean_squared_error :',metrics.mean_squared_error(y_val,predict_val)) Print (' * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ') -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- LinearRegression () Val r2_score: 0.8255887645766965 val mean_squared_error: 0.10567861144493183 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * RidgeCV (alphas = array ([0.1, 1, 10.])) val r2_score: 0.8255973558177241 val mean_squared_error: 0.10567340587190682 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ElasticNetCV () val r2_score: 0.826007552533511 val mean_squared_error: 0.10542486099325592 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * LassoCV () val r2_score: 0.8260362646728213 val mean_squared_error: 0.1054074638398755 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * RandomForestRegressor () val r2_score: 0.8142584766091285 val mean_squared_error: 0.11254381767306122 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * GradientBoostingRegressor () val r2_score: 0.8214384543762159 val mean_squared_error: 0.10819335206922724 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * [00:29:40] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror. XGBRegressor() val r2_score : 0.8221341278990157 val mean_squared_error: 0.10777183213830037 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * LGBMRegressor () val r2_score: 0.8315534168942726 val mean_squared_error: 0.10206453134772146 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *Copy the code

1.4.2 Adjusting hyperparameters

1.4.2.1 Ridge and Lasso model tuning
  • According to the evaluation of R square and mean square error of basic models, several models were selected for parameter tuning optimization, and the grid search parameter tuning was carried out according to the sequence of standardization, binomial expansion and regression model parameter setting using pipeline method.
Model = [Pipeline([('ss', StandardScaler())), ('poly', PolynomialFeatures())), (' Linear '), RidgeCV(alphas=np.logspace(3,1,10))]), Pipeline([('ss', StandardScaler()), ('poly', PolynomialFeatures()), (' Linear ', LassoCV(alphas= np.logSpace (-3,1,10)))]),] Parameters = {"poly__degree": [3,2,1], "poly__interaction_only": [True, False], "poly__include_bias": [True, False], "linear__fit_intercept": [True, False] } for mode in models: model = GridSearchCV(mode, param_grid=parameters,cv=5, scoring='neg_mean_squared_error') model.fit(x_train, Y_train) print(mode[2]) print(" ", model.best_params_) print(" ", model.best_score_) print('**************************************************************') -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- RidgeCV (alphas = array ([1.00000000 e-03, 2.78255940 e-03, 7.74263683e-03, 2.15443469e-02, 5.99484250e-02, 1.66810054e-01, 4.64158883e-01, 1.29154967e+00, 3.59381366e+00, {'linear__fit_intercept': True, 'poly__degree': 1, 'poly__include_bias': True, 'poly__interaction_only': True} 0.10341087538163592 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * LassoCV(alphas=array([1.00000000E-03, 2.78255940E-03, 7.74263683E-03, 2.15443469E-02, 5.99484250E-02, 1.66810054E-01, 4.64158883e-01, 1.29154967e+00, 3.59381366e+00, 1.00000000e+01]) False, 'poly__degree': 2, 'poly__include_bias': True, 'poly__interaction_only': False} 0.10087217338993411 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *Copy the code
  • The optimal parameters are brought into the model for training and evaluation
models = [ Pipeline([('ss', StandardScaler()), ('poly',PolynomialFeatures(degree=1,include_bias=True,interaction_only=True)), ('linear', RidgeCV(alphas= Np.logSpace (-3,1,10), fit_Intercept = True))], Pipeline([(' SS ', StandardScaler()), ('poly', PolynomialFeatures(degree=2,include_bias=True,interaction_only=False)), ('linear', LassoCV(alphas= Np.logSpace (-3,1,10), fit_Intercept =False)])] for mode in models: model=mode.fit(x_train,y_train) predict_val =model.predict(x_val) print(model) print('train mean_squared_error :',metrics.mean_squared_error(y_val,predict_val)) print('**********************************') -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- "train" mean_squared_error: 0.10561696740544291 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * "train" mean_squared_error: 0.10668960584119808 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *Copy the code
1.4.2.2 lightgbm adjustable parameters
Model_lgb =LGBMRegressor(random_state=2021) params_dic=dict(learning_rate=[0.01, 0.1, 1], n_estimators=[20,50,120,300], Num_leaves =[10,30],max_depth=[-1,4,10]) grid_search = GridSearchCV(model_lgb, CV =5, param_grid=params_dic, Score ='neg_mean_squared_error') grid_search.fit(x_train,y_train) print(f' best parameter :{grid_search.best_params_}') Print (f 'best scores are: {- grid_search. Best_score_}') -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- # The best arguments are :{'learning_rate': 0.1, 'max_depth': 4, 'n_ESTIMators ': 300, 'num_leaves': 30}Copy the code
  • Using validation sets to validate the LGB final model,mean_squared_error decreases
Regressor(random_state=2021, LEARning_rate = 0.1, max_depth= 5, N_ESTIMators =200, num_leaves= 50) lgb_final.fit(x_train,y_train) val_pred =lgb_final.predict(x_val) print(f'mean_squared_error:{metrics.mean_squared_error(y_val,val_pred)}') ____________________________________________________ mean_squared_error: 0.10652795234920218Copy the code
1.4.2.3 XGB adjustable parameters
Xgb_re = XGBRegressor (seed = 27, learning_rate = 0.1, n_estimators = 300, silent = 0, objective = 'reg: linear', Gamma = 0, subsample = 0.8, colsample_bytree = 0.8, nthread = 4, scale_pos_weight xgb_params = 1) = {' n_estimators: [50100120], 'min_child_weight: list (range,4,2) (1),} best_model = GridSearchCV(xgb_re,param_grid=xgb_params,refit=True, cv=5,scoring='neg_mean_squared_error') best_model.fit(x_train,y_train) print('best_parameters:',best_model.best_params_) Print (f 'best scores are: {- grid_search. Best_score_}') -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- # best_parameters:  {'min_child_weight': 3, 'n_estimators': 120}Copy the code
  • Verify the XGB final model using validation sets
Xgb_final = XGBRegressor (seed = 27, learning_rate = 0.1, objective = 'reg: linear', gamma = 0.2, subsample = 0.5, Colsample_bytree = 0.8, nthread = 1, scale_pos_weight = 1, min_child_weight = 0.3, n_estimators=300) xgb_final.fit(x_train,y_train) val_pred =xgb_final.predict(x_val) print(f'mean_squared_error:{metrics.mean_squared_error(y_val,val_pred)}') ______________________________________________________ mean_squared_error: 0.10342130385311427Copy the code

1.4.4 Model prediction and output result file

  • Model prediction and output TXT file
Xgb_final = XGBRegressor (seed = 27, learning_rate = 0.1, objective = 'reg: linear', gamma = 0.2, subsample = 0.5, Colsample_bytree = 0.8, nthread = 1, scale_pos_weight = 1, min_child_weight = 0.3, n_estimators=300) xgb_final.fit(x_train,y_train) pre_test=xgb_final.predict(test) pred=pd.Series(pre_test) pred.to_csv('submit.txt',sep='\t',index=False)Copy the code