Introduction to Data Mining — A Case study of used car price forecasting
Author: Zhang Jie
Steps for data mining
- Data Analysis
- Feature Engineering
- Feature Selection
- Model Building
- Model Deployment
1. Data Analysis
For data analysis, the following points need to be explored: 1)Missing Values
2)All The Numerical Variables
3)Distribution of the Numerical Variables
4)Categorical Variables
5)Cardinality of Categorical Variables
6)Outliers
Relationship between independent and dependent feature(SalePrice)
2. Feature Engineering
Data and features determine the upper limit of machine learning, and models and algorithms only approximate this upper limit. So what exactly is feature engineering? As the name implies, it is essentially an engineering activity designed to extract features from raw data to the maximum extent possible for use by algorithms and models.
The following problems may occur in feature engineering:
- Not belonging to the same dimension: that is, the specifications of features are different and cannot be compared together;
- Qualitative features cannot be used directly: Some machine learning algorithms and models can only accept input from quantitative features, so qualitative features need to be converted to quantitative features. The simplest approach is to specify a quantitative value for each qualitative value, but this approach is too flexible and increases the effort of tuning. Generally, qualitative features are converted into quantitative features by dummy coding: assuming that there are N features, when the original feature value is the ith qualitative value, the ith extended feature is 1, and the other extended features are assigned 0. Compared with the directly specified method, the dummy coding method does not need to increase the work of tuning. For the linear model, the feature of dummy coding can achieve nonlinear effect.
- Missing values: Missing values need to be supplemented.
- Low information utilization: Different machine learning algorithms and models make different use of information in data. As mentioned above, in linear models, the use of dummy coding for qualitative features can achieve nonlinear effects. Similarly, polynomials of quantitative variables, or other transformations, can achieve nonlinear effects.
In particular, key issues such as missing value processing, outlier processing, data normalization and data coding need to be paid attention to.
3. Feature Selection
Feature selection means selecting feature variables that improve performance on our model. Some machine learning and statistical methods can be used to select the most relevant features to improve model performance.
4. Model Building
In the model section, you can usually choose to use machine learning models or deep learning models. In particular, model integration often has unexpected results.
The problem analysis
The problem of data
The task of the competition is to predict the transaction price of second-hand cars. The data comes from the second-hand car transaction records of a trading platform, with a total amount of more than 40W, and contains 31 columns of variable information, among which 15 are anonymous variables. In order to ensure the fairness of the competition, 150,000 pieces will be selected as a training set and 50,000 pieces as a test set. Meanwhile, information such as Name, Model, Brand and regionCode will be desensitized. Data link: [tianchi.aliyun.com/competition…].
Evaluation standard
The evaluation criterion is MAE(Mean Absolute Error).
Import basic modules
# Basic tools
import numpy as np
import pandas as pd
import warnings
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.special import jn
from IPython.display import display, clear_output
import time
from tqdm import tqdm
import itertools
warnings.filterwarnings('ignore')
%matplotlib inline
## Model predicted
from sklearn import linear_model
from sklearn import preprocessing
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor
## Data dimension reduction processing
from sklearn.decomposition import PCA,FastICA,FactorAnalysis,SparsePCA
## parameter search and evaluation
from sklearn.model_selection import GridSearchCV,cross_val_score,StratifiedKFold,train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
import scipy.signal as signal
Copy the code
Data analysis and feature engineering
def reduce_mem_usage(df):
""" iterate through all the columns of a dataframe and modify the data type to reduce memory usage. """
start_mem = df.memory_usage().sum()
print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
for col in df.columns:
col_type = df[col].dtype
ifcol_type ! = object: c_min = df[col].min() c_max = df[col].max()if str(col_type)[:3] = ='int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
else:
df[col] = df[col].astype('category')
end_mem = df.memory_usage().sum()
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
return df
Copy the code
Train_data = reduce_mem_usage(pd.read_csv('used_car_train_20200313.csv', sep=' '))
Test_data = reduce_mem_usage(pd.read_csv('used_car_testB_20200421.csv', sep=' '))
Output data size information
print('Train data shape:',Train_data.shape)
print('TestA data shape:',Test_data.shape)
Copy the code
# merge datasets
concat_data = pd.concat([Train_data,Test_data])
concat_data.isnull().sum()
Copy the code
Here we find that bodyType, fuelType and gearbox are missing a lot, Model is missing a line, and price is an output, so there is no extra processing here.
Analyze V series anonymous features and non – V series features
For anonymous variables, only numerical information needs more attention. Variables are divided into anonymous variables and non-anonymous variables for separate analysis
concat_data.columns
Copy the code
Firstly, non-anonymous variables were extracted for analysis, and 10 lines of data were randomly sampled.
concat_data[['bodyType'.'brand'.'creatDate'.'fuelType'.'gearbox'.'kilometer'.'model'.'name'.'notRepairedDamage'.'offerType'.'power'.'regDate'.'regionCode'.'seller']].sample(10)
Copy the code
concat_data[['bodyType'.'brand'.'creatDate'.'fuelType'.'gearbox'.'kilometer'.'model'.'name'.'notRepairedDamage'.'offerType'.'power'.'regDate'.'regionCode'.'seller']].describe()
Copy the code
It is found that the value named notRepairedDamage contains the outlier value of “-“, which is replaced by mode
concat_data['notRepairedDamage'].value_counts()
Copy the code
concat_data['notRepairedDamage'] = concat_data['notRepairedDamage'].replace(The '-'.0).astype('float16')
Copy the code
Then we continue to analyze anonymous variables
concat_data[['v_0'.'v_1'.'v_2'.'v_3'.'v_4'.'v_5'.'v_6'.'v_7'.'v_8'.'v_9'.'v_10'.'v_11'.'v_12'.'v_13'.'v_14']].sample(10)
Copy the code
concat_data[['v_0'.'v_1'.'v_2'.'v_3'.'v_4'.'v_5'.'v_6'.'v_7'.'v_8'.'v_9'.'v_10'.'v_11'.'v_12'.'v_13'.'v_14']].describe()
Copy the code
For missing values, simply fill them with mode first. After filling, the data no longer contains missing values.
concat_data = concat_data.fillna(concat_data.mode().iloc[0,:])
print('concat_data shape:',concat_data.shape)
concat_data.isnull().sum()
Copy the code
Discrete numerical values are independently thermal coded
For each feature, if it has m possible values, then after the unique thermal coding, it will become M binary features (for example, the performance of the feature is good, medium, poor into one-hot is 100, 010, 001). Also, these features are mutually exclusive, with only one activation at a time. As a result, the data becomes sparse. The main benefits of doing this are:
-
It solves the problem that the classifier can not handle attribute data well
-
To a certain extent, it also plays the role of expanding features
Reference links [www.cnblogs.com/zongfa/p/93]…
The distribution of values can be plotted using df.value_counts().plot.bar
def plot_discrete_bar(data) :
cnt = data.value_counts()
p1 = plt.bar(cnt.index, height=list(cnt) , width=0.8)
for x,y in zip(cnt.index,list(cnt)):
plt.text(x+0.05,y+0.05.'%.2f' %y, ha='center',va='bottom')
Copy the code
clo_list = ['bodyType'.'fuelType'.'gearbox'.'notRepairedDamage']
i = 1
fig = plt.figure(figsize=(8.8))
for col in clo_list:
plt.subplot(2.2,i)
plot_discrete_bar(concat_data[col])
i = i + 1
Copy the code
One-hot encoding was used for features with few categories, and the number of features after encoding was changed from 31 to 50.
one_hot_list = ['gearbox'.'notRepairedDamage'.'bodyType'.'fuelType']
for col in one_hot_list:
one_hot = pd.get_dummies(concat_data[col])
one_hot.columns = [col+'_'+str(i) for i in range(len(one_hot.columns))]
concat_data = pd.concat([concat_data,one_hot],axis=1)
Copy the code
It is found here that seller and offerType should be dichotomous values, but the distribution tends to one value result, which can be deleted directly
concat_data['seller'].value_counts()
Copy the code
concat_data['offerType'].value_counts()
Copy the code
concat_data.drop(['offerType'.'seller'],axis=1,inplace=True)
Copy the code
For anonymous variables, it is hoped that more numerical information can be used to expand the characteristics of data by selecting several non-anonymous variables and anonymous variables for addition and multiplication
for i in ['v_' +str(t) for t in range(14)]:
for j in ['v_' +str(k) for k in range(int(i[2:]) +1.15)]:
concat_data[str(i)+'+'+str(j)] = concat_data[str(i)]+concat_data[str(j)]
for i in ['model'.'brand'.'bodyType'.'fuelType'.'gearbox'.'power'.'kilometer'.'notRepairedDamage'.'regionCode'] :for j in ['v_' +str(i) for i in range(14)]:
concat_data[str(i)+The '*'+str(j)] = concat_data[i]*concat_data[j]
concat_data.shape
Copy the code
Date data processing
Date data is also important data with concrete practical implications. Here we first extract the date from the date, and then analyze the data for each date.
# set the date format to 2016-04-04, where the month ranges from 1 to 12
def date_proc(x):
m = int(x[4:6])
if m == 0:
m = 1
return x[:4] + The '-' + str(m) + The '-' + x[6:]
# define date extraction function
def date_transform(df,fea_col):
for f in tqdm(fea_col):
df[f] = pd.to_datetime(df[f].astype('str').apply(date_proc))
df[f + '_year'] = df[f].dt.year
df[f + '_month'] = df[f].dt.month
df[f + '_day'] = df[f].dt.day
df[f + '_dayofweek'] = df[f].dt.dayofweek
return (df)
Copy the code
Extract date information
date_cols = ['regDate'.'creatDate']
concat_data = date_transform(concat_data,date_cols)
Copy the code
Go ahead and use the date data to construct additional features. Var =data[‘creatDate’] -data [‘regDate’] var=data[‘creatDate’] -data [‘regDate’] var=data[‘creatDate’] -data [‘regDate’] var=data[‘creatDate’] -data [‘regDate’] There’s a time error format in the data, so we need errors=’ COERce ‘
data = concat_data.copy()
# Count usage days
data['used_time1'] = (pd.to_datetime(data['creatDate'], format='%Y%m%d', errors='coerce') -
pd.to_datetime(data['regDate'], format='%Y%m%d', errors='coerce')).dt.days
data['used_time2'] = (pd.datetime.now() - pd.to_datetime(data['regDate'], format='%Y%m%d', errors='coerce')).dt.days
data['used_time3'] = (pd.datetime.now() - pd.to_datetime(data['creatDate'], format='%Y%m%d', errors='coerce') ).dt.days
Copy the code
# Bucket operation, divided into the interval
def cut_group(df,cols,num_bins=50):
for col in cols:
all_range = int(df[col].max()-df[col].min())
# print(all_range)
bin = [i*all_range/num_bins for i in range(all_range)]
df[col+'_bin'] = pd.cut(df[col], bin, labels=False) # Use the cut method to separate boxes
return df
Pail operation
cut_cols = ['used_time1'.'used_time2'.'used_time3']
data = cut_group(data,cut_cols,50)
Pail operation
data = cut_group(data,['kilometer'].10)
Copy the code
Year and month are processed, continuing to use unique thermal coding
data['creatDate_year'].value_counts()
Copy the code
data['creatDate_month'].value_counts()
# data['regDate_year'].value_counts()
Copy the code
# One-hot encoding for features with fewer categories
one_hot_list = ['creatDate_year'.'creatDate_month'.'regDate_month'.'regDate_year']
for col in one_hot_list:
one_hot = pd.get_dummies(data[col])
one_hot.columns = [col+'_'+str(i) for i in range(len(one_hot.columns))]
data = pd.concat([data,one_hot],axis=1)
Copy the code
Delete unusable SaleID
data.drop(['SaleID'],axis=1,inplace=True)
Copy the code
Increasing the number of features
Increasing the number of features can increase the dimension of data by selecting the mathematical characteristics of some variables from the perspective of mathematical statistics
# count coding
def count_coding(df,fea_col):
for f in fea_col:
df[f + '_count'] = df[f].map(df[f].value_counts())
return(df)
Copy the code
# count coding
count_list = ['model'.'brand'.'regionCode'.'bodyType'.'fuelType'.'name'.'regDate_year'.'regDate_month'.'regDate_day'.'regDate_dayofweek' , 'creatDate_month'.'creatDate_day'.'creatDate_dayofweek'.'kilometer']
data = count_coding(data,count_list)
Copy the code
A thermal diagram was drawn to analyze the correlation between anonymous variables and price, among which v_0, V_8 and V_12 had higher correlations
temp = Train_data[['v_0'.'v_1'.'v_2'.'v_3'.'v_4'.'v_5'.'v_6'.'v_7'.'v_8'.'v_9'.'v_10'.'v_11'.'v_12'.'v_13'.'v_14'.'price']]
# Zoomed heatmap, correlation matrix
sns.set(rc={'figure.figsize': (8.6)})
correlation_matrix = temp.corr()
k = 8 #number of variables for heatmap
cols = correlation_matrix.nlargest(k, 'price') ['price'].index
cm = np.corrcoef(temp[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()
Copy the code
# Define cross feature statistics
def cross_cat_num(df,num_col,cat_col):
for f1 in tqdm(cat_col): # traversal the category features
g = df.groupby(f1, as_index=False)
for f2 in tqdm(num_col): # Logarithmic feature traversal
feat = g[f2].agg({
'{}_{}_max'.format(f1, f2): 'max'.'{}_{}_min'.format(f1, f2): 'min'.'{}_{}_median'.format(f1, f2): 'median',
})
df = df.merge(feat, on=f1, how='left')
return(df)
Copy the code
# Use numerical features to conduct statistical characterization of category features and select several anonymous features with the highest correlation with price
cross_cat = ['model'.'brand'.'regDate_year']
cross_num = ['v_0'.'v_3'.'v_4'.'v_8'.'v_12'.'power']
data = cross_cat_num(data,cross_num,cross_cat)# First order crossover
Copy the code
Partition data set
## Select the feature column
numerical_cols = data.columns
feature_cols = [col for col in numerical_cols if col not in ['price']]
## Advance feature column, tag column to construct training samples and test samples
X_data = data.iloc[:len(Train_data),:][feature_cols]
Y_data = Train_data['price']
X_test = data.iloc[len(Train_data):,:][feature_cols]
print("X_data: ",X_data.shape)
print("X_test: ",X_test.shape)
Copy the code
Average coding: Data preprocessing for high cardinality qualitative features (category features)
Cardinality refers to the number of all possible different values of a qualitative feature. In the face of the qualitative characteristics of high cardinality, these methods of data preprocessing often can not get satisfactory results.
Examples of high cardinality qualitative characteristics: IP address, email domain name, city name, home address, street, product number.
Main reasons:
- LabelEncoder codes the qualitative features of high cardinality. Although only one column is required, each natural number has different significance. For Y, it is linear and indivisible. Simple models are prone to underfit and cannot completely capture the differences between different categories. With complex models, it is easy to overfit elsewhere.
- OneHotEncoder codes high radix qualitative features, which will inevitably produce tens of thousands of columns of sparse matrix, easy to consume a lot of memory and training time, unless the algorithm itself has related optimization (example: SVM).
Therefore, we can try to use the encoding method of mean encoding to determine the most suitable encoding method for this qualitative feature in the framework of Bayes by using the target variable to be predicted. It’s also a common way to boost scores in Kaggle’s data contests. Reference links [blog.csdn.net/juzexia/art…].
import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold,KFold
from itertools import product
class MeanEncoder:
def __init__(self, categorical_features, n_splits=10, target_type='classification', prior_weight_func=None):
""" :param categorical_features: list of str, the name of the categorical columns to encode :param n_splits: the number of splits used in mean encoding :param target_type: str, 'regression' or 'classification' :param prior_weight_func: a function that takes in the number of observations, and outputs prior weight when a dict is passed, the default exponential decay function will be used: k: the number of observations needed for the posterior to be weighted equally as the prior f: larger f --> smaller slope """
self.categorical_features = categorical_features
self.n_splits = n_splits
self.learned_stats = {}
if target_type == 'classification':
self.target_type = target_type
self.target_values = []
else:
self.target_type = 'regression'
self.target_values = None
if isinstance(prior_weight_func, dict):
self.prior_weight_func = eval('lambda x: 1 / (1 + np.exp((x - k) / f))', dict(prior_weight_func, np=np))
elif callable(prior_weight_func):
self.prior_weight_func = prior_weight_func
else:
self.prior_weight_func = lambda x: 1 / (1 + np.exp((x - 2) / 1))
@staticmethod
def mean_encode_subroutine(X_train, y_train, X_test, variable, target, prior_weight_func):
X_train = X_train[[variable]].copy()
X_test = X_test[[variable]].copy()
if target is not None:
nf_name = '{}_pred_{}'.format(variable, target)
X_train['pred_temp'] = (y_train == target).astype(int) # classification
else:
nf_name = '{}_pred'.format(variable)
X_train['pred_temp'] = y_train # regression
prior = X_train['pred_temp'].mean()
col_avg_y = X_train.groupby(by=variable, axis=0) ['pred_temp'].agg({'mean': 'mean'.'beta': 'size'})
col_avg_y['beta'] = prior_weight_func(col_avg_y['beta'])
col_avg_y[nf_name] = col_avg_y['beta'] * prior + (1 - col_avg_y['beta']) * col_avg_y['mean']
col_avg_y.drop(['beta'.'mean'], axis=1, inplace=True)
nf_train = X_train.join(col_avg_y, on=variable)[nf_name].values
nf_test = X_test.join(col_avg_y, on=variable).fillna(prior, inplace=False)[nf_name].values
return nf_train, nf_test, prior, col_avg_y
def fit_transform(self, X, y):
""" :param X: pandas DataFrame, n_samples * n_features :param y: pandas Series or numpy array, n_samples :return X_new: the transformed pandas DataFrame containing mean-encoded categorical features """
X_new = X.copy()
if self.target_type == 'classification':
skf = StratifiedKFold(self.n_splits)
else:
skf = KFold(self.n_splits)
if self.target_type == 'classification':
self.target_values = sorted(set(y))
self.learned_stats = {'{}_pred_{}'.format(variable, target): [] for variable, target in
product(self.categorical_features, self.target_values)}
for variable, target in product(self.categorical_features, self.target_values):
nf_name = '{}_pred_{}'.format(variable, target)
X_new.loc[:, nf_name] = np.nan
for large_ind, small_ind in skf.split(y, y):
nf_large, nf_small, prior, col_avg_y = MeanEncoder.mean_encode_subroutine(
X_new.iloc[large_ind], y.iloc[large_ind], X_new.iloc[small_ind], variable, target, self.prior_weight_func)
X_new.iloc[small_ind, - 1] = nf_small
self.learned_stats[nf_name].append((prior, col_avg_y))
else:
self.learned_stats = {'{}_pred'.format(variable): [] for variable in self.categorical_features}
for variable in self.categorical_features:
nf_name = '{}_pred'.format(variable)
X_new.loc[:, nf_name] = np.nan
for large_ind, small_ind in skf.split(y, y):
nf_large, nf_small, prior, col_avg_y = MeanEncoder.mean_encode_subroutine(
X_new.iloc[large_ind], y.iloc[large_ind], X_new.iloc[small_ind], variable, None, self.prior_weight_func)
X_new.iloc[small_ind, - 1] = nf_small
self.learned_stats[nf_name].append((prior, col_avg_y))
return X_new
def transform(self, X):
""" :param X: pandas DataFrame, n_samples * n_features :return X_new: the transformed pandas DataFrame containing mean-encoded categorical features """
X_new = X.copy()
if self.target_type == 'classification':
for variable, target in product(self.categorical_features, self.target_values):
nf_name = '{}_pred_{}'.format(variable, target)
X_new[nf_name] = 0
for prior, col_avg_y in self.learned_stats[nf_name]:
X_new[nf_name] += X_new[[variable]].join(col_avg_y, on=variable).fillna(prior, inplace=False)[
nf_name]
X_new[nf_name] /= self.n_splits
else:
for variable in self.categorical_features:
nf_name = '{}_pred'.format(variable)
X_new[nf_name] = 0
for prior, col_avg_y in self.learned_stats[nf_name]:
X_new[nf_name] += X_new[[variable]].join(col_avg_y, on=variable).fillna(prior, inplace=False)[
nf_name]
X_new[nf_name] /= self.n_splits
return X_new
Copy the code
# High cardinality qualitative characteristics: name Car transaction name, brand car brand, regionCode
class_list = ['model'.'brand'.'name'.'regionCode']+date_cols # date_cols = ['regDate', 'creatDate']
MeanEnocodeFeature = class_list Declare characteristics that require average encoding
ME = MeanEncoder(MeanEnocodeFeature,target_type='regression') # declare the average encoding class
X_data = ME.fit_transform(X_data,Y_data) # Fit the X and y of the training data set
X_test = ME.transform(X_test)Code the test set
Copy the code
X_data['price'] = Train_data['price']
Copy the code
from sklearn.model_selection import KFold
Relatively speaking, there are more choices to do target encoding for regression scenes. Not only mean encoding, but also standard deviation encoding and median encoding can be done
enc_cols = []
stats_default_dict = {
'max': X_data['price'].max(),
'min': X_data['price'].min(),
'median': X_data['price'].median(),
'mean': X_data['price'].mean(),
'sum': X_data['price'].sum(),
'std': X_data['price'].std(),
'skew': X_data['price'].skew(),
'kurt': X_data['price'].kurt(),
'mad': X_data['price'].mad()
}
### Select these three encodings for the moment
enc_stats = ['max'.'min'.'mean']
skf = KFold(n_splits=10, shuffle=True, random_state=42)
for f in tqdm(['regionCode'.'brand'.'regDate_year'.'creatDate_year'.'kilometer'.'model']):
enc_dict = {}
for stat in enc_stats:
enc_dict['{}_target_{}'.format(f, stat)] = stat
X_data['{}_target_{}'.format(f, stat)] = 0
X_test['{}_target_{}'.format(f, stat)] = 0
enc_cols.append('{}_target_{}'.format(f, stat))
for i, (trn_idx, val_idx) in enumerate(skf.split(X_data, Y_data)):
trn_x, val_x = X_data.iloc[trn_idx].reset_index(drop=True), X_data.iloc[val_idx].reset_index(drop=True)
enc_df = trn_x.groupby(f, as_index=False) ['price'].agg(enc_dict)
val_x = val_x[[f]].merge(enc_df, on=f, how='left')
test_x = X_test[[f]].merge(enc_df, on=f, how='left')
for stat in enc_stats:
val_x['{}_target_{}'.format(f, stat)] = val_x['{}_target_{}'.format(f, stat)].fillna(stats_default_dict[stat])
test_x['{}_target_{}'.format(f, stat)] = test_x['{}_target_{}'.format(f, stat)].fillna(stats_default_dict[stat])
X_data.loc[val_idx, '{}_target_{}'.format(f, stat)] = val_x['{}_target_{}'.format(f, stat)].values
X_test['{}_target_{}'.format(f, stat)] += test_x['{}_target_{}'.format(f, stat)].values / skf.n_splits
Copy the code
drop_list = ['regDate'.'creatDate'.'brand_power_min'.'regDate_year_power_min']
x_train = X_data.drop(drop_list+['price'],axis=1)
x_test = X_test.drop(drop_list,axis=1)
x_train.shape
Copy the code
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
Copy the code
Data is processed using MinMaxScaler, and dimensionality is reduced using PCA
from sklearn.preprocessing import MinMaxScaler
# Feature normalization
min_max_scaler = MinMaxScaler()
min_max_scaler.fit(pd.concat([x_train,x_test]).values)
all_data = min_max_scaler.transform(pd.concat([x_train,x_test]).values)
Copy the code
print(all_data.shape)
from sklearn import decomposition
pca = decomposition.PCA(n_components=400)
all_pca = pca.fit_transform(all_data)
X_pca = all_pca[:len(x_train)]
test = all_pca[len(x_train):]
y = Train_data['price'].values
print(all_pca.shape)
Copy the code
Model selection
Here, kerAS is used to build the basic neural network model, and the model structure is fully connected neural network.
from keras.layers import Conv1D, Activation, MaxPool1D, Flatten, Dense
from keras.layers import Input, Dense, Concatenate, Reshape, Dropout, merge, Add
def NN_model(input_dim):
init = keras.initializers.glorot_uniform(seed=1)
model = keras.models.Sequential()
model.add(Dense(units=300, input_dim=input_dim, kernel_initializer=init, activation='softplus'))
# model. The add (Dropout (0.2))
model.add(Dense(units=300, kernel_initializer=init, activation='softplus'))
# model. The add (Dropout (0.2))
model.add(Dense(units=64, kernel_initializer=init, activation='softplus'))
model.add(Dense(units=32, kernel_initializer=init, activation='softplus'))
model.add(Dense(units=8, kernel_initializer=init, activation='softplus'))
model.add(Dense(units=1))
return model
Copy the code
from keras.callbacks import Callback, EarlyStopping
class Metric(Callback):
def __init__(self, model, callbacks, data):
super().__init__()
self.model = model
self.callbacks = callbacks
self.data = data
def on_train_begin(self, logs=None):
for callback in self.callbacks:
callback.on_train_begin(logs)
def on_train_end(self, logs=None):
for callback in self.callbacks:
callback.on_train_end(logs)
def on_epoch_end(self, batch, logs=None):
X_train, y_train = self.data[0] [0], self.data[0] [1]
y_pred3 = self.model.predict(X_train)
y_pred = np.zeros((len(y_pred3), ))
y_true = np.zeros((len(y_pred3), ))
for i in range(len(y_pred3)):
y_pred[i] = y_pred3[i]
for i in range(len(y_pred3)):
y_true[i] = y_train[i]
trn_s = mean_absolute_error(y_true, y_pred)
logs['trn_score'] = trn_s
X_val, y_val = self.data[1] [0], self.data[1] [1]
y_pred3 = self.model.predict(X_val)
y_pred = np.zeros((len(y_pred3), ))
y_true = np.zeros((len(y_pred3), ))
for i in range(len(y_pred3)):
y_pred[i] = y_pred3[i]
for i in range(len(y_pred3)):
y_true[i] = y_val[i]
val_s = mean_absolute_error(y_true, y_pred)
logs['val_score'] = val_s
print('trn_score', trn_s, 'val_score', val_s)
for callback in self.callbacks:
callback.on_epoch_end(batch, logs)
Copy the code
import keras.backend as K
from keras.callbacks import LearningRateScheduler
def scheduler(epoch):
# Every 20 epochs, the learning rate decreases to 0.5
if epoch % 20= =0 andepoch ! =0:
lr = K.get_value(model.optimizer.lr)
K.set_value(model.optimizer.lr, lr * 0.5)
print("lr changed to {}".format(lr * 0.5))
return K.get_value(model.optimizer.lr)
reduce_lr = LearningRateScheduler(scheduler)
#model.fit(train_x, train_y, batch_size=32, epochs=5, callbacks=[reduce_lr])
Copy the code
n_splits = 5
kf = KFold(n_splits=n_splits, shuffle=True)
import keras
b_size = 2000
max_epochs = 145
oof_pred = np.zeros((len(X_pca), ))
sub = pd.read_csv('used_car_testB_20200421.csv',sep = ' ') [['SaleID']].copy()
sub['price'] = 0
avg_mae = 0
for fold, (trn_idx, val_idx) in enumerate(kf.split(X_pca, y)):
print('fold:', fold)
X_train, y_train = X_pca[trn_idx], y[trn_idx]
X_val, y_val = X_pca[val_idx], y[val_idx]
model = NN_model(X_train.shape[1])
simple_adam = keras.optimizers.Adam(lr = 0.01)
model.compile(loss='mae', optimizer=simple_adam,metrics=['mae'])
es = EarlyStopping(monitor='val_score', patience=10, verbose=0, mode='min', restore_best_weights=True,)
es.set_model(model)
metric = Metric(model, [es], [(X_train, y_train), (X_val, y_val)])
model.fit(X_train, y_train, batch_size=b_size, epochs=max_epochs,
validation_data = [X_val, y_val],
callbacks=[reduce_lr], shuffle=True, verbose=0)
y_pred3 = model.predict(X_val)
y_pred = np.zeros((len(y_pred3), ))
sub['price'] += model.predict(test).reshape(- 1,)/n_splits
for i in range(len(y_pred3)):
y_pred[i] = y_pred3[i]
oof_pred[val_idx] = y_pred
val_mae = mean_absolute_error(y[val_idx], y_pred)
avg_mae += val_mae/n_splits
print()
print('val_mae is:{}'.format(val_mae))
print()
mean_absolute_error(y, oof_pred)
Copy the code