Introduction to Data Mining — A Case study of used car price forecasting

Author: Zhang Jie

Steps for data mining

Data Analysis
Feature Engineering
Feature Selection
Model Building
Model Deployment

1. Data Analysis

For data analysis, the following points need to be explored: 1)Missing Values

2)All The Numerical Variables

3)Distribution of the Numerical Variables

4)Categorical Variables

5)Cardinality of Categorical Variables

6)Outliers

Relationship between independent and dependent feature(SalePrice)

2. Feature Engineering

Data and features determine the upper limit of machine learning, and models and algorithms only approximate this upper limit. So what exactly is feature engineering? As the name implies, it is essentially an engineering activity designed to extract features from raw data to the maximum extent possible for use by algorithms and models.

The following problems may occur in feature engineering:

Not belonging to the same dimension: that is, the specifications of features are different and cannot be compared together;
Qualitative features cannot be used directly: Some machine learning algorithms and models can only accept input from quantitative features, so qualitative features need to be converted to quantitative features. The simplest approach is to specify a quantitative value for each qualitative value, but this approach is too flexible and increases the effort of tuning. Generally, qualitative features are converted into quantitative features by dummy coding: assuming that there are N features, when the original feature value is the ith qualitative value, the ith extended feature is 1, and the other extended features are assigned 0. Compared with the directly specified method, the dummy coding method does not need to increase the work of tuning. For the linear model, the feature of dummy coding can achieve nonlinear effect.
Missing values: Missing values need to be supplemented.
Low information utilization: Different machine learning algorithms and models make different use of information in data. As mentioned above, in linear models, the use of dummy coding for qualitative features can achieve nonlinear effects. Similarly, polynomials of quantitative variables, or other transformations, can achieve nonlinear effects.

In particular, key issues such as missing value processing, outlier processing, data normalization and data coding need to be paid attention to.

3. Feature Selection

Feature selection means selecting feature variables that improve performance on our model. Some machine learning and statistical methods can be used to select the most relevant features to improve model performance.

4. Model Building

In the model section, you can usually choose to use machine learning models or deep learning models. In particular, model integration often has unexpected results.

The problem analysis

The problem of data

The task of the competition is to predict the transaction price of second-hand cars. The data comes from the second-hand car transaction records of a trading platform, with a total amount of more than 40W, and contains 31 columns of variable information, among which 15 are anonymous variables. In order to ensure the fairness of the competition, 150,000 pieces will be selected as a training set and 50,000 pieces as a test set. Meanwhile, information such as Name, Model, Brand and regionCode will be desensitized. Data link: [tianchi.aliyun.com/competition…].

Evaluation standard

The evaluation criterion is MAE(Mean Absolute Error).

Import basic modules

# Basic tools
import numpy as np
import pandas as pd
import warnings
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.special import jn
from IPython.display import display, clear_output
import time
from tqdm import tqdm
import itertools

warnings.filterwarnings('ignore')
%matplotlib inline

## Model predicted
from sklearn import linear_model
from sklearn import preprocessing
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor

## Data dimension reduction processing
from sklearn.decomposition import PCA,FastICA,FactorAnalysis,SparsePCA

## parameter search and evaluation
from sklearn.model_selection import GridSearchCV,cross_val_score,StratifiedKFold,train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error

import scipy.signal as signal
Copy the code

Data analysis and feature engineering

def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type to reduce memory usage. """
    start_mem = df.memory_usage().sum() 
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        ifcol_type ! = object: c_min = df[col].min() c_max = df[col].max()if str(col_type)[:3] = ='int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() 
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df
Copy the code

Train_data = reduce_mem_usage(pd.read_csv('used_car_train_20200313.csv', sep=' '))
Test_data = reduce_mem_usage(pd.read_csv('used_car_testB_20200421.csv', sep=' '))
Output data size information
print('Train data shape:',Train_data.shape)
print('TestA data shape:',Test_data.shape)
Copy the code

# merge datasets
concat_data = pd.concat([Train_data,Test_data])
concat_data.isnull().sum()
Copy the code

Here we find that bodyType, fuelType and gearbox are missing a lot, Model is missing a line, and price is an output, so there is no extra processing here.

Analyze V series anonymous features and non – V series features

For anonymous variables, only numerical information needs more attention. Variables are divided into anonymous variables and non-anonymous variables for separate analysis

concat_data.columns
Copy the code

Firstly, non-anonymous variables were extracted for analysis, and 10 lines of data were randomly sampled.

concat_data[['bodyType'.'brand'.'creatDate'.'fuelType'.'gearbox'.'kilometer'.'model'.'name'.'notRepairedDamage'.'offerType'.'power'.'regDate'.'regionCode'.'seller']].sample(10)
Copy the code

concat_data[['bodyType'.'brand'.'creatDate'.'fuelType'.'gearbox'.'kilometer'.'model'.'name'.'notRepairedDamage'.'offerType'.'power'.'regDate'.'regionCode'.'seller']].describe()
Copy the code

It is found that the value named notRepairedDamage contains the outlier value of “-“, which is replaced by mode

concat_data['notRepairedDamage'].value_counts()
Copy the code

concat_data['notRepairedDamage'] = concat_data['notRepairedDamage'].replace(The '-'.0).astype('float16')
Copy the code

Then we continue to analyze anonymous variables

concat_data[['v_0'.'v_1'.'v_2'.'v_3'.'v_4'.'v_5'.'v_6'.'v_7'.'v_8'.'v_9'.'v_10'.'v_11'.'v_12'.'v_13'.'v_14']].sample(10)
Copy the code

concat_data[['v_0'.'v_1'.'v_2'.'v_3'.'v_4'.'v_5'.'v_6'.'v_7'.'v_8'.'v_9'.'v_10'.'v_11'.'v_12'.'v_13'.'v_14']].describe()
Copy the code

For missing values, simply fill them with mode first. After filling, the data no longer contains missing values.

concat_data = concat_data.fillna(concat_data.mode().iloc[0,:])
print('concat_data shape:',concat_data.shape)
concat_data.isnull().sum()
Copy the code

Discrete numerical values are independently thermal coded

For each feature, if it has m possible values, then after the unique thermal coding, it will become M binary features (for example, the performance of the feature is good, medium, poor into one-hot is 100, 010, 001). Also, these features are mutually exclusive, with only one activation at a time. As a result, the data becomes sparse. The main benefits of doing this are:

It solves the problem that the classifier can not handle attribute data well
To a certain extent, it also plays the role of expanding features

Reference links [www.cnblogs.com/zongfa/p/93]…

The distribution of values can be plotted using df.value_counts().plot.bar

def plot_discrete_bar(data) :
    cnt = data.value_counts()
    p1 = plt.bar(cnt.index, height=list(cnt) , width=0.8)
    for x,y in zip(cnt.index,list(cnt)):
        plt.text(x+0.05,y+0.05.'%.2f' %y, ha='center',va='bottom')
Copy the code

clo_list = ['bodyType'.'fuelType'.'gearbox'.'notRepairedDamage']
i = 1
fig = plt.figure(figsize=(8.8))
for col in clo_list:
    plt.subplot(2.2,i)
    plot_discrete_bar(concat_data[col])
    i = i + 1
Copy the code

One-hot encoding was used for features with few categories, and the number of features after encoding was changed from 31 to 50.

one_hot_list = ['gearbox'.'notRepairedDamage'.'bodyType'.'fuelType']
for col in one_hot_list:
    one_hot = pd.get_dummies(concat_data[col])
    one_hot.columns = [col+'_'+str(i) for i in range(len(one_hot.columns))]
    concat_data = pd.concat([concat_data,one_hot],axis=1)
Copy the code

It is found here that seller and offerType should be dichotomous values, but the distribution tends to one value result, which can be deleted directly

concat_data['seller'].value_counts()
Copy the code

concat_data['offerType'].value_counts()
Copy the code

concat_data.drop(['offerType'.'seller'],axis=1,inplace=True)
Copy the code

For anonymous variables, it is hoped that more numerical information can be used to expand the characteristics of data by selecting several non-anonymous variables and anonymous variables for addition and multiplication

for i in ['v_' +str(t) for t in range(14)]:
    for j in ['v_' +str(k) for k in range(int(i[2:]) +1.15)]:
        concat_data[str(i)+'+'+str(j)] = concat_data[str(i)]+concat_data[str(j)]
        
for i in ['model'.'brand'.'bodyType'.'fuelType'.'gearbox'.'power'.'kilometer'.'notRepairedDamage'.'regionCode'] :for j in ['v_' +str(i) for i in range(14)]:
        concat_data[str(i)+The '*'+str(j)] = concat_data[i]*concat_data[j]    
concat_data.shape
Copy the code

Date data processing

Date data is also important data with concrete practical implications. Here we first extract the date from the date, and then analyze the data for each date.

# set the date format to 2016-04-04, where the month ranges from 1 to 12
def date_proc(x):
    m = int(x[4:6])
    if m == 0:
        m = 1
    return x[:4] + The '-' + str(m) + The '-' + x[6:]
# define date extraction function
def date_transform(df,fea_col):
    for f in tqdm(fea_col):
        df[f] = pd.to_datetime(df[f].astype('str').apply(date_proc))
        df[f + '_year'] = df[f].dt.year
        df[f + '_month'] = df[f].dt.month
        df[f + '_day'] = df[f].dt.day
        df[f + '_dayofweek'] = df[f].dt.dayofweek
    return (df)
Copy the code

Extract date information
date_cols = ['regDate'.'creatDate']
concat_data = date_transform(concat_data,date_cols)
Copy the code

Go ahead and use the date data to construct additional features. Var =data[‘creatDate’] -data [‘regDate’] var=data[‘creatDate’] -data [‘regDate’] var=data[‘creatDate’] -data [‘regDate’] var=data[‘creatDate’] -data [‘regDate’] There’s a time error format in the data, so we need errors=’ COERce ‘

data = concat_data.copy()
# Count usage days
data['used_time1'] = (pd.to_datetime(data['creatDate'], format='%Y%m%d', errors='coerce') - 
                            pd.to_datetime(data['regDate'], format='%Y%m%d', errors='coerce')).dt.days
data['used_time2'] = (pd.datetime.now() - pd.to_datetime(data['regDate'], format='%Y%m%d', errors='coerce')).dt.days                        
data['used_time3'] = (pd.datetime.now() - pd.to_datetime(data['creatDate'], format='%Y%m%d', errors='coerce') ).dt.days
Copy the code

# Bucket operation, divided into the interval
def cut_group(df,cols,num_bins=50):
    for col in cols:
        all_range = int(df[col].max()-df[col].min())
# print(all_range)
        bin = [i*all_range/num_bins for i in range(all_range)]
        df[col+'_bin'] = pd.cut(df[col], bin, labels=False)   # Use the cut method to separate boxes
    return df

Pail operation
cut_cols = ['used_time1'.'used_time2'.'used_time3']
data = cut_group(data,cut_cols,50)

Pail operation
data = cut_group(data,['kilometer'].10)
Copy the code

Year and month are processed, continuing to use unique thermal coding

data['creatDate_year'].value_counts()
Copy the code

data['creatDate_month'].value_counts()
# data['regDate_year'].value_counts()
Copy the code

# One-hot encoding for features with fewer categories
one_hot_list = ['creatDate_year'.'creatDate_month'.'regDate_month'.'regDate_year']
for col in one_hot_list:
    one_hot = pd.get_dummies(data[col])
    one_hot.columns = [col+'_'+str(i) for i in range(len(one_hot.columns))]
    data = pd.concat([data,one_hot],axis=1)
Copy the code

Delete unusable SaleID
data.drop(['SaleID'],axis=1,inplace=True)
Copy the code

Increasing the number of features

Increasing the number of features can increase the dimension of data by selecting the mathematical characteristics of some variables from the perspective of mathematical statistics

# count coding
def count_coding(df,fea_col):
    for f in fea_col:
        df[f + '_count'] = df[f].map(df[f].value_counts())
    return(df)
Copy the code

# count coding
count_list = ['model'.'brand'.'regionCode'.'bodyType'.'fuelType'.'name'.'regDate_year'.'regDate_month'.'regDate_day'.'regDate_dayofweek' , 'creatDate_month'.'creatDate_day'.'creatDate_dayofweek'.'kilometer']
data = count_coding(data,count_list)
Copy the code

A thermal diagram was drawn to analyze the correlation between anonymous variables and price, among which v_0, V_8 and V_12 had higher correlations

temp = Train_data[['v_0'.'v_1'.'v_2'.'v_3'.'v_4'.'v_5'.'v_6'.'v_7'.'v_8'.'v_9'.'v_10'.'v_11'.'v_12'.'v_13'.'v_14'.'price']]
# Zoomed heatmap, correlation matrix
sns.set(rc={'figure.figsize': (8.6)})
correlation_matrix = temp.corr()

k = 8           #number of variables for heatmap
cols = correlation_matrix.nlargest(k, 'price') ['price'].index
cm = np.corrcoef(temp[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()
Copy the code

# Define cross feature statistics
def cross_cat_num(df,num_col,cat_col):
    for f1 in tqdm(cat_col):  # traversal the category features
        g = df.groupby(f1, as_index=False)
        for f2 in tqdm(num_col):  # Logarithmic feature traversal
            feat = g[f2].agg({
                '{}_{}_max'.format(f1, f2): 'max'.'{}_{}_min'.format(f1, f2): 'min'.'{}_{}_median'.format(f1, f2): 'median',
            })
            df = df.merge(feat, on=f1, how='left')
    return(df)
Copy the code

# Use numerical features to conduct statistical characterization of category features and select several anonymous features with the highest correlation with price
cross_cat = ['model'.'brand'.'regDate_year']
cross_num = ['v_0'.'v_3'.'v_4'.'v_8'.'v_12'.'power']
data = cross_cat_num(data,cross_num,cross_cat)# First order crossover
Copy the code

Partition data set

## Select the feature column
numerical_cols = data.columns
feature_cols = [col for col in numerical_cols if col not in ['price']]

## Advance feature column, tag column to construct training samples and test samples
X_data = data.iloc[:len(Train_data),:][feature_cols]
Y_data = Train_data['price']
X_test  = data.iloc[len(Train_data):,:][feature_cols]
print("X_data: ",X_data.shape)
print("X_test: ",X_test.shape)
Copy the code

Average coding: Data preprocessing for high cardinality qualitative features (category features)

Cardinality refers to the number of all possible different values of a qualitative feature. In the face of the qualitative characteristics of high cardinality, these methods of data preprocessing often can not get satisfactory results.

Examples of high cardinality qualitative characteristics: IP address, email domain name, city name, home address, street, product number.

Main reasons:

LabelEncoder codes the qualitative features of high cardinality. Although only one column is required, each natural number has different significance. For Y, it is linear and indivisible. Simple models are prone to underfit and cannot completely capture the differences between different categories. With complex models, it is easy to overfit elsewhere.
OneHotEncoder codes high radix qualitative features, which will inevitably produce tens of thousands of columns of sparse matrix, easy to consume a lot of memory and training time, unless the algorithm itself has related optimization (example: SVM).

Therefore, we can try to use the encoding method of mean encoding to determine the most suitable encoding method for this qualitative feature in the framework of Bayes by using the target variable to be predicted. It’s also a common way to boost scores in Kaggle’s data contests. Reference links [blog.csdn.net/juzexia/art…].

import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold,KFold
from itertools import product
class MeanEncoder:
    def __init__(self, categorical_features, n_splits=10, target_type='classification', prior_weight_func=None):
        """ :param categorical_features: list of str, the name of the categorical columns to encode :param n_splits: the number of splits used in mean encoding :param target_type: str, 'regression' or 'classification' :param prior_weight_func: a function that takes in the number of observations, and outputs prior weight when a dict is passed, the default exponential decay function will be used: k: the number of observations needed for the posterior to be weighted equally as the prior f: larger f --> smaller slope """
 
        self.categorical_features = categorical_features
        self.n_splits = n_splits
        self.learned_stats = {}
 
        if target_type == 'classification':
            self.target_type = target_type
            self.target_values = []
        else:
            self.target_type = 'regression'
            self.target_values = None
 
        if isinstance(prior_weight_func, dict):
            self.prior_weight_func = eval('lambda x: 1 / (1 + np.exp((x - k) / f))', dict(prior_weight_func, np=np))
        elif callable(prior_weight_func):
            self.prior_weight_func = prior_weight_func
        else:
            self.prior_weight_func = lambda x: 1 / (1 + np.exp((x - 2) / 1))
 
    @staticmethod
    def mean_encode_subroutine(X_train, y_train, X_test, variable, target, prior_weight_func):
        X_train = X_train[[variable]].copy()
        X_test = X_test[[variable]].copy()
 
        if target is not None:
            nf_name = '{}_pred_{}'.format(variable, target)
            X_train['pred_temp'] = (y_train == target).astype(int)  # classification
        else:
            nf_name = '{}_pred'.format(variable)
            X_train['pred_temp'] = y_train  # regression
        prior = X_train['pred_temp'].mean()
 
        col_avg_y = X_train.groupby(by=variable, axis=0) ['pred_temp'].agg({'mean': 'mean'.'beta': 'size'})
        col_avg_y['beta'] = prior_weight_func(col_avg_y['beta'])
        col_avg_y[nf_name] = col_avg_y['beta'] * prior + (1 - col_avg_y['beta']) * col_avg_y['mean']
        col_avg_y.drop(['beta'.'mean'], axis=1, inplace=True)
 
        nf_train = X_train.join(col_avg_y, on=variable)[nf_name].values
        nf_test = X_test.join(col_avg_y, on=variable).fillna(prior, inplace=False)[nf_name].values
 
        return nf_train, nf_test, prior, col_avg_y
 
    def fit_transform(self, X, y):
        """ :param X: pandas DataFrame, n_samples * n_features :param y: pandas Series or numpy array, n_samples :return X_new: the transformed pandas DataFrame containing mean-encoded categorical features """
        X_new = X.copy()
        if self.target_type == 'classification':
            skf = StratifiedKFold(self.n_splits)
        else:
            skf = KFold(self.n_splits)
 
        if self.target_type == 'classification':
            self.target_values = sorted(set(y))
            self.learned_stats = {'{}_pred_{}'.format(variable, target): [] for variable, target in
                                  product(self.categorical_features, self.target_values)}
            for variable, target in product(self.categorical_features, self.target_values):
                nf_name = '{}_pred_{}'.format(variable, target)
                X_new.loc[:, nf_name] = np.nan
                for large_ind, small_ind in skf.split(y, y):
                    nf_large, nf_small, prior, col_avg_y = MeanEncoder.mean_encode_subroutine(
                        X_new.iloc[large_ind], y.iloc[large_ind], X_new.iloc[small_ind], variable, target, self.prior_weight_func)
                    X_new.iloc[small_ind, - 1] = nf_small
                    self.learned_stats[nf_name].append((prior, col_avg_y))
        else:
            self.learned_stats = {'{}_pred'.format(variable): [] for variable in self.categorical_features}
            for variable in self.categorical_features:
                nf_name = '{}_pred'.format(variable)
                X_new.loc[:, nf_name] = np.nan
                for large_ind, small_ind in skf.split(y, y):
                    nf_large, nf_small, prior, col_avg_y = MeanEncoder.mean_encode_subroutine(
                        X_new.iloc[large_ind], y.iloc[large_ind], X_new.iloc[small_ind], variable, None, self.prior_weight_func)
                    X_new.iloc[small_ind, - 1] = nf_small
                    self.learned_stats[nf_name].append((prior, col_avg_y))
        return X_new
 
    def transform(self, X):
        """ :param X: pandas DataFrame, n_samples * n_features :return X_new: the transformed pandas DataFrame containing mean-encoded categorical features """
        X_new = X.copy()
 
        if self.target_type == 'classification':
            for variable, target in product(self.categorical_features, self.target_values):
                nf_name = '{}_pred_{}'.format(variable, target)
                X_new[nf_name] = 0
                for prior, col_avg_y in self.learned_stats[nf_name]:
                    X_new[nf_name] += X_new[[variable]].join(col_avg_y, on=variable).fillna(prior, inplace=False)[
                        nf_name]
                X_new[nf_name] /= self.n_splits
        else:
            for variable in self.categorical_features:
                nf_name = '{}_pred'.format(variable)
                X_new[nf_name] = 0
                for prior, col_avg_y in self.learned_stats[nf_name]:
                    X_new[nf_name] += X_new[[variable]].join(col_avg_y, on=variable).fillna(prior, inplace=False)[
                        nf_name]
                X_new[nf_name] /= self.n_splits
 
        return X_new
Copy the code

# High cardinality qualitative characteristics: name Car transaction name, brand car brand, regionCode
class_list = ['model'.'brand'.'name'.'regionCode']+date_cols  # date_cols = ['regDate', 'creatDate']
MeanEnocodeFeature = class_list   Declare characteristics that require average encoding
ME = MeanEncoder(MeanEnocodeFeature,target_type='regression') # declare the average encoding class
X_data = ME.fit_transform(X_data,Y_data)   # Fit the X and y of the training data set
X_test = ME.transform(X_test)Code the test set
Copy the code

X_data['price'] = Train_data['price']
Copy the code

from sklearn.model_selection import KFold
Relatively speaking, there are more choices to do target encoding for regression scenes. Not only mean encoding, but also standard deviation encoding and median encoding can be done
enc_cols = []
stats_default_dict = {
    'max': X_data['price'].max(),
    'min': X_data['price'].min(),
    'median': X_data['price'].median(),
    'mean': X_data['price'].mean(),
    'sum': X_data['price'].sum(),
    'std': X_data['price'].std(),
    'skew': X_data['price'].skew(),
    'kurt': X_data['price'].kurt(),
    'mad': X_data['price'].mad()
}
### Select these three encodings for the moment
enc_stats = ['max'.'min'.'mean']
skf = KFold(n_splits=10, shuffle=True, random_state=42)
for f in tqdm(['regionCode'.'brand'.'regDate_year'.'creatDate_year'.'kilometer'.'model']):
    enc_dict = {}
    for stat in enc_stats:
        enc_dict['{}_target_{}'.format(f, stat)] = stat
        X_data['{}_target_{}'.format(f, stat)] = 0
        X_test['{}_target_{}'.format(f, stat)] = 0
        enc_cols.append('{}_target_{}'.format(f, stat))
    for i, (trn_idx, val_idx) in enumerate(skf.split(X_data, Y_data)):
        trn_x, val_x = X_data.iloc[trn_idx].reset_index(drop=True), X_data.iloc[val_idx].reset_index(drop=True)
        enc_df = trn_x.groupby(f, as_index=False) ['price'].agg(enc_dict)
        val_x = val_x[[f]].merge(enc_df, on=f, how='left')
        test_x = X_test[[f]].merge(enc_df, on=f, how='left')
        for stat in enc_stats:
            val_x['{}_target_{}'.format(f, stat)] = val_x['{}_target_{}'.format(f, stat)].fillna(stats_default_dict[stat])
            test_x['{}_target_{}'.format(f, stat)] = test_x['{}_target_{}'.format(f, stat)].fillna(stats_default_dict[stat])
            X_data.loc[val_idx, '{}_target_{}'.format(f, stat)] = val_x['{}_target_{}'.format(f, stat)].values 
            X_test['{}_target_{}'.format(f, stat)] += test_x['{}_target_{}'.format(f, stat)].values / skf.n_splits
Copy the code

drop_list = ['regDate'.'creatDate'.'brand_power_min'.'regDate_year_power_min']
x_train = X_data.drop(drop_list+['price'],axis=1)
x_test = X_test.drop(drop_list,axis=1)
x_train.shape
Copy the code

x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
Copy the code

Data is processed using MinMaxScaler, and dimensionality is reduced using PCA

from sklearn.preprocessing import MinMaxScaler
# Feature normalization
min_max_scaler = MinMaxScaler()
min_max_scaler.fit(pd.concat([x_train,x_test]).values)
all_data = min_max_scaler.transform(pd.concat([x_train,x_test]).values)
Copy the code

print(all_data.shape)
from sklearn import decomposition
pca = decomposition.PCA(n_components=400)
all_pca = pca.fit_transform(all_data)
X_pca = all_pca[:len(x_train)]
test = all_pca[len(x_train):]
y = Train_data['price'].values
print(all_pca.shape)
Copy the code

Model selection

Here, kerAS is used to build the basic neural network model, and the model structure is fully connected neural network.

from keras.layers import Conv1D, Activation, MaxPool1D, Flatten, Dense
from keras.layers import Input, Dense, Concatenate, Reshape, Dropout, merge, Add
def NN_model(input_dim):
    init = keras.initializers.glorot_uniform(seed=1)
    model = keras.models.Sequential()
    model.add(Dense(units=300, input_dim=input_dim, kernel_initializer=init, activation='softplus'))
    # model. The add (Dropout (0.2))
    model.add(Dense(units=300, kernel_initializer=init, activation='softplus'))
    # model. The add (Dropout (0.2))
    model.add(Dense(units=64, kernel_initializer=init, activation='softplus'))
    model.add(Dense(units=32, kernel_initializer=init, activation='softplus'))
    model.add(Dense(units=8, kernel_initializer=init, activation='softplus'))
    model.add(Dense(units=1))
    return model
Copy the code

from keras.callbacks import Callback, EarlyStopping
class Metric(Callback):
    def __init__(self, model, callbacks, data):
        super().__init__()
        self.model = model
        self.callbacks = callbacks
        self.data = data

    def on_train_begin(self, logs=None):
        for callback in self.callbacks:
            callback.on_train_begin(logs)

    def on_train_end(self, logs=None):
        for callback in self.callbacks:
            callback.on_train_end(logs)

    def on_epoch_end(self, batch, logs=None):
        X_train, y_train = self.data[0] [0], self.data[0] [1]
        y_pred3 = self.model.predict(X_train)
        y_pred = np.zeros((len(y_pred3), ))
        y_true = np.zeros((len(y_pred3), ))
        for i in range(len(y_pred3)):
            y_pred[i] = y_pred3[i]
        for i in range(len(y_pred3)):
            y_true[i] = y_train[i]
        trn_s = mean_absolute_error(y_true, y_pred)
        logs['trn_score'] = trn_s
        
        X_val, y_val = self.data[1] [0], self.data[1] [1]
        y_pred3 = self.model.predict(X_val)
        y_pred = np.zeros((len(y_pred3), ))
        y_true = np.zeros((len(y_pred3), ))
        for i in range(len(y_pred3)):
            y_pred[i] = y_pred3[i]
        for i in range(len(y_pred3)):
            y_true[i] = y_val[i]
        val_s = mean_absolute_error(y_true, y_pred)
        logs['val_score'] = val_s
        print('trn_score', trn_s, 'val_score', val_s)

        for callback in self.callbacks:
            callback.on_epoch_end(batch, logs)
Copy the code

import keras.backend as K
from keras.callbacks import LearningRateScheduler
  
def scheduler(epoch):
    # Every 20 epochs, the learning rate decreases to 0.5
    if epoch % 20= =0 andepoch ! =0:
        lr = K.get_value(model.optimizer.lr)
        K.set_value(model.optimizer.lr, lr * 0.5)
        print("lr changed to {}".format(lr * 0.5))
    return K.get_value(model.optimizer.lr)
reduce_lr = LearningRateScheduler(scheduler)
#model.fit(train_x, train_y, batch_size=32, epochs=5, callbacks=[reduce_lr])
Copy the code

n_splits = 5
kf = KFold(n_splits=n_splits, shuffle=True)

import keras 

b_size = 2000
max_epochs = 145
oof_pred = np.zeros((len(X_pca), ))

sub = pd.read_csv('used_car_testB_20200421.csv',sep = ' ') [['SaleID']].copy()
sub['price'] = 0

avg_mae = 0
for fold, (trn_idx, val_idx) in enumerate(kf.split(X_pca, y)):
    print('fold:', fold)
    X_train, y_train = X_pca[trn_idx], y[trn_idx]
    X_val, y_val = X_pca[val_idx], y[val_idx]
    
    model = NN_model(X_train.shape[1])
    simple_adam = keras.optimizers.Adam(lr = 0.01)
    
    model.compile(loss='mae', optimizer=simple_adam,metrics=['mae'])
    es = EarlyStopping(monitor='val_score', patience=10, verbose=0, mode='min', restore_best_weights=True,)
    es.set_model(model)
    metric = Metric(model, [es], [(X_train, y_train), (X_val, y_val)])
    model.fit(X_train, y_train, batch_size=b_size, epochs=max_epochs, 
              validation_data = [X_val, y_val],
              callbacks=[reduce_lr], shuffle=True, verbose=0)
    y_pred3 = model.predict(X_val)
    y_pred = np.zeros((len(y_pred3), ))
    sub['price'] += model.predict(test).reshape(- 1,)/n_splits
    for i in range(len(y_pred3)):
        y_pred[i] = y_pred3[i]
        
    oof_pred[val_idx] = y_pred
    val_mae = mean_absolute_error(y[val_idx], y_pred)
    avg_mae += val_mae/n_splits
    print()
    print('val_mae is:{}'.format(val_mae))
    print()
mean_absolute_error(y, oof_pred)
Copy the code

Introduction to Data Mining — A Case study of used car price forecasting

Introduction to Data Mining — A Case study of used car price forecasting

Steps for data mining

1. Data Analysis

2. Feature Engineering

3. Feature Selection

4. Model Building

The problem analysis

The problem of data

Evaluation standard

Import basic modules

Data analysis and feature engineering

Analyze V series anonymous features and non – V series features

Discrete numerical values are independently thermal coded

Date data processing

Year and month are processed, continuing to use unique thermal coding

Increasing the number of features

Partition data set

Average coding: Data preprocessing for high cardinality qualitative features (category features)

Data is processed using MinMaxScaler, and dimensionality is reduced using PCA

Model selection

Related Posts

Big data Hadoop running environment construction

“Embedded Speech Recognition Twen-ASR-One Development Notes.” Chapter 4 twen-ASR-One PWM dimming

Natural language processing: From Word frequency to Word embedding to Bert model