Authors: Chen Yingxiang, Yang Zihan

AI has its way \

Characteristics engineering Manual based on Jupyter: Data Preprocessing

Column | Jupyter based the characteristics of engineering handbook: data preprocessing (a)

Column | Jupyter based the characteristics of engineering handbook: data preprocessing (2)

Project Address:

Github.com/YC-Coder-Ch…

This project will explore the data preprocessing part: how to use SciKit-learn to handle static continuous variables, use Category Encoders to handle static Category variables, and use Featuretools to handle common time series variables.

directory

The data preprocessing of feature engineering will be divided into three parts: \

  • Static continuous variable
  • Static class variable
  • Time series variable

This article will introduce data preprocessing for 1.2 Static category variables (the next section, i.e. 1.2.7-1.2.11). The following will be combined with Jupyter, using sklearn, detailed explanation.

1.2 Static Categorical Variables

Real-world data sets also tend to contain category characteristics. However, since the models in SciKit-Learn can only deal with numerical features, we need to code category features as numerical features. However, many new models, such as lightGBM and Catboost, are starting to provide category variable support directly. Here we use the category_encoders package because it covers more coding methods.

1.2.7 m-estimate Encoding INDICATES the M estimate Encoding

M estimator coding is a simplified version of target coding. Compared to the target encoder, M-estimator encoding has only one adjustable parameter (M), while the target encoder has two adjustable parameters (MIN_samples_LEAF and SMOOTHING).

Formula:

M is a user-defined parameter. M: m is non-negative. The higher the value of m, the greater the weight of prior probability. ???? ‘?????? Is the coded value of category K in category feature X; Prior: Prior probability/expected value of the target variable; ???? + : number of samples in the training set with feature X category K and positive dependent variable label; ???? + : number of samples with positive dependent variable labels in the training set;

References: Micci-Barreca, D. (2001). A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. ACM SIGKDD Explorations Newsletter, 3(1), 27-32.

import numpy as np
import pandas as pd
from category_encoders.m_estimate import MEstimateEncoder
# category_encoders directly supports dataframe


Generate some training sets randomly
train_set = pd.DataFrame(np.array([['male'.10], ['female'.20], ['male'.10], 
                       ['female'.20], ['female'.15]]),
             columns = ['Sex'.'Type'])
train_y = np.array([False.True.True.False.False])


Generate test sets at random and intentionally include categories and missing values that do not appear in the training set
test_set = pd.DataFrame(np.array([['female'.20], ['male'.20], ['others'.15], 
                       ['male'.20], ['female'.40], ['male'.25]]),
             columns = ['Sex'.'Type'])
test_set.loc[4.'Type'] = np.nan
Copy the code
train_set # Original training set
Copy the code

test_set # Raw test set
Copy the code

encoder = MEstimateEncoder(cols=['Sex','Type'], 
                           handle_unknown='value',  
                           handle_missing='value').fit(train_set,train_y) # Train on the training set
encoded_train = encoder.transform(train_set) # Convert training set
encoded_test = encoder.transform(test_set) Transform the test set


# handle_unknown and handle_missing are set to 'value'
# In the target encoding, handle_unknown and handLE_MISSING only accept 'error', 'return_nan' and 'value' Settings
# The default value for both is' value ', which is the average value of the dependent variable to fill the training set with unknown categories or missing values


encoded_test # The number of encoded variables is consistent with the number of original category variables
Copy the code

In the test set, the code value of the 'male' category is 0.466667


y_positive = 2 # In the training set, there are two samples with positive dependent variable labels
n_positive = 1 # In the training set, there are two samples with a 'male' label in the variable 'Sex', and only one of the two samples has a positive dependent variable label
prior = train_y.mean() # Train the prior probability of set dependent variablesM = 1.0# the default value
male_encode = (n_positive + prior * m)/(y_positive + m)
male_encode # return 0.46666666666666, which matches the value to be verified
Copy the code

0.4666666666666666

encoded_train # Result of training set
Copy the code

1.2.8 James-Stein Encoder James-Stein encoding

James-stein coding is also a coding method based on object coding. Like m-estimator encoding, james-Stein encoders also attempt to balance prior probabilities with observed conditional probabilities by parameter B. However, unlike target encoding and M-estimator encoding, James-Stein encoders balance the two probabilities by variance ratio rather than sample size.

James-stein coding can estimate parameter B by independent method, combined method and other methods. For more information, see category_Encoders official website:

Contrib.scikit-learn.org/categorical…

James-stein coding assumes a normal distribution. So to satisfy the required assumptions, Category Encoders default to using logarithmic ratios to transform dichotomies.

Independent method formula:

Among them,???? ‘?????? Is the coded value of category K in category feature X; Prior probability: prior probability/expected value of the target variable; ???? + : in the training set, the number of samples with k label on category feature X and positive label of dependent variable; ???? : In the training set, the number of samples labeled k on category feature X;

???????????????????? (?????????? : In the training set, variance of sample dependent variable labeled k on feature X;

???????????????????? (?????? : Variance of the population dependent variable;

???????????????????? (?????????? And?????????????????? (?????? All should be estimated from sample statistics.

From the perspective of intuition, B plays the role of balancing the prior probability and the observed conditional probability. If the mean value of the conditional probability is unreliable (y_k has high variance), we should assign more weight to the prior probability. \

import numpy as np
import pandas as pd
from category_encoders.james_stein import JamesSteinEncoder
# category_encoders directly supports dataframe


Generate some training sets randomly
train_set = pd.DataFrame(np.array([['male'.10], ['female'.20], ['male'.10], 
                       ['female'.20], ['female'.15]]),
             columns = ['Sex'.'Type'])
train_y = np.array([False.True.True.False.False])


Generate test sets at random and intentionally include categories and missing values that do not appear in the training set
test_set = pd.DataFrame(np.array([['female'.20], ['male'.20], ['others'.15], 
                       ['male'.20], ['female'.40], ['male'.25]]),
             columns = ['Sex'.'Type'])
test_set.loc[4.'Type'] = np.nan
Copy the code
train_set # Original training set
Copy the code

test_set # Raw test set
Copy the code

encoder = JamesSteinEncoder(cols=['Sex'.'Type'], 
                           handle_unknown='value', 
                           model='independent',  
                           handle_missing='value').fit(train_set,train_y) # Train on the training set
encoded_train = encoder.transform(train_set) # Convert training set
encoded_test = encoder.transform(test_set) Transform the test set


# handle_unknown and handle_missing are set to 'value'
# In the target code, handle_unknown and handle_missing only accept 'error', 'return_nan' and 'value' Settings
# The default value for both is' value ', which is the average value of the dependent variable to fill the training set with unknown categories or missing values


encoded_test # The number of encoded variables is consistent with the number of original category variables


Because category_encoders has made some changes to the formula described above, the results will not be further verified here
Copy the code

encoded_train # Result of training set
Copy the code

1.2.9 Weight of Evidence Encoder

Similar to the above methods, the evidence weight encoder encodes classification variables according to the relationship between category variables and dependent variables.

Formula:

That’s the original definition of WoE, but in CATEGORY_encoders, it also adds a regular term for overfitting. Take regular item???????????????????????????????????????????????????????????????????????????? _???????? _???????????????????????????????????????????????? And?????????????????????????????????????????????????????????????????????????????? _???????? _???????????????????????????????????????????????? As follows:

import numpy as np
import pandas as pd
from category_encoders.woe import WOEEncoder
# category_encoders directly supports dataframe


Generate some training sets randomly
train_set = pd.DataFrame(np.array([['male'.10], ['female'.20], ['male'.10], 
                       ['female'.20], ['female'.15]]),
             columns = ['Sex'.'Type'])
train_y = np.array([False.True.True.False.False])


Generate test sets at random and intentionally include categories and missing values that do not appear in the training set
test_set = pd.DataFrame(np.array([['female'.20], ['male'.20], ['others'.15], 
                       ['male'.20], ['female'.40], ['male'.25]]),
             columns = ['Sex'.'Type'])
test_set.loc[4.'Type'] = np.nan
Copy the code
train_set # Original training set
Copy the code

test_set # Raw test set
Copy the code

encoder = WOEEncoder(cols=['Sex','Type'], 
                     handle_unknown='value',  
                     handle_missing='value').fit(train_set,train_y)  # Train on the training set
encoded_train = encoder.transform(train_set) # Convert training set
encoded_test = encoder.transform(test_set) Transform the test set


# handle_unknown and handle_missing are set to 'value'
# In the target encoding, handle_unknown and handLE_MISSING only accept 'error', 'return_nan' and 'value' Settings
# The default value for both is' value ', which is the average value of the dependent variable to fill the training set with unknown categories or missing values


encoded_test # The number of encoded variables is consistent with the number of original category variables
Copy the code

In the test set, the encoding value of the 'male' category is 0.223144


y = 5 There are 5 samples in the training set
y_positive = 2 # 2 samples in the training set have positive tags


n = 2 # 2 samples in the training set have a 'male' tag on the Sex variable
n_positive = 1 Only one of the two samples has a positive tagRegularization = 1.0# the default value


dis_postive = (n_positive + regularization) / (y_positive + 2 * regularization)
dis_negative = (n - n_positive + regularization) / (y - y_positive + 2 * regularization)
male_encode = np.log(dis_postive / dis_negative)
male_encode # return 0.22314355131420976, which matches the value to be verified
Copy the code

0.22314355131420976

encoded_train # Result of training set
Copy the code

1.2.10 Leave One Out Encoder

The one-leave encoder encodes each group by means of the group dependent variable mean. Groups here refer to different categories in category variables.

The retention method also considers the over-fitting problem, and the coded value of each sample in the training set is the mean value of the group dependent variables after removing the sample. Therefore, in the training set, it can encode each sample in the same group as a different value.

The leave one method codes the test set in different ways. Each sample in the test set was coded by the group mean of the training set, and the sample was not removed during the calculation.

Formula:

Here, if sample J has k label, then (???????? = =???? Returns 1, otherwise returns 0

???????????????????? Is the encoding value in the case that the label of sample I is K

import numpy as np
import pandas as pd
from category_encoders.leave_one_out import LeaveOneOutEncoder
# category_encoders directly supports dataframe


Generate some training sets randomly
train_set = pd.DataFrame(np.array([['male'.10], ['female'.20], ['male'.10], 
                       ['female'.20], ['female'.15]]),
             columns = ['Sex'.'Type'])
train_y = np.array([False.True.True.False.False])


Generate test sets at random and intentionally include categories and missing values that do not appear in the training set
test_set = pd.DataFrame(np.array([['female'.20], ['male'.20], ['others'.15], 
                       ['male'.20], ['female'.40], ['male'.25]]),
             columns = ['Sex'.'Type'])
test_set.loc[4.'Type'] = np.nan
Copy the code
train_set # Original training set
Copy the code

test_set # Raw test set
Copy the code

encoder = LeaveOneOutEncoder(cols=['Sex','Type'], 
                             handle_unknown='value',  
                             handle_missing='value').fit(train_set,train_y)  # Train on the training set
encoded_train = encoder.transform(train_set) # Convert training set
encoded_test = encoder.transform(test_set) Transform the test set


# handle_unknown and handle_missing are set to 'value'
# In the target encoding, handle_unknown and handLE_MISSING only accept 'error', 'return_nan' and 'value' Settings
# The default value for both is' value ', which is the average value of the dependent variable to fill the training set with unknown categories or missing values


encoded_test # The number of encoded variables is consistent with the number of original category variables
# Results show that all class values are encoded as the mean values of the class dependent variables in the training set
Copy the code

# Result of training set
LeaveOneOutEncoder(cols=['Sex'.'Type'],
                   handle_unknown='value',  
                   handle_missing='value').fit_transform(train_set,train_y)


# Perform small check:
# for the first sample, the tag on the Sex variable is' male '
# After removing this sample, the average dependent variable of the 'male' tag sample is 1.0 (only sample 3 has the 'male' tag and it has a positive dependent variable tag)
# Similarly, for the third sample with the same 'male' tag, the average dependent variable of the tag sample becomes 0.0 after it is removed
Copy the code

1.2.11 Catboost Encoder Catboost encoding

CatBoost is a tree-based gradient lifting model. It has excellent results in the problem of data set containing a large number of category features. A new coding system based on “leave one encoder” is proposed for classification features. Before using the Catboost encoder, the training data must be randomly arranged because in Catboost, the encoding is based on the concept of “time”, the order of observations in the data set.

Formula:

Where, if sample J has k label, then (???????? = =???? Returns 1, otherwise returns 0

???????????????????? Is the encoding value in the case that the label of sample I is K

Prior is the Prior probability/expected value of the dependent variable

A is the regularization coefficient

import numpy as np
import pandas as pd
from category_encoders.cat_boost import CatBoostEncoder
# category_encoders directly supports dataframe


Generate some training sets randomly
train_set = pd.DataFrame(np.array([['male'.10], ['female'.20], ['male'.10], 
                       ['female'.20], ['female'.15]]),
             columns = ['Sex'.'Type'])
train_y = np.array([False.True.True.False.False])


Generate test sets at random and intentionally include categories and missing values that do not appear in the training set
test_set = pd.DataFrame(np.array([['female'.20], ['male'.20], ['others'.15], 
                       ['male'.20], ['female'.40], ['male'.25]]),
             columns = ['Sex'.'Type'])
test_set.loc[4.'Type'] = np.nan
Copy the code
train_set # Original training set
Copy the code

test_set # Raw test set
Copy the code

# In fact, we should have scrambled the data before using Catboost encoding
# But since our data itself is already randomly generated, there is no need to scramble it


encoder = CatBoostEncoder(cols=['Sex','Type'], 
                          handle_unknown='value',  
                          handle_missing='value').fit(train_set,train_y)   # Train on the training set
encoded_train = encoder.transform(train_set) # Convert training set
encoded_test = encoder.transform(test_set) Transform the test set


# handle_unknown and handle_missing are set to 'value'
# In the target encoding, handle_unknown and handLE_MISSING only accept 'error', 'return_nan' and 'value' Settings
# The default value for both is' value ', which is the average value of the dependent variable to fill the training set with unknown categories or missing values


encoded_test # The number of encoded variables is consistent with the number of original category variables
Copy the code

In the test set, the code value of the 'male' category is 0.466667


Prior = train_y.mean() # Prior probability
n = 2 # In the training set, two samples have a 'male' tag on the Sex variable
n_positive = 1 # Of the two samples, only one has a positive tag
a = 1 # regularization coefficient, default is 1


encoded_male = (n_positive + a * prior) / (n + a) 
encoded_male # return 0.46666666666666, which matches the value to be verified
Copy the code

0.4666666666666666

Verify the results of the training set
CatBoostEncoder(cols=['Sex'.'Type'],
                handle_unknown='value',  
                handle_missing='value').fit_transform(train_set,train_y)
Copy the code

# The third sample in the training set has a 'male' tag on the Sex variable with a code value of 0.2
Prior = train_y.mean() # Prior probability
n = 1 # Only one sample has the 'male' tag before the third sample
n_positive = 0 # This only sample has no positive label
a = 1 # regularization coefficient


encoded_male = (n_positive + a * Prior) / (n + a)
encoded_male # 0.2 return
Copy the code

0.2

Ok, so that’s the introduction to data preprocessing for static category variables (next section). Suggest readers combine code, in Jupyter real operation again.

At present, the complete Chinese version of this project is under production, please continue to pay attention

Chinese version of Jupyter

Github.com/YC-Coder-Ch…

“`php

Highlights of past For beginners entry route of artificial intelligence and data download AI based machine learning online manual deep learning online manual download update (PDF to 25 sets) this qq group, 1003271085 to join WeChat group please reply “add group” to get a sale standing knowledge star coupons, please reply “planet” knowledge like articles, point in watching

Copy the code