Column | Jupyter based the characteristics of engineering handbook: data preprocessing (a)

Edited by Yingxiang Chen & Zihan Yang: Red Stone

Feature engineering plays an important role in machine learning. Appropriate feature engineering can significantly improve the performance of machine learning models. We have compiled a systematic feature engineering tutorial on Github for your reference.

Project Address:

Github.com/YC-Coder-Ch…

This article will explore how to use SciKit-learn to handle static continuous variables, use Category Encoders to handle static Category variables, and use Featuretools to handle common time series variables.

directory

The data preprocessing of feature engineering will be divided into three parts: \

Static continuous variable
Static class variable
Time series variable

This paper will introduce the data preprocessing of 1.1 static continuous variables. The following will be combined with Jupyter, using sklearn, detailed explanation.

1.1 Static continuous variables

1.1.1 Discretization \

Discretization of continuous variables makes the model more robust. For example, when predicting a customer’s purchase behavior, a customer who has made 30 purchases may have very similar behavior to a customer who has made 32. Sometimes over-precision in features can be noise, which is why in LightGBM, the model uses a histogram algorithm to prevent over-fitting. There are two ways to discretize a continuous variable.

1.1.1.1 binarization

Binarization of numerical features. \

# load the sample data
from sklearn.datasets import fetch_california_housing
dataset = fetch_california_housing()
X, y = dataset.data, dataset.target # we will take the first column as the example later
Copy the code

%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt


fig, ax = plt.subplots()
sns.distplot(X[:,0], hist = True, kde=True)
ax.set_title('Histogram', fontsize=12)
ax.set_xlabel('Value', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12); # this feature has long-tail distribution
Copy the code

From sklearn. Preprocessing import Binarizer sample_columns = X[0:10,0]# select the top 10 samples
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])


model = Binarizer(threshold=6) # set 6 to be the threshold
# if value <= 6, then return 0 else return 1Result = model. Fit_transform (sample_columns. Reshape (1, 1)). Reshape (1)# return array([1., 1., 1., 0., 0., 0., 0., 0., 0., 0.])
Copy the code

1.1.1.2 points

Divide the numerical features into boxes.

Evenly divided case: \

from sklearn.preprocessing import KBinsDiscretizer


# in order to mimic the operation in real-world, we shall fit the KBinsDiscretizer
# on the trainset and transform the testset
# we take the top ten samples in the first column as test set
# take the rest samples in the first column as train setTest_set = X [0:10, 0]# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
train_set = X[10:,0]


model = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform') # set 5 bins
# return oridinal bin number, set all bins to have identical widthsThe model fit (train_set. Reshape (1, 1)) result = model. The transform (test_set. Reshape (1, 1)). Reshape (1)# return array([2., 2., 2., 1., 1., 1., 1., 0., 0., 1.])
bin_edge = model.bin_edges_[0]
# return array([0.4999, 3.39994, 6.29998, 9.20002, 12.10006, 15.0001]), the bin edges
Copy the code

# visualiza the bin edges
fig, ax = plt.subplots()
sns.distplot(train_set, hist = True, kde=True)


for edge in bin_edge: # uniform bins
    line = plt.axvline(edge, color='b')
ax.legend([line], ['Uniform Bin Edges'], fontsize=10)
ax.set_title('Histogram', fontsize=12)
ax.set_xlabel('Value', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12);
Copy the code

Quantile subbox: \

from sklearn.preprocessing import KBinsDiscretizer


# in order to mimic the operation in real-world, we shall fit the KBinsDiscretizer
# on the trainset and transform the testset
# we take the top ten samples in the first column as test set
# take the rest samples in the first column as train setTest_set = X [0:10, 0]# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
train_set = X[10:,0]


model = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile') # set 3 bins
# return oridinal bin number, set all bins based on quantileThe model fit (train_set. Reshape (1, 1)) result = model. The transform (test_set. Reshape (1, 1)). Reshape (1)# return array([4., 4., 4., 4., 2., 3., 2., 1., 0., 2.])
bin_edge = model.bin_edges_[0]
# return array([0.4999, 2.3523, 3.1406, 3.9667, 5.10824, 15.0001]), the bin edges
# 2.3523 is the 20% quantile
# 3.1406 is the 40% quantile, etc..
Copy the code

# visualiza the bin edges
fig, ax = plt.subplots()
sns.distplot(train_set, hist = True, kde=True)


for edge in bin_edge: # quantile based bins
    line = plt.axvline(edge, color='b')
ax.legend([line], ['Quantiles Bin Edges'], fontsize=10)
ax.set_title('Histogram', fontsize=12)
ax.set_xlabel('Value', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12);
Copy the code

1.1.2 zoom \

It is difficult to compare features at different scales, especially in linear models such as linear regression and logistic regression. In k-means clustering or KNN model based on Euclidean distance, feature scaling is needed, otherwise distance measurement is useless. Scaling also speeds up convergence for any algorithm using gradient descent.

Some common models: \

Note: Skewness affects the PCA model, so it is best to use power transforms to eliminate skewness.

1.1.2.1 Standard scaling (Z score standardization)

Formula: \

Where X is the variable (characteristic),???? Is the mean of X’s,???? It’s the standard deviation of X. This method is very sensitive to outliers, because outliers also affect???? And???? .

from sklearn.preprocessing import StandardScaler


# in order to mimic the operation in real-world, we shall fit the StandardScaler
# on the trainset and transform the testset
# we take the top ten samples in the first column as test set
# take the rest samples in the first column as train setTest_set = X [0:10, 0]# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])Train_set = X[10:,0] Model = StandardScaler() Model.fit (Train_set. 0) 0# fit on the train set and transform the test set
# top ten numbers for simplificationResult = model. The transform (test_set. Reshape (1, 1)). Reshape (1)# return array([2.34539745, 2.33286782, 1.78324852, 0.93339178, -0.0125957,
# 0.08774668, -0.11109548, -0.39490751, -0.94221041, -0.09419626])
# the result is the same as ((X [0:10, 0] -x [: 10, 0] scheme ())/X [: 10, 0] STD ())
Copy the code

# visualize the distribution after the scaling
# fit and transform the entire first feature


import seaborn as sns
import matplotlib.pyplot as plt


fig, ax = plt.subplots(2.1, figsize = (13.9))
sns.distplot(X[:,0], hist = True, kde=True, ax=ax[0])
ax[0].set_title('Histogram of the Original Distribution', fontsize=12)
ax[0].set_xlabel('Value', fontsize=12)
ax[0].set_ylabel('Frequency', fontsize=12); # this feature has long-tail distribution


model = StandardScaler()
model.fit(X[:,0].reshape(-1.1)) 
result = model.transform(X[:,0].reshape(-1.1)).reshape(-1)


# show the distribution of the entire feature
sns.distplot(result, hist = True, kde=True, ax=ax[1])
ax[1].set_title('Histogram of the Transformed Distribution', fontsize=12)
ax[1].set_xlabel('Value', fontsize=12)
ax[1].set_ylabel('Frequency', fontsize=12); # the distribution is the same, but scales change
fig.tight_layout()
Copy the code

1.1.2.2 MinMaxScaler (scale by value range) \

Suppose we want to scale the eigenvalue range of (A, b).

Formula: \

Where Min is the minimum value of X and Max is the maximum value of X. This method is also sensitive to outliers because outliers affect both Min and Max. \

from sklearn.preprocessing import MinMaxScaler


# in order to mimic the operation in real-world, we shall fit the MinMaxScaler
# on the trainset and transform the testset
# we take the top ten samples in the first column as test set
# take the rest samples in the first column as train setTest_set = X [0:10, 0]# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])Train_set = X[10:,0] Model = MinMaxScaler(Feature_range =(0,1))# set the range to be (0,1)The model fit (train_set. Reshape (1, 1))# fit on the train set and transform the test set
# top ten numbers for simplificationResult = model. The transform (test_set. Reshape (1, 1)). Reshape (1)# return array([0.53966842, 0.53802706, 0.46602805, 0.35469856, 0.23077613,
# 0.24392077, 0.21787286, 0.18069406, 0.1089985, 0.22008662]
# the result is the same as (X [0:10, 0] -x [r]. 10:0 min ())/(X [: 10, 0] Max () - X [: 10, 0]. Min ())
Copy the code

# visualize the distribution after the scaling
# fit and transform the entire first feature


import seaborn as sns
import matplotlib.pyplot as plt


fig, ax = plt.subplots(2.1, figsize = (13.9))
sns.distplot(X[:,0], hist = True, kde=True, ax=ax[0])
ax[0].set_title('Histogram of the Original Distribution', fontsize=12)
ax[0].set_xlabel('Value', fontsize=12)
ax[0].set_ylabel('Frequency', fontsize=12); # this feature has long-tail distribution


model = MinMaxScaler(feature_range=(0.1))
model.fit(X[:,0].reshape(-1.1)) 
result = model.transform(X[:,0].reshape(-1.1)).reshape(-1)


# show the distribution of the entire feature
sns.distplot(result, hist = True, kde=True, ax=ax[1])
ax[1].set_title('Histogram of the Transformed Distribution', fontsize=12)
ax[1].set_xlabel('Value', fontsize=12)
ax[1].set_ylabel('Frequency', fontsize=12); # the distribution is the same, but scales change
fig.tight_layout() # now the scale change to [0,1]
Copy the code

1.1.2.3 RobustScaler \

Scale features using statistics (quantiles) that are robust to outliers. Suppose we want to scale the eigenquantile range of (A, b).

Formula:

This method is more robust to outliers.

import numpy as np
from sklearn.preprocessing import RobustScaler


# in order to mimic the operation in real-world, we shall fit the RobustScaler
# on the trainset and transform the testset
# we take the top ten samples in the first column as test set
# take the rest samples in the first column as train setTest_set = X [0:10, 0]# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])Train_set = X[10:,0] Model = RobustScaler(with_register = True, with_scaling = True, quantile_range = (25.0, 75.0))# with_centering = True => recenter the feature by set X' = X - X.median()
# with_scaling = True => rescale the feature by the quantile set by user
# set the quantile to the (25%, 75%)The model fit (train_set. Reshape (1, 1))# fit on the train set and transform the test set
# top ten numbers for simplificationResult = model. The transform (test_set. Reshape (1, 1)). Reshape (1)Return array([2.19755974, 2.18664281, 1.7077657, 0.96729508, 0.14306683,
# 0.23049401, 0.05724508, -0.19003715, -0.66689601, 0.07196918])
[0:10 # result is the same as (X, 0] - np. Quantile (X [: 10, 0], 0.5))/(np) quantile (X [: 10, 0], 0.75) - np) quantile (X [: 10, 0], 0.25))
Copy the code

# visualize the distribution after the scaling
# fit and transform the entire first feature


import seaborn as sns
import matplotlib.pyplot as plt


fig, ax = plt.subplots(2.1, figsize = (13.9))
sns.distplot(X[:,0], hist = True, kde=True, ax=ax[0])
ax[0].set_title('Histogram of the Original Distribution', fontsize=12)
ax[0].set_xlabel('Value', fontsize=12)
ax[0].set_ylabel('Frequency', fontsize=12); # this feature has long-tail distribution


model = RobustScaler(with_centering = True, with_scaling = True, 
                    quantile_range = (25.0.75.0))
model.fit(X[:,0].reshape(-1.1)) 
result = model.transform(X[:,0].reshape(-1.1)).reshape(-1)


# show the distribution of the entire feature
sns.distplot(result, hist = True, kde=True, ax=ax[1])
ax[1].set_title('Histogram of the Transformed Distribution', fontsize=12)
ax[1].set_xlabel('Value', fontsize=12)
ax[1].set_ylabel('Frequency', fontsize=12); # the distribution is the same, but scales change
fig.tight_layout()
Copy the code

1.1.2.4 Power Transformation (Nonlinear Transformation)

All of the scaling methods described above keep the original distribution. But normality is an important assumption in many statistical models. We can use a power transform to convert the original distribution to a normal distribution.

The Box – Cox transformation:

Box-cox transformation applies only to positive numbers and assumes the following distribution:

All λ values are considered and the optimal values of stable variance and minimum skewness are selected by maximum likelihood estimation.

from sklearn.preprocessing import PowerTransformer


# in order to mimic the operation in real-world, we shall fit the PowerTransformer
# on the trainset and transform the testset
# we take the top ten samples in the first column as test set
# take the rest samples in the first column as train setTest_set = X [0:10, 0]# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
train_set = X[10:,0]


model = PowerTransformer(method='box-cox', standardize=True)
# apply box-cox transformationThe model fit (train_set. Reshape (1, 1))# fit on the train set and transform the test set
# top ten numbers for simplificationResult = model. The transform (test_set. Reshape (1, 1)). Reshape (1)# return array([1.91669292, 1.91009687, 1.60235867, 1.0363095, 0.19831579,
# 0.30244247, 0.09143411, -0.24694006, -1.08558469, 0.11011933])
Copy the code

# visualize the distribution after the scaling
# fit and transform the entire first feature


import seaborn as sns
import matplotlib.pyplot as plt


fig, ax = plt.subplots(2.1, figsize = (13.9))
sns.distplot(X[:,0], hist = True, kde=True, ax=ax[0])
ax[0].set_title('Histogram of the Original Distribution', fontsize=12)
ax[0].set_xlabel('Value', fontsize=12)
ax[0].set_ylabel('Frequency', fontsize=12); # this feature has long-tail distribution


model = PowerTransformer(method='box-cox', standardize=True)
model.fit(X[:,0].reshape(-1.1)) 
result = model.transform(X[:,0].reshape(-1.1)).reshape(-1)


# show the distribution of the entire feature
sns.distplot(result, hist = True, kde=True, ax=ax[1])
ax[1].set_title('Histogram of the Transformed Distribution', fontsize=12)
ax[1].set_xlabel('Value', fontsize=12)
ax[1].set_ylabel('Frequency', fontsize=12); # the distribution now becomes normal
fig.tight_layout()
Copy the code

Yeo – Johnson transformation:

The Yeo Johnson transform applies to both positive and negative numbers and assumes the following distribution:

All λ values are considered and the optimal values of stable variance and minimum skewness are selected by maximum likelihood estimation.

from sklearn.preprocessing import PowerTransformer


# in order to mimic the operation in real-world, we shall fit the PowerTransformer
# on the trainset and transform the testset
# we take the top ten samples in the first column as test set
# take the rest samples in the first column as train setTest_set = X [0:10, 0]# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
train_set = X[10:,0]


model = PowerTransformer(method='yeo-johnson', standardize=True)
# apply box-cox transformationThe model fit (train_set. Reshape (1, 1))# fit on the train set and transform the test set
# top ten numbers for simplificationResult = model. The transform (test_set. Reshape (1, 1)). Reshape (1)# return array([1.90367888, 1.89747091, 1.604735, 1.05166306, 0.20617221,
# 0.31245176, 0.09685566, -0.25011726, -1.10512438, 0.11598074])
Copy the code

# visualize the distribution after the scaling
# fit and transform the entire first feature


import seaborn as sns
import matplotlib.pyplot as plt


fig, ax = plt.subplots(2.1, figsize = (13.9))
sns.distplot(X[:,0], hist = True, kde=True, ax=ax[0])
ax[0].set_title('Histogram of the Original Distribution', fontsize=12)
ax[0].set_xlabel('Value', fontsize=12)
ax[0].set_ylabel('Frequency', fontsize=12); # this feature has long-tail distribution


model = PowerTransformer(method='yeo-johnson', standardize=True)
model.fit(X[:,0].reshape(-1.1)) 
result = model.transform(X[:,0].reshape(-1.1)).reshape(-1)


# show the distribution of the entire feature
sns.distplot(result, hist = True, kde=True, ax=ax[1])
ax[1].set_title('Histogram of the Transformed Distribution', fontsize=12)
ax[1].set_xlabel('Value', fontsize=12)
ax[1].set_ylabel('Frequency', fontsize=12); # the distribution now becomes normal
fig.tight_layout() 
Copy the code

1.1.3 Regularization \

All of the above scaling methods operate on a column basis. But regularization works on every line, and it tries to “scale” each sample so that it has the unit norm. Because regularization works on every line, it distorts the relationship between features and is not common. But regularization can be very useful in the context of text classification and clustering.

Suppose X[I][j] represents the value of feature j in sample I.

L1 regularization formula:

L2 regularization formula: \

L1 regularization: \

from sklearn.preprocessing import Normalizer


# Normalizer performs operation on each row independently
# So train set and test set are processed independently


###### for L1 NormSample_columns = X [2-0, 3-0]# select the first two samples, and the first three features
# return array([[8.3252, 41., 6.98412698],
# [8.3014, 21., 6.23813708]]


model = Normalizer(norm='l1')
# use L2 Norm to normalize each samples


model.fit(sample_columns) 


result = model.transform(sample_columns) # test set are processed similarly
# return array([[0.14784762, 0.72812094, 0.12403144],
# [0.23358211, 0.59089121, 0.17552668]])
# result = sample_columns/np.sum(np.abs(sample_columns), axis=1). 0
Copy the code

L2 regularization: \

###### for L2 NormSample_columns = X [2-0, 3-0]# select the first three features
# return array([[8.3252, 41., 6.98412698],
# [8.3014, 21., 6.23813708]]


model = Normalizer(norm='l2')
# use L2 Norm to normalize each samples


model.fit(sample_columns) 


result = model.transform(sample_columns)
# return array([[0.19627663, 0.96662445, 0.16465922],
# [0.35435076, 0.89639892, 0.26627902]])
# result = sample_columns/np.sqrt(np.sum(sample_columns**2, axis=1)). 0
Copy the code

# visualize the difference in the distribuiton after Normalization
# compare it with the distribuiton after RobustScaling
# fit and transform the entire first & second feature


import seaborn as sns
import matplotlib.pyplot as plt


# RobustScaler
fig, ax = plt.subplots(2.1, figsize = (13.9))


model = RobustScaler(with_centering = True, with_scaling = True, 
                    quantile_range = (25.0.75.0))
model.fit(X[:,0:2]) 
result = model.transform(X[:,0:2])


sns.scatterplot(result[:,0], result[:,1], ax=ax[0])
ax[0].set_title('Scatter Plot of RobustScaling result', fontsize=12)
ax[0].set_xlabel('Feature 1', fontsize=12)
ax[0].set_ylabel('Feature 2', fontsize=12);


model = Normalizer(norm='l2')


model.fit(X[:,0:2]) 
result = model.transform(X[:,0:2])


sns.scatterplot(result[:,0], result[:,1], ax=ax[1])
ax[1].set_title('Scatter Plot of Normalization result', fontsize=12)
ax[1].set_xlabel('Feature 1', fontsize=12)
ax[1].set_ylabel('Feature 2', fontsize=12);
fig.tight_layout()  # Normalization distort the original distribution
Copy the code

1.1.4 Estimation of missing values

In practice, values may be missing from the dataset. However, this sparse data set is incompatible with most SciKit learning models, which assume that all features are numerical and no values are lost. So before applying the SciKit learning model, we need to estimate the missing values.

But some new models, such as XGboost, LightGBM, and Catboost implemented in other packages, provide support for missing values in data sets. So when we apply these models, we no longer need to populate the missing values in the dataset.

1.1.4.1 Univariate feature interpolation

Suppose there is a missing value in column I, then we estimate it with either a constant or statistics from column I (mean, median, or model).

From sklearn. Import SimpleImputer test_set = X[0:10,0].copy()# no missing values
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])


# manully create some missing values
test_set[3] = np.nan
test_set[6] = np.nan
# now sample_columns becomes 
# array([8.3252, 8.3014, 7.2574, nan, 3.8462, 4.0368, nan, 3.12,2.0804, 3.6912])


# create the test samples
# in real-world, we should fit the imputer on train set and tranform the test set.
train_set = X[10:,0].copy()
train_set[3] = np.nan
train_set[6] = np.nan


imputer = SimpleImputer(missing_values=np.nan, strategy='mean') # use mean
# we can set the strategy to 'mean', 'median', 'most_frequent', 'constant'Imputer. Fit (train_set. Reshape (1, 1)) result = imputer. Transform (test_set. Reshape (1, 1)). Reshape (1)# return array([8.3252, 8.3014, 7.2574, 3.87023658, 3.8462,
# 4.0368, 3.87023658, 3.12, 2.0804, 3.6912]
# all missing values are imputed with 3.87023658
# 3.87023658 = np. Nanmean (train_set)
# which is the mean of the trainset ignoring missing values
Copy the code

1.1.4.2 Multivariate feature interpolation \

Multivariate feature interpolation uses information from the entire data set to estimate and interpolate missing values. In SciKit-learn, it is implemented iteratively.

In each step, one feature column is specified as the output Y, and the other feature columns are treated as input X. A regressor applies to (X, y) given y. The regressor is then used to predict the missing value of y. This is done iteratively for each feature, and then repeated for the maximum interpolation round.

Using a linear model (BayesianRidge as an example) :

from sklearn.experimental import enable_iterative_imputer # have to import this to enable
# IterativeImputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge


test_set = X[0:10,:].copy() # no missing values, select all features
# the first columns is
# array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])


# manully create some missing valuesNan test_set[6,0] = np.nan test_set[3,1] = np.nan# now the first feature becomes 
# array([8.3252, 8.3014, 7.2574, nan, 3.8462, 4.0368, nan, 3.12,2.0804, 3.6912])


# create the test samples
# in real-world, we should fit the imputer on train set and tranform the test set.Copy () train_set[3,0] = np.nan train_set[6,0] = np.nan train_set[3,1] = np.nan impute_estimator = BayesianRidge() imputer = IterativeImputer(max_iter = 10, random_state = 0, estimator = impute_estimator) imputer.fit(train_set) result = imputer.transform(test_set)[:,0]# only select the first column to revel how it works
# return array([8.3252, 8.3014, 7.2574, 4.6237195, 3.8462,
# 4.0368, 4.00258149, 3.12, 2.0804, 3.6912]
Copy the code

Using a tree-based model (ExtraTrees as an example) :

from sklearn.experimental import enable_iterative_imputer # have to import this to enable
# IterativeImputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import ExtraTreesRegressor


test_set = X[0:10,:].copy() # no missing values, select all features
# the first columns is
# array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])


# manully create some missing valuesNan test_set[6,0] = np.nan test_set[3,1] = np.nan# now the first feature becomes 
# array([8.3252, 8.3014, 7.2574, nan, 3.8462, 4.0368, nan, 3.12,2.0804, 3.6912])


# create the test samples
# in real-world, we should fit the imputer on train set and tranform the test set.Copy () train_set[3,0] = np.nan train_set[6,0] = np.nan train_set[3,1] = np.nan impute_estimator = ExtraTreesRegressor(n_estimators=10, random_state=0)# parameters can be turned in CV though sklearn pipeline
imputer = IterativeImputer(max_iter = 10, 
                           random_state = 0, 
                           estimator = impute_estimator)


imputer.fit(train_set)
result = imputer.transform(test_set)[:,0] # only select the first column to revel how it works
# return array([8.3252, 8.3014, 7.2574, 4.63813, 3.8462, 4.0368, 3.24721,
# 3.12, 2.0804, 3.6912])
Copy the code

Using K Nearest Neighbors (KNN) :

from sklearn.experimental import enable_iterative_imputer # have to import this to enable
# IterativeImputer
from sklearn.impute import IterativeImputer
from sklearn.neighbors import KNeighborsRegressor


test_set = X[0:10,:].copy() # no missing values, select all features
# the first columns is
# array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])


# manully create some missing valuesNan test_set[6,0] = np.nan test_set[3,1] = np.nan# now the first feature becomes 
# array([8.3252, 8.3014, 7.2574, nan, 3.8462, 4.0368, nan, 3.12,2.0804, 3.6912])


# create the test samples
# in real-world, we should fit the imputer on train set and tranform the test set.Copy () train_set[3,0] = np.nan train_set[6,0] = np.nan train_set[3,1] = np.nan impute_estimator = KNeighborsRegressor(n_neighbors=10, p = 1)# set p=1 to use manhanttan distance
# use manhanttan distance to reduce effect from outliers


# parameters can be turned in CV though sklearn pipeline
imputer = IterativeImputer(max_iter = 10, 
                           random_state = 0, 
                           estimator = impute_estimator)


imputer.fit(train_set)
result = imputer.transform(test_set)[:,0] # only select the first column to revel how it works
# return array([8.3252, 8.3014, 7.2574, 3.6978, 3.8462, 4.0368, 4.052, 3.12,
# 2.0804, 3.6912])
Copy the code

1.1.4.3 Mark the estimated value

Sometimes, some missing values can be useful. As a result, SciKit Learn also provides the ability to transform a missing value dataset into a corresponding binary matrix indicating the presence of a missing value in the dataset.

from sklearn.impute import MissingIndicator


# illustrate this function on trainset only
# since the precess is independent in train set and test set
train_set = X[10:,:].copy() # select all features
train_set[3.0] = np.nan # manully create some missing values
train_set[6.0] = np.nan
train_set[3.1] = np.nan


indicator = MissingIndicator(missing_values=np.nan, features='all') 
# show the results on all the features
result = indicator.fit_transform(train_set) # result have the same shape with train_set
# contains only True & False, True corresponds with missing value


result[:,0].sum(a)# should return 2, the first column has two missing values
result[:,1].sum(a);# should return 1, the second column has one missing value
Copy the code

1.1.5 Feature transformation \

1.1.5.1 Polynomial transformation \

Sometimes we want to introduce nonlinear features into the model to increase its complexity. For simple linear models, this adds a lot of complexity to the model. But for more complex models, such as tree-based ML models, they already include nonlinear relationships in non-parametric tree structures. Therefore, this feature transformation may not be very helpful for tree-based ML models.

For example, if we set the order to 3, it looks like this:

from sklearn.preprocessing import PolynomialFeatures


# illustrate this function on one synthesized sampleTrain_set = np. Array ([2, 3]). Reshape (1, 1)# shape (1, 2)
# return array([[2, 3]])


poly = PolynomialFeatures(degree = 3, interaction_only = False)
# the highest degree is set to 3, and we want more than just intereaction terms


result = poly.fit_transform(train_set) # have shape (1, 10)
# array([[ 1., 2., 3., 4., 6., 9., 8., 12., 18., 27.]])
Copy the code

1.1.5.2 User-defined Transform \

from sklearn.preprocessing import FunctionTransformer


# illustrate this function on one synthesized sample
train_set = np.array([2.3]).reshape(1, -1) # shape (1.2)
# return array([[2.3]])


transformer = FunctionTransformer(func = np.log1p, validate=True)
# perform log transformation.X'=log(1 + x)
# func can be any numpy function such as np.exp
result = transformer.transform(train_set)
# return array([[1.09861229.1.38629436]]), the same as np.log1p(train_set)
Copy the code

Ok, so that’s the introduction to data preprocessing for static continuous variables. Suggest readers combine code, in Jupyter real operation again.

“`php

Highlights of past For beginners entry route of artificial intelligence and data download AI based machine learning online manual deep learning online manual download update (PDF to 25 sets) this qq group, 1003271085, to join this site WeChat group please reply “add group” to get a sale standing knowledge star coupons, please reply “knowledge planet like the article, A look at

Copy the code

Column | Jupyter based the characteristics of engineering handbook: data preprocessing (a)

“`php

Related Posts

MySQL – Redo log and bin log

Leetcode 895. Maximum Frequency Stack (Python)

Getting Started with Java to Architecture – Excellent books