The paper contains 4514 words and is expected to last 9 minutes

The two most important steps in developing machine learning models are feature engineering and preprocessing. Feature engineering involves feature design while preprocessing involves data cleaning.

We often spend a lot of time refining data for modeling purposes. To make this process more efficient, this article will share 4 tips to help you with feature design and preprocessing.

These techniques can be used to create new features, detect outliers, deal with unbalanced data, and estimate missing values.

Domain knowledge is probably one of the most important things to do during feature design. A better understanding of the features you use can prevent under-fitting and over-fitting.

1. Resample unbalanced data

In fact, you will often encounter unbalanced data. If your goal is only a slight imbalance, then that imbalance is certainly not a problem. This can be addressed using appropriate means, such as balancing accuracy, precision-recall curves, or F1 score validation data.

Unfortunately, this is not always the case and your target variable can be very unbalanced (for example, 10:1). Then you can overssample a few targets, introducing balance using a technique called SMOTE.

SMOTE

SMOTE called Synthetic Minority Oversampling Technique, is a Oversampling Technique used to increase the number of samples in a few classes.

It can generate new samples by viewing the feature space of the target and detecting neighboring targets. Then, only similar samples are selected and a column is randomly changed in the feature space of adjacent samples.

The module that gives SMOTE is in the imbalanced-learn folder. You just import the folder and apply fit_transform:

import pandas as pd
from imblearn.over_sampling import SMOTE

# Import data and create X, y
df = pd.read_csv('creditcard_small.csv')
X = df.iloc[:,:-1]
y = df.iloc[:,-1].map({1:'Fraud'Zero:'No Fraud'})

# Resample data
X_resampled, y_resampled = SMOTE(sampling_strategy=
{"Fraud":1000}).fit_resample(X, y)
X_resampled = pd.DataFrame(X_resampled, columns=X.columns)Copy the code

Original data (LEFT) versus oversampled data (RIGHT).

As shown in the figure, the model successfully oversamples the target variable. When using SMOTE for overssampling, the following strategies can be used:

· ‘minority’: resampling only a few classes;

· ‘not minority’: resampling all classes except minority classes;

· ‘Not majority’: resampling all classes except the majority;

· ‘all’: resampling all classes;

· When typing dict, the keys correspond to the target class, and these values correspond to the number of samples required for each target class.

We choose to use a dictionary to determine the extent to which the data is oversampled.

Bonus tip 1: If there is a SMOTE variable in the data set, you may set a value for those variables that do not occur. For example, if you have a variable called isMale which can only be 0 or 1, then SMOTE might create 0.365 as a value.

In addition, you can use SMOTENC which takes into account the properties of categorical variables. This version is also available in the imbalanced-learn folder.

Additional tip 2: Be sure to oversample after creating the training/test split so that only the training data is oversampled. You usually don’t want to test your model on synthetic data.

2. Create new features

To improve the quality and predictive power of the model, new characteristics of existing variables are often created. We can create some interaction (for example, multiplication or division) between each pair of variables and look for interesting new features. However, this requires a lot of coding and is a lengthy process. Fortunately, Deep Feature Synthesis automates the above process.

Deep Feature Synthesis

Deep Feature Synthesis (DFS) is an algorithm that can quickly create new variables with different depths. For example, you can multiply column pairs, or you can choose to multiply column A by column B and then add column C.

Let’s start with the data for the example. This paper chooses to use human resource analysis data because its characteristics are easy to understand:

By intuition alone, we can identify average_monthly_hours differentiated by Number_project as an interesting new variable. However, if we only follow our intuition, we may miss more relationships.

The package does require some understanding of entity usage. However, if you use a single table, just follow the following code:

import featuretools as ft
import pandas as pd

# Create Entity
turnover_df = pd.read_csv('turnover.csv')
es = ft.EntitySet(id = 'Turnover')
es.entity_from_dataframe(entity_id = 'hr', dataframe =
 turnover_df, index = 'index')

# Run deep feature synthesis with transformation primitives
feature_matrix, feature_defs = ft.dfs(entityset = es, target_entity =
 'hr', 
                             trans_primitives = ['add_numeric'.'multiply_numeric'].Copy the code

The first step is to create an Entity that can establish relationships with other tables if necessary. Next, you can just run ft. DFS to create a new variable. Use the parameter trans_primitives to determine how variables are created. Choose to add or multiply numeric variables.

The output of DFS if verbose = True

As shown above, an additional 668 features were created in just a few lines of code. Some examples of feature creation:

· Last_Evaluation multiplied by Satisfaction_level

· Left multiplied by promotion_last_5years

· Average_monthly_hours times satisfaction_level plus time_spend_company

Additional tip 1: Please note that the behavior here is relatively basic. The advantage of DFS is that you can create new variables (for example, facts and dimensions) from merges between tables.

Additional tip 2: Run ft. List_primitives () to see the complete list of merges that can be executed. It can even handle timestamps, null values, and long/ LAT information.

3. Handle the missing value

Again, there is no best way to handle missing values. It is sufficient to populate the data with only the means or patterns of certain groups. However, there are advanced techniques for estimating missing values using known partial data.

One such approach, called IterativeImputer, is a new package in SciKit-Learn, based on the popular R algorithm input missing variables (MICE).

Iterative Imputer

While Python is a great language for developing machine learning models, there are many other approaches that work better in R. For example, the complete interpolation packages in R are missForest, MI, mice, etc.

The Iterative Imputer, developed by Scikit-Learn, models each feature of the missing value according to other functions. It uses it as an estimate of an estimate. At each step, one feature is selected as the output Y, and all the other features are considered as input X, and regressors are fitted on X and Y to predict the missing value of Y. Complete the above steps for each feature.

Let’s look at an example. The data used is known as the Titanic data set. In this dataset, the column Age contains missing values that need to be populated. The code is as straightforward as ever:

# explicitly require this experimental feature
# now you can import normally from sklearn.impute
from sklearn.experimental import enable_iterative_imputer
  from sklearn.impute import IterativeImputer

from sklearn.ensemble import RandomForestRegressor
import pandas as pd

# Load data
titanic = pd.read_csv("titanic.csv")
titanic = titanic.loc[:, ['Pclass'.'Age'.'SibSp'.'Parch'.'Fare']]Copy the code

The advantage of this approach is that it allows you to use the estimator of your choice. RandomForestRegressor was used to simulate the missForest commonly used in R.

Additional tip 1: If the data is sufficient, it may be an attractive option to delete only samples of missing value data. However, keep in mind that this can create bias in the data. Maybe the missing data follows a pattern you missed.

Additional hint 2: Different estimators are allowed in the Iterative Imputer. After testing, we found that we could even use Catboost as an estimator! Unfortunately, LightGBM and XGBoost don’t work because of the random state names.

4. Outlier Detection

Without a good understanding of the data, it is difficult to detect outliers. If you have a good understanding of the data, you can more easily determine the thresholds that make the data still meaningful.

Sometimes this is impossible because a perfect understanding of the data is difficult to achieve. Instead, you can use an exception detection algorithm, such as the popular Isolation Forest.

Isolation Forest

In Isolation Forest algorithm, the key word is Isolation. Essentially, the algorithm checks the ease of separation of samples. This produces the isolation number, which is calculated by the number of splits in the random decision tree needed to isolate the sample, and then averaged across all trees.

Isolation Forest Procedure. Retrieved from: https://donghwa-kim.github.io/iforest.html

It is more likely to be an outlier if the algorithm only needs to do some splitting to find the sample. The split itself is also randomly divided, resulting in shorter exception paths. Therefore, when the isolation number of all trees is low, the sample is likely to be an outlier.

To show the example, we re-used the credit card data set we used previously:

from sklearn.ensemble import IsolationForest
import pandas as pd
import seaborn as sns

# Predict and visualize outliers
credit_card = pd.read_csv('creditcard_small.csv').drop("Class", 1) CLF = IsolationForest(Contamination =0.01, behaviour='new')
outliers = clf.fit_predict(credit_card)
sns.scatterplot(credit_card.V4, credit_card.V2, outliers,
 palette='Set1', legend=False)Copy the code


Bonus tip 1: There is an expanded version of Isolated Forest that addresses the shortcomings. But reviews have been mixed.

Notebook with code: https://github.com/MaartenGr/feature-engineering/blob/master/Engineering%20Tips.ipynb

Leave a comment like follow

We share the dry goods of AI learning and development. Welcome to pay attention to the “core reading technology” of AI vertical we-media on the whole platform.



(Add wechat: DXSXBB, join readers’ circle and discuss the freshest artificial intelligence technology.)