- By Han Xinzi @Showmeai
- Tutorial address: www.showmeai.tech/tutorials/4…
- This paper addresses: www.showmeai.tech/article-det…
- Statement: All rights reserved, please contact the platform and the author and indicate the source
- Check out ShowMeAI for more highlights
The introduction
Above, the familiar machine learning modeling flowchart, ShowMeAI sequence machine learning practical articles before Python application practice and everyone about machine learning algorithm to the entire modeling process is very important step, is for data preprocessing and feature project, it largely determines the final modeling effect is good or bad, in this content summary, We will give you a comprehensive interpretation of the actual application details of data preprocessing and feature engineering.
Characteristics of the engineering
First let’s take a look at “feature engineering”. In fact, you’ve already seen that we did feature engineering in ShowMeAI’s series on Python Machine Learning and E-commerce sales Forecasting.
If we define feature engineering, it refers to: using domain knowledge and existing data to create new features for machine learning algorithms; It can be manual or automated.
- Feature: Information extracted from data that is useful for predicting results.
- Feature engineering: The process of using professional background knowledge and skills to process data so that features can be used better in machine learning algorithms.
There’s a popular saying in the industry:
The upper limit of the model is determined by data and feature engineering, and the improved algorithm only approximates this upper limit.
This is because, in data modeling, there is a difference between “ideal state” and “real scene”, and many times the raw data is not a clean and well-defined form with sufficient meaning:
Feature engineering processing is equivalent to sorting out data, extracting meaningful information based on business, and organizing it in a clean and tidy form:
Feature engineering has very important significance:
- The better the features, the more flexibility. Even generic models (or algorithms) can perform well as long as the features are well chosen. The flexibility of good features is that it allows you to choose models that are not complex, but are also faster, easier to understand and maintain.
- The better the feature, the simpler the model. With good features, your model will still perform well even if your parameters are not optimal, so you don’t have to spend too much time looking for optimal parameters, which greatly reduces the complexity of the model and makes the model simpler.
- The better the characteristics, the better the performance of the model. Obviously, there is no dispute that the ultimate goal of feature engineering is to improve the performance of the model.
In this article, ShowMeAI will take you to systematically learn feature engineering, including “feature type”, “data cleaning”, “feature construction”, “feature transformation”, “feature selection” and other sections.
Here we use the most simple and commonly used Titanic data set to explain to you.
Titanic data set Titanic data set is a perfect data set for beginners in data science and machine learning. The data set contains personal information and survival status of some crew members in the Titanic sinking in 1912. We can develop suitable models from data training and predict survivability in new data (test sets).
The Titanic dataset can be loaded directly from the Seaborn library, as shown in the following code:
import pandas as pd
import numpy as np
import seaborn as sns
df_titanic = sns.load_dataset('titanic')
Copy the code
The data field description of the data set is as follows:
1. Feature types
Before the concrete demonstration of Titanic data preprocessing and feature engineering, ShowMeAI will give you some basic knowledge about data.
1.1 Structured vs. unstructured data
Data can be divided into “structured data” and “unstructured data”. For example, in the Field of The Internet, most of the tabular business data stored in the database is structured data. Text, voice, image and video are unstructured data.
1.2 Quantitative vs qualitative data
For the data we record, we can usually distinguish between “quantitative data” and “qualitative data” alignment, where:
- Quantitative data: Refers to numbers used to measure quantity and size.
- Such as height, length, volume, area, humidity, temperature and other measurements.
- The qualitative data: Refers to the categories used to describe the nature of things.
- Such as texture, taste, smell, color, etc.
The following figure summarizes two examples of data and their common processing and analysis methods:
2. Clean data
Before the actual data mining or modeling, we will have the “data pre-processing” link, data cleaning and other operations on the original data. Because data in the real world are generally incomplete and inconsistent “dirty data”, data mining cannot be carried out directly, or the mining results are not satisfactory.
The main causes of “dirty data” include:
- Tampering with the data
- Incomplete data
- Data inconsistency
- Data duplication
- Abnormal data
Data cleaning process includes data alignment, missing value processing, outlier value processing, data transformation and other data processing methods, as shown in the figure below:
Let’s pay attention to the above mentioned processing method to make an explanation.
2.1 Data Alignment
The collected original data has different formats and forms. We will conduct data alignment processing for time, fields and related dimensions, etc. Data alignment and normalization will produce neat and consistent data, which is more suitable for modeling. Some processing examples are shown below:
(1) time
- Inconsistent date format [
2022-02-20
,20220220
,2022/02/20
,20/02/2022
】. - The timestamp is expressed in different units, some in seconds, some in milliseconds.
- An invalid time is used, the timestamp is 0, and the end timestamp is FFFF.
(2) field
- Name write gender, id number write mobile phone number and so on.
(3) the dimension
- The value types are unified [for example, 1, 2.0, 3.21E3, and 4].
- Uniform units [such as 180cm, 1.80m].
2.2 Missing value processing
Data missing is a common problem in real data. For various reasons, the data we collect is not necessarily complete. We have some common processing methods for missing values:
- No processing (some models such as XGBoost/LightGBM can handle missing values).
- Delete missing data (by sample dimension or field dimension).
- Mean, median, mode, peer mean, or estimate are used for filling.
The specific processing method can be expanded into the following figure:
Back to our Titanic dataset, we demonstrate the various methods:
Let’s first make an understanding of the missing values of the data set (summary distribution) :
df_titanic.isnull().sum(a)Copy the code
survived 0
pclass 0
sex 0
age 177
sibsp 0
parch 0
fare 0
embarked 2
class 0
who 0
adult_male 0
deck 688
embark_town 2
alive 0
alone 0
Copy the code
(1) removed
The most straightforward and crude treatment is culling of missing values, that is, deleting objects (fields, samples/records) with missing information attribute values to create a complete information table. The advantages and disadvantages are as follows:
- Advantages: it is simple and effective in the case that the object has multiple attribute missing values and the deleted object containing missing values is very small compared with the amount of data in the initial data set.
- Disadvantages: When the proportion of missing data is large, especially when the missing data is not randomly distributed, this method may lead to data deviation, leading to incorrect conclusions.
In our current Titanic case, embark_town field has 2 null values, consider deleting the missing processing.
df_titanic[df_titanic["embark_town"].isnull()]
df_titanic.dropna(axis=0,how='any',subset=['embark_town'],inplace=True)
Copy the code
(2) Data filling
The second category is that there are ways to fill in missing values. For example, based on statistical methods, model methods, combined with business methods, etc.
① Manual filling
Perform manual filling based on business knowledge.
② Fill in special values
A null value is treated as a special attribute value, unlike any other attribute value. For example, all null values are filled with unknown. Generally used as a temporary filler or intermediate process.
Code implementation
df_titanic['embark_town'].fillna('unknown', inplace=True)
Copy the code
③ Statistical filling
If the missing rate is low, it can be filled according to the data distribution. Common padding statistics are as follows:
- Median: In the case of skewed distribution of data, the median is used to fill in the missing value.
- Mode: Missing values can be filled with mode for discrete features.
- Mean: For data consistent with uniform distribution, the missing value is filled in with the mean of the variable.
— fare: Use median padding if there are many missing values
df_titanic['fare'].fillna(df_titanic['fare'].median(), inplace=True)
Copy the code
Mode filling – Embarked: The mode is used when only two values are missing
df_titanic['embarked'].isnull().sum(a)# Result: 2
df_titanic['embarked'].fillna(df_titanic['embarked'].mode(), inplace=True)
df_titanic['embarked'].value_counts()
# Execution result:
#S 64
Copy the code
Homogeneous mean filling
Age: Groups according to sex, Pclass and WHO. If you fall in the same group, fill in with the mean or median of that group.
df_titanic.groupby(['sex'.'pclass'.'who'[])'age'].mean()
Copy the code
age_group_mean = df_titanic.groupby(['sex'.'pclass'.'who'[])'age'].mean().reset_index()
Copy the code
def select_group_age_median(row) :
condition = ((row['sex'] == age_group_mean['sex']) &
(row['pclass'] == age_group_mean['pclass']) &
(row['who'] == age_group_mean['who']))
return age_group_mean[condition]['age'].values[0]
df_titanic['age'] =df_titanic.apply(lambda x: select_group_age_median(x) if np.isnan(x['age']) else x['age'],axis=1)
Copy the code
④ The model predicts filling
If other fields without missing are abundant, we can also use the model for modeling prediction filling. The fields to be filled are taken as Label, and the data without missing is taken as training data, so as to establish a classification/regression model and predict and fill the filled missing fields.
Nearest Neighbor method (KNN)
- First, the K samples closest to the samples with missing data were determined according to Euclidean distance or correlation analysis, and the K values were weighted average/voting to estimate the missing data of the samples.
Regression (Regression)
- Based on the complete data set, the regression equation is established. For objects containing null values, the known attribute values are substituted into the equation to estimate the unknown attribute values, and the estimated values are filled. Linear regression is often used when variables are not linearly correlated, which results in biased estimates.
Let’s take the Age field in the Titanic case as an example to explain:
- There is a large amount of age missing. Here, we use sex, pclass, WHO, fare, Parch and SIBSP to construct a random forest model and fill in the age missing values.
df_titanic_age = df_titanic[['age'.'pclass'.'sex'.'who'.'fare'.'parch'.'sibsp']]
df_titanic_age = pd.get_dummies(df_titanic_age)
df_titanic_age.head()
Copy the code
# Passengers are divided into known ages and unknown ages
known_age = df_titanic_age[df_titanic_age.age.notnull()]
unknown_age = df_titanic_age[df_titanic_age.age.isnull()]
# y is the target age
y_for_age = known_age['age']
# X is the value of the characteristic attribute
X_train_for_age = known_age.drop(['age'], axis=1)
X_test_for_age = unknown_age.drop(['age'], axis=1)
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(random_state=0, n_estimators=2000, n_jobs=-1)
rfr.fit(X_train_for_age, y_for_age)
# Use the model to predict unknown age outcomes
y_pred_age = rfr.predict(X_test_for_age)
Fill in the original missing data with the predicted results
df_titanic.loc[df_titanic.age.isnull(), 'age'] = y_pred_age
Copy the code
sns.distplot(df_titanic.age)
Copy the code
⑤ Interpolation method filling
Interpolation can also be used to fill the data, subdivided into linear interpolation, multiple interpolation, hot platform interpolation, Lagrange interpolation, Newton interpolation, etc.
Linear interpolation
An estimate of the missing value can be calculated using interpolation, which estimates the intermediate value from two points (x0, y0), (x1, y1). Assuming y=f(x) is a straight line, compute the function f(x) from two known points, and then solve for y by knowing x, thus estimating the missing value.
The.interpolate(method = ‘Linear ‘, axis) method will interpolate NaN values by linear interpolation using values along a given axis. The difference is the middle value, front or rear
df_titanic['fare'].interpolate(method = 'linear', axis = 0)
Copy the code
Row values can also be inserted
df_titanic['fare'].interpolate(method = 'linear', axis = 1)
Copy the code
Multiple Imputation
The idea of multi-valued interpolation is derived from Bayesian estimation, which holds that the value to be interpolated is random and its value comes from the observed value. In practice, the value to be interpolated is usually estimated, and then different noises are added to form multiple groups of optional interpolation values. According to some selection basis, select the most suitable interpolation value.
The multiple interpolation method is divided into three steps:
- (1) Generate a set of possible interpolation values for each null value, which reflect the unresponsive model’s uncertainty; Each value can be used to interpolate the missing value in the data set to produce several complete data sets.
- (2) Each interpolated data set was statistically analyzed using the statistical method for the complete data set.
- ③ The final interpolation value is generated by selecting the results from each interpolation data set according to the scoring function.
⑥ Dummy variable padding
There is another very interesting filling method, called “dummy variable filling”, which can be adopted when variables are discrete and different values are few. Take Titanic data as an example:
- SEX variable, there are three different values of male, fameal and NA (missing), which can convert this column into
IS_SEX_MALE
,IS_SEX_FEMALE
,IS_SEX_NA
. - If a variable has more than ten different values, the values with lower frequency can be grouped according to the frequency of each value
other
, lower the dimension. This practice maximizes the information reserved for the variable.
The following is a reference code example:
sex_list = ['MALE'.'FEMALE', np.NaN, 'FEMALE'.'FEMALE', np.NaN, 'MALE']
df = pd.DataFrame({'SEX': sex_list})
display(df)
df.fillna('NA', inplace=True)
df = pd.get_dummies(df['SEX'],prefix='IS_SEX')
display(df)
Copy the code
# Raw data
SEX
0 MALE
1female2 NaN
3female4female5 NaN
6 MALE
# after filling
IS_SEX_FEMALE IS_SEX_MALE IS_SEX_NA
0 0 1 0
1 1 0 0
2 0 0 1
3 1 0 0
4 1 0 0
5 0 0 1
6 0 1
Copy the code
When the eigenvalue is missing more than 80%, it is recommended to delete [or add “Yes” and “no” marker bit information], which will easily affect the model effect
df_titanic.drop(["deck"],axis=1)
Copy the code
2.3 Handling outliers
Data quality will also greatly affect the application effect of machine learning. The error value or outlier value of data may result in measurement error or abnormal system conditions, which will bring great problems to model learning. In fact, we often have the detection and processing of outliers. Here is a summary for you.
(1) Abnormal detection methods
① Based on statistical analysis
Usually, the user uses a certain statistical distribution to model logarithmic data points, and then determines whether anomalies are found according to the distribution of points in the assumed model.
For example, by analyzing the divergence of statistical data, that is, the data variation index, the distribution of data can be understood, and then the anomaly data in the data can be found through the data variation index.
Commonly used data variation indicators include range, quartile spacing, mean difference, standard deviation, coefficient of variation and so on. For example, a large value of variation indicator means large variation and wide dispersion. A small value indicates that the deviation is small and dense.
For example, the maximum and minimum values can be used to determine whether the value of this variable exceeds a reasonable range, such as the customer’s age of -20 years or 200 years, which are outliers.
(2) 3 sigma principle
If the data are approximately normally distributed, under the 3 sigma principle, an outlier is a set of measured values that deviate more than 3 standard deviations from the mean.
- If the data is normal distribution, the average distance 3 sigma value outside of the probability of P (x | – mu | > 3 sigma) < = 0.003, belong to a small probability of rare events.
- If the data does not follow a normal distribution, it can also be described by the number of standard deviations away from the mean.
③ Boxplot analysis
You will recall that in the data analysis section there is a useful tool called the boxplot, which provides a standard for identifying outliers: if a value is less than Q1-1.5iQR or greater than Q3+ 1.5iQR, it is called an outlier.
- Q1 is the lower quartile, indicating that a quarter of the observed values are lower than it;
- Q4 is the upper quartile, indicating that one quarter of all observed values are larger than it.
- IQR is the interquartile spacing, the difference between the upper quartile Q1 and the lower quartile Q3, and contains half of all observed values.
The method for determining outliers in box plots is based on quartiles and quartiles. Quartiles are robust: 25% of the data can be arbitrarily far away without interfering with the quartiles, so outliers cannot influence this criterion. Therefore, it is more objective to identify outliers and has certain advantages in identifying outliers.
sns.catplot(y="fare",x="survived", kind="box", data=df_titanic,palette="Set2")
Copy the code
④ Based on model detection
We can also detect outliers based on the model. The basic idea is to establish a data model first, and those objects that do not fit perfectly with the model are regarded as anomalies.
- If the model is a collection of clusters, the exception is an object that does not significantly belong to any cluster.
- When using regression models, anomalies are relatively distant objects from predicted values.
Advantages: Having a solid foundation in statistical theory, these tests can be very valid when sufficient data and knowledge of the types of tests used exist.
Disadvantages: For multivariate data, fewer options are available, and for high-dimensional data, these detections are less likely.
⑤ Based on distance
We also have distance-based methods for exception detection. Such methods are based on the assumption that a data object is an exception if it is far from most points. By defining the proximity measurement between objects, it can judge whether abnormal objects are far away from other objects according to the distance. The main distance measurement methods are absolute distance (Manhattan distance), Euclidean distance and Mahalanobis distance.
-
Advantages:
- The distance-based method is much simpler than the statistics-based method. Because it is much easier to define a measure of distance for a data set than to determine the distribution of the data set.
-
Disadvantages:
- Proximity based method requires O(m2) time and is not suitable for large data sets.
- The method is also sensitive to the choice of parameters.
- Data sets with regions of different densities cannot be processed because it uses global thresholds and cannot account for such density variations.
⑥ Based on density
A very direct idea of anomaly detection is based on the distribution density. Specifically, the local density of local outliers/outliers is significantly lower than that of most neighboring points by observing the density around the current point. Such methods are suitable for heterogeneous data sets.
-
Advantages:
- The quantitative measure that the object is an outlier is given and can be handled well even if the data has different regions.
-
Disadvantages:
- As with distance-based methods, these methods necessarily have O(m2) time complexity.
- O(Mlogm) can be achieved by using a specific data structure for low dimensional data;
- Parameter selection is difficult.
- Although the algorithm deals with this problem by observing different k values and obtaining the maximum outlier score, it still needs to choose the upper and lower bounds of these values.
⑦ Based on clustering
We can detect outliers based on clustering, and samples far from the cluster are more likely to be outliers.
However, this method will be affected by the number of cluster K. One strategy is to repeat the analysis for different number of clusters. Another approach is to find lots of small clusters, the idea being:
- Smaller clusters tend to be more condensed;
- If an object is an outlier when there are lots of small clusters, it is probably a true outlier.
- The downside is that a group of anomalies can form small clusters and evade detection.
-
Advantages:
- Finding outliers based on linear and near-linear complexity (k-means) clustering techniques may be highly efficient;
- The definition of clusters is usually the complement of outliers, so it is possible to find both clusters and outliers.
-
Disadvantages:
- The set of outliers generated and their scores may depend very much on the number of clusters used and the existence of outliers in the data;
- The quality of clusters generated by clustering algorithm has a great influence on the quality of outliers generated by the algorithm.
Anomaly detection based on proximity
Similarly, we also have an approach to anomaly detection based on the nearest neighbor. We believe that the outliers are far from most of the points. This approach is more general and easier to use than statistical methods because it is easier to determine a meaningful proximity measure for a data set than its statistical distribution. The outlier score of an object is given by the distance to its k-nearest neighbor, so the outlier score is highly sensitive to the value of K:
- If K is too small (for example, 1), a small number of neighboring outliers may result in a lower outlier score.
- If K is too large, then all objects in the cluster with less than K may become outliers.
In order to make the scheme more robust to the selection of K, the average distance of K nearest neighbors can be used.
-
Advantages:
- simple
-
Disadvantages:
- Proximity based method requires O(m2) time and is not suitable for large data sets.
- The method is also sensitive to the choice of parameters.
- Data sets with regions of different densities cannot be processed because it uses global thresholds and cannot account for such density variations.
In the data processing stage, outliers are considered as outliers affecting the data quality, rather than as anomaly detection targets as commonly known. Generally, simple and intuitive methods are used to judge outliers of variables by combining boxplot and MAD statistical methods. As shown below, the scatter diagram is drawn to judge directly according to the distribution.
sns.scatterplot(x="fare", y="age", hue="survived",data=df_titanic,palette="Set1")
Copy the code
(2) Exception handling methods
The processing of outliers needs to be analyzed in specific situations. The commonly used methods for handling outliers are as follows:
- Delete records containing outliers;
- Whether some filtered abnormal samples are really unnecessary abnormal feature samples, it is better to confirm the compilation based on business to prevent normal samples from being filtered.
- Treat outliers as missing values and hand them to the missing value processing method.
- Use mean/median/mode to correct;
- Do not handle.
3. Feature construction
The pre-processing process can ensure that we get clean, neat and accurate data, but these data may not be the most effective for modeling. In the next step, we usually carry out feature construction and generate derivative variables based on business scenarios to improve data expression ability and model modeling effect.
3.1 Statistical feature construction
Statistical features are a kind of very effective features, especially in sequence problem scenarios. The following are some thinking dimensions and methods for constructing statistical features:
- ① Build new features based on business rules and prior knowledge.
- ② Quartile, median, mean, standard deviation, deviation, skewness, bias front, discrete system.
- ③ Construct long and short term statistics (e.g., week and month).
- (4) Time attenuation (the closer the observation weight value is, the higher it is).
Going back to the Titanic data set, let’s see what new features we can do with business understanding:
Age processing
We further processed the age field. Considering that different age groups may have different probability of rescue, we divided them into different segments according to age values, corresponding to child, Young, Midlife, old, etc
def age_bin(x) :
if x <= 18:
return 'child'
elif x <= 30:
return 'young'
elif x <= 55:
return 'midlife'
else:
return 'old'
df_titanic['age_bin'] = df_titanic['age'].map(age_bin)
df_titanic['age_bin'].unique() Result: array(['young'.'midlife'.'child'.'old'], dtype=object)
Copy the code
Extract “address” features
In the name field, we can see a variety of names, such as “Mr”, “Master” and “Dr”, which reflect the identity of passengers and other information. We can extract them to construct new features.
# Extract address
df_titanic['title'] = df_titanic['name'].map(lambda x: x.split(', ') [1].split('. ') [0].strip())
df_titanic['title'].value_counts()
Copy the code
The result is as follows:
Mr 757
Miss 260
Mrs 197
Master 61
Rev 8
Dr 8
Col 4
Ms 2
Major 2
Mlle 2
Dona 1
Sir 1
Capt 1
Don 1
Lady 1
Mme 1
the Countess 1
Jonkheer 1
Copy the code
Let’s do a simple “appellation” statistic
Is it official, royal, Or Ms, Sir, Miss
df_titanic['title'].unique()
Copy the code
Execution Result:
array(['Mr', 'Mrs', 'Miss', 'Master', 'Don', 'Rev', 'Dr', 'Mme', 'Ms',
'Major', 'Lady', 'Sir', 'Mlle', 'Col', 'Capt', 'the Countess',
'Jonkheer', 'Dona'], dtype=object)
Copy the code
Below we do a standardized unification of these “address” and “appellation”.
title_dictionary = {
"Mr": "Mr",
"Mrs": "Mrs",
"Miss": "Miss",
"Master": "Master",
"Don": "Royalty",
"Rev": "Officer",
"Dr": "Officer",
"Mme": "Mrs",
"Ms": "Mrs",
"Major": "Officer",
"Lady": "Royalty",
"Sir": "Royalty",
"Mlle": "Miss",
"Col": "Officer",
"Capt": "Officer",
"the Countess": "Royalty",
"Jonkheer": "Royalty",
"Dona": 'Mrs'
}
df_titanic['title'] = df_titanic['title'].map(title_dictionary)
df_titanic['title'].value_counts()
Copy the code
The result is as follows:
Mr 757
Miss 262
Mrs 201
Master 61
Officer 23
Royalty 5
Copy the code
Extracting family size
On Titanic, some members are related to each other. Considering that the size of the family also has an impact on the final rescue, we can construct a family_size feature to represent the size of the family.
df_titanic['family_size'] = df_titanic['sibsp'] + df_titanic['parch'] + 1
df_titanic['family_size'].head()
Copy the code
The result is as follows:
0, 2, 1, 2, 2, 1, 3, 2, 4, 1Copy the code
3.2 cycle value
In e-commerce and other scenarios, data have periodic rules, and we can extract some periodic values as effective information.
Some dimensions to consider for the timing cycle are as follows:
- (1) The value of the last n cycles/days/months/years, such as the number and average of the last 5 days
- ② Year-on-year/month-on-month
3.3 Data buckets
Data bucket is a common method for the processing of continuous value attributes. It means that we slice the continuous value and assign the continuous value to the corresponding segment. Data bucking is also called data bucking or decentralization.
(1) Equal frequency and equal distance buckets
(a) Custom sorting
It refers to the self-defined interval based on business experience or common sense, and then classify the original data into each interval.
(b) Equidistant packing
Divide data into equal portions of the same width.
From the minimum to the maximum, they are divided into N equal parts. If A and B are the minimum and maximum values, the length of each interval is W=(B−A)/N, and the interval boundary values are A+W, A+2W,… W, A + (N – 1).
Isometric bobbing only considers boundaries, and the number of instances in each equal portion may vary. The disadvantage of equal-distance bucket division is that it is greatly affected by outliers.
(c) Equal frequency division box
Divide the data into equal pieces with the same number in each equal piece.
In an equal-frequency division box, the boundary values of intervals are computed, and eventually each interval contains roughly the same number of instances. For example, N=5, each interval should contain about 20% of instances.
- Numerical variables are divided into boxes
Let’s first do an isofrequency segmentation of the ticket price (if you map the distribution of the ticket price, you will find that it is a very long tail distribution, which is not suitable for isofrequency segmentation), and look at the separated sections.
# Qcut and other frequency sub-box
df_titanic['fare_bin'], bins = pd.qcut(df_titanic['fare'].5, retbins=True)
df_titanic['fare_bin'].value_counts()
Copy the code
The results are as follows:
Bins #array([0., 7.8542, 10.5, 21.6792, 39.6875, 512.3292])Copy the code
The following isofrequency segmentation is carried out according to the interval segment
# select * from 'fare
def fare_cut(fare) :
if fare <= 7.8958:
return 0
if fare <= 10.5:
return 1
if fare <= 21.6792:
return 2
if fare <= 39.6875:
return 3
return 4
df_titanic['fare_bin'] = df_titanic['fare'].map(fare_cut)
Copy the code
Compared with the ferry ticket price, the distribution of age field is more concentrated and the interval size is relatively clear. We adopt isometric segmentation, and the code is as follows:
# Cut Equidistant boxes
bins = [0.12.18.65.100]
pd.cut(df_titanic['age'], bins).value_counts
Copy the code
(2) the Best – KS barrels
- 1. Sort the eigenvalues from small to large.
- 2. Calculate the maximum value of KS, which is the tangent point, and write it as D. The data is then cut into two parts.
- 3. Repeat Step 2 to perform recursion and further slice the data around D. Until the number of KS boxes reaches our preset threshold.
- 4. Continuous variable: KS value after packing is less than KS value before packing
- 5. In the process of dividing containers, it is determined that the KS value after dividing containers is a certain cutting point, rather than the joint action of multiple cutting points. The position of this tangent point is the position where the original KS value is maximum.
(3) Chi-square barrels
The bottom-up (i.e., merger-based) approach to data discretization relies on the Chi-square test: adjacent intervals with minimum Chi-square values are merged together until a defined stop criterion is met.
Basic idea:
If two adjacent intervals have very similar class distributions, the two intervals can be merged; Otherwise, they should remain separate. Low chi-square values indicate that they have similar class distributions.
Implementation steps:
- ① Define a chi square threshold in advance
- ② Initialization; Instances are sorted by attributes to be discretized, and each instance belongs to an interval
- ③ Merging interval
- Calculate the chi-square value of each pair of adjacent intervals
- Combine the pair of intervals with the lowest chi-square value
Code implementation: github.com/Lantianzz/S…
(4) Minimum entropy method
There is also the minimum entropy box-splitting method, which needs to minimize the total entropy value, that is, the box-splitting method can maximize the classification of dependent variables.
Entropy is a measure of the disordered degree of data in information theory. The basic purpose of information entropy is to find out the relationship between the amount of information and the redundancy degree of a symbolic system, so as to realize the highest efficiency of data storage, management and transmission with the lowest cost and consumption.
The lower the entropy of the data set, the smaller the differences between the data. The minimum entropy partition is to make the data in each box have the best similarity. Given the number of boxes, if all possible boxes are considered, the boxes obtained by the minimum entropy method should be boxes with the minimum entropy.
3.4 Feature Combination
In some scenarios, we will consider feature combination to construct strong features. The commonly used feature combination construction methods are as follows:
- Discrete + discrete: Build the Cartesian product (that is, the relationship of two and combinations).
- Discrete + continuous: Cartesian product or statistical feature construction based on category feature group by is performed after continuous features are divided into buckets.
- Continuous + continuous: addition, subtraction, multiplication and division, polynomial characteristics, second-order difference, etc.
- Polynomial characteristic
For continuous value features, we construct polynomial features for several features to achieve feature combination and high-order enhancement.
In the example of Titanic, the numerical characteristics are as follows:
df_titanic_numerical = df_titanic[['age'.'sibsp'.'parch'.'fare'.'family_size']]
df_titanic_numerical.head()
Copy the code
We can build polynomial features by referring to the following code
# Extend numerical features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False, interaction_only=False)
df_titanic_numerical_poly = poly.fit_transform(df_titanic_numerical)
pd.DataFrame(df_titanic_numerical_poly, columns=poly.get_feature_names()).head()
Copy the code
After constructing the features, we checked the correlation of the derived new feature variables. In the following heatmap, the darker the color, the greater the correlation:
sns.heatmap(pd.DataFrame(df_titanic_numerical_poly, columns=poly.get_feature_names()).corr())
Copy the code
4. Feature transformation
We will do some “feature transformation” operations for the constructed features to adapt to different models and better complete the modeling.
4.1 Standardization
The normalization operation, also known as the Z-Score transform, makes the arithmetic mean of the numerical feature column 0 and the variance (and standard deviation) 1, as shown in the figure below.
Note: If there are outliers (found by EDA) with very large or very small values in the numerical feature column, more robust statistics should be used: median rather than arithmetic mean, quantile rather than variance. This standardization approach has an important parameter (lower quantile limit, upper quantile limit) that is best determined by EDA data visualization. Immune outlier.
Reference codes for standardized operations are as follows:
from sklearn.preprocessing import StandardScale
# Standardized model training
Stan_scaler = StandardScaler()
Stan_scaler.fit(x)
x_zscore = Stan_scaler.transform(x)
x_test_zscore = Stan_scaler.transform(x_test)
joblib.dump(Stan_scaler,'zscore.m') Write file
Copy the code
4.2 Normalization
Normalization adjusts the magnitude of the data based on the length of the vector, but does not change the order of the original data. As shown below:
4.3 Scaling
Amplitude scaling is to make the values of different features within roughly the same order of magnitude and data interval. The most commonly used method is maximum and minimum scaling, as shown in the figure below:
The following is the reference code for the zoom operation:
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
min_max_scaler.fit_transform(x)
x_minmax = min_max_scaler.transform(x)
x_test_minmax = min_max_scaler.transform(x_test)
joblib.dump(min_max_scaler,'min_max_scaler.m') Write file
Copy the code
4.4 Normalization VS standardization
Normalization and normalization are two very common feature transformation operations. Let’s compare normalization and normalization below:
- For different purposes, normalization is to eliminate dimension compression to the interval [0,1]; Standardization simply adjusts the overall distribution of features.
- Normalization is related to maximum and minimum; Standardization has to do with mean and standard deviation.
- Normalized output is between [0,1]; Standardization is unlimited.
Their application scenarios can be summarized as follows:
-
Z-score standardization performs better when distance is needed to measure similarity (e.g. SVM, KNN) or when PCA is used for dimensionality reduction.
-
The first method or other normalization methods can be used when distance measures, covariance calculations, and data do not conform to normal distribution are not involved. For example, in image processing, the value of RGB image is limited to the range of [0,255] after it is converted into grayscale image.
-
Tree-based models (e.g., Random Forest, GBDT, XGBoost, LightGBM, etc.) do not require feature normalization. If it is a parameter-based model or a distance-based model (logistic regression, K-means clustering, neural network, etc.), normalization is required because parameters or distances need to be calculated.
4.5 Nonlinear Transformation
In some scenarios, we also adjust or correct the distribution of numerical fields, and use statistical or mathematical transformation to reduce the impact of data distribution skew. The values of the originally dense interval should be scattered as much as possible and the values of the originally scattered interval should be aggregated as much as possible.
Most of the transformation functions belong to the power transformation function, the main function is to stabilize the variance, keep the distribution close to the normal distribution and make the data independent of the mean value of the distribution.
Let’s look at some typical nonlinear statistical transformations.
(1) the log transformation
Log transformations are often used to create monotonous transformations of data. The main function is to stabilize the variance, always keep the distribution close to normal distribution and make the data independent of the mean value of the distribution.
- The log transform tends to stretch out the range of values of the independent variables falling in the lower range and tends to compress or reduce the range of values of the independent variables falling in the higher range, thus making the inclined distribution as close to the normal distribution as possible.
- For the variance instability of some numerical continuous features, we need to adopt logization to adjust the variance of the entire data distribution, which belongs to variance stable data transformation.
Log transformation belongs to the power transformation function cluster, and the mathematical expression is
Below, we perform log1P transformation for the ticket price field in Titanic data set, and the sample code is as follows:
sns.distplot(df_titanic.fare,kde=False)
Copy the code
df_titanic['fare_log'] = np.log((1+df_titanic['fare']))
sns.distplot(df_titanic.fare_log,kde=False)
Copy the code
(2) the box – cox transformation
Box-cox transformation is a generalized power transformation method proposed by Box and Cox in 1964. It is a data transformation commonly used in statistical modeling. It is used in the case that the continuous response variables do not meet the normal distribution. After box-Cox transformation, the correlation between unobservable errors and predictive variables can be reduced to a certain extent.
The main characteristic of Box-Cox transformation is to introduce a parameter, estimate the parameter by the data itself and then determine the data transformation form. Box-cox transformation can obviously improve the normality, symmetry and variance equality of data, and it is effective for many actual data.
The mathematical expression of box-Cox transformation function is as follows:
The resulting transformation output y is a function of the input x and the transformation parameters; When λ=0, the transformation is the natural logarithm log transformation, which we have already mentioned. The best value of λ is usually determined by maximum likelihood or maximum logarithmic likelihood.
Below, we perform Box-Cox transformation on the ticket price field in Titanic data set, and the sample code is as follows:
# Remove from the data distribution unless the value is zero
fare_positive_value = df_titanic[(~df_titanic['fare'].isnull()) & (df_titanic['fare'] >0] ['fare']
import scipy.stats as spstats
# Calculate the best λ value
l, opt_lambda = spstats.boxcox(fare_positive_value)
print('Optimal lambda value:', opt_lambda) # 0.5239075895755266
# Perform the Box-Cox transformation
fare_boxcox_lambda_opt = spstats.boxcox(df_titanic[df_titanic['fare'] >0] ['fare'],lmbda=opt_lambda)
sns.distplot(fare_boxcox_lambda_opt,kde=Fal
Copy the code
4.6 Discrete variable processing
For the field features of category type (such as color, type, good or bad degree), there are many models that cannot be directly dealt with, so we can better present the information and support model learning by encoding them. There are the following common encoding methods for category variables:
(1) Label Encoding
Label encoding is one of the most common encoding methods for categorical data. Label encoding is a label with an encoding value between 0 and N_class-1.
For example, if we have [dog,cat,dog,mouse,rabbit], we convert it to [0,1,0,2,3].
- Advantages: Compared with OneHot encoding, LabelEncoder encoding occupies less memory space, and supports text feature encoding.
- Disadvantages: the way it is coded gives extra size order to different categories, which is useful in some computational models (such as logistic regression) and can be used in tree models.
The reference code for label coding is as follows:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(["Beyond the line"."A line"."Second line"."Three line"])
print('Features: {}'.format(list(le.classes_)))
[' first line ', 'third line ',' second line ', 'super line ']
print('Convert tag value: {}'.format(le.transform(["Beyond the line"."A line"."Second line")))Array ([3 0 2]...)
print('Feature label value inversion: {}'.format(list(le.inverse_transform([2.2.1))))# output characteristic tag value inversion: [' second line ', 'second line ',' third line '
Copy the code
(2) One hot Encoding
Unique thermal coding is usually used to deal with features that do not have a size relationship between categories.
For example: characteristics: blood type, there are four categories (A,B,AB,O), which will be transformed into A 4-dimensional sparse vector by using unique thermal coding
- A is [1,0,0,0]
- B represents [0,1,0,0]
- AB is represented by [0,0,1,0]
- O stands for [0,0,0,1]
The dimensions of the generated sparse vector are the same as the number of categories.
- Advantages: The unique thermal coding solves the problem that the classifier can not deal with attribute data, and also plays a role of expanding features to a certain extent. Its values are only 0 and 1, and the different types are stored in vertical space.
- Disadvantages: can only binarize numeric variables, can not directly encode string type category variables. When the number of categories is large, the feature space becomes very large. In such cases, PCA can generally be used to reduce the dimensions. And the one Hot Encoding +PCA combination is also useful in practice.
If with the aid of pandas tool library (see ShowMeAI series of data analysis and data scientific tools quick | pandas using guide for details), heat alone vector encoding Python code reference sample is as follows:
sex_list = ['MALE'.'FEMALE', np.NaN, 'FEMALE'.'FEMALE', np.NaN, 'MALE']
df = pd.DataFrame({'SEX': sex_list})
display(df)
df.fillna('NA', inplace=True)
df = pd.get_dummies(df['SEX'],prefix='IS_SEX')
display(df)
Copy the code
The final results before and after transformation are as follows:
SEX 0 MALE 1 FEMALE 2 NaN 3 FEMALE 4 FEMALE 5 NaN 6 MALE sex_female IS_SEX_MALE IS_SEX_NA 0 0 1 0 1 1 0 0 2 0 0 1 3 1 0 0 4 1 0 0 5 0 0 1Copy the code
Embarked on: the fields ‘sex’, ‘class’, ‘pclass’, ’embarked on ‘, ‘who’, ‘family_size’, ‘age_bin’ are individually encoded.
pd.get_dummies(df_titanic, columns=['sex'.'class'.'pclass'.'embarked'.'who'.'family_size'.'age_bin'],drop_first=True)
Copy the code
And, of course, we can also use SKLearn (see the most complete ShowMeAI tutorial SKLearn application modeling tools quick guide and AI | Scikit – learn to use guidelines detailed study), alone to hot vector encoding implementation:
import numpy as np
from sklearn.preprocessing import OneHotEncoder
# List of labels represented by non-negative integers
labels = [0.1.0.2]
# row vector to column vector
labels = np.array(labels).reshape(len(labels), -1)
# unique heat vector encoding
enc = OneHotEncoder()
enc.fit(labels)
targets = enc.transform(labels).toarray()
The output is sparse without toarray(), in the form of index increments. The same can be achieved by specifying sparse = False
Copy the code
The following output is displayed:
array([[ 1., 0., 0.],
[ 0., 1., 0.],
[ 1., 0., 0.],
[ 0., 0., 1.]])
Copy the code
(3) LabelBinarizer
The OneHotEncoder has the same function as OneHotEncoder, but OneHotEncoder can only binarize numeric variables and cannot encode string category variables directly, while LabelBinarizer can directly binarize character variables.
Example code is as follows:
from sklearn.preprocessing import LabelBinarizer
lb=LabelBinarizer()
labelList=['yes'.'no'.'no'.'yes'.'no2']
# Binarize the tag matrix
dummY=lb.fit_transform(labelList)
print("dummY:",dummY)
# the inverse process
yesORno=lb.inverse_transform(dummY)
print("yesOrno:",yesORno)
Copy the code
The output is as follows:
dummY: [[0 0 1]
[1 0 0]
[1 0 0]
[0 0 1]
[0 1 0]]
yesOrno: ['yes' 'no' 'no' 'yes' 'no2']
Copy the code
4.7 d
In actual machine learning projects, we may also do dimensional reduction, mainly because of the following problems with the data:
- Multicollinearity of data: there is correlation between feature attributes. Multicollinearity leads to spatial instability of the solution, which leads to weak generalization ability of the model.
- Due to the sparsity of high latitude spatial samples, it is difficult to find data features in the model.
- Too many variables can prevent the model from finding patterns.
- Considering only the effect of a single variable on a target attribute may ignore the underlying relationships between variables.
Objectives to be achieved through feature dimension reduction:
- Reduce the number of characteristic attributes
- Ensure that feature attributes are independent of each other
Commonly used dimensionality reduction methods are:
- PCA
- SVD
- LDA
- T-sne and other nonlinear dimensionality reduction
Here is the explanation of dimensionality reduction based on iris data set:
from sklearn import datasets
iris_data = datasets.load_iris()
X = iris_data.data
y = iris_data.target
def draw_result(X, y) :
plt.figure()
# extract Iris - setosa
setosa = X[y == 0]
Parameter 1 x vector, y vector
plt.scatter(setosa[:, 0], setosa[:, 1], color="red", label="Iris-setosa")
versicolor = X[y == 1]
plt.scatter(versicolor[:, 0], versicolor[:, 1], color="orange", label="Iris-versicolor")
virginica = X[y == 2]
plt.scatter(virginica[:, 0], virginica[:, 1], color="blue", label="Iris-virginica")
plt.legend()
plt.show()
draw_result(X, y)
Copy the code
(1) PCA(Principal Component Analysis)
About PCA PCA dimensionality reduction algorithm, everybody can consult ShowMeAI articles graphic machine learning | dimension reduction algorithm, a detailed study.
The reference code for PCA dimension reduction is as follows:
import numpy as np
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
newX = pca.fit_transform(X)
draw_result(newX, y)
Copy the code
(2) SVD(Singular Value Decomposition)
The main steps of SVD method are as follows:
So VVV is the orthogonal matrix of the eigenvectors of the eigenvalue decomposition of ATAA^{T} AATA in columns, σ 2\Sigma^{2} σ 2 is the diagonal matrix of the eigenvalues of ATAA^{T} AATA, It can also be seen that the singular value σ I \sigma_{I}σ I of Am×nA_{m \times n} Am×n is the square root of the eigenvalue λ I \lambda_{I}λ I of ATAA^{T} A ATA.
If the eigenvector of ATAA^{T} AATA is viv_{I}vi, the corresponding UIu_ {I} UI in UUU can be obtained from the following equation:
The key of singular value decomposition is eigenvalue decomposition of ATAA^{T} AATA.
The corresponding code reference implementation is as follows:
from sklearn.decomposition import TruncatedSVD
iris_2d = TruncatedSVD(2).fit_transform(X)
draw_result(iris_2d, y)
Copy the code
PCA vs SVD
The key of PCA solution is to solve the eigenvalue decomposition of covariance matrix C=1mXXTC=\frac{1}{m} XX ^{T}C=m1XXT.
The key of SVD is the eigenvalue decomposition of ATAA^{T} AATA.
It is obvious that both solve very similar problems, which are eigenvalue decomposition of a real symmetric matrix. If:
There are:
At this point, SVD is equivalent to PCA, so PCA problem can be transformed into SVD problem to solve.
(3) LDA(Linear Discriminant Analysis)
Is supervised dimensionality reduction, which minimizes the intra-class and maximizes the inter-class dispersion to obtain the optimal feature subset.
Interpretation of the above figure: LD1 can separate the two classes with normal distribution well through linear judgment. The linear decision of LD2 keeps the large variance of the data set, but LD2 cannot provide information about the category, so LD2 is not a good linear decision.
The corresponding dimension reduction reference code is as follows:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components=2)
iris_2d = lda.fit_transform(X, y)
draw_result(iris_2d, y)
Copy the code
LDA vs PCA
PCA tries to find the orthogonal principal component axis LDA with the largest variance and finds that the feature subspace LDA and PCA that can optimize classification are both linear transformation techniques that can be used to reduce the dimension of the data set. PCA is an unsupervised algorithm. LDA is a supervised algorithm
(4) T-SNE
T-sne (T-distributed neighbor embedding) is a nonlinear dimension reduction method, and the code for reference is as follows:
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2)
iris_2d = tsne.fit_transform(X)
draw_result(iris_2d, y)
Copy the code
5. Feature selection
Feature selection is a process that is often used in modeling and also of great significance:
- Feature redundancy, some feature correlation is too high, consumption of computing resources
- There is noise, which has a negative effect on the model results
- Some features are easy to cause overfitting
In general, there are two main considerations for feature selection:
- Degree of divergence of features: If a feature does not diverge, for example, the variance is close to 0, that is to say, the sample has basically no difference in this feature, and this feature is of no use for sample differentiation.
- Relevance of features to goals: It is also easy to understand that features should be retained when they are more relevant to goals.
The methods of feature selection can be classified into the following three categories:
- Filter: Selects features by setting thresholds or the number of thresholds to be selected.
- Wrapper: A method of selecting or excluding features at a time based on an objective function (usually a predictive performance score).
- Embedded: firstly, some machine learning algorithms and models are used for training, to get the weight coefficients of each feature, and select features from large to small according to the coefficients. Similar to Filter method, but through training to determine the advantages and disadvantages of features. We used the Feature_selection library in SKLearn for feature selection.
5.1 Filter Type
(1) Variance filtering
This is a class that filters features by their own variance.
For example, the variance of a feature itself is very small, which means that the samples have basically no difference in this feature. Most of the values in the feature are the same, or even the value of the whole feature is the same, so this feature has no effect on sample differentiation.
We will eliminate the field features with very small variance, and the reference code is implemented as follows:
from sklearn.feature_selection import VarianceThreshold
variancethreshold = VarianceThreshold() Set variance <=0 by default
df_titanic_numerical = df_titanic[['age'.'sibsp'.'parch'.'fare'.'family_size']]
X_var = variancethreshold.fit_transform(df_titanic_numerical) Get the new feature matrix after removing the unqualified feature
del_list = df_titanic_numerical.columns[variancethreshold.get_support()==0].to_list() # delete
Copy the code
(2) Chi-square filtration
Chi-square test, dedicated to classification algorithms, captures correlation and pursues features where P is less than the significance level. Chi-square filtering is a kind of correlation filtering for discrete labels.
In fact, the p-value is positively correlated with the probability of obtaining this statistic: the higher the p-value is, the higher the probability of obtaining this statistic is, that is, the more reasonable it is. The smaller the p value is, the smaller the probability of obtaining this statistic is, that is, the less reasonable it is. In this case, the original hypothesis should be rejected and the alternative hypothesis should be accepted.
The following is a reference code example for Chi-square filtering:
df_titanic_categorical = df_titanic[['sex'.'class'.'embarked'.'who'.'age_bin'.'adult_male'.'alone'.'fare_bin']]
df_titanic_numerical = df_titanic[['age'.'sibsp'.'parch'.'fare'.'family_size'.'pclass']]
df_titanic_categorical_one_hot = pd.get_dummies(df_titanic_categorical, columns=['sex'.'class'.'embarked'.'who'.'age_bin'.'adult_male'.'alone'.'fare_bin'], drop_first=True)
df_titanic_combined = pd.concat([df_titanic_numerical,df_titanic_categorical_one_hot],axis=1)
y = df_titanic['survived']
X = df_titanic_combined.iloc[:,1:]
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest
chi_value, p_value = chi2(X,y)
# Based on the p value, find the k value
k = chi_value.shape[0] - (p_value > 0.05).sum(a)# Number of features to keep 14
# According to the Chi-square value, select the first features and filter the last features
X_chi = SelectKBest(chi2, k=14).fit_transform(X, y)
Copy the code
(3) the F test
F test captures linear correlation, requiring data to follow normal distribution and pursuing characteristics of P value less than significance level.
The reference codes for feature selection are as follows:
from sklearn.feature_selection import f_classif
f_value, p_value = f_classif(X,y)
# Based on the p value, find the k value
k = f_value.shape[0] - (p_value > 0.05).sum(a)# Post-screening features
X_classif = SelectKBest(f_classif, k=14).fit_transform(X, y)
Copy the code
(4) Mutual information method
The mutual information method is a filtering method used to capture the arbitrary relationship (including linear and nonlinear relationship) between each feature and the tag.
The reference codes for feature selection are as follows:
from sklearn.feature_selection import mutual_info_classif as MIC
# Mutual information method
mic_result = MIC(X,y) # Mutual information estimation
k = mic_result.shape[0] - sum(mic_result <= 0) # 16
X_mic = SelectKBest(MIC, k=16).fit_transform(X, y)
Copy the code
5.2 Wraparound Wrapper
(1) Recursive feature deletion method
The recursive elimination deletion method uses a base model to carry out multiple rounds of training. After each round of training, the features of several weight coefficients are eliminated, and then the next round of training is carried out based on the new feature set. The code for selecting features using the RFE class of the Feature_Selection library is as follows:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
# Recursive feature elimination method, return the data after feature selection
# Parameter Estimator is the base model
# parameter n_features_to_select indicates the number of selected features
X_ref = RFE(estimator=LogisticRegression(), n_features_to_select=10).fit_transform(X, y)
Copy the code
(2) Feature importance assessment
Based on some models (such as all kinds of tree models), we can get the importance of features, and then carry out screening
from sklearn.ensemble import ExtraTreesClassifier
# Modeling and acquiring feature importance
model = ExtraTreesClassifier()
model.fit(X, y)
print(model.feature_importances_)
# Feature importance ranking
feature=list(zip(X.columns,model.feature_importances_))
feature=pd.DataFrame(feature,columns=['feature'.'importances'])
feature.sort_values(by='importances',ascending=False).head(20)
Copy the code
(3) Rank importance assessment
We also have a method for assessing the importance of features and then screening them, called permutation importance.
Principle: The importance of substitutions is calculated after training machine learning models. This approach makes assumptions to the model and predicts the degree of influence on the accuracy of the machine learning model if one column of validation set feature data is randomly shuffled while preserving the target and all other columns. For a feature of high importance, random-Reshuffle will cause greater damage to the accuracy of machine learning model prediction.
Advantages: fast calculation; Easy to use and understand; Attributes of feature importance measures; The pursuit of characteristic stability.
The reference code is implemented as follows:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import eli5
from eli5.sklearn import PermutationImportance
my_model = RandomForestClassifier(random_state=0).fit(train_X, train_y)
perm = PermutationImportance(my_model, random_state=1).fit(val_X, val_y)
eli5.show_weights(perm, feature_names = val_X.columns.tolist())
Copy the code
5.3 Embedded
(1) Feature selection method based on penalty term
In addition to feature screening, dimension reduction is also carried out by using the basis model with penalty term.
Using the Feature_Selection library’s SelectFromModel class combined with a logistic regression model with L1 penalties, the code for selecting features looks like this:
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
# Set logistic regression with L1 and L2 penalty as feature selection for base model
lr = LogisticRegression(solver='liblinear',penalty="l1", C=0.1)
X_sfm = SelectFromModel(lr).fit_transform(X, y)
X_sfm.shape
(891.7
Copy the code
Using the Feature_Selection library’s SelectFromModel class in conjunction with the SVM model, the following code is used to select features:
from sklearn.feature_selection import SelectFromModel
from sklearn.svm import LinearSVC
lsvc = LinearSVC(C=0.01,penalty='l1',dual=False).fit(X, y)
model = SelectFromModel(lsvc,prefit=True)
X_sfm_svm = model.transform(X)
X_sfm_svm.shape
(891.7
Copy the code
(2) Based on tree model
In the tree model, GBDT can also be used as the base model for feature selection. The SelectFromModel class of feature_Selection and GBDT model are used to select features as follows:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import GradientBoostingClassifier
#GBDT as feature selection of base model
gbdt = GradientBoostingClassifier()
X_sfm_gbdt = SelectFromModel(gbdt).fit_transform(X, y)
Copy the code
5.4 Summary of feature selection
As for feature selection, an experience summary is made as follows:
- (1) Category feature variables, then you can start with SelectKBest and use a Chi-square or tree-based selector to select variables;
- (2) Quantitative characteristic variables can be directly selected by linear models and correlation-based selectors;
- For dichotomous problems, SelectFromModel and SVC can be used.
- ④ Before feature selection, to fully understand the data, generally need to do exploratory data analysis EDA.
6. Practical suggestions of feature engineering
Finally, ShowMeAI summarizes some characteristic engineering points based on practical industrial application experience, as follows:
6.1 Data Understanding
The effectiveness of constructed features is strongly correlated with business and data distribution, Therefore, it is recommended to do EDA exploratory data analysis before this step to fully understand the data (see the ShowMeAI article Python Machine Learning Comprehensive Project – E-commerce Sales Estimates and Python Machine Learning Comprehensive Project – E-commerce sales Estimates “Advanced” to understand the basic process and methodology of EDA).
6.2 Data Preprocessing
Some data preprocessing and feature processing we may do are as follows:
-
Discretization of continuous features
- The essence is to limit the accuracy of floating-point features, the exception data is very robust, and the model will be more stable.
- The tree model doesn’t need to be done
-
Numerical truncation
- Limit the value of eigenvalues to a certain range (helpful for exception elimination)
- The. Clip (low,upper) method can be used for pandas Dataframe
6.3 Data Cleaning
Combined with business scenarios and data distribution, reasonable missing values and outliers are processed.
6.4 Feature construction and transformation
It is recommended not to do PCA or LDA dimension reduction in the first place. It is better to build features and screen features first.
-
Linear combination
- It is suitable for decision trees and ensemble based on decision trees (e.g. gradient Boosting, Random Forest), because the common axis-aligned split function is not good at capturing correlations between different features;
- Does not apply to SVM, linear regression, neural network, etc.
-
Combination of category features and numerical features
- The groupby operation pandas uses N1 and N2 to represent numeric features and C1 and C2 to represent category features. The groupby operation pandas uses to create meaningful new features (C2 can also be the discretized N1).
Median (N1)_by(C1) Mean (N1)_by(C1) Arithmetic mean Mode (N1)_by(C1) mode min(N1) _BY (C1) Minimum value Max (N1) _BY (C1) Maximum value STD (N1) _BY (C1) standard deviation Var (N1)_by(C1) variance Freq (C2)_by(C1) frequencyCopy the code
- Statistical characteristics plus linear combination
- Statistical features can be combined with basic feature engineering methods such as linear combination (only used for decision trees) to obtain more meaningful features, such as:
N1 - median(N1)_by(C1)
N1 - mean(N1)_by(C1)
Copy the code
- Create new features based on tree models
- In decision tree algorithms (e.g., Decision tree, GBDT, random forest), each sample is mapped to the leaves of the decision tree.
- We can add the index(natural number) or one-hot-encoding-vector(sparse vector derived from dumb coding) of the sample mapped by each decision tree to the model as a new feature.
In SciKit-learn and XGBoost, it can be implemented using methods such as apply() and decision_path().
6.5 model
We also consider different feature engineering methods in different types of models
-
Tree model
- It is insensitive to the amplitude of characteristic value, so dimensionless and statistical transformation can not be carried out.
- The features of the number model depend on the sample distance to learn, so the classification feature coding can be omitted (but the character feature cannot be directly used as input, so at least label coding is required).
- Both LightGBM and XGBoost learn missing values as part of the data, so there is no need to deal with missing values. Other cases need to fill in the missing.
-
A model that depends on sample distance
- Such as linear regression, SVM, deep learning and so on belong to this category.
- Dimensionless processing is needed for numerical features.
- For some data features of long-tail distribution, statistical transformation can be made to make the model better optimized.
- For linear models, feature segmentation can improve model expression ability.
The resources
- Diagram of machine learning algorithm | from entry to master series
- Data analysis series tutorial
- Quick data science tools | Pandas use guide
ShowMeAIRecommended series of tutorials
- Illustrated Python programming: From beginner to Master series of tutorials
- Illustrated Data Analysis: From beginner to master series of tutorials
- The mathematical Basics of AI: From beginner to Master series of tutorials
- Illustrated Big Data Technology: From beginner to master
- Illustrated Machine learning algorithms: Beginner to Master series of tutorials
- Machine learning: Teach you how to play machine learning series
Related articles recommended
- Application practice of Python machine learning algorithm
- SKLearn introduction and simple application cases
- SKLearn most complete application guide
- XGBoost modeling applications in detail
- LightGBM modeling applications in detail
- Python Machine Learning Integrated Project – E-commerce sales estimates
- Python Machine Learning Integrated Project — E-commerce Sales Estimation
- Machine learning feature engineering most complete interpretation
- Application of Featuretools
- AutoML Automatic machine learning modeling