What is the feature project? What is the feature project? Feature engineering is to prepare for model training, and to do various preprocessing of data

There is a saying in the industry that data and features determine the upper limit of machine learning, and that models and algorithms are only approximating the upper limit. So what is feature engineering? As the name suggests, it is essentially an engineering activity that aims to extract features from raw data to the maximum extent possible for use by algorithms and models. Through summary and induction, it is believed that feature engineering includes the following aspects:

Sklearn provides a relatively complete feature processing method. In this paper, IRIS data set in Sklearn is used to explain the feature processing function by referring to the example of Sklearn. IRIS data set was arranged by Fisher in 1936 and contains 4 features (Sepal.length, sepal.width, Petal.Length, Petal.Width), all of which are positive floating-point numbers. The unit is centimeter. Target values are Iris Setosa (mountain Iris), Iris Versicolour (varietal Iris), Iris Virginica (Virginia Iris)). The code for importing the IRIS dataset is as follows:

# Import IRIS data set IRIS = load_iris() # Feature matrix iris.data # Target vector iris.targetCopy the code

Problems encountered in feature extraction:

Do not belong to the same dimension: that is, the characteristics of the specifications are not the same, can not be compared together. Dimensionless can solve this problem. Information redundancy: For some quantitative features, the valid information contained is divided into intervals, such as academic performance. If only “pass” or “fail” is concerned, then the quantitative test score needs to be converted into “1” and “0” to indicate “pass” and “fail”. Binarization can solve this problem. Qualitative features cannot be used directly: some machine learning algorithms and models can only accept quantitative features as input, so qualitative features need to be converted into quantitative features. The simplest approach is to specify a quantitative value for each qualitative value, but this is too flexible and increases the effort of tuning parameters. Generally, dummy coding is used to transform qualitative features into quantitative features: assuming that there are N qualitative values, this feature is extended to N features. When the original eigenvalue is the i-th qualitative value, the i-th extended feature is assigned a value of 1, and the other extended features are assigned a value of 0. Compared with the directly specified method, the dummy coding method does not need to increase the work of parameter adjustment. For the linear model, the nonlinear effect can be achieved by using the features after dummy coding. This method is also known as the onehot method. Missing values exist: Missing values need to be supplemented. Low information utilization: Different machine learning algorithms and models use information from the data differently. As mentioned earlier, in linear models, dummy coding of qualitative features can achieve nonlinear effects. Similarly, polynomials of quantitative variables, or other transformations, can achieve nonlinear results. We use the SkLearn preproccessing library to preprocess data. All of these problems can be solved in a stand-alone environment, but they can also be solved by numpy and Pandas.

1. Standardization requires the calculation of the mean value and standard deviation of features, which can be expressed as follows:

X=(X-Mean)/std

The code for standardizing data using the StandardScaler class of the Preproccessing library is as follows:

4 StandardScaler().fit_transform(iris.data)Copy the code

2. There are many ideas of interval scaling method, and the common one is to use two maximum values for scaling, which is expressed as follows:

x=(x-min)/(Max-min)

Using the MinMaxScaler class of the Preproccessing library, the code for interval scaling is as follows:

4 MinMaxScaler().fit_transform(iris.data)Copy the code

3 normalization:

Normalizer().fit_transform(iris.data) Normalizer().fit_transform()Copy the code

4. The core of binarization of quantitative features is to set a threshold value. The value greater than the threshold is assigned to 1, and the value less than or equal to the threshold is assigned to 0

Binarizer(threshold=3).fit_transform(iris.data)Copy the code

5 onehot code:

1 1 1 1 OneHotEncoder().fit_Transform (Iris.target. Shape ((-1,1))Copy the code

6 Calculation of missing value

Preprocessing import Imputer # missing value is calculated, return value is the data after missing value is calculated # missing_value is the representation of missing value, default is NaN # strategy is the fill mode of missing value, Imputer(). Fit_transform (vstack((array([nan, nan, nan, nan])))). Imputer() should be Imputer(). Fit_transform (vstack((array([nan, nan, nan, nan])).Copy the code

7 Data Transformation

2, 3 # 4 # parameters of the degree of polynomial transformation for degrees, the default value is 2, 5 PolynomialFeatures () fit_transform (iris) data)Copy the code

Data transformation based on single argument functions can be done in a unified manner. The following code is used to convert data to logarithmic functions using the FunctionTransformer of the Preproccessing library:

2 from sklearn. Preprocessing import FunctionTransformer 3 4 # Custom convert function to log function 5 # first argument is a single argument function 6 FunctionTransformer(log1p).fit_transform(iris.data)Copy the code

Feature selection After the data preprocessing is completed, we need to select meaningful features to input into the machine learning algorithm and model for training. In general, features are selected in two ways:

Whether the feature diverges: If a feature does not diverge, for example, the variance is close to 0, that is, the sample does not differ in the feature, and the feature is not useful for sample differentiation. Correlation between features and targets: This is obvious, and features with high correlation with targets should be selected optimally. In addition to the variance method, the other methods introduced in this paper are all based on correlation. According to the form of feature selection, feature selection methods can be divided into three types:

Filter: The Filter method is used to score each feature according to divergence or correlation, set the threshold value or the number of threshold values to be selected, and select features. Wrapper: A Wrapper method in which features are selected or excluded at a time according to an objective function (usually a predictive performance score). Embedded: Embedded method. Firstly, some machine learning algorithms and models are used for training, and the weight coefficients of each feature are obtained. Then the features are selected from large to small according to the coefficients. Similar to the Filter method, but through training to determine the advantages and disadvantages of features. We use the Feature_Selection library in SkLearn for feature selection.

8Filter variance selection method:

To use the method of variance selection, the variance of each feature should be calculated first, and then the feature whose variance is greater than the threshold value should be selected according to the threshold value. The code for selecting a feature using the Feature_Selection library’s VarianceThreshold class is as follows

4 # parameter threshold= VarianceThreshold 5 VarianceThreshold(threshold=3).fit_transform(iris.data)Copy the code

Correlation coefficient method of 9Filter

To use the correlation coefficient method, the correlation coefficient of each feature to the target value and the P value of the correlation coefficient should be calculated first. The feature_Selection library’s SelectKBest class, combined with the correlation coefficient, is used to select features:

From scipy. Stats import pearsonr # Select K best features and return the selected data # The first parameter is a function to calculate whether the feature is good or not. This function inputs the feature matrix and target vector and outputs an array of binary groups (score, P value). The i-th item of the array is the score and p-value of the i-th feature. SelectKBest(lambda X, Y: array(map(lambda x:pearsonr(x, Y), X.T)).T, k=2).fit_transform(iris.data, iris.target)Copy the code

10Filter Chi-square test

The classical Chi-square test is to test the correlation between qualitative independent variables and qualitative dependent variables. Assuming that the independent variable has N kinds of values and the dependent variable has M kinds of values, consider the difference between the observed value and the expected value of the sample frequency whose independent variable is equal to I and dependent variable is equal to j, and construct the statistics. The formula is very complex, and go directly to the code:

2 from sklearn. Feature_selection import chi2 3 4 SelectKBest(chi2, k=2).fit_transform(iris.data, iris.target)Copy the code

The code is much simpler and more human than the formula, and the formula is deceptive

11Filter Mutual information method

In order to process the quantitative data, the maximum information coefficient method is proposed. The feature_Selection library’s SelectKBest class combined with the maximum information coefficient method is used to select the feature code as follows:

From minepy import MINE # since the MINE design is not functional, define the mic method to make it functional and return a binary with the second item of the binary set to a fixed P value 0.5 def mic(x, y): M = MINE() m.compute_score(x, y) return (m.ic (), 0.5) array(map(lambda x:mic(x, Y), X.T)).T, k=2).fit_transform(iris.data, iris.target)Copy the code

12Wrapper recursive elimination

Recursive feature elimination method uses a base model to conduct several rounds of training. After each round of training, the features of several weight coefficients are eliminated, and then the next round of training is carried out based on the new feature set. The code for using the FEATure_Selection library’s RFE class is as follows:

From sklearn. Linear_model import LogisticRegression RFE(estimator=LogisticRegression()); n_features_to_select (); n_features_to_select=2).fit_transform(iris.data, iris.target)Copy the code

You can choose to use someone else’s model here

13Embedded uses a base model with penalties to screen out features and reduce dimensions. Using the feature_Selection library SelectFromModel class combined with the logistic regression model with L1 penalty, the code for selecting features is as follows:

2 from sklearn. Linear_model import LogisticRegression 3 SelectFromModel (LogisticRegression (penalty = "l1", C = 0.1)). Fit_transform (iris) data, iris. Target)Copy the code

Using the feature_Selection library’s SelectFromModel class combined with the logistic regression model with L1 and L2 penalty items, the feature_Selection library’s SelectFromModel class is used to select features:

5 SelectFromModel(LR(threshold=0.5, threshold=0.5, threshold=0.5, C = 0.1).) fit_transform (iris) data, iris. Target)Copy the code

GBDT can also be used as the base model for feature selection. Using the Feature_Selection library SelectFromModel class and GBDT model, the code for feature selection is as follows:

2 the from sklearn. The ensemble import GradientBoostingClassifier 3 4 # GBDT model as the base of feature selection 5 SelectFromModel(GradientBoostingClassifier()).fit_transform(iris.data, iris.target)Copy the code

14 pca dimension reduction:

5 PCA(n_components=2).fit_transform(iris.data)Copy the code

SVD reduction is replacing pca with SVD

15 the lda dimension reduction

5 LDA(n_Components =2).fit_transform(iris.data, iris.target)Copy the code

So, as you can see, the API is always simpler to use than the formula, so as a datas Cience don’t bother with the formula too much, should focus on engineering practice.