GitHub, the heart of the Machine.

Given the importance of feature selection in machine learning, data scientist William Koehrsen recently unveiled a Feature selector Python class on GitHub to help researchers do feature selection more efficiently. This article is Koehrsen’s project introduction and case demonstration article.

Project address: github.com/WillKoehrse…

Feature selection is a process of finding and selecting the most useful features in data sets, which is a key step in machine learning. Unnecessary features can reduce training speed, model interpretability, and most importantly, generalization performance on test sets.

There are specialized feature selection methods, and I often have to apply them to machine learning problems over and over again, which is frustrating. So I built a feature selection class in Python and opened it up on GitHub. This FeatureSelector includes some of the most commonly used feature selection methods:

1. It is characterized by high percentage of missing values

2. Collinearity (highly correlated) characteristics

3. Features with zero importance in tree-based models

4. Low-importance features

5. It has a unique value

In this article, we’ll walk you through the process of using FeatureSelector on a sample machine learning dataset. We’ll see how to quickly implement these methods for a more efficient workflow.

The full code is available on GitHub and anyone is welcome to contribute. This feature selector is a work in progress and will continue to improve based on community needs!


Sample data set

To illustrate, we’ll use a data sample from Kaggle’s “Household Credit Default Risk” machine learning competition. Understand the competition can see: towardsdatascience.com/machine-lea… , the full dataset can be downloaded here: www.kaggle.com/c/home-cred… . Here we will use part of the data sample to demonstrate.

Data examples. TARGET is the category label

The contest is a supervised classification problem, which is also a very suitable data set because there are many missing values, a large number of highly correlated (collinear) features, and some irrelevant features that do not help machine learning models.


Create an instance

To create an instance of the FeatureSelector class, we need to pass in a structured data set with observations in rows and features in columns. We can use some feature-only methods, but importance-based methods also require training tags. Because this is a supervised classification task, we will use a set of characteristics and a set of labels.

(Make sure to run this code in the feature_selector. Py directory.)

from feature_selector import FeatureSelector
# Features are in train and labels are in train_labels
fs = FeatureSelector(data = train, labels = train_labels)
Copy the code


methods

The feature selector has five ways to find features to remove. You can access any identified features and manually remove them from the data, or you can use the remove function in FeatureSelector.

Here we introduce each of these identification methods and show how to run all five simultaneously. In addition, FeatureSelector has several graphing features, because visually examining data is a key part of machine learning.


Missing value

The first way to find and remove features is simple: look for features where the percentage of missing values exceeds a certain threshold. The following call recognizes characteristics where more than 60% of values are missing (bold is the output).

Fs. identify_missing(missing_threshold = 0.6) 17 features with greater than 0.60 missing values.Copy the code

We can look at the percentage of missing values for each column in a dataframe:

fs.missing_stats.head()
Copy the code

To look up features to remove, we can read the OPS attribute of FeatureSelector, which is a Python feature dictionary where features are given as a list.

missing_features = fs.ops['missing']
missing_features[:5]

['OWN_CAR_AGE'.'YEARS_BUILD_AVG'.'COMMONAREA_AVG'.'FLOORSMIN_AVG'.'LIVINGAPARTMENTS_AVG']
Copy the code

Finally, we can draw a map of the distribution of missing values for all features:

fs.plot_missing()
Copy the code

Collinearity characteristic

Collinear features are those that are highly correlated with each other. In machine learning, high variance and low model interpretability lead to reduced generalization ability on test sets.

The identify_collinear method looks for colinear features based on a specified phase relation value. For each pair of related features, it identifies the one to be removed (because we only need to remove one of them) :

Fs.identify_collinear (correlation_threshold = 0.98) 21 Features with a correlation magnitude greater than 0.98.Copy the code

Collinearity can be well visualized using heat maps. The following figure shows all characteristics with at least one correlation exceeding the threshold:

fs.plot_collinear()
Copy the code

As before, we can access the entire list of related features that will be removed, or view highly related feature pairs in a Dataframe.

# list of collinear features to remove
collinear_features = fs.ops['collinear']
# dataframe of collinear features
fs.record_collinear.head()
Copy the code

If we want to see the full extent of the data set, we can also graph all the correlations in the data by passing plot_all = True into the call:

Zero importance feature

The first two methods can be applied to any structured data set and the results are deterministic — for a given threshold, the results are the same every time. The next approach is designed for supervised machine learning problems, where we have labels for training models and are nondeterministic. The identify_zero_importance function looks for features with zero importance based on the gradient elevator (GBM) learning model.

Tree-based machine learning models (such as Boosting Ensemble) can be used to calculate feature importance. The absolute value of this importance is less important than the relative value, which we can use to determine the characteristics that are most relevant to a task. We can also use feature importance in feature selection by removing zero-importance features. In a tree-based model, zero-importance features are not used to segment any nodes, so we can remove them without affecting model performance.

FeatureSelector can use a gradient elevator from the LightGBM library to measure feature importance. In order to reduce the variance, the obtained feature importance is averaged over 10 rounds of GBM training. In addition, the model uses early stopping for training (which can also be turned off) to prevent overfitting of training data.

LightGBM library: LightGBM. Readthedocs. IO /

The following code calls this method and extracts the zero-importance characteristic:

# Pass in the appropriate parameters
fs.identify_zero_importance(task = 'classification', 
 eval_metric = 'auc', 
 n_iterations = 10, 
 early_stopping = True)
# list of zero importance features
zero_importance_features = fs.ops['zero_importance']
63 features with zero importance after one-hot encoding.
Copy the code

The arguments we pass in are explained as follows:

  • Task: Either “classification” or “regression” depending on our question
  • Eval_metric: Metric used for early stops (not necessary if early stops are disabled)
  • N_iterations: Number of rounds of training, and the final result is the average of multiple rounds
  • Early_stopping: Whether to use early stops for the training model

At this point we can use plot_feature_importances to draw two diagrams:

# plot the feature importancesFs.plot_feature_importances (threshold = 0.99, plot_n = 12) 124 Features requiredfor 0.99 of cumulative importance
Copy the code

The figure on the left shows the most important features of plot_n (the importance degree is normalized and the sum is 1). The figure on the right shows the cumulative importance of the corresponding number of features. The blue vertical line indicates a threshold of 99% cumulative importance.

There are two things to keep in mind about the importance-based approach:

  • The training gradient elevator is random, which means that the feature importance changes after each run of the model.

This shouldn’t make much difference (the most important features don’t suddenly become the least important), but it changes the ranking of certain features and also affects the number of zero-importance features identified. Don’t be surprised if the feature importance changes each time!

  • To train machine learning models, features are first one-hot coded. This means that some features identified as zero importance may be one-Hot coding features added during the modeling process.

When we reach the feature removal phase, there is also an option to remove any one-Hot encoded features that have been added. However, if we want to do machine learning after feature selection, we still have to one-hot code these features.


Low importance characteristics

The next approach, based on the zero importance function, uses feature importance from the model for further selection. The identify_low_importance function finds the least important features that do not contribute to the total importance specified.

For example, the following call finds the least important features, which are 99% important even without them.

Fs. identify_low_importance(cumulative_importance = 0.99) 123 Features requiredforCumulative importance of 0.99 after one hot encoding. 116 featuresdoNot contribute to cumulative importance of 0.99.Copy the code

Based on the previous cumulative importance diagram and this information, the gradient elevator considers many features irrelevant to learning. Again, the results of this method are different after each training run.

We can also view all feature importance in a dataframe:

fs.feature_importances.head(10)
Copy the code

The Low_importance method borrows an approach from principal component analysis (PCA), where it is common to retain only the principal components needed to maintain a certain proportion of variance (say, 95%). The total importance percentage to be taken into account is based on the same idea.

The feature-importance-based approach is only really useful when we are using tree-based models to make predictions. In addition to the randomness of the results, the importance-based approach is a black-box approach, meaning we don’t really know why the model thinks certain features are irrelevant. If you use these methods and run them several times to see how the results change, you might be able to create multiple datasets with different parameters to test!


A single unique value characteristic

The last method is fairly basic: find any column that has a single unique value. Features with a single unique value cannot be used for machine learning because the variance of this feature is 0. For example, if a feature has only one value, then tree-based models can never distinguish (because there is nothing to distinguish).

Unlike the other methods, this method has no optional arguments:

fs.identify_single_unique()

4 features with a single unique value.
Copy the code

We can plot a histogram of the number of unique values for each category:

fs.plot_unique()
Copy the code

It is also important to remember that NaNs has been removed using Pandas by default before calculating unique values.


Removing features

After identifying features to remove, we have two options for removing them. All features to remove are stored in FeatureSelector’s OPS dictionary, a list you can use to remove them manually, as well as the built-in remove function.

For this method, we need to pass in methods to remove features. If we want to use all of the implemented methods, we just use methods = ‘all’

# Remove the features from all methods (returns a df)
train_removed = fs.remove(methods = 'all')

['missing'.'single_unique'.'collinear'.'zero_importance'.'low_importance'] methods have been run

Removed 140 features.
Copy the code

This method returns a Dataframe containing the removed feature. Also, to remove features of one-hot encoding created during machine learning:

train_removed_all = fs.remove(methods = 'all', keep_one_hot=False)

Removed 187 features including one-hot features.
Copy the code

It might be a good idea to check the features to be removed before performing the action! The original data set is stored in the Data property of FeatureSelector as a backup!

Run all methods at once

In addition to using each method individually, we can also use all methods at once through identify_all. We need to use a dictionary to set the parameters for each of these methods:

fs.identify_all(selection_params = {'missing_threshold': 0.6.'correlation_threshold': 0.98.'task': 'classification'.'eval_metric': 'auc'.'cumulative_importance': 0.99})

151 total features out of 255 identified for removal after one-hot encoding.
Copy the code

Note that the total number of features for multiple runs of the model may also vary. You can then call the remove function to remove these features.


conclusion

This feature selector class implements several common operations for removing features before training machine learning models. It provides functions that can be used to identify features to be removed as well as visual functions. These methods can be used individually or all at once to achieve an efficient workflow.

The missing, Collinear and Single_UNIQUE methods are deterministic, while the method based on feature importance will change with each run. Much like the field of machine learning, feature selection is largely empirical, requiring testing multiple combinations to find the optimal solution. It’s best to try multiple configurations in the process, and FeatureSelector provides a way to quickly evaluate feature selection parameters.

Original link: towardsdatascience.com/a-feature-s…