Currently, model selection for many machine learning projects is beginning to shift to automation, while feature engineering is still predominantly manual. This process may be more important than model selection, and artificial features are always limited. In this article, the author will introduce how to use Feature Tools Python library to implement Feature engineering automation. The project is open source.

  • Docs.featuretools.com/

  • Code address: github.com/WillKoehrse…


Machine learning is increasingly moving from manually designed models to tools that are automatically optimized using tools such as H20, TPOT, and Auto-sklearn. These libraries, as well as methods such as Random Search for Hyper-Parameter Optimization (see Random Search for Hyper-Parameter Optimization), are designed to simplify model selection and machine learning tuning by finding the optimal model that matches the data set, with almost no human intervention. However, feature engineering, perhaps the most valuable aspect of the machine learning process, is almost entirely artificial.

Feature engineering, also known as feature construction, is the process of constructing new features from existing data to train machine learning models. This step may be more important than actually using the model, because a machine learning algorithm can only learn from the data we’re given, so it’s crucial to construct a feature that’s relevant to the task, See excellent paper A Few Useful Things to Know about Machine Learning.

Typically, feature engineering is a lengthy manual process that relies on domain knowledge, intuition, and data manipulation. This process can be extremely tedious, and the resulting features will be subject to human subjectivity and time constraints. Feature engineering automation aims to help data scientists by automatically constructing candidate features from data sets and selecting the best features for training.

In this article, we present an example of Feature engineering automation using the Feature Tools Python library. We’ll use a sample data set to illustrate the basic concepts (stay tuned for future examples using real-world data). The full code can be found on Github.

Basic concept of feature engineering

Feature engineering means constructing additional features from existing data, often distributed in multiple related tables. Feature engineering involves extracting relevant information from data and storing it in sheets and tables that are then used to train machine learning models.

Constructing features is a very time consuming process, because each new feature usually takes several steps to construct, especially when using information from multiple tables. We can classify the operations of feature construction into two categories: “transformation” and “aggregation.” Here are a few examples to see these concepts in action.

Transformation works on a single table (in Python, a Pandas DataFrame) by constructing a new feature from one or more columns. For example, if you have the following customer table:

We can construct new features by looking for the months in the joined column or the data in the natural logarithmic income column. These are transformation operations because they use only one table’s information.

Aggregations, on the other hand, are implemented across tables and use one-to-many associations to group observations and then compute statistics. For example, if we have another table with customer loan information, where each customer may have multiple loans, we can calculate statistics such as the average, maximum, and minimum of each customer’s loans.

This process involves grouping loan tables by customer and calculating aggregated statistics, then integrating the results into customer data. Here’s how we do this in Python using the Pandas library.

import pandas as pd

# Group loans by client id and calculate mean, max, min of loans
stats = loans.groupby('client_id') ['loan_amount'].agg(['mean'.'max'.'min'])
stats.columns = ['mean_loan_amount'.'max_loan_amount'.'min_loan_amount']

# Merge with the clients dataframe
stats = clients.merge(stats, left_on = 'client_id', right_index=True, how = 'left')

stats.head(10)
Copy the code


These operations are not difficult by themselves, but if there are hundreds of variables spread across dozens of tables, the process cannot be done manually. Ideally, we want a solution that automatically performs transformations and aggregations across different tables and consolidates the results into a single table. Although Pandas is a great resource, there are still a lot of data manipulations that need to be done manually! For more information on artificial feature engineering, see the Python Data Science Handbook.

Characteristics of the tool

Fortunately, Feature Tools is the solution we were looking for. This open source Python library automatically constructs features from a set of related tables. The Feature tool is based on a method called Deep Feature Synthesis (see Deep Feature Synthesis: Towards Automating Data Science Endeavors), which sounds bigger than it should be (the name comes from stacking multiple features, not from using deep learning!). .

Deep feature composition overlays multiple transformation and aggregation operations, known as feature primitives in the feature tool’s thesaurus, in order to construct new features from data distributed across multiple tables. Like most approaches in machine learning, this is a complex approach based on simple concepts. This powerful approach can be well understood by learning one building block at a time.

First, let’s look at sample data. We’ve already seen some of the above data sets, and the complete table group looks like this:

  • Clients: Basic information about credit union customers. Each customer corresponds to only one row in the data box.

  • Loans: Loans made to users. Each loan corresponds to only one line in the data box, but the customer may have multiple loans.

  • 2. Payments are made to repay loans. Each payment corresponds to only one line, but each loan can have multiple payments.

If we have a machine learning task, such as predicting whether a customer will repay a loan in the future, we want to consolidate all the information about the customer into one table. These tables are related (via client_id and LOan_ID variables), and we can manually implement this process through a series of transformations and aggregation operations. However, we will soon be able to automate this process using feature tools.

Entities and entity sets

The first two concepts of the feature tool are “entity” and “entity set.” An entity is a table (or a DataFrame in Pandas). An entity set is a set of tables and the associations between them. Treat a set of entities as just another Python data structure, with its own methods and attributes.

We can create an empty set of entities in the Characteristics tool by:

import featuretools as ft
# Create new entityset
es = ft.EntitySet(id = 'clients')
Copy the code

Now we need to integrate the two entities. Each entity must have an index, which is a column containing all unique elements. That is, each value in the index can appear only once in the table. The index in the Clients data box is client_id, because each customer corresponds to only one row in the data box. We add an indexed entity to an entity set using the following syntax:

# Create an entity from the client dataframe
# This dataframe already has an index and a time index
es = es.entity_from_dataframe(entity_id = 'clients', dataframe = clients, 
                              index = 'client_id', time_index = 'joined')
Copy the code

The Loans data box also has another unique index, Loan_id, and adds it to the entity set using the same syntax as clients. However, the Payments data box does not have a unique index. When we add the Payments data box to the entity set, we need to pass in the argument make_index = True and specify the name of the index. In addition, although the feature tool can automatically infer the data type of each column in the entity, we can override it by passing a dictionary of column data types to the parameter variable_types.

# Create an entity from the payments dataframe
# This does not yet have a unique index
es = es.entity_from_dataframe(entity_id = 'payments', 
                              dataframe = payments,
                              variable_types = {'missed': ft.variable_types.Categorical},
                              make_index = True,
                              index = 'payment_id',
                              time_index = 'payment_date')
Copy the code

For this data box, although missed is an integer, it is not a numerical variable because it can only take two discrete values, so it is treated as a classification variable in the feature tool. After adding the data box to the entity set, we examine the entire entity set:

The data type of the column has been correctly inferred based on the correction scheme we specified. Next, we need to specify how the entity set tables are related.

Table of the associated

The best way to think about the “relationship” between two tables is to compare the relationship between a father and a son. It’s a one-to-many relationship: each father can have more than one son. For a table, each father corresponds to a row in a parent table, but there may be multiple rows in a child table that correspond to multiple sons in the same parent table.

For example, in our dataset, the Clients data box is a parent table of the Loans data box. Each customer corresponds to only one row in the Clients table, but may correspond to multiple rows in the loans table. Similarly, loans are a parent of payments, because each loan can have multiple payments. Fathers are associated with sons through shared variables. When we perform the aggregation operation, we group the child tables by parent variables and calculate the statistics for each father’s son.

To formalize the association rules in the feature tool, we simply specify the variables that join the two tables. The clients and loans tables are joined by the client_id variable, while the loans and payments tables are joined by the loan_id variable. The syntax for creating associations and adding them to the entity set looks like this:

# Relationship between clients and previous loans
r_client_previous = ft.Relationship(es['clients'] ['client_id'],
                                    es['loans'] ['client_id'])

# Add the relationship to the entity set
es = es.add_relationship(r_client_previous)

# Relationship between previous loans and previous payments
r_payments = ft.Relationship(es['loans'] ['loan_id'],
                                      es['payments'] ['loan_id'])

# Add the relationship to the entity set
es = es.add_relationship(r_payments)

es
Copy the code

The entity set now contains three entities (tables), along with the association rules that join these tables together. After adding entities and formalizing association rules, the entity set is complete and ready to construct new features from it.

Characteristics of the primitive

Before we can delve into deep feature synthesis, we need to understand the concept of feature primitive. We already know what they are, we just called them different names! They are just operations we use to construct new features:

  • Aggregation: An operation done based on a parent-child (one-to-many) association, that is, grouping by fathers and calculating statistics for sons. An example is grouping the loan table by client_ID and finding the maximum loan amount for each customer.

  • Conversion: An operation performed on one or more columns of a table. One example is taking the difference between two columns in a table or taking the absolute value of one column.

New features can be constructed by using these primitives alone or in combination in the feature tool. The following is a list of some of the feature primitives in the feature tool, and you can also customize feature primitives.

Characteristics of the primitive

These primitives can be used individually or in combination to construct new features. To construct new features from specific primitives, we use the FT. DFS function (which stands for deep feature composition). We passed in an entityset and target_entity, which are the tables we want to add features to, with the selected parameters trans_primitives and AGg_primitives.

public class MyActivity extends AppCompatActivity {
@Override  //override the function
    protected void onCreate(@Nullable Bundle savedInstanceState) {
       super.onCreate(savedInstanceState);
       try {
            OkhttpManager.getInstance().setTrustrCertificates(getAssets().open("mycer.cer"); OkHttpClient mOkhttpClient= OkhttpManager.getInstance().build(); } catch (IOException e) { e.printStackTrace(); }}Copy the code

What is returned is a data box containing the new characteristics of each customer (because we defined the customer as target_entity). For example, we have the month of each customer joining, which is the characteristic primitive of a transformation operation:

We also have many primitives for aggregate operations, such as average total payments per customer:

Although we have specified only a few feature primitives, the feature tool can construct new features by combining and superimposing these primitives.

The complete data box contains 793 columns of new features!

Depth characteristic synthesis

We now have everything we need to understand DFS. In fact, we already performed DFS in the previous function call! A depth feature is just a feature of multiple primitive constructs superimposed, and DFS is just the name of the process by which these features are constructed. The depth of a depth feature is the number of primitives required to construct the feature.

For example, the MEAN (payment.payment_amount) column is a feature of depth 1 because it is constructed using a single aggregate operation. LAST (loans (MEAN (payments.payment_amount)) is a feature of depth 2 that is constructed by two aggregate operations superimposed: the LAST (most recent) column above the MEAN column. This represents the average recent loan payment per customer.

We can overlay any depth of features, but in practice I never use more than 2 depth of features. Also, these characteristics are hard to explain, but I encourage anyone interested in “going deep.”

We don’t have to manually specify feature primitives, but we can have the feature tool automatically select features for us. To do this, we use the same ft. DFS function call, but without passing in any characteristic primitives.

# Perform deep feature synthesis without specifying primitives
features, feature_names = ft.dfs(entityset=es, target_entity='clients', 
                                 max_depth = 2)

features.head()
Copy the code

The feature tool constructs many features for us to use. While the process does automatically construct new features, it won’t replace data scientists, because we still need to figure out how to deal with them. For example, if our goal is to predict whether a customer will repay a loan, we can look for features that are most relevant to a particular outcome. In addition, if we have domain knowledge, we can use this knowledge to select the specified feature primitive or the seed depth feature synthesis of candidate features.

The next step

Feature engineering automation solves one problem, but creates another: too many features. Although it is difficult to say which features are important before fitting a model, it is likely that not all of these features are relevant to the task of the model we want to train. In addition, having too many Features (see Irrelevant Features and the Subset Selection Problem) can lead to poor model performance, as less-helpful Features overwhelm the more important ones.

The feature overload problem is known as a dimensional disaster. As the number of features increases (data dimensions grow), it becomes more and more difficult for the model to learn the mapping between features and targets. In fact, the amount of data required to make the model perform well is exponentially related to the number of features.

Dimensional disaster is the opposite of feature reduction (also known as feature selection, the process of removing irrelevant features). This can take many forms: principal component analysis (PCA), SelectKBest, using the importance of features in the model, or self-coding using deep neural networks. However, feature dimension reduction is a different topic for another article. So far, we know that we can use the feature tool to construct a large number of features from many tables with minimal effort!

conclusion

As with many topics in machine learning, feature engineering automation using feature tools is a complex concept based on simple ideas. Using the concepts of entity sets, entities, and associations, the feature tool can perform deep feature composition operations to construct new features. Deep feature composition can superimpose feature primitives in sequence: “aggregations”, which work in one-to-many associations between multiple tables, and “transformations”, which are functions applied to one or more columns of a single table to construct new features from multiple tables.

In a later post, I’ll show you how to use this technology for a real-world problem, namely a housing credit default risk contest on Kaggle (www.kaggle.com/c/home-cred…). . Please continue to pay attention to that post, at the same time, read the instructions to start the race (towardsdatascience.com/machine-lea…). ! I want you to use feature engineering automation as an aid to your data science work. Our models are as good as the data we provide, and feature engineering automation can make the process of feature construction more efficient.

For more information about featuretools, including advanced usage, check out the online documentation (docs.featuretools.com/). To learn how to use Feature tools in practice, read the work of Feature Labs, a developer of open source libraries (www.featurelabs.com/).

Original link:
https://towardsdatascience.com/automated-feature-engineering-in-python-99baf11cc219