• By Han Xinzi @Showmeai
  • Tutorial address: www.showmeai.tech/tutorials/4…
  • This paper addresses: www.showmeai.tech/article-det…
  • Statement: All rights reserved, please contact the platform and the author and indicate the source
  • collectionShowMeAICheck out more highlights

The introduction

In ShowMeAI’s article on machine learning feature Engineering, we describe the operation of feature engineering in detail, but we actually have a lot of tools to help us do feature engineering more quickly. In this article, ShowMeAI introduces Featuretools, a Python automation feature library. We will use the BigMart Sales data set to demonstrate the application of automated feature engineering.

1. Machine learning and features

In the context of machine learning, features are individual properties or groups of properties used to explain the occurrence of phenomena. When these features are translated into some measurable form, they are called features.

2. Introduction to feature engineering

Feature Engineering: Using domain knowledge and existing data to create new features for machine learning algorithms; It can be manual or automated.

The upper limit of the model is determined by data and feature engineering, and the improved algorithm only approximates this upper limit.

3. Significance of characteristic engineering

4. Automation feature engineering

The top left shows a group of people assembling cars in the early 20th century, and the top right shows a group of robots doing the same job today. Automating any process can make it more efficient and economical. The same is true for feature engineering. Moreover, in machine learning, feature engineering of common features has been automated.

A great tool to help with the process of automating feature engineering is a Python library called Featuretools.

5. Featuretools profile

Featuretools is a library of tools for automated feature engineering in Python. It allows you to quickly build rich data features, leaving more time to focus on building other aspects of machine learning models. To learn how to use Featuretools, you need to understand its three main components:

[picture archived failure outside the chain, the source station might be hotlinking prevention mechanism, proposed to directly upload picture preserved (img – omwkGCGq – 1647882602558) (image. Showmeai. Tech/machine_lea…)]

  • Entity (Entities)

    • An Entity can be considered the representation of a data box for Pandas, anda collection of entities is called an Entityset.
  • Relationship (relationship)

    • Relationships are definitions of associated keys between tables.
  • Feature Primitives

    • DFS constructs new features by applying feature operators to Entityset entity relations. Operators are functions of some characteristic engineering, such as groupby mean Max min and so on.

Featuretools provides a framework for quickly and easily converting a single table and joining multiple tables across tables with minimal code. Here’s how to use Featuretools with the Help of the BigMart Sales Dataset Practice question.

6. Featuretools practice

In this case scenario, BigMart Sales aims to solve the Sales forecasting problem in the e-commerce field. We hope to build a model to estimate the Sales of each commodity in a specific store, which will help BigMart’s decision makers find out the important attributes of each product or store, which plays a key role in improving the overall Sales. Note that in a given data set, there are 1,559 items across 10 stores.

Data set is as follows: link: pan.baidu.com/s/1qjJZjY56… Extract code: show

The following table describes the data fields:

variable describe
Item_Identifier Product id
Item_Weight Weight of goods
Item_Fat_Content Is it a low-fat product
Item_Visibility The proportion of this product display area to all product display areas in the store
Item_Type Classification of goods
Item_MRP Maximum commodity price
Outlet_Identifier Stores number
Outlet_Establishment_Year Year of store establishment
Outlet_Size Store floor area
Outlet_Location_Type Type of city where the store is located
Outlet_Type Type of store (grocery or supermarket)
Item_Outlet_Sales Merchandise sales in stores (i.e., output variable to be predicted)

6.1 Featuretools installation

You can easily install Featuretools at the command line using PIP.

pip install featuretools
Copy the code

6.2 Importing dependent Tool Libraries and Data

import featuretools as ft
import numpy as np
import pandas as pd

train = pd.read_csv("Train.csv")
test = pd.read_csv("test.csv")
Copy the code

6.3 Data Preparation

We first extract the target field and feature field from the data as follows:

# saving identifiers
test_Item_Identifier = test['Item_Identifier']
test_Outlet_Identifier = test['Outlet_Identifier']
sales = train['Item_Outlet_Sales']
train.drop(['Item_Outlet_Sales'], axis=1, inplace=True)
Copy the code

Next, we combine training set and test set to achieve uniform and consistent data processing transformation.

combi = train.append(test, ignore_index=True)
Copy the code

Let’s look at the missing values in the data set.

combi.isnull().sum(a)Copy the code

Item_Weight and Outlet_size have a lot of missing values. Let’s do a quick fix:

# Missing value handling
combi['Item_Weight'].fillna(combi['Item_Weight'].mean(), inplace = True)
combi['Outlet_Size'].fillna("missing", inplace = True)
Copy the code

6.4 Data Preprocessing

Let’s just do a little bit of data preprocessing so that we can show you more visually what Featuretools can do.

combi['Item_Fat_Content'].value_counts()
Copy the code

We find that Item_Fat_Content contains only two categories: “low fat” and “regular” (although there are different values for the fields, they are only format differences). Here we binarize them.

# Binary encoding
fat_content_dict = {'Low Fat':0.'Regular':1.'LF':0.'reg':1.'low fat':0}

combi['Item_Fat_Content'] = combi['Item_Fat_Content'].replace(fat_content_dict, regex=True)
Copy the code

6.5 Featuretools Feature Projects

Let’s use Featuretools to automate feature engineering. First, we combine “product” and “store” information to construct a data unique ID.

combi['id'] = combi['Item_Identifier'] + combi['Outlet_Identifier']
combi.drop(['Item_Identifier'], axis=1, inplace=True)
Copy the code

Since the attribute Item_Identifier is no longer needed, we remove it. We have retained the Outlet_Identifier feature, which will be used later.

Next we create a characteristic EntitySet, which is a structure containing multiple data boxes and the relationships between them.

# create entity set es
es = ft.EntitySet(id = 'sales')

Add dataframe data
es.add_dataframe(dataframe_name = 'bigmart', dataframe = combi, index = 'id')
Copy the code

Next we will use Deep Feature Synthesis to automatically create new features.

trans_primitives=['add_numeric'.'subtract_numeric'.'multiply_numeric'.'divide_numeric'] # 2 columns add, subtract, multiply and divide to generate new features
agg_primitives=['sum'.'median'.'mean']

feature_matrix, feature_names = ft.dfs(entityset=es, 
                                       target_dataframe_name = 'bigmart', 
                                       max_depth = 1, 
                                       verbose = 1,
                                       agg_primitives=agg_primitives,
                                       trans_primitives=trans_primitives,
                                       n_jobs = 8)
Copy the code

In the above code:

  • max_depthControls the complexity of features generated by superimposing feature primitives.
  • agg_primitivesIt defines some statistical aggregations.
  • trans_primitivesThe transformation calculation operator is defined.
  • n_jobsThe kernel number of multi – kernel parallel feature calculation is set.

By doing this, Featuretools constructs many new features on its own.

Let’s take a look at the features of these new constructions:

feature_matrix.columns
Copy the code

You’ll find that DFS builds a lot of new features quickly. It’s much more efficient than building features by hand!

Take a look at the first few lines of Feature_matrix.

feature_matrix.head()
Copy the code

We make a small adjustment to the Dataframe by sorting it according to the ID variable in the COMBi data box.

feature_matrix = feature_matrix.reindex(index=combi['id'])
feature_matrix = feature_matrix.reset_index()
Copy the code

6.6 Feature Description

We can also use the following code to explain the features it builds, such as how we get the 20th feature.

ft.graph_feature(feature_names[20])
Copy the code

6.7 Building a Model

Now we can use the built features to model Item_Outlet_Sales. Since there are many category features in the final data (Feature_matrix), we use the LightGBM model here. It can use category features directly and is essentially extensible.

You can learn reading ShowMeAI graphic machine | LightGBM model, rounding, rounding and LightGBM modeling application understand LightGBM principle and application method of the model.

import lightgbm as lgb
import pandas as pd
Copy the code

CatBoost requires that all category variables be in string format. Therefore, we first convert the category variable in the data to a string:

categorical_features = np.where(feature_matrix.dtypes == 'object') [0]

for i in categorical_features:
    feature_matrix.iloc[:,i] = feature_matrix.iloc[:,i].astype('str')
Copy the code

Then feature_matrix is split back into the training and test sets.

feature_matrix.drop(['id'], axis=1, inplace=True)
train = feature_matrix[:8523]
test = feature_matrix[8523:]
Copy the code
# removing uneccesary variables
train.drop(['Outlet_Identifier'], axis=1, inplace=True)
test.drop(['Outlet_Identifier'], axis=1, inplace=True)
Copy the code

The training set is split into training and verification to test the performance of the algorithm locally.

from sklearn.model_selection import train_test_split

# splitting train data into training and validation set
xtrain, xvalid, ytrain, yvalid = train_test_split(train, sales, test_size=0.25, random_state=11)
Copy the code

Finally, train the model. RMSE(Root Mean Squared Error) was used as the measurement index.

Initialize LGBMRegressor regressors
model_lgb = lgb.LGBMRegressor(iterations=5000, learning_rate=0.05, depth=6, eval_metric='RMSE', random_seed=7)
# Training model
model_lgb.fit(xtrain, ytrain, eval_set=[(xvalid, yvalid)], early_stopping_rounds=1000)
Copy the code

from sklearn.metrics import mean_squared_error
np.sqrt(mean_squared_error(model_lgb.predict(xvalid), yvalid))
Copy the code

The RMSE score of the validation dataset is 1094.7984.

In the absence of any feature engineering, the score of the validation set is 1163. Therefore, features constructed by Featuretools aren’t just random features, they’re also very valuable. Most importantly, it saves feature engineering a lot of time.

The resources

  • Diagram of machine learning algorithm | from entry to master series
  • The illustration of machine learning | LightGBM model explanation
  • LightGBM modeling applications in detail
  • Machine learning feature engineering most complete interpretation

ShowMeAIRecommended series of tutorials

  • Illustrated Python programming: From beginner to Master series of tutorials
  • Illustrated Data Analysis: From beginner to master series of tutorials
  • The mathematical Basics of AI: From beginner to Master series of tutorials
  • Illustrated Big Data Technology: From beginner to master
  • Illustrated Machine learning algorithms: Beginner to Master series of tutorials
  • Machine learning: Teach you how to play machine learning series

Related articles recommended

  • Application practice of Python machine learning algorithm
  • SKLearn introduction and simple application cases
  • SKLearn most complete application guide
  • XGBoost modeling applications in detail
  • LightGBM modeling applications in detail
  • Python Machine Learning Integrated Project – E-commerce sales estimates
  • Python Machine Learning Integrated Project — E-commerce Sales Estimation
  • Machine learning feature engineering most complete interpretation
  • Application of Featuretools
  • AutoML Automatic machine learning modeling