Creating new features is a very difficult task that requires a lot of expertise and a lot of time. The nature of machine learning applications is essentially feature engineering.
– Andrew Ng
It is often said in the industry that the data determines the upper limit of the model’s performance, and machine learning algorithms make predictions based on data features. Good features can significantly improve the model’s performance. This means that feature generation (that is, the processing of model usable features from the data design) is a critical step in feature engineering. In this paper, the function of feature generation and the method of feature generation (manual design, automatic feature generation) are elaborated and the code is attached.
The role of feature generation
Feature generation is an important step in feature extraction, and its functions are as follows:
- Increase the expression ability of features to improve the model effect; (For example, weight divided by height is an important indicator of health, while height or weight alone is a limited indicator.)
- Can be integrated into the business understanding design features, increase model interpretability;
2 data analysis
The sample data set for this article is the customer’s fund changes, as shown in the following data dictionary:
Cust_no: customer number; I1: gender; I2: age; E1: date of account opening; B6: recent transfer date; C1 (suffix _fir for last month) : deposits; C2: Number of deposit products; X1: financial deposit; X2: Structured deposits; Label: Indicates the increase and decrease of funds.Copy the code
Here is a super useful Python library for one-click data analysis (summary data, missing data, correlation, outliers, etc.) to facilitate feature generation combined with data analysis reports.
Import pandas_profiling pandas_profiling.ProfileReport(df)Copy the code
3 method of feature generation
Feature generation methods can be divided into two categories: aggregation mode and transformation mode.
3.1 Aggregation Mode
In the aggregation mode, multiple records corresponding to one-to-many fields are aggregated to collect data features such as the average value, count, and maximum value. For example, based on the above data set, the same CUST_NO corresponds to multiple records. By grouping and aggregating CUST_NO (customer number), the number, unique number, average, median, standard deviation, sum, maximum and minimum values of C1 are counted, and the average and maximum values of C1 for each CUST_NO are finally obtained.
# do the aggregation with CUST_no, C1 Field statistics number, unique number, mean, median, standard deviation, sum, maximum and minimum df.groupby('cust_no').C1.agg(['count','nunique','mean','median','std','sum','max','min'])Copy the code
In addition, pandas can customize the aggregation functions to generate features, such as manipulating the sum of squares of the aggregation elements:
Def x2_sum(group): return sum(group**2) df.groupby('cust_no').c1.apply (x2_sum) : return sum(group**2) dF.groupby ('cust_no').c1.apply (x2_sum)Copy the code
3.2 Conversion Mode
Conversion mode refers to the process of generating data features by adding, subtracting, multiplying and dividing fields. There are different conversion modes for different field types.
3.2.1 Value Type
- subtracting
Multiple fields are computed to generate new features, which often requires a combination of business-level understanding and data distribution to generate an optimal feature set.
Df ['C1'] = df['C1']+df['C1_fir'] = Df [' C1] - df [' C1_fir] # product number * money df [' C1 * C2] = df [' C1] * df [' C2] # two months before and after the funds rate df [' C1 / C1_fir] = df [' C1 '] / df [' C1_fir] 1 df.head()Copy the code
- Multiple column statistics
The aggregate function is used to calculate the variance and mean of multiple columns
import numpy as np
df['C1_sum'] = np.sum(df[['C1_fir','C1']], axis = 1)
df['C1_var'] = np.var(df[['C1_fir','C1']], axis = 1)
df['C1_max'] = np.max(df[['C1_fir','C1']], axis = 1)
df['C1_min'] = np.min(df[['C1_fir','C1']], axis = 1)
df['C1-C1_fir_abs'] = np.abs(df['C1-C1_fir'])
df.head()
Copy the code
- Ranking coding characteristics
All samples were sorted according to the eigenvalue, and the sequence number was taken as the eigenvalue. This feature is not sensitive to outliers and does not easily lead to eigenvalue conflict.
Df ['C1_rank'] = df['C1']. Rank (ascending=0, method='dense')Copy the code
3.2.2 Character types
- The interception
When the value of character type is too large, it is usually possible to intercept the character type variable to reduce model overfitting. Such as a specific home address, you can intercept strings to city-level granularity.
- Characters in length
Statistics string length. For example, in the transfer scenario, the word count of the transfer message can describe the type of the transfer to some extent.
- The frequency of
By counting the occurrence frequency of characters. If the number of addresses in the fraud scenario, the more likely it is gang fraud.
Since there is no suitable example, this is just code to implement the logic, and the processed fields have no meaning. # to intercept the first string df [' I1_0] = df [' I1]. The map (lambda x: STR (x) [1]) # character length df [' I1_len] = df [' I1]. Apply (lambda x: len (STR (x))) Df ['I1'].value_counts()Copy the code
3.2.3 Date type
Commonly used to calculate the date interval, the day of the week, the time and so on.
Df ['E1_B6_interval'] = (df.e1.astype ('datetime64[ns]') -df.b6.astype ('datetime64[ns]')).map(lambda x:x.d) df['E1_is_month_end'] = pd.to_datetime(df.E1).map(lambda x :x.is_month_end) df['E1_dayofweek'] = df.E1.astype('datetime64[ns]').dt.dayofweek df['B6_hour'] = df.B6.astype('datetime64[ns]').dt.hour df.head()Copy the code
Automatic feature generation
Traditional feature engineering methods involve manually building features, which is a tedious, time-consuming, and error-prone process. Automated feature engineering is the process of automatically generating useful features from a set of related data tables using tools such as Fearturetools. Compared to artificially generated features, it is more efficient, more repeatable, and can build the model faster.
4.1 FeatureTools to fit
Featuretools is an open source library for implementing Feature automation projects. It has three basic concepts: Feature Primitives: Common methods for generating features include agg_primitives and trans_primitives. You can use the following code to list feature processing methods and introduction of FeatureTools.
import featuretools as ft
ft.list_primitives()
Copy the code
An Entity can be thought of like a Pandas DataFrame, where a collection of entities is called an Entityset. An association Relationship can be added between entities based on the association key.
Df1 = df.drop('label',axis=1) # df2 = df[[' CUST_no '].drop_duplicates() df2.head() # Entity_from_dataframe (entity_id='df1', dataframe= 'df1', index='id') Entity_from_dataframe (entity_id='df2', dataframe= DF2, index=' CUST_NO ') Relation1 = ft.Relationship(es['df2'][' CUST_NO '], es['df1']['cust_no']) es = es.add_relationship(relation1)Copy the code
3) DFS (Depth feature synthesis) : the process of creating new features from multiple data sets. The complexity of feature generation can be controlled by setting the max_depth of the search
DFS (entityset=es, target_entity='df2', relationships = [relation1], feature_names = ft. DFS (entityset=es, target_entity='df2', relationships = [relation1], trans_primitives=['divide_numeric','multiply_numeric','subtract_numeric'], agg_primitives=['sum'], max_depth=2,n_jobs=1,verbose=-1)Copy the code
4.2 FeatureTools
4.2.1 Memory Overflow Fearturetools is a process of generating all features by force at the engineering level. When the amount of data is large, the memory may overflow easily. In addition to upgrading server memory and reducing NJobs, a common solution to this problem is to use brute force by selecting only important features.
4.2.2 Explosion of Feature Dimension When the number of original features is large, or the max_depth and the type of feature primiples are set to be large, Fearturetools generates a large number of features and is prone to dimension explosion. This is to consider feature selection, feature dimension reduction, commonly used feature selection methods can refer to the previous article: Python feature selection
Note: This article source link: Github.
This article is published in the public number – Algorithm advanced. Welcome to follow.