Luo Yuanfei, a senior researcher of the fourth paradigm, introduced how to automatically generate feature and selection algorithm efficiently in the exponential search space for the high dimensional sparse data in the recommendation system. And how to combine large-scale distributed machine learning systems to quickly screen out effective combinational features from data while significantly reducing computation, storage, and communication costs.

Luo shared the following:

Everybody is good! I am Luo Yuanfei of the Fourth paradigm!

I’m glad to have the opportunity to talk with you about some of the work in automatic machine learning. Most of my work in the fourth paradigm is related to automatic machine learning, with my previous focus on automatic feature engineering. While model improvements can bring steady returns, they are more difficult. Therefore, if you are doing a new business, you can try to start with features. Feature engineering can often bring more obvious benefits.

The background of the AutoCross

The automatic machine learning mentioned in this report is automatic machine learning for table data. Table data is a classical data format. It generally contains multiple columns, which may correspond to discrete or continuous features. We can’t just take the models that are used in images, speech, or NLP, we need to make specific optimizations.

The Feature combination mentioned in this report specifically refers to Feature Crossing, which is the Cartesian product of two discrete features. Take “restaurants visited” as an example, I often go to McDonald’s, so I and McDonald’s can be a combined feature; For example, I go to KFC, then I and KFC can also be used as a combined feature.

Automatic feature engineering referred to in this report refers to the automatic detection of these valid combination features from the data in the table above. I’m a software engineer, that’s a trait; Working in the fourth paradigm is another characteristic. These two features are stored in two columns, and we can combine the two columns into a new feature that is more indicative and personalized.

Why automatic feature engineering?

First of all, features play a very important role in the modeling effect. Secondly, there are far more customer scenes than modeling experts. For example, our recommendation business has thousands of media, so we cannot assign an expert to each business and manually model each scene. Finally, even if there is only one business, the data is changeable and the scenarios are constantly changing. Therefore, we need to do automatic feature engineering, so that the manpower can not be proportional to our business volume.

AutoCross related study

Automatic feature engineering is mainly divided into two categories, one is explicit feature combination, the other is implicit feature combination.

① Explicit feature combination

Explicit feature combination has two representative works, namely RMI [2] and CMI [3]. The letter “MI” stands for Mutual Information, which is a classic feature selection method.



MI is calculated by counting the occurrence frequency and co-occurrence frequency of two column features in the same data. However, RMI counts one part of the information in the training set and another part of the information in the reference data, which is the source of the “R”. The figure above, from RMI’s paper [2], shows that the AUC increases gradually as different combination characteristics are added. CMI is another classic work. CMI calculates the importance of each feature by analyzing the pair rate loss function and combining it with Newton’s method.

They all worked well. But on the one hand, they only consider second-order feature combinations; In addition, they are all serial algorithms. After each selection of a combination feature, other features need to be retrained, which is O(n^2) complexity, where N is the number of features. In addition, MI itself does not allow multiple values in a feature.

② Implicit feature combination



The other category is implicit feature combination, which you may be more familiar with. FM[4] and FFM[5] enumerate all the second-order feature combinations, and their combination method is to use the inner product in the low-dimensional space to represent the combination of the two features, which has achieved good results. With the rise of DL, it is now more popular to do implicit feature combinations based on DNN. However, its explainability is not strong, and it has been criticized by everyone.

We proposed AutoCross[1], which has strong interpretability and can combine high-order features with high Inference efficiency.

AutoCross overall structure



From left to right, the input of AutoCross is data and corresponding feature types. Then, through the Flow of AutoCross, a feature generator is output, which can apply the learned feature processing methods to the new data.

Flow mainly consists of three parts, the first is preprocessing, followed by the iterative process of combination feature generation and combination feature selection. For data preprocessing, multi-granularity discretization is proposed. Beam Search is used to efficiently generate combinatorial features from exponential space. For how to make efficient and low-cost feature selection, two methods of field-wise LR and mini-batch GD (Successive small batch gradient descent) are proposed.

AutoCross algorithm

Let’s take a look at the algorithms involved in each process.

The first is data preprocessing, which aims to supplement missing values and discretize continuous features. We observe that for continuous features, if the discretization granularity is different, the effect will be very different. Even a 10 percentage point difference in AUC was observed in one data set. If the optimal discretization granularity is manually set for each data set, it will be costly and unrealistic.

Based on this we put forward the multi-granularity discretization method, at the same time using a variety of particle size to discretize the same features, such as the characteristics of the “age”, we according to the age interval for 5 a discretization, age interval of 10 discretization, age interval of 20 discretization again, at the same time generate multiple different characteristics of discretization, Let the model automatically choose the features that best fit it.



① Beam Search

As mentioned above, assuming n original features, there are O(n^ K) possible k-order features, which is an exponential process. How do you effectively search, generate, and combine features in this space? If both are generated, it is not very feasible to compute and store them.

We use the method of Beam Search to solve this problem. Its working principle is to form a part of the second-order combination features, and then use the second-order combination features with good effect to derive the third-order combination features, not to generate all the third-order combination features, equivalent to a greedy search method.

② Logarithmic probability regression (Field-wise LR)

We preprocess the data by multi-granularity discretization, and then reduce the search space by cluster search.

However, there are still a large number of generated features, so how can we quickly and cheaply select effective features from generated features? In this regard, we proposed field-wise LR algorithm, which fixed the model parameters corresponding to the selected features, and then calculated which feature in the candidate features could be added to improve the model effect to the greatest extent. Doing so can significantly save computing, communication, and storage overhead.



③ Successive small batch gradient descent (mini-batch GD)

For further reducing the cost of characteristic assessment, the Successive mini-batch GD method has been proposed. In the iterative process of small batch gradient descent, the less significant candidate features are gradually eliminated and more batches of data are given to the more important features to increase their accuracy.

AutoCross – System optimization

Here are some of the optimizations we made on the system.

① Cache feature weight

From the point of view of algorithm, our system is a search problem of exponential space, even if the complexity can be reduced, its operation cost is still very large. So we’ll sample the data and serialize it for compressed storage.

After that, the system will cache the calculated feature weights when the log-probability regression is run. If we follow the previous method, we need to obtain the weight of the generated feature from the parameter server first, which will bring network overhead. After obtaining, it is necessary to perform operations and generate the feature and prediction, which will incur computational overhead. After features are generated, they are stored on hard disks, which incurs storage costs. However, by caching the weights of the previous features, we can reduce the network, computation, and storage overhead by directly looking up the table.

② Online calculation

In addition to caching feature weights, we also performed online calculations. We have separate threads to serialize data and generate features while we do feature generation.

③ Data parallelism

In addition, data parallelism is also a common method of system optimization. Each process in the system has a map, and through the master node, or parameter server, ensure that the various operations between them are orderly.

AutoCross experiment

Here are our results.



There are two baseline here. Let’s start by looking at how features generated by AutoCross help LR. When we put AutoCross features into LR, the effect changes dramatically (lines 1 and 2). Also, we compared the AutoCross and CMI approaches (see lines 2 and 4). After comparison, AutoCross consistently outperformed CMI.

To verify whether AutoCross generated features would be helpful to the depth model, we also combined AutoCross features with the W&D model (see line 3). We found that when we gave the features to W&D, the W&D model also achieved very good results, comparable to the best current deep learning models on all 10 data sets.

reference

[1] Yuanfei, Luo, Wang Mengshuo, Zhou Hao, Yao Quanming, Tu WeiWei, Chen Yuqiang, Yang Qiang, Dai Wenyuan. 2019. “AutoCross: Automatic Feature Crossing for Tabular Data in real-world Applications.” KDD.

[2] Rómer Rosales, Haibin Cheng, and Eren Manavoglu. 2012. Post-click conversion modeling and analysis for non-guaranteed delivery display advertising. WSDM.

[3] Olivier Chapelle, Eren Manavoglu, and Romer Rosales. 2015. Simple and scalable response prediction for display advertising. TIST.

[4] Rendle, Steffen. “Factorization Machines.” 2010.

[5] Yuchin Juan, Yong Zhuang,Wei-Sheng Chin, and Chih-Jen Lin. 2016. Field-aware factorization machines for CTR prediction. In ACM Conference on Recommender Systems.

[6] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a factorization-machine based neural network for CTR prediction. IJCAI.