7 times KDD Cup&Kaggle Champion: From Multi-domain optimization to AutoML Framework

The algorithm competition with fast feedback and fierce competition is an important way for algorithm practitioners to improve their technical level. The algorithm competition abstracted from some core problems of the industry has strong practical significance. Based on the author’s experience in winning the Kaggle/KDD Cup for seven times, this paper introduces the multi-domain modeling optimization, AutoML technical framework and how to analyze modeling in the face of new problems. It is hoped that readers can gain general and efficient modeling methods and problem understanding ideas in the competition.

1 Background and introduction

Algorithm competitions with fast feedback and fierce competition are an important way for algorithm practitioners to improve their skills. Algorithmic competition questions abstracted from a number of core industry problems are of great practical significance, while the real-time scoreboard of the competition encourages participants to continually improve in an attempt to surpass current best practices, and the winning proposal is also a strong catalyst for industry and academia. For example, the Field-Aware Factorization Machine(FFM) algorithm produced by KDD Cup competition [1] and the ResNet model produced by ImageNet competition [2] have been widely used in the industry.

Meituan in-store advertising quality estimation team won the first prize in the MDD Cup, an internal algorithm contest of Meituan. Invited by the organizing committee of the contest, we hope to share some common experience in the contest. This article is my experience of winning 7 Kaggle/KDD Cup (as shown in Figure 1 below), hoping to help more students.

As we all know, the Kaggle/KDD Cup competitions are the top international competitions and have a great influence in the competition circle and industry. Specifically, Kaggle is the largest top-level data mining platform in the world, with hundreds of thousands of users all over the world. Through high bonuses and sharing atmosphere, it has produced a large number of excellent algorithm schemes, such as the Heritage Health prize of $3 million. Kaggle, which was acquired by Google, is known for its achievements in AIDS research, card and chess ratings, and traffic forecasting.

ACM SIGKDD (International Conference on Data Mining and Knowledge Discovery, KDD for short) is the international premier conference in the field of data mining. The KDD Cup is the international premier competition for data mining research hosted by SIGKDD. It has been held annually since 1997 and is the most influential event in data mining. The competition is open to both business and academic circles, gathering top experts, scholars, engineers and students in the world’s data mining field to participate, providing a platform for data mining practitioners to communicate academically and display their research achievements.

Through analysis, it is easy to find that KDD Cup has been closely combined with the forefront and hot issues in the industry for 20 years, and its evolution can be divided into three stages. The first stage started from about 2002, focusing on the hot recommendation system of the Internet, including recommendation, advertising, behavior prediction, etc. The second stage focuses on traditional industries, focusing on education, environment, medical care and other fields. While in the third phase, since 2019, the focus has been on unsupervised problems, such as AutoML, Debiasing, reinforcement learning, etc. The common feature of such competitions is that existing new problems are difficult to solve through previous methods. These three trends also reflect the difficulties and emphases in the current industrial and academic circles to a certain extent. No matter in terms of ways, methods or dimensions of problems, they all show a trend of evolution from narrow to wide and from standard to non-standard.

This paper will first introduce the scheme and understanding of the author’s seven winners of KDD Cup/Kaggle competition. The problems involve recommendation, advertising, transportation, environment, fairness of artificial intelligence and other fields. We will then introduce the AutoML framework that played a key role in the above competitions, including automated feature engineering, automated model optimization, automated model fusion, and how to systematically model different problems through this framework. Finally, the general method for the formation of the above competition is introduced, that is, how to analyze, understand, model and solve challenges in the face of a new problem, so as to realize the in-depth optimization of the problem.

This article is aimed at the following two types of readers, and others who are interested are welcome to learn about it.

Algorithmic contest enthusiast, hope to understand the method and logic of international top data mining competition champion scheme, to obtain better ranking.
Industrial engineers and researchers draw lessons from competition methods and apply them to practical work to achieve better results.

2. Multi-domain modeling optimization

In this part, we will divide the above competition into three parts to introduce the scheme. The first part is about the recommendation system. The second part is the time series problem. The important difference from the first part is that the prediction is the future multi-point sequence, rather than the single point prediction of the recommender system. The third part is the automatic machine learning problem. The input of the problem is not a single data set, but a multi-question multi-data set, and the b-table data set problem in the final evaluation is also unknown. Therefore, the robustness of the scheme is very high. As shown in Table 1, winning solutions for the seven race tracks will be detailed later, but will be combined into five core solutions.

2.1 Recommendation System Problems

This section mainly introduces the Kaggle Outbrain Ads Click Prediction and KDD Cup 2020 Debiasing competition. The two tasks are both oriented to the user’s next click prediction problem, but because of the different application scenarios and backgrounds, there are different challenges: the former has a huge data scale, involving billions of browsing records of hundreds of millions of users on thousands of levels of heterogeneous sites, and has strict requirements for model optimization and fusion; The latter pays particular attention to the problem of bias in the recommendation system and requires contestants to propose effective solutions to alleviate the selectivity bias and popularity bias, so as to improve the fairness of the recommendation system. This section introduces the two games separately.

Kaggle Outbrain Ads Click Prediction: Model Fusion Scheme Based on Multi-level and Multi-factor

Questions and Challenges: The competition requires users to predict the next time they click on a web AD on the Outbrain web content discovery platform. For details, please refer to Kaggle Outbrain Competition [26]. The contestants face two important challenges:

Heterogeneity: The platform provides demand side platform (DSP) advertising service, which involves user behavior characterization on thousands of heterogeneous sites.
Ultra-high-dimensional sparsity: Features are high-dimensional sparsity, and the data scale is huge, including 700 million users and 2 billion browsing records.

Model fusion scheme based on multi-level and multi-factor: In view of the challenge of this competition, our team adopted model fusion scheme based on multi-level and multi-factor to conduct modeling. On the one hand, for heterogeneous site behavior, a single model is not easy to comprehensively describe, on the other hand, the data scale of hundreds of millions of levels brings a large space for the optimization of multiple models. Because FFM has strong feature crossing ability and strong generalization ability, it can better deal with high-dimensional sparse features. Therefore, we choose this model as the main model of the fusion base model. Model fusion can learn different content from different models so as to effectively mine heterogeneous behaviors of users on different sites. The key to model fusion is to produce and combine “good but different” models [3][4]. The model fusion scheme based on multi-level and multi-factor firstly constructs the differences between models from multiple perspectives of model difference and feature difference, and then fuses the models through multi-level and multi-feature factors (model pCTR prediction and hidden layer representation) using base learner:

Specifically, as shown in Figure 3. The purpose of the first level is to build a single model with differences, mainly through different types of models to train users’ recent behavior, all behavior data and different feature sets respectively, so as to generate differences. The second level through the combination of a single model further differentiating, difference of ascension from the two aspects, respectively is the difference between the model of combination (with a different model, according to the characteristics of single model) and used in the model combined the characteristics of the factors is different, here characteristic factors including rate and hidden layer in the model parameters of the model. The third level is to consider how to combine the different fusion results. Because the validation data set is small, it is easy to overfit if complex nonlinear model is used. Therefore, a constrained linear model is used to obtain the fusion weights of the second-level model.

Compared with the model in our business, the above scheme adopts more model fusion, resulting in higher overhead while achieving high precision. However, in actual business, more attention should be paid to the balance between effect and efficiency.

KDD Cup 2020 Debasing on i2I multi – hop Debiasing scheme

Competition questions and challenges: The competition is based on the e-commerce platform as the background, predict the user’s next click on the product. It also focuses on how to alleviate the selectivity bias and popularity biasin the recommendation system. For details, please refer to KDD Cup 2020 Debiasing Competition [27]. There are many deviation problems in the recommendation system. In addition to the above two kinds of deviation, there are also exposure deviation, rank deviation, etc. [5][6]. Our team has also conducted relevant research on rank deviation before [7]. In order to better measure the recommendation effect of the recommendation system on the products with low popularity in history, the results of the competitors are mainly ranked by NDCG@50_half index. This index takes half of the clicked goods with less historical exposure from the whole evaluation data set. Since they are low heat and clicked goods, they can better evaluate the deviation problem. The competition contains the following challenges:

The problem only provides click data, and the problem of selection bias needs to be considered when constructing candidate set.
There is a big difference in the popularity of different commodities, and the historical clicks of commodities present a long-tail distribution. The data has a serious problem of popularity deviation, and the evaluation index NDCG@50_half is used to investigate the ranking quality of low-popularity commodities.

Debiasing sorting scheme based on I2I wandering: Our scheme is a sort framework based on I2I modeling. As shown in the figure, the overall process consists of four stages: I2I composition and multi-hop walk, I2I sample construction, I2I modeling and U2I sequencing. The first two phases address the problem of selectivity bias, while the last two phases focus on the problem of prevalence bias.

In the first stage, i2I graph is constructed based on user behavior data and commodity multimodal data, and candidate samples are generated by multi-hop migration on the graph. This method expands the commodity candidate set, better approximates the real candidate set of the system, and alleviates the selection bias.

The second stage is to calculate the similarity of i2I candidate samples according to different I2I relationships, so as to determine the number of candidate samples under each I2I relationship and finally form a candidate set. The selection bias problem can be further alleviated by exploring more different candidate commodities through different candidate construction methods.

The third phase includes automated feature engineering based on the I2I sample set and modeling to eliminate prevalence bias using a prevalence weighted loss function. The automated feature engineering includes the characterization of multi-modal information of commodities, which can reflect the competitive relations of commodities other than heat information and alleviate the problem of popularity deviation to a certain extent. The loss function of prevalence weighting is defined as follows:

Among them, the parameter α is inversely proportional to the popularity to weaken the weight of popular goods and eliminate the popularity bias. The parameter β is the positive sample weight, which is used to solve the sample imbalance problem.

In the fourth stage, i2I scores are aggregated through Max operation to highlight the high score signals of low-heat commodities in the score set, so as to alleviate the problem of prevalence deviation. Then, the rating of the commodity list is adjusted in combination with the popularity of commodities, so as to alleviate the problem of popularity deviation.

For more details about the competition, please refer to “KDD Cup 2020 Debiasing Competition Champion technology Solution and Practice in Meituan”.

2.2 Time series

Time series problem: The time series problem is quite different from the recommender system problem. On task, recommendation system predicts a single point in the future, while time series predicts multiple points in the future. In terms of data, recommendation system usually contains multi-dimensional information such as user, product and context, while time series usually contains numerical sequence information that changes in time and space.

Time Series Competition: In this paper, the time series competition mainly introduces KDD Cup 2018 Fresh Air and KDD Cup 2017 HighWay Tollgates Traffic Flow Prediction. They are all time series problems, the former is to predict pollutant concentration and changes in the next two days, the latter is to predict high-speed traffic conditions and changes in the next few hours. Their common ground is the traditional industry problem, practical significance is strong; Second, there are various mutants and low stability; Thirdly, they all involve multi-region and multi-space problems and need to be modeled in combination with space-time. The differences and similarities between them are that the mutation of pollutant concentration takes a short time to occur, and the data have certain regularity when the mutation occurs. However, the traffic mutation is highly accidental, and the traffic road is susceptible to accidental car accidents and geological disasters, so the data does not show obvious regularity.

KDD Cup 2018 Fresh Air: Air Quality Prediction Scheme Based on Space-time Gated DNN and Seq2Seq

Questions and Challenges: The objective of the competition is to predict the change of PM2.5/PM10/O3 concentration at 48 stations in Beijing and London in the next 48 hours, for specific reference: KDD Cup 2018 Introduction details [28]. Contestants have to solve the following two challenges:

Timing: The pollution concentration in the next 48 hours is predicted, and the actual pollutant concentration is mutated. As shown in Figure 5, there were a lot of fluctuations and mutations at site 2 between 2005-05, 2005-06 and 2005-07.
Spatial: Pollutant concentrations at different sites vary significantly, and are associated with the topological structure between sites. As shown in the figure, the waveforms of sites 1 and 2 are quite different, but the same bulge is generated from 05 to 07.

Model fusion scheme based on spatial-temporal Gated DNN and Seq2Seq [9] : In order to strengthen the modeling of time series and Spatial topology, we introduced two models of spatial-temporal Gated DNN and Seq2Seq, and constructed a model fusion scheme together with LightGBM, as shown below.

(1) Spatial-temporal Gated DNN: For temporal sequence problems, since the statistical characteristic values of the future prediction near time points have small differences, the direct use of DNN model will make the predicted values of different hours and stations have small differences. Therefore, we introduce spatial-temporal Gate in DNN to highlight the temporal and Spatial information. As shown in Figure 6 below, spatial-temporal Gated DNN adopts a two-tower structure to split temporal and Spatial information and other information, and controls and emphasizes temporal and Spatial information through gate function, ultimately improving the sensitivity of the model to temporal and Spatial information. It is found that introducing swish activation function F (x) = x · Sigmoid (x) can improve the model accuracy.

(2) Seq2Seq: Although spatial-temporal information is strengthened by spatial-temporal Gated DNN compared with DNN, their data modeling methods are to replicate 48 copies of historical data of samples and label them with labels for the next 48 hours respectively, which is equivalent to predicting pollution concentration values for 48 hours respectively. In fact, this approach is somewhat divorced from the time series prediction task and loses the time continuity. Seq2Seq modeling can solve this problem naturally, and has achieved good results. Figure 7 below is the Seq2Seq model structure we adopted in this competition. In view of temporal challenges, historical weather features are organized into sequence before and after time and input into the encoder. The decoder decodes based on the encoding results and future weather forecast features to obtain the 48-hour pollutant concentration sequence. During the decoding process of the future weather forecast information aligned to the decoder every hour, the decoder can effectively predict the mutation value through the weather information in the weather forecast (such as wind level, pressure, etc.). In view of spatial challenges, the scheme added site embedding and spatial topological features to the model to describe the spatial information, and spliced and normalized the model and weather information, so as to realize the spatio-temporal joint modeling.

(3) Model integration: Our team adopts Stacking integration, and a single learner builds differences through different models, data, and modeling methods. The LightGBM model uses weather quality, historical statistics, Spatial topology and other features, while the spatial-temporal Gate introduces the Gate structure to strengthen the spatial-temporal information. Seq2Seq describes the continuity and volatility of sequence by using the sequence-to-sequence modeling method. Finally, a constrained linear model is used to fuse different individual learners.

For more details, please refer to SIGKDD Conference paper AccuAir: Winning Solution to Air Quality Prediction for KDD Cup 2018.

KDD Cup 2017 Traffic Flow Prediction: A high stability Traffic Prediction solution based on cross-validation noise reduction and multi-loss fusion

Questions and Challenges: The objective of the competition is to predict the driving conditions of the next 2 hours given the driving conditions from the expressway entrance to the checkpoint in the first 2 hours within a 20-minute time window. For details, please refer to KDD Cup 2017 Competition Introduction details [29]. According to the driving conditions, the race is divided into two tracks: driving time prediction and traffic flow prediction. Contestants have to solve the following two challenges:

Small data and much noise. As shown in Figure 8 below, the numerical distribution of the time period in the box is significantly different from that of other time periods.

Extreme value has A great influence on the results. MAPE is used as the evaluation index, as shown in the formula below, where A_tThat’s the real value, F_tRepresents the predicted value. When the actual value is a small value (especially a minimum value), the contribution of this term to the overall sum has a large weight.

Extreme point optimization model fusion scheme based on cross validation denoising:

(1) Noise reduction based on cross verification. Since online submission can only be carried out once A day, and the final evaluation will be cut from a-list test set to B-list test set, and since a-list data set is small and the online evaluation index is unstable, the offline iterative verification method is particularly important. In order to enable off-line iteration confidence, two verification methods are used for assistance. The first one is verification at the same time period on the next day. We take online data sets at the same time period on every day at the last M days of the training set to obtain M verification sets. The second method is N-fold day-level sampling validation, which is similar to N-fold cross-validation. We take the data of each day of the last N days as the validation set to obtain N validation sets. These two methods together assist the iteration of the offline effect of the model and ensure our robustness in the B-list.

(2) Optimization and model fusion of extreme point problems: As MAPE is more sensitive to extreme values, we carry out different processing in label, loss, sample weight and other aspects, such as Log transformation and Box-Cox transformation on labels. Log transformation is to perform Log transformation on labels, and restore the estimated value after model fitting, which can help the model focus on small values and be more robust. MAE and MSE are used for loss, and labels are used to weight samples. We introduce these processes on XGBoost, LightGBM and DNN to generate multiple different models for model fusion, and optimize the extreme point problem to achieve robust effect.

Note: Special thanks to Chen Huan, Yan Peng, Huang Pan and other students who participated in KDD Cup 2017.

2.3 Automated machine learning problems

Automated Machine Learning problems [10] include the KDD Cup 2019 AutoML and the KDD Cup 2020 AutoGraph Competition. This type of problem has the following characteristics:

Strong data diversity: 15+ data sets are derived from problems in different fields without identifying data sources. The automated machine learning framework designed by contestants is required to be compatible with data from multiple fields and make certain adaptation to data from different fields.
Automated robustness: Public leaderboards differ from private leaderboards in that the final score is based on the average ranking/score of multiple data sets, requiring robust results on previously unseen data sets.
Performance limitation: Corresponding to the real problem search space, need to solve in limited time and memory.

KDD Cup 2020 AutoGraph: Optimization for Automatic multi-level Graph learning based on a Proxy model

Competition Questions and Challenges: The Automated Graph representation Learning Challenge is the first AutoML challenge to be adapted to graph-structured data, described in the KDD Cup 2020 AutoGraph Competition Introduction [30]. The competition selects graph node multi-classification tasks to evaluate the quality of representation learning, and participants need to design automatic graph representation learning [11-13] solutions. This scheme requires efficient learning of high quality representations of each node based on the given characteristics, neighborhood and structural information of the graph. The competition data were collected from real businesses, including social networks, paper networks, knowledge graphs and other fields. There were 5 data sets available for download, 5 feedback data sets to evaluate the scheme’s score in the public ranking board, and the remaining 5 data sets to evaluate the final ranking in the last submission.

Each dataset is given the graph node ID and node characteristics, graph edge and edge weight information, as well as the time budget (100-200 seconds) and memory power (30G) of the dataset. Each training set is randomly divided into 40% nodes as training set and 60% nodes as test set. Participants design automatic graph learning solutions and classify test assembly points. Each dataset is ranked by Accuracy, and the final ranking is evaluated based on the average ranking of the last five datasets. To sum up, the competition required the direct implementation of the automated graph learning scheme on five previously unseen data sets. Participants were faced with the following challenges:

The graph model is characterized by high variance and low stability.
Each data set has a strict time budget and memory power limit.

Automatic multi-level model optimization based on proxy model^[14]

Multi-category hierarchical graph model optimization:

(1) Generation of candidate graph model: The graph in the real world is usually a combination of multiple attributes, and it is difficult to capture all these attribute information using only one method. Therefore, we use different types of models based on spectral domain, spatial domain and Attention mechanism to capture multiple attribute relations. The effects of different models vary greatly in different data sets. In order to prevent the inclusion of poor models in subsequent model fusion, candidate models such as GCN, GAT, APPNP, TAGC, DNA, GraphSAGE, GraphMix, Grand and GCNII are rapidly screened to obtain the model pool.

(2) Hierarchical model integration: this part includes integration of two dimensions. The first layer for the model from the integration, in order to solve the graph model particularly sensitive to the initialization, the same model of plus or minus 1% accuracy fluctuation problem, adopted with the integration of the model, at the same time generate multiple same model, and take the average of the model prediction as the output of this model, successfully reduced the variance of the same model, to enhance the stability of the model on different data sets. The second layer is the integration of different models. In order to effectively utilize information from local and global neighborhoods and fully capture different properties of graphs, we use weighted integration of different types of graph models to further improve performance. At the same time, in the parameter search stage, it is necessary to optimize the model’s internal parameter α and multiple model’s weighted integration parameter β simultaneously. The model integration parameter and the model’s internal parameter are solved through the mutual iterative gradient descent, which effectively improves the speed.

Two-stage optimization based on proxy model and final model: data set sampling, hierarchical sampling of subgraph according to Label to reduce model validation time; Proxy model and Bagging, calculate the average results of multiple smaller hidden layer models, and quickly evaluate the class of models. Kendall Rank and SpeedUp were used to balance accuracy and acceleration ratio, and a suitable proxy model was obtained. Finally, the optimal hyperparameters are obtained through the proxy model, and then the final large model is trained on the parameters well searched.

For details, please refer to the team’s ICDE 2022 paper, AutoHEnsGNN: Winning Solution to AutoGraph Challenge for KDD Cup 2020.

3. AutoML technical framework

3.1 Overview of automation framework

After the above contests, the team summarized and optimized the multi-domain modeling, abstracted the more general modules, and concluded a set of more general solutions to data mining problems — AutoML framework. The framework includes three parts: data preprocessing, automatic feature engineering [15] and automatic model optimization [16-20]. The data preprocessing part is mainly responsible for feature classification, data coding, missing value processing and other common basic operations, but more expansion. This paper mainly introduces the automation feature engineering and automation model optimization of AutoML framework in detail.

3.2 Automatic feature engineering

Feature engineering is a very important work in machine learning. The quality of feature directly determines the upper limit of model accuracy. At present, the common way is to combine and transform the features manually, but there are some problems in the artificial feature mining, such as slow speed and unable to excavate comprehensively. Therefore, the design of automatic feature engineering for comprehensive mining can better solve the above problems. Automatic feature engineering mainly includes three parts:

First – and second-order feature operators: more complex higher-order features can be obtained by basic operations on data. There are three feature operators. Frequency coding refers to the statistics of the frequency and nunique equivalents of the category features in the sample. Objective coding refers to mean, sum, Max and min, percentile and other operations on numerical features. Time series difference refers to the differential processing of time features. The first-order operator uses one entity to calculate, and the second-order operator uses two entities to calculate. For example, the number of orders placed by a user in a certain category uses two entities: user and category.
Fast feature selection: Automatic feature engineering is a cartesian product combination of all entities according to different feature operators, which will produce a large number of invalid features. Therefore, fast feature selection is required. The LightGBM model is used to quickly identify effective features and useless features. From the perspective of index promotion and the importance of features, useless features are cut out, and important features are identified for higher order combination with other features.
High-order feature operator: new features constructed based on the combination of first-order and second-order feature operators are further combined with other features. K+1 high-order combination cycle iteration based on K order (K>=1) can produce a large number of artificially insufficient high-order features.

High-order feature operators can be divided into Match mode — matching All entities and All mode — matching part of entities and obtaining the calculation results of All values of another entity according to whether the results of multiple entities are completely matched. As shown in the following figure, the Match mode matches the user and the time segment to obtain the average order price of the user in the time segment. In All mode, only users are matched to obtain the average order price of users in All time periods.

Compared with DeepFM, DeepFFM and other algorithms, automatic feature engineering has three advantages. Firstly, in the case of multi-table information, it is easy to use the information of non-training data. For example, in the advertising scene, the information of natural data can be used through features. Compared with the direct use of natural data for training, it is less likely to produce problems such as inconsistent distribution. Secondly, some strong feature cross is not fully learned by manual structure only through automatic cross learning of the model. Many cross features, such as click rate of user goods, often have strong business significance. It is easier for the model to directly perceive the combined features than automatic learning of the relationship between features. Thirdly, for many high-dimensional sparse ID features, such as recommendation or advertising scenes with more than 100 million levels, DeepFM and DeepFFM cannot fully learn these features. Automated feature engineering can construct strong feature representation for these sparse ids.

3.3 Automatic model optimization

Importance based grid search: in our framework, the global importance based on greedy way to search, speed up; The obtained optimal results are then searched by more detailed grids in small fields to alleviate local optimality caused by greedy strategies. According to previous competition experience, the order of importance of overparameters of different models is summarized as follows:

LightGBM: learning rate > sample imbalance rate > number of leaves > row sampling, etc.
DNN: learning rate >Embedding dimension > number and size of full connection layers. It is worth mentioning that the super and search in the whole iterative process will be carried out in many times, at the same time, the early stage of the iteration and iterative late parameter search strategy is also different, at the beginning of the iteration, the vector will generally choose bigger, smaller Embedding dimension and the whole connection layers, such as the lower model and speed up the iterative number, while in the later choose more parameters, get better effect.
Model fusion: The key point of model fusion lies in the differences between the constructed models. The models of LightGBM and DNN have great differences, and the differences among the same models are mainly reflected in data differences, feature differences and overparameter differences. Data difference is realized mainly by automatic row sampling, which automatically generates different data sampling models. The model of feature sampling is generated by automatic column sampling. The hyperparameter difference is generated by the high optimal parameter disturbance, and the parameter network pattern is partially disturbed in the optimal part. Model Blending methods are generally Blending, Stacking or simple Mean Pooling, etc. Model granularity pruning (removing models with poor effects to avoid affecting the fusion effect) and regularization are required before fusion.

3.4 AutoML Framework Recent actual combat: MDD Cup 2021 Meituan Takeout Atlas recommended the competition champion scheme

In the MDD Cup 2021, meituan’s in-store advertising platform quality estimation team applied AutoML framework and won the championship. The following introduces the application of the framework in specific problems based on this competition.

The MDD Cup 2021 requires participants to predict which merchants will buy next based on users, merchants’ attributes in the map, users’ historical clicks, real-time clicks and ordering behavior. Including 1.35 million order behaviors around the four weeks, involving 200,000 users, 29,000 merchants and 179,000 dishes, there are 4.38 million order associated dishes data, forming a knowledge graph. Hitrate@5 was used as the evaluation index.

Data pre-processing stage: feature classification, outlier processing, unified coding and other operations. Mainly related to the user (user portrait characteristics, etc.), businesses (category, grade, brand, etc.), food (taste, price, ingredients, etc.) the three entity data and click, buy, (LBS, price, time, etc.) two types of interactive data, classification of the original data set, data coding and missing value, common preprocessing operations.

Characteristics of automation engineering: a, a second order characteristic operator, first for the category, features, timing, labels, four types of original data, can be in accordance with the abstract three entities and two types of interactive data on a cross, a second order characteristic, using the frequency coding, the target encoding and the temporal difference operator, in many times by one, the second order statistics on the statistical characteristics. For example, frequency coding can calculate the number of times a user clicks on a merchant, the NUNIQUE value of the merchant category purchased by the user, and the number of orders placed by the user in a scene. The target code can calculate the average order price of the user, the merchant category that the user clicks the most times, etc. Time series difference can be calculated, such as the average time difference between users to purchase a certain flavor of food. Multi-time statistics means that the above features can be calculated in different time periods.

Fast feature selection. The number of first – and second-order statistical features automatically generated above is 1000+, among which there are a large number of invalid features. Therefore, LightGBM model is used to conduct feature screening and importance identification from the perspective of index promotion and importance. For example, the taste features of user X have no effect, so it is screened; The price range that users buy most often is effective, identifying high-order combinations of important features.

High-order feature operator, a new feature constructed based on the combination of first-order and second-order feature operators, can be used as input for high-order feature combination. Here it is worth mentioning that higher order features combination exists in two forms, the first original features of the higher order combination, such as user in a merchant’s favorite dishes taste, in combination with three entities, does not need additional operations, the second order need to use one or two new features, including the result of the frequency coding can be used directly, Objective coding and timing difference need to be converted into discrete values by numerical bucket operation before they can be used, such as mode x of user order price range and joint count of bucket division of merchant order price average. The final feature set is obtained after feature combination and screening.

Automatic model optimization: In the model part, the fusion scheme of LightGBM and DIN is used. In the iterative process, automatic hyperparameter search is carried out for many times. Multiple models with differences are constructed by automatic row and column sampling and local perturbation of optimal parameters, and the final results are obtained by fusion.

4 general modeling methods and understanding

This section will introduce the general modeling method of the competition, that is, how to design the overall scheme quickly and efficiently in the face of a new problem.

4.1 Modeling framework and methods

In the face of new problems, we mainly divide the technical framework into the following three stages: exploratory modeling, critical modeling, and automated modeling. The three stages have the function of gradually deepening and further supplementing.

Exploratory modeling: In the early stage of the competition, the problem was firstly understood, including evaluation indicators and data tables, and then the basic model was built and submitted online to verify consistency. In the process of consistency verification, multiple submissions are often needed to find an evaluation method consistent with online indicators. The core goal of exploratory modeling is to find iterative ideas and methods, so it is necessary to explore the problem in many aspects and find the right direction in the exploration.

In general, n-fold method is used to construct multiple validation sets for non-sequential problems, and different sets can be obtained by flexibly transforming seeds. In the case of timing, the sliding window method is generally adopted to construct the verification set with the same online submission time, and k verification sets can be constructed by sliding forward k days. In the evaluation of multiple validation sets, the mean value, variance, extreme value and other reference indicators can be used for comprehensive evaluation, and the results are consistent with those on the line.

Key modeling: In the middle of the competition, I will dig into the key problems and reach the Top of the list. In terms of understanding the problems, I will try my best to customize the loss function for the evaluation method.

Classification problem optimization, can combine Logloss, AUC Loss[21], NDCG Loss and other Loss functions to carry out Mix Loss design. The loss function design of regression problem is more complex. On the one hand, it can combine square error and absolute error to design the loss function; on the other hand, it can combine Log transform and Box-Cox transform to solve regression outliers and other problems.

Automatic modeling: At the late stage of the competition, due to the blind spot in details and angles based on human understanding, on the one hand, it is difficult to conduct abstract relationship modeling, so we will use automatic modeling to supplement. As shown in the diagram below 18, first based on relational table more input, automate, and then by generating automation engineering to build a large number of characteristics, and then to feature selection and iteration, and then based on the model input automation super parameter search and selection, based on multi-model fusion building automation, eventually will generate the diversification of model of relationship between choice and empowerment.

The framework as shown in Figure 18 is generally adopted for automated modeling. Firstly, multi-table association is carried out, and then feature selection is carried out based on the logic of first expansion and then filtering. In the next step, hyperparameter search is carried out based on selected features and multiple hyperparameter ranges.

4.2 Contact with industrial methods

Compared with the actual situation of the industry, an important difference of algorithm competition is that the industry involves online systems, which poses greater performance challenges in engineering and involves more online and offline effect consistency problems in algorithm. Therefore, the algorithm contest will further improve the model complexity and model accuracy. ResNet, Field-Aware Factorization Machine(FFM), XGBoost and other algorithm models are also produced in the algorithm contest, which are widely used in practical industrial systems.

In air quality prediction, we use spatial-temporal and spatial-temporal Gated DNN network for effective modeling, which is similar to air quality problems. Meituan also faces spatial-temporal and spatial-temporal modeling problems in its actual business. Take user behavior sequence modeling as an example. We have fully modeled and interacted with users’ historical and current spatio-temporal information [24]. We distinguish the triple spatiotemporal information of user behavior, that is, the time when the user clicks, the geographical location of the user request, and the geographical location of the merchant that the user clicks.

Based on the above triple temporal and spatial information, we put forward spatio-temporal Activator Layer (As shown in Figure 19) : three-side temporal and spatial attention mechanism neural network to model the user’s historical behavior, specifically by learning the interaction of request latitude and longitude information, merchant latitude and longitude information and request time. For spatial information intersection, we further combine geographic location hashing coding with spherical distance. In view of the time information crossover, we also adopt the way of combining absolute and relative time to effectively realize the trilateral expression of user behavior sequence under different space-time conditions. Finally, the spatio-temporal information encoded by the above network is fused through the attention mechanism network to obtain the personalized expression of different request candidates for the user’s ultra-long behavior sequence in the LBS scenario.

Comparatively, spatial-temporal Gated DNN in the competition focuses more on the influence of spatial-temporal fusion information on predicted values. Due to the time series that need to be predicted, different time and space information are more focused on to fully model the difference. However, the spatio-temporal network in Meituan business focuses on fine-grained spatial information, which is derived from different spherical distances and has great influence on different block positions, requiring multi-information depth modeling. For more details, please refer to the team’s CIKM paper: Trilateral Spatiotemporal Attention Network for User Behavior Modeling in Location-based Search[23].

In actual modeling, more online parts are involved than in the contest, which focuses on the extreme precision of offline data sets. Compared with the Debiasing competition, in the actual online system, there are more problems such as Bias. Taking Position Bias as an example, the click-through rate at the high Position of actual display data is naturally higher than that at the low Position, but part of it is due to the difference in browsing habits between the high and low positions of users. Therefore, direct modeling of the data is insufficient to characterize the evaluation of CTR and quality of high and low advertising. In the actual advertising system of Meituan, we designed a position combination prediction framework for modeling and achieved good results, which will not be detailed here. For details, please refer to the team’s SIGIR paper: Deep Position-Wise Interaction Network for CTR Prediction[7].

4.3 Key understanding of modeling

Consistent assessment is the key to determining the model’s generalization ability

In the mechanism of the competition, usually the Private Data finally evaluated and the Public Data previously listed are not the same Data. Sometimes, when the Data is switched, dozens of rankings will shake, affecting the final ranking. Therefore, avoiding over-fitting to the Public Data of the regular iteration is the key to win the final victory. So in this problem, how to construct the verification set consistent with the line distribution? From the perspective of consistency, verification sets with consistent time intervals are generally constructed. While some problem data are noisy, we can construct multiple verification sets by dynamic sliding window. The consistent validation set determines the direction of subsequent iterations.

Big data focuses on model deepening, while small data focuses on model robustness

Different data sets pay attention to different content. In the case of sufficient data, the core problem is model deepening to solve complex problems such as cross and combination of features. However, in the case of small data, the core problem is the robustness of the model because of much noise and strong instability. High data sensitivity is the key to solution design.

The balance of variance and deviation is the key to guide optimization in the later stage

From the perspective of error decomposition, square error can be decomposed into Bias and Variance [25]. When the model complexity is low in the middle and early stage, the deviation can be effectively reduced by increasing the model complexity. In the later stage, when the deviation has been highly optimized, the optimization of variance is the key. Therefore, in the later stage, Emsemble and other methods will be adopted to integrate the optimization results on the basis of the constant complexity of single model.

The key to AutoML is the constant reduction of human priori

When AutoML framework is applied, there will be some hidden human priors such as hyperparameters, and AutoML technology will be understood from the perspective of model. It also has the problem that the higher the model complexity is, the easier it will be to overfit. A key problem in iteration is not the evaluation effect, but whether the scheme has unnecessary hyperparameters and other information. Can continuously simplify the modeling of AutoML, continuously automate and adapt to all kinds of problems.

Finally, special thanks to Convolution Team, Nomo Team, Getmax Team, Aister Team and other teams.

conclusion

Based on the author’s experience of winning seven algorithm competitions, this paper shares algorithm experience in different fields such as recommendation system, time series and automatic machine learning. Then, it introduces AutoML technical framework based on specific problems. Finally, it summarizes the general modeling scheme in the competition and introduces its connection with the competition based on industrial schemes. It is hoped that some relevant experience of algorithm competition in this article can help algorithm enthusiasts better participate in the competition, provide some ideas for everyone, and inspire more engineers and researchers to achieve better results in practical work. In the future, our team will continue to pay attention to the international Algorithm Competition and actively try to combine the ideas of the competition with industrial solutions. Meanwhile, we also welcome everyone to join our team. The recruitment information is attached at the end of the article, looking forward to your email.

Author’s brief introduction

Hu Ke, Xingyuan, Mingjian and Strong are all from the quality estimation team of Meituan advertising platform.

reference

[1] Juan Y , Zhuang Y , Chin W S , et al. Field-aware Factorization Machines for CTR Prediction[C]// the 10th ACM Conference. ACM, 2016.
[2] He K , Zhang X , Ren S , et al. Identity Mappings in Deep Residual Networks[J]. Springer, Cham, 2016.
[3] Ali, Jehad & Khan, Rehanullah & Ahmad, Nasir & Maqsood, Imran. (2012). Random Forests and Decision Trees. International Journal of Computer Science Issues(IJCSI). 9.
[4] Robi Polikar. 2006. Ensemble based systems in decision making. IEEE Circuits and systems magazine 6, 3 (2006), 21–45.
[5] Jiawei Chen, Hande Dong, Xiang Wang, Fuli Feng, Meng Wang, and Xiangnan He. 2020. Bias and Debias in Recommender System: A Survey and Future Directions. arXiv preprint arXiv:2010.03240 (2020).
[6] H. Abdollahpouri and M. Mansoury, “Multi-sided Exposure bias in Recommendation,” ArXiv preprint arXiv: 2006.15772, 2020.
[7] Huang J, Hu K, Tang Q, Deep position-wise Interaction Network for CTR Prediction[J]. ArXiv Preprint arXiv:2106.05482, 2021.
[8] KDD Cup 2020 Debiasing competition champion technical scheme and its practice in Meituan.
[9] Luo Z, Huang J, Hu K, et al. AccuAir: Winning solution to air quality prediction for KDD Cup 2018[C]//Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019: 1842-1850.
[10] He Y, Lin J, Liu Z, et al. Amc: Automl for model compression and acceleration on mobile devices[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 784-800.
[11] Yang Gao, Hong Yang, Peng Zhang, Chuan Zhou, and Yue Hu. 2020. Graph neural architecture search. In IJCAI, Vol. 20. 1403–1409.
[12] Matheus Nunes and Gisele L Pappa. 2020. Neural Architecture Search in Graph Neural Networks. In Brazilian Conference on Intelligent Systems. Springer, 302-317.
[13] Huan Zhao, Lanning Wei, And Quanming Yao.2020. Simplifying Architecture Search for Graph Neural Network. ArXiv Preprint arXiv:2008.11652 (2020).
[14] Jin Xu, Mingjian Chen, Jianqiang Huang, Xingyuan Tang, Ke Hu, Jian Li, Jia Cheng, Jun Lei: Winning Solution to AutoGraph Challenge for KDD Cup 2020, 2021; ArXiv: 2111.12952.
[15] Selsaas L R, Agrawal B, Rong C, et al. AFFM: auto feature engineering in field-aware factorization machines for predictive analytics[C]//2015 IEEE International Conference on Data Mining Workshop (ICDMW). IEEE, 2015: 1705-1709.
[16] Yao Shu, Wei Wang, and Shaofeng Cai. 2019. Understanding Architectures Learnt by Cell-based Neural Architecture Search. In International Conference on Learning Representations.
[17] Kaicheng Yu, Rene Ranftl, and Mathieu Salzmann. 2020. How to Train Your Super-Net: A Heuristics of Training Heuristics in weight-sharing NAS. ArXiv Preprint arXiv:2003.04276 (2020).
[18] Haixun Wang, Wei Fan, Philip S Yu, and Jiawei Han. 2003. Mining concept-drifting data streams using ensemble classifiers. In Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 226 — 235.
[19] Robi Polikar. 2006. Ensemble based systems in decision making. IEEE Circuits and systems magazine 6, 3 (2006), 21–45.
[20] Chengshuai Zhao, Yang Qiu, Shuang Zhou, Shichao Liu, Wen Zhang, and Yanqing Niu. 2020. Graph embedding ensemble methods based on the heterogeneous network for lncRNA-miRNA interaction Prediction. BMC Genomics 21, 13 (2020), 1 — 12.
[21] Rosenfeld N , Meshi O , Tarlow D , et al. Learning Structured Models with the AUC Loss and Its Generalizations.
[22] Chen T , Tong H , Benesty M . xgboost: Extreme Gradient Boosting[J]. 2016.
[23] Qi, Yi, et al. “Trilateral Spatiotemporal Attention Network for User Behavior Modeling in Location-based Search”, CIKM 2021.
[24] The breakthrough and imagination of advertising depth prediction technology in meituan in-store scenario.
[25] Geurts P . Bias vs Variance Decomposition for Regression and Classification[J]. Springer US, 2005
[26] Kaggle Outbrain game links: www.kaggle.com/c/outbrain-… .
[27] KDD Cup 2020 links tianchi.aliyun.com/competition Debiasing game… .
[28] KDD Cup 2018 game links: www.biendata.xyz/competition… .
[29] KDD Cup 2017 game links: tianchi.aliyun.com/competition… .
[30] KDD Cup 2020 AutoGraph game links: www.automl.ai/competition…

Recruitment information

Based on the advertising scene, the algorithm team of Meituan in-store advertising platform explores the technological development in the frontier of deep learning, reinforcement learning, artificial intelligence, big data, knowledge graph, NLP and computer vision, and explores the value of local life service e-commerce. The main work directions include:

Trigger strategies: user intent recognition, AD merchant data understanding, Query rewriting, deep matching, correlation modeling.
Quality estimation: advertising quality modeling. Click rate, conversion rate, customer unit price, transaction volume estimate.
Mechanism design: advertising sorting mechanism, bidding mechanism, bidding suggestions, traffic estimation, budget allocation.
Creative optimization: intelligent creative design. Advertising pictures, text, group orders, preferential information and other creative display optimization.

Job Requirements:

At least 3 years relevant working experience in at least one aspect of CTR/CVR prediction, NLP, image understanding and mechanism design.
Familiar with common machine learning, deep learning, reinforcement learning models.
Excellent logical thinking skills, passion for solving challenging problems, data sensitivity, good analytical/problem solving skills.
Master degree or above in computer science, mathematics or related field.

The following requirements are preferred:

Advertising/search/recommendation related business experience.
Experience with large-scale machine learning.

If you are interested, please send your resume to [email protected] (please mark the email title: Guangping Algorithm Team).

Read more technical articles from meituan’s technical team

| in the public bar menu dialog reply goodies for [2020], [2019] special purchases, goodies for [2018], [2017] special purchases such as keywords, to view Meituan technology team calendar year essay collection.

| this paper Meituan produced by the technical team, the copyright ownership Meituan. You are welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication. Please mark “Content reprinted from Meituan Technical team”. This article shall not be reproduced or used commercially without permission. For any commercial activity, please send an email to [email protected] for authorization.