By Luo Zhipeng
Company: Shenlan Beijing AI R&d Center
The KDD Cup 2019 AutoML Track, the fifth AutoML Challenge, is jointly organized by Fourth Paradigm, ChaLearn and Microsoft and focuses on automatic machine learning for time series related data.
With nearly 800 teams registered, this is the largest number of AutoML competitions in recent memory. The DeepBlueAI team from shenlan technology’s AI research and development center in Beijing, all of whose members graduated or studied at Peking University, won the title. The NUS-Xtra-Lab team from the National University of Singapore took second place, while the Admin team from Alibaba Group and Georgia Institute of Technology took third place. Tsinghua University, Nanjing University, Microsoft Research Asia, Hikvision and Meituan-Dianping also made the top 10.
This article shares the technical details of the team’s competition, with a link to the open source on Github at the end.
background
ACM SIGKDD, sponsored by ACM Data Mining and Knowledge Professional Committee, is an international academic conference with the highest academic status in the field of data mining. KDD Cup is an international top competition in the field of data mining research sponsored by ACM’s Data Mining and Knowledge Discovery Committee (SIGKDD). It has a history of 22 years since 1997.
As the most influential and high-level international top-level competition in the field of data mining, KDD Cup attracts top experts, scholars and engineers from all over the world to participate in the competition every year, so it is also known as the “Big Data Olympics”. This year, the 25th ACM SIGKDD conference was held from August 4 to 8, 2019 in Alaska, USA.
Team performance
In THE KDD Cup 2019 AutoML Track, DeepBlueAI team ranked first in the feedback stage with 4 first prizes and 1 second prize. In AutoML stage, I won the champion with an absolute advantage by winning the first prize in 3 items and 0.3 points in average index ahead of the second place. The specific ranking is as follows:
Figure 1. KDD Cup 2019 AutoML Track final list and open source address
Team member award records
The problem is introduced
Participants will use temporal relational data to design an AutoML computer program capable of supervised learning on its own (without human intervention). The competition focuses on dichotomous problems, and temporal relational data are derived from real business scenarios. According to the time attribute of most practical applications, the data set is divided into training set and test set in chronological order. Both the training and test sets consist of a master table, a set of related tables, and a diagram:
-
The main table contains instances with sample tags, partial features, and sequence tags for dichotomies;
-
Correlation tables contain important auxiliary information about the instances in the primary table and can be used to improve the prediction. Fields in related tables may contain time labels, meaning that the information in the table is related to time;
-
The relationship between data in different tables is described by a diagram. Note that any two tables (primary or related) can have one relationship, and no pair of tables can have more than one relationship. The sponsor ensures that the training set and test set diagrams are the same.
Participants are required to submit AutoML solutions that automatically build machine learning models from master tables, related tables, and diagrams. Once trained, the model takes as input the test master table (excluding sample markers), correlation tables, and diagrams, and predicts sample markers for the test set. Submissions will be tested within limited computing resources and time.
In order to enable participants to better develop and evaluate the scheme, the organizer provided 10 temporal relational data sets, including 5 public data sets and 5 private data sets.
The game stage
Feedback phase
The feedback phase. At this stage, participants can train on five common data sets to develop AutoML solutions. Participants can make a limited number of submissions and get the performance of the test data as feedback from all five common datasets. Participants can download marked training datasets and unmarked test datasets. Therefore, contestants can prepare their code offline and submit it. The code submission at the end of this phase will ultimately serve as the code for the next phase of blindtest.
The Check phase
The verification phase. The stage will be in five private data set for the first phase of the last submitted to blind test code, make sure to submit the project run smoothly, there will be no overtime or memory problems, for example, but the contestants cannot see concrete results, all teams have a chance to update the code, to ensure that the correct running your code in the final stage.
AutoML phase
The blind trial phase. This phase tests the performance of the schema on private data sets. The participant’s code will complete the training and prediction without human intervention. AUC serves as the evaluation indicator, and the final score will be based on the average ranking of the five private data sets. If the final score is equal, the proposal with better interpretability will be given priority and the interpretability will be reviewed by a team of experts.
The calculation and memory resources of the above three stages are limited, so the scheme should take effect and efficiency into consideration.
Competition time
April 1, 2019: Competition begins, public data set released. Participants can start submitting code and get instant feedback on the leaderboards.
June 27, 2019: The Feedback phase ends and the Feedback phase code is automatically migrated to the Test phase.
July 7, 2019: The Check phase ends and the sponsor begins code verification.
11 July 2019: Deadline for submission of reports.
July 16, 2019: AutoML phase ends and the review process begins.
20 July 2019: KDD Cup champion announced.
August 4, 2019: Awards ceremony will be held at KDD.
Evaluation indicators
Subject characteristics
In this competition, there are mainly the following difficulties:
1. Dig for effective features
Is different with traditional data mining competition, AutoML contest, the contestants only know the type of data (numerical variables, classification and time variables, multi-value classification, etc.), and don’t know the meaning of the data, this will undoubtedly increase the difficulty of the characteristics of the project, how to dig into effective general characteristics as a difficulty.
2. Question data is related to time sequence
It is difficult to mine time series data. In traditional machine learning applications, experienced experts are required to mine effective time series information from time series relational data and use it to improve the effect of machine learning model. Even with deep knowledge, experts need to build valuable temporal features through trial and error, and use multiple relational tables to improve the performance of machine learning models.
3. The data are presented in multiple tables
The data is presented according to multiple tables, which requires contestants to build an automated machine learning system to deal with multiple connections between multiple tables. Multi-table data undoubtedly improves the stability of the system requirements, a little careless, it is possible to merge too large data directly timeout or excessive memory, resulting in no final results.
4. Time and memory are strictly limited
The running environment of the game code is a Docker environment with 4-core CPU and 16G memory. For data sets of unknown size, some operations in the process of code execution can easily make the memory peak exceed 16G, resulting in program crash. Therefore, players have to strictly optimize certain operations, or use sampling methods to complete the task. In addition, the game side strictly limits the execution time of the code for each data set, which will result in a crash due to a timeout.
The solution
Based on the given data, our team implemented a set of AutoML framework supporting multiple tables, including automatic multi-table merging, automatic feature engineering, automatic feature selection, automatic model tuning, automatic model fusion and other steps. We also made a lot of optimization work on time and memory control.
Data preprocessing
Based on the analysis of table structure and its attributes, different data preprocessing schemes are developed for different types of data. First, of the multiple join keys between multiple tables, the key with the most variety in the main table is identified as user. Based on the identified users, you can try to identify sessions in the category of the main table.
In addition, we try to identify binary data in the category data that has only two valid values. We re-encode the category, user, session and key, try to convert the numerical data to the type that occupies less memory, and convert the time data to the datetime type that is easy to manipulate.
Recognize_user_col (self,data,key_cols); user_col = None nUNIQUE = -1 for recognize_user and sessiondef recognize_user_col(self,data,key_cols): nnum = data[col].nunique() if nnum > nunique: user_col = col nunique = nnum return user_coldef recognize_session_col(self,data,cat_cols,user_col): if user_col is None: return [] user_nunique = data[user_col].nunique() session_cols = [] def func(df,user_nunique): cat_col = df.columns[0] user_col = df.columns[1] cat_nunique = df[cat_col].nunique() if (cat_nunique <= user_nunique) or (cat_nunique >= df.shape[0]-10): return False if (df.groupby(cat_col)[user_col].nunique()>1).sum()>10: return False return True res = Parallel(n_jobs=CONSTANT.JOBS,require='sharedmem')(delayed(func)(data[[col,user_col]],user_nunique) for col in cat_cols) for col,is_session in zip(cat_cols,res): if is_session: session_cols.append(col) return session_cols
Copy the code
Multiple table joins
The data structure given in the competition is shown in the figure above. The connection relation between tables can be divided into four types, namely 1-1, 1-M, M-1 and M-M. Due to time and memory constraints, we need to keep as much information as possible while keeping the data size of the resulting tables from being too large. However, the way to handle the join of multiple tables directly affects the subsequent results. We use different methods for different connections.
Firstly, the four connection modes are divided into two types: Type 1 contains 1-1 and M-1, and type 2 contains 1-M and M-M. For type 1, we can directly merge the secondary table data into the primary table by key. For type 2, we first perform some aggregation operations on the secondary tables to generate aggregated results, which can be interpreted as having a type 1 relationship with the primary table.
Next, we simply do a type 1 operation on the result that generated the aggregation and merge it directly into the main table. In addition, in the case that both the primary table and the secondary table have timestamps, we merge the data from the secondary table that is earlier and closest to the current data of the primary table with the same key value into the primary table in order to preserve information as much as possible.
The sampling
Since the size of the dataset given by the AutoML competition party is unknown, whether the current environment can support the entire dataset to participate in feature engineering and model training should be judged before operation and processing. We made a judgment before reading the data into the pre-processing, that is, the total number of samples in the training set and test set is required not to exceed an acceptable threshold. If the total number of samples in the training set and test set is too much, we will consider sampling them. Under the current given 16G memory condition, we estimate that 4 million data is a good threshold.
In addition, the idea of sampling is also used in the combined feature module of feature engineering. Combined features are characterized by large number of generated features, long time of feature engineering, high memory peak, and small number of functional features. Therefore, in order to avoid memory overflow, feature engineering is carried out on small data sets before making combination features, and real effective features are obtained after screening. Then, only these effective features are made on the whole data set, which can not only reduce the running time of the system but also avoid the risk of memory explosion.
Automatic feature engineering
def main_init(self): self.order1s = [ PreMcToNumpy,McCatRank, OriginSession, ApartCatRecognize,\ KeysCountDIY,UserKeyCntDIY,SessionKeyCntDIY,\ KeysTimeDiffAndFuture, KeysNuniqueDIY, KeysCntDivNuniqueDIY, KeysCumCntRateAndReverse,UserKeyCumCntRateAndReverse, KeyTimeDate,KeyTimeBin,KeysBinCntDIY, CatCountDIY, LGBFeatureSelection, ] self.keys_order2s = [ KeysNumMeanOrder2MinusSelfNew, KeysNumMaxMinOrder2MinusSelfNew, KeysNumStd, KeysCatCntOrder2New, LGBFeatureSelectionWait, ] self.all_order2s = [ BinsCatCntOrder2DIYNew, BinsNumMeanOrder2DIYNew, CatNumMeanOrder2DIYNew, CatCntOrder2DIYNew, LGBFeatureSelectionWait ] self.post_order1s = [ TimeNum, ] self.merge_order1s = [ CatSegCtrOrigin, CatMeanEncoding, LGBFeatureSelectionLast, ]
Copy the code
Feature engineering is often the core content of data mining competition, and it is also an important factor for our team to obtain significant advantages in the competition. We use the LightGBM model to verify the feature effect. We divided the feature engineering into several modules.
The first module is the basic feature part, which is mainly aimed at the statistical features of users, keys and sessions. The number of new features produced is small but has good effects, so it is placed first.
The second module is the first-order combination feature part. We try to combine the more important user, key and session data in the main table with the rest data selected by numerical or categorical, which has very good effect on some data sets.
The third part is the feature of mass combination. We divide time into buckets and try to combine the categorical and numerical data using time buckets. In addition, according to the data volume of different data sets, A certain amount of categorical or numerical data can be selected to form new features. In the hope of mining useful features, the risk of memory overflow can be minimized.
The last part is CTR and mean coding features of supervised learning. The reason for putting them in the last part is that, on the one hand, many features will be generated in this stage, which is easy to generate too many features and lead to memory explosion. On the other hand, we think these features and other feature combinations have no practical significance, so we put them at the end and do not participate in the combination.
At the same time, due to the strict control of time and memory in this competition, in the face of millions of levels of data, almost every feature generation must be controlled within a few seconds. In order to meet this requirement, our code has added many optimizations. For example, the location of the category data in the multi-category data will take several hours to implement using traditional Pandas. Adding multithreading will improve the situation, but will still take a lot of time.
We took the next best thing and used Numpy to implement the feature. The feature generation time went directly to the level of tens of seconds, but this still didn’t meet our requirements. In the end, we optimized the code using Cython and refined the Cython code so that the feature was generated in a matter of seconds.
▲ Figure 4. More important features after screening
Automatic feature selection
In the stage of automatic feature engineering, we divide feature engineering into several stages. At the end of each module, we do a feature selection to screen out those invalid features made at this stage to avoid memory overflow and speed up the final model training.
By combining feature importance and sequence backward selection algorithm, we set a threshold value, which will screen out the features with low importance in participating model training and minimize the loss of model accuracy. We also tried feature based information gain for feature screening, or combined the two screening methods, but failed to find a better segmentation point, finally we used feature importance based method.
▲ Figure 5. Feature selection mechanism
Class imbalance problem handling
We processed the data that was unbalanced in the categories during training. When the ratio of positive and negative samples exceeds 1:3, undersampling is adopted to mitigate the imbalance of positive and negative samples. In addition, we also try to optimize the problems caused by category imbalance by increasing the weight of positive samples. In the part of model fusion, we changed a batch of negative samples for training while reserving fewer original positive samples, so as to retain as much original data information as possible and alleviate the problem of category imbalance at the same time.
Model integration
Due to the strict limitation of time and memory in the competition environment, we considered bagging, blending, stacking and so on in the model fusion, and finally chose to use bagging. We calculated a demo to simulate the training and prediction of the real data set to estimate the time needed for the real data set. If the time was insufficient, early-stop was selected during the training, allowing the loss of accuracy to ensure that the code could finish running within the specified time.
If time is sufficient, the number of models allowed for fusion is calculated by the current remaining time. In order to ensure the code to pass, we adopted conservative estimation, that is, on the basis of calculating the number of model fusion, we selected one less model fusion.
start_time = time.time()temp = X.iloc[:100000]temp = temp.astype(np.float32)gc.collect()temp = temp.valuesgc.collect()model.predict(temp)end_time = time.time()model_test_use_time = (end_time-start_time)# Model_test_use_time = len_test/temp. Shape [0] * model_test_use_time# = len_test/temp Model_use_time + model_test_use_time# Total time = training time + test time del Temp, modelRest_time = Config. budget/10*9-(end_time-config.start_time)# Use a conservative estimate of 90% of the given time to calculate the remaining free time if rest_time <= 0: rest_model_num = 0else: rest_model_num = int(rest_time / model_use_time)if rest_model_num >= 50: rest_model_num = 50 if rest_model_num >= 1: Rest_model_num -= 1# Calculates the number of fusible models based on the remaining time and fuses one less model
Copy the code
Uptime optimization
Our time control is reflected in every process.
In the process of automating data processing and feature engineering, we use Cython to speed up coding and some features that are slow to generate. Here we take a feature as an example. For two columns of data, one column of category data and the other column of multi-category data, we judge in advance that the data item sets of the two columns have intersection. We need to calculate the location of the items in the category column in the item set corresponding to the multi-category column.
Let’s say I have a piece of data. Data: [2137, (134,2137,576,816)], the former 2137 is in the second position of the latter. So this piece of data this feature is 2. If it does not occur, set it to 0. If we use the Apply interface provided by PANDAS to implement this feature, it will take approximately a few hours to generate this feature in the context of the competition.
Consider the DataFrame’s poor traversal and the performance penalty of interface generalization. We use Numpy to do traversal to implement this feature, which can make feature generation at the minute level. The time and memory of the contest is strictly controlled, and features that take more than 10 seconds to generate can be time-consuming.
Then we used Cython, Cython ahead of time compilation, static typing and so on, and we managed to get this feature generated in less than 10 seconds. There are some details in the process of generating this feature. For example, if you continue to use Python native types in Cython, traversal will still be slow. But multi-category data storage is hard to do without Python’s native type support.
Considering that we are mainly traversing the multi-category type during the generation of characteristics, we can use an array to store each item of the multi-category. And use an extra array to hold the length of each multi-category item set. So we can do an efficient traversal based on the length array and the data array.
In testing this optimization, pure Python code was optimized by Cython for about 60 seconds. After this optimization, it is easy to reach within 10 seconds (the test environment is based on our computer, and the online environment will take more time).
In the model integration part, we will do advance calculation and record the current time. By training the model for several rounds, we can calculate the model startup time and the time consumed by the data of each round of model training. Through these two times, we can predict the subsequent parameter tuning and model training time. Thus determining the final number of model fusion.
▲ Time optimization before and after comparison
Run memory optimization
In memory control, we first implemented a memory listener. We first run a complete round of our system, record the memory condition, and analyze the memory peak of different data sets. It can be found that memory peaks tend to occur in a few typical places. For example, when the data is synthesized, when the model starts training, when certain features are generated.
Pandas’ interface is used to merge pandas. Concat will generate approximately twice the amount of current data memory. This is obvious because the result of the merge is not in place, but creates a third block of memory. Therefore, we changed the composition process to column assignment so that there are almost no memory spikes when merging. But doing so also leads to poor time efficiency. So in the early days of the system, we used pandas’ interface for merging data when memory was relatively loose.
In addition, we also calculate the memory condition in advance during the training prediction. In the final feature screening process, we will calculate and simulate how much data can be generated to complete the subsequent process of the system. So as to control the amount of data finally screened out. In addition, before the last feature screening, when generating features, we will first conduct a simulation process with small data sets to calculate the memory situation in the whole process and control the number of features generated in the generation period.
Finally, we do some careful memory management, and at the end of a variable’s life cycle, we recycle its memory. Here is a comparison of our memory optimization before and after. It contains the memory of the five data sets given in the competition.
▲ Figure 6. Comparison before and after memory optimization
The system test
-
Missing cases for different data types;
-
All values are null for different data types;
-
The structure of a table is single or complex.
-
For the 5 data sets in this competition, we expanded each data set by 2, 3, and 6 times to run smoothly in the specified time and memory.
-
Scaling up the number of fields in the data will also work fine if we scale up the number of fields in five datasets by 2.
-
Construct specific data sets to see if the system crashes because of some combination of characteristics. Finally, about dozens of constructed extreme data sets were tested to run successfully.
-
By limiting the allowable running time of the data set, we adjust the allowable running time of the data set to see if our system can adapt to adjust its own running time. In this competition, we adjusted data set A, which had relatively tight time, and data set B, which had A large amount of data, to change its allowed running time to 1/2, 1/3, or even 1/4 of the original. All of our systems run smoothly.
conclusion
The KDD Cup AutoML competition has been further improved in the competition system. NeurIPS 2018 AutoML competition has increased the number of AutoML phase submissions to ensure that participants can beat the B-table data as much as possible. Compared with the improved scoring mechanism of PAKDD 2019 AutoML competition, the final score is only affected by the highest score of each task. The improved competition mechanism allows the participants to get better competition experience and technical play, thanks to the organizers’ hard work.
In this competition, our work is carried out around the challenge of the competition, which mainly includes several important processes: automatic multi-table data processing, automatic multi-table joining, automatic feature engineering, automatic model building, selection and fusion. At the same time, in order to meet the time and memory requirements of the race, we made a lot of optimization in the code, such as the use of multi-threading, Cython, preprocessing, advance estimation and other methods. In the end, our performance was quite good. Both A and B lists had relatively large advantages in multiple task sets.
Temporal relational data is very common in online advertising, recommendation systems, financial market analysis and other application scenarios. AutoML’s focus on temporal relational data presents a challenge for participants and offers new ideas for AutoML’s development. In recent years, due to its high adaptability, high efficiency, scalability and high availability, AutoML can greatly reduce the application threshold in industrial applications, play a role in different scenarios, and greatly shorten the project development cycle. Finally, congratulations to all the Top teams, wish everyone can achieve their own satisfactory results in the future!
The authors introduce
Luo Zhipeng, head of Shenlan AI R&D Center in Beijing, is a machine learning scientist at Shenlan Technology.
Graduated from Peking University with a master’s degree, won the champion of PAKDD, KDD, NeurIPS, CIKM, CVPR, SIGIR and other top conference competitions. Published a KDD Oral paper and a WWW paper together, with many years of practical experience in machine learning.
Open source link
https://github.com/DeepBlueAI/AutoSmart
Click on the title below for more past content:
-
A survey of graph neural networks: Models and Applications
-
The ACL 2019 | language model based on knowledge enhancement
-
The ACL 2019 | based on context awareness of vector optimization
-
Intention recognition of cold priming based on small sample learning
-
Qiu Xipeng, Fudan University: Review of lexicographical and syntactic analysis
-
ACL 2019 | other matching sample selection bias and partial methods
-
NLP’s Giant Shoulders (Part 1)
-
NLP’s Giant Shoulders (part 2) : From CoVe to BERT
# Submission channel #
Let your paper be seen by more people
How can we make more high-quality content reach readers in a shorter path and shorten the cost for readers to find high-quality content? The answer is: people you don’t know.
There’s always someone you don’t know who knows what you want to know. PaperWeekly may become a bridge to encourage scholars and academic inspirations from different backgrounds and directions to collide with each other and burst out more possibilities.
PaperWeekly encourages university LABS or individuals to share all kinds of high-quality content on our platform, which can be the latest paper interpretation, learning experience or technical dry goods. We have only one goal, to make the knowledge really flow.
📝 Contribution criteria:
• The manuscript is truly original, and the author’s personal information (name + school/work unit + educational background/position + research direction) should be indicated on the manuscript.
• If the article is not first published, please remind and include all published links when submitting
• By default, every article in PaperWeekly is the first one, and the “original” mark will be added
📬
• Submit to [email protected]
• All articles with pictures, please send separately in the attachment
• Please leave your immediate contact information (wechat or mobile) so that we can communicate with the author when editing and publishing
🔍
Now, you can also find us in Zhihu
Enter zhihu home page and search “PaperWeekly”
Click Follow to subscribe
About PaperWeekly
PaperWeekly is an academic platform for recommending, interpreting, discussing and reporting the cutting-edge papers of artificial intelligence. If you research or engage in the FIELD of AI, welcome to click “Communication Group” in the background of the official account, and the little assistant will bring you into PaperWeekly’s communication group.
Del recommended click get latest | | to read the original paper