** Author: Wang He **
This article gives the complete scheme and complete code of the 2019JDATA competition runner-up, direct learning code, nothing more “hardcore” than this. \
Here’s the code:
Github.com/anzhizh/201…
Background of the problem* * * * * * * *
Jdata.jd.com/html/detail…
Jd Retail Group adheres to the business philosophy of “value-creation based on trust and customer-centered”, providing the most suitable products and services to more than 300 million active users at the right place at the right time in different consumption scenarios and connected terminals. Currently, more than 210,000 merchants have signed contracts on the third-party platform of JD Retail Group, covering all categories. In order to maintain a prosperous, diversified and orderly merchant ecosystem and fully meet consumers’ one-stop shopping needs, more accurate analysis and prediction of users’ purchasing behaviors are needed. Based on this, this question provides data information from users, merchants, commodities and other aspects, including content information, comment information of merchants and commodities themselves, as well as the rich interaction between users and them. The teams need to use data mining technology and machine learning algorithm to build the prediction model of relevant categories in the merchants purchased by users, output the matching results of users, stores and categories, and provide high-quality target groups for precision marketing. At the same time, it is hoped that the participating teams will explore the potential significance behind the data through this competition, and provide multi-win intelligent solutions for the merchants and users of the e-commerce ecological platform.
The evaluation index
The results file submitted by the entrant contains the predicted results of all users’ purchase intentions. The predicted results for each user include two aspects:
(1) Whether the user has purchased the category from April 16, 2018 to April 22, 2018, the submitted result file only includes the users and categories that are predicted to place orders (users and categories that are predicted to place no orders need not appear in the results). During evaluation, the repeated “user-category” in the submitted results will be rearranged. If the prediction is correct, the evaluation algorithm will set the middle label=1 and the incorrect label=0.
(2) If the user has made a purchase for the category, he/she also needs to predict which store he/she has purchased for the category. If the store prediction is correct, pred=1 is set in the evaluation algorithm, and pred=0 is not correct.
For the results submitted by participants, the score will be calculated according to the following formula: Score =0.4F11+0.6F12
Here the F1 value is defined as:
Precise is Precise, and Recall is Precise.F11Is the F1 value of label=1 or 0,F12Is the F1 value of pred=1 or 0.
Problem definition ****
Question 1: “Predict whether the user has purchased the product from April 16, 2018 to April 22, 2018”
Question 2: “Predict which stores will buy for this category”
We define problem 1 as a dichotomic problem to predict whether the purchase behavior of F11ID combined with user ID and category ID will occur from April 16, 2018 to April 22, 2018.
We define question 2 as the dichotomous problem of predicting whether the purchase behavior will occur from April 16, 2018 to April 22, 2018 in F12ID, which is a combination of user ID, category ID and store ID.
Solution: We set up models respectively for problem 1 and problem 2. Considering the strong correlation between problem 1 and problem 2, the idea of feature engineering is basically the same. Finally, the two results are fused to obtain the prediction results.
Exploratory data analysis
Some important findings of EDA exploration:
The cate items needed for prediction exist in the commodity list, and there are also cate items in the shop list. This item (shop_cate) is different from the Cate items in the commodity list, which only has different features of the same name, and cannot be used for prediction of F11 problems.
Under the store table, only vender_id==3666 (merchant ID) is associated with multiple shopids. After further exploration, it is found that among all behavior records associated with these shopids (accounting for about 40% of all data), none is a purchase record.
The case of type=5 in user behavior (add to shopping cart) only appears after 4.08;
Unassociated data exists and needs to be deleted.
Abnormal data processing method: ****
1. Due to abnormal data distribution around the Spring Festival, it is likely to be inconsistent with the data distribution of the test set. Therefore, the data before February 22 should be avoided when constructing the training set.
2. Avoid using data before and after the Spring Festival to construct time-dependent features (such as behavior statistics in the first N days), and do not need special consideration to construct cumulative features.
3. As for the abnormal sampling problems occurred on March 27 and 28, considering that the distribution of purchasing behaviors was basically consistent, we only avoided using relevant features in part of the construction time, and did not make special treatment for the rest.
Training set construction
The F11ID and F12ID in the behavior table from 10 March 2018 to 11 April 2018 are the F11ID and F12ID of the training set.
F11ID and F12ID, which appeared in the behavior table from 2018-04-11 to 2018-04-16 and also appeared in the training set, were used as label sets.
Feature extraction of F11ID and F12ID in the training set was completed from 2018-02-01 to 2018-04-11 (full).
Test set building
The F11IDS and F12ids in the behavior table from March 15 to April 16, 2018 are the F11ids and F12ids of the test set.
The feature extraction of F11ID and F12ID in the test set was completed from 2018-02-01 to 2018-04-16 (full).
In addition, another similar model was developed to make good use of the records of shopping cart addition that existed only in the last eight days, which will be described later.
Basic idea of feature engineering
Combined with the basic statistical characteristics, through the user’s behavior record, in three time dimensions of long-term, medium and short term, the expression of users, categories, shops, F11ID(consumers and categories), F12ID(consumers, categories, stores) is established to establish the interest characteristics of time-dependent behavior. At the same time, the last eight days’ data of type==5 were reasonably used to build feature engineering and new model around type==5.
Combined with the above ideas, our characteristics can be basically divided into three categories
Basic statistical characteristics
Characteristics of interest in prescription behavior
Time-dependent behavior interest characteristics – supplement to user_id
Exploration of shopping cart behavior
In the EDA session, we mentioned that the record of type==5 in user behavior (adding to shopping cart) only appeared after midnight on April 8th, only the last eight days.
After EDA, it was found that the distribution of Type ==5 in these 8 days was relatively stable, which should be the result of manual sampling.
As a result, the training set (3.10-4.09) contained only one day’s shopping cart adding records, while the test set (3.15-4.16) contained eight days’ shopping cart adding records, resulting in serious differences in the distribution of the two data sets. Therefore, at the beginning of the competition, all shopping cart records were temporarily deleted when characteristic engineering was considered.
Establishment of shopping cart behavior characteristics
The first idea is to ensure consistent distribution by processing shopping cart records on the basis of the original seven-day model as label.
The specific method is as follows: The record of type==5 on April 08 is kept in the training set;
Records of type==5 on April 15 are kept in the test set.
And for the shopping cart records on this day (also the last day of the data set), features are extracted as follows:
1. Times of adding to shopping cart in 5 dimensions;
2. Conversion rate of shopping cart adding behavior in 5 dimensions (ratio with other times of behavior);
3. The number of times when shopping was added and no purchase occurred;
4. The number of related commodities that have been added to shopping and have not been purchased.
Establishment of new model related to shopping cart
Building on the previous step, we thought about how we could leverage more shopping cart information. In order to meet the requirement of consistent distribution, we changed the original prediction of seven days to five days, increased the shopping cart information of the training set from one day to three days, and deleted all shopping cart information except the last three days of the test set to make the distribution consistent. Finally, a similar feature engineering method is used to extract features from records of type==5.
The specific construction method is as follows:
Training set construction:
The F11ID and F12ID in the behavior table from 10 March 2018 to 11 April 2018 are the F11ID and F12ID of the training set.
F11ID and F12ID, which appeared in the behavior table from 2018-04-11 to 2018-04-16 and also appeared in the training set, were used as label sets.
Feature extraction of F11ID and F12ID in the training set was completed from 2018-02-01 to 2018-04-11 (full).
Test set construction:
The F11IDS and F12ids in the behavior table from March 15 to April 16, 2018 are the F11ids and F12ids of the test set.
The feature extraction of F11ID and F12ID in the test set was completed from 2018-02-01 to 2018-04-16 (full).
Model training and integration
Model training:
1. LightGBM model was used for training to reduce the log loss of binary classification.
2. In order to accelerate the training speed of the model, all positive samples were used, and about 30 times the number of positive samples were randomly selected for training (about 26,000 positive samples and 800,000 negative samples).
Model fusion:
According to the interpretation of the evaluation function, the β^2 of F11 is 0.5, and the β^2 of F12 is 1.5, so the score of F11 is more sensitive to accuracy and the score of F12 is more sensitive to recall.
The fusion method we adopted is as follows:
1. Take out the TOP26000 of PROB predicted results of F11 model, and the TOP32000 of PROB predicted results of F12 model. Inner merge the intersection according to F11ID, and then take out the TOP15000 of F12 model and merge them.
2. After the fusion of the two groups of F11/F12 models was completed, the final result with 22,000 pieces in length was obtained respectively. The TOP15000 of the five-day model was taken out and fused with the seven-day model to obtain the final result.
After the game experience
1.EDA is the basis for understanding questions, processing data and constructing features. The quality of this step determines whether the overall direction of work is correct and the final upper limit of model performance;
2. On the basis of sufficient EDA, it is a method that can effectively improve the work efficiency and quality to quickly implement the feasible method that is intuitionistic;
3. Overly complex derivation and feature engineering without data itself are often ineffective and will waste a lot of time. It is more efficient to implement the method quickly and then update and improve it according to the results.
Shortcomings of the job
1. Due to the short duration of the competition and the limited computing power of the computer used in the process, there are still many intuitively feasible schemes that have not been put into practice;
2. Lack of understanding of score and the definition formula of F11 and F12, and many problems were finally discovered in the post-match summary;
3. The reliability of the established offline verification set is limited, and the final fusion method is still not good intuitively;
4. Feature engineering is complicated, and there may be some redundant features, which have a certain impact on the temporal performance of the model.
Conclusion:
This article gives the complete scheme and complete code of the 2019JDATA competition runner-up, direct learning code, nothing more “hardcore” than this. \
Here’s the code:
Github.com/anzhizh/201…
The author’s Zhihu column:
zhuanlan.zhihu.com/c_152307828
Site introduction ↓↓↓
“Machine Learning Beginners” is a personal public account to help artificial intelligence enthusiasts get started (founder: Huang Haiguang)
Beginners on the road to entry, the most need is “help”, rather than “icing on the cake”.
ID: 92416895\
Currently, the planet of Knowledge in the direction of machine learning ranks no. 1.
Past wonderful review \
-
Conscience Recommendation: Introduction to machine learning and learning recommendations (2018 edition) \
-
Github Image download by Dr. Hoi Kwong (Machine learning and Deep Learning resources)
-
Printable version of Machine learning and Deep learning course notes \
-
Machine Learning Cheat Sheet – understand Machine Learning like reciting TOEFL Vocabulary
-
Introduction to Deep Learning – Python Deep Learning, annotated version of the original code in Chinese and ebook
-
The mathematical foundations of machine learning
-
Machine learning essential treasure book – “statistical learning methods” python code implementation, ebook and courseware
-
Blood vomiting recommended collection of dissertation typesetting tutorial (complete version)
-
Installation of Python (Anaconda+Jupyter Notebook +Pycharm)
-
What if Python code is ugly? Recommend a few artifacts to save you
-
Blockbuster | complete AI learning course, the most detailed resources arrangement!
Note: This site’S QQ group: 865189078 (a total of 8 groups, do not add repeatedly).
To join the wechat group of this site, please add the assistant wechat of Huang Bo, explanation: public number user group.