A quote.

Recommendation, as a technical means to solve information overload and tap the potential needs of users, plays an important role in meituan Dianping, a life service e-commerce platform with rich business. In Meituan App, “Guess what you like” on the home page, operation area, hotel travel recommendation and other important business scenes are all used for recommendation.

At present, deep learning model has made major breakthroughs in many fields with its powerful expression ability and flexible network structure. Meituan platform has massive user and business data and rich product usage scenarios, which also provides necessary conditions for the application of deep learning. This paper will mainly introduce the application and exploration of deep learning model in the recommendation ranking scenario of Meituan platform.

2. Application and exploration of deep learning model

Tens of millions of users are active in meituan recommendation scenarios every day. These users interact with the product and generate massive real behavioral data, which can provide one billion levels of effective training samples every day. In order to deal with large-scale training samples and improve training efficiency, we developed a DISTRIBUTED training DNN model based on PS-Lite, and made a lot of optimization attempts based on this framework, which achieved significant improvement in sorting scenarios.

As shown in the figure above, the model sequencing process includes the stages of log collection, training data generation, model training and online scoring. When the recommendation system makes recommendations to users browsing the recommendation scene, it will record the product characteristics, user status and context information at that time, and collect user behavior feedback for this recommendation. The final training data is generated after tag matching and feature processing. We use pS-Lite framework to conduct distributed training for multi-task DNN model offline, and select the model with better effect through offline evaluation indexes and load it online for online sorting service.

The optimization and attempts we made in feature processing and model structure will be introduced emphatically below.

Characteristics of the processing

Meituan “guess you like” access to the scene to include food, hotel, tourism, take-away, home stay facility, transportation and other business, the business has rich connotation and characteristics of each and every business of supply, demand and condition such as weather, time, location, interweave, O2O life service scenario the diversity and complexity of the unique, This puts forward higher requirements on how to organize the sorting results more efficiently. The construction of more comprehensive features, more accurate and efficient use of samples has always been the key direction of our optimization.

Characteristics of species

  • User features: age, gender, marriage status, children status, etc
  • Item features: price, discount, category and brand related features, short-term and long-term statistical features, etc
  • Context features: weather, time, location, temperature, etc
  • User behavior: users click Item sequence, order Item sequence, etc

In addition to the features listed above, we also cross-referenced some features based on the knowledge accumulation in the O2O field and further processed the features according to the learning effect. The specific sample and feature processing process is as follows:

Tag matching

Background recommendation logs record User features, Item features, and Context features corresponding to the current sample. Label logs capture User feedback on recommended items. We spliced the two pieces of data together according to the unique ID to generate the original training log.

Iso-frequency normalization

Through the analysis of training data, we found that the value distribution of features in different dimensions and the difference of eigenvalues in the same dimension are very large. Such as distance, price and other characteristics of the data follow the long-tail distribution, reflected in the majority of samples are relatively small eigenvalues, a small number of samples have very large eigenvalues. Conventional normalization methods (such as min-max and Z-Score) only translate and stretch the distribution of data, and the distribution of features is still a long-tail distribution. As a result, the characteristic values of most samples are concentrated in a very small value range, which reduces the discrimination of sample features. At the same time, a small number of large value features may cause fluctuations during training and slow down the convergence rate. In addition, logarithmic transformation of eigenvalues can also be performed. However, due to the different distribution of features in different dimensions, this method of eigenvalue processing is not necessarily applicable to features of other dimensions.

In practice, we refer to the processing method of continuous features in Google’s Wide & Deep Model[6] and normalize according to the position of feature values in the cumulative distribution function. In other words, feature is divided into equal frequency buckets to ensure that the sample size in each bucket is basically equal. Assume that there are n buckets in total, and feature **xi belongs to the first Bi (Bi ∈ {0,… , n-1}) buckets, then the feature xi** will eventually be normalized to Bi /n. This method ensures that the features of different distributions can be mapped to approximately uniform distribution so as to ensure the distinction of features between samples and the stability of numerical values.

Low frequency filter

Excessive and sparse discrete features will cause over-fitting problems in the training process and increase the number of parameters stored. In order to avoid this problem, low-frequency filtering is carried out for discrete features, and features less than the occurrence frequency threshold are discarded.

After feature extraction, label matching, and feature processing, we allocate corresponding domains to features, Hash discrete features, and finally generate LIBFFM data as multi-Task DNN training samples. Here are some of the model optimization attempts for business goals.

Model optimization and trial

In terms of the model, we learn from the successful experience of the industry and optimize the model structure for the recommended scenarios based on the MLP model. In depth study, many ways and mechanisms have generality, such as Attention mechanism in machine translation, image annotation has obtained the remarkable effect on the direction of such ascension, but not all concrete model can directly transfer structure, which requires combined with the practical business problems, targeted for introducing the model of network structure adjustment, So as to improve the effect of the model in the specific scene.

Multi-task DNN

The optimization goal of the recommendation scenario should consider the click through rate and order rate of the user. In the past, when we used XGBoost for single-target training, we balanced CTR and RR by taking both clicked samples and ordered samples as positive samples, and up-sampling or weighting the ordered samples. However, this sample weighting method also has some disadvantages, such as high cost of adjusting single weight or sampling rate, retraining is required for each adjustment, and it is difficult for the model to use the same set of parameters to express these two mixed sample distributions. To solve the above problems, we use the flexible network structure of DNN to introduce multi-task training.

According to business objectives, we split CTR and single order rate into two independent training objectives, and establish their respective Loss functions to serve as supervision and guidance for model training. The first several layers of THE DNN network serve as the sharing layer. Click task and single task share their expressions, and the parameters are updated together according to the gradient calculated by the two tasks in the BP stage. The network is split at the last full connection layer to separately learn the parameters corresponding to Loss, so as to better focus on fitting the distribution of their respective labels.

The network structure of multi-Task DNN is shown in the figure above. For online prediction, we make a linear fusion of click-output and pay-output.

On the basis of this structure, we combined the characteristics of data distribution and business objectives for further optimization: In view of the widespread feature loss, we proposed the Missing Value Layer to fit online data distribution in a more reasonable way. Considering correlating the physical meaning of different tasks, we came up with kL-Divergence Bound to mitigate the Noise of a single goal. Below we make specific introduction to these two pieces of work.

Missing Value Layer

In general, some continuous features inevitably have missing values in training samples. Better handling of missing values will be helpful to the convergence and final effect of training. In general, the missing values of continuous features are dealt with in the following ways: taking the zero value, or taking the average value of the features of this dimension. If the value is set to zero, the corresponding weight cannot be updated and the convergence speed slows down. Averaging is a bit arbitrary, since missing features can mean different things. Some non-neural network models can reasonably deal with missing values. For example, XGBoost adaptively determines whether the samples with missing features are better divided into left or right subtrees through the calculation process of Loss. Inspired by this, we hope that neural networks can also learn to deal with missing values adaptively, rather than artificially setting default values. Therefore, the following Layer is designed for weight of adaptive learning missing values:

Through the Layer above, the missing features can learn a reasonable value adaptively according to the distribution of corresponding features.

Through off-line survey, the method of missing value of adaptive learning feature is far better than the method of zero value and mean value to improve the training effect of the model. The change of off-line AUC of the model with the number of training rounds is shown in the figure below:

The relative value of AUC increases as shown in the table below:

KL-divergence Bound

At the same time, we consider that different labels will have different Noise. If related labels can be associated through physical meaning, the robustness of model learning can be improved to some extent and the impact of Noise of individual labels on training can be reduced. For example, MTL allows you to learn the click-through rate, order rate and conversion rate (order/click) of a sample at the same time, all of which satisfy the sense that P (click) * P (convert) = P (order). So we added a KL divergence Bound to make the predicted P (click) * P (convert) closer to p(order). But because the KL divergence are asymmetric, namely KL (p | | q)! = KL (q | | p), so the use of true, optimization is KL (p | | q) + KL (q | | p).

After the above work, the effect of multi-Tast DNN model is more stable than that of XGBoost model. At present, the scene of “You like it” has been fully launched on the homepage of Meituan, and the click-through rate has also been increased online:

The relative value of online CTR increases as shown in the following table:

In addition to the improvement of online effects, the multi-task training method also greatly improves the expansibility of DNN model. Multiple business objectives can be considered simultaneously during model training, which is convenient for us to add business constraints.

To explore more

After the multi-task DNN model went online, in order to further improve the effect, we made multiple optimization attempts by taking advantage of the flexibility of DNN network structure. The following is a detailed introduction to NFM and the exploration of user interest vector.

NFM

In order to introduce low-order feature combination, we try to add NFM based on multi-Task DNN. The discrete features of each domain are firstly learned by Embedding to obtain the corresponding vector expression. As the input of NFM, NFM performs 2-order feature combination for each dimension corresponding to the input vector through bi-interaction Pooling, and finally outputs a vector that is the same as the input dimension. We spliced the vectors learned by NFM with the hidden layer of DNN as sample expression for subsequent learning.

NFM output is in vector form, which is convenient to fuse with DNN’s hidden layer. In addition, we find that NFM can accelerate the convergence of training, which is more conducive to Embedding learning. As there are many layers in THE DNN part, the gradient is easy to disappear when the gradient is transmitted to the bottom layer in the BP training stage. However, compared with DNN, the layer number of NFM is shard, which is conducive to the transmission of the gradient and thus accelerates the learning of the Embedding.

Through off-line survey, although the convergence speed of training was accelerated after NFM was added, the AUC did not improve significantly. The reason is that the features added to NFM model are still limited, which limits the learning effect. In the future, we will try to add more feature fields to provide enough information to help NFM learn useful expressions and tap the potential of NFM.

User interest vector

As an important feature, user interest is usually reflected in user’s historical behavior. By introducing the sequence of user’s historical behavior, we try to express user’s interest vectorally in many ways.

  1. Vectorization representation of Item: Item in the user behavior sequence printed online exists in the form of ID, so the vectorization representation of Item needs to be obtained first. Initially, we tried to learn by randomly initializing the Item Embedding vector and updating its parameters during training. However, due to the sparsity of Item ID, the above random initialization method is prone to overfitting. Later, the man-generated item Embedding vector is used for initialization, and fine tuning is performed in the training process.

  2. Vectored expression of user interest: In order to generate user interest vector, we fuse Item vector in user behavior sequence including Average Pooling, Max Pooling and Weighted Pooling. Where Weighted Pooling refers to the implementation of DIN, the user’s behavior sequence is first obtained, and the Align Vector weight of each behavior Item to be predicted is learned through a nonlinear Attention Net. According to the learned weight, Weighted Pooling is carried out for the behavior sequence, and the interest vector of users is finally generated. The calculation process is shown in the figure below:

Through off-line AUC comparison, the effect of Average Pooling is optimal for the current training data. The effect pair is shown below:

The above is our experience and attempt in model structure optimization. Next, we will introduce the framework performance optimization for improving model training efficiency.

5) Optimization of training efficiency

After extensive investigation and selection of open source framework, we chose PS-Lite as the training framework of DNN model. Ps-lite is an open source Parameter Server implementation of DMLC, which mainly contains two roles: Server and Worker. The Server is responsible for storing and updating model parameters, while the Worker is responsible for reading training data, constructing network structure and gradient calculation. Its significant advantages over other open source frameworks are:

  • PS framework: The design of PS-Lite can make better use of the sparsity of features, which is suitable for recommending scenes with a large number of discrete features.
  • Reasonable encapsulation: communication framework and algorithm decoupling, POWERFUL and clear API, easy integration.

During development, we also encountered and solved some performance tuning issues:

  1. In order to save Worker’s memory, it is usually not to store all data in memory, but to pre-fetch data by Batch from hard disk. However, there is a large amount of data parsing process in this process, and the repeated calculation of some metadata (a large number of key sorting to remove weight, etc.) also accumulates considerable consumption. To solve this problem, we modified the data reading method, serialized the calculated metadata to the hard disk, and pre-fetch the data to the corresponding data structure through multi-threading in advance, so as to avoid wasting a lot of time for repeated calculation.

  2. During the training process, the calculation efficiency of workers is affected by the real-time load and hardware conditions of the host machine, and the execution progress of different workers may be different (as shown in the figure below, for the experimental test data, most workers will complete a round of training in 700 seconds, while the slowest Worker will take 900 seconds). Generally, after the training of an Epoch, synchronized processes such as Checkpoint of the model and evaluation index calculation need to be carried out, so the slowest node will slow down the whole training process. Considering that the execution efficiency of workers roughly follows the Gaussian distribution and only a small number of workers are extremely inefficient, we add an interrupt mechanism into the training process: When most machines have finished executing the current Epoch, the remaining workers interrupt and sacrifice part of the training data on a small amount of workers to prevent the training process from blocking for a long time. When the interrupted Worker starts next Epoch, it will continue training from the Batch at the time of interruption to ensure that the slow node can also make use of all training data.

Iii. Summary and outlook

After the deep learning model is implemented into recommended scenarios, business indicators have been significantly improved. In the future, we will deepen our understanding of business scenarios and make further optimization attempts.

On the business side, we will try to abstract more business rules into the model in the way of learning objectives. Business rules tend to come up when we solve business problems in the short term, but they tend not to be smooth and the rules don’t adapt as scenarios change. In multi-task mode, business Bias is abstracted into learning objectives, and model learning is guided in the training process, so that business problems can be solved elegantly through the model.

In terms of features, we will continue to conduct in-depth research on the mining and utilization of features. Different from other recommended scenarios, Context features play a significant role in O2O services. Time, location, weather and other factors will affect users’ decisions. In the future, we will continue to try to mine various Context features, and use feature engineering or models to combine features to optimize the expression of samples.

In terms of the model, we will continue to explore the network structure, try new model features, and fit the characteristics of the scene. The successful experience of academia and industry is very valuable and provides us with new ideas and methods. However, due to the different business problems and accumulated data of scenarios, it is still necessary to adapt to the scenarios to achieve the improvement of business goals.

reference

[1] Mu Li, David G. Andersen, Alexander Smola, and Kai Yu. Communication Efficient Distributed Machine Learning with the Parameter Server. NIPS, 2014b. [2] Rich Caruana. Multitask Learning. Betascript Publishing, 1997. [3] Xiangnan He and Tat-Seng Chua. Neural Factorization Machines for Sparse Predictive Analytics. Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval(2017). [4] Guorui Zhou, Chengru Song, Et al. Deep Interest Network for click-through Rate Prediction. ArXiv Preprint arXiv:1706.06978,2017. [5] Dzmitry Bahdanau, Kyunghyun Cho, And Yoshua Bengio. Neural Machine Translation by arcade Learning to Align and Translate. ICLR ’15, May 2015. [6] H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir, et al. Wide & Deep Learning for Recommender Systems. ArXiv Preprint arXiv:1606.07792, 2016.

Author’s brief introduction

Shao Zhe joined Yuan Meituan in 2015, mainly engaged in the recommendation ranking model. Rui Liu, who has worked in Baidu and Alibaba in recommendation business, is now working in The Recommendation Technology Center of Meituan platform, focusing on the research and development of deep learning model.

If you are interested in our team, you can follow our column. Meituan platform recommended technology center, welcome all talents to join. Please send your CV to: caohao#meituan.com

The original address: tech.meituan.com/recommend_d…