Overview of sorting algorithms

Modern recommendation system is generally divided into recall and sorting two stages. In the recall stage, some low-cost and fast models will be used to initially screen the candidate sets of 100,000 or millions, leaving thousands or hundreds. Then, in the sorting stage, more detailed features and complex models are used for fine sorting, and topK are left at last.

In the past ten years, the development of ranking model in the industry can be said to be a thousand miles, from the stereotypical LR, to FM in 2010, and then to Facebook in 2014, the tree model GBDT, these years can be regarded as the first half of modern recommendation system;

And rapid development since 2015 May be regarded as the second half, a few years, such as within DNN model to represent the depth of the learning network, popping up, all kinds of model architecture, the characteristics of the cross way emerge in endlessly, all kinds of new idea, was catnip dazzling, and deep learning has gradually become the mainstream method in the field of CTR, recommended.

Along this line, this paper will briefly introduce, sort out and summarize these sorting models.

I. Traditional models

1. LR

Before the rise of deep learning, LR almost monopolized the early CTR and recommendation fields with its advantages of simplicity, speed and interpretability.


  • Intuitively speaking, the objective function of LR model is the weighted sum of all features, and then the sigmoID function is used to map the result to 0-1 to express the probability of users clicking on an item. It is easy to understand and can be implemented in parallel with high speed.

  • In addition, by observing the weight of features learned, we can easily know which features are “important”. When there is a deviation in prediction, it is also easy to see which factors affect the result, which is why it is highly explanatory.

  • However, THE disadvantage of LR is also obvious: as it is a simple linear model, it cannot deal with the nonlinear relationship between features and targets. Moreover, features are not completely independent, and some features will have special effects when they are crossed. Therefore, in order to make the model have certain nonlinearity, data scientists at that time needed to do a lot of feature engineering by hand, such as continuous feature discretization, feature crossover, etc. However, in order to extract efficient cross features, it is necessary to fully understand the data and scenes, and the labor cost is high. Moreover, even experienced engineers cannot exhaust all the feature cross combinations.

Since it is difficult to do by hand, can we find the feature cross combination automatically or with the help of a model? In the following years, two automatic feature crossing methods represented by FM and GBDT appeared.

2. FM/FFM

For the problem of feature intersection in LR, the following polynomial model was proposed:


It can be seen from the formula that the model intersects all features in pairs and assigns weights to all feature combinations.

But there are two problems with this violent method:

  • Modern recommendation system usually contains a large number of sparse characteristics (such as id), and the characteristics of cross dimension is the original characteristic dimension of the product, and if and only if both characteristic value is non-zero, the combination of the corresponding weight will be updated to, it makes most of the characteristics of cross weight training, lack of effective data cannot be convergent, and the amount of weight parametersGo straight up to, greatly increased the training complexity.
  • Cannot be generalized to feature combinations that have not appeared in training samples

In view of the above two problems, Steffen Rendle from The University of Konstanz in Germany proposed FM (Factorization Machine) in 2010.


Similar to the idea of matrix decomposition (MF), FM learns a latent vector for each feature. In feature crossover, the inner product of two feature latent vectors is used as the weight of the cross feature instead of a single weight.

  • By introducing the feature hidden vector, the originalThe weight number of levels has been reduced to(Is the implicit vector dimension,). In the training process, the calculation process of second-order terms can be optimized to further reduce the training complexity of FM toLevel, greatly reducing training overhead.

  • FM solves the data sparsity problem in the above polynomial model. It can be seen from the formula that all the items containing”The nonzero combinatorial characteristics of the,) can be used to learn implicit vectors, which largely avoids the impact caused by data sparsity.

  • The introduction of hidden vector also makes the model can be generalized to the feature combination that has not appeared before, which gives consideration to both the memorization and generalization of the model.

  • In engineering, FM can also use the characteristics of gradient descent to learn so that it does not lose real-time and flexibility. Compared with the complex network structure of the later deep learning model, the inference process that FM is easier to realize also makes it free of serving problems, which is often used in recall stage until now.

3. GBDT + LR

Although FM has a good comprehensive effect, it can only do second-order feature crossover. If the dimension of feature crossover is continued to be improved, combination explosion and high computational complexity will inevitably occur. So are there other methods that can effectively deal with the problem of high-dimensional feature combination and screening?

In 2014, Facebook proposed a cascaded tree model in a paper to solve this problem.

The idea is simple:

Trained a GBDT model, each tree from top to bottom, each node split is a natural process of feature selection, and multilayer down nature of the characteristics of effective combination, so that each leaf node corresponding to a path, the tree also represent different characteristics of cross combinations, so later can put all the leaf nodes are numbered as new features, Then combining with the original features, input LR to train the final model.

The feature combination of tree model can not be limited to the second-order crossover like FM: for example, if the depth of each tree is 5, then the final leaf node is actually the result of the fourth-order feature combination after four node splitting. However, it cannot be said that GBDT has better effect than FM, because tree model also has its own disadvantages, such as easy to overfit high-dimensional sparse data, slow parallelism, poor generalization, etc.

But the significance of GBDT+LR is that

  • It puts forward the idea of automatic feature engineering and feature crossing with model. In a sense, the rise of deep learning, the wide application of embedding and various network structures are the continuation and inheritance of this idea.
  • It also proposes a cascading structure, and each model can be updated step by step. The tree model can be trained once a day or even once a week, while the LR model can be updated online in real time at the level of minutes or even seconds. This also provides an idea of serving for later models.

summary

From the earliest manual rule ranking, to the LR model with manual feature combination, to the FM model with automatic second-order feature combination, to the LR+GBDT model with automatic high-order feature combination, this is basically the main vein of the early recommendation system ranking model.

Later, the introduction of DNN model marks the rise of deep learning-oriented sorting model. The pure simple DNN model essentially adds MLP hidden layer on the basis of feature Embedding of FM model for implicit and feature nonlinear automatic combination.

Below, we will focus on the deep learning model which is leading in recent years.

Depth model

After 2015, a series of deep learning models represented by DNN gradually emerged. Personally, THEY can be divided into two categories:

  • The common feature of a series of models represented by Wide&Deep is that they automatically learn to combine new high-order features from original features. The difference between them lies in the partial modification of wide or deep.
  • Combined training model based on multi-task learning is represented by ESMM, MMOE and other models.

1. Wide&Deep class model

This kind of model is characterized by a two-tower structure, that is, the wide part represented by LR is used to learn the expression of low-order features on one side, emphasizing “memorability”; On the other side, MLP represents the deep model, emphasizing “generalization”. The structure of the deep part can also be divided into the following modules:

  • raw input->embedding: The process of mapping sparse features into low-dimensional dense embedding vector.
  • input_layer: This layer usually does some aggregation for the features.
  • input_layer->output: Several layers of MLP’s fully connected framework are typically connected to SoftMax as the output layer.

Most models differ in the deep part only in the input_layer. Different models cross in a way (implicit/explicit, element-level/directional), Or the way features are connected (concatenate/weighted sum/product/ bi-interaction /attention, etc.) or the order (second order/higher order) in which features are shown to cross. Take Wide&Deep, (x)DeepFM, DCN, DIN as examples, these models will be briefly introduced below.

Here is a brief introduction to the mode of feature interaction:

  1. One is similar to MLP, because its special structure naturally has the ability to learn high-order feature combination, and introduces certain nonlinearity; However, we do not know how the interaction combination occurs and how many levels of intersection occur. Moreover, this modeling is bit-wise, that is to say, the elements of the corresponding embedding vector in the same domain will also influence each other. So we say that this feature crossing is “implicit, element level”.
  2. Another corresponding method is similar to DeepFM and xDeepFM. In the model structure, some sub-networks or sub-structures are explicitly designed to characterize any combination of higher-order features. Take FM as an example, it is clear to model the second-order combination of features in the way of order of magnitude, which we call “explicit and order of magnitude”.

Wide&Deep

  • The advantage of the wide part of the model lies in learning the high frequency part of the sample, which has good “memorability”. It can use a few parameters to learn the high frequency and low order features that have appeared in the sample. However, because it is an LR model, it still needs manual feature cross combination.
  • The deep part of the model is used to learn high-order cross combinatorial relationships between features, and “generalization” is introduced. It is an implicit, element-level feature crossover.
  • The proposed twin tower frame structure greatly promoted the development of subsequent models

DeepFM

  • The Wide&Deep framework is powerful, but because the Wide part is an LR model, it still requires manual feature engineering. The DeepFM model replaces the wide part with FM, which can automatically carry out the second-order crossover of features. But the limitation is that there are only second-order intersections, not higher-order intersections.
  • FM and deep share the embedding layer and update the weight matrix bilaterally. This kind of share bottom thinking is also common in later models.

DCN

Want to get any advanced features cross combinations, and not just the second order, but also to avoid the combination explosion of dimension disaster, lead to network parameters is too big to learn, at the same time, it will produce a lot of invalid cross characteristics, need to design a kind of “compression” ability, and sufficient and efficient network structure. DCN is one of them:

Vector crossover is as follows:

  • On the basis of retaining MLP structure, cross Layer network is also introduced, which can theoretically express any high-order combination, while each layer retains low-order combination. Vectorization of parameters and unique network structure also control the linear growth of the complexity of the model, preventing dimension explosion with the increase of cross order.
  • However, it is pointed out in the xDeepFM paper that the feature crossover of DCN is limited to a special form and is constructed in a bit-wise manner.

xDeepFM

XDeepFM model is a master of automatic construction of cross features and end-to-end learning. It effectively solves the problems mentioned in DCN, realizes automatic learning of explicit high-order feature interaction, and makes the interaction occur on the order of magnitude.

The model designs a unique CIN structure, and extracts the cross of features through cross product, convolution and other operations.

Moreover, the order of feature crossing can be explicitly controlled by controlling the layer number of CIN, and the space-time complexity of calculation increases linearly, without the occurrence of combination explosion leading to parameter explosion and unable to train.

  • The integrated CIN and DNN modules help models learn arbitrary high-order feature interactions both explicitly and implicitly, at the element level and to the order of magnitude, while maintaining linear complexity.

DIN

  • The local interest representation is given different weights by the attention network. Weighted sum is applied to get a summary embedding as the user’s interest representation of the current object. Equivalent to adding a step of automatic feature selection.
  • It also reflects the idea that models serve scenarios. Firstly, the multi-peak distribution of user interest and some corresponding data characteristics are observed, and then a suitable model is proposed for fitting according to local conditions.

summary

2. Multi-task class model

Multi-objective optimization of recommendation system is one of the mainstream in the industry, and also the research and development status of many companies. Taking our Huajiao live broadcast as an example, we can optimize the targets of clicking, watching, giving gifts, commenting, following, forwarding and so on.

The multi-task model aims to balance the mutual influence of different objectives and try to achieve the synchronous rise of all indicators. Even if not, it should also try to achieve the global optimal effect without lowering or reducing other indicators as much as possible in the case of an optimization target rising.

ESMM and MMOE models are mainly introduced here.

ESMM

According to the paper, the complete logging process should include:


Sample space is shown in the figure below:

Traditional CVR tasks only consider the process from browsing to transformation, i.e


The model also considers the process of clicking, and introduces the concept of browsing conversion rate (pCTCVR), which is the probability of both clicking and converting under browsing conditions, that is


Which is the following formula:

The model designed according to this formula is shown below:

  • Two tasks share the way of low-level embedding, and then multiply their logit to fit the pCTCVR process.

Sample construction is as follows:

task Is the sample Negative samples
pCTR Click on the Did not click
pCTCVR Click and convert Did not translate


Model features:

  • Modeling in the full sample space avoids the problem of “sample selection bias” and makes full use of business data.
  • Shared low-level embedding vector, because the conversion rate in recommendation is very low and the corresponding data is very small, this shared feature representation mechanism enables CVR network to update the embedding vector from only exposed but not clicked samples, which helps alleviate the sparse CVR training data.
  • Subnetworks in ESMM can be replaced by other models rather than just the MLP in this article, which means that this article provides us with an extensible multi-tasking model architecture. There’s another one after Ali“Is a similar pattern.

MMOE

The paper states: General multitasking model structure as shown in the picture above (a), namely, for different tasks, the parameters of the underlying network structure is Shared, and then the upper different neural network to obtain the output of the corresponding tasks, the disadvantage is that the effect of the model depends on the correlation of task, if multitasking link between small, use this structure can appear even dragging each other.

Therefore, this paper proposes two structures: OMOE based on Figure (b) and MMOE based on Figure (C). The main idea is that each task has an independent expert intermediate network, which is similar to the function of “switch”. Through model learning, different tasks can extract different features with different emphasis from the same low-level embedding. Rather than completely sharing the underlying layer, the result is “everyone gets what they want”, similar to the attention network mentioned above.

Then, each task is connected with its own tower model to obtain LOGit, and then loss is calculated together with label. Then, multi-objective Loss can be directly combined to form the total loss in a way similar to weighted sum.

summary

Multitasking model of recommendation system, although it is A development trend of sorting model, but the multi-objective learning difficulty is, each target of the sample proportion is different, when training how to integrate loss, when to stop training and online each target score how to combine, A/B test how to measure the overall effect, etc., through measure and consider complex, There is still a lot of room for us to try.

Practice of sorting model in live seeding of Chinese prickly ash

In the past two years, Pepper live has followed the trend of the industry and made a variety of attempts in the sorting stage, such as (GBDT+)LR, Wide&Deep, (X)DeepFM, DIN, ESMM, MMOE and so on. Take Wide&Deep model as an example to briefly introduce our whole sorting system.

The first part is the offline part. We mainly use Spark/HDFS to process and store data. It mainly includes user data, anchor data, real-time data, behavior sequence and so on. Here are some of the features we used:

category Characteristics of the
User portrait Gender, age, model, region, watching/watching/watching/watching/screen/forwarding/attention statistics, behavior sequence……
The host image Gender, age, region, level, category, tag, channel, talent, length, gift, fans and various statistics category rankings……
Real-time feature Real-time watching, appreciation, barrage, heat, whether to sing, whether to dance, game highlights…


Then join the label set and the image to form a one-day data set. Multi-day training sets can be used to form the final overall data set to meet the requirements of data volume and coverage. Data penetration should be careful not to occur here. The resulting dataset is t-level and stored on HDFS. In the training stage, the configuration of single machine and multiple cards could not meet the speed requirements, so we adopted the hbox distributed training platform of 360 private cloud to complete the daily training of depth model.

Below is the structure diagram of our model:

Here are some of our models:

Offline:

model AUC*100
FM 78.9
Wide&Deep 84.5
DeepFM 84.7


Online: average viewing time of popular channels increases by >80% after personalized recommendation

Afterword.

This paper is just a brief introduction and summary of commonly used models in the industry in recent years. In fact, in addition to the typical structure of each model, there are many precious details, such as formula derivation, parameter selection, engineering trick and so on. It is suggested that you should read related model papers closely.

Also note that there is no “best model”, only “best fit model”, not that the more fancy and complex the model, the better it will look online. Ali, for example, came up with the DIN model because engineers first noticed something in the data:

The interests of users when browsing e-commerce websites are very diverse, and only part of the historical data will affect whether the recommended items are clicked, rather than all the historical records, i.e., “multi-peak distribution” and “partial activation”.

It was this need for specific scenarios that led Ali to develop the DIN model to capture the evolution of user interests and achieve breakthrough results.

Therefore, the correct order of recommendation should be to first have a specific “scene”, and then develop a corresponding model suitable for this scene based on the characteristics of user behavior and data. Instead of making a model and then experimenting with the data, it would be the reverse.