Sorting process includes offline sorting and online sorting:

The offline sort

The user behavior data before the day before yesterday (day T-2) were read as the training set to train the offline model. After the training, the user behavior data of yesterday (day T-1) were read as the verification set for prediction, and the offline model was evaluated according to the prediction results. If the evaluation passes, the offline model can be updated to the scheduled task on the same day (T day), and the prediction task can be executed regularly. Tomorrow (day T + 1) we can observe the prediction effect of the updated offline model based on today’s user behavior data. (Note: there is one day time difference in data production, and the data of day T-1 will be generated on day T)

Online ordering

The user behavior data before the day before yesterday (day T-2) were read as a training set to train the online model. After the training, the user behavior data of yesterday (day T-1) were read as the validation set for prediction, and the online model was evaluated according to the prediction results. If the evaluation passes, the online model can be updated online on the same day (T day) and the sorting task can be performed in real time. Tomorrow (day T + 1) we can see the prediction effect of the updated online model based on today’s user behavior data.

Here is another small technique for data set partitioning: horizontal partitioning, random or by user or other sample selection strategy; It can also be divided vertically according to the time span. For example, in the data of a week, Monday to Thursday is the training set, Friday and Saturday is the test set, and Sunday is the verification set.

Ranking model can be used for rating prediction and user behavior prediction. Generally, recommendation systems use ranking model to predict user behavior, such as CTR prediction, and then sort items according to CTR. Currently, CTR prediction models commonly used in industry are as follows:

  • Wide model + Feature Engineering LR/MLR + Non-ID feature (manual discrete/GBDT/FM)
  • Wide model + deep model Wide&Deep, DeepFM
  • Deep model: DNN + feature Embedding

The broad model here refers to the linear model, and the advantages of the linear model include:

  • Relatively simple, the computational complexity of training and prediction is relatively low
  • You can focus on finding new and effective features, and you can parallelize your work
  • It can be explained according to the weight of features

Let’s focus on the first one: wide model + feature engineering

LR+ Discrete features advantages:

In the industry, continuous values are rarely directly fed to the logistic regression model as features, instead, continuous features are discretized into a series of 0 and 1 features and given to the logistic regression model, which has the following advantages:

  1. Sparse vector inner product multiplication is fast, and the results are easy to store and scalable.

  2. Discretized features have strong robustness to abnormal data: for example, if a feature is age >30, it is 1, otherwise it is 0. If features are not discretized, an abnormal data “age of 300 years” will cause great disturbance to the model.

  3. Logistic regression is a generalized linear model with limited expression ability. After the single variable is discretized into N, each variable has its own weight, which is equivalent to introducing nonlinearity into the model, which can improve the expression ability of the model and increase the fitting.

  4. After discretization, feature crossing can be carried out, from M+N variables to M*N variables, which further introduces nonlinearity and improves expression ability.

  5. After feature discretization, the model will be more stable. For example, if the age of users is discretized, 20-30 will not be a completely different person just because a user is one year older. Of course, the samples next to the interval will be the opposite, so how to divide the interval is an art;

Li Mu shaoshuai pointed out that whether the model uses discrete features or continuous features is actually a trade-off between “a large number of discrete features + simple model” and “a small number of continuous features + complex model”. It can be discretized using linear models, or continuous features plus deep learning. It depends on whether you like to fiddle with features or fiddle with models. Generally speaking, the former is easy, and can be done in parallel by n people with successful experience; The latter looks great so far, but how far it will go remains to be seen.

General understanding:

1) Simple calculation

2) Simplified model

3) Enhance the generalization ability of the model, and it is not easy to be affected by noise

LR+ discrete characteristics cause

Artificial features vs. machine features

First, mass discrete features +LR is a common practice in the industry, but not the Holy Grail. In fact, it’s generally just because LR’s optimization algorithms are more mature and can take advantage of sparse features for better optimization in computations — as a last resort.

It has been proved that the addition of GBDT and deep learning features is positively helpful for CTR prediction.

If you think about this a little bit more deeply, the last layer of the current deep learning network, if binary Classification, is actually equivalent to LR. So, artificial/semi-artificial features are the same as deep neural networks (regardless of their original structure) that eventually produce dense representations. The difference is that deep learning generally produces a dense representation. Feature engineering produced a bunch of sparse representations.

In some cases, artificial features are actually highly similar to the results of neural networks after several layers of nonlinearity. Machines are certainly better than humans at brute force extraction of higher-order/non-linear features. But even the best machine intelligence is sometimes no match for some “common human sense.” In particular, some business logic can be considered as features pre-trained by the human brain on a larger dataset, which must contain more information than the dataset you use for prediction.

In such cases, often severe manual features will outperform violent machine methods. So discrete features can be considered an effective method mathematically, but more importantly, it can make sense of the data, so that it can be converted into a format that humans can understand and validate. Human business logic, of course, is not perfect.

Human business insights are a powerful complement (in many cases, even the most important part) before machine intelligence has conquered the realm of “common sense”. To the extent that machines can fully master go, for example, the proud intuition of humans can no longer resist the brute force of machine computation-so in the future we’ll see more and more machine intelligence invading areas traditionally thought to rely on human “sense.”

Advertising, of course, is not immune to this general trend.

LR applies to sparse feature causes

I have been thinking about this question for a long time, and I have encountered many cases in my daily projects. Indeed, it is easy to use GBDT for over-fitting of high-dimensional sparse features.

But I still don’t know why. Later, I thought deeply about the characteristics of the model and found some interesting places.

Suppose there is 1 w samples, y type 0 and 1100 d characteristics, including 10 samples are category 1, and the characteristics of f1 has a value of 0, 1, and just the 10 samples of f1 characteristic value is 1, the rest of the 9990 samples of 0 (in the case of high-dimensional sparse this kind of situation is common), we all know this situation at the time of tree model, It is easy to optimize a tree with a split node using F1 and divide the data very well, but when tested, the result will be found to be poor, because this feature just happens to fit the law with Y, which is also known as overfitting. But AT that time, I still didn’t understand why linear model could handle this kind of case well. Logically, the linear model will also produce such a formula after optimization:Among themBig enough to fit all ten of them, because f1 is only 0 and 1 anyway,Too large will have no effect on the other 9,990 samples.

After reflection, I found that the reason is that current models generally carry regular terms, while the regular terms of linear models such as LR are the punishment for weight, that isOnce it’s too big, the punishment will be too big, further compressedThe penalty term of the tree model is usually the number of leaf nodes and depth, etc. As we all know, for the above case, the tree only needs one node to perfectly divide 9990 and 10 samples, and the penalty term is extremely small.

This is why linear models are better than nonlinear models in the case of high-dimensional sparse features: linear models with regularization are less likely to overfit sparse features.

Why is LR only suitable for discrete features

LR is a very simple linear model. Let’s review the formula again: y = w*x + b. This is a linear function. As we said before, linear functions have limited expressive power, and we introduced activation functions in order to add nonlinearity to LR. Can turn a straight line into a curve. So we can get a better fit.

Then, the biggest difference between discretization and continuization is that the result after continuization of a field is still only a feature, and the number of features will be extracted according to the number of keys (possible values of fields) in the column after discretization.

  1. Then the first point comes. After the single variable is discretized into N variables, each variable has its own weight. Under the action of the activation function, it is equivalent to adding nonlinearity to the model, which can improve the expression ability of the model and increase the fitting.
  2. Second, the discretized features are very robust to abnormal data: for example, if a feature is age >30, it is 1; otherwise, it is 0. If features are not discretized, an abnormal data “age of 300 years” will cause great interference to the model, because the abnormal eigenvalue will lead to the abnormal weight value, that is, the value of W.
  3. Third, it is easy to add and subtract discrete features, facilitating rapid iteration of the model.
  4. Fourthly, some students must be worried that too many features will lead to slow operation, but LR is a linear model, and we used vectorization rather than cyclic iteration in internal calculation. Sparse vector inner product multiplication is fast, and the calculation results are easy to store and expand. Therefore, there is no need to worry that GBDT algorithm will be unable to run if there are many features. (We all say that GBDT cannot use discrete features not because it cannot process discrete features, but because there will be too many features after discretization, and the decision tree has too many leaves and traversal is too slow.)

Therefore, mass discrete feature +LR is a common practice in the industry. A few continuous features + complex models is another approach, such as GBDT.

Continuous and discrete transformation into each other

Now we know the difference between discrete and continuous, and how they apply. But data is not as discrete as we want it to be.

If you want to discretize the asset field, each key is a feature, then there will be a large number of features, 1000 and 1001 and 999 become three different features, which is not what we want, so the difference of a dollar in the middle is important? What we would prefer is that the numbers in a certain interval are mapped uniformly to a feature. For example, those with assets of less than 100W are considered poor, those between 100W and 1000W are considered middle class, and those above 1000W are considered rich.

Maybe those are the three characteristics we want to extract from this field. Therefore, the continuous value bucket method is used to convert continuous features into discrete features. The interval of continuous features is divided into different buckets for transformation. The same discrete features can be converted to continuous features, possibly by sorting the data by time field, and then converting the discrete data into a numerical value based on the value of the data in the time window. I don’t really know the details. You can look it up.

The normalized

Normalization is also an important step. After we extracted the features, we found that the value intervals of these features were not the same. Especially for continuous features. The interval of feature 1 is 0 to 1, and the interval of feature 2 is 0 to 1000. So let’s do gradient descent as shown below:


On the left is the unnormalized graph, where the gradient descent algorithm is a bit like a flat bowl, and we need more iterations to complete the gradient descent. And the one on the right is normalized gradient descent, so it’s a more rounded bowl, so we’re going to gradient down faster. So what is normalization on earth? In fact, normalization is to compress our eigenvalues into a range of 0~1, so that all features are in a relatively equal state. The diagram below:


The features were originally distributed like this on the far left, but by normalization, the distribution of features gradually changed to the far right.

So normalization is useful when our features are in different distributions. We also have batch norm operations in deep learning. In fact, the output of each layer is normalized and then handed over to the next layer for calculation.

GBDT coding, LR modeling

When using LR for CTR estimation, a lot of feature engineering is required. The continuous features are discretized and one-Hot coding is carried out on the discretized features. Finally, the features are combined with second-order or third-order features in order to obtain nonlinear features. There are several problems in feature engineering:

  • How to select the point of continuous variable segmentation?
  • How many pieces of discretization are reasonable?
  • Which features do you choose to cross?
  • How many intersections, second, third or more?

It is generally based on experience, constantly trying some combinations, and then choosing the appropriate parameters based on the offline evaluation.

However, using GBDT coding, solved the above problems at one stroke. The segmentation point is determined not by subjective experience, but by objective selection of segmentation point and share number according to information gain. The path of each decision tree from the root node to the leaf node will pass through different features. This path is the combination of features, and contains the second, third and even more order.

Why not just use GDBT instead of GDBT+LR? Because GDBT online prediction is difficult, and the training time complexity is higher than LR. So in practice, you can train the GDBT offline and then use the model as part of an online ETL.

Although the Facebook paper mentions that GBDT + LR performs better than GBDT alone, even LR performs better than GBDT. Actually, I was skeptical, so I ran a local experiment using five data packets from MLbench, Diabetes, Satellite, Sonar, Vehicle and Vowel, All other indicators are GBDT > GBDT + LR > LR, which is in line with my previous assumption.

Of course, my data set is too limited to generalize. However, from the experimental data, there is no qualitative difference between these algorithms in various indicators, so in practical work, finding important features is the first priority; Algorithm, the choice can quickly online, enough on the line, can be iterative optimization.

Summary of other propositions:

The following is a brief record of other topics in the paper for later review

  • In online learning, LR vs BOPR. The effect of BOPR is slightly better than LR, but LR is simpler, so LR is chosen in the end
  • The number of iterations of GDBT, most of the optimization only needs 500 iterations of GBDT model to complete.
  • GDBT tree depth does not need to be too deep, 2 and 3 layers generally meet the requirements.
  • More features aren’t always better. Just the Top 400 in importance can perform well
  • Historical characteristics are more important than the characteristics of the user’s current context and are more stable to the predicted time. However, user context data can effectively solve the problem of cold start.
  • There is no need to use the entire sample, and the effect of using 10% of the training data is not much different from 100%
  • If the negative sample is resampled, the probability calculated by the model needs to be revised. The revised formula is q= P +(1− P)/w, where Q is the revised data, P is the data predicted by the model, and W is the negative sample resampling ratio.

This article is formatted using MDNICE