This article is from OPPO Internet Basic technology team, please note the author. At the same time, welcome to follow our official account: OPPO_tech, share with you OPPO cutting-edge Internet technology and activities.
Some people can deduce the formula of machine learning algorithm by hand, can understand the realization of the algorithm deeply in the source code, or get a good ranking in the Kaggle competition, but in the actual project is “stumbling”, the fundamental reason is that the machine learning knowledge system is not perfect.
For example, in our user interest label production system, we need to consider the following problems: how to do feature engineering well, how to obtain high-quality positive and negative samples, how to deal with unbalanced data sets, how to determine the evaluation method, which algorithm to choose, how to adjust parameters, and explain the model.
The experience directly related to the algorithm is the easiest to obtain from books or online, while others are closely related to their own business and rarely shared online, so it is relatively difficult to acquire experience and knowledge in this aspect, which are generally accumulated by project practice. Unfortunately, this hard-to-get empirical knowledge is the most important aspect of the project at this stage — the optimization of the modeling data sets yields much more gain than algorithm selection and tuning.
This paper will share some practical experience based on the tag project. The paper is divided into two parts: The first part introduces the general module, including feature engineering, building evaluation sets and determining evaluation methods; The second part introduces algorithm-related modules, including algorithm selection, feature processing, parameter tuning (Spark and SkLearn platforms), and model interpretation.
1. Characteristics and sample engineering
1.1 Feature Engineering
Feature engineering mainly includes:
Feature usage scheme
Extract features from relevant data sources as far as possible based on the business, and assess the availability of these features (coverage, accuracy, and so on).
Feature acquisition scheme
The main concern is how to capture and store features.
Characteristics of the monitoring
Monitor important features to prevent feature quality slippage from affecting the model.
Characteristics of the processing
It is the core part of feature engineering, including feature cleaning, feature transformation, feature selection and other operations.
It can be seen that feature engineering is huge and complicated, especially in the big data scenario, some routine operations may even have performance problems. It is intolerable that we are still stuck on the project problem when the deadline is approaching.
Therefore, our feature engineering platform OFeature emerged at the historic moment, which is committed to solving the problem of consistency, usability and performance of feature engineering, completely freeing users from the engineering “swamp” and making the process of obtaining modeling data sets very simple.
The following figure visually shows that high-quality modeling data sets are mainly composed of rich feature sets, appropriate feature processing and high-quality label. Among them, feature processing is related to the selected algorithm, which will be introduced in the algorithm chapter in this paper. This chapter only introduces some experience in acquiring features and label.
Figure 1. Process of obtaining modeling data set using OFeature
1.1.1 Feature Set
What kind of feature set it has is closely related to the industry and even the project. Taking Internet advertising as an example, its features can be roughly divided into the following categories:
User attributes
Age, sex, education, occupation, region, etc
Behavioral biases
This is usually achieved by statistical data on historical behavior (or by model prediction), such as a user’s payments in the last month to determine his payment preferences
Behavioral intention
A user who has not been exposed or clicked on maternal and child product ads before, but has recently started to look up maternal and child related content in his browser, is likely the target group of maternal and child product advertisers. Our behavioral intention features mainly fall into two categories: “information stream content exposure click behavior” and “browser Query”
Behavior patterns
The so-called behavior pattern refers to the behavior performance with certain consistency significance summarized by analyzing the relationship between consumer behavior and time and space, as well as the time and space sequence relationship between a series of behaviors, and predicts related behaviors through these consistency patterns [1]. As shown in the figure below: Assume that we have three types of apps, and then sum up the number of times that user USER1 pays for B (similar to exposure, click, download and registration) on a weekly basis, and finally splicing together multiple weeks and multiple apps
FIG. 2 User game behavior sequence
1.1.2 Label
The acquisition of label is related to specific business:
Custom label service
Modelers can select positive and negative samples according to business requirements. For example, they can take the converted population of a certain APP as the positive sample and the unconverted population as the negative sample.
LookAlike
This is a typical PU learning (Positive and unlabeled learning) problem because the seed packs (Positive samples) may be provided by a third party and the corresponding negative samples cannot be determined. At this time, it is necessary to try various candidate sets (such as full users, active users of information flow, active users of games or their intersection, etc.) as negative samples, and use PU learning related algorithms (such as Spy Technique) to further extract higher quality negative samples, and finally compare each scheme to determine the optimal negative samples.
Given the relationship between exposure and click (download, register, pay), the positive and negative samples of the data set are often extremely unbalanced, in which case the test set still needs to use true distribution (unbalanced data), while the training set needs to use some technique to process the unbalanced data.
For example, if smote is sufficient, drop the negative sample. Otherwise, use the smote technology to expand the positive sample. Use the real distribution in the verification set, but it is difficult to do k-fold, so use the early-stopping parameter when you use the real distribution.
Finally, it should be noted that some features are historical data statistics, so the time of label must be after feature to prevent feature leakage.
1.2 Construct evaluation sets and determine evaluation indicators
1.2.1 evaluation set
To evaluate the influence of feature engineering or algorithm tuning on the model, it is necessary to find out some representative evaluation sets. If most of the evaluation sets have better indexes, it is considered that the model has been positively optimized. A representative set of measures:
Business classification
It is divided into “label business” and “LookAlike business”. In “label business”, we also divide game and non-game review sets
Categorize by magnitude
Generally refers to the seed magnitude of LookAlike. This is because packages provided by third parties often vary greatly in magnitude, and with this in mind, it is important to select a representative set of ratings based on the actual situation
1.2.2 Evaluation indicators
The feature engineering, label selection, resampling ratio and algorithm tuning mentioned below all need unified index evaluation.
First, of course, the most convincing indicator is the online indicator, but due to the high cost of online evaluation, it is impossible to test online every update, so the general strategy is to test online after the offline indicator is significantly better. Next, this paper only introduces the formulation of off-line evaluation indicators, which can be divided into two categories, as shown in the following figure.
Figure 3 Offline evaluation indicators
Model intrinsic index
Precision, recall, F1 and other indicators were first determined and then calculated from the corresponding confusion matrix. However, auC and AUPR indexes do not need to manually determine the threshold value: each point is taken as the threshold value, and each threshold value corresponds to a coordinate point. Finally, the total area of the graph is calculated. In a word, the selection should be based on the actual business. For example, our LookAlike task often focuses on ranking (that is, the transformation ranked at the front of the prediction probability is better than that at the back). At this time, AUC and AUPR are needed to evaluate (AUPR is recommended for unbalanced data sets).
For example, if the tagging task is to find people who are really interested in “XX game”, we can set a threshold to determine whether a certain user is really interested. Therefore, the predicted probability value of the model for the crowd becomes important. In addition, logloss should be used as the criterion of early-stopping in the tuning phase.
Business indicators
-
Intersection index:
A common approach is to make the intersection between the tag advertising population and the tag package, and then compare the historical index and the intersection index. For example, the model predicts the whole population in the history of 20200101 days, and then inverts the prediction score and takes topK as the tag package of “XX games”. Then tag package and 20200101 day “XX game” the actual exposure of the crowd to do the intersection, and finally statistics the intersection index. The advantage of intersection index is that it is intuitive and can be compared with historical indexes, while the disadvantage is that the small intersection magnitude will lead to meaningless statistical indexes.
-
X @ N:
Another indicator is called X@N, which is evolved from P@N. Different from intersection, we use all exposure groups of “XX game” on 20200101, invert the group with predicted probability, and then divide the group into several segments by quantile (N represents quantile node). Finally, we can calculate the exposure, click, download, registration, payment, CTR, DTR, conversion rate and other indicators of each paragraph (the meaning of X). Its guiding significance is that business indicators and predicted scores should show a monotonically increasing relationship. The advantage is that the magnitude is enough, but the disadvantage is that it cannot be compared with historical indicators.
In addition to considering business scenarios, we also consider situations where the distribution of data sets changes. For example, when the label distribution changes (when different seeds and negative samples are tried), AUPR is often not instructive, while X@N and intersection indicators can still be used because they use historical data, which will not change. Therefore, it should be noted that the comparison of indicators in the model should be carried out under the condition that the distribution of the test set remains unchanged, such as adding or subtracting features, adjusting parameters, and adjusting the positive and negative ratio of the training set.
2. The algorithm
“There is No Free Lunch” theorem (No Free Lunch, the NFL) said there was No a kind of algorithm is applicable to all cases, choose what kind of algorithm is combined with the actual situation [2], but the NFL is aimed at “effect” talk about “apply”, in the actual project, we should comprehensive ground “algorithm” and “algorithm” talk about “apply”.
Such as an algorithm in the experimental stage performance is very good, but at this stage could not be applied to the project process (probably does not support cross-platform or performance issues), this can’t be born is obviously not applicable, and such as A, B two algorithms can be born, A simple rapid B complex time-consuming, but B are small gain, this kind of circumstance also is not applicable to B.
In conclusion, our LookAlike project decides to use Logistic Regression at the present stage. LR effect is acceptable, computation is small, resource overhead is small, the model is explanatory and easy to parallelize, which is suitable for big data and high-dimension scenarios. The following article will introduce some of the problems that need to be paid attention to in LR practice.
2.1 Feature Processing
LR has high requirements for feature engineering, and there is a key step in the feature processing process — discrete and one-HOT encoding of continuous values. After discretization and one-HOT encoding are adopted in LookAlike projects, the memory overhead and training time of feature data are significantly reduced, and AUPR is also significantly improved. The reasons are as follows:
- One-hot sparse data, sparse storage more memory saving, sparse vector inner product faster.
- Discretization of continuous values can reduce the interference of outliers and increase the robustness of the model. The reason for further one-HOT coding of discrete data is that N discrete values of a single variable can be transformed into N-dimensional features, and each feature has its own weight to improve the nonlinear capability of the model.
2.2 super parameter
There are fewer LR overparameters, and there’s really nothing more to recommend except that the regularization is usually elastic>L1>L2 (‘>’ means better). However, when we use LR of SkLearn and sparkML, we find that we get different models with the same data. Therefore, this section will focus on some important differences between the two platforms and how to ensure that the models of each platform are consistent.
Using L2 regularization for both platforms and L-BFGS for optimizer, we summarize some differences between the two platforms:
2.2.1 Differences in objective functions
As shown in the following formula, the objective function of SkLearn and Spark is n-fold relation (n represents the number of samples). Since the n-fold relation of the objective function does not affect the optimization result, we only need to keep the regularization factor relation as λ= 1/(nC).
2.2.2 Differences in data standardization
Sklearn defaults to not standardizing data, while Spark defaults to standardizing data within the algorithm, even in a slightly different way. Here is an example from StackOverflow [3] for an extended experiment:
Python:
import numpy as np from sklearn.linear_model import LogisticRegression, Ridge from Sklearn. Preprocessing import StandardScaler X = Np.array ([[-0.7306653538519616, 0.0],[0.6750417712898752, 0.4232874171873786], [0.1863463229359709, 0.8163423997075965], [0.6719842051493347, 0.0], [0.9699938346531928, 0.0],[0.22759406190283604, 0.0],[0.9688721028330911, 0.0],[0.5993795346650845, 0.0],[0.9219423508390701 0.8972778242305388], [0.7006904841584055, 0.5607635619919824]]) y = np. Array ([0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0])## sqrt(n-1)/sqrt(n) factor for getting the same standardization as sparkSQRT (10.0) L = 0.3e = LogisticRegression(FIT_Intercept =True, penalty= Penalty'l2',
C=1/l,
max_iter=100,
tol=1e-11,
solver='lbfgs',verbose=1)
e.fit(Xsc, y)
print e.coef_, e.intercept_
Copy the code
SparkML:
import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.ml.classification.LogisticRegression import org.apache.spark.ml.regression.LinearRegression import org.apache.spark.ml.linalg.Vectors import org.apache.spark.sql.SQLContext import org.apache.spark.ml.feature.StandardScaler val sparkTrainingData_orig = new Seq((0.0, Vectors. Dense (Array(-0.7306653538519616, 0.0))), (1.0, Vectors. Dense (Array (0.6750417712898752, 0.4232874171873786))), (1.0, Vectors. The dense (Array (0.1863463229359709, 0.0, Vectors. Dense (Array(-0.6719842051493347, 0.0)), (1.0, Vectors. Dense (Array(0.9699938346531928, 0.0)), (1.0, Vectors. Dense (Array(0.22759406190283604, 0.0)), Vectors. Dense (Array(0.9688721028330911, 0.0)), (0.0, Vectors. Dense (Array(0.5993795346650845, 0.0))), (0.0, Vectors. Dense (Array (0.9219423508390701, 0.8972778242305388))), (0.0, Vectors. The dense (Array (0.7006904841584055, 0.5607635619919824))))). ToDF ("label"."features_orig")
val sparkTrainingData=new StandardScaler().
setWithMean(true).
setInputCol("features_orig").
setOutputCol("features").
fit(sparkTrainingData_orig).
transform(sparkTrainingData_orig)
//Make regularization 0.3/10=0.03
val logisticModel = new LogisticRegression().
setRegParam (0.03).setLabelCol("label").
setFeaturesCol("features").
setTol(1e-12).
setMaxIter(100).
fit(sparkTrainingData)
println(s"Spark logistic model coefficients: ${logisticModel.coefficients} Intercept: ${logisticModel.intercept}")
Copy the code
In the above example, we added the data standardization cross experiment to verify the consistency of each standardization operation.
As shown in the following table: When both sides do not standardize, the parameters and intercepts can be consistent. When both sides standardize and sparkML standardizes in the pre-processing stage, the parameters of both sides will be consistent (the intercepts are different).
Note: Intercept is considered in all training phases, L2 penalty factor has been converted | sklearnDo not usestandardized | sklearnusestandardized |
---|---|---|
sparkDo not usestandardized | The parameters are equal and the intercepts are equal | Parameters and intercepts are not equal |
The spark pretreatmentuseStandardization, also inside the algorithmuse | Parameters and intercepts are not equal | The parameters are equal and the intercepts are unequal |
The spark pretreatmentuseStandardization, inside the algorithmDo not use | Parameters and intercepts are not equal | The parameters are equal and the intercepts are unequal |
The spark pretreatmentDo not useStandardization, inside the algorithmuse | Parameters and intercepts are not equal | Parameters and intercepts are not equal |
2.2.3 There are differences between the use modes of TOL and the inverse of L-BFGS Haysen matrix
Tol is used as the convergence condition of the Optimizer. In SkLearn, TOL only applies to gradients, that is, to determine whether the gradient is convergent. Ftol for determining whether the target function is convergent uses the default value and cannot be adjusted by users.
L-bfgs uses the latest M data to approximate the inverse of the Hessen matrix. The m value of Sklearn and Spark is different, 10 and 7 respectively. This parameter is not available to users.
Therefore, we need to consider whether the difference in convergence conditions and the difference in the inverse of Hessen matrix will have a significant impact on the model, which is not affected from the previous example. However, after all, the previous example has a simple data set and a small number of iterations (6~8 iterations). Next, part of criteo samples are used for testing: The training set samples are 3.6 million and features are 1 million dimensions. According to the above experimental results, non-standardized strategies are adopted.
As shown in the table below, we only intercept the first 10 dimensions of parameters, and it can be seen that after 800 iterations, there is no significant difference in parameters between the two sides.
Train_size: 3.6 million x 1 million iterations: 800 | w1 | w2 | w3 | w4 | w5 | w6 |
---|---|---|---|---|---|---|
sklearn | 0.0000 | 0.0668 | 0.0123 | 0.1432 | 0.0000 | 0.0070 |
spark | 0.0000 | 0.0687 | 0.0131 | 0.1505 | 0.0000 | 0.0079 |
To sum up, when we use different platforms to build LR models, we should be aware of the differences in standardization and penalty factors.
2.3 Model Interpretation
LR model is highly explanatory, and the significance of features is clear after the model parameters are inverted. However, if the coverage of these corresponding features is taken into account in actual projects, it will be found that the coverage of the first features is often very small, which is actually caused by the problem of small-sample bias, which is caused by the maximum likelihood estimation [4].
Here is an extreme example to understand the above problem. Assume that the coverage of feature Xj is shown in the following table:
After discretization +one-hot, all features are 01 | Xj = 0 cover | Xj = 1 cover | The total number of samples |
---|---|---|---|
Positive sample (y=1) | 99 | 1 | 100 |
Negative sample (y=-1) | 10000 | 0 | 10000 |
Xj total coverage | 10099 | 1 |
Then, the maximum likelihood is used to construct the objective function, and then the gradient descent is used to update the model parameters. The formula is as follows:
In summary, we can getIt’s only one in a positive sample, y=1,Lambda =1 in all other cases= 0. So the above equation is equivalent toThat only= 1 sample to work, and because of h (x) is the sigmoid function, its domain | h (x) | < 1, and soEach iteration only moves in one direction, resulting in their heavy weight.
Therefore, when we interpret the model, we should not only look at the parameter ordering, but also consider comprehensively the coverage of features.
3. Summary
Based on actual projects, this paper summarizes the practical experience of machine learning, including sample engineering, feature engineering, evaluation index selection, overparameter and model interpretation. Due to the limited knowledge reserve of the author, the understanding of some problems may be partial or even wrong, so readers are kindly asked to give corrections.
4 References
1. www.jianshu.com/p/c7957ac16…
2. blog.csdn.net/qq_28739605…
3. stackoverflow.com/questions/4…
4. statisticalhorizons.com/logistic-re…