Author: Yi Yuntian
1 background
In the early machine learning recommendation scenes of Cloud music, most of them are mainly offline machine learning, and the model is updated at the day level (T+1). As users, anchor and user-generated content changes frequently, and the external environment, hot point mutations such as product form such as case, offline mode has serious lag, while the model can real-time capture from global rapid change, improve the traffic efficiency, reduce the waste flow exposure, more accurate real-time personalized recommendation. Can see on a specific background: mp.weixin.qq.com/s/uuogZ3aPM…
The real-time process of cloud music model mainly includes three stages: real-time sample generation, real-time model training and real-time push online. One of the first things we can see is real-time sample generation, also known as online samples. Only when the sample is generated in real-time second level, the model can be trained in real time. Finally, the model can be pushed online in real time for prediction sorting.
This paper focuses on the real-time sample generation process, real-time model training and real-time push online to share subsequent articles. Before we get to that, let’s know what a sample is.
2 What is a sample
A sample is a data set required by a machine learning training model. It is usually a data set with annotations composed of different attributes of user and item objects.
Here, object attributes are features, which can be directly used for model learning and training after processing, rather than original features. The labels here indicate which results each piece of data belongs to, which leads to a distinction between positive and negative samples.
- Positive sample: if the sample is determined to be positive according to the goal of training and learning, it will be positive sample. For example, if the learning goal is CTR, the host can click on the positive sample after exposure, and label = 1
- Negative sample: if the sample is determined to be negative according to the goal of training and learning, it will be negative sample. For example, if the learning goal is CTR in the live scene of the home page, the negative sample will be negative sample if the host does not click after exposure, and label = 0
The generation mode and cycle of samples are different, and can be divided into offline samples and real-time samples.
- Offline sample: Mainly uses offline machine learning and Spark batch processing to generate samples. The samples generated are samples with T+1 cycles.
- Real-time samples: Based on real-time (incremental) machine learning, flink method is used to generate samples in real time, and the samples generated are samples with a second cycle.
Therefore, the two key variables of the sample are feature and label, so the sample generation process is divided into three parts: feature generation, label generation, and feature and label association. Now that we know what a sample is, let’s take a look at some of the problems that we face in actual sample production.
3 Inconsistency
The online model prediction process (PREDICT) and offline model training (train) are actually causal and relative processes, and their logic and data structure are the same. It can be simply understood as the following schematic diagram:
Feature1 here… N refers to the extracted model features, which are obtained from the original features through a series of feature extraction calculations. The schematic diagram is as follows:
From the above two diagrams, we can see that the similarities between Train and predict are Feature1… N, so that’s the variable that’s caused by the discrepancy between online and offline. The inconsistency will directly cause the loss of the effect of the algorithm model. The most direct manifestation is that the AUC effect evaluated by the offline model is good, but the actual AB effect after the model goes online is often unsatisfactory, as shown in the figure below:
The inconsistency of feature variable may be mainly caused by the inconsistency between feature input (original feature X in the figure above) and feature extraction calculation (F(X) in the figure above), in which feature traversal is a typical case of inconsistent feature input.
3.1 Feature crossing
In the sample feature splicing, it should be the feature of the sample at time T associated with time T, but in the original practice, feature crossing cannot be avoided. Let’s look at how feature crossing occurs:
There are two main situations in which crossing can occur:
- Real-time feature traversal: Real-time features are constantly changing, usually statistical features or sequence features, which is also difficult to control. It may be the same feature value one minute, and the value of the feature has changed the next minute. For example, the click item sequence characteristics of user U1 at the time of t-1 estimated recommendation: U_click_seq :[1001,1002,1005], the model predicts that after the exposure of item:1006,1008, user U1 clicks item:1008, the real-time feature task may have been updated in advance when the ua backflow splics the sample, and the U1 click sequence feature obtained at this time: U_click_seq :[1001,1002,1005,1008], and a reasonable sample of U1 click sequence characteristics would still be: u_click_seq:[1001,1002,1005]. This is one of the phenomena where features cross over.
- System delay crossing: generally offline features are processed by day. Considering the process of various data pipelines, the processing time will generally be delayed. After the offline features are processed, they will be directed to the line for the request of online model estimation. For example, on March 18, online estimation requests used feature data from March 17. At 0 o ‘clock on The morning of March 19, the feature Pipeline began to process data. At 5 o ‘clock in the morning, the offline feature processing was finished and the data was stored online. So between 0:00 and 5:00 on March 19, the feature requested on this timeline uses the old feature data, that is, the feature data on March 17. 5 -24 on March 19, online features using data from March 18. In the offline sample generation process, if the sample is assembled by day, then all samples on March 19 will use the characteristics of March 18. This is also a phenomenon of feature crossing.
Model training learning F(x) is based on the superposition of the results at T+1 after the feature estimation at T-1, rather than the superposition of the results at T+1 after the feature estimation at T-1, otherwise the “causality inversion” will occur.
3.2 Inconsistent calculations
In addition to feature crossing, inconsistent online and offline feature calculation will also result in good offline AUC and poor online experimental results. Let’s look at how feature calculation inconsistencies occur:
We can see that both online and offline feature query and feature calculation are required. However, there is no correlation between online and offline, and they are independent of each other. Because of the different calculation process of two sets of codes, inconsistent calculation logic often occurs.
- Online calculation: the use of c++ language in the estimation engine feature calculation
- Offline computing: Features are calculated using Scala or Java in flink and Spark engines.
So how to solve the problem of inconsistency in the sample production process? Firstly, let’s look at how to solve the problem of Feature traversal. Generally, there are two approaches In the industry: one is Feature point-in-time, a typical example is time-travel with Tecton. Another approach is to anticipate snapshot backflow.
Let’s look at the first method: Feature point-in-time
In this way, each feature is required to have timestamp, and each change is directly overwritten from time to time. Instead, it exists in the form of multiple versions. When the association is carried out offline, the feature at which time is associated will be selected according to timestamp. This method has the advantages of high accuracy and easy traceability. However, due to the historical burden, the data feature system of cloud music is still very immature and does not have timestamp capability. Moreover, it starts from the bottom data warehouse system, which costs a lot. For this reason, we start with online estimates.
Use the second approach: Predict Snapshot
Considering the inconsistency problem and the demand of real-time model, we carried out a complete project practice of real-time model based on Snapshot to realize real-time sample generation.
4 Real-time sample
The following is the basic architecture diagram of real-time model engineering. We can see that online estimation and offline training form a closed loop of data flow.
The real-time sample includes three stages: real-time feature snapshot, real-time sample attribution and real-time sample splicing.
4.1 Real-time Feature Snapshot
Real-time feature snapshot means to drop the features used in online estimation, write them into KV in real time after collection and processing, and then associate them with the generated label in real time. This is the feature generation part in the sample. But in the process of implementation, there are also many problems:
1. Feature snapshots are oversized
When the characteristics of hundreds or thousands of items and users in line network estimation are directly dropped locally in the form of logs, and then local logs are collected in real time through the collector and sent to Kafka. After Flink processing, they are written into KV storage in real time. However, cloud music scene traffic is very large, more than 5W write per second, single volume is more than 50KB, every hour to reach TB of data, local disk and network IO can not bear, KV consumption is also large, but also affect the normal request line network estimation.
To solve this problem, we propose a bypass TopN scheme, as follows:
Since the only TopN item object that can be finally associated with label is the TopN item object that is recommended to the user at last, the last link of our online placing system asynchronously constructs the same request for the winning TopN item and forwards it again to the sorting and estimation system of the bypass environment. The only difference from the sequencing and estimation system in the wire network environment is that it does not make the inference model calculation. In this way, the original hundreds and thousands of items are reduced to the quantity within TopN 10, which is almost 50 to 100 times smaller. Moreover, it is decoupled from the wire network request and no longer affects the normal request. On the other hand, local logs are removed and Kafka is directly connected to reduce I/O pressure. Protobuf is estimated to be used for snapshot. Combined with SNappy compression, single volume is reduced by more than 50%.
In terms of KV storage, combined with the real-time characteristics, we have also made a customized tair-FIO-RDB based on Rocksdb. Its advantages are obvious, and the detailed introduction will be shared next time.
2. Diversity of feature selection
Different scenes have different selection of features. Some scenes need original features, while others need features extracted from features, or even both of them exist at the same time. In addition, some features do not need to be presented offline. If the code needs to be developed every time to select what features to drop snapshot, greatly reduced development efficiency. In this case, we configure feature selection for DSL as follows:
As we can see, users can configure such an XML to have the flexibility to choose which features, greatly meeting the need for quick online changes without code development.
3, recId
RecId is an important means for the accurate association between snapshot real-time feature snapshot and Label. If only userId+itemId is used, there may be duplication or association error, which greatly affects the accurate confidence of sample.
RecId is generated by the online delivery system and represents a unique ID requested by each user. The recId is generated in the last rearrangement layer link of algorithm recommendation link. On the one hand, the recId is filled into the buried point field ALG reserved by the forwarding algorithm and leaked to the upstream client. The client will report the exposure of each item, user click on item, play and other behaviors, and bring the recId in the UA. On the other hand, the recId is passed to the Snapshot bypass estimation system, dumped into Kafka along with the feature snapshot, and then dropped into kv storage according to recId_userId_itemId.
In the real-time sample stitching stage, the label and snapshot can be successfully associated with stitching through key=recId_userId_itemId.
4.2 Real-time Sample Attribution (Label)
Sample attribution is the process of sample marking. Positive or negative samples can be judged according to the behavior results of users on item. Sample attribution is extremely important for the authenticity and accuracy of samples, and also directly affects whether the learned model is biased.
There are two common methods of sample attribution in the industry:
- Negative sample cache: Facebook proposed the negative sample cache attribution method, the domestic mushroom street adopts this way. Negative sample Cache, waiting for potential positive samples to make choices, sample attribution is accurate, training cost is low, there is a window, quasi-real-time.
- Twice Fast-train: Sample correction method (FN calibration /PU Loss) proposed by Twitter, which is also adopted by IQiyi in China. Positive and negative two rapid update training, the sample does not need to show attribution, no waiting window, more real-time, high requirements for streaming training, and strongly rely on correction strategy, accuracy is difficult to guarantee.
Combined with our existing engineering characteristics and business characteristics, we made a compromise between real-time performance and accuracy, and adopted the negative sample Cache+ incremental window correction method, as follows:
Let’s take a home page live scene as an example. The learning objective of this scene is CTCVR. Ctr_label and CVR_label are the sample labels of whether there is a click and whether it is viewed effectively respectively.
ctr_label | cvr_label | feature{1… n} | Samples show |
---|---|---|---|
0 | 0 | {fea1:,1,3 [0], fea2: [100]} | Exposure no click |
1 | 0 | {fea1:,1,3 [0], fea2: [100]} | Exposure is clickable, but not viewable |
1 | 1 | {fea1:,1,3 [0], fea2: [100]} | Exposure is clickable and viewable |
The same user may generate multiple UA logs for the same item. For example, for the anchor itemId=10001, it is recommended to generate a impress UA after exposure. If the user userId=50001 clicks on it, a Click UA will be generated. If userId=50001 is clicked, a Play UA is generated. However, when the sample is generated at last, only one UA can be retained, that is, only one of the above three situations can exist. If the sample is offline, the sample attribution usually counts according to GroupBy(recId_userId_itemId) on Spark, and then the last UA in the behavior link is retained. But if it’s a real-time sample, what about sample attribution?
In the delay processing link (the delay processing is temporarily evaluated according to the target transformation time, and can be adjusted through monitoring statistics or offline statistics if 95% of the transformation can be completed within the set time window), for example, the evaluation of the home page live scene is 10 minutes.
-
Pre-store KV: only store the action flags after exposure, such as click, playend, key: recId_userId_itemId_action, value: 1.
-
When impress UA comes, check the pre-stored KV according to recId_userId_itemId_1. If click behavior is found, discard the current UA, otherwise mark CTR_label =0, cVR_label =0;
-
Ctr_label =1, cVR_label =0; recId_userId_itemId_2 =1;
-
When play UA comes, ctr_label=1 and CVR_label =0 are marked according to the playing duration. If the playing duration is greater than or equal to the valid playing duration, ctr_label=1, cVR_label =1;
In terms of technical implementation, flink’s keyBy+ State +timer is used to realize the sample attribution process:
- Perform keyBy operation on the input stream according to the key of the join :recId_userId_itemId_ts. If the keyBy operation has data skew, a random number can be added before the operation.
- Call KeyedProcessFunction on the stream after keyBy, define a ValueState in KeyedProcessFunction, override the processElement method, If the value state is empty, a new state is created and the data is written to the value state, and a timer is registered for the data (the timer is automatically de-duplicated by Flink by pressing key+ TIMESTAMP).
- Rewrite the onTimer method, in the onTimer method is mainly defined when the timer triggers the execution of the logic: processing logic is the above attribution logic.
Of course, there are also some special cases. For example, some users have cross-window behaviors towards item, or because of the mechanism of burying point reporting, for example, when the home page is turned off again, they may not request to refresh item again, but item is the result of the last recommendation. This directly affects the effect of real-time attribution samples. Maybe for the same user and item, in the last window it can be attributed to a negative sample, and in the next window it can be attributed to a positive sample. Often we can do another small batch of attribution offline.
- Check whether groupBY (recid, userID,itemId) is greater than 1. If so, only keep Max(sum(ctr_label, cVR_label)).
4.3 Real-time Sample Stitching (Sample)
With real-time feature Snapshot and real-time Label, the next step is real-time sample splicing, which mainly includes Join association, feature extraction processing and sample output.
1. Join association
In Flink task, after getting a record of the label, query KV according to key=recId_userId_itemId, find snapshot and put together a wide record.
2. Feature extraction
After join association is spliced into wide records, some snapshot features are original features, so feature extraction calculation should be carried out. The calculation logic should be consistent with online estimation. Therefore, online features need to be extracted into JARS and used in offline Flink, so as to ensure the consistency of online and offline feature calculation mentioned above.
3. Sample output
Format it according to different training formats, such as TFRecord and Parquet, and output it to HDFS.
4.4 Flink task development
Due to the large number of processes and tasks of real-time samples, we have also done some work on task development in order to achieve rapid landing of real-time samples and quickly cover more business scenarios.
1. Template development, CICD
The tasks are abstracted into four: Snapshot collection task, Label task, Join task, Sampling task (optional), and each task is encapsulated by interface to realize template-based development, simplify the task development process, automate the creation of engineering, compilation and packaging, and create tasks, and have the ability of sustainable integration of CICD.
2. Mission blood
The same scene, DAG series, make the mission lineage more clear.
4.5 Sample Monitoring
Real-time sample generation process is real-time, involving many links, and the stability of the sample, how the content of the sample is very important for the final effect of the sample. Therefore, sample monitoring is to actively discover abnormal situations of samples, so monitoring can be divided into two aspects: system monitoring and content monitoring. The emphasis here is on content monitoring.
-
System monitoring: on-line bypass system monitoring, KV storage monitoring, Flink monitoring.
-
Content monitoring: splicing rate monitoring, feature distribution monitoring, time span monitoring
- Splicing rate monitoring: that is, the success rate of association between label and snapshot through recId reflects the true and effective level of samples.
- Feature distribution monitoring: for example, the monitoring of gender and age distribution can reflect the feature distribution of feature content to judge whether the current sample is reliable.
-
Time span distribution: the time difference between the recommended outgoing item and ua backflow, and the time difference between the snapshot bypass collection and kv fall
5 Online Effects
Our whole model real-time model scheme is online, and the results are quite encouraging. Combined with the above sample attribution processing, offline model hot start and restart, and feature access scheme, we finally achieved the conversion rate in the recommended scenes of the home page live module: the average relative increase of 24 days +5.219%; Click through Rate: +6.575% increase in average 24 days.
In addition, we tested a variety of schemes for different model update frequencies, as shown in the figure below. ABTest T1 group was an offline daily update model, and model files were updated and replaced every day. In t2 group, the model was updated for 2 hours, and the model was incrementally trained every 2 hours. In t8 group, the model was updated in 15 minutes, and the model was incrementally trained every 15 minutes. After many tests, we found that the faster the model is updated, the better and more stable the effect is.
6. Summary and Outlook
The live recommendation business has its scene characteristics different from other businesses. The recommendation is not only an Item but also a state. Therefore, the live recommendation needs faster, higher and stronger recommendation algorithms to support the development of the business. This paper expounds the process of real-time sample production in the real-time model from the perspective of project landing, and shares our experience of some problems encountered in the engineering practice of cloud music. Of course, the background of different companies may be different, so it is the right route to find some suitable for their actual landing. Next we will do more business coverage, continue to achieve business breakthroughs, and also hope to provide more engineering capabilities for algorithmic modeling processes.
reference
- Cheng Li, Yue Lu, Qiaozhu Mei, Dong Wang, and Sandeep Pandey. 2015. Click-through Prediction for Advertising in Twitter Timeline. In Proceedings of the 21th ACM SIGKDD International Conferen
- Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, Joaquin Quinonero Candela. 2014. Practical Lessons from Predicting Clicks on Ads at Facebook Eighth International Workshop on Data Mining for Online Advertising (ADKDD ’14). ACM, New York, NY, USA, Article 5, content 9 pages.
- How does taobao search model change comprehensively in real time? First applied to Double 11[1]
- Ant Financial core technology: Disclosure of real-time recommendation algorithm of 10 billion features.[2]
- Exploration and practice of online learning in iQiyi information Flow recommendation business [3]
- Video streaming — incremental learning and wide&deepFM practice (engineering + algorithm) [4]
The resources
[1] How to make taobao search model comprehensive and real-time? For the first time applied to the double 11: * developer.aliyun.com/article/741… Zhuanlan.zhihu.com/p/53530167. : * *… . : * www.infoq.cn/article/lTH… : zhuanlan.zhihu.com/p/212647751
This article is published by NetEase Cloud Music Technology team. Any unauthorized reprinting of this article is prohibited. We recruit all kinds of technical positions all year round, if you are ready to change jobs, and you like cloud music, then join us at [email protected]