takeaway
The recommendation system in the industry usually consists of four stages: recall, rough sorting, fine sorting and rearranging. As shown in Figure 1, each stage acts as a funnel to filter out the items most likely to be of interest to users from a large collection of items. The main role of coarse discharge model is to unify the calculation and filtering of recall results and reduce the calculation pressure of fine discharge model on the premise of ensuring the accuracy of recommendation as much as possible. This paper mainly introduces iQiyi basic recommendation team’s implementation of a series of practical schemes on the rough layout model optimization of short video recommendation business.
Figure 1: Recommended overall process architecture diagram
background
In the industry, performance is usually a very important consideration when making rough layout model selection. According to the development history of roughing model selection in the industry, roughing models can be roughly divided into the following categories:
1. The earliest and simplest coarse filtering method directly truncated according to the score calculated by recall, controlled the number of candidate items input to the fine filtering model, or unified truncated according to the global CTR and other statistical indicators.
2. The ** machine learning model represented by LR/ decision tree with relatively simple structure and certain ability of personalized expression, ** uniformly truncated the recall candidate set.
3. The most widely used coarse row model in the current industry — ** double-tower DNN model based on vector inner product. User features and item features are input on both sides of **, and user vectors and item vectors are generated after deep network calculation, and then ranking scores are obtained through vector similarity calculation.
The coarse layout model initially adopted by iQiyi’s short video recommendation business can be classified into the second type of model selection above, which is a GBDT model based on statistical characteristics of each latitude. The statistical feature dimension mainly includes the following dimensions:
1. Consumption statistical characteristics of user groups with different attributes for different types of videos (sub-tags, creators and videos themselves, etc.).
2. Cumulative consumption statistical characteristics of video dimensions, such as click rate of videos, median and mean consumption time, etc. Consumption statistics of creators and video labels.
3. Statistical characteristics of video content consumed by users **** in history, such as statistics of types and labels of users’ historical consumption and statistics of creators’ content consumed.
After the refinement model of the business was optimized and upgraded to wide&Deep model, we made detailed statistics and analysis on the estimation results of the coarse arrangement model and the refined arrangement model, and found that there was a great difference between the coarse arrangement model and the refined arrangement model in the estimation of top head video. The main reasons for attribution are the following two aspects:
1. Differences in feature sets: Coarse GBDT model mainly contains some dense statistical features, while refined WIDE&Deep model mainly plays an important role in video ID, video tag, up main ID and other sparse type features of video itself, such as ID, tag and UP main ID.
2. Differences in model structures: there are great differences in the focus of data optimization and fitting between tree structure model and DNN model.
In addition to the big difference between the predicted results and the refined WIDE&deep model, GBDT model needs a lot of manpower in feature processing and mining. Based on the above analysis, in order to make up the Gap between the coarse row model and the fine row model as far as possible, narrow the difference between the estimated results of the coarse row model and the fine row model, and save a lot of labor costs of feature statistics and mining, we carried out a series of upgrades and optimizations for the coarse row model.
Twin tower DNN coarse row model
Comprehensive calculation performance and experimental results, we finally chose the current mainstream twin tower DNN model as our final selection of coarse row model. The structure of our two-tower DNN model is shown in Figure 2 below. The user side and the Item side respectively construct a three-layer fully connected DNN model. Finally, a multi-dimensional (512) embedding vector is output respectively as a low-dimensional representation of the user side and the video side.
In order to control the complexity of parameters of coarse arrangement model, a large number of feature sets of coarse arrangement model are pruned when constructing feature sets of coarse arrangement model. Only a small part of feature subset of fine arrangement model is used in both user side and video side. Among them, the following features are mainly selected for user-side features:
1. User base portrait features, contextual features such as mobile phone system, model, region, etc.
2. Historical behavior characteristics of users, such as ID of videos watched by users, up master ID, keywords tag of videos watched, and behavior characteristics of users in session, etc.
The video side features are only preserved in three dimensions:
1. The video ID
2. The up main ID
3. Video labels
Figure 2: Structure diagram of DNN model with coarse row twin towers
From GBDT to twin-tower DNN, the complexity and parameter magnitude of the model have explosive growth. In order not to lose the precision of the coarse row model and meet the requirements of on-line performance indicators, we have done a lot of optimization work from the following aspects:
1. Distillation of knowledge
In order to make up for the loss caused by feature clipping and ensure the accuracy of rough model after clipping, knowledge distillation, a common method of model compression, is used to train rough model.
Knowledge distillation is a common method of model compression. In the teacher-student framework, features learned by a complex network with strong learning ability are expressed as “knowledge distillation” and transferred to a network with small number of parameters and weak learning ability. So we get a fast, powerful network.
The coarse arrangement model and the refined arrangement model are put into the teacher-student framework of knowledge distillation to train the coarse arrangement model in the way of distillation training, and the refined arrangement model is used as teacher to guide the training of the coarse arrangement model, so as to obtain a rough arrangement model with simple structure, small number of parameters, but not weak expression power. The schematic diagram of distillation training is shown in Figure 3.
Figure 3: Schematic diagram of rough row model distillation training
In the process of distillation training, in order to align the logits distribution of output of rough layout model and that of output of fine layout model as much as possible, the target of training optimization is adjusted from the original single logloss of rough layout model to the sum of three parts loss as shown in Formula 1. It consists of student Loss (Coarse row model Loss), Teacher Loss (fine row model Loss) and distillation Loss.
In distillation loss, we adopt the minimum square error of rough arrangement model output and fine arrangement model output online. In order to adjust the influence of distillation Loss, we add one-dimensional hyperparametric LAMDA before the loss. We set the hyperparametric LAMDA to increase gradually with the iteration of training steps to enhance the influence of distillation Loss. In the late training period, the rough layout model was aligned to the fine layout model as far as possible. The change trend of LAMDA with the training step is shown in Figure 4.
Figure 4: Cut loss superparameter LAMda setting
2. Embedding parameter optimization
In order to further reduce the parameter magnitude, compress the model size, improve the model transmission speed and reduce the memory consumption when loading the model, we replace the optimizer of the model embedding parameters with the sparse solution optimizer FTRL when training the coarse layout model, and AdaGrad is still used for the parameters of other layers. This step adjustment not only improves the offline Auc slightly, but also the trained ratio of all embedding parameters to 0 is as high as 49.7%. When we export the model, we cut the uniformly 0 embedding parameter, which reduces the size of the coarse layout model by 46.8%, so that the transmission speed of the model is nearly doubled, and the memory consumption of the online loading model is also reduced by 100%.
3. Online Inference optimization
In addition to a series of optimization in off-line training, we have also made a lot of calculation to eliminate and optimize the online inference calculation of the double-tower coarse row model. For the same user + 1000 level candidate video pair, the features of the user side are separated independently. NN inference of the user side is applied only once. This optimization reduces the computation time of p99 by about 19ms.
In addition, we cache high-frequency video embedding based on the distribution of long tails in video recommendation and the fact that all features on the video side of the coarse layout model are static features (the video ID is determined and the features are also determined). The video side embedding is searched from the cache first, and the inference calculation is carried out when the embedding does not match the cache. The optimized rough scoring service architecture is shown in Figure 5 below:
Figure 5: Rough scoring Service architecture diagram
Through the above a series of optimization, thick line of twin towers within DNN model online computing performance contrast before GBDT model basically flat, respectively applied to iQIYI hot channel and on the home page recommended scenario, the line on both sides of the indicators are significantly increased, which feed on front page recommends users per capita consumption of time rose by about 3%; The per-capita consumption time of iQiyi’s popular channels also increased by 1%, and the CTR and per-capita consumption of videos both increased by 2%.
Cascade model
With the subsequent iterations and deepening of the business, the learning objectives of the refinement model are constantly adjusted. In order to facilitate the addition of different learning objectives, we upgraded the online refinement model to the MMOE multi-objective model proposed by Google, which is also widely used in the industry. In order to solve the problem of consistency between rough model and fine model once and for all, we optimize the rough model iteratively again. By upgrading to a cascade model, the coarse arrangement model can adaptively align with the change of the target of the fine arrangement model, and at the same time, the distillation training link is saved, which greatly saves the consumption of training resources.
From a practical point of view, the cascade model does not modify the model structure and the input feature set of the model, but only adjusts the generation mode of training samples of the coarse arrangement model. The upgraded coarse arrangement model is adjusted to directly learn the estimated results of the fine arrangement model from the real exposure of the learning line by clicking/playing the samples. The topN prediction result of the fine scheduling model is taken as the positive sample of rough scheduling model learning. The specific sample generation method of the cascade model is shown in Figure 6.
Figure 6: Cascaded model training sample data flow
Therefore, upgrading the coarse row model to the cascade model this time simply adjusted the training samples of the model, but achieved significant benefits. In addition to removing the distillation learning link, the resource and time consumption of the coarse row model training was greatly reduced. In fact, the line also achieved significant benefits. The exposed click-through rate of video recommendation scenes and the number of effective videos watched per capita increased by about 3%. At the same time, the interaction indicators also significantly improved, including a significant increase of 12% in the number of comments per capita.
The future planning
The above are a series of optimizations we have tried in short video recommendation rough layout model recently. Practice has proved that the consistency of rough layout model and fine layout model has a great influence on online effect. Subsequently, we will continue to optimize the roughing model from the following aspects:
1. Try COLD, a rough sorting system for next generation.
2. Continuously optimize the online computing performance of the coarse row model, expand the number of recalled videos if the performance allows, and add more effective features verified in the fine row model to the coarse row model to improve the accuracy of the coarse row model.
3. Optimize the similarity calculation of user embedding and video embedding. Consider adding a shallow network to calculate the similarity between user and item instead of simple cosine similarity calculation.
reference
1. www.kdd.org/kdd2018/acc…
2.H.B.McMahan. Follow-the-regularized-leader and mirror descent: Equivalence theorems and L1 regularization. In AISTATS, 2011
3. arxiv.org/abs/1503.02…
4. arxiv.org/abs/2007.16…
Team to introduce
Basic recommendation team of IQiyi Suke Business Division, responsible for recommendation strategy optimization of short video feed on the homepage of SUke APP and hot feed of IQiyi Suke.
PS. There are a lot of positions available, we welcome all talents. If you are interested, please send your resume to [email protected] and talk to the recruiters directly
“Recruit 01” can obtain specific recruitment positions jd! 台湾国
I Qiyi Technology Sauce
How to achieve home hi sing hi dance large scene? Iqiyi “disco” mode, vibration + lighting to create a pulsating atmosphere! AI hack technology lets you enjoy any time and anywhere! Immersive audio-visual feast 🕺🏻 refuse to bore, will Live experience! # Band summer # Cloud Bundy # Chinese new rap # Hip Hop # Bundy #VOB #DoubleC # IQiyi Black Technology # Technology can entertain