(Original article on Hornet’s Nest Technology, wechat ID: MFWtech)
Part.1 Hornet’s Nest recommendation System architecture
The Hornet’s Nest recommendation system is mainly composed of several parts: Recall (Match), Rank (Rank) and Rerank. The overall architecture is shown as follows:
In the recall stage, the system will screen out the candidate set (100 level and 1000 level) in line with the user’s preference from the massive content library. In the sorting stage, on this basis, the candidate set content is calculated and selected more accurately based on specific optimization objectives (such as click rate), each content is scored accurately, and then a small amount of high-quality content that users are most interested in is selected from the hundreds of contents in the candidate set.
In this paper, we will focus on one of the core of hornet’s nest recommendation system — sorting algorithm platform, its overall architecture; In order to present users with more accurate recommendation results, in the process of supporting model rapid and efficient iteration, what roles did the sorting algorithm platform play and the experience of practice?
Part.2 Evolution of sorting algorithm platform
2.1 Overall Architecture
At present, the model sorting platform of hornet’s nest sorting algorithm online is mainly composed of three parts: general data processing module, replacement model production module, monitoring and analysis module. The structure of each module and the overall working process of the platform are shown as follows:
2.1.1 modulefunction
(1) General data processing module
The core function is feature construction and training sample construction, which is also the most basic and key part of the whole sorting algorithm. Data sources involve click exposure logs, user portraits, content portraits, etc. Underlying data processing relies on Spark offline batch processing and Flink real-time stream processing.
(2) Replaceable model production module
Mainly responsible for the construction of training set, model training and generation line configuration to achieve seamless synchronous on-line model.
(3) Monitoring and analysis module
It mainly includes upstream dependency data monitoring, recommendation pool monitoring, feature monitoring and analysis, model visualization analysis and other functions.
The functions of each module and the interaction between them are integrated with JSON configuration files, so that the training and on-line model can be completed only by modifying the configuration, which greatly improves the development efficiency and lays a solid foundation for the rapid iteration of sorting algorithm.
2.1.2 Main Configuration file types
Configuration files are mainly classified into TrainConfig, MergeConfig, OnlineConfig, and CtrConfig. Their functions are as follows:
(1) TrainConfig
Training configuration, including training set configuration and model configuration:
-
The training set configuration includes specifying which characteristics to use for training; Specify which time periods of training data to use; Specify scenes, pages, channels, etc
-
Model configuration includes model parameters, training set path, test set path, model save path, etc
(2) MergeConfig
It refers to feature configuration, including the selection of context feature, user feature, item feature and cross feature.
Here, we also configure the calculation method of cross features. For example, user features have some vector features, content features have some vector features. When we want to use the cosine similarity or Euclidean distance of some two vectors as a cross feature for the model, the selection and calculation of the cross feature can be realized directly through configuration, and can be used in the synchronous online configuration.
(3) the OnlineConfig
Refers to online configuration. Training data is automatically generated during the construction process for online use, including feature configuration (context feature, user feature, content feature, cross feature), model path, and version of feature.
(4) CtrConfig
Indicates the default CTR configuration. It is used to smooth CTR features of users and content.
2.1.3 Feature engineering
From the perspective of application, features mainly include User Feature, Article Feature and Context Feature.
According to the way of acquisition, it can be divided into:
-
Statistics Feature: includes users, content, clicks/exposure /CTR within a specific period, etc
-
Vector Feature: vector features trained by Word2Vec based on tag, destination and other information, using user click behavior history.
-
Cross Feature: Construct user vector or item vector based on tag or destination vector, so as to obtain similarity features between users and items
2.2 Sorting algorithm platform V1
In V1 stage of sorting algorithm platform, through simple JSON file configuration, the platform can realize feature selection, training set selection, multi-scene XGBoost model training, XGBoost model offline AUC evaluation, automatic synchronization online configuration file generation and other functions.
2.3 Sorting algorithm platform V2
In view of these problems, we added data verification and model interpretation functions in the monitoring and analysis module of the sorting algorithm platform, which helped us provide more scientific and accurate basis for the continuous iterative optimization of the model.
2.3.1 DataVerification (DataVerification)
In the algorithm platform V1 stage, when the offline effect (AUC) of the model performed well, but the online effect did not meet expectations, it was difficult for us to troubleshoot and locate problems and affect the model iteration.
Through the investigation and analysis of the problems, we found that a very important reason for the online effect not meeting the expectations may be that the training set of the current model is based on a click exposure meter collected by the data warehouse every day. Due to data reporting delays and other reasons, some contextual features in the offline click-exposure table may be inconsistent with the real-time click-exposure behavior, causing some inconsistency between offline and online features.
In view of this situation, we added the function of data verification, and compared and analyzed the training set constructed offline with the real-time feature log printed online in all dimensions.
This is done by adding a unique ID to each live click-exposure record based on the online live click-exposure log (which contains information about the model used, characteristics, and model prediction scores) and keeping this unique ID in the offline click-exposure table. In this way, for a click exposure record, we can associate the features of the training set constructed offline with the features actually used online, and compare the AUC of the online and offline models, the prediction score of the online and offline models, and the situation of the features, so as to find some problems.
For example, during previous iterations of the model, the AUC of the model was high offline but not ideal online. Through data verification, we first compared the AUC of online and offline models and found inconsistent effects. Then, we compared the prediction scores of online and offline models and found TopK samples with the largest difference in the prediction scores of online and offline models, and conducted comparative analysis on their offline and online characteristics. Finally, it was found that the data reporting delay resulted in the inconsistency of some online and offline context features, and the missingValue parameters selected during the construction of online XGBoost and DMatrix were problematic, resulting in the deviation of online and offline model prediction score. After the above problems were repaired, the online UV click rate increased by 16.79% and PV click rate increased by 19.10%.
Through the function of data verification and the solution strategy, we quickly locate the cause of the problem, accelerate the iterative development process of the algorithm model, and improve the online application effect.
2.3.2 ModelExplain
Model interpretation can open the black box of machine learning models, increase our trust in model decisions, help us understand model decisions, and provide inspiration for model improvement. Two articles are recommended to help you understand some concepts of model interpretation: Why Should I Trust You Explaining the Predictions of Any Classifier, A Unified Approach to Interpreting Model Predictions.
In real development, there is always a trade-off between the accuracy of the model and the interpretability of the model. Simple models have good explanatory power, but not high accuracy. Complex models improve the accuracy of models at the expense of interpretability. Using simple models to explain complex models is one of the core methods of current model interpretation.
At present, XGBoost model is used to sort online models. However, in the XGBoost model, the traditional model interpretation method based on the importance of features can only give a measure of the importance of each feature on the whole, and does not support the local output interpretation of the model, or the output interpretation of the single sample model. In this context, our model to explain module USES a new method of model to explain the displayed shapes and Lime, the importance of not only support features, also supports partial explanation of the model, we can learn in a single sample, a characteristic of a value to the model output can rise to what extent the positive or negative effect.
The following illustrates the core functionality of model interpretation with a simplified example from a real world scenario. First of all, I would like to introduce the meanings of several features:
Our model interpretation gives the following analysis for a single sample:
- U0–I1
-
U0–I2
-
U_0–I3
As shown in the figure, the predicted values of u0-I2 and U0-_I3_ for single samples by the model are 0.094930, 0.073473 and 0.066176. For the prediction of a single sample, the positive and negative effects of each eigenvalue can be seen from the length of the feature strip in the figure. Red represents the positive effect and blue represents the negative effect. This value is determined by the shap_value value in the following table:
Among them, Logit_output_value = 1.0 / (1 + np.exp(-margin_output_value)),logit_base_value = 1.0 / (1 + np.exp(-margin_base_value)), Output_value is the XGBoost model output value; Base_value is the expected output of the model; Approximately equal to the mean value of the model predicted value in the whole training set; Shap_value is a measure of the positive and negative effects of this feature on the prediction results.
The model predicted value logIT_output_value, 0.094930>0.073473>0.066176, so the ranking result is I1> I2>I3, the predicted value of U0-I1 is 0.094930, The doubleFlow_article_ctr_7_v1=_I1_ctr plays a positive role of 0.062029, making the predicted value increase compared with the base value. Similarly, UI_cosine_70 =0.894006, playing a positive role of 0.188769.
Intuitively, the higher the CTR and user-content similarity, the higher the predicted value of the model, which is also in line with expectations. In the real world, we’ll have more features.
The core function of Shap model interpretation is to support local single sample analysis. Of course, it also supports global analysis, such as feature importance, feature positive and negative action, feature interaction and so on. The figure below is the analysis of doubleFlow_article_ctr_7_v1. It can be seen that content click-through rate of 7 days less than the threshold has a negative effect on model prediction, while content click-through rate greater than the threshold has a positive effect on model prediction.
Part.3 Recent Planning
Recently, the sorting algorithm platform will continue to improve the online application effect of the training model, and take real-time feature as the focus of work to quickly reflect the changes on the online.
The advantage of XGBoost model used by the current sorting algorithm platform is that it does not need too much feature engineering, including feature missing value processing, continuous feature discretization, cross feature construction, etc. However, there are many shortcomings, including:
-
It is difficult to deal with high latitude sparse features
-
Complete data sets need to be loaded into memory for model training. Online learning algorithm is not supported, so it is difficult to realize real-time update of the model.
To solve these problems, we will carry out the construction of depth models such as Wide&Deep and DeepFM in the later stage, as shown in the figure below:
In addition, the current model predicts the score of a single Item every time, and then sorts it to get the result of one brush (Learning to rank, pointwise). In the later stage, we hope to achieve Learning to rank (Listwise), which can bring users more real-time and accurate recommendation results.
Author: Xia Dingxin, Wang Lei, r&d engineer of Hornet’s Nest recommendation algorithm platform.