Refer to https://zhuanlan.zhihu.com/p/115690499?from_voters_page=trueCopy the code
0/ Background
Taking "recall" as the main character, this paper will introduce different algorithm methods of recall from four aspects, namely content-based recall, collaborative filtering, FM model-based recall and deep learning-based method.Copy the code
Recall of video corpus is the first stage of recommendation system. It is mainly based on the characteristics of users (user portrait) and commodity characteristics (label) to quickly retrieve a small part of users' potentially interested items from the massive object database, and then hand them to the sorting process. The amount of data that needs to be processed in this part is very large, the speed is very fast, and all the policies, models, and features used should not be too complex. The following four common recall methods are mainly introduced: <1> Content-based recall: The similarity between items is used to recommend items similar to users' favorite items. For example, if user A watches The movie "Embroidered Spring Knife 2" starring Yang Mi, he will be recommended to other movies or TV dramas starring Yang Mi. For example, a user looks at asics men's shoes and suggests similar men's shoes to the user. <2> Collaborative filtering: use the similarity between Query and item to make recommendations. For example, if user A is similar to user B and user B likes video 1, the system can recommend video 1 to user A (even if user A has not seen any similar video). <3> Recall based on FM model: FM is a recommendation algorithm based on matrix decomposition, and its core is the combination of second-order features. <4> Method based on deep neural network: the corresponding candidate set is generated by deep neural network.Copy the code
1/ Content-based recall
Content-based recall (CB recall) is also commonly known as label recall. When talking about CB, people may feel very simple, just use tag or Cate to recall, as if there is nothing to do. But in fact, CB is not just an inversion of tag and Cate. The core idea of this kind of recall is based on the attributes of item(corpus) itself, which can be expressed as tag, Cate, user ID, user type, etc. It can also be recalled by extracting vector of content and expressing the content as continuous vector through some cross-validation methods. Let's take a closer look at content-based filtering. In practical applications, such as movie recommendation, first of all, according to the user's previous historical behavior information (such as click, comment, watch, etc.), CB will use item related features to recommend items similar to previous items to users. To visualize CB, imagine an APP store that recommends apps to users. The corresponding feature matrix is shown below, where each row represents an application and each column represents a feature. Include different category characteristics, application publisher information, and so on. For simplicity, assume that this eigenmatrix is of Boolean type: a non-zero value indicates that the application has the characteristic.Copy the code
The above is the sparse matrix, which can calculate the similarity between items through cosine distance. <1> Advantages of content-based recall - this model does not require data from other users because recommendations are for that user. This makes it easier to scale to large numbers of users. - This model can capture a user's specific interests. <2> Disadvantages of content-based recall - since the feature representation of item is to some extent manually designed, this technique requires a lot of domain knowledge. Therefore, models depend heavily on the quality of hand-designed features. - The model can only make recommendations based on the user's existing interests. In other words, the model's ability to extend users' existing interests is limited. Content-based recall seems to be relatively easy. If we have more and more item attributes, for example, a video may have multiple parallel tags and other attributes, then in order to make comprehensive use of these information, we will use multi-term retrieval to improve the CB effect. Below, we talk about the common optimization points of CB in detail for these contents. <3> Inverted optimization The main purpose of inverted optimization is to improve the recommendation effect of CB recall. The common inverted optimization is basically consistent with the online ranking index. For example, if the ranking index is click rate, inverted optimization is also click rate. However, there is a small problem with this kind of sorting method, because inverted sorting uses a posterior value, and sorting is usually a single index sorting, so it is easy to appear as we mentioned before, the single index is hacked, for example, with the click rate inverted, the header is clickbait, etc.. Therefore, this problem needs extra attention. In addition, the problem of bias of indicators should also be considered. For example, short videos are arranged in the head due to inverted completion rate. This problem can be mitigated by normalization. There is a potential risk, however, that the distribution of the posterior performance of the resource tends to depend on the type of resource itself. < 4 > triggered only a little of the optimization of the optimization of the key key requirements, ensure that each time you select key, is one of the largest key users click on probability, so the general way is to add the user's click history according to the properties and the top, for example, click on the number of sort on a certain category, the most clicks that a few categories for triggering the key, It's a simple process, and there aren't too many optimizations, so we're not going to talk about it anymore. Here want to talk about the behavior of the user is not the same as the weight of differentiation problem, we select the key process is actually a process of judgment users interested in one type of content, which is through the behavior, to determine the user's interest degree, in the product of only browsing, click on is the only behavior to express user interest, but usually product design a lot of interactivity, To help users to effectively express their interest, therefore, this point needs to be taken into account when selecting the trigger key. As for how to do things related to the business form, it will not be expanded here. <5> Multi-dimensional content attribute Finally, we will talk about the higher-order problem in CB, the multi-term problem. So let's think about a situation where we have multiple tags in a video, and tags are parallel and not unique. We have obtained the high-frequency click tags of users through click quantity aggregation online. In general, each tag will have an inverted row and then be recalled separately. But it's easy to think of, when the user has a "handsome" at the same time, the two tag "of pet", by MOE familiar to the user a beauty + recall of pet video apparently did not recall a handsome boy + of pet video more attractive, that is to say, when we recall, if you can consider the relationship between the multiple term at the same time, some will be more effective. There is a set of implementation methods for multi-term recall in traditional search. By setting the weight of each term, results containing multiple terms can be obtained in the returned results. This part belongs to the content scope of search architecture, and I will not expand it here. Let's talk about a more similar way to recommend match. There are multiple labels on the user, and there are also multiple labels on the content. When we do multi-term matching, it is the list of labels to the list of labels, so that the probability of them clicking is maximized. Do you feel like you've suddenly seen the light? Does that translate into a CTR modeling problem? More simply, word2vec is directly used to train the user tag and content tag as sentence, and then add and represent the content tag as the content attribute vector offline, and then recall the content online. Of course, you can do it in a more complicated way, improve the accuracy, but it's pretty much the same, so I won't expand it. The author, other latitude and so on can be added in the same way, and that's left for discussion.Copy the code