Author: Zuo Feng Yi
I. Project background
Relying on the music host business, Cloud music has derived a lot of innovative businesses, such as live broadcasting, podcasting and cloud village. Whether it’s music, live streaming, or podcasting, there are two cold-start problems:
- How do YOU distribute content to new (or inactive) users
- How to distribute new (or unpopular) content to users
Each business has some solutions to cold start. A common method is to introduce the user’s behavior data in other scenes. For example, in the live broadcast cold start, we will introduce the user’s behavior in music, such as the user’s hearts, favorites, comments and sharing of songs.
With the continuous development of the business, more and more scenarios hope to have a good effect, low development cost, strong versatility of the scheme to solve the cold start of cloud music business in the early and middle stages.
We investigated various solutions for cold start, such as cross-domain recommendation, new and old user layering, meta-learning, and relationship chain delivery. Finally, it is found that the delivery based on the relationship chain is more suitable for making a general basic scheme, and many Internet products (such as wechat Video number, Pinduoduo, etc.) also take it as an important cold start method.
But cloud music doesn’t have an explicit chain of friends, so what do you do? In this context, the chain of implicit relationships was born.
So how do you build the implicit chain of users?
Cloud music takes content (such as songs, podcasts, live broadcasts and videos, etc.) as the core, connecting people with content. Users comment, share and like a series of behaviors based on content. Therefore, we can learn the vector representation of users by comprehensively using the behavioral data of users in each core business scene of cloud music, and then describe the strength of the implicit relationship of users based on the similarity of the vector.
Based on the above background, the main goal of this project is to learn the vector representation of users by integrating the whole ecological behavior data of users in cloud music, so as to build the implicit relationship chain of users. Based on the implicit relationship chain, realize different downstream recommendation tasks, such as user cold start, content cold start, similar recall and seed user Lookalike, etc.
Ii. Project challenges
The main challenges encountered in the project include:
First, large data scale
The large data scale is mainly reflected in three aspects, with multiple services: cloud music business includes music, search, live broadcasting and other 10+ services; Multiple scenarios: Each service has multiple scenarios, such as 20+ scenarios on the home page and comment page. Multiple behaviors: users have 20+ behaviors such as clicking, playing and following.
In view of this problem, we have made a bottom of each business data situation, the total amount of user behavior data is very large. Considering the efficiency of model training and timeliness of user interest, it is not necessary to use full data. Therefore, we treat business data in two ways:
- Use the core data of the business. In the music business, for example, we use hearts, favorites, comments and shares
- More data cleaning, limiting and filtering, effectively reducing the magnitude of data
Through some of the above processing, we get a relatively high quality of user behavior data.
Second, how to model
The implicit relationship chain of users is not a specific business scene, lacks a clear label and a clear optimization goal, so we cannot directly use the ordering model in the business for modeling. In addition, the behavior of users is highly heterogeneous. The graph composed of users and content (songs, live broadcasts, podcasts, etc.) is heterogeneous, and contains not only heterogeneous nodes (users and various contents), but also heterogeneous edges (various interactive behaviors of users and contents, such as clicking and playing, etc.).
We’ll focus on how to model this in Part 3.
Third, how to evaluate
As mentioned above, we cannot directly use the sorting model, which means we cannot directly use auC and other sorting indicators. Through analysis and investigation, we comprehensively use the following methods for evaluation:
- Qualitative assessment
- Case analysis. This method analyzes the intersection number of user behavior sequence or Jaccard coefficient to judge whether the relationship chain is reliable. This method is intuitive and explicable to some extent. However, there is a question: if the model determines that the similarity of two users is 0.8, does their sequence of behaviors overlap highly? Not necessarily. , for example, A user of song song S1 and S2 has the behavior of hearts, and user B S3 and S4 has the behavior of hearts, to the song and the song is more related to songs (such as this A few songs often appear in many other users’ behavior sequence, or derived from the same artist, etc.), their behavior, though, have not overlap, We can also assume that the interests of the two users are similar.
- Visual analysis. After obtaining the vector representation of the user, we can visualize the Embedding through tools such as the Embedding Projector of TensorFlow, and then observe whether the users with the same tag get together and whether the users with less tag overlap are separated.
- Quantitative evaluation
- On-line experiment. Recall online through U2U, U2U2I, etc., and evaluate according to online revenue. This method is relatively direct and has high confidence, but the experimental cost is also high. It should be evaluated offline before deciding whether to conduct online experiments.
- Offline evaluation. The user vector is input into the sorting model as a feature to evaluate the income of off-line AUC or Loss and other indicators. Or offline generation of u2I recommendation list, and then evaluate recall accuracy and recall rate, etc. These two offline evaluation methods have lower costs than online evaluation methods. Although they are not direct evaluation methods for U2U, they are also feasible methods.
Fourth, how to provide external services
The implicit relationship chain of users is an infrastructure construction. In order to facilitate the rapid access of various business scenarios, we need to make an online service, which involves a problem: how to perform millisecond level similarity retrieval with the user vector of hundreds of millions of levels?
By referring to some vector retrieval engine frameworks in the cloud music industry (such as Faiss, Milvus and Proxima, etc.), Nsearch has been developed to realize high performance similarity retrieval of large-scale vectors, support high concurrency and low latency, complex dynamic expansion of algorithms and incremental import, etc.
In addition, in order to support the ability to customize requirements of different business scenarios, we also designed a set of good service architecture to provide unified interface for business and support the ability of fast access and iteration.
We will focus on the issue of providing external services in Part IV.
3. Technological evolution
In order to construct the implicit relationship chain of users, we first need to generate the vector representation of users. To this end, we have investigated and implemented a variety of algorithms, carried out good practice in many business scenarios, but also carried out many beneficial attempts.
As shown in the figure, we divided the investigation process into five stages according to technical depth and modeling method:
The first stage is the exploratory stage, where we investigate SimHash algorithms. The SimHash algorithm is actually used to compute the similarity between texts, which in our case is to treat a sequence of user actions as a single piece of text.
The second stage is the initial stage. We investigated the Item2VEC model. Item2vec comes from word2vec, and its basic principle is to maximize the probability of co-occurrence of those contextual words that occur near the center word.
The third stage is the exploration stage. We made some optimizations on Item2vec, such as changing global negative sampling to constrained negative sampling and using user_id as global context, which further enhanced vector representation.
The fourth stage is the development stage. We investigated the heterogeneous graph model Metapath2VEC, which can model a variety of entity relations and has strong generalization.
The fifth stage is the innovation stage. The original Metapath2VEc did not add side Information and the generalization performance was not strong enough, so we are making some improvements and optimizations.
Next, let’s focus on SimHash, Item2vec, and MetaPath2vec.
SimHash
The SimHash algorithm is used to calculate the similarity between texts. It is an algorithm used by Google to remove large amounts of text and is also a Locality Sensitive Hashing algorithm: The two strings have some similarity that can be preserved after hash.
The basic principle of SimHash is to map the original text to an N-bit binary string, and then measure the similarity of the original content by comparing the Hamming distance of the binary string. Its basic steps are as follows:
- Keyword extraction. Split Doc, remove the stop word, and assign a weight to each word (such as frequency of occurrence or TF-IDF value, etc.).
- The Hash. The hash algorithm maps each word to a hash value (binary string).
- Weighted. The binary string of words is weighted according to their weight, multiplying the position of 1 by the weight and the position of 0 by the weight and taking the negative.
- A merger. The weighted sequence values calculated by each word are added together to form a sequence.
- Dimension reduction. Converts the combined sequence string to 01, which is the final SimHash value. Conversion rule: If the value of this bit is greater than 0, it takes 1, and if it is less than 0, it takes 0.
Here is a simple example:
Therefore, after SimHash algorithm processing, a text string becomes a numeric string with only 0 and 1. Then, the similarity of two texts is judged by hamming distance. It is generally considered that two texts with hamming distance less than 3 are similar.
So how do we use SimHash to build implicit chains of relationships for users? In fact, by aggregating the sequence of behaviors of each user in each business scenario with some rules, we get a sequence of ids for each user, treat each ID as a word, and give each ID a weight, The SimHash signature (a string of zeros and ones) for each user can be obtained according to the SimHash algorithm flow.
It looks like we’ve solved our problem. However, in the actual application process, we need to retrieve which users are more similar to a user, if compared with the whole database data, it is very inefficient. Here, we generate a 64-bit 01 string for each user, so the problem of building implicit relationship chains can be abstracted as:
Suppose there are 1 billion unique 64 bit 01 strings, given a 64 bit 01 string, how to quickly find out from the 1 billion strings the hamming distance from the given string is less than or equal to 3 (relatively similar).
How to achieve efficient retrieval? We can divide the 64-bit 01 string into four 16-bit segments. According to the drawer principle, if two strings are similar (hamming distance is within 3), then one of them must be equal.
Based on the above ideas, we can design the retrieval process as follows:
-
Import: Traverses all SimHash fingerprints. Perform the following steps for each SimHash fingerprint:
1) Split the 64-bit SimHash fingerprint into 4 16-bit segments 2) Store each segment through KV database or inverted index, for example, segment 1 is stored in library 1, segment 2 is stored in library 2, and the key is the 16-bit string 01. Value is the set of fingerprints with the same key
-
Search: Perform the following steps to retrieve the SimHash fingerprint:
1) Split the SimHash fingerprint into four 16-bit segments; 2) perform equivalent query for each segment respectively in the corresponding database. According to the drawer principle above, the queried fingerprint is suspected to be similar; 3) Compare the fingerprint to be retrieved with the suspected similar fingerprint by hamming distance to determine whether the fingerprint is similar
The overall process can be represented as the following figure:
Assuming the full database has 2^30 (about 1 billion) pieces of data, and assuming that the data is evenly distributed, the maximum number of inversion returns for each 16-bit (random combination of 16 01 numbers has 2^16) is 2^30/2^16= 16,384 candidate results, 4 16-bits, Then there are 4*16384=65536 candidates in total. Therefore, through the above database entry retrieval process, it used to need about 1 billion times of comparison, but now it only needs 65536 times at most to get results, greatly improving the retrieval efficiency.
SimHash, the first algorithm we investigated, is simple, fast, and untrainable, but has some obvious drawbacks:
- SimHash results have certain randomness, and the probability of hamming distance between two random sequences is close to a certain extent. Ensemble is required to reduce the probability, which is relatively expensive
- SimHash’s representation of user similarity is coarse-grained (computes hamming distance, which is an integer)
- SimHash cannot learn the relationship between user behavior context sequences and cannot model i2i, U2I, and so on
Item2Vec
Microsoft published a paper in 2016, Item2Vec:Neural Item Embedding for Collaborative Filtering. Inspired by NLP’s embedding algorithm to learn Word latent representation, the author applies Item2vec to the i2I similarity calculation of recommendation scenes by referring to Google word2vec method. Its main idea is: item is regarded as word in Word2vec, user behavior sequence is regarded as sentence, co-occurrence between items is positive sample, and negative sampling is carried out according to item frequency distribution.
This paper is a very useful article on Microsoft’s application of Word2vec to the recommendation world. The Item2vec method is simple and easy to use, and greatly expands the use of Word2vec from the NLP domain directly to recommendations, advertising, search, and any domain where sequences can be generated.
There is a problem with negative sampling for Item2Vec: the negative sample is too simple. For example, if a user listens to a Cantonese song and uses global negative sampling, it may be an English song, which will weaken the vector representation ability. Therefore, we can carry out constrained negative sampling to improve the ability of vector representation. As shown in figure:
Notice that Item2Vec generates the item vector, so how do WE get the user vector? In fact, after the generation of item vector, we can calculate the vector expression of mean pooling users according to their historical interaction with item. With the user’s vector expression, we can use high-dimensional vector search engines (such as Cloud Music’s own NSearch, Facebook’s Faiss, etc.) to quickly retrieve similar vectors.
In addition to obtaining the user vector indirectly through the above method, we can also refer to Doc2Vec. When we build the training sequence through the interaction history of user and item, we can add the user ID into the training as a global context. Learn the vectors of both user and item.
MetaPath2Vec
Item2Vec is a method to deal with isomorphic network. In practice, we model and fuse it separately by business. MetaPath2Vec is a model for learning node representation of Heterogeneous Information Network (HIN) proposed by YuXiao Dong in 2017.
In this paper, heterogeneous network is defined as graph G=(V,E,T), where V represents the set of nodes,E represents the set of edges, and T represents the set of node types and edge types. For each node v, each edge e has a corresponding mapping function, f (v) : v – > T_V, g (e) : e – > T_E, T_V and T_E said the types of nodes and edges respectively, at the same time | T_V | | + T_E | > 2, namely diagram node type and type of the category of the total is greater than 2.
MetaPath2Vec can be used to learn low-dimensional representations of graph nodes with different node types and different edge types. There are two main ideas at the core of MetaPath2Vec:
- Yuan path
- Heterogeneous Skip – “gramm
Next, we illustrate the user-song-artist network. As shown in the figure, the network has three types of nodes, which are distinguished by different colors.
First, let’s look at what a meta-path is. A meta-path is a combination of node types selected from the figure and has certain service meanings. For example, “user-song-user” indicates that two users have a common behavior for a particular song. We usually design the meta-path in a symmetric form so that we can repeat the loop for many random walks. For example, a meta-path based on “user-song-user” can sample a sequence of “user-1-song 1-user-2-song 2-user 3”.
Compared with the general random walk, the random walk based on meta-path has the following advantages:
- Avoid wandering in favor of node types with high frequency
- Avoid wandering towards relatively concentrated nodes (i.e. nodes with high degree)
- The ability to capture connections between different node types enables the semantic relationships of different types of nodes to be fused into a heterogeneous Skip-Gram model
Next, we look at the heterogeneous Skip-gram, whose optimization goal is to maximize the co-occurrence probability of a node and its heterogeneous context.
The objective function of the heterogeneous Skip-Gram is as follows. The main difference between the heterogeneous Skip-Gram model and the ordinary Skip-Gram model is that a summation symbol is added in the middle, and the relationship between nodes and their heterogeneous neighbors is modeled respectively.
Service deployment
With the vector representation of the user, we can build the implicit relationship chain service. The overall architecture of services is shown in the figure, from bottom to top are algorithm layer, vector engine layer and service layer.
At the algorithm layer, we learn vector representation of users through simHash, Item2vec and Metapath2vec models. In this process, vector representation of content is also produced, that is, Embedding.
In vector engine layer, we import user and content Embedding into vector engine for index construction. The vector engine here uses the nSearch developed by Cloud Music to realize the query requirements of high dimensional vector similarity retrieval, high concurrency and low delay.
In the service layer, the service framework RTRS developed by Cloud Music is adopted to realize the engineering requirements such as dynamic publishing, multi-level caching and asynchronous processing. When the request comes in, the framework parses the protocol parameters, and the recall module goes to the configuration center to load the corresponding service configuration, and then performs the service recall for the scenario.
Through the above framework, we can support a variety of recall methods of implicit relationship chain, including U2U, U2I and I2I, etc. In addition, each recall method can support the ability to customize requirements of different business scenarios.
V. Project achievements
The implicit relationship chain comes from the business and should eventually return to the business, serving the business and creating value for the business.
First, the implicit relationship chain from nothing, provides the ability of implicit relationship service. In this process, we not only build the implicit relationship between users, but also build the implicit relationship chain between content and artists. As shown in the figure, the implicit relationship chain of the artist is displayed.
Second, at present, implicit relationship chain provides implicit relationship services for 13 application scenarios of 9 businesses, such as music, podcast and dynamic, and has achieved good results. When strangers were listening together, the effect improved significantly (+9.4% per person).
Third, the current implicit relationship service, peak QPS is more than 5000, the average time consumption is 10 milliseconds.
Sixth, summary and prospect
Implicit relationship chain is an infrastructure project. Although it is not directly engaged in business, it has the same goal as doing business, and both need to create value for business. We have achieved some results in some business scenarios, but there are many areas that need further improvement:
First, the implicit relationship chain is based on the neural network model. The black-box feature of the neural network makes many models unexplicable, making the implicit relationship chain unable to be applied in some businesses requiring explicit relationships, such as providing recommendation reasons in user recommendation scenarios. For this problem, we will introduce a graph database to aid modeling.
Secondly, the data value of the implicit relationship chain has not been fully explored, such as KOL mining and community discovery.
Thirdly, the generalization ability of the model is not strong enough, and more side information needs to be added.
Fourth, the model is not robust enough. It is easy to be biased by active users and popular content, leading to insufficient learning of inactive users and long-tail content. For this problem, we will introduce contrastive learning (contrastive learning) for modeling.
reference
- Charikar, Moses S. (2002), “Similarity estimation techniques from rounding algorithms”, Proceedings of the 34 th Annual ACM Symposium on going of Computing, p. 380, doi: 10.1145/509907.509965, ISBN 978-1581134957.
- Gurmeet Singh, Manku; Jain, Arvind; Das Sarma, Anish (2007), “Detecting near-duplicates for web crawling”, Proceedings of the 16 th International Conference on World Wide Web (PDF), p. 141, doi: 10.1145/1242572.1242592, ISBN 9781595936547.
- Barkan O, Koenigstein N. Item2vec: neural item embedding for collaborative filtering[C]//2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP). IEEE, 2016: 1-6.
- Dong Y, Chawla N V, Swami A. metapath2vec: Scalable representation learning for heterogeneous networks[A]. Proceedings of the 23rd ACM SIGKDD international Conference on Knowledge Discovery and Data Mining [C]. 2017: 135 — 144.
- Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In ICLR.
- Xiangnan He, Kuan Deng ,Xiang Wang, Yan Li, Yongdong Zhang, Meng Wang(2020). LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation
This article is published by NetEase Cloud Music Technology team. Any unauthorized reprinting of this article is prohibited. We recruit all kinds of technical positions all year round, if you are ready to change jobs, and you like cloud music, then join us at [email protected].