Item2Vec: Generalization of the Word2vec method
After the birth of Word2vec, the idea of Embedding spreads rapidly from the field of natural language processing to almost all machine learning fields, and recommendation systems are no exception. Since Word2vec can embed words in “sequence”, there should be corresponding Embedding method for users to buy a good in “sequence” and watch a movie in “sequence”.
Therefore, Microsoft proposed the Method Item2Vec in 2015, which is a generalization of the Word2vec method and makes the Embedding method suitable for almost all sequence data. The technical details of Item2Vec are almost exactly the same as that of Word2vec. If we can embed the object we want to express as sequence data, and then feed the sequence data to Word2vec, we can get any object. Item2vec is of course essential to the recommendation system as it makes it possible for “everything to be Embedding”. For the recommendation system, Item2vec can use the Embedding of objects to obtain their similarity directly, or input it into the recommendation model as an important feature for training, which will help improve the effect of the recommendation system.
What graph structured data is available on the Internet?
As we know, any object that can be represented by sequence data can be trained to be added using Item2vec. But the data of the Internet is more than just sequential data. More and more of it is presented as graphs. In this case, the Embedding method based on sequence data is not enough. However, it is a pity to give up graph structure data in recommendation system, because graph data contains a lot of very valuable structure information. How can we generate Embedding based on graph structure data? The Embedding method based on Graph structure is also called Graph Embedding.
If we don’t know what information is contained in the structure of the graph, why do we want to make good use of it and generate Embedding based on it? Let me take you through some of the most typical graph-structured data on the Internet. Graph one:
In fact, graph structured data is almost everywhere in the Internet, the most typical is the social network we use every day (See Figure 1-a). From social network, we can find opinion leaders and community, and make social recommendation according to these “social” features. If we can embed coding the nodes in social network, the process of social recommendation will be very convenient.
Knowledge mapping is also a hot research and application direction recently. As described in FIG. 1-b, the knowledge graph contains different types of knowledge subjects (such as people and places), attributes attached to the knowledge subjects (such as person description and object characteristics), as well as relations between subjects and subjects and between subjects and attributes. If we can embed the bodies in the knowledge graph, we can find the potential relationships between the bodies, which is very helpful for the content and knowledge-based recommendation system.
Another very important type of graph data is behavior relational graph data. Such data, found in almost all Internet applications, is actually a “bipartite graph” (also known as a binary graph, as shown in Figure 1-c) composed of users and objects. Interactions between users and objects generate behavior diagrams. With the help of such relation graph, we can naturally explore the relationship between objects, users and users and objects by Embedding technology, so as to apply it to further recommendation of recommendation system.
There is no doubt that graph data is of great value. If the nodes in the graph can be Embedding, it will be a very valuable feature for the recommendation system. Let’s get to the topic and learn the Graph Embedding method based on Graph data.
Random Walk based Graph Embedding method: Deep Walk
We will first learn a Graph Embedding method, Deep Walk, which has great influence and wide application in the industry. It was proposed by researchers of Stony Brook University in 2014. Its main idea is to randomly walk on the graph structure composed of objects, generate a large number of object sequences, and then input these object sequences into Word2vec as training samples for training, and finally get the Embedding of objects. Therefore, DeepWalk can be regarded as a transition method for connecting serial Embedding and Graph Embedding. Figure 2 shows the execution of the DeepWalk method.
Next, I will refer to the four diagrams in Figure 2 to explain the algorithm flow of DeepWalk in detail. Figure 2:
-
First, we constructed the item relationship diagram (FIG. 2-b) based on the original sequence of user behavior (FIG. 2-a), such as the sequence of user purchasing items, watching videos, etc. From this, we can see that since the user Ui has purchased item A and item B successively, A directed edge from A to B is generated. If more than one directed edge is generated, the weight of the directed edge is strengthened. After all sequences of user actions have been converted into edges in the item correlation diagram, the global item correlation diagram is established.
-
Then, a random walk was used to randomly select the starting point and regenerate the item sequence (Figure 2-c). The sampling times and length of random walk belong to hyperparameters, which need to be adjusted according to the specific application.
-
Finally, we input these random walk generated item sequences into the Word2vec model in Figure 2-D to generate the final item vector.
In the above DeepWalk algorithm process, the only thing that needs to be formally defined is the jump probability of random walk, that is, the probability of traversing the adjacent point vjv_j vj of VIv_i vi after reaching node VI. If the item graph is a directed entitlement graph, then the probability of jumping from node viv_i vi to node vjv_j vj is defined as follows:
Where, N+(Vi)N_+(V_i) N+(Vi) is the set of all outgoing edges of node vjv_jvj, MijM_{ij} Mij is the weight of edge from node viv_i Vi to node vjv_jvj, That is, the jump probability of DeepWalk is the proportion of the weight of the jump edge to the sum of the weight of all the relevant outgoing edges. If the item correlation graph is undirected and unreduplicated, then the jump probability will be a special case of the above formula, i.e., the weight MijM_{ij} Mij will be constant 1, and N+(Vi)N_+(V_i) N+(Vi) should be the set of all “edges” of node viv_i Vi, not the set of all “outsides”.
When the random walk gets the new item sequence, we can generate the item sequence in the classical Word2vec way.
A tradeoff between homogeneity and structure, Node2vec
In 2016, researchers at Stanford University took DeepWalk one step further by proposing the Node2vec model. Node2vec adjusts the random walk jump probability so that the results of Graph Embedding are balanced between Homophily and Structural Equivalence. Different Embedding can be further input into the recommendation model so that the recommendation system can learn different network structure characteristics.
As shown in Figure 3, the Embedding expression of node U and its connected nodes S1, S2, S3 and S4 should be close, which is the embodiment of network homogeneity. In e-commerce sites, homogeneous items are likely to be in the same category, with the same attributes, or often bought together.
For example, node U and node S6 in Figure 3 are the central nodes of their respective local area networks. They are similar in structure, so their Embedding expression should also be similar, which is the embodiment of “structure”. In e-commerce websites, structurally similar items are generally popular items of various categories, the best collection of goods and other items with similar trends or structural attributes.
After understanding these basic concepts, the question comes: how do the results of Graph Embedding express structure and homogeneity?
Firstly, in order to make the results of Graph Embedding can express the “structure” of the network, in the process of random walk, we need to make the process of random walk more prone to BFS (Breadth First Search), Because BFS will travel more in the neighborhood of the current node, it is equivalent to conducting a “microscopic scan” of the network structure around the current node. Whether the current node is “local central node”, “edge node”, or “connection node”, the generated sequence contains different number and sequence of nodes, so that the final Embedding can capture more structural information.
In order to express “homogeneity”, random walk should be more inclined to DFS (Depth First Search), because DFS is more likely to walk to distant nodes through multiple jumps. However, DFS migration is more likely to happen inside a large group, which makes the Embedding of nodes inside a group or community more similar, thus expressing the “homogeneity” of the network.
In Node2vec algorithm, how to control the tendency of BFS and DFS? In fact, it mainly controls the tendency of the jump through the jump probability between nodes. Figure 4 shows the jump probability of Node2vec algorithm from node T to node V, and then from node V to the surrounding points. Here, you have to pay attention to the characteristics of these nodes. For example, node T is the node visited in the last random walk, node V is the node currently visited, nodes X1, x2, X3x_1, X_2, X_3X1, x2, x3 are non-T nodes connected to V, but node X1 is also connected to node T, These different characteristics determine the probability of the next jump in a random walk.
⋅ WVX π_{vx}=\alpha_{pq}(t,x)⋅ W_ {vx}π =αpq(t,x)⋅ WVX Where wvxw_{vx} WVX is the original weight of edge vx, αpq(t,x)\alpha_{pq}(t,x)αpq(t,x) is a jump weight defined by Node2vec. Whether you prefer DFS or BFS depends on the definition of jump weights. Let’s take a look at the exact definition and explain it further:
Alpha pq(t,x)\alpha_{pq}(t,x) \alpha_{pq}(t,x) \alpha_{pq}(t,x) \alpha_{pq}(t,x) \alpha_{pq}(t,x) \alpha_{pq}(t,x) \alpha_{pq}(t,x) \alpha_{pq}(t,x) \alpha_{pq}(t,x) The distance between t and t itself dttd_{tt} DTT is 0, and x2 and x3, which are not connected to T, dtxd_{tx} DTX is 2.
In addition, the parameters P and q in αpq(t,x)\alpha_{pq}(t,x)αpq(t,x) jointly control the random walk tendency. Parameter P is called Return Parameter. The smaller p is, the more likely it is to randomly walk back to node T, and Node2vec pays more attention to express the structure of the network. Parameter Q is called in-out Parameter. The smaller q is, the more likely it is to randomly walk to a distant node. Node2vec pays more attention to expressing the homogeneity of the network. Instead, the current node is more likely to swim around nearby nodes. You can try different values for p and q yourself, and calculate the probability of jumping from V to T, x1, x2x_1, X_2x1, x2, and x3 x_3x3. So it should be easy to understand the random walk bias I was talking about.
Node2vec can flexibly express homogeneity and structure, which is verified by experiments. We can adjust p and Q parameters to make it produce different Embedding results. FIG. 5 shows that Node2vec pays more attention to homogeneity, from which we can see that nodes with similar distance are more similar in color; FIG. 5 shows that nodes with similar structural characteristics are more similar in color.
There is no doubt that the homogeneity and structure of network embodied by Node2vec are very important feature expression in recommendation system. Due to Node2vec’s flexibility and ability to explore different graph features, we can even input the “structural” Embedding results generated by different Node2vec and “homogeneity” Embedding results into the subsequent deep learning network. In order to retain the different map characteristic information of the item.
How is Embedding applied in feature engineering of recommendation system?
Embedding, Word2vec and Item2vec for sequence data, and Embedding, Deep Walk and Node2vec for graph data. Embedding in feature Engineering module: How does Embedding apply to recommendation systems? So I’m going to do a general solution here.
-
The first question is easy to answer. Since the output of Embedding is a numerical feature vector, the Embedding technology itself can be regarded as one of the feature processing methods. Different from simple one-hot encoding, Embedding is a higher-order feature processing method, which has the ability to integrate sequence structure, network structure and even other features into a feature vector.
-
There are three answers to the second question, because there are roughly three application ways of Embedding in the recommendation system, namely “direct application”, “pre-training application” and “End2End application”.
Among them, “direct application” is the simplest, after we get the Embedding vector, directly use the similarity of Embedding vector to realize some functions of recommendation system. Typical features include using the similarity of Embedding to realize similar item recommendation, using the similarity of Embedding and user Embedding to realize classic recommendation functions like “guess you like”, and also using Embedding to realize recall layer in the recommendation system.
“Pre-training application” refers to that the Embedding vectors are not applied directly after we have pre-trained the Embedding of objects and users. Instead, they are integrated with other Embedding vectors as part of the feature vector to participate in training as the input of the recommendation model. Doing so allows other features to be brought in better, allowing the recommendation model to make more comprehensive and accurate predictions.
The third application is called the End2End application. It looks like it’s a new term. The full name for it is End to End Training. However, it is not a mystery. It means that we do not train Embedding in advance, but combine the training of Embedding with the deep learning recommendation model, train together in a unified, end-to-end way, and get the recommendation model containing Embedding directly. This approach is very popular. For example, Figure 6 shows three classical models that incorporate Embedding: Microsoft’s Deep Crossing, UCL’s FNN and Google’s Wide&Deep.
How to use Spark to generate Item2vec and Graph Embedding?
Spark MLlib also contains a number of mature machine learning models, including the Word2vec model we talked about. Based on this, we will complete the training of Item2vec and Graph Embedding based on Deep Walk on Spark platform.
If we are familiar with other machine learning platforms, TensorFlow and PyTorch have powerful deep learning toolkits. Can they be used for Embedding training? Of course you can. However, Spark, as a native distributed computing platform, has more advantages in processing big data than deep learning platforms such as TensorFlow, and many companies in the industry are still using Spark to train some machine learning models with relatively simple structures. Therefore, We continue to implement the Embedding practice using Spark.
First, let’s look at how to complete the Item2vec training.
Item2vec: Processing of sequence data
As we know, Item2vec is proposed based on the natural language processing model Word2vec, so Item2vec deals with sequence data such as text sentences and movie sequences. Before we can actually train Item2vec, we need to prepare the sequence data for it to train. In the MovieLens dataset, there is a table called the rating, which contains how many movies you’ve seen and when you’ve seen them. Now that we have the time and rating history, the sequence we want to use can be obtained by manipulating the rating table.
However, there are two more questions that need to be clarified before using movie-watching sequence encoding. First, MovieLens is essentially just a rating table, not a real “movie viewing sequence”. But for the user, of course, you have to have seen the movie to evaluate it, so we can almost think of the rating sequence as a viewing sequence. The other is should we put all the movies in the sequence, or only the ones that are highly rated?
Here, I’m suggesting that we filter the ratings and only show movies that are rated highly by users. Why do you do that? We need to think about what Item2vec is essentially trying to learn. We want Item2vec to learn about the similarity between items. In this case, of course, we want the well-rated movies to be closer together, and the poorly rated movies and the well-rated movies not to pair in the sequence.
Ok, so here we have clarified the idea of sample processing. That is, for a user, we first filter out the movies with low ratings, and then sort the movies he has reviewed by time stamp. In this way, we get a user’s viewing sequence, and the viewing sequences of all users constitute the training sample set of Item2vec.
So how exactly does this work on Spark? It’s pretty simple. Just understand these 5 key steps:
-
Read ratings raw data to Spark platform;
-
Filter low score records with WHERE statement;
-
The groupBy userId operation is used to aggregate the score records of each user. Each record in the DataFrame is a score sequence of a user.
-
Define a custom operation sortUdf, with which each user’s scoring records are sorted according to the timestamp;
-
Each user’s score record is processed into a string for subsequent training.
Specific implementation process, I still suggest you to refer to the code I give below, important places I have also added annotations, convenient for you to understand.
def processItemSequence(sparkSession: SparkSession) :RDD[Seq[String]] = {// Set the path for rating data and use Spark to load data
val ratingsResourcesPath = this.getClass.getResource("/webroot/sampledata/ratings.csv")
val ratingSamples = sparkSession.read.format("csv").option("header"."true").load(ratingsResourcesPath.getPath)
Copy the code
// Implement a user-defined operation function (UDF) for subsequent sorting
val sortUdf: UserDefinedFunction = udf((rows: Seq[Row]) => {
rows.map { case Row(movieId: String, timestamp: String) => (movieId, timestamp) }
.sortBy { case (movieId, timestamp) => timestamp }
.map { case (movieId, timestamp) => movieId }
})
Copy the code
// Process raw rating data into sequence data
val userSeq = ratingSamples
.where(col("rating") > =3.5) // Filter out scores below 3.5
.groupBy("userId") // Group by user ID
.agg(sortUdf(collect_list(struct("movieId"."timestamp"))) as "movieIds")
// Each user generates a sequence and sorts it by timestamp using the udF function just defined
.withColumn("movieIdStr", array_join(col("movieIds"), ""))
// Concatenate all ids into a single String for subsequent word2vec model processing
// Filter out the sequence data and discard other process data
userSeq.select("movieIdStr").rdd.map(r => r.getAs[String] ("movieIdStr").split("").toSeq)
Copy the code
In the sample score sequence generated by this code, the form of each sample is very simple, which is a sequence composed of movie IDS. For example, the following is the movie viewing sequence of user ID 11888:
296 380 344 588 593 231 595 318 480 110 253 288 47 364 377 589 410 597 539 39 160 266 350 553 337 186 736 44 158 551 293 780 353 368 858
Item2vec: Model training
With the training data ready, it’s time to move on to model training. The entire training process for writing Item2vec by hand must have been a bit of a “crash”, but Spark MLlib already has a handy interface to the Word2vec model. I’ll post the training code below, and then walk you through what each line of code is doing.
def trainItem2vec(samples : RDD[Seq[String]]) :Unit= {// Set model parameters
val word2vec = new Word2Vec()
.setVectorSize(10)
.setWindowSize(5)
.setNumIterations(10)
// The training model
val model = word2vec.fit(samples)
// Use the model to find the 20 items most similar to item"592"
val synonyms = model.findSynonyms("592".20)
for((synonym, cosineSimilarity) <- synonyms) {
println(s"$synonym $cosineSimilarity")}// Save the model
val embFolderPath = this.getClass.getResource("/webroot/sampledata/")
val file = new File(embFolderPath.getPath + "embedding.txt")
val bw = new BufferedWriter(new FileWriter(file))
var id = 0
// Get all Embedding vectors with model.getVectors
for (movieId <- model.getVectors.keys){
id+=1
bw.write( movieId + ":" + model.getVectors(movieId).mkString("") + "\n")
}
bw.close()
Copy the code
As you can see from the code above, Spark’s Word2vec model training process is very simple, requiring only four or five lines of code to complete. Here, I’ll break down three of the key steps from top to bottom.
-
The first step is to create the Word2vec model and set the model parameters. We need to know that there are three key parameters of the Word2vec model, which are setVectorSize, setWindowSize and setNumIterations. Among them, setVectorSize is used to set the dimension of the generated Embedding vector, setWindowSize is used to set the size of the sliding window sampling on the sequence data, and setNumIterations are used to set the number of iterations during training. The specific selection of these hyperparameters should be adjusted according to the actual training effect.
-
Secondly, the training process of the model is very simple, which is to call the fit interface of the model. When the training is complete, the model returns an object containing all the model parameters.
-
The final step is to extract and save the Embedding vector. As we can see from the last few lines of code, we call the getVectors interface to extract the corresponding Embedding vector for a movie ID, which can then be saved to a file or other database for use by other modules.
After the model training is completed, we will verify whether the training results are reasonable or not. I’ve got a similar movie with ID 592 in the code. This movie is called “Batman”. I put the similar movies obtained through Item2vec in the following, you can intuitively judge whether this result is reasonable.
Graph Embedding: Data preparation
At this point, I’m sure you’re familiar with the Item2vec method implementation. Next, we will talk about the Graph Embedding method based on random walk and see how to implement it using Spark. Here, we choose the Deep Walk method to implement. Figure 3:
In the Deep Walk approach, the most critical data we need to prepare is the transition probability matrix between items. Figure 3 is the algorithm flow chart of Deep Walk. The transition probability matrix expresses the relationship diagram of items in Figure 3(b), which defines the jump probability from item A to item B in the process of random Walk. So, let’s take a look at how Spark generates this transition probability matrix.
//samples Sets of input movie-watching sequences
def graphEmb(samples : RDD[Seq[String]], sparkSession: SparkSession) :Unit= {// Smash the viewing sequence into movie pairs using the flatMap operation
val pairSamples = samples.flatMap[String]( sample => {
var pairSeq = Seq[String] ()var previousItem:String = null
sample.foreach((element:String) = > {if(previousItem ! =null){
pairSeq = pairSeq :+ (previousItem + ":" + element)
}
previousItem = element
})
pairSeq
})
// Count the number of movie pairs
val pairCount = pairSamples.countByValue()
// Double layer Map data structure of transition probability matrix
val transferMatrix = scala.collection.mutable.Map[String, scala.collection.mutable.Map[String.Long]] ()val itemCount = scala.collection.mutable.Map[String.Long] ()// Find the transition probability matrix
pairCount.foreach( pair => {
val pairItems = pair._1.split(":")
val count = pair._2
lognumber = lognumber + 1
println(lognumber, pair._1)
if (pairItems.length == 2) {val item1 = pairItems.apply(0)
val item2 = pairItems.apply(1)
if(! transferMatrix.contains(pairItems.apply(0))){
transferMatrix(item1) = scala.collection.mutable.Map[String.Long]()
}
transferMatrix(item1)(item2) = count
itemCount(item1) = itemCount.getOrElse[Long](item1, 0) + count
}
Copy the code
The function input to generate the transition probability matrix is the viewing sequence data processed during training Item2vec. The output is the transition probability matrix. Since the transition probability matrix is relatively sparse, I did not use the method of two-dimensional array which is relatively wasteful of memory, but adopted a double-layer Map structure to achieve it. For example, if we want to get the transfer probability of itemA to itemB, the transferMatrix(itemA)(itemB) is the transfer probability.
In the process of obtaining the transition probability matrix, I first used Spark’s flatMap operation to “break” the movie-watching sequence into movie pairs, then counted the number of these movie pairs by countByValue operation, and finally calculated the transition probability between every two movies according to the number of these movie pairs.
After obtaining the transition probability matrix between items, we can proceed to the step in FIG. 3(c) for random walk sampling.
Graph Embedding: Random walk sampling process
Random walk sampling is the process of generating new sequence samples by using transition probability matrix. How do you understand that? First, we randomly select a starting item based on the distribution of the number of occurrences of the item, and then enter the process of random walk. In each walk, we find the transfer probability between two items according to the transfer probability matrix, and then jump according to this probability. For example, if the current item is A and A is found in the transfer probability matrix, it may jump to item B or item C, and the transfer probability is 0.4 and 0.6 respectively. Then we will randomly walk to B or C according to this probability, and proceed in turn until the length of the sample meets our requirements.
Based on the random walk process above, I implemented it in Scala. You can refer to the following code, and I have also provided comments in key places:
// Random walk sampling function
//transferMatrix Transfer probability matrix
//itemCount Specifies the number of occurrences of an item
def randomWalk(transferMatrix : scala.collection.mutable.Map[String, scala.collection.mutable.Map[String.Long]], itemCount : scala.collection.mutable.Map[String.Long) :Seq[Seq[String]] = {// Number of samples
val sampleCount = 20000
// The length of each sample
val sampleLength = 10
val samples = scala.collection.mutable.ListBuffer[Seq[String]] ()// Total number of occurrences of the item
var itemTotalCount:Long = 0
for ((k,v) <- itemCount) itemTotalCount += v
// Random walk sampleCount times to generate sampleCount sequence samples
for( w <- 1 to sampleCount) {
samples.append(oneRandomWalk(transferMatrix, itemCount, itemTotalCount, sampleLength))
}
Seq(samples.toList : _*)
}
// The process of generating a sample through a random walk
//transferMatrix Transfer probability matrix
//itemCount Specifies the number of occurrences of an item
//itemTotalCount Total number of occurrences of an item
//sampleLength The length of each sample
def oneRandomWalk(transferMatrix : scala.collection.mutable.Map[String, scala.collection.mutable.Map[String.Long]], itemCount : scala.collection.mutable.Map[String.Long], itemTotalCount:Long, sampleLength:Int) :Seq[String] = {val sample = scala.collection.mutable.ListBuffer[String] ()// Determine the starting point
val randomDouble = Random.nextDouble()
var firstElement = ""
var culCount:Long = 0
// Randomly determine the starting point based on the probability of the item appearing
breakable { for ((item, count) <- itemCount) {
culCount += count
if (culCount >= randomDouble * itemTotalCount){
firstElement = item
break
}
}}
sample.append(firstElement)
var curElement = firstElement
// generate sampleLength sampleLength through random walk
breakable { for( w <- 1 until sampleLength) {
if(! itemCount.contains(curElement) || ! transferMatrix.contains(curElement)){break
}
// The transition probability vector from curElement to the next hop
val probDistribution = transferMatrix(curElement)
val curCount = itemCount(curElement)
val randomDouble = Random.nextDouble()
var culCount:Long = 0
// Randomly determine the next jump item according to the transition probability vector
breakable { for ((item, count) <- probDistribution) {
culCount += count
if (culCount >= randomDouble * curCount){
curElement = item
break
}
}}
sample.append(curElement)
}}
Seq(sample.toList : _
Copy the code
After random walk generates sampleCount samples we need for training, the following process is completely consistent with Item2vec process, that is, these training samples are input into Word2vec model to complete the generation of the final Graph Embedding. You can also use the same method to verify the effects of the Graph Embedding method. As for the Spark implementation of Item2vec, you should pay attention to training several parameters of the Word2vec model, such as VectorSize, WindowSize, NumIterations, etc., to understand their respective roles. They are used to set the dimension of the Embedding vector, the size of the sliding window sampling on the sequence data, and the number of iterations during training. In the implementation of Deep Walk, we should focus on the method of generating transfer probability matrix between items and the process of generating training samples through random Walk.
conclusion
The above is the relevant knowledge about word embedding. All the above content has passed the practical test.
Oneself also is the small white of this respect, still have wrong place to ask big guy to instruct.
There are small partners do not understand the place can comment, we discuss to solve!