The Word2vec paper was published by Google’s research team in 2013, which changed the development of NLP technology to a great extent. Moreover, when Embedding is used to solve problems in various fields, it cannot be discussed without Embedding. What is Embedding? For those of you who know Word2vec, it’s actually another name for Word2vec, or Word2vec in general, and it’s a representation learning method that uses dense vectors to represent features.
For example, in the domain of search and recommendation, Word2vec’s ideas can be used to encode key features such as Item, Query, and User to produce dense vectors of Item, Query, and User. Once these vectors are present, they can be further used in recall and sorting scenarios. It is not an exaggeration to say that it is the basis of intelligent search and recommendation system, or the emergence of Word2vec, promoted the development of intelligent search and recommendation field.
For me, Word2vec was an Epiphany, mostly because the transition from the core idea to the implementation of the technology was not straightforward, like “How does it update vector parameters?” Maybe it was because after reading some online courses and the training code of the model, a series of problems left before suddenly became clear, and at the same time, realized the importance of this technology.
Although there are some excellent articles about this technique on the web, I wanted to write it down, partly to test my understanding of it, and partly to help students like me, who were struggling with it at the beginning, quickly understand it.
Prep: This article assumes that you already know the concepts of One-hot coding, logistic regression, Softmax, and inner product.
The core idea of Word2vec
In NLP natural language processing, features are made up of words, and for a computer to process those words, you need to encode them into numbers, because word is discrete data. In the past, the most intuitive encoding method was one-hot encoding. However, an obvious problem of one-HOT coding is that any two vectors are orthogonal, and the correlation between words cannot be expressed by their inner product or cosine similarity, and such semantic correlation is very important.
Since one-hot is not suitable for sparse coding, it is natural to use a dense vector to represent each word, which is like the radar diagram in the football game we play, as shown below:
The radar map in the game uses six dimensions to represent all the players in the world. Each player uses different values in these six dimensions to express the difference in ability. Players with similar values in some dimensions will be more “similar”. For example, players who have a high number of shots and a high conversion rate are likely to be good strikers.
In fact, the radar graph above is a way of encoding players into dense vectors, namely Embedding as mentioned above. However, Embedding in machine learning often has many more dimensions, and the interpretation of each dimension is not as intuitive as radar graph.
The above is our Word2vec coding goal, then we need to determine how to code, the idea of this coding mode is like this:
You shall know a word by the company it keeps. (J. R. Firth 1957: 11) — from cs224n
It means that the words that often appear around a word give it meaning. Skip-gram and CBOW (Continuous Bag Of Words Model) are two algorithms that encode Word around this core idea.
Skip – “gramm model
Skip-gram is expressed as predicting the surrounding words according to the central word, while CBOW is the reverse, predicting the central word according to the surrounding words. For the window “The man Loves his son”, loves is the central word of the window, and the other four words are called window words in this paper. During training, skip-Gram and CBOW methods respectively represent them as follows:
Because skip-gram is more common in practical application, this method is mainly used for explanation in the following paper. The left figure above can be expressed with a conditional probability:
Assuming that the probability of occurrence of each word is independent, the above equation can be decomposed into:
We hope that the greater the above conditional probability is, the better. Therefore, for the conditional probability of window words generated by each group of central words in the corpus, they are multiplied, which is the index to be optimized by the model, or Loss, as follows:
In the above formula, T represents the number of Windows, which is also the number of central words,The central word,saidPick the word in the window, and do not pick the center word itself, soIs represented as a window word.
Take -log of the above formula, change multiplication to addition, and change the optimization of the maximum problem of the above formula to the optimization of the minimum problem of the following formula:
Once Loss is defined, the remaining question is: HowThe probability of?
Back to the core idea — the meaning of a word can be expressed by words that frequently appear around it, meaning 和 Relatively similar, at the same timeLess similar to other words. Assuming that vectors for each word have been trained,The vector forAnd theThe vector for, 和 The similarity of can be calculated by the inner product of their respective vectors:Similarly, we can also calculate the similarity between the central word and other words in the corpus.
After obtaining the similarity between the central word and all words, we can use SoftMax to convert the similarity into a probability. Given the central word, the conditional probability of its window word can be calculated as follows:
On the type,Represents the number of independent words in the corpus. At this point, the whole Loss is deduced. As you can see, the expression is still very succinct, where m (window size) is a hyperparameter and vector 和 These are the parameters that the model needs to optimize. The specific optimization is further illustrated by the following neural network diagram:
Above, input layer is a vector v d, v corresponding to the number of independent word corpus, each dimension represents a word, the diagram of input is the corpus of the second word (subscript 1), you can see the word input, only the second neurons has a value of 1, the value of other neurons are 0, visible, before training term vectors, Each word is still one-hot coded.
There is a fully connected layer between the input layer and the hidden layer. Assuming that the dimension of the hidden layer is H, the parameter between the input layer and the hidden layer is oneThe matrix of theta, let’s use a two-dimensional arrayV[v][h]
So let’s say h is equal to 3, just for convenience.
When the parameter matrix V is combined with the input layer of one-HOT encoding, a special phenomenon will appear: Only the parameters connected between the neuron with the value of 1 (central word) and the hidden layer will be activated, namely, the three parameters V[1][0], V[1][1] and V[1][2] as shown in the figure, and these parameters will be transmitted to each neuron of the hidden layer through forward propagation. The activated parameters are what we call the dense vector used to represent the center word, and the dimension of the vector is determined by the dimension of the hidden layer.
The structure from the hidden layer to the output layer is the same as that from the input layer to the hidden layer, because we need to calculate the similarity between the central word and all words in the corpus. Each similarity accounts for one dimension of the output layer, so the neuron of the output layer is also v dimension, and the shape of matrix U is also.
If each neuron in the output layer is the similarity between the central word and other words, for each neuron, the parameter connected to the hidden layer is the dense vector used to represent the corresponding word. For example, assuming that the third neuron in the output layer corresponds to the similarity between a window word and the center word, then the parameters U[2][0], U[2][1] and U[2][2] that connect the neuron to the hidden layer are the word vectors of this window word, because in the case of forward propagation, the calculation process of this neuron is as follows:
This is justV[1][:]
和 U[2][:]
The inner product of these two vectors can be calculated by connecting the output layer to a Softmax. Finally, if the window words are marked at the target layer, parameters can be reversely adjusted by cross entropy Loss function, so as to optimize the Loss function. When Loss gradually converges, the remaining parameters U and V are the dense word vector we need.
But then you can see that for the same word, the neural network produces two vectors, V vectors in the V matrix and V vectors in the U matrix. So which vector do we take as the word vector? Generally, we take the vector corresponding to the central word, that is, in skip-gram, we take the vector between the input layer and the hidden layer. In CBOW, the vector between the hidden layer and the output layer is taken.
This process looks very perfect, but there is a problem, each iteration is a center, computing center word and all the words in the corpus similarity, and the back propagation will calculate all the word vector gradient, the computational overhead is very large, so we need to model for further optimization.
Negative sampling
In order to reduce the amount of calculation model of training, we usually use negative sampling techniques to optimize model, as the name implies, negative sampling is sampling only a small amount of negative samples for training, we put the window of the word as the center of the word is sample, the sample K times the sample is outside the window of the word as negative samples (usually take 5 K), so, The number of neurons in the output layer is reduced from V in the full corpus to (m-1)(K+1), which greatly reduces the amount of computation during training.
Window words are positive samples, and a small number of non-window words sampled are negative samples, so this becomes a dichotomous problem. Sigmoid function is enough, and Softmax multi-classification model is not neededIt can be modified as follows:
It means that the probability of co-occurrence of the center word and window word is close to 1, while the probability of co-occurrence of the center word and K non-window words sampled is close to 0. Maximum Likelihood is used to maximize this probability. Again, use -log to convert the maximum value of the serial to the minimum value of the sum as follows:
So this is one of the things that happens in papers, so if you look at this, you should take note. After updating Loss with this formula, the corresponding network structure is also modified as follows:
The main changes are that the output layer is changed from v neurons to (M-1)(K+1), in which M-1 has positive samples and K(M-1) has negative samples, and the Softmax layer is changed into Sigmoid calculation.
Note here the hidden layer to the output layer parameterU[(m-1)(K+1)][h]
It’s just a small part of the complete matrix U, and the size of the complete matrix U is still zero, the current training submatrix is composed of the parameters of positive and negative samples.
summary
Above, is the whole Word2vec principle and optimization training all content.
Word2vec technology has a further extension in the field of search recommendation. For example, the user’s behavior sequence is used as word sequence in Word2vec, and the dense vector representing item is trained, called Item2vec. As well as the technology GraphEmbedding generated by random walk according to the user and behavior diagram, and then using these sequences to encode.
In the field of NLP, one word producing only one Embedding can not meet the real application scenarios. In reality, many word Embedding has multiple meanings and needs to be represented by multiple Embedding, so a new generation of Word2vec technologies such as ELMo and BERT are born. Those of you who are interested in these things can continue to study them.
Reference:
- Hands-on Deep Learning – Word Embedding (tinyurl.com/y8g37x2s)
- Deep learning Recommendation System – The application of Embedding in recommendation system
- cs224n-lecture1(tinyurl.com/yd7vmkap)