Word2vec principle explanation

Word2vec principle (I) CBOW and Skip-Gram model basis

Principle of Word2VEC (II) Is based on Hierarchical Softmax model

Word2vec principle (iii) Model based on Negative Sampling

Word2vec is a NLP tool launched by Google in 2013. It is characterized by quantization of all words, so that the relationship between words can be quantitatively measured and the relationship between words can be mined. Although the source code is open source, but Google’s code base is not accessible in China, so this article explains the principle of Word2vec on Github code. This article focuses on the basics of Word2VEc.

 

directory

1. Word vector basis

2. CBOW and SkIP-Gram are used for neural network language model

3. Huffman tree for Word2vec base


1. Word vector basis

Word vectors to represent words were not invented by Word2vec; they came along a long time ago. The earliest word vectors were very verbose, using the word vector dimension as the size of the entire vocabulary, and for each word in a specific vocabulary, the corresponding position was set to 1. For example, if we have a vocabulary of five words, and the word “Queen” is numbered 2, its word vector would be (0,1,0,0)(0,1,0,0). Similarly, the word vector for the word “Woman” is (0,0,0,1,0)(0,0,0,1,0). This kind of word vector encoding is commonly called 1-of-n representation or one hot representation.

One hot representation is a very simple word vector, but there are many problems. The biggest problem is that our vocabularies tend to be very large, such as the mega level, so that each word is represented in a vector of one million dimensions is a memory disaster. In fact, all the positions of such vector are 0 except 1, which is not efficient in expression. Can we reduce the dimension of word vector?

Distributed representation can solve the problem of One hot representation. Its idea is to map each word to a short word vector through training. All these word vectors constitute vector space, and then we can use ordinary statistical methods to study the relationship between words. What is the dimension of this shorter word vector? This usually needs to be specified by ourselves during training.

For example, the words in the vocabulary are represented by four dimensions: “Royalty”,”Masculinity”, “Femininity” and “Age”. The term vectors may be associated with the word King (0.99, 0.99, 0.05, 0.7) (0.99, 0.99, 0.05, 0.7). Of course, in practice, we can’t give a good explanation for every dimension of the word vector.

 

With the short word vector represented by Distributed Representation, we could easily analyze the relationship between words. For example, we reduced the dimension of words to 2 dimensions. An interesting study showed that when we represented our words with the word vector shown in the following figure, we could find:

The King – – Man – + Woman – > = Queen – King – – Man – + Woman – > = Queen –

So once we get the word vectors for all the words in the vocabulary, then we can do a lot of interesting things. But how do you train the right word vector? A very common approach is to use neural network language models.

2. CBOW and SkIP-Gram are used for neural network language model

Before word2vec came into being, the neural network DNN was already used to train word vectors and deal with the relationships between words. The method used is generally a three-layer neural network structure (of course, it can also be multi-layer), divided into input layer, hidden layer and output layer (Softmax layer).

How does this model define the inputs and outputs of data? Generally, it can be divided into two models: Continuous bag-of-words (CBOW) and Skip-Gram.

The training input of CBOW model is the word vector corresponding to the context related words of a particular word, and the output is the word vector of this particular word. For example, in the following paragraph, the size of our context is 4, and the specific word is “Learning”, which is also the output word vector we need. There are 8 words corresponding to the context, 4 before and 4 after each, and these 8 words are the input of our model. Since CBOW uses a word bag model, all eight words are equal, regardless of the distance between them and the word we care about, as long as they are within our context.

Thus, in this CASE of CBOW, our input is 8 word vectors, and the output is the Softmax probability of all words (the goal of training is to expect the maximum softmax probability corresponding to a specific word in the training sample). The input layer of the corresponding CBOW neural network model has 8 neurons, and the output layer has a vocabulary size of neurons. We can specify the number of neurons in the hidden layer. Through the DNN back propagation algorithm, we can calculate the parameters of the DNN model and get all the corresponding word vectors at the same time. In this way, when we have a new demand to find the most likely output central word corresponding to a certain 8 words, we can find the neuron corresponding to the word with the highest probability through DNN forward propagation algorithm and Softmax activation function.

The idea of skip-Gram model and CBOW is reversed, that is, the input is the word vector of a specific word, while the output is the contextual word vector corresponding to a specific word. Again, our context size is 4, and this particular word “Learning” is our input, and these eight context words are our output.

In this way, in our skip-Gram example, our input is a specific word, and the output is the top 8 words in the probability ranking of SoftMax. The corresponding skIP-Gram neural network model has one neuron in its input layer and one neuron in its output layer, which is the size of vocabulary. We can specify the number of neurons in the hidden layer. Through the DNN back propagation algorithm, we can calculate the parameters of the DNN model and get all the corresponding word vectors at the same time. In this way, when we have a new demand to find the most likely 8 context words corresponding to a certain word, we can obtain the words corresponding to the neuron corresponding to the top 8 Softmax probability through DNN forward propagation algorithm.

The above is the general process of how to use CBOW and skip-Gram to train the model and get the word vector in neural network language model. But this is a lot different from word2VEc using CBOW and skip-gram to train the model and get the word vector.

Word2vec why not use the existing DNN model and continue to refine the new method? The main problem is that this process for the DNN model is very time-consuming. Our vocabulary is usually above million level, which means that our DNN output layer needs softmax to calculate the output probability of each word. Is there a way to simplify it a little bit?

3. Huffman tree for Word2vec base

Word2vec also uses CBOW and skip-gram to train the model and get the word vector, but does not use the traditional DNN model. The first data structure optimized is to replace the neurons of the hidden layer and the output layer with Huffman tree. The leaf nodes of Huffman tree play the role of the neurons of the output layer, and the number of leaf nodes is the size of the vocabulary. The inner nodes act as hidden layer neurons.

How to do CBOW and Skip-Gram training with Huffman trees is going to be covered in the next section, but let’s review Huffman trees.

Hofmann trees are not difficult to build, and the process is as follows:

Input: weights are (w1,w2… wn)(w1,w2,… Wn) and nn nodes

Output: Corresponding Huffman tree

1) to (w1, w2,… wn)(w1,w2,… Wn) is regarded as a forest with NN trees, and each tree has only one node.

2) A new tree is obtained by merging the two trees with the lowest root node weight in the forest, and these two trees are distributed as the left and right subtrees of the new tree. The weight of the root node of the new tree is the sum of the weight of the root nodes of the left and right subtrees.

3) Delete the two trees with the lowest weight of the previous root node from the forest, and add the new tree to the forest.

4) Repeat steps 2) and 3) until there is only one tree in the forest.

Below, we use a specific example to illustrate the process of Huffman tree establishment. We have (a,b,c,d,e,f) a total of 6 nodes, and the weight distribution of the nodes is (20,4,8,6,16,3).

The first is the minimum combination of b and f, and the weight of the new root node is 7. At this time, there are 5 trees in the forest, and the weight of the root node is 20,8,6,16,7 respectively. At this point, 6 and 7 with the minimum weight of the root node are combined to get a new subtree, and so on, and finally get the Huffman tree below.

So what’s so good about Huffman trees? Generally, hoffman coding is carried out on leaf nodes after obtaining hoffman tree. As the leaf nodes with high weight are closer to the root node, while the leaf nodes with low weight are far away from the root node, the coding value of the high-weight node is shorter, while the coding value of the low-weight node is longer. This guarantees the shortest weighted path of the tree, which is also consistent with our information theory that we expect the more frequently used words to have shorter codes. How do you code it? Generally, for a Huffman tree node (except the root node), you can agree that the left subtree code is 0 and the right subtree code is 1. As shown in the figure above, we can get that the code of C is 00.

In word2vec, the convention encoding is the reverse of the above example, that is, the convention encoding the left subtree is 1, the convention encoding the right subtree is 0, and the convention weight of the left subtree is not less than the weight of the right subtree.

We will continue in Hierarchical Softmax in the next section to talk about the benefits of using Huffman trees compared to the DNN language model and how to train the Cbow &Skip-Gram model.