Learn the skip-Gram implementation of Word2VEc, besides skip-Gram model, there is also CBOW model. The skip-gram model predicts the front and back words according to the middle words, while the CBOW model is just the opposite, predicting the middle words according to the front and back words.

So what is a middle word? What kind of words are called before and after words?

First, we need to define a window size, the words in the window, we have the middle word and before and after the word definition. This window is usually between 5 and 10 in size. For example, let’s set the window size to 2:

|The|quick|brown|fox|jump|
Copy the code

So brown is our middle word, and The, quick, fox and jump are before and after.

We know that Word2Vec is actually a neural network (explained later), so in what format do we train this data? Look at a picture and you’ll see:

As you can see, we always put the middle word first, followed by our neighbors. As you can see, each word pair is an input and an output data pair (X,Y). Where, X is feature and Y is label.

Therefore, before we train the model, we need to sort out all the input data like the above for training according to the corpus.

Word2vec is a neural network

Word2vec is a simple neural network consisting of the following layers:

  • 1 input layer
  • 1 hidden layer
  • 1 output layer

The input layer inputs the numeric representation of the data pair we talked about above, and outputs it to the hidden layer. The number of neural network units of the hidden layer is actually the embedding size as we call it. Only the reason why can be known after simple calculation. Note that we do not need to use the activation function behind our hidden layer. In the output layer, we use SoftMax operation to get the probability of each prediction result.

Here’s a diagram of this network:

The input layer

Now the question is, just now, we said that the input layer’s input is the numeric representation of the data pairs that we prepared, so how do we represent the text data as a numeric representation?

It seems that either way can be used to represent our text.

Looking at the figure above, we can see that the input uses one-hot coding. What is ont-hot coding? As shown in the figure, if there are n words, each word can be represented by an n-dimensional vector. Only one position of this n-dimensional vector is 1, and the rest positions are 0.

So why do we write it in this code? The answer will come later.

Hidden layer

The number of neural units in the hidden layer represents the dimension of each word as a vector. Let’s say that ourhidden_sizeIf we take 300, which means our hidden layer has 300 neurons, then for each word, our vector representation is oneThe vector. There are as many of these vectors as there are words!

So forThe input layerandHidden layerThe weight matrix betweenAnd its shape should be[vocab_size, hidden_size]The matrix,

Output layer

So what should our output layer look like? As you can see from the figure above, the output layer is a vector of [VOCab_size] size, and each value represents the probability of output of a word.

Why do I do this? Because we want to know, for an input word, what are the most likely words for the words that follow it, in other words, we need to know the probability distribution of the words that follow it.

You can take a look at the network structure diagram above.

You will see a very common function softmax, why softmax and not some other function? Let’s take a look at what softMax looks like:

Obviously, it’s in the range (0,1), and it doesn’t add up to 1. Isn’t that the natural representation of probability?

Softmax also has a property, of course, because it operates on exponentials, and if the loss function uses a logarithmic function, then the exponential calculation can be offset.

For more on SoftMax, see the Stanford SoftMax Regression

The mathematical representation of the whole process

So now that we know the structure of the whole neural network, how do we write it mathematically?

Reviewing our structure diagram, it is clear that there will be two weight matrices between the three layersAt the same time, the two offset terms. So our entire network can be built using the following pseudocode:

import tensorflow as tf

Vocab_size = 1000
VOCAB_SIZE = 1000
# Assume embedding_size = 300
EMBEDDINGS_SIZE = 300

The input word x is a matrix of size [1, VOCab_size]. Of course, we usually use a batch of words as input, so that is the matrix of [N, VOCab_size]
x = tf.placeholder(tf.float32, shape=(1,VOCAB_SIZE))
# W1 is a matrix of [VOCab_SIZE, embedding_SIZE] size
W1 = tf.Variable(tf.random_normal([VOCAB_SIZE, EMBEDDING_SIZE]))
# b1 is a matrix of size [1, embedding_size]
b1 = tf.Variable(tf.random_normal([EMBEDDING_SIZE]))
# Simple matrix multiplication and addition
hidden = tf.add(tf.mutmul(x,W1),b1)

W2 = tf.Variable(tf.random_normal([EMBEDDING_SIZE,VOCAB_SIZE]))
b2 = tf.Variable(tf.random_normal([VOCAB_SIZE]))
The # output is a matrix with the size of VOCab_size, and each value is the probability value of a word
prediction = tf.nn.softmax(tf.add(tf.mutmul(hidden,w2),b2))
Copy the code

Loss function

With the network defined, we need to select a loss function to optimize the model using the gradient descent algorithm.

Our output layer is actually a SoftMax classifier. Therefore, according to the conventional formula, the cross entropy loss function is chosen as the loss function.

Haha, remember what cross entropy is?

P and Q are true probability distributions and estimated probability distributions.

# Loss function & EMSP;
cross_entropy_loss = tf.reduce_mean(-tf.reduce_sum(y_label * tf.log(prediction), reduction_indices=[1]))
# Training operation
train_op = tf.train.GradientDescentOptimizer(0.1).minimize(cross_entropy_loss)
Copy the code

Next, you can prepare the data and start training!

Why type one-hot code?

We know that word2Vec training results in a weight matrix W1(ignore B1 for the moment), which is the vector representation of all our words! Each row of this matrix is just a vector representation of a word. If you multiply two matrices…

See? The characteristic of ont-hot coding, when we multiply matrices, is to select a row in the matrix, and this row is the word2vec that we input the word! .

How’s that? Isn’t that wonderful?

From this, we can see that the so-called Word2Vec, is actually a lookup table, is a two-dimensional floating point number matrix!

The above is a complete analysis of the Skip-Gram model of Word2VEc. How about it? Have you made clear the principle and details of Word2VEc?

See Luozhouyang /word2vec for the full code

To contact me

  • Email: [email protected]

  • WeChat: luozhouyang0528

  • Personal official account, you may be interested in: