1. What is Word2vec and why is it needed?

  • One-hot coding suffers from dimension disaster and semantic gap.

The dimensional disaster is that one-hot’s vector dimensions match the size of the thesaurus. The vector dimension represented by one-HOT is very large, which makes data samples sparse and distance calculation difficult, resulting in dimensional disaster. There are too many features of samples, which leads to over-fitting in the process of model learning.

The semantic gap is because the word vectors generated by one-HOT are orthogonal to each other and cannot reflect any semantic connection.

  • Word Embedding is the vector that converts “uncomputable” and “unstructured” words into “computable” and “structured” words.

  • Word2vec is one of the methods of Word Embedding. Word2vec is an open-source toolkit for word vector retrieval from Google in 2013. Its core algorithm is an efficiency improvement on the part of NNLM that has the most computation. Word2vec uses network training to encode words into vectors, which is better than one-HOT.

  • Advantages:

    1. Since Word2vec considers the context, it is better than the previous Embedding method

    2. The Embedding method has fewer dimensions than the previous one, so it is faster

    3. Strong versatility, can be used in a variety of NLP tasks

2. Word2vec process

Word2vec mainly contains two models: Skip-gram and Continuous bag of Words (CBOW).

  • The CBOW model predicts the central word according to the words around the central word W(t), as shown in the figure below. The characteristic of this model is to input the known context and output the prediction of the current word.

Its learning objective is to maximize the logarithmic likelihood function:

Training process of the model:

(1) The unique thermal encoding of the context of the current word is input to the input layer.

(2) These words are multiplied by the same matrix to obtain their respective one-dimensional vectors.

(3) Average these vectors into a one-dimensional vector as the output of the hidden layer.

(4) The output vector is multiplied by the matrix between the hidden layer and the output layer to become a one-dimensional vector output.

(5) Output the probability vector of each word after normalization of the vector Softmax.

(6) The word corresponding to the number with the largest probability value is taken as the prediction word.

(7) Calculate the error between the predicted result vector and the real label vector

(8) Weight and threshold matrix of backpropagation error are updated after each forward propagation.

  • Skip – “gramm model:

    Skip-gram simply reverses CBOW’s causality, that is, knowing the current word and predicting the context.

    Parameter description:

    1. A summary dictionary of particular words that appear in a database or text. This dictionary is called vocabulary and is a systematic known word. Vocabulary is represented by the letter “V”.

    2. N represents the number of neurons in the hidden layer.

    3. The window size is the maximum context location to predict a word. “C” stands for window size. For example, in a given architecture diagram the window size is 2, so we predict words in the context positions of (t-2), (t-1), (t+1), and (t+2).

    4. The context window refers to the number of predicted words that will appear within a given word range. For a 2* C window size represented by K, the context window value is twice the size of the window. The context window value for the given image is 4.

    5. The dimension of the input vector is equal to the | | V. Each word is one-hot coded.

    6. The dimension of the hidden layer weighting matrix (W) is [| V |, N]. “| |” is the value of an array is reduction of the modular function.

    7. The output vector of hidden layer is H[N].

    8. Weights between hidden layer and output layer matrix (W) ‘dimension is [N, | | V].

    9. The dot product between W and H ‘generating output vector U [| v |]

    Specific process:

  1. Using one – hot coding convert word to vector, the vector dimension is [1, | | v].

  1. Words w (t) from | V | is passed to the hidden layer neurons.

  2. Hidden layer performs the weight vector W [| v |, N] and the dot product between the input vector W (t). Here, we can concluded that the first lines of W (t) [| v |, N] outputs (H (1, N]).

  3. Remember: the hidden layer does not use activation functions, so H[1,k] is passed directly to the output layer.

  4. Output layer performs H [1, N) and W ‘[N, | | v] the dot product between the operations, and gives the vector U.

  5. Now, to figure out the probability of each vector, we use the Softmax function because each iteration yields the output vector U, which is a one-hot encoding pattern.

  6. The word with the highest probability is the final result. If the predicted word in the specified context location is wrong, we use a backpropagation algorithm to correct the weight vectors W and W ‘.

    The above steps are performed for each word w(t) in the dictionary. Also, each word w(t) is passed K times. So we learn that positive propagation algorithm performs in each period of time | v | * k times.

    Probability function:

    W (c, j) is the JTH word predicted at the CTH context position; W (O, c) is the actual word that appears in the c-th context position; W (I) is the only input word; U (c, j) is the JTH value of the u vector when predicting words at the c-th context position.

    Loss function:

    Since we want to maximize the probability when predicting w(c, j) at the CTH context position, we can use the loss function L.

3. Technical advantages in Word2VEc

We know that in the Word2VEc model, when the training set or corpus is very large, the Softmax function calculates every prediction based on all the data sets, which will be a very large calculation process. At this time, what is the optimization of Word2vec?

There are two efficient training methods in WORD2VEC model: Negative sampling and Hierarchical Softmax.

  • negative sampling:

Negative sampling is a means of sampling negative examples to help training, which is used to improve the training speed of the model. For example, we have a training sample, the center word is W, and there are 2c words around it, named context(w). And since this central word, W, does exist in relation to context(w), it is a true positive example. By Negative Sampling, we can get neg central words wi, I =1,2,.. Neg, so context(w) and WI constitute NEg negative cases that do not really exist. By using this positive example and NEg negative examples, we perform binary logistic regression, and obtain that the negative sampling corresponds to each word W_i, which constitutes NEG negative examples that do not exist. Using this positive example and NEg negative examples, we carried out binary logistic regression and obtained that each word WI of negative sampling constituted NEG negative examples that did not exist. Using this positive example and NEg negative example, we perform binary logistic regression to obtain the model parameter $\theta_{I} corresponding to each word w_i and the word vector of each word.

Sigmoid function is used to calculate. The specific calculation process is not detailed: blog.csdn.net/sun_brother…

  • Hierarchical softmax:

Since we have changed all of the probability calculations from the output SoftMax layer to a binary Hoffman tree, our Softmax probability calculations only need to proceed along the tree structure. As shown in the figure below, we can follow the Huffman tree from the root node all the way to the word w2 at our leaf node.

Compared with neural network language model before, our Huffman tree before all of the internal node is similar to neural network hidden layer neurons, among them, the root node word after the word vector corresponding to our projection vector, and before all the leaf node is similar to neural network softmax of output layer neurons, the number of leaf nodes is the size of the vocabulary. In the Hoffman tree, the Softmax mapping of the hidden layer to the output layer is not done all at once, but step by step along the Hoffman tree, hence the name “Hierarchical Softmax”.

How do you “work your way up the Huffman tree”? In Word2vec, we use binary logistic regression, specifying that if you follow the left subtree, it is a negative class (Huffman code 1), and if you follow the right subtree, it is a positive class (Huffman code 0). The way to distinguish between positive and negative classes is to use the sigmoid function, that is:

Where xw is the word vector of the current internal node, and θ is the model parameter of logistic regression that we need to calculate from the training sample.

What are the benefits of using Huffman trees? First of all, since it’s a binary tree, it was V, now it’s log base of 2V. Second, since Huffman trees are used for high-frequency words close to the root, so high-frequency words take less time to be found, which conforms to our greedy optimization idea.

Easy to understand, the probability of being divided into a left subtree and becoming a negative class is P(−)=1−P(+). For an internal node, the criterion for deciding whether to follow the left subtree or the right subtree is P(−) or P(+). One of the factors controlling the probability of P(−) and P(+) is the word vector of the current node and the other is the model parameter θ of the current node.

For w2 in the figure above, if it is the output of a training sample, then we expect a high probability of P(−) for the hidden node n(w2,1), a high probability of P(−) for n(w2,2), and a high probability of P(+) for n(w2,3).

  • Comparative analysis of the two methods:

Although the Huffman tree can improve efficiency, if the target word of the training sample is an obscure word, you have to walk down the Huffman tree for a long time. In other words, Hierachical Softmax is actually an unstable algorithm. Negative Sampling eliminates the Huffman tree.

4. Word2vec simple code

  • gensim
from gensim.models import word2vec
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
model = word2vec.Word2Vec(sens_list,min_count=1,iter=20)
model.save("word2vec.model")
Copy the code
  • python
def generate_batch(batch_size, num_skips, skip_window): The global data_index #global keyword enables data_index to be modified in other functions. Assert batch_size % num_skips == 0 Assert num_skips <= 2 * skip_window Batch = Np.ndarray (shape=(batch_size), Ndarray (shape=(batch_size, 1)); shape=(batch_size, 1); Dtype =np.int32) span = 2 * skip_window + 1 # [skip_window target skip_window] Delete data from the other end that can't accommodate it. 1, 21, 124, 438, 11 buffer = collections.deque(maxlen=span) # create a queue for collections in range(span): buffer.append(data[data_index]) data_index = (data_index + 1) % len(data) for i in range(batch_size // num_skips): Target = skip_window # target label at the center of the buffer # target_to_avoid: [2] targets_to_avoid = [skip_window] # Update source word to middle word of current 5 words source_word = buffer[skip_window] # For j in range(num_skips): while target in targets_to_avoid: target = random.randint(0, Sp-1) targets_to_avoid.append(target) # Add the source word batch[I * num_skips + j] = to the target that has passed targets_to_avoid # Batch Source_word #labels adds target words, Words come from two labels of the four words in the randomly selected 5 SPAN words except the source word [I * num_skips + j, 0] = buffer[target] # add the next word to the double-ended queue, Buffer. Append (data[data_index]) data_index = (data_index + 1) % len(data) return Batch, labels' 'Copy the code