This article mainly refers to Ng’s deep Learning course and the natural Language Processing course selected by national Advanced Economics.

Word2vec overview

In my last article, I showed you how to use one-hot to represent a single word and the BoW model to represent a sentence. One of the big problems is that the BoW model doesn’t represent semantics.

For example, there are two sentences:

  • The cat lay on the stool in the sun
  • The dog lay beside the stool in the sun

If you use the BoW model, because cat and dog are two completely different words, you get two completely different vector representations. That is, not understanding the meaning of the sentence. So how to solve this problem?

At this point you can use the Word2vec model. The model “understands” that cats and dogs are animals by using unsupervised learning in a large corpus, and then uses the same or similar (averaging rather than summation) method as in the previous one-hot -> BoW to get a semantic vector representation of the sentence.

The role of models

There are two main approaches to this model:

  1. Similar words: similar words can get similar vector representation;

For example, cats and dogs (animals); Peony, jasmine, rose, etc.

  1. Word analogy

Here’s a very popular example:

Men -> women are like Kings ->?

And then the model will solve for what? A possible position is QUEEN.

In fact, so far, the model has many shortcomings in dealing with this problem.

How do you train the model?

This article focuses on two approaches, CBOW and Skip-Gram

In addition, GloVe and PPMI + SVD are also used.

CBOW

Take the following corpora for example:

  1. We laughed and said goodbye, but we both knew it would never happen again.
  2. Ordinary is a paved path: comfortable to walk, but no flowers to grow.

If the above sentences are used in CBOW model training, the model will first represent all the words in one-hot mode. Then a “supervised” learning model is constructed using the existing corpus.

That is, when we input the model training, we input the following words in the sentence as positive examples, such as laughing and goodbye, and use a Softmax layer to output the results.

That is, we may be followed by laughter, goodbye, etc.

It can be seen that if the sentence is very long, the model may take us and distant into account, so at this time, the CBOW model has another variable of window to indicate that only the word before and after the word window is used for training.

skip-gram

One of the biggest differences between the Skip-Gram model and the CBOW model is that when the Skip-Gram model generates “supervised” learning training data, it does not take all words in this window as positive samples, but randomly samples different words in this window as positive samples.

Take the above corpus as an example:

If skip-gram is performed for the word known and the window size is equal to 3, three positive samples may be randomly generated as follows:

  1. know -> we
  2. know -> goodbye
  3. know -> but

hierarchical softmax classifier

In both CBOW model and Skip-Gram model, Softmax is used as the output layer to predict the probability of subsequent words of this word, so the output value of this Softmax is very large (equal to the size of the vocabulary in the whole corpus). The Softmax model is slow to run.

At this point hierarchical Softmax Classifier can be used.

Basic principle of hierarchical softmax classifier and binary tree, want to understand the classmate see becominghuman carefully. Ai/hierarchica…

Negative sampling

As mentioned above, the supervised learning problems generated by CBOW model and SIP-Gram model only have positive samples and no negative samples, which may lead to a poor result of supervised learning. Negative sampling is designed to solve this problem

How do you take a negative sample?

You just take any word from the vocabulary that’s not in the window range and use it as a negative sample.

Again, use the example above:

  1. know -> we-> 1 (positive sample)
  2. know -> RongRong-> 0 (negative sample)
  3. know -> The cat-> 0 (negative sample)
  4. know -> life-> 0 (negative sample)

It doesn’t matter if the negative sample generated is “incorrect” (the probability is so small that it doesn’t matter much).

Finally, the generated training data can be input into the model for training.

doc2vec

Doc2vec and Paragraph2vec are just two different terms for a model. This model is mainly used to evaluate the similarity of documents.

The model is trained in a similar but not identical way to Word2vec, which I won’t cover in this article.

A question is raised as follows:

Is it possible to use the trained Word2VEc model to represent each word in a sentence and sum up to get the final document vector representation?

conclusion

This paper mainly explains the role of WORD2VEC model, and two main training methods of this model CBOW and Skip-Gram.

However, it is not necessary to write a model in the actual development process, and there are good open source libraries that implement this function:

  • gensim-word2vec
  • gensim-doc2vec

Refer to the address

www.coursera.org/learn/nlp-s…

Blog.csdn.net/qq_39422642…

Ai/hierarchica becominghuman….

Radimrehurek.com/gensim/mode…