preface

Since Mikolov put forward the concept of Word Vector in his paper “Efficient Estimation of Word Representation in Vector Space” in 2013, the NLP field seems to enter the world of embedding.

Sentence2Vec, Doc2Vec, Everything2Vec. Word vector is based on the assumption of language model that “the meaning of a word can be inferred from its context”, and a Distributed Representation method of words is proposed. Compared with the high-dimensional and sparse one-hot Representation of traditional NLP, the word vector trained by Word2Vec is low-dimensional and dense. Word2Vec makes use of the context information of words, with richer semantic information. The common applications at present are as follows:

  1. The trained word vectors are used as input features to enhance the input layer of existing systems, such as emotion analysis, part-of-speech tagging, language translation and other neural networks.
  2. The word vector is directly applied from the perspective of linguistics, such as using the distance of vector to express word similarity and Query correlation.

Note: The preface is from reference 1.

1. What is Word2vec

To summarize, word2vec is the process of using a one-layer neural network (CBOW) to call a sparse word vector map of one-hot form a dense vector of N-dimensions (n is usually several hundred). In order to speed up the model training speed, the tricks include Hierarchical Softmax, Negative sampling, Huffman Tree, etc.

In NLP, the most fine-grained objects are words. If we want to carry out pos tagging, we can have a series of sample data (x,y) using the general idea. Where X represents words and Y represents parts of speech. What we need to do is to find a mapping relationship of X -> y. The traditional methods include Bayes,SVM and other algorithms. But our mathematical models, in general, are numerical inputs. However, the words in NLP are abstract summaries of human beings in symbolic form (such as Chinese, English, Latin, etc.), so they need to be converted into numerical form, or embedded in a mathematical space. This embedding method is called Word embedding, while Word2vec, It is a kind of word embedding.

In NLP, x is regarded as a word in a sentence, and Y is the context of the word, so f here is the “language model” that often appears in NLP. The purpose of this model is to judge whether the sample (x,y) conforms to the laws of natural language. To put it more generally, the word X and the word Y together are not human.

Word2vec is derived from this idea, but its ultimate purpose is not to train F perfectly, but to focus on the by-product of model training — model parameters (specifically, the weight of neural network here), and take these parameters as some vector representation of input X. This vector is called — word vector. (The above part is from Reference 2)

2. CBOW and Skip – “gramm

There are two important models in Word2VEc — Continuous Bag-of-words Model (CBOW Model) and Skip-Gram Model. This is illustrated in Tomas Mikolov’s paper.

From the name and figure are easy to see, CBOW is according to a word in front of the C words or C consecutive words, to calculate the probability of the occurrence of a word. On the contrary, the Skip-gram Model is based on a certain word, and then calculates the various probabilities of several words before and after it.

In the previous picture, you can see clearly the training process of CBOW.

Read this picture, basic word2vec understand more than half. Let’s talk more about this graph. The simplest form of word vector is the 1-of-n one-hot form. Onehot for students are very familiar with, which is selected from the large thesaurus corpus V highest frequency word (ignoring other), V is compared commonly big, such as V = 10 w, fixed the order of these words, each word is then possible to use a V d sparse vector expressed, this vector is only one position of the element is 1, Everything else is 0. In fact, One Hot method is a simple direct mapping, so the disadvantages are obvious, the dimension is large, and there is no computational significance. In the figure above, 1, Input layer: is one hot of context words. It is assumed that the dimension of the word vector space is V, that is, the size of the whole corpus is V, and the size of the context word window is C. 2. Assume that the dimension of the final word vector is N, then the weight sharing matrix in the figure is W. W is V∗NV * NV∗N, and initialized. 3. Suppose there is a sentence “I love you” in the corpus. If we focus now on the word “love “, let’s say C=2, then the context is” I “,” you “. The model takes the onehot form of “I” and “you” as input. The magnitude is 1∗V1 * V1∗V. C vectors of 1∗V1 * V1∗V are multiplied by the shared matrix W of the weight of the same V∗NV * NV∗N, yielding C hidden layers of 1∗N1 * N1∗N. 4.C hidden layers of 1∗N1 * N1∗N are averaged to obtain a vector of 1∗N1 * N1∗N, namely, the hidden layer in the figure. 5. Output weight matrix W’ is N∗VN * VN∗V, and corresponding initialization work is performed. 6. Multiply the obtained Hidden layer vector 1∗N1 * N1∗N and W’, and use softmax to obtain a vector of 1∗V1 * V1∗V, each dimension of which represents a word in corpus. The word represented by the largest index in the probability is the predicted intermediate word. 7. Compare with one Hot in Groud Truth to find the minimum value of Loss Function.

The specific calculation process is as follows: from input -> hidden: WT∗xW^T * xWT∗ X, W is V*N matrix, x is V* 1 vector, and the final result of hidden layer is N * 1 from hidden -> output: XT ∗ W ‘x ^ W ^ T * {‘} xT ∗ W’, where x is N ∗ 1 N * N ∗ 1 vector, W ‘W ^ {‘} W’ V * N, the final result is 1 V1 ∗ ∗ V1 * V

3. Word2vec doesn’t really care about models

Word2vec can be divided into two parts: the model and the word vector obtained through the model. The idea of Word2vec is similar to that of auto-encoder. A neural network is built based on training data. When the network is trained, we do not use the trained network to handle new tasks. What we really need is the parameters that the model has learned from the training data, such as the implicit weight matrix — these weights in Word2Vec, as we will see later, are actually the “Word Vectors” we are trying to learn. Based on the training data modeling process, we call it “Fake Task”, which means that modeling is not our ultimate goal.

The method mentioned above will actually be seen in unsupervised feature learning, the most common of which is auto-encoder: The input is encoded and compressed in the hidden layer, and the data is decoded and restored to its initial state in the output layer. After the training, the output layer is “cut off” and only the hidden layer is retained.

4.tricks

1. When hierarchical Softmax finally predicts the output vector, the size is 1*V vector, which is essentially a multi-classification problem. Through the techniques of Hierarchical Softmax, the problem of V classification is reduced to log(V) subdichotomy.

2. Negative sampling essentially means that the training set is sampled, thus reducing the size of the training set. Len (w)=count(w)3/4∑u∈ VOCabcount (u) 3/4Len (w)= \frac{count(w)^{3/4}}{\sum\limits_{u \in vocab} Count (u) ^ {3/4}} len (w) = u ∈ vocab ∑ count (u) 3/4 count 3/4 (w)

References:

1. 2. zhuanlan.zhihu.com/p/26306795 zhuanlan.zhihu.com/p/28491088 3. Alwa. Info / 2016/04/24 /… 4. Qrfaction. Making. IO / 2018/03/20 /… Data game TRICK summary