For today’s post, I collected materials from five papers. The theme is “Word2vec,” which summarizes the work of Google’s Mikolov et al on word vectors (and what you can do with them). The papers are as follows:
·Efficient Estimation of Word Representations in Vector Space — Mikolov et al. 2013
· Representations of Words and Phrases and their Compositionality — Mikolov et al. 2013
· Coded in Continuous Space Word Representations — Mikolov et al. 2013
· Word2vec Parameter Learning Explained — Rong 2014
· Word2vec Explained: Deriving Mikolov et al’s Negative Sampling word-embedding Method — Goldberg and Levy 2014
In the first paper, we learned about the continuous bag-in-words model and the continuous Skip-Gram model in terms of word vectors (we’ll talk about word vectors in a moment). In the second article, we fully demonstrated the power of word vectors and optimized the Skip-Gram model (hierarchical Softmax and negative sampling). In addition, the concept and properties of word vector are extended to phrases. The third article describes vector reasoning based on word vectors and introduces the famous example of “King — Man + Woman = Queen”. The last two papers are other people’s in-depth explanations of some of Milokov’s ideas.
An implementation of Word2vec can be found at Google Code.
What is word vector
In a sense, word vectors are simply weight vectors. In the simple one-bit efficient encoding, each element in the vector is associated with a word in the thesaurus, and the encoding of the word is simply set to 1 for the corresponding element in the vector and 0 for the other elements.
Suppose we had a vocabulary of just five words: King, Queen, Man, Woman, and Child. We can code the word “Queen” as:
If we use such an encoding, we cannot compare word vectors except for equal operations.
In Word2vec, a distributed representation of words is used. In the case of large vector dimensions (such as 1000), each word can be represented by the distributed weight of the element. Thus, the representation of a word is propagated across all the elements in the vector, and each element has an effect on the definition of many words, rather than a simple one-to-one mapping between elements and words. If I marked the dimensions of the phantom word vector (of course, there are no such pre-allocated labels in the algorithm), it might look something like this:
Such vectors represent the “meaning” of a word in some abstract way, and as we will see, it is possible to learn word vectors in surprising ways and be able to capture the relationships between words simply by examining a large corpus. Of course, we can also use vectors as inputs to neural networks.
Word vector reasoning
In fact, we find that the word representations being learned capture meaningful grammatical and semantic patterns in a very simple way. In particular, grammar rules can be regarded as fixed vector offsets for phrases with specific relationships. For example, if we express the vector of the word I as xi, if we pay attention to the relationship between singular and plural, we can find such laws as xapple — xapples≈xcar — xcars, xfamily — xfamilies≈ xcar — xcars, etc. More surprisingly, the same is true for various semantic relationships, as SemEval’s 2012 work measuring relationship similarity revealed.
Vectors are great for solving analogical problems, such as problems of the form a is to B what C is to b. For example, using vector offset based on cosine distance can solve the problem that men are to women what uncles are to what (aunts).
For example, here is the vector offset of three pairs of words in gender relations:
The following figure shows the singular and plural relationships:
This type of vector can answer “King — Man + Woman =?” “And get the result” Queen “. It is very exciting to gain such knowledge simply by browsing through a large amount of semantic data (as we will see soon) without providing additional information.
Somewhat surprisingly, word representations are more similar than grammatical rules. It is simply a simple operation on a word vector that yields the representation of words close to “Queue” as shown in the following example: vector(” King “) – vector(” Man “) + vector(” Woman “).
The vectors for King, Man, Queen, and Woman are shown below:
Vector composition King — Man + Woman =? Results:
Here are a few more implementations using the same technique:
Here is a 2-dimensional principal component analysis prediction for the state-capital relationship:
Here are some examples of questions like “A is to B as C is to what?” Examples of this type of problem:
We can also use elemental intelligence beyond vector elements to solve problems such as “Germany + Route” by synthesizing vectors to get the final result:
Word vectors with such semantic relationships can be used to improve many existing natural language processing applications such as machine translation, information retrieval and question answering systems, and even spawn new ideas and applications.
The figure below shows the word relation test based on grammar and semantics. The 640-dimensional word vector obtained using a Skip-Gram training model can achieve 55% semantic accuracy and 59% grammatical accuracy.
Learning word vector
Mikolov et al. were not the first to use vector representations of words, but they did reduce computational complexity and make it feasible to learn higher-dimensional word vectors over large amounts of data. For example, “We’ve used the Google News corpus to train word vectors. The corpus contains 6B phrases. We’ve limited vocabulary to a million of the most frequent words…” .
The complexity of neural network language models (feedforward or recursive) comes from the nonlinear hidden layer (S).
And while that’s the beauty of neural networks, we decided to explore simpler models that might not be as precise as neural networks, but might be more efficient at training larger amounts of data.
Two new architectures are proposed: Continuous bag-of-words Model and Continuous Skip-Gram model. Let’s look at Continuous bag-of-Words (CBOW).
Consider The passage “The recently introduced continuous skip-gram model is an efficient method for high-quality learning distributed Vector representations that capture a large number of Precises syntatic and semantic word relationships.” Imagine the text has a sliding window with the current word and the four words before and after, as shown below.
Context words make up the input layer, and each word is represented by a valid code bit. If the vocabulary is V, these words become v-dimensional vectors, and the corresponding words are set to 1 and the rest to 0. The following figure shows a single hidden layer and an output layer.
The goal of the training is to maximize the conditional probability of the actual output word (focus word) in a given input context with weight taken into account. In our example, given the input (” one “, “effective”, “method”, “for”, “high”, “quality”, “distributed”, “vector”) we want to maximize the probability of obtaining “learning” as the output.
Since our input vectors are represented in one-bit efficient encoding, multiplying by the weight matrix W1 is equivalent to simply selecting a row.
If C word vectors are input, the activation function of the hidden layer is used to count the hot rows in the matrix and then average them by C.
This means that the link (activation) function of the hidden layer element is a simple linear operation (taking the sum of weights directly as input to the next layer).
From the hidden layer to the output layer, we use a weight matrix W2 to calculate the score for each word. Similarly, softmax can be used to calculate the posterior distribution of words.
Skip-gram is the opposite of CBOW in that it uses a single focus word as input and a target context word as output, as shown below:
As mentioned earlier, the activation function of the hidden layer is just a simple statistic of the rows corresponding to the weight matrix W1. In the output layer, we output C polynomial distributions. The ultimate goal of training is to reduce the prediction error rate of all context words in the output layer. For example, if we type “learn”, we might get “an”, “efficient”, “method”, “for”, “high”, “quality”, “distributed”, “vector” in the output layer.
To optimize the
In a training example, the cost of outputting word vectors for each word is very high…
Intuitively, the solution to this problem is to impose a quantitative limit on the output vectors updated in each training instance. But we offer two elegant methods, one is hierarchical flexible maximum transfer function and the other is sampling.
Hierarchical Softmax uses a binary tree to represent all words in the vocabulary and each word is represented as a leaf in the tree. For each leaf, there is a unique path from root to leaf that is used to estimate the probability of the word represented by the leaf. “We define such a probability in terms of the probability of a random walk from root to leaf.”
The advantage of this is that instead of evaluating V nodes one by one to get the probability distribution in the neural network, only log2(V) nodes are evaluated. We use binary Huffman trees here because they give shorter codes to high-frequency words, which makes the training process faster.
The idea of negative sampling is simple: each iteration we update only the output words of the sample. The target output words should be left in the sample and should be updated, with some (non-target) words added as negative samples. “In the sampling process, the probability distribution is necessary, the probability distribution selection is relatively flexible, generally based on experience.”
Milokov et al. also used a simple quadratic sampling direction to measure the ratio and frequency imbalance in the training set. (For example, “in,” “the,” and “a” provide less information than infrequent words.) Each word in the training set has a discard probability, which can be expressed by the following formula:
F (wi) represents the frequency of the word WI, and t is a threshold value. Generally, it is a number around 10-5.
The Amazing Power of Word Vectors by Adrian Colyer
The article is a brief translation. For more details, please refer to the original text
This article is translated by Ali Yunqi Community Organization.