@TOC

preface

Word Representation has always been one of the most fundamental and important tasks in natural language processing (NLP). Deep learning has revolutionized the field. One of the key concepts is word embeddings, which is a way of language representation. It can be understood as expressing the semantics of words into vectors, and the similarity between vectors is used to measure the strength of the relationship between words.

This paper is my summary and reflection on the chapter of Word Representation in The deep learning course of Andrew Ng, and is compiled by referring to some books and online materials. The main purpose of writing this article is to deepen their understanding, if there are mistakes welcome to point out in the comment area, thank you very much!

One-hot = one-hot

One-hot representation is an important method to represent discrete features in machine learning, and we can also use it to represent words in NLP tasks. In the neural network sequence Model blog, I have shown you the steps to use one-hot vectors to represent words. Here I will briefly explain:

  • Build a vector containing commonly used words to form a vocabulary (vocabulary). The size of the vocabulary is artificial, and here we use 10,000 words to build our vocabulary. 30,000-50,000 word dictionaries are common for average-sized business applications, but 100,000-word dictionaries are not unheard of, and some large Internet companies use million-word dictionaries or even larger.
  • Next, we build the one-hot vector for the word. If the serial number of a word in the vocabulary is 1234, then its one-hot vector is a 10,000-dimensional column vector with 1 in 1234 and 0 in the other rows.
  • In particular, we need to consider unknown words (OOV), opening tags (BOS), closing tags (EOS), and so on. Unregistered words refer to words that are not in the vocabulary, and we need to add the tag \UNK to the vocabulary to represent them. In addition, in the process of language model construction and machine translation, we also need to use the beginning tag and the end tag, which represent the beginning and end position of the sentence. For example, once the language model generates a closing tag, we can consider the sentence complete. We add \BOS and \EOS to the vocabulary to represent them.

One drawback of this representation method is that it isolates each word, which makes the generalization ability of the algorithm weak.Each one-hot only represents the word itself, and the algorithm cannot easily generalize across words, i.e. it cannot represent the relationship between any two words, because the inner product of the one-hot vector of any two words is 0.

For example, we need to build a language model to generate sentences, assuming we have learned the following sentence:

I want a glass of orange juice.
Copy the code

In another task, we have predicted the following sentences:

I want a glass of apple __.
Copy the code

Since there is no correlation between the word orange and the word apple in the one-hot representation, even if the model knows that orange juice is a relatively common collocation, it cannot learn that Apple juice is also a common situation. Therefore, the word juice cannot be correctly filled in the blank.

In addition, the one-hot vector is usually very dimensional (consistent with the size of the vocabulary) and very sparse (1 in only one position), so using the one-hot vector to represent words will make the model more parameterized and difficult to train.

Second, word embedding notation

Is there a way to better represent words, to capture meanings and associations between words? The answer is yes, we can use characterization to represent a word.

As you can see below, there are four attributes (and more in practice) : gender, royalty, age, and food. Each word is given its relevance to each of the four attributes.Then any word can be represented by a 4-dimensional eigenvectorFor example, Man is −1,0.01,0.03, and 0.09.At this point, it can be clearly seen that Apple and Orange are very similar, so it is easy for the algorithm to fill in the word juice in the second sentence.

When words are represented by such high dimensional features, it is called word embedding. The reason why it is called embedding can be imagined that each word is embedded into a high-dimensional space. Word embedding is one of the most important ideas of NLP.

It should be noted that the above features are just intuitive examples. In fact, the features are not manually designed, but learned by the algorithm (word embedding). And these learning features, they may not be well interpreted, but in any case, the algorithm can quickly tell which words are similar.

In addition, the dimension of word embedding vector is usually much smaller than the number of words in the vocabulary, so the number of parameters is reduced to some extent and the training burden is reduced.

We can use t-SNE algorithm to visualize high-dimensional word vectors, and it can be seen that words with similar meanings are clustered together after word embedding:

Functions of Word embedding

1. Transfer learning

Let’s take a named entity Recognition (NER) task as an example. Suppose there is a sentence in the training set:

Sally Johnson is an orange farmer.
Copy the code

We use a BRNN model here and use word embedding vectors to represent words as inputs to BRNN. BRNN can tell from Orange Farmer that Sally Johnson is a person’s name.

When we encounter new input such as:

Robert Lin is an apple farmer.
Copy the code

Since apple’s word embedding is similar to orange’s word embedding, the model can also easily identify Robert Lin as a person’s name.

Suppose our input contains obscure words such as:

Robert Lin is a durian cultivator.
Copy the code

Durian and cultivator are relatively uncommon words that probably don’t show up in our training set. Using our traditional one-hot vector representation, it would be difficult to predict that Robert Lin is a person’s name. And if we use word embedding notation, durian and Orange have similar vectors, so that the model can infer the relationship between durian and orange farmer and their names, And predicted that Robert Lin was a person’s name.

Why can word embedding learn associations of words that do not appear in the text of the current training set? This is because the word embedding vector is usually trained on the mass of unlabeled text, and we will introduce its training method later.

When the number of training sets is small, the effect of word embedding will be obvious, because it can greatly enrich the information of input model and provide word meaning information. In other transfer learning situations, too, if you move from task A to task B, the transfer process is only useful if THERE is A lot of data in A and A little data in B. Therefore, for many NLP task words, the embedding effect is obvious, but for some language models and machine translation, because they have a large amount of data.

2. Analogical reasoning

Word embedding can also help to achieve analogy reasoning. Again, take the previous data:Using word embedding, we can find an interesting property: knownmanIf the correspondingwomanWe can use word embedding projection to get it automaticallykingThe correspondingqueen

We use the word embedding vector eman of man minus the word embedding vector ewoman of woman to get:Using king’s word embedding vector eking minus Queen’s word embedding vector equeen, we can get:As you can see, the difference between the two is very close. This is because: the main difference between man and woman is gender, while the main difference between King and Queen, according to the representation of vector, is also gender, so the two differences will be very similar.

By means of this property, we can deduce from the relation of a given pair of words what word best fits that relation with another word. For example, given the relationship between man and woman, to know which word and king also fit the relationship, it is necessary to find the word W that maximizes the similarity between king−ew and eman−ewoman.Generally speaking, cosine similarity is used here, that is, cosine of two vectors is calculated to measure the similarity of two vectors U and V:

Four, Word2Vec

To train word embedding vectors, we can use the Word2Vec model. Word2Vec is a simple and computationally efficient algorithm for learning word embedding.

The core idea of Word2Vec is to learn a neural network language model to train word vectors. It is based on the assumption that words with similar contexts have similar word embedding vectors. For example, there are two sentences:

I like eating apples. I like eating pearsCopy the code

We know that the semantics of apple and pear are very close, and in the example above, the context of apple and pear is also very similar, so our model will train to get an apple and pear similar word embedding vector.

Word2Vec takes a distributed semantic approach to expressing the meaning of a word. Essentially, the meaning of a word is the context in which the word is used. Remember when we did a cloze test in high school, we found a lot of blanks in a short passage and asked us to choose the right words according to the context of the vacancy words. This means that the context has already determined the meaning of the word, and if the word is chosen correctly, it means that we understand the meaning of the vacancy word

The language model used by Word2Vec is divided into two categories, that is to say, there are two ways to learn word embedding, which are as follows:

  • If a word is used as input to predict the context around it, this model is called the skip-Gram model.
  • Using the context of a word as input to predict the word itself is the CBOW model. (This is the cloze test example above)

Skip-gram

First, we introduce the Skip-Gram model in Word2Vec. It takes a word as input to predict the context around it.

Suppose the training set is given a sentence like this:

I want a glass of orange juice to go along with my cereal.
Copy the code

In the Skip-Gram model, what we want to do is to extract context and target word pairs to construct a supervised learning problem. What we need to do is to randomly select a word as the context, such as the word “orange”, and then we need to randomly select another word within a certain distance (i.e. window), such as 5 words before and 10 words before and 10 words before and after the context, and we will choose the target word within this range. So we will construct a supervised learning problem, and it has given up and phraseology, asks you to predict the positive and negative 10 words from the word or plus or minus within 5 words from the random choice of a target word, construct the supervision target of learning not to solve the problem of the supervised learning itself, but to use this learning problems to construct a good word embedded model.

The literal meaning of skip-gram is to Skip some element, i.e. to randomly select a word as the target word within the context window, regardless of whether it is continuous or not.

Of course, we can also choose each word in the window as the target word instead of randomly selecting the target word, and form the sample input model with the current context words for training, as shown in the figure below. The costs are correspondingly high. To solve this problem, the method of subsampling can be used to calculate a retention probability of each sample and decide whether to delete the corresponding sample based on this probability. (This is essentially the same thing as a random word in the window)

Next, we use a simple single hidden layer basic neural network to train the desired word embedding vector. The structure of the network is as follows:You can see that the network has the following details:

  • The input layer is the one-hot vector of the context we selected earlier, the original representation of the word.
  • The hidden layer dimension is custom, and we can set the hidden layer to as many dimensions as we want to get the word embedding vector. Also, the hidden layer is linear and does not use nonlinear activation functions, which simplify the language model, and this is the advantage of Word2Vec.
  • The dimension of the output layer is consistent with the vocabulary dimension, and the probability of each word in the vocabulary being the target word is output using the Softmax activation function as the classifier.
  • Input layer, output layer and hidden layer are fully connected.

Neural network training has been described in a previous blog post. Here, considering that it is a supervised multi-classification problem, we can use the cross entropy loss function as the optimization objective of the model, and use the gradient descent method to fit the parameters of the model.

After training, we get the weight matrices V and U of the model. In the weight matrix V from the input layer to the hidden layer, since the input is the one-hot vector, only the weight vector Vx corresponding to the location of the context word x is activated, and its dimension is equal to the number of units in the hidden layer. We call it the input vector, because the position of each word in the one-hot vector is different. X can be uniquely represented using Vx.

In the weight matrix U from the hidden layer to the output layer, we can also use the weight vector Ux of the context word location x in the output layer to represent x. Similarly, its dimension is equal to the number of hidden layer units. We call it the output vector, which can also uniquely represent X.

In general, we more often use input vectors as word embedding representations of the word X.

In addition, SkIP-gram can also select multiple words as the target words of the current context. The network structure only needs fine-tuning, and the same input vector and output vector can still be selected as the word embedding representation of the context.

CBOW

CBOW Model, also known as Continuous Bag-of-words Model, is exactly opposite to skip-Gram Model in its prediction method. It uses the context Of a word as input to predict the word itself.It can be seen that the network structure of CBOW model is almost the same as that of Skip-Gram model. We just need to reverse the calculation of the network structure of skip-Gram with the above multi-objective words. We can still use the cross entropy loss function as the optimization objective of the model and fit the parameters of the model by the gradient descent method.

By the way, the principle of the CBOW model is somewhat similar to that of Mask in Bert, except that Mask in Bert is to randomly cover up some words in a sentence and predict them with the remaining words to train word embedding. The CBOW model needs to predict every word in a sentence.

The optimization of WordVec

1. Grade the Softmax classifier

One of the biggest flaws in the Word2Vec model is that the Softmax layer is too computating, especially when there are many words in the vocabulary. We need to index and sum the scores calculated by all the cells in the Softmax layer, which is actually slow when the number of words reaches millions and tens of millions.

To solve this problem, scholars proposed an optimization method called Hierarchical Softmax classifier.

The basic idea of hierarchical Softmax classifier is somewhat similar to the binary lookup tree. It replaces the original SoftMax layer with the following structure:The above structure is much like a binary search tree, where each node is a SigmoID binary classifier. Suppose we have 10,000 words, that is, 10,000 units in the output layer. The first binary classifier for the root node tells us if the result is in the first 5000, if it is in the left subtree, otherwise it is in the right subtree. And so on and so forth, eventually we get to the leaf node, which is the number of words.

According to the above method, we reduce the linear time complexity O(n) to the pairwise time complexity O(logn) to speed up the operation of the output layer.

In particular, in practice the grading Softmax classifier does not use a perfectly balanced classification tree or a symmetric tree with the same number of words in the left and right branches (as shown in figure 1). In fact, the graded Softmax classifier will be constructed with common words at the top, whereas less common words like Durian will be further down the tree (the classification tree shown in figure 2). The concrete implementation usually uses the common structure of data structure Huffman tree.

2. Negative sample

Word2Vec model has a large number of training samples. If all parameters are updated for each training sample, the training speed will be slow. Therefore, scholars proposed the method of Negative Sampling to reduce the number of samples updated in each training.

We will define a new supervised learning problem: given a pair of words (such as orange and juice), predict whether the two words are context-target pairs. That is, the original Softmax multiple classification is converted to logistic regression sigmoID multiple classification (one vs all). Assuming there are 10,000 words in the vocabulary, we construct 10,000 independent logistic regression models.

  • Firstly, a Positive Example is generated. The generation method of Positive sample is similar to skip-gram, that is, a context word is selected and a target word is randomly selected near a Windows size. For example, in the above statement, orange and juice, we mark the positive sample as 1.
  • Then use the same context to generate Negative examples, the corresponding words of which are randomly selected from the vocabulary, for Example, generate a Negative sample of orange-king and mark the Negative sample as 0. The same method generates more and more negative samples, which could be: orange-book, orange-the, orange-or. Because it’s randomly selected, we always think of it as a negative sample, so even in the orange-of example above, where of is actually the target of orange, we still mark it as 0. Finally, the following records are formed:

For a positive sample, multiple negative samples will be selected, the number of which is denoted as K. For a small data set, 5-20 is usually recommended for K; for a large data set, the value of K is small, such as 2-5.

We constructed an independent logistic regression model for 10,000 words (i.e., the multi-classification method of one vs All), and then updated model parameters of positive and negative samples during each training. In this way, instead of training as many parameters as the original 10,000-dimension SoftMax layer (3 million parameters), we only need to train 5 parameters of the logistic regression model (1500 parameters) in each iteration, which greatly reduces the computation of training.

How do you choose the negative sample? An important detail of this algorithm is how to select negative samples. One method is to sample according to the empirical probability of each word in the corpus, but it will lead to a high frequency of common words being sampled. There is also a evenly distributed sampling that takes no account of the actual frequency of the word.In negative sampling, the probability of negative sample being selected is proportional to word frequency, and the higher the word frequency is, the higher the probability of word being selected is. The probability formula is as follows:Where F (WI) is the observed frequency of a word in the corpus. By taking the power of 3/4, we can not only consider the corpus frequency of words, but also increase the probability of low-frequency words being selected.

Five, the GloVe

Several word embedding algorithms have been introduced before. There is also a global Vectors for Word Representation (GloVe) algorithm with a certain momentum in the NLP field. Although it is not as common as Word2Vec or Skip-Gram, it is simple enough.

Algorithm we use Xij to represent the number of times the word I appears in the context of the word J. So Xij, that’s how often the words I and j come together. The co-occurrence matrix X can be obtained by traversing the entire training text through the window.

If the meaning of the defining context is within the range of 10 words, it is obvious that Xij=Xji, or symmetry. If the definition context is the immediate preceding word, there is no symmetry. But for GloVe, we generally choose the former definition. We define the optimization objective of the model as (see detailed derivationBlog.csdn.net/coderTC/art… By minimizing the above equation, you can learn vectors that can predict the frequency of two words appearing together. In addition, f(Xij) has two functions:

  • When Xij=0, log(Xij) is infinite and cannot be computed. F (Xij)=0 is defined in this case, that is, it is not included in the calculation. In other words, require at least one occurrence of both words.
  • In addition, as weights, adjust the calculated weights of frequently used and very frequently used words. Neither common words are given too much weight, nor uncommon words are given too little weight. Here is a detailed reference to GloVe’s paper.

In addition, due to the symmetry of GloVe, θ and E are symmetric, or play the same role in the optimization goal. Therefore, we usually take their mean values as the final word vector, namely:

In addition, due to the symmetry of GloVe, θ and E are symmetric, or play the same role in the optimization goal. Therefore, we usually take their mean values as the final word vector, namely:Although the optimization function of GloVe algorithm is very simple (only a quadratic cost function), the result does work and good word embedding can be learned.

Six, ELMO

Although Word2Vec and GloVe are among the most commonly used word embedding models, they also have a serious defect, that is, they assume that each word has only one meaning. In other words, Word2Vec can’t categorize the different meanings of each word. However, polysemy is very common, so it is not reasonable for us to describe it with only one word embedding vector. For example, we have the word “bundle” and there are sentences like:

He shouldered his bundle and walked away into the distance.Copy the code

And another sentence:

His crosstalk is very good, and he often puts out funny linesCopy the code

It can be seen that the same word “baggage” has different meanings in different contexts. In the above example, we call the “baggage” in different sentences the same Type, but different tokens.

In order to solve the above defects, scholars proposed the ELMO Model (Embedding from Language Model) to realize Contextual Word Embedding, which is Word Embedding based on Token. In this way, even if the same word is in different contexts, its word embedding vector is also different, which solves the deficiency of Word2Vec’s polysemy processing.

The training method of ELMO model is based on RNN (recurrent neural network) language model. In more detail, we use the bidirectional LSTM language model and use Word2Vec word embedding vector as the input of the model to train ELMO word embedding.

BRNN and LSTM have been introduced in my blog on neural network sequence models. This BI-LSTM language model is to predict the correct word in the current position by knowing all the preceding words and all the following words, that is, to maximize the conditional probability of the correct word in the current position. This optimization problem can be solved by softmax’s cross entropy loss function.

It is important to note that the bi-LSTM-LM used here is multi-layered, i.e., deep-bi-LSTM-LM. We choose the first k time step LSTM unit output, each layer of weighted summation, the final weight multiplied by the current sentence, namely the first k for the current sentence word ELMO word embedded vector, its dimension and LSTM unit twice the dimension of the hidden layer is consistent (LSTM units on each floor to activate value is the output of forward and backward propagation to activation of the value of joining together, So twice). Please refer to Teacher Li Hongyi’s PPT for schematic diagram:

For the KTH word of the task sentence, its ELMO vector is expressed as: the unique weight of each sentence α Task times the weight of each layer from 0 to L Staskj times the sum of the activation value output hLMk,j of this layer.

Why multiply the output of different layers by different weights? Because researchers find that different layers of Embedding are suitable for different tasks. In general, ELMO models learn different things at different levels, so stacking them together gives a good boost to the task. The experimental results show that the semantic understanding of the upper layer is better, while the part of speech and morphology of the lower layer is better. Staskj can be set for different tasks.

conclusion

That’s the basics of the regular pre-trained language models Word2Vec, GloVe, and ELMO, which I’ve summarised roughly here. In fact, the best pre-training model currently available is Google’s BERT model, which I’ll explain in my next blog post.