Traditional word vector models, such as Word2Vec and Glove, have fixed word vectors, that is, one word has only one word vector, which is obviously not suitable for polysemy. The ELMo algorithm uses deep bidirectional language model (biLM) to train only the language model, and the word vector is obtained in real time in the input sentence, so the word vector is closely related to the context information and can distinguish meaning better.
1. Static word vector algorithm
Word embedding algorithms Word2Vec and Glove were introduced in previous articles. Compared with the traditional one-HOT encoding and co-occurrence vector, the word vector obtained by the word embedding algorithm has lower dimension and can better support some downstream tasks, such as document classification and question answering system.
However, these two algorithms are static word vector algorithms. After a language model is trained on the data set, the word vector of each word is fixed. The word vector will be the same no matter what sentence is entered when we use it later, for example:
-
In “I like to eat millet”, “millet” refers to a kind of food
-
“Xiaomi” refers to the mobile phone brand
Given the above two sentences, the word vector of “mi” obtained in Word2Vec and Glove is the same, and no more accurate word vector can be given according to the context.
ELMo is a dynamic word vector algorithm that trains a biLSTM (bidirectional LSTM model) in a large corpus. When the downstream task needs to obtain the word vector, the whole sentence is input to biLSTM, and the output of biLSTM is used as the word vector, containing context information. It can be understood that biLSTM is a function whose input is a sentence and output is the word vector of the words in the sentence.
2. Bidirectional language model
Firstly, it introduces what is bidirectional language model and how to get word vector through biLSTM. For those unfamiliar with LSTM, they can refer to the previous article “Recurrent neural Network RNN, LSTM, GRU”.
2.1 Bidirectional language model
Given a sentence containing N words T = [T (1), T (2)… T (N)], the forward model needs to predict the next word T (k) from the preceding word [T (1), T (2)… T (k-1)]. The backward model needs to predict the previous word from the following word.
2.2 Use biLSTM to get context-dependent word vectors
BiLSTM is a bidirectional recurrent neural network, which consists of forward network and backward network. The figure above is a biLSTM with layer number L = 2.
The input of each word T (I) is word vector, which is fixed and can be used as word vector generated by Word2Vec or Glove. In ELMo, the word vector generated by CNN-big-LSTM is used. Note that ELMo input word vector is fixed, ELMo to biLSTM input word vector is dynamic, including context information.
In ELMo’s paper, the following symbols are used to represent the output of the i-th word corresponding to each layer in the bidirectional LSTM. The forward output contains the semantics before the i-th word, and the backward output contains the semantics after the i-th word.
It is difficult to add a formula in this paper, so H (k,j,→) is used to represent the forward output and H (k,j,←) is used to represent the backward output. Please understand. The output h(k-1,j,→) and h(k+1,j,←) of each layer are dynamic word vectors of words.
LSTM has A total of L layers. For forward LSTM, the output h(k-1,L,→) of the last layer of each word T (k-1) is used to predict the next word T (k). For backward LSTM, the last layer of output H (k+1,L,←) of each word T (k+1) is used to predict the previous word t(k). Softmax is used in the prediction process, and biLSTM needs to optimize the objective function as follows:
θ(x) represents the word vector for word input, and the word vector is fixed. θ(s) represents the Softmax layer and is used to predict words before and after. θ(LSTM,→) represents the forward LSTM parameter and is used to calculate h(k-1,L,→). θ(LSTM,←) denotes the parameter of backward LSTM, which is used to calculate H (K-1,L,←).
3. ELMo algorithm
3.1 Process Introduction
The number of biLSTM layers used in ELMo is L = 2. ELMo first trains the model on a large data set, and then outputs the word vector of each word according to the input sentence in the subsequent tasks. For example, given a sentence T = [T (1), T (2)… T (N)], ELMo calculates the word vector as follows:
-
Find word vector E(1) from static word vector list… , E(N) for input. ELMo uses cnN-big-LSTM generated word vectors as input.
-
The word vector E(1)… , E(N) input forward LSTM and backward LSTM of the first layer respectively to obtain forward output H (1,1,→)… , h(N,1,→), and the backward output H (1,1,←)… H (N, 1, please).
-
Forward output h(1,1,→)… , h(N,1,→) is passed into the second layer forward LSTM, and the second layer forward output H (1,2,→) is obtained… H (N, 2 -); Output backward h(1,1,←)… , H (N,1,←) is passed into the second layer backward LSTM to obtain the second layer backward output H (1,2,←)… H (N, 2, please).
-
Then the final word vector of word I includes E(I), H (N,1,→), H (N,1,←), H (N,2,→), h(N,2,←). If biLSTM of L layer is adopted, 2L+1 word vector can be finally obtained.
3.2 Use word vectors
As we know above, a word I in a sentence can get 2L+1 word vector. How to use the 2L+1 word vector in practice?
First, the cnN-big-LSTM word vector E(I) is used as input in ELMo, and the dimension of E(I) is 512. Then each layer of LSTM can obtain two word vectors H (I,layer,→) and H (I,layer,←), both of which are 512 dimensions. Then L+1 word vector can be constructed for word I.
H (I,0) represents the direct concatenation of two E(I), representing the input word vector, which is static and 1024 dimensions.
H (I,j) represents the direct splicing of the two output word vectors H (I,j,→) and H (I,j,←) of the JTH layer biLSTM, which is dynamic and 1024 dimensions.
In ELMo, different layers of word vectors often have different emphases. The CNN-big-LSTM word vector adopted by the input layer can better encode the part of speech information, the LSTM of the first layer can better encode the syntactic information, and the LSTM of the second layer can better encode the word semantic information.
The ELMo authors propose two ways to use word vectors:
The first is to directly use the output of the last layer biLSTM as the word vector, that is, H (I,L).
The second, more general approach, fuses L+1 output weights together as follows. γ is a task-dependent coefficient that allows different NLP tasks to scale ELMo’s vectors, increasing the flexibility of the model. S (Task,j) is the weight coefficient normalized by Softmax.
3.3 effect of ELMo
Here’s an example from the paper, Glove on top, ELMo on the bottom two lines. Glove can be seen looking for play’s nearest neighbor, and related words like “game”, “performance”, “sport” and so on appear, which may differ from paly’s actual meaning in the sentence. However, in ELMo, we can see that the play in the first sentence means competition, and the play in its nearest sentence also means competition. The second sentence is to play. It shows that ELMo can better get the word vector of a word based on context.
4. ELMo summary
-
ELMo trains the language model instead of directly training the word vector. In subsequent use, sentences can be passed into the language model and more accurate word vector can be obtained by combining contextual semantics.
-
By using biLSTM, we can simultaneously learn and obtain the word vector that stores the information above and the information below.
-
In biLSTM, the word vectors obtained by different layers have different emphases. Cnn-big-lstm word vectors adopted by input layer can better encode partof speech information, LSTM of the first layer can better encode syntactic information, and LSTM of the second layer can better encode word semantic information. The final word vector can be obtained by the fusion of multi-layer word vector, and the final word vector can take into account different levels of information.
5. References
- Deep contextualized word representations
- Zhihu: ELMo principle analysis and simple use