Seq2Seq model in detail

Seq2Seq is a variant of recurrent neural network, including Encoder and Decoder. Seq2Seq is an important model in natural language processing, which can be used in machine translation, conversational system and automatic summarization.

1. Structure and use of RNN

The RNN model was introduced in the previous article “Recurrent Neural Network RNN, LSTM, GRU”. The basic model of RNN is shown in the figure above. The input received by each neuron includes: the hidden layer state H of the previous neuron (for memory) and the current input X (for current information). After the neuron receives the input, it calculates the new hidden state H and the output Y, which is then passed on to the next neuron. Because of the existence of hidden state H, RNN has certain memory function.

For different tasks, RNN model structure is usually adjusted a little. According to the number of inputs and outputs, it can be divided into three common structures: N vs N, 1 vs N, and N vs 1.

1.1 N v N

The figure above is an N vs N structure of the RNN model with N inputs x1, X2… , xN, and N outputs y1, y2… , yN. The N vs N structure, in which the input and output sequences are of equal length, is usually suitable for the following tasks:

The part of speech tagging
Training language models, using previous words to predict the next word, etc

1.2 1 v N

In the 1 vs N structure, we have only one input x, and N outputs y1, y2… , yN. There are two ways to use 1 vs N. The first way is to pass input X only to the first RNN neuron, and the second way is to pass input X to all RNN neurons. The 1 VS N structure is suitable for the following tasks:

The image generates text, input X is a picture, the output is a description of the picture.
Generate the corresponding music according to the music category.
Generate the corresponding novel according to the novel category.

1.3 N v 1

In the N vs 1 structure, we have N inputs x1, X2… XN and an output y. The N vs 1 structure is suitable for the following tasks:

Sequence classification task, a paragraph of speech, a paragraph of text category, sentence sentiment analysis.

2. Seq2Seq model

2.1 Seq2Seq structure

The above three structures have certain restrictions on the number of input and output of RNN, but in practice, the sequence length of many tasks is not fixed. For example, in machine translation, the sentence length of source language and target language is different. In a conversational system, there is a difference in sentence length between questions and answers.

Seq2Seq is an important RNN model, also known as Encoder-Decoder model, which can be understood as an N×M model. The model consists of two parts: Encoder for encoding sequence information, encoding sequence information of any length into a vector C. And Decoder is Decoder, Decoder can get the context information vector C after decoding information, and output as a sequence. There are many kinds of Seq2Seq model structures. Here are some common ones:

The first kind of

The second,

The third kind of

2.2 Encoder

The main difference between the three Seq2Seq models is the Decoder, and their Encoder is the same. The following figure shows the Encoder part. The Encoder’S RNN accepts input X and finally outputs a context vector C encoding all information. The neuron in the middle has no output. Decoder basically passes in the context vector C and decodes the desired information.

As can be seen from the figure above, Encoder is not much different from ordinary RNN, except that the intermediate neuron has no output. The context vector C can be calculated in a variety of ways.

It can be seen from the formula that C can be directly represented by the hidden state hN of the last neuron; You can also perform some transformation on the hidden state of the last neuron hN and you get, q represents some transformation; You can also use the hidden states of all neurons h1, H2… , hN calculated. Once you have the context vector C, you need to pass it to the Decoder.

2.3 Decoder Decoder

Decoders come in many different configurations, but here are three.

The first kind of

The first Decoder structure is relatively simple, the context vector C is regarded as the initial hidden state of RNN, input to RNN, the subsequent only accept the hidden layer state of the last neuron H ‘and do not receive other input X. The first Decoder structure of the hidden layer and output calculation formula:

The second,

The second Decoder structure has its own initial hidden layer state H ‘0, and no longer regards the context vector C as the initial hidden state of the RNN, but as the input of each neuron of the RNN. It can be seen that each neuron in the Decoder has the same input C. The hidden layer and output calculation formula of this Decoder are as follows:

The third kind of

The third Decoder structure is similar to the second, but with the output y’ of the previous neuron in the input part. That is, the input of each neuron includes: the hidden layer vector H ‘of the previous neuron, the output y’ of the previous neuron, and the current input C (Encoder encoded context vector). For the input Y ‘0 of the first neuron, it is usually the embedding vector of the sentence’s actual flag bit. The third kind of Decoder hidden layer and output calculation formula:

3. Skills of using Seq2Seq model

3.1 the Teacher Forcing

The Teacher Forcing is used in the training stage, mainly for the third Decoder model above. The input of neurons in the third Decoder model includes the output y’ of the previous neuron. If the output of the last neuron is wrong, the output of the next neuron is also prone to error, causing the error to pass on.

The Teacher Forcing can alleviate the above problems to some extent. When training the Seq2Seq model, each neuron in the Decoder does not necessarily use the output of the previous neuron, but a certain proportion of the correct sequence as the input.

For example, in a translation task, a given English sentence is translated into Chinese. “I have a cat” translates to” I have a cat”. Seq2Seq, who doesn’t use the Teacher Forcing ladder:

If the Teacher Forcing is used, the neuron directly uses the correct output as the current neuron’s input.

3.2 Attention

In the Seq2Seq model, Encoder always encodes all the information of the source sentence into a fixed-length context vector C, and then the vector C remains unchanged during the Decoder decoding. This has a number of drawbacks:

For long sentences, it is difficult to express the meaning completely with a constant length vector C.
RNN has the problem of long sequence gradient disappearing, and the vector C obtained by using only the last neuron is not ideal.
It is different from human attention, that is, when human is reading an article, they will focus on the current sentence.

Attention is a mechanism for focusing the model’s Attention on the currently translated word. For example, when translating “I have a cat”, focus on the “I” of the source sentence when translating “I”, and pay attention to the “cat” of the source sentence when translating “cat”.

With Attention, Decoder inputs the current C instead of a fixed context vector c, based on the current translated information.

Attention needs to retain the hidden layer vector H of each neuron of Encoder, and then the t neuron of Decoder calculates the correlation et between the current state and each neuron of Encoder according to the hidden layer vector H ‘t-1 of the last neuron. Et is an N-dimensional vector (the number of Encoder neurons is N). If the i-th dimension of ET is larger, the correlation between the current node and the i-th Encoder neuron is larger. There are many calculation methods for ET, namely, there are many calculation functions a of correlation coefficient:

After the correlation vector ET is obtained above, normalization is required and softmax normalization is used. Then, the normalized coefficient is used to fuse multiple hidden layer vectors of Encoder to obtain the context vector CT of current neuron of Decoder:

3.3 beam search

The Beam search method is not used for training, but for testing. In each neuron, we select the top K outputs with the maximum current output probability and transfer them to the next neuron. The next neuron takes each of these k outputs, calculates the probability of L words (L is the vocabulary size), then gets the top K largest outputs out of kL results, and repeats the process.

4. Seq2Seq summary

The Seq2Seq model allows us to use input and output sequences of different lengths, which can be applied to a wide range of scenarios such as machine translation, conversational systems, reading comprehension, etc.

Seq2Seq model can be optimized by using Teacher Forceing, Attention, beam search and other methods.

5. References

Different structures of RNN neural network model
Seq2Seq family buckets in Tensorflow
Seq2Seq “Attention