“This is the second day of my participation in the Gwen Challenge in November. See details: The Last Gwen Challenge in 2021”

Sequence – to – sequence model

Sequence-to-sequence, referred to as Seq2Seq, is a sequence conversion model widely used in machine translation, speech recognition and other fields, generally using Encoder-Decoder structure. Encoder neural network processes a sentence symbol by symbol and compresses it into a vector representation. Then, a decoder (decoder) neural network outputs the predicted value symbol by symbol according to the Encoder state, and takes the previously predicted symbol as the input for each step, as shown below 👇

The input of Encoder is the original sequence, and hidden state is semantic encoding. Decoder is a generative language model that initializes the hidden state of Encoder.

is the start character, as the first input, combined with hidden state, to predict the first word. The next input is the target sequence, combined with the hidden state of the previous neuron, predicting the next word until < EOS > (end character). Finally, the loss function is calculated by comparing with the target sequence.

Traditional machine translation is basically based on Seq2Seq model, because the translation output is a sequence of words rather than a single word. The length of the output sequence may be different from the length of the source sequence. The model is divided into encoder layer and decoder layer, and both are COMPOSED of RNN or variants of RNN. Encoder converts the input sequence into semantic coding, and decoder converts the semantic coding into output sequence.

The translation process is as follows:

In encode stage, the first node inputs a word, and the next node inputs the hidden state of the next word and the previous node. Finally, encoder outputs a context, which in turn serves as the input of decoder. Each node that passes through the decoder outputs a translated word with the decoder’s hidden state as the input for the next layer. This model works well for short text translation, but it also has some disadvantages. If the text is a little longer, some information of the text is easily lost. To solve this problem, Attention came into being.

Attention

In the decode phase, the attentional mechanism selects the most appropriate context for the current node as input. Attention differs from the traditional Seq2Seq model in the following two main points.

1) Encoder provides more data to the decoder. Encoder will provide the hidden state of all nodes to the decoder, not just the hidden state of the last node of Encoder.

2) Decoder does not directly take the hidden state provided by all encoder as input, but adopts a selection mechanism to select the hidden state most consistent with the current position. The specific steps are as follows

  • Determine which hidden State is most closely related to the current node

  • Calculate the score value for each hidden State

  • Make a Softmax calculation for each score value, which can make the score value of hidden State with high correlation bigger, and the score value of Hidden State with low correlation lower

The Attention model doesn’t just blindly align the first word in output with the first word in input. The essence of the Attention function can be described as a mapping from a query to a set of key-value pairs.

There are three main steps in calculating attention:

  1. The similarity calculation of query and each key is carried out to get the weight. The commonly used similarity functions are the product, splicing, perceptron, etc.
  2. A Softmax function is used to normalize these weights;
  3. Sum the weights with the corresponding key value to get the final attention.

At present, in NLP studies, key and value are often the same, that is, key=value.

Transformer model

Transformer overall structure

Just like the Attention model, the Transformer model uses the encoder-decoder architecture. But the structure is more complex than Attention, with the Encoder layer and decoder layer stacked with six encoders. 👇

The internal simplified structure of each Encoder and decoder is shown below 👇

For Encoder, there are two layers:

  • Self-attention layer: helps the current node focus not only on the current word, but also on contextual semantics
  • Feedforward neural network

As shown in the figure:

The decoder part is basically the same as the Encoder part, except that marshaled mutil-head attetion is added at the bottom to help the current node get the important content it needs to pay attention to.

Encoder process:

  1. The model needs an embedding operation on the input data

  2. Input to the Encoder layer, self-attention processing data

  3. The data is fed to a feed-forward neural network, whose calculations can be done in parallel

  4. The calculated output of the feedforward neural network will be input to the next encoder.

Self-Attention

Self-attention is a way that Transformer uses to convert “understanding” of other related words into words that we normally understand. Here’s an example: The animal didn’t cross the street because it was too tired. It is easy for us to judge whether “it” represents Animal or street, but it is difficult for the machine to judge. Self-attention can make the machine associate “IT” with animal.

Multi-Headed Attention

Instead of just initializing one set of Query, Key, and Value matrices, we initialize multiple sets. Transformer uses eight sets, so the result is eight matrices.

The whole process of multi-head attention can be summarized as follows:

  1. Query, Key, and Value undergo a linear transformation
  2. For attention, the dot product of Q, K, and V takes h times, one head at a time, and W is different for each linear transformation.
  3. Splicing the attention result of h times of scaling dot product
  4. Performing a linear transformation yields the value as the result of multiheaded attention

As you can see, the difference of multi-attentional proposed by Google is that it calculates h times instead of just once, which allows the model to learn relevant information in different presentation subspaces

So how does attention work in the model?

The diagram below:

First, multi-attention is used to connect the encoder to the decoder. Key, Value and Query are the layer output of the encoder (K=V here) and the input of multi-attention in the decoder respectively. In fact, just like attention in the mainstream machine translation model, attention is used for translation alignment by decoder and encoder.

Self-attention is K=V=Q. For example, when you type in a sentence, every word in the sentence should be evaluated for attention with all the words in the sentence. The purpose is to learn the word dependence relationship inside the sentence and capture the internal structure of the sentence.

Positional Encoding

So far, a way to interpret the order of words in an input sequence has been missing from the Transformer model. To solve this problem, Transformer adds an additional vector Positional Encoding to the input of the Encoder and decoder layers. Dimensions are the same as those of embedding, and this vector uses a unique way for the model to learn this value. This vector determines the position of the current word and captures information about the order of words (simply the distance between different words in a sentence). Transformer position code vector does not need training, it has a regular way of generation, specific calculation methods are many.

Layer normalization

In Transformer, each sub-layer (FFNN) is followed by a residual module and there is a Layer of normalization.

There are many types of Normalization, but they all have a common purpose of converting input into data with a mean of zero and variance of one. We do normalization before putting data into the activation function because we don’t want the input data to fall in the saturated area of the activation function.

Mask

Mask indicates a mask that masks certain values so that they do not take effect when parameters are updated. Transformer model involves two types of mask, namely padding mask and Sequence mask.

The padding Mask is used in all scaled dot-Product attention, while the Sequence Mask is only used in self-attention for decoder.

Padding Mask

Since the length of the input sequence is different from batch to batch, we align the input sequence. In particular, you populate short sequences with zeros. However, if the input sequence is too long, the content on the left is cut and the excess is discarded. Because these filling positions are meaningless, our attention mechanism should not focus on these positions, so we need to do some processing.

To do this, add the values of these positions to a very large negative number (negative infinity), so that the probability of these positions approaches 0 by SoftMax! And our padding mask is actually a tensor, and each value is a Boolean, and the value false is where we’re going to do the processing.

sequence mask

Sequence masks are designed to prevent the decoder from seeing future information. In other words, for a sequence, when time_step is t, our decoding output should only depend on the output before t, but not on the output after T. So we need to figure out a way to hide the information after t.

So how do you do that? It’s also very simple: generate an upper triangle matrix with all values of 0. Applying this matrix to each sequence will do the trick.

  • For scaled dot-product attention, use padding Mask and Sequence Mask as attn_mask. The implementation is to add two masks together as attn_mask.
  • In all other cases, attn_mask equals the padding mask.

Don’t panic, look at the picture to speak 😎

First the encoder starts by processing the input sequence, then the output of the top encoder is converted into a set of attention vectors K and V. Each decoder will use these attention vectors in its “encoder-decoder attention” layer, which helps the decoder focus attention on the appropriate place in the input sequence.

After completing the coding phase, we begin the decoding phase. Each step of the decoding phase outputs one element from the output sequence. The following steps repeat the process until a symbol indicating that the decoder has finished output is reached. The output of each step is fed into the bottom decoder at the next time step, and the decoder just as we do with the encoder inputs, we embed and add position codes to these decoder inputs to represent the position of each word.

Output layer

When the decoder layer is fully executed, how to map the vector to the words we need?

It’s as simple as adding a full connection layer and softMax layer at the end. If our dictionary is 1W words, softMax will input the probability of 1W words in the end, and the corresponding word with the highest probability value will be our final result.

Welcome giant guide ~

The resources

One article to understand BERT — waste when self-improvement