Transformer is an advanced NLP model proposed by Google and achieves better results than traditional RNN in many tasks. Transformer uses the self-attention mechanism to make direct connections between words, so it can encode information and learn features better than RNN. But there are still some limitations to Transformer’s ability to learn long distance dependencies, and Transformer-XL is a language model that improves Transformer’s ability to learn long distance dependencies.

1.vanilla Transformer

Vanilla Transformer, another method of training language models based on Transformer, Children unfamiliar with Transformer can refer to the previous article “Transformer Model In Detail”. Vanilla Transformer was proposed in the paper “Character-Level Language Modeling with Engaging Self-attention”, The training process and evaluation process of Vanilla Transformer are described below to understand its defects.

1.1 Vanilla Transformer training process

The figure above shows the training process of vanilla Transformer language model. It can be seen that Vanilla Transformer will predict the next moment information based on the previous information. For example, after obtaining [x1, x2, x3], it can predict the corresponding output of X3. The information after X3 will be masked out.

However, there is a defect in the language model of Vanilla Transformer training. During the training process, the expected training is divided into two segments [X1, X2, x3, X4] and [X5, X6, X7, X8]. When training the second segment, the model does not know the information about the first segment. For example, there are two sentences connected, “I have a cat” and “he doesn’t like fish”. If Vanilla Transformer trains these two sentences in two segments, the model does not know the exact meaning of “he” in the second segment.

1.2 Vanilla Transformer evaluation process

The figure above shows the evaluation process of vanilla Transformer. As you can see, when predicting the output, the model needs to move one step to the right at a time and predict the current word based on the new sequence above. For example, the first picture predicts the next output based on [x1, X2, x3, X4]. The second graph predicts the next output based on [x2, x3, X4, x5].

Vanilla Transformer would use the current sequence and make predictions from scratch, which made the predictions slow and did not take advantage of earlier information. For example, in the second figure above, the prediction is made using [X2, x3, X4, X5] without the information of X1. Ab initio prediction means that Transformer will go through the training process, using X2 to predict X3, then using X2 and x3 to predict X4, and finally using X2, x3, X4 and X5 to predict the next output.

1.3 Vanilla Transformer defects

Limited long-term dependencies: Vanilla Transformer trains data in segments, limiting the dependencies of words in each segment to this segment and not using information that has long been dependent on.

Semantically incomplete segmented data: Vanilla Transformer, when segmented data, directly divided sentences in the corpus according to fixed length, without considering sentence boundaries, resulting in semantically incomplete segmented data. This means dividing a sentence into two segments.

Slow speed in evaluation: When evaluating the model, the previous word of each predicted word should be recalculated, which is inefficient.

2.Transformer-XL

Transformer-XL is a language model training method proposed by Google in 2019. In order to solve the problem of long-term dependence of Transformer, its paper is “Transformer-XL: Attentive Language Models Beyond a fixed-length Context. There are two main innovations in the Transformer-XL:

First, the segment-level Recurrence was proposed. A loop mechanism was introduced in Transformer, and the output vectors of each layer of the previous Segment would be saved and used while training the current Segment. In this way, the information of the previous segment can be used to improve the long-term dependence of Transformer. In training, the output of the previous segment is only involved in forward calculation, rather than back propagation.

Second, Relative Positional Encodings are proposed. Transformer adds location Embedding to word Embedding in order to represent the location of each word. Location Embedding can be calculated or learned using trigonometric functions. However, this method cannot be used in Transformer-XL because the Embedding in each segment will be the same in the same location. Therefore, a new Positional encoding method, Relative Positional Encodings, is proposed in The Transformer-XL.

2.1 the Segment – Level Recurrence

In this example, you can see that a new segment can be trained using the information from the previous segment, or multiple segments can be retained if memory or video memory allows. The green line segments in the figure indicate that the current segment uses information from the previous segment.

Therefore, when training a segment τ+1, the input of layer N in the Transformer-XL includes:

  • The output of the Segment τ n-1, the green line in the figure, does not calculate the reverse gradient.
  • Segment τ+1 output of the n-1 layer, gray line in the figure, calculate the reverse gradient.

The output of the Transformer-XL layer N can be calculated using the following formula when training τ+1 segment:

It can be seen that the k and V vectors are calculated using the information of the previous segment when training the τ+1 segment. Note that vector Q of the NTH Transformer layer only uses the information of the τ+1 segment, the output of the NTH Transformer layer.

The diagram above is a diagram of the Transformer-XL evaluation phase. In the previous section, we learned that vanilla Transformer could only output one word to the right at a time during evaluation and had to start from the beginning. In Transformer-XL, you can advance one segment at a time and use the previous segment to predict the current output.

The longest dependency that Transformer-XL can support is approximately O(NL), where L is the length of a segment and N is the number of layers in Transformer. The maximum dependencies supported by the Transformer-XL are shown in the green area above.

2.2 Relative Positional Encodings

Traditional Transformer uses location Embedding to indicate the location of words. The location Embedding can be computed by trigonometric functions or obtained by training. However, the information of the previous segment is used in the Transformer-XL. If the traditional location Embedding is used, the model cannot distinguish the words of the current segment and the previous segment, as shown in the following formula.

E is the word vector of the word and U is the position vector. It can be seen that the position vector of the current segment τ+1 is the same as that of the previous segment τ. The model cannot determine which segment the words in a particular position belong to.

In order to solve this problem, transformer-XL proposes relative position Embedding. First let’s look at the traditional Transformer method for calculating the attention score:

The local attention score is calculated by Transformer. Ui means Embedding at the I position and Uj means Embedding at the J position. Transformer-XL removes the Ui and Uj from the formula, since the Ui and Uj are fixed positions, so Transformer-XL makes the following changes to the formula.

The Transformer-XL adds some changes to the local Attention score calculation formula and removes all the absolute position Embedding U.

  • The Uj of vector k calculated in terms (b) and (d) of the formula is replaced with the relative position Embedding ri-j, that is, it only cares about the relative positions of words. R is calculated using the trigonometric formula.
  • In (c) (d) of the formula, the calculation part of vector q is replaced by two trainable vectors U and v, because the query vector Q is the same for every query position, which means that the bias of attention to different words should remain unchanged no matter where the query position is.
  • The weight transformation matrix Wk of K vector is divided into two, Wk,E and Wk,R, which are used for word content and word position respectively.

2.3 Overall calculation formula of Transformer-XL

After the above two mechanisms of section-level Recurrence and Relative Positional Encodings, the overall calculation formula of the Transformer-XL was as follows.

3. The Transformer – XL summary

The Transformer XL is an improvement on the Vanilla Transformer model, The segment-level Recurrence and Relative Positional Encodings mechanisms were introduced to enable Transformer to learn long-term dependence. The longest dependency that Transformer-XL can support is approximately O(NL). The subsequent XLNet model proposed by Google also utilizes the Transformer-XL structure.

4. References

  • Character-Level Language Modeling with Deeper Self-Attention
  • Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
  • The Transformer – XL code