Author: Anna Rogers

takeaway

What happened at Finetune BERT? 支那

This blog post summarizes our EMNLP 2019 paper “Revealing the Dark Secrets of BERT” (Kovaleva, Romanov, Rogers, & Rumshisky, 2019). Pdf:www.aclweb.org/anthology/D…

2019 could be called the year of NLP’s Transformer: this architecture dominated the charts and inspired a lot of analytical research. Without a doubt, the most popular Transformer is BERT(Devlin, Chang, Lee, & Toutanova, 2019). In addition to its numerous applications, many studies have explored various models of language knowledge, often concluding that such language knowledge does exist, at least to some extent (Goldberg, 2019; Hewitt & Manning, 2019; Ettinger, 2019).

The work focuses on complementary questions: what happens at Finetune BERT? In particular, how many linguistically explainable modes of self-attention are considered its strengths and can actually be used to solve downstream tasks?

To answer this question, we used BERT Finetune to perform the following GLUE (Wang et al., 2018) tasks and data sets:

  • Semantic detection (MRPC and QQP);
  • Text similarity (STS-B);
  • Emotion Analysis (SST-2);
  • Text implication (RTE);
  • Natural language reasoning (QNLI, MNLI).

A brief introduction to BERT

BERT represents the bidirectional encoder representation of Transformer. The model is basically a multi-layer bi-directional Transformer encoder (Devlin, Chang, Lee, & Toutanova, 2019) with several excellent guides on how it works, including illustrated Transformer. We focus on a specific component of the Transformer architecture, self-attention. In short, it is a way to measure the components of a sequence of inputs and outputs in order to model the relationships (even remote dependencies) between them.

As a simple example, let’s suppose we need to create a representation of the sentence “Tom is a black cat”. BERT may choose to pay more attention to “Tom” and less attention to “is”, “a”, “black” when coding “cat”. This can be expressed as a weight vector (for each word in the sentence). As the model encodes each word in the sequence, it calculates these vectors, resulting in a square matrix, which we call a “self-attention diagram.”

Now, it’s not clear that the relationship between “Tom” and “cat” is always the best. To answer questions about cat color, models should focus on “black” rather than “Tom.” Fortunately, it doesn’t have to. BERT(and other Transformers) are powerful in large part because of the fact that there are multiple heads in multiple tiers, all of which have learned to build independent self-attention maps. In theory, this can give the model the ability to “focus on information from different representation subspaces at different locations” (Vaswani et al., 2017). In other words, the model will be able to choose between several alternative representations of the current task.

Most of the calculation of self-attention weight occurs in BERT’s pre-training: the model (pre-training) is trained for two tasks (Masked language model and next sentence prediction), followed by finetune for individual downstream tasks (such as emotion analysis). The basic ideological training process of this separation is divided into semi-supervised training and the Finetune stage, where the finetune stage is supervised learning: the data set of the transfer task is usually too small to understand the entire language, but large text corpora can be used for this purpose (and others like it) through language modeling. Thus, we can obtain a task-independent representation of information in sentences and texts that can then be “adapted” to subsequent tasks.

Let us point out here that the exact way in which “adaptation” should work is not described in detail in either the BERT paper or the GPT technical report, which highlights the pre-training/Finetune approach. However, if attention itself is to provide a “link” to the partial input sequence to increase the amount of information, multi-head, multi-layer architectures need to provide multiple alternative attentional maps, perhaps the Finetune process will teach the model to rely on attentional maps that are more useful for the task at hand. For example, in the emotion analysis task, the relationship between noun and adjective is more important than the relationship between noun and preposition, so Finetune would ideally teach the model to rely more on the more useful self-attention diagram.

What are the types of self-attention patterns learned, and how many of them are there?

So what is BERT’s mode of self-attention? We found five, as shown below:

Figure 1, BERT’s types of self-attention patterns. The two axes on each image represent BERT markers for the input sample, with colors indicating absolute attention weights (darker colors indicate greater weights). \

  • The vertical pattern represents attention to a single tag, which is typically the [SEP] tag (a special tag for the end of a sentence) or the [CLS] tag (a special BERT tag used as a complete sequence representation provided to the classifier).
  • The diagonal pattern means pay attention to the first/last words;
  • The block pattern represents more or less uniform attention to all markers in the sequence;
  • In theory, the heterogeneous pattern is the only one that can correspond to any meaningful relationship between the parts of the input sequence (although it is not necessarily the case).

Here is BERT’s ratio of five types of attention in seven GLUE tasks (each column represents 100% of all heads in all layers) :

See Figure 2. The proportion of BERT’s self-attention mapping types is fine-tuned on selected GLUE tasks. \

While the exact proportion varies from task to task, in most cases, patterns that might make sense account for less than half of BERT’s total weight of self-attention. At least a third of BERT heads focus only on [SEP] and [CLS] — a strategy that does not provide much meaningful information for the representation of the next layer. It also shows that the model is severely over-parameterized, which explains recent successful attempts at distillation methods (Sanh, Debut, Chaumond, & Wolf, 2019; Jiao et al., 2019).

It is worth noting that we used bert-Base, which is a smaller model with 12 heads in 16 layers. If it is already over-parameterized, it means that Bert-Large and all subsequent models, some of which are 30 times larger, are over-parameterized (Wu et al., 2016).

Such a reliance on [SEP] and [CLS] may also indicate that either BERT has somehow “absorbed” the representation of the information gained by the previous layer and subsequent self-attention graphs do not need much, or BERT is somehow independent of self-attention as a whole.

What happens when fine-tuning occurs?

Our next question is what happens during BERT’s fine-tuning. The heat map below shows the cosine similarity between each head and each layer of the flat self-attention mapping matrix before and after finetune. Darker colors indicate more variation in performance. For all GLUE tasks, three Epochs finetune are performed.

As shown in Figure 3, cosine similarity exists between flattened self-attentional graphs in trained and adjusted BERT. The darker the color, the greater the difference. \

We found that the weight of most attention didn’t change much, with the last two layers changing the most for most tasks. These changes do not seem to support any particular type of meaningful pattern of attention. Instead, we found that the model basically relies more on the vertical attention model. In the case of SST, the thicker vertical attention pattern in the last layer is due to the joint attention to the final [SEP] and the punctuation mark before it, which we observe to be another common target of the vertical attention pattern.

Figure 4, a separate example of a self-attention diagram, finetune BERT on SST. \

There are two possible explanations for this:

  • The vertical pattern is sufficient to some extent, that is, the marker represents a meaningful pattern of attention that absorbs the previous layer to some extent. We did find that the earliest layers were more focused on [CLS], and then [SEP] began to dominate most tasks (see Figure 6).
  • The task at hand doesn’t really require the fine-grained, meaningful patterns of attention that should be a major feature of Transformers.

How much difference will Finetune make?

Given the huge differences in the data sets used in pre-training and tuning, and the very different training objectives, it will be interesting to study how much difference tuning actually makes. As far as we know, this question has not been raised before.

We conducted three experiments on each selected GLUE dataset:

  • Pretrain the model to weight freeze and use task-specific classifiers to see how BERT performs
  • Initialize the model randomly from the normal distribution and perform 3 epochs of finetune on the task data set, then see how BERT performs
  • Using the official pre-trained Bert-Base model, finetune3 epochs on the task dataset and then see how BERT performs

The experimental results are as follows:

While it was clear that pre-training +finetune Settings produced the highest results, BERT with random + Finetune performed disconcertingly on all tasks except for textual similarity. In fact, for sentiment analysis, random initialization and Finetune BERT can achieve 80% accuracy without any pre-training. Given the size of a large pre-trained Transformer, this raises a serious question as to whether expensive pre-training will make enough economic sense. It also raises serious questions about NLP data sets that can apparently be solved without much of the task-neutral language knowledge that pre-Xunl + Finetune Settings are supposed to provide. \

18.01.2020 Update: Thanks to Sam Bowman for pointing out that stochastic BERT results are generally comparable to the GLUE baseline for pre-trained Transformer and can be well explained as to the extent to which these tasks can be solved without in-depth language knowledge. The NLP community needs more work on more difficult data sets and actually needs that knowledge, and in the interim, we should at least switch to SuperGLUE. Note that for these tasks, the GLUE baseline and most published results use word embedding or count-based word vectors as input, whereas our random BERT is completely random. So direct comparisons are not entirely fair. However, especially for SST, this comparison can be compared with the original recursive neural tensor network (Socher et al., 2013). This 2013 model is small by comparison and also uses random vectors as inputs, but it has 7 points more binary classification than our random +finetuneBERT.

Is there a self-attentional head that can be explained linguistically?

At this point, several studies have tried to find headers that encode specific types of information from attention, but most have focused on syntax. We conducted an experiment focusing on frame semantic elements: we extracted 473 sentences from FrameNet 1.7, up to 12 tag lengths, and the core frame element was at least 2 tags away from the target word. In the following example, it is the relationship between the experiencer and the participle that evokes the emotion_directframe. It can be argued that such relationships are crucial to understanding what a given sentence describes, and that any mechanism that claims to provide a self-attentional attempt at language information should reflect these relationships (and possibly many others).

We obtain the representation of these sentences through pre-trained BERT, and calculate the maximum weight between the tag pairs corresponding to the frame semantic relationship of the annotation. Figure 5 shows the average of these scores for all the examples in the FrameNet dataset. We find that two headers (header 2 at level 1 and header 6 at level 7) are more concerned with these framework semantic relationships than the others.

But what information is at work when it comes to reasoning?

We think it is too hasty to conclude that some information is actually coded by probing BERT weights before training. Given the size of the model, it is possible to find similar proofs of encoding for any other relationship (in fact, Jawahar et al. found no significant differences between different decomposition schemes in that direction). The real question is whether the model really relies on this information in reasoning.

To determine whether the two headers that we thought were useful for encoding frame semantic relations were actually used by BERT of Finetune, we performed an ablation study, disabling one header at a time (i.e., replacing learned attention weights with unified attention). Figure 6 shows a heat map of all GLUE tasks in our example, with each cell showing overall performance when the given header is turned off. It is clear that while the overall pattern varies between tasks, it is best to randomly remove headers — including those we identify as encoding meaningful information that should be most relevant to the task. Many headers can also be turned off without affecting any performance, again demonstrating that even Bert-Base is heavily parameterized.

Figure 6, model performance, disabled one header at a time, with the blue line representing baseline performance without disabled headers. Darker colors correspond to higher performance scores. \

A similar independent conclusion was reached for machine translation tasks, which zeroed attention weights instead of replacing them with unified attention (Michel, Levy, & Neubig, 2019). We further show that this observation extends not only to the end, but to the entire layer: depending on the task, the entire layer can be detrimental to model performance!

Figure 7, disable the performance of a tier model. \

discuss

Our main contribution is that while most of BERT’s research has focused on exploring pre-trained models, we ask the question of what happens during fine-tuning and how meaningful representations obtained through self-attentional mechanisms are. So far, we have found no evidence that verbally meaningful attentional attempts are critical for fine-tuning BERT’s performance. Our results contribute to an ongoing discussion of attributes based on Transformer models in the following directions:

A)BERT is over-parameterized. In our experiments, we disabled only one header at a time, and the fact that in most cases the performance of the model was not affected suggests that many headers are functionally duplicative, i.e. disabling one header does not harm the model because the same information can be obtained elsewhere. This result points to over-parameterization and explains the success of small Berts like ALBert and TinyBERT.

This over-parameterization means that BERT may have some very important heads that have a linguistically meaningful self-attentional pattern, but to prove this we must try to disable all possible head combinations (which is not feasible). A study at the same time suggested a promising alternative: Voita, Talbot, Moiseev, Sennrich, & Titov, 2019) finetune the model using regularization targets with pruning effects to determine the “important” headers of the base Transformer.

B)BERT doesn’t need to be so smart to accomplish these tasks. The fact that BERT can do most GLUE tasks quite well without pre-training suggests that, for the most part, it doesn’t take much language knowledge to solve these problems. Unlike verbal reasoning, it may learn to rely on various shortcuts, biases and human factors in a data set to make correct predictions. In this case, its self-attention diagram doesn’t necessarily make sense to us. This finding supports current recent findings on many data set problems (Gururangan et al., 2018; McCoy, Pavlick, & Linzen, 2019).

Another explanation is that BERT’s success is due to metaphysics, not to self-attention. For example, the intense focus on punctuation after Finetune may mean that the model has actually learned to rely on some other component, or that there are deep patterns that we don’t understand. In addition, the extent to which attention can be used to explain the principles of model prediction is currently being debated (Jain & Wallace, 2019; Serrano & Smith, 2019; Wiegreffe & Pinter, 2019).

— the END —

English text: text – machine – lab. Making. IO/blog / 2020 / b…

Note: the menu of the official account includes an AI cheat sheet, which is very suitable for learning on the commute.

Machine Learning Online Manual Deep Learning online Manual AI Basics download (PDF updated to25Note: to join this site's wechat group or QQ group, please reply to "add group" to get a discount station knowledge planet coupon, please reply to "knowledge planet"Copy the code

Like articles, click Looking at the