EMNLP 2019 introduces some papers related to BERT

Click “AI Park” at the top, follow the public account, and select “Star label” or “Top”.

By Pavel Gladkov

Compiler: ronghuaiyang \

takeaway

Some good bert-related papers from EMNLP 2019.

BERT at EMNLP 2019\

The Conference on Empirical Methods in Natural Language Processing (EMNLP) was held in Hong Kong from 3 To 7 November 2019. There are a lot of interesting papers, but I want to highlight BERT’s paper.

Reveal BERT’s dark secrets

Arxiv.org/abs/1908.08…

In this paper, researchers from the University of Massachusetts Lowell investigate the self-attention mechanisms of BERT’s Layer and head. The data set used is a subset of GLUE tasks: MRPC, STS-B, SST-2, QQP, RTE, QNLI, MNLI.

Experiment:

The head of a particular relationship in BERT
The change of self-attention mode after fine-tuning
The attentional mechanism of language features
Token-to-token attention mechanism
Turn off head’s self-attention mechanism

Typical types of self-attention for neural network training. Two axes on each image represent BERT Tokens of the input sample, and colors represent absolute attention weights (darker colors indicate greater weights). The first three types are most likely to be associated with pre-trained language models, while the last two types may encode semantic and syntactic information. \

Interesting findings:

The BERT model is obviously over-parameterized. The pattern of limited attention is repeated in different heads. Therefore, disabling some heads does not result in decreased accuracy, but rather improved performance.

Very interesting. That’s why Distilling BERT makes sense.

Visualize and understand BERT’s effectiveness

Arxiv.org/abs/1908.05…

This is another paper on understanding BERT’s performance using Microsoft Research’s cool visualization tools.

Training loss surfaces from scratch (top) and finetune training loss surfaces for BERT (bottom) on four datasets. Compared with random initialization, pretraining can get more general optimization and simplify the optimization process. \

The picture above clearly shows the main ideas of this article:

The training loss of Finetune BERT shows a monotonically decreasing trend along the optimization direction, which is conducive to optimization and accelerates training convergence
The Finetune process is more robust to overfitting
The pre-training model can obtain a flatter and wider optimal value

So, don’t train BERT from scratch to finish your task. Finetune better.

The BERT model is compressed by patient distillation of knowledge

Arxiv.org/abs/1908.09…

Microsoft also has a paper on distillation of knowledge. A new method is proposed to compress the big BERT model into the shallow BERT model by patient distillation of knowledge. This method claims to be the first to use distillation not only for output distribution, but also for the hidden state of the teacher. In addition, “student” only attempts to mimic the representation of the [CLS] token. Compared with other distilling methods, Bert-PKD is better than DistilBERT but worse than TinyBERT.

Sentence-bert: Use Siamese bert-networks to get Sentence embedding

Arxiv.org/abs/1908.10…

Code: github.com/UKPLab/sent…

The question is as follows: Is BERT’s embedding suitable for semantic similarity search? This paper proves that BERT can map sentences out of the box to a vector space that is not suitable for common similarity measures such as cosine similarity. Its performance is worse than ordinary GloVe inserts. In order to overcome this shortcoming, sentence-bert (SBERT) is proposed. SBERT finetune BERT in Siamese or Triplet network architecture.

An SBERT architecture with classification objective functions, for example for finetune SNLI datasets. Both BERT networks have their own weights (Siamese network structure). \

Beto, Bentz, Becas: BERT’s amazing cross-linguistic effectiveness

Arxiv.org/abs/1904.09…

This paper discusses the cross-language potential of multilingual BERT as a zero-distance language transfer model.

Long story short: BERT effectively learned good multilingual representations with strong cross-language zero-sample migration performance in a variety of tasks.

— the END —

English text: towardsdatascience.com/bert-at-emn…

Note: the menu of the official account includes an AI cheat sheet, which is very suitable for learning on the commute.

Highlights from the past2019Machine learning Online Manual Deep Learning online Manual AI Basic Download (Part I) note: To join our wechat group or QQ group, please reply "add group" to join knowledge planet (4500+ user ID:92416895), please reply to knowledge PlanetCopy the code

Like articles, click Looking at the

EMNLP 2019 introduces some papers related to BERT

Reveal BERT’s dark secrets

Visualize and understand BERT’s effectiveness

The BERT model is compressed by patient distillation of knowledge

Sentence-bert: Use Siamese bert-networks to get Sentence embedding

Beto, Bentz, Becas: BERT’s amazing cross-linguistic effectiveness

Related Posts

The old driver takes you through the MapReduce framework in Go

Big factory interview question third quarter notes

Nginx configuration and use