BERT, ELMo, GPT-2: How subcultural are these contextual representations?

Author: Ronghuaiyang

takeaway

To what extent do words with contextual information indicate subculture? Quantitative analysis is given here.

Putting contextual information into word embedding — like BERT, ELMo, and GPT-2 — has proved to be a watershed idea for NLP. Using word representations with context information to replace static word vectors (such as word2vec) is a very significant improvement on each NLP task.

But how much are these expressions of shang-shang really shang-shang?

Consider the word “mouse.” It has several meanings, one for rodent, the other for equipment. Did BERT effectively create an expression of “mouse” in the sense of each word? Or did BERT create an infinite number of “mouse” images, each of which is relevant to a particular context?

In our EMNLP 2019 paper “How Contextual Word Representations?” In, we solved these problems and came to some surprising conclusions: \

In all the layers of BERT, ELMo, and GPT-2, all the words occupy a narrow cone in the embedded space, instead of being distributed throughout the region.
In all three models, the upper layer produces more context-specific representations than the lower layer, however, the context of the word is very different in these models.
If the subcultural representation of a word is not subcultural at all, then we can expect that 100% of the difference can be explained by static embedding. Instead, we found that, on average, less than 5% of the differences could be explained by static embedding.
We can create a new statically embedded type for each word by placing the first principal component of the subculture representation in a lower level of BERT. Static inserts created this way perform better than benchmarks like GloVe and FastText at solving word analogies.

Going back to our example, this means that BERT creates a representation of the word “mouse” that is highly context-dependent, rather than having a representation for each word. Any static embedding of a mouse makes little difference to the cultural representations above and below it. However, if the vector we choose does maximize the explainable variation, we will get a static embedding that is better than the static embedding provided by GloVe or FastText.

Measurement of upper and lower cultures

What does shang and Shang culture look like? Consider two scenarios:

A panda dog runs.

A dog is trying to get bacon off its back.

== means no subculture (that is, what we get with Word2vec). ! = means there is some kind of upper and lower culture. The difficulty is quantifying the extent to which this is happening. Since there are no established contextual relevance measures, we propose three new measures:

Self-similarity (SelfSim) : The average cosine Similarity of a word to itself in all contexts in which it occurs, where the representation of the word comes from the same layer of a given model. For example, we calculate (‘dog’) by averaging all the different cosines (,).
Intra-sentence Similarity (IntraSim) : the average cosine Similarity between a word and its context. For the first sentence, where the context vector:

It helps us discern whether cultures are naive — that is, simply making each word more similar to its neighbor — or whether they are more subtle, recognizing that words in the same context can influence each other while still having different semantics.
Maximum Explainable Variance (MEV) : In a representation of a word, the proportion of Variance that can be explained by its first principal component. For example, (‘dog’) is expressed by, and the first principal component of each other ‘dog’ instance in the data to indicate the ratio of change. (‘dog’) = 1 indicates that there is no subculture: static embedding replaces all subculture representations. Conversely, if (‘dog’) is close to 0, static embedding explains almost no variation.

Note that these measures are calculated for a given layer of a given model, because each layer has its own representation space. For example, the word ‘dog’ has different self-similarity values on the first and second BERT layers.

Anisotropy adjustment

When discussing context, it is important to consider embedded isotropy. (that is, whether they are evenly distributed in all directions).

In the two figures below, (‘dog’) = 0.95. The picture on the left shows that the word “dog” does not have a good subculture. Its expression is almost identical in all contexts in which it occurs, and the high isotropy of the representation space indicates that 0.95 self-similarity is very high. The image on the right shows the opposite: since the cosine similarity of any two words is greater than 0.95, a self-similarity of 0.95 for ‘dog’ is no big deal. ‘dog’ is considered highly subcultural compared to other words!

To adjust for anisotropy, we calculate anisotropy baselines for each measure and subtract each baseline from the corresponding original measure. But is it necessary to adjust for anisotropy? There are! As shown in the figure below, the upper layers of BERT and GPT-2 are highly anisotropic, suggesting that high anisotropy is an inherent feature, or at least a consequence, of subcultural processes: \

Specific context

In general, at higher levels, the representations of the upper and lower cultures are more context-specific. As shown in the figure below, the decrease in self-similarity is almost monotonous. This is similar to how the upper layers of LSTMs trained on NLP tasks learn more task-specific representations (Liu et al., 2019). Gpt-2 is the most context-specific, and the representation in the last layer is almost the most context-dependent.

** Stop words with the lowest self-similarity include words like “the”. (the most context-specific word). ** The occurrence of a word in more than one context, rather than its inherent polysemy, is responsible for the cultural variation in its representation. This suggests that ELMo, BERT, and GPT-2 do not simply assign one representation to each word, otherwise there would not be so many variations in so few meaning representations. \

** Context-specific representations behave very differently in ELMo, BERT, and GPT-2. ** As you can see in the figure below, in ELMo, words in the same sentence are more similar in the upper layer. In BERT, upper words of the same sentence are more similar to each other, but on average they are more similar to each other than two random words. In contrast, for GPT-2, words in the same sentence indicated no greater similarity to each other than words in a random sample. This suggests that BERT and GPT-2’s subcultures are more subtle than ELMo’s, as they seem to recognize that words appearing in the same context do not necessarily have the same meaning.

Static vs. subculture

On average, less than 5 percent of the differences in the upper and lower cultural representations of a word can be explained by static embedding. If the subcultural representations of a word are completely context-free, then we expect their first principal component to explain 100% of the variation. Instead, on average, less than 5% of the variation could be explained. This 5% threshold represents the best case, where static embedding is the first principal component. For example, there is no theoretical guarantee that the GloVe vector is similar to a static embedding that maximizes explicable change. This suggests that BERT, ELMo, and GPT-2 do not simply assign an embed to each word meaning: otherwise, the proportion of explicable variation would be much higher.

On many static embedded benchmarks, BERT’s lower-level subculture representation performed better than GloVe and FastText. This approach boils the previous findings down to a logical conclusion: What if we created a new statically embedded type for each word by simply using the first principal component of the subculture representation? This approach has proved surprisingly effective. If we use representations from underlying BERT, these principal component emplacements outperform GloVe and FastText on benchmark tasks involving semantic similarity, analogical solving, and conceptual classification (see table below).

For all three models, principal component embedding created from a lower level is more efficient than principal component embedding created from a higher level. Those given GPT-2 performed significantly worse than those from ELMo and BERT. Given that the upper level is more context-specific than the lower level, and given that the GPT-2 representation is more context-specific, this suggests that the principal component of a less context-specific representation is more efficient for these tasks. \

conclusion

In ELMo, BERT, and GPT-2, the upper layer produces more context-specific representations than the lower layer. However, these models are very different for the context of words: after adjusting for anisotropy, the similarity between words in the same sentence is highest in ELMo, while it is almost non-existent in GPT-2.

On average, less than 5% of the variation in the subcultural representation of a word can be explained by static embedding. Thus, even in the best case, static word embedding is not a good substitute for subcultural words. Nevertheless, subculture representations can be used to create more powerful static embedding types: BERT’s low-level subculture representations have much better principal components than GloVe and FastText!

— the END —

English text: kawine. Making. IO/blog/NLP / 20…

“`php

Highlights of past For beginners entry route of artificial intelligence and data download AI based machine learning online manual deep learning online manual download update (PDF to 25 sets) note: WeChat group or qq group to join this site, please reply “add group” to get a sale standing knowledge star coupons, please reply “planet” knowledge like articles, point in watching

Copy the code

BERT, ELMo, GPT-2: How subcultural are these contextual representations?

Measurement of upper and lower cultures

Anisotropy adjustment

Specific context

Static vs. subculture

conclusion

“`php

Related Posts

Spring Boot2(11) : Mybatis use summary (self-growth, multi-condition, batch operation, multi-table query, etc.)

MySQL > install MySQL5.7.27 MySQL > install MySQL5.7.27

SpringMVC interceptor pit Diary | Java Debug Notes