Attention is being used more and more widely. Especially since BERT became so popular.

What’s so special about Attention? What are his principles and nature? What types of Attention are there? This article will cover all aspects of Attention in detail.

What is the essence of Attention

The mechanism of Attention, if understood at a superficial level, is a perfect match for his name. His core logic is to “go from focusing on the whole to focusing on the important.”

The Attention mechanism is similar to the logic of human looking at pictures. When we look at a picture, we don’t see the whole picture clearly, but focus on the focus of the picture. Take a look at the picture below:

We are sure to see the four characters “Jinjiang Hotel” clearly, as shown below:

But I don’t think anyone will realize that jinjiang Hotel has a phone number on it, or that Xiyunlai Restaurant is as follows:

So, when we look at an image, it actually looks like this:

As mentioned above, our visual system is a kind of Attention mechanism, focusing our limited Attention on key information so as to save resources and quickly obtain the most effective information.

The Attention mechanic in AI

Attention mechanism was first applied in computer vision, and then began to be applied in the field of NLP. It is in the field of NLP that Attention mechanism is really carried forward. In 2018, BERT and GPT achieved surprisingly good effects and became popular. Transformer and Attention have become the focus of Attention.

The position of Attention should look something like this:

In this article, we will give you a general idea of how Attention works in detail. Before we do that, let’s talk about why we use Attention.

The 3 main benefits of Attention

There are three main reasons for introducing Attention:

  1. Few parameters
  2. Speed is fast
  3. The effect is good

Few parameters

Compared with CNN and RNN, the model has smaller complexity and fewer parameters. So it requires less power.

Speed is fast

Attention solved the problem that RNNS cannot be computed in parallel. The calculation of each step of Attention mechanism does not depend on the calculation result of the previous step, so it can be processed in parallel with CNN.

The effect is good

Before Attention was introduced, there was always a problem that people were worried about: long-distance information would be weakened, just as people with weak memory could not remember past events.

Attention is to pick out the key points. Even if the text is long, you can capture the key points in the middle without losing important information. The expectations in red are the ones singled out.

The principle of Attention

The Encoder Decoder and the Seq2Seq model Decoder are also mentioned in the previous article.

The following GIF shows how attention can complete machine translation tasks with encoder-decoder framework.

However, Attention doesn’t have to be in the Encoder-Decoder framework. It can be out of the Encoder-Decoder framework.

Below is an illustration of how Encoder-Decoder works without it.

Short story

The diagram above looks abstract. Here’s an example of how attention works:

There are a lot of books in the source library. In order to find them easily, we have numbered the books with keys. When we want to look at Marvel (Query), we can look at books about anime, movies, and even Wwii (Captain America).

In order to improve efficiency, not all books will be carefully read. For Marvel, anime and movie related books will be carefully read (high weight), but WORLD War II will only need a simple scan (low weight).

We’ll have a complete picture of Marvel when we’ve seen it all.

The Attention principle is broken down in 3 steps:

Step 1: Perform similarity calculation for Query and key to obtain weights

The second step: normalize the weights to get directly usable weights

Step 3: Sum the weights and values by weight

From the above modeling, we can generally feel that the thinking of Attention is simple. The four words “sum with weight” can be highly summarized and simplified. To use an unfortunate analogy, there are basically four stages in learning a new language: Rote learning (learn by reading recite grammar exercises language sense) – > grasp (simple dialogue by understand the key words in the sentence accurately understand the core meaning) – > digest (behind the complex dialogue context refers to, language contact, with the other learning ability) – > the highest (immersed in a large number of practice).

This is also the development vein of attention. In the ERA of RNN, the model of attention was memorized by rote. The model of attention learned to outline, evolved to Transformer, with excellent expressive learning ability, and then to GPT and BERT, accumulated practical experience through multi-task large-scale learning. The fighting capacity is overflowing.

Why is attention so good? Because it makes models open, understand the essentials, learn to integrate.

— Ali Technology

For more technical details, check out the following article or video:

Attention mechanism in deep learning of article

Do you really understand Attention everywhere?

“Article” explores the Attention mechanism in NLP and elaborates on Transformer

“Video” Li Hongyi – Transformer

“Video” explained by Li Hongyi – ELMO, BERT and GPT

N types of Attention

There are many different types of Attention: Soft Attention, Hard Attention, static Attention, dynamic Attention, Self Attention, etc. Here’s a look at some of the differences between Attention.

Attention for NLP has been summarized in this article. The following is a direct reference:

This section classifies the form of Attention from the aspects of calculation area, information used, structural level and model.

1. Compute area

According to the calculation area of Attention, it can be divided into the following types:

1) Soft Attention is a common way of Attention. It calculates the weight probability of all keys, and each key has a corresponding weight. It is a Global calculation method (also known as Global Attention). This is a rational approach, taking all the keys into account and weighting them. But it might be a little bit more computation.

2) Hard Attention: in this way, a key is directly positioned precisely and the rest keys are ignored, equivalent to the probability of this key is 1 and the probability of the rest keys is 0. So this alignment is very demanding, it’s very one-step, and if it’s not aligned correctly, it can have a big impact. On the other hand, because it is not conductive, it is generally necessary to use the method of reinforcement learning for training. (Or use something like Gumbel Softmax)

3) Local Attention is actually a compromise between the above two methods. It calculates a window area. First, use the Hard method to locate a place, and then get a window area centered around this point. In this small area, use the Soft method to calculate the Attention.

2. Information used

Suppose we want to calculate Attention for a passage of original text. Here original text refers to the text for which we want to do Attention, then the information used includes internal information and external information. Internal information refers to the information of original text, while external information refers to additional information in addition to original text.

1) General Attention: This approach utilizes external information and is often used for tasks that require the construction of two text relationships. Query usually contains additional information and aligns the text according to external query.

For example, in the reading comprehension task, it is necessary to build the association between questions and articles. Suppose that the baseline is now, calculate a question vector Q for questions, splicing the q with all article word vectors, and input it into LSTM for modeling. So in this model, the article all word vector vector share the same problem, now we want to make the article every step of the word has a different question vector, that is, every step in the use of the article after the down word vector to calculate the attention to the problem, this problem belongs to the original, the article word belongs to the external information vector.

2) Local Attention, this method only uses internal information, key and value and query are only related to the input text, in self Attention, key=value=query. Since there is no external information, each word in the original text can perform Attention calculation with all words in the sentence, which is equivalent to looking for the internal relationship of the original text.

Again, take the reading comprehension task as an example. As mentioned in the baseline above, when a vector Q is calculated for a question, attention can also be used here. Only the information of the question itself can be used for attention, without introducing the information of the article.

3. Structural hierarchy

In terms of structure, attention can be divided into single-layer attention, multi-layer attention and multi-headed attention according to whether the hierarchical relationship is divided.

1) Single-layer Attention, this is a common practice, using a query to perform a single Attention for a text.

2) Multi-layer Attention is generally applied to models with hierarchical relationships between texts. Assume that we divide a document into multiple sentences. In the first layer, we use Attention to calculate a sentence vector for each sentence (i.e., single-layer Attention). In the second layer, we calculate a document vector (also a single-layer attention) for attention for all sentence vectors, and then use this document vector to do tasks.

3) Multi-head Attention, which is referred to in “Attention is All You Need”. Multiple queries are used to pay Attention to the same text for multiple times, with each query focusing on a different part of the text. Attention:

Finally, the results are pieced together:

4. Model

According to the model, Attention is generally used in CNN and LSTM, and pure Attention calculation can also be directly carried out.

1) + (CNN) Attention

CNN’s convolution operation can extract important features, which I think is also the idea of Attention. However, CNN’s convolution perception field is local, and it needs to expand the field by stacking multi-layer convolution area. In addition, Max Pooling directly extracts the features with the largest value, which is also like the idea of hard attention, and directly selects a feature.

Attention can be added on CNN.

A. Attention should be performed before the convolution operation, such as attention-based BCN-1. This task is a text implication task, which requires processing two pieces of text.

B. Perform attention after the convolution operation, for example, attention-based BCN-2. Perform attention for the output of the convolution layer of two texts as the input of the pooling layer.

C. Do attention on pooling layer instead of Max pooling. For example, when conducting Attention pooling, LSTM is used to learn a good sentence vector as query, and CNN is used to learn a feature matrix as key, and query is used to generate weights on the key to obtain the final sentence vector for Attention.

2) LSTM + Attention

LSTM has an internal Gate mechanism, where input Gate selects the current information to input, and forget Gate selects the past information to forget. I think this is a degree of Attention, and it claims to solve the long-term dependency problem. In fact, LSTM needs to capture sequence information step by step. In long text, the performance will decay slowly with the increase of step, and it is difficult to retain all useful information.

LSTM usually needs to get a vector and then do the task. The common ways are as follows:

A. Directly use the last hidden State (some previous information may be lost and it is difficult to express the full text)

B. Equal-weight averaging of hidden State under all steps (all steps are treated equally).

C. Attention mechanism: weight the hidden state of all steps and focus on the important hidden state information in the whole text. The performance is a little better than the previous two, and it is convenient to visually observe which steps are important, but it needs to be careful about over-fitting, and it also increases the amount of calculation.

3) pure Attention

“Attention is all you need,” says CNN/RNN. “Attention is all you need.”

5. Calculation method of similarity

When doing attention, we need to calculate the score (similarity) between query and key.

1) Dot product: The simplest method,

2) Matrix multiplication:

3) Cos similarity:

4) Series mode: splice Q and K,

5) Multi-layer perceptron can also be used:This article is fromEasyai.tech, AI learning library for product managers