First, why Mask?
Here, consider a question: Why do we need masks?
In NLP, one of the most common problems is that the length of the input sequence varies. PAD operation is usually required, and zeros are usually filled after the short sequence. Although models such as RNN can handle variable length input, in practice you need to batchsize the input and translate it into a fixed tensor.
PAD case:
Here are two sentences in English. First, convert the text into numbers
s1 = 'He likes cats'
s2 = 'He does not like cats'
s = s1.split(' ') + s2.split(' ')
word_to_id = dict(zip(s, range(len(s))))
id_to_word = dict((k,v) for v,k in word_to_id.items())
# {'He': 3, 'likes': 1, 'cats': 7, 'does': 4, 'not': 5, 'like': 6}
# {3: 'He', 1: 'likes', 7: 'cats', 4: 'does', 5: 'not', 6: 'like'}
s1_vector = [word_to_id[x] for x in s1.split(' ')]
s2_vector = [word_to_id[x] for x in s2.split(' ')]
sentBatch = [s1_vector, s2_vector]
print(sentBatch)
Copy the code
Encode text numerically
[[3.1.7], [3.4.5.6.7]]
Copy the code
Pad the two vectors above.
from torch.nn.utils.rnn import pad_sequence
a = torch.tensor(s1_vector)
b = torch.tensor(s2_vector)
pad = pad_sequence([a, b])
print(pad)
Copy the code
The PAD results
tensor([[3.3],
[1.4],
[7.5],
[0.6],
[0.7]])
Copy the code
Using the PAD result of the sentence “He likes cats” as an example: [3, 1, 7, 0, 0] the PAD operation causes the following problems.
1. Mean-pooling problems
As shown in the above case, for the matrix:
1. Mean-pooling:
After pad:
rightTo mean – pooling:
contrast 和 Pad operation impact mean-pooling.
2. Max-pooling problem
For the matrix S1:After PAD:
Respectively for 和 For Max – pooling:
contrast 和 Discovery: Pad operation affects max-pooling.
3. Attention
Usually the last step in Attention calculations is to use SoftMax for normalization to convert the value into probability. But if you softmax the vector directly after PAD, then the PAD parts will also share some of the probabilities, resulting in the sum of the meaningful parts (non-PAD parts) probabilities being less than or equal to 1.
2. Mask was born to solve the PAD problem
Mask is a technique developed as opposed to PAD and has the power to tell the model how long a vector is. The Mask matrix has the following characteristics:
- The Mask matrix is the same shape as the matrix after PAD.
- The mask matrix has only 1 and 0 values. If the value is 1, the value of this position in PAD matrix is meaningful; if the value is 0, the value of this position in the corresponding PAD matrix is meaningless.
The mask matrices of the two matrices in the first part are shown as follows:
mask_s1 = [1, 1, 1, 0, 0]
mask_s2 = [1, 1, 1, 1, 1]
Copy the code
mask = a.ne(torch.tensor(paddingIdx)).byte()
print(mask)
>>> tensor([[1.1],
[1.1],
[1.1],
[0.1],
[0.1]], dtype=torch.uint8)
Copy the code
1. Solve mean-pooling problems
2. Solve the max-pooling problem
During max-pooling, only the value of part of pad is small enough, and the position of value 0 in the mask matrix can be replaced by small enough (e.g., the max-pooling calculation will not be affected.
3. Focus on Attention
The solution to the problem is the same as max-pooling, that is, the pad part is small enough to makeIs so close to 0 that it is ignored.
2. What are the common masks?
With that in mind, you should understand why masks are created and what they do, and in NLP missions, masks will be different depending on their function.
There are two common types of Mask, padding-mask, which is used to process input of variable length, namely, the first type mentioned above, and seqence-mask, which is used to prevent future information from being leaked. We’ll look at these two types of masks in more detail.
Padding mask – Handles input of variable length
In NLP, a common problem is the unequal length of the input sequences. Generally speaking, we PAD a batch of sentences, usually with a value of 0. However, as we mentioned earlier, PAD 0 will cause many problems and affect the final result. Therefore, the Mask matrix is produced to solve the PAD problem.
Here’s an example:
case 1: I like cats.
case 2: He does not like cats.
Copy the code
If the default seq_len is 5, pad will be used for case 1
[1, 1, 1, 0, 1]
Copy the code
When embedding is encoded, pad also has embedding vector, but pad itself has no practical significance, so it may be harmful to train.
Therefore, it is necessary to maintain a mask tensor to keep track of what are real values. The two masks for this example are as follows:
1 1 1 0 0 1 1 1 1 1 1Copy the code
In subsequent gradient propagation, mask plays a role of filtering. In PyTorch, parameters can be set:
nn.Embedding(vocab_size, embed_dim,padding_idx=0)
Copy the code
Sequence mask – To prevent future information leaks
In the language model, it is often necessary to predict the next word from the previous word, and sequence mask is to prevent the decoder from seeing the future information. In other words, for a sequence, when time_step is t, our decoding output should only depend on the output before t, but not on the output after T. So we need to figure out a way to hide the information after t.
So how do you do that? It’s also very simple: generate an upper triangle matrix with all values of 1, all values of 0, and all values of 0 on the diagonal. Applying this matrix to each sequence will do the trick.
A common trick is to generate a mask diagonal matrix, as in [1] :
def sequence_mask(seq):
batch_size, seq_len = seq.size()
mask = torch.triu(torch.ones((seq_len, seq_len), dtype=torch.uint8),
diagonal=1)
mask = mask.unsqueeze(0).expand(batch_size, - 1.- 1) # [B, L, L]
return mask
Copy the code
Harvard’s article The Annotated Transformer has this rendering:
It is worth noting that the original mask only needs a two-dimensional matrix, but considering that our input sequence is batch, we need to expand the original two-dimensional matrix into a three-dimensional tensor. As you can see from the code above, we have already done this.
If you feel the article is helpful to you, welcome to like, forward, pay attention to.