This article uses transformer to achieve a repeatable machine, code from harvarDNLP annotated- Transformer
Torch version: 1.6.0
Introducing three-party packages
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import math, copy, time
from torch.autograd import Variable
import matplotlib.pyplot as plt
import seaborn
seaborn.set_context(context="talk")
%matplotlib inline
Copy the code
Code is written from the large framework to the widget, starting with the entire encoder-decode structure
1 Encoder Decoder architecture
Encoding: Encoder encodes the input sequence SRC and the input mask sequence src_mask
Decoding: The decoder decodes according to the memory output of the encoder, the input mask sequence srC_mask, the input sequence TGT of the decoder, and the mask sequence TGT_mask of the decoder
class EncoderDecoder(nn.Module):
def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):
super(EncoderDecoder, self).__init__()
self.encoder = encoder
self.decoder = decoder
self.src_embed = src_embed
self.tgt_embed = tgt_embed
self.generator = generator
def forward(self, src, tgt, src_mask, tgt_mask):
return self.decode(self.encode(src, src_mask), src_mask,
tgt, tgt_mask)
def encode(self, src, src_mask):
return self.encoder(self.src_embed(src), src_mask)
def decode(self, memory, src_mask, tgt, tgt_mask):
return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)
Copy the code
The encoder and decoder in the code are the encoder and decoder of Transformer, srC_Embed is the word vector matrix on the encoder side, tgT_embed is the word vector matrix on the decoder side, and a generator is defined. Its function is to map the decoder output vector to the VOCab dimension and calculate log SoftMax, which can be regarded as part of the decoder
Class Generator(nn.Module): """ def __init__(self, d_model, vocab): """ super(Generator, self).__init__() self.proj = nn.Linear(d_model, vocab) def forward(self, x): return F.log_softmax(self.proj(x), dim=-1)Copy the code
2 Encoder implementation
The encoder consists of six layers of the same architecture, the same architecture does not mean the same parameters, first implement a copy function
Def clones(module, N): "clones(module, N)" return nn.ModuleList([copy. Deepcopy (module) for _ in range(N)])Copy the code
Defining the encoder framework
For each layer, enter x and mask to get an updated X; And then finally a LayerNorm
class Encoder(nn.Module): def __init__(self, layer, N): super(Encoder, self).__init__() self.layers = clones(layer, N) self.norm = LayerNorm(layer.size) def forward(self, x, Mask): "for each layer, enter x and mask to get the updated x; LayerNorm" for layer in self. Layers: x = layer(x, mask) return self.norm(x)Copy the code
Layer Normalization
Arxiv.org/abs/1607.06…
class LayerNorm(nn.Module): def __init__(self, features, eps=1e-6): super(LayerNorm, self).__init__() self.a_2 = nn.Parameter(torch.ones(features)) self.b_2 = nn.Parameter(torch.zeros(features)) self.eps = eps def forward(self, x): mean = x.mean(-1, keepdim=True) std = x.std(-1, keepdim=True) return self.a_2 * (x - mean) / (std + self.eps) + self.b_2Copy the code
Connection between sub-layers
As mentioned above, the encoder is composed of six identical modules, and each module is separately composed of two sub-modules (multi-head self-attention, FFN). Use residual mode to connect here, namely: LayerNorm(x+Sublayer(x))
Note that the output dimension of each sub-layer or layer remains: dModel =512d_{\text{model}}= 512dModel =512
class SublayerConnection(nn.Module):
def __init__(self, size, dropout):
super(SublayerConnection, self).__init__()
self.norm = LayerNorm(size)
self.dropout = nn.Dropout(dropout)
def forward(self, x, sublayer):
"residual connection"
return x + self.dropout(sublayer(self.norm(x)))
Copy the code
Define EncoderLayer
EncoderLayer is each layer in class Encoder. It first accepts input X and mask, and passes through multiple self-attention layers. Then output x through the SublayerConnection layer; X goes through the FFN layer; Finally, go through the SublayerConnection layer one more time and output the vector
class EncoderLayer(nn.Module):
def __init__(self, size, self_attn, feed_forward, dropout):
super(EncoderLayer, self).__init__()
self.self_attn = self_attn
self.feed_forward = feed_forward
self.sublayer = clones(SublayerConnection(size, dropout), 2)
self.size = size
def forward(self, x, mask):
x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
return self.sublayer[1](x, self.feed_forward)
Copy the code
The concrete implementation of the two sub-layers will be described in detail later, and then build the decoder according to the above routine
3. Decoder implementation
The decoder is the same as the encoder, and also consists of six identical modules
Define the decoder framework
The decoder still passes through six layers in sequence, and finally through a LayerNorm
The difference between this and the encoder is that the encoder has input x and mask, and the decoder has changed from two parameters to four parameters: the memory of the encoder, the mask on the encoder side, the mask on the decoder side and the input to the decoder section
class Decoder(nn.Module):
def __init__(self, layer, N):
super(Decoder, self).__init__()
self.layers = clones(layer, N)
self.norm = LayerNorm(layer.size)
def forward(self, x, memory, src_mask, tgt_mask):
for layer in self.layers:
x = layer(x, memory, src_mask, tgt_mask)
return self.norm(x)
Copy the code
Define DecoderLayer
In EncoderLayer, there are only two sub-layers, multi-self-focused and fully connected; However, in the DecoderLayer, there are three sub-layers. In addition to the multi-attention of the decoder side and full connection, there is also an additional multi-self-attention of the decoder side to the encoder side, which is similar to the attention mechanism of traditional SEq2SEq
class DecoderLayer(nn.Module): def __init__(self, size, self_attn, src_attn, feed_forward, dropout): super(DecoderLayer, Self_attn = self_attn # Self attention self.src_attn = src_attn # and the encodec do attention self.feed_forward = feed_forward self.sublayer = clones(SublayerConnection(size, dropout), 3) def forward(self, x, memory, src_mask, tgt_mask): m = memory x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask)) x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, src_mask)) return self.sublayer[2](x, self.feed_forward)Copy the code
In DecoderLayer, QKV is the input of decoder, and the attention mechanism is combined with mask of decoder. Then pass a SublayerConnection; Then the output X is used as query, and the memory on the encoder is used as key and value as attention mechanism. SublayerConnection again; Fully connected; SublayerConnection again
Mask of the decoder
In particular, the decoder side mask is different from the encoder side mask. Of course, both sides mask the pad inside the batch. In addition, the decoder side mask also takes into account that it cannot “look back”, because unlike the RNN which explicitly depends on the previous pass at each moment. When Transformer uses self-attention, it is necessary to only see the inputs at the time t and before the step T, and not the inputs after the t. Otherwise, it is cheating and meaningless to use the known to predict the known
So, we want to construct a triangular matrix
Construct an upper triangle using np.triu
See: juejin. Cn/post / 693125…
def subsequent_mask(size):
attn_shape = (1, size, size)
subsequent_mask = np.triu(np.ones(attn_shape), k=1).astype('uint8')
return torch.from_numpy(subsequent_mask) == 0
Copy the code
Let’s look at an example
Print (subsequent_mask (5)) PLT. Figure (figsize = (5, 5)) PLT. Imshow (subsequent_mask (20) [0]) NoneCopy the code
The output
tensor([[[ True, False, False, False, False],
[ True, True, False, False, False],
[ True, True, True, False, False],
[ True, True, True, True, False],
[ True, True, True, True, True]]])
Copy the code
The output diagram is as follows:
4. Multi-head self-attention mechanism
Set aside the bulls for a moment and look at pure self-attention
Since the attention
Just like traditional attention, query and key calculate the weight, and then calculate weighted-sum with value. The formula is as follows:
What’s special here is that the denominator of the weight is dk\ SQRT {d_k} DK. The reason for this is: When dkd_kdk is small, it doesn’t matter whether you divide or not, but when dkd_kDK is large, the dot product of Query and key will be very large, which may lead to the gradient minimum region of SoftMax, so you need to scale, divide by dk\ SQRT {d_k} DK and still keep 0 mean and 1 variance
The problem with dot-product attention is to softmax everything after the dot product, and the components affect each other (not like tanh, which counts each component separately). As a result: The higher the dimension of the vector, the wider the range of the result of the dot product, and the more likely it is that the maximum value is much larger than the other values, The result of SoftMax is close to one-hot (you can calculate softMax (Np.random.random (10)) and SoftMax (100 * NP.random.Random (10)), the phenomenon of the latter’s probability mass concentration to a certain dimension is obvious). In back propagation, most elements of Jacobian matrix of Softmax are close to zero, so the gradient cannot flow
def attention(query, key, value, mask=None, dropout=None):
d_k = query.size(-1)
scores = torch.matmul(query, key.transpose(-2, -1)) \
/ math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
p_attn = F.softmax(scores, dim = -1)
if dropout is not None:
p_attn = dropout(p_attn)
return torch.matmul(p_attn, value), p_attn
Copy the code
In the implementation, considering the mask matrix, set the position of mask matrix equal to 0 in the corresponding position of scores matrix as a large negative number, for example: -1e9, so that e−1e9e^{-1e9}e−1e9 is close to 0, which is equivalent to ignoring these positions in attention
The bulls
Transformer uses multi-head self-attention to allow the model to calculate attention in subspaces of different angles, as follows:
WiQ∈Rdmodel×dkW^Q_i \in \mathbb{R}^{d_{text{model}} \times d_k}WiQ∈Rdmodel×dk, WiK∈Rdmodel×dkW^K_i \in \mathbb{R}^{d_{\text{model}} \times d_k} WiV∈Rdmodel×dvW^V_i \in \mathbb{R}^{d_{text{model}} \times d_v}WiV∈Rdmodel×dv and WO∈Rhdv×dmodelW^O \in ^ \ mathbb {R} {hd_v \ times d_ {\ text {model}}} send ∈ Rhdv x dmodel
Dmodel = 512D_ {model}= 512dModel =512, h=8h=8h=8, dk=dv=dmodel/h=64d_k=d_v=d_{\text{model}}/h= 64DK =dv=dmodel/h=64
Class MultiHeadedAttention(nn.module): def __init__(self, H, d_model, dropout=0.1): super(MultiHeadedAttention, Self).__init__() assert d_model % h == 0 self.d_k = d_model // h self.h = h # linears clones(nn.Linear(d_model, d_model), Dropout = nn. dropout (p=dropout) def forward(self, query, key, value, self) mask=None): if mask is not None: Mask = mask.unsqueeze(1) nbatches = query.size(0) # linears D_model => h x query, key, value = \ [l(x).view(nbatches, -1, self.h, self.d_k). Transpose (1, 2) for l, X in zip(self.linears, (query, key, value))] # 2) Self. attn = attention(query, key, value, mask=mask, Dropout =self.dropout) # 3) Concat the results of multiple batches through a layer of Linear computing output x = x.transpose(1, 2).contiguous() \.view(nbatches, -1, self.h * self.d_k) return self.linears[-1](x)Copy the code
The multi-head code is easy to understand and is computed in parallel using a large pieced matrix. See the code comment above for details
5 FFN layer
The sub-layer FFN layer is composed of two full connections and one ReLU. The formula is as follows:
In the above equation, dimension DFF = 2048D_ {FF}= 2048DFF =2048
Class PositionwiseFeedForward(nn.module): def __init__(self, d_model, d_ff, dropout=0.1): super(PositionwiseFeedForward, self).__init__() self.w_1 = nn.Linear(d_model, d_ff) self.w_2 = nn.Linear(d_ff, d_model) self.dropout = nn.Dropout(dropout) def forward(self, x): return self.w_2(self.dropout(F.relu(self.w_1(x))))Copy the code
6 Embedding layer
The word vector Embedding
Note here: dmodel\ SQRT {d_{model}} dModel is also multiplied by each weight
class Embeddings(nn.Module): def __init__(self, d_model, vocab): Super (Embeddings, self).__init__() # vocab size x d_model self.lut = nn.Embedding(vocab, d_model) self.d_model = d_model def forward(self, x): return self.lut(x) * math.sqrt(self.d_model)Copy the code
Positional Encoding
Since Transformer has no recurrence design, simple self-attention cannot distinguish the order. Therefore, in order to integrate the location information, a position coding vector needs to be designed, which should be consistent with the dimension of the input vector, so that the addition operation can be performed
Transformer’s position coding is designed using sin and cos, which has not been used in subsequent papers. The random effect is similar. The formula is as follows:
In the above equation, pos is position, and I is index of Dimension
class PositionalEncoding(nn.Module): def __init__(self, d_model, dropout, max_len=5000): super(PositionalEncoding, self).__init__() self.dropout = nn.Dropout(p=dropout) pe = torch.zeros(max_len, d_model) position = torch.arange(0, max_len).unsqueeze(1) div_term = torch.exp(torch.arange(0, d_model, 2) * - (math. The log (10000.0)/d_model)) PE [: 0: : 2] = torch. Sin (position * div_term) [PE: 1::2] = torch.cos(position * div_term) pe = pe.unsqueeze(0) self.register_buffer('pe', pe) def forward(self, x): x = x + Variable(self.pe[:, :x.size(1)], requires_grad=False) return self.dropout(x)Copy the code
plt.figure(figsize=(15, 5)) pe = PositionalEncoding(20, 0) y = pe.forward(Variable(torch.zeros(1, 100, 20))) PLT. The plot (np) arange (100), [0, :, 4:8] y data. Numpy (), PLT. Legend ([" dim % d % p for p in [4, 7]]) NoneCopy the code
7 Complete model modeling
The inner module of the model has been implemented in the previous section. Just fill the submodule into the EncoderDecoder class
Def make_model(src_VOCab, TGT_VOCab, N=6, d_model=512, d_ff=2048, h=8, dropout=0.1): c = copy.deepcopy attn = MultiHeadedAttention(h, d_model) ff = PositionwiseFeedForward(d_model, d_ff, dropout) position = PositionalEncoding(d_model, dropout) model = EncoderDecoder( Encoder(EncoderLayer(d_model, c(attn), c(ff), dropout), N), Decoder(DecoderLayer(d_model, c(attn), c(attn), c(ff), dropout), N), nn.Sequential(Embeddings(d_model, src_vocab), c(position)), nn.Sequential(Embeddings(d_model, tgt_vocab), c(position)), Generator(d_model, tgt_vocab)) for p in model.parameters(): if p.dim() > 1: nn.init.xavier_uniform(p) return modelCopy the code
8 Training details
Construct the batch
class Batch: def __init__(self, src, trg=None, pad=0): Self. SRC = SRC # (batch size, 1, seq_len) # Src_mask = (SRC! = pad).unsqueeze(-1) # TRG is not empty if TRG is not None: self.trg = trg[:, :-1] self.trg_y = trg[:, 1:] self.trg_mask = \ self.make_std_mask(self.trg, Ntokens self. Ntokens = (self. Trg_y! = pad).data.sum() @staticMethod def make_std_mask(TGT, pad): = pad).unsqueeze(-2) # unsqueeze(-2) # Here is a broadcast operation # (batch size, seq_len-1, seq_len-1) tgt_mask = tgt_mask & Variable( subsequent_mask(tgt.size(-1)).type_as(tgt_mask.data)) return tgt_maskCopy the code
Batch class constructs the mask matrix on the encoder side and the mask matrix on the encoder side according to the input sequence, output sequence and pad index. In addition, the input sequence and output sequence on the decoder side are constructed by shifting according to the convention of SEq2SEq
Training loop
def run_epoch(data_iter, model, loss_compute):
start = time.time()
total_tokens = 0
total_loss = 0
tokens = 0
for i, batch in enumerate(data_iter):
out = model.forward(batch.src, batch.trg,
batch.src_mask, batch.trg_mask)
loss = loss_compute(out, batch.trg_y, batch.ntokens)
total_loss += loss
total_tokens += batch.ntokens
tokens += batch.ntokens
if i % 50 == 1:
elapsed = time.time() - start
print("Epoch Step: %d Loss: %f Tokens per Sec: %f" %
(i, loss / batch.ntokens, tokens / elapsed))
start = time.time()
tokens = 0
return total_loss / total_tokens
Copy the code
The only thing to note here is that when observing the loss indicator, loss is calculated in batch. nTokens, and the loss of pad position is not concerned
The optimizer
- Adam, specific parameters are as follows: 0.9 beta 1 = \ beta_1 = 0.9 beta 1 = 0.9, 0.98 beta 2 = \ beta_2 = 0.98 beta 2 = 0.98 and ϵ = 10-9 \ epsilon = 10 ^ {9} ϵ = 10-9
- Warmup is used for the learning rate, and the formula is:
Here warmup_steps = 4000 warmup \ warmup_steps _steps = 4000 = 4000
The above formula is a subsection function. If step_num is smaller than warmup_steps, Lr = dmodel – 0.5 ⋅ step_num ⋅ warmup_steps – 1.5 lr = d_ {\ text {model}} ^ {0.5} \ cdot {step \ _num} \ cdot {warmup\_steps}^{-1.5}lr=dmodel−0.5, is a linear function. Greater than negative power decay, the decay rate is first fast and then slow
The implementation is as follows:
class NoamOpt: def __init__(self, model_size, factor, warmup, optimizer): self.optimizer = optimizer self._step = 0 self.warmup = warmup self.factor = factor self.model_size = model_size Self._step += 1 rate = self.rate() for p in self.optimizer.param_groups: self._step += 1 rate = self.rate() for p in self.optimizer.param_groups: p['lr'] = rate self._rate = rate self.optimizer.step() def rate(self, step = None): if step is None: Step = self._step return self.factor * \ (self.model_size ** (-0.5) * min(self.model_size ** (-0.5)), Step * self. Warmup ** (-1.5)) def get_std_opt(model): D_model, 2, 4000, torch. Optim.adam (model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9))Copy the code
Let’s draw the learning rate curve
opts = [NoamOpt(512, 1, 4000, None),
NoamOpt(512, 1, 8000, None),
NoamOpt(256, 1, 4000, None)]
plt.plot(np.arange(1, 20000), [[opt.rate(i) for opt in opts] for i in range(1, 20000)])
plt.legend(["512:4000", "512:8000", "256:4000"])
None
Copy the code
Regularization-Label Smoothing
Label Smoothing is to punish the neural network for its confidence in prediction. It evenly distributes the probability of “1” in one-Hot Ground truth to the probability of “0”. Three classification, for example, the original y = (0, 0) y = (0, 0) y = (0, 0), the label after smoothing is: y = (0.1, 0.8, 0.1) y = (0.1, 0.8, 0.1) = y (0.1, 0.8, 0.1)
The implementation is as follows:
Class smoothing (nn.Module): def __init__(self, size, padding_idx, smoothing=0.0): super(LabelSmoothing, Self).__init__() self.criterion = nn.KLDivLoss(size_average=False) self.padding_idx = padding_idx self.confidence = 1.0 Self. Size = size self.true_dist = None def forward(self, x, target): Size # vocab keyword size true_dist = x.data.clone() true_dist. Fill_ (self.size/(self.size)) # Smoothing probability true_dist. Scatter_ (1, target.data.unsqueeze(1), self.confidence) # Smoothing probability true_dist. self.padding_idx] = 0 mask = torch.nonzero(target.data == self.padding_idx) if mask.dim() > 0: True_dist = true_dist return self.criterion(x, Variable(true_dist, 0) requires_grad=False))Copy the code
Fill_ here explain why divide by size-2, suppose the dictionary size is 3, and they are ABC, because the model needs pad, so the dictionary supplement is A,B,C,< pad >, suppose the target tag is A, then one-hot is: When choosing an answer to your answer, which is 0.2 and smoothing= 0.8, we left it at 0 because we don’t want to make A prediction.
Scatter_ is used to populate the confidence into the appropriate location
Then set the <PAD> column to 0 in the one-hot transformed matrix
Finally, when the target is <PAD>, since the maximum length max_len should be added in the batch, assuming that max_len is 5, then the output sequence B B A C will become B B A C <PAD>. At this time, it is meaningless to predict the probability distribution of the position where the target is <PAD>. It should not be included in our loss, so the true_distribution corresponding to this position should be set to all 0
Let’s look at a simple example
Smoothing(5, 0, 0.4) predict = torch. Smoothing(5, 0, 0.2) 0.7, 0.1, 0], [0, 0.2, 0.7, 0.1, 0] to [0, 0.2, 0.7, 0.1, 0]]) v = crit (Variable (predict the log ()), Tensor([[0.0000, 0.1333, 0.6000, 0.1333, 0.1333], # [0.0000, 0.6000, 0.1333, 0.1333], # [0.0000, 0.6000, 0.1333, 0.1333], # [0.0000, 0.6000, 0.1333, 0.1333] Print (crit. True_dist) plt.imshow(crit. True_dist) NoneCopy the code
Let’s take a look at the change curve of loss for the probability distribution with different density
Smoothing(5, 0, 0.1) def loss(x): d = x + 3 * 1 predict = torch.FloatTensor([[0, x / d, 1 / d, 1 / d, 1 / d], ]) return crit(Variable(predict.log()), Variable(torch.LongTensor([1]))).data.item() plt.plot(np.arange(1, 100), [loss(x) for x in range(1, 100)]) NoneCopy the code
When calculating loss in the traditional one-hot form, the curve should be decreasing, but in the label smoothing case, the prediction probability of overly confident is slightly increased by loss
Loss calculation
class SimpleLossCompute: def __init__(self, generator, criterion, opt=None): self.generator = generator self.criterion = criterion self.opt = opt def __call__(self, x, y, norm): X = self.generator(x) # Mapping to the dictionary dimension do log softmax # Norm valid token number in batch loss = self.generator(x.contiguous().view(-1, x.size(-1)), y.contiguous().view(-1)) / norm loss.backward() if self.opt is not None: self.opt.step() self.opt.optimizer.zero_grad() return loss.data.item() * normCopy the code
9 Small experiment – reread machine
Fake data
def data_gen(V, batch, nbatches):
for i in range(nbatches):
data = torch.from_numpy(np.random.randint(1, V, size=(batch, 10)))
data[:, 0] = 1
src = Variable(data, requires_grad=False).long()
tgt = Variable(data, requires_grad=False).long()
yield Batch(src, tgt, 0)
Copy the code
Let’s make the dictionary size 11, where 1 to 10 are normal tokens and 0 is pad token
training
Smoothing(size=V, padding_idx=0, smoothing=0.0) model = make_model(V, V, D_model, 1, 400, torch. Optim.adam (model.parameters(), lr=0, betas=(0.9,) 0.98), EPS = 1E-9)) for the epoch in range(10): model.train() run_epoch(data_gen(V, 30, 20), model, SimpleLossCompute(model.generator, criterion, model_opt)) model.eval() test_loss = run_epoch(data_gen(V, 30, 5), model, SimpleLossCompute(model.generator, criterion, None)) print("test_loss", test_loss)Copy the code
Running record
Epoch Step: 1 Loss: 2.949874 Tokens per Sec: 557.973450 Epoch Step: 1 Loss: 1.857541 Tokens per Sec: 557.973450 Tokens per Sec: Tensor (1.8417) Epoch Step: 1 Loss: 2.048431 Tokens per Sec: 596.984863 Epoch Step: 1 Loss: 1.577389 Tokens per Sec: 861.355225 test_loss tensor(1.6092) Epoch Step: 1 Loss: 1.865752 Tokens per Sec: Epoch Step: 1 Loss: 1.395658 Tokens per Sec: 942.581787 TEST_loss tensor(1.3495) Epoch Step: 1 Loss: 1.395658 Tokens per Sec: 942.581787 2.041692 Tokens per Sec: 608.372864 Epoch Step: 1 Loss: 1.183396 Tokens per Sec: Epoch Step: 1 Loss: 1.291280 Tokens per Sec: 667.504517 Epoch Step: 1 Loss: 1.291280 Tokens per Sec 0.924788 Tokens per Sec: 906.874023 test_loss tensor(0.9144) Epoch Step: 1 Loss: 1.222422 Tokens per Sec: Epoch Step: 1 Loss: 0.733476 Tokens per Sec: 1043.809326 test_loss tensor(0.7075) Epoch Step: 1 Loss: 0.733476 Tokens per Sec: 1043.809326 test_loss tensor(0.7075) Epoch Step: 1 Loss: 0.733476 Tokens per Sec 0.829088 Tokens per Sec: 663.332275 Epoch Step: 1 Loss: 0.296809 Tokens per Sec: 1100.190186 test_loss tensor(0.3417) Epoch Step: 1 Loss: 1.048580 Tokens per Sec: 638.724670 Epoch Step: 1 Loss: 1.048580 Tokens per Sec 0.277764 Tokens per Sec: 970.994873 test_loss tensor(0.2576) Epoch Step: 1 Loss: 0.393721 Tokens per Sec: Epoch Step: 1 Loss: 0.385875 Tokens per Sec: 690.867737 test_loss tensor(0.3720) Epoch Step: 1 Loss: 0.385875 Tokens per Sec: 690.867737 0.544152 Tokens per Sec: 441.701752 Epoch Step: 1 Loss: 0.238676 Tokens per Sec: 965.472900 test_loss tensor(0.2562)Copy the code
Greed to generate
def greedy_decode(model, src, src_mask, max_len, start_symbol):
memory = model.encode(src, src_mask)
ys = torch.ones(1, 1).fill_(start_symbol).type_as(src.data)
for i in range(max_len-1):
out = model.decode(memory, src_mask,
Variable(ys),
Variable(subsequent_mask(ys.size(1)).type_as(src.data)))
prob = model.generator(out[:, -1])
_, next_word = torch.max(prob, dim = 1)
next_word = next_word.data[0]
ys = torch.cat([ys,
torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=1)
return ys
model.eval()
src = Variable(torch.LongTensor([[1, 3, 2, 2, 4, 6, 7, 9, 10, 8]]) )
src_mask = Variable(torch.ones(1, 1, 10) )
print(greedy_decode(model, src, src_mask, max_len=10, start_symbol=1))
Copy the code
Generate results:
tensor([[ 1, 3, 2, 2, 4, 6, 7, 9, 10, 8]])
Copy the code
Attention visualization
def draw(data, x, y, ax): Seaborn. heatmap(data, xticklabels=x, square=True, Yticklabels =y, vmin=0.0, vmax=1.0, cbar=False, ax=ax)Copy the code
sent = [1, 3, 2, 2, 4, 6, 7, 9, 10, 8] for layer in range(2): Print ("Encoder Layer", Layer +1) for h in range(4) and plots(figure size=(20, 10) draw(model.encoder.layers[layer].self_attn.attn[0, h].data, sent, sent if h ==0 else [], ax=axs[h]) plt.show()Copy the code
Encoder Layer 1
Encoder Layer 2
tgt_sent = [1, 3, 2, 2, 4, 6, 7, 9, 10, 8] for layer in range(2): Figure (1,4, figsize=(20, 10)) print("Decoder Self Layer", Layer +1) for h in range(4): draw(model.decoder.layers[layer].self_attn.attn[0, h].data[:len(tgt_sent), :len(tgt_sent)], tgt_sent, Tgt_sent if h ==0 else [], ax=axs[h]) plt.show() print("Decoder Src Layer", Layer +1) FIG, axs = plt.subplots(1,4, plots) figsize=(20, 10)) for h in range(4): draw(model.decoder.layers[layer].self_attn.attn[0, h].data[:len(tgt_sent), :len(sent)], sent, tgt_sent if h ==0 else [], ax=axs[h]) plt.show()Copy the code
Decoder Self Layer 1
Decoder Src Layer 1
Decoder Self Layer 2
Decoder Src Layer 2
reference
Github.com/harvardnlp/…