Drastically Reduce training Iterations and Improve Generalization Ability: Multi-Sample Dropout

Thesis title: Multi-Sample Dropout for Accelerated Training and Better Generalization

Links to papers: arxiv.org/pdf/1905.09…

{Hiroshi Inoue}

Paper introduction

This paper also describes a variant of dropout technology, multi-sample dropout. Whereas traditional dropout randomly selects a set of samples (called dropout samples) from the inputs for each training round, multi-sample dropout creates multiple dropout samples and then averages the losses of all samples to get the final loss. This approach simply replicates parts of the training network after the dropout layer and shares weights between the fully connected layers of those replicates. No new operators are required. Network parameters are updated by combining the losses of M dropout samples so that the final loss is lower than that of any one dropout sample. The effect is similar to repeating the training M times for each input in a minibatch. Therefore, it greatly reduces the number of training iterations.

Multi-Sample Dropout

Figure 1 is a simple example of the “pipeline” dropout we often use in refining. In this image, the multi-sample dropout uses two dropout systems. Only existing deep learning frameworks and common operators are used in this example. As shown in the figure, each dropout sample replicates the dropout layer in the original network and several layers after dropout. The example in the figure replicates the “dropout”, “Fully connected” and “Softmax + Loss Func” layers. In the dropout layer, each dropout sample uses a different mask to make its neuron subset different, but parameters (i.e. connection weights) are shared between the replicated full connection layers, and the same loss function, such as cross entropy, is then used to calculate the loss of each dropout. The final loss value can be obtained by averaging the loss values of all dropout samples.This method takes the last loss value as the objective function of optimization training and the class label of the maximum value in the last full-connection layer output as the prediction label. When dropout is applied to the end of a network, the added training time due to repeated operations is not very much. It is worth noting that the number of dropout samples in a multi-sample dropout can be arbitrary, and figure 1 shows an example with two dropout samples.

It is also important to note that neurons are not ignored in the reasoning process. The loss of only one dropout sample is calculated because the dropout sample is the same when reasoning, and this allows pruning of the network to eliminate redundant calculations. Note that using all dropout samples during reasoning does not significantly affect predictive performance, but only slightly increases the computational cost of reasoning time.

Pytorch implementation

Github.com/lonePatient…

In the initialization method, a ModuleList is defined, which contains multiple Dropout. In forward, the sample is Dropout for many times. Finally, the average of OUT and the average of Loss are calculated. In the command, dropout_num is a hyperparameter, indicating the specific value of Multi in multi-sample. The core code is as follows:

  self.dropouts = nn.ModuleList([nn.Dropout(dropout_p) for _ in range(dropout_num)])
Copy the code

Complete ResNet model structure:

class ResNet(nn.Module): def __init__(self, ResidualBlock, num_classes,dropout_num,dropout_p): super(ResNet, self).__init__() self.inchannel = 32 self.conv1 = nn.Sequential( nn.Conv2d(3, 32, kernel_size=3, stride=1, padding=1, bias=False), nn.BatchNorm2d(32), nn.ReLU(), ) self.layer1 = self.make_layer(ResidualBlock, 32, 2, stride=1) self.layer2 = self.make_layer(ResidualBlock, 64, 2, stride=2) self.layer3 = self.make_layer(ResidualBlock, 64, 2, stride=2) self.layer4 = self.make_layer(ResidualBlock, 128, 2, stride=2) self.fc = nn.Linear(128,num_classes) self.dropouts = nn.ModuleList([nn.Dropout(dropout_p) for _ in range(dropout_num)]) def make_layer(self, block, channels, num_blocks, stride): Strides =[stride] + [1] * (num_strides - 1) #strides=[1,1] layers =[] for stride in strides: layers.append(block(self.inchannel, channels, stride)) self.inchannel = channels return nn.Sequential(*layers) def forward(self, x,y = None,loss_fn = None): out = self.conv1(x) out = self.layer1(out) out = self.layer2(out) out = self.layer3(out) out = self.layer4(out) feature = F.avg_pool2d(out, 4) if len(self.dropouts) == 0: out = feature.view(feature.size(0), -1) out = self.fc(out) if loss_fn is not None: loss = loss_fn(out,y) return out,loss return out,None else: for i,dropout in enumerate(self.dropouts): if i== 0: out = dropout(feature) out = out.view(out.size(0),-1) out = self.fc(out) if loss_fn is not None: loss = loss_fn(out, y) else: temp_out = dropout(feature) temp_out = temp_out.view(temp_out.size(0),-1) out =out+ self.fc(temp_out) if loss_fn is not None: loss = loss+loss_fn(temp_out, y) if loss_fn is not None: return out / len(self.dropouts),loss / len(self.dropouts) return out,NoneCopy the code

How does Bert use the Multi-sample Dropout

The Hubert-family model uses a lot of Dropout, and the implementation above is too complex. A simpler implementation:

for step, (input_ids, attention_mask, token_type_ids, y) in enumerate(tk):
    input_ids, attention_mask, token_type_ids, y = input_ids.to(device), attention_mask.to(
        device), token_type_ids.to(device), y.to(device).long()
    with autocast(): 
        for i in range(batch_n_iter):
            output = model(input_ids, attention_mask, token_type_ids).logits
            loss = criterion(output, y) / CFG['accum_iter']
            SCALER.scale(loss).backward()
        SCALER.step(optimizer)
        SCALER.update()
        optimizer.zero_grad()
Copy the code
  • In the command, batch_n_iter is the hyperparameter, which indicates the specific value of Multi in multi-sample.

If you use face hugging, you can use the following template:

Class Multilabel_dropout (): # Multisample Dropout https://arxiv.org/abs/1905.09788 def __init__ (self, HIGH_DROPOUT HIDDEN_SIZE) : self.high_dropout = torch.nn.Dropout(config.HIGH_DROPOUT) self.classifier = torch.nn.Linear(config.HIDDEN_SIZE * 2, 2) def forward(self, out): Return torch. Mean (Torch. Stack ([self. Classifier (self.high_dropout(p))) for p in np.linspace(0.1,0.5, 5)], dim=0), dim=0)Copy the code

Multi – Sample Dropout

About the principle of the word, because found that the online writing is very full, here lazy, you can see the reference, here mainly to provide you with the code.

  • Significantly reduce the number of training iterations
  • Improve generalization ability

The resources

Multi-sample Dropout for Accelerated Training and Better Generalization Capability: IBM proposes “New Dropout”