This paper is based on a QUESTION in a CNN assignment, which covers basic CNN network construction, classification results on MNIST data sets, the impact of Batch Normalization, Dropout, the impact of convolution kernel size, data set size, and the impact of different parts of data sets. The influence of random number seeds and different activation units can make people have a more comprehensive understanding of CNN, so I want to do it, so I have this paper.

tool

  • Open source deep learning library: PyTorch
  • Data set: MNIST

implementation

The initial requirements

First, build the BASE network. Pytorch has the following code:

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, kernel_size=(5, 5), stride=(1, 1), padding=0)
        self.conv2 = nn.Conv2d(20, 50, kernel_size=(5, 5), stride=(1, 1), padding=0)
        self.fc1 = nn.Linear(4*4*50, 500)
        self.fc2 = nn.Linear(500, 10)

    def forward(self, x):
        x = F.max_pool2d(self.conv1(x), 2)
        x = F.max_pool2d(self.conv2(x), 2)
        x = x.view(-1, 4*4*50)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.log_softmax(x)
Copy the code

See base.py for this code.

Problem A: Preprocessing

MNIST data sets are required to be read and tranform to a format suitable for processing. The code read here follows BigDL Python Support. Without going into detail, it can be read quickly according to the data format on the MNIST home page. The key block has functions to read 32-bit bits:

def _read32(bytestream): Dt = numpy.dTYPE (numpy.uint32). Newbyteorder ('>') # Return numpy.fromBuffer (bytestream.read(4), dtype=dt)[0]Copy the code

Then you’ll have a tensor of N, 1, 28, 28, so each pixel is 0-255, so you do a normalization, you divide everything by 255, you get a 0-1 value, and then Normalize, you know the variance of the mean of your training set and your test set, so you just do it. Since the mean variances of the training set and the test set were based on the normalized data, normalization was not done at the beginning, so the forward output and grad were very unreasonable. It was later found that there was a problem here.

See preprocessing.py for this code.

Problem B: BASE model

The random seed was set to 0, and parameters were learned on the first 10,000 training samples. Finally, the error rate of the test set after 20 EPOchs was looked at. The final result is:

Test set: Average loss: 0.0014, Accuracy: 9732/10000 (97.3%)
Copy the code

It can be seen that the accuracy of BASE model is not so high.

Question C: Batch Normalization v. BASE

The Batch Normalization layer is added after the convolutional layer of the first three blocks, which simply modifies the network structure as follows:

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, kernel_size=(5, 5), stride=(1, 1), padding=0)
        self.conv2 = nn.Conv2d(20, 50, kernel_size=(5, 5), stride=(1, 1), padding=0)
        self.fc1 = nn.Linear(4*4*50, 500)
        self.fc2 = nn.Linear(500, 10)
        self.bn1 = nn.BatchNorm2d(20)
        self.bn2 = nn.BatchNorm2d(50)
        self.bn3 = nn.BatchNorm1d(500)

    def forward(self, x):
        x = self.conv1(x)
        x = F.max_pool2d(self.bn1(x), 2)
        x = self.conv2(x)
        x = F.max_pool2d(self.bn2(x), 2)
        x = x.view(-1, 4*4*50)
        x = self.fc1(x)
        x = F.relu(self.bn3(x))
        x = self.fc2(x)
        return F.log_softmax(x)
Copy the code

Run is the same parameter, and the result with BN is:

Test set: Average loss: 0.0009, Accuracy: 9817/10000 (98.2%)
Copy the code

It can be seen that there is a significant improvement in the effect. For more information on Batch Normalization, see [2],[5].

Problem D: Dropout Layer

After adding a Dropout(P =0.5) to the final layer, fc2 layer, the result on BASE and BN is:

BASE: Test set: Average loss: 0.0011, Accuracy: 9769/10000 (97.7%) BN: Test set: Average loss: 0.0014, Accuracy: 9789/10000 (97.9%).Copy the code

It can be observed that dropout can improve the BASE model to some extent, but the effect of dropout on BN model is not obvious but decreases. The reason may be that the BN model itself contains regularization effects, so adding a layer of Dropout is unnecessary and may affect the results.

Problem E: SK Model

SK model: Stacking two 3×3 conv. layers to replace 5×5 conv. layer



After such changes, the SK model is as follows:

class Net(nn.Module): def __init__(self): super(Net, self).__init__() self.conv1_1 = nn.Conv2d(1, 20, kernel_size=(3, 3), stride=(1, 1), padding=0) self.conv1_2 = nn.Conv2d(20, 20, kernel_size=(3, 3), stride=(1, 1), padding=0) self.conv2 = nn.Conv2d(20, 50, kernel_size=(3, 3), stride=(1, 1), padding=0) self.fc1 = nn.Linear(5*5*50, 500) self.fc2 = nn.Linear(500, 10) self.bn1_1 = nn.BatchNorm2d(20) self.bn1_2 = nn.BatchNorm2d(20) self.bn2 = nn.BatchNorm2d(50) self.bn3 = Dropout(p=0.5) def forward(self, x): self.drop = self. Dropout(p=0.5) def forward(self, x): x = F.relu(self.bn1_1(self.conv1_1(x))) x = F.relu(self.bn1_2(self.conv1_2(x))) x = F.max_pool2d(x, 2) x = self.conv2(x) x = F.max_pool2d(self.bn2(x), 2) x = x.view(-1, 5*5*50) x = self.fc1(x) x = F.relu(self.bn3(x)) x = self.fc2(x) return F.log_softmax(x)Copy the code

After 20 epochs, the results are as follows,

SK: Test set: Average loss: 0.0008, Accuracy: 9848/10000 (98.5%)
Copy the code

Test set accuracy has been slightly improved. Here, two 3×3 convolution kernels are used to replace the large 5×5 convolution kernels, and the number of parameters changes from 5×5=25 to 2x3x3=18. Practice shows that this makes the computation faster, and small ReLU between convolution layers is also helpful. This approach is used in VGG.

F: Change Number of channels

By multiplying the feature graph by a multiple and executing it through shell program, the following results are obtained:

Sk0.2:97.7% SK0.5:98.2% SK1:98.5% SK1.5:98.6% SK2:98.5% (Max 98.7%)Copy the code

When the feature maps are 4,10, 30 and 40 respectively, the final accuracy is basically improved. This indicates to a certain extent that before the fitting is achieved, increasing the number of feature maps is equivalent to extracting more features, and the increase in the number of extracted features is conducive to improving the accuracy. See sk_s.py and runsk.sh for this code.

Problem G: Use different training set sizes

Also run by script, add parameters

parser.add_argument('--usedatasize', type=int, default=60000, metavar='SZ',
                    help='use how many training data to train network')
Copy the code

Indicates the data size. Usebatchsize is used from the front to the back. See sk_s. py and runtrainingsize.sh for this part of the program. The results are as follows:

500:84.2% 1000:92.0% 2000:94.3% 5000:95.5% 10000:96.6% 20000:98.4% 60000:99.1%Copy the code

It is obvious from this that the more data, the more accurate the results. Too little data can not accurately reflect the overall distribution of data, and it is easy to overfit, and the effect will not be obvious when there are too many data. However, most of the time, we still think there are too few data, and it is difficult to obtain more data.

Question H: Use different training sets

This is done by script, see SK_0.2. Py and diffTrainingSets. The running results are as follows:

0-10000:98.0% 10000-20000:97.8% 2000-30000:97.8% 3000-40000:97.4% 4000-50000:97.5% 5000-60000:97.7%Copy the code

It can be seen that the networks trained by different training sample sets have certain differences, which are not very large, but show unstable results after all.

Question I: Random Seed’s effects

This was done using the runsed.sh script, using all 60,000 training sets. The results are as follows:

Seed 0:98.9% Seed 1:99.0% Seed 12:99.1% Seed 123:99.0% Seed 12345:99.0% Seed 123456:98.9%Copy the code

In fact, when using the entire training set, the seed setting of the random number generator has little effect on the final result.

Question J: ReLU or Sigmoid?

After all ReLU was changed into Sigmoid, all 60,000 training sets were used for training. The comparison results are as follows:

   ReLU SK_0.2:  99.0%
Sigmoid SK_0.2:  98.6%
Copy the code

It can be seen that it is better to use ReLU activation unit than Sigmoid activation unit when training CNN. The reason may lie in the difference between the two mechanisms. When the sigmoID input value of neurons is large or small, the output value will be close to 0 or 1, which makes the gradient in many places almost zero and the weight can hardly be updated. Although ReLU increases the computational burden, it can significantly speed up the convergence process without the problem of gradient saturation.

conclusion

In this paper, PyTorch is used to compare several CNN models on MNIST data sets, as well as the influence of some parameter Settings on the model effect, so as to carry out some detailed evaluation on many aspects of CNN. PyTorch is quite easy to use after using it once. It is relatively simple and unknown to other models. Anyway, the convolutional neural network model is the same. See [7] for the specific code of the project. Since the author does not conduct in-depth research on CNN (convolutional neural network), the interpretation of each result may be biased, and readers can criticize it.

References

  • [1] Introduction to Convolutional Neural Networks
  • [2] Batch Normalization
  • [3] PyTorch Doc
  • [4] MatConvNet
  • [5] Why does Batch Normalization have a good effect in Zhihu in-depth learning?
  • [6] Dropout
  • [7] Code on Github