Ta-ying Chen is a PhD candidate in machine learning at the University of Oxford and a well-known technology blogger on Medium

Translator: Song Xian

Many computer vision applications, such as fashionable self-driving cars and face detection, are made possible by deep neural networks. What many people may not know, however, is that the breakthroughs in computer vision in recent years have been driven by a particular type of network architecture, known as residual networks (ResNet). In fact, many of the advances we’ve seen in AI would not have been possible without residual blocks. It is the residuals, a concept so simple and elegant, that gives us a truly “deep” network.

This article will take you through the basic principles behind residual networks and show you how to simply implement residual networks in PyTorch and train ResNets for image classification.

Degradation problem: Are more layers more powerful?

In theory, a deep web with more variables should be better able to do the difficult task of understanding images. There is evidence, however, that networks with traditionally deeper layers are actually harder to train and perform even worse than shallow ones. This is what we call the degradation problem.

The problem of degradation doesn’t seem to make intuitive sense. Logically, if we have two networks with the same number of layers and add x layers in front of the second network, the worst case scenario should be that the first x layer outputs an identity mapping, resulting in the same performance of the two networks. One possible guess is that networks with more layers perform worse because the identity mapping is ignored by the network. So in the early 2010s, networks like VGG-16 were typically limited to around 10 to 20 layers.

Residual architecture

Residual network provides a simple and straightforward solution to the degradation problem described above. Residual networks can create a shortcut, called a Skip connection, that puts raw inputs into the network and makes them pass through several layers of stacking before finally combining with output characteristics.As shown in Figure 1, let the input to the stack layer be X, and the middle stack layer be function F, then the final output y is

y=F(x)+x\

\

When the dimensions of F(x) and X do not match, we can simply change the dimension of X by performing a linear projection during the jump join.

We call the entire pipeline above a residual block, and we can combine multiple residual blocks to build a very deep network and avoid degradation problems.

Computing environment

library

We will use PyTorch (including TorchVision) to build the residual network, and the following code imports all the libraries we need:

"""
The following is an import of PyTorch libraries.
"""
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
from torchvision import datasets, transforms
from torchvision.utils import save_image
import matplotlib.pyplot as plt
import numpy as np
import random
Copy the code

The data set

To demonstrate the power of the residual network, we will test it with two datasets: the simpler MINST dataset (including 60,000 handwritten Arabic digital images from 0 to 9) and the more complex CIFAR-10 dataset.

The download link is as follows: 1. MNIST dataset 2. Cifar-10 dataset

In the process of testing the network, we often need to refer to the training results of multiple different data sets. Our open dataset platform provides a one-stop platform to acquire, filter and manage high-quality datasets, which makes our work very convenient. Many well-known unstructured datasets are available for free. Using the SDK provided by Gewu Titanium, we can even integrate these data sets directly into our code for training and testing.

Hardware requirements

Generally speaking, we can use CPU to train neural network, but the best choice is actually GPU, because this can greatly improve the training speed. The residual network built by this paper is relatively simple, and both CPU and GPU can run. However, in practical applications, more complex networks (such as ResNET-152) will appear, and GPU is generally preferred. We can use the following code to test whether our machine can be trained with the GPU:

"""
Determine if any GPUs are available
"""
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
Copy the code

Build residuals

Here, we will show how to create the simplest residual block on a convolutional neural network, whose input and output dimensions are the same. The following code combines PyTorch’s nn.Module to create residuals.

"""
Define an nn.Module class for a simple residual block with equal dimensions
"""
class ResBlock(nn.Module):
​
    """
    Initialize a residual block with two convolutions followed by batchnorm layers
    """
    def __init__(self, in_size:int, hidden_size:int, out_size:int):
        super().__init__()
        self.conv1 = nn.Conv2d(in_size, hidden_size, 3, padding=1)
        self.conv2 = nn.Conv2d(hidden_size, out_size, 3, padding=1)
        self.batchnorm1 = nn.BatchNorm2d(hidden_size)
        self.batchnorm2 = nn.BatchNorm2d(out_size)
​
    def convblock(self, x):
        x = F.relu(self.batchnorm1(self.conv1(x)))
        x = F.relu(self.batchnorm2(self.conv2(x)))
        return x
​
    """
    Combine output with the original input
    """
    def forward(self, x): return x + self.convblock(x) # skip connection
Copy the code

Use the existing ResNet model

In fact, there are many networks using residual structures that have turned out to be so good after training with large data sets like ImageNet that we can take advantage of these existing models without having to reinvent the wheel. Torchvision provides pre-built models of checkpoints and architectures for networks such as RESNET-34, RESNET-50 and RESNET-152 in its library. We can get the above model with the following code:

"""
Creates a pretrained on ImageNet Resnet34
"""
resnet = torchvision.models.resnet34(pretrained=True)
Copy the code

Note, however, that if we need to fine-tune the model with a dataset other than ImageNet, it is important to make changes to the last layer of ResNet, because the dimension of the final one-Hot vector needs to be equal to the number of classes in the dataset.

The results of

After 50 rounds of training, the network we built can easily achieve around 99% accuracy on the MNIST dataset, while resNET-34 and RESNET-152 can both achieve 90% accuracy on the CIFAR-10 dataset. According to the original paper of He Kaiming et al., we can also see that the residual structure performs significantly better on ImageNet data sets than VGG, and is also stronger than a network with the same number of layers but no residual structure.

conclusion

He Kaiming et al. ‘s residual structure can be said to be one of the most outstanding recent inventions in the field of neural networks in the direction of computer vision. Today, almost all networks (even those beyond convolutional networks) use residuals more or less in order to maintain good performance even after a large number of layers are stacked. This simple and elegant approach opens up countless possibilities for pushing machines to understand the human world.