Wechat official account: Ilulaoshi. My personal website will be updated continuously, please click the link for more details.

The eve of deep learning

Although Yann LeCun proposed convolutional neural network LeNet in the last century and used LeNet for image classification, convolutional neural network did not develop rapidly. In the nearly 20 years since LeNet’s proposal, neural networks have been overtaken by other machine learning methods, such as support vector machines. At that time, convolutional neural network failed to develop rapidly due to:

1. Lack of data

Deep learning requires a lot of labeled data to perform better than other classical methods. Limited by the limited storage of early computers and the limited research budgets of the 1990s, most research was based on small, publicly available data sets. A number of research papers, for example, are based on several publicly available data sets from the University of California, Irvine, many of which contain only a few hundred to a few thousand images. The situation improved with the wave of big data around 2010. In particular, Fei-fei Li led the construction of ImageNet data set. The ImageNet dataset contains 1,000 broad categories of objects, with thousands of different images per category, and hundreds of gigabytes of data. This scale was unmatched by any other publicly available data set at the time. In addition, the community holds an annual Challenge called ImageNet Large-scale Visual Recognition Challenge (ILSVRC), in which contestants need to optimize computer vision-related tasks based on ImageNet data sets. It can be said that the ImageNet data set has pushed computer vision and machine learning research into a new phase.

2. Lack of hardware

Deep learning is demanding on computing resources. Early hardware had limited computing power, which made it difficult to train more complex neural networks. However, the arrival of the General Purpose GPU (GPGPU) has changed the landscape. Gpus have long been designed for image processing and computer games, especially for high-throughput matrix and vector multiplication. Fortunately, the mathematical expression is similar to that of the convolution layer in deep networks. The idea of a general-purpose GPU began to emerge in 2001, with programming frameworks like CUDA and OpenCL. CUDA programming interfaces are less difficult to get started, and researchers can use CUDA to speed up their scientific computing tasks on Nvidia gpus. Some computation-intensive tasks started being migrated to Nvidia gpus around 2010.

The current wave of AI is widely believed to have started in 2012. That year, Alex Krizhevsky successfully trained AlexNet, a deep convolutional neural network, using nvidia Ggpu, and won the ImageNet Challenge with this network, greatly improving the accuracy of image classification. At that time, the storage and computation of big data was almost no longer the bottleneck, and the proposal of AlexNet also made academic circle and industry realize the amazing performance of deep neural network.

AlexNet network structure

AlexNet’s design philosophy is very similar to LeNet’s, but there are also significant differences.

First, compared with LeNet, which is relatively small, AlexNet contains 8 layers of transformation, including 5 layers of convolution and 2 layers of fully connected hiding layer, as well as 1 fully connected output layer. Let’s describe the design of these layers in detail.

The convolution window shape in the first layer of AlexNet is 11 × 11. Since the height and width of most images in ImageNet are more than 10 times that of MNIST images, objects in ImageNet images occupy more pixels, so larger convolution Windows are needed to capture objects. The shape of convolution window in the second layer is reduced to 5 × 5, and then 3 × 3 is used. In addition, the first, second and fifth convolutional layers are followed by the largest pooling layer with a window shape of 3 × 3 and a step of 2. Moreover, the number of convolution channels used by AlexNet is tens of times larger than that in LeNet.

The last convolution layer is followed by two fully connected layers with 4096 outputs. These two huge fully connected layers bring about nearly 1 GB of model parameters. Due to the limitations of early video memory, the original AlexNet design used dual data streams so that a GPU only needed to process half of the model. Fortunately, video memory has come a long way in the past few years, so often we don’t need such special designs anymore.

Second, AlexNet changed the SigmoID activation function to the simpler ReLU activation function. On the one hand, the ReLU activation function is simpler to calculate, for example, it does not have the exponentiation of sigmoID activation functions. On the other hand, the ReLU activation function makes the model easier to train under different parameter initialization methods. This is because when the output poles of the SigmoID activation function are close to 0 or 1, the gradient in these regions is almost 0, which makes the back propagation unable to update part of the model parameters. The gradient of ReLU activation function in the positive interval is always 1. Therefore, if the model parameters are not initialized properly, the sigmoid function may obtain a gradient of almost zero in the positive interval, which makes the model unable to be trained effectively.

Third, AlexNet uses Dropout method to control model complexity of the full connection layer and avoid overfitting. LeNet did not use the discard method.

Fourth, AlexNet introduces a number of image augmentation, such as flipping, cropping, and color changes, to further expand the data set to mitigate overfitting.

Here is a slightly simplified version of AlexNet implemented using PyTorch. The network assumes 1 × 224 × 224 inputs, i.e., inputs have only one channel, such as black-white and single-color data sets such as Fashion MNIST.

class AlexNet(nn.Module) :
    def __init__(self) :
        super(AlexNet, self).__init__()

        # convolution layer will change input shape into: floor((input_shape - kernel_size + padding + stride) / stride)
        # input shape: 1 * 224 * 224
        # convolution part
        self.conv = nn.Sequential(
            # conv layer 1
            # floor((224-11 + 2 + 4) / 4) = floor(54.75) = 54
            # conv: 1 * 224 * 224 -> 96 * 54 * 54 
            nn.Conv2d(in_channels=1, out_channels=96, kernel_size=11, stride=4, padding=1), nn.ReLU(),
            # floor((53-4 + 2) / 2) = floor(26.5) = 26
            # 96 * 54 * 54 -> 96 * 26 * 26
            nn.MaxPool2d(kernel_size=3, stride=2), 
            # conv layer 2: decrease kernel size, add padding to keep input and output size same, increase channel number
            # floor((26 - 5 + 4 + 1) / 1) = 26
            # 96 * 26 * 26 -> 256 * 26 * 26
            nn.Conv2d(in_channels=96, out_channels=256, kernel_size=5, stride=1, padding=2), nn.ReLU(),
            # floor((26 - 3 + 2) / 2) = 12
            # 256 * 26 * 26 -> 256 * 12 * 12
            nn.MaxPool2d(kernel_size=3, stride=2),
            # 3 consecutive conv layer, smaller kernel size
            # floor((12 - 3 + 2 + 1) / 1) = 12
            # 256 * 12 * 12 -> 384 * 12 * 12
            nn.Conv2d(in_channels=256, out_channels=384, kernel_size=3, stride=1, padding=1), nn.ReLU(),
            # 384 * 12 * 12 -> 384 * 12 * 12
            nn.Conv2d(in_channels=384, out_channels=384, kernel_size=3, stride=1, padding=1), nn.ReLU(),
            # 384 * 12 * 12 -> 256 * 12 * 12
            nn.Conv2d(in_channels=384, out_channels=256, kernel_size=3, stride=1, padding=1), nn.ReLU(),
            # floor((12 - 3 + 2) / 2) = 5
            # 256 * 5 * 5
            nn.MaxPool2d(kernel_size=3, stride=2))# fully connect part 
        self.fc = nn.Sequential(
            nn.Linear(256 * 5 * 5.4096),
            nn.ReLU(),
            # Use the dropout layer to mitigate overfitting
            nn.Dropout(p=0.5),
            nn.Linear(4096.4096),
            nn.ReLU(),
            nn.Dropout(p=0.5),
            # Output layer. 
            # the number of classes in Fashion-MNIST is 10
            nn.Linear(4096.10),def forward(self, img) :
        feature = self.conv(img)
        output = self.fc(feature.view(img.shape[0] -1))
        return output
Copy the code

Model training

Although AlexNet uses ImageNet data set in this paper, because the training time of ImageNet data set is very long, we use fashion-Mnist data set to demonstrate AlexNet. When reading the data we made an extra step to enlarge the image height and width to the image height and width used by AlexNet 224. This can be achieved by torchvision. Transforms. The Resize instance. That is, we use the Resize instance in front of the ToTensor instance, and then we use the Compose instance to concatenate the two transformations for easy calls.

def load_data_fashion_mnist(batch_size, resize=None, root='~/Datasets/FashionMNIST') :
    """Use torchvision.datasets module to download the fashion mnist dataset and then load into memory."""
    trans = []
    if resize:
        trans.append(torchvision.transforms.Resize(size=resize))
    trans.append(torchvision.transforms.ToTensor())

    transform = torchvision.transforms.Compose(trans)
    mnist_train = torchvision.datasets.FashionMNIST(root=root, train=True, download=True, transform=transform)
    mnist_test = torchvision.datasets.FashionMNIST(root=root, train=False, download=True, transform=transform)
    if sys.platform.startswith('win'):
        num_workers = 0
    else:
        num_workers = 4
    train_iter = torch.utils.data.DataLoader(mnist_train, batch_size=batch_size, shuffle=True, num_workers=num_workers)
    test_iter = torch.utils.data.DataLoader(mnist_test, batch_size=batch_size, shuffle=False, num_workers=num_workers)

    return train_iter, test_iter
Copy the code

The load_data_fashion_mnist() method defines how to read the data, and fashion-mnist was originally 1 × 28 × 28 in size. Resize modifies the size of the image based on the original image, which can be adjusted to the size we want.

def train(net, train_iter, test_iter, batch_size, optimizer, num_epochs, device=mlutils.try_gpu()) :
    net = net.to(device)
    print("training on", device)
    loss = torch.nn.CrossEntropyLoss()
    timer = mlutils.Timer()
    # in one epoch, it will iterate all training samples
    for epoch in range(num_epochs):
        # Accumulator has 3 parameters: (loss, train_acc, number_of_images_processed)
        metric = mlutils.Accumulator(3)
        # all training samples will be splited into batch_size
        for X, y in train_iter:
            timer.start()
            # set the network in training mode
            net.train()
            # move data to device (gpu)
            X = X.to(device)
            y = y.to(device)
            y_hat = net(X)
            l = loss(y_hat, y)
            optimizer.zero_grad()
            l.backward()
            optimizer.step()
            with torch.no_grad():
                # all the following metrics will be accumulated into variable `metric`
                metric.add(l * X.shape[0], mlutils.accuracy(y_hat, y), X.shape[0])
            timer.stop()
            # metric[0] = l * X.shape[0], metric[2] = X.shape[0]
            train_l = metric[0]/metric[2]
            # metric[1] = number of correct predictions, metric[2] = X.shape[0]
            train_acc = metric[1]/metric[2]
        test_acc = mlutils.evaluate_accuracy_gpu(net, test_iter)
        if epoch % 1= =0:
            print(f'epoch {epoch + 1} : loss {train_l:3.f}, train acc {train_acc:3.f}, test acc {test_acc:3.f}')
    # after training, calculate images/sec
    # variable `metric` is defined in for loop, but in Python it can be referenced after for loop
    print(f'total training time {timer.sum() :2.f}.{metric[2] * num_epochs / timer.sum() :1.f} images/sec ' f'on {str(device)}')
Copy the code

In the main() method of the whole program, the network is defined first, then load_data_fashion_mnist() is used to load the training and test data, and finally train the model using train() method:

def main(args) :

    net = AlexNet()
    optimizer = torch.optim.Adam(net.parameters(), lr=args.lr)

    # load data
    train_iter, test_iter = mlutils.load_data_fashion_mnist(batch_size=args.batch_size, resize=224)
    # train
    train(net, train_iter, test_iter, args.batch_size, optimizer, args.num_epochs)

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Image classification')
    parser.add_argument('--batch_size'.type=int, default=128.help='batch size')
    parser.add_argument('--num_epochs'.type=int, default=10.help='number of train epochs')
    parser.add_argument('--lr'.type=float, default=0.001.help='learning rate')
    args = parser.parse_args()
    main(args)
Copy the code

Where args is the argument and can be passed in on the command line.

I uploaded the source code to GitHub and made PyTorch and TensorFlow available.

The resources

  1. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).
  2. Ai/chapter_con d2l….
  3. Tangshusen. Me/Dive into – D…