This is the seventh day of my participation in the First Challenge 2022

preface

In deep learning 🔥 right now, I know that introduces convolution neural network, the article has been cut, but I still want to write a little something different with others, although want to talk over or some knowledge, but I want to make it very easy to understand, as long as a little computer and linear algebra foundation of the students can understand it.

The level is limited, but any mistake or understanding is not in place, welcome you to point out 🙏

What is a neural network?

Before introducing convolutional neural networks, let’s review the basic knowledge of neural networks 📖. At present, neural networks are the core of deep learning algorithms, and many of the deep learning algorithms we are familiar with are actually neural networks. A neural network consists of node layers, usually consisting of one input layer, one or more hidden layers, and one output layer. When we say several layers of neural networks, we usually mean the number of hidden layers, because the input layer and output layer are usually fixed. Nodes are interconnected and have associated weights and thresholds. If the output of a node is higher than the specified threshold, the node is activated and the data is sent to the next layer of the network. Otherwise, no data is passed to the next layer of the network. Do you feel very similar about the process of node activation? That’s right, it’s actually a simulation of how neurons in a living organism fire.

There are various types of neural networks, which are suitable for different application scenarios. For example, recurrent neural networks (RNN) are commonly used for natural language processing and speech recognition, while convolutional neural networks (ConvNets or CNN) are commonly used for classification and computer vision tasks. Before CNN is generated, manual feature extraction method is needed to identify objects in images. Now, convolutional neural networks offer a more scalable approach to image classification and object recognition tasks, using the principles of linear algebra, particularly matrix multiplication, to identify patterns in images. However, THE computation requirements of CNN are very high, and graphics processing unit (GPU) is usually required to train the model, otherwise the training speed is slow.

What is a convolutional neural network?

Convolutional neural network is a feedforward neural network based on convolution computation. Compared with the common neural network, it has the advantages of local connection and weight sharing, so that the number of learning parameters is greatly reduced, and the convergence speed of the network is also faster. At the same time, convolution operation can better extract image features. The basic components of convolutional neural network are convolution layer, pooling layer and full connection layer. The convolutional layer is the first layer of the convolutional network, which can be followed by other convolutional layers or pooling layers. The last layer is the full connection layer. Later layers identify larger portions of the image, while earlier layers focus on simple features such as colors and edges. As the image data progresses through the layers of CNN, it begins to identify larger elements or shapes of objects until it finally identifies the desired object.

The following sections briefly describe the principles and functions of these basic components.

Convolution layer

The convolutional layer is the core component of CNN, and its function is to extract the features of samples. It consists of three parts, namely, input data, filter and feature map. Inside the computer, images are stored as a matrix of pixels. If the input data is an RGB image, it consists of a 3D pixel matrix, which means that the input will have three dimensions — height, width, and depth. The filter, also known as the convolution kernel or feature detector, is essentially a two-dimensional (2-D) weight matrix that will move through the image’s perception field to check for the presence of features.

The convolution kernel varies in size, but is usually a 3×3 matrix, which also determines the size of the receptive field. Different convolution kernels extract different image features. Starting from the upper left corner of the pixel matrix of the input image, the weight matrix of the convolution kernel is dotted with the corresponding region of the pixel matrix, then the convolution kernel is moved, and the process is repeated until the convolution kernel sweeps the entire image. This process is called convolution. The final output of the convolution operation is called the feature graph, the activation graph, or the convolution feature.

As shown above ☝, each output value in the feature graph does not have to be connected to every pixel value in the input image, it only needs to be connected to the receptive field of the applied filter. Because the output array does not need to be mapped directly to each input value, the convolution layer (and thus the pooling layer) is often referred to as the “partially connected” layer. This property is also described as local joining.

As the convolution kernel moves across the image, its weight remains constant, which is called weight sharing. Some parameters, such as weight values, are adjusted during training through a process of back propagation and gradient descent. Before starting to train the neural network, three hyperparameters affecting the output volume should be set:

  1. Number of filters: Affects the depth of output. For example, three different filters will produce three different feature maps, resulting in three depths.

  2. Stride: the distance or number of pixels that the convolution kernel moves on the input matrix. Although steps greater than or equal to 2 are rare, larger steps produce smaller output.

  3. Zero-padding: Usually used when the filter is not suitable for input images. This sets all elements outside the input matrix to zero, resulting in a larger or equal-sized output. There are three types of padding:

    • Valid PADDING: This is also called no padding. In this case, if the dimensions are not aligned, the last convolution is discarded.
    • Same PADDING: This padding ensures that the output layer is the Same size as the input layer.
    • Full padding: This type of padding increases the size of the output by adding zero to the input boundary.

After each convolution operation, the feature graph of CNN is transformed by activation function (Sigmoid, ReLU, Leaky ReLU, etc.), so as to perform nonlinear mapping on the output, so as to improve the expression capability of the network.

When CNN has multiple convolutional layers, the latter layer can see the pixels in the receptive field of the previous layer. For example, suppose we are trying to determine whether the image contains a bicycle. We can think of a bicycle as the sum of its parts: frame, handlebars, wheels, pedals, etc. Each individual part of the bicycle constitutes a lower-level pattern in the neural network, and the combination of each part represents a higher-level pattern, which creates a feature hierarchy in CNN.

Pooling layer

In order to reduce the number of parameters in feature maps, improve the computing speed and increase the receptive field, pooling layer (also known as down-sampling layer) is usually added after the convolution layer. Pooling can improve the fault tolerance of models because it can reduce feature dimension without losing important information. On the one hand, this dimension reduction makes the model pay more attention to global features rather than local features, and on the other hand, it can prevent over-fitting to some extent. The concrete implementation of pooling is to take the result of the aggregation function of the values in the sensing domain as the output. There are two main types of pooling:

  • Max pooling: When a filter is moved through the input, it selects the pixel with the maximum value to send to the output array. This approach is used more frequently than average pooling.

  • Average pooling: As the filter moves through the input, it calculates the Average of the experience field to send to the output array.

The connection layer

The full connection layer is usually located at the end of the network and is structured as its name suggests. As mentioned earlier, the pixel values of the input image are not directly connected to the output layer in the partial connection layer. However, in the fully connected layer, each node in the output layer is directly connected to the node in the previous layer, where the feature graph is expanded as a one-dimensional vector.

This layer performs classification tasks based on features extracted through previous layers and their different filters. Although the convolutional layer and pooling layer tend to use ReLu function, the FC layer usually uses Softmax activation function to properly classify the input and generate probability values between [0, 1].

Handwritten digit recognition with custom convolutional neural network

  • Guide package
import time
import numpy as np
import torch
import torch.nn.functional as F
from torchvision import datasets
from torchvision import transforms
from torch.utils.data import DataLoader

if torch.cuda.is_available():
    torch.backends.cudnn.deterministic = True
Copy the code
  • Set parameters & load data set
########################## ### SETTINGS ########################## # Device device = torch.device("cuda:3" if Torch.cuda.is_available() else "CPU ") # Hyperparameters random_seed = 1 learning_rate = 0.05 num_epochs = 10 batch_size = 128 # Architecture num_classes = 10 ########################## ### MNIST DATASET ########################## # Note transforms.ToTensor() scales input images # to 0-1 range train_dataset = datasets.MNIST(root='data', train=True, transform=transforms.ToTensor(), download=True) test_dataset = datasets.MNIST(root='data', train=False, transform=transforms.ToTensor()) train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True) test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False) # Checking the dataset for images, labels in train_loader: print('Image batch dimensions:', images.shape) print('Image label dimensions:', labels.shape) breakCopy the code
  • The model definition
##########################
### MODEL
##########################

class ConvNet(torch.nn.Module):

    def __init__(self, num_classes):
        super(ConvNet, self).__init__()
        
        # calculate same padding:
        # (w - k + 2*p)/s + 1 = o
        # => p = (s(o-1) - w + k)/2
        
        # 28x28x1 => 28x28x8
        self.conv_1 = torch.nn.Conv2d(in_channels=1,
                                      out_channels=8,
                                      kernel_size=(3, 3),
                                      stride=(1, 1),
                                      padding=1) # (1(28-1) - 28 + 3) / 2 = 1
        # 28x28x8 => 14x14x8
        self.pool_1 = torch.nn.MaxPool2d(kernel_size=(2, 2),
                                         stride=(2, 2),
                                         padding=0) # (2(14-1) - 28 + 2) = 0
        # 14x14x8 => 14x14x16
        self.conv_2 = torch.nn.Conv2d(in_channels=8,
                                      out_channels=16,
                                      kernel_size=(3, 3),
                                      stride=(1, 1),
                                      padding=1) # (1(14-1) - 14 + 3) / 2 = 1                 
        # 14x14x16 => 7x7x16                             
        self.pool_2 = torch.nn.MaxPool2d(kernel_size=(2, 2),
                                         stride=(2, 2),
                                         padding=0) # (2(7-1) - 14 + 2) = 0

        self.linear_1 = torch.nn.Linear(7*7*16, num_classes)

        for m in self.modules():
            if isinstance(m, torch.nn.Conv2d) or isinstance(m, torch.nn.Linear):
                m.weight.data.normal_(0.0, 0.01)
                m.bias.data.zero_()
                if m.bias is not None:
                    m.bias.detach().zero_()
        
        
    def forward(self, x):
        out = self.conv_1(x)
        out = F.relu(out)
        out = self.pool_1(out)

        out = self.conv_2(out)
        out = F.relu(out)
        out = self.pool_2(out)
        
        logits = self.linear_1(out.view(-1, 7*7*16))
        probas = F.softmax(logits, dim=1)
        return logits, probas

    
torch.manual_seed(random_seed)
model = ConvNet(num_classes=num_classes)

model = model.to(device)

optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)  
Copy the code
  • Model training
def compute_accuracy(model, data_loader): correct_pred, num_examples = 0, 0 for features, targets in data_loader: features = features.to(device) targets = targets.to(device) logits, probas = model(features) _, predicted_labels = torch.max(probas, 1) num_examples += targets.size(0) correct_pred += (predicted_labels == targets).sum() return correct_pred.float()/num_examples * 100 start_time = time.time() for epoch in range(num_epochs): model = model.train() for batch_idx, (features, targets) in enumerate(train_loader): features = features.to(device) targets = targets.to(device) ### FORWARD AND BACK PROP logits, probas = model(features) cost = F.cross_entropy(logits, targets) optimizer.zero_grad() cost.backward() ### UPDATE MODEL PARAMETERS optimizer.step() ### LOGGING if not batch_idx  % 50: print ('Epoch: %03d/%03d | Batch %03d/%03d | Cost: %.4f' %(epoch+1, num_epochs, batch_idx, len(train_loader), cost)) model = model.eval() print('Epoch: %03d/%03d training accuracy: %.2f%%' % ( epoch+1, num_epochs, compute_accuracy(model, train_loader))) print('Time elapsed: %.2f min' % ((time.time() - start_time)/60)) print('Total Training Time: %.2f min' % ((time.time() - start_time)/60))Copy the code
  • Model to evaluate
with torch.set_grad_enabled(False): # save memory during inference
    print('Test accuracy: %.2f%%' % (compute_accuracy(model, test_loader)))
Copy the code

The output is 👇

The Test accuracy: 97.97%

❤️ thank you

Thank you all for reading this, if you find it helpful:

  1. Like it and make it available to more people.
  2. Share your thoughts with me in the comments section, and record your thought process in the comments section.
  3. You can also read some of my previous posts if you are interested:
    • “Out of the box” Perceptron principle and Practice (Pytorch implementation) – Digging gold (juejin. Cn)
    • Logistic regression principle and Actual Combat – Nuggets (Juejin. Cn)

Thank you again for your encouragement and support 🌹🌹🌹