1. Introduction
For image analysis, it can be divided into many types of tasks, such as classification, object detection, recognition, description and so on. For image classifiers, they should be able to work with high precision even under changes such as occlusion, illumination changes, vision, etc. The traditional image classification method with feature engineering as the main step is not suitable for working in rich environment. Even the experts in this field can not give a set of features that can achieve high precision under different changes, and can not guarantee whether the features selected by hand are appropriate. Inspired by this problem, the idea of feature learning emerges at the right moment. It is also one of the reasons for the robustness of artificial neural network (ANN) for image analysis task. Based on the gradient descent algorithm (GD) and other learning algorithms, ANN can automatically learn image features. After the original image is input into the artificial neural network, ANN can automatically generate features to describe it.
2. Image analysis based on fully connected network
Now let’s take a look at how artificial neural networks handle this and why CNN is more efficient in terms of time and memory than fully connected networks. As shown in Figure 1, the input is a 3×3 grayscale image. The use of small-sized images in this example is intended to be illustrative, not to indicate that ANN can only handle small-sized images.
Image 1
When ANN is entered, the image is transformed into a pixel matrix. Since ANN uses a one-dimensional vector instead of a two-dimensional matrix, the input two-dimensional grayscale image is converted into a one-dimensional vector, where each pixel represents an input neuron node.
Picture 2
Each pixel is mapped to a vector element, and each element in the vector represents a neuron in ANN. Since the image has 3×3=9 pixels, the Input Layer will have 9 neurons. Because ANN structures typically extend horizontally, each layer is represented as a column vector.
The input Layer is connected to the Hidden Layer, and the output of the input Layer is fed to the Hidden Layer, which learns how to convert image pixels into representative features. Suppose in Figure 3 there is a single hidden layer with 16 neurons.
Image 3
Since the network is a fully connected network, this means that every neuron in layer I is connected to all neurons in layer I-1. Each neuron in the hidden layer is connected to nine neurons in the input layer. In other words, each input pixel is connected to 16 neurons in the hidden layer, where each connection has a corresponding parameter (weight). By connecting each pixel to all neurons in the hidden layer, as shown in Figure 4, the network has 9×16=144 parameters (weights).
Image 4
3. Lots of parameters
The number of parameters in the above example seems acceptable, but as the input image size increases and the number of hidden layers increases, the network parameters increase significantly.
For example, if the network has two hidden layers with 90 and 50 neurons respectively, then the number of parameters between the input layer and the first hidden layer is 9×90=810, and the number of parameters between the two hidden layers is 90×50=4500, and the total number of parameters of the network is 810+4500=5310. It is obviously inappropriate for such a simple network structure to have so many parameters. Another case is when the input image size is large, such as 32×32 size image (1024 pixels), if the network uses a single hidden layer (with 500 neurons), there are 1024×500=512000 parameters (weights) in total, which is a huge number for a network with only a single hidden layer. Therefore, there must be a solution to reduce network parameters, so for this, convolutional neural network (CNN) emerged. Although its network model is usually large, the number of parameters is greatly reduced.
4. Neuronal clusters
Even in a small fully connected network, the number of network parameters becomes very large because of the different parameters on each connection of the neurons between the layers. Therefore, consider giving the same parameter to a group of neurons, as shown in Figure 5, where the neurons within a group of neurons will all be assigned the same parameter.
Image 5
After this process, the number of network parameters is greatly reduced. Take Figure 4 as an example, say every 4 consecutive neurons as a group, the result is a 4-fold reduction in the number of parameters. Each input neuron will have 16/4=4 parameters. The entire network will have 144/4=36 parameters, reducing the number of parameters by 75%. As you can see, it works well, but there are still some areas that can be optimized.
Image 6
Figure 7 shows the connection between each pixel and the first neuron in each group, but each pixel in each group is still connected to each neuron, and the network is still fully connected.
Image 7
For simplicity, just pick one group and ignore the others, as shown in Figure 8. As you can see from the figure, each group is still connected to all nine neurons in the input layer and therefore has nine parameters.
Image 8
5. Spatial correlation of pixels
The previous statement makes each neuron accept all pixels, and if there is a function f(x1,x2,x3,x4) that accepts four inputs, this means that decisions are made based on all four inputs. If you have only two inputs, but the output is the same as if you had four inputs, you don’t have to use all four inputs, just give the two that affect the result. Using this idea, each neuron receives 9 pixels of input, and the number of parameters can be greatly reduced if the same or better results can be obtained with fewer pixels, so network parameters can be optimized in this direction. Generally, in image analysis, the input image is transformed into a pixel matrix, in which each pixel is highly correlated with the pixels around it, and the farther the distance between the two pixels, the less correlated they are. For example, as shown in Figure 9, the pixel of the face is related to the pixels around the face, but it has a low correlation with the pixels of sky and ground, etc.
Image 9
Based on this assumption, each neuron in the example above accepts only the pixels that are spatially relevant to each other, rather than applying all nine pixel points to each input neuron, so four spatially relevant pixels can be selected, as shown in Figure 10. For the pixel matrix position (0,0), the most spatially relevant pixels are the coordinate points (0,1), (1,0), and (1,1). All neurons in the same group share the same weight, so the four neurons in each group will have only four parameters instead of nine. The total parameter becomes 4×4=16. Compared to the fully connected network in Figure 4, 128 parameters are reduced (88.89% reduction).
Image 10
6. Convolutional Neural Networks (CNN)
As THE right of use of CNN is heavily shared and fewer parameters are used, the network structure of CNN generally has a large number of layers, which is a feature that fully connected networks cannot have. Now there are only four weights assigned to all neurons in the same group, so how do those four weights cover nine pixels? Let’s see how this is handled! Figure 11 shows one of the networks in Figure 10, with weights added for each connection. Inside the neuron, each of the four input pixels is multiplied by its corresponding weight, as shown in the formula in Figure 11.
Image 11
Suppose the step size of each move here is set to 1 (you can set the step size yourself), the pixel index is moved one bit after each multiplication, and the weight matrix is multiplied by another set of pixels. And so on, until the entire pixel matrix is multiplied by the weight matrix. The whole process is the same as the convolution operation. The convolution operation is carried out between the weight of the group and the image matrix, which is also the reason why CNN has the word “convolution”.
The image of 12
The remaining sets of neurons do the same, multiplying the weight matrix from the upper-left corner of the pixel matrix to the lower-right corner of the pixel matrix.
Reference 7.
Aghdam, Hamed Habibi, and Elnaz Jahani Heravi. Guide to Convolutional Neural Networks: A Practical Application to Traffic-Sign Detection and Classification. Springer, 2017.
The author information
Ahmed Gad is a teacher specializing in deep learning and computer vision
This article is translated by Ali Yunqi Community Organization.
Convolutional Neural Network from Fully Connected Network Step-by-step
The original link