What is a convolutional neural network and why is it important?
Convolutional neural networks (also known as ConvNets or CNN) are a kind of neural networks, which have proved to be very effective in the fields of image recognition and classification. Convolutional neural networks can not only help robots and autonomous vehicles with vision, but also successfully recognize faces, objects and traffic signs.
Figure 1
As shown in Figure 1, the CONVOLUtional neural network can recognize the scene of the picture and provide the relevant title (” Football player is playing football “). Figure 2 is an example of using the convolutional neural network to recognize everyday objects, humans and animals. More recently, convolutional neural networks have also played an important role in some natural language processing tasks, such as statement classification.
Figure 2
Therefore, convolutional neural networks are an important tool for most machine learning practitioners today. But understanding convolutional neural networks and starting to try to use them is a really painful process. The main purpose of this paper is to understand how convolutional neural networks process images.
For those new to neural networks, I recommend that you read this short tutorial on multilayer perceptrons and understand how they work before moving on. Multilayer perceptron is the “fully connected layer” in this paper.
LeNet Framework (1990s)
LeNet is one of the first convolutional neural networks to promote the development of deep learning. This pioneering work by Yann LeCun has been named LeNet5 after several successful iterations since 1988. At that time, LeNet framework was mainly used for character recognition tasks, such as reading postal codes, numbers, etc.
Next, we’ll take a visual look at how the LeNet framework learns to recognize images. Several new frameworks based on LeNet improvements have been proposed in recent years, but the basic idea is the same as LeNet, and these new frameworks are relatively easy to understand if you understand LeNet clearly.
Figure 3: A simple convolutional neural network
The convolutional neural network in Figure 3 is similar in structure to the original LeNet and divides the input images into four categories: dog, cat, boat or bird (the original LeNet is mainly used for character recognition tasks). As can be seen from the figure above, when receiving the ship image as input, the neural network correctly assigned the highest probability value (0.94) to the ship in the four categories. The sum of all the probabilities in the output layer should be 1 (explained later).
The convolutional neural network in FIG. 3 has four main operations:
- convolution
- Nonlinear transformation (ReLU)
- Pooling or subsampling
- Classification (Full connection layer)
These operations are the basic components of all convolutional neural networks, so understanding how they work is an important step in understanding convolutional neural networks. Let’s try to understand each operation visually.
An image is a matrix of pixel values
Essentially, each image can be represented as a matrix of pixel values.
Figure 4: Each image is a pixel matrix
A channel is a traditional term for a particular component of an image. A photo taken with a standard digital camera has three channels — red, green and blue — that you can think of as three stacked two-dimensional matrices (one for each color), each with a pixel value between 0 and 255.
Grayscale images have only one channel. For the purposes of this article, we only consider the grayscale image, that is, a two-dimensional matrix representing the image. The values of each pixel in the matrix range from 0 to 255 — 0 for black and 255 for white.
convolution
Convolutional neural networks get their name from the operation “convolution”. In convolutional neural networks, the main purpose of convolution is to extract features from input images. By using small squares in the input data to learn image features, convolution preserves the spatial relationships between pixels. We’re not going to go into the mathematics of convolution here, but we’re going to try to understand how it works with images.
As mentioned above, each image can be viewed as a matrix of pixel values. Consider a 5 × 5 image with only 0 and 1 pixel values (note that for grayscale images, the pixel values range from 0 to 255, and the green matrix below is a special case with only 0 and 1 pixel values) :
Also, consider another 3×3 matrix, as shown below:
The convolution calculation process of the above 5 x 5 image and 3 x 3 matrix is shown in the animation in FIG. 5:
Figure 5: Convolution operation. The output matrix is called the convolution feature or the feature map.
Let’s take a moment to understand how this calculation is done. The orange matrix is moved on the original image (green) at a rate of 1 pixel at a time (also known as “stride”), and for each position, the product of the two matrices relative to the elements is computed and added to output an integer as an element of the final output matrix (pink). Note that the 3 × 3 matrix “sees” only a portion of the input image per step.
In the terminology of convolutional neural networks, this 3 × 3 matrix is called a “filter” or a “kernel” or a “feature detector”, and the matrix obtained by moving the filter across the image and computing the dot product is called a “convolution feature” or an “activation map” or a “feature map”. It is important to note that the filter acts as a feature detector for the original input image.
It is obvious from the above animation that different filter matrices will produce different feature maps for the same input image. For example, consider the following input image:
In the table below, we can see the effect of the above image convolved with different filters. As shown in the figure, different operations such as edge detection, sharpening and blur can be performed by changing the value of the filter matrix before convolution operation [8] — this means that different filters can detect different features of the image, such as edges and curves. More examples of this can be found here in section 8.2.4.
Another good way to understand the convolution operation is to refer to the animation in Figure 6 below:
Figure 6: Convolution operation
A filter (red border) is moved over the input image (convolution operation) to generate a feature map. On the same image, the convolution of another filter (green border) produces a different feature graph, as shown. It is important to note that the convolution operation captures local dependencies in the original image. Note also how the two different filters produce different feature maps from the same original image. Keep in mind that the above image and the two filters are just numerical matrices.
In fact, convolutional neural network will learn the values of these filters by itself during the training process (although we still need to specify parameters such as the number of filters, size and network frame before the training process). The more filters we have, the more features we can extract, and the better our network will be at recognizing new images.
The size of the feature map (convolution feature) is controlled by three parameters [4] that we need to decide before performing the convolution step:
- Depth: Depth corresponds to the number of filters we use for the convolution operation. In the network shown in FIG. 7, we convolved the initial ship image with three different filters to generate three different feature graphs. These three feature maps can be thought of as stacked two-dimensional matrices, so the “depth” of the feature map is 3.
Figure 7.
- Step: The step is the number of pixels that we move the filter matrix over the input matrix at one time. When the step is 1, we move the filter 1 pixel at a time. When the step is 2, the filter moves 2 pixels at a time. The larger the step, the smaller the generated feature map.
- Zero padding:Sometimes it is convenient to fill the input matrix boundary with zero so that we can apply filters to the boundary elements of the input image matrix. A nice feature of zero fill is that it allows us to control the size of the feature map. Adding zero padding is also known asWide convolutionInstead of using zero padding is forNarrow convolution. In [this14] is clearly explained.
Introduction to The Nonlinear Section (ReLU)
As shown in FIG. 3 above, after each convolution, another operation called ReLU is performed. ReLU is a kind of nonlinear operation, which is called Rectified Linear Units. The output is shown below:
Figure 8: ReLU function
ReLU is an element-specific operation (applied to each pixel) that replaces all negative pixel values in the feature map with zero. ReLU on the convolution is the purpose of introducing the nonlinear factors in the neural network, because in real life we want to use neural network to learn most of the data is nonlinear, is a linear convolution operation, according to the element matrix multiplication and addition, so we hope ReLU such nonlinear function is introduced to solve the nonlinear problem).
The ReLU operation is clearly understood from Figure 9. It shows the result of applying ReLU to one of the feature maps in Figure 6. The output feature map here is also referred to as a “modified” feature map.
Figure 9: ReLU operation
Other nonlinear functions such as TANh or Sigmoid can also be used in place of ReLU, but ReLU performs better in most cases.
pooling
Spatial pooling (also known as subsampling or downsampling) reduces the dimension of each feature map and preserves the most important information. There are several different ways to pool space: maximum, average, sum, etc.
In the case of maximum pooling, we define a spatial neighborhood (for example, a 2 × 2 window) and take the largest element of the modified feature map within that window. Of course, we can also take the average of all elements in the window (average pooling) or the sum of all elements. In practice, maximum pooling works better.
Figure 10 shows an example of applying maximum pooling on a modified feature map (obtained after convolution + ReLU operation) through a 2 × 2 window.
Figure 10: Maximum pooling
We move the 2 x 2 window by 2 cells (also known as “strides”) and take the maximum in each region. As shown in Figure 10, this reduces the dimension of the feature map.
In the network shown in Figure 11, pooling is applied separately to each feature map (so we get three output maps out of three input maps).
Figure 11: Pooling applied to the modified feature map
Figure 12 shows the pooled effect of the modified feature map obtained after ReLU operation in Figure 9.
Figure 12: pooling
The role of pooling is to gradually reduce the size of the input space [4]. Specifically, there are the following four points:
- Make the input (characteristic dimension) smaller and easier to manage
- Reduce parameters and operation times in the network, so overfitting can be controlled [4]
- Make the network more robust to small transformations, distortions, and shifts in the input image (small distortions in the input image will not change the pooled output – because we take the maximum/average value of the neighborhood).
- You can get images that are almost constant in scale (the exact term is “isovariant”). This is very useful so that we can detect the object in the image wherever it is located (see [18] and [19] for details).
At this point…
Figure 13
So far, we have seen how convolution, ReLU, and pooling work. It’s important to understand that these are the basic building blocks of convolutional neural networks. As shown in Figure 13, we have two middle layers composed of convolution, ReLU and Pooling — the second convolution layer uses six filters to perform convolution on the output of the first layer to generate six feature maps. ReLU is then applied to the six feature maps respectively. Next, we perform maximum pooling on each of the six modified feature maps.
The functions of these two intermediate layers are to extract useful features from images, introduce nonlinear factors into the network, and at the same time reduce the dimension of features and make them change in scale and shift order [18].
The output of the second pooling layer is the input of the fully connected layer, which we discuss in the next section.
Complete connection layer
The fully connected layer is a traditional multi-layer perceptron that uses softmax activation functions at the output layer (you can also use other classifiers, such as SVM, but only Softmax is used in this article). The term “fully connected” means that every neuron in the previous layer is connected to every neuron in the next layer. If you are not familiar with multilayer perceptrons, I recommend reading this article.
The convolutional and pooled outputs represent the advanced features of the input image. The purpose of the fully connected layer is to classify the input images into different classes using these features derived from the training data set. For example, the image classification task we want to perform has four possible outputs, as shown in Figure 14 (note that Figure 14 does not show the connections between nodes in the fully connected layer)
Figure 14: Fully connected layer — each node is connected to other nodes in the adjacent layer
In addition to classification, adding a fully connected layer is (generally) an easy way to learn nonlinear combinations of these features. Most features obtained from the convolution layer and pooling layer may also be good for classification, but the combination of these features may be better [11].
The sum of the output probabilities of the fully connected layer is 1. This is because we used softMax activation functions in the output layer of the fully connected layer. The Softmax function takes any real-number vector as input and compresses it to a vector with values between 0 and 1 and a sum of 1.
Here we go – train with back propagation
As mentioned above, the convolutional + pooling layer is used to extract features from the input image, and the complete connection layer is used as a classifier.
Note that in Figure 15, since the input image is a ship, the probability is 1 for the ship class target and 0 for the other three classes
- Input image = ship
- Target vector = [0, 0, 1, 0]
Figure 15: Training convolutional neural network
The overall training process of convolutional network is summarized as follows:
- Step 1: Initialize all filters and parameters/weights with random values
- Step 2: The neural network takes the training image as the input and obtains the output probability of each class through the forward propagation steps (convolution, ReLU and pooling operations for forward propagation in the fully connected layer).
- Assume that the output probability of the above ship image is [0.2,0.4,0.1,0.3]
- Since the weights are randomly assigned to the first training sample, the output probability is also random.
- Step 3: Calculate the total error of the output layer (sum all four classes)
- Total error=∑ ½(target probability – output probability)²
- Step 4: Calculate the error of ownership weight in the network using back propagationThe gradientAnd the use ofGradient descentUpdate all filter values/weights and parameter values to minimize output errors.
- The weights are adjusted according to their contribution to the total error.
- When the same image is input again, the output probability may become [0.1,0.1,0.7,0.1], which is closer to the target vector [0,0,1,0].
- This means that the network hasLearned toHow to correctly classify a particular image by adjusting its weight/filter and reducing the output error.
- Filter number, size, network structure and other parameters have been fixed before Step 1 and will not change during training — only filter matrix and connection weights will be updated.
- Step 5: Repeat steps 2-4 for all images in the training set.
Through the above steps, the convolutional neural network can be trained — this actually means that the ownership weight and parameters in the convolutional neural network have been optimized, and the images in the training set can be correctly classified.
When we input a new (unseen) image into the convolutional neural network, the network performs the forward propagation step and outputs the probability of each class (for the new image, the weight used to calculate the probability of the output is previously optimized and can classify the training set completely correctly). If our training set is large enough, the neural network can generalize well (hopefully) and put new images into the right classes.
Note 1: In order to provide an intuitive training process, the above steps have been simplified and the mathematical derivation has been ignored. For mathematical derivation and a thorough understanding of convolutional neural networks, see [4] and [12].
Note 2: In the example above, we used two alternating sets of convolution and pooling layers. Note, however, that these operations can be repeated many times in a convolutional neural network. In fact, some of the best convolutional neural networks today contain dozens of convolutional and pooling layers! In addition, the pooling layer after each convolution layer is not required. As can be seen from Figure 16 below, we can perform multiple convolution + ReLU operations consecutively before the pooling operation. Note also how each layer of the convolutional neural network is shown in Figure 16.
Figure 16
Visualization of convolutional neural networks
In general, the more convolutional steps there are, the more complex the features that the neural network can learn and recognize. For example, in image classification, convolutional neural network may learn to detect the edges of original pixels at the first layer, and then use these edges to detect simple shapes at the second layer, and then use these shapes to detect advanced features, such as facial shapes at a higher level [14]. Figure 17 illustrates this process — these features are learned using convolutional deep belief networks, and this image is just to illustrate the idea (this is just an example: objects identified by the convolutional filter may not actually make sense to people).
Figure 17: Convolutional deep belief network learning features
Adam Harley created a visualization of training convolutional neural network based on MNIST handwritten digital data set [13]. I highly recommend that you use it to understand the details of how convolutional neural networks work.
In the figure below, we can see the operation details of the neural network for inputting the number “8”. Note that the ReLU operation is not shown separately in Figure 18.
Figure 18: Visualization of convolutional neural network based on handwritten digit training
The input image contains 1024 pixels (32 × 32 image), and the first convolution layer (convolution layer 1) consists of six different 5 × 5 (step 1) filters convolved with the input image. As shown in the figure, six different filters are used to obtain a feature map of depth six.
Convolution layer 1 is followed by pooling layer 1, which carries out 2 × 2 maximum pooling (step 2) on the six feature maps in convolution layer 1. Move the mouse pointer over any pixel of the pooled layer and you can see that it comes from the role of the 2 x 2 grid in the previous convolutional layer (see Figure 19). Note that the pixels with the maximum value (the brightest one) in the 2 x 2 grid are mapped to the pooling layer.
Figure 19: Visualization of pooling operations
Pooling layer 1 is followed by sixteen 5 × 5 (step 1) convolutional filters that perform convolution operations. Next is pooling layer 2, which performs 2 × 2 maximum pooling (step 2). These two layers do the same thing.
Then there are three fully connected (FC) layers:
- The first FC layer has 120 neurons
- The second FC layer contains 100 neurons
- The 10 neurons in the third FC layer correspond to 10 numbers — also known as the output layer
Notice in Figure 20 that each of the 10 nodes in the output layer is connected to all 100 nodes in the second fully connected layer (hence fully connected).
Also, notice why the only bright node in the output layer is’ 8 ‘– that means the neural network correctly classifies our handwritten numbers (the brighter the node is, the higher its output is, i.e. 8 has the highest probability of all numbers).
Figure 20: Fully connected layer visualization
A 3D version of the visualization system is available here.
Other convolutional neural network frameworks
Convolutional neural networks began in the early 1990s. We’ve talked about LeNet, which is one of the first convolutional neural networks. Some other influential neural network frameworks are listed below [3] [4].
- LeNet (1990s) : This article has been detailed.
- 1990s to 2012: From the late 1990s to early 2010, convolutional neural networks were in their incubation period. As more and more data and computing power improve, the tasks that convolutional neural networks can solve become more and more interesting.
- AlexNet (2012) — In 2012, Alex Krizhevsky (and others) released AlexNet, an enhanced depth and breadth version of LeNet that won the ImageNet Large-scale Visual Recognition Challenge (ILSVRC) by a wide margin in 2012. This is a major breakthrough based on the previous approach, and AlexNet is responsible for the widespread use of CNN today.
- ZF Net (2013) — THE 2013 ILSVRC winners came from Matthew Zeiler and Rob Fergus for convolutional networks. It was called ZFNet (short for Zeiler and Fergus Net). It is improved on the basis of AlexNet by adjusting the network frame hyperparameters.
- GoogLeNet (2014) — The 2014 ILSVRC winner is Google’s Szegedy et al. ‘s convolutional network. The main contribution was the development of an initial module that significantly reduced the number of parameters in the network (4M compared to AlexNet’s 60M).
- VGGNet (2014) — The runner-up in ILSVRC 2014 was a network named VGGNet. Its main contribution is to prove that network depth (layer number) is a key factor affecting performance.
- ResNets (2015) — The residual network developed by Kaming Ho (and others) was the ILSVRC champion of 2015. ResNets is the most advanced convolutional neural network model to date and is the default choice for everyone to use convolutional neural networks in practice (as of May 2016).
- DenseNet (August 2016) — Recently published by Gao Huang et al., each layer of a densely connected convolutional network is directly connected to other layers in a feedforward manner. DenseNet has proven itself to be a significant improvement over previous state-of-the-art frameworks in five competitive object recognition benchmarking tasks. Please refer to this website for detailed implementation.
conclusion
In this article, I have tried to explain the main concepts behind convolutional neural networks in simple terms and have simplified/skipped over a few details, but I hope this article gives you an intuitive understanding of how they work.
This article was originally inspired by Denny Britz’s article “Understanding the Use of Convolutional Neural Networks in Natural Language Processing” (recommended reading), on which many of the explanations in this article are based. To understand some of these concepts in more depth, I encourage you to read the notes from the Stanford Course on convolutional Neural networks and other great resources mentioned in Resources below. If you have any questions/suggestions about the above concepts, please feel free to leave a comment below.
All images and animations used in this article belong to their respective authors and are displayed below.
reference
- karpathy/neuraltalk2: Efficient Image Captioning code in Torch, Examples
- Shaoqing Ren, et al., “Faster R-CNN: “Towards Real-time Object Detection with Region Proposal Networks”, 2015, arXiv:1506.01497
- Neural Network Architectures, Eugenio Culurciello’s blog
- CS231n Convolutional Neural Networks for Visual Recognition, Stanford
- Clarifai/Technology
- Machine Learning is Fun! Part 3: Deep Learning and Convolutional Neural Networks
- Feature extraction using convolution, Stanford
- Wikipedia article on Kernel (image processing)
- Deep Learning Methods for Vision, CVPR 2012 Tutorial
- Neural Networks by Rob Fergus, Machine Learning Summer School 2015
- What do the fully connected layers do in CNNs?
- Convolutional Neural Networks, Andrew Gibiansky
- W. Harley, “An Interactive Node-link Visualization of Convolutional Neural Networks,” in ISVC, pages 867-877, 2015 (link). Demo
- Understanding Convolutional Neural Networks for NLP
- Backpropagation in Convolutional Neural Networks
- A Beginner’s Guide To Understanding Convolutional Neural Networks
- Vincent Dumoulin, et al., “A Guide to Convolution arithmetic for Deep learning”, 2015, arXiv:1603.07285
- What is the difference between deep learning and usual machine learning?
- How is a convolutional neural network able to learn invariant features?
- A Taxonomy of Deep Convolutional Neural Nets for Computer Vision
- Honglak Lee, et al, Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations (Link)