Convolutional neural network is a kind of thing that I could not understand in any way, mainly because the name is too “advanced”, and various articles on the Internet to introduce “what is convolution” are particularly unbearable. After listening to Ng’s online class, I suddenly understood what this thing was and why. I’m going to spend about six or seven articles on CNN and implement some interesting applications. After watching this, you should be able to do something you like by yourself.

I. Introduction: boundary detection

Let’s look at the simplest example: “Edge Detection”. Suppose we have an image that looks like this, 8×8:The number in the image represents the pixel value at that location. As we know, the larger the pixel value, the brighter the color, so for illustration, we will make the smaller pixels on the right darker. The boundary between the two colors in the middle of the graph is the boundary we want to detect. How do you detect this boundary? We can design one that looks like thisFilter (also called kernel), size 3×3:Then, we overlay our image with the filter, covering an area the size of the filter, multiplying the corresponding elements, and summing them. After calculating one area, move to the other areas and continue counting until you have covered every corner of the original image. The process is“Convolution”

(We don’t need to know what convolution is mathematically, we just need to know how it works in CNN.)

Here, “move” involves a step size, and if our step size is 1, then once we cover one place, we move one space, and it’s easy to see that we can cover 6 by 6 different areas. So, let’s take the convolution of these six by six regions and put it together into a matrix:Aye? ! What was found?

In this image, the middle color is light and the sides are dark, which means that the middle boundary of our original image is reflected here! From the example above, we can see that,By designing a specific filter and convolving it with the image, we can identify certain features in the image, such as borders.

The above example is to detect the vertical boundary, we can also design to detect the horizontal boundary, just rotate the filter just 90°. For other features, it is theoretically possible to design appropriate filters as long as we make careful design.Convolutional neural Network (CNN) is to continuously extract features, from local features to general features, through filter by filter, so as to realize image recognition and other functions. So here’s the questionHow can we possibly design so many kinds of filters? First of all, we are not sure which features we need to recognize for a large number of images. Secondly, even if we know the features, it is probably not easy to design the corresponding filter, because the number of features may be tens of thousands. In fact, once we learned about neural networks,These filters, we don’t even have to design themEach number in each filter is a parameter, isn’t it? We can use a large amount of data to calculateLet the machine “learn” these parameters for itself. That’s how CNN works.

2. The basic concept of CNN

1. The padding in white

From the above introduction, we can know that the original image becomes smaller after the filter convolution, from (8,8) to (6,6). So let’s say we roll it again, so it becomes 4 comma 4. What’s wrong with that? There are two main problems:

  • Every time you convolve, the image shrinks, so that it doesn’t take more than a few times to roll it away;
  • Compared with the point in the middle of the picture, the point at the edge of the picture is computed less times in the convolution. In this way, edge information is easy to lose.




To solve this problem, we can use the padding method. Each time before we convolve, we fill a circle of space around the image, so that the image is the same size after the convolution, and at the same time, the original edge is computed more times.For example, if we add the picture of (8,8) to (10,10), then the filter of (3,3) will be (8,8), unchanged. We call the padding of the convolution the same size“The Same”Way,

Let’s call that without any padding, let’s call that“Valid”Way. This is the hyperparameter that we need to set when we use some frameworks.

Step 2. Stride length

The convolution that we’ve been talking about, the default step is 1, but in fact, we can set the step to something else. For example, for the input of (8,8), we use the filter of (3,3). If the stride=1, the output is (6,6); If the stride=2, the output is (3,3); (This example is not very good, except the continuous round down)

3. Pooling pooling

Pooling is to extract the main features of a certain area and reduce the number of parameters to prevent overfitting of the model.

For example, for MaxPooling, a 2×2 window is adopted and the stride=2 is taken:In addition to MaxPooling, there’s also AveragePooling, which literally means taking the average of that area.

4. Roll up multi-channels pictures product (Important!)

This needs to be mentioned separately. Color images are generally RGB three channels, so there are generally three dimensions of input data:(Length, width, passage)

For example, a 28×28 RGB image has the dimensions (28,28,3). In the previous introduction, the input image is 2-dimensional (8,8), the filter is (3,3), and the output is 2-dimensional (6,6). If the input image is three-dimensional, such as (8,8,3), then the dimension of our filter will change to (3,3,3)The last dimension should be consistent with the channel dimension of the input.

The convolution at this point,Is the sum of all the elements of the three channelsSo what was the sum of nine products, is now the sum of 27 products. Therefore, the dimensions of the output do not change. Or (6, 6). But, in general, we doUse multiple filters convolution at the same timeFor example, if we use 4 filters at the same time, thenThe dimensions of the output will be (6,6,4). I drew the following diagram to illustrate the above process:The input image in the figure is (8,8,3), and there are four filters with sizes of (3,3,3). The output obtained is (6,6,4).

I think this graph is pretty clear and shows where the key numbers 3 and 4 come from, so I won’t bore you (it’s been drawing me for at least 40 minutes). In fact, if we use the neural network notation we learned earlier to look at CNN,

  • Our input image is X, shape=(8,8,3);
  • Shape =(3,3,3,4). Shape =(3,3,3,4).
  • Our output is Z1, shape=(6,6,4);
  • Shape =(6,6,4); shape=(6,6,4);

So, in the previous diagram, I added an activation function and labeled the corresponding part like this:Personally, it is a pity that such a good figure is not collected.

Iii. Structure and composition of CNN

We have seen how convolution, pooling and padding are performed. Now let’s take a look at the overall structure of CNN, which includes three layers:

1. Convolutional layer (CONV)

Consists of filters and activation functions. The usual hyperparameters to set include the number of filters, size, step size, and whether the padding is “valid” or “same”. And, of course, what activation function to choose.

Pooling layer (POOL)

There are no parameters for us to learn, because all parameters are set by us, either Maxpooling or Averagepooling. The hyperparameters that need to be specified include whether it is Max or Average, window size, and step size. Generally, Maxpooling is commonly used, and filter of size (2,2) and step length 2 is generally selected. In this way, after pooling, the length and width of input will be reduced by 2 times, while channels remain unchanged.

3. Fully Connected layer (FC)

I didn’t talk about this because this is the guy we’re most familiar with,The most common layer in the neural network that we studied before, is a row of neurons. Because each unit in this layer is connected to each unit in the previous layer, it is called “fully connected”.

The hyperparameters to specify here are simply the number of neurons and the activation function. Next, let’s take a look at a random CNN to get some perceptual understanding of CNN:The CNN above is the one I slap on my head. Its structure can be used:

The X – CONV (relu) – MAXPOOL – CONV (relu) – FC (relu) – FC (softmax) – Y

To represent. The point to make here is that after a couple of convolution and pooling, weFinally, the multidimensional data will be “flattened” first,That is the(height,width,channel)The growth degree of data compression isHeight times width times channelOne dimensional array, and then withFC layerThe connection,After that, it’s just like a normal neural network. As can be seen from the figure, with the deepening of the network, our images (strictly speaking, those in the middle can not be called images, but for convenience, we can say so) are getting smaller and smaller, but channels are getting bigger and bigger. The representation in the figure is that the area of the cuboid facing us is getting smaller and smaller, but its length is getting longer and longer.

Convolutional neural networks vs. traditional neural networks

In fact, now looking back, CNN is not very different from the neural network we studied before. The traditional neural network is actually a stack of FC layers. CNN is nothing more than changing FC into CONV and POOL, which is changing the layer composed of traditional neurons into a layer composed of filters. So why the change? What are the benefits? Specifically, there are two points:

1. Parameters Sharing

Let’s compare the layer of traditional neural network with the CONV layer made of filters:

Let’s say our image is 8×8, which is 64 pixels, and let’s say we use a fully connected layer with 9 cells:So how many parameters do we need for this layer? Need to be64×9 = 576 parameters(Leave out the bias term B for now). Because each link needs a weight w. So let’s seeThere are also 9 units of filterHow it works:You know, you don’t have to look,There are a few cells with a few parameters, so there are nine parameters! Because we all share the same filter for different regions, we share the same set of parameters.

This also makes sense. As we know from the previous explanation, filter is used to detect features.That one feature is generally likely to be present in more than one placeFor example, “vertical boundary” may appear more frequently in a picture, soIt’s not only reasonable that we share the same filter, it’s the right thing to do.Thus, the parameter sharing mechanism,Let’s drastically reduce the number of parameters in our network. In this way, we can train better models with fewer parameters, typically twice the result with half the effort, and effectivelyAvoid overfitting.

Similarly, due to the shared parameters of filter, we can still identify features even if the image is translated to a certain extent, which is called“Translation invariance”. So the model is more robust.

2. Sparsity of Connections\

According to the operation of convolution, any unit in the output image,It only relates to a part of the input imageDepartment:However, in traditional neural networks, because they are all fully connected, any unit of output is affected by all units of input. In this way, the recognition effect of the image will be greatly reduced. In comparison, each region has its own unique characteristics and we don’t want it to be influenced by other regions.

It is because of these two advantages that CNN surpasses the traditional NN and opens a new era of neural network.