The article source | turbine cloud community

The working principle of the original address | convolution neural network

The original author | instter


Video link

When there are major breakthroughs in deep learning, in nine cases out of ten they involve Convolutional Neural Networks (CNN). CNN, also known as CNNs or ConvNets, is a major development in the field of deep neural networks, which can distinguish images with greater accuracy than humans. If there is any way to live up to the expectations of deep learning, CNN is the top choice.

What’s especially great about them is that they’re easy to understand, at least when you break them down into their basic parts. I’ll walk you through it. A video discusses these images in more detail. If you don’t understand something in the middle, just click on the picture to jump to the corresponding description in the movie.

X’s and O’s

To illustrate CNN, we can start with a very simple example: recognizing a symbol on a picture as a cross and a circle. This example is a good illustration of how CNN works, but it’s also simple enough to avoid getting bogged down in unnecessary details. CNN’s most important job here is that every time we give it a graph, it tells us whether the symbol on it is a circle or a cross. It always turns out to be one or the other.

First of all, we can think of the easiest way to identify the picture, is to directly use the circle and cross picture to compare the new picture, to see which symbol on the picture is more like. But it wasn’t that simple, because the computer was very rigid in comparing the images. To a computer, these images are just a bunch of position-numbered pixels arranged in a two-dimensional matrix (like a chessboard). In our example, the white grid (the stroke) has a value of 1, and the black grid (the background) has a value of -1. So when comparing images, if any of the squares have different values, the computer will think the two images are different. Ideally, we want the computer to be able to correctly interpret symbols whether they are panned, shrunk, rotated, or deformed. That’s where CNN comes in.

Characteristics of the

CNN will compare various parts in the two pictures, which are called features. CNN can better distinguish whether two pictures are the same or not than comparing the whole picture and comparing general features at similar positions.

Every feature in an image is like a smaller image, a smaller two-dimensional matrix. These features capture common elements in the image. Take the image of a cross. Its most important features include diagonals and a cross in the middle. That is, any line or center of a cross should match these characteristics.

convolution

Every time CNN identifies a new image, without knowing where these features are, it will compare them anywhere in the image. In order to calculate how many matching features there are in the whole image, we created a filtering mechanism here. The mathematics behind this mechanism is called convolution, which is how CNN gets its name.

The basic principle of convolution is to calculate the consistency of features and local images by multiplying the values of each pixel and dividing the sum by the number of pixels. If both pixels are white (value 1), the product is; If both are black (with a value of -1), the product is. That is, the product of pixels that match is 1, and the product of pixels that differ is -1. If each phase of the two images matches, add these products and divide by the number of pixels to get 1. On the other hand, if the two pixels are completely different, you get -1. 1 times 1 is 1 minus 1 times minus 1 is 1

We can complete the convolution by repeating the above process and generalizing the various possible features of the picture. Then, we can make a new two-dimensional matrix based on the values and positions of each convolution. This is the original image after feature screening, which can tell us where we can find the feature in the original image. The closer the value is to 1, the more consistent it is with this feature. The closer the value is to -1, the greater the difference is. As for the value close to 0, there is almost no similarity at all.

The next step is to apply the same method to different features, convolving each part of the image. We end up with a set of filtered primitives, each of which corresponds to a feature. We can think of the whole convolution as a single step. This step is called the convolution layer, which means there are more layers to follow.

From the operation principle of CNN, it is not hard to see that it consumes computing resources. While we can explain how CNN works on a single piece of paper, the amount of adding, multiplying and dividing along the way can increase very quickly. In mathematical terms, it can be said that the number of these operations increases exponentially with (I) the number of phase elements in the image, (ii) the number of pixels in each feature, and (iii) the number of features. With so many factors affecting the number of operations, the problems CNN deals with can easily become very complex, and it is no wonder that some chip manufacturers are designing and building special chips for CNNs.

pooling

Another powerful tool CNN uses is pooling. Pooling is a way to compress an image and retain important information by selecting different Windows on the image and selecting a maximum value within that window range. In practice, a square range of two or three sides with a stride of two pixels is an ideal setting.

After the original image is pooled, the number of pixels will be reduced to a quarter of the original image, but because the pooled image contains the maximum value of each range in the original image, it still retains the degree of consistency between each range and each feature. In other words, the pooled information focuses more on the presence or absence of matching features in the image, rather than the presence or absence of features in the image. This can help CNN determine whether the image contains a feature without being distracted by the location of the feature. This solves the problem of computers being very rigid.

So, the function of the pooling layer is to pool one or more images into smaller images. We end up with the same number of images, but with fewer pixels. This also helps with the computation overhead mentioned earlier. Simplifying an 8-megapixel image into a 2-megapixel image beforehand will make the rest of your work much easier.

Linear rectifier unit

Another subtle but important step is the Rectified Linear Unit (ReLU), which is also mathematically simple — it converts all negative numbers on a picture to zero. This trick keeps CNNs from going to zero or infinity, and it’s like axle lubricant for CNNs — nothing cool, but CNNs won’t get very far without it.

The result of linear rectification will have the same number of pixels as the original image, except that all negative values will be replaced with zero.

Deep learning

By now, the reader has probably noticed that the inputs (two-dimensional matrices) and outputs (two-dimensional matrices) for each layer of computation are similar, which means that we can stack each layer on top of the other like Legos. So the original image is filtered, rectified and pooled into a set of smaller images containing characteristic information. These images can then be filtered and compressed again, their features becoming more complex and the images becoming smaller with each processing. Finally, the lower levels contain simple features such as edges or points of light; Higher processing layers contain more complex features, such as shapes or patterns. These advanced features have usually become easy to recognize. For example, in the CNN of face recognition, we can see the complete face in the highest processing layer.

The connection layer

Finally, CNN has a secret weapon — fully connected layers. The full connection layer will collect the filtered images from the high level and convert the feature information into votes. In our example there are two options: circle or cross. In traditional neural network architecture, the full connection layer plays the role of the primary building block. When we enter a picture for this unit, it treats the values of all the pixels as a one-dimensional list, rather than the previous two-dimensional matrix. Each value in the list can determine whether the symbol in the picture is a circle or an x, but the election is not entirely democratic. Since some values are better at judging crosses and some are better at judging circles, these values will vote more votes than others. The number of votes cast by all values on different options will be expressed as weight or connection strength.

Therefore, whenever CNN judges a new image, the image will first pass through many lower layers and then reach the full link layer. After the vote, the option with the highest number of votes becomes the category of the picture.

Like other layers, multiple full-link layers can be combined because their inputs (lists) and outputs (voting results) are very similar. In practice, we can group together several full-link layers, some of which have some virtual, hidden voting options. Every time we add a full link layer, the whole neural network can learn more complex combinations of features and make more accurate judgments.

Back propagation

The instructions so far look good, but there’s a big question behind them — where do features come from? And how do we determine weights in the full link layer? If we had to do it ourselves, CNN wouldn’t be as popular as it is now. Fortunately, a machine learning technique called backpropagation can help us solve this problem.

In order to use backpropagation, we need to prepare some images that already have the answer. That means we need to sit down and circle and cross thousands of images. Then we had to prepare an untrained CNN in which any pixel, feature, weight and full connection layer values were randomly determined. Then we can train this CNN with labeled pictures.

After CNN’s processing, each image will eventually have a round of voting to determine the category. The misjudgment in this election, the identification error, can tell us what the good features and weights are, compared to the positive solution. We can adjust the characteristics and weights to reduce the error generated by the election. After each adjustment, these features and weights are tweaked a little higher or lower, the error is recalculated, and adjustments that successfully reduce the error are retained. So when we adjust each pixel in the convolutional layer and each weight in the full link layer, we can get a set of weights that are slightly better at judging the current image. We can then repeat the above steps to identify more tagged images. Over the course of the training, misjudgments in individual images pass, but common features and weights in those images remain. If there are enough tagged images, the values of these features and weights eventually converge to a stable state that is good at identifying most images.

Needless to say, reverse conduction is also a computationally expensive step, which is another reason that drives manufacturers to produce special components.

Super parameter

In addition to the above, there are some aspects of CNN that are difficult to explain and learn. CNN designers have many decisions to make, including the following.

  • How many features should there be in each convolution layer? How many pixels should there be in each feature?
  • What is the window size in each pooling layer? How long should the interval be?
  • How many hidden neurons (options) should each additional full link layer have?

In addition to these problems, we also need to consider many higher-order structural problems, such as how many processing layers should be in CNN and in what order. Some deep neural networks may consist of thousands of processing layers, so there are many design possibilities.

With so many permutations, we can only test a small number of CNN Settings. Therefore, the design of CNN usually evolves along with the accumulated knowledge of the machine learning community, occasionally showing some unexpected improvements in performance. Although we have covered some of the basic CNN constructs, there are many other enhancements that have been tested and found to be effective, such as using new processing layers or connecting different processing layers in more complex ways.

Beyond graphics

Our circle and cross example has to do with image recognition, but CNN can handle other types of data as well. The trick is to turn any data into something like a picture. For example, we can subdivide the audio by time, and then divide each short sound into bass, midrange, treble, or other higher frequencies. In this way, we can form this information into a two-dimensional matrix with rows representing different times and columns representing different frequencies. In this fake image, the closer the pixels are, the more relevant they are to each other. CNN is very good at processing such data, and researchers are getting creative by converting textual data from natural language processing and chemical data from drug discovery into a form that CNN can process.

However, when each row represents a customer and each column represents the customer’s name, email address, purchase and browsing history, this kind of customer data is not a form that CNN can process. Because in this example, the position of the row and column is not important, which means that they can be arranged arbitrarily without affecting the information. In contrast, transposed rows and rows of pixels in an image usually lose their original meaning.

So one trick to using CNN is that if the data is not affected by changing the order of columns and columns, it is not suitable for CNN processing. However, if the problem can be transformed into a form like image recognition, CNN is probably the most ideal tool.

Extended reading: lecture notes for CS231n

reference

  1. E2eml. School/how_convolu…
  2. Brohrer.mcknote.com/zh-Hant/how…
  3. www.youtube.com/watch?v=Fmp…