.

Part of the content is added and supplemented, please click here to read the original

 

The original author of the article is more conducive to the reader’s understanding of convolution itself, but in fact, the author’s understanding of the realistic significance of the convolution does not explain very clear, and may even go against the wall to understand, and it is why I am in the process of translation of the original may larger changes, hope it helpful for you.

 

In fact, convolutional neural network is derived from neurological research, and its computational process actually simulates the computational process of the visual nervous system. This part of the content to read other articles.

 

See my blog for this section of TensorFlow: Convolution and Pooling functions

 

introduce

I used to regret not having the time to really understand deep learning in the past. In the past period of time, through the research papers and articles on the topic, I feel that this is a very complicated topic. I tried to understand neural networks and their various types, but it seemed difficult.

Then one day, I decided to take it one step at a time and build the foundation from scratch. I decided to break down the steps of the technology application and do the manual (computational) steps until I understood how they worked. It was time for such a big effort, but the gains were equally spectacular.

Now, NOT only can I understand the scope of deep learning, but I can even think of a better way, after all, my foundation has been reached.

Today, I want to share my learning experience with you. I’m going to show you how to understand convolutional neural networks. I’m going to take you on my own journey and give you a deep understanding of how CNN works.

In this article I will discuss the architecture behind convolutional neural networks, which are designed to solve image recognition and classification problems.

I assume you have a basic understanding of how the web works. If you are unsure of your knowledge base, this article will help you.

 

How does a machine “see” an image?

The human brain is a very powerful machine. We see multiple images per second and process them, and we don’t realize that there’s a lot of math going on that’s not so easy for a machine. The first step in image processing is to understand: how to represent an image so that a machine can read the image

The representation method is that each image point (pixel) is stored in the form of matrix. If you change the pixels of the order or color, the image changes too. Let’s take an example. Let’s say you want to store and read a picture with the number 4 written on it.

The machine will type “read” a matrix of image pixels and store the position represented by the color code for each pixel. In the description below, number 1 is white and 256 is the darkest shade green (it should actually be black, but for ease of representation the number of each pixel is represented with one color, the shade of green).

 

 

Once the image’s storage format is determined, the next challenge is for our neural network to understand the arrangements and patterns.

 

2 How can we help a neural network to recognize images?

A number is represented by different combinations of pixels arranged.

 

 

First, suppose we tried to identify it using a traditional fully connected network. What would the effect be?

Fully connected networks use this photo as an array to predict digital images by comacting and considering pixel values as features. The results are as follows:

This is a representation of the number 4. We have completely lost the spatial arrangement of pixels, and the results seem to be beyond human comprehension.

What should we actually do about it? What we should do is extract features from images and preserve their spatial distribution so that the machine learns to “read” the image. So what should we do? Perhaps the following example can help us understand how a machine “understands” an image. ‘

 

Case 1:

Here we use a weight multiplied by the original pixel value.

 

It makes it easy for the naked eye to recognize, and this is a 4. But send the photo again to the fully connected network, and it becomes a one-dimensional array. This method does not save the spatial arrangement of the image.

It seems that this approach does not help machine learning “see” images

 

Case 2:

From the above case, we can see that the one-dimensional image completely destroys its spatial arrangement. We need to devise a way to send images to a network without flattening the image and preserving its spatial layout.

Instead of calculating just one, let’s try to get two pixel values of the image at once. This will give the network a good view of what two pixels next to each other look like. Now we’re going to take two pixels at a time, and we’re going to take two weights to do the same thing.

 

 

I want you to notice that the graph now goes from a four-column arrangement to a three-column arrangement. Because we move two pixels at a time (pixels are shared with each move), all the images become smaller. Also, an important fact to realize is that, so we are only considering horizontal permutations here, we are going to use two consecutive horizontal pixels, whereas when we are considering vertical elements, we are going to use two consecutive weights in vertical directions.

This is a method of extracting features from images. We look at different parts of the image, where the right side is not as sharp as it used to be. This is due to two reasons:

  1. The left and right corner image pixels are weighted only once.
  2. The left part is still dark because of the higher weight value, while the right part is slightly lower because of the lower weight.

Now we have two problems, and we have two solutions to solve these problems.

 

Case 3:

The problem encountered is that the left and right corners of the image are passing weights only once. The solution allows the left and right edge pixels to be multiplied by the weight as many times as any other pixel in the image.

We have a simple solution to solve this problem: add columns with zero values to both sides of the image.

As shown in the figure, by adding a column with a value of zero, the edge information is preserved and the size of the image increases. We can use this method when we don’t want the image size to decrease.

Similarly, we can make the size of the input and output images consistent by zeroing. Such two different convolution modes can be realized by simple parameter control in TensorFlow. The calculation method of the size of the input and output images will be mentioned in the following article.

 

Case 4:

The problem we’re trying to solve is that a smaller weight value reduces the pixel value in the right corner and makes it harder for our neural network to actually read the graph. What we can do is take multiple weight values and put them together after calculation.

The weight (0.3) gives us a form of output

The afterweight (0.1,5) will give us another form of output

The combined version of these two images will give us a very clear picture. So, what we did was to simply use multiple weights instead of just one to try to retain more information about the image. The final output in this example will be a combined version of the two images.

 

Example 5:

Until now we have used weights together to deal with horizontal pixels. But in most cases, we need to preserve the spatial arrangement of the image in both horizontal and vertical directions. We can multiply pixels horizontally and vertically using a two-dimensional matrix of weights. Also, keep in mind that because of our two horizontal and vertical motion weights, the output is an image with a pixel lower in both horizontal and vertical dimensions, i.e., the output image is smaller than the input image.

Special thanks to Jeremy Howard for inspiring me to create these visuals.

What exactly did we do in the above case?

What we have done above is to try to extract features from the image and preserve the spatial arrangement of the image. One of the most important aspects of understanding images is understanding pixel alignment. And what we’re doing is essentially what convolutional neural networks are doing. We can convolve the input image with the weight matrix (convolution kernel) defined by us, so as to obtain the desired result.

Another benefit of this approach is that it reduces the number of parameters

3. Define a convolutional neural network

We need three basic components to define a basic convolutional network.

  1. Convolution of the layer
  2. Pool layer (optional)
  3. Output layer

Let’s look at these in detail

 

3.1 the convolution layer

And on this level, that’s what we saw in case five, after and after. So let’s say we have an image of size 6 by 6. We define a weight matrix to extract some characteristics of the image

We’ve initialized the weight 3 by 3 matrix. This weight should operate on each pixel, giving a convolution output

The 6 by 6 image is now converted to a 4 by 4 image. Draw the weight matrix like a paintbrush to a wall. Think of the weight matrix as a paintbrush painting a wall. The brush first draws one line of the wall horizontally, then down, a second line horizontally, then down a third line, until the entire wall is painted. As the weight matrix moves along the image, the pixel value is used again. Again the weight matrix is used only once along the image pixels. This feature allows parameter sharing in convolutional neural networks.

Let’s see the real effect.

 

The weight matrix behaves like a filter on the image extracting a specific information matrix from the original image. The weight combination may be to extract edges, while another may be a specific color, while another may blur unwanted noise, through different convolution kernel operations, we have achieved by computer the extraction of different features of the image or other operations.

Weights are learned by minimizing a loss function similar to an MLP. Therefore, some features are extracted from the original image to help the network make correct prediction. When we have multiple convolutional layers, the initial layer will extract more generic features, while with the depth of the network, the features extracted by the weight matrix become more and more complex, which is also more suitable for image recognition and processing.

 

The concept of step size and padding

In our previous case, the filter, or weight matrix, moved only one pixel at a time across the entire image movement. The number of pixels it moves is called the step size, and here is an example with step size 2

As you can see the image size continues to decrease as we increase the step size value. Adding a dimension of zero to the image helps us solve this problem. As follows:

We can see how the initial shape of the image is preserved when we fill the image with zero. This is called the same fill since the output image has the same size as the input.

In this way, we retain more information from the image boundaries and preserve the size of the image.

 

Multiple filters and activation mappings

One thing to keep in mind is that the depth dimensions are weighted the same as the depth dimensions of the input image. The weights extend to the depth of the entire input image. Thus, the depth dimension of the result convolved with a single weight matrix with an output. Meanwhile, since in most cases we have the same multiple convolution kernels applied together, the number of images we will obtain will also increase with the increase of the number of convolution kernels.

The output of each filter is stacked together to form a depth-sized convolution image. So let’s say we have an input image size 32 by 32 by 3. We apply 10 convolution of size 5 * 5 * 3 to convolution (by not filling the dimension with value 0). So the output dimension of the convolution layer will be 28 * 28 * 10.

You can imagine this is —

 

3.2 pool layer

Sometimes the image is too big and we need to reduce the trainable parameters. At this point we can add a pool layer between the different convolution layers. The sole purpose of the pool is to reduce the spatial size of the image. The pool is independent at each depth size, so the depth of the image remains the same. The most common pooling layer is maximum pooling (another approach is average pooling).

At the same time, the pooling layer also strengthens the robustness of the neural network, so that the slight deformation of the target image in the image position has less influence on the final prediction of the neural network.

So here we have a big step, and the pool size is 2. The maximum pooling operation is applied to the output of the convolution for each depth dimension. As you can see, the output after the 4 * 4 convolution has become a 2 * 2 Max pool operation.

Let’s look at maximum pooling in action.

As you can see I have complex images that apply maximum pooling. The maximum pooled image still retains the information that this is a car on a street. If you look closely, the size picture has been halved. This can greatly reduce the computation.

Similar forms of pooling such as mean pooling or L2 norm pooling.

 

Output size

The above might confuse you with the output size of each layer. So I decided to use the following to allow you to identify the output size. In the convolution layer, three keys control the size of the output

  1. The number of filters – the depth of the output volume is equal to the number of filters applied. Each filter (convolution kernel) can output one image, and as the convolution kernel increases, the number of output images increases
  2. Step size – Controls the pixel value that the convolution kernel moves down. A high step is the pixel value we step over, thus producing a smaller output.
  3. Zeroing – this helps us maintain the size of the input image. If only one layer of zeroing is added around the original image and the step size is one, the output retains the size of the original image.

We can apply a simple formula to calculate the output size. The space size of the output image can be calculated (W-F + 2 p/S)+ 1. Here,W is the size of the input image,F is the size of the convolution kernel,P is the number of fill applications and S is the number of steps. Suppose we have an input image with a size of 32 * 32 * 3, and we apply 10 filters with a size of 3 * 3 * 3, with single step and zeroing.

W = 32,F = 3,P = 0, and S = 1. The output depth is equal to the number of filter applications, which is 10.

The output volume will be (32-3 + 0)/ 1 + 1 = 30. Therefore, the output volume will be 30 * 30 * 10.

 

3.3 the output layer

In the case of image recognition, after multilevel convolution and pooling, we need to output the form of a category. The convolution and pool layers only extract features from the original image. However, to generate the final output, we need to create a full connection layer after the convolution layer, and make the output of the full connection layer equal to our actual image classification category. The image at the end of the convolution layer and all we need to output is whether the image belongs to a particular class. The output layer has a loss function like classification cross entropy to calculate the error in the prediction. Once forwarding is complete, back propagation begins to update errors and reduce the weight and bias of losses.

 

4 Put them together, how does the whole network look like?

CNN can now be seen by various convolution and convergence layers. Let’s take a look at what the network looks like.

  • We first go through an input image convolution layer. Get complex output as an active map. Convolution filters are applied to layers to extract relevant features from the input image further through.
  • Each filter should give different characteristics to help predict the correct class. In case we need to preserve the size of the image, we use the same padding (zeroing) as the other wise effective padding, because it helps reduce the number of features used.
  • The pool layer then further reduces the number of parameters
  • Several convolution and pool layers were added prior to the prediction. The convolution layer helps extract features. In the network, our deeper and more specific feature extraction is more general than shallow network feature extraction.
  • As mentioned above, the output layer of CNN is a fully connected layer, where the input of other layers is turned into one-dimensional data, and the sent output is converted into the number of classes, the desired network.
  • The generated output is then compared by error between the output layer and the output layer. A loss function is defined to compute the mean square loss at the fully connected output layer. And then compute the gradient of the error.
  • The error then backpropagated updates the filter (weight) and deviation value.
  • A training cycle is completed in a forward and backward pass.

 

5 used in KERAS CNN classification images

Let’s try to take an example, we input some pictures of cats and dogs, and we try to classify the images into their respective animal species. This is a typical image recognition and classification problem. What this machine needs to do is to see pictures and understand the various characteristics of whether a cat or a dog. These attributes might be extracting edges, or extracting whiskers from cats, and so on. The role of the convolution layer is to extract these features. Data sets can be clicked here

These are some examples of image data sets.

   

We will first need to adjust these images so that they are all in the same shape. This is what we usually need when processing images, since in capturing images, it is impossible to capture all images of the same size.

Simple your understanding I just used a convolution layer and a pool layer, which doesn’t usually happen when we’re trying to make predictions.

 

#importAll kinds of packageimport os
import numpy as np
import pandas as pd
import scipy
import sklearn
import keras
from keras.models import Sequential
import cv2
from skimage importIO %matplotlib inline # Define file path cat=os.listdir("/mnt/hdd/datasets/dogs_cats/train/cat")
dog=os.listdir("/mnt/hdd/datasets/dogs_cats/train/dog")
filepath="/mnt/hdd/datasets/dogs_cats/train/cat/"
filepath2="/mnt/hdd/datasets/dogs_cats/train/dog/"Images =[] label =[]for i in cat:
image = scipy.misc.imread(filepath+i)
images.append(image)
label.append(0) #for cat images

for i in dog:
image = scipy.misc.imread(filepath2+i)
images.append(image)
label.append(1) #for dog images

# Determine the size of all images

for i in range(0.23000):
images[i]=cv2.resize(images[i],(300.300))

Convert the image to an arrayImages =np.array(images) label=np.array(label) # define the hyperparameter filters=10
filtersize=(5.5)

epochs =5
batchsize=128

input_shape=(300.300.3)

Convert the target variable to the desired size

from keras.utils.np_utils importModel = Sequential() to_categorical label = to_categorical(label) # model.add(keras.layers.InputLayer(input_shape=input_shape)) model.add(keras.layers.convolutional.Conv2D(filters, filtersize, strides=(1.1), padding='valid', data_format="channels_last", activation='relu'))
model.add(keras.layers.MaxPooling2D(pool_size=(2.2)))
model.add(keras.layers.Flatten())

model.add(keras.layers.Dense(units=2, input_dim=50,activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(images, label, epochs=epochs, batch_size=batchsize,validation_split=0.3)

model.summary()
Copy the code

 

In this model, I use only one convolution and pool layer, and the trainable parameter is 219,801. If I used an MLP in this case, how much output would I have? You can further reduce the number of parameters by adding more convolution and pooling layers. The more convolutional layer network structure and training we add will be more complex, but the same result will be better.

 

Finally it is pointed out that

I hope this article gives you a sense of what convolutional neural networks look like. I didn’t delve into the complicated math of CNN. If you like to understand the same thing — stay tuned, you’ll have more choices. Try setting up your own CNN network to see how it works and make predictions about the images. Please let me know what you find and how to use the comments section.