preface

With the heat of front-end intelligence, AI machine learning has entered the field of vision of front-end developers. AI can solve problems in programming that can’t be solved directly by rules and calculations, and generate optimal strategies through automatic reasoning, making it another big tool for front-end engineers to solve problems.

Many of you might have been tempted to go to TensorFlow or Pytorch and try to write a machine learning Hello World. You might have encountered some functions that you didn’t know what they were. This is because TensorFlow and Pytorch are tools that use machine learning without specifying what machine learning is. Therefore, this article will introduce some basic principles of machine learning introduction, plus a convolution of image processing, hoping to help you understand.

Basic concept

First, what is machine learning? Machine learning is roughly equivalent to finding a function that, say, in speech recognition, inputs a speech and outputs a text

In image recognition, you take an image, you output an object in the image,

In Go, input the board data, output the next move,

In a conversational system, you type in a hi, you output a response,

And this is a function that you write with a lot of data, and the machine learns on its own.

So how do we find such a function? Let’s start with the linear model. Linear models are simple in form and easy to model, but they contain some important basic ideas in machine learning. Many more powerful nonlinear models can be obtained on the basis of linear models by introducing hierarchical structures or high-dimensional mappings.

Linear model

We look at a cat and dog category, and when we teach a child to distinguish between a cat and a dog, we don’t give a Wichita definition, but we constantly show the child a cat and a dog, let him judge, and then tell him the correct answer, and tangle with the wrong perception. Machine learning is the same, constantly telling the computer what is right and correcting the computer’s cognition. The difference is that children’s cognition is done automatically by the human brain. Computers can’t automatically construct the memory of cats and dogs.

So we need to extract features that represent cats and dogs, and put numbers on them. To simplify the example, we only use two features here, the size of the nose and the shape of the ears. Generally speaking, cats have smaller noses and pointier ears, while dogs have larger noses and rounder ears.

We’ve plotted the ear and nose features of multiple images in a two-dimensional coordinate system, and you can see that cats and dogs are located in different regions of the coordinate system.

With the naked eye, we can draw a straight line, but the computer can’t see where to draw a line. So how do WE get that information to the computer? Let’s define two variables, x1 for the size of the nose, x2 for the shape of the ear, and let’s define the equation of the line W1 times x1 plus W2 times x2 minus b is equal to 0, which is the same thing as y is equal to W1 times x1 plus W2 times x2 minus b, If y is greater than 0, it’s a cat, and if y is less than 0, it’s a dog.

Now, from a computer’s point of view, it has a bunch of data,

And a linear model,

One more goal/task is needed. Our expectation is that when we give an unknown x, through f(x), we can get a predicted value y, which is as close to the real value as possible, and then we have a useful pet sorting machine! How can such a goal be expressed in numbers? This will introduce a concept of Loss function, which calculates the difference between predicted value and true value.

The commonly used loss function is Absolute value loss, which is the Absolute value of the difference between two values. It is very intuitive. The difference between the two values and the target can be added up

There is also a Least squares loss function

The goal of the squared loss function is to minimize the distance from each point to the regression line, which is calculated as the Euclidean distance. Now, our goal for the computer is to find a minimum,

And to do that, let’s think back to some long-lost calculus, where (again, just to simplify to two dimensional coordinates, let’s say there’s only one w that we need), where the derivative is zero is the maximum or minimum of the function.

For such a simple quadratic equation with one variable in the figure, we can directly take the derivative of the parameter W and get the minimum value. However, if it is a function like the one shown below, then… It’s hard to find, and there are different formulas for different functions, so.. It’s a little bit more difficult, because the goal is for the machine to learn by itself, right?

Therefore, we need a more general calculation method, which is called Gradient descent. The basic process of Gradient descent is as follows: First, we randomly select a point as an initial value and calculate its slope, or derivative.

When the slope is negative, I take a small step to the right,

When the slope is positive, I take a small step to the left,

Repeat on each point, calculation of new slope, appropriate to go one step again, will be approaching function of a local minimum, is like a ball rolling down from the mountain, but the initial position is different, can reach different local minimum, there is no guarantee that is the global minimum, but, in fact, most of the things we according to the basic problem of abstract function is convex function, Can get a minimum, in the case of the minimum is not unique, can also add random numbers to give a chance to jump out of the current minimum region. We need to make it clear that the theoretical support of machine learning is probability theory and statistics, and the answer to the problem we seek through machine learning is often not the optimal solution, but an extremely optimal solution.

Imagine a more complex binary function with two inputs and one output. Our Loss function can be presented as a surface in a three-dimensional space. The problem becomes which direction a point on the surface should go in the space to make the result decline fastest.

Again, calculate gradient, update, calculate, update…. It can be expressed as follows,

At this point, we encounter the first hyperparameter η, namely the Learning rate. The parameters in machine Learning are divided into two types: model parameters and hyperparameters. The model parameters are w, which is left to the machine to learn by itself, while the hyperparameters are specified by the developer before model training.

From the formula above, you can see that

It is the derivative of Loss function function to parameter W that determines the direction we go, so the learning rate determines the distance we go in each small part in this direction.

When the η is too small, the process of reaching the minimum is very slow, while when the η is too large, it will go straight past the minimum because the pace is too high. So, how do you take the value of η,

Comparing the conventional approach is to begin with a value of 0.1 from this, then a index decline, 0.01, 0.001, vector when we use a big, will find that the value of the loss function almost no decline, it is probably in the swing, when we get to a smaller value, can let the loss function to drop, then continue to take down, narrowing range, This process can also be done automatically by a computer, if computing resources are available.

Now that we know about gradient descent and learning rates, we can use linear models to solve relatively simple problems,

Basic steps:

  • The extracted features
  • Set the model
  • Compute gradients, update

You want to give it a try!

Here is a simple housing price forecast chestnut, you can run around locally, try to adjust different learning rate, see the change of loss function. Github.com/xs7/Machine…

The key codes are as follows:

# Loss function
def lossFunction(x,y,w,b) :
    cost=np.sum(np.square(x*w+b-y))/(2*x.shape[0])
    return cost

# derivative
def derivation(x,y,w,b) :
    #wd=((x*w+b-y)*x)/x.shape[0]
    wd=x.T.dot(x.dot(w)+b-y)/x.shape[0]
    bd=np.sum(x*w+b-y)/x.shape[0]
    return wd,bd

# Linear regression model
def linearRegression(x_train,x_test,y_train,y_test,delta,num_iters) :     
    w=np.zeros(x.shape[1])                                    Initialize the w parameter
    b=0                                                       Initialize the b parameter
    trainCost=np.zeros(num_iters)                             Initialize loss on the training set
    validateCost=np.zeros(num_iters)                          Initialize Loss on validation set
    for i in range(num_iters):                                # Start iterating
       trainCost[i]=lossFunction(x_train,y_train,w,b)         # Calculate loss on training set
       validateCost[i]=lossFunction(x_test,y_test,w,b)        # Calculate loss on test set
       Gw,Gb=derivation(x_train,y_train,w,b);                 # Compute the derivative on the training set
       Dw=-Gw                                                 # slope >0 is going in the negative direction, so we need to add a negative sign
       Db=-Gb                                                 # same as above
       w=w+delta*Dw                                           # update parameter w
       b=b+delta*Db                                           # update parameter b
return trainCost,validateCost,w,b
Copy the code

Multilayer perceptron

The linear model we just talked about is actually a single-layer network that contains the basic elements of machine learning: models, training data, loss functions, and optimization algorithms. But limited by linear operations, you can’t solve more complicated problems.

We need more general models to accommodate different data. Like an extra layer? The effect of adding a layer is about the same as transforming the axes, and we can do a more complicated problem.

But it’s still a linear model, you can’t solve nonlinear problems, like in the picture below, you can’t separate them with a straight line, but you can do it with a binary equation like y equals x2, and that’s the advantage of nonlinearity.

With the addition of a nonlinear structure, Activation Function, another basic concept in neural network, is introduced. Common Activation functions are as follows

Relu function only keeps positive elements and clears negative elements, sigmoid function can transform the value of elements to between 0 and 1, and tanh function can transform the value of elements to between -1 and 1. Relu is the most widely used Relu, which seems to be the simplest. The Relu function is just like the neuron of human brain. If the neuron’s stimulus threshold is reached, the output will be set to zero.

For example, sigmoID is usually used as the activation function of the output layer. For example, when doing the classification task, the result is mapped to 0 ~ 1 and a predicted probability value of 0 ~ 1 is given for each preset category.

It can be understood that we provide a nonlinear function, and then the neural network, by learning by itself, using the nonlinear elements that we provide, can approach any nonlinear function, and then it can be applied to many nonlinear models.

After the activation function is added, we have multi-layer perceptron, which is a neural network composed of fully connected layers with at least one hidden layer, and the output of each hidden layer is transformed through the activation function.

As shown in the figure above, a simple multilayer sensory network is formed, that is, a deep neural network. As the network level becomes more complex, gradient descent is still used for iterative optimization, but the calculation of gradient becomes more complex. Each line in the network has a w weight parameter, so loss function is needed to calculate the gradient for each W. Roughly estimate, assuming that the input layer has 10 nodes and two hidden layers, There are 30000*3 parameters in each hidden layer from the input layer to the hidden layer 1 and then to the hidden layer 2, and there is a functional relationship between the parameters. The derivative of the final output loss for the first hidden layer W needs to be calculated layer by layer, and the calculation amount is ++++++n. It is absolutely impossible to directly derive the derivative. Therefore, we need Backpropagation algorithm (bp algorithm).

Back propagation algorithm

Backpropagation algorithm is used to quickly calculate gradients in multi-layer networks, and its derivation is relatively complex, when using the framework.. Call the API directly, there is nothing developers can change, you should.. Don’t want to write code to compute partial derivatives.. As an advanced content, first dig a hole.. Next time fill in..

Midway summary

By now we should have a basic idea of what a neural network can do, just to recap, is to take a model of a multi-layer network structure, input the data, and constantly compute gradients to update the parameters of the model, constantly reducing the error of the model prediction. Where, the pace of using gradient to update parameters is determined by the hyperparameter learning rate, which is expressed in pseudocode:

For I in number of iterations: Loss = gap between predicted value and true value D = Derivative of Loss to W w = W-d * learning rateCopy the code

Now that we know the basic structure and computation flow of a deep neural network, we can understand some simple code using neural network. Go back to the official tutorial of Pytorch. The demo is full of images, so.. So let’s see what convolutional neural networks are.

Convolutional neural network

In the network model mentioned above, any two nodes between two adjacent layers are Connected, which is called Fully Connected Layer. When we use a deep network model to process images, the RGB value of each pixel in the image can be taken as input. A 100*100 image will have 100*100*3 nodes in the input layer of the network. Even if only one hidden layer is given, there will already be 30,000 *100 parameters from the input layer to the hidden layer. Add a few layers or a slightly larger image and the number of parameters explodes even more. The amount of data required to process images is too large, and the cost and efficiency of computing using fully connected networks are too high. It is not until the emergence of Convolutional Neural Network (CNN) that the problem of image processing is solved.

So what does a convolutional neural network look like

A typical convolutional neural network consists of three parts

  • Convolution layer
  • Pooling layer
  • The connection layer

Among them, the convolution layer is used to extract image features, the pooling layer is used to reduce parameters, and the full connection layer is used to output the results we want.

Let’s look at convolution.

You take a picture, you give it a convolution kernel.

Slide the filter across the image, multiply and sum the corresponding positions,

After that, you get a new 2-d array. That’s convolution. Yes… It’s just a simple addition.

If you add another convolution kernel, you get an array of two channels.

The two-dimensional array output by the two-dimensional convolution layer can be regarded as the representation of the input at a certain level on the spatial dimension (width and height), which is also called feature map.

The input source area of a node is called its receptive field (receptive field). For example, the input field of the first node 3 in the feature graph is the 3*3 area in the upper left corner of the input picture.

If we convolve the result again, the receptive field of the first node 17 in the final feature graph becomes the union of the receptive field of its input node, that is, the region of 4*4 in the upper left corner of the picture. We can use deeper convolutional neural networks to make the receptive field of individual elements in the feature graph broader, so as to capture features of larger size on the input.

It mimics human vision. When we receive visual signals, certain cells in the cortex do preliminary processing, finding edges and directions, and then abstracting, deciding whether the shape of the object in front of us is round or square, and then abstracting further. Through the multi-layer neural network, the neurons at the lower level recognize the primary image features, and several features at the lower level constitute the features at the higher level. Finally, the highest abstract features are obtained to obtain the classification results.

After understanding the recognition process of multi-layer convolution from local abstraction to global abstraction, go back to the convolution kernel itself.

From the perspective of function, the convolution process is a process of linear transformation and mapping into new values at each position of the image, layer by layer mapping, forming a complex function as a whole. From the perspective of template matching, the convolution kernel defines a certain mode. The convolution operation is to calculate the degree of similarity between each position and the mode, or how many components of each position have the mode. The more similar the current position is to the mode, the stronger the response will be.

For example, edge detection operator is used for convolution. Sobel operator contains two groups of 3*3 matrices, namely transverse and longitudinal, which are convolved with the image in plane. If A represents the original image, G(x) and G(y) represent the images detected by transverse and longitudinal edges respectively:

The calculation results of sobel’s edge detection in the x direction are as follows:

Let’s look at some chestnuts that intuitively show the effects of different convolutional accounting subunits,

Doesn’t feel like convolution is good. Of course, we can directly find some interesting convolution kernels to use, such as using convolution to detect image edges, or we can learn convolution kernels through data, so that neural networks can learn different operators.

In the convolution calculation, I slid one bar at a time, which means the stride is 1. In fact, I can also increase the stride by 2 bars at a time, or I can skip to expand the receptive field. In order to keep the length and width of the output array consistent with the input, I can also add a circle of padding on the edge of the original image. As a very basic introduction, I won’t expand here.

Going back to our network structure, we can see that there are only partial connections between the two layers of neurons, and fewer connections represent fewer parameters.

However, this is not enough. The image has too many pixels. Even if we only extract local features, we still need many parameters, so we also need to pool.

In fact, the Pooling layer is used to downsample and reduce the image. The calculation of Pooling is also very simple. Calculate the elements in a fixed-size window of the output data and output the maximum value of elements in the Pooling window by Max Pooling and the average value of elements in the input window by average Pooling.

In addition to the sampling and reduce the image size, pooling can also ease the convolution layer excess sensitivity of location, avoid model fitting, to take an extreme example, an image only four pixels, if a position is 255 pixels, we can decide is a type of goods, if we input is used to study the training of the chart, The first pixel in the upper left corner of each picture is 255. If there is no pooling, the result of model training is that when the first pixel in the upper left corner is 255, the output is judged to be the object. When we use this model to predict a picture with the upper right corner pixel of 255, the model will think that it is not the object, and the judgment is wrong. If 255 is pooled, no matter where 255 appears, 255 will be pooled and judged to be the item.

After dimensionality reduction of multiple convolution layers and pooling layers, the data comes to the full connection layer for classification of high-level abstract features.

The last

At this point, you should have covered most of the principles you need to understand Pytorch/Tensorflow’s introductory tutorial, so you can have fun running the image classification example and writing your own network. The specific framework to use that in the next “super basic machine learning introduction – Practice” see.

The last

Deco Intelligent Code project is the exploration of “front-end intelligence” by Concave-Gu Laboratory. We try to start from the point of design code generation to complete the capacity of the existing design to RESEARCH and development, so as to improve the efficiency of production and research. A lot of AI capabilities are used to realize the analysis and recognition of design draft. If you are interested in children’s shoes, please pay attention to our account “Bump Lab” (Zhihu, Digging Gold).

The resources