If you want to talk about the most eye-catching technology in recent years, there is no doubt, not artificial intelligence. Whether you are in the tech Internet industry or not, artificial intelligence is everywhere: from AlphaGo’s defeat of the world Go champion, to the rise of the concept of driverless driving, to the tech giant All in AI, and universities are sending a large number of AI graduates to the society. So that people began to think: a new revolution is coming, our world will change again; Then I start to worry: will my job be replaced by a machine? How do I capture this revolution?
The core technology behind artificial intelligence is Deep Neural Network. About a year ago this time, I was learning 3Blue1Brown’s Neural Network series video course on the high-speed train back to my hometown, which lasted only 4 episodes and more than 60 minutes. At that time, in addition to the excitement of new knowledge, I also had a new knowledge, which was to pour cold water on the revolutionary technology in my mind: Neural networks can solve complex tasks that used to be difficult to program, such as image recognition and speech recognition, but its implementation mechanism tells me that neural networks are still not intelligent at the biological level, and it is impossible to expect them to replace people in the short term.
Today, a year later, I still write down my understanding of neural network at this time of Spring Festival travel, which is a study note for this part of knowledge. If I am lucky, I can also let students who do not know neural network understand.
What is a neural network
Wikipedia explains neural networks this way:
Modern neural network is a nonlinear statistical data modeling tool. Neural network is usually optimized by a Learning Method based on mathematical statistics, so it is also a practical application of mathematical statistics Method.
This definition is so broad that you can even use it to define other machine learning algorithms, such as logistic regression and GBDT decision trees that we studied together earlier. Let’s be more specific. Here is a schematic diagram of logistic regression:
Where x1 and x2 represent inputs, w1 and w2 are parameters of the model, and z is a linear function:
Then we perform a SIGmod transform on z (blue circle) to get output y:
In fact, the logistic regression above can be regarded as a neural network with only one input layer and one output layer. The circles containing numbers in the figure are called neurons. Among them, the connections between layers w1, W2 and B are parameters of this neural network. If each neuron is connected between layers, such layers are called Full Connection Layer or Dense Layer. In addition, sigMOID Function is also called Activation Function. In addition to SIGMOID, commonly used Activation functions include ReLU and TANH Function, etc. These functions all Function to transform linear functions into nonlinear ones. We are left with one important concept: the hidden layer, which needs to be illustrated by stacking up more than two logistic regressions:
As shown in the figure above, all layers except the input and output layers are called hidden layers. If we add more layers, this Neural Network can also be called a Deep Neural Network. Some students may ask how many layers are “Deep”. There is no absolute conclusion on this, I think above 3 floors is ok 🙂
So that’s neural networks, and the concepts that are involved in neural networks, so neural networks are not special, in a broad sense, they are
A nonlinear function, or the input and output of several nonlinear functions are joined together to form a larger nonlinear function.
Neural networks don’t have anything to do with the human brain either, especially if we call them the Mutilayer Perceptron, which was invented in the 1980s. Why hasn’t it exploded until 30 years later? You were right, because of the name change… Just kidding; In fact, deep learning technology also experienced a dark trough for a long time, until people began to use Gpus to greatly improve the speed of training models, and several landmark events: For example, AlphaGo defeated Lee Sedol, Google’s open source TensorFlow framework, and so on. If you are interested, you can browse the history here.
Why neural networks
Take the three logistic regression neural networks in the figure above as an example. What advantages does it have over ordinary logistic regression? Let’s take a look at the disadvantages of single logistic regression. For some cases, logistic regression may never be able to classify them, as shown in the following data:
label | x1 | x2 |
---|---|---|
class 1 | 0 | 0 |
class 1 | 1 | 1 |
class 2 | 0 | 1 |
class 2 | 1 | 0 |
The four samples are plotted in the coordinate system as shown below
Because the Decision Boundary of logistic regression is a straight line, the two categories in the figure above, no matter what you do, can’t find a straight line to separate them, but if you use a neural network, you can do it.
The network composed of three logistic regressions (bias is ignored here) is as follows:
Observe the calculation process of the whole network. Before entering the output layer, the calculation of the network is actually as follows:
[x1′, x2′] is obtained by performing a Linear Transformation of the input to [Z1, z2], and then performing a nonlinear Transformation (sigmoid) to [Z1, z2]. From here, the following operations are no different from a normal logistic regression, so the difference is: Before the data is input into the model, a layer of Feature Transformation (sometimes called Feature Extraction) is performed to make the data that could not be classified before classifiable.
Let’s go ahead and look at the effect of the feature transformation, let’s say 为 , substitute the above formula to calculate the corresponding of 4 samples[x1', x2']
As follows:
label | x1 | x2 | x1′ | x2′ |
---|---|---|---|---|
class 1 | 0 | 0 | 0.5 | 0.5 |
class 1 | 1 | 1 | 0.88 | 0.27 |
class 2 | 0 | 1 | 0.73 | 0.38 |
class 2 | 1 | 0 | 0.73 | 0.38 |
Then plot the four transformed points in the coordinate system:
Obviously, after feature transformation, these two categories can be easily separated by a decision boundary.
Therefore, the advantage of neural network is that it can help us to automatically complete feature transformation or feature extraction, especially for sound, image and other complex problems, because in the face of these problems, it is difficult for people to clearly tell you which features are useful.
In solving the characteristic transformation at the same time, the neural network also introduces a new problem, is that we need to design all kinds of network structure to the targeted response to different scenarios, such as using convolution neural network (CNN) to process the image, using both short-term and long-term memory network (LSTM) to deal with sequence problem, using generated against network (GAN) to write poetry and drawing, etc., Even Transformer/Bert, last year’s breakthrough in Natural Language processing (NLP), is a specific network architecture. Therefore, learning neural network well is also helpful for understanding other higher level network structures.
How does a neural network work
As mentioned above, neural network can be regarded as a nonlinear function whose parameters are all Weights and Biases connecting neurons. This function can be abbreviated as F (W, B). Take the task of handwritten number recognition as an example: Identify the numbers in the MNIST dataset. The dataset (MNIST dataset is HelloWorld in deep learning) contains tens of thousands of digital images written by different people, with a total of ten numbers from 0 to 9, and each image is 28*28=784 pixels. We design a network like this to complete this task:
-
The input layer can hold all the pixels of an image, with a total of 784 neurons
-
Using a hidden layer with 16 neurons, then
- The number of parameters from input layer to hidden layer is
784 * 16 = 12544
The number of Bias is 16
- The number of parameters from input layer to hidden layer is
-
The output layer consists of 10 neurons, representing the ten situations from 0 to 9 respectively
- The number of parameters from hidden layer to output layer is
16 * 10 = 160
The number of Bias is 10
- The number of parameters from hidden layer to output layer is
-
The total Weights and Biases add up to 12544+16+160+10 = 12730
Complete the properties of the network function:
Parameters: 12730 Weights and Biases Input: a 28*28 handwritten digital picture Output: 0-9 possibilities of these 10 numbersCopy the code
The next question is, how does this function come about? The question essentially asks how the values of these parameters are determined.
In machine learning, there is another function C to measure f. The parameters of C are a bunch of data sets. You input Weights and Biases to C, and C outputs Bad or Good. You need to continue to adjust the Weights and Biases of F, input them again to C, and so on until C gives you Good, which is the Cost Function (or Loss Function). In the example of handwritten digit recognition, C can be described as follows:
Parameters: tens of thousands of handwritten digital pictures input: Weights and Biases of F Output: a number to measure the quality of the classification task, the smaller the betterCopy the code
It can be seen that in order to complete the handwritten digit recognition task, only the 12,730 parameters need to be adjusted to make the loss function output a small enough value. By extension, most of the problems of neural network and machine learning can be regarded as the problems of defining the loss function and parameter tuning.
In handwriting recognition tasks, we can use either the Cross Entropy loss function or the MSE (Mean Squared Error) as the loss function, and then we just need to tune the parameters.
The parameter tuning of neural network does not use any special technology, but is still the gradient descent algorithm that everyone has just learned from machine learning. Gradient descent solves the legacy problem in the above iteration process — how to adjust parameters to reduce Loss the fastest when the Loss function gives Bad results.
The gradient can be understood as:
Consider a mountain where the height of the point (x,y) is denoted by H(x,y). The gradient at this point is in the direction of the steepest slope (or slope) at that point. The size of the gradient tells us how steep the slope really is. – wiki/gradient
If Loss corresponds to H and 12730 parameters correspond to (x,y), the gradient of Loss to all parameters can be expressed as the following vector, the length of which is 12730:
Therefore, each iteration process can be summarized as
- Enter model parameters into the loss function
- Based on model parameters and tens of thousands of samples, Loss is calculated
- The gradients of all these parameters are calculated according to Loss
- Adjust the parameters according to the gradient
The formula for adjusting parameters by gradient is as follows (bias is omitted for simplicity) :
On the type,It’s learning rate, which means taking one small step at a time in the direction of the fastest decline and avoiding Overshoot.
As neural network has many parameters, it needs more efficient algorithms to compute gradients. Therefore, Backpropagation algorithm comes into being.
Back propagation algorithm
Before we look at the back propagation algorithm, let’s review the Chain Rule in calculus: Let g = u(h) and h = f(x) be two differentiable functions. A small change in x, △x, causes a small change in H, △h, and thus a small change in G, △g/△x, can be used as the chain rule:
With that in mind, it’s easy to understand the back propagation algorithm.
Assume that our demonstration network has only 2 layers, with only 2 neurons for both input and output, as shown in the figure below:
Among themIs the input,Is the output,Is the target value of the sample, and the loss function L used here is MSE; In the figure, superscript (1) or (2) indicates that the parameter belongs to layer (1) or layer (2), and subscript 1 or 2 indicates the first or second neuron of this layer.
Now let’s calculate 和 , after mastering the calculation of the partial derivatives of these two parameters, the calculation of the whole gradient will be mastered.
The so-called back propagation algorithm is to compute the partial derivative of each parameter from right to left, firstAccording to the chain rule
Let’s do the chain rule for the left-hand side
againIs the output value,It can be directly calculated by the derivative of MSE:
while,It’s the derivative of the sigmoid functionIs, i.e
soHere it is:
Look atThis term, because
so
Note: the above formula is for all 和 That’s true, and it’s pretty straightforward 对 The partial derivative of phi is the input on the leftThe size of the; At the same time, there is another implication: which needs to be adjustedTo influenceCan make Loss decrease the fastest. It can be seen from this formula that, of course, the larger adjustment is made firstValue of, the effect is the most significant.
So, the last layer of parametersSo we have the partial derivative of theta
Let’s do one more layerAccording to the chain rule:
I’m going to keep expanding this term on the left
You see, this is almost exactly the same thing as calculating the last layer, but notice that this isThere are multiple paths that affect Loss, so for this example with only 2 outputs:
On the type,It’s all worked out at the last layer, so let’s look at itBecause the
so
In the same way
Note: the adjustment intuition of gradient descent is also extended here: namely, to decrease Loss the fastest, the weight with a larger weight value should be adjusted first.
At this point,And we figured it out
By observing the above equation, the so-called partial derivative of each parameter can be converted into linear Weighted Sum through the back propagation algorithm, which can be summarized as follows:
In the formula, n represents the number of classifications, (L) represents layer L, and I represents the ith neuron of layer L. Since back propagation is a linear weighting, the whole neural network can be computed in parallel with the matrix of GPU.
And finally, when you understand how neural networks work, isn’t it more and more obvious that they’re just doing a bunch of calculus, and of course, as a way of proving that someone has learned calculus, neural networks are worth learning. Just kidding ..
summary
In this paper, we adopted
- What is a neural network
- Why neural networks
- How neural networks work
- Back propagation algorithm
These four points, a comprehensive study of the neural network knowledge point, I hope this article can give you help.
Reference:
- 3Blue1Brown: Neural Network
- 3Blue1Brown: Linear Transformer
- Back propagation algorithm
- Wikipedia: Artificial neural networks
Contact the blogger: