preface
This paper is mainly a summary of learning BP neural network, which itself is more basic in machine learning, suitable for entry model.
Now I for machine learning is also very state, for a lot of nouns is still a little knowledge (feeling machine learning many of the term itself is ambiguous), for a lot of formulas is natural, so this article is trying to use their own language and understanding to repeat to knowledge, if there are any errors also recognize more the wing.
Hawking said that for every mathematical formula, half of the readers will read it, so I’m going to use as few formulas as possible, and only simple ones. In addition, I think many formulas in the neural network can be understood emotionally, so it is best to fully understand the derivation process. However, it is not necessarily a good way to quickly grasp the perceptual cognition in a state of no understanding.
In addition, this article used a lot of matrix related knowledge, students who forget can see the appendix.
Neurons and excitation functions
neurons
A neuron is the basic building block of a neural network, and if you were to draw it, it would look something like this:
The x to the left of the neuron in the figure represents multiple inputs to the neuron, w represents the corresponding weight of each input, and the arrow to the right of the neuron indicates that it has only one output.
Of course, there are many kinds of neurons, and here are two more basic ones.
Neuron 1: Perceptron
Neural network technology originated in the 1950s and 1960s, when it was calledPerceptron (perceptron)And the individual neurons are what we call perceptrons. The characteristics of the perceptron have a strong flavor of The Times: its input and output are binary form (it is said that due to the backward computing technology, the perceptron transfer function was mechanically realized by pulling the rheostat to change the resistance).
As shown in the figure above, the perceptron has multiple binary inputs (values can only be 0 or 1)X1, X2.. Xn, each input has corresponding weights W1, W2.. Wn(not drawn in the figure), multiply each input value by the corresponding weight and sum (∑XjWj), and then compare with a threshold value (threshold). If the value is greater than the threshold, the output is 1; if the value is less than the threshold, the output is 0. The formula is as follows:
If the formula is written in matrix form and b is used to represent the threshold of negative numbers (i.e., b=-threshold), the following formula is obtained:
Take a chestnut
For example, there will be a concert by your idol in your city, and you are deciding whether to attend it or not. You might weigh your decision in three ways:
- Is the weather good?
- Is your best gay friend willing to go with you?
- Is the event close to public transportation? (You don’t own a car)
These three factors are represented by the corresponding binary variables x1, x2, and x3. For example, if the weather is good, we have x1=1, and if the weather is bad, x1=0; Similarly, if a good gay friend is willing to go, x2=1, otherwise x2=0; Assign the same value to public transportation X3.
Then depending on your wishes, for example, make the weather weight w1=6, and the other conditions weight w2=2 and w3=2 respectively. A higher w1 weight indicates that the weather has the greatest influence, which is greater than the influence of good gay friends or the distance of traffic. Finally, suppose you choose 5 as the perceptron threshold (that is, b is -5). With this choice, the perceptron can implement this decision model: output 1 when the weather is good and output 0 when the weather is bad, regardless of whether your good gay friend wants to go or whether the traffic is close.
Neuron 2: Sigmoid neuron
Let’s start with the Sigmoid function. The word Sigmoid is literally “Sigmoid colon” on some tools, and some sources actually call Sigmoid neurons Sigmoid colon neurons. In fact, it is a common “S” type function, which can map variables to the interval (0,1), the formula is as follows:
Its function image is the “S” type as shown below:
So what is a Sigmoid neuron? How is it different from the perceptron?
First, in the Sigmoid neuron, the input values are no longer binary, but arbitrary values between 0 and 1. So Xi is any real number between 0 and 1.
Second, the output of Sigmoid neurons is no longer 0 or 1, but σ(WX + B). Note that “wx+b “is shorthand (matrix), refer to the perceptron formula above.
Therefore, we can obtain the Sigmoid neuron formula:
⋅x+b: σ(z)≈1 when Z = W ⋅x+b is a large positive number; σ(z)≈0 when z= W ⋅x+b is a small negative number. In both cases, the output of the Sigmoid neuron is very close to the perceptron. 12. Only when W ⋅ X + B is at a moderate value can the sigmoid neuron deviate greatly from the perceptron.
Excitation function
There is a functional relationship between the input and output of neurons, which is called excitation function. So the Sigmoid function mentioned above is one kind of excitation function, and the function of perceptron can also be called threshold (or step) excitation function.
The excitation function, also known as the ignition rule, makes it relevant to the workings of the human brain. When a neuron’s input is large enough, it ignites, sending electrical signals from its axon (output connection). Similarly, in artificial neural network, as long as the input exceeds a certain standard, the output will be generated, which is the idea of ignition rule.
The structure of neural networks
Neural network simply means connecting multiple neurons to form a network. Here’s one of the simplest and oldest: The “Multilayer feed-forward Neural Network,” or “Multilayer feed-forward Neural Network,” is characterized by multiple layers, And the neurons are fully connected, that is, the neurons at the latter layer will be connected to each neuron at the previous layer (here, from the input layer to the output layer is defined as from “back” to “front”).
A schematic of a multilayer perceptron is shown below. The leftmost layer of the network is called the input layer, and the neurons in it are called input neurons. The rightmost and output layers contain output neurons. In this case, there is a single output neuron, but the output layer usually has multiple neurons. The middle layer is called the hidden layer because the neurons inside are neither inputs nor outputs.
The significance of training neural networks
Now that we have neurons, and we have the structure of neural networks, we get to the core question: What do we do with neural networks? How do you make it happen?
Objectives of training
According to common sense and human speaking, the role of neural network is that we give it a lot of data in advance (including input and output) to train, after training, we hope that it can also give us a satisfactory output for the input of the real environment in the future.
Loss function/Cost function (Loss function)
So how do we mathematically express how satisfying an output is? Here we introduce the concept of Loss function (or cost function, Loss function).
Now assume that there are n groups of sample data containing the input and the real result (or expected result and expected output). For each group of input, the output result of our neural network is denoted as FI, and the real result (expected result) is denoted as YI.
MAE (Mean Absolute Error) in mathematical tools can be used to intuitively express the deviation between the output result and the real result. Therefore, MAE can be used to write a Loss function as follows. The larger the Loss value is, the more the output result of neural network is far from our expectation.
MSE (Mean Squared Error) can also be used as the loss function. MSE can better evaluate the change degree of data. Simply speaking, the deviation will be magnified because of a square.
Substituting the Sigmoid neuron expression F (x)=σ(wx+b) into the Loss function above, it can be found that X (input) is fixed, and yi (expected result) is also fixed. Let us imagine sensibly: in fact, only W and B affect Loss, and the most important task is to find W and B to minimize Loss.
To be more specific, the purpose of neural network training is to find the most suitable values of W and B for each neuron, so that the output of the whole neural network is closest to our expectations (saying “most” actually violates the advertising law, and the neural network ultimately achieves hardly the optimal solution of the problem).
Note: The actual loss function will be used below
In practice, in order to facilitate derivation, the following Loss function is generally used:
Gradient descent
According to the above conclusion, Loss can be denoted as C, and C is only related to W and B, so C can be regarded as a function of W and B, as shown in the figure below. Note that since there are a lot of “w’s” and “B’s” in neural networks (recall that each neuron has multiple weights and a threshold), there is also a need for perceptual cognition.
If I were to graph it, it might look something like this:
Our goal is to find w and B to minimize C. Of course, it’s easy to see where w and B fit in the picture above, but in more complex cases like the one below, how do you quickly find the point where C minimizes?
Here we introduce the gradient descent algorithm, and the principle is simple: think of the image above as a hilly area, imagine that we have a ball in a certain position and let it “naturally roll down”, the lower the roll, the smaller the C, and the happier we will be.
So how do you make it roll down? Differential laws tell us that when W moves δ w and B moves δ b, we have:
Since C is a loss, we want the ball to roll lower and smaller, so δ C should always be negative. What about δ w and δ B? Gradient descent is designed like this:
It can be seen that such values make δ C always negative, where η is called the learning rate.
So now the question becomes ∂C/∂ W, ∂C/∂ B, partial of C with respect to W and partial of C with respect to B, how do I solve these two things?
Back propagation
Back Propagation is the algorithm to quickly solve ∂C/∂ W and ∂C/∂ B in such scenes. The multilayer perceptron using this algorithm — the neural network described in this paper — is also called BP neural network (confusion +1).
This chapter contains a complex derivation of the formula. I think it is ok to skip the chapter without knowing the details (just look at the section on “forward propagation”), as long as you know that there is a classical back propagation algorithm that can quickly solve ∂C/∂ W, ∂C/∂ B to calculate δ W and δ B. Keep δ C constant negative, even if Loss becomes smaller and smaller.
Positive communication
Forward propagation can also be called feedforward (hence the word feedforward neural network…). Forward propagation is when you give the input to the neural network, and then you compute the output layer by layer, and finally you get an output, and that’s forward propagation.
Basic definition before derivation
Definitions of W, A and B
We use WLJK to represent the weights on the links from KTH neurons in (L −1)th layer to JTH neurons in (L)th layer. For example, the following figure gives the weights on the links between the fourth neuron of the second hidden layer and the second neuron of the third hidden layer:
BLJ was used to represent the deviation of JTH neurons in LTH layer, and ALJ was used to represent the activation value of JTH neurons in LTH layer. The diagram below clearly illustrates what this means:
Based on the above definition, a formula can be written for the activation value alj of a single neuron, where sum(L-1) represents the number of neurons in the (L −1)th layer:
This might be a weird way to write w, but maybe we can see the beauty of it if we write it in matrix form. Wl matrix is used to represent the value of W at the (L)th layer, j is used as row and K behavior column, then w3 in the neural network above can be written as:
So you can also use al to represent a at the (L)th layer, with j as a row, but only one column, so al is actually a column vector. So a2 up here can be written as a very graphic column vector:
Similarly, B3 can and can be written as a column vector:
Then, according to the above formula of activation value ALj of single neuron, the formula of Al matrix can be obtained:
Weighted input Z of a single neuronlj
An intermediate quantity ZLJ can be extracted from the above formula:
Of course, you can also abbreviate it in matrix form:
ZLJ is actually the weighted input of the activation function of the JTH neuron at layer L.
Loss of a single set of data
The Loss function has been introduced previously, so for a set of inputs, its Loss (capital “L” represents the output layer) can be written in the following formula (there is n less than the Loss formula above, because only one set of inputs is considered here, while the Loss setting above considers N sets of data).
This formula can also be written as a matrix, where the magnitude of the matrix is used (see appendix), and the square of the magnitude is the sum of squares of the elements of the vector.
Error delta of a single neuronljtest
Error δlj on the JTH neuron of layer L is defined as:
Then you can deduce two more steps:
derivation
Error matrix of output layer
From the above single neuron error formula, the output layer error matrix formula can be obtained (note that the capital “L” is used here to represent the output layer, and the Hadamard product represented by the circle can be seen in the appendix) :
Since the loss function adopted by us is very easy to calculate the derivative of C to aL, the formula can be further simplified as:
Error matrix for a particular layer
Firstly, the relationship between individual neuron error δlj and the next layer (L +1 layer) is deduced:
What is difficult to understand in the above derivation may be the origin of cumulative K. This is because the JTH neuron in the LTH layer will affect all neurons in the (L +1)th layer, so all neurons in the (L +1)th layer need to be considered in the reverse calculation of partial derivatives.
Then the error matrix (vector) δ L of LTH layer can be obtained:
This transformation involves the transpose of the matrix, which is probably a little hard to understand. A close look at the above WKJ will show that the order of j and K in it is reversed from the order in the original definition of W, which can be understood as the reason for transpose. You can also find the transformation of the error matrix from a single neuron error to a layer of neurons by taking an example.
The error relative to the weight w
After obtaining the error of a single neuron, let’s look at the relationship between the error and W:
Like the deduction in the previous section, if written as a matrix, it is in the following form:
The relationship between the error and the deviation B
Consistent with the derivation process of W above, it is easy to get the relationship between error and B:
The matrix form of this is very simple:
conclusion
Through the above cruel derivation, it can be found that after a forward propagation, the partial derivative of C for each W and B, namely ∂C/∂ W and ∂C/∂ B, can be quickly solved through the error of the output layer, and then δ W and δ B are added to each W and B, so as to make the “ball roll down”, C, namely Loss becomes smaller and smaller. The neural network is adjusting in the direction we want it to.
Training flow of BP neural network
Based on the above knowledge, we can now summarize the whole process of training a neural network:
- Initialize the neural network and assign random values to w and B of each neuron;
- Input the training sample set. For each sample, the input is sent to the input layer of the neural network for a forward propagation to obtain the output value of each neuron in the output layer.
- Calculate the error of output layer, and then calculate the error of each layer (of each neuron) through the back propagation algorithm;
- The ∂C/∂ W and ∂C/∂ B of each neuron can be obtained by the error, and then multiplied by the negative learning rate (-η), δ W and δ B can be obtained, and the W and B of each neuron can be updated as W + δ W and B + δ B.
After training, we usually have a neural network with relatively little loss.
The appendix
matrix
Matrix addition and subtraction
The two matrices are required to have the same size (number of rows, number of columns) and then add/subtract the elements in the same position.
Matrix multiplication
I think you remember this, that one row of the left-hand matrix times one column of the right-hand matrix, so matrix multiplication requires that the number of columns of the left-hand matrix is equal to the number of rows of the right-hand matrix.
transpose
The matrix produced by exchanging the rows and columns of matrix A is called the transpose matrix of A (that is, the elements of the MTH row and the NTH column are converted to the elements of the NTH row and the MTH column), and is represented by the symbol T:
vector
Matrices with only one row are called row vectors, and matrices with only one column are called column vectors. Row vectors such as:
Column vectors for example:
PS: Vectors are just a special kind of matrix, matrix multiplication and transpose can be applied to vectors.
Hadamard product: ⨀
Assuming S and T are two vectors of the same dimension, use S⨀T to represent products by elements. So the element of S⨀T is (S⨀T)j=SjTj.
The magnitude or length of a vector
In linear algebra, the magnitude of a vector is the square root of the sum of squares of its components by adding two vertical lines to both sides. If there is a vector S:
Then its modulus is:
The resources
What are the differences between the internal network structures of CNN, RNN and DNN?
www.tensorfly.cn/home/?p=80