Part of the content of the article is referenced from
Three basic problems of neural network
Three basic and important functions of neural network are introduced through three basic problems.
How does the neural network predict that?
The neural network makes prediction by logistic regression function and becomes prediction algorithm.
Logistic regression function
The prediction process of the neural network is formulated as z=dot(w,x)+bz =dot(w,x)+bz =dot(w,x)+b.
Where WWW represents a series of weights in the network, XXX represents the input training set, dot(w,x)dot(w,x)dot(w,x) represents the vector multiplication of WWW and XXX. B represents the threshold [Yu Zhi], which is used to influence the prediction results. In logistic regression function, the accuracy of neural network prediction is determined by WWW and BBB.
the WWW here is actually the transpose wTw^TwT, and linear algebra requires row times column
Take three small examples to understand the formula
- Let’s say the weekend is coming up and you hear that there is going to be a music festival in your city. We have to predict whether you’ll decide to attend. The music festival is quite far from the subway, and your girlfriend wants you to stay home with her to make things happen, but the weather forecast says the weather is perfect for the music festival. That is, there are three factors that affect your decision, and those three factors can be regarded as three input characteristics. So are you going to go or not? Your personal preferences — how much you value the above three factors — will influence your decision. These three levels of importance are three weights.
- The same formula is used to predict whether a cat is in the picture. The trained neural network is given a set of weights associated with the cat. When we feed an image into a neural network, the image data is calculated against this set of weights and thresholds, and the result is greater than zero, there are cats, less than zero, there are no cats.
- Many websites keep track of your browsing preferences and use them as weights to predict what you’ll buy.
Weights W and bias B are somewhat similar to the strength of the connection between input and output.
Activate functions — take the sigmoid function as an example
In a real neural network, we cannot use logistic regression directly. You have to add an activation function to the logistic regression. The activation function is very, very important, because without the activation function the IQ of the neural network will never rise. There are many kinds of activation functions. At present, only one activation function called Sigmoid is briefly introduced.
Its formula for sigma (z) = 11 + e – z \ sigma (z) = \ frac {1} {1 + e ^ {z}} sigma (z) = 1 + e – z1, images are as follows:
Just one use of it is to map z to between [0,1] for easy understanding of the prediction results. In the figure above, the X-axis is z, and the Y-axis is y ‘, which represents our final prediction. And you can see from the graph that the larger z is the closer y prime is to 1, and the smaller z is the closer y prime is to 0. So why map predictions to [0,1]? Because it’s not only easier for the neural network to compute, it’s easier for us to understand. For example, in the case of cats, if y ‘is 0.8, there is an 80% chance of cats.
Predicted results
The logistic regression function is combined with the activation function. Combined with the above, the prediction formula is as follows:
The = = y ^ (I) \ hat {} y ^ {(I)} ^ y (I) is forecast for the training sample x (I) x ^ {(I)} prediction results of x (I) = =.
Is the neural network accurate
Only when the prediction can be judged accurately can the algorithm be improved and optimized continuously. The loss function in the neural network is responsible for judging whether the prediction is accurate ==. Loss function is the most basic and important part of machine learning.
The action of the loss function
There are many kinds of loss functions, but their functions are used to measure the quality of the model prediction, generally speaking, it is to use a reasonable formula as far as possible to reflect the difference between the forecast and the actual data.
In another academic point of view, it is to measure the distance between two distributions. One is the original distribution and the correct ground truth distribution, while the other is the prediction distribution and the model fitting distribution.
Understanding the loss function enables analysis and understanding of subsequent optimization tools (gradient descent, etc.). Loss functions need to be analyzed on a case-by-case basis, and a lot of times the hardest part is how to write complex loss functions.
Example of loss function
Combining with the variance learned in the middle school, the most easy to think of a loss function formula of difference square is as follows:
= = y ^ (I) \ hat {} y ^ {(I)} ^ y (I) of the training sample x (I) x ^ {} (I) x (I) the forecast result, y (I) y ^ {} (I) = = y (I) the actual result, the variance of error square formula is very good understanding, the smaller the difference of both, that the more accurate prediction, The smaller the result. But in fact, the == formula does not use ==.
One loss function we might use in practice is as follows:
We won’t get into the mathematical details of doing this, but the goal is the same.
The cost function
The loss function for the whole training set is called the cost function, and the formula is as follows:
In fact, the formula is to accumulate and average the loss function of each training set. == The larger the calculation result is, the greater the cost is, that is, the less accurate the prediction is ==.
How do neural networks learn
After solving how the neural network predicts and whether the prediction is accurate, the next natural step is how the neural network learns, that is, how to make its prediction more and more accurate. The neural network is learning to tell the computer how to change weights and biases to control the strength of the connection.
Loss function and gradient descent
As mentioned above, the accuracy of neural network prediction is determined by WWW and BBB, so the “learning” of == is to find the appropriate WWW and BBB ==, and the gradient Descent algorithm can achieve this process. The gradient descent algorithm will gradually change the values of WWW and BBB, making the new WWW and BBB gradually make the value of the loss function smaller and smaller, so as to finally achieve the purpose of more accurate prediction.
Let’s review all the formulas we’ve learned before:
The input training set XXX and the output actual result yyy are fixed, so the == loss function J(w,b)J(w,b)J(w,b) can be understood as a function of WWW and BBB ==. “Learning” means finding the most suitable WWW and BBB to minimize the loss function, that is, the most accurate result. The mathematical significance of loss functions
The image of the loss function is as follows:
We can see that the loss function is actually like a funnel, and our goal is to find the extreme values of WWW and BBB at the bottom of the funnel, such that J(w,b)J(w,b)J(w,b) is the smallest.
The principle of gradient descent
Through gradient descent, the gradient descent algorithm gradually updates WWW and BBB to gradually approach the minimum value. The process is as follows:
For convenience, let’s first assume that the loss function has only one parameter, WWW, and that WWW is a real number (in fact, WWW is a vector). Let’s change the value of WWW by the following formula:
W ‘w ‘w ‘is the new value of WWW, DWDWDW is the partial derivative of the loss function JJJ with respect to WWW, that is, the slope with respect to WWW. RRR is a parameter that represents the learning rate.
Mathematical significance Since JJJ is a convex function, that is, a monotone function, its partial derivative with respect to WWW must be greater than 0, and its monotonicity is consistent with that of JJJ, and its value must be smaller than that of JJJ. We can use these mathematical properties to make it gradually reduce its extreme value approximating JJJ. RRR is used to adjust the rate. When the RRR is too large, w’w ‘w may also become negative, which means “learning too much”. If THE RRR is not appropriate, the WWW value may keep bouncing around the minimum. So == Finding an appropriate value for RRR is very important for gradient descent ==.
do not understand to turn over the high number textbook, convex function properties. Therefore, we can see why the algorithms are all tuned, because the parameters are too important.
I highly recommend watching this video for a further theory.