Wu En of machine learning, neural network | back propagation algorithm

Last week we learned neural network | classification problem. We use logistic regression and neural network to solve the multi-classification problem respectively, and we find that neural network is a more effective method when the number of features is very large. This week’s lecture will show the cost functions used to train neural networks and calculate gradients using backpropagation algorithms. The important ideas and derivations of the back-propagation algorithm will be presented, but not all the computational steps will be included.

You can continue to learn Ng lessons by clicking on the course video, and the Python code for the course assignments has been posted on Github. You can click on the course code to go to Github (if you cannot access Github, you can click on Coding). Errors and improvements in the code are welcome.

Here are the notes for week 4 of Ng’s machine learning course.

Cost function

So let’s say we have multiple classification problemsThe neural network hasLayer, the number of neurons in each layer isSo the neural networkCost functionTo:

The second term is the regularization term, which is the sum of squares of ownership values in the network. The first term is similar to the cost function in logistic regression, but here we need to add up the errors of all the output neurons.

Gradient calculation

In order to be able to useGradient descentAlgorithm to train the network, we need to compute the gradient of the cost function. And one of the intuitive ways to do that is to do numerical calculations for somePlus or minus a very small quantityTo calculate the gradient:

But a little analysis of the complexity of the algorithm shows that this approach is very slow. For each set of data, we need to calculate the gradient of ownership value, the total number of calculations = number of training data x number of network weights x number of forward propagation calculations. This complexity is not acceptable under normal circumstances, so we only use this method to verify that the gradient calculated by the backpropagation algorithm is correct.

The chain rule

In order to understand what’s going onBack propagationDerivation of the formula, we first want to understand a derivative of a multivariate composite functionThe chain rule. For functions of several variables, including., then:

The chain ruleIt tells us that there are multiple levels of composition functions, and that the derivative of the next level can be derived from the previous level.



In the figure above, I intentionally added another layer, hereThe function,The function,The function. For what we’re going to calculate, the above formula still holds because we can takeasThe function. This is equivalent to putting:

Simplified to only relate to the previous layer, using the results of the previous layer’s calculationInstead of starting from scratch:

In general, for functionsIf it can be viewed asFunction of, andIs a function of, then:

Neural network is a multivariable function with many layers, and we can feel the meaning of back propagation vaguely from the chain rule.

The formulas

In order to put to good useBack propagationMagic, we first introduce an intermediate variable, is defined as:

Among themIs the level,According to the firstWhich neurons are in the layer,For the intermediate variable that we talked about last time (and just to make it a little bit clearer,Back propagationDo not use parentheses to superscript the formula in.Known as the firstThe first layerError of one neuron.Back propagationYou compute the error of each neuron in each layer, and then you compute the gradient from that error.

First, let’s look at the error of the output layer:

Use the chain rule to get:

And only whenWhen, the right-hand side is notSo:

For errors of other layers:

Use the chain rule:

Of which:

The partial derivative is obtained:

So:

Finally, the chain rule is also used to calculate:

Due to:

Only when the.When left an item:

Back propagation

With equations (1), (2) and (3), the back propagation algorithm can be completed (it should be noted that the formulas deduced just now are for a group of training data).

  1. For allInitialize the
  2. forGroup training data,Get to the:
    • make
    • Forward propagation, calculate the activation vector of each layer
    • Formula (1) is used to calculate the output layer error
    • Formula (2) is used to calculate errors of other layers
    • Using equation (3), add.
  3. Calculate the gradient matrix: Dlij={1m δ lij+λm θ Lijif j≠01m δ (L)ijif j=0
  4. Update the weights

Weight initialization

Finally, initialization of weights. For neural networks, you cannot initialize them with the same value of 0 as before, which would result in each layer ofA logical unitAre the same. So we’re going to randomize the initialization so that.

So, after learning neural network and its learning algorithm this week, do you think it is amazing?


hertzcat

2018-04-14