CHAPTER 2 How the BackPropagation Algorithm Works
Neural networks and machine learning series links
Overview of back propagation
Back propagation algorithms were first mentioned in the 1970s, but their importance was not recognized until the famous 1986 paper by David Rumelhart, Geoffrey Hinton, and Ronald Williams.
At the heart of back propagation is a pair cost functionAbout any weightAnd offsetThe partial derivativeOf.
This expression tells us how fast the cost function changes as we change weights and offsets.
Two assumptions about the cost function
-
The cost function can be written in each training sampleThe cost function ofThe average of 。
-
The cost function can be written as a function of the neural network output.
The reason for assumption 1 is that back propagation is actually calculated for an independent training sample 和 . It was then averaged across all the training samples 和 。
The reason hypothesis 2 is needed is to relate the cost function to the output of the neural network, and hence to the parameters of the neural network.
Symbol definition
fromLayer of the firstOne neuron toLayer of the firstThe weight of the neurons.
Is the firstLayer of the firstThe bias of one neuron.
Is the firstLayer of the firstOf neurons.
Is the activation function.
Vectoquantize the sign above
Is the weight matrix 行 The elements of the column are.
For example, the weight matrix between layer 2 and layer 3 is
It’s the offset vector. The firstThe elements of the row are.
For example, the offset vector on the second layer is
So we have these representationsLayer of the firstOf neuronsJust likeThe activation values of the layers are correlated by the equation
So let’s vectorialize this
For example, the activation vector of the third layer is
It’s the activation vector. The firstThe elements of the row are.
define
the
Said the first the firstLayer of weighted input. The firstAn element is.
Is the firstLayer of the firstThe weighted input of four neurons.
At the heart of back propagation is a pair cost functionAbout any weightAnd offsetThe partial derivativeOf. To calculate these values, an intermediate quantity is introduced, said in theLayer of the firstOf neurons.
define
Is the error vector,The firstAn element is.
Four basic equations of back propagation
Is the gradient operator,The result is a vector whose elements are partial derivatives.
Is the operator multiplied by elements,, e.g.
Back propagation algorithm
As we said above, the back propagation algorithm computes the gradient of the cost function on a training sample,. In practice, back propagation algorithms are usually combined with learning algorithms such as stochastic gradient descent, and the corresponding gradient is calculated for many training samples. In particular, given a small batch of data of size M, the following algorithm applies the gradient descent learning algorithm on this small batch of data:
Back propagation algorithm and small batch random gradient descent algorithm combined with a schematic code, the complete code see network.py
def backprop(self, x, y):
"""Return a tuple ``(nabla_b, nabla_w)`` representing the gradient for the cost function C_x. ``nabla_b`` and ``nabla_w`` are layer-by-layer lists of numpy arrays, similar to ``self.biases`` and ``self.weights``."""
nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in self.weights]
# feedforward
activation = x
activations = [x] # list to store all the activations, layer by layer
zs = [] # list to store all the z vectors, layer by layer
for b, w in zip(self.biases, self.weights):
z = np.dot(w, activation)+b
zs.append(z)
activation = sigmoid(z)
activations.append(activation)
# backward pass
delta = self.cost_derivative(activations[- 1], y) * \
sigmoid_prime(zs[- 1])
nabla_b[- 1] = delta
nabla_w[- 1] = np.dot(delta, activations[2 -].transpose())
# Note that the variable l in the loop below is used a little
# differently to the notation in Chapter 2 of the book. Here,
# l = 1 means the last layer of neurons, l = 2 is the
# second-last layer, and so on. It's a renumbering of the
# scheme in the book, used here to take advantage of the fact
# that Python can use negative indices in lists.
for l in range(2, self.num_layers):
z = zs[-l]
sp = sigmoid_prime(z)
delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
nabla_b[-l] = delta
nabla_w[-l] = np.dot(delta, activations[-l- 1].transpose())
return (nabla_b, nabla_w)
Copy the code
def cost_derivative(self, output_activations, y):
"""Return the vector of partial derivatives \partial C_x / \partial a for the output activations."""
return (output_activations-y)
def sigmoid_prime(z):
"""Derivative of the sigmoid function."""
return sigmoid(z)*(1-sigmoid(z))
Copy the code
Proof of four basic equations
We now prove these four fundamental equations (BP) – (BP4). All of this is a consequence of the chain rule of multivariable calculus.
prove
We start with the equation (BP1), which gives us the errorOf. According to the definition
According to the two assumptions about the cost function (2) “the cost function can be written as the function of the neural network output”, the chain method can be used to measure the partial derivative of the neural network outputTake the partial derivative of the weighted output.
It looks very complicated up here, but because of theThe output activation value of five neuronsIt only depends on when the indicesWhen the firstThe input weight of neurons. All the while 时 Disappeared. And it turns out that we can simplify the last expression by
And becausesoCan be written as, the equation becomes
That’s BP1 in component form, and then according toIs the gradient operator,The result is a vector whose elements are partial derivatives. The equation can be written in vector form
(BP1) is proved.
prove
Proof (BP2), which gives the following layer of errorThe error is expressed in the form of. To this end, toRewrite the form of. 和 through 和 Connect them, use the chain method
According to theThe definition of a
对 You take the partial, you get
Note that although the 和 The connections between the two layers are intricate, but any pair of neurons between the two layers (no connections within the same layer) has only one connection, i.e 和 Between only throughThe connection. so 对 The result of doing partial derivatives is very simple, just. So let’s put this result in 中
This is exactly BP2 written in component form.
Let me write it as a vector
For example,
(BP2) is proved.
prove
According to thedefine
anddefine
so
And because
so
namely
Let me write it as a vector
(BP3) is proved.
prove
According to thedefine
anddefine
And because
so
Vectrize the expression
For example,
(BP4) is proved.
An intuitive diagram:
So far all four equations of back propagation have been proved.
Other scholars back-propagated a proof of four equations (he wrote it more concisely) : CSDN: oio328Loio
Back propagation: the big picture
As shown in the figure above, suppose we are rightMake a small disturbance, this disturbance will eventually affect the cost function along the neural networkOf the cost functionChange andSo let’s relate it to the following formula
So you can imagine that one path that affects the cost function is
In order to calculateWe need to sum all the possible paths, namely
because
We know from these three things
The above formula looks complicated, but there is a fairly good intuitive explanation for it. Let’s use this formulaAbout the rate of change of a weight in a network. What this formula tells us is that the connection between two neurons is actually associated with a rate of change factor, which is just the partial derivative of the activation value of one neuron relative to the activation value of the other neurons. The rate-of-change factor of a path is the product of many factors along that path. Overall rate of changeIs the sum of the rate of change factors for all possible paths from the initial weights to the final output of the cost function. For a given path, the process is explained as follows,
If you take all of these cases and sum them up with matrix operations, and simplify them as much as you can, you’ll find that you’re propagating back! You can think of back propagation as a way to sum the rates of change of all possible paths. Or, to put it another way, back propagation is a technique that cleverly tracks the propagation of small changes in weights and bias to the output layer affecting the cost function.
If you try to use the above ideas to prove back propagation, it will be much more complicated than the back propagation of the four equations in this paper, because there are many things that can be simplified by using the above ideas. One neat step that can be added is that the partial derivative object of the above equation is similarThe activation value of. The trick is to use weighted inputs, for example, as an intermediate variable. If the idea doesn’t occur, stick with the activation value insteadYou end up with a proof that is slightly more complicated than the one I gave you earlier.
It’s not too much of a mystery that the first proofs appeared. Because that’s just the accumulation of hard work to simplify the proof!
reference
[1] Michael Nielsen. CHAPTER 2 How the backpropagation algorithm works[DB/OL]. neuralnetworksanddeeplearning.com/chap2.html, 2018-06-21.
[2] Zhu Xiaohu. Zhang Freeman.Another Chinese Translation of Neural Networks and Deep Learning[DB/OL]. Github.com/zhanggyb/nn… , 2018-06-21.
[3] oio328Loio. Neural network learning (3) the back propagation (BP) algorithm (1) [DB/OL]. Blog.csdn.net/hoho1151191… , 2018-06-25.