CHAPTER 2 How the BackPropagation Algorithm Works

Neural networks and machine learning series links

Overview of back propagation

Back propagation algorithms were first mentioned in the 1970s, but their importance was not recognized until the famous 1986 paper by David Rumelhart, Geoffrey Hinton, and Ronald Williams.

At the heart of back propagation is a pair cost functionAbout any weightAnd offsetThe partial derivativeOf.

This expression tells us how fast the cost function changes as we change weights and offsets.

Two assumptions about the cost function

  1. The cost function can be written in each training sampleThe cost function ofThe average of

  2. The cost function can be written as a function of the neural network output.

The reason for assumption 1 is that back propagation is actually calculated for an independent training sample. It was then averaged across all the training samples

The reason hypothesis 2 is needed is to relate the cost function to the output of the neural network, and hence to the parameters of the neural network.

Symbol definition

fromLayer of the firstOne neuron toLayer of the firstThe weight of the neurons.

Is the firstLayer of the firstThe bias of one neuron.

Is the firstLayer of the firstOf neurons.

Is the activation function.

Vectoquantize the sign above

Is the weight matrixThe elements of the column are.

For example, the weight matrix between layer 2 and layer 3 is

It’s the offset vector. The firstThe elements of the row are.

For example, the offset vector on the second layer is

So we have these representationsLayer of the firstOf neuronsJust likeThe activation values of the layers are correlated by the equation

So let’s vectorialize this

For example, the activation vector of the third layer is

It’s the activation vector. The firstThe elements of the row are.

define

the

Said the first the firstLayer of weighted input. The firstAn element is.

Is the firstLayer of the firstThe weighted input of four neurons.

At the heart of back propagation is a pair cost functionAbout any weightAnd offsetThe partial derivativeOf. To calculate these values, an intermediate quantity is introduced, said in theLayer of the firstOf neurons.

define

Is the error vector,The firstAn element is.

Four basic equations of back propagation

Is the gradient operator,The result is a vector whose elements are partial derivatives.

Is the operator multiplied by elements,, e.g.

Back propagation algorithm

As we said above, the back propagation algorithm computes the gradient of the cost function on a training sample,. In practice, back propagation algorithms are usually combined with learning algorithms such as stochastic gradient descent, and the corresponding gradient is calculated for many training samples. In particular, given a small batch of data of size M, the following algorithm applies the gradient descent learning algorithm on this small batch of data:

Back propagation algorithm and small batch random gradient descent algorithm combined with a schematic code, the complete code see network.py

def backprop(self, x, y):
        """Return a tuple ``(nabla_b, nabla_w)`` representing the gradient for the cost function C_x. ``nabla_b`` and ``nabla_w`` are layer-by-layer lists of numpy arrays, similar to ``self.biases`` and ``self.weights``."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # feedforward
        activation = x
        activations = [x] # list to store all the activations, layer by layer
        zs = [] # list to store all the z vectors, layer by layer
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation)+b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)
        # backward pass
        delta = self.cost_derivative(activations[- 1], y) * \
            sigmoid_prime(zs[- 1])
        nabla_b[- 1] = delta
        nabla_w[- 1] = np.dot(delta, activations[2 -].transpose())
        # Note that the variable l in the loop below is used a little
        # differently to the notation in Chapter 2 of the book. Here,
        # l = 1 means the last layer of neurons, l = 2 is the
        # second-last layer, and so on. It's a renumbering of the
        # scheme in the book, used here to take advantage of the fact
        # that Python can use negative indices in lists.
        for l in range(2, self.num_layers):
            z = zs[-l]
            sp = sigmoid_prime(z)
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
            nabla_b[-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l- 1].transpose())
        return (nabla_b, nabla_w)

Copy the code
def cost_derivative(self, output_activations, y):
        """Return the vector of partial derivatives \partial C_x / \partial a for the output activations."""
        return (output_activations-y)

def sigmoid_prime(z):
    """Derivative of the sigmoid function."""
    return sigmoid(z)*(1-sigmoid(z))
Copy the code

Proof of four basic equations

We now prove these four fundamental equations (BP) – (BP4). All of this is a consequence of the chain rule of multivariable calculus.

prove

We start with the equation (BP1), which gives us the errorOf. According to the definition

According to the two assumptions about the cost function (2) “the cost function can be written as the function of the neural network output”, the chain method can be used to measure the partial derivative of the neural network outputTake the partial derivative of the weighted output.

It looks very complicated up here, but because of theThe output activation value of five neuronsIt only depends on when the indicesWhen the firstThe input weight of neurons. All the whileDisappeared. And it turns out that we can simplify the last expression by

And becausesoCan be written as, the equation becomes

That’s BP1 in component form, and then according toIs the gradient operator,The result is a vector whose elements are partial derivatives. The equation can be written in vector form

(BP1) is proved.

prove

Proof (BP2), which gives the following layer of errorThe error is expressed in the form of. To this end, toRewrite the form of.throughConnect them, use the chain method

According to theThe definition of a

You take the partial, you get

Note that although theThe connections between the two layers are intricate, but any pair of neurons between the two layers (no connections within the same layer) has only one connection, i.eBetween only throughThe connection. soThe result of doing partial derivatives is very simple, just. So let’s put this result in

This is exactly BP2 written in component form.

Let me write it as a vector

For example,

(BP2) is proved.

prove

According to thedefine

anddefine

so

And because

so

namely

Let me write it as a vector

(BP3) is proved.

prove

According to thedefine

anddefine

And because

so

Vectrize the expression

For example,

(BP4) is proved.

An intuitive diagram:

So far all four equations of back propagation have been proved.

Other scholars back-propagated a proof of four equations (he wrote it more concisely) : CSDN: oio328Loio

Back propagation: the big picture

As shown in the figure above, suppose we are rightMake a small disturbance, this disturbance will eventually affect the cost function along the neural networkOf the cost functionChange andSo let’s relate it to the following formula

So you can imagine that one path that affects the cost function is

In order to calculateWe need to sum all the possible paths, namely

because

We know from these three things

The above formula looks complicated, but there is a fairly good intuitive explanation for it. Let’s use this formulaAbout the rate of change of a weight in a network. What this formula tells us is that the connection between two neurons is actually associated with a rate of change factor, which is just the partial derivative of the activation value of one neuron relative to the activation value of the other neurons. The rate-of-change factor of a path is the product of many factors along that path. Overall rate of changeIs the sum of the rate of change factors for all possible paths from the initial weights to the final output of the cost function. For a given path, the process is explained as follows,

If you take all of these cases and sum them up with matrix operations, and simplify them as much as you can, you’ll find that you’re propagating back! You can think of back propagation as a way to sum the rates of change of all possible paths. Or, to put it another way, back propagation is a technique that cleverly tracks the propagation of small changes in weights and bias to the output layer affecting the cost function.

If you try to use the above ideas to prove back propagation, it will be much more complicated than the back propagation of the four equations in this paper, because there are many things that can be simplified by using the above ideas. One neat step that can be added is that the partial derivative object of the above equation is similarThe activation value of. The trick is to use weighted inputs, for example, as an intermediate variable. If the idea doesn’t occur, stick with the activation value insteadYou end up with a proof that is slightly more complicated than the one I gave you earlier.

It’s not too much of a mystery that the first proofs appeared. Because that’s just the accumulation of hard work to simplify the proof!


reference

[1] Michael Nielsen. CHAPTER 2 How the backpropagation algorithm works[DB/OL]. neuralnetworksanddeeplearning.com/chap2.html, 2018-06-21.

[2] Zhu Xiaohu. Zhang Freeman.Another Chinese Translation of Neural Networks and Deep Learning[DB/OL]. Github.com/zhanggyb/nn… , 2018-06-21.

[3] oio328Loio. Neural network learning (3) the back propagation (BP) algorithm (1) [DB/OL]. Blog.csdn.net/hoho1151191… , 2018-06-25.