“This is the 19th day of my participation in the Gwen Challenge in November. See details: The Last Gwen Challenge in 2021”.

Many books on neural networks introduce the concept of various graphs, and introduce the principle of neural networks from the perspective of computing graphs, and introduce quite a lot of branches and operations. But for a simple neural network from a macro point of view, it may be possible to understand its specific operation process in a more concise way.

Suppose we have a neural network that has only one hidden layer and one output layer. I don’t care about the number of weights, the size of the input and output. We consider the specific process of forward propagation and back propagation from this neural network.

Flowsheet LR input([input]) --> linear1[[hidden layer F]] -->linear2[[hidden layer H]] --> sigmoid2{{activation function J}};

This is separating the linear unit from the activation function. For simplicity, assume that the activation functions all use sigmoid functions

So there are only two elements, linear functions and sigmoid functions. Let’s not worry about the input of the function.

For the linear function f(x)= W ×x+bf(x) =w \times x+bf(x) = W ×x+b, considering its gradient, for its layer, in the neural network it should be regarded as a trivariate function of W,x,bw, x,bw, x,b, for the other layers, It should be thought of as a unary function of xx. So its gradients are the partial derivatives of WWW, XXX and BBB, respectively. Details are as follows:

  • For x

partial f partial x = w \frac{\partial f }{\partial x}= w
  • For w, yes

partial f partial w = x \frac{\partial f }{\partial w}= x
  • For B, yes

partial f partial b = 1 \frac{\partial f }{\partial b}= 1

Now, the so-called activation function sigmoid, it’s a function of one variable, so its partial derivative is as follows


partial g partial x = ( 1 g ( x ) ) x g ( x ) \frac{\partial g }{\partial x}= (1-g(x))\times g(x)

Now let’s consider the process of forward propagation.


Hidden layer: f(x)=w×x+b F (x)=w \times x+ bf(x)= W ×x+b

First activation: G (f)= SigmoID (f) G (f)= Sigmoid (f)g(f)= Sigmoid (f), where xx x is the above f(x)f(x)f(x)

H (g)=w×g+bh(g) =w \times g+bh(g) =w×g+b, where xx x is the above g(x)g(x)g(x)

Second activation: J(h)= SigmoID (h)J(h)=sigmoid(h) J(h)=sigmoid(h), where HHH is the above h(x) H (x)h(x)


There’s a reason for that, and I’ll talk about that later. Now that we have the output of the neural network, let’s think about how accurate it is to calculate it, using what’s called the square error loss function, and adding a coefficient here for convenience.


L o s s ( J ) = 1 2 x ( t J ) 2 Loss(J) = \frac{1}{2} \times (t – J)^2

Where t is the label value

If there is an error, then we need to figure out how to make it as small as possible, based on the principle of gradient descent, which is to adjust the variable of the function towards the goal of reducing the loss function.

A classic mistake is to focus too much on the x or the input, when in fact the parameters in a neural network can be viewed as variables in a function, and the inputs are immutable. That is, the intermediate adjustment weights ww w and bias BBB to minimize the loss.

Let’s first consider adjusting the output layer, and consider adjusting its weights.

Gradient descent tells us that the data should move in the direction of the negative gradient, and the learning rate lr is actually not important here, we don’t care about its value here, although it is quite important.

Generally speaking, the way to adjust is the following process.


w = w l r x partial L partial w w = w – lr\times \frac{\partial L }{\partial w}

That requires a partial derivative, and we’re trying to solve for that partial derivative, and that’s going to lead to the chain rule.


partial L partial w = partial L partial J x partial J partial h x partial h partial w \frac{\partial L }{\partial w} = \frac{\partial L }{\partial J} \times \frac{\partial J }{\partial h} \times \frac{\partial h }{\partial w}

In the previous section, we considered the gradient of the two functions, but we did not consider only the loss function. Now let’s think about the three partial derivatives together


partial L partial J = t J \frac{\partial L }{\partial J } = t – J

We take the derivative of the loss function with respect to J, and we set up the coefficients specifically to make the partial derivatives simpler.


partial J partial h = J ( h ) x ( 1 J ( h ) ) \frac{\partial J }{\partial h} = J(h)\times(1-J(h))

partial h partial w = x \frac{\partial h }{\partial w} = x

Sigmoid, sigmoid, sigmoID is one of the magic functions, and in general, the derivative itself is independent of the function, but there are some special functions, whose derivative is known by the function, sort of like exe^xex, although mathematically they shouldn’t be related, But they do have a relationship, which saves us some computational overhead.

It can be found that for the weight of the output layer, adjustment depends on the calculation of the first two partial derivatives. Now let’s look at the bias B adjustment. The formula is as follows:


w = w l r x partial L partial b w = w – lr\times \frac{\partial L }{\partial b}

It’s basically the same thing, moving in the direction of the negative gradient.

The expanded partial derivatives are as follows:


partial L partial b = partial L partial J x partial J partial h x partial h partial b \frac{\partial L }{\partial b} = \frac{\partial L }{\partial J} \times \frac{\partial J }{\partial h} \times \frac{\partial h }{\partial b}

As you can see, for the parameters of this layer, they all depend on a part of the value (the product of the two partial derivatives) provided by the subsequent layer.

For a more accurate and in-depth understanding, let’s consider the weight update of the hidden layer.

It follows the same principle of gradient descent, that is, you need to calculate the gradient to update.


w = w l r x partial L partial w w = w – lr\times \frac{\partial L }{\partial w}

Notice that the w here and the W above are not the same. And then we’re going to expand the partial derivative


partial L partial w = partial L partial J x partial J partial h x partial h partial g x partial g partial f x partial f partial w \frac{\partial L }{\partial w} = \frac{\partial L }{\partial J} \times \frac{\partial J }{\partial h} \times \frac{\partial h }{\partial g} \times \frac{\partial g }{\partial f} \times \frac{\partial f }{\partial w}

I’m not going to write it down here, but it’s also going to be some partial derivative of the upper layer times the partial derivative of his layer with respect to W. The bias b is similar. So no further details.

Partial derivatives get longer and longer, but as before, they are updated depending on the partial derivatives of subsequent layer calculations. And there’s quite a bit of overlap between these partial derivatives. These repeated calculation processes can be retained to avoid repeated calculation.

This is the essence of the BP algorithm, which discovers the common parts between these operations and expects these common parts to be reused, somewhat like the idea of dynamic programming.

Another point worth noting is that in the use of update weights, we actually use x, and note that this refers to the output of the previous layer. This means that data input to this layer must be retained, which is why each layer was previously written separately.

After understanding the mathematical form, we will finally consider implementing it from the code. We will ignore the specific data use and only focus on the process of forward propagation and back propagation.

import numpy as np

Tensor = np.ndarray
X: Tensor
t: Tensor
w1: Tensor
w2: Tensor
b1: Tensor
b2: Tensor
lr = 1e-6


def sigmoid(x: Tensor) :
    return 1 / (1 + np.exp(-x))


def diff_sigmoid(x: Tensor) :  In order to simplify the calculation, it is not the derivative at x, but at x = sig(...). Is the corresponding derivative of
    return x * (1 - x)


def loss(x: Tensor) :
    return 1 / 2 * (t - x)


def diff_loss(x: Tensor) :  The derivative of the error with respect to x
    return t - x  # t is the tag


for epoch in range(10):
    f = np.dot(X, w1) + 1  # hidden layer
    g = sigmoid(f)  # Hide layer through activation function
    h = np.dot(g, w2)  # output layer
    J = sigmoid(h)  # output

    Consider the partial derivative of the error with respect to the output
    diff_L_to_j = diff_loss(J)
    The derivative of the activation function depends on the result J of sigmoid activation operation. Normally, it should depend on h
    diff_J_to_h = diff_sigmoid(J)  # (1- sigmoID (h)) * sigmoID (h) but this is expensive
    # Merge part, equivalent to propagation gradient
    diff_L_to_h = diff_L_to_j * diff_J_to_h

    The partial derivative of the output layer with respect to W depends on the input g
    diff_h_to_w = g

    The partial derivative of the output layer, with respect to b, actually doesn't have to be computed
    diff_h_to_b = 1

    Compute the derivative of h with respect to the previous layer
    diff_h_to_g = w2
    Calculate the derivative of the activation function G with respect to the previous layer depending on the layer output G
    diff_g_to_f = diff_sigmoid(g)

    # Merge part, equivalent to propagation gradient
    diff_L_to_f = diff_J_to_h * diff_h_to_g * diff_g_to_f

    Compute the partial derivative of f with respect to w, depending on the input X
    diff_f_to_w = X
    The partial of f with respect to b, you don't have to
    diff_f_to_b = 1
    # Adjustment weights
    w2 = w2 - lr * diff_L_to_h * diff_h_to_w
    b2 = b2 - lr * diff_L_to_h  # * diff_h_to_b2
    w1 = w1 - lr * diff_L_to_f * diff_f_to_w
    b1 = b1 - lr * diff_L_to_f  # * diff_f_to_b
Copy the code

The program here is not really training, but rather making trade-offs, not materializing the data, and not using more intermediate variables, so it may need some fine-tuning when it’s actually used.

As you can see from the code, both forward operations and back propagation require a significant amount of memory space for holding intermediate results. This is also the reason why the training of neural network has certain internal requirements.