This is the 8th day of my participation in the August More Text Challenge. For details, see:August is more challenging

This is the second part of the notes series of Ng’s machine learning course, mainly to learn the principle and derivation of neural network forward propagation and back propagation algorithm.

Fundamentals of Neural Networks

The concept is introduced

Artificial Neural Network (NN) for short. A neural network is a mathematical model that simulates the structure of human neurons with fixed connections. Renew neurons by forward and back propagation. In simple terms, a neural network is made up of a series of neural layers, each of which contains many neurons.

The neural network can be logically divided into three layers:

  • Input Layer: the first Layer. The receiving feature is XXX.
  • Output Layer: The last Layer outputs the assumption HHH of the final forecast.
  • HiddenLayers: The middle layer that is not directly visible.

Features:

  • Every neural network has input and output values
  • How to be trained:
    • Massive data sets
    • Thousands and thousands of training sessions
    • Learning from mistakes, comparing the difference between the predicted answer and the real answer, and improving recognition by back propagation.

The following is the simplest two-layer neural network as an example to introduce:

In the neural network above, the input feature vector is XXX, the weight parameter matrix WWW, the bias parameter BBB and aaa represent the output of each neuron, and the superscript represents the number of layers of the neural network (the hidden layer is 1).

Formula:


  • z = W T x + b z=W^Tx+b

  • a = g ( z ) = 1 1 + e z a=g(z)=\dfrac{1}{1+e^{-z}}

Computational steps of neural network:

  • First, calculation of the first layer of each node in network related z [1] [1] Tx = W + bz ^ {[1]} = W ^ {[1] ^ T} x + bz [1] [1] Tx = W + b
  • Activation function is used to calculate a [1] = g (z) [1] a ^ {[1]} = g (z ^ {[1]}) a [1] = g (z) [1]
  • A [2]a^{[2]}a[2]
  • Loss(a[2],y)Loss(a^{[2]},y)Loss(a[2],y) Loss(a[2],y)

Vectorization calculation

Formula of the first layer:


[ z 1 [ 1 ] z 2 [ 1 ] z 3 [ 1 ] ] 3 x 1 = [ W 11 [ 1 ] T . . . W 21 [ 1 ] T . . . W 31 [ 1 ] T . . . ] 3 x 3 [ x 1 x 2 x 3 ] 3 x 1 + [ b 1 [ 1 ] b 2 [ 1 ] b 3 [ 1 ] ] 3 x 1 \begin{bmatrix}z_1^{[1]} \\z_2^{[1]} \\z_3^{[1]} \end{bmatrix}_{3\times1}=\begin{bmatrix}W_{11}^{[1]^T}… \\W_{21}^{[1]^T}… \\W_{31}^{[1]^T}… \\\end{bmatrix}_{3\times3}*\begin{bmatrix}x_1\\x_2\\x_3\\ \end{bmatrix}_{3\times1}+\begin{bmatrix}b_1^{[1]}\\b_2^{[1]}\\b_3^{[1]}\\\end{bmatrix}_{3\times1}


[ a 1 [ 1 ] a 2 [ 1 ] a 3 [ 1 ] ] 3 x 1 = [ g ( z 1 [ 1 ] ) g ( z 2 [ 1 ] ) g ( z 3 [ 1 ] ) ] 3 x 1 \begin{bmatrix}a_1^{[1]} \\a_2^{[1]} \\a_3^{[1]} \end{bmatrix}_{3\times1}=\begin{bmatrix}g(z_1^{[1]})\\g(z_2^{[1]}) \\g(z_3^{[1]}) \end{bmatrix}_{3\times1}

The second formula:


[ z [ 2 ] ] 1 x 1 = [ W 11 [ 2 ] T . . . ] 1 x 3 [ a 1 [ 1 ] a 2 [ 1 ] a 3 [ 1 ] ] 3 x 1 + [ b [ 2 ] ] 1 x 1 \begin{bmatrix}z^{[2]}\end{bmatrix}_{1\times1}=\begin{bmatrix}W_{11}^{[2]^T}… \end{bmatrix}_{1\times3}*\begin{bmatrix}a_1^{[1]} \\a_2^{[1]} \\a_3^{[1]} \end{bmatrix}_{3\times1}+\begin{bmatrix}b^{[2]}\end{bmatrix}_{1\times1}


[ a [ 2 ] ] 1 x 1 = [ g ( z [ 2 ] ) ] 1 x 1 \begin{bmatrix}a^{[2]}\end{bmatrix}_{1\times1}=\begin{bmatrix}g(z^{[2]})\end{bmatrix}_{1\times1}

Output layer:


L o s s ( a [ 2 ] . y ) = y l o g ( a [ 2 ] ) ( 1 y ) l o g ( 1 a [ 2 ] ) Loss(a^{[2]},y)=-ylog(a^{[2]})-(1-y)log(1-a^{[2]})

This formula is very similar to the cost function in logistic regression.

The forward propagation

A [I](z)a^{[I]}(z)a[I](z)a [I](z)a [I](z)a [I](z) Then the process of LossLossLoss obtained from the label value is called Forward Propagation.

The forward propagation process of the above example is:


  1. z [ 1 ] = W [ 1 ] x + b [ 1 ] z^{[1]}=W^{[1]}x+b^{[1]}

  2. a [ 1 ] = g ( z [ 1 ] ) a^{[1]}=g(z^{[1]})

  3. z [ 2 ] = W [ 2 ] a [ 1 ] + b [ 2 ] z^{[2]}=W^{[2]}a^{[1]}+b^{[2]}

  4. a [ 2 ] = g ( z [ 2 ] ) a^{[2]}=g(z^{[2]})

  5. L o s s ( a [ 2 ] . y ) = y l o g ( a [ 2 ] ) ( 1 y ) l o g ( 1 a [ 2 ] ) Loss(a^{[2]},y)=-ylog(a^{[2]})-(1-y)log(1-a^{[2]})

Multiple local vectorization

Multiplicity is when the input feature is no longer a simple vector, but a matrix. The principle is essentially the same, the only caveat is that the matrix dimensions have to match.

Assuming that the training sample is MMM, the above formula should be updated as follows:


  • Z = W T X + b Z=W^TX+b
  • For a single sample: z [1] (I) = WTx (I) + b [1] z ^ {[1] (I)} = W ^ Tx ^ {(I)} + b ^ {[1]} z [1] (I) = WTx (I) + b [1]

  • X = [ x ( 1 ) x ( 2 ) x ( m ) ] X=\begin{bmatrix} \vdots&\vdots&\vdots\\x^{(1)}&x^{(2)}&x^{(m)}\\\vdots&\vdots&\vdots\end{bmatrix}

  • Z [ 1 ] = [ z [ 1 ] ( 1 ) z [ 1 ] ( 2 ) z [ 1 ] ( m ) ] Z^{[1]}=\begin{bmatrix} \vdots&\vdots&\vdots\\z^{[1](1)}&z^{[1](2)}&z^{[1](m)}\\\vdots&\vdots&\vdots\end{bmatrix}

  • A [ 1 ] = [ a [ 1 ] ( 1 ) a [ 1 ] ( 2 ) a [ 1 ] ( m ) ] A^{[1]}=\begin{bmatrix} \vdots&\vdots&\vdots\\a^{[1](1)}&a^{[1](2)}&a^{[1](m)}\\\vdots&\vdots&\vdots\end{bmatrix}

The activation function

Why do you need a nonlinear activation function?

If the activation function is removed or the activation function is linear, then any combination of two linear functions is still linear, so that no matter how many layers of the neural network, all you do is compute the linear function, and the hidden layer is useless.

Here are three common activation functions:


S i g m o i d \mathbf {Sigmoid}
function


  • sigma ( z ) = 1 1 + e z \sigma(z)=\dfrac{1}{1+e^{-z}}
  • Sigma ‘(z) = sigma (z) (1 – sigma (z)) \ sigma’ (z) = \ sigma (z) (1 – \ sigma (z)) sigma ‘(z) = sigma (z) – sigma (z) (1) the nature of the very useful


t a n h \mathbf {tanh}
function


  • t a n h ( z ) = e z e z e z + e z tanh(z)=\dfrac{e^z-e^{-z}}{e^z+e^{-z}}
  • Obviously, the range is [−1,1][-1,1][−1,1].

  • t a n h ( z ) = 1 ( t a n h ( z ) ) 2 tanh'(z)=1-(tanh(z))^2

Relu \ mathbf {Relu} Relu function


  • g ( z ) = m a x ( 0 . z ) g(z)=max(0,z)


  • g ( z ) = { 0 i f    z < 0 1 i f    z > 0 u n d e f i n e d i f    z = 0 g'(z)=\begin{cases}0&if\; z<0 \\1&if\; z>0\\undefined&if\; z=0\end{cases}

  • LeakyReluLeaky ReluLeakyRelu function


    • g ( z ) = m a x ( 0.01 z . z ) 0.01 g (z) = Max (z, z)

    • g ( z ) = { 0.01 i f    z < 0 1 i f    z > 0 u n d e f i n e d i f    z = 0 G ‘(z) = \ begin 0.01 & if \ {cases}; z<0 \\1&if\; z>0\\undefined&if\; z=0\end{cases}
  • Most widely used

Neural network gradient descent mechanism

As we know, the neural network can contain multiple hidden layers, and the neurons in each layer will produce prediction, and the final error is calculated at the output layer, then how to optimize the final loss function L(a[I],y)L(a^{[I]},y)L(a[I],y)?

Obviously, the traditional regression problems such as the gradient descent law of logistic regression cannot be directly applied here, because we need to consider the error layer by layer and optimize it layer by layer. Therefore, the Back Propagation Algorithm is adopted in the neural network to optimize the error.

Back propagation derivation

First, let’s review the loss function:


  • L o s s ( a [ L ] . y ) = y l o g ( a [ L ] ) ( 1 y ) l o g ( 1 a [ L ] ) Loss(a^{[L]},y)=-ylog(a^{[L]})-(1-y)log(1-a^{[L]})

We know that forward propagation is a process of calculating the final value from the input layer one step at a time, and back propagation, as the name suggests, is a process of derivation from the input layer to the final value, but the process of derivation is based on the final loss function, constantly taking the partial derivative forward. And the way to do that is based on the chain rule.

From the front we know the forward propagation process:


  1. z [ 1 ] = W [ 1 ] x + b [ 1 ] z^{[1]}=W^{[1]}x+b^{[1]}

  2. a [ 1 ] = g ( z [ 1 ] ) a^{[1]}=g(z^{[1]})

  3. z [ 2 ] = W [ 2 ] a [ 1 ] + b [ 2 ] z^{[2]}=W^{[2]}a^{[1]}+b^{[2]}

  4. a [ 2 ] = g ( z [ 2 ] ) a^{[2]}=g(z^{[2]})

  5. L o s s ( a [ 2 ] . y ) = y l o g ( a [ 2 ] ) ( 1 y ) l o g ( 1 a [ 2 ] ) Loss(a^{[2]},y)=-ylog(a^{[2]})-(1-y)log(1-a^{[2]})

Then the back propagation process is as follows:

Output layer:


  1. d a [ 2 ] = partial L o s s partial a [ 2 ] = y a [ 2 ] + 1 y 1 a [ 2 ] da^{[2]}=\dfrac{\partial Loss}{\partial a^{[2]}}=-\dfrac{y}{a^{[2]}}+\dfrac{1-y}{1-a^{[2]}}

  2. d z [ 2 ] = partial L o s s partial a [ 2 ] d a [ 2 ] d z [ 2 ] = ( y a [ 2 ] + 1 y 1 a [ 2 ] ) a [ 2 ] ( 1 a [ 2 ] ) = a [ 2 ] y dz^{[2]}=\dfrac{\partial Loss}{\partial a^{[2]}}\cdot\dfrac{d a^{[2]}}{d z^{[2]}}=(-\dfrac{y}{a^{[2]}}+\dfrac{1-y}{1-a^{[2]}})\cdot a^{[2]}(1-a^{[2]})=a^{[2]}-y

  3. d W [ 2 ] = partial L o s s partial a [ 2 ] d a [ 2 ] d z [ 2 ] d z [ 2 ] d W [ 2 ] = d z [ 2 ] a [ 1 ] T dW^{[2]}=\dfrac{\partial Loss}{\partial a^{[2]}}\cdot\dfrac{d a^{[2]}}{d z^{[2]}}\cdot\dfrac{d z^{[2]}}{d W^{[2]}}=dz^{[2]}a^{[1]^T}

  4. d b [ 2 ] = partial L o s s partial a [ 2 ] d a [ 2 ] d z [ 2 ] d z [ 2 ] d b [ 2 ] = d z [ 2 ] db^{[2]}=\dfrac{\partial Loss}{\partial a^{[2]}}\cdot\dfrac{d a^{[2]}}{d z^{[2]}}\cdot\dfrac{d z^{[2]}}{d b^{[2]}}=dz^{[2]}

The second layer:


  1. d a [ 1 ] = partial L o s s partial a [ 2 ] d a [ 2 ] d z [ 2 ] d z [ 2 ] d a [ 1 ] = W [ 2 ] T d z [ 2 ] da^{[1]}=\dfrac{\partial Loss}{\partial a^{[2]}}\cdot\dfrac{d a^{[2]}}{d z^{[2]}}\cdot\dfrac{d z^{[2]}}{d a^{[1]}}=W^{[2]^T}dz^{[2]}

  2. d z [ 1 ] = partial L o s s partial a [ 2 ] d a [ 2 ] d z [ 2 ] d z [ 2 ] d a [ 1 ] d a [ 1 ] d z [ 1 ] = W [ 2 ] T d z [ 2 ] g ( z [ 1 ] ) dz^{[1]}=\dfrac{\partial Loss}{\partial a^{[2]}}\cdot\dfrac{d a^{[2]}}{d z^{[2]}}\cdot\dfrac{d z^{[2]}}{d a^{[1]}}\cdot\dfrac{d a^{[1]}}{d z^{[1]}}=W^{[2]^T}dz^{[2]}*g'(z^{[1]})

  3. d W [ 1 ] = partial L o s s partial a [ 2 ] d a [ 2 ] d z [ 2 ] d z [ 2 ] d a [ 1 ] d a [ 1 ] d z [ 1 ] d z [ 1 ] d W [ 1 ] = d z [ 1 ] x T dW^{[1]}=\dfrac{\partial Loss}{\partial a^{[2]}}\cdot\dfrac{d a^{[2]}}{d z^{[2]}}\cdot\dfrac{d z^{[2]}}{d a^{[1]}}\cdot\dfrac{d a^{[1]}}{d z^{[1]}}\cdot\dfrac{d z^{[1]}}{d W^{[1]}}=dz^{[1]}x^T

  4. d b [ 1 ] = partial L o s s partial a [ 2 ] d a [ 2 ] d z [ 2 ] d z [ 2 ] d a [ 1 ] d a [ 1 ] d z [ 1 ] d z [ 1 ] d b [ 1 ] = d z [ 1 ] db^{[1]}=\dfrac{\partial Loss}{\partial a^{[2]}}\cdot\dfrac{d a^{[2]}}{d z^{[2]}}\cdot\dfrac{d z^{[2]}}{d a^{[1]}}\cdot\dfrac{d a^{[1]}}{d z^{[1]}}\cdot\dfrac{d z^{[1]}}{d b^{[1]}}=dz^{[1]}

The input layer is not computed.

We define the error of each layer as the vector δ(L)\delta^{(l)} δ(L), LLL represents the number of layers and LLL represents the total number of layers. From the above derivation we can get:


Delta t. ( l ) = { a ( l ) y l = L W [ l + 1 ] T d z [ l + 1 ] g [ l ] ( z [ l ] ) l = 2 . 3 . . . . . L 1 \ delta ^ {} (l) = \ begin {cases} a ^ {} (l) – y&l = l \ \ W ^ {^ T} [m + 1] dz ^ ^ {} [m + 1] * g {\ [l] prime} (z ^ {} [l]) & l = 2, 3,… ,L-1 \end{cases}

For m samples, then


  1. d z [ 2 ] = A [ 2 ] y dz^{[2]}=A^{[2]}-y

  2. d W [ 2 ] = 1 m d z [ 2 ] a [ 1 ] T dW^{[2]}=\dfrac{1}{m}dz^{[2]}a^{[1]^T}

  3. d b [ 2 ] = 1 m n p . s u m ( d Z [ 2 ] . a x i s = 1 . k e e p d i m s = T r u e ) db^{[2]}=\dfrac{1}{m} np.sum(dZ^{[2]},axis=1,keepdims=True)

So that’s a very detailed derivation of back propagation, and if you understand the chain derivation, then the principle of back propagation is very easy to understand.