Part of the article is from captainbed.vip/1-3-3/

Shallow neural network

The computational process of shallow neural networks is almost the same as that of single neural networks, only more complicated.

Vectorization of shallow neural networks

Vectoquantization is used almost everywhere in AI programming. It can be said that the smallest unit of data that we deal with in most cases in AI programming is the vector, and we write multi-neuron networks using a higher level of data unit than the vector — the matrix. So the vectozation of the shallow neural network is really matrixization.

Take the shallow neural network in the following figure as an example:

The superscript indicates the number of levels and the subscript indicates the number of lines, and its original form is as follows:

As we know, each weight value WI [j]Tw_i^{[j]T} WI [j]T is a row vector, in which there are three values corresponding to X1 ~ x3x_1 ~ x_3x1 ~ x3. At this time, we have four such row vectors, forming a matrix of 4∗34*34∗3. The form of the vector is as follows:

Tips: Review matrix multiplication in linear algebra if you don’t understand it

We will leave out the subscripts, and it can also be abbreviated as:

Same with the second layer:

However, considering that we often need many training samples to train the neural network, it can be written as follows (the number in the bracket of the upper mark represents the ith training sample) :

As mentioned earlier in vectorization, we generally do not recommend using for loops, and use vectorization instead of for loops whenever possible. We also combine the eigenvector X of each sample into a matrix, as shown below:

With this in mind, the above for loop can be written as matrix multiplication (review matrix multiplication if this is confusing) :

Where, Z and A are both matrices, and each column vector in the matrix corresponds to A training sample (for example, Z [1](2) Z ^{[1](2)} Z [1](2) is the Z of the first layer of the second sample) :

Loss function of shallow neural networks

We calculated the loss function of multi-neuron neural network by the following formula. Compared with the loss function of single-neuron neural network, it only has more superscripts of A, after all, each layer requires the loss function once:

⭐ Back propagation of shallow neural networks

Back Propagation can be said to be the standard configuration of neural network model, and its efficiency of solving partial derivatives is the highest.

The idea of calculating partial derivatives for shallow neural networks is the same as that for single neural networks. We first calculate the partial derivatives of the last layer, then work backwards to calculate the partial derivatives of the previous layer, then the partial derivatives of the previous layer, and so on.

Take the second partial derivative

The formula of the second layer partial derivative is as follows, which is the same as the single neuron network (single sample), and a[1]a^{[1]}a[1] can be regarded as XXX:

⭐️ Take the first partial derivative

The partial derivatives of the first layer are not the same as those of the second layer, because the second layer is directly adjacent to the loss function, whereas the first layer is not directly related to the loss function. 💡 You can see that the above loss function is calculated using the last layer a[2]a^{[2]}a[2], and the middle layer is only used for transition.

We need to use the chain rule to find the partial derivative of the first layer:


{ partial L partial z [ 1 ] = partial L partial a [ 1 ] partial a [ 1 ] partial z [ 1 ] partial L partial a [ 1 ] = partial L partial z [ 2 ] partial z [ 2 ] partial a [ 1 ] \begin{cases}\frac{\partial L}{\partial z^{[1]}} =\frac{\partial L}{\partial a^{[1]}} \frac{\partial a^{[1]}}{\partial z^{[1]}} &\\ \frac{\partial L}{\partial a^{[1]}} =\frac{\partial L}{\partial z^{[2]}} \frac{\partial z^{[2]}}{\partial a^{[1]}} &\end{cases}

And partial z [2] partial a [1] \ frac {\ partial z ^ {[2]}} {\ partial a ^ {[1]}} partial a [1] [2] partial z we already came out, the final result is derived as follows:


partial L partial a [ 1 ] = W [ 2 ] T partial L partial z [ 2 ] \frac{\partial L}{\partial a^{[1]}} =W^{[2]T}\frac{\partial L}{\partial z^{[2]}}

Partial partial z L [1] [2] = W T ∗ partial partial z L [2] partial g [1] [1] partial z \ frac {\ partial L} {\ partial z ^ {[1]}} = W ^ {T} [2] * \ frac {\ partial L} {\ partial z ^ {[2]}} \ frac {\ partial g ^ {[1]}} {\ partial z ^ {[1]}} partial z [1] [2] partial L = W T ∗ partial z [2] partial partial z L [1] [1] partial g (g [1] g ^ {[1]} g [1] said sigmoid activation function, in fact you can also use other activation functions)

Concluded that dz [1] dz ^ {[1]} dz [1], the dw [1] the dw ^ {[1]} dw [1] and the db [1] the db ^ {[1]} the db can dz [1] to [1] dz ^ {[1]} dz out [1], the formula with the single neuron network is the same:


d w [ 1 ] = d z [ 1 ] X T dw^{[1]}=dz^{[1]}X^T


d b [ 1 ] = d z [ 1 ] db^{[1]}=dz^{[1]}

Multiple training samples

When training multiple samples, we need to change the vector into matrix (lowercase letters to uppercase) through vectozation to improve the operation efficiency. And you have to divide by the number of samples, m, to find the average.

Finally a np. The sum () parameter is different, because the first layer is no longer only one neuron, dz [1] dz ^ {[1]} dz [1] dimension is (n [1], m) (n ^ {[1]}, m) (n [1], m) (n [1] n ^ {[1]} n [1] is the number of neurons in the first layer, M is the number of training samples), we need to add up the result parameters of the corresponding training samples of each neuron:

So what asix=1 does is make sum just add up all the elements in each row. And keepdims is to prevent the sum of the output into a (n) [1],) (n ^ {[1]},) (n [1], this form, we need the form of (n [1], 1) (n ^ {[1]}, 1) (n [1], 1).