Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.

1. Fully connected neural network structure

On the left is the simplest network structure of a fully connected neural network. On the far left is the input layer, which is responsible for receiving data. The last edge is the output layer, from which the output data is obtained; The layer between them is the hidden layer. In a fully connected neural network, there are no connections between neurons at the same layer, and each neuron at the NTH layer is connected to all neurons at the n-1st layer, which is the meaning of full connected. The output of the neuron at the n-1st layer is the input of the neuron at the NTH layer. Each connection has a weight, W, and these weights are the parameters of the model (that is, what the model learns).

The nodes in the network structure are neurons. The diagram on the right is the simplest node of a fully connected neural network. Neurons are the basic unit of the network. Neurons are made up of:

  • Input :n dimensional vector x
  • Linear weighted
  • The activation function
  • The output of a

The output data of the upper layer is input to the neuron of the next layer, and the weight between the upper layer and the connection is linearly weighted. After that, the results are normalized through the activation function, and then the results are output to the neuron of the next layer. Repeat the above step as many times as there are hidden layers until you reach the output layer and output the result.

Two, back propagation

So how is the neural network trained to learn the weight w? The training method used here is the backpropagation algorithm. We want to look at two concepts.

(1) Loss Function

The first one is Loss Function, Loss is the Loss between the output value (predicted value) of neural network and the true value. Loss Function can be expressed as the formula in the following figure. In order to prevent negative Loss and positive Loss from cancelling each other, we square each Loss. Loss refers to the error between the real value and the predicted value. Taking the square has no influence on the meaning of the expression. The whole training process is the process of narrowing Loss. From the simplified results of this formula, we can see that A, B, C, D, E and F are constant coefficients, and the unknowns are W and B. In other words, in order to minimize Loss, we need to solve the best W and B. Here we use gradient descent to find the best solution.

(2) Gradient descent

Gradient: indicates that the directional derivative of a function at this point obtains the maximum value along this direction, that is, the function changes fastest along this direction (the direction of this gradient) at this point, and the change rate is the largest (is the modulus of this gradient). Gradient expression in the three-dimensional space of the first formula in the following figure. In the reverse propagation, we will first set an initial unknown parameter value, and then update the unknown in the direction of the fastest Loss reduction according to the gradient. This formula expresses the relationship between the current unknown and the unknown in the previous step. A represents the learning rate, which directly affects how fast our model can converge to the local minimum. In this way, we keep iterating, and the pace of each update is smaller and smaller, and the loss value is smaller and smaller, until we reach a certain threshold or the number of iterations, we stop training, and find the solution we require in this way.

The whole process is called gradient back propagation.

3. Limitations of fully connected neural networks

The limitation of fully connected neural network is that the more complex the network structure is, the convergence of the whole network will be very slow. Assume the network structure as shown in the figure above on the right: we need to take 4 partial derivatives of the hidden layer and output layer, paying the cost of 3 times of the chain derivative. If our network is deeper, the more partial derivative and multiplication, the greater the computational cost, the network is likely to form a number of complex nested relationships, so that the whole network will converge very slowly. Because of the limitations of fully connected neural networks, many networks emerge at this time.