Please indicate the source for reprinting: blog.csdn.net/gamer_gyt

Weibo: weibo.com/234654758

Github:github.com/thinkgamer

Search and recommend wikis

Personal website: ThinkGamer.github. IO


Other articles in the series:

  • Five common neural networks (1)- feedforward neural networks
  • Five common neural networks (2)- convolutional neural networks
  • Five Common Neural Networks (3)- Recurrent Neural Networks (1)
  • Five Common Neural Networks (3)- Recurrent Neural Networks (MIDDLE)
  • Five Common Neural networks (3)- Recurrent neural networks (2)
  • Five Common Neural Networks (4)- Deep belief Networks (1)
  • Five Common Neural networks (4)- Deep belief networks (2)
  • Five common neural networks (5)- generative adversarial networks

Given a set of neurons, we can construct a network with neurons as nodes. Different neural network models have different topology of network connections. A more straightforward topology is the feedforward network. Feedforward Neural Network (FNN) is the earliest simple artificial Neural Network.

introduce

In the feedforward neural network, different neurons belong to different layers, and the neurons in each layer can receive the signals from the neurons in the previous layer and generate the signals to the next layer. The 0 layer is called the input layer, the last layer is called the output layer, and the middle layer is called the hidden layer. There is no feedback in the whole network, and the signal propagates one-way from the input layer to the output layer, which can be represented by a useful acyclic graph.

Feedforward neural networks are also known as mutli-Layer Perceptron (MLP). However, the name of multi-layer perceptron is not accurate, because the feedforward neural network is actually composed of multi-layer Logistic regression model (continuous nonlinear model), rather than multi-layer perceptron model (discontinuous nonlinear model).

The following figure shows a simple feedforward neural network:

Several concepts involved in neural networks:

  • L: indicates the number of layers of the neural network
  • M ^ L: indicates the number of neurons at layer L
  • f_l(.) : represents the activation function of layer L neurons
  • W^ L: represents the weight matrix from layer L-1 to layer L
  • B ^ L: represents the bias from layer L -1 to layer L
  • Z ^ L: represents the net input (net activity value) of neurons at layer L
  • A ^ L: represents the neuron output of layer L (activity value)

The information transmission formula of neural network is as follows (Formula 1-1)


Formula 1-1 can also be combined as (Formula 1-2) :


Or (Formula 1-3)


In this way, the neural network can get the final output of the network, a^L, through layer by layer information transmission. The entire network can be viewed as a coincidence function


Let’s take the vector x as the input to the first layer a to the 0, let’s take the input to the L layer A to the 0, let’s take the output to the L layer A to the L as the output to the whole function.


Where W and b represent the connection weights and biases of all layers in the network.

Parameter learning

If the cross entropy loss function is adopted, for sample (x, y), its loss function is (Formula 1-4) :


Where y belongs to {0,1}^T is the one-hot vector corresponding to label y.

Given the training set D={(x^n,y^n)}, n >= n >=0, input each sample X ^n to the feedforward neural network, and obtain the network output y^n, whose structured risk function on data set D is (Formula 1-5) :


W and b respectively network of all the weight matrices and bias vectors in (| | W | | _F) ^ 2 is the regularization item, used to prevent fitting, lambda is super parameter is positive, the lambda, the greater the W is more close to zero. The (| | W | | _F) ^ 2 generally use Frobenius norm:


With learning criteria and training samples, network parameters can be learned by gradient descent method. In each iteration of gradient descent method, parameters W^ L and B ^ L of l layer are updated as (Formula 1-6) :


Alpha is the learning parameter.

The gradient descent method needs to calculate the partial derivative of the loss function with respect to parameters, and it is inefficient to calculate the partial derivative of each parameter one by one through the chain rule. In the training of neural network, back propagation algorithm is often used to calculate gradient efficiently.

Back propagation algorithm

The training process of feedforward neural network based on backPropagation (BP) algorithm can be divided into the following three steps:

  • Feedforward calculates the net input Z ^ L and activation value A ^ L for each layer up to the last layer
  • Back propagation computes the error term for each layer
  • Calculate the partial derivatives of each layer parameter and update the parameter

The specific training process is as follows:

Automatic gradient calculation

The parameters of neural network are optimized by gradient descent. After determining the risk function and network structure, we can use the chain rule to calculate the gradient of the risk function for each parameter, and implement it in code.

At present, almost all deep learning frameworks contain the function of automatic gradient calculation. When using the framework for neural network development, we only need to consider the structure of the network and implement the code. The gradient can be calculated automatically without manual intervention, which greatly improves the development efficiency.

Automatic gradient calculation methods are divided into the following three types:

Numerical differentiation

Numerical Differentiation is the calculation of the derivative of f(x) using Numerical methods. The derivative of the point x of function f(x) is defined as:


To compute the derivative of f(x) at point x, one can apply a small nonzero perturbation to x and then directly compute the gradient of the function f(x) by the above definition. The numerical differentiation method is easy to implement, but it is very difficult to find a suitable disturbance. If the disturbance is too small, it will cause numerical calculation problems, such as rounding error. If the disturbance is too large, it will increase the truncation error and make the derivative calculation inaccurate, so the practicability of numerical differentiation is relatively poor. In practical application, the following formula is commonly used to calculate the gradient to reduce the truncation error.


  • Rounding error: The difference between an approximate value and an exact value due to rounding of numbers in numerical calculations of exponents, such as using floating point numbers to represent real numbers.
  • Truncation error: The error between the theoretical solution of a mathematical model and the exact solution of a numerical problem

Symbolic differential

Symbolic Differentiation is an automatic Differentiation method based on Symbolic calculation. Symbolic computation, also known as algebraic computation, is the use of computers to process mathematical expressions with variables.

The input and output of symbolic computation are simplification of mathematical expression, factorization, differentiation, integration, solving algebraic equation, solving ordinary differential equation and so on.

For example, simplification of mathematical expressions

  • Input: 3 x x x + 1 + 2
  • Output: 4 x + 1

Symbolic computation is generally a conversion of an input expression, either iteratively or recursively, using predefined rules. When the result of the transformation can no longer use the transformation rules, the calculation is stopped.

Automatic differential

Automatic Differentiation (AD) is a method in which you can compute the derivative of a (program) function. The object of symbolic differentiation is mathematical expression, while the object of automatic differentiation is a function or a program. Automatic differentiation can be performed directly in the original program code. The basic principle of automatic differentiation is that all numerical calculations can be decomposed into some basic operations, including +,−,×, / and some elementary functions exp, log, sin, cos, etc.

Automatic differentiation also uses the chain rule to automatically calculate the gradient of a composite function. We illustrate the process of automatic differentiation with an example of a composition function common in neural networks. For simplicity, let the composition function f(x; W, b)


Where x is the input scalar, w and b are the weights and bias parameters, respectively.

The composition function f(x; W,b) can be broken down into:

And then you can use the chain rule to take the derivative of the composite function.



Mp.weixin.qq.com/s/PtX9ukKRB…