RNN (Recurrent Neural Network)

Introduction to recurrent neural networks

BP algorithm, after CNN, why RNN?

If we consider BP algorithm and CNN(convolutional Neural network) carefully, we will find that their output only considers the influence of the previous input without considering the influence of input at other times. For example, simple cat, dog, handwritten numbers and other single objects have good recognition effect. However, these algorithms do not perform as well for time-dependent tasks, such as predicting the next moment of a video or the context of a document. Thus, RNN was born.

What is the RNN?

It has a Recurrent Neural Network (RNN), a population that feeds on sequence data, A recursive neural network that recurses in the direction of progression of a sequence and all nodes (cyclic units) are chained together.

RNN is a special neural network structure, which is proposed based on the view that human cognition is based on past experience and memory. It is different from DNN and CNN in that it not only considers the input of the previous moment, but also endows the network with a memory function for the previous content.

RNN is called a recurrent neural network because the current output of a sequence is also related to the previous output. Specifically, the network will remember the previous information and apply it to the calculation of the current output, that is, the nodes between the hidden layer are connected instead of unconnected, and the input of the hidden layer includes not only the output of the input layer but also the output of the hidden layer at the last moment.

Application field

There are many application fields of RNN. It can be said that RNN can be used to solve problems as long as the time sequence is considered. Here are a few common application areas:

① Natural Language processing (NLP): mainly video processing, text generation, language model, image processing ② machine translation, machine writing novel ③ speech recognition ④ image description generation ⑤ text similarity calculation ⑥ Music recommendation, netease Kola product recommendation, Youtube video recommendation and other new application fields.

The development history

Representative RNN

Basic RNN: The basic composition of a cyclic network
LSTM: A Breakthrough Long – and short-term memory network
GRU: New Cell module unit
NTM: Exploration of a larger memory model

Heat network

Basic RNN

Let’s start with a simple cyclic neural network, which consists of an input layer, a hidden layer, and an output layer:

If you remove the arrow circle with the W on it, it becomes the most ordinary fully connected neural network.

X is a vector that represents the value of the input layer (the circles of neuron nodes are not drawn here);
S is a vector, which represents the output value of the hidden layer (here the hidden layer shows a node, you can also imagine that this layer is actually multiple nodes, the number of nodes is the same as the dimension of vector S);
U is the weight matrix from the input layer to the hidden layer
O is also a vector that represents the value of the output layer;
V is the weight matrix from the hidden layer to the output layer.

So, now let’s see what W is. The value s of the hidden layer of the recurrent neural network depends not only on the current input X, but also on the value S of the last hidden layer. The weight matrix W is the weight of the input from the previous value of the hidden layer. This is the way to realize the function of time memory.

We give the concrete graph corresponding to this abstract graph:

From the figure above, we can clearly see how the hidden layer of the previous moment affects the hidden layer of the current moment.

If we expand the diagram above, the recurrent neural network can also be drawn like this:

This is a standard RNN structure diagram, in which each arrow represents a transformation, that is, the arrow connection has a weight. Folded on the left, unfolded on the right, middle on the leftThe arrow next to the structure represents the “loop” embodied in the hidden layer. In the expanded structure, we can observe that in the standard RNN structure, there are weights between neurons in the hidden layer. That is, as the sequence progresses, the previous hidden layer will influence the subsequent hidden layer. In the figureRepresents output,Represents the determined value given by the sample,Represents the loss function, and we can see that “loss” also accumulates with the recommendation of the sequence. In addition to the above features, standard RNN has the following features:

Weights of SharedThe W’s are all the same, and so are U and V.
Each input value makes weight connections only to its own route, not to other neurons.
The hidden state can be understood as: h=f(existing input + past memory summary)

The above is the standard structure of RNN N VS N (N inputs correspond to N outputs), that is, the input and output sequences must be of equal length

Due to this limitation, the scope of application of classical RNN is relatively small, but there are also some problems suitable for modeling classical RNN structure, such as:

Calculate the classification label for each frame in the video. Because each frame is evaluated, the input and output sequences are of equal length.
The probability that the input is a character and the output is the next character. This is The famous Char RNN (See: The Gray-Effectiveness of Recurrent Neural Networks, which can be used to generate articles, poems, and even code, which is very interesting).

RNN variant

N VS 1

The standard structure of RNNN VS NIn practice, it’s not going to solve all the problems, sometimes we’re going to deal with problems where the input is a sequence and the output is a single value rather than a sequence, for example if we’re going to put in a string of words and the output is a category category, then the output doesn’t need a sequence, it just needs a single output. How do you model it? Actually, we’re only in the last oneOn the output transformation is ok.

This structure is usually used to deal with sequence classification problems. For example, input a text to determine its category, input a sentence to determine its emotional orientation, input a video and determine its category, and so on.

1 VS N

What if the input is not a sequence and the output is a sequence? We can do input calculations only at the beginning of the sequence:

There is another structure that takes the input informationAs an input to each stage (the input is a sequence, but does not vary with the sequence) :

The following figure, which omits some X circles, is an equivalent representation:

This 1 VS N structure can handle the following problems:

To generate an image caption from the imageIs the feature of the image, and the outputA sequence is a sentence;
Generate speech or music, etc., from categories.

N VS M

Let’s introduce one of the most important variants of RNN: N vs M. This structure is also called Encoder-Decoder model, or Seq2Seq model.

The original N vs N RNN requires the sequence to be of equal length. However, most of the problems we encounter are of unequal length. For example, in machine translation, the sentences in the source language and the target language often do not have the same length.

To do this, the Encoder-Decoder structure first encodes the input data into a context vectorc:

getThere are several ways to do this, but the simplest way is to assign the last implicit state of the Encoder toAnd can also be obtained by a transformation of the final hidden stateAnd you can do transformations for all the hidden states.

getAfter that, it is decoded with another RNN network, this part of the RNN network is calledDecoder. The specific approach is tocSo let’s say we’re in our initial stateInput to Decoder:

There’s another way to do itAs input for each step:

Because this Encoder-Decoder structure does not limit the sequence length of input and output, it is widely used, such as:

Machine translation.Encoder-DecoderIn fact, this structure was first proposed in the field of machine translation.
Text abstract. The input is a sequence of text, and the output is a sequence of summaries of the sequence of text;
Reading comprehension. Encode the input article and the question separately, and then decode it to get the answer to the question;
Speech recognition. The input is a speech signal sequence and the output is a text sequence.

Attention mechanism

In Encoder-Decoder structure, Encoder codes all input sequences into a unified semantic feature C and then decodes it. ** Therefore, C must contain all information in the original sequence, and its length becomes a bottleneck that limits the performance of the model. ** For example, when the sentence to be translated is long, a C may not be able to store so much information, resulting in a decrease in translation accuracy.

The Attention mechanic solves this problem by typing a different C at each time. Here’s a Decoder with Attention:

Each C is automatically selected to match the current outputThe most appropriate context information. Specifically, we useMeasure no. 1 in EncoderPhase of theAnd when decodingPhase correlation, the final Decoder in the firstPhase of the input context informationIt comes from all 对 The weighted sum of.

Take machine translation (translating Chinese into English) :

The input sequence is “I love China”, therefore, in EncoderIt can be regarded as the information represented by “I”, “love”, “Zhong” and “guo” respectively. When translated into English, the first contextShould be most relevant to the word “I”, therefore corresponding toIt’s going to be bigger and corresponding toIt’s smaller.Should be most related to “love”, therefore corresponding toIt’s bigger. The last of the 和 Most relevant, thereforeThe value of theta is larger.

That leaves us with one last question about the Attention model, and that is:How did it come about?

In fact,Also learned from the model, it’s the first real and DecoderPhase of hidden state, EncoderIs related to the hidden state of the phase.

Again, using the machine translation example above,The arrow represents the calculation of h’ andDo the transformation at the same time) :

The calculation of the:

The calculation of the:

Above is the whole process of Encoder-Decoder model calculation with Attention.

You might find no LSTM content because LSTM looks exactly like RNN from the outside, so all of the above structures are generic to LSTM

Forward output flow for standard RNN

It has been introduced above that RNN has many varieties, but its mathematical derivation process is virtually the same. Here is an introduction to the forward propagation process of RNN of standard structure.

To introduce the meanings of each symbol:Is the input,Is the hidden layer unit,For the output,Is the loss function,The label for the training set. These elements are in the upper right cornerRepresents the state at time T, where it should be noted that the unit is hidden 在 The performance of the moment is not only determined by the input of the moment, but also by itThe influence of the moment before the moment.Yes weights. Weights of the same type have the same weights.

With the above understanding, the forward propagation algorithm is actually very simple forTime:

Among themFor the activation function, it is generally selectedThe function,To offset.

The output of time is simpler:

The predicted output of the final model is:

Among themFor activation function, RNN is usually used for classification, so softmax function is generally used here.

Formula (1) is the calculation formula of the hidden layer, which is the cyclic layer. Formula (2) is the calculation formula of the output layer, which is a fully connected layer

As we can see from the formula above,Circulation layerandThe connection layerThe difference betweenCirculation layerOne moreWeight matrices .

If formula (1) is substituted into formula (2) repeatedly, first removing the bias, we will get:

As you can see from the top,Recurrent neural networkThe output value of, is based on the previous input values 、,,And… That’s why, rightRecurrent neural networkI can go ahead as many as I wantThe input valuesThe reason why.

Sometimes formula (1) is written as follows:

That’s because of formula 3Is equal to the vector, e.g.The vector shape is (10, 1),The vector has the shape of (10, 10),The shape of is (20, 1), thenIs (10, 20), then finallyThe shape of phi is (10, 1).

Let’s seeThe shape of is (10, 30), while(10, 30) * (30,1) = (10, 1Formula (1) is equivalent to formula (3);

Post a handcrafted illustration of Ng’s derivation:

The training method of RNN — BPTT

BPTT (Back Propagation through Time) algorithm is a commonly used method to train RNN. In essence, IT is still BP algorithm, but RNN deals with time series data. Therefore, it is called back propagation over time. The central idea of BPTT is the same as that of BP algorithm, which continuously searches for better points along the negative gradient direction of parameters to be optimized until convergence. To sum up, BPTT algorithm is still BP algorithm in essence, and BP algorithm is still gradient descent method in essence, so finding the gradient of each parameter becomes the core of this algorithm.

Take out this structure diagram again and observe that there are three parameters to be optimized, respectively. Different from the BP algorithm, where 和 The optimization process of two parameters needs to trace the previous historical data and parametersIt’s relatively easy to just focus on the present, so let’s solve for the parameters firstPartial derivative of phi.

This formula looks simple, but it is easy to make mistakes in solving it, because it is nested with activation function function, which is the process of finding composite function.

The loss of RNN also accumulates over time, so the partial derivative at time T alone cannot be calculated.

和 The solution of partial derivative of is relatively complicated because it involves historical data. Let’s first assume that there are only three moments, thenAt the third moment 对 The partial derivative of is:

Accordingly,The partial derivative with respect to U at the third moment is:

You can observe that at some point in time, rightorThe partial derivative of phi, you have to go back to all the time before this time, and this is just the partial derivative of one time, and it said that the loss also adds up, so the whole loss function is a function of phi 和 The partial derivative of phi is going to be tedious. But the good news is that there’s a pattern here, and we can write L atMoment of 和 General formula of partial derivative:

The partial derivative of the whole thing is just adding them up in time.

If we put the activation function inside, take out the middle part of the multiplication:

We will find that the multiplication will lead to the multiplication of the derivative of the activation function, which will lead to the disappearance of gradient and the occurrence of gradient explosion.

Gradient extinction and gradient explosion

The reasons causing

Let’s first look at the graphs of these two activation functions. This is the graph of the sigmoid function and its derivative.

This is the graph of the tanh function and its derivative.

It can be seen from the above figure that when the activation function is the tanh function, the maximum value of the derivative of the tanh function is 1, and it is impossible to take 1 all the time, and this situation rarely occurs. That is to say, most of the multiplication is done by numbers less than 1. If t is very large,

Gradient disappeared

You may object that RNN is obviously different from deep neural network. The parameters of RNN are all shared, and the gradient at a certain moment is the sum of the current moment and the previous moment, even if the shallow layer is not transmitted to the deepest depth, there is a gradient. This is certainly true, but if we update the shared parameters of more layers based on the gradient of a finite layer, problems will surely arise, because the optimal solution of all information will not be found by using finite information as the basis for optimization.

The disappearance of gradient means that the parameters of the vanishing layer are no longer updated, and then the hidden layer becomes a simple mapping layer, which is meaningless. Therefore, in deep neural network, sometimes more neurons may be better than multiple depths.

Let’s look at the formula

In theNetwork parameters are also required, if parameterIf the value in is too large, there will also be long-term dependence with the length of the sequence, then the problem will beGradient explosionIn ordinary application, RNN is relatively deep, which makes gradient explosion or gradient disappearance more obvious.

We said we use moreDelta function as the activation function, soThe derivative of the function is only 1 at most, and you can’t get all the values up to 1, so it’s still a bunch of decimals multiplying, and there’s still gradient extinction, so why use it as an activation function? The reason is thatThe function is relative toFor a function, a larger gradient leads to faster convergence and slower disappearance of the gradient.

Another reason is thatThere’s another disadvantage of functions,The Sigmoid function output is not zero-centrosymmetric. The sigmoid output is greater than 0, which makes the output not zero mean, calledMigration phenomenon, which will cause the neurons of the later layer to take the non-zero mean signal output of the previous layer as input.The network converges better for inputs with symmetric origin and outputs with symmetric center.

The solution

The characteristic of RNN is that it can “trace back to the source” and use historical data. Now IT tells me that the available historical data is actually limited, which is very uncomfortable. It is very necessary to solve the “gradient disappearance”. The main methods to solve “gradient disappearance” are: 1. Choose better activation function; 2. Change the propagation structure

On the first point, ReLU function is generally selected as the activation function, and the image of ReLU function is:

The derivative of ReLU function in the domain greater than 0 is identical to 1, which can solve the problem of gradient disappearance.

In addition, the calculation is convenient and fast, which can speed up network training. However, the negative part of the definition domain is equal to zero, which will cause the neuron cannot be activated (the probability of occurrence can be reduced by setting the learning rate reasonably).

ReLU has advantages and disadvantages, among which the disadvantages can be avoided or reduced by other operations. It is the most used activation function at present.

On the second point, the LSTM structure can solve this problem.

BRNN (Bidirectional Recurrent Neural Network)

In RNN (Bi-directional Recurrent Neural Network), only the words in front of the prediction words are considered, that is, only the “above” in the context is considered, and the content behind the word is not considered. This may miss some important information and make the forecast less accurate. The ability to access future context as well as past context information is beneficial for many sequence annotation tasks.

Two-way RNN, that is, memory can be obtained from the past point in time, and information can be obtained from the future point in time. Why get information about the future?

If Teddy is a person’s name in the following sentence, it is impossible to know whether Teddy is a person’s name only from the first two words, and it is easy to determine if there is information behind it, which requires the two-way circulating neural network.

The basic idea of bidirectional recurrent neural network (BRNN) is to propose that each training sequence is two recurrent neural networks (RNN) forward and backward respectively, and both of them are connected to an output layer. This structure provides the output layer with complete past and future context information for each point in the input sequence. The figure below shows a bi-directional cyclic neural network that unfolds over time. Six unique weights are used repeatedly at each time step, corresponding to: input to the forward and backward hidden layers (W1, W3), hidden layer to hidden layer itself (W2, W5), and forward and backward hidden layer to output layer (w4, W6). It is worth noting that there is no information flow between the forward and backward hidden layers, which ensures that the unfolding graph is acyclic.

It doesn’t matter whether the network unit is a standard RNN, GRU or LSTM, either can be used.

The calculation process of the whole bidirectional cyclic neural network (BRNN) is as follows:

Forward pass:

For the hidden layer of the bidirectional recurrent neural network (BRNN), the forward projection is the same as that of the unidirectional recurrent neural network (RNN), except that the input sequence is opposite to the two hidden layers, and the output layer is not updated until the two hidden layers have processed all the input sequences:

Backward pass:

The backward projection of the bidirectional cyclic neural network (BRNN) is similar to that of the standard cyclic neural network (RNN) via back propagation over time, except that all the output layer δδ terms are first computed and then returned to two hidden layers in different directions:

Deep RNN (DRNN)

Deep RNN network has several hidden layers in the RNN model, because it is considered that when the amount of information is too large to save all important information at one time, more important information can be saved through multiple hidden layers, just as we may watch the same episode repeatedly to remember more key plots when watching TV dramas. Similarly, we can also add several hidden layers to the bidirectional RNN model to obtain the deep bidirectional RNN model.

Note: The parameters in each layer of the loop body are shared, but the weight matrix is different between layers.

It doesn’t matter whether the network unit is a standard RNN, GRU or LSTM, either can be used.

Refer to the link

zhuanlan.zhihu.com/p/30844905
www.jianshu.com/p/8d1c0f96a…
zhuanlan.zhihu.com/p/53405950
Zybuluo.com/hanbingtao/…
Blog.csdn.net/jojozhangju…