Make writing a habit together! This is the fifth day of my participation in the “Gold Digging Day New Plan · April More text Challenge”. Click here for more details.
In NLP, the most classic network RNN and its magic modification LSTM, GRU
Photo source and reference
Photo source and reference
Recurrent neural network
Why RNN
Before the rise of deep learning, the NLP field has always been dominated by statistical models. The most commonly used model, such as N-Gram, is difficult to capture medium and long distance information. Bengio team integrated N-Gram into feedforward neural network, but the improvement was limited.
In NLP, the input data is a sequence of segments, and the information in the sequence is interrelated. Obviously, the full connection layer with independent input and output is not competent for various tasks of NLP. We needed a network that could model sequence relationships correctly, and RNN was born of this.
infrastructure
For a sequence x1,x2,x3,x4… xTX_1,x_2, X_3, X_4 \cdots X_TX1,x2,x3,x4… xt
The most common sequences are a piece of music, a sentence, a video, etc
The basic structure of RNN is
A receives A certain XTX_TXT of the sequence and outputs A HTH_THT, which is called the Hidden State. It will input A together with the next input Xt +1x_{t+1}xt+1 to model the sequence relationship.
This is an autoregressive model (AR) in which the prediction of the previous step is added to the next step, so it can only pay attention to the information below or above separately
In order to understand RNN more directly, for A sequence and A certain neuron A, input data in the sequence of X0, X1,x2… xTX_0, X_1,x_2\cdots X_TX0,x1,x2… xt can be expanded into the form on the right side of the following figure
Note that here A is the same as A, that is, all steps of the RNN are shared with parameters
Hidden states
Let’s simplify the picture of the neuron and get rid of the A’s
Hth_tht is calculated by the following formula, where U,W and B are still shared
ϕ\phiϕ is the activation function, and the general RNN uses tanH to reduce the risk of gradient explosions
The output
Now let’s see what the output of RNN looks like
The corresponding yTY_TYt can be obtained by hTH_tht, and the specific formula is
Here we can see one limitation of the standard RNN — the input and output are of equal length
Back propagation
Since the data in RNN are involved with each other, the output of the later data in the sequence is related to all the previous data, and the formula obtained by using the chain rule will be abnormally long when solving the gradient
And tanh activation function is often used, whose derivative is always less than or equal to 1, so it can be obtained:
For long sequences, RNN is prone to gradient disappearance (it is believed that the gradient disappearance of RNN is different from that of ordinary networks: its gradient is dominated by near gradient and ignored by long gradient).
Therefore, RNN adopts a special method BPTT for training, which groups the network structure and calculates loss for each group
In addition, when other derivatives are too large, it is easy to have gradient explosion
Therefore, although RNN can deal with long-distance information well in theory, the actual situation is not good
And a little bit here:
NLP task normalization uses layer-norm instead of batch-norm, because normalization on batches confuses information between different statements, and we need to normalize within each statement.
Problem of unequal length
Not all tasks require the input and output to be of equal length, so there are many different types of RNN to solve the problem of unequal input and output lengths
N-1
Just take the last layer to get the output
Or use the output of each layer to be followed by softmax for classification, etc
1-N
There are two main ways
X is only used as the initial input:
X as input for each layer:
N-M
Encoder-Decoder structure, first get a code C, decoding when two methods
In fact, it is equivalent to the concatenation of two RNS, N-1 and 1-m
Its loss function is defined as
Variation of RNN
Bi-directional Recurrent Neural Network
An RNN always learns the information before a point in the sequence. Imagine adding the information after that point while learning.
The web will learn more contextual (future) information to get better results
It is worth noting that too much hereafter (future) information should not be added, as this will take up too much of the network’s resources, resulting in a decrease in effectiveness
The construction of the network consists of two reverse RNNS.
To be continued
Long and short term memory networks
Compared with long distance (long-term) memory, RNN is better at short distance (short-term) memory, so LSTM is proposed to improve the modeling effect of network in long sequence
LSTM can be regarded as a special RNN, and they all have Hidden State and circular forms, but LSTM introduces input gate, output gate and forgetting gate to control the flow of information, so as to solve the problem of long-distance information modeling
The acquisition of long-distance (long-term) information is actually a “talent” of LSTM and does not cost much
The basic structure
Before introducing the structure, let’s take a look at what the various ICONS mean
- The yellow box indicates the layer 1 network
- Pink origin indicates bitwise operation
Its basic structure is as follows:
Ignoring the internal information, its structure is very similar to RNN, and the standard LSTM is completely composed of such a structure
Draw the structure diagram of RNN into a similar structure diagram
Next, let’s break down the specific structure step by step
State of the cell
Cell State is a concept developed separately by LSTM to improve modeling over long distances
As you can see from the diagram, this structure resembles a conveyor belt. If you plot the structure of the entire network, you will find that this “conveyor belt” runs through the whole network
Personally, Cell State contains long-term memory, while Hidden State selectively outputs the contents of Cell State
gating
Gating can control the flow of information. In LSTM, three gates are mainly used to increase or decrease information to protect and control the cell state
Sigma sigma is the Sigmoid function, which outputs a value between 0 and 1 to control the flow of information
Forget the door
The forgetting gate is added according to the memory mechanism of human brain. The forgetting gate will read XTX_txt, HT − 1H_ {T-1} HT −1, and get a weight between 0 and 1, which will be multiplied by the cell state, indicating how much information should be forgotten, and its formula is
I think the more intuitive explanation is how much information you still have to remember
Enter the door
The input gate is used to update the cell state
Iti_tit is used to control how much information is updated to the cell state
The updated formula is zero
Output layer
The output layer determines the HTH_THT to output
Formula for
The output is determined by the state of the cell and the upper input
How does LSTM solve gradient disappearance
Leave the pit
LSTM variation
Leave the pit
Gated loop unit
GRU is a variant of LSTM. In many cases, their performance is similar, but GRU is easier to calculate and has about three times as many parameters as RNN
The GRU combines the oblivion gate and the input gate into a single “update gate” and simultaneously incorporates the cell state and hidden State
The top line represents the Hidden State, and the input in the bottom left is X_T
Reset gate: Used to determine which information to discard
Update gate: Acts like the LSTM forget and input gate, which determines what information is to be added or discarded
Output: