Make writing a habit together! This is the fifth day of my participation in the “Gold Digging Day New Plan · April More text Challenge”. Click here for more details.

In NLP, the most classic network RNN and its magic modification LSTM, GRU

Photo source and reference

Photo source and reference

Recurrent neural network

Why RNN

Before the rise of deep learning, the NLP field has always been dominated by statistical models. The most commonly used model, such as N-Gram, is difficult to capture medium and long distance information. Bengio team integrated N-Gram into feedforward neural network, but the improvement was limited.

In NLP, the input data is a sequence of segments, and the information in the sequence is interrelated. Obviously, the full connection layer with independent input and output is not competent for various tasks of NLP. We needed a network that could model sequence relationships correctly, and RNN was born of this.

infrastructure

For a sequence x1,x2,x3,x4… xTX_1,x_2, X_3, X_4 \cdots X_TX1,x2,x3,x4… xt

The most common sequences are a piece of music, a sentence, a video, etc

The basic structure of RNN is

A receives A certain XTX_TXT of the sequence and outputs A HTH_THT, which is called the Hidden State. It will input A together with the next input Xt +1x_{t+1}xt+1 to model the sequence relationship.

This is an autoregressive model (AR) in which the prediction of the previous step is added to the next step, so it can only pay attention to the information below or above separately

In order to understand RNN more directly, for A sequence and A certain neuron A, input data in the sequence of X0, X1,x2… xTX_0, X_1,x_2\cdots X_TX0,x1,x2… xt can be expanded into the form on the right side of the following figure

Note that here A is the same as A, that is, all steps of the RNN are shared with parameters

Hidden states

Let’s simplify the picture of the neuron and get rid of the A’s

Hth_tht is calculated by the following formula, where U,W and B are still shared


h t = ϕ ( U x t + W h t 1 + b ) h_t=\phi(Ux_t+Wh_{t-1}+b)

ϕ\phiϕ is the activation function, and the general RNN uses tanH to reduce the risk of gradient explosions

The output

Now let’s see what the output of RNN looks like

The corresponding yTY_TYt can be obtained by hTH_tht, and the specific formula is


y t = s o f t m a x ( V h t + c ) y_t=softmax(Vh_t+c)

Here we can see one limitation of the standard RNN — the input and output are of equal length

Back propagation

Since the data in RNN are involved with each other, the output of the later data in the sequence is related to all the previous data, and the formula obtained by using the chain rule will be abnormally long when solving the gradient

And tanh activation function is often used, whose derivative is always less than or equal to 1, so it can be obtained:

For long sequences, RNN is prone to gradient disappearance (it is believed that the gradient disappearance of RNN is different from that of ordinary networks: its gradient is dominated by near gradient and ignored by long gradient).

Therefore, RNN adopts a special method BPTT for training, which groups the network structure and calculates loss for each group

In addition, when other derivatives are too large, it is easy to have gradient explosion

Therefore, although RNN can deal with long-distance information well in theory, the actual situation is not good

And a little bit here:

NLP task normalization uses layer-norm instead of batch-norm, because normalization on batches confuses information between different statements, and we need to normalize within each statement.

Problem of unequal length

Not all tasks require the input and output to be of equal length, so there are many different types of RNN to solve the problem of unequal input and output lengths

N-1

Just take the last layer to get the output

Or use the output of each layer to be followed by softmax for classification, etc

1-N

There are two main ways

X is only used as the initial input:

X as input for each layer:

N-M

Encoder-Decoder structure, first get a code C, decoding when two methods

In fact, it is equivalent to the concatenation of two RNS, N-1 and 1-m

Its loss function is defined as


m a x Theta. 1 N n = 1 N l o g p Theta. ( y n x n ) max_{\theta}\frac{1}{N}\sum_{n=1}^{N}logp_{\theta}(y_n|x_n)

Variation of RNN

Bi-directional Recurrent Neural Network

An RNN always learns the information before a point in the sequence. Imagine adding the information after that point while learning.

The web will learn more contextual (future) information to get better results

It is worth noting that too much hereafter (future) information should not be added, as this will take up too much of the network’s resources, resulting in a decrease in effectiveness

The construction of the network consists of two reverse RNNS.

To be continued

Long and short term memory networks

Compared with long distance (long-term) memory, RNN is better at short distance (short-term) memory, so LSTM is proposed to improve the modeling effect of network in long sequence

LSTM can be regarded as a special RNN, and they all have Hidden State and circular forms, but LSTM introduces input gate, output gate and forgetting gate to control the flow of information, so as to solve the problem of long-distance information modeling

The acquisition of long-distance (long-term) information is actually a “talent” of LSTM and does not cost much

The basic structure

Before introducing the structure, let’s take a look at what the various ICONS mean

  • The yellow box indicates the layer 1 network
  • Pink origin indicates bitwise operation

Its basic structure is as follows:

Ignoring the internal information, its structure is very similar to RNN, and the standard LSTM is completely composed of such a structure

Draw the structure diagram of RNN into a similar structure diagram

Next, let’s break down the specific structure step by step

State of the cell

Cell State is a concept developed separately by LSTM to improve modeling over long distances

As you can see from the diagram, this structure resembles a conveyor belt. If you plot the structure of the entire network, you will find that this “conveyor belt” runs through the whole network

Personally, Cell State contains long-term memory, while Hidden State selectively outputs the contents of Cell State

gating

Gating can control the flow of information. In LSTM, three gates are mainly used to increase or decrease information to protect and control the cell state

Sigma sigma is the Sigmoid function, which outputs a value between 0 and 1 to control the flow of information

Forget the door

The forgetting gate is added according to the memory mechanism of human brain. The forgetting gate will read XTX_txt, HT − 1H_ {T-1} HT −1, and get a weight between 0 and 1, which will be multiplied by the cell state, indicating how much information should be forgotten, and its formula is


f t = sigma ( W f [ h t 1 . x t ] + b f ) f_t=\sigma(W_f\cdot[h_{t-1},x_t]+b_f)

I think the more intuitive explanation is how much information you still have to remember

Enter the door

The input gate is used to update the cell state


i t = sigma ( W i [ h t 1 . x t ] + b i ) C t ^ = t a n h ( W C [ h t 1 . x t ] + b C ) i_t=\sigma(W_i\cdot[h_{t-1},x_t]+b_i)\\ \hat{C_t}=tanh(W_C\cdot[h_{t-1},x_t]+b_C)

Iti_tit is used to control how much information is updated to the cell state

The updated formula is zero


C t = f t C t 1 + i t C t ^ C_t=f_tC_{t-1}+i_t\hat{C_t}

Output layer

The output layer determines the HTH_THT to output

Formula for


o t = sigma ( W o [ h t 1 . x t ] + b o ) h t = o t t a n h ( C t ) o_t=\sigma(W_o\cdot[h_{t-1},x_t]+b_o)\\ h_t=o_t\cdot tanh(C_t)

The output is determined by the state of the cell and the upper input

How does LSTM solve gradient disappearance

Leave the pit

LSTM variation

Leave the pit

Gated loop unit

GRU is a variant of LSTM. In many cases, their performance is similar, but GRU is easier to calculate and has about three times as many parameters as RNN

The GRU combines the oblivion gate and the input gate into a single “update gate” and simultaneously incorporates the cell state and hidden State

The top line represents the Hidden State, and the input in the bottom left is X_T

Reset gate: Used to determine which information to discard


R t = sigma ( W r x x t + W r h h t 1 + b r ) R_t=\sigma(W_{rx}x_t+W_{rh}h_{t-1}+b_r)

Update gate: Acts like the LSTM forget and input gate, which determines what information is to be added or discarded


U t = sigma ( W u x x t + W u h h t 1 + b u ) U_t=\sigma(W_{ux}x_t+W_{uh}h_{t-1}+b_u)

Output:


h t ^ = t a n h ( W h x x t + W h h ( R t Even though h t 1 ) + b h ) h t = U t Even though h t 1 + ( 1 U t ) Even though h t ^ \hat{h_t}=tanh(W_{hx}x_t+W_{hh}(R_t\odot h_{t-1})+b_h)\\ h_t=U_t\odot h_{t-1}+(1-U_t)\odot\hat{h_t}