This is the 20th day of my participation in the August Text Challenge.More challenges in August

LSTM

Preknowledge — RNN

It has a Recurrent Neural Network. RNN is a neural network for processing sequential data (like timing data).

Let’s look at ordinary RNN first.

Unlike ordinary neural networks, RNN has an extra Memory, that is, RNN stores the value of the hidden layer in Memory. This allows the RNN to remember the output of the previous hidden layer and have memory. Understand the above picture:

The input of RNN has two values, one is the timing sequence data XXX of the current state, and the other is the value HHH of the previous hidden layer. In other words, when x2x^2×2 is entered in the next state or moment, the previous value in Memory is also taken into account, and the hidden layer outputs the new value H2H ^ 2H2 instead of h1H ^ 1H1, which updates the value in Memory simultaneously. The final output yyY is to get the probability value through a layer of Softmax.

In simple terms, for ordered data sets, RNN stores the results of the previous analysis of data x1x^1×1 into Memory and analyzes the results of the previous analysis of data x2x^2×2 together with the previous analysis of data x1x^1×1. That is, accumulate memories and analyze them together.

It is worth noting that since RNN deals with data of time series, in order to more intuitively show the relationship between time series, RNN of different moments are put together in this figure. Therefore, there are only one RNN in the figure instead of three.

In addition, RNN has many extensions, such as Elman Network and Jordan Network. The difference between them lies in the different values stored in Memory.

  • Elman Network: Memory stores the value of the hidden layer
  • Jordan Network: The final output value stored in Memory

In addition, there is a Bidirectional RNN. Bidirectional means that forward training RNN and reverse training RNN simultaneously, as shown in the figure below

What is the LSTM

LSTM stands for Long short-term Memory network. Is an advanced version of RNN.

In the previous section, we learned that ordinary RNN can update Memory without limit, that is, every time the neural network inputs new data, Memory is updated.

LSTM adds three gates on the basis of RNN to control global Memory.

The three gates are:

  • Input gate: Input control that writes values to Memory only if it is turned on.
  • Output gate: Output control that reads values from Memory only when enabled.
  • Forget Gate: The forget gate control that determines whether to forget to clear or save the value in Memory.

As you can see, a Memory cell of LSTM has 4 inputs and 1 output. Output is the value in Memory, input is the value that you want to write to Memory (which is the output of the hidden layer) and the gating signal that controls the three gates.

structure

The following is an in-depth understanding of the structure of LSTM. See the figure below

Ziz_izi controls the input gate, zoz_ozo controls the output gate, and zfz_fzf controls the forget gate. The function FFF represents the activation function. Sigmoid is generally used and can be controlled at [0,1][0,1][0,1], indicating the degree of gate opening.

For input gate, it is easy to see that f(zi)=1f(z_i)=1f(zi)=1 indicates that the gate is open, and g(z)g(z)g(z) can be written in; And g(z), g(z), g(z) becomes 0.

For forget-Gate, f(zf)=1f(z_F)=1f(zf)=1, saving the current Memory value CCC. F (zf)=0f(z_f)=0f(zf)=0, which means to forget the current value CCC in Memory. C ‘c ‘c ‘is the updated value of Memory. C ‘= g (z) f (zi) + cf (zf) c’ = g (z) f (z_i) + cf (z_f) c ‘= g (z) f (zi) + cf (zf) this formula should be well understood.

The final output aaa is controlled by the output gate F (ZO) F (z_O) F (ZO), obviously f(ZO)=1f(Z_O)= 1F (ZO)=1 can read the Memory value H (C ‘) H (c’) H (c’). Here we have an activation function HHH.

Now that we understand the specific structure of the Memory cell, let’s look at the complete structure of the LSTM. The following figure shows the situation of the same LSTM at two adjacent moments.

We observe the figure above, input sample data XTX ^ TXT at TTT moment, it will be multiplied by a matrix, converted into four vectors z, Zf, ZI, ZOz, Z ^ F, Z ^ I, Z_oz, Zf, ZI,zo (because we consider from the whole time sequence, so is the vector, In fact, each time the input gating signal is a value). These four vectors are the control signals.

The input also includes the value ct−1c^{T-1}ct−1 stored in the Memory cell. Ht − 1H ^{T-1} HT −1 is the output value of the last hidden layer. Ct −1c^{t-1} CT −1 and HT −1h^{t-1} HT −1 RNN only needs to consider HT −1h^{t-1} HT −1. In fact, it is because of the outgate, which determines whether Memory can be read or not. Even if the value of Memory cell is updated, the output gate does not read the value of the hidden layer must be different from that of the cell.

advantage

What are the advantages of LSTM?

Problems with ordinary RNN:

  • Gradient disappearance: error reverse transmission, each step will be multiplied by the weight W, W <1, multiplied by smaller and smaller
  • Gradient explosion: W >1, multiplied by bigger and bigger

The advantage of LSTM is that it can solve gradient extinction. But LSTM doesn’t solve gradient explosions.

Why can LSTM solve gradient extinction?

C ‘=g(z)f(zi)+cf(zf) C ‘=g(z)f(z_i)+cf(z_f) C ‘=g(z)f(zi)+cf(zf)

LSTM adds the original Memory cell value to the input value at each time point, while RNN updates the Memory at each time point. There is a big difference between LSTM’s Memory value and the input value in the past cell. Unless the forget gate controls forget the original value.

In other words, in THE RNN, the effect of weight on Memory is cleared at every point in time. In the LSTM, the effect of weight on Memory is not cleared until the forget Gate control forgets the original Memory value. So it doesn’t have a gradient disappearing problem.

In addition, since LSTM can solve the problem of gradient disappearance, we can set the learning rate to be smaller during training.

summary

Long short-term Memory is a very short term Memory that is erased every time a new input is entered. LSTM has added forget Gate, I’m going to make this short term longer.

Ordinary RNN is prone to gradient disappearance or gradient explosion because the impact of weight on Memory will be cleared every time, while LSTM can solve the problem of RNN gradient disappearance. But it doesn’t solve gradient explosions. Because LSTM values are associated with past cell values and inputs, the effect of weight on Memory will not be removed until the forget gate controls forget the original value.

However, LSTM also has an obvious disadvantage, that is, too many parameters make training more difficult and easy to overfit. Therefore, GRU, an RNN network with fewer parameters and the same effect as LSTM, was proposed. In simple terms, the GRU connects the input gate to the forget gate. When the input gate is turned on, the forget gate controls Memory forgetting. That is, you need to forget the old value before you want to store the new value.

reference

  1. Video of machine learning course B by Professor Li Hongyi of NATIONAL Taiwan University