Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.
First, network structure
LSTM has the advantage of RNN and has the ability to memorize, but it memorizes valuable information and gives up redundant memory, thus reducing the difficulty of learning. Compared with RNN, the neuron of LSTM is still calculated based on the input X and the output H of the upper hidden layer, but the internal structure has changed, that is, the calculation formula of neuron has changed, while the external structure has not changed. Therefore, various structures of RNN mentioned above can be replaced by LSTM. The reason why LSTM can screen out valuable memories is that its neurons add input gate I, forgetting gate F, output gate O and internal memory unit C. For a trained LSTM model, we need to know that each gate (forgetting gate, output gate and input gate) has its own (U, W, b), which is also reflected in the above formula, which is obtained in the training process. And when there is no useful information in the input sequence, the value of forgetting gate F will approach 1, and the value of input gate I will approach 0, so that the useful information in the past will be saved. When there is important information in the input sequence, the value of forgetting gate F will approach 0, and the value of input gate I will approach 1. At this point, the LSTM model forgets past memories and records important memories.
The following figure shows the network structure of a three-layer LSTM model, in which each box represents a neuron.
Let’s look at the structure of neurons:
(1) Forgetting door
The first step in LSTM is to decide what information to discard from the cellular state. This decision is made by an S-shaped network layer called the “forgetting gate layer”. It receives h_ (t – 1) and ๐ฅ ๐ก ๐ฅ _ ๐ก xt, and the cell state ๐ _ (๐ก – 1) for each number in the output values are between 0 and 1. 1 means “totally accept this” and 0 means “completely ignore this”.
(2) Input gate
The next step is to determine what new information needs to be preserved in the cellular state. There are two parts here. The first part, an S-shaped network layer called an input gate layer, determines what information needs to be updated. In the second part, a tanh-shaped network layer creates a new alternative value vector —
, can be used to add to the cellular state.
(3) Internal memory unit
Now update the old cell state ๐_(๐กโ1) to
. Let’s multiply the old state times the old state
To forget information we decide to forget. And then we add
We want to keep the new message, this is the new candidate value.
(3) output door
Finally, determine the output value. The output depends on our cell state. First we run an S-shaped network layer to determine which parts of the cell state can be output. We then multiply the cell state input tanH (adjusting the value between โ1 and 1) by the output value of the S-network layer, so that we can output the desired fraction.
How does LSTM mitigate gradient loss
Then we’ll look at how LSTM alleviates gradient loss. From the formula of memory unit back propagation, we can find that the range of results is not necessarily limited to 0-1, but is likely to be greater than 1, that is, the gradient disappearance is alleviated. Two factors affect its range of results. One is the”addSecond, the parameters of the logic gate in LSTM can control the degree of gradient disappearance of different time steps to a certain extent. It is the main influencing factor, and this parameter is obtained through learning, which is also the strength of LSTM.