1. What is LSTM
As you read this article, you infer the true meaning of the current word based on what you already know about the word you’ve seen before. We don’t throw everything away and think with a blank mind. Our thoughts have permanence. LSTM has this feature.
This paper will introduce another commonly used gated cyclic neural network: ** Long short-term Memory (LSTM) [1]. ** It is slightly more complex than the gated loop unit structure, but also to solve the problem of gradient attenuation in RNN networks, is an extension of GRU.
You can understand the PROCESS of GRU first, and it will be much easier to understand LSTM.
LSTM introduces three gates, namely input gate, forget gate and Output gate, as well as memory cells with the same shape as the hidden state (some literatures treat memory cells as a special hidden state) to record additional information.
2. Input gate, forget gate and output gate
Like the reset gate and update gate in the gated loop unit, the input of the gate of long and short-term memory is Xt of the current time step and Ht−1 of the hidden state of the previous time step, and the output is calculated by the full connection layer whose activation function is sigmoid function. In this way, the range of the three gate elements is [0, 1]. As shown below:
Specifically, it is assumed that the number of hidden units is H, and a small batch input of time step T is given(Sample number n, input number D) and hidden status of last time step. The formula of the three gates is as follows:
Input gate:? I_t=\sigma(X_tW_{xi}+H_{t-1}W_{hi}+b_i)?
Oblivion asks:? F_t=\sigma(X_tW_{xf}+H_{t-1}W_{hf}+b_f)?
Output door:? O_t=\sigma(X_tW_{xo}+H_{t-1}W_{ho}+b_o)?
3. Candidate memory cells
Next, short – and long-term memory requires counting candidate memory cells. Its calculation is similar to the three gates introduced above, but the TANh function with a range of [−1, 1] is used as the activation function, as shown in the figure below:
Specifically, the candidate memory cells of time step T are calculated as follows:
Memory cells
We can control the flow of information in the hidden state through the input gate, forget gate and output gate whose element value is in [0, 1]. This is also generally achieved by using multiplication by element (symbol ⊙). Current time step memory cellThe calculation combined the information of the last time step memory cell and the current time step candidate memory cell, and controlled the information flow through the forgetting gate and input gate:
As shown in the figure below, the forgetting gate controls whether the information in the memory cell Ct−1 of the last time step is transferred to the current time step, and the input gate controls how the input Xt of the current time step flows into the memory cell of the current time step through C˜t. If the forgetfulness gate is always close to 1 and the input gate is always close to 0, the memory cells of the past will always be preserved through time and passed to the current time step. This design can solve the gradient attenuation problem in recurrent neural networks and better capture the dependence of large time step distance in time series.
5. Hide the status
Now that we have memory cells, we can also control the flow of information from memory cells to the hidden Ht via the output gate:
The tanh function here ensures that the value of the hidden state element is between -1 and 1. It should be noted that when the output gate is approximately 1, the memory cell information will be transferred to the hidden state for the output layer to use; When the output gate is close to 0, the memory cell information only retains itself. The figure below shows the full calculation of hidden states in short – and long-term memory:
6. Differences between LSTM and GRU
The structures of LSTM and GRU are very similar, but the differences lie in:
- The new memory is calculated according to the previous state and input, but there is a reset gate in GRU to control the entry amount of the previous state, while there is no similar gate in LSTM.
- LSTM has two different gates, namely forget Gate and input Gate, while GRU has only one update gate.
- LSTM can adjust the newly generated state through the output gate, while GRU has no adjustment on the output.
- The advantage of GRU is that it’s a simpler model, so it’s easier to create a larger network, and it only has two gates, and it’s computationally faster, and then it can scale up the model.
- LSTM is more powerful and flexible because it has three doors instead of two.
7. Can LSTM use other activation functions?
As for the selection of activation function, in LSTM, the forgetting gate, input gate and output gate use Sigmoid function as activation function. The hyperbolic tangent function Tanh is used as the activation function to generate candidate memory.
It is worth noting that both activation functions are saturated, meaning that the output does not change significantly until the input reaches a certain value. If you use an unsaturated activation function, such as ReLU, it will be difficult to achieve a gated effect.
The output of Sigmoid function is between 0 and 1, which conforms to the physical definition of gating. And when the input is large or small, its output will be very close to 1 or 0, so as to ensure that the door is open or closed. When generating candidate memories, the Tanh function is used because its output ranges from −1 to 1, which is consistent with the feature distribution of 0 center in most scenarios. In addition, the Tanh function has a larger gradient near input 0 than the Sigmoid function, which generally makes the model converge faster.
The choice of activation function is not invariable, but it is necessary to choose a reasonable activation function.
8. Code implementation
MIST Data Classification -TensorFlow implements LSTM
Machine Learning
9. References
Hands-on Learning — Deep Learning
Author: @ mantchs
GitHub:github.com/NLP-LOVE/ML…
Welcome to join the discussion! Work together to improve this project! Group Number: [541954936]