Advantages of LSTM over traditional RNN
As the gap between the relevant information grows, RNNs become unable to learn to connect the information. But LSTM are capable of handling such “long-term dependencies.”
LSTM Networks
LSTM is explicitly designed to avoid long-term dependency problems. Remembering information for long periods of time is actually their default behavior, not something they’re trying to learn.
In the figure above, each line carries the entire vector, from the output of one node to the input of the other. Pink circles represent point-by-point operations, such as vector addition, while yellow boxes represent the neural network layer of learning. Merged lines represent concatenation, while forked lines represent content to be copied, and the copies arrive at different locations.
The Core Idea Behind LSTMs
The key to LSTM is the unit state, with horizontal lines running through the top of the diagram.
The cell state is a bit like a conveyor belt. It goes all the way along the chain in a straight line, with some minor linear interactions. It is easy for information to flow unaltered.The LSTM does have the ability to remove or add information to the state of a cell, and these functions are carefully tuned into structures called gates.
A gate is a way of selectively letting information through. They consist of S – shape network layer and point – by – point multiplication.
The Sigmoid Layer prints a number between zero and one that describes how much of each component should be allowed through. A value of zero means “let nothing pass,” while a value of 1 means “Let everything pass!”
The LSTM has three of these gates to protect and control the unit state.
Step-by-Step LSTM Walk Through
Forget Gate Layer decides what information we’re going to throw away from the cell state.
The first step in LSTM is to decide what information to discard from the cell state. The decision is made by an S-shaped layer called the forgetgate layer. It looks at HT-1 and XT and outputs a number between 0 and 1 for each number in the cell state CT-1. 1 means fully reserved for this condition and 0 means fully reserved for this condition.
Let’s go back to the example of a language model that tries to predict the next word based on all the previous words. In such a question, the cellular state might include the gender of the current subject, allowing the correct pronoun to be used. When we see a new theme, we want to forget the gender of the old theme.The next step is to determine what new information to store in the cell state. There are two parts to this. First, an S-shaped layer called the input gate layer determines which values we will update. Next, the TANh layer creates a new candidate vector C̃t that can be added to the state. In the next step, we combine the two to create an update to that state.
In our language model example, we want to add the gender of the new topic to the cell state to replace the old topic that we forgot.
The mathematical formula of sigmod function is:
The mathematical formula of tanh(x) function is:
It is now time to update the old cell state CT-1 to the new cell state Ct. The previous steps have identified what to do, we just need to actually do it.
We multiplied the old state times ft, forgetting what we decided to forget earlier. Then we add * C̃t to it. This is the new candidate value, scaled according to how big we decide to update each status value.
In terms of the language model, this is where we actually remove information about the gender of the old topic and add new information, as we identified in the previous step.
Finally, we need to decide what to export. This output will be based on our cell state, but will be a filtered version. First, we run an S-shaped layer to determine which parts of the unit state to output. We then place the unit state through tanh (pushing the value between -1 and 1) and multiply it by the output of the S-gate so that only the identified portion is printed.
For the language model example, since it only sees a subject, it might output information related to the verb, just in case. For example, it might output whether the subject is singular or plural so that we know in what form the verb should be conjugated if we want it next.
Original text: colah. Making. IO/posts / 2015 -…