This paper starts from the limitations of RNN, describes the basic principle of LSTM through simple concepts and detailed operation process, and then strengthens the understanding of this VARIANT of RNN through text generation cases. LSTM is a widely used model at present. We use deep learning libraries such as TensorFlow or PyTorch to call it without even knowing its calculation process. I hope this article can provide some help for readers to preview or review LSTM.
The sequential prediction problem has been around for a long time. It’s considered one of the most difficult problems in data science. These range from predicting stock price movements to understanding the way people speak, from language translation to predicting the next word you will type on your iPhone keyboard.
In recent years, with the technological breakthrough of data science, people have gradually found that the best solution to almost all sequence problems is long and short term memory network (LSTM), which is considered to be the most effective method.
LSTM has many advantages over traditional feedforward neural networks and RNN because it selectively remembers partial features over long time scales. This article will explain the principles of LSTM in detail so that you can use it better.
Note: To understand this article, you’ll need some basic knowledge of recurrent neural networks and Keras, a popular deep learning library.
How to use RNN? A small tutorial for beginners
LSTM, GRU, and Neural Turing Machines: A detailed look at the most popular recurrent neural networks in deep learning
LSTM multivariate time Series prediction based on Keras
Keras Sequence to Sequence Learning in 10 minutes
directory
1. Introduction of recurrent neural network (RNN)
2. Limitations of RNN
3. Improve the performance of RNN: Long short-term memory Network (LSTM)
4. The LSTM architecture
4.1 forget the door
4.2 enter the door
4.3 the output gate
4.4 LSTM overall process
5. Use LSTM to generate text
1. Introduction of recurrent neural network (RNN)
Take continuous data such as the price of a stock on the stock market. A simple machine learning model, or artificial neural network, can predict future prices by learning certain information about the history of the stock price: the number of shares, the price at which the stock will open, and so on. Stock prices depend on these characteristics of stocks and also have a high correlation with stock prices over the past few days. In fact, for a trader, the price (or trend) of the past few days is one of the decisive factors in predicting future stock prices.
In traditional feedforward neural networks, all examples are considered independent. This means that when the model is used to predict a particular day, it does not take into account the previous days’ share prices.
This temporal correlation is realized by the recurrent neural network. A typical RNN would look like this:
If you expand it, it looks like this:
It is now easier to show how these networks predict trends in stock prices before predicting today’s prices. Here, each prediction at time t (h_t) depends on all the previous predictions and the information learned from them.
RNN can do much of what we want to do with sequences, but not all of it. We want computers to be good enough at writing Shakespearean sonnets. Right now, RNN works fine in short-term contexts, but in order to be able to create a story and remember it, we need the model to understand and remember the context after the sequence, just like humans do. This is not possible with simple RNN.
Why is that? Let’s explore.
2. Limitations of RNN
Recurrent neural networks work well when we deal with short-term dependencies. When applied to problems like this:
RNN proved to be quite effective. This is because the question has nothing to do with the context of the statement. RNN doesn’t need to remember the previous information, or its meaning, it just needs to know that most of the time the sky is blue. So the prediction would be:
However, the general RNN cannot understand the context implied by the input. Some past information cannot be recalled when making current predictions. Let’s understand it with an example:
Here we understand because the author worked in Spain for 20 years, and he probably knows Spanish. But in order to make proper predictions, RNN needs to keep this context in mind. Relevant information can be separated from where it needs to be by a large amount of irrelevant data. This is where RNN fails!
The reason behind this is the problem of gradient disappearance. To understand this, you need to know something about feedforward neural network learning. We know that for a traditional feedforward neural network, the weight updates applied at a particular layer are the learning rate, the error term from the previous layer, and the multiples of the input at that layer. Therefore, the error term for a particular layer may be the result of errors for all previous layers. When working with an activation function like Sigmoid, its small derivative value (which appears in the error function) multiplies as we move to the initial level. As a result, the gradients almost disappear as you move towards the starting layers, and these layers become harder to train.
A similar situation occurs in RNN. RNN has only short-term memory, that is, it works if we need the information after a short period of time, but once a large number of words have been entered, the information will be lost somewhere. This problem can be solved by applying a slightly modified RNN, the Long and Short Term memory network.
3. Improve the performance of RNN: Long short-term memory Network (LSTM)
When arranging the schedule, we first consider whether there are conference bookings. But if we need to make time for more important things, we may cancel some minor meetings.
However, RNN does not do this. In order to add a new message, it needs to transform the current message completely through a function. So information is modified on a holistic basis, and the model does not take into account important and unimportant information.
LSTM, on the other hand, locally modifies information through operations such as multiplication and addition. Thus, through LSTM, information flows selectively through the unit state, that is, LSTM selectively remembers or forgets certain features. In addition, there are three distinct dependencies for information in a particular cell state.
We will use some examples to understand this. If we take the share price of our particular stock as an example, then today’s share price depends on:
The previous days of the stock trend, such as up or down, that is, the previous time step of the unit state or memory of the information;
The closing price of the previous day, because it has a great relationship with the opening price of the day, that is, the hidden unit state or memory information of the previous time step;
Factors that may affect the stock that day, namely the input value of the current LSTM unit or the input of new information.
Another important feature of LSTM is its sequence processing mode, which is used to collect more information and context relations. The following figure shows the sequential processing of LSTM:
While the figure above does not represent a detailed and realistic LSTM architecture, it gives an intuitive idea. In addition, because of the LSTM property, it does not perform uniform operations on the entire information, but slightly modifies local information. As a result, LSTM can selectively remember or forget things, giving it a “longer short-term memory.”
4. The LSTM architecture
By understanding the process of news reporting murder, we can similarly understand and visualize the operation of LSTM. Now suppose a news story is constructed around many facts, evidence and witnesses, and we can report any murder through these three channels.
For example, if the murder was initially assumed to have been done by poisoning the victim, but the autopsy report indicates that the cause of death was “effects on the head”. As part of the news team, we quickly “forget” the first cause and focus on the second.
If a new suspect comes into our view, and he had a grudge against the victim, could he be the killer? So we need to “input” it into our news for further analysis.
But right now all these pieces of information are not enough to cover in mainstream media, so after a while we need to summarize this information and “output” the corresponding results to our readers. Perhaps this output indicates and analyzes who is the most likely killer.
Below, we will introduce the architecture of LSTM network in detail:
This architecture is completely different from the simplified version we know about, but this article explains it in detail. A typical LSTM network consists of different cells, or chunks of memory, the yellow rectangles we see in the figure above. An LSTM unit typically outputs two states to the next unit, the unit state and the hidden state. Memory block is responsible for remembering the events of each hidden state or previous time step. This kind of memory is generally realized by three kinds of gating mechanism, namely input gate, forgetting gate and output gate.
4.1 forget the door
We will use the following statement as an example of a text prediction problem, first assuming that the statement has been fed into the LSTM network.
When the model encounters the first period after “person,” the forgetting gate might be aware that the context of the next statement might change. So the subject of the statement may need to be forgotten, leaving the subject’s place empty. When we’re talking about “Dan,” the subject position that’s left empty should be assigned to “Dan.” Forgetting “Bob”, the subject of the preceding statement, is controlled by the forgetting gate.
The forgetting gate is responsible for removing information from the unit state. LSTM does not need this information to understand things. This less important information will be removed by filter operation. This is an aspect that must be considered to optimize LSTM performance.
The forgetting gate takes two inputs, namely H_T-1 and X_T. H_t-1 is the hidden state or output state of the previous unit, x_t is the input of a specific time step, that is, the element t th of the input sequence X. Given the product of the input vector and the weight matrix, the offset term is added to enter the Sigmoid function. The Sigmoid function outputs a vector with values ranging from 0 to 1 for each value in the cell state. Basically, the Sigmoid function decides which values to keep and which to forget. If the cell state has a specific value of zero, the forgetting gate requires that the cell state completely forget the information. The Sigmoid function vector of this output is finally multiplied by the unit state.
4.2 enter the door
Let’s use another example to show how LSTM analyzes statements:
Now the important thing we know is that “Bob” knew how to swim, and he served in the Navy for four years. This can be added to the cell state, so this process of adding new information can be done through an input gate.
The input gate is responsible for adding information to the cell state. The process of adding information can be divided into three steps:
- The Sigmoid function is used to adjust the value that needs to be added to the cell state. This is very similar to a forgetting gate, which acts as a filter to filter information from H_T-1 and X_T.
- Creates a vector containing all possible values that can be added to the cell state. This is done by using the tanh function, with an output value of -1 to 1.
- Multiply the value of the tuning filter (Sigmoid gating) by the created vector (TANH function), and then add this useful information to the cell state.
After completing these three steps, we basically ensured that the information added to the cell state was important and not redundant.
4.3 the output gate
Not all information running in the cell state is suitable for output at a particular time. We will demonstrate this with an example:
In this statement, a large number of choices can be made at Spaces. But we know that the input “brave” before the space is an adjective that modifies a noun. So, anyway, there is a strong noun bias in Spaces. Therefore, Bob may be a correct output.
Selecting useful information from the current cell state and displaying it as output is done through the output gate. Its structure is as follows:
The function of the output gate can again be divided into three steps:
1. Apply the tanh function to the cell state to create a vector that scales the value between -1 and +1.
2. Generate a filter using the values of h_T-1 and X_T so that it can adjust the values to be output from the vectors created above. This filter again uses a sigmoid function.
3. Multiply the value of this tuning filter by the vector created in Step 1 and send it as output, and send it to the hidden state of the next cell.
The filter in the example above will ensure that it reduces all values except “Bob”, so the filter needs to be built on input and hidden state values and applied to the unit state vector.
4.4 LSTM overall process
Above we have looked at the various parts of LSTM in detail, but readers may not have a good understanding of the overall process of LSTM. Here we briefly introduce the specific processing process of LSTM unit selection memory or forgetting to readers.
The following is the detailed structure of the LSTM unit, where Z is the input part and Z_i, Z_o and Z_f are the values of the control three gates respectively, that is, they will filter the input information through the activation function F. The general activation function can be chosen as the Sigmoid function, because its output value is 0 to 1, which indicates the degree to which the three doors are opened.
Image from Machine learning handout by Li Hongyi.
If we input Z, then the product of g(Z) of the input vector obtained by the activation function and the input gate f(Z_i), g(Z) f(Z_i), represents the information retained after filtering the input data. The forgetfulness controlled by Z_f controls how much information previously remembered needs to be retained, and the retained memory can be expressed by equation C * F (Z_f). The previously retained information plus the current input meaningful information will be retained to the next LSTM unit, that is, we can use C ‘= g(Z) F (Z_i) + CF (z_F) to represent the updated memory, and the updated memory C’ also represents all the previously and currently retained useful information. We then take the activation value h(C ‘) of this updated memory as the possible output, and generally choose the TANH activation function. All that is left is the output gate controlled by Z_o, which determines which of the outputs currently activated by memory are useful. So the final output of LSTM can be expressed as a = H (c’)f(Z_o).
5. Use LSTM to generate text
We already know enough about the theoretical concepts and functions of LSTM. Now we are trying to build a model to predict the number of characters of “n” after Macbeth’s original text. Most of the classic text no longer protected by copyright, you can find here (https://www.gutenberg.org/), Update the TXT version can be found here (https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/12/10165151/macbeth.txt).
We use Keras, which is a high-level API for neural networks and works on TensorFlow or Theano. So make sure you have a working Keras installed before you go into the code. Ok, let’s start generating text!
Importing dependencies
# Importing dependencies numpy and keras import numpy from keras.models import Sequential from keras.layers import Dense from keras.layers import Dropout from keras.layers import LSTM from keras.utils import np_utilsCopy the code
We import all the necessary dependencies, and this is self-evident.
Load the text file and create a character-to-integer mapping
# load text
filename = "/macbeth.txt"
text = (open(filename).read()).lower()
# mapping characters with integers
unique_chars = sorted(list(set(text)))
char_to_int = {}
int_to_char = {}
for i, c in enumerate (unique_chars):
char_to_int.update({c: i})
int_to_char.update({i: c})
Copy the code
The text file is open and all characters are converted to lowercase. To facilitate the following steps, we map each character to the corresponding number. This is done to make the computation part of the LSTM easier.
Preparing the data set
# preparing input and output dataset
X = []
Y = []
for i in range(0, len(text) - 50, 1):
sequence = text[i:i + 50]
label =text[i + 50]
X.append([char_to_int[char] for char in sequence])
Y.append(char_to_int[label])
Copy the code
Data needs to be prepared in this format if we want LSTM to predict the “O” in “HELLO”, we need to enter [H, E, L, L] and [O] as expected output. Similarly, here we determine the desired sequence length (set to 50 in this example), and then save the encoding and expected output of the first 49 characters in X, the 50th character in Y.
Restore X
# reshaping, normalizing and one hot encoding
X_modified = numpy.reshape(X, (len(X), 50, 1))
X_modified = X_modified / float(len(unique_chars))
Y_modified = np_utils.to_categorical(Y)
Copy the code
LSTM network expects the input form to be [sample, time step, feature], where sample is the number of data points we have, time step is the number of time-related steps existing in a single data point, and feature is the number of variables we have for the corresponding truth value in Y. We then scale the values in X_modified between 0 and 1, and one hot encode the truth values in Y_modified.
Define the LSTM model
# defining the LSTM model model = Sequential() model.add(LSTM(300, input_shape=(X_modified.shape[1], X_modified.shape[2]), Return_sequences = True)) model. The add (Dropout (0.2)) model. The add (LSTM (300)) model. The add (Dropout (0.2)) model.add(Dense(Y_modified.shape[1], activation='softmax')) model.compile(loss='categorical_crossentropy', optimizer='adam')Copy the code
Use a sequence model with a linear stack layer. The first layer is an LSTM layer with 300 memory cells, and it returns sequences. This is done to ensure that the next LSTM layer receives the sequence and not just randomly scattered data. A Dropout layer is applied after each LSTM layer to avoid model overfitting. Finally, we end up with a final layer that serves as the full connection layer, with a Softmax activation function and the same number of neurons as unique characters, because we need to output one hot encoded result.
Fit the model and generate characters
# fitting the model
model.fit(X_modified, Y_modified, epochs=1, batch_size=30)
# picking a random seed
start_index = numpy.random.randint(0, len(X)-1)
new_string = X[start_index]
# generating characters
for i in range(50):
x = numpy.reshape(new_string, (1, len(new_string), 1))
x = x / float(len(unique_chars))
#predicting
pred_index = numpy.argmax(model.predict(x, verbose=0))
char_out = int_to_char[pred_index]
seq_in = [int_to_char[value] for value in new_string]
print(char_out)
new_string.append(pred_index)
new_string = new_string[1:len(new_string)]
Copy the code
The model fits more than 100 epochs with a batch size of 30. Next we fix a random seed (for ease of reproduction) and start generating characters. The model prediction gives the character encoding of the predicted character, which is then decoded into character values and appended to the pattern.
The following figure shows the output of the network:
Eventually, after training enough epoch, it will get better and better results over time. This is exactly how you solve the sequence prediction problem with LSTM.
conclusion
LSTM is a promising solution to sequence and timing related problems, but also has the disadvantage of being difficult to train. It takes a lot of time and system resources to train even a simple model. But this is only a hardware limitation. This article is intended to help you accurately understand the basics of these networks. If you have any questions, please leave a comment.
The original link: https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/