How to start from RNN and understand LSTM step by step

 

 

preface

For those of you who have studied LSTM, the first thing that comes to mind is ChristopherOlah’s blog post “understanding the LSTM network”, which is really amazing and widely circulated on the Internet, and when you read many articles about LSTM on the Internet, you will find that it is a classic article. However, if you are reading the LSTM for the first time, the original text may give you a few obstacles:

First, dry LSTM, many readers may not have a good understanding of RNN. Based on this, we can start with the simplest single-layer network; However, the paper does not explain the three gates of LSTM in sufficient detail, including three different sigmoid functions that use the same symbol σ (wrong? That’s right, it’s an idiom if you read enough. Three, different weights also use the same symbol W, and when the figure, formula, relationship one by one corresponding to clear, beginner will not be a face meng. Even if we express the calculation process of each formula with directional water flow, it will be easier to understand a lot. At this time, we can go on the GIF.

Although I have learned a lot of models/algorithms, I still didn’t know much about LSTM before, including reading ChristopherOlah’s blog post “understanding LSTM network” repeatedly at the beginning. Fortunately, after repeated discussions with Dr. Chen, AI Lab of our company, I finally got through it!

Side note, when a person meditates hard to think of impassability, nine out of ten because of the material is not popular enough, if not, then ask people, the results may instantly understand, this is the meaning of education, but also we do July online great value.

As we all know, we have written/spoken of many technologies, such as SVM, CNN, Xgboost, LSTM, etc., the most easy to understand in China. Next, we will write/speak of BERT and other technologies the most easy to understand in China, as the entry standard, and not only from NNLM Word2Vec Seq2Seq Seq2Seq with Attention Transformer Elmo GPT BERT, we want to pave the way for all AI beginners: one step at a time, not a gap in understanding.

Based on the references at the end of this article, including ChristopherOlah’s blog post and the translation by @not_god, there are plenty of easy-to-understand notes/notes (which are not readily available in other articles) to make things easier to understand.

 

A, RNN

Before learning LSTM, it is necessary to learn RNN first, and before learning RNN, it is necessary to first understand the most basic single-layer network, whose structure is shown in the figure below:

The input is x, and the transformation of Wx+ B and the activation function F gives the output y. I’m sure you’re already familiar with this. In practical applications, we will also encounter a lot of sequential data:

Such as:

  1. Natural language processing problems. X1 could be the first word, x2 could be the second word, and so on.
  2. Speech processing. Now, x1, x2, x3… It’s the sound of each frame.
  3. Time series. For example, daily stock prices and so on.

Among them, sequential data is not easy to handle with primitive neural networks.

In order to model sequence problems, RNN introduces the concept of hidden state (H), which can extract features from sequential data and then transform them into outputs.



From the firstThe calculation starts looking at:


The meanings of the symbols in the drawings are:

A) Circles or squares represent vectors.

B) An arrow represents a transformation of the vector. Such as in the figure aboveandIf you have one arrow connected to each other, that means yesandI did a transformation.

Similar notations will appear in many papers. It is easy to get confused at the beginning, but as long as you grasp the above two points, you can easily understand the meaning behind the diagrams.



The calculation andSimilar. But there are two caveats:

  1. In calculation, the parameters U, W and B used in each step are the same, that is to say, the parameters of each step are shared, which is an important feature of RNN and must be kept in mind.
  2. The weights in the LSTM, which you will see in a moment, are not shared because they are in two different vectors. Why do RNN weights share? That’s easy, because the weights of RNN are in the same vector, just at different times.

Calculate the rest in turn (using the same parameters U, W, b) :

We’re just going to draw the case of length 4 for convenience, but in fact, this calculation can go on indefinitely. Our current RNN has not been output, and the method to obtain the output value is to calculate directly through H:

So as we said before, an arrow represents a transformation of the corresponding vector like f of Wx plus b, and this arrow right here represents a transformation of h1, which gives you the output y1. The rest of the output is similar (using the same arguments V and c as y1) :

OK! And you’re done! This is the classic RNN structure, x1, x2,….. Xn, the output is y1, y2… Yn, that is, the input and output sequences have to be the same length.

1.2 Application of RNN Human beings do not always start their thinking with a blank mind. As you read this article, you infer the true meaning of the current word based on what you already know about the word you’ve seen before. We don’t throw everything away and think with a blank mind. Our thoughts have permanence. Traditional neural networks can’t do this, and that seems like a huge disadvantage. For example, suppose you want to categorize the time types at each point in time in a movie. Traditional neural networks should have a hard time dealing with this problem: extrapolating from previous events in the movie to subsequent ones. The recurrent neural network RNN solves this problem.

As we know from the first section above, RNN is a network containing cycles. In this cycle structure, each neural network module, reads an inputAnd prints a value(Note: the output is represented by Y before, and from here on, it is represented by hidden layer output H), and then the loop continues. Loops allow information to be passed from the current step to the next.

These loops make RNN seem mysterious. However, if you think about it, this is no more difficult to understand than a normal neural network. RNN can be thought of as multiple copies of the same neural network, with each neural network module passing messages to the next. So, if we expand the loop:

The chaining nature reveals that RNN is intrinsically sequence – and list-related. They are the most natural neural network architecture for this kind of data.

1.3 Limitations of RNN: Long-term dependencies problem

One of the key points of RNNS is that they can be used to connect previous information to the current task, for example using past video clips to infer an understanding of previous video clips. If RNNS can do this, they become very useful. But can it? The answer is, there are a lot of dependencies. Sometimes, we just need to know the previous information to perform the current task. For example, we have a language model that predicts the next word based on the previous word. If we try to predict the final word of “The clouds are in the sky”, we don’t need any more information, because obviously the next word should be sky. In such scenarios, the gap between the relevant information and the predicted word position is so small that the RNN can learn to use the previous information.


But there are also more complex scenarios. Suppose we try to predict “I grew up in France… I speak Fluent French “. The current information suggests that the next word may be the name of a language, but if we need to figure out what language it is, we need the context of the previously mentioned France, which is far away from the current location. This means that the gap between the relevant information and the current predicted position must become quite large.



Unfortunately, as this interval increases, the RNN loses the ability to learn to connect information so far away.





In theory, RNN can definitely handle such long-term dependency problems. One can carefully pick and choose parameters to solve the most rudimentary form of such problems, but in practice RNNS will certainly not succeed in learning this knowledge. Bengio, Etal.(1994) etal. made an in-depth study of this problem and found some quite fundamental reasons that make training RNN very difficult.

In other words, RNN is affected by short-term memory. If a sequence is long enough, it will be difficult for them to transfer information from an earlier time step to a later time step.

So, if you’re trying to process a piece of text for prediction, the RNN might miss important information from the start. In the period of back propagation (back propagation is a very important core issue, in essence, weight is updated by constantly reducing the error, so as to constantly modify the fitted function), RNN will face the problem of gradient disappearance.

Because the gradient is used to update the weight value of the neural network (new weight = old weight – learning rate * gradient), the gradient will decrease continuously over time, and when the gradient becomes very small, learning will not continue.

In other words, in a recursive neural network, the layers that get small gradient updates stop learning — those are usually the earlier layers. Since these layers do not learn, the RNN can forget what it sees in a longer sequence and therefore has short-term memory.

Gradient explosions are more and more difficult to calculate.

Fortunately, however, there is a variant of RNN, LSTM, that can solve both gradient extinction and gradient explosion to some extent!

 

 

Second, LSTM network

Long ShortTerm networks — commonly referred to as LSTM — are a special type of RNN that can learn long-term dependencies. Of course, the LSTM is not particularly structurally different from the baseline RNN, but they use different functions to compute hidden states. LSTM memories are what we call cells, and you can just think of them as black boxes, and the input of the black box is the pre-stateAnd current input. These “cells” decide which previous information and states need to be retained/remembered and which need to be erased. In practical applications, it is found that this method can effectively save the association information for a long time.

2.1 What is LSTM network

For example, when you want to buy groceries online, it is common to check the reviews of people who have already bought the item.

When you read the comments, your brain subconsciously remembers only important keywords, such as “amazing” and “awsome”, and cares less about words like “this”, “give”, “all”, “should” and so on. If a friend asks you the next day what the review said, you might not remember it verbatim. Instead, you might say the main idea that you remember in your head, like “I’ll definitely buy it again,” and the rest of the irrelevant stuff will fade from memory.

This is basically what an LSTM or GRU does, which can learn to make predictions by retaining only relevant information and forgetting irrelevant data. Simply put, remember the important and forget the unimportant due to limited memory capacity.

LSTM was suggested by Hochreiter&Schmidhuber(1997), and recently improved and popularized by AlexGraves. In many problems, LSTM has achieved considerable success and has been widely used. LSTM is deliberately designed to avoid long-term dependency issues. Remember that long-term information is the default behavior of LSTM in practice, not a very expensive ability to acquire! All RNNS have the form of a chain of repeating neural network modules. In a standard RNN, this repeating module has a very simple structure, such as a TANH layer.

The activation function Tanh helps to regulate the values flowing through the network so that they are always limited to between -1 and 1.

The LSTM has the same structure, but the repeated modules have a different structure. Specifically, RNN is a repeating single neural network layer, while the repeating module in LSTM contains four interacting layers, three Sigmoids and a TANH layer, and interacts in a very special way.

In the figure above, the Sigmoid activation function represented by sigma is similar to the TANH function, except that the Sigmoid is compressed between 0 and 1 instead of -1 to 1. Settings like this help update or forget information:

  • Because any number multiplied by 0 is 0, so this piece of information gets thrown out;
  • Similarly, any number multiplied by 1 yields itself, and that piece of information is perfectly preserved.

If it’s a 1, remember it, or if it’s a 0, forget it, so it’s still the same principle: remember what’s important and forget what’s not because of your limited memory capacity.

In addition, for the ICONS of the various elements used in the diagram, each black line transmits an entire vector from the output of one node to the input of the other. The pink circles represent pointwise operations, such as vector sums, and the yellow matrices are learned neural network layers. The joined lines represent the connection of vectors, and the separated lines represent the content being copied and then distributed to different locations.

2.2 Core ideas of LSTM The key of LSTM is the cell state, with horizontal lines running through the top of the figure. The cellular state is like a conveyor belt. It runs directly along the chain, with just a few linear interactions. It would be easy for the message to circulate and stay the same.

LSTM has the ability to remove or add information to the cellular state through carefully designed structures called gates. Gates are a means of letting information through selectively. They contain a Sigmoid neural network layer and a nonlinear operation of pointwise multiplication.

Thus, 0 means “nothing is allowed through” and 1 means “anything is allowed through”! This allows the network to know which data needs to be forgotten and which needs to be saved.

LSTM has three types of gate structures: forgetgate/forgetgate, input gate and output gate to protect and control the cellular state. Now, let’s introduce these three doors.

 


Step by step to understand LSTM

3.1 forget the door

The first step in our LSTM is to decide what information we will discard from the cellular state. This decision is made through a structure called the “forget gate”. The forget gate will read the previous outputAnd current input, do a nonlinear mapping of Sigmoid, and output a vector(The value of each dimension of this vector is between 0 and 1, 1 means completely retained, 0 means completely abandoned, equivalent to remembering the important and forgetting the unimportant), and finally with the cell stateMultiplication.



The analogy to the language model example is to predict the next word based on what you’ve already seen. In this case, the cellular state may contain the gender of the current subject, so the correct pronoun can be selected. When we see a new subject, we want to forget the old subject and decide to discard the information.

Most beginner readers may find this confusing, but there are two steps to understanding it:

  1. For the weights in the formula on the right of the figure aboveTo be precise, it is not shared, that is, it is not the same. Some students’ first reaction might be what? Don’t worry, I’ll just unfold it and you’ll see it all at once, namely:.
  2. What is the one-to-one correspondence between the formula on the right and the diagram on the left? If the calculation process is represented by directional water flow, it will be clear at a glance, on the GIF! The red circle represents the Sigmoid activation function and the basket represents the TANh function:


3.2 enter the door

The next step is to determine what new information is stored in the cellular state. There are two parts to this:

First, the sigmoid layer, called the input gate layer, determines what values we are going to update;

Second, a TANH layer creates a new candidate value vector, will be added to the state.

Next, we’ll talk about these two pieces of information to generate state updates.



In the example of our language model, we would like to add the gender of the new subject to the cellular state to replace the old subject to be forgotten, thus determining the updated information.

Continue to understand in two steps:

  1. First, in order to understand the two formulas on the right of the figure, let’s expand the calculation process, namely,
  2. Second, GIF up!


3.3 Cell state

Now it’s time to update the old cellular state,Updated to. The previous steps have determined what will be done, and we are now going to actually do it.

Let’s compare the old state with the old stateMultiply and discard the information that we know we need to discard. Then add. This is the new candidate value, varying according to how much we decide to update each state.

In the case of the language model, this is where we actually discard the gender information of the old pronouns and add new information, similar to updating the cell state, based on the goals identified earlier.

Gifs again!

3.4 Output Gate Finally, we need to determine what value to output. This output will be based on our cell state, but also a filtered version. First, we run a sigmoID layer to determine which part of the cell state will be output. Next, we process the cell state through TANH (to get a value between -1 and 1) and multiply it by the output of the Sigmoid gate. Finally, we will only output what we are sure to output. In the example of the language model, because he sees a pronoun, he might need to output information related to a verb. For example, it’s possible to output whether the pronoun is singular or negative, so that if it’s a verb, we also know what inflection the verb needs to make, and that output information.

Again, there are two steps to understand:

  1. Expanding the first formula on the right,
  2. Last GIF:

 

So far we have been covering normal LSTM. But not all LSTMS look the same. In fact, almost all papers that contain LSTM use minor variations. The differences are small, but worth mentioning. 4.1 The LSTM variant of peephole linking and coupled one of the manifolds was proposed by Gers&Schmidhuber(2000), adding “peepholeconnection”. That is, we make the portal layer accept cellular inputs as well.

In the legend above, we added Peephole to each door, but many papers add part of Peephole instead of all of it. Another variation is through the use of coupled forget and input gates. Instead of deciding separately what to forget and what new information to add, decide together. We only forget when we are going to type in the current location. We only enter new values into states where we have forgotten the old information.

 

4.2 Another highly modified variant of GRU is GatedRecurrentUnit(GRU), which was proposed by Cho,etal.(2014). It combines the forget gate and the input gate into a single update gate. There was also a mix of cell and hidden states, among other changes. The resulting model is simpler than the standard LSTM model and is a very popular variant.

To make it easier to understand, LET me expand the first three formulas on the right side of the figure

Now, there’s a little bit of a problem here, and if you’re sharp-eyed,andAre right,Sigmoid nonlinear mapping, so what’s the difference? The reason is that the GRU forget gate and input gate into one, andIt belongs to remember, the other way aroundIt’s forgotten, it’s equivalent to input,Have made some changes/changes whileIs equivalent to prescient input,in/Save a copy of the original before making any changes/changes to itSo let’s do a tanh change.

Here are just a few of the popular LSTM variants. Of course, there are many others, such as DepthGatedRNN proposed by Yao, Etal.(2015). There are also some completely different views to solve the problem of long-term dependence, such as ClockworkRNN proposed by Koutnik,etal.(2014). Which variant is the best? Does the difference really matter? Greff,etal.(2015) presented a comparison of popular variants and concluded that they were basically the same. Jozefowicz,etal.(2015) tested more than 10,000 RNN architectures and found that some architectures also achieved better results than LSTM in some tasks.

 

 

5. LSTM related interview questions

In order to help you consolidate what you have learned above, and help you find a job in AI, here are some typical interview questions about LSTM from online AI question bank in July. For more specific answers, see www.julyedu.com/search?word… After opening the link, check the “Interview questions” TAB.

  1. LSTM structure derivation, why is better than RNN?
  2. What is a GRU? What changes did GRU make to LSTM?
  3. What are the inputs and outputs of LSTM neural network?
  4. Why do LSTM models have both sigmoID and TANH activation functions instead of one sigmoID or TANH? What is the purpose of this?
  5. How to fix the gradient explosion problem?
  6. How to solve the problem of RNN gradient explosion and dispersion?

 

 

Vi. References

  1. ChristopherOlah’s blog post understanding the LSTM network
  2. @not_god translate ChristopherOlah’s blog “understanding the LSTM network”
  3. How is RNN constructed step by step from a single-layer network?
  4. Understand LSTM through images of giFs
  5. How to Understand the LSTM Network (July annotated version of Christopher Olah’s classic blog post)
  6. Lstm-related typical interview questions: www.julyedu.com/search?word… (After opening the link, check the “Interview questions” TAB.)
  7. How to understand BackPropagation algorithm BackPropagation