This article is a systematic introduction of RNN by Jason Brownlee, a master of machine learning. In this article, he compares in detail the working principles and respective characteristics of LSTM, GRU and NTM three mainstream architectures in deep learning. By reading this article, you can easily GET a sense of why recurrent neural networks excel in current technological challenges such as speech recognition, natural language processing, and machine translation.



The author | Jason Brownlee

| AI base of science and technology (rgznai100)

Participate in | reason_W



Cyclic neural network (RNN) is an artificial neural network, which creates a cyclic mechanism to maintain internal state by assigning additional weights to network graphs.

Once a neural network has “state”, it can explicitly learn and use contextual information, such as order or time components, in sequence prediction.

This article will take you through the various applications of RNN in deep learning in one go.

After reading this, you should be able to understand:

  • How do state-of-the-art RNNS perform deep learning tasks such as LSTM (Short and Long-term memory network), GRU (Gated loop Unit), and NTM (Neural Turing Machine)?

  • What is the specific relationship between the most advanced RNN and the broader recursive research in artificial neural networks?

  • Why does RNN perform so well on a range of challenging problems?

We’re not going to be able to go through all the details of the recurrent neural network. Therefore, we will focus on recurrent neural networks for deep learning (LSTM, GRU, and NTM) and the necessary background knowledge to understand them.

Let’s get down to business.

Deep Learning Journey with Recurrent Neural network Algorithms Photo by Santiago Medem, right reserved.

An overview of

Let’s first understand the research background of recurrent neural networks.

Next, we will carefully study the application of LSTM, GRU and NTM in deep learning.

Finally, we’ll look at some advanced topics related to the use of RNN for deep learning.

  • Recurrent neural network

    • Fully recursive networks

    • Structural recursive neural network

    • Neurohistorical compressor

  • Long short Term Memory Network (LSTM)

  • Gated loop unit neural network

  • Neural Turing machine


Recurrent neural network

First, let’s take a look at the research background of RNN.

It is generally believed that loops give the network topological properties of memory.

But a better way to think about an RNN is to think of it as a training set containing a set of input samples of the current training sample. This is more “conventional,” like a traditional multilayer perceptron.

X(i)->y(i)

But adding one from the previous sample to the training sample is “unconventional”. For example, a recurrent neural network.

[X(i-1),X(i)]->y(i)

As with all feedforward networks paradigms, the problem here is how to connect the input layer to the output layer, including feedback activation, and then train the structure of the network to converge.

Now let’s start with a simple concept and look at the different types of recurrent neural networks.

Fully recursive networks

The network retains the layered topology of multi-layer perceptrons, but each neuron in the network is weighted with other neurons and has a feedback connection with itself.

Of course, not all connections will be trained, and since the error derivative is extremely nonlinear, the traditional back propagation will no longer work, so the network is approximated by back propagation algorithm over time (BPTT) or stochastic gradient descent (SGD).

For more: Bill Wilson’s Tensor Product Networks (1991)

http://www.cse.unsw.edu.au/~billw/cs9444/tensor-stuff/tensor-intro-04.html

Structural recursive neural network

Structural recursive neural networks are linear structural variants of recursive networks. Structural recursion facilitates branching in hierarchical feature Spaces and enables network architectures to be trained to emulate this.

The training process is realized by subgradient descent.

This is described in detail in R. Socher et al., Paralsing Natural Scenes and Natural Language with Recursive Neural Networks, 2011.

http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Socher_125.pdf

Neurohistorical compressor

Schmidhuber first published this very deep learning machine in 1991, which can achieve credit allocation of hundreds of neural layers through pre-training of unsupervised RNNs hierarchies (that is, components that perform well are assigned more weight to achieve the goal).

RNN predicts the next input through unsupervised training. Only the errant input is then passed forward, passing the new information to the next RNN in the hierarchy, which is then processed at a slower self-organizing time scale.

Obviously there is no loss of information, just compression. The RNN stack is a “depth-generating model” of data. Data can be reconstructed from a compressed form.

See J.Schmidhuber et al., Deep Learning in Neural Networks: An Overview, 2014.

http://www2.econ.iastate.edu/tesfatsi/DeepLearningInNeuralNetworksOverview.JSchmidhuber2015.pdf

Unlikely as it sounds, backpropagation can still fail because errors propagate backwards through larger topologies, increasing the computation of extreme nonlinear derivatives and making credit allocation difficult.



Short – and long-term memory networks

In traditional time back propagation (BPTT) or real-time loop learning (RTTL) algorithms, error signals tend to explode or disappear as time goes by.

The time evolution of the backpropagation error depends exponentially on the weight. Weight explosion can lead to weight instability, and in gradient dispersion (disappearance), it can cause learning to span long lags and take too much time, or even not work at all.

  • LSTM is a new cyclic network architecture based on gradient learning algorithm training.

  • LSTM is designed to overcome the error reflux problem. It can learn to span time intervals of more than 1,000 steps.

  • In the case of noisy, incompressible input sequences, the network does not lose short-range (LAG) capability.

The error backflow problem is overcome by an efficient gradient-based algorithm whose network structure keeps the error stream constant (so that it does not explode or disappear) through the internal state of special neurons.

These neurons reduce the effects of input weight conflicts and output weight conflicts.

Input weight conflict: If the input is non-zero, the same input weight must be used to store some inputs and ignore others, and then conflicting weight updates are often received.

These signals try to involve weights in storing the input and protecting it. This conflict makes learning difficult and requires a more context-sensitive mechanism to control “write operations” through input weights.

Output weight conflict: As long as the output of the neuron is not zero, the weight from the output connection of the neuron will attract the conflicting weight update signal generated during sequence processing.

These signals will attempt to involve the output weights in accessing the information stored in the processing unit and at various times protect subsequent neurons from interference from the output of the feed-forward neurons.

These conflicts do not only cause long delays, but can also cause short delays. It is important to note that as lag increases, stored information must be secured against perturbation, especially in advanced stages of learning.

The network architecture: Useful information about the current state of the network may be transmitted through different types of neurons. For example, an input gate (an output gate) can use input from other storage units to decide whether to store (access) certain information in its storage unit.

The storage unit contains the gate structure and assigns the gate to the connection they want to mediate. The input gate is used to eliminate input weight conflicts, while the output gate can eliminate output weight conflicts.

The door: Specifically, in order to reduce the conflict and disturbance of input and output weights, it is necessary to introduce the multiplication input gate unit to protect the stored content from the disturbance of the interfering input, and the multiplication output gate unit protects other units from interference by storing the current irrelevant memory content.

Example of an LSTM network with 8 input units, 4 output units, and 2 storage unit blocks of size 2. In1 marks the input gate, out1 marks the output gate, cell1 = block1 marks the first storage unit of block1. (From Long Short-Term Memory, 1997)

The connectivity of LSTM is more complex than that of multilayer perceptron due to its diverse processing elements and feedback connections.

Storage unit block: The structure formed by storage units that share the same input gate and the same output gate is called a storage unit block.

Storage unit blocks contribute to information storage; As with traditional neural networks, it is not easy to encode distributed input within a single cell. The storage unit block becomes a simple storage unit at size 1.

learning: Consider alternative real-time loop learning (RTRL) variants to ensure that the non-attenuation error of the backpropagation through the internal state of the memory cell to the “memory cell network input” is not further backpropagated in time due to the dynamic nature of multiplication caused by the input and output gates.

guessThis stochastic approach can outperform many time-delay algorithms. It has been established that many tasks with long time delays used in previous work can be solved more quickly by simple random weight guessing than by the proposed algorithm.

See S.Hochreiter and J. Schmidt Huber, Long-short Term Memory, 1997.

http://dl.acm.org/citation.cfm?id=1246450

The most interesting application of LSTM recurrent neural networks is in language processing. For a fuller description, see Gers’ paper:

F. Gers and J. Schmidhuber, LSTM Recurrent Networks Learn Simple Context Free and Context Sensitive Languages, 2001.

ftp://ftp.idsia.ch/pub/juergen/L-IEEE.pdf

F. Gers, Long Short-Term Memory in Recurrent Neural Networks, Ph.D. Thesis, 2001. http://www.felixgers.de/papers/phd.pdf

The shortage of the LSTM

  • The efficient truncation version of LSTM does not easily solve the problem of “strongly delayed xOR” classes.

  • Each memory block of the LSTM requires an input gate and an output gate, which is not required in other loop methods.

  • Constant errors flow through “Constant Error Carrousels” inside the storage cell, producing the same effect as the entire input string in a traditional feedforward architecture.

  • LSTM, like other feedforward methods, is flawed in the concept of “regency”. If precise time step counting is required, additional counting mechanisms may be required.

The advantage of the LSTM

  • The constant errors propagated back in memory cells give the architecture the ability to bridge algorithms with long time delays.

  • LSTM can be approximated to the noise problem domain, distributed representation, and continuous values.

  • The LSTM Outlines the problem domains to consider. This is important because some tasks are tricky for an already established circular network.

  • There is no need to fine-tune network parameters on the problem domain.

  • LSTM is essentially equivalent to BPTT in terms of the complexity of each weight and time step update.

  • LSTM has achieved the most advanced results in such fields as machine translation and has shown great capabilities.


Gated loop unit neural network

Like LSTM, gated cyclic neural network has been successfully applied to the processing of sequential and temporal data, especially in the fields of long sequence problems such as speech recognition, natural language processing and machine translation.

The network considers gating on top of LSTM and also involves a gated network that generates signals that control how the current input and previous memory are used to update the current activation and thus the current state of the network.

The doors themselves are given weight and are selectively updated according to the algorithm throughout the learning phase.

Gate networks introduce increased computational overhead in the form of increased complexity and therefore need to be parameterized.

The LSTM RNN architecture uses the computation of a simple RNN as an intermediate candidate for an internal memory cell (state). Gated Loop Unit (GRU)RNN reduces the gating signals in the LSTM RNN model to two. These two gates are called update and reset gates respectively.

The gating mechanism in GRU(and LSTM)RNN is similar to the parameterization of RNN. BPTT random gradient descent was used to minimize the loss function to update the weights corresponding to these gates.

Each parameter update will involve information about the state of the entire network. This can have detrimental effects.

The network further explores the concept of gating and extends three new variable gating mechanisms.

The three gating variables that have been considered are:

  • GRU1, where each gate is calculated using only the previous hidden state and deviation;

  • GRU2, where each gate is computed using only the previously hidden state;

  • GRU3, where each gate is calculated using bias only.

The significant reduction in parameters can be observed by the minimum number of GRU3 produced.

All three variables and the GRU RNN were benchmarked using handwritten numbers from the MNIST database and data from the IMDB film review dataset.

Results Two sequence lengths are generated from the MNIST dataset and one from the IMDB dataset.

The main driving signal of the gate seems to be the (cyclic) “state”, since the “state” contains basic information about the other signals.

The use of stochastic gradient descent implicitly carries information about the state of the network. This may explain the relative success of using bias alone in gate signals, since its adaptive updates carry information about the state of the network.

The gating variable explores and extends the gating mechanism and provides a limited evaluation of the topology.

For more information, see:

R. Dey and F. M. Salem, Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks, 2017. https://arxiv.org/ftp/arxiv/papers/1701/1701.05923.pdf

J. Chung, et al., Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling, 2014.

https://pdfs.semanticscholar.org/2d9e/3f53fcdb548b0b3c4d4efb197f164fe0c381.pdf


Neural Turing machine

Neural Turing machines extend the capabilities of neural networks by coupling them to external memory resources, and they can interact through the attention process. The so-called NTM, in fact, is to use NN to achieve reading and writing operations in the Turing machine computing model. The components in its model are the same as in a Turing machine.

The combined system is similar to Turing Neumann or Von Neumann structures, but it can be trained effectively end-to-end by gradient descent.

Preliminary results suggest that neural Turing machines can derive simple algorithms, such as replication, sorting, and associative recall, from input and output examples.

RNN’s ability to learn and transform data over long periods of time sets them apart from other machine learning methods. Moreover, because RNN has been proven to be Turing-complete, arbitrary programs can be simulated with only proper wiring.

Extending the capabilities of the standard RNN can simplify solutions for algorithmic tasks. Thus, this expansion is mainly through a huge addressable memory, by analogy with Turing extending the finite state machine through infinite storage tapes, and is called a “neural Turing machine” (NTM).

Unlike Turing machines, NTM is a differentiable computer that can be trained by gradient descent, which provides a very practical mechanism for learning programs.

The NTM architecture is shown above. During each update cycle, the controller network receives input from the external environment and emits output in response. It also reads and writes to the memory matrix through a set of parallel read and write headers. Dashed lines indicate the division of the NTM circuit from the outside world. From Neural Turing Machines, 2014

Crucially, each component of the architecture is differentiable, trained directly with gradient descent. This is achieved by defining “fuzzy” read and write operations that interact more or less with all elements in memory (rather than processing individual elements as normal Turing machines or digital computers do).

For more information, see:

A. Graves, et al., Neural Turing those, 2014. – https://arxiv.org/pdf/1410.5401.pdf

R. Greve, et al., Evolving Neural Turing Machines for Reward-based Learning, 2016. – http://sebastianrisi.com/wp-content/uploads/greve_gecco16.pdf


NTM experiment

A copy task can be used to test whether NTM can store and invoke arbitrary information in long sequences. In this test, the network is presented as an input sequence of random binary vectors followed by delimiter markers.

The network needs to be trained to replicate a sequence of 8-bit random vectors with random numbers between 1 and 20 in length. The target sequence is simply a copy of the input sequence (not delimited).

Duplicate the replication task extension copy by asking the network to print the copied sequence a specified number of times, and then issue the end-of-sequence marker. The main purpose of this procedure is to see if NTM can learn a simple nested function.

The network receives a random length sequence of random binary vectors and subsequently accepts a scalar value representing the number of copies required that appears on a separate input channel.

Associative recall tasks involve organizing “indirectly” generated data, in which one data item points to another. Build a list of projects to use one of the project query networks to return requirements for subsequent projects.

Next, define a sequence of binary vectors limited by the left and right delimiter symbols. After propagating multiple items to the network, view the network by displaying random items and see if the network can produce the next item.

The dynamic N-Gram task is used to test whether NTM can quickly adapt to the new prediction distribution by using memory as a replayable table, which can be used to hold transformation statistics to simulate the regular N-Gram model.

Consider the set of all possible 6-gram distributions in a binary sequence. Given a history of all possible lengths of five binary digits, each 6-gram distribution can be represented as a table of 32 digits, each specifying the probability that the next digit will be 1. A specific training sequence is generated by drawing 200 consecutive bits using the current lookup table. The network looks at the sequence one bit at a time and then predicts the next bit.

The prioritization task tests the sorting capability of the NTM. A sequence of random binary vectors is first entered into the network along with the scalar priority of each vector. The priorities are evenly distributed in the range [-1,1]. The target sequence contains binary vectors sorted by priority.

One component of NTM is the feedforward architecture of LSTM.

conclusion

By the end of this article, you should have an understanding of the use of recurrent neural networks for deep learning, specifically the following:

  • How do LSTM, GRU and NTM, the most advanced recurrent neural networks, perform deep learning tasks

  • What is the specific relationship between these recurrent neural networks and the broader recursive research in artificial neural networks

  • What makes RNN so good at a range of challenging problems

The original link

A Tour of Recurrent Neural Network Algorithms for Deep Learning