The author | Renu Khandelwal compile | source of vitamin k | forward Data Science

What is neuromachine translation?

Neuromachine translation is a technique for translating one language into another. One example is the conversion of English to Hindi. Let’s think about it, if you’re in an Indian village, most people there don’t know English. You intend to communicate easily with the villagers. In this case, you can use neural machine translation.

The task of neuromachine translation is to use deep neural networks to convert a series of words in a source language (such as English) into a sequence of words in a target language (such as Spanish).

What are the characteristics of neuromachine translation?

  • Ability to persist sequential data in multiple time steps

NMT uses continuous data that needs to be persisted over several time steps. Artificial neural networks (ANN) do not store data over several time steps. Recurrent neural networks (RNN), such as LSTM(Short and long memory) or GRU(gated recursive unit), are capable of persisting data over multiple time steps

  • Ability to handle input and output vectors of variable length

ANN and CNN require a fixed input vector on which a function is applied to produce a fixed size output. NMT translates one language into another, and the length of the sequence of words in the source and target languages is variable.

How does an RNN, like an LSTM or GRU, help with sequential data processing?

RNN is a neural network with loops in its structure to store information. They perform the same task on each element in the sequence, and the output element depends on the previous element or state. This is exactly what we need to process sequential data.

An RNN can have one or more inputs and one or more outputs. This is another requirement for processing sequential data (that is, variable inputs and variable outputs)

Why can’t we use RNN for neuromachine translation?

In an artificial neural network, we do not need to share weights between different layers of the network, so we do not need to sum gradients. For the shared weight of RNN, we need to get the gradient of W in each time step, as shown below.

Computing the gradient of H at the time step t=0 involves many factors of W, since we need to propagate back through each RNN cell. Even if we don’t weigh the matrix and multiply it by the same scalar value over and over again, it can be a challenge if the time steps are extremely large, say 100 time steps.

If the maximum singular value is greater than 1, the gradient will explode, called the explosion gradient.

If the maximum singular value is less than 1, the gradient will disappear, known as the vanishing gradient.

Weights are shared across all layers, resulting in gradient explosion or disappearance

For gradient explosion problems, we can use gradient clipping, where we can set a threshold in advance, and if the gradient is greater than the threshold, we can clipping it.

To solve the vanishing gradient problem, a common approach is to use long short-term memory (LSTM) or gated cyclic unit (GRU).

What are LSTM and GRU?

LSTM is a Long Short Term Memory network, and GRU is a Gated Recurrent Unit. They can quickly learn about long-term dependence. LSTM can learn to span time intervals of 1000 steps. This is achieved through an efficient gradient-based algorithm.

LSTM and GRU remember information for a long time. They do this by deciding what to remember and what to forget.

LSTM uses four gates, which you can decide if you need to remember the previous state. Cell states play a key role in LSTMs. The LSTM can use four adjusting gates to decide whether to add or remove information from the cell state. These doors act like faucets, determining how much information should pass through.

GRU is a simple variant of LSTM for solving vanishing gradient problems

It uses two gates: the reset gate and the update gate, which are different from the three steps in LSTM. The GRU has no internal memory

The reset gate determines how to combine the new input with the memory of the previous time step. The renewal gate determines how many old memories should be kept.

Grus have fewer parameters, so they are more computationally efficient and require less data than LSTM

How to use LSTM or GRU for neuromachine translation?

We use a codec framework with LSTM or GRU as the base block to create the Seq2Seq model

The sequence-to-sequence (SEQ2SEQ) model maps source sequences to target sequences. The source sequence is the input language of the machine translation system and the target sequence is the output language.

Encoder: Reads an input sequence of words from the source language and encodes this information as a real-valued vector, also known as an implicit state vector or context vector. This vector encodes the “meaning” of the input sequence into a single vector. The encoder output is discarded and only the implicit or internal state is passed to the decoder as initial input

Decoder: Takes a context vector from the encoder as input and the start

of the string marker as initial input to generate an output sequence.

The encoder reads the input sequence verbatim, and similarly the decoder generates the output sequence verbatim.

In the training and reasoning phases, the decoder works differently, while in the training and reasoning phases, the encoder works the same way

Training phase of decoder

We use the Teacher forcing to train the decoder quickly and efficiently.

A Teacher forcing is like a Teacher forcing his students to correct them when they are being trained in new concepts. Students will learn new concepts faster and more effectively as teachers give them the right input during the training process.

The Teacher Forcing algorithm trains the decoder by providing the actual output of the forward time stamp rather than the predicted output of the forward time stamp as the input during training.

We add a tag

to indicate the START of the target sequence and a tag

as the last word of the target sequence. The

tag is later used as a stop condition in the inference phase to indicate the END of the output sequence.


The reasoning phase of the decoder

In the inference or prediction phase, we have no actual output sequence or word. In the reasoning phase, we pass the predictive output of the previous time step as input to the decoder along with the hidden state.

The first time step in the decoder prediction phase will have the final state as input from the encoder and the

tag.

For subsequent time steps, the input to the decoder will be the hidden state of the previous decoder and the output of the previous decoder.

The prediction phase stops when the maximum target sequence length or

mark is reached.

Note: This is just an intuitive explanation of Seq2Seq. We create word embeddings for input language words and target language words. Embeddings provide vector representations of words and their associated meanings.

How to improve the performance of SEQ2SEQ model?

  • Large training data set
  • Hyperparameter regulation
  • Attentional mechanism

What is the attentional mechanism?

The encoder passes the context vector to the decoder. A context vector is a single vector that summarizes the entire input sequence.

The basic idea of the attentional mechanism is to avoid trying to learn a single vector representation for each sentence. The attentional mechanism focuses on some input vectors of the input sequence based on attentional weights. This allows the decoder network to “focus” on different parts of the encoder’s output. It does this using a set of attention weights for each time step output by the decoder itself.

The original link: towardsdatascience.com/intuitive-e…

Welcome to panchuangai blog: panchuang.net/

Sklearn123.com/

Welcome to docs.panchuang.net/