Recurrent Neural network (RNN)

Recurrent neural networks differ from other neural networks in thatcycleIn two words, the cycle here meansImplicit output (v)Re-input implied weight to participate in model training,The input of X at different timesAlso put in hidden layer weights. The implicit weight is cyclic to different time periods with different parameters, but the implicit weight issharingOf, notice in the figureunfoldThe underlying weight matrix is justA matrix(How the implicit output (v) and the corresponding input (x) share a matrix is explained later here), the reason why multiple is presented is to correspond to different moments.

Before going on, there is an optimization problem to be solved, that is, sharing a matrix. Here, a simple example can be used to illustrate: For the convenience of understanding, the implicit variable is two-dimensional, the input is one-dimensional, and the purpose is to get a new implicit input, namely two-dimensional.We can see that the weight coefficients are combined side by side to get the same result. This improves parallel processing.

RNN source code implementation

import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__(a)

        self.hidden_size = hidden_size

        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
        self.i2o = nn.Linear(input_size + hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        combined = torch.cat((input, hidden), 1)
        hidden = self.i2h(combined)
        output = self.i2o(combined)
        output = self.softmax(output)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, self.hidden_size)
Copy the code

The characteristic of the recurrent neural network is that every input involves the information of the hidden layer of the last moment, but only the last moment. If the input of the neural network is not only related to the information of the last moment, in other words, the model that needs to be trained at the moment is also related to the information of a long time ago, how can it be involved? This requires the introduction of the Long and short Term Memory Network (LSTM). However, before discussing this model, we should first study the improved paper called GRU model, which is easier to understand than the LSTM model.

GRU helped model

GRU model needs to pay attention to the role of gating, more certainly in the formula as the embodiment of the weight or the effect of mask. Without saying much, understanding these four formulas will help you understand the GRU model, so it will be easier to understand when looking at the picture.


R t = sigma ( X t W x r + H t 1 W h r + b r ) Z t = sigma ( X t W x z + H t 1 W h z + b z ) R_t = \sigma(X_tW_{xr}+H_{t-1}W_{hr}+b_r)\\ Z_t = \sigma(X_tW_{xz}+H_{t-1}W_{hz}+b_z)

Among them, RtR_tRt is called reset gate, which is helpful to capture the short-term dependency of time series data, and ZtZ_tZt is called update gate, which is helpful to capture the long-term dependency of time series data. Its effect will be embodied in the following formula.


H t ~ = t a n h ( X t W x h + R t Even though H t 1 W h h + b n ) \tilde{H_{t}}=tanh(X_t W_{xh} + R_t \odot H_{t-1} W_{hh} + b_n)

Ht~\tilde{H_{t}}Ht~ we call it candidate hidden states, which mainly preserve short-term memory. The reason that a reset gate can help capture short-term dependencies in time series data is that it acts on the last implicit state by pin-wise multiplication. We can decide how much of the last hidden state we want to retain. The greater the value of RtR_tRt corresponding to a certain dimension, the more information the last hidden state retains on that dimension.


H t = Z t Even though H t 1 + ( 1 Z t ) Even though H t ~ H_t = Z_t \odot H_{t-1} + (1-Z_t)\odot \tilde{H_{t}}

In the formula, ZtZ_tZt as the update gate, we may as well take an extreme example, in this moment ZtZ_tZt is 1, then the information of the last moment is not saved, the candidate state of the corresponding short-term memory is zero. Similarly, if we continue to recurse, ZtZ_tZt is 1 at the previous time, then is the information stored at the previous time, and the candidate state of the corresponding short-term memory is zero. This allows us to store information that is far, far back in time, so the update gate helps to capture long-term dependencies of time series data.

So the following picture makes sense.

Source code implementation

class GRUCell(nn.Module) :
    def __init__(self, input_size, hidden_size, output_size) :
        super(GRUCell, self).__init__()
        self.hidden_size = hidden_size
        self.gate = nn.Linear(input_size + hidden_size, hidden_size)
        self.output = nn.Linear(hidden_size, output_size)
        self.sigmoid = nn.Sigmoid()
        self.tanh = nn.Tanh()
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden) :
        combined = torch.cat((input, hidden), 1)        
        z_gate = self.sigmoid(self.gate(combined))
        r_gate = self.sigmoid(self.gate(combined))
        combined01 = torch.cat((input, torch.mul(hidden,r_gate)), 1)  
        h1_state = self.tanh(self.gate(combined01))
        
        h_state = torch.add(torch.mul((1-z_gate), hidden), torch.mul(h1_state, z_gate))
        output = self.output(h_state)
        output = self.softmax(output)
        return output, h_state

    def initHidden(self) :
        return torch.zeros(1, self.hidden_size)  
Copy the code

Finally, let’s look at what’s so hard to understand about LSTM.

Short term Memory Network (LSTM)

With an understanding of the GRU model, we can understand LSTM directly through the formula.


I t = sigma ( X t W x i + H t 1 W h i + b i ) F t = sigma ( X t W x f + H t 1 W h f + b f ) O t = sigma ( X t W x o + H t 1 W h o + b o ) I_t = \sigma(X_tW_{xi}+H_{t-1}W_{hi}+b_i)\\ F_t = \sigma(X_tW_{xf}+H_{t-1}W_{hf}+b_f)\\ O_t = \sigma(X_tW_{xo}+H_{t-1}W_{ho}+b_o)\\

ItI_tIt, FtF_tFt and OtO_tOt are input gate, forgetting gate and output gate respectively.


C t ~ = t a n h ( X t W x c + H t 1 W h c + b c ) \tilde{C_{t}}=tanh(X_t W_{xc} + H_{t-1} W_{hc} + b_c)

As a candidate memory cell, Ct~\tilde{C_{t}}Ct~ is mainly used to preserve short-term memory.


C t = F t Even though C t 1 + I t Even though C t ~ C_{t}=F_t\odot C_{t-1} + I_t\odot \tilde{C_{t}}

CtC_{t}Ct was the current memory cell. ItI_tIt and FtF_tFt were used to measure the long-term memory and short-term memory. It is understood in the same way as GRU, except that ZtZ_tZt and (1−Zt)(1-z_t)(1−Zt) are replaced by two variable forms.


H t = O t Even though t a n h ( C t ) H_t = O_t \odot tanh(C_t)

HtH_tHt is the current hidden state, where OtO_tOt takes part of the current memory cells to the hidden state, which is different from GRU, not all of them go to the hidden layer. In this way, it is more efficient or flexible to output or pass to the next hidden layer on the basis of the hidden layer.

Let’s look at the picture again

Source code implementation:

import torch.nn as nn
import torch

class LSTMCell(nn.Module) :
    def __init__(self, input_size, hidden_size, cell_size, output_size) :
        super(LSTMCell, self).__init__()
        self.hidden_size = hidden_size
        self.cell_size = cell_size
        self.gate = nn.Linear(input_size + hidden_size, cell_size)
        self.output = nn.Linear(hidden_size, output_size)
        self.sigmoid = nn.Sigmoid()
        self.tanh = nn.Tanh()
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden, cell) :
        combined = torch.cat((input, hidden), 1)
        f_gate = self.sigmoid(self.gate(combined))
        i_gate = self.sigmoid(self.gate(combined))
        o_gate = self.sigmoid(self.gate(combined))
        z_state = self.tanh(self.gate(combined))
        cell = torch.add(torch.mul(cell, f_gate), torch.mul(z_state, i_gate))
        hidden = torch.mul(self.tanh(cell), o_gate)
        output = self.output(hidden)
        output = self.softmax(output)
        return output, hidden, cell

    def initHidden(self) :
        return torch.zeros(1, self.hidden_size)

    def initCell(self) :
        return torch.zeros(1, self.cell_size)
Copy the code

Python deep learning is based on The PyTorch hands-on learning deep learning

If this article is helpful, please give it a like