Recurrent Neural network (RNN)
Recurrent neural networks differ from other neural networks in thatcycle
In two words, the cycle here meansImplicit output (v)
Re-input implied weight to participate in model training,The input of X at different times
Also put in hidden layer weights. The implicit weight is cyclic to different time periods with different parameters, but the implicit weight issharing
Of, notice in the figureunfold
The underlying weight matrix is justA matrix
(How the implicit output (v) and the corresponding input (x) share a matrix is explained later here), the reason why multiple is presented is to correspond to different moments.
Before going on, there is an optimization problem to be solved, that is, sharing a matrix. Here, a simple example can be used to illustrate: For the convenience of understanding, the implicit variable is two-dimensional, the input is one-dimensional, and the purpose is to get a new implicit input, namely two-dimensional.We can see that the weight coefficients are combined side by side to get the same result. This improves parallel processing.
RNN source code implementation
import torch.nn as nn
class RNN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(RNN, self).__init__(a)
self.hidden_size = hidden_size
self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
self.i2o = nn.Linear(input_size + hidden_size, output_size)
self.softmax = nn.LogSoftmax(dim=1)
def forward(self, input, hidden):
combined = torch.cat((input, hidden), 1)
hidden = self.i2h(combined)
output = self.i2o(combined)
output = self.softmax(output)
return output, hidden
def initHidden(self):
return torch.zeros(1, self.hidden_size)
Copy the code
The characteristic of the recurrent neural network is that every input involves the information of the hidden layer of the last moment, but only the last moment. If the input of the neural network is not only related to the information of the last moment, in other words, the model that needs to be trained at the moment is also related to the information of a long time ago, how can it be involved? This requires the introduction of the Long and short Term Memory Network (LSTM). However, before discussing this model, we should first study the improved paper called GRU model, which is easier to understand than the LSTM model.
GRU helped model
GRU model needs to pay attention to the role of gating, more certainly in the formula as the embodiment of the weight or the effect of mask. Without saying much, understanding these four formulas will help you understand the GRU model, so it will be easier to understand when looking at the picture.
Among them, RtR_tRt is called reset gate, which is helpful to capture the short-term dependency of time series data, and ZtZ_tZt is called update gate, which is helpful to capture the long-term dependency of time series data. Its effect will be embodied in the following formula.
Ht~\tilde{H_{t}}Ht~ we call it candidate hidden states, which mainly preserve short-term memory. The reason that a reset gate can help capture short-term dependencies in time series data is that it acts on the last implicit state by pin-wise multiplication. We can decide how much of the last hidden state we want to retain. The greater the value of RtR_tRt corresponding to a certain dimension, the more information the last hidden state retains on that dimension.
In the formula, ZtZ_tZt as the update gate, we may as well take an extreme example, in this moment ZtZ_tZt is 1, then the information of the last moment is not saved, the candidate state of the corresponding short-term memory is zero. Similarly, if we continue to recurse, ZtZ_tZt is 1 at the previous time, then is the information stored at the previous time, and the candidate state of the corresponding short-term memory is zero. This allows us to store information that is far, far back in time, so the update gate helps to capture long-term dependencies of time series data.
So the following picture makes sense.
Source code implementation
class GRUCell(nn.Module) :
def __init__(self, input_size, hidden_size, output_size) :
super(GRUCell, self).__init__()
self.hidden_size = hidden_size
self.gate = nn.Linear(input_size + hidden_size, hidden_size)
self.output = nn.Linear(hidden_size, output_size)
self.sigmoid = nn.Sigmoid()
self.tanh = nn.Tanh()
self.softmax = nn.LogSoftmax(dim=1)
def forward(self, input, hidden) :
combined = torch.cat((input, hidden), 1)
z_gate = self.sigmoid(self.gate(combined))
r_gate = self.sigmoid(self.gate(combined))
combined01 = torch.cat((input, torch.mul(hidden,r_gate)), 1)
h1_state = self.tanh(self.gate(combined01))
h_state = torch.add(torch.mul((1-z_gate), hidden), torch.mul(h1_state, z_gate))
output = self.output(h_state)
output = self.softmax(output)
return output, h_state
def initHidden(self) :
return torch.zeros(1, self.hidden_size)
Copy the code
Finally, let’s look at what’s so hard to understand about LSTM.
Short term Memory Network (LSTM)
With an understanding of the GRU model, we can understand LSTM directly through the formula.
ItI_tIt, FtF_tFt and OtO_tOt are input gate, forgetting gate and output gate respectively.
As a candidate memory cell, Ct~\tilde{C_{t}}Ct~ is mainly used to preserve short-term memory.
CtC_{t}Ct was the current memory cell. ItI_tIt and FtF_tFt were used to measure the long-term memory and short-term memory. It is understood in the same way as GRU, except that ZtZ_tZt and (1−Zt)(1-z_t)(1−Zt) are replaced by two variable forms.
HtH_tHt is the current hidden state, where OtO_tOt takes part of the current memory cells to the hidden state, which is different from GRU, not all of them go to the hidden layer. In this way, it is more efficient or flexible to output or pass to the next hidden layer on the basis of the hidden layer.
Let’s look at the picture again
Source code implementation:
import torch.nn as nn
import torch
class LSTMCell(nn.Module) :
def __init__(self, input_size, hidden_size, cell_size, output_size) :
super(LSTMCell, self).__init__()
self.hidden_size = hidden_size
self.cell_size = cell_size
self.gate = nn.Linear(input_size + hidden_size, cell_size)
self.output = nn.Linear(hidden_size, output_size)
self.sigmoid = nn.Sigmoid()
self.tanh = nn.Tanh()
self.softmax = nn.LogSoftmax(dim=1)
def forward(self, input, hidden, cell) :
combined = torch.cat((input, hidden), 1)
f_gate = self.sigmoid(self.gate(combined))
i_gate = self.sigmoid(self.gate(combined))
o_gate = self.sigmoid(self.gate(combined))
z_state = self.tanh(self.gate(combined))
cell = torch.add(torch.mul(cell, f_gate), torch.mul(z_state, i_gate))
hidden = torch.mul(self.tanh(cell), o_gate)
output = self.output(hidden)
output = self.softmax(output)
return output, hidden, cell
def initHidden(self) :
return torch.zeros(1, self.hidden_size)
def initCell(self) :
return torch.zeros(1, self.cell_size)
Copy the code
Python deep learning is based on The PyTorch hands-on learning deep learning
If this article is helpful, please give it a like