1. Recurrent Neural Network (RNNs)

People don’t think from scratch. Just as you are reading this article right now, your understanding of each word will depend on the words you have seen before, rather than trying to understand the word by throwing everything you have seen before and forgetting it. In other words, there will always be continuity of thought.

The inability of traditional neural networks to maintain such continuity seems to be a huge drawback. For example, in watching a movie, you try to make a categorical understanding of what is happening in each frame. There is no clear way to use the traditional web to add events to the earlier scenes of the film to help understand the later scenes.

But recurrent neural networks can do it. In RNNs’ network, there is a loop of operations that allows them to retain what they have previously learned.

FIG. 1. RNNs network structure

In the network structure above, for the portion of rectangular block A, by input $x_t (eigenvector at time t), it will output A result (eigenvector at time t), it will output A result (eigenvector at time t), it will output A result h_t $(state or output at time t). The loop structure in the network allows the state at one moment to be transmitted to the next. (Because the state of the current moment is part of the input of the next moment)

The structure of these cycles makes RNNs seem a little hard to understand. But when you think about it for a moment, it seems to have a lot in common with ordinary neural networks. We can think of THE RNNs as an ordinary network that has been duplicated many times and added together. Each network passes its output to the next. We can expand RNNs in time step and get the following figure:

FIG. 2. RNNs Expand network structure

From the chain structure of RNNs, it is easy to understand that it is related to sequence information. This structure seems designed to solve sequential correlation problems.

And they really do work! In the last few years, RNNs has been used to solve an incredible variety of problems: speech recognition, language modeling, translation, image captioning, and more. For an account of RNNs’s amazing success in these areas, see Andrej Karpathy’s blog: The Graineffectiveness of Recurrent Neural Networks.

The success of RNNs is mainly due to the use of LSTMs. This is a special kind of RNNs, and for many tasks, it works much, much better than regular RNNs! LSTMs are basically used in current recurrent neural networks, and this is the network to be explained later in this article.

2. Problems with long-term dependence

RNNs have emerged mainly because of their ability to link past information to the present to solve present problems. For example, using the previous screen can help us understand the content of the current screen. If RNNs can really do this, it will certainly help our mission. But can it really be done? I’m afraid it depends on the actual situation.

Sometimes we just need to look at more recent information while working on the task at hand. For example, in a language model, we need to predict what a word might be from the above, so when we see “the clouds are in the? We don’t need any more information to think that the next word should be “sky”. In this case, the gap between what we are trying to predict and the relevant information is very small. In this case, RNNs can easily do this using the past information.

Fig2. Short-term dependence

However, there are situations where more context information is required. For example, we want to predict “I grew up in France… (10,000 words omitted here)… I speak?” The word for this prediction is Franch, but we have to use a long, long time of information mentioned before to make this prediction correctly, which is very difficult for ordinary RNNs to do.

As the gap between prediction information and related information increases, RNNs finds it difficult to correlate them.

Fig3. Long-term dependence

Theoretically, RNNs can solve such problems by selecting appropriate parameters to relate such “long-term dependencies”. Unfortunately, in practice, RNNs can’t solve this problem. Hochreiter (1991) [German] and Bengio et al. (1994) have conducted in-depth studies on this problem and found that RNNs is indeed difficult to solve this problem.

Fortunately, LSTMs was able to help us solve this problem.

3. The LSTM network

Long Short Term Memory networks — often called “LSTMs” — are a special type of RNN. It was put forward by Hochreiter & Schmidhuber (1997), which was widely welcomed and improved by many people after that. LSTMs is widely used to solve all kinds of problems and has achieved excellent results.

Specifically, LSTMs are designed to avoid the aforementioned problem of long-term dependency. Their very nature is to be able to remember information over a long period of time, and to do so very easily.

All recurrent neural network structures are duplicated by modules of exactly the same structure. In normal RNNs, this module structure is very simple, such as a single TANH layer.

Fig4. Internal structure of common RNNs

LSTMs has a similar structure (the only difference is the middle part). But instead of just using a single TANh layer, they use four interacting layers.

Fig5. internal structure of LSTM

Don’t worry, don’t let this structure frighten you, we will break it up and understand it step by step according to this structure (patience, you will understand). Now, let’s first define the notation used:

Fig6. Symbol description

In a network diagram, each line carries a vector that goes from one node to another. The pink circles represent point-by-point operations, such as vector addition; The yellow rectangle represents a neural network layer (that is, many neural nodes); A merged line is a combination of vectors carried by two lines (e.g., a band), another belt, then the combined output is); A split line is a copy of the vector passed along the line, passed to two places.

3.1 Core ideas of LSTMs

The most critical aspect of the LSTMs is the state of the cell (the entire green box is a cell) : the horizontal line across the structure diagram.

The transmission of cell states is like a conveyor belt, with vectors passing through the entire cell with only a few linear operations. This structure makes it easy for information to pass through the cell without changing it. (Translator’s Note: So we can achieve long-term memory retention)

Fig7. Conveyor belt structure

There is no way to add or remove information with only the top horizontal line. They do it through a structure called gates.

Gates can selectively allow information to pass through, mainly through a neural layer of sigmoid and a point-by-point multiplication operation.

Each element of the sigmoID layer output (which is a vector) is a real number between 0 and 1, representing the weight (or ratio) to let the corresponding message through. For example, 0 means “let no information pass” and 1 means “let all information pass”.

Each LSTM has three such gate structures to protect and control information. (Translator’s Note: Forget gate layer; Input Gate layer; “Output gate layer”

3.2 Understand LSTM step by step

3.2.1 oblivion gate

First, the LSTM decides to keep the information flowing through the cell, using a sigmoID neural layer called the forget Gate Layer. Its input isRepresents the hidden state and new input information, the output is a numeric value all inVector between (vector length and cell stateThe same as the sameThe proportion of each part of information passing through. 0 means do not let any information through, and 1 means let all information through.

Going back to the language model we mentioned above, we need to predict the next word based on all the previous information. In this case, the state of each cell should contain information about the gender of the current subject (reserved information) so that we can use pronouns correctly later on. But when we start to describe a new subject, we should forget the gender of the subject mentioned above.

Fig9. Forget Gates

3.2.2 incoming door

The next step is to decide how much new information to add to the cell state. Doing this involves two steps: first, a Sigmoid layer called the input Gate Layer determines which information needs to be updated; A TANH layer generates a vector, which is what you update,. In the next step, we join these two parts together to make an update to the cell’s state.

Fig10. Input gates

In the example of our language model, we want to add the new subject gender information to the cell state to replace the old state information.

With this structure, we can update the cell state, i.eUpdated to. It should be obvious from the structure diagram, but first let’s take the old stateandMultiply and forget some information you don’t want to keep. And then you multiply, and you forget some of the information that you don’t want to keep. And then you multiply, and you forget some of the information that you don’t want to keep. Then add. This information is what we’re adding.

Fig11. Update cell status

3.2.3 output door

Finally, we need to decide what value to output. This output depends primarily on the state of the cellBut it doesn’t just depend onInstead, it needs to go through a filtering process. First of all, we still use a SigmoID layer to make the decision, but need to go through a filtering process. First of all, we still use a SigmoID layer to make the decision, but need to go through a filtering process. First of all, let’s use ones i g m o i dLayer to decideWhich part of the information will be output. Then, which part of the information we put in will be printed. Then, which part of the information we put in will be printed. And then, let’s takeBy taking a TANh layer (grouping the values between -1 and 1) and multiplying the output of the TANh layer by the weight calculated by the Sigmoid layer, you get the final output.

In the example of a language model, suppose our model has just been exposed to a pronoun and may then output a verb, which may be related to the pronoun’s information. For example, if the verb should be singular or plural, we need to incorporate all the information we’ve learned about pronouns into the cell state to make the correct prediction.

Fig12. Cell output

4. LSTM variant GRU

This section of the text introduces several variants of LSTM and the role of these deformations. I’m not going to write any more here. Those who are interested can read the original article directly.

Here are some of the most famous varieties of GRU (Gated Recurrent Unit), which are explained byCho, et al. (2014)Is put forward. In the GRU, as shown in Fig.13, there are only two gates: reset Gate and Update Gate. In this structure, the oblivion gate and the incoming gate are combined into one update, and the cell state is combinedAnd hidden stateA merger took place. The resulting model was simpler than the standard LSTM architecture, which became very popular.

Fig13. GRU helped structure

Among them,Stands for reset gate,Represents an update gate. ** Reset gate determines whether the previous state is forgotten.when As it approaches 0, the hidden state information of the previous momentWill be forgotten, the current input information is set to hidden.The update gate determines whether to update the hidden state to the new state.

Compare with LSTM:

  • (1) THE GRU has one less gate and the cell state.
  • (2) In LSTM, the retention and incoming of information state are controlled by forgetting gate and incoming gate; GRU resets the gate to control whether to retain the original hidden state of information, but no longer restrict the current information.
  • (3) In LSTM, although new cell states were obtained, but can not be directly output, but need to go through a filter to get output; Also, in the GRU, although in (2) we also get new hidden states ****, but not directly output, but through the update gate to control the final output.
clc; clear; close all;
%% ---------------------------- init Variabels ----------------------------
opt.Delays = 1:30;
opt.dataPreprocessMode  = 'Data Standardization'; % 'None' 'Data Standardization' 'Data Normalization'
opt.learningMethod      = 'LSTM';
opt.trPercentage        = 0.80;                   %  divide data into Test  and Train dataset

% ---- General Deep Learning Parameters(LSTM and CNN General Parameters)
opt.maxEpochs     = 400;                         % maximum number of training Epoch in deeplearning algorithms.
opt.miniBatchSize = 32;                         % minimum batch size in deeplearning algorithms .
opt.executionEnvironment = 'cpu';                % 'cpu' 'gpu' 'auto'
opt.LR                   = 'adam';               % 'sgdm' 'rmsprop' 'adam'
opt.trainingProgress     = 'none';  % 'training-progress' 'none'

% ------------- BILSTM parameters
opt.isUseBiLSTMLayer  = true;                     % if it is true the layer turn to the Bidirectional-LSTM and if it is false it will turn the units to the simple LSTM
opt.isUseDropoutLayer = true;                    % dropout layer avoid of bieng overfit
opt.DropoutValue      = 0.5;

% ------------ Optimization Parameters
opt.optimVars = [
    optimizableVariable('NumOfLayer',[1 4],'Type','integer')
    optimizableVariable('NumOfUnits',[50 200],'Type','integer')
    optimizableVariable('isUseBiLSTMLayer',[1 2],'Type','integer')
    optimizableVariable('InitialLearnRate',[1e-2 1],'Transform','log')
    optimizableVariable('L2Regularization',[1e-10 1e-2],'Transform','log')];

opt.isUseOptimizer         = true;

opt.MaxOptimizationTime    = 14*60*60;
opt.MaxItrationNumber      = 60;
opt.isDispOptimizationLog  = true;

opt.isSaveOptimizedValue       = false;        %  save all of Optimization output on mat files 
opt.isSaveBestOptimizedValue   = true;         %  save Best Optimization output o丿 a mat file  


%% --------------- load Data
data = loadData(opt);
if ~data.isDataRead
    return;
end

%% --------------- Prepair Data
[opt,data] = PrepareData(opt,data);

%% --------------- Find Best LSTM Parameters with Bayesian Optimization
[opt,data] = OptimizeLSTM(opt,data);

%% --------------- Evaluate Data
[opt,data] = EvaluationData(opt,data);



%% ---------------------------- Local Functions ---------------------------
function data = loadData(opt)
[chosenfile,chosendirectory] = uigetfile({'*.xlsx';'*.csv'},...
    'Select Excel time series Data sets','data.xlsx');
filePath = [chosendirectory chosenfile];
if filePath ~= 0
    data.DataFileName = chosenfile;
    data.CompleteData = readtable(filePath);
    if size(data.CompleteData,2)>1
        warning('Input data should be an excel file with only one column!');
        disp('Operation Failed... '); pause(.9);
        disp('Reloading data. ');     pause(.9);
        data.x = [];
        data.isDataRead = false;
        return;
    end
    data.seriesdataHeder = data.CompleteData.Properties.VariableNames(1,:);
    data.seriesdata = table2array(data.CompleteData(:,:));
    disp('Input data successfully read.');
    data.isDataRead = true;
    data.seriesdata = PreInput(data.seriesdata);
    
    figure('Name','InputData','NumberTitle','off');
    plot(data.seriesdata); grid minor;
    title({['Mean = ' num2str(mean(data.seriesdata)) ', STD = ' num2str(std(data.seriesdata)) ];});
    if strcmpi(opt.dataPreprocessMode,'None')
        data.x = data.seriesdata;
    elseif strcmpi(opt.dataPreprocessMode,'Data Normalization')
        data.x = DataNormalization(data.seriesdata);
        figure('Name','NormilizedInputData','NumberTitle','off');
        plot(data.x); grid minor;
        title({['Mean = ' num2str(mean(data.x)) ', STD = ' num2str(std(data.x)) ];});
    elseif strcmpi(opt.dataPreprocessMode,'Data Standardization')
        data.x = DataStandardization(data.seriesdata);
        figure('Name','NormilizedInputData','NumberTitle','off');
        plot(data.x); grid minor;
        title({['Mean = ' num2str(mean(data.x)) ', STD = ' num2str(std(data.x)) ];});
    end
    
else
    warning(['In order to train network, please load data.' ...
        'Input data should be an excel file with only one column!']);
    disp('Operation Cancel.');
    data.isDataRead = false;
end
end
function data = PreInput(data)
if iscell(data)
    for i=1:size(data,1)
        for j=1:size(data,2)
            if strcmpi(data{i,j},'#NULL!')
                tempVars(i,j) = NaN; %#ok
            else
                tempVars(i,j) = str2num(data{i,j});   %#ok
            end
        end
    end
    data = tempVars;
end
end
function vars = DataStandardization(data)
for i=1:size(data,2)
    x.mu(1,i)   = mean(data(:,i),'omitnan');
    x.sig(1,i)  = std (data(:,i),'omitnan');
    vars(:,i) = (data(:,i) - x.mu(1,i))./ x.sig(1,i);
end
end
function vars = DataNormalization(data)
for i=1:size(data,2)
    vars(:,i) = (data(:,i) -min(data(:,i)))./ (max(data(:,i))-min(data(:,i)));
end
end
% --------------- data preparation for LSTM ---
function [opt,data] = PrepareData(opt,data)
% prepare delays for time serie network
data = CreateTimeSeriesData(opt,data);

% divide data into test and train data
data = dataPartitioning(opt,data);

% LSTM data form
data = LSTMInput(data);
end

% ----Run Bayesian Optimization Hyperparameters for LSTM Network Parameters
function [opt,data] = OptimizeLSTM(opt,data)
if opt.isDispOptimizationLog
    isLog = 2;
else
    isLog = 0;
end
if opt.isUseOptimizer
    opt.ObjFcn  = ObjFcn(opt,data);
    BayesObject = bayesopt(opt.ObjFcn,opt.optimVars, ...
        'MaxTime',opt.MaxOptimizationTime, ...
        'IsObjectiveDeterministic',false, ...
        'MaxObjectiveEvaluations',opt.MaxItrationNumber,...
        'Verbose',isLog,...
        'UseParallel',false);
end
end

% ---------------- objective function
function ObjFcn = ObjFcn(opt,data)
ObjFcn = @CostFunction;

function [valError,cons,fileName] = CostFunction(optVars)
inputSize    = size(data.X,1);
outputMode   = 'last';
numResponses = 1;
dropoutVal   = .5;

if optVars.isUseBiLSTMLayer == 2
    optVars.isUseBiLSTMLayer = 0;
end

if opt.isUseDropoutLayer % if dropout layer is true
    if optVars.NumOfLayer ==1
        if optVars.isUseBiLSTMLayer
            opt.layers = [ ...
                sequenceInputLayer(inputSize)
                bilstmLayer(optVars.NumOfUnits,'OutputMode',outputMode)
                dropoutLayer(dropoutVal)
                fullyConnectedLayer(numResponses)
                regressionLayer];
        else
            opt.layers = [ ...
                sequenceInputLayer(inputSize)
                lstmLayer(optVars.NumOfUnits,'OutputMode',outputMode)
                dropoutLayer(dropoutVal)
                fullyConnectedLayer(numResponses)
                regressionLayer];
        end
    elseif optVars.NumOfLayer==2
        if optVars.isUseBiLSTMLayer
            opt.layers = [ ...
                sequenceInputLayer(inputSize)
                bilstmLayer(optVars.NumOfUnits,'OutputMode','sequence')
                dropoutLayer(dropoutVal)
                bilstmLayer(optVars.NumOfUnits,'OutputMode',outputMode)
                dropoutLayer(dropoutVal)
                fullyConnectedLayer(numResponses)
                regressionLayer];
        else
            opt.layers = [ ...
                sequenceInputLayer(inputSize)
                lstmLayer(optVars.NumOfUnits,'OutputMode','sequence')
                dropoutLayer(dropoutVal)
                lstmLayer(optVars.NumOfUnits,'OutputMode',outputMode)
                dropoutLayer(dropoutVal)
                fullyConnectedLayer(numResponses)
                regressionLayer];
        end
    elseif optVars.NumOfLayer ==3
        if optVars.isUseBiLSTMLayer
            opt.layers = [ ...
                sequenceInputLayer(inputSize)
                bilstmLayer(optVars.NumOfUnits,'OutputMode','sequence')
                dropoutLayer(dropoutVal)
                bilstmLayer(optVars.NumOfUnits,'OutputMode','sequence')
                dropoutLayer(dropoutVal)
                bilstmLayer(optVars.NumOfUnits,'OutputMode',outputMode)
                dropoutLayer(dropoutVal)
                fullyConnectedLayer(numResponses)
                regressionLayer];
        else
            opt.layers = [ ...
                sequenceInputLayer(inputSize)
                bilstmLayer(optVars.NumOfUnits,'OutputMode','sequence')
                dropoutLayer(dropoutVal)
                bilstmLayer(optVars.NumOfUnits,'OutputMode','sequence')
                dropoutLayer(dropoutVal)
                bilstmLayer(optVars.NumOfUnits,'OutputMode',outputMode)
                dropoutLayer(dropoutVal)
                fullyConnectedLayer(numResponses)
                regressionLayer];
        end
    elseif optVars.NumOfLayer==4
        if optVars.isUseBiLSTMLayer
            opt.layers = [ ...
                sequenceInputLayer(inputSize)
                bilstmLayer(optVars.NumOfUnits,'OutputMode','sequence')
                dropoutLayer(dropoutVal)
                bilstmLayer(optVars.NumOfUnits,'OutputMode','sequence')
                dropoutLayer(dropoutVal)
                bilstmLayer(optVars.NumOfUnits,'OutputMode','sequence')
                dropoutLayer(dropoutVal)
                bilstmLayer(optVars.NumOfUnits,'OutputMode',outputMode)
                dropoutLayer(dropoutVal)
                fullyConnectedLayer(numResponses)
                regressionLayer];
        else
            opt.layers = [ ...
                sequenceInputLayer(inputSize)
                bilstmLayer(optVars.NumOfUnits,'OutputMode','sequence')
                dropoutLayer(dropoutVal)
                bilstmLayer(optVars.NumOfUnits,'OutputMode','sequence')
                dropoutLayer(dropoutVal)
                bilstmLayer(optVars.NumOfUnits,'OutputMode','sequence')
                dropoutLayer(dropoutVal)
                bilstmLayer(optVars.NumOfUnits,'OutputMode',outputMode)
                dropoutLayer(dropoutVal)
                fullyConnectedLayer(numResponses)
                regressionLayer];
        end
Copy the code

The complete code to download www.cnblogs.com/ttmatlab/p/…