This is the 8th day of my participation in the August More text Challenge. For details, see: August More Text Challenge

Deep Learning with Python

This article is one of a series of notes I wrote while studying Deep Learning with Python (2nd edition, by Francois Chollet). This post marks the turn from Jupyter Notebooks to Markdown, as you can check out the original.ipynb notebooks at GitHub or Gitee.

You can read the original copy of the book online (in English) at this website. The book’s author also gave the accompanying Jupyter notebooks.

This paper is a note of Chapter 6. Deep Learning for Text and Sequences.

6.3 Advanced usage of recurrent neural networks

Advanced use of recurrent neural networks

This time, we will use an example to understand several techniques of cyclic neural networks: cyclic dropout, stacked cyclic layers, and bidirectional cyclic layers.

Temperature prediction problem

We will use the weather time series data set recorded by a weather station (Jena data set). In this data set, there are 14 measurements (such as temperature, pressure, humidity, wind direction, etc.) recorded every 10 minutes. In this data set, there are many years of records, here we only use the data from 2009 to 2016. This data set is used to build the model, and the final goal is to input some recent data (data points of a few days) to predict the temperature for the next 24 hours.

First, download and unpack the data set:

cd ~/Somewhere
mkdir jena_climate
cd jena_climate
wget https://s3.amazonaws.com/keras-datasets/jena_climate_2009_2016.csv.zip
unzip jena_climate_2009_2016.csv.zip
Copy the code

Take a look at the data:

import os

data_dir = "/CDFMLR/Files/dataset/jena_climate"
fname = os.path.join(data_dir, 'jena_climate_2009_2016.csv')

with open(fname) as f:
    data = f.read()

lines = data.split('\n')
header = lines[0].split(', ')
lines = lines[1:]

print(header)
print(len(lines))
print(lines[0])
Copy the code

[‘”Date Time”‘, ‘”p (mbar)”‘, ‘”T (degC)”‘, ‘”Tpot (K)”‘, ‘”Tdew (degC)”‘, ‘”rh (%)”‘, ‘”VPmax (mbar)”‘, ‘”VPact (mbar)”‘, ‘”VPdef (mbar)”‘, ‘”sh (g/kg)”‘, ‘”H2OC (mmol/mol)”‘, ‘”rho (g/m**3)”‘, ‘”wv (m/s)”‘, ‘”max. wv (m/s)”‘, ‘”wd (deg)”‘]

420551

01.01.2009 00:10:00, 996.52, 8.02, 265.40, 8.90, 93.30, 3.33, 3.11, 0.22, 1.94, 3.12, 1307.75, 1.03, 1.75, 152.30

To put the data in a Numpy array:

import numpy as np

float_data = np.zeros((len(lines), len(header)-1))
for i, line in enumerate(lines):
    values = [float(x) for x in line.split(', ') [1:]]
    float_data[i, :] = values
    
print(float_data.shape)
Copy the code

The shape is :(420551, 14)

If we plot the changes in temperature, the periodicity is obvious:

from matplotlib import pyplot as plt

temp = float_data[:, 1]
plt.plot(range(len(temp)), temp)
Copy the code

Let’s take a look at the data for the first 10 days (one entry in 10 minutes, so 144 entries in one day) :

plt.plot(range(1440), temp[: 1440])
Copy the code

This graph shows that it is winter, and the daily temperature change is also periodic (more obvious in the next few days).

The next step is to try to do predictive models. First, let’s clarify our problem: Given a lookback time step in the past (10 minutes each), let’s sample every step and let you predict the temperature of the future delay time step:

  • lookback = 720: Observed data of the past 5 days
  • steps = 6: The observed data is sampled every hour
  • delay = 144: The target is the next 24 hours

Data preparation

  1. Data standardization: keeping the number of features close
# Data standardization

mean = float_data[:200000].mean(axis=0)
float_data -= mean
std = float_data[:200000].std(axis=0)
float_data /= std
Copy the code
  1. Put the data into a generator and yield it(samples, targets)Samples is the incoming data batch and targets is the corresponding target temperature array.
# Generator that generates time series samples and their targets

def generator(data,     # Raw data
              lookback, Input data includes how many past time steps
              delay,    # Goal is how many time steps in the future
              min_index, max_index,  # Specifies which part of data to extract from
              shuffle=False.# Disrupt the sample or extract in order
              batch_size=128.# Number of samples per batch
              step=6) :         # Period of data sampling
    
    if max_index is None:
        max_index = len(data) - delay - 1
    
    i = min_index + lookback
    while True:
        if shuffle:
            rows = np.random.randint(min_index + lookback, max_index, size=batch_size)
        else:
            if i + batch_size >= max_index:
                i = min_index + lookback
            rows = np.arange(i, min(i + batch_size, max_index))
            i += len(rows)
            
        samples = np.zeros((len(rows), lookback // step, data.shape[-1]))
        targets = np.zeros((len(rows), ))
        
        for j, row in enumerate(rows):
            indices = range(rows[j] - lookback, rows[j], step)
            samples[j] = data[indices]
            targets[j] = data[rows[j] + delay][1]
        
        yield samples, targets
Copy the code

Call this generator to instantiate the training set generator, validation set generator, and test set generator:

Prepare training generators, validation generators, and test generators

lookback = 1440
step = 6
delay = 144
batch_size = 128

train_gen = generator(float_data, 
                      lookback=lookback, 
                      delay=delay, 
                      min_index=0, 
                      max_index=200000, 
                      shuffle=True, 
                      step=step, 
                      batch_size=batch_size)

val_gen = generator(float_data, 
                    lookback=lookback, 
                    delay=delay, 
                    min_index=200001, 
                    max_index=300000, 
                    step=step, 
                    batch_size=batch_size)

test_gen = generator(float_data, 
                     lookback=lookback, 
                     delay=delay, 
                     min_index=300001, 
                     max_index=None, 
                     step=step, 
                     batch_size=batch_size)

# How many times to extract for validation and testing:
val_steps = (300000 - 200001 - lookback)  // batch_size
test_steps = (len(float_data) - 300001 - lookback)  // batch_size
Copy the code

A common sense, non-machine learning benchmarking approach

We assume that the time series of temperature is continuous and that the temperature changes periodically from day to day. In this case, you can make a bold assumption that the temperature in the next 24 hours is equal to the current temperature.

Using common sense based non-machine learning methods as a benchmark, we evaluate it using mean absolute error (MAE) as an indicator:

mae = np.mean(np.abs(preds - targets))
Copy the code

The machine learning models that we do after that need to exceed this benchmark in order for machine learning to work.

Calculate MAE based on common sense benchmark methods

def evaluate_naive_method() :
    batch_maes = []
    for step in range(val_steps):
        samples, targets = next(val_gen)
        preds = samples[:, -1.1]
        mae = np.mean(np.abs(preds - targets))
        batch_maes.append(mae)
    
    return np.mean(batch_maes)
    
mae = evaluate_naive_method()
celsius_mae = mae * std[1]
print(f'mae={mae}, the mean absolute error of temperature ={celsius_mae}° C ')
Copy the code

The result is:

Mae = 0.2897359729905486

The mean absolute error of temperature is 2.564887434980494°C

This error is still quite large, so the next goal is to use deep learning methods to beat this benchmark.

Benchmarking methods for machine learning

Before using a complex, computationally expensive network, such as an RNN, it’s a good idea to see if a simple model can solve the problem.

So here’s a simple, fully connected network to try to deal with weather forecasting:

# plot_acc_and_loss: useful function for plotting training history

import matplotlib.pyplot as plt

def plot_acc_and_loss(history) :

    epochs = range(len(history.history['loss']))

    try:
        acc = history.history['acc']
        val_acc = history.history['val_acc']
        
        plt.plot(epochs, acc, 'bo-', label='Training acc')
        plt.plot(epochs, val_acc, 'rs-', label='Validation acc')
        plt.title('Training and validation accuracy')
        plt.legend()
    except:
        print('No acc. Skip')
    finally:
        plt.figure()

    loss = history.history['loss']
    val_loss = history.history['val_loss']

    plt.plot(epochs, loss, 'bo-', label='Training loss')
    plt.plot(epochs, val_loss, 'rs-', label='Validation loss')
    plt.title('Training and validation loss')
    plt.legend()

    plt.show()
Copy the code
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers
from tensorflow.keras.optimizers import RMSprop

model = Sequential()
model.add(layers.Flatten(input_shape=(lookback // step, float_data.shape[-1])))
model.add(layers.Dense(32, activation='relu'))
model.add(layers.Dense(1))

model.compile(optimizer=RMSprop(), loss='mae')
history = model.fit_generator(train_gen, 
                              steps_per_epoch=500, 
                              epochs=20, 
                              validation_data=val_gen, 
                              validation_steps=val_steps)

plot_acc_and_loss(history)
Copy the code

Loss curve during training:

While some of these results are better than benchmark methods without machine learning, they are not reliable enough. In fact, it is not easy to go beyond the benchmark approach based on attempts, which contain useful information that is difficult for machines to learn. In general, for problems with simple and efficient solutions, unless we hard-code the model to use this simple method, it will be difficult for the machine to learn the parameters to find the simple model and further improve it.

Cyclic network reference method

The fully connected network flattens the time series with a Flatten to begin with, so the model actually doesn’t take into account the concept of time. To take advantage of the order of time, consider using a circular network. This time, we’ll use the GRU layer instead of the LSTM:

Train and evaluate a GRU-based model

from tensorflow.keras.models import Sequential
from tensorflow.keras import layers
from tensorflow.keras.optimizers import RMSprop

model = Sequential()
model.add(layers.GRU(32, input_shape=(None, float_data.shape[-1])))
model.add(layers.Dense(1))

model.compile(optimizer=RMSprop(), loss='mae')
history = model.fit_generator(train_gen, 
                              steps_per_epoch=500, 
                              epochs=20, 
                              validation_data=val_gen, 
                              validation_steps=val_steps)

plot_acc_and_loss(history)
Copy the code

Training process:

Before starting the overfitting, the temperature error of the best result is:

print(0.2624 * std[1].'° C')
Copy the code

2.3228957591704926 ° c.

It’s better than the common sense model we started with. But as we can see, the overfitting comes later, and in RNN, we can use cyclic dropout to combat the overfitting.

Loop dropout

We use dropout in feedforward networks, which means we randomly set the input units of a layer to 0. But in RNNs it’s not that simple. Using Dropout before the loop layer only hinders learning and does not help results, so use droput in the loop layer.

Using dropout in a loop layer you must use the same mask for each timestep; the mask cannot vary with the timestep. At the same time, for LSTM, GRU and other cyclic layers, a time-invariant “cyclic Dropout mask” should be applied to activate the inner loop of the layer. Both implementations of dropout are built into the loop layer of Keras. You can specify the ratio of dropout using the dropout and recurrent_dropout parameters.

Train and evaluate a GRU-based model regularized using Dropout

from tensorflow.keras.models import Sequential
from tensorflow.keras import layers
from tensorflow.keras.optimizers import RMSprop

model = Sequential()
model.add(layers.GRU(32, 
                     dropout=0.4, 
                     recurrent_dropout=0.4, 
                     input_shape=(None, float_data.shape[-1])))
model.add(layers.Dense(1))

model.compile(optimizer=RMSprop(), loss='mae')
history = model.fit_generator(train_gen, 
                              steps_per_epoch=500, 
                              epochs=40.# Networks using Dropout take longer to converge
                              validation_data=val_gen, 
                              validation_steps=val_steps)

plot_acc_and_loss(history)
Copy the code

Training results:

This works better than a book, I don’t know why.

Loop layer stack

The problem of overfitting has been solved, and now the precision needs to be further improved. We just used one loop layer, so consider adding a few more, stack them up, and increase network capacity. In fact, the loop layer stack is not as large as the stack, and Google Translate only uses 7 super large LSTM layers stacked on top of each other.

Stack the loop layers in Keras, remember that the middle layer should return the full 3D output sequence tensor, not just the output of the last time step (this behavior is the default) :

from tensorflow.keras.models import Sequential
from tensorflow.keras import layers
from tensorflow.keras.optimizers import RMSprop

model = Sequential()
model.add(layers.GRU(32,
                     dropout=0.1,
                     recurrent_dropout=0.5,
                     return_sequences=True.# Output the complete output sequence
                     input_shape=(None, float_data.shape[-1])))
model.add(layers.GRU(64, activation='relu',
                     dropout=0.1, 
                     recurrent_dropout=0.5))
model.add(layers.Dense(1))

model.compile(optimizer=RMSprop(), loss='mae')
history = model.fit_generator(train_gen,
                              steps_per_epoch=500,
                              epochs=40,
                              validation_data=val_gen,
                              validation_steps=val_steps)

plot_acc_and_loss(history)
Copy the code

Loss curve:

This is also a little different from the book. But as you can see, stacking the loop layer doesn’t bring much performance.

Two-way RNN

Bidirectional RNNs (Bidirectional Cyclic Networks) are a variant of RNNs that can sometimes perform better than RNNs, especially in natural language processing. Bidirectional RNNs are known as the Swiss army of deep learning NLP.

An RNN is dependent on the time or other order of the sequence; scramble or reverse the time step, and the representation that an RNN extracts from the sequence is completely different. Taking advantage of this sensitivity of RNN to order, a bidirectional RNN consists of two ordinary RNNS that process the input sequence along the positive order and the reverse order respectively, and finally combine their learned representations together, so that patterns ignored by a unidirectional RNN can be learned.

Before, we trained in chronological order by default, now we can try to do it in reverse order. To reverse order, simply yield samples[:, ::-1, :], targets):

def reverse_order_generator(data, lookback, delay, min_index, max_index,
                            shuffle=False, batch_size=128, step=6) :
    if max_index is None:
        max_index = len(data) - delay - 1
    i = min_index + lookback
    while 1:
        if shuffle:
            rows = np.random.randint(
                min_index + lookback, max_index, size=batch_size)
        else:
            if i + batch_size >= max_index:
                i = min_index + lookback
            rows = np.arange(i, min(i + batch_size, max_index))
            i += len(rows)

        samples = np.zeros((len(rows),
                           lookback // step,
                           data.shape[-1]))
        targets = np.zeros((len(rows),))
        for j, row in enumerate(rows):
            indices = range(rows[j] - lookback, rows[j], step)
            samples[j] = data[indices]
            targets[j] = data[rows[j] + delay][1]
        yield samples[:, ::-1, :], targets
        
train_gen_reverse = reverse_order_generator(
    float_data,
    lookback=lookback,
    delay=delay,
    min_index=0,
    max_index=200000,
    shuffle=True,
    step=step, 
    batch_size=batch_size)

val_gen_reverse = reverse_order_generator(
    float_data,
    lookback=lookback,
    delay=delay,
    min_index=200001,
    max_index=300000,
    step=step,
    batch_size=batch_size)
Copy the code
model = Sequential()
model.add(layers.GRU(32, input_shape=(None, float_data.shape[-1])))
model.add(layers.Dense(1))

model.compile(optimizer=RMSprop(), loss='mae')
history = model.fit_generator(train_gen_reverse,
                              steps_per_epoch=500,
                              epochs=20,
                              validation_data=val_gen_reverse,
                              validation_steps=val_steps)
plot_acc_and_loss(history)
Copy the code

It doesn’t work very well. Well, for temperature predictions, of course, the most recent data is useful, and information from a long time ago is of little significance. The loop layer loses some of the older information as the timesteps move forward, so it is better to use the positive rather than the reverse order for this problem.

But for the processing of textual information, the importance of a word to the understanding of a sentence does not usually depend on its position in the sentence. That is, while word order is important for understanding language, it doesn’t matter which order you use. Therefore, when dealing with some text problems, the positive order and the reverse order may get very similar results:

from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential

# Number of words to consider as features
max_features = 10000
# Cut texts after this number of words (among top max_features most common words)
maxlen = 500

# Load data
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

# Reverse sequences
x_train = [x[::-1] for x in x_train]
x_test = [x[::-1] for x in x_test]

# Pad sequences
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)

model = Sequential()
model.add(layers.Embedding(max_features, 128))
model.add(layers.LSTM(32))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model.fit(x_train, y_train,
                    epochs=10,
                    batch_size=128,
                    validation_split=0.2)
plot_acc_and_loss(history)
Copy the code

Loss and accuracy curve during training:

There was little difference between the IMDB results of the reverse order training and the positive order training.

It is possible to improve the performance of the model if we combine the positive and reverse order to view the data from different perspectives, complementing each other’s neglected content. That’s what bidirectional RNN does.

In Keras, Bidirectional layer is used to implement Bidirectional RNN:

Train and evaluate a bidirectional LSTM on IMDB

model = Sequential()
model.add(layers.Embedding(max_features, 32))
model.add(layers.Bidirectional(layers.LSTM(32)))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model.fit(x_train, y_train, epochs=10, batch_size=128, validation_split=0.2)

plot_acc_and_loss(history)
Copy the code

The result is:


Next, we try to apply the method of bidirectional RNN to the temperature prediction task.

from tensorflow.keras.models import Sequential
from tensorflow.keras import layers
from tensorflow.keras.optimizers import RMSprop

model = Sequential()
model.add(layers.Bidirectional(
    layers.GRU(32), input_shape=(None, float_data.shape[-1])))
model.add(layers.Dense(1))

model.compile(optimizer=RMSprop(), loss='mae')
history = model.fit_generator(train_gen,
                              steps_per_epoch=500,
                              epochs=40,
                              validation_data=val_gen,
                              validation_steps=val_steps)
plot_acc_and_loss(history)
Copy the code

Going even further

In the following steps, we can further improve the capability of the model by trying:

  • Increase the number of units in a layer
  • Adjust learning rate of RMSprop
  • Try replacing the GRU layer with the LSTM layer
  • Use larger denser connected regitors above (behind) the circulating layer (larger denser layers or stacks of denser layers)
  • Run the best performing model on the test set to prevent the model from overfitting the validation set

Finally, a word of caution: Don’t use this temperature prediction method to predict stock prices. In markets, past performance is a poor predictor of future earnings: Looking in the rear-view mirror is a bad way to drive.