Deep Learning with Python

This article is one of a series of notes I wrote while studying Deep Learning with Python (2nd edition, by Francois Chollet). The article covers the notebooks of Jupyter to Markdown, and I will release all of the Jupyter notebooks on GitHub once all of the articles have been completed.

You can be in this website online reading the original text of the book (English) : livebook.manning.com/book/deep-l…

The author of this book also gives a set of Jupyter notebooks: github.com/fchollet/de…


Chapter 3. Getting Started with Neural Networks

Classification of film reviews: Dichotomous questions

The original link

IMDB data set

The IMDB data set contains 50,000 movie reviews. One is a training set, one is a test set. The data is 50 percent positive and 50 percent negative.

Keras has built-in pre-processed IMDB data sets that convert a sequence of words into a sequence of integers (one number for each word in the dictionary) :

from tensorflow.keras.datasets import imdb

# data set
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(
    num_words=10000)
Copy the code

Num_words =10000 is reserved for the words with the first 10000 occurrences.

Let’s start with a random comment. Here’s a good one:

# dictionary
index_word = {v: k for k, v in imdb.get_word_index().items()}

# restore a comment and take a look
text = ' '.join([index_word[i] for i in train_data[0]])

print(f"{train_labels[0]}:", text)
Copy the code

Output:

1: the as you with out themselves powerful lets loves their becomes reaching had journalist of lot from anyone to have after out atmosphere never more room and it so heart shows to years of every never going and help moments or of every chest visual movie except her was several of enough more with is now current film as you of mine potentially unfortunately of you than him that with out themselves her get for was camp of you movie sometimes movie that with scary but and to story wonderful that in seeing in character to of 70s musicians with heart had shadows they of here that with her serious to have does when from why what have critics they is you that isn’t one will very to as itself with other and in of seen over landed for anyone of and br show’s to whether from than out themselves history he name half some br of and odd was two most of mean for 1 any an boat she he should is thought frog but of script you not while history he heart to real at barrel but when from one bit then have two of script their with her nobody most that with wasn’t to with armed acting watch an for with heartfelt film want an

Data preparation

Let’s take a look at train_data’s current shape:

train_data.shape
Copy the code

Output:

(25000)Copy the code

We’re going to make it look like (samples, word_indices), something like this:

[[0, 0,..., 1,..., 0,..., 1], [0, 1,..., 0,..., 1,..., 0],...Copy the code

With this word, it’s a 1, without it, it’s a 0.

import numpy as np

def vectorize_sequences(sequences, dimension=10000) :
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results

x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)

Copy the code
x_train
Copy the code

Output:

array([[0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       ...,
       [0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.]])
Copy the code

Labels also plays around with it:

train_labels
Copy the code

Output:

array([1, 0, 0, ..., 0, 1, 0])
Copy the code

To deal with:

y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')
Copy the code
y_train
Copy the code

Output:

array([1., 0., 0., ..., 0., 1., 0.], dtype=float32)
Copy the code

Now that data can be safely fed into the neural network that we’re going to build.

Build a network

For problems where the inputs are vectors and the labels are scalars (even zeros or ones) : Dense networks stacked with relu activated:

Dense(16, activation='relu')
Copy the code

Output = relu(dot(W, input) + b).

16 is the number of hidden units per level. A hidden unit is a dimension in the representation space of this layer. The shape of W is also (input_dimension, 16), and the shape of dot is a 16-dimensional vector, projecting the data into a 16-dimensional representation space.

This dimension (number of hidden units) can be regarded as the control of the freedom of network learning. The higher the dimension, the more complex the things you can learn, but the more computatively expensive it is, and you may learn something unimportant that leads to overfitting.

Here, we’ll use two layers of 16 hidden cells, and finally a sigmoid-activated layer that outputs the results (values in [0,1][0, 1][0,1]) that predict how likely it is that the data tag is 1, i.e., a good comment.

Relu is to filter out negative values (output negative input as 0), sigmoid is to cast the value to [0, 1] :

Implement this network in Keras:

from tensorflow.keras import models
from tensorflow.keras import layers

model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000, )))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
Copy the code

Activate the function

We’ve used the relu activation function before in MNIST, so what exactly does the activation function do?

The function of a Dense layer without activating functions is simply a linear transformation:

output = dot(W, input) + b
Copy the code

If you have this linear transformation at each level, if you have multiple of these layers on top of each other, assuming that the space doesn’t get bigger, so there’s a limit to what you can learn.

For example, if relu is activated, output = relu(dot(W, input) + b). If relu is activated, output = relu(dot(W, input) + b). Using this activation function, the representation space can be expanded, so that the network can learn more complex “knowledge”.

Compilation model

When compiling the model, we also need to select the loss function, optimizer, and metrics.

For this binary classification problem, where the final output is 0 or 1, the loss function can be binary_crossentropy(which is a good name).

This crossentropy is called crossentropy, which is in information theory, and it measures the direct distance of a probability distribution. So models that output probabilities are often lost by this crossentropy.

As for the optimizer, just like MNIST, we use RMsprop (why not in the book) and the metric is accuracy:

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])
Copy the code

Because these optimizer, Loss, and metrics are commonly used, Keras has built-in strings that can be passed directly. But you can also pass class instances to customize some parameters:

from tensorflow.keras import optimizers
from tensorflow.keras import losses
from tensorflow.keras import metrics

model.compile(optimizer=optimizers.RMSprop(lr=0.001),
              loss=losses.binary_crossentropy,
              metrics=[metrics.binary_accuracy])
Copy the code

Training model

In order to verify the accuracy of the model in the process of training on the data it has never seen, we divided 10,000 samples from the original training data:

x_val = x_train[:10000]
partial_x_train = x_train[10000:]

y_val = y_train[:10000]
partial_y_train = y_train[10000:]
Copy the code

Using a batch of 512 Mini-batches for 20 runs (all the data in X_train run one run) and a batch of 10,000 batches for accuracy verification:

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])

history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=20,
                    batch_size=512,
                    validation_data=(x_val, y_val))
Copy the code
Train on 15000 samples, validate on 10000 samples Epoch 1/20 15000/15000 [==============================] - 3s 205us/sample - loss: 0.5340-ACC: 0.7867 - val_loss: 0.4386-val_acc: 0.8340....... Epoch 20/20 15000/15000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 1 s 74 us/sample - loss: 0.0053 acc: 0.9995 - val_loss: 0.7030 - val_acc: 0.8675Copy the code

Fit returns history, which holds the dark history of each Epoch during training:

history_dict = history.history
history_dict.keys()
Copy the code

Output:

dict_keys(['loss', 'acc', 'val_loss', 'val_acc'])
Copy the code

We can draw a picture of these things:

# Draw training and validation losses

import matplotlib.pyplot as plt

history_dict = history.history
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']

epochs = range(1.len(loss_values) + 1)

plt.plot(epochs, loss_values, 'ro-', label='Training loss')
plt.plot(epochs, val_loss_values, 'bs-', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()
Copy the code

# Draw training and verification accuracy

plt.clf()

acc = history_dict['acc']
val_acc = history_dict['val_acc']

plt.plot(epochs, acc, 'ro-', label='Training acc')
plt.plot(epochs, val_acc, 'bs-', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()
Copy the code

As we can see, the accuracy of the training set has been increasing (the loss has been decreasing), but in the verification set, the loss has become larger later, reaching the best peak around the fourth round.

So this is an overfitting, and we’ve been doing it since the second round. So, we can actually run three or four rounds. If we run any longer, our model will only “master” the training set and will not recognize any other data that it has never seen.

So, let’s retrain a model (to start with the network rewrite, otherwise fit is following what we’ve just done), and then test it with a test set:

model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000, )))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop',
             loss='binary_crossentropy',
             metrics=['accuracy'])

model.fit(x_train, y_train, epochs=4, batch_size=512)
result = model.evaluate(x_test, y_test, verbose=2)    # verbose=2 to avoid a looooong progress bar that fills the screen with '='. https://github.com/tensorflow/tensorflow/issues/32286
Copy the code
Train on 25000 samples Epoch 1/4 25000/25000 [==============================] - 2s 69us/sample - loss: 0.4829 accuracy: 0.8179 Epoch by 2/4 25000/25000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 42 us - 1 s/sample - loss: 0.2827 accuracy: 0.9054 Epoch 3/4 25000/25000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 42 us - 1 s/sample - loss: 0.2109 accuracy: 0.9253 Epoch 4/4 25000/25000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 43 us - 1 s/sample - loss: 0.1750 - accuracy: 0.9380 25000/1-3s-loss: 0.2819 - accuracy: 0.8836Copy the code

Let’s look at the output:

print(result)
Copy the code
[0.2923990402317047, 0.8836]
Copy the code

After the training, of course we want to try it out, right? So let’s predict the test set, type it out and see:

model.predict(x_test)
Copy the code

Output:

Array ([[0.17157233], [0.99989915], [0.79564804]..., [0.11750051], [0.05890778], [0.5040823]], dType =float32) array([[0.17157233], [0.99989915], [0.79564804]..., [0.11750051], [0.05890778], [0.5040823]], dType =float32)Copy the code

Further experiment

  1. Try to use only one layer
model = models.Sequential()
# model.add(layers.Dense(16, activation='relu', input_shape=(10000, )))
# model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid', input_shape=(10000, )))

model.compile(optimizer='rmsprop',
             loss='binary_crossentropy',
             metrics=['accuracy'])

model.fit(x_train, y_train, epochs=4, batch_size=512)
result = model.evaluate(x_test, y_test, verbose=2)    # verbose=2 to avoid a looooong progress bar that fills the screen with '='. https://github.com/tensorflow/tensorflow/issues/32286
print(result)
Copy the code
Train on 25000 samples Epoch 1/4 25000/25000 [==============================] - 3s 116us/sample - loss: 0.5865 accuracy: 0.7814 Epoch by 2/4 25000/25000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 31 us - 1 s/sample - loss: 0.4669 accuracy: 0.8608 Epoch 3/4 25000/25000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 32 us - 1 s/sample - loss: 0.3991 accuracy: 0.8790 Epoch 4/4 25000/25000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 33 us - 1 s/sample - loss: Accuracy: 0.8920 25000/1-3s-loss: 0.3794 - Accuracy: 0.8732 [0.3726908649635315, 0.8732]Copy the code

. It’s a fairly simple problem, one layer works this well, but it’s not as good as it used to be

  1. Make more layers
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000, )))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop',
             loss='binary_crossentropy',
             metrics=['accuracy'])

model.fit(x_train, y_train, epochs=4, batch_size=512)
result = model.evaluate(x_test, y_test, verbose=2)    # verbose=2 to avoid a looooong progress bar that fills the screen with '='. https://github.com/tensorflow/tensorflow/issues/32286
print(result)
Copy the code
Train on 25000 samples Epoch 1/4 25000/25000 [==============================] - 3s 123us/sample - loss: 0.5285 accuracy: 0.7614 Epoch by 2/4 25000/25000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 45 us - 1 s/sample - loss: 0.2683 accuracy: 0.9072 s - loss: Epoch 3/4 25000/25000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 45 us - 1 s/sample - loss: 0.1949 accuracy: 0.9297 Epoch 4/4 25000/25000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 47 us - 1 s/sample - loss: 0.1625 - accuracy: 0.9422 25000/1-2s-loss: 0.3130 - accuracy: 0.8806 [0.30894253887176515, 0.88056]Copy the code

Better, but not as good as the real version

  1. Several more hidden layers of cells
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000, )))
model.add(layers.Dense(1024, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop',
             loss='binary_crossentropy',
             metrics=['accuracy'])

model.fit(x_train, y_train, epochs=4, batch_size=512)
result = model.evaluate(x_test, y_test, verbose=2)    # verbose=2 to avoid a looooong progress bar that fills the screen with '='. https://github.com/tensorflow/tensorflow/issues/32286
print(result)
Copy the code
Train on 25000 samples Epoch 1/4 25000/25000 [==============================] - 15s 593us/sample - loss: 0.5297 accuracy: 0.7964 Epoch by 2/4 25000/25000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 12 s 490 us/sample - loss: 0.2233 accuracy: 0.9109 Epoch 3/4 25000/25000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 12 s 489 us/sample - loss: 0.1148 accuracy: 0.9593 Epoch 4/4 25000/25000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 12 s 494 us/sample - loss: 0.0578 - Accuracy: 0.9835 25000/1-9s-loss: 0.3693 - Accuracy: 0.8812 [0.4772889766550064, 0.8812]Copy the code

Not as far away as possible.

  1. Using mse loss
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000, )))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop',
             loss='mse',
             metrics=['accuracy'])

model.fit(x_train, y_train, epochs=4, batch_size=512)
result = model.evaluate(x_test, y_test, verbose=2)    # verbose=2 to avoid a looooong progress bar that fills the screen with '='. https://github.com/tensorflow/tensorflow/issues/32286
print(result)
Copy the code
Train on 25000 samples Epoch 1/4 25000/25000 [==============================] - 3s 119us/sample - loss: 0.1472 accuracy: 0.8188 Epoch by 2/4 25000/25000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 46 us - 1 s/sample - loss: 0.0755 accuracy: 0.9121 Epoch 3/4 25000/25000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 50 us - 1 s/sample - loss: 0.0577 accuracy: 0.9319 Epoch 4/4 25000/25000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 47 us - 1 s/sample - loss: 0.0474 - accuracy: 0.9442 25000/1-3s-loss: 0.0914 - accuracy: 0.8828 [0.08648386991858482, 0.88276]Copy the code
  1. Using tanh activation
model = models.Sequential()
model.add(layers.Dense(16, activation='tanh', input_shape=(10000, )))
model.add(layers.Dense(16, activation='tanh'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop',
             loss='binary_crossentropy',
             metrics=['accuracy'])

model.fit(x_train, y_train, epochs=4, batch_size=512)
result = model.evaluate(x_test, y_test, verbose=2)    # verbose=2 to avoid a looooong progress bar that fills the screen with '='. https://github.com/tensorflow/tensorflow/issues/32286
print(result)
Copy the code
Train on 25000 samples Epoch 1/4 25000/25000 [==============================] - 4s 149us/sample - loss: 0.4237 accuracy: 0.8241 Epoch by 2/4 25000/25000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 46 us - 1 s/sample - loss: 0.2310 accuracy: 0.9163 Epoch 3/4 25000/25000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 46 us - 1 s/sample - loss: 0.1779 accuracy: 0.9329 Epoch 4/4 25000/25000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 49 us - 1 s/sample - loss: 0.1499 - accuracy: 0.9458 25000/1-3s-loss: 0.3738 - accuracy: 0.8772 [0.3238203083658218, 0.87716]Copy the code

So, these experiments say that the model we used in the previous book is reasonable, and you can’t change it as much as his 😂.

News classification: Multiple classification issues

The original link

In this video, we are going to divide things into multiple categories, which is called “multi-class classification”.

We need to divide the news from Reuters into 46 topic categories, which requires that a news can only belong to one category, so specifically, what we need to do is a “single-label (multiclass classification)” problem.

Reuters data set

The Reuters Dataset, published by Reuters in 1986 (much older than me 😂), has 46 categories of news, with at least 10 items per category in the training set.

This toy data set is also built into Keras as IMDB and MNIST:

from tensorflow.keras.datasets import reuters

(train_data, train_labels), (test_data, test_labels) = reuters.load_data(
    num_words=10000)
Copy the code
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/reuters.npz
2113536/2110848 [==============================] - 6s 3us/step
Copy the code

The data in this data set is the same as in the previous IMDB, the words are translated into numbers, and then we only capture the most frequent 10,000 words.

Our training set has 8K+ pieces of data, and the test set has 2K+ :

print(len(train_data), len(test_data))
Copy the code
8982, 2246,Copy the code

Let’s still do like IMDB, restore the data will text see:

def decode_news(data) :
    reverse_word_index = {v: k for k, v in reuters.get_word_index().items()}
    return ' '.join([reverse_word_index.get(i - 3.'? ') for i in data])
    # i-3 because 0, 1, 2 are reserved words "padding", "start of sequence", "unknown"


text = decode_news(train_data[0])
print(text)
Copy the code
? ? ? said as a result of its december acquisition of space co it expects earnings per share in 1987 of 1 15 to 1 30 dlrs per share up from 70 cts in 1986 the company said pretax net should rise to nine to 10 mln dlrs from six mln dlrs in 1986 and rental operation revenues to 19 to 22 mln dlrs from 12 5 mln dlrs it said cash flow per share this year should be 2 50 to three dlrs reuter 3
Copy the code

Numbers with labels ranging from 0 to 45:

train_labels[0]
Copy the code

Output:

3
Copy the code

Data preparation

First of all, the data is bit-vectored, directly using the code we wrote when doing IMDB:

import numpy as np

def vectorize_sequences(sequences, dimension=10000) :
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results


x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)
Copy the code

And then there’s this effect:

print(x_train)
Copy the code
array([[0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       ...,
       [0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.]])
Copy the code

Then you have to deal with the labels. We can treat tags as integer tensors, or we can use one-HOT encoding, which is commonly used for categorical problems.

For our current problem, one-hot coding is used, that is, the vector whose positions are all 0 except the tag index position is 1:

def to_one_hot(labels, dimension=46) :
    results = np.zeros((len(labels), dimension))
    for i, label in enumerate(labels):
        results[i, label] = 1.
    return results


one_hot_train_labels = to_one_hot(train_labels)
one_hot_test_labels = to_one_hot(test_labels)
Copy the code

In fact, Keras comes with a function that does this:

from tensorflow.keras.utils import to_categorical
# is the book from keras. Utils. Np_utils import to_categorical but,,, times have changed, and the za this use is tensorflow. Keras, so a little difference

one_hot_train_labels = to_categorical(train_labels)
one_hot_test_labels = to_categorical(test_labels)
Copy the code
print(one_hot_train_labels)
Copy the code
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)
Copy the code

To build the network

This problem is similar to the previous film review classification problem, except that the final solution possibility is from 2 -> 46, the solution space is much larger.

For our Dense layer stack, each layer receives the output of the previous layer as input. So, if some information is lost in one layer, it can never be retrieved by the next layer. If the lost information is useless for classification, then the loss is good and we expect it to happen; But if the missing information is responsible for the final classification, the loss will limit the results of the network. In other words, it can create an “information bottleneck.” This bottleneck can occur at every level.

The previous classification of film reviews only required two results at the end, so we used 16 units in the layer, which allowed the machine to learn in a 16-dimensional space, and was large enough to avoid “information bottleneck”.

Now, for our problem, the solution space is 46 dimensions. Directly copy the previous code, let it learn in 16 dimensional space, must have a bottleneck!

The solution to the bottleneck is simply to add more units to the layer. Here we are 16 -> 64:

from tensorflow.keras import models
from tensorflow.keras import layers

model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(46, activation='softmax'))
Copy the code

In the last layer, our output is 46 dimensional, corresponding to 46 categories, and the activation function for this layer is SoftMax, the same one we used for MNIST training.

Softmax allows the network to output the probability distribution of 46 categories, namely a 46-dimensional vector, where the i-th element represents the possibility that the input belongs to the I-th category, and the sum of these 46 elements is 1.

Compilation model

When the model is compiled, the loss function, the optimizer, and the goal of optimization are determined again.

  • As for the loss function, the classification problem is “categorical_crossentropy”.
  • Optimizer, actually we use RMsProp for a lot of problems
  • The goal is also the accuracy of the predictions
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
Copy the code

Verify the effect of

We still have to have a validation set to evaluate the model during training. Divide 1K pieces of data from the training set:

x_val = x_train[:1000]
partial_x_train = x_train[1000:]

y_val = one_hot_train_labels[:1000]
partial_y_train = one_hot_train_labels[1000:]
Copy the code

Training model

Well, the preparation is done and it’s time to see the most fascinating training!

history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=20,
                    batch_size=512,
                    validation_data=(x_val, y_val))
Copy the code
Train on 7982 samples, validate on 1000 samples Epoch 1/20 7982/7982 [==============================] - 3s 372us/sample - loss: 2.6180 - accuracy: 0.5150-val_loss: 1.7517-val_accuracy: 0.6290...... Epoch 20/20 7982/7982 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 1 s 91 us/sample - loss: 0.1134 accuracy: 0.9578 - val_loss: 1.0900 - val_accuracy: 0.8040Copy the code

🆗 pretty fast, as usual, or draw to see the training process.

  1. Loss during training
import matplotlib.pyplot as plt

loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1.len(loss) + 1)

plt.plot(epochs, loss, 'bo-', label='Training loss')
plt.plot(epochs, val_loss, 'rs-', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
Copy the code

  1. Precision during training
plt.clf()

acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

plt.plot(epochs, acc, 'bo-', label='Training acc')
plt.plot(epochs, val_acc, 'rs-', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
Copy the code

Emmmm, said, “The overfitting started at Epochs on the ninth round.” So you only have to run nine rounds.

model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(46, activation='softmax'))

model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model.fit(partial_x_train,
          partial_y_train,
          epochs=9,
          batch_size=512,
          validation_data=(x_val, y_val))
Copy the code
Train on 7982 samples, validate on 1000 samples Epoch 1/9 7982/7982 [==============================] - 1s 153us/sample - loss: 2.5943 - accuracy: 0.5515-val_loss: 1.7017-val_accuracy: 0.6410...... Epoch 9/9 7982/7982 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 1 s 84 us/sample - loss: 0.2793 accuracy: 0.9402 - val_loss: 0.8758 - val_accuracy: 0.8170 < tensorflow. Python. Keras. Callbacks. History at 0 x16eb5d6d0 >Copy the code

Then, test this with a test set:

results = model.evaluate(x_test, one_hot_test_labels, verbose=2)
print(results)
Copy the code
2246/1-0s-loss: 1.7611 - accuracy: 0.7912 [0.983459981976082, 0.7911843]Copy the code

Accuracy is about 80%, in fact, or good, better than random blind lines to divide.

If randomly crossed to classify, the accuracy of the binary classification problem is 50%, but the classification accuracy of the 46 yuan is less than 19% :

importcopy test_labels_copy = copy.copy(test_labels) np.random.shuffle(test_labels_copy) hits_array = np.array(test_labels) ==  np.array(test_labels_copy)float(np.sum(hits_array)) / len(test_labels)
Copy the code
0.18432769367764915
Copy the code

Using the predict method of the model instance, we can obtain the probability distribution of the input over 46 categories:

predictions = model.predict(x_test)
print(predictions)
Copy the code
Array ([4.7181980e-05, 2.0765587e-05, 8.6653872e-06,..., 3.1266565e-05, 8.2046267e-07, 6.0611728e-06], [5.9005950e-04, E-02, 1.2290048E-03, 4.2919168e-05, 5.7422225e-05, 4.0201416e-05, [8.5751421e-04, 9.2367262e-01, E-05, [8.5679676e-05, 2.0081598E-04, 4.1808224e-05], [8.5679676e-05, 2.0081598e-04, 4.1808224e-05, E-05, 7.6962686e-05, 6.5783697e-06, 2.9889508e-05, [1.7291466e-03, 2.5600385e-02, 1.8182390e-03, 1.4499390e-03, 6.5783697e-06, 2.9889508e-05] [2.5776261e-04, 8.6797208e-01, 3.9900807e-03, 2.6547859e-04, 6.5820634e-05] 6.8603881 e-06]], dtype = float32)Copy the code

Predictions represents 46 categories of possibilities:

predictions[0].shape
Copy the code
(46)Copy the code

Their sum is 1:

np.sum(predictions[0])
Copy the code
0.99999994
Copy the code

The biggest of them, which we think this story falls into this category

np.argmax(predictions[0])
Copy the code
3
Copy the code

Another way to deal with labels and losses

As mentioned earlier, labels can be hot-coded, or they can be processed directly as integer tensors:

y_train = np.array(train_labels)
y_test = np.array(test_labels)
Copy the code

Sparse_categorical_crossentropy: sparse_categorical_crossentropy: sparse_categorical_crossentropy: sparse_categorical_crossentropy: sparse_categorical_crossentropy: sparse_categorical_crossentropy: sparse_categorical_crossentropy: sparse_categorical_crossentropy: sparse_categorical_crossentropy

 model.compile(optimizer='rmsprop',
               loss='sparse_categorical_crossentropy',
               metrics=['acc'])
Copy the code

The importance of a sufficiently large middle dimension

We talked about the “information bottleneck” before, and then we said that for this 46-dimensional network of results, the middle tier has to be big enough!

Now let’s see what happens if it’s not big enough. Let’s exaggerate a little bit and go from 64 to 4:

model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(4, activation='relu'))
model.add(layers.Dense(46, activation='softmax'))

model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model.fit(partial_x_train,
          partial_y_train,
          epochs=20,
          batch_size=128,
          validation_data=(x_val, y_val))
Copy the code
Train on 7982 samples, validate on 1000 samples Epoch 1/20 7982/7982 [==============================] - 2s 288us/sample - loss: 2.8097 - accuracy: 0.4721 - val_loss: 2.0554 - val_accuracy: 0.5430....... Epoch 20/20 7982/7982 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 1 s 121 us/sample - loss: 0.6443 accuracy: 0.8069 - val_loss: 1.8962 - val_accuracy: 0.6800 < tensorflow. Python. Keras. Callbacks. History at 0 x16f628b50 >Copy the code

Look at this. It’s not even a little bit worse than the previous 64 dimensional training. It’s pretty significant.

This decline in effect occurs because you give him too little spatial dimension to learn from, and he throws away a lot of information that is useful for categorization.

So bigger is better? Let’s try making the middle layer bigger:

model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(4096, activation='relu'))
model.add(layers.Dense(46, activation='softmax'))

model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model.fit(partial_x_train,
          partial_y_train,
          epochs=20,
          batch_size=128,
          validation_data=(x_val, y_val))
Copy the code
Train on 7982 samples, validate on 1000 samples Epoch 1/20 7982/7982 [==============================] - 2s 273us/sample - loss: 1.5523 - accuracy: 0.6310-val_loss: 1.1903-val_accuracy: 0.7060...... Epoch 20/20 7982/7982 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 2 s 296 us/sample - loss: 0.0697 accuracy: 0.9605 - val_loss: 3.5296 - val_accuracy: 0.7850 < tensorflow. Python. Keras. Callbacks. History at 0 x1707fcf90 >Copy the code

You can see that the training time is a little bit longer, the computer is a little bit warmer, but the effect is not much better. This is because the input from the first layer to the middle layer is only 64 dimensions. However large the middle layer is, it is limited by the bottleneck of the first layer.

Try making the first layer bigger too!

model = models.Sequential()
model.add(layers.Dense(512, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(512, activation='relu'))
model.add(layers.Dense(46, activation='softmax'))

model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model.fit(partial_x_train,
          partial_y_train,
          epochs=20,
          batch_size=128,
          validation_data=(x_val, y_val))
Copy the code
Train on 7982 samples, validate on 1000 samples Epoch 1/20 7982/7982 [==============================] - 5s 662us/sample - loss: 1.3423 - accuracy: 0.6913-val_loss: 0.9565-val_accuracy: 0.7920...... Epoch 20/20 7982/7982 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 5 s 583 us/sample - loss: 0.0648 accuracy: 0.9597 - val_loss: 2.9887 - val_accuracy: 0.8030 < tensorflow. Python. Keras. Callbacks. History at 0 x176fbbd90 >Copy the code

(Slightly smaller, used to use 4096, but too big, our beggar version of MBP is slow, it takes more than 20 minutes to finish, I am too lazy to wait)

What a waste of time, and he quickly crossed the mud river over the fitting, also lived a serious, draw a picture to see:

import matplotlib.pyplot as plt

loss = _.history['loss']
val_loss = _.history['val_loss']
epochs = range(1.len(loss) + 1)

plt.plot(epochs, loss, 'bo-', label='Training loss')
plt.plot(epochs, val_loss, 'rs-', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
Copy the code

So, too big is not good either. Still want to have a degree!

Try to use fewer/more layers

  1. Fewer layers
model = models.Sequential()
model.add(layers.Dense(46, activation='softmax', input_shape=(10000,)))

model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

history = model.fit(partial_x_train,
          partial_y_train,
          epochs=20,
          batch_size=128,
          validation_data=(x_val, y_val))

loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1.len(loss) + 1)

plt.plot(epochs, loss, 'bo-', label='Training loss')
plt.plot(epochs, val_loss, 'rs-', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
Copy the code
Train on 7982 samples, validate on 1000 samples Epoch 1/20 7982/7982 [==============================] - 1s 132us/sample - loss: 2.4611 - accuracy: 0.6001 - val_loss: 1.8556-val_accuracy: 0.6440...... Epoch 20/20 7982/7982 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 1 s 85 us/sample - loss: 0.1485 accuracy: 0.9570 - val_loss: 1.2116 - val_accuracy: 0.7960Copy the code

Quick! The results were slightly worse.

Clown0te("Anti-theft text crawling :) bug tracking tag, readers need not care.").by(CDFMLR)
Copy the code
  1. More layers
model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(46, activation='softmax'))

model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

history = model.fit(partial_x_train,
          partial_y_train,
          epochs=20,
          batch_size=128,
          validation_data=(x_val, y_val))

loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1.len(loss) + 1)

plt.plot(epochs, loss, 'bo-', label='Training loss')
plt.plot(epochs, val_loss, 'rs-', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
Copy the code
Train on 7982 samples, validate on 1000 samples Epoch 1/20 7982/7982 [==============================] - 2s 188us/sample - loss: 1.8340-accuracy: 0.5829-val_loss: 1.3336-val_accuracy: 0.6910...... Epoch 20/20 7982/7982 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 1 s 115 us/sample - loss: 0.0891 accuracy: 0.9600 - val_loss: 1.7227 - val_accuracy: 0.7900Copy the code

So, the more the better!

Forecasting housing prices: the regression problem

The original link

In the previous two examples, we were doing classification problems (predicting discrete labels). This time look at a regression problem (predicting continuous values).

Boston room data set

We use the Boston Housing Price Dataset to predict the Housing prices in the suburbs of Boston in the mid-1970s. The data set contains data about the place at the time, such as crime laws, tax rates and so on.

Compared with the data sets of our first two categories, this data set has relatively few data, only 506 of which are in the training set and 102 are not in the test set. The order of magnitude of input data of each feature in the data is also different.

Let’s first import the data (this data set also comes with Keras) :

from tensorflow.keras.datasets import boston_housing

(train_data, train_targets), (test_data, test_targets) = boston_housing.load_data()

print(train_data.shape, test_data.shape)
Copy the code
(404, 13) (102, 13)
Copy the code
train_data[0:3]
Copy the code
Array ([[1.23247e+00, 0.00000e+00, 8.14000e+00, 0.00000e+00, 5.38000e-01, 6.14200e+00, 9.17000e+01, 3.97690e+00, 4.00000e+00, 3.07000e+02, 2.10000e+01, 3.96900e+02, 1.87200e+01], [2.17700e-02, 8.25000e+01, 2.03000e+00, 0.00000e+00, 4.15000e-01, 7.61000e+00, 1.57000e+01, 6.27000e+00, 2.00000e+00, 3.48000e+02, 1.47000e+01, 3.95380e+02, 3.11000e+00], [4.89822e+00, 0.00000e+00, 1.81000e+01, 0.00000e+00, 6.31000e-01, 4.97000e+00, 1.00000e+02, 1.33250e+00, 2.40000e+01, 6.66000 e+02, 2.02000 e+01 e+02 3.75520, 3.26000 e+00]])Copy the code
train_targets[0:3]
Copy the code
Array ([15.2, 42.3, 50])Copy the code

The data unit in targets is thousands of dollars, which is relatively cheap at this time:

min(train_targets), sum(train_targets)/len(train_targets), max(train_targets)
Copy the code
(5.0, 22.395049504950496, 50.0)
Copy the code
import matplotlib.pyplot as plt

x = range(len(train_targets))
y = train_targets

plt.plot(x, y, 'o', label='data')
plt.title('House Prices')
plt.xlabel('train_targets')
plt.ylabel('prices')
plt.legend()
plt.show()
Copy the code

Data preparation

The range of data values that we feed to the neural network should not be too different, although the neural network can handle the range of data that is very different, but it is not very good. As for the data with a large gap, we usually perform feature-wise normalization of each feature.

For each feature (a column of the input data matrix), subtract the mean of that column and divide by its standard deviation. When this is done, the data will be zero-centered and have one standard deviation (standard deviation is 1).

You can easily do this with Numpy:

mean = train_data.mean(axis=0)
std = train_data.std(axis=0)

train_data -= mean
train_data /= std

test_data -= mean
test_data /= std
Copy the code

Notice that the test set is processed using the mean and standard deviation of the training set.

After processing the data, the network can be constructed and trained (labels need not be processed, which is easier than classification).

To build the network

The less data you have, the easier it is to overfit. To slow down overfitting, use a smaller network.

For example, in this problem, we use a network with only two hidden layers, each 64 units:

from tensorflow.keras import models
from tensorflow.keras import layers

def build_model() :
    model = models.Sequential()
    model.add(layers.Dense(64, activation="relu", input_shape=(train_data.shape[1], )))
    model.add(layers.Dense(64, activation="relu"))
    model.add(layers.Dense(1))
    
    model.compile(optimizer="rmsprop", loss="mse", metrics=["mae"])
    
    return model
Copy the code

The last layer of the network has only one cell and no activation function (so it is a linear layer). This layer is the standard configuration for the last step of the continuous single-valued regression problem.

If the activation function is added, the output value will have a range limit, for example sigmoid will limit the value to [0, 1]. Without the activation function, there is no limit to the value that this linear layer can output.

When compiling the model, the loss function used is MSE (Mean Squared Error). This function returns the square of the difference between the prediction and the true target value. The regression problem uses this loss.

Then we used a training metric that we haven’t used before — MAE (Mean Absolute Error), which is the absolute difference between the prediction and the real goal.

Verification of fit — K-fold verification

What we’ve been doing — in order to evaluate the network, to adjust the network parameters (such as the number of rounds of training), we divide the data into a training set and a validation set. This time, we need to do the same. The trouble is that we have so little data this time that we have a very small set (maybe 100 pieces of data). In this case, different data selected for the validation set may have a great impact on the verification results (that is, different partitioning methods of the validation set may result in a large variance of the verification results), which will affect our verification of the model.

In this awkward situation, the best practice is to use k-fold cross-validation.

With K-fold verification, we divided the data into K parts (generally K = 4 or 5), and then instantiated K independent models, each with K-1 data for training, and then with the remaining one for verification. The final verification score of the model used the average value of K parts.

Code implementation for K-fold validation:

Slightly modifying the example in the book, we added a little bit of code to visualize the training process using the TensorBoard. First load tensorboard in the Jupyter Lab notebook:

# Load the TensorBoard notebook extension
# TensorBoard can visualize the training process
%load_ext tensorboard
# Clear any logs from previous runs! rm -rf ./logs/Copy the code

Output:

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard
Copy the code

Then start writing the main code:

import numpy as np
import datetime
import tensorflow as tf

k = 4
num_val_samples = len(train_data) // k
num_epochs = 100
all_scores = []

# prepared TensorBoard
log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)

for i in range(k):
    print(f'processing fold #{i} ({i+1}/{k}) ')
    
    Prepare validation data
    val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples]
    val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]
    
    Prepare training data
    partial_train_data = np.concatenate(
        [train_data[: i * num_val_samples],
         train_data[(i+1) * num_val_samples :]],
        axis=0)
    partial_train_targets = np.concatenate(
        [train_targets[: i * num_val_samples], 
         train_targets[(i+1) *  num_val_samples :]], 
        axis=0)
    
    # Build and train models
    model = build_model()
    model.fit(partial_train_data, partial_train_targets, 
              epochs=num_epochs, batch_size=1, verbose=0,
              callbacks=[tensorboard_callback])
    
    # Have a validation set evaluation model
    val_mse, val_mae = model.evaluate(val_data, val_targets, verbose=0)
    all_scores.append(val_mae)


np.mean(all_scores)
Copy the code
Processing fold #1 (4/4) Processing fold #2 (4/4) Processing fold #3 (4/4) 2.4046657Copy the code

Use this command to display the Tensorboard in the Jupyter Lab notebook:

%tensorboard --logdir logs/fit
Copy the code

It looks something like this:

It can also be accessed directly from your browser by going to http://localhost:6006.

That was just for fun, now let’s change it, iterate 500 rounds (not unique MBP run this is very slow), write down the results of training:

k = 4
num_val_samples = len(train_data) // k

num_epochs = 500
all_mae_histories = []

for i in range(k):
    print(f'processing fold #{i} ({i+1}/{k}) ')
    
    Prepare validation data
    val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples]
    val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]
    
    Prepare training data
    partial_train_data = np.concatenate(
        [train_data[: i * num_val_samples],
         train_data[(i+1) * num_val_samples :]],
        axis=0)
    partial_train_targets = np.concatenate(
        [train_targets[: i * num_val_samples], 
         train_targets[(i+1) *  num_val_samples :]], 
        axis=0)
    
    # Build and train models
    model = build_model()
    history = model.fit(partial_train_data, partial_train_targets,
                        validation_data=(val_data, val_targets),
                        epochs=num_epochs, batch_size=1, verbose=0)

    mae_history = history.history['val_mae']
    all_mae_histories.append(mae_history)


print("Done.")
Copy the code
processing fold #0 (1/4)
processing fold #1 (2/4)
processing fold #2 (3/4)
processing fold #3 (4/4)
Done.
Copy the code

Drawing:

average_mae_history = [
    np.mean([x[i] for x in all_mae_histories]) for i in range(num_epochs)]
Copy the code
import matplotlib.pyplot as plt

plt.plot(range(1.len(average_mae_history) + 1), average_mae_history)
plt.xlabel('Epochs')
plt.ylabel('Validation MAE')
plt.show()
Copy the code

This image is too dense to see, so let’s fix it:

  • After removing the first ten groups of data, this end obviously has a large gap with the others.
  • The curve is smootened by replacing each point with an exponential moving average of the previous points.
def smooth_curve(points, factor=0.9) :
  smoothed_points = []
  for point in points:
    if smoothed_points:
      previous = smoothed_points[-1]
      smoothed_points.append(previous * factor + point * (1 - factor))
    else:
      smoothed_points.append(point)
  return smoothed_points

smooth_mae_history = smooth_curve(average_mae_history[10:])

plt.plot(range(1.len(smooth_mae_history) + 1), smooth_mae_history)
plt.xlabel('Epochs')
plt.ylabel('Validation MAE')
plt.show()
Copy the code

It can be seen from this figure that it is overfitting after about 80 EPOchs.

After we try these, we find the best parameters (rounds, layers of the network, etc.) and then use the best parameters to train on all the data to arrive at the final production model.

Train the final model

model = build_model()
model.fit(train_data, train_targets, 
          epochs=80, batch_size=16, verbose=0)

# Make a final assessment
test_mse_score, test_mae_score = model.evaluate(test_data, test_targets, verbose=0)
print(test_mse_score, test_mae_score)
Copy the code
17.43332971311083 2.6102107
Copy the code

The value of test_mae_score indicates that our trained model is about 2k+ dollars off the forecast… 😭