Writing in the front

Build a network that divides Reuters news into 46 mutually exclusive topics. Because there are multiple categories, this is an example of a multiclass classification problem. Since each data point can only be classified into one category, this is more specifically an example of the problem of single-label (multiclass) classification. If each data point can be classified into multiple categories (topics), then it is a multilabel (multiclass classification) problem.

Reuters data set

This article uses the Reuters data set, which contains many short stories and their corresponding topics, published by Reuters in 1986. It is a simple, widely used text categorization data set. It includes 46 different topics: some topics have more samples, but the training set has at least 10 samples for each topic. Like IMDB and MNIST, the Reuters dataset is built in as part of Keras. Let’s see.

from tensorflow.keras.datasets import reuters

(train_data, train_labels), (test_data, test_labels) = reuters.load_data(num_words=10000)
Copy the code

As with the IMDB dataset, the parameter num_words=10000 restricts the data to the first 10,000 most frequently occurring words. We had 8,982 training samples and 2,246 test samples. As with IMDB comments, each sample is a list of integers (representing word indexes). If you are curious, you can decode the index into words with the following code.

word_index = reuters.get_word_index()
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
decoded_newswire = ' '.join([reverse_word_index.get(i - 3.'? ') for i in train_data[0]])

print(decoded_newswire)
Copy the code

To prepare data

Vectorization of data.

def vectorize_sequences(sequences, dimension=10000) :     
  results = np.zeros((len(sequences), dimension))     
  for i, sequence in enumerate(sequences):         
    results[i, sequence] = 1.     
  return results 

x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)
Copy the code

There are two ways to vectorize labels: you can convert tag lists to integer tensors, or you can use one-hot encoding. One-hot encoding, also known as categorical encoding, is a widely used format for categorical data. In this example, the one-hot encoding of the tag represents each tag as an all-zero vector, with only the element corresponding to the tag index being 1. The code implementation is as follows.

def to_one_hot(labels, dimension=46) :
  results = np.zeros((len(labels), dimension))
  for i, label in enumerate(labels):
    results[i, label] = 1.
  return results

one_hot_train_labels = to_one_hot(train_labels)
ont_hot_test_labels = to_one_hot(test_labels)
Copy the code

Note that the Keras built-in method can do this, which you’ve already seen in the MNIST example.

from keras.utils import to_categorical
 
one_hot_train_labels = to_categorical(train_labels) 
one_hot_test_labels = to_categorical(test_labels)
Copy the code

To build the network

This problem has one constraint: the number of output categories is 46. The dimensions of the output space are large. For the stack of the used Dense layer, each layer can only access the output of the previous layer. If one layer loses some information related to the classification problem, that information cannot be retrieved by subsequent layers, that is, each layer can become an information bottleneck. The previous example used a 16-dimensional middle layer, but the 16-dimensional space is probably too small for this example to learn to distinguish between 46 different categories. This small dimensional layer can become an information bottleneck, permanently losing relevant information. For this reason, a more dimensional layer with 64 units will be used below

from tensorflow.keras import layers
from tensorflow.keras import models

model = models.Sequential() 
model.add(layers.Dense(64, activation='relu', input_shape=(10000,))) 
model.add(layers.Dense(64, activation='relu')) 
model.add(layers.Dense(46, activation='softmax'))
Copy the code

Two other points should be noted about this architecture.

  • The final layer of the network is the 46 Dense layer. This means that, for each input sample, the network will output a 46-dimensional vector. Each element of this vector (that is, each dimension) represents a different output category.
  • The last layer uses SoftMax activation. You’ve seen this used in the MNIST example. The network will output probability distributions over 46 different output categories — for each input sample, the network will output a 46-dimensional vector, where output[I] is the probability that the sample falls into the i-th category. The sum of the 46 probabilities is 1.

For this example, the best loss function is categorical_crossentropy. It is used to measure the distance between two probability distributions, namely the probability distribution of network output and the actual distribution of labels. By minimizing the distance between the two distributions, the training network can make the output as close to the real label as possible.


model.compile(optimizer='rmsprop',
        loss='categorical_crossentropy',
        metrics=['accuracy'])
Copy the code

Validate your approach

We set aside 1000 samples in the training data as validation sets.

x_val = x_train[:1000]
partial_x_train = x_train[1000:]

y_val = one_hot_train_labels[:1000]
partial_y_train = one_hot_train_labels[1000:]
Copy the code

Now start training network, 20 rounds in total.

history = model.fit(partial_x_train,
            partial_y_train,
            epochs=20,
            batch_size=512,
            validation_data=(x_val, y_val))
Copy the code

Finally, we draw the loss curve and the accuracy curve.

loss = history.history['loss'] 
val_loss = history.history['val_loss'] 
epochs = range(1.len(loss) + 1) 
 
plt.plot(epochs, loss, 'bo', label='Training loss') 
plt.plot(epochs, val_loss, 'b', label='Validation loss') 
plt.title('Training and validation loss') 
plt.xlabel('Epochs') 
plt.ylabel('Loss') 
plt.legend() 
 
plt.show()
Copy the code

plt.clf()   
 
acc = history.history['acc'] 
val_acc = history.history['val_acc'] 
 
plt.plot(epochs, acc, 'bo', label='Training acc') 
plt.plot(epochs, val_acc, 'b', label='Validation acc') 
plt.title('Training and validation accuracy') 
plt.xlabel('Epochs') 
plt.ylabel('Accuracy') 
plt.legend() 
 
plt.show() 
Copy the code

The network began to overfit after 9 rounds of training. We trained a new network from scratch, nine rounds in total, and then evaluated the model on the test set.

model = models.Sequential() 
model.add(layers.Dense(64, activation='relu', input_shape=(10000,))) model.add(layers.Dense(64, activation='relu')) 
model.add(layers.Dense(46, activation='softmax')) 
 
model.compile(optimizer='rmsprop',               
				loss='categorical_crossentropy',               
				metrics=['accuracy']) 
model.fit(partial_x_train,           
			partial_y_train,           
			epochs=9,           
			batch_size=512,           
			validation_data=(x_val, y_val)) 
results = model.evaluate(x_test, one_hot_test_labels)
Copy the code

This method can achieve an accuracy of about 80%. For balanced dichotomies, a completely random classifier can achieve 50% accuracy. But in this case, the accuracy was about 19% completely random, so the results are pretty good, at least compared to random benchmarks.

Generate prediction results on new data

You can verify that the predict method for the model instance returns probability distributions on 46 topics. We generate topic predictions for all test data.

predictions = model.predict(x_test) 
Copy the code

Each element in the predictions is a vector of length 46.

 predictions[0].shape 
 # (46),
Copy the code

The sum of all the elements of this vector is 1.

np.sum(predictions[0]) 
# 1.0
Copy the code

The largest element is the prediction category, the category with the highest probability.

np.argmax(predictions[0]) 
# 4
Copy the code

The complete code

from tensorflow.keras.datasets import reuters
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import layers
from tensorflow.keras import models
import matplotlib.pyplot as plt
import numpy as np


(train_data, train_labels), (test_data, test_labels) = reuters.load_data(num_words=10000)

#word_index = reuters.get_word_index()
#reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
#decoded_newswire = ' '.join([reverse_word_index.get(i - 3, '?') for i in train_data[0]])

#print(decoded_newswire)

def vectorize_sequences(sequences, dimension=10000) :     
  results = np.zeros((len(sequences), dimension))     
  for i, sequence in enumerate(sequences):         
    results[i, sequence] = 1.     
  return results 

x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)

def to_one_hot(labels, dimension=46) :
  results = np.zeros((len(labels), dimension))
  for i, label in enumerate(labels):
    results[i, label] = 1.
  return results

one_hot_train_labels = to_one_hot(train_labels)
one_hot_test_labels = to_one_hot(test_labels)

# one_hot_train_labels = to_categorical(train_labels)
# one_hot_test_labels = to_categorical(test_labels)

model = models.Sequential() 
model.add(layers.Dense(64, activation='relu', input_shape=(10000,))) 
model.add(layers.Dense(64, activation='relu')) 
model.add(layers.Dense(46, activation='softmax'))


model.compile(optimizer='rmsprop',
        loss='categorical_crossentropy',
        metrics=['accuracy'])

x_val = x_train[:1000]
partial_x_train = x_train[1000:]

y_val = one_hot_train_labels[:1000]
partial_y_train = one_hot_train_labels[1000:]


history = model.fit(partial_x_train,
            partial_y_train,
            epochs=9,
            batch_size=512,
            validation_data=(x_val, y_val))

loss = history.history['loss'] 
val_loss = history.history['val_loss'] 
epochs = range(1.len(loss) + 1) 
 
plt.plot(epochs, loss, 'bo', label='Training loss') 
plt.plot(epochs, val_loss, 'b', label='Validation loss') 
plt.title('Training and validation loss') 
plt.xlabel('Epochs') 
plt.ylabel('Loss') 
plt.legend() 
 
plt.show()

# plt.clf()   
 
# acc = history.history['acc'] 
# val_acc = history.history['val_acc'] 
 
# plt.plot(epochs, acc, 'bo', label='Training acc') 
# plt.plot(epochs, val_acc, 'b', label='Validation acc') 
# plt.title('Training and validation accuracy') 
# plt.xlabel('Epochs') 
# plt.ylabel('Accuracy') 
# plt.legend() 
 
# plt.show() 
Copy the code