This is the 13th day of my participation in the August More text Challenge. For details, see: August More Text Challenge

Deep Learning with Python

This article is one of a series of notes I wrote while studying Deep Learning with Python (2nd edition, by Francois Chollet). This post marks the turn from Jupyter Notebooks to Markdown, as you can check out the original.ipynb notebooks at GitHub or Gitee.

You can read the original copy of the book online (in English) at this website. The book’s author also gave the accompanying Jupyter notebooks.

This paper is one of the notes in Chapter 8. Generative Deep Learning.

Generate text using LSTM

8.1 Text generation with LSTM

As someone said before, “Generating sequential data is the closest computers get to dreaming.” Using text generation as an example, we will explore how recurrent neural networks can be used to generate sequence data. The technology can also be used for music generation, speech synthesis, chatbot dialogue generation, and even movie script writing.

In fact, the LSTM algorithm, as we know it today, was first developed to generate text character by character.

Generation of sequence data

The general method of generating sequences with deep learning is to train a network (usually RNN or CNN), input the previous Token, and predict the next Token in the sequence.

To put it in terminology: a network that can model the probability of the next Token given the previous Token is called a language model. Language models can capture the statistical structure of language – “latent space”. Train a language model, enter an initial text string (called “conditioning data”), sample from the language model, generate new tokens, add the new tokens to the condition data, enter them again, and repeat the process to produce a sequence of any length.

Let’s start with a simple example: take an LSTM layer, input a string of N characters from a text corpus, and train the model to generate the NTH +1 character. The output of the model is to do SoftMax, on all possible characters, to get the probability distribution of the next character. This model is called a character-level Neural Language Model.

The sampling strategy

When using character level NLP models to generate text, the most important problem is how to select the next character. Here are some common methods:

  • Greedy sampling: Always select the most likely next character. This approach is likely to produce repetitive, predictable strings that may not be coherent in meaning. (Input method association)
  • Pure random sampling: The next character is extracted from a uniform probability distribution, where each character has the same probability. This is too random to produce anything interesting. (this is just a random combination of characters)
  • Stochastic sampling: Based on the results of the language model, if the probability of the next character being e is 0.3, then you will choose it 30% of the time. There’s a bit of randomness to make the generated content morerandomVaried, but not completely random, and the output can be interesting.

Random sampling looks good and creative, but one problem is that you can’t control the magnitude of the randomness: more randomness can be creative, but it can be arbitrary; Less randomness might be closer to the real word, but it’s too rigid and predictable.

In order to control the randomness in the sampling process, a parameter “SoftMax Temperature” is introduced to represent the entropy of the sampling probability distribution, that is, to indicate how unexpected or predictable the next character selected will be:

  • Higher temperatures: more entropy of sampling distribution, resulting in more unexpected and unstructured data;
  • Lower temperature: corresponds to less randomness, resulting in more predictable data.

The specific implementation is that, given the temperature value, the softmax output of the model is reweighted to get a new probability distribution:

import numpy as np

def rewight_distribution(original_distributon, temperature=0.5) :
    For different Softmax temperatures, the probability distribution is reweighted.
    distribution = np.log(original_distribution) / temperature
    distribution = np.exp(distribution)
    return distribution / np.sum(distribution)
Copy the code

Implementation of character level LSTM text generation

That’s the theory, now we need to use Keras to achieve character level LSTM text generation.

Data preparation

First, we need a large amount of text data (corpus) to train the language model. Find one or more text files that are large enough: Wikipedia, various books, etc. Here we choose to use some of Nietzsche’s works (in English translation), so that the language model we learn will be based on Nietzsche’s writing style and theme. (For many years, I wrote my own wild model to play, are using Lu Xun 😂)

Download the corpus and convert it to all lowercase:

from tensorflow import keras
import numpy as np

path = keras.utils.get_file(
    'nietzsche.txt', 
    origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')
text = open(path).read().lower()
print('Corpus length:'.len(text))
Copy the code

Output results:

Corpus length: 600893

Next, we’ll turn the text into data (vectorization) : Sequences with maxlen length (there is partial overlap between sequences) are extracted from text, one-hot coding is performed, and then packaged into shapes (sequences, maxlen, unique_characters). Also, you need to prepare an array y that contains the corresponding target, that is, the characters that appear after each extracted sequence (also one-hot coded) :

# Vectorize character sequences

maxlen = 60     # Length of each sequence
step = 3        # Sample a new sequence every 3 characters
sentences = []  Save the extracted sequence
next_chars = [] The next character in # sentences

for i in range(0.len(text) - maxlen, step):
    sentences.append(text[i: i+maxlen])
    next_chars.append(text[i+maxlen])
print('Number of sequences:'.len(sentences))

chars = sorted(list(set(text)))
char_indices = dict((char, chars.index(char)) for char in chars)
# Insert: The above two lines of code 6
print('Unique characters:'.len(chars))

print('Vectorization... ')

x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)

for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1
Copy the code

Output information:

Number of sequences: 200278

Unique characters: 57

To build the network

The network we need is actually very simple, an LSTM layer and a SoftMax activated Dense layer will do. (You don’t have to use LSTM, you can use one dimensional convolution layer to generate sequences.)

The single-layer LSTM model for predicting the next character:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense

model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars), activation='softmax'))
Copy the code

Model compilation configuration:

from tensorflow.keras import optimizers

optimizer = optimizers.RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy',
              optimizer=optimizer)
Copy the code

Train language models and sample from them

Given a language model and a seed text fragment, you can generate new text by repeating the following:

  1. Given the existing text, the probability distribution of the next character is obtained from the model;
  2. Reweighting the distribution according to a certain temperature;
  3. The next character is sampled randomly according to the reweighted distribution.
  4. Adds a new character to the end of the text.

Before we train the model, we write the “sampling function”, which is responsible for reweighting the original probability distribution obtained by the model and extracting a character index from it:

def sample(preds, temperature=1.0) :
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)
Copy the code

Finally, we train and generate the text. We use a series of different temperature values to generate text after each round, so that we can see how the generated text changes as the model converges, and the impact of temperature on the sampling strategy:

# Text generation loop

import random

for epoch in range(1.60) :Train 60 rounds
    print(033 f '👉 \ [1; 35 m epoch{epoch} \033[0m')    # print('epoch', epoch)
    
    model.fit(x, y,
              batch_size=128,
              epochs=1)
    
    start_index = random.randint(0.len(text) - maxlen - 1)
    generated_text = text[start_index: start_index + maxlen]
    print(F '📖 Generating with seed: "\033[1;32;43m{generated_text}\033[0m"')    # print(f' Generating with seed: "{generated_text}"')
    
    for temperature in [0.2.0.5.1.0.1.2] :print(033 f '\ n \ [1, 36 m 🌡 ️ temperature:{temperature}\033[0m')    # print('\n temperature:', temperature)
        print(generated_text, end=' ')
        for i in range(400) :# generate 400 characters
            # one-hot code currently available text
            sampled = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(generated_text):
                sampled[0, t, char_indices[char]] = 1
            
            # Predict, sample, generate next character
            preds = model.predict(sampled, verbose=0) [0]
            next_index = sample(preds, temperature)
            next_char = chars[next_index]
            print(next_char, end=' ')
            
            generated_text = generated_text[1:] + next_char
            
    print('\n' + The '-' * 20)
Copy the code

This runs with a huge number of results:

Round 1:

The 30th wheel:

59 wheel:

Using more data to train larger models for longer periods of time will produce more consistent and realistic samples. But, in any case, the text generated in this way doesn’t make any sense. All the machine is doing is sampling data from a statistical model, it doesn’t understand human language, it doesn’t know what it’s saying.

Text generation based on word embedding

If we want to generate Chinese text, we have too many Chinese characters, I don’t think it is a good choice to do character by character. So we can consider generating text based on word embedding. Based on the previous character-level LSTM text generation, the encoding/decoding mode is slightly modified and the Embedding layer is added to realize a primary text generation based on word Embedding:

import random
import tensorflow as tf
from tensorflow.keras import optimizers
from tensorflow.keras import layers
from tensorflow.keras import models
from tensorflow import keras
import numpy as np

import jieba    # Use jieba as Chinese word segmentation
import os

os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"

# Import text

path = '~/CDFMLR/txt_zh_cn.txt'
text = open(path).read().lower()
print('Corpus length:'.len(text))

# Vectorize text sequences

maxlen = 60     # Length of each sequence
step = 3        Sample a new sequence for every 3 tokens
sentences = []  Save the extracted sequence
next_tokens = []  The next token for # Sentences

token_text = list(jieba.cut(text))

tokens = list(set(token_text))
tokens_indices = {token: tokens.index(token) for token in tokens}
print('Number of tokens:'.len(tokens))

for i in range(0.len(token_text) - maxlen, step):
    sentences.append(
        list(map(lambda t: tokens_indices[t], token_text[i: i+maxlen])))
    next_tokens.append(tokens_indices[token_text[i+maxlen]])
print('Number of sequences:'.len(sentences))

# Code the target one-hot
next_tokens_one_hot = []
for i in next_tokens:
    y = np.zeros((len(tokens),), dtype=np.bool)
    y[i] = 1
    next_tokens_one_hot.append(y)

# Make data set
dataset = tf.data.Dataset.from_tensor_slices((sentences, next_tokens_one_hot))
dataset = dataset.shuffle(buffer_size=4096)
dataset = dataset.batch(128)
dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)


# Build and compile the model

model = models.Sequential([
    layers.Embedding(len(tokens), 256),
    layers.LSTM(256),
    layers.Dense(len(tokens), activation='softmax')
])

optimizer = optimizers.RMSprop(lr=0.1)
model.compile(loss='categorical_crossentropy',
              optimizer=optimizer)

# Sampling function: Reweights the original probability distribution obtained by the model and extracts a token index from it
def sample(preds, temperature=1.0) :
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

# Training model

callbacks_list = [
    ..., Save weights after each round. .# Drop learning rate when no longer improving. .Interrupt training when no longer improving
]

model.fit(dataset, epochs=30, callbacks=callbacks_list)

# Text generation

start_index = random.randint(0.len(text) - maxlen - 1)
generated_text = text[start_index: start_index + maxlen]
print(F '📖 Generating with seed:"{generated_text}"')

for temperature in [0.2.0.5.1.0.1.2] :print(' '\ n 🌡 ️ temperature., temperature)
    print(generated_text, end=' ')
    for i in range(100) :# Generate 100 tokens
        # Encodes the current text
        text_cut = jieba.cut(generated_text)
        sampled = []
        for i in text_cut:
            if i in tokens_indices:
                sampled.append(tokens_indices[i])
            else:
                sampled.append(0)

        # Predict, sample, and generate the next token. The code is the same as the previous example, which is omitted here.Copy the code

I use some articles of Lu Xun to train, the final result is probably like this:

As you can see, none of these sentences make sense. It’s hard to watch. So we can also change the method of dividing tokens, not words, but sentences:

text = text.replace(', '.' ,').replace('. '.' 。').replace('? '.' ?').replace(':'.':')
token_text = tf.keras.preprocessing.text.text_to_word_sequence(text, split=' ')
Copy the code

Everything else stays the same, so you get interesting text. For example, this is the result of my training with some of Yu Qiuyu’s articles:

It’s still messy and pointless, but at least it looks a little more comfortable.

If you want to wait for good results, the easiest way is to add data, add network parameters. Or, use gdt-3, CPM, these super large networks 🙂