This is the 7th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.

1. The background

I have two hobbies, one is traditional culture, the other is high-tech.

Traditional culture, I like tang and Song poetry, brush and ink danqing, high-tech I am engaged in cutting-edge IT programming, like artificial intelligence research.

I want these two to connect, this old new, I don’t know what sparks will collide.

Results 2.

Through experiments, using recurrent neural networks combined with text generation, I finally achieved the magic trick: provide a beginning and he will automatically generate a song poem. Also, this new word is definitely original.

At the beginning generate
The drizzle Drizzle fairy guichun. The bright moon, the dream is broken in sorrow. Idle curtain cold, return. The roost crows are calling.
The wind The wind to break, the appearance of sleep without wind. People in the dream of cuckoo charm. No man can stand at the door. Qie frosty morning.
Tall buildings High-rise lights, nine streets. Tonight outside the building step on the chariot, walk sanxue empty. But to wash. Overlooking the five colors of the world.
Sea breeze Sea breeze falls tonight, where phoenix floor preference. Wonderful. Waning moon. Will heart castle peak, fall apart.
tonight Tonight who and tears on the appendix. The wind is light. Like ling melancholy. Like a clear wave like a jade man. Put to shame.

Having a little research on poetry, I am quite satisfied with the text generated by the word “high-rise” above.

High-rise lights, nine streets. Tonight outside the building step on the chariot, walk sanxue empty. But to wash. Overlooking the five colors of the world.

Tall building in the height, the text behind also reflects the “high” characteristics, “tall building looks at the street” is a kind of artistic conception, “tall building looks at the night” is another artistic conception, finally gave a “overlooking the five colors”, a “overlooking” word, also reflects the condescending, the whole text all around the “high” theme. Wonderful indeed!

The following is to analyze the song ci generation is how to achieve.

3. Implementation method

3.1 Data preparation

I found a song ci data set, a CSV file containing 20,000 song ci.

The first column of the document is the tag name, the second column is the author, and the third column is the text. Among them, the text has done word segmentation.

To learn about word segmentation, you can check the NLP knowledge point: Chinese word segmentation.

3.2 Data reading

Start by importing the packages involved in the entire project.

import csv
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
from tensorflow.python.keras.engine.sequential import Sequential
from tensorflow.keras import layers
from tensorflow.keras.optimizers import Adam
Copy the code

Here is how to load data in a dataset file.

def load_data(num = 1000) :
      
    Read the CSV file. Header: 0 | 1 author | 2 topic content
    csv_reader = csv.reader(open("./ci.csv",encoding="gbk"))
    Store in units of one word
    ci_list = []
    for row in csv_reader:
        # Take each line and find the column with the word content
        ci_list.append(row[2])
        # exit the loop when the maximum number is exceeded
        if len(ci_list) > num:break 
    return ci_list
Copy the code

To learn more about how to load a CSV dataset, check out NLP: CSV Reading.

The data is then serialized.

def get_train_data() :
      
    # Load data as corpus [" Spring flowers and autumn moon "," A river of spring water flows east "]
    corpus = load_data()
    # define a participle
    tokenizer = Tokenizer()
    {" spring flower ":1," autumn moon ":2," yijiang ":3}
    tokenizer.fit_on_texts(corpus)

    Define the input sequence
    input_sequences = []
    # Retrieve each item from the corpus
    for line in corpus:
        [3,4,5,6]
        token_list = tokenizer.texts_to_sequences([line])[0]
        [3,4,5,6] becomes [3,4,5],[3,4,5,6]
        for i in range(1.len(token_list)):
            n_gram_sequence = token_list[:i+1]
            input_sequences.append(n_gram_sequence)
    # Find the maximum length item in the corpus
    max_sequence_len = max([len(x) for x in input_sequences])
    Fill each item in the sequence to the maximum length, using the preceding padding 0
    input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

    return tokenizer, input_sequences, max_sequence_len
Copy the code

For detailed instructions on why and how to serialize text, go to Tokenizer, Texts_to_sequences, and PAD_sequences.

It is important to note here that some processing has been done because the training of text prediction requires inferences from the words above to the words below.

For example, the sentence “a mountain is not a mountain, a mountain is a mountain” is transformed into several sentences:

To see mountains is not to see mountains, mountains are not to see mountains, mountains are not to see mountains, mountains are not to see mountains, mountains are not to see mountainsCopy the code

The idea is to tell the neural network that if it’s preceded by “look at the mountain,” it’s followed by “No.” When the front becomes “look at the mountain is not a mountain, look at the mountain”, then “look at the mountain” is followed by “again”.

“Look at the mountain” behind is not fixed, but according to a series of words in front of it comprehensive judgment and decided.

To break a sentence into multiple sentences, this is a special processing area, which is what the following code does:

for i in range(1.len(token_list)):
    n_gram_sequence = token_list[:i+1]
    input_sequences.append(n_gram_sequence)
Copy the code

3.3 Model Building

To train data, we first have to have a neural network model, and the following is a sequence of network models.

def create_model(vocab_size, embedding_dim, max_length) :
    
    # Build the sequence model
    model = Sequential()
    # Add an embedding layer
    model.add(layers.Embedding(vocab_size, embedding_dim, input_length = max_length))
    # Add short and long memory layer
    model.add(layers.Bidirectional(layers.LSTM(512)))
    # Add softmax category
    model.add(layers.Dense(vocab_size, activation='softmax'))
    # Adam optimizer
    adam = Adam(lr=0.01)
    Configure training parameters
    model.compile(loss='categorical_crossentropy',optimizer=adam, metrics=['accuracy'])

    return model
Copy the code

There are special explanations about the knowledge points of model, layer and activation function: neural network model sequence and layer and activation function.

3.4 Training

The training code is as follows:

# tokenizer, input sequence, maximum sequence length
tokenizer, input_sequences, max_sequence_len = get_train_data()
# to figure out how many words there are, then + 1,1 is a filler word for the same length
total_words = len(tokenizer.word_index) + 1

# Split the input and output from the corpus sequence. The input is the first few words, and the output is the last word
xs = input_sequences[:,:-1]
labels = input_sequences[:,-1]

# The result becomes unique thermal encoding
ys = tf.keras.utils.to_categorical(labels, num_classes=total_words)
# create model
model = create_model(total_words, 256, max_sequence_len-1)
# Get training
model.fit(xs, ys, epochs= 15, verbose=1)

# Save the training model
model_json = model.to_json()
with open('./save/model.json'.'w') as file:
    file.write(model_json)
Save training weights
model.save_weights('./save/model.h5')
Copy the code

Suppose we have training sequences input_sequences that are:

[0, 0, 1, 2] [0, 0, 3, 4] [0, 3, 4, 5], [3, 4, 5, 6]Copy the code

The corresponding text is:

[0, 0, spring flowers, autumn moon] [0, 0, one river, water] [0, a river, and, to the east] / flow of a river, and the east,Copy the code

For training, it usually comes in pairs. One input, one output. Machines will learn the knack of extrapolating from inputs to outputs.

In this example, because the previous word is inferred from the next word, both input and output are taken from the above corpus.

This code takes the input and output from input_sequences:

xs = input_sequences[:,:-1] 
labels = input_sequences[:,-1]
Copy the code
Enter the xs Output labels
[0, 0, spring flower] [秋月]
[0, 0, one river] [and the]
[0, One river, Spring water] [the east]
[One river, spring water, east] [current]

Since activation=’softmax’ is used by the activation function in the model, this output is translated by tF.keras.utils.to_categorical to a single thermal code.

If you have any questions about heat coding, check out heat coding.

At this point, several concepts need to be emphasized:

  • The maximum length of a text sequencemax_sequence_lenis[A river, spring water, east, flow]Length of, where the value is 4. The main function is to define a fixed training length. When the length is insufficient, 0 will be added, and when the length exceeds, cutting will be performed.

You can click here to learn why.

  • The length of the input sequenceinput_lengthis[0, One river, Spring water]The length, fixed to 3, is frommax_sequence_lenIt’s an extract. Don’t use the last word. The main function is as input.

Save the model data in JSON and H5 format.

3.5 Making predictions

Once the training is complete, we can enjoy the fruits of our victory and start making predictions.

The prediction codes are as follows:

def predict(seed_text, next_words = 20) :

    # tokenizer, input sequence, maximum sequence length
    tokenizer, input_sequences, max_sequence_len = get_train_data()

    # Read the model results of the training
    with open('./save/model.json'.'r') as file:
        model_json_from = file.read()
    model = tf.keras.models.model_from_json(model_json_from)
    model.load_weights('./save/model.h5')

    If next_words=20 words, then loop 20 words, one at a time
    for _ in range(next_words):
        [50] [50] [50] [50]
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        Fill each item of the sequence to the maximum length, using the preceding padding 0 [0,0...... 50]
        token_list = pad_sequences([token_list], maxlen= max_sequence_len-1, padding='pre')
        # Predict the next word, predict the index
        predicted = model.predict_classes(token_list, verbose = 0)
        # Define an output to store output values
        output_word = ' '
        # find which word is the index of prediction, e.g. 55 is "lights"
        for word, index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        # input + Output, as next forecast: high-rise lights
        seed_text = seed_text + "" + output_word

    print(seed_text)
    # replace space returns
    return seed_text.replace(' ' , ' ')

# Forecast data
print(predict('rain',next_words = 22))
# Drizzle fairy guichun. The bright moon, the dream is broken in sorrow. Idle curtain cold, return. The roost crows are calling.
Copy the code

Read JSON, h5 files to restore the model.

The prediction needs to be given a leading word and specify how many words to predict later.

Graph of TD [A] beginning words - B - > P1 [AB] - C - > P2 [ABC] - D - > P3 (ABCD) word behind A - N - > P3

First, predict_classes(token_list) is used to predict the next word based on the initial word. Then the opening word and the predicted word are used as input to predict the next word. And so on, like a snake, slowly leading from one opening word to a long sentence. Each word in a sentence is semantically related.

This is the song ci generator implementation logic, I hope to help you.