Original link:tecdat.cn/?p=8448

Original source:Tuo End number according to the tribe public number

 

Text generation is one of the most recent applications of NLP. Deep learning techniques have been used for a variety of text generation tasks, such as writing poetry, generating movie scripts and even composing music. In this article, however, we’ll look at a very simple example of text generation, where given the input word string, we’ll predict the next word. We will use the original text of Shakespeare’s famous novel Macbeth and predict the next word based on a given series of input words.

After completing this article, you will be able to perform text generation using the data set of your choice.

Import libraries and datasets

The first step is to import the libraries and data sets needed to execute the scripts in this article. The following code imports the required libraries:

import numpy as np
from keras.models import Sequential, load_model
from keras.layers import Dense, Embedding, LSTM, Dropout
Copy the code

The next step is to download the data set. We will use Python’s NLTK library to download the dataset.

download('gutenberg') 
Copy the code

You should see the following output:

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
Copy the code

The file contains the original text of the novel “Macbeth”. To read text from this file, use the **raw** method of the class:

macbeth_text = corpus.gutenberg.raw('shakespeare-macbeth.txt')
Copy the code

Let’s print the first 500 characters from the dataset:

print(macbeth_text[:500])
Copy the code

Here is the output:


Actus Primus. Scoena Prima.

Thunder and Lightning. Enter three Witches.

  1. When shall we three meet againe?
In Thunder, Lightning, or in Raine?
  2. When the Hurley-burley's done,
When the Battaile's lost, and wonne

   3. That will be ere the set of Sunne

   1. Where the place?
  2. Vpon the Heath

   3. There to meet with Macbeth

   1. I come, Gray-Malkin

   All. Padock calls anon: faire is foule, and foule is faire,
Houer through
Copy the code

You’ll see that the text contains many special characters and numbers. The next step is to clean up the data set.

Data preprocessing

To remove punctuation and special characters, we’ll define a function called preprocess_text() :

Def preprocess_text (sen) : # remove punctuation marks and Numbers sentence = re. The sub (' [^ a zA - Z] ', ' ', sen)... return sentence.lower()Copy the code

The preprocess_text function takes a text string as an argument and returns the cleaned text string in lowercase.

Now let’s clean up the text and print the first 500 characters again:

macbeth_text = preprocess_text(macbeth_text)
macbeth_text[:500]
Copy the code

Here is the output:

the tragedie of macbeth by william shakespeare actus primus scoena prima thunder and lightning enter three witches when shall we three meet againe in thunder lightning or in raine when the hurley burley done when the battaile lost and wonne  that will be ere the set of sunne where the place vpon the heath there to meet with macbeth come gray malkin all padock  calls anon faire is foule and foule is faire houer through the fogge and filthie ayre exeunt scena secunda alarum within enter king malcomCopy the code

Convert words to numbers

The deep learning model is based on statistical algorithms. Therefore, in order to use the deep learning model, we need to convert words into numbers.

In this article, we will use a very simple method to convert words to single integers. Before converting words to integers, we need to mark the text as a single word.

The following script marks the text in our dataset and then prints the total number of words in the dataset and the total number of unique words in the dataset:

 

from nltk.tokenize import word_tokenize

...

print('Total Words: %d' % n_words)
print('Unique Words: %d' % unique_words)
Copy the code

The output looks like this:

Total Words: 17250
Unique Words: 3436
Copy the code

There are 17,250 words in our text, of which 3,436 are unique. To convert tokenized words to numbers, you can use keras.preprocessing.text in the module. You need to call the FIT_on_texts method and pass it to the word list. A dictionary will be created where the keys will represent the words and the integers will represent the corresponding values of the dictionary.

Look at the script below:

from keras.preprocessing.text import Tokenizer
...
Copy the code

To access a dictionary containing a word and its corresponding index, word_index can use the attributes of the Tokenizer object:

vocab_size = len(tokenizer.word_index) + 1
word_2_index = tokenizer.word_index
Copy the code

If you check the length of the dictionary, it will contain 3,436 words, which is the total number of unique words in our dataset.

Now let’s print the 500th unique word and its integer value from the dictionary.

print(macbeth_text_words[500])
print(word_2_index[macbeth_text_words[500]])
Copy the code

Here is the output:

comparisons
1456
Copy the code

Modify data shape

LSTM accepts data in 3d format (number of samples, number of time steps, characteristics of each time step). Since the output will be a single word, the shape of the output will be two-dimensional (number of samples, number of unique words in the corpus).

The following script modifies the shape of the input sequence and corresponding output.

input_sequence = []
output_words = []
input_seq_length = 100

for i in range(0, n_words - input_seq_length , 1):
    in_seq = macbeth_text_words[i:i + input_seq_length]
...
Copy the code

In the script above, we declare two empty lists, input_sequence and output_words. Setting input_seq_length to 100 means that our input sequence will contain 100 words. Next, we execute a loop that appends the integer values of the first 100 words in the text to the input_sequence list in the first iteration. The 101st word is appended to the output_words list. During the second iteration, the sequence of words from the second word in the text to the 101st word is stored in the input_sequence list, the 102nd word is stored in the output_words array, and so on. Since there are 17,250 words in the dataset (100 less than the total number of words), a total of 17,150 input sequences will be generated.

Now let’s print the value of the first sequence in the input_sequence list:

print(input_sequence[0])
Copy the code

Output:

[1, 869, 4, 40, 60, 1358, 1359, 408, 1360, 1361, 409, 265, 2, 870, 31, 190, 291, 76, 36, 30, 190, 327, 128, 8, 265, 870, 83, 8, 1362, 76, 1, 1363, 1364, 86, 76, 1, 1365, 354, 2, 871, 5, 34, 14, 168, 1, 292, 4, 649, 77, 1, 220, 41, 1, 872, 53, 3, 327, 12, 40, 52, 1366, 1367, 25, 1368, 873, 328, 355, 9, 410, 2, 410, 9, 355, 1369, 356, 1, 1370, 2, 874, 169, 103, 127, 411, 357, 149, 31, 51, 1371, 329, 107, 12, 358, 412, 875, 1372, 51, 20, 170, 92, 9]
Copy the code

Let’s normalize the input sequence by dividing the integers in the sequence by the largest integer value. The following script also converts the output to a two-dimensional format. ` `

The following script outputs the input and corresponding output shapes.

print("X shape:", X.shape)
print("y shape:", y.shape)
Copy the code

 

Output:

X shape: (17150, 100, 1)
y shape: (17150, 3437)
Copy the code

Training model

The next step is to train our model. There are no hard and fast rules about how many layers and neurons should be used to train the model.

We will create three LSTM layers, each with 800 neurons. Finally, a dense layer with 1 neuron is added to predict the index of the next word, as shown below:

. model.summary() model.compile(loss='categorical_crossentropy', optimizer='adam')Copy the code

Since the output word can be one of 3436 unique words, our problem is a multi-class classification problem, so we use the categorical_crossentropy loss function. If the classification is binary, binary_crossentropy uses this function. After executing the script above, you can see the model summary:

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
lstm_1 (LSTM)                (None, 100, 800)          2566400
_________________________________________________________________
lstm_2 (LSTM)                (None, 100, 800)          5123200
_________________________________________________________________
lstm_3 (LSTM)                (None, 800)               5123200
_________________________________________________________________
dense_1 (Dense)              (None, 3437)              2753037
=================================================================
Total params: 15,565,837
Trainable params: 15,565,837
Non-trainable params: 0
Copy the code

To train the model, we can simply use the fit() method.

model.fit(X, y, batch_size=64, epochs=10, verbose=1)
Copy the code

 

 

To predict

To make the prediction, we will randomly select a sequence from the input_sequence list, transform it into a 3d shape, and then pass it to Predict () ‘s method of training the model. The index value is then passed to the index_2_word dictionary, where the word index is used as the key. The index_2_WORD dictionary will return the index words that belong to the key dictionary passed in.

The following script randomly selects a sequence of integers and prints the corresponding sequence of words:

. print(' '.join(word_sequence))Copy the code

For the scripts in this article, the following order was chosen at random:

amen when they did say god blesse vs lady consider it not so deepely mac but wherefore could not pronounce amen had most  need of blessing and amen stuck in my throat lady these deeds must not be thought after these wayes so it will make vs mad macb me thought heard voyce cry sleep no more macbeth does murther sleepe the innocent sleepe sleepe that knits vp the rauel sleeue of care the death of each dayes life sore labors bath balme of hurt mindes great natures second course chiefe nourisher in life feast lady what doe you meaneCopy the code

Next, we’ll print the next 100 words in the order above:

for i in range(100):
    int_sample = np.reshape(random_seq, (1, len(random_seq), 1))
    int_sample = int_sample / float(vocab_size)

...
Copy the code

Word_sequence Now, the variable contains the sequence of words we typed and the next 100 predicted words. The word_sequence variable contains a list of words. We can simply concatenate the words in the list to get the final output sequence, as follows:

final_output = ""
for word in word_sequence:
...
print(final_output)
Copy the code

Here is the final output:

amen when they did say god blesse vs lady consider it not so deepely mac but wherefore could not pronounce amen had most  need of blessing and amen stuck in my throat lady these deeds must not be thought after these wayes so it will make vs mad macb me thought heard voyce cry sleep no more macbeth does murther sleepe the innocent sleepe sleepe that knits vp the rauel sleeue of care the death of each dayes life sore labors bath balme of hurt mindes great natures second course chiefe nourisher in life feast lady what doe you meane and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and andCopy the code

 

conclusion

In this article, we saw how to use deep learning to create a text generation model through Python’s Keras library.