So far, we have explored the use of machine learning in a variety of environments — topic modeling, clustering, categorization, text summarization, and even POS tags and NER tags are trained using machine learning. In this chapter we will begin to explore a cutting-edge machine learning technique: deep learning. Deep learning is biologically inspired to construct algorithm structures to perform text learning tasks, such as text generation, classification, and word embedding. This chapter will discuss the basics of deep learning and how to implement a text deep learning model. This chapter covers the following topics:

  • Deep learning;
  • The application of deep learning in text;
  • Text generation techniques.

13.1 Deep Learning

The previous chapters introduced machine learning techniques, including topic models, clustering and classification algorithms, and what we call shallow learning — word embedding. Word embeddings are the first neural network models that readers encounter in this book, and they can learn semantic information.

A neural network can be understood as a computing system or machine learning algorithm whose structure is inspired by biological neurons in the brain. We can only talk about neural networks in such general terms, because current technology lacks a thorough understanding of the human brain. Neural networks borrow neural connections and structures from the brain, such as perceptrons and single-layer neural networks.

A standard neural network consists of a number of neuron nodes as operational units, and they interact with each other through connections. A model is similar in a sense to the structure of the brain, with nodes representing neurons and wires representing connections between neurons. Different layers of neurons perform different types of operations, and the network shown in Figure 13.1 contains one input layer, multiple hidden layers, and one output layer.

 

Figure 13.1 Example of neural network structure

Conversely, research into neural networks has advanced cognitive science, helping to understand the human brain. The aforementioned tasks of categorizing, clustering, and creating vectors for words and documents can all be accomplished by machine learning algorithms that utilize neural networks.

Outside the field of text analysis, neural networks have achieved great success. At present, the research achievements in the fields of image classification, machine vision, speech recognition and medical diagnosis are usually realized by neural networks. As mentioned above, neural networks can generate word vectors, and the values stored in the hidden layer in the graph can be represented as word vectors.

This section introduces deep learning and extends the topic of neural networks to deep learning. Deep learning is just one form of multi-layer neural networks. At present, the vast majority of neural networks use multi-layer structure, which is deep learning technology. There are exceptions, such as in Word2Vec, where we only get weights from one layer.

Neural networks and deep learning are used in many fields, and although we can’t explain it mathematically, this book considers it a preferred option for natural language processing, so we’ll start with the next section on how to apply deep learning to text analysis.

13.2 Application of deep learning in text

When we learned about word embedding, we realized the power of neural networks. This is only part of what a neural network does, which is to extract useful information from the structure itself, but it is capable of more than that. We are more interested in the natural output of neural networks when we start using deeper networks where it is not prudent to use weights to extract useful information. Neural networks can be trained to perform multiple tasks related to text analysis, and in fact, for some of these tasks, the use of neural networks has completely changed the way we approach them.

One of the best use cases for deep learning is in the field of machine translation, specifically Google’s neural translation model. Starting in September 2016, Google used statistical and rule-based methods and models for language translation, but the Google Brain research group soon switched to using neural networks, which we call Zero-shot Translation. Previously, Google translated malaysia-to-Arabic from the source language into English. After the appearance of neural network, the model accepts a source language input sentence, and does not immediately output the translated target sentence, but runs a set of scoring mechanism behind it, such as grammar check. The deep translation model is much simpler than traditional translation methods, which break down the source language sentences, perform some rule-based translation, and then reassemble them into one sentence. Although the deep model requires more training data and longer training time, its model file is still smaller than the statistical translation model file. More and more language translations are being replaced with in-depth models, which are better than previous models, especially the latest Hindi translation model.

Although machine translation technology has made great progress, it still has many shortcomings. For example, users need more grammatically accurate translation results, and the current translation system can only provide relatively close semantic results of the target language. Just as deep models have flourished in other fields, neural networks are expected to greatly improve machine translation.

Word embedding is another very popular application of neural networks in the field of text processing. Considering the way word vectors and document vectors are used in many NLP tasks, word embedding has a place in many machine learning algorithms involving text. In fact, replacing all previous vectors with word embedding means that all algorithms or applications contain neural networks that capture contextual information about words and help improve classification and clustering.

Neural networks are widely used in classification and clustering tasks. Many complex scenarios, such as chatbots, rely on text categorization. Emotion analysis in the text is essentially a categorization task, distinguishing whether the current emotion is positive or negative (or more subdivided multiple emotions). Convolutional neural network, recursive neural network and other complex networks can be used for these text classification tasks, of course, the simplest single-layer neural network can also achieve good prediction results.

Reviewing POS tagging and NER tagging introduced before, in fact, both recognize parts of speech and named entities through neural network, so we have already involved deep learning when we use spaCy to tag parts of speech.

The mathematics of neural networks is beyond the scope of this book, and when discussing the different types of neural networks and how to use them, we will only discuss their architectures, superparameters, and practical applications. A superparameter is a configurable parameter in a machine learning algorithm. Usually, the value of the superparameter needs to be set before the algorithm is executed.

For ordinary neural networks and even convolutional neural networks, the size of input and output space is fixed and set by the developer. The input/output type can be an image, a sentence, or essentially a set of vectors. In natural language processing, the output vector represents the probability that the document belongs to a category. Recursive neural network (RNN) is a kind of neural network with special structure, which can accept sequence input and perform prediction tasks far more complicated than classification. Recursive neural networks are very common in text analysis because they interpret input data as sequences to capture contextual information about words in sentences.

Another application of neural networks in text is probabilistic language models, which can be understood as calculating the probability of the next word (or character) appearing based on the previous paragraph of text. In other words, the model uses contextual information to calculate the probability of the current word. This approach, which was widely used before neural networks, such as n-Gram technology, works in a similar way. Traditional methods try to calculate the co-occurrence probability of two adjacent words based on corpus and text database. For example, we would think of New York as a phrase because they have a very high probability of co-occurrence, which is calculated based on conditional probability and chain probability rules.

A neural network is implemented not by learning the probability of occurrence of words and characters, but by a sequence generator, so a neural network is a generative model. The generation model of natural language processing is very interesting. It can learn what kind of sentence has a high probability of appearing, so it can obtain the text data needed for training through neural network simulation.

Word embedding is built on this idea: if the word blue appears with the same probability as red after the text the Wall is Painted, word embedding encodes the two words into the same semantic space. This semantic understanding technique later evolved into shared representation, which maps semantically identical but different types of inputs into the same vector space. For example, the English word dog has the same meaning as the Chinese word dog, so it can be mapped to very similar vectors in a shared Chinese-English vector space. The magic of neural networks is that, with training, they can even map images and words into the same space. Automatic text description of images is one such technique.

Deep models incorporating reinforcement learning — techniques that train models with rewards and punishments for learning mistakes — can already beat humans at go, once considered one of the most difficult areas for ARTIFICIAL intelligence to break through.

One of the earliest natural language processing tasks was text summarization, and the traditional way to solve this problem is to sort the sentences according to which they provide the most information and select a subset of them. This book tries to use this algorithm in the text summary section. As for deep learning, it can directly generate a paragraph of text, which is similar to the way of human thinking, that is, skipping the step of selecting key sentences and directly creating summaries through probability models. This technique is often called natural language generation (NLG).

Therefore, the neural network machine translation model just mentioned is a similar generation model, which directly generates sentences in the target language. Let’s try this approach as an example to construct the first text-based depth model.

13.3 Text generation

The previous chapters extensively discuss deep learning and natural language processing, as well as text generation techniques to get convincing results. Next, we’ll implement some examples of text generation.

The neural network structure we will use is recursive neural network, and its specific implementation version is LSTM, long and short memory network. This network can capture both short and long context information of words. The most popular blog about LSTM is Understanding LSTM Networks, written by Colah, which provides insight into the inner workings of LSTM.

Andrej Karpathy also writes on his blog about a similar architecture, The Unreasonable Effectiveness of Neural Networks, where The implementation language is Lua and The framework is Keras (a highly abstract deep learning framework).

The Python-based deep learning ecosystem is growing rapidly, and developers can build deep learning systems in a variety of ways, depending on the situation. This book uses a more abstract high-level framework to easily show the reader the training process. Choosing a deep learning framework isn’t easy in 2018, so this book uses Keras as an example framework, but before doing so, let’s briefly explore and compare the features of various frameworks.

  • TensorFlow: TensorFlow is a neural network framework published by Google and used by the artificial intelligence team Google Brain. Unlike purely commercial development tools, TensorFlow is maintained by an active open source community and runs on the GPU platform. Support for gpus is an important feature, as they can perform math faster than normal cpus. Because TensorFlow is a model based on graphic calculation, it fits well with the neural network model. This framework supports both high-level and low-level interfaces and is currently the most popular option in both industry and science.
  • Theano: It is the world’s first deep Learning framework developed by Yoshia Bengio of MILA (Montreal Institute of Learning Algorithms). It takes symbol graph as a part of deep learning construction and provides low level interface operation. It is a very powerful deep learning system. Although its code is no longer in maintenance, it is still worth referring to, if only for the history. The Lasagne and Blocks libraries are the higher-level interfaces to Theano, abstracting and encapsulating some of the lower-level operations.
  • Caffe2: Caffe is the first framework dedicated to deep learning, developed by the University of California, Berkeley. The framework is characterized by its speed and modularity, which may be a little awkward to use since it is not developed in Python and requires configuring.prototxt files to use neural networks. Of course, this extra operation doesn’t affect learning costs, and we still want to use some of its great features.
  • PyTorch: A framework based on Lua’s Torch library, PyTorch has rapidly grown into a member of the deep framework family. Its author, Facebook AI Researcher FAIR, has donated it to the open source community, providing multiple sets of apis. Because it has good characteristics such as dynamic calculation graph, readers are recommended to refer to it.
  • Keras: Keras is the in-depth framework used for the examples in this book. It is considered to be the most suitable deep framework for prototype development because it has a lot of high-level and abstract interface encapsulation. It supports both TensorFlow and Theano underlying algorithms. We’ll see how easy it is to implement code in the text generation example. Keras also has a large and active community, and TensorFlow has announced that Keras will be packaged in a later release, which means that Keras will remain very much alive for a long time to come.

It is recommended that you understand each depth framework so that you can use it optimally in different application scenarios. The technologies involved in these frameworks are the same, so there may be the same logic and text generation process.

The examples mentioned in this chapter will involve recursive neural network. The advantage of this network is that it can remember the context, and the parameters of the current network layer are learned based on the information transmitted by the previous layer, hence the name of recursion, so it can obtain better training effect than other neural network structures.

We will implement the latter example using a variant of the recursive neural network, LSTM (Long and Short memory network), which can hold information for a long time. When the input is a time series structure, LSTM tends to get good results. In natural language scenarios, where the occurrence of each word is affected by the context of the sentence, this feature of LSTM becomes even more important. Moreover, this network structure is unique in that it can understand the context of the surrounding words and remember the previous words.

For readers interested in the mathematics behind RNN and LSTM, please refer to the following two articles:

  • Understanding LSTM Networks;
  • Unreasonable Effectiveness of Recurrent Neural Networks.

The first step in the sample code is also to load the necessary libraries, make sure that Keras and TensorFlow are installed on native PIP using PIP or conda.

The following code is the result of a slight change to the Jupyter Notebook:

import kerasfrom keras.models import Sequentialfrom keras.layers import LSTM, Dense, Dropoutfrom keras.callbacks import ModelCheckpointfrom keras.utils import np_utilsimport numpy as np
Copy the code

Here we use Keras’s sequence model and add an LSTM structure. The next step is to organize the training data. Theoretically any text data can be input, depending on the type of data we are generating. This is where developers can get creative, and RNNS can become J.K. Rowling, Shakespeare, or even your own writing style if there’s enough data.

Generating text using Keras requires building maps of all the different characters in advance (the example here is character-based). For example, the input text is source_data.txt. In the sample code below, all variables depend on the data set selected, but the code will work no matter what text file is selected.

filename    = 'data/source_data.txt'data        = open(filename).read()data        = data.lower()# Find all the unique characterschars       = sorted(list(set(data)))char_to_int = dict((c, i) for i, c in enumerate(chars))ix_to_char  = dict((i, c) for i, c in enumerate(chars))vocab_size  = len(chars)
Copy the code

Both dictionaries in the above code need to act as variables, passing characters to the model and generating text. A standard set of inputs should include print(chars), VOCab_size, and char_to_int.

The contents of the character set are as follows:

['n', ' ', '!', '&', "'", '(', ')', ',', '-', '.', '0', '1', '2', '3', '4','5', '6', '7', '8', '9', ':', ';', '?', '[', ']', 'a', 'b', 'c', 'd', 'e','f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't','u', 'v', 'w', 'x', 'y', 'z']
Copy the code

The dictionary size is:

51
Copy the code

After mapping to id, the dictionary contents are as follows:

{'n': 0, ' ': 1, '! ': 2,' & ': 3, "" : 4,' (' : 5) : 6, : ', '7,' - ', 8, '. ': 9,' 0 ': 10,' 1 ': 11,' 2 ': 12,' 3 ': 13,' 4 ': 14,' 5 ': 15,' 6 ': 16, '7':17, '8': 18, '9': 19, ':' 20, '; ': 21,'? ': 22, '[': 23, ']': 24, 'a': 25,'b': 26, 'c': 27, 'd': 28, 'e': 29, 'f': 30, 'g': 31, 'h': 32, 'i': 33,'j': 34, 'k': 35, 'l': 36, 'm': 37, 'n': 38, 'o': 39, 'p': 40, 'q': 41,'r': 42, 's': 43, 't': 44, 'u': 45, 'v': 46, 'w': 47, 'x': 48, 'y': 49,'z': 50}Copy the code

RNN takes a sequence of characters as input and outputs a similar sequence. Now process the data source into the following sequence:

seq_length = 100list_X = [ ]list_Y = [ ]for i in range(0, len(chars) - seq_length, 1):    seq_in = raw_text[i:i + seq_length]    seq_out = raw_text[i + seq_length]    list_X.append([char_to_int[char] for char in seq_in])    list_Y.append(char_to_int[seq_out])n_patterns = len(list_X)
Copy the code

To convert to a format that matches the model input, further processing is required:

X = np.reshape(list_X, (n_patterns, seq_length, 1))# Encode output as one-hot vectorY = np_utils.to_categorical(list_Y)
Copy the code

Because the unit of each prediction output is one character, character-based one-hot encoding is necessary. In this example, np_utils.to_categorical encoding is used. For example, when using index 37 to encode the letter M, the code would look like this:

[0. 0 0. 0. 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.Copy the code

Let’s start formally creating the neural network model:

model = Sequential()model.add(LSTM(256, input_shape=(X.shape[1], X.s hape [2]))) model. The add (Dropout (0.2)) model. The add (Dense (y.s hape [1], activation='softmax'))model.compile(loss='categorical_crossentropy', optimizer='adam')Copy the code

The above example created an LSTM with a single layer of neurons (created using Dense), set the dropout rate to 0.2, the activation function to Softmax, and the optimization algorithm to ADAM.

Dropout values are used to solve the problem of overfitting neural networks when they perform well on only one data set. The activation function is used to determine the activation mode of the output value of a neuron, and the optimization algorithm is used to help the network narrow the error between the predicted value and the true value.

Choosing the values of these superparameters is a matter of practical knowledge. In the next chapter, we’ll cover briefly how to select the appropriate value for a text processing task. For now, you can think of superparameter selection as a black box step. The hyperparameters used here are standard parameters for generating text using Keras.

The code for training the model is simple, similar to scikit-learn, by calling the fit function:

filepath="weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1,save_best_only=True, mode='min')callbacks_list = [checkpoint]# fit the modelmodel.fit(X, y, epochs=20, batch_size=128, callbacks=callbacks_list)
Copy the code

The fit function repeats the input training n_epochs several times, and then saves the optimal weight for each training by calling back.

Depending on the size of the training set, the fit function can take hours or even days to complete the training.

Another way of training is to preload the weights of a trained model:

filename = "weights.hdf5"model.load_weights(filename)model.compile(loss='categorical_crossentropy', optimizer='adam')
Copy the code

Now that we have a trained model, we can start generating character-level text sequences.

start   = np.random.randint(0, len(X) - 1)pattern = np.ravel(X[start]).tolist()
Copy the code

Because we want the generated text to be more random, we use the NUMpy library to limit the range of characters that can occur:

output = []for i in range(250):    x           = np.reshape(pattern, (1, len(pattern), 1))    x           = x / float(vocab_size)    prediction  = model.predict(x, verbose = 0)    index       = np.argmax(prediction)    result      = index    output.append(result)    pattern.append(index)    pattern = pattern[1 : len(pattern)]print (""", ''.join([ix_to_char[value] for value in output]), """)
Copy the code

As you can see, based on the current character x we want to predict, the model gives the prediction for the next character with the highest probability of occurrence (the argmax function returns the id of the character with the highest probability of occurrence), then converts that index to a character and adds it to the output list. Depending on the number of iterations we want to see in the output, we need to run multiple loops.

The network model in the LSTM example is not complex, and the reader can add more layers to the network to achieve a better prediction than in this example. Of course, a simple model can become better after many times of training with Epochs. Andrej Karpathy’s blog demonstrates this conclusion and provides an example of how the model works with Shakespeare and the Linux code base.

Pretreatment of input data and repeated training of epoch can also optimize the prediction effect. Increasing the number of network layers and epoch training times will also increase the time cost of training. If the reader just wants to experiment with RNN, rather than build an extensible production model, Keras will suffice.

13.4 summarize

This chapter fully demonstrates the power of deep learning. We have successfully trained a text generator that is close to human level in grammar and spelling. Further tuning and logical interventions are needed to create a more realistic chatbot.

Although the text generation results of this quality are not perfect for us, the neural network can produce satisfactory prediction results in other text analysis scenarios, such as text classification and clustering. The next chapter explores text categorization using Keras and spaCy.

Before concluding this chapter, readers are advised to read the following articles to deepen their understanding of deep learning text generation techniques:

  • NLP Best Practices;
  • Deep Learning and Representations;
  • Unreasonable Effectiveness of Neural Networks;
  • Best of 2017 for NLP and DL.

This article is excerpted from Natural Language Processing and Computational Linguistics

 

This book describes how to use natural language processing and computational linguistics algorithms to reason and gain insight into the data you have. These algorithms are based on statistical machine learning and artificial intelligence techniques. Tools that use these algorithms are now readily available and available in tools such as Python, Gensim, and spaCy.

The book starts with data cleansing and then introduces the concepts of computational linguistics. With that in hand, it’s time to use real language and text, and explore the more complex areas of statistical NLP and deep learning with the help of Python. You’ll learn how to annotate, parse, and model text with appropriate tools, and learn how to use the appropriate framework tools. You’ll also know when to choose a tool like Gensim for a topic model, and when to use Keras for deep learning.

This book has a good balance between theory and practice, so you can run your own natural language processing project while mastering the theory. You’ll discover Python’s rich ecosystem of natural language processing tools and enter the interesting world of modern text analysis.