In real life, text information is everywhere. Understanding and learning the inherent meaning of textual data has been a very active research topic, which is natural language processing.

For enterprises, text data can be used to validate, improve, and extend the functionality of new products. Three categories of natural language processing tasks are most common in such practical applications:

  • Identify different users/customer groups (e.g., forecast churn, life cycle value, product preferences)

  • Accurately detect and extract different types of feedback (such as positive and negative comments/opinions, frequency of mentions of specific attributes such as clothing size)

  • Categorize text messages according to the user’s intent (e.g., request for basic help, urgent questions)

Although there are many papers and tutorial resources available online in the field of natural language processing, there are few effective guides and tips to get started and solve the problems. This is the purpose of this article.

In this article, we will introduce how to use machine learning to process text data in eight steps. We’ll start with the simplest methods, walk through them one by one, and then look at more specific scheme details such as feature engineering, word vectors, and deep learning. You can think of this article as a high level overview of the standard approach.

After reading this article, you will have learned:

  • How is data collected, prepared and checked

  • How to build simple models and, if necessary, deep learning models

  • How can you interpret and understand your model to ensure that it learns characteristic information and not noise

There is also an interactive Notebook at the end of the article so you can run the code in this article to demonstrate and apply the techniques, especially some of the more abstract concepts.

Step 1: Collect data

Every machine learning problem starts with data, like a set of emails, posts or tweets. Common sources of text information include:

  • Ecommerce reviews (from Amazon, Yelp, and other ecommerce platforms)

  • User-generated content (tweets, Facebook posts, StackOverflow questions, etc.)

  • Problem solving (customer request, technical support, chat history)

In this article, we will use a data set provided by CrowdFlower called Disasters on Social Media.

Contributors looked at more than 10,000 tweets with search terms like “fire,” “quarantine,” and “chaos,” and then flagged whether the tweet was related to a catastrophic event (as opposed to jokes, movie reviews, or non-catastrophic events).

Our task was to identify which tweets were actually related to the disaster, rather than the irrelevant topics depicted in the movie. Why is that? One potential application would be to alert law enforcement officers specifically to emergencies without being distracted by irrelevant information, such as a new Adam Sandler movie. One of the particular challenges of this task is that both cases use the same search terms when searching for tweets, so we can only distinguish them by subtle differences.

In the following article, we’ll refer to tweets related to disaster events as “disasters” and others as “irrelevant.”

The label

We’ve annotated the data so we know how tweets are categorized. Finding and tagging enough data to train a model is faster, easier, and cheaper than optimizing a complex unsupervised learning approach.

Step 2: Data cleaning

An essential skill for a data scientist is to know whether his next move is to work with models or data. A good rule of thumb is to look at the data and then clean it. A clean data set enables the model to learn meaningful features without being affected by irrelevant noise.

You can use the following list for data cleaning :(see the code for more information)

  • Remove all irrelevant characters, such as any non-alphanumeric characters

  • Mark your text and break it up into separate words

  • Remove irrelevant words, such as @ reminders or URL links

  • Convert all letters to lowercase so that “hello”, “hello”, “hello” are treated as the same word

  • Bind misspelled or multi-spelt words to a particular expression (e.g., “cool”/” kewl “/” cooool “)

  • Consider word reduction (e.g. “am”, “are”, “is” as “be”)

After completing these steps and checking for other errors, we are ready to train our models with clean, tagged data!

Find a good way to express your data

Machine learning models usually take numerical values as input. Our data set here is a list of sentences, and in order for the model to learn characteristic patterns of sentences from the data, we first need to find a way to translate it into a form that the model can understand, a list of numbers.

One-hot and Bag of Words

The usual way to interpret text for a computer is to encode each character as a separate number (such as ASCII), a process called feature extraction. If we use such simple expressions to make classifiers, we need our data to learn the structure of words from scratch, which is difficult to achieve for most data sets. So we need a higher level approach.

For example, we can make a glossary of all the words in a dataset and then associate each word with a unique index. Each sentence is made up of a string of numbers that correspond to individual words in the word list. Using the index in the list, we can count the number of times a particular word appears in a sentence. This method, called the bag model, completely ignores the order of words in a sentence. As shown below:

The word bag model represents the sentence. Sentences on the left, extracted features on the right. Each index in the vector represents a specific word.

visualization

There would be about 20,000 words in the “disaster in Social media” vocabulary, meaning that each sentence would be represented by a vector of 20,000 words. Most of the vector will be filled with zeros, because each sentence contains only a small subset of the word list.

A good way to see if the work is really capturing the information relevant to the problem (such as whether the tweets are disaster-related) is to visualize them and see if the results are well distributed. Considering that word lists are usually very large and it is almost impossible to visualize with 20,000 dimensional data, we used PCA to reduce the data to two dimensions. Draw as follows:

The two categories don’t seem to separate well, which may be a feature of the embedding method we chose or simply due to a reduction in dimension. To see if word bag feature extraction works, we can try to train a classifier based on it.

Step 4: classification

The best way to approach a problem for the first time is usually to pick the simplest tool to solve it. When it comes to data classification, the most popular logistic regression algorithm is both generality and interpretability. The algorithm is easy to train and the results are interpretable, so you can easily extract the most important coefficients from the model.

We divide the data into two sets, the training set for matching the model and the test set for observing the effect of application on unknown data. We got 75.4% accuracy after training. It turned out well! The most speculated classes (” irrelevant “) were only 57 percent. But even 75 percent accuracy is good enough, and we should never start applying the model without understanding it.

Step 5: Verify the analysis results

Check confusion matrix

The first step in understanding a model is to understand the categories of errors that the model produces, and the ones that should least occur. In our example, “false positives” refers to categorizing unrelated tweets as “disaster events,” and “underreporting” refers to categorizing disaster-related tweets as “disaster-unrelated events.” If potential disaster events are to be dealt with first, then “under-reporting” should be reduced. If resources are limited, the priority should be to reduce “false positives” and reduce false reminders. This information can be well visualized using the obfuscation matrix, and the results predicted by the model can be compared to the true labels of the data. Ideally, the model’s prediction is exactly the same as the real case (manually labeled), and the obfuscation matrix is a diagonal matrix running from the top left to the bottom right.

Confusion matrix (green parts have a higher proportion, blue parts have a lower proportion)

Our classifier produced more false negatives than false positives. In other words, the most common error in the model was classifying catastrophic tweet errors as irrelevant. If the law enforcement costs of false positives are high, then this bias in our classifier is benign.

Explain and illustrate the model

To validate the model and explain its predictions, we need to define the terms in which the model makes its judgments. If our data is biased and the classifier makes accurate predictions in the sample data, the model will not scale well in the real world.

Here, we can chart the most important words for both catastrophic tweets and irrelevant tweets. Since we can extract and rank the prediction coefficients of the models, bag-of-words and Logistic regression models can easily calculate the importance of words.

Bag-of-words: Indicates the importance of a word based on its frequency

Our classifier correctly identified some patterns (Hiroshima, Holocaust, etc.), but overfitted some meaningless words (heyoo, X1392, etc.). The bag-of-words model can only process different words in a large vocabulary and assign equal weight to all words. However, some of these words appear very frequently, but are just noise data that predicts the outcome. Next, we will try to find a way to represent the frequency of words in sentences, so that the model can get as many signals from the data as possible.

Step 6: Statistics of lexical structures

Tf-idf feature extraction method

In order to make the model focus on learning more meaningful words, we can use tF-IDF feature extraction method on the word bag model. Tf-idf measures the importance of words by the TFIDF value of the words in the data set, and moderately weakens the words that appear too frequently. The following figure is the model constructed based on TF-IDF feature extraction method and visualized after PCA dimensionality reduction:

It can be seen that there is a clearer distinction between the two colors, making the two types of data more easily separated by the classifier. By training Logistic regression on the new model, we obtained an accuracy of 76.2%, indicating that tF-IDF feature extraction method is indeed helpful to improve the recognition performance.

It’s a very small improvement, but can our model learn more important words? If better results can be obtained and over-fitting of irrelevant words can be avoided, tF-IDF feature extraction method can be considered as a significant improvement in learning effect.

As you can see, the words learned in the new model seem to be more relevant! Although the metrics of the test set have only increased slightly, we are more confident about the model’s recognition performance, so deploying the interactive system of the new model makes the user experience more comfortable.

Step 7: Use semantics

Word2Vec feature representation

The previous word bag, One-HOT and TFIDF all build their own feature set based on the corpus to be analyzed, and use the feature set to extract features and convert the corpus to be analyzed into data that can be analyzed by the computer.

However, if we deploy the model, we are likely to encounter some words that never appear in the feature set. Previous models failed to properly classify such new data, even though the words in it were very similar to the corpus.

To solve this problem, we need to capture the semantics of the words, which means that the model needs to understand that “good” and “positive” are semantically closer than “apricot” and “continent.” The tool here is Word2Vec.

Use pre-trained WORD2VEc data

Word2Vec is a technique for finding continuous embedding for words, where embedding is hard to explain. You can simply think of the computer randomly assigning a multidimensional vector to each word (word2vec has the advantage that any two words can be evaluated using a word vector for similarity). By reading large volumes of text, it is able to learn and remember words that tend to appear in similar contexts. After enough data training, it generates a 300-dimensional vector for each word in the vocabulary to record semantically similar words.

The Word2Vec authors pre-trained and open-source the model on a very large corpus. Using this corpus, we can incorporate some semantic knowledge into our model. Pre-trained word vectors can be found and downloaded in Word2vec.

https://code.google.com/archive/p/word2vec/

Sentence classification feature representation

The way for the classifier to get sentence embedding quickly is to average the Word2Vec scores of all the words in the sentence. This is similar to the previous word bag model, but here we discard only the syntax while preserving the semantic information.

Sentence feature representation of Word2vec model

The visualization results of Word2Vec sentence feature representation model analysis are as follows:

Here, the separation of the two sets of colors is greater, which means that Word2Vec can help the classifier separate the two categories better. Using Logistic regression again, we got 77.7% accuracy, our best result to date!

Complexity/interpretability trade-offs

Unlike previous models, the new model can’t represent every word as a one-dimensional vector, making it difficult to see which words are most relevant to our classification results. Although we can still use the Logistic regression coefficients, they are only relevant to the embedded 300 dimensions and not to the lexical index values.

Step 8: Practice grammar features in an end-to-end manner

We’ve already shown you how to generate compact sentence embedding in a fast and efficient way. However, by omitting the order of words, we also give up all syntactic information about the statement. If simple methods don’t give satisfactory results, we use a more complex model: take the entire sentence as input and predict the label without having to build an intermediate representation. A common approach is to treat sentences as sequences of word vectors, such as Word2Vec, or more advanced methods such as GloVe and CoVe. Let’s talk about it in detail.

The training was fast. As an entry-level deep learning architecture, it can solve classification problems well. While CNN’s reputation rests largely on its image processing prowess, it also delivers excellent results on text-related tasks. Moreover, compared with most complex NLP methods (such as LSTM, Encoder/Decoder architecture, etc.), CNN training speed is also faster. It preserves the order of words and is good at learning the sequence features of words and other useful information. Compared to the previous model, it can distinguish between “Alex Eats Plants” and “Plants Eat Alex”.

The training required less work than previous methods, but the results were much better, with an accuracy of 79.5%! As with the previous steps, the next step is to continue exploring and visualizing the prediction results of the model to verify that it is the best model. At this point, you should be able to do this on your own.

Write in the last

To recap, the approach we used in each step is as follows:

  • Start quickly with a simple model

  • Explain the model’s predictions

  • Understand the error samples in model classification

  • Use this knowledge to determine the next step in deployment.

The models used in these eight steps are a few specific examples of how we deal with short texts, but the solutions behind them have been widely used in the practical treatment of all kinds of NLP problems.

The articles

100G Python Learning Materials: From Beginner to Master! As a free download

Why are you planning for 2019, not 2018?

My bad motive for sharing this article is to get more followers!

The 15 best Python libraries for Data Science in 2017

How to extract feature information from text?

First acquaintance with k-means algorithm

What NLTK can do for Chinese

Every word you leave online, you give away your identity

An elegant and concise list derivation

Get tips to divide the list equally

How do you sort the data in various ways?

Scrapy crawls simple book user information recursively

Meituan business information collection artifact

Use Chardect library to solve the problem of garbled web pages

Gevent: Asynchronous theory and Practice

Grequests library, a lightweight and efficient asynchronous access library

Selenium drive configuration details

The use of the reptile PyQuery

Easy SQLite3 database learning

Python calls functions from strings

Symmetry in the Python circuit

The use of date-time libraries in Python