Topic Modeling using Gensim (1)

Topic modeling is a technique for extracting hidden topics from large amounts of text. Latent Dirichlet Allocation (LDA) is a popular topic modeling algorithm with an excellent implementation in Python’s Gensim package. However, the challenge is to extract high-quality topics that are clear, isolated and meaningful. Much depends on the quality of text preprocessing and the strategy for finding the optimal number of topics. This tutorial attempts to address both of these issues.

content

1. Introduction 2. Prerequisites – Download NLTK Stop Words and spacy Model 3. Import package 4. What does LDA do? 5. Prepare stop words 6. Import newsgroup data 7. Delete E-mail messages and newlines 8. Mark words and clean up text 9. Create Bigram and Trigram models 10. Delete stop words, make double letter words and word transformations 11. Create required dictionary and corpus topic modeling 12. Build a topic model 13. View topics in the LDA model 14. Calculate model complexity and consistency scores 15. Visualization Topics – Keywords 16. Building the LDA Mallet model 17. How to find the optimal number of topics for LDA? Find the main topic in each sentence 19. Find the most representative document for each topic 20. Assign topics across files

1. Introduction

One of the main applications of natural language processing is to automatically extract the topic that people are discussing from a large amount of text. Some examples of large amounts of text could be feeders from social media, customer reviews from hotels, movies, etc., user feedback, news reports, emails with customer complaints, etc.

Knowing what people are talking about and understanding their questions and opinions is very valuable for businesses, managers and political campaigns. It is also difficult to read such a large volume of text manually and identify topics.

Therefore, an automatic algorithm is needed that can read text documents and automatically output the topics discussed.

In this tutorial, we’ll take a real-world example of the ’20 newsgroup ‘data set and use LDA to extract natural discussion topics.

I will use Latent Dirichlet Allocation (LDA) in the Gensim package and the implementation of Mallet (via Gensim). Mallet effectively implements LDA. It is known to run faster and provide better topic isolation.

We will also extract the quantity and percentage contribution of each topic to understand the importance of the topic.

Let’s get started!

Topic modeling in Python using Gensim. Photography is by Jeremy Bishop.

2. Prerequisites – Download NLTK stop words and spacy model

We need StopWords from NLTK and SPacy’s EN model for text preprocessing. Later, we will use the Spacy model for lexical reduction.

Lexical reduction is simply a conversion of a word to its root. For example, the lemma of the word “machine” is “machine”. Again, ‘walk’ – >’ walk ‘, ‘mouse’ – >’ mouse ‘and so on.

# Run in python console
import nltk; nltk.download('stopwords')

# Run in terminal or command prompt
python3 -m spacy download enCopy the code

3. Import packages

The core packages used in this tutorial re, Gensim, Spacy and pyLDAvis. In addition, we will use matplotlib, numpy, and pandas

Reasoning and visualization. Let’s import them.

import re
import numpy as np
import pandas as pd
from pprint import pprint

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# spacy for lemmatization
import spacy

# Plotting tools
import pyLDAvis
import pyLDAvis.gensim  # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline

# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)Copy the code

4. What does LDA do?

LDA’s approach to topic modeling is to treat each document as a proportional set of topics. And each topic as a set of keywords, again in proportion to a topic.

Once you provide the algorithm with the number of topics, it rearranges the topic distribution in the document and the keyword distribution within the topic to get a good combination of topic-keyword distribution.

When I say subject, what is it actually and how is it represented?

A topic is nothing more than a collection of dominant keywords that are representative. You can determine the content of a topic simply by looking at the keywords.

The following are the key factors for getting a well-quarantined topic:

Quality of text processing.
The text talks about various topics.
Choice of topic modeling algorithm.
The number of topics provided to the algorithm.
Algorithm parameter adjustment.

5. Prepare keywords

We’ve downloaded the stop word. Let’s import them and make them stop_words available.

# NLTK Stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from'.'subject'.'re'.'edu'.'use'])Copy the code

6. Import newsgroup data

We will do this exercise using the 20-Newsgroups dataset. This version of the dataset contains approximately 11K newsgroup posts from 20 different topics. This can be used as newsgroups.json.

This is done using the imported PANDAS. Read_json, and the resulting dataset has three columns, as shown in the figure.

# Import Dataset
df = pd.read_json('https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json')
print(df.target_names.unique())
df.head()Copy the code

['rec.autos' 'comp.sys.mac.hardware' 'rec.motorcycles' 'misc.forsale'
 'comp.os.ms-windows.misc' 'alt.atheism' 'comp.graphics'
 'rec.sport.baseball' 'rec.sport.hockey' 'sci.electronics' 'sci.space'
 'talk.politics.misc' 'sci.med' 'talk.politics.mideast'
 'soc.religion.christian' 'comp.windows.x' 'comp.sys.ibm.pc.hardware'
 'talk.politics.guns' 'talk.religion.misc' 'sci.crypt']Copy the code

Copy the code

20 newsgroup data sets

7. Delete emails and newlines

As you can see, there are a lot of emails, line breaks, and extra space that are very distracting. Let’s use regular expressions to get rid of them.

# Convert to list
data = df.content.values.tolist()

# Remove Emails
data = [re.sub('\S*@\S*\s? '.' ', sent) for sent in data]

# Remove new line characters
data = [re.sub('\s+'.' ', sent) for sent in data]

# Remove distracting single quotes
data = [re.sub("\ '"."", sent) for sent in data]

pprint(data[:1])
Copy the code

['From: (wheres my thing) Subject: WHAT car is this! ? Nntp-Posting-Host: '
 'rac3.wam.umd.edu Organization: University of Maryland, College Park Lines: '
 '15 I was wondering if anyone out there could enlighten me on this car I saw '
 'the other day. It was a 2-door sports car, looked to be from the late 60s/ '
 'early 70s. It was called a Bricklin. The doors were really small. In '
 'addition, the front bumper was separate from the rest of the body. This is '
 'all I know. (.. truncated..) ]Copy the code

After you remove the email and extra space, the text still looks messy. It is not ready for LDA consumption. You need tokenization to break down each sentence into a list of words, while removing all the clutter from the process.

Gensim is very helpful with this.

8. Mark words and clean up text

Let’s mark each sentence as a list of words, completely removing punctuation and unnecessary characters.

Gensim helps with this. Simple_preprocess (). In addition, I have set deacc=True to remove punctuation.

def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(data))

print(data_words[:1])
Copy the code

[['from'.'wheres'.'my'.'thing'.'subject'.'what'.'car'.'is'.'this'.'nntp'.'posting'.'host'.'rac'.'wam'.'umd'.'edu'.'organization'.'university'.'of'.'maryland'.'college'.'park'.'lines'.'was'.'wondering'.'if'.'anyone'.'out'.'there'.'could'.'enlighten'.'me'.'on'.'this'.'car'.'saw'.'the'.'other'.'day', (.. truncated..) )]]Copy the code

Create Bigram and Trigram models

Bigrams are two words that come up a lot in the document. Trigrams is three words that often appear.

Some examples in our example are: ‘front_bumper’, ‘oil_leak’, ‘Maryland_college_park’, etc.

Gensim’s Phrases model allows you to build and implement Bigrams, Trigrams, Quadgrams, etc. Two important argument Phrases are min_count and threshold. The higher the value of these parameters, the more difficult it is to combine words into two-letter groups.

# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

# See trigram example
print(trigram_mod[bigram_mod[data_words[0]]])
Copy the code

['from'.'wheres'.'my'.'thing'.'subject'.'what'.'car'.'is'.'this'.'nntp_posting_host'.'rac_wam_umd_edu'.'organization'.'university'.'of'.'maryland_college_park'.'lines'.'was'.'wondering'.'if'.'anyone'.'out'.'there'.'could'.'enlighten'.'me'.'on'.'this'.'car'.'saw'.'the'.'other'.'day'.'it'.'was'.'door'.'sports'.'car'.'looked'.'to'.'be'.'from'.'the'.'late'.'early'.'it'.'was'.'called'.'bricklin'.'the'.'doors'.'were'.'really'.'small'.'in'.'addition'.'the'.'front_bumper'(.. truncated..) ]Copy the code

10. Delete stop words and make Bigrams and Lemmatize

The two-word model is ready. Let’s define functions to remove stop words, make double letter combinations and word forms and call them sequentially.

# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN'.'ADJ'.'VERB'.'ADV') :"""https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp("".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out
Copy the code

We call these functions sequentially.

# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# python3 -m spacy download en
nlp = spacy.load('en'.disable= ['parser'.'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN'.'ADJ'.'VERB'.'ADV'])

print(data_lemmatized[:1])
Copy the code

[['where'.'s'.'thing'.'car'.'nntp_post'.'host'.'rac_wam'.'umd'.'organization'.'university'.'maryland_college'.'park'.'line'.'wonder'.'anyone'.'could'.'enlighten'.'car'.'see'.'day'.'door'.'sport'.'car'.'look'.'late'.'early'.'call'.'bricklin'.'door'.'really'.'small'.'addition'.'front_bumper'.'separate'.'rest'.'body'.'know'.'anyone'.'tellme'.'model'.'name'.'engine'.'spec'.'year'.'production'.'car'.'make'.'history'.'whatev'.'info'.'funky'.'look'.'car'.'mail'.'thank'.'bring'.'neighborhood'.'lerxst']]Copy the code

11. Create dictionaries and corpora for topic modeling

The two main inputs to the LDA topic model are the dictionary (ID2Word) and the corpus. Let’s create them.

# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Create Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:1])
Copy the code

[[[0, 1), (1, 2), (2, 1), (3, 1), (4, 1), (5, 1), (6, 5), (7, 1), (8, 1), (9, 2), (10, 1), (11, 1), (12, 1), (13, 1), 15, 14, (1), (1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 2), (23, 1), (24, 1), (25, 1), (26, 1), 28, 27 (1), (1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1) and (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1)]]Copy the code

Gensim creates a unique ID for each word in the document. The resulting corpus shown above is a mapping of (word_id, word_frequency).

For example, (0,1) above implies that the word ID 0 occurs once in the first document. Again, the word id 1 appears twice, and so on.

This is used as input to the LDA model.

If you want to see the word for a given ID, pass the ID as a key to the dictionary.

id2word[0]Copy the code

'addition'Copy the code

Alternatively, you can see the human-readable form of the corpus itself.

# Human readable format of corpus (term-frequency)
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]Copy the code

[[('addition', 1),  ('anyone', 2),  ('body', 1),  ('bricklin', 1),  ('bring', 1),  ('call', 1),  ('car'And 5), ('could', 1),  ('day', 1),  ('door', 2),  ('early', 1),  ('engine', 1),  ('enlighten', 1),  ('front_bumper', 1),  ('maryland_college', 1), (.. truncated..) ] ]Copy the code

Okay, let’s get back on track for the next step: building the theme model.

12. Build a topic model

We have everything we need to train the LDA model. In addition to the corpus and dictionary, you need to provide the number of topics.

In addition, Alpha has a hyperparameter called ETA that affects the sparsity of a topic. According to the Gensim documentation, the default is before 1.0 / num_topics.

Chunksize is the number of documents used in each training block. Update_every determines the frequency at which model parameters should be updated and the total number of passes that passes passes.

# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=20, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)
Copy the code

13. View topics in the LDA model

The above LDA model is built from 20 different topics, where each topic is a combination of keywords and each keyword contributes some weight to the topic.

You can see the keywords for each topic and the weight (importance) for each keyword, as shown in lda_model.print_topics().

# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]Copy the code

[(0,
  '+ 0.014 * 0.016 * "car" "light" + "power" + 0.010 * 0.009 * "drive" + 0.007 * "mount"'
  '" controller "+ 0.007 + 0.007 * *" cool "+ + 0.007 * 0.007 *" engine "" back" +'
  '0.006 * "turn"'),
 (1,
  '0.072*"line" + 0.066*"organization" + 0.037*"write" + 0.032*"article" + '
  '" university "+ 0.027 * 0.028 *" nntp_post "" host" + 0.016 + 0.026 * * "reply" +'
  "0.014 *" get "+ 0.013 *" 'thank'),
 (2,
  '" patient "+ 0.011 * 0.017 *" study "" slave" + 0.009 + 0.010 * * "wing" +'
  '" diseases "+ 0.008 * 0.009 *" food "" eat" + 0.008 + 0.008 * * "pain" +'
  '" treatment "+ 0.007 * 0.007 *" syndrome "'),
 (3,
  '" key "+ 0.009 * 0.013 *" use "+ 0.009 *" may ", "public" + 0.007 * 0.007 * "system" +'
  '" order "+ 0.007 * 0.007 *" government "" state" + 0.006 + 0.006 * * "dojo.provide +'"
  '0.006 * "law"),
 (4,
  '" ax "+ 0.007 * 0.568 *" RLK "" tufts_university" + 0.004 + 0.005 * * "ei" +'
  '" m "+ 0.004 * 0.004 *" vesa "" differential" + 0.004 + 0.004 * * "CHZ" + 0.004 * "lk"'
  '+ 0.003 * "weekly"),
 (5,
  '" player "+ 0.015 * 0.029 *" master "" Steven" + 0.009 + 0.015 * * "tor" +'
  '" van "+ 0.008 * 0.009 *" king "" scripture" + 0.007 + 0.008 * * "CAL" +'
  '" helmet "+ 0.007 * 0.007 *" det "'),
 (6,
  '" system "+ 0.020 * 0.028 *" problem "" run" + 0.018 + 0.019 * * "use" + 0.016 * "work"'
  '" do "+ 0.013 + 0.015 * *" the window "" driver" + 0.013 + 0.013 * * "bit" + 0.012 * "set"'),
 (7,
  '" Israel "+ 0.011 * 0.017 *" Israeli "" war" + 0.010 + 0.010 * * "Armenian" +'
  '" kill "+ 0.008 * 0.008 *" as "" attack" + 0.008 + 0.008 * * "government +'"
  '" Lebanese "+ 0.007 * 0.007 *" Greek "'),
 (8,
  '" money "+ 0.018 * 0.018 *" year "" pay" + 0.012 + 0.016 * * "car" + 0.010 * "drug" +'
  '" President "+ 0.009 * 0.010 *" rate "" face" + 0.007 + 0.008 * * "license" +'
  '0.007 * "American"),
 (9,
  '" god "+ 0.020 * 0.028 *" evidence "" Christian" + 0.012 + 0.018 * * "believe" +'
  '" "reason" + 0.011 * 0.012 * "faith" "exist" + 0.008 + 0.009 * * "bible" +'
  '" religion "+ 0.007 * 0.008 *" claim "'),
 (10,
  '" physical "+ 0.028 * 0.030 *" science ", "" direct" + 0.012 + 0.012 * * "st" +'
  '" scientific "+ 0.009 * 0.012 *" waste "+ + 0.008 * 0.009 *" Jeff "" cub" +'
  '" brown "+ 0.008 * 0.008 *" MSG "'),
 (11,
  '" wire "+ 0.011 * 0.016 *" the rid_device_info_keyboard "+ 0.011 *" md "" PM" + 0.008 + 0.009 * * "air" +'
  '" input "+ 0.008 * 0.008 *" fbi "" listen" + 0.007 + 0.007 * * "tube" +'
  '0.007 * "koresh"'),
 (12,
  '" motif "+ 0.014 * 0.016 *" serial_number "" son" + 0.013 + 0.013 * * "father" +'
  '" choose "+ 0.009 * 0.011 *" server "" event" + 0.009 + 0.009 * * "value" +'
  '" collin "+ 0.007 * 0.007 *" prediction "'),
 (13,
  '" _ "+ 0.098 * 0.043 *" Max "+ 0.015 *" dn "" cx" + 0.009 + 0.011 * * "eeg +'"
  "0.008 *" gateway "" c + + 0.008 * 0.005 *" mu "" Mr" + 0.005 + 0.005 * * "eg" '),
 (14,
  '" book "+ 0.009 * 0.024 *" research "" group" + 0.007 + 0.007 * * "page" +'
  '" new_york "is encouraged + 0.007 * 0.007 *" irfan "" united_state" + 0.006 + 0.006 * * "author" +'
  '" include "+ 0.006 * 0.006 *" club "'),
 (15,
  "0.020 *" order "" say" + 0.016 + 0.017 * * "people" "think" + 0.014 + 0.016 * * "make" '
  '" go "+ 0.013 + 0.014 * *" know "to" see "+ 0.011 + 0.012 * *" time "+ 0.011 *" get "'),
 (16,
  '" file "+ 0.017 * 0.026 *" the program "" window" + 0.012 + 0.012 * * "version" +'
  '" entry "+ 0.011 * 0.011 *" software "" image" + 0.011 + 0.011 * * "color" +'
  '" source "+ 0.010 * 0.010 *" available "'),
 (17,
  '" game "+ 0.027 * 0.027 *" team "+ 0.020 *" year ", "play" + 0.016 * 0.017 * "win" +'
  '" good "+ 0.009 * 0.010 *" season "+ 0.008 *" fan "" run" + 0.007 + 0.007 * * "score"'),
 (18,
  '" drive "+ 0.024 * 0.036 *" card "" MAC" + 0.017 + 0.020 * * "sale" + 0.014 * "CPU" +'
  '" price "+ 0.010 * 0.010 *" disk "+ 0.010 *" board "+" pin "+ 0.010 * 0.010 *" chip "'),
 (19,
  '" space "+ 0.010 * 0.030 *" sphere "" earth" + 0.009 + 0.010 * * "item" +'
  '" launch "+ 0.007 * 0.008 *" moon ", "" mission" + 0.007 + 0.007 * * "NASA" +'
Copy the code

  '" orbit "+ 0.006 * 0.007 *" research "'How do you explain this?Copy the code

Topic 0 is represented as _0.016

“Car” + 0.014

“Power” + 0.010

“Light” + 0.009

“Drive” + 0.007

“Mount” + 0.007

“Controller” + 0.007

“Cool” + 0.007

“Engine” + 0.007

“Back” + ‘0.006

“Turn”.

This means that the top 10 keywords contributing to this topic are: ‘car’, ‘power’, ‘light’, etc. The weight of the word ‘car’ on topic 0 is 0.016.

The weights reflect how important the keywords are to the topic.

Looking at these keywords, can you guess what the topic is? You can summarize it as “car” or “car”.

Also, can you browse through the remaining topic keywords and determine what the topic is?

Infer topics from keywords

14. Calculate model complexity and consistency scores

Model complexity and topic consistency provide a convenient way to determine how good or bad a given topic model is. In my experience, thematic consistency scores are especially helpful.

# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)
Copy the code

Perplexity:  -8.86067503009
Coherence Score:  0.532947587081Copy the code

You have a consistency score of 0.53.

15. Visual Themes – Keywords

Now that you have built the LDA model, the next step is to examine the generated topics and associated keywords. There is no better tool than the interactive charting of the pyLDAvis package and is designed for use with Jupyter Notebook.

# Visualize the topicspyLDAvis.enable_notebook()vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)visCopy the code

PYLDAVIS output

So how to extrapolate pyLDAvis’s output?

Each bubble in the figure on the left represents a topic. The larger the bubble, the more common the theme.

A good topic model will scatter fairly large non-overlapping bubbles throughout the diagram rather than cluster in a quadrant.

Models with too many topics often have a lot of overlap, with small bubbles clustered in one area of the diagram.

Well, if you move the cursor over one of the bubbles, the words and bars on the right will be updated. These words are the prominent keywords that make up the chosen topic.

We have successfully built a good theme model.

Given what we already know about the number of natural topics in the document, finding the best model is straightforward.

Continue…

Check the English original: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

Check out more articles: www.apexyun.com

Public id: Galaxy 1

Contact email: public@space-explore.com

(Please do not reprint without permission)