Original link:tecdat.cn/?p=3897
Original source:Tuo End number according to the tribe public number
Text analysis: Topic modeling
library(tidyverse)
theme_set( theme_bw())
Copy the code
The target
- Defining topic modeling
- Explain Latent Dirichlet and how this process works
- How to use LDA to find a topic structure from a set of known topics
- How to use LDA to find a topic structure from a set of unknown topics
- To determine the k
- A method for selecting the appropriate parameters
Topic modeling
Generally, when we search for information online, there are two main methods:
- Keywords – Use a search engine and enter words related to what we are looking for
- Link. Linked pages may share similar or related content.
Another approach is to search and explore documents by topic. Broad topics may relate to various sections of the article (national affairs, sports), but there may be specific topics within or between these sections.
To do this, we need detailed information about the topic of each article. Manual coding of this corpus would be time-consuming, not to mention the need to know the topic structure of the document before you can begin coding.
Therefore, we can use a probabilistic topic model to analyze statistical algorithms for words in the original text documents to reveal the topic structure of the corpus and individual documents themselves. They do not require any manual encoding or tagging of the document before analysis – instead, the algorithm comes from analysis of the text.
Potential Dirichlet assignments
The LDA assumes that each document in the corpus contains mixed topics in the entire corpus. The theme structure is hidden – we can only observe the document and text, not the theme itself. Because the structure is hidden (also known as latent), this approach attempts to infer the subject structure given known words and documents.
Food and animals
Suppose you have the following sentences:
- I had bananas and spinach for breakfast.
- I like to eat broccoli and bananas.
- Chinchillas and kittens are cute.
- My sister adopted a kitten yesterday.
- Check out this cute hamster munching on a piece of broccoli.
Latent Dirichlet assignment is a method of automatically discovering the topic these sentences contain. For example, given these sentences and asked about two topics, LDA might produce something similar
- Sentences 1 and 2:10% topic A.
- Sentences 3 and 4:10% topic B.
- Sentence 5:60% topic A, 40% topic B.
- Theme A: 30% broccoli, 15% banana, 10% breakfast, 10% chew,……
- Theme B: 20% Chinchillas, 20% kittens, 20% cuteness, 15% hamsters,……
You can infer that topic A is about food and topic B is about cute animals. However, the LDA does not explicitly identify themes in this way. All it does is tell you the probability that a particular word is related to a topic.
LDA document structure
The LDA represents documents as subject combinations of probabilistic words. It assumes that documents are generated in the following way: As you write each document, you
- Determine the number of words N.
- Choose a theme for your document (based on K themes)
- For example, suppose we have two food and cute animal themes above.
- Generate each word in the document by:
- First select a topic (based on the assignment you sampled above; For example, you can choose a 1/3 chance food theme and a 2/3 chance cute animal theme).
- The word itself is then generated using the topic (assigned by topic). For example, a food theme might output “broccoli” with a 30% probability, “banana” with a 15% probability, and so on.
How can we generate sentences in the previous example? When generating document D:
- D will be half about food and half about cute animals.
- Choose 5 as the number of words D
- Choose the first word from the food theme and give the word “broccoli”.
- Choose the second word from the cute animal theme, such as “panda”.
- Choose a third word from a cute animal theme, such as “cute”.
- Choose the fourth word from the food theme, such as “cherry”.
- Choose the fifth word from the food theme, such as “eat”.
Thus, the file generated under the LDA model will be “cute panda eats cherry and broccoli” (LDA uses the word bag model).
Learn topic models through LDA
Now suppose you have a set of documents. You pick some fixed number of K’s.
K are the themes to be found and we wish to learn the theme representations of each document and the words associated with each theme using LDA. How do you do that? One way (called Gibbs sampling) is as follows:
- Browse each document and randomly assign each word in the document to a K subject
- But since it’s random, it’s not a very accurate structure.
- In other words, in this step, we assume that all subject assignments are correct except for the current word, and then update the current word assignment using our document generation model.
- Repeat the previous step many times and you will eventually reach a more or less stable state
- You can use these assignments to estimate two things:
- Topic for each document (by counting the proportion of words assigned to each topic in that document)
- Words related to each topic (by counting the proportion of words assigned to each topic)
LDA with a known topic structure
LDA can be useful if you know a priori the topic structure of a set of documents.
We can use LDA and topic modeling to discover how chapters relate to different topics (that is, books).
As a pre-processing, we divided these into chapters, divided them into words using Tidytext unnest_tokens, and then deleted stop_words. We treat each chapter as a separate “document.”
by_chapter <- books %>% group_by(title) %>% mutate(chapter = cumsum( str_detect(text, regex("^chapter ", ignore_case = TRUE)))) %>% ungroup() %>% count(title_chapter, word, sort = TRUE) %>% ungroup() ## Joining, by = "word" word_counts ## # A tibble: 104721 x 3 # # # # title_chapter word n < CRH > < CRH > < int > # # 1 Great Expectations_57 Joe 88 # # 2 Great Expectations_7 Joe 70 ## 3 Great Expectations_17 biddy 63 ## 4 Great Expectations_27 joe 58 ## 5 Great Expectations_38 estella 58 ## 6 Great Expectations_2 joe 56 ## 7 Great Expectations_23 pocket 53 ## 8 Great Expectations_15 joe 50 ## 9 Great Expectations_18 joe 50 ## 10 The War of the Worlds_16 brother 50 ## # ... With 104711 more rowsCopy the code
Latnet Dirichlet Allocation (LDA) model
The TopicModels package requires a DocumentTermMatrix (from the TM package). We can use cast_DTM function to convert to DocumentTermMatrix:
chapters_dtm
## <<DocumentTermMatrix (documents: 193, terms: 18215)>>
## Non-/sparse entries: 104721/3410774
## Sparsity : 97%
## Maximal term length: 19
## Weighting : term frequency (tf)
Copy the code
Now we are ready to create a four-topic LDA model.
chapters_lda <- LDA(chapters_dtm, k = 4, control = list(seed = 1234))
chapters_lda
## A LDA_VEM topic model with 4 topics.
Copy the code
- In this case, we know that there are four themes, because there are four books; This is the value of understanding the structure of the underlying topic
- Seed = 1234 Sets the starting point of the random iteration process. If we had not set the seed, we could have estimated a slightly different model each time we ran the script
Let’s start with verbs.
library(tidytext) chapters_lda_td <- tidy(chapters_lda) chapters_lda_td ## # A tibble: 72,860 × 3 ## topic term beta ## <int> < CHR > < DBL > ## 1 1 Joe 5.830326e-17 ## 2 2 Joe 3.194447e-57 ## 3 3 Joe 4.162676e-24 ## 44 Joe 1.445030e-02 ## 5 1 biddy 7.846976e-27 ## 62 biddy 4.672244e-69 ## 7 3 biddy 2.259711e-46 ## 8 4 biddy 4.767972E-03 ## 9 1 Estella 3.827272E-06 ## 10 2 Estella 5.316964E-65 ##... With 72850 more rowsCopy the code
We can use dplyr’s top_n to find the first 5 words in each topic:
top_n(5, beta) %>% ungroup() %>% arrange(topic, -beta) top_terms ## # A tibble: 20 × 3 ## topic term beta ## <int> < CHR > < DBL > ## 1 1 Elizabeth 0.014107538 ## 2 1 Darcy 0.008814258 ## 3 1 Miss Bennet ## 1 bennet 0.006497512 ## 1 bennet ## 2Copy the code
visualization
Ggplot (AES (term, beta, fill = factor(topic))) + geom_bar(alpha = 0.8, statCopy the code
- These themes are very clearly relevant to the four books
- Nemo, Sea and Nautilus belong to 20,000 leagues under the sea
- “Jane”, “Darcy” and “Elizabeth” belong to pride and prejudice
Also note that LDA() does not assign any labels to each topic. They are only themes 1,2,3, and 4. We can infer that these are relevant to each book, but it is only our inference.
Classification by document
Each chapter is a “document” in this analysis. Therefore, we might want to know which topics are associated with each document. Can we put these chapters back in the correct book?
chapters_lda_gamma ## # A tibble: 772 × 3 ## Document Topic gamma ## < CHR > <int> < DBL > ## 1 Great Expectations_57 1 1.3518861 ## 2 Great Expectations_57 1. Expectations_27 1. Expectations_27 1 Expectations_23 1. Expectations_23 1. Expectations_23 1 Great Expectations_18 1 1.2020482 ## 10 Great Expectations_18 1 1.2020482 ## 10 The War of The world 1.084337 e-05 # # #... with 762 more rowsCopy the code
One document per line per topic. Now that we have these document categories, we can see how well our unsupervised learning is doing at distinguishing between the four books.
First, we reclassify the document names into titles and sections:
chapters_lda_gamma <- chapters_lda_gamma %>% separate(document, c("title", "chapter"), sep = "_", convert = TRUE) chapters_lda_gamma ## # A tibble: 772 × 4 ## Title chapter Topic Gamma ## * < CHR > <int> <int> < DBL > ## 1 Great Expectations 1 1.470726E-05 ## 3 Great Expectations 1 2.117127E-05 ## 4 Great Expectations 2 1.919746E-05 ## 5 1 3.544403E-01 ## 6 Great Expectations 2 1.723723E-05 ## 7 Great Expectations 3 5.507241E-01 1 1.272044e-05 ## 10 The War of The Worlds 16 1 1.084337 e-05 # # #... with 762 more rowsCopy the code
Then we check the correct parts of each chapter:
ggplot(chapters_lda_gamma, aes(gamma, fill = factor(topic))) +
geom_histogram() +
facet_wrap(~ title, nrow = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Copy the code
We note that almost all chapters from “Pride and Prejudice,” “The War of the Worlds,” and “20,000 Leagues under the Sea” are identified as one chapter.
chapter_classifications <- chapters_lda_gamma %>% group_by(title, chapter) %>% top_n(1, gamma) %>% ungroup() %>% arrange(gamma) chapter_classifications ## # A tibble: 193 × 4 ## Title Chapter Gamma ## < CHR > <int> <int> < DBL > ## 1 Great Expectations 1 0.5464851 ## 4 Great Expectations 2 0.5507241 ## 5 Great Expectations 3 0.5356506 ## 3 Great Expectations 4 0.5464851 ## 4 Great Expectations 4 0.5984806 ## 8 Great Expectations 4 0.5984806 ## 8 Great Expectations ## 4 Great Expectations ## 4 Great Expectations ##... with 183 more rowsCopy the code
An important step in the topic modeling expectation maximization algorithm is to assign each word in each document to a topic. The more words in a document are assigned to that topic, the weight (gamma) will generally be on that document topic classification.
LDA with unknown topic structure
Often when using LDA, you don’t actually know the underlying topic structure of the document. In general, this is why you use LDA to analyze text in the first place.
Ap article
The data is a matrix of document terms for a sample of articles published in 1992. Let’s load them into R and convert them to neat format.
## 1 1 adding 1 ## 2 1 adult 2 ## 3 1 ago 1 ## 4 1 alcohol 1 ## 5 1 allegedly 1 ## 6 1 allen 1 ## 7 1 apparently 2 ## 8 1 appeared 1 ## 9 1 arrested 1 ## 10 1 assault 1 ## # ... With 302021 more rowsCopy the code
Why tidy it up first? Because the original DTM contained stop words – we wanted to remove them before modeling the data. The data is then converted back into the document matrix.
## Sparsity : 99%
## Maximal term length: 18
## Weighting : term frequency (tf)
Copy the code
What are the top words for each topic?
group_by(topic) %>% top_n(5, beta) %>% ungroup() %>% arrange(topic, -beta) top_terms ## # A tibble: 20 × 3 ## topic term beta ## <int> < CHR > < DBL > ## 1 Soviet 0.009502197 ## 21 government 0.009198486 ## 3 1 President 0.007046753 ## 4 1 United 0.006507324 ## 5 1 people 0.005402784 ## 6 2 people 0.007454587 ## 72 police 0.006433472 ## 8 2 city 0.00399952 ## 9 2 time 0.00336952 ## 10 2 school 0.00336952 ## 11 3 court 0.00336952 ## 12 3 bush 0.00216749 ## 50 3 President 0.00216749 ## 50 3 Federal 0.00216749 ## 50 3 House 0.00216749 ## 50 4 percent 0.023766679 ## 17 million 0.012489935 ## 18 billion 0.009864418 ## 19 market 0.008402463 ## 20 prices Top_terms %>%) + coord_flip()Copy the code
These four themes are commonly used to describe:
If we set k equal to 12
How did our results change?
group_by(topic) %>% top_n(5, beta) %>% ungroup() %>% arrange(topic, -beta) top_terms ## # A tibble: 60 × 3 ## Topic term beta ## <int> < CHR > < DBL > # 1 Military 0.011691176 ## 2 United 0.011598436 ## 3 Iraq 0.010618221 ## 4 1 President 0.009498227 ## 5 1 American 0.008259379 ## 6 2 Bush 0.007300862 ## 7 2 Bush 0.007300862 ## 8 2 Campaign 0.006366915 ## 9 2 People 0.006098596 ## 10 2 school 0.005208529 ## #... with 50 more rowsCopy the code
Well, these themes seem more specific, but not easy to understand.
And so on.
Some aspects of LDA are driven by intuitive thinking. But we can provide an auxiliary method.
Confusion is a statistical measure of how well a probabilistic model predicts a sample. You estimate the LDA model. A theoretical word assignment, represented by a topic, is then presented and compared to the actual word assignment in a topic or document.
Perplexity is a function of calculating the value for a given model.
Perplexity (ap_lda) # # 2301.814 [1]Copy the code
But the statistics themselves are a bit meaningless. The benefit of this statistic is to compare the confusion of different K’s in different models. Models with the lowest degree of confusion are generally considered “best”.
Let’s evaluate a series of LDA models on the AP data set.
n_topics <- c(2, 4, 10, 20, 50, 100)
ap_lda_compare <- n_topics %>%
map(LDA, x = ap_dtm, control = list(seed = 1109))
geom_point() +
y = "Perplexity")
Copy the code
It looks like the 100 topic model has the lowest confusion score. What themes will emerge from this? Let’s take a look at the first 12 themes generated by the model:
ap_lda_td <- tidy(ap_lda_compare[[6]]) top_terms <- ap_l ungroup() %>% arrange(topic, -beta) top_terms ## # A tibble: 50 × 3 ## Topic term beta ## <int> < CHR > < DBL > ## 1 communist party 0.020029039 ## 2 COMMUNIST 0.013810107 ## 3 1 government 0.013221069 ## 4 1 news 0.013036980 ## 51 Soviet 0.011512086 ggplot(AES (term, beta, Fill = Factor (topic))) + geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) + facet_wrap(~ topic, scales = "free", ncol = 3) + coord_flip()Copy the code
We are now getting more specific themes. The problem is how we present these results and use them in an informational way.
Again, this is where intuition and domain knowledge are very important when you’re a researcher. You can use confusion as a data point in the decision process, but many times it simply looks at the topic itself and the highest probability word associated with each topic to determine if the structure makes sense. It is also useful if you have a known topic structure that you can compare to (such as the book example above).
Most welcome insight
1. Research hot spots of big data journal articles
2.618 Online Shopping data Review – What are the Chopped people concerned about
3. R language text mining TF-IDF topic modeling, sentiment analysis N-gram modeling research
4. Python Topic Modeling Visualization LDA and T-SNE interactive visualization
5. R language text mining, NASA Data Network analysis, TF-IDF and Topic modeling
6. Python topics LDA modeling and T-SNE visualization
7. Topic-modeling analysis of text data in R language
8. Topic modeling analysis of TEXT mining for NASA metadata using R language
9. Python crawler is used to analyze semantic data of web fetching LDA topic