Original link:tecdat.cn/?p=4333
Topic modeling
In text mining, we often collect collections of documents, such as blog posts or news articles, that we want to group so that we can understand them individually. Topic modeling is a way to unsupervised categorization of these documents, similar to clustering digital data so that groups can be found even if we are not sure what to look for.
Latent Dirichlet assignment (LDA) is a particularly popular method for fitting thematic models. It treats each document as a mixture of topics, and each topic as a mixture of words. This allows documents to “overlap” with each other in terms of content, rather than being separated into discrete groups to reflect typical usage of natural language.
Text analysis flow chart combined with topic modeling. The TopicModels package takes a document-term Matrix as input and generates a model that can be processed through TidyText so that it can be processed and visualized using DPlyr and GGplot2.
Potential Dirichlet allocation
Potential Dirichlet assignments are one of the most commonly used algorithms in topic modeling. Without delving into the mathematics behind the model, we can understand that it is guided by two principles.
Each document is a mixture of topics. We imagine that each document might contain text from several topics in proportion. For example, in A two-topic model, we could say “Document 1 is 90% of topic A and 10% of topic B, while Document 2 is 30% of topic A and 70% of topic B.”
Each theme is a mix of words. For example, we can imagine a two-topic model of a news story, with one topic being “sports” and the other “entertainment.” The most common words in sports topics are probably “basketball”, “football” and “swimming”, while entertainment topics can be made up of words such as “movie”, “television” and “actor”. Importantly, topics can be shared between topics; Words like “Olympic champion” can be used in both.
LDA is a mathematical way to estimate both: look for the set of words associated with each topic, while determining the grouping of topics that describes each document. There are many existing implementations of this algorithm, one of which we will explore in depth.
library(topicmodels)
data("AssociatedPress")
AssociatedPress
: term frequency (tf)
Copy the code
We can use the function in the LDA() TopicModels package to set k = 2 to create an LDA model for both topics.
Virtually all topic models use the larger model K, but as we will soon see, this approach to analysis can be extended to more topics.
This function returns an object that contains the complete details of the model fit, such as how words relate to the topic and how the topic relates to the document.
# # Set the random seed so that the output of the model is repeatable ap_lda < -lda (AssociatedPress,k =2, Control =list(seed =1234)) ap_ldaCopy the code
Fitting the model is the “easy part” : the rest of the analysis will involve exploring and interpreting the model using functions in the TidyText software package.
Word topic probability
The TidyText package provides this method to extract the probability of each word for each topic, called beta.
## # A tibble: 20 946 x 3 ## topic term beta ## 1 1 Aaron 1.69e-12 ## 2 Aaron 3.90E-5 ## 3 1 abandon 2.65e-5 ## 4 2 abandon 3.99e- 5 # # 1 abandoned 1.39 e - 4 # # 2 abandoned 5.88 e - 5 # 6 # 7 1 abandoning 2.45 e-33 # # 8 2 abandoning 2.34 e - 5 # 9 # 1 2 Abbott 2.97 E-5 ## #... With 20936 more rowsCopy the code
The most common word in each topic
This visualization gives us an insight into two themes extracted from the article. The most common words in Topic 1 include “percentage,” “million,” “billion,” and “company,” indicating that it could represent business or financial news. The most common topics in topic 2 include “president” and “government,” indicating that the topic represents political news. An important observation about the words in each theme is that some words such as “new” and “people” are common in both themes. As opposed to the “hard clustering” approach, this is an advantage of topic modeling: there may be some overlap between topics used in natural languages.
We can argue that the biggest difference is the word with the biggest difference between the two subjects.
## # A tibble: 100 x 4 ## term topic1 topic2 log_ratio ## ## 1 Administration 0.000431 0.001338 1.68 ## 2 ago 0.00107 0.001842 -0.339 ## 3 agreement 0.000671 0.00676 0.00676 ## 4 AID 0.0000476 0.00105 4.46 ## 5 air 0.002140.107-2.85 ## 6 American 0.00203 0.00168-0.270 ## 7 Analysts 0.00109 0.000000578-10.9 ## 8 Area 0.00137 0.000231-2.57 ## 9 Army 0.000262 0.0076 2.00 ## 10 asked 0.000776 3.05 ## #... with 188 more rowsCopy the code
The figure shows the words that differ the most between the two topics.
As we can see, the more common words in Topic 2 include the names of politicians from political parties such as “democracy” and “Republican Party”. Theme 1 features currencies such as “yen” and “dollar” as well as financial terms such as “index”, “price” and “interest rate”. This helps confirm that the two topics identified by the algorithm are political and financial news.
Document – Topic probability
In addition to evaluating each topic as a collection of words, LDA models each document as a mixed topic. We can examine the probability of each topic for each document, called gamma (” gamma “).
## # A tibble: 4 4 x 3 ## document Topic gamma ## <int> <int> < DBL > ## 1 1 1 0.248 ## 2 2 1 0.362 ## 3 3 1 0.527 ## 4 4 1 0.357 ## 5 1, 0.181 5 # # 6 June 1 0.000588 # # 7 July 1, 0.773 0.00445 # # # # 8 8 1 9 9 1, 0.967 0.147 # # # # # 10 10 1... With 4482 more rowsCopy the code
Each of these values is the estimated percentage of words generated from that topic in the document. For example, the model estimates that about 24.8% of the words in document 1 were generated from topic 1.
As we can see, many of these documents are extracted from both topics, but document 6 is almost entirely derived from topic 2, where one topic 1γ is close to zero. To check this answer, we can examine the most common words in the document.
#> # A tibble: 287 x 3 #> document term count #> <int> <chr> <dbl> #> 1 6 noriega 16 #> 2 6 panama 12 #> 3 6 jackson 6 #> 4 6 powell 6 6 Administration 5 #> 6 6 economic 5 #> 7 6 general 5 #> 8 6 I 5 #> 9 6 Panamanian 5 #> 10 6 American 4 #> #... with 277 more rowsCopy the code
Based on the most common words, you can see that the algorithm is correct in grouping them into topic 2 (as political/national news).
Most welcome insight
1. Research hot spots of big data journal articles
2.618 Online Shopping data Review – What are the Chopped people concerned about
3. R language text mining TF-IDF topic modeling, sentiment analysis N-gram modeling research
4. Python Topic Modeling Visualization LDA and T-SNE interactive visualization
5. R language text mining, NASA Data Network analysis, TF-IDF and Topic modeling
6. Python topics LDA modeling and T-SNE visualization
7. Topic-modeling analysis of text data in R language
8. Topic modeling analysis of TEXT mining for NASA metadata using R language
9. Python crawler is used to analyze semantic data of web fetching LDA topic