Original link:tecdat.cn/?p=6864
Original source:Tuo End number according to the tribe public number
We analyzed 20,000 messages from 20 Usenet bulletin boards. The Usenet bulletin board in this dataset includes topics such as new cars, sports, and cryptography.
pretreatment
We start by reading all the messages in the 20news-ByDate folder, which are organized in subfolders with a file for each message.
raw_text
Copy the code
## A tibble: 511,655 x 3 ## newsgroup id text ## < CHR > < CHR > < CHR > ## 1 alt.atheism 49960 From: mathew <mathew@mantis.co.uk> ## 2 alt.atheism 49960 Subject: Alt.Atheism FAQ: Atheist Resources ## 3 alt.atheism 49960 Summary: Books, addresses, music -- anything related to atheism ## 4 alt.atheism 49960 Keywords: FAQ, atheism, books, music, fiction, addresCopy the code
# # #... With 511645 more rowsCopy the code
Notice that the Newsgroup column describes the 20 newsgroups from which each message comes, as well as the ID column, which identifies the messages in that newsgroup.
tf-idf
TF is Term Frequency, IDF is Inverse Document Frequency. We want newsgroups to be different in subject and content, and therefore, word frequency between them.
newsgroup_cors
Copy the code
## # A tibble: 380 x 3 ## item1 correlation ## < CHR > < CHR > < DBL > ## 1 talk.religy.misc soc Misc 0.779 ## 4 talk.religion.misc Atheism 0.779 ## 5 Alt. Atheism soc. Religion. Christian 0.751 ## 6 soc. Religion. Christian Alt. Atheism 0.751 ## 7 Comp.sys.mac. Hardware comp.sys.ibm.pc. Hardware 0.680 # 8 comp.sys.ibm.PC. Hardware comp.sys.mac Rec. Rec. Rec. Rec. with 370 more rowsCopy the code
Topic modeling
Can LDA collate Usenet messages from different newsgroups?
Topic 1 of course stands for sci.space newsgroups (hence the most common word is “space”), and topic 2 probably comes from cryptography, using terms such as “key” and “encryption”.
Copy the code
Sentiment analysis
We can use the sentiment analysis techniques we discussed to examine the frequency of positive and negative words in these Usenet posts. Which newsgroups are the most positive or negative overall?
In this example, we will use the AFINN Emotion Dictionary, which provides a positivity score for each word, visualized with a bar chart
Analyze emotions verbally
It’s worth looking into why some newsgroups are more positive or negative than others. To do this, we can examine the total positive and negative contribution of each word.
N – “gramm analysis
The Usenet dataset is a modern textual corpus, so we will be interested in the sentiment analysis in this article.
Most welcome insight
1. Research hot spots of big data journal articles
2.618 Online Shopping data Review – What are the Chopped people concerned about
3. R language text mining TF-IDF topic modeling, sentiment analysis N-gram modeling research
4. Python Topic Modeling Visualization LDA and T-SNE interactive visualization
5. Observation of news data under the epidemic
6. Python topics LDA modeling and T-SNE visualization
7. Topic-modeling analysis of text data in R language
8. Theme model: Data listening to the “online events” on the message board of People’s Daily Online
9. Python crawler is used to analyze semantic data of web fetching LDA topic