Rio side tecdat | R language text mining and sentiment analysis and visualization of harry potter novel text data

Original link:tecdat.cn/?p=22984

Original source:Tuo End number according to the tribe public number

Once we’ve cleaned up our text and done some basic word frequency analysis, the next step is to understand the idea or emotion in the text. This is known as sentiment analysis and this tutorial will guide you through a simple approach to sentiment analysis.

In a nutshell

This tutorial is an introduction to sentiment analysis. This tutorial builds on the Tidy Text tutorial, so if you haven’t read this tutorial, I recommend you start there. In this tutorial, I include the following.

Replication requirements: What is required to reproduce the analysis in this tutorial?
Emotion data set: The primary data set used to score emotions
Basic sentiment analysis: Perform basic sentiment analysis
Compare emotions: Compare the emotional differences in the emotion library
Common mood words: Find the most common positive and negative words
Large unit emotion Analysis: Analyzing emotions in large units of text, rather than individual words.

Replication requirements

This tutorial leverages harrypotter text data to illustrate text mining and analysis capabilities.

Library (Tidyverse) # Data Processing and Drawing Library (StringR) # Text Cleansing and regular Expression Library (TidyText) # provides additional text mining capabilitiesCopy the code

We’re dealing with seven novels, including

Philosophers_ Stone: Harry Potter and the Philosopher's Stone (1997).
Chamber_ OF_ Secrets: Harry Potter and the Chamber of Secrets (1998)
No prisoners of Azkaban. Harry Potter and the Prisoner of Azkaban (1999)
Goblet_of_fire: Harry Potter and the Goblet of Fire (2000)
Order_of_the_phoenix: Harry Potter and the Order of the Phoenix (2003)
Half_blood_prince: Harry Potter and the Half-Blood Prince (2005)
Deathly_hallows: Harry Potter and the Deathly Hallows (2007).

Each text is in a character vector, and each element represents a chapter. For example, the original text of the first two chapters of philosophers_stone is described below.

Philosophers_stone [1:2] ## [1] "THE BOY WHO LivedMr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank ## you very much. They were the last people you'd expect to be involved in anything strange or mysterious, Because they just didn't hold ## with such nonsense.Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly ## any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, ## which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a ## small son called Dudley and in their opinion there was no finer boy The Dursleys had everything they wanted, but they also ## had a secret, and their greatest fear was that somebody would discover it. They didn't think they could bear it if anyone found out ##  about the Potters. Mrs. Potter was Mrs. Dursley's sister, but they hadn'... <truncated> ## [2] "THE VANISHING GLASSNearly ten years had passed since THE Dursleys had woken up to find their nephew on the front step, but ## Privet Drive had hardly changed at all. The sun rose on the same tidy front gardens and lit up the brass number four on the Dursleys' ## front door; it crept into their living room, which was almost exactly the same as it had been on the night when Mr. Dursley had seen ## that fateful news report about the owls. Only the photographs on the mantelpiece really showed how much time had passed. Ten years ago, ## there had been lots of pictures of what looked like a large pink beach ball wearing different-colored bonnets -- but Dudley Dursley was ## no longer a baby, and now the photographs showed a large blond boy riding his first bicycle, on a carousel at the fair, playing a ## computer game with his father, being hugged and kissed by his mother. The room held no sign at all that another boy lived in the house, ## too.Yet Harry Potter was still there, asleep at the moment, but no... <truncated>Copy the code

Affective data set

A variety of dictionaries exist to assess ideas or emotions in a text. The TidyText package contains three sentiment dictionaries in the Sentiments data set.

sentiments ## # A tibble: 23,165 × 4 ## Word sentiment Lexicon Score ## < CHR > < CHR > <int> # 1 ABacus trust NRC NA ## 2 abandon fear NRC NA ## 3 abandon negative nrc NA ## 4 abandon sadness nrc NA ## 5 abandoned anger nrc NA ## 6 abandoned fear nrc NA ## 7 abandoned negative nrc NA ## 8 abandoned sadness nrc NA ## 9 abandonment anger nrc NA ## 10 abandonment fear nrc NA ## # . With 23155 more rowsCopy the code

These three thesauruses are

AFINN
bing
nrc

All three thesaurus are based on words (or words). These dictionaries contain many English words that have been assigned scores for positive/negative emotions, which can also be happiness, anger, sadness, etc. The NRC dictionary divides words into positive, negative, anger, anticipation, disgust, fear, happiness, sadness, surprise, and trust on a binary basis (” yes “/” no “). The Bing thesaurus divides words into positive and negative categories in a binary way. The AFINN thesaurus rates words on a scale of -5 to 5, with a negative score indicating negative emotion and a positive score indicating positive emotion.

Get_sentiments (" AFinn ") Get_sentiments ("bing") get_sentiments(" NRC ")Copy the code

Basic emotion analysis

To do sentiment analysis, we need to organize our data into a neat format. Here, all seven Harry Potter novels are converted into a tibble, with each word arranged by chapter in the book. See the Clean Text tutorial for more details.

Series $book <- factor(Series $book, levels = rev(titles)) Series ## # A tibble: 334 ## book Chapter Word ## * < FCTR > <int> < CHR > ##  ## 3 Philosopher's Stone 1 who ## 4 Philosopher's Stone 1 lived ## 5 Philosopher's Stone 1 mr ## 6 Philosopher's Stone 1 and ## 7 Philosopher's Stone 1 mrs ## 8 Philosopher's Stone 1 dursley ## 9 Philosopher's Stone 1 of ## 10 Philosopher's Stone 1 number ## # ... With 1089376 more rowsCopy the code

Now let’s use the NRC Sentiment dataset to evaluate the different emotions represented by the entire Harry Potter series. We can see that the presence of negative emotions is stronger than that of positive emotions.

filter(! is.na(sentiment)) %>% count(sentiment, sort = TRUE)Copy the code

## # A tibble: 20 × 2 # sentiment n ## < CHR > <int> ## 1 Negative sentiment 56579 ## 2 positive 38324 ## 3 Sadness 35866 ## 4 anger 32750 ## 5 trust 23485 ## 6 fear 21544 ## 7 anticipation 21123 ## 8 joy 14298 ## 9 disgust 13381 ## 10 surprise 12991Copy the code

This gives a good overall sense, but what if we want to understand how the mood changes over the course of each novel? To do this, we need to do the following.

Create an index that separates each book by 500 words; This is roughly the number of words per two pages, so this will allow us to assess changes in mood, even within chapters.
The Bing dictionary is joined with inner_join to assess the positive and negative emotions of each word.
Count the number of positive and negative words per two pages
Scatter our data
Calculate the net sentiment (positive – negative).
Plot our data

Ggplot (AES (index, sentiment, fill = book)) + geom_bar(alpha = 0.5")Copy the code

Now we can see how the plot of each novel changes towards a more positive or negative mood over the course of the story.

More emotional

Armed with several options for the Emotion Dictionary, you may want to learn more about which one is appropriate for your purposes. Let’s use all three emotion dictionaries and examine how they differ for each novel.


        summarise(sentiment = sum(score)) %>%
        mutate(method = "AFINN")

bing_and_nrc <-
                  inner_join(get_sentiments("nrc") %>%
                                     filter(sentiment %in% c("positive", "negative"))) %>%
              
        spread(sentiment, n, fill = 0) %>%
     
Copy the code

We now have an estimate of net emotion (positive-negative) in each emotion thesaurus of the fictional text. So let’s plot them.

Ggplot (AES (index, sentiment, fill = method)) + geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) + facet_grid(book ~ method)Copy the code

The three different dictionaries for calculating emotions give different results in absolute terms, but have quite similar relative trajectories in fiction. We see similar lows and peaks in roughly the same places in the novel, but the absolute values are markedly different. In some cases, the AFINN dictionary seems to find more positive emotions than the NRC dictionary. This output also allows us to make comparisons between different novels. First, you can get a good idea of the difference in book length — “The Order of Phoenix” is much longer than “Philosopher’s Stone.” Second, you can compare the emotional differences between the books in a series.

Common emotional words

One benefit of having a data framework for both emotions and words is that we can analyze the number of words that contribute to each emotion.

word_counts ## # A tibble: 3,313 × 3 ## Word sentiment n ## < CHR > < CHR > <int> ## 1 like 2416 # 2 well positive 1969 ## 3 right positive 1643 ## 4 good positive 1065 ## 5 dark negative 1034 ## 6 great positive 877 ## 7 death negative 757 ## 8 magic positive  606 ## 9 better positive 533 ## 10 enough positive 509 ## # ... With 3303 more rowsCopy the code

We can look at it visually to assess the first n words of each emotion.

Ggplot (AES (reorder(Word, n), n, fill = sentiment)) + geom_bar(alpha = 0.8, stat = "identity"Copy the code

Emotion analysis of larger units

Much useful work can be done by tokenizing at the word level, but sometimes it is useful or necessary to look at different units of text. For example, some sentiment analysis algorithms don’t just focus on single words (i.e. individual words), but try to understand the overall emotion of a sentence. These algorithms try to understand

I had a bad day today.

Is a sad sentence, not a happy sentence, because there are negative words. Stanford University’s CoreNLP tool is an example of this kind of sentiment analysis algorithm. For these, we might want to mark the text as a sentence. I use the philosophers_stone data set to illustrate.

tibble(text = philosophers_stone)
##                                                                       sentence
##                                                                          <chr>
## 1                                              the boy who lived  mr. and mrs.
## 2  dursley, of number four, privet drive, were proud to say that they were per
## 3  they were the last people you'd expect to be involved in anything strange o
## 4                                                                          mr.
## 5      dursley was the director of a firm called grunnings, which made drills.
## 6  he was a big, beefy man with hardly any neck, although he did have a very l
## 7                                                                         mrs.
## 8  dursley was thin and blonde and had nearly twice the usual amount of neck, 
## 9  the dursleys had a small son called dudley and in their opinion there was n
## 10 the dursleys had everything they wanted, but they also had a secret, and th
## # ... with 6,588 more rows
Copy the code

The token = “sentence” argument attempts to split text with punctuation.

Let’s continue to decompose the philosophers_stone text by chapters and sentences.


                        text = philosophers_stone) %>% 
  unnest_tokens(sentence, text, token = "sentences")
Copy the code

This will allow us to assess net mood by chapter and sentence. First, we need to track sentence numbers, and then I create an index that tracks the progress of each chapter. Then, I unnested the sentence by word count. This gives us a Tibble with individual words from each chapter broken down by sentence. Now, as before, I join the AFINN dictionary and calculate the net emotional score for each chapter. As we can see, the most positive sentences are the middle of chapter 9, the end of chapter 17, the early part of chapter 4, and so on.

group_by(chapter, index) %>% summarise(sentiment = sum(score, na.rm = TRUE)) %>% arrange(desc(sentiment)) ## Source: Local data frame [1,401 x 3] ## Groups: Chapter [17] ## ## chapter index sentiment ## <int> < DBL > <int> # 1 9 0.47 14 ## 2 17 0.91 13 ## 3 4 0.11 12 ## 4 12 0.45 12 ## 5 17 0.54 12 ## 6 1 0.25 11 ## 7 10 0.04 11 ## 8 10 0.16 11 ## 9 11 0.48 11 ## 10 12 0.70 11 ## #... With 1391 more rowsCopy the code

We can illustrate this graphically with a heat map that shows our most positive and negative emotions as each chapter progresses.

ggplot(book_sent) +
        geom_tile(color = "white") +
Copy the code

Most welcome insight

1. Research hot spots of big data journal articles

2.618 Online Shopping data Review – What are the Chopped people concerned about

3. R language text mining TF-IDF topic modeling, sentiment analysis N-gram modeling research

4. Python Topic Modeling Visualization LDA and T-SNE interactive visualization

5. Observation of news data under the epidemic

6. Python topics LDA modeling and T-SNE visualization

7. Topic-modeling analysis of text data in R language

8. Theme model: Data listening to the “online events” on the message board of People’s Daily Online

9. Python crawler is used to analyze semantic data of web fetching LDA topic