The author | Audhi Aprilliant compile | source of vitamin k | forward Datas of Science

An overview of the

For this project, we used raw Data from Twitter via crawlers on May 28-29, 2019. In addition, the data is in CSV format (comma separated) and can be downloaded here.

Github.com/audhiaprill…

It deals with two topics, one is the data of Joko Widodo with the keyword “Joko Widodo” and the other is the data of Prabowo Subianto with the keyword “Prabowo Subianto”. It includes several variables and information to determine the user’s mood. In fact, the data has 16 variables or attributes and more than 1000 observations. Table 1 lists some variables.

# import libraries
library(ggplot2)
library(lubridate)

# Load Joko Widodo's data
data.jokowi.df = read.csv(file = 'data-joko-widodo.csv',
                          header = TRUE,
                          sep = ', ')
senti.jokowi = read.csv(file = 'sentiment-joko-widodo.csv',
                        header = TRUE,
                        sep = ', ')
                        
Load Prabowo Subianto's data
data.prabowo.df = read.csv(file = 'data-prabowo-subianto.csv',
                           header = TRUE,
                           sep = ', ')
senti.prabowo = read.csv(file = 'sentiment-prabowo-subianto.csv',
                         header = TRUE,
                         sep = ', ')
Copy the code

Data visualization

Data exploration aims to extract any information from Twitter data. It should be noted that the data has been text preprocessed. We explore variables that are considered interesting.

Bar chart of # TWEETS -JOKO WIDODO
data.jokowi.df$created = ymd_hms(data.jokowi.df$created,
                                 tz = 'Asia/Jakarta')
# Another way to make "date" and "hour" variables
data.jokowi.df$date = date(data.jokowi.df$created)
data.jokowi.df$hour = hour(data.jokowi.df$created)
Date of # 2019-05-29
data.jokowi.date1 = subset(x = data.jokowi.df,
                           date == '2019-05-29')
data.hour.date1 = data.frame(table(data.jokowi.date1$hour))
colnames(data.hour.date1) = c('Hour'.'Total.Tweets')
Create data visualizations
ggplot(data.hour.date1)+
  geom_bar(aes(x = Hour,
               y = Total.Tweets,
               fill = I('blue')),
           stat = 'identity',
           alpha = 0.75,
           show.legend = FALSE)+
  geom_hline(yintercept = mean(data.hour.date1$Total.Tweets),
             col = I('black'),
             size = 1)+
  geom_text(aes(fontface = 'italic',
                label = paste('Average:'.ceiling(mean(data.hour.date1$Total.Tweets)),
                              'Tweets per hour'),
                x = 8,
                y = mean(data.hour.date1$Total.Tweets)+20),
            hjust = 'left',
            size = 4)+
  labs(title = 'Total Tweets per Hours - Joko Widodo',
       subtitle = '28 May 2019',
       caption = 'Twitter Crawling 28 - 29 May 2019')+
  xlab('Time of Day')+
  ylab('Total Tweets')+
  scale_fill_brewer(palette = 'Dark2')+
  theme_bw()
  
Bar chart of # TWEETS -PRABOWO SUBIANTO
data.prabowo.df$created = ymd_hms(data.prabowo.df$created,
                                  tz = 'Asia/Jakarta')
                                  
# Another way to make "date" and "hour" variables
data.prabowo.df$date = date(data.prabowo.df$created)
data.prabowo.df$hour = hour(data.prabowo.df$created)

Date of # 2019-05-28
data.prabowo.date1 = subset(x = data.prabowo.df,
                            date == '2019-05-28')
data.hour.date1 = data.frame(table(data.prabowo.date1$hour))
colnames(data.hour.date1) = c('Hour'.'Total.Tweets')

Date of # 2019-05-29
data.prabowo.date2 = subset(x = data.prabowo.df,
                            date == '2019-05-29')
data.hour.date2 = data.frame(table(data.prabowo.date2$hour))
colnames(data.hour.date2) = c('Hour'.'Total.Tweets')
data.hour.date3 = rbind(data.hour.date1,data.hour.date2)
data.hour.date3$Date = c(rep(x = '2019-05-28',
                             len = nrow(data.hour.date1)),
                         rep(x = '2019-05-29',
                             len = nrow(data.hour.date2)))
data.hour.date3$Labels = c(letters.'A'.'B')
data.hour.date3$Hour = as.character(data.hour.date3$Hour)
data.hour.date3$Hour = as.numeric(data.hour.date3$Hour)

# Data preprocessing
for (i in 1:nrow(data.hour.date3)) {
  if (i%%2 == 0) {
    data.hour.date3[i,'Hour'] = ' '
  }
  if (i%%2 == 1) {
    data.hour.date3[i,'Hour'] = data.hour.date3[i,'Hour']
  }
}
data.hour.date3$Hour = as.factor(data.hour.date3$Hour)

# Data Visualization
ggplot(data.hour.date3)+
  geom_bar(aes(x = Labels,
               y = Total.Tweets,
               fill = Date),
           stat = 'identity',
           alpha = 0.75,
           show.legend = TRUE)+
  geom_hline(yintercept = mean(data.hour.date3$Total.Tweets),
             col = I('black'),
             size = 1)+
  geom_text(aes(fontface = 'italic',
                label = paste('Average:'.ceiling(mean(data.hour.date3$Total.Tweets)),
                              'Tweets per hour'),
                x = 5,
                y = mean(data.hour.date3$Total.Tweets)+6),
            hjust = 'left',
            size = 3.8)+
  scale_x_discrete(limits = data.hour.date3$Labels,
                   labels = data.hour.date3$Hour)+
  labs(title = 'Total Tweets per Hours - Prabowo Subianto',
       subtitle = '28 - 29 May 2019',
       caption = 'Twitter Crawling 28 - 29 May 2019')+
  xlab('Time of Day')+
  ylab('Total Tweets')+
  ylim(c(0.100))+
  theme_bw()+
  theme(legend.position = 'bottom',
        legend.title = element_blank())+
  scale_fill_brewer(palette = 'Dark2')
Copy the code

Based on Figure 1, we can conclude that the number of tweets obtained by data scraping (keywords “Jokow Widodo” and “Prabowo Subianto”) is not similar, even on the same date.

For example, in Figure 1 (left), tweets with the keyword “Joko Widodo” are visually obtained only during WIB from 03:00 to 17:00 on May 28, 2019. In Figure 1 (right), we conclude that tweets with the keyword “Prabowo Subianto” were obtained during 12:00-23:59 WIB (May 28, 2019) and 00:00-15:00 WIB (May 29, 2019) on May 28-29.

Tweet # 2019-05-28
ggplot(data.hour.date1)+
  geom_bar(aes(x = Hour,
               y = Total.Tweets,
               fill = I('red')),
           stat = 'identity',
           alpha = 0.75,
           show.legend = FALSE)+
  geom_hline(yintercept = mean(data.hour.date1$Total.Tweets),
             col = I('black'),
             size = 1)+
  geom_text(aes(fontface = 'italic',
                label = paste('Average:',
ceiling(mean(data.hour.date1$Total.Tweets)),
                              'Tweets per hour'),
                x = 6.5,
                y = mean(data.hour.date1$Total.Tweets)+5),
            hjust = 'left',
            size = 4)+
  labs(title = 'Total Tweets per Hours - Prabowo Subianto',
       subtitle = '28 May 2019',
       caption = 'Twitter Crawling 28 - 29 May 2019')+
  xlab('Time of Day')+
  ylab('Total Tweets')+
  ylim(c(0.100))+
  theme_bw()+
  scale_fill_brewer(palette = 'Dark2')
  
Tweet # 2019-05-29
ggplot(data.hour.date2)+
  geom_bar(aes(x = Hour,
               y = Total.Tweets,
               fill = I('red')),
           stat = 'identity',
           alpha = 0.75,
           show.legend = FALSE)+
  geom_hline(yintercept = mean(data.hour.date2$Total.Tweets),
             col = I('black'),
             size = 1)+
  geom_text(aes(fontface = 'italic',
                label = paste('Average:',
ceiling(mean(data.hour.date2$Total.Tweets)),
                              'Tweets per hour'),
                x = 1,
                y = mean(data.hour.date2$Total.Tweets)+6),
            hjust = 'left',
            size = 4)+
  labs(title = 'Total Tweets per Hours - Prabowo Subianto',
       subtitle = '29 May 2019',
       caption = 'Twitter Crawling 28 - 29 May 2019')+
  xlab('Time of Day')+
  ylab('Total Tweets')+
  ylim(c(0.100))+
  theme_bw()+
  scale_fill_brewer(palette = 'Dark2')
Copy the code

According to Figure 2, we get a significant difference between users using the keywords “Joko Widodo” and “Prabowo Subianto”. Tweets about Joko Widodo at a particular time (07:00 — 09:00 WIB) tend to be quite intense, with 08:00 WIB having the highest number of tweets. It has 348 tweets. However, between May 28 and 29, 2019, tweets with the keyword “Prabowo Subianto” tended to talk about Prabowo Subianto constantly. On May 28-29, 2019, an average of 36 tweets were uploaded per hour with the keyword “Prabowo Subianto”.

# JOKO WIDODO
df.score1. = subset(senti.jokowi,class= =c('Negative'.'Positive'))
colnames(df.score1.) = c('Score'.'Text'.'Sentiment')
# Data viz
ggplot(df.score1.) +geom_density(aes(x = Score, fill = Sentiment),
               alpha = 0.75) +xlim(c(-11.11)) +labs(title = 'Density Plot of Sentiment Scores',
       subtitle = 'Joko Widodo',
       caption = 'Twitter Crawling 28 - 29 May 2019') +xlab('Score') +ylab('Density') +theme_bw() +scale_fill_brewer(palette = 'Dark2') +theme(legend.position = 'bottom',
        legend.title = element_blank())
        
# PRABOWO SUBIANTO
df.score2 =.subset(senti.prabowo,class == c('Negative'.'Positive'))
colnames(df.score2.) = c('Score'.'Text'.'Sentiment')
ggplot(df.score2.) +geom_density(aes(x = Score, fill = Sentiment),
               alpha = 0.75) +xlim(c(-11.11)) +labs(title = 'Density Plot of Sentiment Scores',
       subtitle = 'Prabowo Subianto',
       caption = 'Twitter Crawling 28 - 29 May 2019') +xlab('Density') +ylab('Score') +theme_bw() +scale_fill_brewer(palette = 'Dark2') +theme(legend.position = 'bottom',
        legend.title = element_blank())
Copy the code

Figure 3 is a bar chart of multiple tweets with the keywords “Joko Widodo” and “Prabowo Subianto” from May 28-29, 2019. It can be seen from Figure 3(left) that the frequency of Twitter users talking about Prabowo Subianto is low on WIB from 19:00-23:59. This is due to the rest time for Indonesians. However, these tweets with themes are always updated in the middle of the night, as some users live abroad and others are still active. Then, user activity starts at 04:00 WIB, peaks at 07:00 WIB, and drops until 12:00 WIB rises again.

# JOKO WIDODO
df.senti.score1. = data.frame(table(senti.jokowi$score))
colnames(df.senti.score1.) = c('Score'.'Freq')
# Data preprocessing
df.senti.score1.$Score = as.character(df.senti.score1.$Score)
df.senti.score1.$Score = as.numeric(df.senti.score1.$Score)
Score1 = df.senti.score1.$Score
sign(df.senti.score1.[1.1])
for (i in 1:nrow(df.senti.score1.)) {
  sign.row = sign(df.senti.score1.[i,'Score'])
  for (j in 1:ncol(df.senti.score1.)) {
    df.senti.score1.[i,j] = df.senti.score1.[i,j] * sign.row
  }
}
df.senti.score1.$Label = c(letters[1:nrow(df.senti.score1.)])
df.senti.score1.$Sentiment = ifelse(df.senti.score1.$Freq < 0.'Negative'.'Positive')
df.senti.score1.$Score1 = Score1
# Data Visualization
ggplot(df.senti.score1.)+
  geom_bar(aes(x = Label,
               y = Freq,
               fill = Sentiment),
           stat = 'identity',
           show.legend = FALSE)+
  # Positive emotion
  geom_hline(yintercept = mean(abs(df.senti.score1.[which(df.senti.score1.$Sentiment == 'Positive'),'Freq'])),
             col = I('black'),
             size = 1)+
  geom_text(aes(fontface = 'italic',
                label = paste('Average Freq:',
ceiling(mean(abs(df.senti.score1.[which(df.senti.score1.$Sentiment == 'Positive'),'Freq'])))),
                x = 10,
                y = mean(abs(df.senti.score1.[which(df.senti.score1.$Sentiment == 'Positive'),'Freq'+))30),
            hjust = 'right',
            size = 4) +# Negative emotions
  geom_hline(yintercept = mean(df.senti.score1.[which(df.senti.score1.$Sentiment == 'Negative'),'Freq']),
             col = I('black'),
             size = 1)+
  geom_text(aes(fontface = 'italic',
                label = paste('Average Freq:',
ceiling(mean(abs(df.senti.score1.[which(df.senti.score1.$Sentiment == 'Negative'),'Freq'])))),
                x = 5,
                y = mean(df.senti.score1.[which(df.senti.score1.$Sentiment == 'Negative'),'Freq']) -15),
            hjust = 'left',
            size = 4)+
  labs(title = 'Barplot of Sentiments',
       subtitle = 'Joko Widodo',
       caption = 'Twitter Crawling 28 - 29 May 2019')+
  xlab('Score')+
  scale_x_discrete(limits = df.senti.score1.$Label,
                   labels = df.senti.score1.$Score1)+
  theme_bw()+
  scale_fill_brewer(palette = 'Dark2')

# PRABOWO SUBIANTO
df.senti.score2. = data.frame(table(senti.prabowo$score))
colnames(df.senti.score2.) = c('Score'.'Freq')
# Data preprocessing
df.senti.score2.$Score = as.character(df.senti.score2.$Score)
df.senti.score2.$Score = as.numeric(df.senti.score2.$Score)
Score2 = df.senti.score2.$Score
sign(df.senti.score2.[1.1])
for (i in 1:nrow(df.senti.score2.)) {
  sign.row = sign(df.senti.score2.[i,'Score'])
  for (j in 1:ncol(df.senti.score2.)) {
    df.senti.score2.[i,j] = df.senti.score2.[i,j] * sign.row
  }
}
df.senti.score2.$Label = c(letters[1:nrow(df.senti.score2.)])
df.senti.score2.$Sentiment = ifelse(df.senti.score2.$Freq < 0.'Negative'.'Positive')
df.senti.score2.$Score1 = Score2
# Data Visualization
ggplot(df.senti.score2.)+
  geom_bar(aes(x = Label,
               y = Freq,
               fill = Sentiment),
           stat = 'identity',
           show.legend = FALSE)+
  # Positive emotion
  geom_hline(yintercept = mean(abs(df.senti.score2.[which(df.senti.score2.$Sentiment == 'Positive'),'Freq'])),
             col = I('black'),
             size = 1)+
  geom_text(aes(fontface = 'italic',
                label = paste('Average Freq:',
ceiling(mean(abs(df.senti.score2.[which(df.senti.score2.$Sentiment == 'Positive'),'Freq'])))),
                x = 11,
                y = mean(abs(df.senti.score2.[which(df.senti.score2.$Sentiment == 'Positive'),'Freq'+))20),
            hjust = 'right',
            size = 4) +# Negative emotions
  geom_hline(yintercept = mean(df.senti.score2.[which(df.senti.score2.$Sentiment == 'Negative'),'Freq']),
             col = I('black'),
             size = 1)+
  geom_text(aes(fontface = 'italic',
                label = paste('Average Freq:',
ceiling(mean(abs(df.senti.score2.[which(df.senti.score2.$Sentiment == 'Negative'),'Freq'])))),
                x = 9,
                y = mean(df.senti.score2.[which(df.senti.score2.$Sentiment == 'Negative'),'Freq']) -10),
            hjust = 'left',
            size = 4)+
  labs(title = 'Barplot of Sentiments',
       subtitle = 'Prabowo Subianto',
       caption = 'Twitter Crawling 28 - 29 May 2019')+
  xlab('Score')+
  scale_x_discrete(limits = df.senti.score2.$Label,
                   labels = df.senti.score2.$Score1)+
  theme_bw()+
  scale_fill_brewer(palette = 'Dark2')
Copy the code

Figure 4 shows the emotion score density with the keywords “Joko Widodo” and “Prabowo Subianto”. Tweets are scored based on the average score of the roots that make up the tweets. Thus, its score is given for each root, with a value between -10 and 10. The smaller the score, the more negative the tweet, and vice versa. Based on Figure 4 (left), it can be concluded that tweets containing the keyword “Joko Widodo” have a negative sentiment range of -10 to -1, with a median score of -4. It also applies to positive emotions (with a positive score, of course). According to the density plot in Figure 4 (left), we find that the score for positive emotions has a fairly small variance. Therefore, we conclude that the positive sentiment towards the tweets containing the keyword “Joko Widodo” is not too diverse.

Figure 4 (right) shows the sentiment score density diagram including the keyword “Prabowo Subianto”. It differs from Figure 4 (left) because the negative emotions in Figure 4 (right) range from -8 to -1. This means that tweets don’t have much negative emotion (tweets have negative emotion, but not high enough). In addition, the distribution of negative emotion scores had two peaks between 4 and 1. However, positive emotions range from 1 to 10. Compared with Figure 4 (left), positive emotion in Figure 4 (right) has a higher variance, with two peaks in the range of 3 and 10. This showed that tweets containing the keyword “Prabowo Subianto” had high positive emotions.

# JOKO WIDODO
df.senti3. = as.data.frame(table(senti.jokowi$class))
colnames(df.senti3.) = c('Sentiment'.'Freq') # Data preprocessingdf.pie1 =.df.senti3.df.pie1 $.Prop = df.pie1 $.Freq/sum(df.pie1.$Freq)
df.pie1 =.df.pie> 1% %arrange(desc(Sentiment)) % > %mutate(lab.ypos = cumsum(Prop) - 0.5*Prop) # Data visualizationggplot(df.pie1.,
       aes(x = 2,
           y = Prop,
           fill = Sentiment)) +geom_bar(stat = 'identity',
           col = 'white',
           alpha = 0.75,
           show.legend = TRUE) +coord_polar(theta = 'y', 
              start = 0) +geom_text(aes(y = lab.ypos, label = Prop),
            color = 'white',
            fontface = 'italic',
            size = 4) +labs(title = 'Piechart of Sentiments',
       subtitle = 'Joko Widodo',
       caption = 'Twitter Crawling 28 - 29 May 2019') +xlim(c(0.5.2.5)) +theme_void() +scale_fill_brewer(palette = 'Dark2') +theme(legend.title = element_blank(),
        legend.position = 'right')
        
# PRABOWO SUBIANTO
df.senti. 4 =as.data.frame(table(senti.prabowo$class))
colnames(df.senti4.) = c('Sentiment'.'Freq') # Data preprocessingdf.pie2 =.df.senti4.df.pie2 $.Prop = df.pie2 $.Freq/sum(df.pie2.$Freq)
df.pie2 =.df.pie> 2% %arrange(desc(Sentiment)) % > %mutate(lab.ypos = cumsum(Prop) - 0.5*Prop) # Data visualizationggplot(df.pie2.,
       aes(x = 2,
           y = Prop,
           fill = Sentiment)) +geom_bar(stat = 'identity',
           col = 'white',
           alpha = 0.75,
           show.legend = TRUE) +coord_polar(theta = 'y', 
              start = 0) +geom_text(aes(y = lab.ypos, label = Prop),
            color = 'white',
            fontface = 'italic',
            size = 4) +labs(title = 'Piechart of Sentiments',
       subtitle = 'Prabowo Subianto',
       caption = 'Twitter Crawling 28 - 29 May 2019') +xlim(c(0.5.2.5)) +theme_void() +scale_fill_brewer(palette = 'Dark2') +theme(legend.title = element_blank(),
        legend.position = 'right')
Copy the code

Figure 5 is a summary of sentiment scores of tweets, which are divided into negative, neutral and positive emotions. Negative emotions are those with scores below zero, neutral emotions are those with scores equal to zero, and positive emotions are those with scores greater than zero. As you can see from Figure 5, tweets with the keyword “Joko Widodo” had a lower percentage of negative emotions than tweets with the keyword “Prabowo Subianto”. That’s a 6.3 percent difference. The study also found that tweets containing the keyword “Joko Widodo” had higher neutral and positive emotions than tweets with the keyword “Prabowo Subianto.” In Piechart’s study, tweets with the keyword “Joko Widodo” tended to have a higher percentage of positive emotions than tweets with the keyword “Prabowo Subianto.” But the distribution of positive and negative sentiment scores found by density maps showed that tweets containing the keyword “Prabowo Subianto” tended to have higher sentiment scores than tweets containing the keyword “Joko Widodo.” It must undergo further analysis.

Figure 6 shows terms or words in tweets frequently uploaded by users on May 28-29, 2019 (keywords “Joko Widodo” and “Prabowo Subianto”). With this WordCloud visualization, you can find hot topics that are discussed for keywords. For tweets containing the keyword “Joko Widodo”, we found that the terms “tuang”, “petisi”, “negara”, “aman” and “nusantara” were the top five, with the most occurrences per tweet. However, tweets containing the keyword ‘Joko Widodo’ found that ‘Prabowo’, ‘Subianto’, ‘Kriminalisasi’, ‘selamat’ and ‘Dubai’ were the top five most frequently used words per tweet. This one indirectly shows the pattern of tweets uploaded with the keyword “Prabowo Subianto”, i.e., every tweet uploaded almost certainly contains the name “Prabowo Subianto” directly, rather than by reference (@). This is because, in text preprocessing, the reference (@) has been removed.

Go to my GitHub repo and find the code: github.com/audhiaprill…

Refer to the reference

[1] K. Borau, C. Ullrich, J. Feng, R. Shen. Microblogging for Language Learning: Social Context Using Twitter to Train Context and Cultural Competence (2009), Advances in Web-based Learning – ICWL 2009, 8th International Conference, Aachen, Germany, August 19 — 21, 2009.

The original link: towardsdatascience.com/twitter-dat…

Welcome to panchuangai blog: panchuang.net/

Sklearn123.com/

Welcome to docs.panchuang.net/