Super dimensional Analysis of Trump's Twitter

Get Trump’s most recent Twitter feed

In addition to using crawlers to get Twitter, there is a simple and stable solution is through the official API, but the official Twitter is relatively strict recently, it is difficult to apply for open platform is difficult to pass. Fortunately, I had a Twitter developer account before.

Install python- Twitter library for third party access to Twitter

pip install python-twitter
Copy the code

Dump Trump’s most recent Tweets

import twitter

proxies = {
    'http': ' '.'https': ' '
}

api = twitter.Api(consumer_key=' ',
                  consumer_secret=' ',
                  access_token_key=' ',
                  access_token_secret=' ', proxies=proxies)
                  
count = 200
max_id = None
fp = open('Trump-twitter.txt', encoding='utf-8', mode='w')
while True:
    statuses = api.GetUserTimeline(screen_name="realDonaldTrump", count=count, max_id=max_id)
    if len(statuses) < 1:
        break
    for s in statuses:
        print(s)
        fp.write(str(s) + "\n")
        max_id = s.id

    max_id = max_id - 1

fp.close()
Copy the code

Note that this API needs to be used in a network environment with access to Twitter, so it is configuredproxy

Currently, Twitter’s API limits access to only the most recent 3,200 tweets, and there is some variation, though not much.

The derived data looks something like this:

Data preprocessing

usepandasIt is very convenient to process tables. Here, important information is mainly processed into table fields, which is convenient for statistical analysis
Focus on the following fields

The field name	Field to explain
id	Twitter ID Twitter.com/i/web/statu…
created_at	Release time
favorite_count	Thumb up for
hashtags	Topic TAG
retweet_count	Forwarding number
source	Release source, such as iPhone
text	Content of tweets
media_type	Content media types, such as video
ret_user_name	The publisher of the original article
ret_user_verified	Whether the publisher of the original text is authenticated
quoted_user_name	The author of the quoted tweet
quoted_user_verified	Whether the author of the quoted tweet is authenticated
created_day	Release date, convenient by day statistics
created_day_hour	Release date, accurate to the hour, for example: 2020-06-1402
created_hour	Release hour, convenient statistics of the time rule

Code snippet

Convert the release time to a specific format
def get_created_info(row):
    created_at = row.created_at
    day,time_str = created_at.split(' ')
    hour = time_str.split(':')[0]
    day_hour = day+""+hour
    return day,day_hour,hour

# Apply to multiple fields
DF[['created_day'.'created_day_hour'.'created_hour']] = DF.apply(get_created_info,axis=1,result_type='expand')    
Copy the code

Analysis of tweets

The data preprocessed in the previous step is converted to the PANDAS DataFrame for subsequent processing

df = pd.read_excel('./Trump-twitter.xlsx')
Copy the code

Then we enter into the analysis of super multidimensional degree

Daily post statistics

 df[['id'.'created_day']].groupby(by=['created_day']).count().sort_values(by=['id'], ascending = False). The head (20). The plot. The bar (figsize = (12, 4))Copy the code

Make drawings according to the top 20 total posts per day

It can be found that the most posts were made on 2020-06-05, with 153 posts. Let’s take a closer look at the number of posts per hour for that day

df1 = df[df.created_day=='2020-06-05']
df1[['id'.'created_hour']].groupby(by=['created_hour'). The count (). The plot. The bar (figsize = (12, 4))Copy the code

Post the most at 11 or 12 o ‘clock in a day. Analyze all the tweets of the day and draw WordCloud. WordCloud is a relatively useful library for drawing WordCloud

Emotional analysis of tweets

def get_sentiment(row):
    polarity = tb(row.text).polarity
    if polarity < 0:
        tag = 'negative'
    elifPolarity < 0.3: Tag ='neutral'
    else:
        tag = 'positive'
    return tag,polarity
df[['text_polarity'.'polarity_prob']] = df.apply(get_sentiment,axis=1,result_type='expand')    
Copy the code

Textblob library, a simple text processing library, is used here, which can be used for sentiment analysis, partof speech tagging, spelling correction and so on

It can be seen that Trump’s tweets are mostly neutral, of course, there are also a lot of negative tweets, which will not be expanded here…

Time Period Statistics

df[['id'.'created_hour']].groupby(by=['created_hour'). The count (). The plot. The bar (figsize = (12, 4))Copy the code

It can be seen from the figure that Trump posts posts in 24 hours, with the largest number of tweets around 12 o ‘clock. It is true that he is oriented to governing the country through Twitter, and the amount of tweets is outrageous.

Analyze all the tweets during that time to create a word cloud

As you can see from the word cloud, “White House”, “Novel Coronavirus”, “Obama”, “Biden” are his frequent references, while keeping in mind the words “Great American”…

Conduct word cloud statistics for “FAKE NEWS” mentioned in the tweet

Don’t explain

Forwarding and original statistics

Explode = (0, 0.1) df[['id'.'tweet_status']].groupby(by=['tweet_status']).count()\
.plot.pie(y='id', figsize = (5, 5), explodes = explodes, autopct ='% 1.1 f % %',shadow=True, startangle=0,label=' ')
Copy the code

Forwarding with original accounted for a lot of

Statistics of authors transferred

rt_df = df[['id'.'ret_user_name']].groupby(by=['ret_user_name']).count().sort_values(by=['id'], ascending = False) rt_df. Head (20). The plot. The bar (figsize = (16, 4))Copy the code

Mostly retweets from the White House, followed by his own

Release source statistics

Explode = (0, 0.1) df[['id'.'source']].groupby(by=['source']).count()\
.plot.pie(y='id', figsize = (5, 5), explodes = explodes, autopct ='% 1.1 f % %',shadow=True, startangle=0,label=' ')
Copy the code

Mostly from the iPhone

Tweet topic TAG statistical analysis

It can be seen that Trump has been paying more attention to the COVID-19 epidemic and the “salary protection plan” related to the epidemic.
Next is MAGA (Make Ameraica Greate Again)

Content clustering of tweets

Trump’s tweets are grouped into 10 categories. As can be seen from the figure, the content types of Trump’s tweets are mostly concentrated in certain aspects

Do word2vec for tweets

model.similar_by_word('Trump')
# output

[('coronavirus', 0.6097418069839478),
 ('great', 0.5778061151504517),
 ('realDonaldTrump', 0.554646909236908),
 ('Great', 0.5381245613098145),
 ('National', 0.49641942977905273),
 ('America', 0.47522449493408203),
 ('today', 0.4736398458480835),
 ('people', 0.469297856092453),
 ('Democrats', 0.45948123931884766),
 ('time', 0.4551768898963928)]
Copy the code

Likes histogram

df[df.favorite_count > 0][['id'.'favorite_count']].plot.hist(y='favorite_count', bins = 50, figsize = (12, 4))Copy the code

The number of likes is mainly around 5 million

Histogram of forwarding number

df[df.retweet_count > 0][['id'.'retweet_count']].plot.hist(y='retweet_count', bins = 50, figsize = (12, 4))Copy the code

The forwarding number is mainly about 1W

conclusion

This paper mainly from multidimensional objective statistics, do not do too much interpretation. Is to throw a brick to attract jade, readers can play imagination, do more dimensional analysis and profound interpretation.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Super dimensional Analysis of Trump’s Twitter

Get Trump’s most recent Twitter feed

Data preprocessing

Analysis of tweets

conclusion

Super dimensional Analysis of Trump’s Twitter

Get Trump’s most recent Twitter feed

Data preprocessing

Analysis of tweets

conclusion

Related Posts

OpenCV outline rendering details

Reinforcement learning 9 — DQN improved algorithm DDQN, Dueling DQN detailed explanation

Deep learning – YoloV5 transfer learning, custom data set training