Get Trump’s most recent Twitter feed
In addition to using crawlers to get Twitter, there is a simple and stable solution is through the official API, but the official Twitter is relatively strict recently, it is difficult to apply for open platform is difficult to pass. Fortunately, I had a Twitter developer account before.
- Install python- Twitter library for third party access to Twitter
pip install python-twitter
Copy the code
- Dump Trump’s most recent Tweets
import twitter
proxies = {
'http': ' '.'https': ' '
}
api = twitter.Api(consumer_key=' ',
consumer_secret=' ',
access_token_key=' ',
access_token_secret=' ', proxies=proxies)
count = 200
max_id = None
fp = open('Trump-twitter.txt', encoding='utf-8', mode='w')
while True:
statuses = api.GetUserTimeline(screen_name="realDonaldTrump", count=count, max_id=max_id)
if len(statuses) < 1:
break
for s in statuses:
print(s)
fp.write(str(s) + "\n")
max_id = s.id
max_id = max_id - 1
fp.close()
Copy the code
Note that this API needs to be used in a network environment with access to Twitter, so it is configuredproxy
Currently, Twitter’s API limits access to only the most recent 3,200 tweets, and there is some variation, though not much.
The derived data looks something like this:
Data preprocessing
- use
pandas
It is very convenient to process tables. Here, important information is mainly processed into table fields, which is convenient for statistical analysis - Focus on the following fields
The field name | Field to explain |
---|---|
id | Twitter ID Twitter.com/i/web/statu… |
created_at | Release time |
favorite_count | Thumb up for |
hashtags | Topic TAG |
retweet_count | Forwarding number |
source | Release source, such as iPhone |
text | Content of tweets |
media_type | Content media types, such as video |
ret_user_name | The publisher of the original article |
ret_user_verified | Whether the publisher of the original text is authenticated |
quoted_user_name | The author of the quoted tweet |
quoted_user_verified | Whether the author of the quoted tweet is authenticated |
created_day | Release date, convenient by day statistics |
created_day_hour | Release date, accurate to the hour, for example: 2020-06-1402 |
created_hour | Release hour, convenient statistics of the time rule |
- Code snippet
Convert the release time to a specific format
def get_created_info(row):
created_at = row.created_at
day,time_str = created_at.split(' ')
hour = time_str.split(':')[0]
day_hour = day+""+hour
return day,day_hour,hour
# Apply to multiple fields
DF[['created_day'.'created_day_hour'.'created_hour']] = DF.apply(get_created_info,axis=1,result_type='expand')
Copy the code
Analysis of tweets
The data preprocessed in the previous step is converted to the PANDAS DataFrame for subsequent processing
df = pd.read_excel('./Trump-twitter.xlsx')
Copy the code
Then we enter into the analysis of super multidimensional degree
Daily post statistics
df[['id'.'created_day']].groupby(by=['created_day']).count().sort_values(by=['id'], ascending = False). The head (20). The plot. The bar (figsize = (12, 4))Copy the code
Make drawings according to the top 20 total posts per day
It can be found that the most posts were made on 2020-06-05, with 153 posts. Let’s take a closer look at the number of posts per hour for that day
df1 = df[df.created_day=='2020-06-05']
df1[['id'.'created_hour']].groupby(by=['created_hour'). The count (). The plot. The bar (figsize = (12, 4))Copy the code
Post the most at 11 or 12 o ‘clock in a day. Analyze all the tweets of the day and draw WordCloud. WordCloud is a relatively useful library for drawing WordCloud
Emotional analysis of tweets
def get_sentiment(row):
polarity = tb(row.text).polarity
if polarity < 0:
tag = 'negative'
elifPolarity < 0.3: Tag ='neutral'
else:
tag = 'positive'
return tag,polarity
df[['text_polarity'.'polarity_prob']] = df.apply(get_sentiment,axis=1,result_type='expand')
Copy the code
Textblob library, a simple text processing library, is used here, which can be used for sentiment analysis, partof speech tagging, spelling correction and so on
It can be seen that Trump’s tweets are mostly neutral, of course, there are also a lot of negative tweets, which will not be expanded here…
Time Period Statistics
df[['id'.'created_hour']].groupby(by=['created_hour'). The count (). The plot. The bar (figsize = (12, 4))Copy the code
It can be seen from the figure that Trump posts posts in 24 hours, with the largest number of tweets around 12 o ‘clock. It is true that he is oriented to governing the country through Twitter, and the amount of tweets is outrageous.
Analyze all the tweets during that time to create a word cloud
As you can see from the word cloud, “White House”, “Novel Coronavirus”, “Obama”, “Biden” are his frequent references, while keeping in mind the words “Great American”…
Conduct word cloud statistics for “FAKE NEWS” mentioned in the tweet
Don’t explain
Forwarding and original statistics
Explode = (0, 0.1) df[['id'.'tweet_status']].groupby(by=['tweet_status']).count()\
.plot.pie(y='id', figsize = (5, 5), explodes = explodes, autopct ='% 1.1 f % %',shadow=True, startangle=0,label=' ')
Copy the code
Forwarding with original accounted for a lot of
Statistics of authors transferred
rt_df = df[['id'.'ret_user_name']].groupby(by=['ret_user_name']).count().sort_values(by=['id'], ascending = False) rt_df. Head (20). The plot. The bar (figsize = (16, 4))Copy the code
- Mostly retweets from the White House, followed by his own
Release source statistics
Explode = (0, 0.1) df[['id'.'source']].groupby(by=['source']).count()\
.plot.pie(y='id', figsize = (5, 5), explodes = explodes, autopct ='% 1.1 f % %',shadow=True, startangle=0,label=' ')
Copy the code
Mostly from the iPhone
Tweet topic TAG statistical analysis
- It can be seen that Trump has been paying more attention to the COVID-19 epidemic and the “salary protection plan” related to the epidemic.
- Next is MAGA (Make Ameraica Greate Again)
Content clustering of tweets
Trump’s tweets are grouped into 10 categories. As can be seen from the figure, the content types of Trump’s tweets are mostly concentrated in certain aspects
Do word2vec for tweets
model.similar_by_word('Trump')
# output
[('coronavirus', 0.6097418069839478),
('great', 0.5778061151504517),
('realDonaldTrump', 0.554646909236908),
('Great', 0.5381245613098145),
('National', 0.49641942977905273),
('America', 0.47522449493408203),
('today', 0.4736398458480835),
('people', 0.469297856092453),
('Democrats', 0.45948123931884766),
('time', 0.4551768898963928)]
Copy the code
Likes histogram
df[df.favorite_count > 0][['id'.'favorite_count']].plot.hist(y='favorite_count', bins = 50, figsize = (12, 4))Copy the code
The number of likes is mainly around 5 million
Histogram of forwarding number
df[df.retweet_count > 0][['id'.'retweet_count']].plot.hist(y='retweet_count', bins = 50, figsize = (12, 4))Copy the code
The forwarding number is mainly about 1W
conclusion
This paper mainly from multidimensional objective statistics, do not do too much interpretation. Is to throw a brick to attract jade, readers can play imagination, do more dimensional analysis and profound interpretation.