Hello, everyone, I am the talented brother.
Recently ghost blow up series network drama “Yunnan Bug Valley” online, as a ghost blow up series of works, to undertake the upper “Longling Fan Caves” content, and or the original cast iron Triangle starring, netizens call very good!
Today, we will use Python to crawl all the current series reviews (including trailers), and do statistical and visual analysis of the show. Let’s follow the netizens to watch the show!
This article will explain in detail crawler and data processing visualization, edutainment!
1. Web page analysis
This comment is all from Tencent Video (after all, only broadcast)
Open the “Yunnan Bug Valley” playing page, F12 entered the developer mode, we slide click “view more comments”, you can find the real address of the comment request.
Let’s find several comment interface addresses for comparative analysis to find the rule
https://video.coral.qq.com/varticle/7313247714/comment/v2?callback=_varticle7313247714commentv2&orinum=10&oriorder=o&pag eflag=1&cursor=6838089132036599025&scorecursor=0&orirepnum=2&reporder=o&reppageflag=1&source=132&_=1630752996851 https://video.coral.qq.com/varticle/7313247714/comment/v2?callback=_varticle7313247714commentv2&orinum=100&oriorder=o&pa geflag=1&cursor=6838127093335586287&scorecursor=0&orirepnum=2&reporder=o&reppageflag=1&source=132&_=1630752996850 https://video.coral.qq.com/varticle/7313258351/comment/v2?callback=_varticle7313258351commentv2&orinum=10&oriorder=o&pag eflag=1&cursor=6838101562707822837&scorecursor=0&orirepnum=2&reporder=o&reppageflag=1&source=132&_=1630753165406Copy the code
Finally, we find that this address can be simplified into the following section
url = f'https://video.coral.qq.com/varticle/{comment_id}/comment/v2? '
params = {
'orinum': 30.'cursor': cursor,
'oriorder': 't'
}
Copy the code
The meanings of the four parameters are as follows:
orinum
Is the number of comments per request, the default is 10, I tried to find a maximum of 30, so HERE I set the value to 30cursor
Is the initial comment ID of each request, and the initial value can be set to 0 in the rule of change. The next request will use the ID of the last comment in the comment list obtained in the last requestoriorder
Is the order of requested comments (t means in chronological order, default is the other hottest)- In addition to the above three parameters, there is actually one more
comment_id
There’s one in every episodeid
For collecting corresponding reviews, so we need to look at thatid
Where to get
We mentioned that we needed to get the comment ID data for each episode, and we found that we could get it by requesting the page data for each episode.
Note that we used Requests directly to request data for each set of pages, which is not the same as the actual web page, but we can find relevant data.
For example, the list of episode ids would look like this:
Episode review ID where:
Then, we can use re regular expression data parsing can be obtained.
2. Crawler process
Through web page analysis and our collection requirements, the whole process can be divided into the following parts:
- Crawl the episode page data
- Parse to get episode ids and episode review ids
- Collect all episode reviews
- Save data locally
2.1. Introduce the required libraries
import requests
import re
import pandas as pd
import os
Copy the code
2.2. Crawl the episode page data
# used to crawl episode page data
def get_html(url) :
headers = {
"User-Agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36",
}
r = requests.get(url, headers=headers)
# Garbled correction
r.encoding = r.apparent_encoding
text = r.text
# Remove non-character data
html = re.sub('\s'.' ', text)
return html
Copy the code
2.3. Parse episode ids and episode review ids
# pass in the id of the drama, etc., used to crawl the episode ID and comment ID
def get_comment_ids(video_id) :
# Play address
url = f'https://v.qq.com/x/cover/{video_id}.html'
html = get_html(url)
data_list = eval(re.findall(r'"vip_ids":(\[.*?\])', html)[0])
data_df = pd.DataFrame(data_list)
comment_ids = []
for tid in data_df.V:
# address per episode
url = f'https://v.qq.com/x/cover/{video_id}/{tid}.html'
html = get_html(url)
comment_id = eval(re.findall(r'"comment_id":"(\d+)"', html)[0])
comment_ids.append(comment_id)
data_df['comment_id'] = comment_ids
data_df['show'] = range(1.len(comment_ids)+1)
return data_df
Copy the code
2.4. Collect all episode reviews
# Get full episode reviews
def get_comment_content(data_df) :
for i, comment_id in enumerate(data_df.comment_id):
i = i+1
# initial cursor
cursor = 0
num = 0
while True:
url = f'https://video.coral.qq.com/varticle/{comment_id}/comment/v2? '
params = {
'orinum': 30.'cursor': cursor,
'oriorder': 't'
}
r = requests.get(url, params=params)
data = r.json()
data = data['data']
if len(data['oriCommList'= =])0:
break
# Comment data
data_content = pd.DataFrame(data['oriCommList'])
data_content = data_content[['id'.'targetid'.'parent'.'time'.'userid'.'content'.'up']]
# Commentator info
userinfo = pd.DataFrame(data['userList']).T
userinfo = userinfo[['userid'.'nick'.'head'.'gender'.'hwlevel']].reset_index(drop=True)
# Merge comment information with commentator information
data_content = data_content.merge(userinfo, how='left')
data_content.time = pd.to_datetime(data_content.time, unit='s') + pd.Timedelta(days=8/24)
data_content['show'] = i
data_content.id = data_content.id.astype('string')
save_csv(data_content)
# Next cursor
cursor = data['last']
num =num + 1
pages = data['oritotal'] / /30 + 1
print('the first f{i}Set the first{num}/{pages}Page comments have been collected! ')
Copy the code
2.5. Save data to the local PC
# Save comment data locally
def save_csv(df) :
file_name = 'Comment data.csv'
if os.path.exists(file_name):
df.to_csv(file_name, mode='a', header=False,
index=None, encoding='utf_8_sig')
else:
df.to_csv(file_name, index=None, encoding='utf_8_sig')
print('Data saved! ')
Copy the code
3. Data statistics and visual display
The data statistics and visual display methods of this time can be referred to the previous tweets “” and” “
3.1. Data preview
Let’s take 5 samples
df.sample(5)
Copy the code
Look at the data
df.info()
Copy the code
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35758 entries, 0 to 35757
Data columns (total 12 columns):
#Column Non-Null Count Dtype--- ------ -------------- ----- 0 id 35758 non-null int64 1 targetid 35758 non-null int64 2 parent 35758 non-null int64 3 time 35758 non-null object 4 userid 35758 non-null int64 5 content 35735 non-null object 6 up 35758 non-null int64 7 nick 35758 non-null object 8 head 35758 non-null object 9 gender 35758 non-null int64 10 hwlevel 35758 non-null int64 11 Episode35758 Non-NULL INT64 dtypes: INT64 (8), Object (4) Memory Usage: 3.3+ MBCopy the code
Just brother also made a comment, let’s see if we have collected
Only brother’s userID is 1296690233. Let’s check and find that only brother’s VIP level is actually 6
df.query('userid==1296690233')
Copy the code
The head field is the avatar, let’s see if it’s the avatar
from skimage import io
# display avatar
img_url = df.query('userid==1296690233') ['head'].iloc[0]
image = io.imread(img_url)
io.imshow(image)
Copy the code
Got it, got it!!
3.2. Number of diversity comments
The text is written for Pandas. The text is written for Pandas.
import pandas as pd
import pandas_bokeh
pandas_bokeh.output_notebook()
pd.set_option('plotting.backend'.'pandas_bokeh')
Copy the code
Next, the formal data statistics and visual presentation began
from bokeh.transform import linear_cmap
from bokeh.palettes import Spectral
from bokeh.io import curdoc
# curdoc().theme = 'caliber'
episode_comment_num = df.groupby('show') ['id'].nunique().to_frame('Number of comments')
y = episode_comment_num['Number of comments']
mapper = linear_cmap(field_name='Number of comments', palette=Spectral[11] ,low=min(y) ,high=max(y))
episode_bar = episode_comment_num.plot_bokeh.bar(
ylabel="Number of comments",
title="Number of diversity comments",
color=mapper,
alpha=0.8,
legend=False
)
Copy the code
As we can see, the first episode had the highest number of comments, with 17,000, accounting for half of all comments; Followed by the number of comments for episode 7, which aired this week!
3.3. Number of comments by date
df['date'] = pd.to_datetime(df.time).dt.date
date_comment_num = df.groupby('date') ['id'].nunique().to_frame('Number of comments')
date_comment_num.index = date_comment_num.index.astype('string')
y = date_comment_num['Number of comments']
mapper = linear_cmap(field_name='Number of comments', palette=Spectral[11] ,low=min(y) ,high=max(y))
date_bar = date_comment_num.plot_bokeh.bar(
ylabel="Number of comments",
title="Number of comments by date",
color=mapper,
alpha=0.8,
legend=False
)
Copy the code
Starting on August 30, members can watch five episodes on the first day, and AS a member, I watched them all in one sitting. We found that the number of comments in the first two days after the broadcast was relatively high, and the number of comments in these days was also relatively high as it was updated 1-3 per week.
3.4. Number of timeshare comments
df['time'] = pd.to_datetime(df.time).dt.hour
date_comment_num = pd.pivot_table(df,
values='id',
index=['time'],
columns=['date'],
aggfunc='count'
)
time_line = date_comment_num.plot_bokeh(kind="line",
legend="top_left",
title="Timeshare comments"
)
Copy the code
Through the time-sharing review number curve, we find that the hourly review number hits the highest at 8 o ‘clock on the premiere day, and then it is more consistent with TV series viewing behavior: noon, evening and midnight are higher.
3.5. Distribution of COMMENTATOR VIP levels
vip_comment_num = df.groupby('hwlevel'Agg (Number of users =('userid'.'nunique'), number of comments =('id'.'nunique')
)
vip_comment_num['Comments per capita'] = round(vip_comment_num['Number of comments']/vip_comment_num['Users'].2)
usernum_pie = vip_comment_num.plot_bokeh.pie(
y="Number of users",
colormap=Spectral[9],
title="Commentator VIP Rating Distribution".)Copy the code
Have to say, most of the comments are VIP users, no wonder Tencent video to do what ahead of the so-called VIP broadcast ON the VIP…
Is there a difference in the number of comments per VIP user?
y = vip_comment_num['Comments per capita']
mapper = linear_cmap(field_name='Comments per capita', palette=Spectral[11] ,low=min(y) ,high=max(y))
vipmean_bar = vip_comment_num.plot_bokeh.bar(
y = 'Comments per capita',
ylabel="Comments per capita",
title="Number of comments per VIP user",
color=mapper,
alpha=0.8,
legend=False
)
Copy the code
The basic presentation of a VIP level higher evaluation will be higher! But why?
3.6. Comment length
Most of the netizens who write comments are 666, good-looking words, such as just brother is waiting for the update of three words, so how many characters are generally evaluated?
import numpy as np
df['Comment Length'] = df['content'].str.len()
df['Comment Length'] = df['Comment Length'].fillna(0).astype('int')
contentlen_hist = df.plot_bokeh.hist(
y='Comment Length',
ylabel="Number of comments",
bins=np.linspace(0.100.26),
vertical_xlabel=True,
hovertool=False,
title="Histogram of likes on comments",
color='red',
line_color="white",
legend=False.# normed=100,
)
Copy the code
Let’s take a look at some of the comments
(df.sort_values(by='Comment Length',ascending=False)
[['show'.'content'.'Comment Length'.'nick'.'hwlevel']].head(3)
.style.hide_index()
)
Copy the code
I mean, is that a plagiarized review, or is it really brilliant?
3.7. Number of likes on comments
Let’s take a look at the most liked ones
# pd.set_option('display.max_colwidth',1000)
(df.sort_values(by='up',ascending=False)
[['show'.'content'.'up'.'nick'.'hwlevel']].head()
.style.hide_index()
)
Copy the code
Read the map and don’t get lost! Is that a meme? Over 8000 likes ~~
3.8. Users with the most comments
user_comment_num = df.groupby('userid').agg(Comments =('id'.'nunique'), VIP level =('hwlevel'.'max')
).reset_index()
user_comment_num.sort_values(by='Number of comments',ascending=False).head()
Copy the code
userid | comments | VIP level |
---|---|---|
640014751 | 33 | 5 |
1368145091 | 24 | 1 |
1214181910 | 18 | 3 |
1305770517 | 17 | 2 |
1015445833 | 14 | 5 |
Evaluation 33, this user is also very cow!! Let’s see what he said:
df.query('userid==640014751') [['nick'.'show'.'time'.'content']].sort_values(by='time')
Copy the code
A little boring, is to brush the praise bar!! Let’s take a look at our second most rated friend.
df.query('userid==1368145091') [['nick'.'show'.'time'.'content']].sort_values(by='time')
Copy the code
I gotta say, looks normal. However, feel a little talkative, ha ha!
What are these two pictures, interested to see:
from skimage import io
# display avatar
img_url = df.query('userid==640014751') ['head'].iloc[0]
image = io.imread(img_url)
io.imshow(image)
Copy the code
from skimage import io
# display avatar
img_url = df.query('userid==1368145091') ['head'].iloc[0]
image = io.imread(img_url)
io.imshow(image)
Copy the code
Ahem, I won’t make a judgment, after all, my avatar and nickname are also very…
3.9. Comment word cloud
This part refers to “”, and we will start from the whole word cloud and the leading word cloud
Let’s take a look at the number of mentions of our three main characters
df.fillna(' ',inplace=True)
hu = ['khufu'.'Hu Bayi'.'Pan Yueming'.'hu'.'pan']
yang = ['Zhang Yuqi'.'Shirley'.'杨']
wang = ['Jiang Chao'.'fat']
df_hu = df[df['content'].str.contains('|'.join(hu))]
df_yang = df[df['content'].str.contains('|'.join(yang))]
df_wang = df[df['content'].str.contains('|'.join(wang))]
df_star = pd.DataFrame({The 'role': ['Hu Bayi'.'Shirley Yang'.Fat Wang].'weight': [len(df_hu),len(df_yang),len(df_wang)]
})
y = df_star['weight']
mapper = linear_cmap(field_name='weight', palette=Spectral[11] ,low=min(y) ,high=max(y))
df_star_bar = df_star.plot_bokeh.bar(
x = The 'role',
y = 'weight',
ylabel="Mention weights",
title="The main characters mention weights.",
color=mapper,
alpha=0.8,
legend=False
)
Copy the code
Wang Pangzi is funny point bear, have to say mirror number is really high!!
The whole word cloud
Hu eight word cloud
Shirley Yang word cloud
Wang Pangzi word cloud
Word cloud core code
import os
import stylecloud
from PIL import Image
import jieba
import jieba.analyse
import pandas as pd
from wordcloud import STOPWORDS
def ciYun(data,addWords,stopWords) :
print('Mapping... ')
comment_data = data
for addWord in addWords:
jieba.add_word(addWord)
comment_after_split = jieba.cut(str(comment_data), cut_all=False)
words = ' '.join(comment_after_split)
# word cloud stop words
stopwords = STOPWORDS.copy()
for stopWord in stopWords:
stopwords.add(stopWord)
Get the parameters that meet the type requirements
stylecloud.gen_stylecloud(
text=words,
size = 800,
palette='tableau.BlueRed_6'.# Set the color scheme
icon_name='fas fa-mountain'.# paper-plane mountain thumbs-up male fa-cloud
custom_stopwords = stopwords,
font_path='FZZJ-YGYTKJW.TTF'
# bg = bg,
# font_path=font_path, # word cloud font (Chinese needs to be set to the native Chinese font)
)
print('Word cloud generated ~')
pic_path = os.getcwd()
print(F 'word cloud map file has been saved in{pic_path}')
data = df.content.to_list()
addWords = [Teacher Pan.'Insect Valley of Yunnan'.'Degree of reduction'.'khufu'.'Hu Bayi'.'Pan Yueming'.'Zhang Yuqi'.'Shirley'.'Shirley Yang'.'General Yang'.Fat Wang.'fat']
# add stop words
stoptxt = pd.read_table(r'stop.txt',encoding='utf-8',header=None)
stoptxt.drop_duplicates(inplace=True)
stopWords = stoptxt[0].to_list()
words = ['said'.'years'.'VIP'.'true'.'this is'.'no'.'dry'.'like']
stopWords.extend(words)
# run ~
ciYun(data,addWords,stopWords)
Copy the code