“This is the third day of my participation in the November Gwen Challenge. See details of the event: The last Gwen Challenge 2021”.
preface
Use Python to grab mango TV bullet screen, no more nonsense to say.
Let’s have a good time
The development tools
Python version: 3.6.4
Related modules:
Requests module;
Pandas module
And some modules that come with Python.
Environment set up
Install Python and add it to the environment variables. PIP installs the required related modules.
Thought analysis
This article explains how to climb mango TV video barrage and comments by taking the movie “On the Cliff” as an example!
The target site
https://www.mgtv.com/b/335313/12281642.html? fpa=15800&fpos=8&lastp=ch_movie
Copy the code
Grab barrage
Analysis of the website
The file where the barrage data resides is dynamically loaded. You need to enter the developer tool of the browser to capture the package and get the real URL where the barrage data resides. When the video plays for a minute it will update a JSON packet containing the data we need for the barrage.
Get the real URL
https://bullet-ali.hitv.com/bullet/202108 /14/ 005323 /12281642/0.json\
https://bullet-ali.hitv.com/bullet/202108 /14/ 005323 /12281642/1.json
Copy the code
It can be found that the difference between each URL is the number behind it. The first URL is 0, and the following ones increase gradually. The video is 120:20 minutes, rounded up, which is 121 packets.
Code implementation
import requests\
import pandas as pd\
\
headers = {\
'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'\
}\
df = pd.DataFrame()\
for e in range(0.121) : \print(F prime is climbing to number one{e}Page ')\
resposen = requests.get(f'https://bullet-ali.hitv.com/bullet/2021/08/3/004902/12281642/{e}.json', headers=headers)\
Extract data directly from JSON \
for i in resposen.json()['data'] ['items']:\
ids = i['ids'] # user id \
content = i['content'] # bullet screen content \
time = i['time'] # Barrage occurrence time \
# Some files don't have likes \
try: \
v2_up_count = i['v2_up_count'] \except:\
v2_up_count = ' '\
text = pd.DataFrame({'ids': [ids], 'barrage': [content], 'Occurrence time': [time]})\
df = pd.concat([df, text])\
df.to_csv('On the cliff.csv', encoding='utf-8', index=False)
Copy the code
Results show
Grab comments
Analysis of web page
Comments on Mango TV videos need to be pulled down to the bottom of the page to view. The file where the comment data resides is still dynamically loaded. Enter the developer tools and follow the following steps to capture the package: Network→js, and finally click to see more comments.
The js file is still loaded, which contains the comment data. The resulting real URL:
https://comment.mgtv.com/v4/comment/getCommentList?page=1&subjectType=hunantv2014&subjectId=12281642&callback=jQuery1820749973529821774_1628942431449&_support=10000000& _ =1628943290494\
https://comment.mgtv.com/v4/comment/getCommentList?page=2&subjectType=hunantv2014&subjectId=12281642&callback=jQuery1820749973529821774_1628942431449&_support=10000000& _ =1628943296653
Copy the code
Page is the page number and _ is the timestamp. The deletion of the timestamp in the URL does not affect data integrity, but the callback parameter in the URL will interfere with data parsing, so delete it. Finally, we get the URL:
https://comment.mgtv.com/v4/comment/getCommentList?page=1&subjectType=hunantv2014&subjectId=12281642&_support=10000000
Copy the code
Each page of the packet contains 15 comment data, the total number of comments is 2527, and the maximum page is 169.
Code implementation
import requests\
import pandas as pd\
\
headers = {\
'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'\
}\
df = pd.DataFrame()\
for o in range(1.170):\
url = f'https://comment.mgtv.com/v4/comment/getCommentList?page={o}&subjectType=hunantv2014&subjectId=12281642&_support=10000000'\
res = requests.get(url, headers=headers).json()\
for i in res['data'] ['list']:\
nickName = i['user'] ['nickName'] # user name \
praiseNum = i['praiseNum'] # Number of likes \
date = i['date'] # Send date \
content = i['content'] # Comment content \
text = pd.DataFrame({'nickName': [nickName], 'praiseNum': [praiseNum], 'date': [date], 'content': [content]})\
df = pd.concat([df, text])\
df.to_csv('On the cliff.csv', encoding='utf-8', index=False)
Copy the code
Results show