“This is the third day of my participation in the November Gwen Challenge. See details of the event: The last Gwen Challenge 2021”.

preface

Use Python to grab mango TV bullet screen, no more nonsense to say.

Let’s have a good time

The development tools

Python version: 3.6.4

Related modules:

Requests module;

Pandas module

And some modules that come with Python.

Environment set up

Install Python and add it to the environment variables. PIP installs the required related modules.

Thought analysis

This article explains how to climb mango TV video barrage and comments by taking the movie “On the Cliff” as an example!

The target site

https://www.mgtv.com/b/335313/12281642.html? fpa=15800&fpos=8&lastp=ch_movie
Copy the code

Grab barrage

Analysis of the website

The file where the barrage data resides is dynamically loaded. You need to enter the developer tool of the browser to capture the package and get the real URL where the barrage data resides. When the video plays for a minute it will update a JSON packet containing the data we need for the barrage.

Get the real URL

https://bullet-ali.hitv.com/bullet/202108 /14/ 005323 /12281642/0.json\
https://bullet-ali.hitv.com/bullet/202108 /14/ 005323 /12281642/1.json
Copy the code

It can be found that the difference between each URL is the number behind it. The first URL is 0, and the following ones increase gradually. The video is 120:20 minutes, rounded up, which is 121 packets.

Code implementation

import requests\
import pandas as pd\
\
headers = {\
    'user-agent''the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'\
}\
df = pd.DataFrame()\
for e in range(0.121) : \print(F prime is climbing to number one{e}Page ')\
    resposen = requests.get(f'https://bullet-ali.hitv.com/bullet/2021/08/3/004902/12281642/{e}.json', headers=headers)\
Extract data directly from JSON \
    for i in resposen.json()['data'] ['items']:\
        ids = i['ids']  # user id \
        content = i['content']  # bullet screen content \
        time = i['time']  # Barrage occurrence time \
        # Some files don't have likes \
        try:  \
            v2_up_count = i['v2_up_count'] \except:\
            v2_up_count = ' '\
        text = pd.DataFrame({'ids': [ids], 'barrage': [content], 'Occurrence time': [time]})\
        df = pd.concat([df, text])\
df.to_csv('On the cliff.csv', encoding='utf-8', index=False)
Copy the code

Results show

Grab comments

Analysis of web page

Comments on Mango TV videos need to be pulled down to the bottom of the page to view. The file where the comment data resides is still dynamically loaded. Enter the developer tools and follow the following steps to capture the package: Network→js, and finally click to see more comments.

The js file is still loaded, which contains the comment data. The resulting real URL:

https://comment.mgtv.com/v4/comment/getCommentList?page=1&subjectType=hunantv2014&subjectId=12281642&callback=jQuery1820749973529821774_1628942431449&_support=10000000& _ =1628943290494\
https://comment.mgtv.com/v4/comment/getCommentList?page=2&subjectType=hunantv2014&subjectId=12281642&callback=jQuery1820749973529821774_1628942431449&_support=10000000& _ =1628943296653
Copy the code

Page is the page number and _ is the timestamp. The deletion of the timestamp in the URL does not affect data integrity, but the callback parameter in the URL will interfere with data parsing, so delete it. Finally, we get the URL:

https://comment.mgtv.com/v4/comment/getCommentList?page=1&subjectType=hunantv2014&subjectId=12281642&_support=10000000
Copy the code

Each page of the packet contains 15 comment data, the total number of comments is 2527, and the maximum page is 169.

Code implementation

import requests\
import pandas as pd\
\
headers = {\
    'user-agent''the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'\
}\
df = pd.DataFrame()\
for o in range(1.170):\
    url = f'https://comment.mgtv.com/v4/comment/getCommentList?page={o}&subjectType=hunantv2014&subjectId=12281642&_support=10000000'\
    res = requests.get(url, headers=headers).json()\
    for i in res['data'] ['list']:\
        nickName = i['user'] ['nickName']  # user name \
        praiseNum = i['praiseNum']  # Number of likes \
        date = i['date']  # Send date \
        content = i['content']  # Comment content \
        text = pd.DataFrame({'nickName': [nickName], 'praiseNum': [praiseNum], 'date': [date], 'content': [content]})\
        df = pd.concat([df, text])\
df.to_csv('On the cliff.csv', encoding='utf-8', index=False)
Copy the code

Results show