This is the sixth day of my participation in the November Gwen Challenge. See details: The Last Gwen Challenge 2021.

preface

Use Python to crawl iQiyi bullet screen comments, no more nonsense to say.

Let’s have a good time

The development tools

Python version: 3.6.4

Related modules:

Requests module;

Re module;

Pandas module;

LXML module;

The random module;

And some modules that come with Python.

Environment set up

Install Python and add it to the environment variables. PIP installs the required related modules.

Thought analysis

This article takes the movie “Godzilla vs King Kong” as an example to explain how to climb iQiyi video barrage and comments!

The target site

https://www.iqiyi.com/v_19rr0m845o.html
Copy the code

Grab barrage

For the bullet screen of iQiyi video, you still need to enter the developer tool to capture the package and get a BR compressed file, which can be downloaded directly by clicking. The content inside is binary data, and a data package will be loaded every minute the video is played

Get the URL, the difference between the two urls is the increasing number, 60 is the video update packet every 60 seconds

https://cmts.iqiyi.com/bullet/64/00/1078946400_60_1_b2105043.br\
https://cmts.iqiyi.com/bullet/64/00/1078946400_60_2_b2105043.br
Copy the code

Br files can be decompressed by brotli library, but the actual operation is very difficult, especially coding problems, difficult to solve; When decoding directly using UTF-8, the following error is reported

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x91 in position 52: invalid start byte
Copy the code

Add ignore in the decoding, Chinese will not garble, but HTML format garble, data extraction is still difficult

decode("utf-8"."ignore")
Copy the code

The obtained URL is modified into the following link to obtain the. Z compressed file

https://cmts.iqiyi.com/bullet/64/00/1078946400 _300_1.z
Copy the code

The reason for this change is that this is the previous bullet screen interface link of IQiyi, which has not been deleted or modified, and can still be used at present. 1078946400 in the interface link is the video ID; 300 is the old iQiyi bullet screen, which will load new bullet screen data packets every 5 minutes, 5 minutes is 300 seconds, Godzilla vs Kong lasts 112.59 minutes, divided by 5 and rounded up is 23; 1 is page number; 64 is the 7th and 8th number of the id value.

Code implementation

import requests\
import pandas as pd\
from lxml import etree\
from zlib import decompress  # extract \
\
df = pd.DataFrame()\
for i in range(1.23):\
    url = f'https://cmts.iqiyi.com/bullet/64/00/1078946400_300_{i}.z'\
    bulletold = requests.get(url).content  # get binary data \
    decode = decompress(bulletold).decode('utf-8')  # decompress decode \
    with open(f'{i}.html'.'a+', encoding='utf-8'as f:  Save as a static HTML file \
        f.write(decode)\
\
    html = open(f'./{i}.html'.'rb').read()  # Read HTML file \
    html = etree.HTML(html)  Parse web pages with xpath syntax
    ul = html.xpath('/html/body/danmu/data/entry/list/bulletinfo') \for i in ul:\
        contentid = ' '.join(i.xpath('./contentid/text()'))\
        content = ' '.join(i.xpath('./content/text()'))\
        likeCount = ' '.join(i.xpath('./likecount/text()')),print(contentid, content, likeCount)\
        text = pd.DataFrame({'contentid': [contentid], 'content': [content], 'likeCount': [likeCount]})\
        df = pd.concat([df, text])\
df.to_csv('Godzilla vs. King Kong. CSV', encoding='utf-8', index=False)
Copy the code

Results show

Grab comments

The comments of iQiyi video are still dynamically loaded at the bottom of the web page. You need to enter the developer tool of the browser to capture the package. When the web page is pulled down, a packet will be loaded, which contains the comment data

The exact URL you get

https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11. 5&authcookie=null&business_type=17&channel_id=1&content_id=1078946400&hot_size=10&last_id=&page=&page_size=10&types=hot,time&callback=jsonp_1629025964363_15405\
https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11. 5&authcookie=null&business_type=17&channel_id=1&content_id=1078946400&hot_size=0&last_id=7963601726142521&page=&page_size=20&types=time&callback=jsonp_1629026041287_28685\
https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11. 5&authcookie=null&business_type=17&channel_id=1&content_id=1078946400&hot_size=0&last_id=4933019153543021&page=&page_size=20&types=time&callback=jsonp_1629026394325_81937
Copy the code

The first URL loads the content of the highlight review, and the second URL starts loading the content of the entire review. The following URL is obtained by deleting unnecessary parameters

https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11. 5&business_type=17&content_id=1078946400&last_id=&page_size=10\
https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11. 5&business_type=17&content_id=1078946400&last_id=7963601726142521&page_size=20\
https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11. 5&business_type=17&content_id=1078946400&last_id=4933019153543021&page_size=20
Copy the code

The difference is in the arguments last_id and page_size. Page_size has a value of 10 in the first URL and is fixed at 20 from the second url. The value of last_id is null in the first URL and will change from the second. In my research, last_id is the user ID of the last comment in the previous URL (it should be the user ID); The web page data format is JSON.

Code implementation

import requests\
import pandas as pd\
import time\
import random\
\
\
headers = {\
    'User-Agent''the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'\
}\
df = pd.DataFrame()\
try:\
    a = 0\
    while True: \if a == 0:\
            url = 'https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&business_type=17&conte nt_id=1078946400&page_size=10'\
        else: \Get the last id from the previous page from id_list
            url = F 'https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&business_type=17&cont ent_id=1078946400&last_id={id_list[-1]}&page_size=20'\
        print(url)\
        res = requests.get(url, headers=headers).json()\
        id_list = []  Create a list to store the id value \
        for i in res['data'] ['comments']:\
            ids = i['id']\
            id_list.append(ids)\
            uname = i['userInfo'] ['uname']\
            addTime = i['addTime']\
            content = i.get('content'.'Nonexistence')  The first argument is the matched key, and the second is the missing key
            text = pd.DataFrame({'ids': [ids], 'uname': [uname], 'addTime': [addTime], 'content': [content]})\
            df = pd.concat([df, text])\
        a += 1\
        time.sleep(random.uniform(2.3)),except Exception as e:\
    print(e)\
df.to_csv('Godzilla vs. King Kong _ comment.csv', mode='a+', encoding='utf-8', index=False)
Copy the code

Results show