This is the sixth day of my participation in the November Gwen Challenge. See details: The Last Gwen Challenge 2021.
preface
Use Python to crawl iQiyi bullet screen comments, no more nonsense to say.
Let’s have a good time
The development tools
Python version: 3.6.4
Related modules:
Requests module;
Re module;
Pandas module;
LXML module;
The random module;
And some modules that come with Python.
Environment set up
Install Python and add it to the environment variables. PIP installs the required related modules.
Thought analysis
This article takes the movie “Godzilla vs King Kong” as an example to explain how to climb iQiyi video barrage and comments!
The target site
https://www.iqiyi.com/v_19rr0m845o.html
Copy the code
Grab barrage
For the bullet screen of iQiyi video, you still need to enter the developer tool to capture the package and get a BR compressed file, which can be downloaded directly by clicking. The content inside is binary data, and a data package will be loaded every minute the video is played
Get the URL, the difference between the two urls is the increasing number, 60 is the video update packet every 60 seconds
https://cmts.iqiyi.com/bullet/64/00/1078946400_60_1_b2105043.br\
https://cmts.iqiyi.com/bullet/64/00/1078946400_60_2_b2105043.br
Copy the code
Br files can be decompressed by brotli library, but the actual operation is very difficult, especially coding problems, difficult to solve; When decoding directly using UTF-8, the following error is reported
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x91 in position 52: invalid start byte
Copy the code
Add ignore in the decoding, Chinese will not garble, but HTML format garble, data extraction is still difficult
decode("utf-8"."ignore")
Copy the code
The obtained URL is modified into the following link to obtain the. Z compressed file
https://cmts.iqiyi.com/bullet/64/00/1078946400 _300_1.z
Copy the code
The reason for this change is that this is the previous bullet screen interface link of IQiyi, which has not been deleted or modified, and can still be used at present. 1078946400 in the interface link is the video ID; 300 is the old iQiyi bullet screen, which will load new bullet screen data packets every 5 minutes, 5 minutes is 300 seconds, Godzilla vs Kong lasts 112.59 minutes, divided by 5 and rounded up is 23; 1 is page number; 64 is the 7th and 8th number of the id value.
Code implementation
import requests\
import pandas as pd\
from lxml import etree\
from zlib import decompress # extract \
\
df = pd.DataFrame()\
for i in range(1.23):\
url = f'https://cmts.iqiyi.com/bullet/64/00/1078946400_300_{i}.z'\
bulletold = requests.get(url).content # get binary data \
decode = decompress(bulletold).decode('utf-8') # decompress decode \
with open(f'{i}.html'.'a+', encoding='utf-8') as f: Save as a static HTML file \
f.write(decode)\
\
html = open(f'./{i}.html'.'rb').read() # Read HTML file \
html = etree.HTML(html) Parse web pages with xpath syntax
ul = html.xpath('/html/body/danmu/data/entry/list/bulletinfo') \for i in ul:\
contentid = ' '.join(i.xpath('./contentid/text()'))\
content = ' '.join(i.xpath('./content/text()'))\
likeCount = ' '.join(i.xpath('./likecount/text()')),print(contentid, content, likeCount)\
text = pd.DataFrame({'contentid': [contentid], 'content': [content], 'likeCount': [likeCount]})\
df = pd.concat([df, text])\
df.to_csv('Godzilla vs. King Kong. CSV', encoding='utf-8', index=False)
Copy the code
Results show
Grab comments
The comments of iQiyi video are still dynamically loaded at the bottom of the web page. You need to enter the developer tool of the browser to capture the package. When the web page is pulled down, a packet will be loaded, which contains the comment data
The exact URL you get
https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11. 5&authcookie=null&business_type=17&channel_id=1&content_id=1078946400&hot_size=10&last_id=&page=&page_size=10&types=hot,time&callback=jsonp_1629025964363_15405\
https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11. 5&authcookie=null&business_type=17&channel_id=1&content_id=1078946400&hot_size=0&last_id=7963601726142521&page=&page_size=20&types=time&callback=jsonp_1629026041287_28685\
https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11. 5&authcookie=null&business_type=17&channel_id=1&content_id=1078946400&hot_size=0&last_id=4933019153543021&page=&page_size=20&types=time&callback=jsonp_1629026394325_81937
Copy the code
The first URL loads the content of the highlight review, and the second URL starts loading the content of the entire review. The following URL is obtained by deleting unnecessary parameters
https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11. 5&business_type=17&content_id=1078946400&last_id=&page_size=10\
https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11. 5&business_type=17&content_id=1078946400&last_id=7963601726142521&page_size=20\
https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11. 5&business_type=17&content_id=1078946400&last_id=4933019153543021&page_size=20
Copy the code
The difference is in the arguments last_id and page_size. Page_size has a value of 10 in the first URL and is fixed at 20 from the second url. The value of last_id is null in the first URL and will change from the second. In my research, last_id is the user ID of the last comment in the previous URL (it should be the user ID); The web page data format is JSON.
Code implementation
import requests\
import pandas as pd\
import time\
import random\
\
\
headers = {\
'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'\
}\
df = pd.DataFrame()\
try:\
a = 0\
while True: \if a == 0:\
url = 'https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&business_type=17&conte nt_id=1078946400&page_size=10'\
else: \Get the last id from the previous page from id_list
url = F 'https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&business_type=17&cont ent_id=1078946400&last_id={id_list[-1]}&page_size=20'\
print(url)\
res = requests.get(url, headers=headers).json()\
id_list = [] Create a list to store the id value \
for i in res['data'] ['comments']:\
ids = i['id']\
id_list.append(ids)\
uname = i['userInfo'] ['uname']\
addTime = i['addTime']\
content = i.get('content'.'Nonexistence') The first argument is the matched key, and the second is the missing key
text = pd.DataFrame({'ids': [ids], 'uname': [uname], 'addTime': [addTime], 'content': [content]})\
df = pd.concat([df, text])\
a += 1\
time.sleep(random.uniform(2.3)),except Exception as e:\
print(e)\
df.to_csv('Godzilla vs. King Kong _ comment.csv', mode='a+', encoding='utf-8', index=False)
Copy the code
Results show