“This is the 17th day of my participation in the Gwen Challenge in November. See details of the event: The Last Gwen Challenge in 2021”.
preface
Use Python to crawl QQ music comments. Without further ado.
Let’s have a good time
The development tools
Python version: 3.6.4
Related modules:
Requests module;
Re module;
Pymysql module;
And some modules that come with Python.
Environment set up
Install Python and add it to the environment variables. PIP installs the required related modules.
Sequel Pro has been installed on the Mac as well as MySQL. Sequel Pro has been successfully installed on the Mac as well.
Next, create the database, table, and primary key information.
import pymysql
Create database
db = pymysql.connect(host='127.0.0.1', user='root', password='774110919', port=3306)
cursor = db.cursor()
cursor.execute("CREATE DATABASE QQ_Music DEFAULT CHARACTER SET utf8mb4")
db.close()
Copy the code
Based on the analysis of last summer’s web pages in QQ Music, I checked the last page of all comments and found that the time was shortened, because one comment in the hot comments was dated July 12, while the time on the last page of all comments was July 16. Obviously, all comments are not really all comments, I wonder if this is QQ music BUG.
Another is that when you click the last page directly, you cannot directly return to the real information. You need to turn forward from the last page to the real information page, and then turn back to get the real information on the last page.
After confirming the url of the same Ajax request, I analyzed the request header and found that three parameters had changed: JsonCallback
pagenum
lasthotcommentid
Pagenum is the number of pages. Jsoncallback does not affect the request, so it does not need to be changed. The value of LasthotCommentid corresponds to the ID of the last commenter on the previous page, so it needs to be changed at any time.
That is, change the value of PAGenum, lasthotCommentid, the request can be successfully implemented.
Part of the code
import re
import json
import time
import pymysql
import requests
URL = 'https://c.y.qq.com/base/fcgi-bin/fcg_global_comment_h5.fcg?'
HEADERS = {
'user-agent': 'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
}
PARAMS = {
'g_tk': '5381'.'jsonpCallback': 'jsoncallback4823183319594757'.'loginUin': '0'.'hostUin': '0'.'format': 'jsonp'.'inCharset': 'utf8'.'outCharset': 'GB2312'.'notice': '0'.'platform': 'yqq'.'needNewCode': '0'.'cid': '205360772'.'reqtype': '2'.'biztype': '1'.'topid': '213910991'.'cmd': '8'.'needmusiccrit': '0'.'pagenum': '0'.'pagesize': '25'.'lasthotcommentid': ' '.'callback': 'jsoncallback4823183319594757'.'domain': 'qq.com'.'ct': '24'.'cv': '101010',
}
LAST_COMMENT_ID = ' '
db = pymysql.connect(host='127.0.0.1', user='root', password='774110919', port=3306, db='QQ_Music', charset='utf8mb4')
cursor = db.cursor()
def get_html(url, headers, params=None, tries=3) :
try:
response = requests.get(url=url, headers=headers, params=params)
response.raise_for_status()
response.encoding = 'utf-8'
except requests.HTTPError:
print("connect failed")
if tries > 0:
print("reconnect...")
last_url = url
get_html(last_url, headers, tries-1)
else:
print("3 times failure")
return None
return response
if __name__ == '__main__':
main()
Copy the code
Finally, successful access to comment information