“This is the 17th day of my participation in the Gwen Challenge in November. See details of the event: The Last Gwen Challenge in 2021”.

preface

Use Python to crawl QQ music comments. Without further ado.

Let’s have a good time

The development tools

Python version: 3.6.4

Related modules:

Requests module;

Re module;

Pymysql module;

And some modules that come with Python.

Environment set up

Install Python and add it to the environment variables. PIP installs the required related modules.

Sequel Pro has been installed on the Mac as well as MySQL. Sequel Pro has been successfully installed on the Mac as well.

Next, create the database, table, and primary key information.

import pymysql
Create database
db = pymysql.connect(host='127.0.0.1', user='root', password='774110919', port=3306)
cursor = db.cursor()
cursor.execute("CREATE DATABASE QQ_Music DEFAULT CHARACTER SET utf8mb4")
db.close()
Copy the code

Based on the analysis of last summer’s web pages in QQ Music, I checked the last page of all comments and found that the time was shortened, because one comment in the hot comments was dated July 12, while the time on the last page of all comments was July 16. Obviously, all comments are not really all comments, I wonder if this is QQ music BUG.

Another is that when you click the last page directly, you cannot directly return to the real information. You need to turn forward from the last page to the real information page, and then turn back to get the real information on the last page.

After confirming the url of the same Ajax request, I analyzed the request header and found that three parameters had changed: JsonCallback

pagenum

lasthotcommentid

Pagenum is the number of pages. Jsoncallback does not affect the request, so it does not need to be changed. The value of LasthotCommentid corresponds to the ID of the last commenter on the previous page, so it needs to be changed at any time.

That is, change the value of PAGenum, lasthotCommentid, the request can be successfully implemented.

Part of the code

import re
import json
import time
import pymysql
import requests

URL = 'https://c.y.qq.com/base/fcgi-bin/fcg_global_comment_h5.fcg?'

HEADERS = {
    'user-agent': 'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
}

PARAMS = {
    'g_tk': '5381'.'jsonpCallback': 'jsoncallback4823183319594757'.'loginUin': '0'.'hostUin': '0'.'format': 'jsonp'.'inCharset': 'utf8'.'outCharset': 'GB2312'.'notice': '0'.'platform': 'yqq'.'needNewCode': '0'.'cid': '205360772'.'reqtype': '2'.'biztype': '1'.'topid': '213910991'.'cmd': '8'.'needmusiccrit': '0'.'pagenum': '0'.'pagesize': '25'.'lasthotcommentid': ' '.'callback': 'jsoncallback4823183319594757'.'domain': 'qq.com'.'ct': '24'.'cv': '101010',
}

LAST_COMMENT_ID = ' '

db = pymysql.connect(host='127.0.0.1', user='root', password='774110919', port=3306,  db='QQ_Music', charset='utf8mb4')
cursor = db.cursor()


def get_html(url, headers, params=None, tries=3) :
    try:
        response = requests.get(url=url, headers=headers, params=params)
        response.raise_for_status()
        response.encoding = 'utf-8'
    except requests.HTTPError:
        print("connect failed")
        if tries > 0:
            print("reconnect...")
            last_url = url
            get_html(last_url, headers, tries-1)
        else:
            print("3 times failure")
            return None
    return response

if __name__ == '__main__':
    main()
Copy the code

Finally, successful access to comment information