Since graduation, I haven’t used QQ any more. What IS recorded in Qzone is some not wonderful years, but it is still a memory. Recently, I want to put what I have learned into practice, and use Python to climb down all the photos in qzone album for backup.

Analyzing QZone

Log in to Qzone

Climb the first step, analysis site, first need to know how to log in QQ space. The initial idea was to use the Requests library to configure login requests to simulate login, but this idea was soon abandoned

According to the listening event bound to the login button, the click event of the button can be traced as follows:

Account encryption is inevitable, but this pile of code really bad parsing, patient warriors enjoy a try!

After excluding this login method, choosing Selenium simulated user login is a time and effort saving method, and we just need to complete the login through Selenium, get the Cookies and g_TK parameters described below, and then disable it, so it’s not too inefficient.

Analyzing spatial album

After login, the page will jump to a {QQ_NUMBER} [] (javascript:;) If you hover over the navigation bar, you’ll see that all the navigation bar links are javascript:; 😳. That’s exactly what happened. It was all a black box.

Of course, this is not too difficult to handle, just use a debugging tool to capture the click generated request, and then filter out the correct request package. Because there are so many network packets, how to filter, guess the album data API must return a list list, try to filter the list and then exclude one by one, and finally locate the request packet. The following packets are filtered by fcG_list. The list information is returned in JSONP format, which can be read as JSON format with a little manipulation (more on that later).

Two important sets of information can be obtained from Headers and Response, respectively:

  1. requestGets the required request information for the album list, including request links and parameters
  2. responseThe packet contains information about all albums and is the source of data for the request packet parameters corresponding to the photos contained in each album

First look at the request package:

# url

# args
g_tk: 477819917
callback: shine0_Callback
t: 691481346
hostUin: 123456789
uin: 123456789
appid: 4
inCharset: utf-8
outCharset: utf-8
source: qzone
plat: qzone
format: jsonp
notice: 0
filter: 1
handset: 4
pageNumModeSort: 40
pageNumModeClass: 15
needUserInfo: 1
idcNum: 4
callbackFun: shine0
_ : 1551788226819
Among them, hostUin and UIN are QQ numbers, g_tk is required and will be updated every time you log in again (how to obtain it will be explained later), other parameters are not required, I tried to sort out the following request parameters:

query = {
    'g_tk': self.g_tk,
    'hostUin': self.username,
    'uin': self.username,
    'appid': 4.'inCharset': 'utf-8'.'outCharset': 'utf-8'.'source': 'qzone'.'plat': 'qzone'.'format': 'jsonp'
Let’s look at the cross-domain response package in JSONP format:

shine0_Callback({ "code":0, "subcode":0, "message":"", "default":0, "data": { "albumListModeSort" : [ { "allowAccess" : 1, "anonymity" : 0, "bitmap" : "10000000", "classid" : 106, "comment" : 11, "createtime" : 1402661881, "desc" : "", "handset" : 0, "id" : "V13LmPKk0JLNRY", "lastuploadtime" : 1402662103, "modifytime" : 1408271987, "name" : "Graduation season ", "order" : 0, "pre" : "http:\/\/\/psb?\/V13LmPKk0JLNRY\/eSAslg*mYWaytEtLysg*Q*5Km91gIWfGuwSk58K2rQY! \/a\/dIY29GUbJgAA", "priv" : 1, "pypriv" : 1, "total" : 4, "viewtype" : 0 },Copy the code

Shine0_Callback is determined by the callbackFun parameter of the request package. Without this parameter, the response package will have _Callback as the default name, which of course doesn’t matter. All album information is stored in an albumListModeSort in JSON format, and only one album is captured.

In the album information, name stands for the name of the album, id as the unique identifier can be used to request the photo information in the album, and Pre is just a link to preview the thumbnail, it doesn’t matter.

Analyzing individual albums

Similar to obtaining photo album information, enter an album and use cGI_list to filter data packets to find the photo information of the album

In the same way, according to the packet can obtain the photo list information request packet and response information, first look at the request:

# url

# args
g_tk: 477819917
callback: shine0_Callback
t: 952444063
mode: 0
idcNum: 4
hostUin: 123456789
topicId: V13LmPKk0JLNRY
noTopic: 0
uin: 123456789
pageStart: 0
pageNum: 30
skipCmtCount: 0
singleurl: 1
notice: 0
appid: 4
inCharset: utf-8
outCharset: utf-8
source: qzone
plat: qzone
outstyle: json
format: jsonp
json_esc: 1
callbackFun: shine0
_ : 1551790719497
There are several key parameters:

  1. g_tk– Consistent with the album list parameter
  2. topicId– With album list parameteridconsistent
  3. pageStart– Indicates the start number of the requested photo
  4. pageNum– Number of photos requested this time

To get all the photos at once, you can set pageStart to 0 and pageNum to the maximum number of photos in all albums.

You can also simplify the above parameters by adding topicId, pageStart and pageNum on the basis of the album list request parameters.

Here is the list of returned photos:

shine0_Callback({ "code":0, "subcode":0, "message":"", "default":0, "data": { "limit" : 0, "photoList" : [ { "batchId" : "1402662093402000", "browser" : 0, "cameratype" : " ", "cp_flag" : false, "cp_x" : 455, "cp_y" : 388, "desc" : "", "exif" : { "exposureCompensation" : "", "exposureMode" : "", "exposureProgram" : "", "exposureTime" : "", "flash" : "", "fnumber" : "", "focalLength" : "", "iso" : "", "lensModel" : "", "make" : "", "meteringMode" : "", "model" : "", "originalTime" : "" }, "forum" : 0, "frameno" : 0, "height" : 621, "id" : 0, "is_video" : false, "is_weixin_mode" : 0, "ismultiup" : 0, "lloc" : "NDN0sggyKs3smlOg6eYghjb0ZRsmAAA!", "modifytime" : 1402661792, "name" : "QQ photo 20140612104616", "origin" : 0, "origin_upload" : 0, "origin_URL" : ", "owner" : "123456789", "ownername" : "123456789", "photocubage" : 91602, "phototype" : 1, "picmark_flag" : 0, "picrefer" : 1, "platformId" : 0, "platformSubId" : 0, "poiName" : "", "pre" : "http:\/\/\/psb?\/V13LmPKk0JLNRY\/eSAslg*mYWaytEtLysg*Q*5Km91gIWfSk58K2rQY! \/a\/dIY29GUbJgAA&bo=pANtAgAAAAABCeY!", "raw" : "http:\/\/\/psb?\/V13LmPKk0JLNRY\/eSAslg*mYWaytEtLysg*Q*5Km91gIWfSk58K2rQY! \/r\/dIY29GUbJgAA", "raw_upload" : 1, "rawshoottime" : 0, "shoottime" : 0, "shorturl" : "", "sloc" : "NDN0sggyKs3smlOg6eYghjb0ZRsmAAA!", "tag" : "", "uploadtime" : "2014-06-13 20:21:33", "url" : "http:\/\/\/psb?\/V13LmPKk0JLNRY\/eSAslg*mYWaytEtLysg*Q*5Km91gIWfSk58K2rQY! \/b\/dIY29GUbJgAA&bo=pANtAgAAAAABCeY!", "width" : 932, "yurl" : 0 }, // ... ]  "t" : "952444063", "topic" : { "bitmap" : "10000000", "browser" : 0, "classid" : 106, "comment" : 1, "cover_id" : "NDN0sggyKs3smlOg6eYghjb0ZRsmAAA!" , "createtime" : 1402661881, "desc" : "", "handset" : 0, "id" : "V13LmPKk0JLNRY", "is_share_album" : 0, "laSTUploadTime" : 1402662103, "modiFYTime" : 1408271987, "Name" :" Graduation season ", "ownerName" : "707922098", "ownerUin" : "707922098", "pre" : "http:\/\/\/psb?\/V13LmPKk0JLNRY\/eSAslg*mYWaytEtLysg*Q*5Km91gIWfGuwSk58K2rQY! \/a\/dIY29GUbJgAA", "priv" : 1, "pypriv" : 1, "share_album_owner" : 0, "total" : 4, "url" : "http:\/\/\/psb?\/V13LmPKk0JLNRY\/eSAslg*mYWaytEtLysg*Q*5Km91gIWfGuwSk58K2rQY! \/b\/dIY29GUbJgAA", "viewtype" : 0 }, "totalInAlbum" : 4, "totalInPage" : 4 }Copy the code

The returned photo information is stored in photoList. Again, only one photo is captured on the top, and some basic information of the current album is returned on the bottom. TotalInAlbum, totalInPage stores the total number of photos contained in the current album and the number of photos returned this time. The image link we need to download is the URL!

OK, now that all request and response data have been analyzed, it is time for coding.

Determine the crawl scheme

  1. createqqzoneClass to initialize user information
  2. useSeleniumTo simulate the login
  3. To obtainCookiesandg_tk
  4. userequestsGet album list information
  5. Walk through the album to get the photo list information and download the photos

Create qqzone

class qqzone(object):
    """QQ Space album crawler ""
    def __init__(self, user):
        self.username = user['username']
        self.password = user['password']
To simulate the login

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import WebDriverExceptio


def _login_and_get_args(self):
    """ Log in to QQ and get Cookies and G_TK """
    opt = webdriver.ChromeOptions()

    driver = webdriver.Chrome(chrome_options=opt)
    # time.sleep(2)'User {} login... '.format(self.username))

Note here:

  1. useseleniumNeed to install the correspondingwebdriver
  2. Can be achieved bywebdriver.Chrome()Specify the browser location, otherwise the search defaults to the path defined by the environment variable
  3. If your computer is slow to open a browser, you may need to use thedriver.getaftersleepA few seconds

To get the Cookies

Selenium is very convenient for getting Cookies

self.cookies = driver.get_cookies()
Copy the code

Get g_tk

Getting G_TK is the biggest difficulty of this crawler at the beginning, because there is no value directly written from the web page, only various function calls. I did a global search and found that many places have access to it.

Finally, one of them was selected, and g_TK! Was successfully obtained through selenium’s scripting capabilities.

self.g_tk = driver.execute_script('return QZONE.FP.getACSRFToken()')
Copy the code

At this point, Selenium is done, and the rest will be done through Requests.

Initializes the request. The Session

The next step is to generate the request and then retrieve the data. However, for convenience, data is requested in session mode, and cookie and headers are configured to save each request being set.

def _init_session(self):
    self.session = requests.Session()
    for cookie in self.cookies:
        self.session.cookies.set(cookie['name'], cookie['value'])
    self.session.headers = {
        'Referer': ''.'User-Agent': 'the Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36'
Request album Information

To obtain the album information, you need to encapsulate the request parameters, then crawl the data through session.get, then read the JSONP data in JSON format through regular matching, and finally parse the required name and ID.

def _get_ablum_list(self):
    """ Get list information for album """
    album_url = '{} {}'.format(
        self._get_query_for_request())'Getting ablum list id... ')
    resp = self.session.get(album_url)
    data = self._load_callback_data(resp)

    album_list = {}
    for item in data['data'] ['albumListModeSort']:
        album_list[item['name']] = item['id']

    return album_list
The parameter combinations come from the _get_query_for_request function.

def _get_query_for_request(self, topicId=None, pageStart=0, pageNum=100):
    PageStart: the starting page number required to request the photo list information of an album. PageNum: The number of photos in an album that are requested at a time. Returns: The number of photos in an album that are requested at a time. A string that combines all the request parameters.
    query = {
        'g_tk': self.g_tk,
        'hostUin': self.username,
        'uin': self.username,
        'appid': 4.'inCharset': 'utf-8'.'outCharset': 'utf-8'.'source': 'qzone'.'plat': 'qzone'.'format': 'jsonp'
    if topicId:
        query['topicId'] = topicId
        query['pageStart'] = pageStart
        query['pageNum'] = pageNum
    return '&'.join('{} = {}'.format(key, val) for key, val in query.items())
The jSONP parsing function is as follows, the body of which is a regular match, very simple.

def _load_callback_data(self, resp):
    Parse returned JSONP data in JSON format
        resp.encoding = 'utf-8'
        data = loads('.*? \ [({*}). *? \]. * ', resp.text, re.S)[1])
        return data
    except ValueError:
        logging.error('Invalid input')
Parse and download the photos

After getting the album list, request the photo list information one by one and then download it one by one

def _get_photo(self, album_name, album_id):
    """ Gets the photo list information for a single album and downloads all the photos in that album. ""
    photo_list_url = '{} {}'.format(
        self._get_query_for_request(topicId=album_id))'Getting photo list for album {}... '.format(album_name))
    resp = self.session.get(photo_list_url)
    data = self._load_callback_data(resp)
    if data['data'] ['totalInPage'] = =0:
        return None

    file_dir = self.get_path(album_name)
    for item in data['data'] ['photoList']:
        path = '{}/{}.jpg'.format(file_dir, item['name'])'Downloading {}-{}'.format(album_name, item['name']))
        self._download_image(item['url'], path)
Copy the code

Images are also downloaded via request, so remember to set the timeout.

def _download_image(self, url, path):
    """ Download a single photo """
        resp = self.session.get(url, timeout=15)
        if resp.status_code == 200:
            open(path, 'wb').write(resp.content)
    except requests.exceptions.Timeout:
        logging.warning('get {} timeout'.format(url))
    except requests.exceptions.ConnectionError as e:
Crawl test

  • The crawl process

  • Crawl results

Write in the last

  1. If the request parameters informatbyjsonptojson, can be directly obtainedjsondata
  2. This use case does not use multi-process or multi-thread, so the speed is not fast, which needs to be optimized
  3. Women, that crawler is received a storm of applause. That’s what’s expected of you

