Since graduation, I haven’t used QQ any more. What IS recorded in Qzone is some not wonderful years, but it is still a memory. Recently, I want to put what I have learned into practice, and use Python to climb down all the photos in qzone album for backup.
Analyzing QZone
Log in to Qzone
Climb the first step, analysis site, first need to know how to log in QQ space. The initial idea was to use the Requests library to configure login requests to simulate login, but this idea was soon abandoned
According to the listening event bound to the login button, the click event of the button can be traced as follows:
Account encryption is inevitable, but this pile of code really bad parsing, patient warriors enjoy a try!
After excluding this login method, choosing Selenium simulated user login is a time and effort saving method, and we just need to complete the login through Selenium, get the Cookies and g_TK parameters described below, and then disable it, so it’s not too inefficient.
Analyzing spatial album
After login, the page will jump to a {QQ_NUMBER} [https://user.qzone.qq.com/] (javascript:;) If you hover over the navigation bar, you’ll see that all the navigation bar links are javascript:; 😳. That’s exactly what happened. It was all a black box.
Of course, this is not too difficult to handle, just use a debugging tool to capture the click generated request, and then filter out the correct request package. Because there are so many network packets, how to filter, guess the album data API must return a list list, try to filter the list and then exclude one by one, and finally locate the request packet. The following packets are filtered by fcG_list. The list information is returned in JSONP format, which can be read as JSON format with a little manipulation (more on that later).
Two important sets of information can be obtained from Headers and Response, respectively:
request
Gets the required request information for the album list, including request links and parametersresponse
The packet contains information about all albums and is the source of data for the request packet parameters corresponding to the photos contained in each album
First look at the request package:
# url
https://h5.qzone.qq.com/proxy/domain/photo.qzone.qq.com/fcgi-bin/fcg_list_album_v3
# args
g_tk: 477819917
callback: shine0_Callback
t: 691481346
hostUin: 123456789
uin: 123456789
appid: 4
inCharset: utf-8
outCharset: utf-8
source: qzone
plat: qzone
format: jsonp
notice: 0
filter: 1
handset: 4
pageNumModeSort: 40
pageNumModeClass: 15
needUserInfo: 1
idcNum: 4
callbackFun: shine0
_ : 1551788226819
Copy the code
Among them, hostUin and UIN are QQ numbers, g_tk is required and will be updated every time you log in again (how to obtain it will be explained later), other parameters are not required, I tried to sort out the following request parameters:
query = {
'g_tk': self.g_tk,
'hostUin': self.username,
'uin': self.username,
'appid': 4.'inCharset': 'utf-8'.'outCharset': 'utf-8'.'source': 'qzone'.'plat': 'qzone'.'format': 'jsonp'
}
Copy the code
Let’s look at the cross-domain response package in JSONP format:
shine0_Callback({ "code":0, "subcode":0, "message":"", "default":0, "data": { "albumListModeSort" : [ { "allowAccess" : 1, "anonymity" : 0, "bitmap" : "10000000", "classid" : 106, "comment" : 11, "createtime" : 1402661881, "desc" : "", "handset" : 0, "id" : "V13LmPKk0JLNRY", "lastuploadtime" : 1402662103, "modifytime" : 1408271987, "name" : "Graduation season ", "order" : 0, "pre" : "http:\/\/b171.photo.store.qq.com\/psb?\/V13LmPKk0JLNRY\/eSAslg*mYWaytEtLysg*Q*5Km91gIWfGuwSk58K2rQY! \/a\/dIY29GUbJgAA", "priv" : 1, "pypriv" : 1, "total" : 4, "viewtype" : 0 },Copy the code
Shine0_Callback is determined by the callbackFun parameter of the request package. Without this parameter, the response package will have _Callback as the default name, which of course doesn’t matter. All album information is stored in an albumListModeSort in JSON format, and only one album is captured.
In the album information, name stands for the name of the album, id as the unique identifier can be used to request the photo information in the album, and Pre is just a link to preview the thumbnail, it doesn’t matter.
Analyzing individual albums
Similar to obtaining photo album information, enter an album and use cGI_list to filter data packets to find the photo information of the album
In the same way, according to the packet can obtain the photo list information request packet and response information, first look at the request:
# url
https://h5.qzone.qq.com/proxy/domain/photo.qzone.qq.com/fcgi-bin/cgi_list_photo
# args
g_tk: 477819917
callback: shine0_Callback
t: 952444063
mode: 0
idcNum: 4
hostUin: 123456789
topicId: V13LmPKk0JLNRY
noTopic: 0
uin: 123456789
pageStart: 0
pageNum: 30
skipCmtCount: 0
singleurl: 1
batchId:
notice: 0
appid: 4
inCharset: utf-8
outCharset: utf-8
source: qzone
plat: qzone
outstyle: json
format: jsonp
json_esc: 1
question:
answer:
callbackFun: shine0
_ : 1551790719497
Copy the code
There are several key parameters:
g_tk
– Consistent with the album list parametertopicId
– With album list parameterid
consistentpageStart
– Indicates the start number of the requested photopageNum
– Number of photos requested this time
To get all the photos at once, you can set pageStart to 0 and pageNum to the maximum number of photos in all albums.
You can also simplify the above parameters by adding topicId, pageStart and pageNum on the basis of the album list request parameters.
Here is the list of returned photos:
shine0_Callback({ "code":0, "subcode":0, "message":"", "default":0, "data": { "limit" : 0, "photoList" : [ { "batchId" : "1402662093402000", "browser" : 0, "cameratype" : " ", "cp_flag" : false, "cp_x" : 455, "cp_y" : 388, "desc" : "", "exif" : { "exposureCompensation" : "", "exposureMode" : "", "exposureProgram" : "", "exposureTime" : "", "flash" : "", "fnumber" : "", "focalLength" : "", "iso" : "", "lensModel" : "", "make" : "", "meteringMode" : "", "model" : "", "originalTime" : "" }, "forum" : 0, "frameno" : 0, "height" : 621, "id" : 0, "is_video" : false, "is_weixin_mode" : 0, "ismultiup" : 0, "lloc" : "NDN0sggyKs3smlOg6eYghjb0ZRsmAAA!", "modifytime" : 1402661792, "name" : "QQ photo 20140612104616", "origin" : 0, "origin_upload" : 0, "origin_URL" : ", "owner" : "123456789", "ownername" : "123456789", "photocubage" : 91602, "phototype" : 1, "picmark_flag" : 0, "picrefer" : 1, "platformId" : 0, "platformSubId" : 0, "poiName" : "", "pre" : "http:\/\/b171.photo.store.qq.com\/psb?\/V13LmPKk0JLNRY\/eSAslg*mYWaytEtLysg*Q*5Km91gIWfSk58K2rQY! \/a\/dIY29GUbJgAA&bo=pANtAgAAAAABCeY!", "raw" : "http:\/\/r.photo.store.qq.com\/psb?\/V13LmPKk0JLNRY\/eSAslg*mYWaytEtLysg*Q*5Km91gIWfSk58K2rQY! \/r\/dIY29GUbJgAA", "raw_upload" : 1, "rawshoottime" : 0, "shoottime" : 0, "shorturl" : "", "sloc" : "NDN0sggyKs3smlOg6eYghjb0ZRsmAAA!", "tag" : "", "uploadtime" : "2014-06-13 20:21:33", "url" : "http:\/\/b171.photo.store.qq.com\/psb?\/V13LmPKk0JLNRY\/eSAslg*mYWaytEtLysg*Q*5Km91gIWfSk58K2rQY! \/b\/dIY29GUbJgAA&bo=pANtAgAAAAABCeY!", "width" : 932, "yurl" : 0 }, // ... ] "t" : "952444063", "topic" : { "bitmap" : "10000000", "browser" : 0, "classid" : 106, "comment" : 1, "cover_id" : "NDN0sggyKs3smlOg6eYghjb0ZRsmAAA!" , "createtime" : 1402661881, "desc" : "", "handset" : 0, "id" : "V13LmPKk0JLNRY", "is_share_album" : 0, "laSTUploadTime" : 1402662103, "modiFYTime" : 1408271987, "Name" :" Graduation season ", "ownerName" : "707922098", "ownerUin" : "707922098", "pre" : "http:\/\/b171.photo.store.qq.com\/psb?\/V13LmPKk0JLNRY\/eSAslg*mYWaytEtLysg*Q*5Km91gIWfGuwSk58K2rQY! \/a\/dIY29GUbJgAA", "priv" : 1, "pypriv" : 1, "share_album_owner" : 0, "total" : 4, "url" : "http:\/\/b171.photo.store.qq.com\/psb?\/V13LmPKk0JLNRY\/eSAslg*mYWaytEtLysg*Q*5Km91gIWfGuwSk58K2rQY! \/b\/dIY29GUbJgAA", "viewtype" : 0 }, "totalInAlbum" : 4, "totalInPage" : 4 }Copy the code
The returned photo information is stored in photoList. Again, only one photo is captured on the top, and some basic information of the current album is returned on the bottom. TotalInAlbum, totalInPage stores the total number of photos contained in the current album and the number of photos returned this time. The image link we need to download is the URL!
OK, now that all request and response data have been analyzed, it is time for coding.
Determine the crawl scheme
- create
qqzone
Class to initialize user information - use
Selenium
To simulate the login - To obtain
Cookies
andg_tk
- use
requests
Get album list information - Walk through the album to get the photo list information and download the photos
Create qqzone
class qqzone(object):
"""QQ Space album crawler ""
def __init__(self, user):
self.username = user['username']
self.password = user['password']
Copy the code
To simulate the login
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import WebDriverExceptio
#...
def _login_and_get_args(self):
""" Log in to QQ and get Cookies and G_TK """
opt = webdriver.ChromeOptions()
opt.set_headless()
driver = webdriver.Chrome(chrome_options=opt)
driver.get('https://i.qq.com/')
# time.sleep(2)
logging.info('User {} login... '.format(self.username))
driver.switch_to.frame('login_frame')
driver.find_element_by_id('switcher_plogin').click()
driver.find_element_by_id('u').clear()
driver.find_element_by_id('u').send_keys(self.username)
driver.find_element_by_id('p').clear()
driver.find_element_by_id('p').send_keys(self.password)
driver.find_element_by_id('login_button').click()
time.sleep(1)
driver.get('https://user.qzone.qq.com/{}'.format(self.username))
Copy the code
Note here:
- use
selenium
Need to install the correspondingwebdriver
- Can be achieved by
webdriver.Chrome()
Specify the browser location, otherwise the search defaults to the path defined by the environment variable - If your computer is slow to open a browser, you may need to use the
driver.get
aftersleep
A few seconds
To get the Cookies
Selenium is very convenient for getting Cookies
self.cookies = driver.get_cookies()
Copy the code
Get g_tk
Getting G_TK is the biggest difficulty of this crawler at the beginning, because there is no value directly written from the web page, only various function calls. I did a global search and found that many places have access to it.
Finally, one of them was selected, and g_TK! Was successfully obtained through selenium’s scripting capabilities.
self.g_tk = driver.execute_script('return QZONE.FP.getACSRFToken()')
Copy the code
At this point, Selenium is done, and the rest will be done through Requests.
Initializes the request. The Session
The next step is to generate the request and then retrieve the data. However, for convenience, data is requested in session mode, and cookie and headers are configured to save each request being set.
def _init_session(self):
self.session = requests.Session()
for cookie in self.cookies:
self.session.cookies.set(cookie['name'], cookie['value'])
self.session.headers = {
'Referer': 'https://qzs.qq.com/qzone/photo/v7/page/photo.html?init=photo.v7/module/albumList/index&navBar=1'.'User-Agent': 'the Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36'
}
Copy the code
Request album Information
To obtain the album information, you need to encapsulate the request parameters, then crawl the data through session.get, then read the JSONP data in JSON format through regular matching, and finally parse the required name and ID.
def _get_ablum_list(self):
""" Get list information for album """
album_url = '{} {}'.format(
'https://h5.qzone.qq.com/proxy/domain/photo.qzone.qq.com/fcgi-bin/fcg_list_album_v3?',
self._get_query_for_request())
logging.info('Getting ablum list id... ')
resp = self.session.get(album_url)
data = self._load_callback_data(resp)
album_list = {}
for item in data['data'] ['albumListModeSort']:
album_list[item['name']] = item['id']
return album_list
Copy the code
The parameter combinations come from the _get_query_for_request function.
def _get_query_for_request(self, topicId=None, pageStart=0, pageNum=100):
PageStart: the starting page number required to request the photo list information of an album. PageNum: The number of photos in an album that are requested at a time. Returns: The number of photos in an album that are requested at a time. A string that combines all the request parameters.
query = {
'g_tk': self.g_tk,
'hostUin': self.username,
'uin': self.username,
'appid': 4.'inCharset': 'utf-8'.'outCharset': 'utf-8'.'source': 'qzone'.'plat': 'qzone'.'format': 'jsonp'
}
if topicId:
query['topicId'] = topicId
query['pageStart'] = pageStart
query['pageNum'] = pageNum
return '&'.join('{} = {}'.format(key, val) for key, val in query.items())
Copy the code
The jSONP parsing function is as follows, the body of which is a regular match, very simple.
def _load_callback_data(self, resp):
Parse returned JSONP data in JSON format
try:
resp.encoding = 'utf-8'
data = loads(re.search(r'.*? \ [({*}). *? \]. * ', resp.text, re.S)[1])
return data
except ValueError:
logging.error('Invalid input')
Copy the code
Parse and download the photos
After getting the album list, request the photo list information one by one and then download it one by one
def _get_photo(self, album_name, album_id):
""" Gets the photo list information for a single album and downloads all the photos in that album. ""
photo_list_url = '{} {}'.format(
'https://h5.qzone.qq.com/proxy/domain/photo.qzone.qq.com/fcgi-bin/cgi_list_photo?',
self._get_query_for_request(topicId=album_id))
logging.info('Getting photo list for album {}... '.format(album_name))
resp = self.session.get(photo_list_url)
data = self._load_callback_data(resp)
if data['data'] ['totalInPage'] = =0:
return None
file_dir = self.get_path(album_name)
for item in data['data'] ['photoList']:
path = '{}/{}.jpg'.format(file_dir, item['name'])
logging.info('Downloading {}-{}'.format(album_name, item['name']))
self._download_image(item['url'], path)
Copy the code
Images are also downloaded via request, so remember to set the timeout.
def _download_image(self, url, path):
""" Download a single photo """
try:
resp = self.session.get(url, timeout=15)
if resp.status_code == 200:
open(path, 'wb').write(resp.content)
except requests.exceptions.Timeout:
logging.warning('get {} timeout'.format(url))
except requests.exceptions.ConnectionError as e:
logging.error(e.__str__)
finally:
pass
Copy the code
Crawl test
- The crawl process
- Crawl results
Write in the last
- If the request parameters in
format
byjsonp
tojson
, can be directly obtainedjson
data - This use case does not use multi-process or multi-thread, so the speed is not fast, which needs to be optimized
- Women, that crawler is received a storm of applause. That’s what’s expected of you
This article was originally published at www.litreily.top