The author of this article is Wei Wei.

Project requirements and problem introduction

Sometimes, we want to climb the cartoon in Tencent animation, for example, we might as well open the website of a certain animation in Tencent animation ac.qq.com/Comic/comic… , as shown in the figure below:

Then, we click “Start reading” and the following interface appears:

As you can see, on which there is a cartoon, we can try to process according to conventional methods, we view the web page corresponds to the source code, can be found in the source code and can’t find this pair of cartoon images address, and, when we, when the mouse slipping would trigger loading subsequent comics, so, we can preliminarily concluded that, This data is triggered dynamically through asynchronous loading.

Following the usual approach, we’ll try to solve the problem using packet capture analysis, so we open Fiddler.

After opening Fiddler, we open the animation page again and drag it to trigger the corresponding cartoon. Meanwhile, the newly triggered resource information will appear successively in Fiddler, as shown below:

We analyzed these websites in turn, sorted out and copied the cartoon-related websites and put them into Word, as shown below:

Through the comparison observation, we can see the url rules of comic resources.

The corresponding rules are as follows:

ac.qq.com/store_file_download?buid= animation ID&uin= UIN value &dir_path=/&name= date_random number _ cartoon picture ID.jpg

We can see that there is a random number in the address, and this section of the website is difficult to be constructed by the previous method of website structure. Therefore, even if the website law is analyzed, it is of no help, because part of the law of the website is random number, that is, irregular fields.

So, obviously, this kind of url dynamic trigger + resource random storage anti-crawling strategy we use the past anti-crawling skills is difficult to solve, which we can first try to write a conventional method will have a deep feeling.

There is always a solution to the problem, as long as we think about it, next, we will explain to you how to solve this anti-crawling strategy. Today, our main demand and purpose is to use Python to automatically climb each cartoon in Tencent animation, to realize the function of automatically loading trigger cartoon and getting random address. Take this as an example to explain the url dynamic trigger + resource random storage anti – crawling strategy to conquer.

The difficulty and solution of the problem

From the above introduction, we can know that the difficulty of the current problem lies in:

1. Cartoon pictures are dynamically triggered and asynchronously loaded. The website of each cartoon picture cannot be obtained through the main website of the cartoon, and without the website of the cartoon picture, we cannot crawl these cartoon pictures.

2. Comic picture url contains random parameters. Even if we analyze the rules of each comic picture url through packet capture analysis, we cannot actively construct the address of these comic pictures.

In fact, we can solve these problems, first to introduce the solution idea, the solution idea is as follows:

1. PhantomJS (no interface browser) automatically triggers the creation of comic images.

2. Slide the page through JS code to automatically trigger the remaining multiple comic pictures.

3. After the cartoon picture is triggered, the address of the cartoon is extracted by regular expression.

4, to Urllib or Scrapy common crawler, automatic crawling of related resources, here we use Urllib module to compile related crawlers.

PhantomJS can trigger data because it’s essentially a browser, but it’s very slow, so we usually give the main part of the crawler to Urllib or Scrapy. If the crawler can’t handle the part, We can give this part to PhantomJS and other processing, and then turn it over to the regular crawler after processing. In other words, different technologies are responsible for different parts and write them together, which can make the crawler more efficient and not affect the function of the crawler.

PhantomJS dynamic trigger animation picture address access

Next, let’s write the implementation-related projects.

First, we import the relevant modules:

from selenium import webdriver

import time

from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

Then, we need to create a browser based on PhantomJS and set up the user agent, otherwise the interface may not be compatible, as shown below:

Dcap = dict (DesiredCapabilities PHANTOMJS) dcap [” PHANTOMJS. Page. Settings. UserAgent “] = (” Mozilla / 4.0 (compatible; MSIE 5.5; windows NT)” ) browser = webdriver.PhantomJS(desired_capabilities=dcap)

Then, we opened the relevant animation web page through PhantomJS, and triggered the relevant animation picture address, as shown below:

# Open the first page of the anime

The get (” ac.qq.com/ComicView/i…” )

Save the screenshot of the opened interface for easy observation

a=browser.get_screenshot_as_file(“D:/Python35/test.jpg”)

Get all the source code of the current page (including the resources that trigger the asynchronous load)

data=browser.page_source

# Write the source code of relevant web pages into local files for easy analysis

fh=open(“D:/Python35/dongman.html”,”w”,encoding=”utf-8″)

fh.write(data)

fh.close()

We then run the code, and when we’re done, we’ll find the corresponding screenshot D:/Python35/test.jpg as follows:

As you can see, the first few cartoon images are loaded successfully, but the later cartoon images are not loaded. Why?

Obviously behind the cartoon picture we need to trigger to load, so we can use JS code to automatically drag trigger behind the cartoon function.

Before triggering the subsequent comic pictures, we might as well take a look at the source code of the web page at this time. We search “ac.tc.qq.com/store_file_download” in the source code, that is, search for the url format of the comic picture resources, and see if there is any in the source code, as shown below:

It can be seen that there are only 4 matching urls at this time, indicating that the remaining url of relevant animation picture resources has not been loaded at this time.

Next, we can use window.scrollTo(position 1, position 2) to automatically slide the page, triggering the subsequent url, we can use the above:

The get (” ac.qq.com/ComicView/i…” )

Insert the following supplementary code below:

for i in range(10):

js=’window.scrollTo(‘+str(i*1280)+’,’+str((i+1)*1280)+’)’

browser.execute_script(js)

time.sleep(1)

Through this loop, we can automatically slide in turn, and the simulation slide will naturally trigger the subsequent image resources.

Then, we execute the code again, after executing the code, you can see in the screenshot that the rest of the animation picture resources have been loaded, and the source code, the url matching situation in the source code is now as follows:

As you can see, the number of regular urls has now increased to 25, so now all the images in the current page have been loaded.

Obviously, we’ve already implemented asynchronous resource triggering and random URL fetching with PhantomJS. Next, we will need to extract the url of relevant animation pictures and submit it to Urllib module for subsequent crawling.

After we finished using PhantomJS, we needed to close the browser, so we added the following line after the code:

browser.quit()

Write a complete crawler project

We then went on to write the crawler project.

<img SRC =”(HTTP :.. Ac.tc.qq.com.store_file_download.bug =.*? Name =.*?).jpg”‘ Extract all animation resource picture url, extract out, through urllib to climb these pictures, climb to the local.

The specific code is as follows:

Import re import urllib #

pat='<img src=”(http:.. ac.tc.qq.com.store_file_download.buid=.*?name=.*?).jpg”‘

# Get all the animation resources url

allid=re.compile(pat).findall(data)

for i in range(0,len(allid)):

Get the current url

thisurl=allid[i]

# remove unnecessary elements from urls amp;

thisurl2=thisurl.replace(“amp;” ,””)+”.jpg”

Print the url that is currently being crawled

print(thisurl2)

Set the animation to a local directory

localpath=”D:/Python35/dongman/”+str(i)+”.jpg”

# Use urllib to crawl animation image resources

urllib.request.urlretrieve(thisurl2,filename=localpath)

We then run the code and see the following information under the local directory D:/Python35/dongman/ :

You can see that the relevant animation picture resources have climbed to the local.

In order to facilitate debugging, we provide the complete code, the complete code is as follows:

from selenium import webdriver

import time

from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

import re

import urllib.request

dcap = dict(DesiredCapabilities.PHANTOMJS)

Dcap [” phantomjs. Page. Settings. UserAgent “] = (” Mozilla / 4.0 (compatible; MSIE 5.5; windows NT)” )

browser = webdriver.PhantomJS(desired_capabilities=dcap)

# Open the first page of the anime

The get (” ac.qq.com/ComicView/i…” )

for i in range(10):

js=’window.scrollTo(‘+str(i*1280)+’,’+str((i+1)*1280)+’)’

browser.execute_script(js)

time.sleep(1)

Save the screenshot of the opened interface for easy observation

a=browser.get_screenshot_as_file(“D:/Python35/test.jpg”)

Get all the source code of the current page (including the resources that trigger the asynchronous load)

data=browser.page_source

# Write the source code of relevant web pages into local files for easy analysis

fh=open(“D:/Python35/dongman.html”,”w”,encoding=”utf-8″)

fh.write(data)

fh.close()

browser.quit()

# Construct regular expression to extract animation picture resources url

pat='<img src=”(http:.. ac.tc.qq.com.store_file_download.buid=.*?name=.*?).jpg”‘

# Get all the animation resources url

allid=re.compile(pat).findall(data)

for i in range(0,len(allid)):

Get the current url

thisurl=allid[i]

# remove unnecessary elements from urls amp;

thisurl2=thisurl.replace(“amp;” ,””)+”.jpg”

Print the url that is currently being crawled

print(thisurl2)

Set the animation to a local directory

localpath=”D:/Python35/dongman/”+str(i)+”.jpg”

# Use urllib to crawl animation image resources

urllib.request.urlretrieve(thisurl2,filename=localpath)

It can be seen that when we have until the solution, the project is not difficult to achieve, here, we need to through this example, to grasp the solution of this kind of problem, that is, to master the anti-crawling strategy of url dynamic trigger + resource random storage solution. I hope you can practice more, I hope this article can let students encounter this kind of problem to get ideas and inspiration to solve.

The author of this article is Wei Wei.

Author’s Book recommendation