At present, many websites have adopted a variety of measures to anti-crawler, usually a website will use the following variety of anti-crawler, the more data value of the website anti-crawler do more complex. Common anti-crawl measures and solutions are as follows:

1. Crawl backwards through the web request header

This is the most basic anti-crawling measures, but also the most easy to achieve anti-crawling, but it is also easy to crack, only need to add a reasonable request header can access the target website to obtain data.

2. IP the climb

  • The server detects the number of requests to an IP address within a specified period of time. If the number exceeds the threshold, the server directly rejects the service and returns some error messages. IP addresses can also be permanently sealed or temporarily sealed.

  • Permanently blocked: IP addresses in the blacklist are permanently inaccessible

  • Locked in a fixed period: The IP address becomes invalid for a period of time

Solution:

Using proxies to overcome IP access restrictions, the conventional approach is to purchase proxy services or purchase A VPS server to build your own proxy IP address pool

Principle of agency:

IP proxy pool architecture:

  • Storage modules generally use the ordered collection of Redis, used to do proxy deduplication and status identification, storage module as the central module, connected to other modules
  • The fetch module periodically obtains proxies from the proxy site, passes the obtained proxies to the storage module, and saves the data to Redis
  • The detection module periodically obtains all agents in the storage module and checks the agents. Based on the detection results, the detection module sets different identifiers for the agents
  • The interface module provides a service interface through WebAPI, which connects to Redis to get data and return available proxies

ADSL dial agent:

  • Dial-up module: The generated IP address is sent to the interface module. The interface module invokes the storage module to store IP data
  • Interface module: receives the IP of dial-up module and provides interface for crawler to return IP data
  • TingProxy: A proxy service, is a software, installed to the VPS to start
  • Storage module: Is responsible for accessing IP
  • Crawler: get IP by calling interface, add proxy to visit target website to get data

3. The verification code crawls backwards

  • Verification code anti – crawl is also a common anti – crawl mechanism of many websites, with the development of technology, verification code patterns are more and more. Captcha was originally a graphic captcha of several numbers, later with the addition of English letters and obfuscation curves. Some websites may also add Chinese character verification code.

  • Encounter a verification code of the web page, there are two solutions, one is to buy the verification code identification services, these identification services themselves is their background to identify through the interface after the return of recognition results; The other is to train the recognition model for recognition. This section mainly describes how to process the verification code. For service interconnection, you can find the corresponding identification platform to complete identification and interconnection according to API documents.

1) Character verification code:

  • General character recognition, deep learning can achieve higher accuracy than human eye recognition. You can develop your own identification service interface through deep learning, as follows:

  • General training samples are connected with the actual coding platform by crawler to collect data, and the correct samples are saved as model training samples. At present, the approximate relationship between the required amount of training samples and sample categories is as follows: number of sample categories X 500. For example, the sample size required for the training of 36 categories of digits and letters is 36*500=18000. The more training samples, the higher the model recognition rate, but relatively speaking, the higher the cost. In the actual training, it is found that a usable recognition model can be obtained for more than 10000 samples of alphanumeric and numeric verification codes. How to train a usable model can be found in another article I wrote, “Building Captcha Recognition Services based on Python + Deep Learning” juejin.cn/post/684490…

  • Chinese character recognition is the same as ordinary character recognition, only requiring a large sample size, and the model structure is the same as the training process. In the figure above, there is a verification code that requires the input of characters with specified colors. This verification code is similar to character verification code recognition, but requires the cooperation of two models, as well as the color recognition model and the character recognition model. The color recognition model is responsible for the output of the corresponding color sequence of the picture characters, and the character output model is responsible for the output of the corresponding picture characters. In practical training, it is found that 99.99% recognition rate can be obtained with few samples of color recognition model, and the model converges quickly. However, since 3,500 kinds of Chinese characters are added to the character verification code, the actual training sample is 1 million (generated by code simulation), and the recognition rate is over 95%. The training time is relatively long (GPU is much faster).

Deep learning training captcha recognition model

  • Model design can be designed as a general structure, only need to modify the number of output categories each time, model can be reused. The problem of inconsistent picture size can be solved by scaling to a uniform size.
  • Character captchas are currently case-insensitive, and character captchas on each platform may exclude characters that are easily confused, so there are no 36 types of output of character captchas. Missing characters can be found through statistics of training samples, which can reduce the number of output categories. Another note point is that some sites of characters verification code may not be constant, every time returns the authentication code is variable length, to this kind of authentication code can be designed according to the maximum length model, the length is not enough to use the underline is lacking, but must be reasonable to choose a filling position, to get better recognition accuracy.

2) Behavior verification code:

A. Coordinate selection:

  • Coordinate point selection can also be directly connected with the coding platform, submit image data to the coding platform, and the coding platform returns the coordinate value, and click the corresponding coordinate through Seleniun and PhantomJS simulation to complete the verification. Or simply construct a parameter submission.
  • Deep learning training model identification, self-training model requires a large sample size, training costs are relatively high. Chinese character click verification code relatively requires less training samples. This paper provides a recognition idea for Chinese character verification code:

  • The verification code after the access to images of characters, the key step is to need to click on the pictures in the correct order of the characters, so you need to have a word order module according to input characters to give correct order sequence, the easiest way is to sort through all of the characters Jieba library participles, longest participle sequence is the correct results are obtained. Or based on the natural language processing training word order model.
  • The last step is to simulate the click process, the simple way is to use the browser to simulate the click, this pass rate is relatively high, easy to implement. The other way is to analyze the parameter encryption process and directly construct the parameter submission. It is difficult to reverse JS and has high technical requirements. The advantage is that the program execution efficiency is high.

B. Sliding verification:

  • The key of sliding verification code recognition is to determine the sliding distance and construct the sliding trajectory. The most common way to calculate the track is to use the image processing algorithm to calculate the slide distance of the slider through OpenCV library. At present, several mainstream slide verification codes on the Internet can use this algorithm to calculate the slide distance. Trajectory construction, the principle of trajectory construction is to simulate the process of human sliding as far as possible, such as the common online acceleration and deceleration or the use of normal distribution curve to construct trajectory (the actual implementation effect is good, the parameters are easy to modify). Seleniun and PhantomJS are used to simulate sliding after the distance and track are obtained. However, the problem of this method is low efficiency.
  • A better way is to directly analyze the JS parameter construction process after getting the sliding distance and sliding track, reverse JS, get the submission parameters, and directly submit data to the background for verification. The technical requirements are relatively high, requiring a certain JS reverse ability, but the crawl efficiency is high.

4.JS confuses dynamic parameters and crawls backwards

  • JS parameter encryption is also a lot of websites often use a anti – crawling mechanism. The simplest way is to capture directly through Seleniun and PhantomJS. The advantage is that JS does not need to be analyzed, and the disadvantage is that the collection efficiency is low.
  • Another solution is to directly reverse analyze JS, rewrite encrypted JS or directly execute JS with JS execution engine (PyV8, PyexecJS, PhantomJs) to get the encrypted parameters and directly submit the parameters.

5. The account crawls backwards

  • Common is every visit need to login to browse data, this site data collection will need to prepare a large number of accounts, need to pay attention to each account maximum number of requests at the same time, some sites will be in the same account in a short period of time by a large number of requests to titles strategy, solution is a large number of account switching acquisition; After each account sends a certain amount of requests, it switches to another account in time for collection.
  • The first step to solve account creep is simulated login. There are two common methods for simulated login: one is to use Seleniun and PhantomJS to simulate login. This method is relatively simple and does not need to analyze JS, because many websites that need to log in have corresponding JS parameter confusion mechanism. Another direct way is reverse JS, which simulates submitting data to complete simulated login and saves Cookie data for crawler to retrieve data.

General architecture of Cookie proxy pool module:

  • Get module: responsible for generating the Cookie corresponding to each account
  • Storage module: store the account and the Cookie information corresponding to the account, but also need to achieve some convenient access operations
  • Detection module: check the Cookie regularly, different sites detect different links, detection module takes the corresponding Cookie to request a link, if the returned status is valid, the Cookie is valid, otherwise the Cookie is invalid and removed.
  • Interface module: provides EXTERNAL API calls and randomly returns cookies to ensure that each Cookie can be retrieved. The more cookies, the lower the probability of being retrieved, thus reducing the risk of being blocked.

6. Custom font library reverse crawl

At present, some websites use custom font library to achieve anti-crawling, mainly in the page data display normal, but the actual data obtained by the page is another character or a code. This reverse crawl requires parsing the site’s own font library and using font library counterpart character replacements for encrypted characters. You need to make a mapping between the font and the base font.

7. To summarize

  • At present, many websites have basic anti-crawl strategy, common is verification code, JS parameter encryption these two. Seleniun, PhantomJS, and Splash are recommended for quick data capture. For the crawler task with large amount of data, it is recommended to construct parameter crawler, which has high stability and high efficiency, especially for the data crawler of many websites with separated front and back ends.
  • The crawler itself will increase certain pressure on the website, so the crawler rate should be set reasonably to avoid causing trouble to the target website and affecting the normal use of the website, and pay attention to the posture of the crawler.

Here is an article on whether crawlers are legal or illegal:Mp.weixin.qq.com/s/rO24Mi5G5…

Respect the law, abide by the law, start from me

The above article is not aimed at any actual website, part of the verification code picture source network, if there is the same, pure coincidence

Do not use for commercial purposes

Thank you for reading