Reference: www.cnblogs.com/LLBFWH/arti…

Crawler is a method that simulates a browser’s request to a website to obtain data. When the simple crawler crawls the website data, because it visits the website too frequently, it causes too much pressure to the server and easy to crash the website. Therefore, the website maintainer will avoid the visit of crawler by some means. The following are several common anti-crawler and anti-crawler strategies:

About the method of website dynamic loading, there is also an anti-crawler method: find the interface of its API, here is an example of crawling B station video information is used in this method, address: github.com/iszoop/Bili…

Crawler progression: coping mechanism of anti-crawler strategy

Reptilian and anti-reptilian, a loving couple, could write a spectacular history of warfare. In the era of big data, data is money, and many enterprises have adopted anti-crawler mechanism for their websites to prevent the data on the web page from being crawled away. However, if the anti-crawl mechanism is too strict, real user requests can be accidentally hurt; If it is necessary to fight with crawlers and ensure a very low rate of friendly fire, it will increase the cost of research and development.

Simple and low-level crawlers have high speed and low camouflage. If there is no anti-crawler mechanism, they can quickly grab a large amount of data, and even cause the normal work of the server because of too many requests. However, the crawler with high camouflage has a slow crawling speed and a relatively small burden on the server. Therefore, the focus of anti-crawling website is also that kind of simple and crude crawler, anti-crawling mechanism will also allow high camouflage crawler, access to data. After all, a well-camouflaged crawler is not much different from a real user.

This article focuses on how to deal with common anti-crawl mechanisms when using Scrapy frameworks.

The header checksum

The simplest anti-crawling mechanism is to check Headers for HTTP requests, including user-agent, Referer, and Cookies.

User-Agent

In each request, a real browser user-agent is randomly selected.

Referer

Referer checks where the request came from, and can usually be used to judge the image theft. In Scrapy, if a page URL is extracted from a previously crawled page, Scrapy automatically assigns that url to the page as the Referfer. You can also define the Referfer field yourself in the above way.

Cookies

The site may detect the number of times session_id is used in the Cookie, and if the limit is exceeded, an anti-crawling policy is triggered. So you can set COOKIES_ENABLED = False in Scrapy to make requests without Cookies.

There are also sites that force Cookis to be turned on, which is a bit trickier. Another simple crawler can be written to periodically send a request to the target website without Cookies, extract the set-cookie field information in the response and save it. When you climb a web page, you bring stored Cookies into Headers.

X-Forwarded-For

Add the X-Forwarded-For field to your request header, declaring yourself a transparent proxy server. Some web sites are a little soft on proxy servers. The format of the x-Forwarded-For header is as follows: X-Forwarded-For: Client1, PROxy1, and proxy2. This header is forwarded-for: client1, proxy1, and proxy2. But because X-Forwarded-For can be tampered with at will, many sites don’t trust this value.

Limit the number of IP requests

If a request from an IP address is too fast, anti-crawl is triggered. It can of course be circumvented by slowing down the crawl speed, at the cost of greatly increasing the crawl time. Another approach is to add proxies. request.meta[‘proxy’] = ‘http://proxy_host:proxy_port

Then use a different proxy IP for each request. However, the problem is how to obtain a large number of proxy IP?

You can write an IP proxy acquisition and maintenance system, periodically climb free IP proxy from a variety of websites that disclose free proxy IP, and then periodically scan whether these IP addresses and ports are available, and timely clear the unavailable proxy IP. This creates a dynamic proxy library, from which one proxy is randomly selected for each request. However, the disadvantages of this scheme are also obvious. The development of agent acquisition and maintenance system itself is very time-consuming and laborious, and the number of free agents is not large, and the stability is poor. If you must use an agent, you can also buy some stable agent services. Most of these services use a proxy with authentication.

A dynamic proxy

Now more and more websites use Ajax to load content dynamically, this time can intercept ajax request analysis, it is possible to construct the corresponding API request URL based on Ajax request can directly obtain the desired content, usually in JSON format, but also do not have to parse HTML.

However, many ajax requests are authenticated by the back end and cannot be directly constructed for URL fetching. At this point, PhantomJS+Selenium simulates browser behavior and captures the RENDERED JS page. Please refer to:

Note that with Selenium, requests are no longer executed by Scrapy’s Downloader, so the request-level information that was previously added will be invalidated and will need to be re-added in Selenium