Text: Nine nine Temple true night ** From: SegmentFault **

As mentioned earlier, some sites are crawlproof. In fact, the fact that all have a certain size of the site, the site of a large company, or the nature of the site is relatively strong, there are advanced anti-crawling measures. In general, there are two anti-crawling strategies: verify your identity and crush the insect to death in the doorway; Or in the website to implant a variety of anti-crawling mechanisms, let the crawler back. This section proposes some countermeasures based on these two anti-crawling strategies.

Identity in disguise

Even small, anonymous sites, let alone big ones, will check headers to verify the identity of the visitor. So, in order for worms to return with a message, we need to teach them how to disguise it. There was a time when it was no use pretending, and we had to teach the reptiles how to be human, to behave more like a man than an unknown creature with the speed of a bachelor.

Customize Requests Headers

  • “I am a man! — Modify user-Agent: It stores system and browser models and versions, and modifies it to pretend to be human.

  • “I’m from Hebei province” — modify referer: Tell the server which url you clicked on rather than appearing out of nowhere, some sites will check.

  • “Biscuits! Sometimes it makes a difference whether you bring a cookie or not. Try bribing a server with a cookie to give you the full information.

  • For detailed data, F12 can capture a package to view its Requests Headers

    Headers = {‘ Referer ‘:’ accounts.pixiv.net/loginlang=z… ‘the user-agent’ : ‘Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome / 65.0.3325.146 Safari / 537.36 ‘# for each crawler disguise} r = requests. Get (” segmentfault.com/a/119000001…).

Headers data usually uses these two, and I strongly recommend having a user-agent for each request in a crawler, which is better than nothing.

The primary IP address access frequency is reduced

Note: this is for long term, large scale crawlers

There are sites that monitor the frequency and number of visits to a particular IP, and once they pass a certain threshold, they will kick you out as a suspected crawler, so you need to reduce your presence.

  • Zzzzz – Sleep: take a break after climbing for a while, not only for your own success, but also for the sake of the server.

  • IP proxies: To use proxies via proxies (proxies) parameter, if you have IP, good IP proxies cost money.

    Sleep (60)# Proxies = {‘ HTTP ‘: ‘http://10.10.1.10:3128’,# ‘” Protocol Type: Full IP address + End number’ HTTPS ‘: ‘http://10.10.1.10:1080’}# Proxy IP dictionary, call r = requests. Get (URL,headers=headers,proxies=proxies)

Reverse crawl

In part 0 of this series I mentioned that getting the source code for a web page can be a problem. Yes, sometimes the headers disguise is done enough, but you still can’t get the correct source code for a web page as you want, either missing it, giving you a bunch of irrelevant stuff, or making you famous. This shows that the key point is not fake not disguise the problem, but how to interpret the anti-crawling mechanism of the web page to introduce a solution, which requires relatively high observation and analysis ability.

The main ones I have encountered so far are:

  • Random verification code: a web page generates a random code and requires you to submit it before accepting your request (used in login verification). This check code is usually hidden in the source code of the web page. Fetch first, submit later is the policy.
  • Unordered urls: Urls are followed by a bunch of random things. — Have nothing to say with this stuff, go directly to Selenium.
  • Encrypted/cluttered source code: you know what you want is out there, but don’t know how to extract it. To understand puzzles depends on how well your brain works.
  • Dynamic loading: you need to interact with the page to get more information, but the crawler can’t interact with it. — Go directly to Selenium/manually capture and analyze the target link
  • Ajax technology: asynchronous loading, webpage content is loaded in different times, with crawler can only get the FIRST issued HTML, resulting in incomplete information. — Go to Selenium/Manual packet capture to analyze the target connection

Supplement: Selenium module, emulated browser, is strong but slow. In fact, dynamic loading is designed to make it easier for users to click and see, but it also makes crawlers more difficult because a lot of information is hidden.

The last

It’s always a good idea to add headers to your code.

Inline reverse crawling is very flexible, there is no fixed code format, it takes time to analyze.

Methods/modules that are new to this article will be instantiated later.

The next article officially goes to the web parsing theme, and then you can start writing crawlers. (^ ∀ ^ ●) Blue.

The original address: segmentfault.com/a/119000001…