How to avoid site anti - crawler check

Sometimes, we get 403 error when we crawl a web page, because these pages have some anti-crawler Settings to prevent malicious collection of information. So there’s nothing we can do about it? Of course not!

First, how to find that they may be identified by the website

1. CAPTCHApages (CaptCHA)

2, Unusualcontent delivery delay

Frequentresponse with HTTP404,301 or 50x errors

(1) 301 MovedTemporarily

(2) 401 unauthorized

(3) forbidden (processed by aAatch)

(4) 404 notfound

(5) 408 RequestTimeout

(6) 429 too manyrequests

(7) 503 ServiceUnavailable (IP layer)

Second, a good way to avoid anti-crawler inspection

(I) Multi-host strategy; (Multi-host client or multi-host server)

(two) climb slowly, do not attack the host, if found to be blocked, immediately reduce the access frequency, design a smart point, to find access frequency limit critical point;

1. Reduce the number of requests as much as possible, and do not catch the details page if you can catch the list page. 2. If performance is really important, consider multithreading (which is supported by mature frameworks such as scrapy), or even distribution

(3) disguise by changing IP address or proxy server; Many websites in different places have different servers (different IP), then you can dynamically switch IP to access the same website, in order to solve the problem of access frequency or traffic limitation.

Search for distributed server: “home of the webmaster” website to find some of the distributed server deployment IP situation, with IP access.

There are many sites around the world. How do I get them? Search for “top-level domain” and copy the results.

Dynamic IP switching technology:

1. Why use IP proxy?

To prevent frequent IP access from being blocked, using 100 proxy IP addresses to access 100 pages can give the illusion that 100 people each visited one page of the site. This will naturally not limit your access

2. How do I prevent the IP address from being blocked?

Set delay: time.sleep(random.randint(1,3));
Use an IP proxy to allow other IP addresses to access pages instead of your IP address.

3. How do I obtain the proxy IP address? www.xiongmaodaili.com/ (provided by panda Agency website)

ProxyHandler —-> Request()
Opener —-> urlopen()
Install the Opener

4. How do I check whether the proxy is successful? Test site: 2019.ip138.com/ic.asp

(4) put the crawler under the subnet of the IP address of the main site frequently visited, such as education;

(v) Frequently change their user-agent;

Common emulated browser information: 1.Android

Mozilla / 5.0 (Linux; Android 4.4.1. Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Safari/535.19
Mozilla / 5.0 (Linux; U; Android 4.0.4; en-gb; Gt-i9300 Build/IMM76D) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30
Mozilla / 5.0 (Linux; U; The Android 2.2. en-gb; Gt-p1000 Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1

2.Firefox

Mozilla / 5.0 (Windows NT 6.2; WOW64; The rv: 21.0) Gecko / 20100101 Firefox 21.0
Mozilla / 5.0 (Android; Mobile; The rv: Gecko/Firefox 14.0/14.0 14.0)

3.Google Chrome

Mozilla / 5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36
Mozilla / 5.0 (Linux; Android 4.0.4; Galaxy Nexus Build/IMM76B) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.133 Mobile Safari/535.19

4.iOS

Mozilla / 5.0 (the device; CPU OS 5_0 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9A334 Safari/7534.48.3

(6) Detection traps, such as nofollow tags, or display: None CSS;

Aws (Business Process Management Development Platform) : AWS BPM Business Process Management development platform is a basic platform software for business process management that is easy to deploy and use. Aws platform provides full-cycle management and role-oriented BPMTotal Solution from business process combing and modeling to operation, monitoring and optimization.

(7) If the rules (pattern) are used to batch crawl, the rules need to be combined;

(eight) if possible, according to the behavior defined by robots.txt to civilization grab.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

How to avoid site anti – crawler check

How to avoid site anti – crawler check

Related Posts

DevFest 2018 China national Linkage

Ajax is not a programming language, it’s a programming technology, right?

HTML version of the countdown display