Sometimes, we get 403 error when we crawl a web page, because these pages have some anti-crawler Settings to prevent malicious collection of information. So there’s nothing we can do about it? Of course not!
First, how to find that they may be identified by the website
1. CAPTCHApages (CaptCHA)
2, Unusualcontent delivery delay
Frequentresponse with HTTP404,301 or 50x errors
(1) 301 MovedTemporarily
(2) 401 unauthorized
(3) forbidden (processed by aAatch)
(4) 404 notfound
(5) 408 RequestTimeout
(6) 429 too manyrequests
(7) 503 ServiceUnavailable (IP layer)
Second, a good way to avoid anti-crawler inspection
(I) Multi-host strategy; (Multi-host client or multi-host server)
(two) climb slowly, do not attack the host, if found to be blocked, immediately reduce the access frequency, design a smart point, to find access frequency limit critical point;
1. Reduce the number of requests as much as possible, and do not catch the details page if you can catch the list page. 2. If performance is really important, consider multithreading (which is supported by mature frameworks such as scrapy), or even distribution
(3) disguise by changing IP address or proxy server; Many websites in different places have different servers (different IP), then you can dynamically switch IP to access the same website, in order to solve the problem of access frequency or traffic limitation.
Search for distributed server: “home of the webmaster” website to find some of the distributed server deployment IP situation, with IP access.
There are many sites around the world. How do I get them? Search for “top-level domain” and copy the results.
Dynamic IP switching technology:
1. Why use IP proxy?
To prevent frequent IP access from being blocked, using 100 proxy IP addresses to access 100 pages can give the illusion that 100 people each visited one page of the site. This will naturally not limit your access
2. How do I prevent the IP address from being blocked?
- Set delay: time.sleep(random.randint(1,3));
- Use an IP proxy to allow other IP addresses to access pages instead of your IP address.
3. How do I obtain the proxy IP address? www.xiongmaodaili.com/ (provided by panda Agency website)
- ProxyHandler —-> Request()
- Opener —-> urlopen()
- Install the Opener
4. How do I check whether the proxy is successful? Test site: 2019.ip138.com/ic.asp
(4) put the crawler under the subnet of the IP address of the main site frequently visited, such as education;
(v) Frequently change their user-agent;
Common emulated browser information: 1.Android
- Mozilla / 5.0 (Linux; Android 4.4.1. Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Safari/535.19
- Mozilla / 5.0 (Linux; U; Android 4.0.4; en-gb; Gt-i9300 Build/IMM76D) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30
- Mozilla / 5.0 (Linux; U; The Android 2.2. en-gb; Gt-p1000 Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1
2.Firefox
- Mozilla / 5.0 (Windows NT 6.2; WOW64; The rv: 21.0) Gecko / 20100101 Firefox 21.0
- Mozilla / 5.0 (Android; Mobile; The rv: Gecko/Firefox 14.0/14.0 14.0)
3.Google Chrome
- Mozilla / 5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36
- Mozilla / 5.0 (Linux; Android 4.0.4; Galaxy Nexus Build/IMM76B) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.133 Mobile Safari/535.19
4.iOS
- Mozilla / 5.0 (the device; CPU OS 5_0 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9A334 Safari/7534.48.3
(6) Detection traps, such as nofollow tags, or display: None CSS;
Aws (Business Process Management Development Platform) : AWS BPM Business Process Management development platform is a basic platform software for business process management that is easy to deploy and use. Aws platform provides full-cycle management and role-oriented BPMTotal Solution from business process combing and modeling to operation, monitoring and optimization.
(7) If the rules (pattern) are used to batch crawl, the rules need to be combined;
(eight) if possible, according to the behavior defined by robots.txt to civilization grab.