Common anti-reptilian measures
Verification code
Manually enter the picture verification code, enter the mobile phone verification code, or click and drag the verification code.
The solution
Use the image recognition API for verification code identification, call code website API, simulated login manual assistance.
IP restrictions
Websites restrict access to IP addresses for a short time based on the frequency and number of crawls.
The solution
Build an IP proxy pool and randomly select proxies.
UA limit
Websites are restricted based on the browser identifier when accessed.
The solution
Build a UA pool, select a random UA id, and select a random wait time.
Cookie limit
Validation is performed based on the random cookie generated each time it is opened.
The solution
Build your own cookie or, if complex, use Selenium to simulate the login.
Referer hotlinking prevention
Anti-link theft is mainly to verify the legitimacy of the request based on some key information carried by the client during the request process. There are many kinds of anti-link theft, such as Referer anti-link theft, timestamp anti-link theft, etc. According to the Referer judgment, the Referer anti-link theft informs the server which page the request is linked from. You can prevent crawling by limiting Referer.
The solution
The Referer field and its corresponding value are forged in the request headers.
HTML/JS/CSS confusion
HTML/JS/CSS code obfuscation, such as adding random useless code, to increase the difficulty of parsing.
The solution
Use tools to clean up and then parse.
Ajax dynamic loading
After loading the source code of the web page from its URL, the JavaScript program executes in the browser. These programs load more content and transfer it to the web page. If a crawler does not have a JS engine, or does have a JS engine but does not have a solution to handle JS returns, or does have a JS engine but has no way to make the site display the script enabled Settings.
The solution
Parse the Ajax request and grab the returned data.