Python Beginner Crawlers xii: Common anticrawlers and anticrawlers techniques

This is the 16th day of my participation in the August More Text Challenge. For details, see: August More Text Challenge

This chapter mainly explains the common anti-reptile and anti-reptile technology, but will not involve too much specific technology, just as a popular science, or a direction for students to continue learning in the future. I also welcome students to share their experience of climbing techniques with me in the discussion area.” “Anti-crawler” is pretty literal: prevent crawlers from crawling the content of the site. Crawler technology has undoubtedly provided us with a lot of convenience, but for websites, their own content is climbed by others, not only no benefit, and even consume a lot of traffic, which is undoubtedly intolerable for website owners. Hence the “anti-crawler” technology.

Why reptilian?

For site owners, crawlers are the most “discussed” thing, the main reason is that the site’s public free information is captured in bulk, will lose the corresponding competition, thus reducing the profit. As for the current situation in China, there is no corresponding laws and regulations that can define whether or not the information of the website is suspected of being illegal. In other words, the reptile is a domestic borderline — it can be prosecuted successfully or not at all. Hence, the development of the crawler technology is the corresponding – it is to prevent the crawler on the site arbitrarily, for the site owner to stick to the profit point of the last safeguard.

The purpose of the reptile

In addition to the interests mentioned above. The anti-reptile also serves the following purposes:

Damaged the normal access to the website

Common to beginners (especially us). As a novice, we are often excited to be able to crawl the information of a web page, but we don’t consider the pressure on the web server. If we haven’t disconnected after the crawl, the server thinks you have relevant service requirements and keeps responding. This will undoubtedly reduce the normal processing speed of the website server. 2. Data protection is common on professional websites. As a professional website, usually stored with their own characteristics or is processed after the data, although these data has been open, but the site owner is not willing to be others to climb these data. 3. Preventing commercial competition is common on commercial websites. Take the simplest example, we often see “Double 11” such festivals, some e-commerce companies put out the slogan of “the lowest price of the whole network, not the lowest loss of 10 times”. If an e-commerce company can crawl the sales information of another competitor’s e-commerce company and carry out targeted strategies, it will undoubtedly have a great impact on its rivals.

Common anti – reptile technology and anti – reptile technology

Limit the IP

When a website finds that the access speed of an IP address is not only fast, but also in the same IP segment, it will think that this IP address is crawler operation, so it will restrict the access to this IP address and refuse to transmit data. Solution: First, set the user-agent to simulate the normal access process, and obtain the proxy IP address to crawl the data. In the previous article, we set up the user-agent simulation browser to make the server think we are a normal access process. When the website finds that the server is under great pressure and all the servers are normal user-agent, it often sets some important data for comparison as having to log in to access. Solution: we can first register an account, take the Cookie of the account to crawl data 6. Using a verification code to log in to a website A verification code is used to verify login. You must enter the corresponding verification code on the login page to view a website. Solution: Use machine learning algorithms to train captcha. Or through the code platform and manual code processing. (Difficult) 7. Loading data using asynchronous loading Mode Not all the data of the site is loaded, only when the user sees the bottom of the site, the rest of the data will be loaded. Solution: As we learned earlier, Ajax techniques can be tricky to handle. We need to analyze the rules by constantly requesting them, and then simulate the web links that Ajax asks for. 8. Sliding slider is a popular verification method at present. That is, only when the user moves part of the image to the corresponding area with the mouse, can it be accessed normally, as shown in the figure below:

Solution: The current solution, most of the image is first processed, and then simulated artificial sliding way. (Difficult) 9. Sending error information When the website recognizes that the originating request is from the crawler, it will not reject the connection, but provide the request with an error information or data. On the one hand, it makes the crawler mistakenly climb the desired data, and on the other hand, it reduces the request pressure on the server. Solution: Extract 50% or more of the data for spot check or crawl again. 10. Behavior Recognition At present, large websites, such as Alibaba, will judge every action of users on their websites. When there is an error between the user’s action, which is characterized by intermittently and mouse stay time, and the data during normal access, the system will judge it as a crawler and reject the corresponding request. Solution: Fully simulate the browser through Selenium and PhantomJS. 11. Important Data visualization Before data transmission, the website will encode the data into the format of pictures, through the pictures to show the data that is considered important. Solution: Use OCR recognition technology to identify the data in the image. (hard)

conclusion

Reptilian and antireptilian is a kind of “antagonistic” technology. As the saying goes, “While the priest climbs a post, the devil climbs ten.” The emergence of anti – reptilian technology often leads to the “emergence” of anti – reptilian technology. As a crawler engineer, we need to realize that the current anti-crawler technology has gradually evolved from the original common restriction of IP to the current method of verifying information — in fact, the current crawler technology can completely ignore the restriction of IP through IP proxy technology. However, as for anti-crawler technology with action determination such as “slide slider” and “click on the corresponding text”, we don’t have any particularly good methods. We can only simulate bit by bit and guess the standard of action determination — and the standard will change at any time due to the upgrade of the provider. As a novice, we need to remember the most important point of anti-crawler strategy is: understand the site anti-crawler means what? In the analysis of the site anti – crawler in this process, is a continuous testing, continuous analysis of the process. Once we figured out the anti-crawler mechanism, we were more than halfway there.

Python Beginner Crawlers xii: Common anticrawlers and anticrawlers techniques

Why reptilian?

The purpose of the reptile

Common anti – reptile technology and anti – reptile technology

conclusion

Related Posts

【 brush question diary 】385

Threads & Multithreading? All kinds of locks silly points not clear, a thorough understanding of “concurrent programming, all kinds of locks”

[Spring] The process by which Spring AOP creates dynamic proxies