Automated data harvesting on the Internet has been around almost as long as the Internet has existed. Today, the public seems to prefer “network data collection,” sometimes referring to network data collection programs as bots. The most common approach is to write an automated program that requests data from a web server (usually in HTML forms or other web files) and then parses the data to extract the desired information.
There are some things that are even more frustrating when gathering web sites than the fact that the data shows up in the browser and can’t be retrieved. It may be that you submit a form to the server that you think has been processed well but it is rejected, or your IP address is blocked by the website for some unknown reason and you can’t continue to access it. That’s because many sites have anti-crawler mechanisms that simply identify whether your crawler is a person or a machine. This is how we disguise our reptile as a human being.
For simple static HTML, use Python’s crawler library scrapy, or use the simpler urllib2 to crawl and parse with beautiful.
It’s not always dynamic, there will be some dynamic loading in there to meet the dynamic requirements for functionality, and if you’re using static HTML that you’ve crawled and it’s parsed and a part of the page disappears, then, yes, that part of the page is probably dynamically loaded. In addition, in order to protect data, websites will often design some strange requirements to prevent your crawlers from working, and only allow real people to use them. Here are a few tips to make your crawler look more human.
1. Construct an appropriate request header containing the requestor’s data, specifically to modify user-agent requests, using python’s Requests package. 2. What if the site requires you to log in? Set cookies to ensure that you remain logged in to the same site using Selenium in conjunction with the Phantom JS deletecookie(), addCookie () and deleteAllcookies () methods. 3. For the problem of dynamic loading, Selenium combined with Phantom JS can simulate the behavior of human operating web pages, complete the loading of WEB JS, and solve the problem that dynamic web pages cannot be climbed directly. 4. If the site detects that your crawler may block your IP address, change the IP address and use the tool Tor to anonymize the IP address.
This is an example, but there are some very simple ways to make your webbot look more like a human visitor. I’ll talk to you next time.
More wonderful content can go to www.dongnaoedu.com/python.html