Due to some reasons, I finally got out of the trivial work recently, and had time to conduct a simple sorting of some reptile knowledge before, from which I realized that it was really necessary to sort out the past knowledge periodically.
Common third-party libraries
For beginners of crawler, it is suggested to use these common third-party libraries to implement a simple crawler by themselves without using any crawler framework after understanding the principle of crawler, so as to deepen the understanding of crawler.
Urllib and Requests are both HTTP libraries for Python, including the URllib2 module for comprehensive functionality at a great cost to complexity. Compared to URllib2, the Requests module is more minimalist in supporting complete simple use cases. For information on the pros, cons, and differences between URllib and Requests, check out the web.
BeautifulSoup and LXML are both Python page parsing libraries. BeautifulSoup is DOM-based, loads the entire document and parses the entire DOM tree, so the time and memory costs are much higher. While LXML does only partial traversal, using xpath can quickly locate tags. Bs4 is written in Python and LXML is implemented in C, which makes LXML faster than BS4.
The crawler frame
Python’s most common crawler frameworks are scrapy and PySpider.
For details on how to use the framework, please refer to the official documentation.
Dynamic page rendering
1. Url request analysis
(1) Carefully analyze the page structure and check the action of JS response;
(2) Use the browser to analyze the REQUEST URL sent by JS click action;
(3) Refetch the url as scrapy’s start_URL or yield Reques.
2. selenium
Selenium is a Web testing automation tool. It was originally developed for Web site automation testing. It is a type of button wizard we use to play games. It supports all major browsers (including unbounded browsers like PhantomJS).
Selenium allows the browser to automatically load a page, retrieve a desired page, or even take a screen shot of a page, or determine whether certain actions on a site are taking place, based on our instructions.
Selenium does not come with a browser of its own and does not support browser functionality. It needs to be used in conjunction with third-party browsers.
3. phantomjs
When using Selenium to call the browser for page fetching, the operation of opening the browser and rendering the page is not efficient enough to meet the requirements for large-scale data fetching. At this point we can choose phantomJS.
PhantomJS is a WebKit-based “headless” browser that loads websites into memory and executes JavaScript on the page, and because it doesn’t display a graphical interface, it runs more efficiently than a full browser.
If we put Selenium and PhantomJS together, we can run a very powerful web crawler that can handle JavaScript, cookies, headers, and anything else our real users need to do.
4. splash
Splash is a Javascript rendering service. It is a lightweight browser that implements the HTTP API. Splash is implemented in Python, using Twisted and QT. Twisted (QT) is used to make services asynchronous to take advantage of WebKit’s concurrency capabilities.
The library that Python connects to splash is called scrapy-splash. Scrapy-splash uses the HTTP API of Splash, so a splash instance is required. In general, Docker is used to run splash, so docker needs to be installed.
5. spynner
Spynner is a QtWebKit client that emulates a browser to load pages, raise events, fill in forms, and more.
Crawler anti-shield policy
1. Modify the user-agent
User-agent is one of the most common ways to disguise a browser.
User-agent is a string containing browser information and operating system information. It is also called a special network protocol. The server uses it to determine whether the current access object is a browser, mail client, or web crawler. The user-agent can be viewed in request.headers, as discussed in the previous article about how to analyze a packet and see its User-agent information.
You can set up a user-agent pool (list, array, dictionary, etc.) to store multiple “browsers” and randomly select one User Agent for each crawl. In this way, the User-agent will always change to prevent the wall.
2. The ban on cookies
Cookies are actually encrypted data stored in the user terminal. Some websites identify users through cookies. If a visitor always sends requests with high frequency, it is likely to be noticed by the website and suspected as a crawler.
By disabling cookies, the client actively prevents the server from writing. Banning cookies prevents sites that might use cookies to detect crawlers from banning us.
You can set COOKIES_ENABLES= FALSE in your scrapy crawler to disable cookies middleware and not send cookies to the Web Server.
3. Set the request interval
The crawler can increase the load of the server in a short time. It should be noted that the range control of download waiting time is set. If the waiting time is too long, it cannot meet the requirements of large-scale grasping in a short time. If the waiting time is too short, access may be denied.
Setting reasonable request time interval ensures crawler’s crawler efficiency and does not cause great influence on the other server.
4. Proxy IP address pool
In fact, other micro blog is IP, not an account. That is, simulated logins don’t make sense when a lot of data needs to be fetched consecutively. As long as it is the same IP address, no matter how to change the account, the main is to change the IP address.
One of the strategies for web server to deal with crawler is to directly block IP or the whole IP segment to prevent access. When THE IP is blocked, it can be converted to other IP addresses to continue access. Method: Proxy IP, local IP database (using IP pool).
5. Use the Selenium
Selenium is an effective way to prevent ban by simulating human click-throughs to websites. However, Selenium is not suitable for large-scale data capture due to its low efficiency.
6. Decrypt the verification code
Captchas are now the most common means of preventing crawlers. Capable partners can write their own algorithm to crack the verification code, but generally we can spend a little money to use the interface of the third-party coding platform, easy to achieve the verification code cracking.
conclusion
The above is a bit of an overview of The Python crawler, and you need to check the details for specific technical points. Hope to learn crawler students have a little help.