There are many programming environments to realize crawler technology, such as Java, Python, C++, etc. But many people choose Python to write crawlers. Why? Because Python is really good for crawlers, rich third-party libraries are powerful enough to do what you want in a few lines of code. More importantly, Python is also a master of data mining and analysis. So, what’s a good framework for a Python crawler?
In general, the Python crawler framework is used only for large requirements. The main purpose of this is to facilitate management and expansion. In this article I will recommend ten Python crawler frameworks.
1, ScrapyScrapy is an application framework designed to crawl site data and extract structured data. It can be used in a range of applications including data mining, information processing or storing historical data. It is a very powerful crawler framework, can meet simple page crawling, such as can explicitly know the URL pattern of the situation. This framework makes it easy to climb down data such as Amazon product information. However, for a slightly more complex page, such as weibo page information, this framework can not meet the needs. Features include: built-in support for HTML, XML source data selection and extraction; Provides a set of reusable filters (Item Loaders) that are shared between spiders, with built-in support for intelligently handling crawls.
2. Crawley: Crawley the content of the corresponding website at a high speed. It supports relational and non-relational databases, and the data can be exported as JSON, XML, etc.
Portia: is an open source visual crawler that allows users to crawl websites without any programming knowledge! Simply annotate pages of interest to yourself, and Portia will create a spider to extract data from similar pages. In simple terms, it’s based on a scrapy kernel; Visually crawl content without requiring any development expertise; Dynamically matches the contents of the same template.
4, newspaper: can be used to extract news, articles and content analysis. Use multithreading, support more than 10 languages, etc. The authors took inspiration from the simplicity and power of the Requests library, a program developed in Python that can be used to extract article content. More than 10 languages are supported and all are unicode encoded.
Python-goose: Java article extraction tool. The python-GOOSE framework can extract information including: the main content of the article, the main picture of the article, any Youtube/Vimeo videos embedded in the article, meta description, meta tag.
6. Beautiful Soup: Famous for integrating some common crawler requirements. It is a Python library that extracts data from HTML or XML files. It allows you to navigate, find, and modify documents in your favorite converter. Beautiful Soup can save you hours or even days at work. The downside of Beautiful Soup is that it can’t load JS.
Mechanize: It has the advantage of loading JS. Of course, there are drawbacks, such as a serious lack of documentation. However, through the official example and human flesh to try the method, it is still able to work.
Selenium: This is a driver that calls the browser. Through this library, you can directly call the browser to perform certain operations, such as entering a verification code. Selenium is an automated testing tool that supports a variety of browsers, including Chrome, Safari, Firefox and other major interface browsers. If you install a Selenium plug-in in these browsers, you can easily implement Web interface testing. Selenium supports browser drivers. Selenium supports multiple languages such as Java, C, Ruby, etc. PhantomJS is used to render and parse JS. Selenium is used to drive and interface with Python for post-processing.
Cola: is a distributed crawler framework that allows users to write a few specific functions without paying attention to the details of distributed execution. Tasks are automatically distributed across multiple machines and the process is transparent to the user. The overall design of the project is a bit bad, with high coupling degree between modules.
PySpider: a powerful web crawler system written by Chinese people with a powerful WebUI. Written in Python, distributed architecture, supporting a variety of database backend, powerful WebUI support script editor, task monitor, project manager, and result viewer. Python script control, you can parse the package with any HTML you like.
Those are the top 10 frameworks that Python crawlers use in general. These frameworks have different advantages and disadvantages. You can choose the appropriate framework according to the specific scenario. If you are interested in Python, you are welcome to join us and get free learning materials and source code.