Python Data Science Zhihu: Data Analyst
To play big data, how to play without data? Here are some 33 open source crawlers for you.
Crawler, namely web crawler, is a program that automatically obtains web content. Is an important part of search engine, so search engine optimization is to a large extent for crawler optimization.
Web crawler is an automatic web page extraction program, it downloads web pages from the world wide web for search engine, is an important component of search engine. The traditional crawler starts from one or several urls of the initial web page and obtains the urls on the initial web page. In the process of crawling web pages, new urls are continuously extracted from the current page and put into the queue until certain stop conditions of the system are met. The work flow of focused crawler is complicated, and it is necessary to filter the links irrelevant to the topic according to certain webpage analysis algorithm, reserve the useful links and put them into the URL queue waiting to be captured. Then, it will select the next web page URL from the queue according to a certain search strategy, and repeat the above process, until reaching a certain condition of the system to stop. In addition, all crawler web pages will be stored by the system for certain analysis, filtering, and index establishment, so as to facilitate the subsequent query and retrieval; For the focused crawler, the analysis results obtained from this process may also provide feedback and guidance for the subsequent grasping process.
There are as many as hundreds of crawlers in the world. This paper sorts out the well-known and common open source crawlers and summarizes them according to their development languages. Search engines have crawlers, but this is just crawlers, not big, complex search engines, because a lot of guys just want to crawl data, not run a search engine.
Java crawler
1.Arachnid
Arachnid is a Java-based Web spider framework. It includes a simple HTML parser that analyzes input streams containing HTML content. By implementing a subclass of Arachnid, you can develop a simple Web spiders and add a few lines of code calls after each page on the Web site is parsed. The Arachnid download package includes two spider application examples to demonstrate how to use the framework.
Features: micro crawler framework, contains a small HTML parser
License: GPL
2.crawlzilla
Crawlzilla is free software that lets you easily build a search engine without relying on a commercial company’s search engine or having to index your own website.
Nutch project as the core, and integration of more related kits, and design of the installation and management of the UI to make it easier to use.
Crawlzilla besides basic HTML crawl, also can analyze the file on the web, such as (doc, PDF and PPT, ooo, RSS) a variety of file formats, such as your search engine not just web search engine, but the site of the complete data index.
Have Chinese word segmentation ability, make your search more accurate.
Crawlzilla features and aims to provide users with a convenient, easy-to-install search platform.
Authorization agreement: Apache License 2 Development language: Java JavaScript SHELL Operating system: Linux
- Project homepage: github.com/shunfa/cra….
- Download address: sourceforge.net/projec…
Features: Easy to install, with Chinese word segmentation function
3.Ex-Crawler
Ex-crawler is a Web Crawler developed in Java. The project is divided into two parts, one is a daemon, the other is a flexible configurable Web Crawler. Use a database to store web page information.
Licensing agreement: GPLv3 Development language: Java operating system: cross-platform
Features: Executed by a daemon, using a database to store web page information
4.Heritrix
Heritrix is a Java developed, open source web crawler that users can use to crawl desired resources from the web. Its most outstanding lies in its good scalability, convenient for users to achieve their own fetching logic.
Heritrix adopts a modular design. Each module is coordinated by a controller class (CrawlController class), and controller is the core of the whole.
Code hosting: github.com/internetar….
Licensing agreement: Apache development language: Java operating system: Cross-platform Features: strictly follow the exclusion instructions of robots files and META Robots tags
5.heyDr
HeyDr is a java-based lightweight open source multi-threaded vertical search crawler framework, which complies with the GNU GPL V3 protocol.
Users can build their own vertical resource crawler through heyDr, which is used to prepare data in the early stage of building vertical search engine.
Licensing agreement: GPLv3 Development language: Java operating system: cross-platform
Features: lightweight open source multi-threaded vertical search crawler framework
6.ItSucks
ItSucks is an open source Java Web spider project. Download templates and regular expressions can be used to define download rules. Provides a Swing GUI operation interface.
Features: Provides swing GUI operation interface
7.jcrawl
Jcrawl is a tiny, high-performance Web crawler that crawls various types of files from the web, based on user-defined symbols such as email and QQ.
Licensing agreement: Apache Development language: Java operating system: cross-platform
Features: light weight, excellent performance, can grab various types of files from the web page
8.JSpider
JSpider is a WebSpider implemented in Java. JSpider is executed in the following format:
jspider [URL] [ConfigName]
The URL must contain the protocol name, for example, http://. Otherwise, an error message will be displayed. If ConfigName is omitted, the default configuration is used.
The behavior of the JSpider is specified by the configuration file, such as what plug-ins to use, how the results are stored, and so on are set in the conf[ConfigName] directory. The default configurations of JSpider are few in variety and not very useful. But JSpider is easy to extend and can be used to develop powerful web scraping and data analysis tools. To do this, you need to have a deep understanding of the principles of JSpider, then develop plug-ins and write configuration files based on your own needs.
Licensing agreement: LGPL Development language: Java operating system: cross-platform
Features: Powerful and easy to expand
9.Leopdo
Web search and crawlers written in JAVA, including full-text and categorized vertical searches, and word segmentation systems
Licensing agreement: Apache Development language: Java operating system: cross-platform
Features: includes full-text and category vertical search, and word segmentation system
10.MetaSeeker
Is a complete set of web content capture, formatting, data integration, storage management and search solutions.
There are various implementation methods for web crawlers, which can be divided into:
- Server side: is generally a multi-threaded program, download multiple target HTML at the same time, can use PHP, Java, Python(currently very popular) and so on, can be done quickly, the general comprehensive search engine crawler do so. However, if the other party hates crawlers, it is likely to block your IP, the server IP is not easy to change, and the bandwidth consumption is quite expensive. I suggest taking a look at Beautiful SOAP.
- Client: Usually meet topic crawler, or is a focused crawler, do comprehensive search engine is not easy to success, while vertical search or price comparison service or recommendation engines, are relatively easy to many, this kind of crawler every page does not take, but only take your relationship page, and only take care about content on a page, such as yellow pages information extraction, commodity price information, Search Spyfu. It’s fun. These crawlers can deploy in large numbers and can be aggressive and difficult to block.
Web crawlers in MetaSeeker fall into the latter category.
The MetaSeeker toolkit takes advantage of the Mozilla platform’s ability to extract anything Firefox sees.
MetaSeeker toolkit is free to use, download address: www.gooseeker.com/cn/node/download/front
Features: web page capture, information extraction, data extraction kit, simple operation
11.Playfish
Playfish is a web scraping tool that uses Java technology and multiple open source Java components to achieve a high degree of customizability and scalability through XML configuration files
Application open source JAR packages including HttpClient (content reading), Dom4j (configuration file parsing), Jericho (HTML parsing), have been in the war package lib.
The project is still very immature, but the functions are almost complete. Familiarity with XML and regular expressions is required. At present, through this tool can grab all kinds of forum, stick bar, as well as all kinds of CMS system. Like Discuz! PHPBB, forum and blog articles, through this tool can easily crawl. Crawl definitions are entirely XML, suitable for Java developers.
Use method, 1. Download the right. War package into eclipse, 2. Create a sample database using the wcc.sql file in WebContent/ SQL, 3. Change the SRC package to dbconfig. TXT and set the username and password to your mysql username and password. 4. Then run SystemCore on the console. If there are no parameters, the default example.xml configuration file will be executed.
The system comes with 3 examples, for baidu. XML crawl Baidu know, example.xml crawl my Javaeye blog, BBS. XML crawl a content using discuz forum.
Licensing agreement: MIT Development language: Java operating system: cross-platform
Features: High customizability and extensibility through XML configuration files
12.Spiderman
Spiderman is a web spider based on microkernel + plug-in architecture. Its goal is to capture and parse complex target web page information into the business data it needs through a simple method.
How to use it?
First of all, determine your target website and target webpage (i.e. a certain kind of webpage you want to obtain data, such as the news page of netease News)
Then, open the target page, analyze the HTML structure of the page, and get the XPath of the data you want. See how to obtain the XPath below.
Finally, fill in the parameters in an XML configuration file and run Spiderman!
Licensing agreement: Apache Development language: Java operating system: cross-platform
Features: flexible, strong scalability, micro-kernel + plug-in architecture, through simple configuration can complete data capture, without writing a code
13.webmagic
Webmagic is a no configuration, easy to secondary development of the crawler framework, it provides a simple and flexible API, only a small amount of code to achieve a crawler.
Webmagic adopts a completely modular design, with functions covering the whole crawler life cycle (link extraction, page download, content extraction, persistence), supporting multi-threaded crawling, distributed crawling, and supporting automatic retry, custom UA/cookie and other functions.
Webmagic contains powerful page extraction function, developers can easily use CSS selector, xpath and regular expression for link and content extraction, support multiple selector chain call.
Use of webmagic documentation: webmagic. IO /docs/
View source: git.oschina.net/flashs…
Licensing agreement: Apache Development language: Java operating system: cross-platform
Features: the function covers the whole crawler life cycle, using Xpath and regular expression for link and content extraction.
Note: This is a domestic open source software, contributed by Huang Yihua
14.Web-Harvest
Web-harvest is a Java open source Web data extraction tool. It can collect specified Web pages and extract useful data from those pages. Web-harvest uses techniques such as XSLT,XQuery, and regular expressions to operate on text/ XML.
Httpclient filters the content of a text/ XML page using XPath, XQuery, regular expressions, and other techniques. Select accurate data. The vertical search that compares fire before two years (for example: cool news, etc.) also is to use similar principle to realize. For a Web-Harvest application, the key is to understand and define the configuration file, and the rest is Java code to think about what to do with the data. Of course, before crawler, Java variables can also be filled into the configuration file to achieve dynamic configuration.
Authorization agreement: BSD Development language: Java
Features: Using XSLT, XQuery, regular expression and other technologies to achieve Text or XML operations, with a visual interface
15.WebSPHINX
WebSPHINX is an interactive development environment for Java class packages and Web crawlers. Web crawlers (also known as robots or spiders) are programs that automatically browse and process Web pages. WebSPHINX consists of two parts: crawler working platform and WebSPHINX class package.
Authorization agreement: Apache
Development language: Java
Features: it consists of two parts: crawler working platform and WebSPHINX class package
16.YaCy
YaCy a distributed Web search engine based on P2P. It is also an Http caching proxy server. This project is a new approach to building p2p based Web indexing networks. It can search your own or global indexes, Crawl your own web pages or enable distributed Crawling, and so on.
Licensing agreement: GPL Development language: Java Perl operating system: cross-platform
Features: Based on P2P distributed Web search engine
Python crawler
17.QuickRecon
QuickRecon is a simple information gathering tool that helps you find subdomain names, perform Zone Transfe, collect email addresses and use microformats to find relationships. QuickRecon is written in Python and supports Linux and Windows operating systems.
Licensing agreement: GPLv3 Development language: Python Operating system: Windows Linux
Features: It can find subdomain names, collect E-mail addresses and find interpersonal relationships
18.PyRailgun
This is a very simple and easy to use scraping tool. Python web crawler crawler module that supports simple, practical and efficient crawling of javascript rendered pages
Licensing agreement: MIT Development language: Python Operating system: Cross-platform Windows Linux OS X
Features: simple, lightweight, efficient web crawling framework
Note: this software is also open by Chinese
Github download: github.com/princehaku….
19.Scrapy
Scrapy is a set of Twisted based asynchronous processing framework, pure Python implementation of the crawler framework, users only need to customize the development of several modules can easily implement a crawler, used to grab web content and various pictures, very convenient ~
Licensing agreement: BSD Development language: Python operating system: Cross-platform Github source code: github.com/scrapy/scra…
Features: Twisted based asynchronous processing framework, well documented
C + + crawler
20.hispider
HiSpider is a fast and high performance spider with high speed
Strictly speaking, it can only be a spider system framework without detailed requirements. Currently, it can only extract URLS, prioritize URLS, perform asynchronous DNS resolution, queue tasks, support N distributed download, and support web directed download (hispiderd.ini whitelist needs to be configured).
Features and Usage:
- Based on Unix/Linux system development
- Asynchronous DNS resolution
- URL row of heavy
- Supports HTTP compression encoding for transmission of GZIP/Deflate
- Character set judgment is automatically converted to UTF-8 encoding
- Compressed document storage
- Supports distributed download from multiple download nodes
- Support targeted download from websites (hispiderd.ini Whitelist needs to be configured)
- Download statistics can be viewed at http://127.0.0.1:3721/. Download task control (stop and resume tasks)
- Rely on the base communication libraries libevBase and libsbase (you need to install these two libraries first)
Workflow:
- Fetch the URL from the central node (including the task number, IP, and port corresponding to the URL, which may need to be resolved by yourself)
- Connect to the server to send the request
- Waiting for the data header to determine whether the data is needed (currently, data of text type is mainly taken)
- Wait to complete data (direct wait with length header indicates the length of data otherwise wait for a larger number and then set timeout)
- When the data is complete or timed out, zlib returns the compressed data to the central server. The compressed data may include its own DNS resolution information, compressed data length + compressed data, and if there is an error, it directly returns the task number and related information
- The central server receives the data with the task number and checks whether the data is included. If there is no data, the status corresponding to the task number is error. If there is data, extract data type link and then store the data to the document file
- Return to a new task when completed
Licensing agreement: BSD Development language: C/C++ operating system: Linux
Features: support multi-machine distributed download, support web directional download
21.larbin
Larbin is an open source web crawler/web spider developed independently by a young Frenchman named Sebastien Ailleret. Larbin is intended to be able to track the url of a page for extended crawling, and ultimately provide a wide range of data sources for search engines. Larbin is just a crawler, which means that Larbin only crawls web pages, and how to parse is left up to the user. Also, larbin does not provide information on how to store to the database or build indexes. A simple Larbin crawler can fetch 5 million web pages a day.
With Larbin, we can easily obtain/identify all links to a single website, or even mirror a website; You can also use it to set up a url list group, for example after a URL retrive for all web pages, the join of XML is retrieved. Or mp3, or custom larbin, can serve as a source of information for search engines.
Licensing agreement: GPL Development language: C/C++ Operating system: Linux
Features: high-performance crawler software, only responsible for crawling is not responsible for analysis
22.Methabot
Methabot is a speed-optimized, highly configurable crawler for the WEB, FTP, and local file systems.
License agreement: unknown development languages: C/C + + operating system: Windows Linux features: speed optimization, can crawl WEB, FTP and local file system source code: www.oschina.net/code/t…
C # the crawler
23.NWebCrawler
NWebCrawler is an open source, C# developed web crawler.
Features:
Configurable: number of threads, wait time, connection timeout, allowed MIME type and priority, download folder. Statistics: number of urls, total downloaded files, total downloaded bytes, CPU utilization, and available memory. Preferential crawler: MIME type for which users can set priority. Robust: 10+ URL normalization rules, crawler trap avoiding rules. Licensing agreement: GPLv2 development language: C# operating system: Windows
Project homepage: www.open-open.com/lib/…
Features: visualization of statistics and execution process
24.Sinawler
The first crawler for micro-blog data in China! Original name “Sina Weibo reptile”.
After login, you can designate the user as the starting point and collect the user’s basic information, microblog data and comment data by taking the user’s followers and fans as clues.
The data obtained by this application can be used as data support for scientific research, research and development related to Sina Weibo, etc., but should not be used for commercial purposes. The application is based on the.NET2.0 framework, and requires SQL SERVER as the background database, and provides database script files for SQL SERVER.
In addition, due to the limitation of Sina Weibo API, the data to be crawled may not be complete (such as the limitation of the number of fans, the limitation of the number of microblogs, etc.).
The copyright of this program belongs to the author. You can do it for free: copy, distribute, present and perform your current work, and make spin-offs. You may not use your current work for commercial purposes.
5.X version has been released! There are six background threads in this version: the robot that crawls the user’s basic information, the robot that crawls the user’s relationship, the robot that crawls the user’s tag, the robot that crawls the microblog content, the robot that crawls the microblog comment, and the robot that adjusts the request frequency. Higher performance! Maximize crawler potential! With the present test results, it has been able to meet their own use.
Features of this program:
- 6 background worker threads to maximize crawler performance potential!
- The interface provides flexible and convenient parameter setting
- Discard the app.config configuration file and encrypt the configuration information to protect the database account information
- Automatically adjust the frequency of requests to prevent exceeding the limit, but also avoid too slow, reduce efficiency
- Any control to the crawler, can suspend, continue, stop the crawler at any time
- Good user experience
Licensing agreement: GPLv3 development language: C#.net operating system: Windows
25.spidernet
Spidernet is a multithreaded Web crawler modeled on a recursive tree that supports retrieving text/ HTML resources. You can set the crawling depth, maximum download byte limit, support GZIP decoding, support to GBK (GB2312) and UTF8 coding resources; Stored in sqLite data files.
The TODO: tag in the source code describes unfinished functionality and wants to submit your code.
Licensing agreement: MIT development language: C# operating system: Windows
Github source: github.com/nsnail/spi….
Features: recursive tree as the model of multithreaded Web crawler, support to GBK (GB2312) and UTF8 coding resources, using SQLite storage data
26.Web Crawler
Mart and Simple Web Crawler is a Web Crawler framework. Integrated Lucene support. The crawler can start with a single link or an array of links, providing two traversal modes: maximum iteration and maximum depth. Filters can be set to limit the links that crawl back. By default, three filters are provided: ServerFilter, BeginningPathFilter AND RegularExpressionFilter. These three filters can be combined with AND, OR AND NOT. Listeners can be added during parsing or before and after a page loads. The introduction is from open-open
Development language: Java operating system: Cross-platform licensing agreement: LGPL
Features: multithreading, support to grab PDF/DOC/EXCEL and other document sources
27. Cyber miner
Web data acquisition software Network Miner collector (originally Soukey Picking)
Soukey Picking website data collection software is based on. Net platform of open source software, is also the website data collection software type of the only open source software. Although Soukey is open source, it does not affect the functionality of the software, even more than some commercial software.
Licensing agreement: BSD development language: C#.net operating system: Windows
Features: Feature-rich, as good as commercial software
PHP crawler
28.OpenWebSpider
OpenWebSpider is an open source multi-threaded WebSpider (Robot, crawler) and search engine that contains many interesting features.
License agreement: Unknown Development language: PHP Operating system: cross-platform
Features: Open source multithreaded web crawler with many interesting features
29.PhpDig
PhpDig is a Web crawler and search engine developed in PHP. Build a vocabulary by indexing dynamic and static pages. When searching a query, it displays a page of search results containing key words in a certain sort. PhpDig includes a templating system and is able to index PDF,Word,Excel, and PowerPoint documents. PHPdig is suitable for more specialized and deeper personalized search engines, and it is the best choice to use it to build a vertical search engine for a specific domain.
Presentation: www.phpdig.net/navigat…
Licensing agreement: GPL Development language: PHP operating system: cross-platform
Features: It has the function of collecting web content and submitting forms
30.ThinkUp
ThinkUp is a social media perspective engine that collects data from twitter, Facebook and other social networks. Interactive analysis tools that collect data from individual social network accounts, archive and process it, and make the data graphical for more intuitive viewing.
Licensing agreement: GPL Development language: PHP operating system: Cross-platform Github source code: github.com/ThinkUpLLC….
Features: A social media perspective engine that collects data from social networks such as Twitter and Facebook, performs interactive analysis and presents results in a visual form
31. The micro
Micro purchase social shopping system is a shopping sharing system based on ThinkPHP framework for the development of open source, it is also a webmaster, open source site taobao guest program, it integrates the taobao, Tmall, taobao guest data acquisition interface, more than 300 products such as taobao passenger station to provide for the majority of the fool for web services to clients, HTML will do program template, free open download, is the general tao station long first choice.
Demo site: tlx.wego360.com
Licensing agreement: GPL Development language: PHP operating system: cross-platform
ErLang crawler
32.Ebot
Ebot is a scalable distributed web crawler developed in The ErLang language. URLs are stored in a database and can be queried through RESTful HTTP requests.
Licensing agreement: GPLv3 Development language: ErLang Operating system: cross-platform
Github source: github.com/matteoreda….
Project home page: www.redaelli.org/matte…
Features: Scalable distributed web crawler
Ruby crawler
33.Spidr
Spidr is a Ruby web crawler library that can completely crawl an entire website, multiple websites, and a single link locally.
Development language: Ruby Licensing agreement: MIT Features: can be one or more websites, a link to fully grab the local
36 Big Data: 33 open source crawler tools you can use to capture data
Past review:
Python crawler urllib library – Advanced Python crawler – Micro Reliance when crawling movie consult
Follow the official wechat account Python data Science, get more exciting content, take you into the world of reptiles.