Open source crawlers
DataparkSearch is a crawler search engine distributed under the GNU GPL license.
GNU Wget is a command-line crawler written in C under the GPL license. It is mainly used for mirroring network servers and FTP servers.
Heritrix is an Internet archive-level crawler designed to take periodic archival snapshots of large portions of the web and written in Java.
Ht://Dig includes a web crawler in both it and the indexing engine.
HTTrack uses web crawlers to create images of web sites for offline viewing. It is written in C and distributed under the GPL license.
ICDL Crawler is a cross-platform web Crawler written in C++. It uses only idle CPU resources to crawl the entire site on the ICDL standard.
JSpider is a highly configurable, customizable web crawler engine licensed under the GPL.
LLarbin was developed by Sebastien Ailleret;
Webtools4larbin was developed by Andreas Beder;
Methabot is a high-speed, optimized, command-line web finder written in C, distributed under the 2-clause BSD license. Its main features are high configurability and modularity; It can retrieve a local file system, HTTP or FTP.
Nutch is a crawler written in Java under the Apache license. It can be used to connect to Lucene’s full-text search suite;
Pavuk is a GPL-licensed, command-line WEB site mirroring tool with an optional X11 graphical interface. Compared to WGET and Httprack, it has a number of advanced features, such as regular expression based file filtering rules and file creation rules.
WebVac is a crawler used by the Stanford WebBase project.
WebSPHINX(Miller and Bharat, 1998) is a text-based search engine made up of Java libraries. It uses multiple threads for web search, HTML parsing, and a graphical user interface to set the seed URL to start and extract the downloaded data.
WIRE- web information retrieval environment (Baeza-Yates and Castillo, 2002) is a crawler written in C++ and issued under the GPL license. It has several strategies for page download arrangement built in, as well as a module for generating reports and statistics. Therefore, it is mainly used for the description of network features.
LWP: RobotUA(Langheinrich,2004) is a robot made up of Perl class libraries that can perform parallel tasks excellently under Perl5 license.
Web Crawler is an open source Web finder (written in C#) for.net.
The collection and retrieval of textual data (text files, web pages) locally and online by Sherlock Holmes, a project sponsored and used here by the Czech Web Portal Centrum; It is also used in.
YaCy is a free distributed search engine based on P2P networks (distributed under the GPL license);
Ruya is an open source web crawler that performs well in breadth first and is based on level crawling. It crawls well on English and Japanese pages, is distributed under the GPL license, and is written entirely in Python. According to robots.txt there is a delayed single domain delay crawler.
Universal Information Crawler, a fast growing web Crawler for retrieving, storing and analyzing data;
Agent Kernel, a Java framework for scheduling, concurrency, and storage when a crawler crawls.
SQL Server 2005 is a gpl-licensed, versatile open source robot written in C# and supported by SQL Server 2005. It can be used to download, retrieve and store all kinds of data including email addresses, files, hyperlinks, images and web pages.
Dine is a multithreaded Java HTTP client. It can be redeveloped under the LGPL license.
The composition of web crawlers
In the system framework of web crawler, the main process consists of controller, parser and resource library. The main job of controller is to assign work tasks to each crawler thread in multithreading. The main work of the parser is to download the web page and process the page, mainly processing some JS script tags, CSS code content, space characters, HTML tags and other content. The basic work of crawler is completed by the parser. A resource library is used to store downloaded web resources. It is usually stored in a large database, such as Oracle database, and indexed to it.
The controller
The controller is the central controller of the web crawler, which is mainly responsible for allocating a thread according to the URL link transmitted by the system, and then starting the thread to call the process of crawler crawling the web page.
The parser
The parser is the main part of the web crawler, which mainly includes the functions of downloading web pages, processing the text of web pages, such as filtering, extracting special HTML tags, and analyzing data.
The repository
It is mainly used to store the data records downloaded from the web page, and provides the target source to generate the index. Medium and large database products include: Oracle, Sql Server and so on.