A lightweight event-based crawler framework

A lightweight event-based crawler framework.

An event dispatch crawler framework.

Function is introduced

Crawler framework based on fully custom event handling mechanism.
Modular design provides strong scalability.
Synchronous and asynchronous data fetching is supported based on HttpClient.
Multithreading is supported.
Jsoup page parsing framework provides powerful web page parsing processing capabilities.

Quick start

Using Maven

< the dependency > < groupId > com. Jibug. Cetty < / groupId > < artifactId > cetty - core < / artifactId > < version > 0.1.5 < / version > </dependency>Copy the code

help

1. Detailed documentation:cetty.jibug.com/

2. The QQ group

3. The bug feedback:issues

So let’s write our first demo

/ * * * grab Tianya BBS article list title * * * @ http://bbs.tianya.cn/list-333-1.shtml author heyingcai * / public class Tianya extends ProcessHandlerAdapter { @Override public void process(HandlerContext ctx, Page Page) {// getDocument Document Document = page.getdocument (); ItemElements = document. select("div#bbsdoc>div#bd>div#main>div.mt5>table>tbody").get (2). select("tr"); List<String> titles = Lists.newArrayList(); for (Element item : itemElements) { String title = item.select("td.td-title").text(); titles.add(title); } // get the Result object and pass it to the next handler Result Result = page.getresult (); result.addResults(titles); // This tutorial passes the results directly to ConsoleHandler and outputs the results directly to console ctx.Firereduce (page); } public static void main(String[] args) {// Bootstrap.me (). // Use isAsync(false). / / grab entry url startUrl (" http://bbs.tianya.cn/list-333-1.shtml "). / / common request information setPayload (Payload. The custom ()). / / add the processor // Add the default result handler to the console addHandler(new ConsoleReduceHandler()).start (); }}Copy the code

Version history

version	instructions
0.1.0 from	Support basic crawler function
0.1.5	1. Support xpath 2. Fix the failure of adding cookies 3

TODO

Support for annotations
Proxy pool support
Berkeley in-memory data is used as a URL manager to provide mass URL storage and improve access efficiency
Support for hot updates
Support crawler governance

A lightweight event-based crawler framework

Function is introduced

Quick start

Using Maven

help

So let’s write our first demo

Version history

TODO

Related Posts

Digest | Docker security strategy, Python data engineer guide

How to install Nginx in Debian using Package manager or Docker

Google Chrome Advanced Use of bizarre tips (Practical summary)