A lightweight event-based crawler framework.

An event dispatch crawler framework.

Function is introduced

  • Crawler framework based on fully custom event handling mechanism.
  • Modular design provides strong scalability.
  • Synchronous and asynchronous data fetching is supported based on HttpClient.
  • Multithreading is supported.
  • Jsoup page parsing framework provides powerful web page parsing processing capabilities.

Quick start

Using Maven

< the dependency > < groupId > com. Jibug. Cetty < / groupId > < artifactId > cetty - core < / artifactId > < version > 0.1.5 < / version > </dependency>Copy the code

help

1. Detailed documentation:cetty.jibug.com/

2. The QQ group



3. The bug feedback:issues

So let’s write our first demo

/ * * * grab Tianya BBS article list title * * * @ http://bbs.tianya.cn/list-333-1.shtml author heyingcai * / public class Tianya extends ProcessHandlerAdapter { @Override public void process(HandlerContext ctx, Page Page) {// getDocument Document Document = page.getdocument (); ItemElements = document. select("div#bbsdoc>div#bd>div#main>div.mt5>table>tbody").get (2). select("tr"); List<String> titles = Lists.newArrayList(); for (Element item : itemElements) { String title = item.select("td.td-title").text(); titles.add(title); } // get the Result object and pass it to the next handler Result Result = page.getresult (); result.addResults(titles); // This tutorial passes the results directly to ConsoleHandler and outputs the results directly to console ctx.Firereduce (page); } public static void main(String[] args) {// Bootstrap.me (). // Use isAsync(false). / / grab entry url startUrl (" http://bbs.tianya.cn/list-333-1.shtml "). / / common request information setPayload (Payload. The custom ()). / / add the processor // Add the default result handler to the console addHandler(new ConsoleReduceHandler()).start (); }}Copy the code

Version history

version instructions
0.1.0 from Support basic crawler function
0.1.5 1. Support xpath 2. Fix the failure of adding cookies 3

TODO

  • Support for annotations
  • Proxy pool support
  • Berkeley in-memory data is used as a URL manager to provide mass URL storage and improve access efficiency
  • Support for hot updates
  • Support crawler governance