A lightweight event-based crawler framework.
An event dispatch crawler framework.
Function is introduced
- Crawler framework based on fully custom event handling mechanism.
- Modular design provides strong scalability.
- Synchronous and asynchronous data fetching is supported based on HttpClient.
- Multithreading is supported.
- Jsoup page parsing framework provides powerful web page parsing processing capabilities.
Quick start
Using Maven
< the dependency > < groupId > com. Jibug. Cetty < / groupId > < artifactId > cetty - core < / artifactId > < version > 0.1.5 < / version > </dependency>Copy the code
help
1. Detailed documentation:cetty.jibug.com/
2. The QQ group
3. The bug feedback:issues
So let’s write our first demo
/ * * * grab Tianya BBS article list title * * * @ http://bbs.tianya.cn/list-333-1.shtml author heyingcai * / public class Tianya extends ProcessHandlerAdapter { @Override public void process(HandlerContext ctx, Page Page) {// getDocument Document Document = page.getdocument (); ItemElements = document. select("div#bbsdoc>div#bd>div#main>div.mt5>table>tbody").get (2). select("tr"); List<String> titles = Lists.newArrayList(); for (Element item : itemElements) { String title = item.select("td.td-title").text(); titles.add(title); } // get the Result object and pass it to the next handler Result Result = page.getresult (); result.addResults(titles); // This tutorial passes the results directly to ConsoleHandler and outputs the results directly to console ctx.Firereduce (page); } public static void main(String[] args) {// Bootstrap.me (). // Use isAsync(false). / / grab entry url startUrl (" http://bbs.tianya.cn/list-333-1.shtml "). / / common request information setPayload (Payload. The custom ()). / / add the processor // Add the default result handler to the console addHandler(new ConsoleReduceHandler()).start (); }}Copy the code
Version history
version | instructions |
---|---|
0.1.0 from | Support basic crawler function |
0.1.5 | 1. Support xpath 2. Fix the failure of adding cookies 3 |
TODO
- Support for annotations
- Proxy pool support
- Berkeley in-memory data is used as a URL manager to provide mass URL storage and improve access efficiency
- Support for hot updates
- Support crawler governance