Xxl-crawler V1.2.2 release, distributed CRAWLER framework

V1.2.2 new features

1. System low-level reconstruction, standardized package name;
2. Whitelist filtering optimization of collection threads to avoid redundant failures and retries;
3, enhance the capacity of collecting JS rendering, native new offer “SeleniumPhantomjsPageLoader”, support page data in “selenisum + phantomjs” way;
4, support collection of non-Web pages, such as JSON interface, directly output response data; Select “NonPageParser”;

Introduction to the

Xxl-crawler is a distributed CRAWLER framework. One line of code to develop a distributed crawler, with “multi-thread, asynchronous, IP dynamic proxy, distributed, JS rendering” and other features;

features

1. Simplicity: API is intuitive and simple, and can be used quickly;
2. Lightweight: the low-level implementation only relies on Jsoup, which is simple and efficient;
3, modular: modular structure design, easy to expand
4, object-oriented: support through annotations, convenient mapping page data to PageVO object, the bottom automatically complete PageVO object data extraction and encapsulation return; One or more PageVO can be extracted from a single page
5. Multi-threading: run in thread pool to improve collection efficiency;
6, distributed support: by extending the “RunData” module, and combining Redis or DB to share the running data can achieve distributed. LocalRunData standalone crawlers are provided by default.
7, JS rendering: through the extension of the “PageLoader” module, support the collection of JS dynamic rendering data. Native Jsoup(non-JS rendering, faster), HtmlUnit(JS rendering), Selenium+Phantomjs(JS rendering, high compatibility) and other implementations support free extension of other implementations.
8. Retry after failure: Retry after a request fails and support setting the retry times.
9. Proxy IP: anti-collection policy rule WAF;
10. Dynamic proxy: support dynamic adjustment of the proxy pool at runtime and custom routing policy of the proxy pool;
11, asynchronous: support synchronous, asynchronous two ways to run;
12, diffusion of the whole site: support to the existing URL as the starting point of diffusion to climb the whole site;
13, to heavy: to prevent repeated climb;
14, URL whitelist: support to set the page whitelist regular, filtering URL;
15. Customize request information, such as request parameters, Cookie, Header, UserAgent polling, Referrer, etc.
16, dynamic parameters: support dynamic adjustment of request parameters at run time;
17. Timeout control: support setting the timeout time of crawler request;
18. Active pause: the crawler thread actively pauses after processing the page to avoid being intercepted too frequently;

The document address

Chinese document

Technical communication

Community communication

Xxl-crawler V1.2.2 release, distributed CRAWLER framework

V1.2.2 new features

Introduction to the

features

The document address

Technical communication

Related Posts

Technical key points of SpringCloud micro-service project of Wanxin Finance Project

Redis series (5) cache invalidation solutions

Pulumi AWS continued to display Key errors when performing preview updates