V1.2.2 new features

  • 1. System low-level reconstruction, standardized package name;
  • 2. Whitelist filtering optimization of collection threads to avoid redundant failures and retries;
  • 3, enhance the capacity of collecting JS rendering, native new offer “SeleniumPhantomjsPageLoader”, support page data in “selenisum + phantomjs” way;
  • 4, support collection of non-Web pages, such as JSON interface, directly output response data; Select “NonPageParser”;

Introduction to the

Xxl-crawler is a distributed CRAWLER framework. One line of code to develop a distributed crawler, with “multi-thread, asynchronous, IP dynamic proxy, distributed, JS rendering” and other features;

Enter the image title here

features

  • 1. Simplicity: API is intuitive and simple, and can be used quickly;
  • 2. Lightweight: the low-level implementation only relies on Jsoup, which is simple and efficient;
  • 3, modular: modular structure design, easy to expand
  • 4, object-oriented: support through annotations, convenient mapping page data to PageVO object, the bottom automatically complete PageVO object data extraction and encapsulation return; One or more PageVO can be extracted from a single page
  • 5. Multi-threading: run in thread pool to improve collection efficiency;
  • 6, distributed support: by extending the “RunData” module, and combining Redis or DB to share the running data can achieve distributed. LocalRunData standalone crawlers are provided by default.
  • 7, JS rendering: through the extension of the “PageLoader” module, support the collection of JS dynamic rendering data. Native Jsoup(non-JS rendering, faster), HtmlUnit(JS rendering), Selenium+Phantomjs(JS rendering, high compatibility) and other implementations support free extension of other implementations.
  • 8. Retry after failure: Retry after a request fails and support setting the retry times.
  • 9. Proxy IP: anti-collection policy rule WAF;
  • 10. Dynamic proxy: support dynamic adjustment of the proxy pool at runtime and custom routing policy of the proxy pool;
  • 11, asynchronous: support synchronous, asynchronous two ways to run;
  • 12, diffusion of the whole site: support to the existing URL as the starting point of diffusion to climb the whole site;
  • 13, to heavy: to prevent repeated climb;
  • 14, URL whitelist: support to set the page whitelist regular, filtering URL;
  • 15. Customize request information, such as request parameters, Cookie, Header, UserAgent polling, Referrer, etc.
  • 16, dynamic parameters: support dynamic adjustment of request parameters at run time;
  • 17. Timeout control: support setting the timeout time of crawler request;
  • 18. Active pause: the crawler thread actively pauses after processing the page to avoid being intercepted too frequently;

The document address

  • Chinese document

Technical communication

  • Community communication