Nodejs crawler framework with queue support

Introduce a simple crawler framework, the focus is simple, then directly to try it.

start

[Image here, a little slow to load]

Let’s see what happens in slow motion.

  ~ $ npm install crawl-pet -gCopy the code

Install the crawl – pet

  ~ $ cd /Volumes/M/downloadCopy the code

Go to the directory where you want to create a new project

  download $ crawl-pet newCopy the code

Create a project and fill in the parameters as prompted

  ctrl + cCopy the code

If the crawler rules need to be customized, exit the crawler.js file under the chapter project

  module.exports = {
      /**************** * Info part ****************/
      projectDir: __dirname,
      url : "https://imgur.com/r/funny".outdir : "/Volumes/M/download/imgur.com".saveMode : "group".keepName : true.limits : 5.timeout : 60000.limitWidth : 400.limitHeight : 400.proxy : "http://127.0.0.1:1087".userAgent : "Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_13_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36".cookies : null.fileTypes : "png|gif|jpg|jpeg|svg|xml|mp3|mp4|pdf|torrent|zip|rar".sleep : 1000.crawl_data : {},
  
      // crawl_js : "./parser.js",
  
      /**************** * Crawler part *****************/
      // init(queen) {},
      prep(queen) {
          let url = queen.head.url;
          let m = url.match(/^(https? : \ \ /)? (([\w\-]\.) ? imgur.com)\/*/i);
          if(m) { url = (! m[1]?'https://' : ' ') + url.replace(/\/page(\/\d+(\/hit\.json)?) ? $|\/+$/i.' ');
              if (!/\/(new|top|hot)$/i.test(url)) {
                  url += '/new';
              }
              queen.head.url = url + '/page/0/hit.json';
              queen.save('api_url', url);
              queen.save('page_offset'.0); }},// start(queen) {},
      // filter(url) {},
      // filterDownload(url) {},
      // willLoad(request) {},
      loaded(body, links, files, crawler) {
          if (!/hit\.json/i.test(crawler.url)) {
              return;
          }
          try {
              let json = JSON.parse(body);
              let data = json.data;
              if(! data || data.length ===0) {
                  return;
              }
              let add_down = 0;
              for (let pic of data) {
                  if (crawler.appendDownload('https://i.imgur.com/' + pic.hash + pic.ext)) {
                      add_down += 1; }}if (add_down) {
                  let api_url = crawler.read('api_url');
                  let offset = crawler.read('page_offset');
                  let add = 5;
                  while (add-- > 0) {
                      offset++;
                      crawler.appendPage(api_url + '/page/' + offset + '/hit.json');
                  }
                  crawler.save('page_offset', offset); }}catch (err) {
              // PASS}},// browser(crawler) {}
  }Copy the code

To clarify, there are two functions overwritten here, prep(Queen)/loaded(body, links, files, crawler), see more here.

Prep (Queen) is a preprocessing function that is called the first time an item is run, and the first time it is run after a reset. Here, according to the Imgur API, some changes have been made to the start link.

Imgur’s address structure is:

https://imgur.com/ sort/sort method/page/ page number /hit.jsonCopy the code

Loaded (Body, links, files, crawler) is called whenever the page loads.

body
crawler
crawler.appendPage(url)
crawler.appendDownload(url)

In this example, because the requested page is JSON, we need to unwrap the text into JSON and add the image URL to the download queue with appendDownload. If false is returned, the image has been downloaded repeatedly and new images can be downloaded. It regenerates five new pages and appendPage is added to the queue.

Describes a utility command

You can find the download link by referring to the local file name

  $ crawl-pet -f local "CstcePq.png"Copy the code

See help for more commands

  ~ $ crawl-pet -hCopy the code

——————————————————————-

GIthub address: github.com/wl879/crawl…

There are some examples in the crawlers folder of the project.

Nodejs crawler framework with queue support

start

Describes a utility command

Related Posts

Build local Kubernetes cluster with Ansible automation

Fully prepared bytedance, passed all five, HR told me this is the reason I was cut…

Python uses LSTM in Keras to solve sequence problems