Introduction: For most front-end developers, when it comes to command-line tools, you’ve probably used them. But when it comes to developing command-line tools, few people know. This article aims to help you develop a practical (squint smile) picture crawler command line application in the shortest possible time.

For a better reading experience, please visit the front end Inn in Tuoba. Also put the project address in a prominent location

Puppeteer profile

What is a Puppeteer?

Puppeteer is Google Chrome’s official Headless Chrome tool. As the browser market leader, Chrome Headless will become an industry benchmark for automated testing of Web applications. So it’s important for us to understand it.

What can Puppeteer do?

There are many things Puppeteer can do, including but not limited to:

  • Use web pages to generate PDF and pictures
  • You can grab content from websites
  • Automated form submission, UI testing, keyboard entry, and more
  • Help you create a new automated test environment (Chrome) where you can run test cases directly
  • Capture a timeline of your site to track your site and help analyze site performance issues

What are the advantages of Puppeteer?

  • Less loading of CSS, JS and rendering of pages compared to real browsers. Headless browsers are much faster than real browsers.
  • It can run on the server or CI without interface, which reduces external interference and is more stable.
  • You can simulate running multiple headless browsers on a single machine, facilitating concurrent running.

How do I install Puppeteer?

Installing the Puppeteer is as follows: NPM I — Save Puppeteer or yarn add Puppeteer

Note that due to the ASYNc /await syntax of ES7, node should be v7.6.0 or above.

How do I use Puppeteer?

As this article is not dedicated to Puppeteer, this part will be skipped and you can check out the links below.

Puppeteer Github

Puppeteer Api Doc

Puppeteer Chinese Api Doc

Having said that, what does Puppeteer have to do with the command-line application we want to develop? Instead of using the traditional request crawler, we are going to create a command line tool to capture images from the DOM using a headless browser called Puppeteer, which can effectively circumnavigate some crawler defenses.

Simple application of Puppeteer

Case 1. Screenshot

Directly on the code, very easy to understand:

const puppeteer = require("puppeteer");

const getScreenShot = async() = > {const browser = await puppeteer.launch({ headless: false });
  const page = await browser.newPage();
  await page.goto("https://baidu.com");
  await page.screenshot({ path: "baidu.png" });

  await browser.close();
};

getScreenShot();
Copy the code

This code means to open the browser in headless mode, then open a new TAB, jump to baidu website, take a screenshot, save it as baidu. PNG, and close the browser.

The result is as follows:

Case 2. Capture website information

Next, learn how to use Puppeteer to crawl web sites.

This time let’s grab JD book list information.

// book info spider
const puppeteer = require("puppeteer");
const fs = require("fs");

const spider = async() = > {const browser = await puppeteer.launch();
  const page = await browser.newPage();

  await page.goto("https://search.jd.com/Search?keyword=javascript");

  const result = await page.evaluate((a)= > {
    let elements = document.querySelectorAll(".gl-item");

    const data = [...elements].map(i= > {
      return {
        name: i.querySelector(".p-name em").innerText,
        description: i.querySelector(".p-name i").innerText,
        price: i.querySelector(".p-price").innerText,
        shop: i.querySelector(".p-shopnum").innerText,
        url: i.querySelector(".p-img a").href
      };
    });
    return data; // Return data
  });

  browser.close();
  return result;
};

spider().then(value= > {
  fs.writeFile(`${__dirname}/javascript.json`.JSON.stringify(value), err => {
    if (err) {
      throw err;
    }
    console.log("file saved!");
  });
  console.log(value); // Success!
});
Copy the code

What we do is jump to the page whose keyword is javascript, and then analyze the DOM structure of the page, find the book list corresponding to the title, description, price, publishing house, web page link information, and then write the data into the javascript. Json file, which is convenient for us to save and browse.

The logic is simple. This is already the prototype of a crawler, and the result is a JSON file like the one shown below, which is very impressive.

Case 3. Picture crawler

Image crawler, that’s the theme of the command line application we’re going to do.

The basic idea is this:

Open the browser – > jump to Baidu Picture – > get the focus of the input box – > Enter keywords – > click the search button – > jump to the results list page – > drop down to the bottom – > operate DOM, Get the SRC of all images. Spare – > save the images locally according to SRC – > close the browser

Code implementation:

The first is the browser manipulation part

const browser = await puppeteer.launch(); // Open the browser
const page = await browser.newPage(); // Open a new TAB page
await page.goto("https://image.baidu.com"); // Jump to Baidu pictures
console.log("go to https://image.baidu.com"); // Get the focus of the input box

await page.focus("#kw"); // Focus on the search input box
await page.keyboard.sendCharacter("Cats"); // Enter the keyword
await page.click(".s_search"); // Click the search button
console.log("go to search list"); // Prompts to jump to the search list page
Copy the code

Then there is the image processing part

page.on("load".async() = > {await autoScroll(page); // Scroll down to load the image
  console.log("page loading done, start fetch...");
  const srcs = await page.evaluate((a)= > {
    const images = document.querySelectorAll("img.main_img");
    return Array.prototype.map.call(images, img => img.src);
  }); // Get all img SRC
  console.log(`get ${srcs.length} images, start download`);
  for (let i = 0; i < srcs.length; i++) {
    await convert2Img(srcs[i], target);
    console.log(`finished ${i + 1}/${srcs.length} images`);
  } // Save the image
  console.log(`job finished! `);
  await browser.close();
});
Copy the code

Because Baidu picture is to scroll down can continue lazy loading. So if we want to load more images, we can scroll down for a while. Then, the SRC of all images in the list can be obtained by analyzing DOM structure, and finally downloaded.

Perform the following to get a list of cat pictures:

Only the main function is written in the image download area, and more detailed code can be found on Github.

So far, we have developed a basic image crawler using Node and Puppeteer.

How to optimize?

The image crawler is a bit of a drag at the moment. Our goal is to develop an interactive command line application, and it can’t stop there. What are the points that can be further optimized? After a brief thought, I listed the following:

  • The content of the downloaded image can be customized
  • Users can choose the number of pictures to download
  • Supports command line parameter transmission
  • Supports command line interaction
  • Beautiful interactive interface
  • Supports double – click operation directly
  • Support for global command line invocation

Use commander. Js to support command line parameter passing

Commander is a lightweight, expressive, and powerful command line framework. Provides user command line input and parameter parsing powerful functions.

const program = require("commander");

program
  .version("0.0.1")
  .description("a test cli program")
  .option("-n, --name <name>"."your name"."zhl")
  .option("-a, --age <age>"."your age"."22")
  .option("-e, --enjoy [enjoy]")
  .action(option= > {
    console.log('name: ', option.name);
    console.log('age: ', option.age);
    console.log('enjoy: ', option.enjoy);
  });

program.parse(process.argv);
Copy the code

Commander is easy to use. The above code defines the input and output of a command line in just a few lines. Among them:

  • Version Defines the version number
  • Description Defines the description
  • Option defines input options, passing three parameters, such asoption("-n, --name <name>", "your name", "GK")–name is the full name of the input parameter. <> is mandatory. [] is optional. The second item “your name” is the prompt for help, telling the user what to type, and the last item “GK” is the default value.
  • Action defines the operation to be performed and is a callback function. The input parameter is the option entered previously. If no option is entered, the default value is used.

For more detailed apis, refer to the Commander API documentation.

Execute the above script to get:

This allows for simple interaction on the command line. But there is not enough to look at it, don’t worry, continue to read.

Use Inquirer to create interactive command-line applications

Inquirer can make a nice, embeddable command line interface for Node.

Q&a command input can be provided:

Can provide a variety of forms of selection interface:

Input information can be verified:

Finally, input information can be processed:

The example above is an official example from Inquirer and can be found at pizza.js

Inquirer documents can be viewed at Inquirer Documents

With Inquirer, we can make more sophisticated interactive command-line tools.

Use chalk. Js to make the interface aesthetically pleasing

Chalk’s syntax is very simple:

const chalk = require('chalk');
const log = console.log;

// Combine styled and normal strings
log(chalk.blue('Hello') + ' World' + chalk.red('! '));
// Compose multiple styles using the chainable API
log(chalk.blue.bgRed.bold('Hello world! '));
// Pass in multiple arguments
log(chalk.blue('Hello'.'World! '.'Foo'.'bar'.'biz'.'baz'));
// Nest styles
log(chalk.red('Hello', chalk.underline.bgBlue('world') + '! '));
// Nest styles of the same type even (color, underline, background)
log(chalk.green(
  'I am a green line ' +
  chalk.blue.underline.bold('with a blue substring') +
  ' that becomes green again! '
));
Copy the code

You can output the following information to understand at a glance:

Let’s do something more interesting…

Some of you have seen the console effect of zhihu below, since we want to do something interesting, today we might as well add this effect to the command line program, improve the force.

First we prepare a pair of ASCII code for printing, you can search text to ASCII, online conversion scheme is not too much. The command line image spider we are going to create will be an ASCII string for IMG SPD

After selection, the effect is shown as follows:

How do I print this complicated string? If you save it to string, it’s not going to work, it’s going to be a mess.

For a complete print format, one trick is to print it out as comments. What can save comments? ~ ~ the function.

So it’s as simple as printing a function. The toString() method is used instead. We just need to match the middle part of the comment with the re

Finally take a look at the effect, dang dang dang ~

Support double click operation

A technique called Shebang is used.

Shebang (also known as Hashbang) is a combination of # and! The sequence of characters #! , which appears in the first two characters of the first line of the text file. In the case of the presence of Shebang in a file, the program loader of unix-like operating systems will analyze the contents after Shebang as interpreter instructions, and call the instructions, taking the path of the file containing Shebang as the interpreter’s parameter.

Under Node we use #! The/usr/bin/env node

At this point we can cancel the file extension.js.

Add environment variables to support global calls

Package. json

"bin": {
  "img-spd": "app"
},
Copy the code

Executing NPM link will copy the img-SPD field into NPM’s global module installation folder node_modules and create a symbolic link (soft link), that is, add the PATH of app to the environment variable PATH.

To run this command, enter img-spd in any directory

The end of the

At this point, to improve the place has been all modified, come to see our finished product ~

Looking at a whole folder of Gakki, I felt so happy that I was about to overflow

Finally, use a GIF to show it:

The appendix

The project address

The project address

Install

npm install -g img-spd
Copy the code

Usage

img-spd
Copy the code

or

Usage: img-spd [options]

img-spd is a spider get images from image.baidu.com

Options:
  -v --version               output the version number
  -k, --key [key]            input the image keywords to download
  -i, --interval [interval]  input the operation interval(ms,default 200)
  -n, --number [number]      input the operation interval(ms,default 200)
  -m, --headless [headless]  choose whether the program is running in headless mode
  -h, --help                 output usage information
Copy the code