Node is known to be powerful and offers more possibilities for the front end. Today, I’ll share with you how I wrote a headless crawler using Node. Leeing. Site /2018/10/17/…

Tools used

  • puppeteer
  • commander
  • inquirer
  • chalk

So let’s talk about what these tools do

puppeteer

Headless crawlers rely on it. It can simulate the process of a user opening a web page without opening the web page. Those of you who have written about automated testing should be familiar with this, because the crawler process is almost the same as the automated testing process.

commander

Node-based CLI command line tool. Using it, we can easily write a variety of CLI commands.

inquirer

Interactive command line tools. What is the interactive command line? It’s just like NPM init where you ask a question, you answer a question, and you generate package.json based on the answer.

chalk

This is just a tool to make the text we output on the command line more elegant.

Ok, with the tools introduced, let’s begin our project in earnest.

Project introduction

First, we need to figure out what features we want to implement. What we want to do is type in the image we want to download from the command line, then Node will go to the network to crawl the image we want to download directly to the local. And enter a command to clear the images in our output directory.

File directory

|-- Documents
    |-- .gitignore
    |-- README.md
    |-- package.json
    |-- bin
    |   |-- gp
    |-- output
    |   |-- .gitkeeper
    |-- src
        |-- app.js
        |-- clean.js
        |-- index.js
        |-- config
        |   |-- default.js
        |-- helper
            |-- questions.js
            |-- regMap.js
            |-- srcToImg.js
Copy the code

This is a simple directory structure for the project

  • Output is used to store downloaded images
  • Bin File used by the CLI tool
  • srcThis is where the code is mostly stored
    • Index.js project entry file
    • App.js main function file
    • Clean.js Is used to clean the file of the image operation
    • Config is used to store some configurations
    • Helper is used to store files for helper methods

Start project

First let’s take a look at app.js.

We wrapped our core methods in a class so that command-line tools could call our methods more easily.

This class is simple; constructor receives arguments and start starts the main process. The start method is an async function because puppeteer manipulates the browser almost always asynchronously.

Then we use puppeteer to generate page instances and use goto method to simulate entering Baidu picture pages. In fact, it is the same as when we actually open the browser and enter baidu image, but because we are headless, we can’t perceive the process of opening the browser.

Then we need to set the browser width (imagine that), not too big, not too small. Too general trigger Baidu anti-crawler mechanism, causing us to climb down the picture is 403 or other errors. Too small will result in very few images to crawl to.

Next we focus on the search box, enter the keyword we want to search for (this keyword is the one we entered on the command line), and click Search.

After the page loads, we use page.? Eval access to all the class on a page. Main_img images (specific rules need yourself to observe, to obtain the SRC attribute, will turn our local image SRC.

At this point, app.js is done. That’s easy.

Here’s the code.

const puppeteer = require('puppeteer');
const chalk = require('chalk');
const config = require('./config/default');
const srcToImg = require('./helper/srcToImg');

class App {
    constructor(conf) {
        // If there are passed parameters, use both the passed parameters and the default parameters
        this.conf = Object.assign({}, config, conf);
    }

    async start () {
        Puppeteer generates an instance of browser
        // Regenerate a page instance with browser
        const browser = await puppeteer.launch();
        const page = await browser.newPage();
    
        // Open the search engine and write baidu first
        await page.goto(this.conf.searchPath);
        console.log(chalk.green(`go to The ${this.conf.searchPath}`));
    
        // Set window size, too large will cause anti-crawler
        await page.setViewport({
            width: 1920.height: 700
        });
    
        // Search text input box focus
        await page.focus('#kw');
    
        // Enter the keyword to search for
        await page.keyboard.sendCharacter(this.conf.keyword);
    
        // Click search
        await page.click('.s_search');
        console.log(chalk.green(`get start searching pictures`));
    
        // What to do after the page loads
        page.on('load'.async() = > {console.log(chalk.green(`searching pictures done, start fetch... `));
            // Get the SRC of all specified images
            const srcs = await page.?eval('img.main_img', pictures => {
                return pictures.map(img= > img.src);
            });
            console.log(chalk.green(`get ${srcs.length} pictures, start download`));
    
            srcs.forEach(async (src) => {
                await page.waitFor(200);
                await srcToImg(src, this.conf.outputPath); }); }); }};module.exports = App;
Copy the code

Now let’s see, how to convert the image SRC property to our local image? Let’s look at srctoimg.js in helper

First of all, this module mainly introduces the HTTP module, HTTPS module, path module and FS module of Node, and some helper tools such as regular, transform callback function into promisify, and chalk which will output better.

Why did we introduce both HTTP and HTTPS modules? By carefully observing the images in baidu image search results, we can find that there are both HTTP and HTTPS images. Therefore, we introduce two modules to distinguish which specific images belong to and use which module to request images. After requesting the image, we use the createWriteStream method of the FS module to store the image in our output directory.

If we carefully observe the SRC of images in Baidu search results, we will find that in addition to HTTP and HTTPS start images, there are base64 images, so we need to do something about base64 images.

The same as the ordinary image processing, first according to SRC segmentation extension, and then calculate the path and file name of the storage, finally write to call fs module writeFile method to writeFile (here simply with writeFile).

Above, the picture is stored locally.

Here’s the code.

const http = require('http');
const https = require('https');
const path = require('path');
const fs = require('fs');
const { promisify } = require('util');
const chalk = require('chalk');
const writeFile = promisify(fs.writeFile);
const regMap = require('./regMap');

const urlToImg = promisify((url, dir) = > {
    let mod;
    if(regMap.isHttp.test(url)){
        mod = http;
    }else if(regMap.isHttps.test(url)){
        mod = https;
    }
    // Get the extension of the image
    const ext = path.extname(url);
    // The path and extension of the spliced image store
    const file = path.join(dir, `The ${parseInt(Math.random() * 1000000)}${ext}`);

    mod.get(url, res => {
        // Take the stream form, which is faster than writing directly
        res.pipe(fs.createWriteStream(file)).on('finish', () = > {console.log(file);
        });
    });
});

const base64ToImg = async (base64Str, dir) => {
    const matchs = base64Str.match(regMap.isBase64);
    try {
        const ext = matchs[1].split('/') [1].replace('jpeg'.'jpg');
        const file = path.join(dir, `The ${parseInt(Math.random() * 1000000)}.${ext}`);

        await writeFile(file, matchs[2].'base64');
        console.log(file);
    } catch (error) {
        console.log(chalk.red('Unrecognized picture')); }};module.exports = (src, dir) = > {
    if(regMap.isPic.test(src)){
        urlToImg(src, dir);
    }else{ base64ToImg(src, dir); }};Copy the code

Now let’s see how do we empty the image under output? Here we use the fs module of Node again. First we use the fs.readdir method to read the output folder, and then we go through the files under it. If it is an image, we call the fs.unlink method to delete it. It’s easy, right?

The following code

const fs = require('fs');
const regMap = require('./helper/regMap');
const config = require('./config/default');
const cleanPath = config.outputPath;

class Clean {
    constructor() {}

    clean() {
        fs.readdir(cleanPath, (err, files) => {
            if(err){
                throw err;
            }
            files.forEach(file= > {
                if(regMap.isPic.test(file)){
                    const img = `${cleanPath}/${file}`;
                    fs.unlink(img, (e) => {
                        if(e) {
                            throwe; }}); }});console.log('clean finished'); }); }};module.exports = Clean;
Copy the code

Finally, how to write cli tools? First we need to create a new script file gp in the bin directory as follows

#! /usr/bin/env node
module.exports = require('.. /src/index');
Copy the code

Find node under /usr/bin/env to start line 2

Secondly, we need to add a bin object in package.json. The property name of the object is the name of our command, and the property is the path of the script file under bin, as follows

"bin": {
  "gp": "bin/gp"
}
Copy the code

Next, let’s look at index.js

const program = require('commander');
const inquirer = require('inquirer');
const pkg = require('.. /package.json');
const qs = require('./helper/questions');
const App = require('./app');
const Clean = require('./clean');

program
    .version(pkg.version, '-v, --version');

program
    .command('search')
    .alias('s')
    .description('get search pictures what you want.')
    .action(async() = > {const answers = await inquirer.prompt(qs.startQuestions);
        const app = new App(answers);
        await app.start();
    });

program
    .command('clean')
    .alias('c')
    .description('clean all pictures in directory "output".')
    .action(async() = > {const answers = await inquirer.prompt(qs.confirmClean);
        const clean = new Clean();
        answers.isRemove && await clean.clean();
    });
    
program.parse(process.argv);

if(process.argv.length < 3){
    program.help();
}
Copy the code

Commander and inquirer were introduced. The program.command method generates command names for us, alias is the abbreviation of the command, description is the description of the command, and action is what the command does.

We first use command to generate two commands, search and clean, and then we use inquirer in action. Inquirer questions are an asynchronous process, so we also use async and await. Inquirer receives an array of questions. This contains the type, name, message, and validation methods of the problem, which can be found in inquirer’s documentation. The problem we have here is that we return two arrays, one for entering the keyword and one for clearing the image. The query array will verify whether there is a keyword filled in. If not, you will not proceed to the next step and will be prompted to enter the keyword. Otherwise, the crawler process will officially start. The delete confirmation array is a simple confirmation, and if it is confirmed, the image is deleted. Finally, use program.parse to inject the command into node’s process.argv, depending on whether the command line prompts help for input arguments.

At this point, our program is done. Then we only need to release our program to NPM, you can let others download to use ~ NPM release we will not repeat here, do not know the students online casually search ok.

SRC/helper/questions. Js as follows

const config = require('.. /config/default');

exports.startQuestions = [
    {
        type: 'input'.name: 'keyword'.message: 'What pictures do yo want to get ? '.validate: function(keyword) {
            const done = this.async();
            if(keyword === ' '){
                done('Please enter the keyword to get pictures');
                return;
            }
            done(null.true); }}]; exports.confirmClean = [ {type: 'confirm'.name: 'isRemove'.message: `Do you want to remove all pictures in ${config.outputPath}? `.default: true,}];Copy the code

The project download

npm i get_picture -g

Refer to the link

  • The git link for the project is github.com/1eeing/get_…
  • Puppeteer git link github.com/GoogleChrom…
  • Commander Git link github.com/tj/commande…
  • Inquirer git link github.com/SBoudrias/I…