Node is known to be powerful and offers more possibilities for the front end. Today, I’ll share with you how I wrote a headless crawler using Node. Leeing. Site /2018/10/17/…
Tools used
- puppeteer
- commander
- inquirer
- chalk
So let’s talk about what these tools do
puppeteer
Headless crawlers rely on it. It can simulate the process of a user opening a web page without opening the web page. Those of you who have written about automated testing should be familiar with this, because the crawler process is almost the same as the automated testing process.
commander
Node-based CLI command line tool. Using it, we can easily write a variety of CLI commands.
inquirer
Interactive command line tools. What is the interactive command line? It’s just like NPM init where you ask a question, you answer a question, and you generate package.json based on the answer.
chalk
This is just a tool to make the text we output on the command line more elegant.
Ok, with the tools introduced, let’s begin our project in earnest.
Project introduction
First, we need to figure out what features we want to implement. What we want to do is type in the image we want to download from the command line, then Node will go to the network to crawl the image we want to download directly to the local. And enter a command to clear the images in our output directory.
File directory
|-- Documents
|-- .gitignore
|-- README.md
|-- package.json
|-- bin
| |-- gp
|-- output
| |-- .gitkeeper
|-- src
|-- app.js
|-- clean.js
|-- index.js
|-- config
| |-- default.js
|-- helper
|-- questions.js
|-- regMap.js
|-- srcToImg.js
Copy the code
This is a simple directory structure for the project
- Output is used to store downloaded images
- Bin File used by the CLI tool
- srcThis is where the code is mostly stored
- Index.js project entry file
- App.js main function file
- Clean.js Is used to clean the file of the image operation
- Config is used to store some configurations
- Helper is used to store files for helper methods
Start project
First let’s take a look at app.js.
We wrapped our core methods in a class so that command-line tools could call our methods more easily.
This class is simple; constructor receives arguments and start starts the main process. The start method is an async function because puppeteer manipulates the browser almost always asynchronously.
Then we use puppeteer to generate page instances and use goto method to simulate entering Baidu picture pages. In fact, it is the same as when we actually open the browser and enter baidu image, but because we are headless, we can’t perceive the process of opening the browser.
Then we need to set the browser width (imagine that), not too big, not too small. Too general trigger Baidu anti-crawler mechanism, causing us to climb down the picture is 403 or other errors. Too small will result in very few images to crawl to.
Next we focus on the search box, enter the keyword we want to search for (this keyword is the one we entered on the command line), and click Search.
After the page loads, we use page.? Eval access to all the class on a page. Main_img images (specific rules need yourself to observe, to obtain the SRC attribute, will turn our local image SRC.
At this point, app.js is done. That’s easy.
Here’s the code.
const puppeteer = require('puppeteer');
const chalk = require('chalk');
const config = require('./config/default');
const srcToImg = require('./helper/srcToImg');
class App {
constructor(conf) {
// If there are passed parameters, use both the passed parameters and the default parameters
this.conf = Object.assign({}, config, conf);
}
async start () {
Puppeteer generates an instance of browser
// Regenerate a page instance with browser
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Open the search engine and write baidu first
await page.goto(this.conf.searchPath);
console.log(chalk.green(`go to The ${this.conf.searchPath}`));
// Set window size, too large will cause anti-crawler
await page.setViewport({
width: 1920.height: 700
});
// Search text input box focus
await page.focus('#kw');
// Enter the keyword to search for
await page.keyboard.sendCharacter(this.conf.keyword);
// Click search
await page.click('.s_search');
console.log(chalk.green(`get start searching pictures`));
// What to do after the page loads
page.on('load'.async() = > {console.log(chalk.green(`searching pictures done, start fetch... `));
// Get the SRC of all specified images
const srcs = await page.?eval('img.main_img', pictures => {
return pictures.map(img= > img.src);
});
console.log(chalk.green(`get ${srcs.length} pictures, start download`));
srcs.forEach(async (src) => {
await page.waitFor(200);
await srcToImg(src, this.conf.outputPath); }); }); }};module.exports = App;
Copy the code
Now let’s see, how to convert the image SRC property to our local image? Let’s look at srctoimg.js in helper
First of all, this module mainly introduces the HTTP module, HTTPS module, path module and FS module of Node, and some helper tools such as regular, transform callback function into promisify, and chalk which will output better.
Why did we introduce both HTTP and HTTPS modules? By carefully observing the images in baidu image search results, we can find that there are both HTTP and HTTPS images. Therefore, we introduce two modules to distinguish which specific images belong to and use which module to request images. After requesting the image, we use the createWriteStream method of the FS module to store the image in our output directory.
If we carefully observe the SRC of images in Baidu search results, we will find that in addition to HTTP and HTTPS start images, there are base64 images, so we need to do something about base64 images.
The same as the ordinary image processing, first according to SRC segmentation extension, and then calculate the path and file name of the storage, finally write to call fs module writeFile method to writeFile (here simply with writeFile).
Above, the picture is stored locally.
Here’s the code.
const http = require('http');
const https = require('https');
const path = require('path');
const fs = require('fs');
const { promisify } = require('util');
const chalk = require('chalk');
const writeFile = promisify(fs.writeFile);
const regMap = require('./regMap');
const urlToImg = promisify((url, dir) = > {
let mod;
if(regMap.isHttp.test(url)){
mod = http;
}else if(regMap.isHttps.test(url)){
mod = https;
}
// Get the extension of the image
const ext = path.extname(url);
// The path and extension of the spliced image store
const file = path.join(dir, `The ${parseInt(Math.random() * 1000000)}${ext}`);
mod.get(url, res => {
// Take the stream form, which is faster than writing directly
res.pipe(fs.createWriteStream(file)).on('finish', () = > {console.log(file);
});
});
});
const base64ToImg = async (base64Str, dir) => {
const matchs = base64Str.match(regMap.isBase64);
try {
const ext = matchs[1].split('/') [1].replace('jpeg'.'jpg');
const file = path.join(dir, `The ${parseInt(Math.random() * 1000000)}.${ext}`);
await writeFile(file, matchs[2].'base64');
console.log(file);
} catch (error) {
console.log(chalk.red('Unrecognized picture')); }};module.exports = (src, dir) = > {
if(regMap.isPic.test(src)){
urlToImg(src, dir);
}else{ base64ToImg(src, dir); }};Copy the code
Now let’s see how do we empty the image under output? Here we use the fs module of Node again. First we use the fs.readdir method to read the output folder, and then we go through the files under it. If it is an image, we call the fs.unlink method to delete it. It’s easy, right?
The following code
const fs = require('fs');
const regMap = require('./helper/regMap');
const config = require('./config/default');
const cleanPath = config.outputPath;
class Clean {
constructor() {}
clean() {
fs.readdir(cleanPath, (err, files) => {
if(err){
throw err;
}
files.forEach(file= > {
if(regMap.isPic.test(file)){
const img = `${cleanPath}/${file}`;
fs.unlink(img, (e) => {
if(e) {
throwe; }}); }});console.log('clean finished'); }); }};module.exports = Clean;
Copy the code
Finally, how to write cli tools? First we need to create a new script file gp in the bin directory as follows
#! /usr/bin/env node
module.exports = require('.. /src/index');
Copy the code
Find node under /usr/bin/env to start line 2
Secondly, we need to add a bin object in package.json. The property name of the object is the name of our command, and the property is the path of the script file under bin, as follows
"bin": {
"gp": "bin/gp"
}
Copy the code
Next, let’s look at index.js
const program = require('commander');
const inquirer = require('inquirer');
const pkg = require('.. /package.json');
const qs = require('./helper/questions');
const App = require('./app');
const Clean = require('./clean');
program
.version(pkg.version, '-v, --version');
program
.command('search')
.alias('s')
.description('get search pictures what you want.')
.action(async() = > {const answers = await inquirer.prompt(qs.startQuestions);
const app = new App(answers);
await app.start();
});
program
.command('clean')
.alias('c')
.description('clean all pictures in directory "output".')
.action(async() = > {const answers = await inquirer.prompt(qs.confirmClean);
const clean = new Clean();
answers.isRemove && await clean.clean();
});
program.parse(process.argv);
if(process.argv.length < 3){
program.help();
}
Copy the code
Commander and inquirer were introduced. The program.command method generates command names for us, alias is the abbreviation of the command, description is the description of the command, and action is what the command does.
We first use command to generate two commands, search and clean, and then we use inquirer in action. Inquirer questions are an asynchronous process, so we also use async and await. Inquirer receives an array of questions. This contains the type, name, message, and validation methods of the problem, which can be found in inquirer’s documentation. The problem we have here is that we return two arrays, one for entering the keyword and one for clearing the image. The query array will verify whether there is a keyword filled in. If not, you will not proceed to the next step and will be prompted to enter the keyword. Otherwise, the crawler process will officially start. The delete confirmation array is a simple confirmation, and if it is confirmed, the image is deleted. Finally, use program.parse to inject the command into node’s process.argv, depending on whether the command line prompts help for input arguments.
At this point, our program is done. Then we only need to release our program to NPM, you can let others download to use ~ NPM release we will not repeat here, do not know the students online casually search ok.
SRC/helper/questions. Js as follows
const config = require('.. /config/default');
exports.startQuestions = [
{
type: 'input'.name: 'keyword'.message: 'What pictures do yo want to get ? '.validate: function(keyword) {
const done = this.async();
if(keyword === ' '){
done('Please enter the keyword to get pictures');
return;
}
done(null.true); }}]; exports.confirmClean = [ {type: 'confirm'.name: 'isRemove'.message: `Do you want to remove all pictures in ${config.outputPath}? `.default: true,}];Copy the code
The project download
npm i get_picture -g
Refer to the link
- The git link for the project is github.com/1eeing/get_…
- Puppeteer git link github.com/GoogleChrom…
- Commander Git link github.com/tj/commande…
- Inquirer git link github.com/SBoudrias/I…