Introduction: For most front-end developers, when it comes to command-line tools, you’ve probably used them. But when it comes to developing command-line tools, few people know. This article aims to help you develop a practical (squint smile) picture crawler command line application in the shortest possible time.
For a better reading experience, please visit the front end Inn in Tuoba. Also put the project address in a prominent location
Puppeteer profile
What is a Puppeteer?
Puppeteer is Google Chrome’s official Headless Chrome tool. As the browser market leader, Chrome Headless will become an industry benchmark for automated testing of Web applications. So it’s important for us to understand it.
What can Puppeteer do?
There are many things Puppeteer can do, including but not limited to:
- Use web pages to generate PDF and pictures
- You can grab content from websites
- Automated form submission, UI testing, keyboard entry, and more
- Help you create a new automated test environment (Chrome) where you can run test cases directly
- Capture a timeline of your site to track your site and help analyze site performance issues
What are the advantages of Puppeteer?
- Less loading of CSS, JS and rendering of pages compared to real browsers. Headless browsers are much faster than real browsers.
- It can run on the server or CI without interface, which reduces external interference and is more stable.
- You can simulate running multiple headless browsers on a single machine, facilitating concurrent running.
How do I install Puppeteer?
Installing the Puppeteer is as follows: NPM I — Save Puppeteer or yarn add Puppeteer
Note that due to the ASYNc /await syntax of ES7, node should be v7.6.0 or above.
How do I use Puppeteer?
As this article is not dedicated to Puppeteer, this part will be skipped and you can check out the links below.
Puppeteer Github
Puppeteer Api Doc
Puppeteer Chinese Api Doc
Having said that, what does Puppeteer have to do with the command-line application we want to develop? Instead of using the traditional request crawler, we are going to create a command line tool to capture images from the DOM using a headless browser called Puppeteer, which can effectively circumnavigate some crawler defenses.
Simple application of Puppeteer
Case 1. Screenshot
Directly on the code, very easy to understand:
const puppeteer = require("puppeteer");
const getScreenShot = async() = > {const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto("https://baidu.com");
await page.screenshot({ path: "baidu.png" });
await browser.close();
};
getScreenShot();
Copy the code
This code means to open the browser in headless mode, then open a new TAB, jump to baidu website, take a screenshot, save it as baidu. PNG, and close the browser.
The result is as follows:
Case 2. Capture website information
Next, learn how to use Puppeteer to crawl web sites.
This time let’s grab JD book list information.
// book info spider
const puppeteer = require("puppeteer");
const fs = require("fs");
const spider = async() = > {const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://search.jd.com/Search?keyword=javascript");
const result = await page.evaluate((a)= > {
let elements = document.querySelectorAll(".gl-item");
const data = [...elements].map(i= > {
return {
name: i.querySelector(".p-name em").innerText,
description: i.querySelector(".p-name i").innerText,
price: i.querySelector(".p-price").innerText,
shop: i.querySelector(".p-shopnum").innerText,
url: i.querySelector(".p-img a").href
};
});
return data; // Return data
});
browser.close();
return result;
};
spider().then(value= > {
fs.writeFile(`${__dirname}/javascript.json`.JSON.stringify(value), err => {
if (err) {
throw err;
}
console.log("file saved!");
});
console.log(value); // Success!
});
Copy the code
What we do is jump to the page whose keyword is javascript, and then analyze the DOM structure of the page, find the book list corresponding to the title, description, price, publishing house, web page link information, and then write the data into the javascript. Json file, which is convenient for us to save and browse.
The logic is simple. This is already the prototype of a crawler, and the result is a JSON file like the one shown below, which is very impressive.
Case 3. Picture crawler
Image crawler, that’s the theme of the command line application we’re going to do.
The basic idea is this:
Open the browser – > jump to Baidu Picture – > get the focus of the input box – > Enter keywords – > click the search button – > jump to the results list page – > drop down to the bottom – > operate DOM, Get the SRC of all images. Spare – > save the images locally according to SRC – > close the browser
Code implementation:
The first is the browser manipulation part
const browser = await puppeteer.launch(); // Open the browser
const page = await browser.newPage(); // Open a new TAB page
await page.goto("https://image.baidu.com"); // Jump to Baidu pictures
console.log("go to https://image.baidu.com"); // Get the focus of the input box
await page.focus("#kw"); // Focus on the search input box
await page.keyboard.sendCharacter("Cats"); // Enter the keyword
await page.click(".s_search"); // Click the search button
console.log("go to search list"); // Prompts to jump to the search list page
Copy the code
Then there is the image processing part
page.on("load".async() = > {await autoScroll(page); // Scroll down to load the image
console.log("page loading done, start fetch...");
const srcs = await page.evaluate((a)= > {
const images = document.querySelectorAll("img.main_img");
return Array.prototype.map.call(images, img => img.src);
}); // Get all img SRC
console.log(`get ${srcs.length} images, start download`);
for (let i = 0; i < srcs.length; i++) {
await convert2Img(srcs[i], target);
console.log(`finished ${i + 1}/${srcs.length} images`);
} // Save the image
console.log(`job finished! `);
await browser.close();
});
Copy the code
Because Baidu picture is to scroll down can continue lazy loading. So if we want to load more images, we can scroll down for a while. Then, the SRC of all images in the list can be obtained by analyzing DOM structure, and finally downloaded.
Perform the following to get a list of cat pictures:
Only the main function is written in the image download area, and more detailed code can be found on Github.
So far, we have developed a basic image crawler using Node and Puppeteer.
How to optimize?
The image crawler is a bit of a drag at the moment. Our goal is to develop an interactive command line application, and it can’t stop there. What are the points that can be further optimized? After a brief thought, I listed the following:
- The content of the downloaded image can be customized
- Users can choose the number of pictures to download
- Supports command line parameter transmission
- Supports command line interaction
- Beautiful interactive interface
- Supports double – click operation directly
- Support for global command line invocation
Use commander. Js to support command line parameter passing
Commander is a lightweight, expressive, and powerful command line framework. Provides user command line input and parameter parsing powerful functions.
const program = require("commander");
program
.version("0.0.1")
.description("a test cli program")
.option("-n, --name <name>"."your name"."zhl")
.option("-a, --age <age>"."your age"."22")
.option("-e, --enjoy [enjoy]")
.action(option= > {
console.log('name: ', option.name);
console.log('age: ', option.age);
console.log('enjoy: ', option.enjoy);
});
program.parse(process.argv);
Copy the code
Commander is easy to use. The above code defines the input and output of a command line in just a few lines. Among them:
- Version Defines the version number
- Description Defines the description
- Option defines input options, passing three parameters, such as
option("-n, --name <name>", "your name", "GK")
–name is the full name of the input parameter. <> is mandatory. [] is optional. The second item “your name” is the prompt for help, telling the user what to type, and the last item “GK” is the default value. - Action defines the operation to be performed and is a callback function. The input parameter is the option entered previously. If no option is entered, the default value is used.
For more detailed apis, refer to the Commander API documentation.
Execute the above script to get:
This allows for simple interaction on the command line. But there is not enough to look at it, don’t worry, continue to read.
Use Inquirer to create interactive command-line applications
Inquirer can make a nice, embeddable command line interface for Node.
Q&a command input can be provided:
Can provide a variety of forms of selection interface:
Input information can be verified:
Finally, input information can be processed:
The example above is an official example from Inquirer and can be found at pizza.js
Inquirer documents can be viewed at Inquirer Documents
With Inquirer, we can make more sophisticated interactive command-line tools.
Use chalk. Js to make the interface aesthetically pleasing
Chalk’s syntax is very simple:
const chalk = require('chalk');
const log = console.log;
// Combine styled and normal strings
log(chalk.blue('Hello') + ' World' + chalk.red('! '));
// Compose multiple styles using the chainable API
log(chalk.blue.bgRed.bold('Hello world! '));
// Pass in multiple arguments
log(chalk.blue('Hello'.'World! '.'Foo'.'bar'.'biz'.'baz'));
// Nest styles
log(chalk.red('Hello', chalk.underline.bgBlue('world') + '! '));
// Nest styles of the same type even (color, underline, background)
log(chalk.green(
'I am a green line ' +
chalk.blue.underline.bold('with a blue substring') +
' that becomes green again! '
));
Copy the code
You can output the following information to understand at a glance:
Let’s do something more interesting…
Some of you have seen the console effect of zhihu below, since we want to do something interesting, today we might as well add this effect to the command line program, improve the force.
First we prepare a pair of ASCII code for printing, you can search text to ASCII, online conversion scheme is not too much. The command line image spider we are going to create will be an ASCII string for IMG SPD
After selection, the effect is shown as follows:
How do I print this complicated string? If you save it to string, it’s not going to work, it’s going to be a mess.
For a complete print format, one trick is to print it out as comments. What can save comments? ~ ~ the function.
So it’s as simple as printing a function. The toString() method is used instead. We just need to match the middle part of the comment with the re
Finally take a look at the effect, dang dang dang ~
Support double click operation
A technique called Shebang is used.
Shebang (also known as Hashbang) is a combination of # and! The sequence of characters #! , which appears in the first two characters of the first line of the text file. In the case of the presence of Shebang in a file, the program loader of unix-like operating systems will analyze the contents after Shebang as interpreter instructions, and call the instructions, taking the path of the file containing Shebang as the interpreter’s parameter.
Under Node we use #! The/usr/bin/env node
At this point we can cancel the file extension.js.
Add environment variables to support global calls
Package. json
"bin": {
"img-spd": "app"
},
Copy the code
Executing NPM link will copy the img-SPD field into NPM’s global module installation folder node_modules and create a symbolic link (soft link), that is, add the PATH of app to the environment variable PATH.
To run this command, enter img-spd in any directory
The end of the
At this point, to improve the place has been all modified, come to see our finished product ~
Looking at a whole folder of Gakki, I felt so happy that I was about to overflow
Finally, use a GIF to show it:
The appendix
The project address
The project address
Install
npm install -g img-spd
Copy the code
Usage
img-spd
Copy the code
or
Usage: img-spd [options]
img-spd is a spider get images from image.baidu.com
Options:
-v --version output the version number
-k, --key [key] input the image keywords to download
-i, --interval [interval] input the operation interval(ms,default 200)
-n, --number [number] input the operation interval(ms,default 200)
-m, --headless [headless] choose whether the program is running in headless mode
-h, --help output usage information
Copy the code