This article contains 1451 words and takes about 4 minutes to read

Puppeteer is a NodeJS library developed by the Chrome team. One of its features is web crawling (which acts as a crawler).

See GayHub for more information. The update cycle is about a month. This article is based on V1.4.0 and the API is generally generic. This article summarizes the main uses of the Puppeteer crawler. My goal with this article is to eliminate the need to look at official documentation for everyday crawler use.

One, installation and use

1.1 installation

CNPM i-s Puppeteer is installed using CNPM. There is no error, and Chromium supporting puppeteer is downloaded by default.

1.2 the use of

const puppeteer = require('puppeteer');

(async () => {
   const browser = await puppeteer.launch({
    headless: false// The default istrue(headless), does not display browser interface slowMo :200, // deceleration display, sometimes slows devTools as simulated human operation deliberately:true// Display developer tools. Page width and height default 800*600, the developer tools show and then hide the page will fill the screen, there is no big man to explain? }); //const Page = await browser.newPage(); Const page = (await browser.pages())[0]; // Const page = (await browser.pages())[0]; // This is my way of writing, just a TAB await page.goto('https://www.juejin.com'); // Jump to the Nuggets // Please start your show... await browser.close(); // Close the browser})();Copy the code

Puppeteer basically returns a Promise for every operation. Remember to await the next operation.

There are also some third party written crawler demos available on Puppeteer, but it feels a bit too packaged, so leave it at that.

Two, basic usage

2.1 Adjusting the Page

The default page width and height is 800*600, which I think is too small. I usually initialize the page size first.

As mentioned above, this is only the initial size, when opening and hiding the developer tools, the page will fill the full screen, I don’t know if this is a bug.

await page.setViewport({
    width: 1280,
    height: 800
  });
Copy the code

2.2 Simulate input and click

And the visual underlying is document.querySelector()

await page.type(selector, 'Hello puppeteer'); // Find the corresponding selector and fill in the value. If you set slowMo previously, you'll see that it looks like human typing, and the values are typed with <input/> await page.click(selector) one by one; // Simulates clicking. This is useful for traditional asynchronous paging (urls with no paging parameters), where the selector is set on the next page's tagCopy the code

2.3 the iframe handling

If the page has an iframe tag, the page object cannot read the contents of the

let iframe = await page.frames();
iframe.find(f => f.name() === 'name')
Copy the code

2.4 the waitFor function

The waitFor function is shorthand for both Page and Frame objects. I will only use the following two ways, the rest please big guy for advice.

For simplicity, THE API for both Page and Frame objects will not be specified, but will be directly reflected in the code.

Await the iframe. WaitFor ('.contain .item') // Wait in <iframe>'.contain .item'Await page.waitfor (200) // page waits for 200msCopy the code

2.5 the selector and emulate

Why write it all together? Because there is a composite API called Eval

Let’s break it down.

2.5.1 selector

Visual bottom is to use the document. QuerySelector () and document querySelectorAll (). Those familiar with both apis should be easy to get started with.

$(selector) // document.querySelector() iframe.? (the selector) / / document. QuerySelectorAll (),? It means AllCopy the code

2.5.2 emulate

The first idea is that the PUPpeteer crawler parses the DOM in the browser, and the API’s arguments are in the browser. Therefore, dom manipulation can be performed within this function, and the native Node API cannot be run from this function. Running console.log(global) will result in an error.

For example: inside the function is console.log(‘ Press F12, I’m on the browser console, not the Node command line ‘)

You’ll notice that it’s not on the Node command line, but on the Chromium Console. Therefore, you should understand that it runs on the current site, not your local Node.

Evaluate (el => {// dom manipulation}) await iframe. Evaluate (el => {console.log(el => {//'Press F12, I'm on the browser's console, not on the Node command line')})Copy the code

2.5.3 The real protagonist$eval? eval

Using the above two apis together makes Eval, one of my most commonly used apis. One API on top of the top two, write together, comfortable.

const result = await page.$eval(Selector, el => {// Return a Promise if an assignment is requiredreturnnew Promise(resolve => { //... Reslove (obj)}); await iframe.$$eval(selector, el => {... });Copy the code

2.6 Listening Events

As mentioned above, page. Evaluate cannot be console printed on the node command line, but listener events can change this rule. You can also do fault tolerance in listening events.

page.on('console', msg => {
    console.log(msg);
});
Copy the code

Personally, if you print the DOM, it’s better to look at the browser’s console.

// Listen to the browser error page. On ('pageerror', pageErr => { console.log(pageErr); }); // Listening node reported an error page.on('error', err => {
    console.log(err);
});
Copy the code

Three, camouflage mobile terminal

const devices = require("puppeteer/DeviceDescriptors");
const iPhone = devices["iPhone 6"]; . await page.emulate(iPhone);Copy the code

More devices can be viewed here

That is my introduction to the Puppeteer crawler API.


Amway two books, one is Lao Yao’s regular expression PDF version said very detailed, really useful, although I have forgotten most of the (╯ and ﹏╰); The other one is the Web front-end interview guide and high-frequency examination question analysis booklet, the content is rich and basic, I hope I can find an opportunity to practice.

Now I want to get down to business.

I am a software engineering graduate of 17 years. I am familiar with VUE and have written two company projects. At present has left office, coordinates guangzhou, for big men shelter.