This article contains 1451 words and takes about 4 minutes to read
Puppeteer is a NodeJS library developed by the Chrome team. One of its features is web crawling (which acts as a crawler).
See GayHub for more information. The update cycle is about a month. This article is based on V1.4.0 and the API is generally generic. This article summarizes the main uses of the Puppeteer crawler. My goal with this article is to eliminate the need to look at official documentation for everyday crawler use.
One, installation and use
1.1 installation
CNPM i-s Puppeteer is installed using CNPM. There is no error, and Chromium supporting puppeteer is downloaded by default.
1.2 the use of
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({
headless: false// The default istrue(headless), does not display browser interface slowMo :200, // deceleration display, sometimes slows devTools as simulated human operation deliberately:true// Display developer tools. Page width and height default 800*600, the developer tools show and then hide the page will fill the screen, there is no big man to explain? }); //const Page = await browser.newPage(); Const page = (await browser.pages())[0]; // Const page = (await browser.pages())[0]; // This is my way of writing, just a TAB await page.goto('https://www.juejin.com'); // Jump to the Nuggets // Please start your show... await browser.close(); // Close the browser})();Copy the code
Puppeteer basically returns a Promise for every operation. Remember to await the next operation.
There are also some third party written crawler demos available on Puppeteer, but it feels a bit too packaged, so leave it at that.
Two, basic usage
2.1 Adjusting the Page
The default page width and height is 800*600, which I think is too small. I usually initialize the page size first.
As mentioned above, this is only the initial size, when opening and hiding the developer tools, the page will fill the full screen, I don’t know if this is a bug.
await page.setViewport({
width: 1280,
height: 800
});
Copy the code
2.2 Simulate input and click
And the visual underlying is document.querySelector()
await page.type(selector, 'Hello puppeteer'); // Find the corresponding selector and fill in the value. If you set slowMo previously, you'll see that it looks like human typing, and the values are typed with <input/> await page.click(selector) one by one; // Simulates clicking. This is useful for traditional asynchronous paging (urls with no paging parameters), where the selector is set on the next page's tagCopy the code
2.3 the iframe handling
If the page has an iframe tag, the page object cannot read the contents of the
let iframe = await page.frames();
iframe.find(f => f.name() === 'name')
Copy the code
2.4 the waitFor function
The waitFor function is shorthand for both Page and Frame objects. I will only use the following two ways, the rest please big guy for advice.
For simplicity, THE API for both Page and Frame objects will not be specified, but will be directly reflected in the code.
Await the iframe. WaitFor ('.contain .item') // Wait in <iframe>'.contain .item'Await page.waitfor (200) // page waits for 200msCopy the code
2.5 the selector and emulate
Why write it all together? Because there is a composite API called Eval
Let’s break it down.
2.5.1 selector
Visual bottom is to use the document. QuerySelector () and document querySelectorAll (). Those familiar with both apis should be easy to get started with.
$(selector) // document.querySelector() iframe.? (the selector) / / document. QuerySelectorAll (),? It means AllCopy the code
2.5.2 emulate
The first idea is that the PUPpeteer crawler parses the DOM in the browser, and the API’s arguments are in the browser. Therefore, dom manipulation can be performed within this function, and the native Node API cannot be run from this function. Running console.log(global) will result in an error.
For example: inside the function is console.log(‘ Press F12, I’m on the browser console, not the Node command line ‘)
You’ll notice that it’s not on the Node command line, but on the Chromium Console. Therefore, you should understand that it runs on the current site, not your local Node.
Evaluate (el => {// dom manipulation}) await iframe. Evaluate (el => {console.log(el => {//'Press F12, I'm on the browser's console, not on the Node command line')})Copy the code
2.5.3 The real protagonist$eval
和 ? eval
Using the above two apis together makes Eval, one of my most commonly used apis. One API on top of the top two, write together, comfortable.
const result = await page.$eval(Selector, el => {// Return a Promise if an assignment is requiredreturnnew Promise(resolve => { //... Reslove (obj)}); await iframe.$$eval(selector, el => {... });Copy the code
2.6 Listening Events
As mentioned above, page. Evaluate cannot be console printed on the node command line, but listener events can change this rule. You can also do fault tolerance in listening events.
page.on('console', msg => {
console.log(msg);
});
Copy the code
Personally, if you print the DOM, it’s better to look at the browser’s console.
// Listen to the browser error page. On ('pageerror', pageErr => { console.log(pageErr); }); // Listening node reported an error page.on('error', err => {
console.log(err);
});
Copy the code
Three, camouflage mobile terminal
const devices = require("puppeteer/DeviceDescriptors");
const iPhone = devices["iPhone 6"]; . await page.emulate(iPhone);Copy the code
More devices can be viewed here
That is my introduction to the Puppeteer crawler API.
Amway two books, one is Lao Yao’s regular expression PDF version said very detailed, really useful, although I have forgotten most of the (╯ and ﹏╰); The other one is the Web front-end interview guide and high-frequency examination question analysis booklet, the content is rich and basic, I hope I can find an opportunity to practice.
Now I want to get down to business.
I am a software engineering graduate of 17 years. I am familiar with VUE and have written two company projects. At present has left office, coordinates guangzhou, for big men shelter.