Puppeteer is an official Chrome library for Headless Chrome nodes. It provides a series of apis that can be used to call Chrome functionality without a UI, and is suitable for various scenarios such as crawlers and automated processing
puppteer
Puppeteer is an official Chrome headless Chrome Node library (a web browser without a graphical user interface). It provides a series of apis that can be used to call Chrome functionality without a UI, and is suitable for various scenarios such as crawlers and automated processing
What can it be used for?
- Generate page screenshots and PDF
- Automated form submission, UI testing, keyboard entry, and more
- Create an up-to-date automated test environment. With the latest JavaScript and browser features, you can run tests directly in the latest version of Chrome.
- Crawl SPA page and pre-render (i.e. ‘SSR’)
- .
The difference from Cheerio
- Cheerio – This is a tired HTML document library for JQ syntax operation. It can only crawl static HTML and cannot get Ajax data. It is generally used in combination with AXIos + Cherrio
- Puppteer – can simulate the browser runtime environment, can request website information. It can simulate actions (click/swipe /hover, etc.) and even inject Node scripts to run inside the browser
Puppteer architecture diagram
- Puppeteer – Communicates with browser through devTools
- Browser – an instance of a Browser that can have multiple pages (Chroium)
- Page – A Page that contains at least one Frame
- Frame – Also has at least one execution environment for executing javascript, and can extend multiple execution environments
Easy entry
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(targetUrl);
await page.screenshot({path: 'example.png'}); await browser.close(); }) ();Copy the code
Analysis of the code
1. Introduce the puppeteer
const puppeteer = require('puppeteer');
Copy the code
2. Create an instance
This is enabling a browser environment through Puppeteer
const browser = await puppeteer.launch(options);
Copy the code
options:
- ExecutablePath: Puppeteer.executablePath () – Gets the default executable chrome location
- Headless: false – Whether to enable the headless mode
- SlowMo: 250 – This option will slow the Puppeteer operation by the specified number of milliseconds
- Devtools: true – Use the debugger in the application code browser
- DefaultViewport – Default 800 x 600
- width
- height
- DeviceScaleFactor – Scale factor
- IsMobile – Whether to consider the Meta Viewport tag. Default is false
- HasTouch – Specifies whether the viewport supports touch events. Default is false
- IsLandscape – Specifies whether the port is in landscape mode
- For more parameters, see puppeteer.launch ()
3. Open a new page
const page = await browser.newPage();
Copy the code
4. Go to the target page
await page.goto(targetUrl);
Copy the code
Note: the second argument is acceptable, which is an object for some simple configuration, with options
waitUntil:
- Load – The data is returned immediately after the request is received
- Domcontentloaded-dom returns after loading
- Networkidle0 – Returned after 500ms with no more than 0 network connections
- Networkidle2 – returned after 500ms without more than 2 network connections
Timeout: jump waiting time, the unit is milliseconds, the default is 30 seconds, the 0 means unlimited waiting, can through the page. SetDefaultNavigationTimeout (timeout) method to modify the default values
Referer (uncommon): The value of the header referenced. If provided, It will take precedence over the referer header value set by page.setexTraHttpHeaders (). If provided it will take preference over the referer header value set by page.setExtraHTTPHeaders().)
5. Close the browser
browser.close();
Copy the code
tart
In fact, easy to get started section has been a relatively complete description of our commonly used functions, to sum up, climb a web page takes a few steps
- Open a browser
- climb
- Close the browser
Isn’t that easy? Question? How to climb? Will you use JQ?
If you can use JQ, you can use crawler!
Find a video site you like, (the following content is for teaching only!)
const demo = async () => {
const browser = await (puppeteer.launch({
executablePath: puppeteer.executablePath(),
headless: false
}))
var arr = []
for (let i = 1; i <= 40; i++) {
console.log('Catching the first full time master' + i + 'set')
const targetUrl = `https://goudaitv1.com/play/78727-4-${i}.html`
console.log(targetUrl)
const page = await browser.newPage()
await page.goto(targetUrl, {
timeout: 0,
waitUntil: 'domcontentloaded'
})
const baseNode = '.row'
const movieList = await page.evaluate((sel) => {
var stream = Array.from($(sel).find('iframe#Player').attr('src'))
stream && (stream = stream.join(' '))
return stream
}, baseNode)
arr.push(movieList)
page.close()
}
console.log(arr)
browser.close()
}
Copy the code
page.evaluate(pageFunction[, …args])
- PageFunction < XSL: | string > to be executed in page instance context method
- . Args The argument to pass to pageFunction
- Return: result of pageFunction execution
If pageFunction returns a Promise, Page. Evaluate waits for the Promise to complete and returns its return value.
If pageFunction returns a value that cannot be serialized, undefined is returned
PageFunction = pageFunction;
const result = await page.evaluate(x => {
returnPromise.resolve(8 * x); }, 7); // console.log(result); // console.log(result); / / output"56"
Copy the code
You can also pass in a string
console.log(await page.evaluate('1 + 2')); / / output"3"
const x = 10;
console.log(await page.evaluate(`1 + ${x}`)); / / output"11"
Copy the code
Database entry
Done! You can do whatever you want with that data, like
The last
Of course, ‘crawling’ is only the tip of the iceberg, the above demo is rather lazy to directly get the address of a tag to jump, we can also use the click event to jump to the page, interested can try.
page.click(selector[, options])
- Selector The selector of the element to be clicked. If there are multiple matching elements, click the first one.
- options
- Button left, right, or middle, default left.
- ClickCount defaults to 1. View uievent.detail.
- Delay Time between a mouseDown and a mouseup, in milliseconds. The default is 0
- Return the: Promise object, and the matching element is clicked. If no element is clicked, the Promise object is rejected.
This method finds an element that matches the Selector, scrolls it visually if needed, and then clicks on it via page.mouse. This method will report an error if the selector does not match any elements.
Note that if click() triggers a jump, there is a separate Page.waitforNavigation () Promise object to wait on. The correct waiting jump looks like this:
const [response] = await Promise.all([
page.waitForNavigation(waitOptions),
page.click(selector, clickOptions),
]);
Copy the code
- page.waitForNavigation([options])
This method resolves when the page jumps to a new address or is reloaded, and is useful if your code indirectly causes the page to jump.
See Page.waitforNavigation ([options]) for more information
reference
Puppeteer Chinese website
Puppeteer npm