Puppeteer is an official Chrome library for Headless Chrome nodes. It provides a series of apis that can be used to call Chrome functionality without a UI, and is suitable for various scenarios such as crawlers and automated processing

puppteer

Puppeteer is an official Chrome headless Chrome Node library (a web browser without a graphical user interface). It provides a series of apis that can be used to call Chrome functionality without a UI, and is suitable for various scenarios such as crawlers and automated processing

What can it be used for?

  • Generate page screenshots and PDF
  • Automated form submission, UI testing, keyboard entry, and more
  • Create an up-to-date automated test environment. With the latest JavaScript and browser features, you can run tests directly in the latest version of Chrome.
  • Crawl SPA page and pre-render (i.e. ‘SSR’)
  • .

The difference from Cheerio

  • Cheerio – This is a tired HTML document library for JQ syntax operation. It can only crawl static HTML and cannot get Ajax data. It is generally used in combination with AXIos + Cherrio
  • Puppteer – can simulate the browser runtime environment, can request website information. It can simulate actions (click/swipe /hover, etc.) and even inject Node scripts to run inside the browser

Puppteer architecture diagram

  • Puppeteer – Communicates with browser through devTools
  • Browser – an instance of a Browser that can have multiple pages (Chroium)
  • Page – A Page that contains at least one Frame
  • Frame – Also has at least one execution environment for executing javascript, and can extend multiple execution environments

Easy entry


const puppeteer = require('puppeteer');
(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(targetUrl);
  await page.screenshot({path: 'example.png'}); await browser.close(); }) ();Copy the code

Analysis of the code

1. Introduce the puppeteer

 const puppeteer = require('puppeteer');
 
Copy the code

2. Create an instance

This is enabling a browser environment through Puppeteer

const browser = await puppeteer.launch(options);
Copy the code

options:

  • ExecutablePath: Puppeteer.executablePath () – Gets the default executable chrome location
  • Headless: false – Whether to enable the headless mode
  • SlowMo: 250 – This option will slow the Puppeteer operation by the specified number of milliseconds
  • Devtools: true – Use the debugger in the application code browser
  • DefaultViewport – Default 800 x 600
    • width
    • height
    • DeviceScaleFactor – Scale factor
    • IsMobile – Whether to consider the Meta Viewport tag. Default is false
    • HasTouch – Specifies whether the viewport supports touch events. Default is false
    • IsLandscape – Specifies whether the port is in landscape mode
  • For more parameters, see puppeteer.launch ()
  • 3. Open a new page

const page = await browser.newPage();

Copy the code

4. Go to the target page

await page.goto(targetUrl);

Copy the code

Note: the second argument is acceptable, which is an object for some simple configuration, with options

waitUntil:

  • Load – The data is returned immediately after the request is received
  • Domcontentloaded-dom returns after loading
  • Networkidle0 – Returned after 500ms with no more than 0 network connections
  • Networkidle2 – returned after 500ms without more than 2 network connections

Timeout: jump waiting time, the unit is milliseconds, the default is 30 seconds, the 0 means unlimited waiting, can through the page. SetDefaultNavigationTimeout (timeout) method to modify the default values

Referer (uncommon): The value of the header referenced. If provided, It will take precedence over the referer header value set by page.setexTraHttpHeaders (). If provided it will take preference over the referer header value set by page.setExtraHTTPHeaders().)

5. Close the browser

 browser.close();
Copy the code

tart

In fact, easy to get started section has been a relatively complete description of our commonly used functions, to sum up, climb a web page takes a few steps

  1. Open a browser
  2. climb
  3. Close the browser

Isn’t that easy? Question? How to climb? Will you use JQ?

If you can use JQ, you can use crawler!

Find a video site you like, (the following content is for teaching only!)

const demo = async () => {
  const browser = await (puppeteer.launch({
    executablePath: puppeteer.executablePath(),
    headless: false
  }))
  var arr = []
  for (let i = 1; i <= 40; i++) {
    console.log('Catching the first full time master' + i + 'set')
    const targetUrl = `https://goudaitv1.com/play/78727-4-${i}.html`
    console.log(targetUrl)
    const page = await browser.newPage()
    await page.goto(targetUrl, {
      timeout: 0,
      waitUntil: 'domcontentloaded'
    })
    const baseNode = '.row'
    const movieList = await page.evaluate((sel) => {
      var stream = Array.from($(sel).find('iframe#Player').attr('src'))
      stream && (stream = stream.join(' '))
      return stream
    }, baseNode)
    arr.push(movieList)
    page.close()
  }
  console.log(arr)
  browser.close()
}
Copy the code

page.evaluate(pageFunction[, …args])

  • PageFunction < XSL: | string > to be executed in page instance context method
  • . Args The argument to pass to pageFunction
  • Return: result of pageFunction execution

If pageFunction returns a Promise, Page. Evaluate waits for the Promise to complete and returns its return value.

If pageFunction returns a value that cannot be serialized, undefined is returned

PageFunction = pageFunction;

const result = await page.evaluate(x => {
  returnPromise.resolve(8 * x); }, 7); // console.log(result); // console.log(result); / / output"56"
Copy the code

You can also pass in a string

console.log(await page.evaluate('1 + 2')); / / output"3"
const x = 10;
console.log(await page.evaluate(`1 + ${x}`)); / / output"11"
Copy the code

Database entry

Done! You can do whatever you want with that data, like

The last

Of course, ‘crawling’ is only the tip of the iceberg, the above demo is rather lazy to directly get the address of a tag to jump, we can also use the click event to jump to the page, interested can try.

page.click(selector[, options])

  • Selector The selector of the element to be clicked. If there are multiple matching elements, click the first one.
  • options
    • Button left, right, or middle, default left.
    • ClickCount defaults to 1. View uievent.detail.
    • Delay Time between a mouseDown and a mouseup, in milliseconds. The default is 0
  • Return the: Promise object, and the matching element is clicked. If no element is clicked, the Promise object is rejected.
  • This method finds an element that matches the Selector, scrolls it visually if needed, and then clicks on it via page.mouse. This method will report an error if the selector does not match any elements.

    Note that if click() triggers a jump, there is a separate Page.waitforNavigation () Promise object to wait on. The correct waiting jump looks like this:


const [response] = await Promise.all([
  page.waitForNavigation(waitOptions),
  page.click(selector, clickOptions),
]);

Copy the code
  • page.waitForNavigation([options])

This method resolves when the page jumps to a new address or is reloaded, and is useful if your code indirectly causes the page to jump.

See Page.waitforNavigation ([options]) for more information

reference

Puppeteer Chinese website

Puppeteer npm