Puppeteer is introduced

With this tool, we can become a Puppeteer. Puppeteer is a Nodejs library that calls Chrome’s API to manipulate the Web. Compared to Selenium or PhantomJs, Puppeteer’s Dom manipulation can be simulated completely in memory, and can be processed in a V8 engine without opening a browser. And the key is that this is being maintained by the Chrome team, which will have better compatibility and prospects.

Puppeteeruse

  • Use web pages to generate PDF and pictures
  • Crawl the SPA application and generate pre-rendered content (i.e. “SSR” server-side rendering)
  • You can grab content from websites
  • Automated form submission, UI testing, keyboard entry, and more
  • Help you create an up-to-date automated test environment (Chrome) from which to run test case 6. Capture a timeline of your site to track your site and help analyze site performance issues

Puppeteer USES

The installationPuppeteer

  • Due to blocked network, direct downloadChromiumWill fail, you can block the download firstChromiumThen download it manually
Install command
npm i puppeteer --save

# error message
ERROR: Failed to download Chromium r515411! Set "PUPPETEER_SKIP_CHROMIUM_DOWNLOAD" env variable to skip download.

# Set environment variables to skip Chromium download
set PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=1 

Or you can download modules without building them
npm i --save puppeteer --ignore-scripts

Successfully installed the module
+ puppeteer@0.13.0
added 1 package in 1.77s
Copy the code
  • Manually download ChromiumAfter downloading, unzip the package and there will be aChromium.app, put it in your favorite directory, for example/Users/huqiyang/Documents/project/z/chromium/Chromium.app. After the package is installed properlyChromium.appWill be in.local-chromiumIn the

Tip: downloadChromiumFailure solution

  • Replace the domesticChromiumThe source
PUPPETEER_DOWNLOAD_HOST=https://storage.googleapis.com.cnpmjs.org
npm i puppeteer
Copy the code
  • Or usecnpmThe installation
npm install -g cnpm --registry=https://registry.npm.taobao.org
cnpm i puppeteer
Copy the code

Click on the checkPuppeteer API

First tryPuppeteerTake a screenshot

knowledge

  • puppeteer.launchStart a browser instance
  • browser.newPage()Create a new page
  • page.gotoEnter specified webpage
  • page.screenshotscreenshots
const puppeteer = require('puppeteer'); (async () => {const browser = await (puppeteer.launch({// The default reference address is/project directory /node_modules/puppeteer/. Local-chromium/executablePath:'/Users/huqiyang/Documents/project/z/chromium/Chromium.app/Contents/MacOS/Chromium'// Set timeout: 15000, // This property ignores HTTPS error ignoreHTTPSErrors:true, // Open developer tools when this value istrue, headless is alwaysfalse
    devtools: false// Disable the headless mode, the browser headless will not open:false
  }));
  const page = await browser.newPage();
  await page.goto('https://www.jianshu.com/u/40909ea33e50');
  await page.screenshot({
    path: 'jianshu.png'.type: 'png'// quality: 100, only for JPG fullPage:true// clip: {// x: 0, // y: 0, // width: 1000, // height: 40 //}}); browser.close(); }) ();Copy the code

The results

Advanced, get netease Cloud music lyrics and comments

Apis of netease Cloud Music are encrypted using AES and RSA algorithms. Data can be obtained only after the encrypted information is requested through POST. None of this matters now that Puppeteer is available, as long as the element is displayed on its page.

knowledge

  • page.typeGets the focus of the input box and enters the text
  • page.keyboard.pressThe invalidation of key combinations on current Macs is a known bug that simulates keyboard pressing a key
  • page.waitForPage wait, which can be time, an element, or a function
  • page.frames()Gets all of the current pageiframeAnd then according toiframeGet the exact name of something you wantiframe
  • iframe.$('.srchsongst')To obtainiframeAn element in
  • iframe.evaluate()Executing a function in the browser is equivalent to executing a function in the console and returning onePromise
  • Array.fromConvert an array-like object into an object
  • page.click()Click on an element
  • iframe.$eval()Equivalent to theiframeRunning in thedocument.queryselectorGets the specified element and passes it as the first argument
  • iframe.? evalEquivalent to theiframeRunning in thedocument.querySelectorAllGets the array of specified elements and passes it as the first argument
const fs = require('fs');
const puppeteer = require('puppeteer');

(async () = > {
  const browser = await (puppeteer.launch({ executablePath: '/Users/huqiyang/Documents/project/z/chromium/Chromium.app/Contents/MacOS/Chromium'.headless: false }));
  const page = await browser.newPage();
  // Go to the page
  await page.goto('https://music.163.com/#');

  // Click on the search box and type it in
  const musicName = 'Who the hell would think';
  await page.type('.txt.j-flag', musicName, {delay: 0});

  / / return
  await page.keyboard.press('Enter');

  // Get the iframe of the song list
  await page.waitFor(2000);
  let iframe = await page.frames().find(f= > f.name() === 'contentFrame');
  const SONG_LS_SELECTOR = await iframe.$('.srchsongst');

  // Get the address of the song
  const selectedSongHref = await iframe.evaluate(e= > {
    const songList = Array.from(e.childNodes);
    const idx = songList.findIndex(v= > v.childNodes[1].innerText.replace(/\s/g.' ') = = ='I don't even think about it.');
    return songList[idx].childNodes[1].firstChild.firstChild.firstChild.href;
  }, SONG_LS_SELECTOR);

  // Go to the song page
  await page.goto(selectedSongHref);

  // Get the nested iframe of the song page
  await page.waitFor(2000);
  iframe = await page.frames().find(f= > f.name() === 'contentFrame');

  // Click the expand button
  const unfoldButton = await iframe.$('#flag_ctrl');
  await unfoldButton.click();

  // Get the lyrics
  const LYRIC_SELECTOR = await iframe.$('#lyric-content');
  const lyricCtn = await iframe.evaluate(e= > {
    return e.innerText;
  }, LYRIC_SELECTOR);

  console.log(lyricCtn);

  / / screenshots
  await page.screenshot({
    path: 'songs. PNG'.fullPage: true});// Write to the file
  let writerStream = fs.createWriteStream('lyrics. TXT');
  writerStream.write(lyricCtn, 'UTF8');
  writerStream.end();

  // Get the number of comments
  const commentCount = await iframe.$eval('.sub.s-fc3', e => e.innerText);
  console.log(commentCount);

  // Get comments
  const commentList = await iframe.?eval('.itm', elements => {
    const ctn = elements.map(v= > {
      return v.innerText.replace(/\s/g.' ');
    });
    return ctn;
  });
  console.log(commentList); }) ();Copy the code

The results

Senior crawler

Crawl the SPA application and generate pre-rendered content (” SSR “server-side rendering), which in layman’s terms is what is displayed on the page. Here we are through climbing guazi second-hand car direct sale network vehicle information to know it.

First of all byaxiosTo have a try

const axios = require('axios');
const useAxios = (a)= > {
  axios.get('https://www.guazi.com/hz/buy/')
    .then(((result) = > {
      console.log(result.data);
    }))
    .catch((err) = > {
      console.log(err);
    });
};
Copy the code

And it comes back to me with this thing, which is obviously not what I was looking for

throughPuppeteercrawl

const fs = require('fs');
const puppeteer = require('puppeteer');

(async () => {
  const browser = await (puppeteer.launch({ executablePath: '/Users/huqiyang/Documents/project/z/chromium/Chromium.app/Contents/MacOS/Chromium', headless: true})); const page = await browser.newPage(); // enter the page await page.goto('https://www.guazi.com/hz/buy/'); // Get the page titlelettitle = await page.title(); console.log(title); // Get the car brand const BRANDS_INFO_SELECTOR ='.dd-all.clearfix.js-brand.js-option-hid-info';
  const brands = await page.evaluate(sel => {
    const ulList = Array.from($(sel).find('ul li p a'));
    const ctn = ulList.map(v => {
      return v.innerText.replace(/\s/g, ' ');
    });
    return ctn;
  }, BRANDS_INFO_SELECTOR);
  console.log('Car Brand:', JSON.stringify(brands));
  let writerStream = fs.createWriteStream('car_brands.json');
  writerStream.write(JSON.stringify(brands, undefined, 2), 'UTF8'); writerStream.end(); // await bodyHandle.dispose(); // Get the source list const CAR_LIST_SELECTOR ='ul.carlist';
  const carList = await page.evaluate((sel) => {
    const catBoxs = Array.from($(sel).find('li a'));
    const ctn = catBoxs.map(v => {
      const title = $(v).find('h2.t').text();
      const subTitle = $(v).find('div.t-i').text().split('|');
      return {
        title: title,
        year: subTitle[0],
        milemeter: subTitle[1]
      };
    });
    returnctn; }, CAR_LIST_SELECTOR); The console. The log (` altogether${carList.length}', JSON. Stringify (carList, undefined, 2)); WriterStream = fs.createWritestream ('car_info_list.json');
  writerStream.write(JSON.stringify(carList, undefined, 2), 'UTF8'); writerStream.end(); browser.close(); }) ();Copy the code

The results

data

  • Official documentation for Puppeteer
  • Key name mapping
  • More tutorials on the web: Getting started with Puppeteer, Pre-puppeteer Testing automation, and using puppeteer-AutoTest to automate cnodeJS.
  • Pay more attention to Puppeteer issues to solve problems

To be continued, have questions please leave a message, we study together