Before the first crawler tutorial node.js crawler introduction (a) crawling static pages explained the crawling of static pages, very simple, but encounter some dynamic web pages (Ajax), direct use of the previous method to send requests can not get the data we want. This is where crawling dynamic web pages comes in, and Selenium and Puppeteer are both good.

Puppeteer is recommended for no other reason than that it is Google’s own and is being maintained and updated. The following is the official documentation of the translation

Puppeteer is a Node library that provides a set of high-level interfaces to control Chrome or Chromium via the DevTools protocol. It runs in headless mode (no browser UI) by default, and can be configured to run in normal mode.

It can be used:

  • Screenshot of web page and export PDF
  • Crawl SPA and SSR sites
  • Automated form submission, interface testing, keyboard entry, etc
  • Create an up to date automated test environment and test directly with the latest Chrome versions and JS features
  • Capture the timeline trace of the site to help diagnose performance problems
  • Test the Chrome extension

First of all, we need to install it before using it. By default, we will download the latest Chromium, which is about 300M.

npm install puppeteer
Copy the code

If you already have a newer version of Chrome on your machine, you can just install the core version, but configure the local Chrome path when launching puppeteer.

NPM install puppeteer-core // Puppeteer-coreCopy the code

Let’s say we want to crawl the front end of a dragnet job listing. This is a dynamic page. Using this example, let’s try to crawl.

Since Chrome operations are all asynchronous, to avoid callback hell, it is recommended to use ES7’s async await syntax, which is readable, as is the official documentation.

  1. Start the browser with puppeteer and open the dynamic page

Note that if you are using a local browser, you need to pass in the local Chrome path in the launch browser configuration

const browser = await puppeteer.launch({
    executablePath: 'C:/Users/Administrator/AppData/Local/Google/Chrome/Application/chrome.exe'
})
Copy the code
  1. Execute the function in the Chrome environment to retrieve the required data and return it to the Node execution environment

The figure above shows the DOM location of the data we need to fetch and organize in the functions performed by the Chrome environment.

let list = document.querySelectorAll('.s_position_list .item_con_list li')
let res = []
for (let i = 0; i < list.length; i++) {
    res.push({
        name: list[i].getAttribute('data-positionname'),
        company: list[i].getAttribute('data-company'),
        salary: list[i].getAttribute('data-salary'),
        require: list[i].querySelector('.li_b_l').childNodes[4].textContent.replace(/ |\n/g, ''),
    })
}
return res
Copy the code

Here’s a debugging tip: The function that gets the data can be written directly in the Chrome console for easy debugging

Finally, the complete code is attached

const puppeteer = require('puppeteer'); (async () => {// start browser const browser = await puppeteer.launch({headless: False, // Default is headless mode, }) // control the browser to open the new TAB page const page = await browser.newpage () // open the page await in the new TAB to crawl page.goto('https://www.lagou.com/jobs/list_web%E5%89%8D%E7%AB%AF?px=new&city=%E5%B9%BF%E5%B7%9E') // Use the evaluate method to execute the passed function in the browser (the full browser environment, Let data = await page. Evaluate (() => {let list = evaluate(() => {let list = await page document.querySelectorAll('.s_position_list .item_con_list li') let res = [] for (let i = 0; i < list.length; i++) { res.push({ name: list[i].getAttribute('data-positionname'), company: list[i].getAttribute('data-company'), salary: list[i].getAttribute('data-salary'), require: list[i].querySelector('.li_b_l').childNodes[4].textContent.replace(/ |\n/g, ''), }) } return res }) console.log(data) })()Copy the code

The results

The crawling of dynamic web pages is complete, but puppeteer is much more than that, with many powerful apis. You can refer to the official documentation.

The third installment will show you how to perform crawlers and database storage on a regular basis.