Node crawler puppeteer is used

Puppeteer is a Node crawler framework based on Chromium. The great thing about it is that it has all the functionality of a browser and can be controlled with NodeJS. Perfect to achieve the crawler effect we want

When installing Puppeteer, you also download Chromium simultaneously. The network is not good directly with CNPM download.

Of course, the official also has an additional bag, quote:

Since version 1.7.0 we publish the puppeteer-core package, a version of Puppeteer that doesn't download any browser by default.
Copy the code

This means that using Puppeteer core allows you to use puppeteer core functionality without having to install a full browser. I have not used this time, the next time to use in detail ~

Record a real-life demo

A node command line tool, the principle of the ping.chinaz.com/ website to obtain the corresponding IP address information and the request duration, and then from the local Ping corresponding IP address, to find the best node

Operation effect:

PS: Why not just use this website to check, but also have to write a time? Because the website found the IP display is ok, but the local ping may not pass, can not try one by one, so this tool was born ~

As to why puppeteer is used? Because the site doesn’t provide an API, we can only get the information we need from the requested interface

Development starts with documentation

Official document PPTR

If for the first time to open it, just add 185.199.110.133 raw.githubusercontent.com to host files, because the document from making load. And remove the disabled cache from the NetWord panel.

Start the browser

Introduction to the official website demo

const puppeteer = require('puppeteer')

;(async() = > {// Create a browser object
  const browser = await puppeteer.launch()
  // Open a new page
  const page = await browser.newPage()
  // Set the URL of the page
  await page.goto('https://www.google.com')
  // other actions...

  // Finally close the browser (if not, node will not end)
  await browser.close()
})()
Copy the code

puppeteer.launch

First look at the launch documentation put the ~ launch documentation

I only used one parameter for launch, whether headless doesn’t show the debugger mode at the beginning of the browser interface, so I chose to open the browser interface anyway

const browser = await puppeteer.launch({
  headless: false // Whether not to display the browser interface
})
Copy the code

page.goto

await page.goto('https://ping.chinaz.com/www.baidu.com', {
  timeout: 0.// The timeout period is not limited
  waitUntil: 'networkidle0'
})
Copy the code

Page is the current TAB. Page. Goto the document

When I open the screen, ping.chinaz.com/ will automatically start requesting if there is a url behind it, so all I have to do is wait until they are fully requested before I start fetching the content

Goto provides several options. One is to wait s before the page loads. Unless you set it to -1 and you wait forever

The second one is waitUntil and I’m using NetworkIdle0. So if you don’t have any requests for 500 milliseconds, it just keeps going

Gets the elements on the page

So normally, we’re going to do queries in the console using the $operator. Puppeteer offers similar functionality, but in more granular detail.

Page selector document

$$= page.$$= page.$$= page.$$= page. In fact, the same meaning, if you use $, you will only query the first element matching, and use the API of The Go$class, you will help you to find the corresponding elements, into an array element

Page.$$(selector) and page.$$eval(selector, pageFunction[,…args]) are similar in that they query elements

Page.$$searches for all matching nodes and gives you a method to do so. For example, you can perform page

Page.$$eval is the node closest to our query, and the following callback functions can operate on node nodes

So, I use page.$$eval to query all the data in the table, and the callback function to manipulate the innerText of each item.

let list = await page.$$eval('#speedlist .listw'.options= >
  options.map(option= > {
    let [city, ip, ipaddress, responsetime, ttl] = option.innerText.split(/[\n]/g)
    return { city, ip, ipaddress, responsetime, ttl }
  })
)
Copy the code

All that remains is a matter of business logic

Ping method

The implementation of the ping method is particularly simple, with the introduction of NPM-ping

Ping.sys. probe(IP address, callback function)

So the following is based on the asynchronous method encapsulated, incoming IP, calculate the time of the local specific ping. We can use async and await with Promise

const pingIp = ip= >
  new Promise((resolve, reject) = > {
    let startTime = new Date().getTime()
    ping.sys.probe(ip, function(isAlive) {
      if (isAlive) {
        resolve({
          ip: ip,
          time: new Date().getTime() - startTime
        })
      } else {
        reject({
          ip: ip,
          time: -1})}})})Copy the code

Icing on the cake

If our Promise returns reject. Normally we can only handle this with a try/catch. This is not very elegant, see the previous article how to gracefully handle async thrown errors

const awaitWrap = promise= > {
  return promise.then(data= > [data, null]).catch(err= > [null, err])
}
Copy the code

Loading effect in the project

As the whole process takes a long time and requires a lot of waiting time, it is very important to have a good loading prompt. Npm-ora is used

Add loading and corresponding prompts everywhere

Complete project code

Dependency:

NPM package name role
ora Friendly loading effect
puppeteer The crawler frame
ping This section describes how to ping nodes

index.js

const ora = require('ora')
let inputUrl = process.argv[2]
if(! inputUrl) { ora().warn('Please enter the link you want to determine')
  process.exit()
}

const puppeteer = require('puppeteer')

const { awaitWrap, pingIp, moveHttp } = require('./utils')

let url = moveHttp(inputUrl)

const BaiseUrl = 'https://ping.chinaz.com/'; (async() = > {const init = ora('Initialize the browser environment').start()
  const browser = await puppeteer.launch({
    headless: true // Whether not to display the browser interface
  })
  init.succeed('Initialization complete')
  const loading = ora('parsing${url}`).start()
  const page = await browser.newPage() // Create a new page
  await page.goto(BaiseUrl + url, {
    timeout: 0.// The timeout period is not limited
    waitUntil: 'networkidle0'
  })

  loading.stop()
  let list = await page.$$eval('#speedlist .listw'.options= >
    options.map(option= > {
      let [city, ip, ipaddress, responsetime, ttl] = option.innerText.split(/[\n]/g)
      return { city, ip, ipaddress, responsetime, ttl }
    })
  )

  if (list.length == 0) {
    ora().fail('Please enter the correct URL or IP')
    process.exit()
  }

  ora().succeed('Get IP address done, try to connect to IP address')

  let ipObj = {}
  let success = []
  let failList = []
  let fast = Infinity
  let fastIp = ' '
  for (let i = 0; i < list.length; i++) {
    let item = list[i]
    let time = parseInt(item.responsetime)
    if (!isNaN(time) && ! ipObj[item.ip]) {const tryIp = ora(` try${item.ip}`).start()
      let [res, error] = await awaitWrap(pingIp(item.ip))
      if(! error) { success.push(res.ip)if (res.time < fast) {
          fast = res.time
          fastIp = res.ip
        }
        tryIp.succeed(`${res.ip}Connection successful, time consuming:${res.time}ms`)}else {
        failList.push(error.ip)
        tryIp.fail(`${error.ip}Connection failure ')
      }

      ipObj[item.ip] = time
    }
  }

  if (success.length > 0) {
    ora().succeed('Request succeeded:The ${JSON.stringify(success)}`)}if (failList.length > 0) {
    ora().fail('Request failed:The ${JSON.stringify(failList)}`)}if (fastIp) {
    ora().info('Recommended nodes:${fastIp}Time:${fast}ms`)
    ora().info(` host configuration:${fastIp} ${url}`)
  }

  browser.close() // Close the browser}) ()Copy the code

utils/index.js

var ping = require('ping')

module.exports = {
  awaitWrap: promise= > {
    return promise.then(data= > [data, null]).catch(err= > [null, err])
  },

  pingIp: ip= >
    new Promise((resolve, reject) = > {
      let startTime = new Date().getTime()
      ping.sys.probe(ip, function(isAlive) {
        if (isAlive) {
          resolve({
            ip: ip,
            time: new Date().getTime() - startTime
          })
        } else {
          reject({
            ip: ip,
            time: -1})}})}),moveHttp: val= > {
    val = val.replace(/http(s)? :\/\//i.' ')
    var temp = val.split('/')
    if (temp.length <= 2) {
      if (val[val.length - 1] = ='/') {
        val = val.substring(0, val.length - 1)}}return val
  }
}
Copy the code

conclusion

Throughout the project, the most important thing is to open the browser, and learn to see the various events in the document, here is just the browser network request waiting, there are many other methods

  • For example, a callback that waits until a node appears
  • Intercept request
  • Wait for a callback after a request finishes
  • Screen capture
  • Website Performance Analysis
  • , etc…

In general, the emergence of puppeteer makes node more possible. Hope node ecology will get better and better

Of course, debugging node programs, especially when there are so many node and prototype chain methods, using terminal debugging alone is not enough, so to better debug your node scripts, check out the comparison of debugging Node tools

(getIp is my folder, index.js is my index.js code)

The ~