This is probably the simplest way to crawl a dynamic web page using Node.js

Now most web pages are dynamic web pages, if simply crawl through web HTML file, don’t crawl to subsequent loading of commodity prices, such important information or pictures, more don’t talk about those unscrupulous login limit, for small reptiles, to analyze the complex scripts do more harm than good, more don’t talk about website will keep pace with The Times to update, very not easy broken, They have to start all over again, which greatly increases the difficulty of the crawler.

Fortunately, there’s one device in Node.js that bravers access restrictions and anti-crawlers, enabling it to get around most of those restrictions with a simple act of user simulation: Google’s Puppeteer.

1. Advantages and disadvantages of Puppeteer

Puppeteer is essentially a Chrome browser that can be manipulated through code. For example, simulated mouse clicks, keyboard input, and other actions are a bit like button sprites, and it’s hard to tell if it’s a human user or a crawler, so restrictions are out of the question.

It has the advantage of being simple, very simple, probably the simplest of all the libraries that allow you to crawl dynamic web pages.

But the disadvantage is also obvious, that is slow, a little low efficiency. It launches a Chrome browser every time it runs, so it is far less efficient than other libraries and is not suitable for crawling big data. But that’s more than enough for a little reptile.

Let’s take the example of the crawler I wrote about that crawler that crawls jd merchandise pages to see how simple it is. Originally write this crawler is to buy apple wonderful control board, after looking for a circle found that JD capture treasure island in the price is very attractive, this should also be capture treasure island only worth snatching goods, but the number of rare, will appear for a long time.

So I thought of monitoring the product page, once found a new wonderful dashboard will pop up to remind. There’s even an automatic auction, but I didn’t write it, because I don’t want to buy anything but the trackpad, so there’s no way to test it out.

OK, here we go!

2. The first step is to install Puppeteer:

Install the Puppeteer library, which is the only one used:

npm install puppeteer
Copy the code

Step 2: Link to the page

Linking to a web page is also very simple, requiring only a few lines of code:

// Start the browser
const browers = await puppeteer.launch()
// Start a new page
const page = await browers.newPage()
// Link url
await page.goto(url)
Copy the code

This way the link is successful! Puppeteer.launch() can accept many more parameters, but here we use only headless, which defaults to true and displays the browser interface if false. We can use this feature to implement a pop-up notification that changes headless to false whenever a qualifying item is found.

4. Crawl commodity information

After linking to the web page, the next step is to crawl the product information and then analyze it.

Website: Wonderful dashboard

4.1 Obtaining the corresponding element label

As you can see from the page, once a similar item will appear in the next similar treasure, we just need to crawl there information, there are two ways:

One is $eval, the js equivalent of document.querySelector, which only crawls the first element that matches.

What is the other? The eval, equivalent to the document in the js. QuerySelectorAll, crawl all elements;

The first argument they receive is the address of the element, and the second argument is the callback function, which does the same thing as document.querySelector. Look at the code:

// We get all the child elements from the same kind of raiders
const goods = page.?eval('#auctionRecommend > div.mc > ul > li', ele => ele)
Copy the code

4.2. Analyze product information

Now that we’ve got the tag information on all the items in the same raiders, let’s start analyzing that information. Get the names of all the items in the store and check if the keyword exists. If it does, change headless to false. If it doesn’t, link again half an hour later.

Puppeteer provides a wait command, page.waitfor (), which can wait not only by time, but also by the loading progress of an element.

const goods = page.?eval('#auctionRecommend > div.mc > ul > li', el => {
	  // Both errors and keywords that do not exist return false, and the loop continues
    try {
        for (let i = 0; i < el.length; i++) {
            let n = el[i].querySelector('div.p-name').textContent
            if(n.includes('Control panel')) {return true
            } else {
                return false}}}catch (error) {
        return false}})if(! bool){return console.log('Page open, no longer monitored')}await goods.then(async (b) => {
    if(b){
        console.log('It's in stock! ')
        await page.waitFor(2000)
        await browers.close()
        return requestUrl(false)}else {
        console.log('Not available yet')
        console.log('Try again in 30 minutes.')
        await page.waitFor(1800000)
        await browers.close()
        return requestUrl(true)}})Copy the code

Optimize your code

For this little crawler, the loss of efficiency is not much, there is no need to optimize, but as an obsessive-compulsive disorder, I still hope to remove as much as possible.

5.1 Blocking Images

In this crawler, we don’t need to see any image information at all, so there is no need to load any images. For a little efficiency, we will block all images:

// Start interceptor
await page.setRequestInterception(true)
await page.on('request',interceptedRequest => {
    // Check whether the URL ends in JPG or PNG
    if(interceptedRequest.url().endsWith('.jpg') || interceptedRequest.url().endsWith('.png')){
        interceptedRequest.abort();
    }else{ interceptedRequest.continue(); }})Copy the code

5.2 Resize the Window

When the browser pops up, it will be found that the display range of the opened window is very small, which is not only inconvenient to browse, but also may lead to errors in clicking or typing. Therefore, it is necessary to adjust:

await page.setViewport({
    width: 1920.height: 1080,})Copy the code

At this point, all the code is complete, let’s try it out!

6. Complete code

const puppeteer = require('puppeteer')

const url = 'https://paipai.jd.com/auction-detail/114533257?entryid=p0120003dbdnavi'

const requestUrl = async function(bool){
    const browers = await puppeteer.launch({headless:bool})
    const page = await browers.newPage()

    await page.setRequestInterception(true)
    await page.on('request',interceptedRequest => {
        if(interceptedRequest.url().endsWith('.jpg') || interceptedRequest.url().endsWith('.png')){
            interceptedRequest.abort();
        }else{ interceptedRequest.continue(); }})await page.setViewport({
        width: 1920.height: 1080,})await page.goto(url)

    const goods = page.?eval('#auctionRecommend > div.mc > ul > li', el=>{
        try {
            for (let i = 0; i < el.length; i++) {
                let n = el[i].querySelector('div.p-name').textContent
                if(n.includes('Control panel')) {return true
                } else {
                    return false}}}catch (error) {
            return false}})if(! bool){return console.log('Page open, no longer monitored')}await goods.then(async (b)=>{
        if(b){
            console.log('It's in stock! ')
            await page.waitFor(2000)
            await browers.close()
            return requestUrl(false)}else {
            console.log('Not available yet')
            console.log('Try again in 30 minutes.')
            await page.waitFor(1800000)
            await browers.close()
            return requestUrl(true)
        }
    })
}

requestUrl(true)
Copy the code

The full code is also available at Github: github.com/Card007/Nod… If it is helpful to you, welcome to follow me, I will continue to output more good articles!