• A Guide to Automating & Scraping the Web with JavaScript (Chrome + Puppeteer + Node JS)
  • By Brandon Morelli
  • The Nuggets translation Project
  • Permanent link to this article: github.com/xitu/gold-m…
  • Translator: pot – code
  • Proofreader: Bambooom

Chrome + Puppeteer + Node JS

Take off with Headless Chrome

Udemy Black Friday Sale — Thousands of Web Development & Software Development courses are on sale for only $10 for a limited time! Full details and course recommendations can be found here.

Content abstract

This article will teach you how to automate web crawlers using JavaScript, using the Technology developed by the Google team called Puppeteer. Puppeteer runs in a Node environment and can be used to manipulate Headless Chrome. What is Headless Chrome? In plain English, this means using the API provided to simulate the user’s browsing behavior without opening the Chrome browser.

If you still don’t understand, you can imagine using JavaScript to fully automate Chrome.

preface

Make sure you have Node 8 or later installed. If not, download it from the official website. Make sure that the version number displayed in “Current” is greater than 8.

Learn Node JS — The 3 Best Online Node JS Courses If you are new to Node.

After Node is installed, create a project folder and install Puppeteer. A matching download of Chromium comes with Puppeteer installation. PUPPETEER_SKIP_CHROMIUM_DOWNLOAD = 1 Skip the download, and manually specify the browser execution path each time you use the launch method.

npm install --save puppeteer
Copy the code

Example 1 — Screenshot of web page

Once the Puppeteer is installed, we can start writing a simple example. This example is taken directly from official documentation and can take screenshots of a given website.

First create a js file, name it whatever you like. Here we use test.js as an example, and type the following code:

const puppeteer = require('puppeteer');

async function getPic() {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://google.com');
  await page.screenshot({path: 'google.png'});

  await browser.close();
}

getPic();
Copy the code

Let’s examine the above code line by line.

  • Line 1: Importing dependencies.
  • Lines 3-10: Core code, where the automation process is done.
  • Line 12:performgetPic()Methods.

Observant readers will notice that getPic() is preceded by an async prefix, which indicates that the getPic() method is asynchronous. Async and await come in pairs and are new to ES 2017. Since it is an asynchronous method, the call returns a Promise object. When the async method returns a value, the corresponding Promise object passes the value to resolve (or Reject if an exception is thrown).

In async methods, we can use await expressions to suspend the execution of the method until the Promise object in the expression is fully resolved. It doesn’t matter if you don’t understand it. I’ll explain it in detail later, and then you’ll understand.

Next, we’ll take a closer look at the getPic() method:

  • Line 4:
const browser = await puppeteer.launch();
Copy the code

This code launches the Puppeteer, essentially opening an instance of Chrome and assigning that instance object to the variable Browser. Because of the await keyword, the code runs here and blocks (suspends) until the Promise is resolved (whether the execution result is successful or not)

  • Line 5:
const page = await browser.newPage();
Copy the code

Next, create a new page in the browser instance obtained above, wait until it returns and assign the new page object to the variable page.

  • Line 6:
await page.goto('https://google.com');
Copy the code

Use the page object we got above to load the URL we gave, and then the code pauses to wait for the page to load.

  • Line 7:
await page.screenshot({path: 'google.png'});
Copy the code

Once the page is loaded, you can take a screenshot of the page. The screenshot() method accepts an object parameter that can be used to configure the path of the screenshot. Note that don’t forget to add the await keyword.

  • Line 9:
await browser.close();
Copy the code

Finally, close the browser.

Run the example

Enter the following command on the command line to execute the sample code:

node test.js
Copy the code

Here’s a screenshot from the example:

Isn’t that impressive? This is just a warm-up, but here’s how to run code in a non-Headless environment.

The headless? Seeing is believing, so try it out for yourself by adding line 4:

const browser = await puppeteer.launch();
Copy the code

Replace it with this:

const browser = await puppeteer.launch({headless: false});
Copy the code

Then run it again:

node test.js
Copy the code

Is it cooler? When {headless: false} is configured, you can see how the code controls Chrome.

There is also a small problem here. Our screenshot was a bit incomplete because the default screenshot size of the page object was a bit small. We can reset the viewport size of the page using the following code and then capture it:

await page.setViewport({width: 1000.height: 500})
Copy the code

That’s better:

The final code is as follows:

const puppeteer = require('puppeteer');

async function getPic() {
  const browser = await puppeteer.launch({headless: false});
  const page = await browser.newPage();
  await page.goto('https://google.com');
  await page.setViewport({width: 1000.height: 500})
  await page.screenshot({path: 'google.png'});

  await browser.close();
}

getPic();
Copy the code

Example 2 — Crawl data

From the above example, you should have a grasp of the basic usage of Puppeteer. Let’s look at a slightly more complicated example.

Before you start, take a look at the official documentation. You’ll find Puppeteer doing things like simulating mouse clicks, filling form data, entering text, reading page data, and more.

In the next tutorial, we’ll be climbing a website called Books To Scrape, which is specially used for developers To do crawler exercises.

Create a new js file in the folder you created earlier, using corone.js as an example, and enter the following code:

const puppeteer = require('puppeteer');

let scrape = async() = > {// Actual Scraping goes Here...
  
  // Return a value
};

scrape().then((value) = > {
    console.log(value); // Success!
});
Copy the code

Given the experience of the previous example, this code should not be difficult to understand. If you still don’t understand…… There’s nothing wrong with that.

First, again, introduce the Puppeteer dependency, and then define a scrape() method to write the crawler code. This method returns a value, which we will process in time (the example code simply prints out the value)

To test this, add the following line to the scrape method:

let scrape = async() = > {return 'test';
};
Copy the code

On the command line, type node corone.js, and if nothing goes wrong, the console will print a test string. After the test is passed, we’ll continue to refine the Scrape method.

Step 1: Prepare

As in example 1, get the browser instance, create a new page, and load the URL:

let scrape = async() = > {const browser = await puppeteer.launch({headless: false});
  const page = await browser.newPage();
  await page.goto('http://books.toscrape.com/');
  await page.waitFor(1000);
  // Scrape
  browser.close();
  return result;
};
Copy the code

Let’s look at the above code again:

First, we create a browser instance with headless set to false so we can see what the browser is doing directly:

const browser = await puppeteer.launch({headless: false});
Copy the code

Then create a new TAB:

const page = await browser.newPage();
Copy the code

Books.toscrape.com:

await page.goto('http://books.toscrape.com/');
Copy the code

Optionally, pause the code for 1 second to make sure the page is fully loaded:

await page.waitFor(1000);
Copy the code

After the task is complete, close the browser and return to the execution result.

browser.close();
return result;
Copy the code

Step 1 End.

Step 2: Open climb

If you open the Books to Scrape website, you can see that there are tons of Books in there, but the data aren’t true. To start with, let’s grab the data for the first book on the page and return its title and price information (the one with the red border).

Check the documentation and notice that this method simulates page clicks:

page.click(selector[, options])

  • selectorSelector that locates the element that needs to be clicked. If more than one element matches, the first one takes precedence.

Here you can use the developer tools to view the element’s selector, right click on the image and select inspect:

This opens the developer toolbar and highlights the previously selected element. Click on the first three dots and select copy-Copy Selector:

Now that we have the element selector, and the element click method we found earlier, we get the following code:

await page.click('#default > div > div > div > div > section > div:nth-child(2) > ol > li:nth-child(1) > article > div.image_container > a > img');
Copy the code

You will then observe that the browser clicks on the image of the first book and the page jumps to the details page.

On the details page, we only care about the title and price information – see the red box in the picture.

To obtain this data, use the Page.evaluate () method. This method can be used to execute the browser’s built-in DOM API, such as querySelector().

First create the page.evaluate() method and store its return value in the result variable:

const result = await page.evaluate((a)= > {
// return something
});
Copy the code

Again, to select the element we want to use in the method, open the developer tool again and select the element that needs inspect:

The title is a simple H1 element, obtained using the following code:

let title = document.querySelector('h1');
Copy the code

All we need is the literal part of the element. We can add.innertext to the end of the element as follows:

let title = document.querySelector('h1').innerText;
Copy the code

The same goes for obtaining price information:

The price element has a price_color class, and you can use this class as a selector to get the price element:

let price = document.querySelector('.price_color').innerText;
Copy the code

Now that we have the title and price, putting them into an object returns:

return {
  title,
  price
}
Copy the code

Recall that we took the title and price information, returned them in an object, and assigned the result to the Result variable. So, your code should now look like this:

const result = await page.evaluate((a)= > {
  let title = document.querySelector('h1').innerText;
  let price = document.querySelector('.price_color').innerText;
return {
  title,
  price
}
});
Copy the code

Then we simply return result and print it to the console:

return result;
Copy the code

Finally, the combined code is as follows:

const puppeteer = require('puppeteer');

let scrape = async() = > {const browser = await puppeteer.launch({headless: false});
    const page = await browser.newPage();

    await page.goto('http://books.toscrape.com/');
    await page.click('#default > div > div > div > div > section > div:nth-child(2) > ol > li:nth-child(1) > article > div.image_container > a > img');
    await page.waitFor(1000);

    const result = await page.evaluate((a)= > {
        let title = document.querySelector('h1').innerText;
        let price = document.querySelector('.price_color').innerText;

        return {
            title,
            price
        }

    });

    browser.close();
    return result;
};

scrape().then((value) = > {
    console.log(value); // Success!
});
Copy the code

Run the code on the console:

node scrape.js
// {title: 'A Light in the Attic', price: '£51.77'}
Copy the code

If you do this correctly, you’ll see the correct output on the console, and you’re done with the Web crawler.

Example 3 — Late improvement

If you think about it, the title and price information are displayed directly on the front page, so there’s no need to go to the details page to grab that data. In that case, take it a step further and ask, can I grab all the titles and prices of all the books?

So, there are many ways to grab, you need to find yourself. In addition, fetching data directly from the home page as mentioned above may not be feasible, as some titles may not be fully displayed.

They support the topic

Target – grabs the title and price information of all books on the home page and saves the returns as an array. The correct output should look like this:

Go ahead, buddy, it’s almost as easy to implement as the above example, but if you find it too hard, follow the tips below.


Tip:

The big difference is that you need to iterate through the entire result set. The code looks like this:

const result = await page.evaluate((a)= > {
  let data = []; // Create an empty array
  let elements = document.querySelectorAll('xxx'); // Select all relevant elements
  // Iterate over all the elements
    // Retrieve the header information
    // Extract price information
    data.push({title, price}); // Insert the data into the array
  return data; // Return the data set
});
Copy the code

If you still can’t do it, well, here are the reference answers. In future tutorials, I’ll expand on this code and cover more advanced crawler techniques. You can subscribe by submitting your email address here, and we’ll let you know when there are new content updates.

Reference answer:

const puppeteer = require('puppeteer');

let scrape = async() = > {const browser = await puppeteer.launch({headless: false});
    const page = await browser.newPage();

    await page.goto('http://books.toscrape.com/');

    const result = await page.evaluate((a)= > {
        let data = []; // Create an array to hold the results
        let elements = document.querySelectorAll('.product_pod'); // Select all books

        for (var element of elements){ // Iterate over the list of books
            let title = element.childNodes[5].innerText; // Retrieve the header information
            let price = element.childNodes[7].children[0].innerText; // Extract price information

            data.push({title, price}); // Combine data into an array
        }

        return data; // Return the data set
    });

    browser.close();
    return result; // Return data
};

scrape().then((value) = > {
    console.log(value); // Print the result
});
Copy the code

Conclusion:

Thanks for watching! Learn NodeJS – The 3 Best Online NodeJS Courses if you want to Learn NodeJS.

Every week I publish four technical articles about Web development. Subscribe! Or you can follow me on Twitter


The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.