• Web Scraping with Puppeteer in node.js
  • Belle Poopongpanit
  • The Nuggets translation Project
  • Permanent link to this article: github.com/xitu/gold-m…
  • Translator: Badd
  • Proofreader: Shixi-Li

Implement Puppeteer web crawler in Node.js

Have you ever tried to build a new App project using the API of your favorite company or website, only to find that they either don’t have API support or are no longer available to the public? (We’re talking about you, Netflix.) Well, it happened to me, and since I’m a pushover, I finally found a way to break it: the web crawler.

A web crawler is a technique that uses software to automatically extract and collect data from a website. Once you’ve collected the data, you can use it to develop your own apis.

There are multiple technologies for web crawlers, and Python is the star of the bunch. However, I am a big fan of JavaScript. Therefore, in this article I will show you how to do this using Node.js and Puppeteer.

Puppeteer is a Node.js library that lets us secretly run a Chrome browser (headless, since it doesn’t need a graphical user interface) to pull data from websites.

With most of us cooped up at home due to COVID-19, binge-watching on Netflix has become a way for many of us to pass the time (crying is all we have to do). Other like me in order to have a choice dyslexia and bored Netflix audience share common welfare, I found a website, www.digitaltrends.com/movies/best. “It lists the best recent films for April 2020. Once I’ve climbed the page and pulled out the data, I’ll store it in a JSON file. That way, whenever I want to make an API for Netflix’s best recent movies, I can just pull the data straight from this JSON file.

To start the

First, I create a new Webscraper folder in VS Code. Create a new Netflixcorone.js file in this folder.

We need to install the Puppeteer in the terminal.

npm i puppeteer
Copy the code

Then, import the required modules and libraries. The starting code in Netflixcorone.js is as follows:

const puppeteer= require('puppeteer')
const fs = require('fs')
Copy the code

Fs is the file system module of Node.js. We will use this module to create JSON files that hold the data.

Write crawler

Now we’re going to write the scrape() function.

async function scrape (url) {
   const browser = await puppeteer.launch();
   const page = await browser.newPage();
   await page.goto(url)
Copy the code

Scrape () will receive a URL as an argument. We use puppeteer.launch() to launch the headless browser. Browser.newpage () opens a blank page in a headless browser. We then tell the browser to open the specified URL.

How do I get the data I want to crawl

In order to crawl the data we want from the site, we need to crawl the data by the specific HTML element in which it resides. Open the www.digitaltrends.com/movies/best… Then open Inspector or the Chrome debugging tool. You can use the following shortcut keys:

In Mac, Command + Option + j is displayed. In Windows, Ctrl + Shift + I

Since I want to grab the title and intro of the movie from the article, I have to select the H2 element (title) and its corresponding sibling p (intro).

How do I manipulate the fetched elements

Let’s continue with the code:

var movies = await page.evaluate((a)= > {
   var titlesList = document.querySelectorAll('h2');
   var movieArr = [];

   for (var i = 0; i < titlesList.length; i++) {
      movieArr[i] = {
         title: titlesList[i].innerText.trim(),
         summary: titlesList[i].nextElementSibling.innerText.trim()
      };
   }
   return movieArr;
})
Copy the code

We use Page.evaluate () to access the DOM structure of the page so we can execute our own JavaScript code as if it were in the debug tool’s Console panel.

Document. querySelector(‘h2’) selects all h2 elements on the page. We store them all in the titlesList variable.

Then, we create an empty array called movieArr.

We want to store the title and introduction of each movie in a separate object. So we run a for loop. This loop makes every element in movieArr an object with title and summary attributes.

To get the movie title, we have to traverse the titlesList — all the H2 element nodes are there. We use the innerText property to get the text content of H2. Then, we remove whitespace with the.trim() method.

If you look closely at the debugger’s Console panel, you’ll notice that the page has a lot of P elements that don’t specify a unique class name or ID. This makes it really hard to accurately capture the movie intro P element we need. To solve this problem, we call the nextElementSibling attribute on the H2 node (titlesList[I]). When you take a closer look at the Console panel, you’ll see that the P element containing the movie synopsis is a sibling of the H2 element containing the movie title.

Save the crawl data to a JSON file

At this point, we have completed the main data extraction and are ready to store the data in a JSON file.

fs.writeFile("./netflixscrape.json".JSON.stringify(movies, null.3), (err) => {
   if (err) {
      console.error(err);
      return;
   };
   console.log("Great Success");
});
Copy the code

Fs.writefile () creates a new JSON file to store the movie data. It takes three arguments:

  1. The name of the file to create

  2. The json.stringify () method converts a JavaScript object to a JSON string. It takes three arguments. Objects to convert: Movies objects, substitution arguments (for filtering out unwanted properties) : NULL, Spaces (for inserting Spaces in the output JSON string to increase readability) : 3. The resulting JSON file will look nice and clean.

  3. Err, handle error cases

Err receives a callback function that prints an error message on the console when the program fails. If there are no errors, print “Great Success”.

Finally, the overall code is as follows:

const puppeteer = require('puppeteer')
const fs = require('fs')

async function scrape (url) {
   const browser = await puppeteer.launch();
   const page = await browser.newPage();
   await page.goto(url)

   var movies = await page.evaluate((a)= > {
      var titlesList = document.querySelectorAll('h2');
      var movieArr = [];
      for (var i = 0; i < titlesList.length; i++) {
         movieArr[i] = {
         title: titlesList[i].innerText.trim(),
         summary: titlesList[i].nextElementSibling.innerText.trim(),
         };
      }
      return movieArr;
   })
   fs.writeFile("./netflixscrape.json".JSON.stringify(movies, null.3),  (err) => {
      if (err) {
         console.error(err);
         return;
      };
      console.log("Great Success");
   });
   browser.close()
}

scrape("https://www.digitaltrends.com/movies/best-movies-on-netflix/")
Copy the code

We added browser.close() to turn off Puppeteer’s headless browser. On the last line, we call the scrape() function and pass in the URL.

Run the scrape() function

To run this code, type node netflixcorone.js in the terminal.

If all goes well (and it must), you’ll see the words “Great Success” on the console, and you’ll get a freshly baked JSON file with all the Netflix titles and introductions.

Congratulations!!!!! 👏 You are now officially a hacker! Okay, I’m just kidding. But now that you know how to use web crawlers to get data and develop your own apis, doesn’t that sound a lot better?

If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.


The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.