How to combine Node and Puppeteer to create a web crawler

❝

I’m sure you’ve all heard of crawlers, and we’ve also heard that Python is an easy way to crawl images on the web, but I don’t know Python, so I’ll have to try it out with Node.

❞

01 preface

What is a reptile

In fact, crawler uses very official language to describe it as “automatic browsing network program”, we do not need to manually click, to download some articles or pictures. You may have used the ticket-snatching software. In fact, it constantly accesses the railway official interface through the software to achieve the effect of ticket-snatching. However, such ticket-snatching software is illegal.

So how do you tell if a reptile is illegal? There is no clear idea about whether crawlers are illegal or not, and there has always been a neutral attitude. A crawler is a technology, and technology itself is not illegal. But if you use this technology to crawl inappropriate information, copyrighted images, etc. for commercial use, you’re breaking the law. In fact, as long as we use crawler technology, do not crawl personal privacy information, do not crawl copyrighted pictures, the most important thing is that information should not be used for commercial behavior, crawler should not interfere with the normal operation of the website.

All that is said is to be careful with this technique.

How to climb

I checked the information, there are many ways to use Node to do crawler, many people prefer to use Cheerio and Request to crawl. But I’ve also found a useful tool to use: Puppeteer, which is officially available from Google. It’s basically taking what a human does and turning it into a call interface. The puppeteer official document provides the following functions

Generate screen captures and PDF of the page.
Crawl SPA (single-page application) and generate pre-rendered content (that is, “SSR” (server-side rendering)).
Automatically perform form submission, UI testing, keyboard input, etc.
Create the latest automated test environment. Run tests directly in the latest version of Chrome, using the latest JavaScript and browser features.
Capture a timeline trace of the site to help diagnose performance problems.
Test the Chrome extension.

At the same time, I also read some comments from my classmates. I think this thing is amazing! I don’t know the full API yet, but I do know the general process. If you don’t understand the official documentation, visit site B for the basic introduction to the Puppeteer series.

02 Installation Process

Puppeteer installation

This installation is really a headache, it took a long time to install successfully. The reason was that the high wall kept the people inside the city from getting out. So we have to install it in another way.

Our direct NPM installation process is to download the browser by default, which is where I kept getting stuck and reporting an error. This happened several times. Internet users have pointed out that we can download the puppeteer after the fact instead of downloading the browser while it is being installed.

env PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=true npm i puppeteer
Copy the code

Now that we have successfully installed the Puppeteer, we need to manually install the browser. So where do we go to download it? I’ll download it manually

There are so many version, see the online tutorial we should choose the right version number (don’t know can download a number), we return to the root directory of the project the following node_modules/puppeteer/package. The json this directory, check what is our version of the browser.

I have the version number of reality 737027 here, so we can download this browser manually. You can download it according to your own version.

Browser reference

Once we’ve installed it, we still have to introduce a browser. This is also very headache, see a lot of tutorials can not be. Maybe their system is different and they want to give up after being here for a long time. Fortunately, this article solved my problem. I knew it was the wrong path but I didn’t know how to write it.

Once the browser downloads it, we unpack it into the root directory, the same as package.json. Then we create a new index.js file in the root directory.

const puppeteer = require("puppeteer");
const fs = require("fs");
const request = require("request");
const path = require("path");
// Configure the path, key!
const pathToExtension = require("path").join(
    __dirname,
    "./chrome-win/chrome.exe"
  );
Copy the code

Finally, my project catalog:

03 Choosing a Website

So once we’re all set up, we’re going to pick a site to test, and I’m going to pick this one to crawl.

In fact, we know that everything can crawl, as long as we get the analysis right. The most familiar front end of the F12 to go a wave. Take a look at what the structure looks like, and when you’re ready, start lifting.

Backhand writing (CV) comes out with this code.

const puppeteer = require("puppeteer");
const fs = require("fs");
const request = require("request");
const path = require("path");
let i = 2;/ / number of pages
async function netbian(i) {
  const pathToExtension = require("path").join(
    __dirname,
    "./chrome-win/chrome.exe"
  );
  const browser = await puppeteer.launch({
    headless: false.executablePath: pathToExtension,
  });
  const page = await browser.newPage();
  await page.goto(`http://pic.netbian.com/4kfengjing/index_${i}.html`);// Start from page 2 for convenience
  let images = await page.$eval("ul>li>a>img", (el) =>// Image node, API can be viewed official introduction
    el.map((x) = > "http://pic.netbian.com" + x.getAttribute("src"))// Get the SRC address of the image
  );
  mkdirSync(`./images`); // Store the directory
  for (m of images) {
    await downloadImg(m, "./images/" + new Date().getTime() + ".jpg");
  }
Copy the codenetbian(++i);// Next page, the specific end page can be restricted
/ / close
await browser.close();
}
netbian(i);// Execute here

// Create a directory synchronously
function mkdirSync(dirname) {
if (fs.existsSync(dirname)) {
return true;
} else {
if (mkdirSync(path.dirname(dirname))) {
fs.mkdirSync(dirname);
return true; }}return false;
}

// Download the file to save the image
async function downloadImg(src, path) {
return new Promise(async function (resolve, reject) {
let writeStream = fs.createWriteStream(path);
let readStream = await request(src);
await readStream.pipe(writeStream);
readStream.on("end".function () {
console.log("File downloaded successfully");
});
readStream.on("error".function () {
console.log(Error message: + err);
});
writeStream.on("finish".function () {
console.log("File write succeeded");
writeStream.end();
resolve();
});
});
}

04 Start to crawl

We can simply run Node index.js from the root directory. After executing, we found an images directory, which contains our images.

05 summary

There’s a lot more to puppeteer than that, but I’m just showing you how it works. Just as we do with browsers, we can use the corresponding API to do this. I’m just getting started, and I’ll explore this technology later. We can’t guarantee that the structure of the site will not change, in fact, we will change with it.

Project code: Github repository