Node: Puppeteer + Image recognition implements Baidu index crawler

There is no 100% anti-crawler method. This article introduces a simple way to bypass all of these front-end anti-crawler methods.

The following code in baidu index, for example, the code is encapsulated into a baidu index crawler node library: https://github.com/Coffcer/baidu-index-spider

Note: Do not abuse crawlers to trouble others

Baidu index anti crawler strategy

Look at the baidu Index interface, index data is a trend chart, when the mouse hover over a certain day, will trigger two requests, display the results in the hover box:

As usual, let’s take a look at this request:

Request 1:

Request 2:

It can be found that Baidu index actually does a certain anti-crawler strategy in the front end. When the mouse moves over the diagram, two requests are triggered, one to return a piece of HTML and one to return a generated image. The HTML does not contain the actual value, but sets width and margin-left to display the corresponding characters on the image. And request parameters with RES, RES1, we do not know how to simulate the parameters, so it is difficult to climb baidu index data with conventional simulation request or HTML crawling.

The crawler thinking

How to break through baidu’s anti-crawler method, in fact, it is very simple, is to completely ignore how he is anti-crawler. We only need to simulate user operation, will need the value of screenshots down, do image recognition on the line. The steps are as follows:

To simulate the login
Open index page
Mouse moves to the specified date
Wait for the request to end and capture the numeric part of the image
Image recognition gets the value
At steps 3 to 5, you get the value for each date

This method can theoretically crawl the content of any website, next we will implement the crawler step by step, the following library will be used:

Puppeteer simulates browser operations
Node-tesseract Encapsulation of tesserAct, used for image recognition
Jimp picture cropping

Install Puppeteer to simulate user operations

Puppeteer is a Chrome automation tool that enables Chrome to execute commands. Can simulate user operation, do automatic test, crawler, etc. The usage is very simple, there are many online tutorials, follow this article will probably know how to use it.

The API documentation: https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md

Installation:

npm install --save puppeteer
Copy the code

Puppeteer automatically downloads Chromium when it is installed to ensure it runs properly. However, it may not be possible to download Chromium successfully from the domestic network. If the download fails, CNPM can be used to install it, or change the download address to taobao’s image, and then install:

npm config set PUPPETEER_DOWNLOAD_HOST=https://npm.taobao.org/mirrors
npm install --save puppeteer
Copy the code

You can also skip the Chromium download at install time and run it with code specifying the native Chrome path:

// npm
npm install --save puppeteer --ignore-scripts

// node
puppeteer.launch({ executablePath: '/path/to/Chrome' });
Copy the code

implementation

For the sake of clarity, only the main parts of the code are listed below, and all the parts of the code that involve selector use… Instead, see the Github repository at the top of this article for the full code.

Open baidu index page, simulate login

What we’re doing here is simulating user action, step by step clicking and typing. Without dealing with the case of login verification code, processing verification code is another topic, if you have logged in baidu on the machine, generally do not need verification code.

// Start the browser,
// If the headless parameter is set to true, Puppeteer will be manipulating your Chromium in the background, in other words you can't see what's going on in the browser
// If set to false, the browser opens on your computer and displays every action of the browser.
const browser = await puppeteer.launch({headless:false});
const page = await browser.newPage();

// Open baidu index
await page.goto(BAIDU_INDEX_URL);

// Simulate login
await page.click('... ');
await page.waitForSelecto('... ');
// Enter your Baidu account password and log in
await page.type('... '.'username');
await page.type('... '.'password');
await page.click('... ');
await page.waitForNavigation();
console.log('✅ login successful ');
Copy the code

Simulate moving the mouse to obtain the required data

Scroll to the area of the trend chart, move your mouse over a date, wait for the request to end, display the value in the tooltip, and take a screenshot to save the image.

// get the coordinates of chart day 1
const position = await page.evaluate((a)= > {
  const $image = document.querySelector('... ');
  const $area = document.querySelector('... ');
  const areaRect = $area.getBoundingClientRect();
  const imageRect = $image.getBoundingClientRect();

  // Scroll to the diagram visualization area
  window.scrollBy(0, areaRect.top);

  return { x: imageRect.x, y: 200 }；
});

// Move the mouse to trigger the tooltip
await page.mouse.move(position.x, position.y);
await page.waitForSelector('... ');

// Get tooltip information
const tooltipInfo = await page.evaluate((a)= > {
  const $tooltip = document.querySelector('... ');
  const $title = $tooltip.querySelector('... ');
  const $value = $tooltip.querySelector('... ');
  const valueRect = $value.getBoundingClientRect();
  const padding = 5;

  return {
    title: $title.textContent.split(' ') [0].x: valueRect.x - padding,
    y: valueRect.y,
    width: valueRect.width + padding * 2.height: valueRect.height
  }
});
Copy the code

screenshots

Calculate the coordinates of the values, take a screenshot and crop the picture with JIMP.

await page.screenshot({ path: imgPath });

// Crop the image to keep only the numbers
const img = await jimp.read(imgPath);
await img.crop(tooltipInfo.x, tooltipInfo.y, tooltipInfo.width, tooltipInfo.height);
// Enlarge the image to improve the recognition accuracy
await img.scale(5);
await img.write(imgPath);
Copy the code

Image recognition

Here we use Tesseracts for image recognition. Tesseracts is an open source OCR tool from Google, which is used to identify words in images and can be trained to improve accuracy. Github already has a simple node wrapper: Node-tesseract, which requires you to install tesseract and set it to environment variables.

Tesseract.process(imgPath, (err, val) => {
if (err || val == null) {
  console.error('❌ identification failure: ' + imgPath);
  return;
}
console.log(val);
Copy the code

In fact, untrained Tesseracts may have a few errors in recognition, for example, the number starting with 9 is recognized as’ 3. Here, we need to improve the accuracy of Tesseracts through training. If the problems in the recognition process are the same, we can also simply fix these problems through re.

encapsulation

After the above points are realized, it can be encapsulated into a Baidu index crawler node library only by combination. Of course, there are many optimization methods, such as batch crawl, designated days crawl, as long as on this basis are not difficult to achieve.

const recognition = require('./src/recognition');
const Spider = require('./src/spider');

module.exports = {
  async run (word, options, puppeteerOptions = { headless: true{})const spider = newSpider({ imgDir, ... options }, puppeteerOptions);// Fetch data
    await spider.run(word);

    // Read captured screenshots for image recognition
    const wordDir = path.resolve(imgDir, word);
    const imgNames = fs.readdirSync(wordDir);
    const result = [];

    imgNames = imgNames.filter(item= > path.extname(item) === '.png');

    for (let i = 0; i < imgNames.length; i++) {
      const imgPath = path.resolve(wordDir, imgNames[i]);
      const val = await recognition.run(imgPath);
      result.push(val);
    }

    returnresult; }}Copy the code

The crawler

Finally, how to resist this kind of crawler? I think judging the track of mouse movement may be one way. Of course, the front end does not have 100% anti-crawler means, what we can do is to add a little more difficult crawler.