50 lines to implement node.js multi-process paging crawler

preface

Coding should be a lifelong career, not just 30 youth rice This article has included making https://github.com/ponkans/F2E, welcome to Star, continuously updated

Node is used to write crawler or very convenient, most of the online articles are single process climb, take off work time to write a multi-process crawler, source code in the end ~

I hope you learned a few things from each article, this is a node.js-based multi-process crawler, and I hope you learned a few things after reading it:

Simple usage of Node cluster, interprocess communication
For some simple page crawlers, can be implemented immediately
Simple use of superagent

Architecture diagram

Target analysis

Strange that I am obsessed with Japanese animation, I often go to Douban to watch the ranking list, and then hide at home to enjoy quietly, SHH ~~

I don’t know how many animations you’ve seen, but ONE Piece is finished!

Target lock douban Top 10 pages of Japanese animation chart data climb.

Let’s take a look, douban Japanese animation chart request logic is how to drop?

Page 1 packet capture

Page 2 Packet capture

Through the packet capture of the interfaces of the first two pages, it can be clearly summarized as follows:

Hot list of Japanese animation API for https://movie.douban.com/j/search_subjects
The input parameter remains unchanged except for the page_start increment by 20
The request mode is GET

Building a GET request

Superagent is a very convenient client request proxy module in Node.js, which is very convenient for making requests.

According to the summary obtained from the above analysis, we can easily construct the request with superagent.

Multi-process creation

For the underlying principle of Node multi-process architecture, you can refer to my other article “Large Front-end Advanced Node.js” series of multi-process model underlying implementation.

Using the Cluster module provided by Node, you can easily create multiple child processes.

Generally speaking, the CPU is several cores, create a few child processes, but the real server, in fact, more consideration ~

Child process paging fetch

There’s a little bit of algorithmic logic involved here, and it’s actually quite simple

For example, if I have a MAC with four cores, I will open four sub-processes to crawl. The following algorithm is to realize how to make the four sub-processes evenly divide the number of network requests to crawl.

If all of your requests are placed on one child process, then it is meaningless to open so many child processes.

Close child processes

After the crawl, you don’t have to keep the process running. You can shut it down to save resources.

cluster.disconnect(); 

Copy the code

Multi-process disorder problem

When a multi-process crawl is performed, it is the CPU that schedules the child processes, so the crawl data is actually unordered. For example, if you need to crawl the first 20 pages, the first page may not be the first page.

We can add a movieIndex field to indicate the order of the crawl.

The effect

Take a look at the first 10 pages.

conclusion

This article has included making https://github.com/ponkans/F2E, welcome to Star, continuously updated

Node multi-process architecture alleviates CPU resource utilization problems. In some time-consuming operations, multi-process can be tried to solve the problem.

In the use of multi-process, data synchronization is a very important problem, not handled well, easy to cause a series of pits, such as strange before writing “big front-end advanced Node.js” series double eleven seconds kill system (advanced must see), which mentioned the oversold problem, is the multi-process data synchronization problem.

This article is just a very simple crawler, the entry post, will write some more in-depth Node multi-process combat post ~

Biubiubiu:

Collection of dACHang front-end component library tools (PC, mobile, JS, CSS, etc.)
“Big front-end basic Components” series of 80 lines to achieve a web watermarking NPM package
Asynchronous non-blocking (synchronous/asynchronous/blocking/non-blocking /read/select/epoll)

Like the small partner to add a concern, a praise oh, Thanksgiving 💕😊

Contact me/public number

Public number background reply [source], get crawler source code oh ~

Wechat search “water monster” or scan the qr code below reply “add group”, I will pull you into the technical communication group. Honestly, in this group, even if you don’t talk, just reading the chat is a kind of growth. (Ali technical experts, aobing authors, Java3y, mogujie senior front end, Ant Financial security experts, all the big names).

The water monster will also be regular original, regular with small partners to exchange experience or help read the resume. Add attention, don’t get lost, have a chance to run together 🏃 ↓↓↓