When building a production application, you usually look for ways to optimize its performance. In this article, we use a method to improve the way Node.js applications handle workloads.

Node.js applications run single-threaded, which means that on a multi-core system (as most computers do these days), not all of the kernel is fully utilized by the application. To use other available kernels, you can start a cluster of Node.js processes and distribute the load among them.

Using multiple threads increases the throughput of the server (requests/second) because multiple requests can be processed at the same time. Next, we will use the Node.js Cluster module to create child processes, and then we will use the PM2 process manager to manage the Cluster.

What is a Cluster?

The Node.js Cluster module allows the creation of workers that can run simultaneously and share the same service port. Each derived child process has its own Event loop, memory, and V8 instance. Child processes use inter-process-communication (IPC) to communicate with their parent processes.

Having multiple processes to process requests means that multiple requests can be processed at the same time, and if one worker has time-consuming/blocking operations, other workers can continue to process other incoming requests without the application stagnating.

Running multiple workers can also update applications with almost no downtime in production. You can modify your application by restarting one worker process at a time and waiting for one child to be fully generated before restarting the other. This way, when you update the application, there will always be a worker running.

Incoming requests are distributed to child processes in two ways:

  • The main process listens for requests on the port and distributes them in a circular fashion across workers. This is the default method on all platforms except Windows.
  • The main process creates a socket and sends it to the interested worker, who will be able to accept incoming requests directly.

Cluster application

To understand the performance benefits of clustering, we created an unused cluster and compared it to a Node.js application that uses cluster.

const express = require('express');
const app = express();
const port = 3000;

app.get('/'.(req, res) = > {
  res.send('Hello World! ');
})

app.get('/api/:n'.function (req, res) {
  let n = parseInt(req.params.n);
  let count = 0;

  if (n > 5000000000) n = 5000000000;

  for(let i = 0; i <= n; i++){
    count += i;
  }

  res.send(`Final count is ${count}`);
})

app.listen(port, () = > {
  console.log(`App listening on port ${port}`);
})
Copy the code

It seems a little tricky, but that’s what we want it to be. The application contains two routes, one that returns the string “Hello World” and the other that takes n and finally computes the cumulative sum of 1-n.

This operation is a 0(n) operation, so it gives us an easy way to simulate a long-running operation on the server (parameter n needs to be large enough). The upper limit is 5,000,000,000. Don’t overload the computer.

Execution node app. Js, and pass in a smaller n parameter (e.g., http://localhost:3000/api/50), it will soon end and returns the result.

When you pass in a large number n, you’ll see problems with single-threaded applications. Try to send a 5000000000 (via http://localhost:3000/api/5000000000).

The application will take a few seconds to return results. If you open another browser TAB at the same time and send the request again (/ or/API /:n either works), you will see that this will also wait a few seconds for the result. A single CPU kernel must complete the first request before it can process the other.

Now, let’s use the Cluster module to spawn child processes and see how it improves performance.

const express = require('express');
const port = 3000;
const cluster = require('cluster');
const totalCPUs = require('os').cpus().length;

if (cluster.isMaster) {
  console.log(`Number of CPUs is ${totalCPUs}`);
  console.log(`Master ${process.pid} is running`);

  // Fork workers.
  for (let i = 0; i < totalCPUs; i++) {
    cluster.fork();
  }

  cluster.on('exit'.(worker, code, signal) = > {
    console.log(`worker ${worker.process.pid} died`);
    console.log("Let's fork another worker!");
    cluster.fork();
  });

} else {
  const app = express();
  console.log(`Worker ${process.pid} started`);

  app.get('/'.(req, res) = > {
    res.send('Hello World! ');
  })

  app.get('/api/:n'.function (req, res) {
    let n = parseInt(req.params.n);
    let count = 0;

    if (n > 5000000000) n = 5000000000;

    for(let i = 0; i <= n; i++){
      count += i;
    }

    res.send(`Final count is ${count}`);
  })

  app.listen(port, () = > {
    console.log(`App listening on port ${port}`); })}Copy the code

The application does the same thing as before, but this time we have created several child processes that share port 3000 and are able to handle requests from that port. The worker process is generated using the child_process.fork() method, which returns a ChildProcess object with a built-in communication channel that allows messages to be passed back and forth between the child and parent processes.

We create as many child processes as there are CPU cores on the machine on which the application is running. It is recommended that you do not create more worker processes than the logical kernel on your computer, as this may incur scheduling costs. This happens because the system must schedule all created processes so that each can run on the kernel.

The worker process is created and managed by the main process. When the application starts, we determine if it is the main process isMaster. This is determined by process.env.node_unique_id. If process.env.node_unique_id is undefined, isMaster is true.

In the case of the main process, we call cluster.fork() to create several worker processes and print out the process IDS of the main and worker processes. After running the code above, you should see the following print-out log (my computer has 8 cores). When a worker process dies, a new process is created to make full use of the CPU kernel.

Number of CPUs is 8 Master 19981 is running Worker 19984 started Worker 19983 started Worker 19986 started App listening  on port 3000 App listening on port 3000 App listening on port 3000 Worker 19985 started Worker 19982 started App listening on port 3000 App listening on port 3000 Worker 19987 started Worker 19988 started App listening on port 3000 App listening on port 3000 Worker 19989 started App listening on port 3000Copy the code

To see how clustering improves performance, we perform the same test steps as before: / API /:n Requests a large parameter n, and quickly executes another request on another TAB page. You can see that the second request returns results immediately, even though the first request is still executing, indicating that the second request is not being executed sequentially after the first request completes. With multiple worker processes available to handle requests, server availability and throughput are improved.

Here we feel the performance boost from clustering, but we can’t measure it. Next, we benchmarked to let the data speak for itself.

The performance test

Let’s run a load test on both applications to see how each handles a large number of connection requests. To do this, we will use the loadTest NPM package.

Loadtest allows you to simulate a large number of concurrent requests to an API so that you can measure its performance.

First, install loadtest globally:

npm install -g loadtest
Copy the code

Let’s first pressure test the clusterless application:

loadtest http://localhost:3000/api/5000000 -n 1000 -c 100
Copy the code

The command above means that 100 concurrent 1000 requests are sent. Here is the print:

[Fri Mar 26 2021 14:33:46 GMT+0800 (GMT+08:00)] INFO Requests: 0 (0%), requests per second: 0, mean latency: 0 ms [Fri Mar 26 2021 14:33:51 GMT+0800 (GMT+08:00)] INFO [Fri Mar 26 2021 14:33:51 GMT+0800 (GMT+08:00)] INFO Target URL: http://localhost:3000/api/5000000 [Fri Mar 26 2021 14:33:51 GMT+0800 (GMT+08:00)] INFO Max requests: 1000 [Fri Mar 26 2021 14:33:51 GMT+0800 (GMT+08:00)] INFO Concurrency level: 100 [Fri Mar 26 2021 14:33:51 GMT+0800 (GMT+08:00)] INFO Agent: none [Fri Mar 26 2021 14:33:51 GMT+0800 (GMT+08:00)] INFO [Fri Mar 26 2021 14:33:51 GMT+0800 (GMT+08:00)] INFO Completed  requests: 1000 [Fri Mar 26 2021 14:33:51 GMT+0800 (GMT+08:00)] INFO Total errors: 0 [Fri Mar 26 2021 14:33:51 GMT+0800 (GMT+08:00)] INFO Total time: 4.949394667000001 s [Fri Mar 26 2021 14:33:51 GMT+0800 (GMT+08:00)] INFO Requests per second: 202 [Fri Mar 26 2021 14:33:51 GMT+0800 (GMT+08:00)] INFO Mean latency: 468.8 MS [Fri Mar 26 2021 14:33:51 GMT+0800 (GMT+08:00)] INFO [Fri Mar 26 2021 14:33:51 GMT+0800 (GMT+08:00)] INFO [Fri Mar 26 2021 14:33:51 GMT+0800 (GMT+08:00)] INFO Percentage of the requests served within a certain time [Fri Mar 26 2021 14:33:51 GMT+0800 (GMT+08:00)] INFO 50% 490 ms [Fri Mar 26 2021 14:33:51 GMT+0800 (GMT+08:00)] INFO 90% 496 ms [Fri Mar 26 2021 14:33:51 GMT+0800 (GMT+08:00)] INFO 95%  499 ms [Fri Mar 26 2021 14:33:51 GMT+0800 (GMT+08:00)] INFO 99% 500 ms [Fri Mar 26 2021 14:33:51 GMT+0800 (GMT+08:00)] INFO 100% 505 ms (longest request)Copy the code

For the same request (n = 5000000), the server can process it 202 times per second with an average latency of 468.8 ms (the average time to complete a single request).

Next, let’s compare this result with an application that uses clustering.

Stop the service, start the cluster application, and perform the same pressure test:

[Fri Mar 26 2021 14:42:17 GMT+0800 (GMT+08:00)] INFO Requests: 0 (0%), requests per second: 0, mean latency: 0 ms [Fri Mar 26 2021 14:42:18 GMT+0800 (GMT+08:00)] INFO [Fri Mar 26 2021 14:42:18 GMT+0800 (GMT+08:00)] INFO Target URL: http://localhost:3000/api/5000000 [Fri Mar 26 2021 14:42:18 GMT+0800 (GMT+08:00)] INFO Max requests: 1000 [Fri Mar 26 2021 14:42:18 GMT+0800 (GMT+08:00)] INFO Concurrency level: 100 [Fri Mar 26 2021 14:42:18 GMT+0800 (GMT+08:00)] INFO Agent: none [Fri Mar 26 2021 14:42:18 GMT+0800 (GMT+08:00)] INFO [Fri Mar 26 2021 14:42:18 GMT+0800 (GMT+08:00)] INFO Completed  requests: 1000 [Fri Mar 26 2021 14:42:18 GMT+0800 (GMT+08:00)] INFO Total errors: 0 [Fri Mar 26 2021 14:42:18 GMT+0800 (GMT+08:00)] INFO Total time: 1.204361125s [Fri Mar 26 2021 14:42:18 GMT+0800 (GMT+08:00)] INFO Requests per second: 830 [Fri Mar 26 2021 14:42:18 GMT+0800 (GMT+08:00)] INFO Mean latency: 112.9MS [Fri Mar 26 2021 14:42:18 GMT+0800 (GMT+08:00)] INFO [Fri Mar 26 2021 14:42:18 GMT+0800 (GMT+08:00)] INFO [Fri Mar 26 2021 14:42:18 GMT+0800 (GMT+08:00)] INFO Percentage of the requests served within a certain time [Fri Mar 26 2021 14:42:18 GMT+0800 (GMT+08:00)] INFO 50% 110 ms [Fri Mar 26 2021 14:42:18 GMT+0800 (GMT+08:00)] INFO 90% 133 ms [Fri Mar 26 2021 14:42:18 GMT+0800 (GMT+08:00)] INFO 95%  142 ms [Fri Mar 26 2021 14:42:18 GMT+0800 (GMT+08:00)] INFO 99% 162 ms [Fri Mar 26 2021 14:42:18 GMT+0800 (GMT+08:00)] INFO 100% 170 ms (longest request)Copy the code

As you can see, 830 requests can be processed per second with an average latency of 112.9 ms. This proves that clustering makes a big difference in performance!

Before moving on to the next section, let’s look at a scenario where clustering might not provide much of a performance boost.

Let’s run two more tests for two applications where we will test requests that are not CPU-intensive.

Run the following command to test the cluster:

loadtest http://localhost:3000/api/5000 -n 1000 -c 100
Copy the code

The result is:

Total time:          0.383569709 s
Requests per second: 2607
Mean latency:        35.5 ms
Copy the code

Then, apply the test to the cluster and execute the same command.

The result is:

Total time: 0.401926125 S Requests per second: 2488 Mean latency: 37.7msCopy the code

As you can see, the performance is worse ~~ what is going on here?

In the above test, we called our API with a small value of n, which means that the loop in our code will run quite a few times. This operation is not CPU intensive. Clustering comes into play when it comes to CPU-intensive tasks.

However, if your application is not running a lot of CPU-intensive tasks, it may not be worth it to generate so many workers. Keep in mind that each process you create has its own memory and V8 instances. Due to the additional resource allocation, it is not always recommended to generate a large number of child Node.js processes.

In our example, the clustered application performed slightly worse than the non-clustered application because we paid the price of creating several child processes that didn’t offer much advantage. In practice, you can determine which applications in your microservices architecture could benefit from clustering — that is, by running tests to check whether additional worker processes benefit.

PM2 manages the Node.js cluster

In our application, we use the Cluster module to manually create and manage our worker processes. We first determine the number of workers to be generated (using CPU cores), then manually generate workers, and finally, listen for any workers that hang so that we can generate new workers. In our very simple application, we had to write a lot of code to handle clustering. In a production application, you might write more.

One tool that can help manage processes better is the PM2 process Manager. PM2 is a production process manager for Node.js applications with a built-in load balancer. When configured correctly, PM2 will automatically run your application in cluster mode, generate workers for you, and take care of generating new workers when workers die. PM2 makes it easy to stop, delete, and start processes, and it also has monitoring tools to help you monitor and tune your application’s performance.

To apply PM2, first install globally:

npm install pm2 -g
Copy the code

We use it to run our application:

const express = require('express');
const app = express();
const port = 3000;

app.get('/'.(req, res) = > {
  res.send('Hello World! ');
})

app.get('/api/:n'.function (req, res) {
  let n = parseInt(req.params.n);
  let count = 0;

  if (n > 5000000000) n = 5000000000;

  for(let i = 0; i <= n; i++){
    count += i;
  }

  res.send(`Final count is ${count}`);
})

app.listen(port, () = > {
  console.log(`App listening on port ${port}`);
})
Copy the code

To start the application, execute the following command:

pm2 start app.js -i 0
Copy the code

-I

tells PM2 to start the application in cluster_mode (as opposed to fork_mode). If

is set to 0, PM2 will generate as many workers as possible based on your CPU cores.

Your application is now running in cluster_mode and no code changes are required. Now, you can run the same tests as in the previous section, and you can get the same results as an application that uses clustering. Behind the scenes, PM2 also uses the Node.js cluster module and other tools to make process management easier.

In the terminal, you get a table with details of the working process:

You can stop the application with the following command:

pm2 stop app.js
Copy the code

Pm2 start app.js -i 0 is required every time you run the service, which may seem cumbersome, but you can store them in a configuration File — Ecosystem File. This file allows you to generate individual configurations for different applications, which is useful for microservice applications.

You can generate Ecosystem File:

pm2 ecosystem
Copy the code

This generates a file called file.config.js. For our current application, change it to:

module.exports = {
  apps : [{
    name: "app".script: "app.js".instances : 0.exec_mode : "cluster"}}]Copy the code

With exec_mode set to cluster, PM2 implements load balancing between application instances. Set instances to 0, and generate as many working processes as possible based on the number of server CPU cores.

-I or Instances can be set to:

  • 0max(deprecated) to distribute applications across all cpus;
  • - 1To distribute applications inNumber of CPU cores – 1A CPU;
  • number, distribute the application on a number of cpus (number <= number of CPU cores).

Now, you can launch the app like this:

pm2 start ecosystem.config.js
Copy the code

The application will run in Cluster Mode just as before.

You can start, restart, reload, stop, and delete apps:

$ pm2 start app_name
$ pm2 restart app_name
$ pm2 reload app_name
$ pm2 stop app_name
$ pm2 delete app_name

# When using an Ecosystem file:

$ pm2 [start|restart|reload|stop|delete] ecosystem.config.js
Copy the code

The restart command immediately ends the process and then restarts it. The reload command realizes 0-second stop reload, that is, restart workers one by one, wait for a new worker to be generated, and then kill the old worker.

You can also check the status, logs, and metrics of running applications.

The following command lists all application states managed by PM2:

pm2 ls
Copy the code

The following command will print a real-time log:

pm2 logs
Copy the code

The following command displays the real-time monitoring panel:

pm2 monit
Copy the code

For more information about PM2 and its cluster mode, see the documentation.

conclusion

Clustering provides a way to improve the performance of Node.js applications by making more efficient use of system resources. When an application was tuned to use clustering, we saw a significant increase in throughput. Then, we briefly introduced a tool that can help you simplify the process of managing your cluster. I hope you found this article useful. For more information about clusters, see the Node.js Cluster module documentation and PM2 documentation. For details on how the Cluster module works, see the Node.js multi-process Cluster module.

The code for this article can be found on GitHub.

reference

  • Improving Node.js Application Performance With Clustering
  • PM2 document
  • The principle of node.js multi-process architecture