• Improving Node.js Application Performance With Clustering
  • Joyce Echessa
  • The Nuggets translation Project
  • Permanent link to this article: github.com/xitu/gold-m…
  • Translator: zenblo
  • Proofread: PassionPenguin, Ashira97

Improve Node.js application performance with clustering

When building an application product, we usually look for ways to optimize the performance of the application as much as possible. In this article, we’ll explore a way to improve the workload handling of Node.js applications.

Node.js instances run in a single thread, meaning that on the multi-core systems used by most computers today, the application will not use all cpus. To use other available cpus, you can start node.js processes in cluster mode and distribute the load between them.

Because multiple clients can be served at the same time, having multiple processes to handle requests can significantly increase server throughput (requests per second). In this article, we’ll first explore how to create child processes using the Node.js cluster module, and then we’ll explore how to manage clusters using the PM2 process manager.

Introduction of the cluster

The Node.js cluster module supports the creation of child processes (worker processes) that run simultaneously and share the same server port. Each derived child object has its own event loop, memory, and V8 instances. The child communicates with the parent using interprocess communication (IPC).

Having multiple processes to process incoming requests means that multiple requests can be processed at the same time, and if there are long-running or blocking operations on one worker process, other worker processes can continue processing incoming requests. Even if there is a blocking operation does not affect the overall task, complete the normal incoming task and the blocking task, the entire application can be terminated.

By running multiple worker processes, we can update the application without downtime. You can modify the application, restart the worker process one at a time, and wait for one child to be fully generated before restarting the other. This way, there will always be a worker process running as we update the application.

Incoming connections are allocated between child processes in one of two ways:

  • By default, all platforms except Windows use the main process to listen for connections on ports and distribute them in a circular fashion among worker processes.
  • The main process creates a listening socket and sends it to the relevant worker processes, which will then be able to receive incoming connections directly.

Use cluster

To see the advantages of clustering, we’ll start with an example of a Node.js application that doesn’t use clustering and then compare it to an application that uses clustering:

const express = require('express');
const app = express();
const port = 3000;

app.get('/'.(req, res) = > {
  res.send('Hello World! ');
})

app.get('/api/:n'.function (req, res) {
  let n = parseInt(req.params.n);
  let count = 0;

  if (n > 5000000000) n = 5000000000;

  for(let i = 0; i <= n; i++){
    count += i;
  }

  res.send(` count as a result${count}`);
})

app.listen(port, () = > {
  console.log('App listens on port${port}`);
})
Copy the code

It’s a bit of a fake and doesn’t exist in the real world, but it works for us. This application consists of two paths — one by returning the string “Hello World” and the other by taking the route parameter n and adding n numbers to a count variable before returning the string containing the final count.

This operation is an O(n) operation, and if we set n to a large enough value, this method can easily simulate running the operation on the server for a long time. We limit n to 5,000,000,000 — so that our computers don’t have to run so many operations.

If you are using the app. Js to run the application, and to pass a small appropriate value of n (e.g., http://localhost:3000/api/50), it will be quick and returns a response almost immediately. The root route (http://localhost:3000) also returns a quick response.

When you pass it a large value of n, you will see the problems of running applications on a single process. You can try to pass a 5000000000 (via http://localhost:3000/api/5000000000) for such a big number to it.

The application may take a few seconds to complete your request. And if you open another browser TAB and try to send another request to the server (to/or/API /:n route), that request will take a few seconds to complete because a single process will be busy with another long-running operation — a single CPU kernel must complete the first request first, Then another request can be processed.

Now, let’s use the cluster module in the application to generate some child processes and see how it can be improved.

Here is the modified application:

const express = require('express');
const port = 3000;
const cluster = require('cluster');
const totalCPUs = require('os').cpus().length;

if (cluster.isMaster) {
  console.log(` CPU number is${totalCPUs}`);
  console.log('running${process.pid}`);

  // Fork the worker process
  for (let i = 0; i < totalCPUs; i++) {
    cluster.fork();
  }

  cluster.on('exit'.(worker, code, signal) = > {
    console.log('Destruction process'${worker.process.pid}`);
    console.log("Fork another worker process!");
    cluster.fork();
  });

} else {
  const app = express();
  console.log('Start the process${process.pid}`);

  app.get('/'.(req, res) = > {
    res.send('Hello World! ');
  })

  app.get('/api/:n'.function (req, res) {
    let n = parseInt(req.params.n);
    let count = 0;

    if (n > 5000000000) n = 5000000000;

    for(let i = 0; i <= n; i++){
      count += i;
    }

    res.send(` count as a result${count}`);
  })

  app.listen(port, () = > {
    console.log('App listens on port${port}`); })}Copy the code

The application does the same thing as before, but this time, we are spawning several child processes that will all share port 3000 and be able to handle requests sent to this port. The worker process is generated using the child_process.fork() method. This method returns a ChildProcess object with a built-in communication channel that allows messages to be passed back and forth between child and parent processes.

We create as many child processes as possible on the machine where the application is running. It is recommended not to create more worker processes than the number of logical cores on your computer, as this can result in process scheduling overhead. This happens because the system must schedule all the processes created so that each process can take its turn to run on the kernel.

The worker process is created and managed by the main process. When the application first runs, we use isMaster to check if it is a master process. This is determined by the process.env.node_unique_id variable. If process.env.node_unique_id is undefined, isMaster is true.

If the process is a main process, we can call cluster.fork() to generate several processes. We record the identity of the main and worker processes, and you can see the output of running the application on a quad-core system in the output below. When a child process terminates, a new process is created to continue running the available CPU kernel.

Number of CPUs is 4
Master 67967 is running
Worker 67981 started
App listening on port 3000
Worker 67988 started
App listening on port 3000
Worker 67982 started
Worker 67975 started
App listening on port 3000
App listening on port 3000
Copy the code

To experience the improvements you can get with clustering, run the same example as before: first send a large n-value request to the server, then run another request quickly in another browser TAB. The second request completes while the first request is still running, without waiting for other requests to complete. Server availability and throughput are improved because multiple worker processes can be used to process requests.

Running one request in one browser TAB and then quickly running another request in the second TAB may show us the improvements this example provides through clustering, but it is not an appropriate or reliable way to determine performance improvements. Let’s look at some benchmarks that better illustrate how clustering improves our application.

Performance indicators

Let’s do a load test on two applications to see how each handles a large number of incoming connections. We will use the LoadTest dependency package for this.

With the LoadTest dependency package, you can simulate a large number of concurrent connections to the API so that you can evaluate its performance.

To use loadtest, you first need to install it globally:

$ npm install -g loadtest
Copy the code

Then, run the application you want to test using Node app.js. We will first test the version that does not use clustering.

With the application running, open another terminal and run the following load tests:

$ loadtest http://localhost:3000/api/500000 -n 1000 -c 100
Copy the code

The command above sends 1000 requests to a given URL, 1000 of which are concurrent. Here is the output from running the above command:

Requests: 0 (0%), requests per second: 0, mean latency: 0 ms Target URL: http://localhost:3000/api/500000 Max requests: 1000 Concurrency level: 100 Agent: none Completed requests: 1000 Total errors: 0 Total time: 1.268364041 S Requests per second: 788 Mean latency: 119.4 MS Percentage of the requests served within a certain time 50% 121 ms 90% 132 ms 95% 135 ms 99% 141 ms 100% 142 ms  (longest request)Copy the code

We see that using the same request (n = 500000), the server was able to process 788 requests per second with an average wait time of 119.4 milliseconds (the average time it takes to complete a single request).

Let’s try again, but this time with more requests (and no clustering) :

$ loadtest http://localhost:3000/api/5000000 -n 1000 -c 100
Copy the code

Here is the output:

Requests: 0 (0%), requests per second: 0, mean latency: 0 ms Requests: 573 (57%), requests per second: 115, the mean latency: 798.3 ms Target URL: http://localhost:3000/api/5000000 Max requests: 1000 Concurrency level: 100 Agent: None Completed requests: 1000 Total Errors: 0 Total time: 8.740058135 S requests per second: 114 Mean latency: 828.9 ms Percentage of the requests served within a certain time 50% 869 ms 90% 874 ms 95% 876 ms 99% 879 ms 100% 880 ms  (longest request)Copy the code

For n = 5000000 requests, the server can process 114 requests per second with an average wait time of 828.9 ms.

Let’s compare this to an application that uses clustering.

Stop the non-clustered application, run the clustered application, and finally run the same load test.

The following is the test result of http://localhost:3000/api/500000:

Requests: 0 (0%), requests per second: 0, mean latency: 0 ms Target URL: http://localhost:3000/api/500000 Max requests: 1000 Concurrency level: 100 Agent: none Completed requests: 1000 Total errors: 0 Total time: 0.70144636 S Requests per second: 1426 Mean latency: 65 ms Percentage of the requests served within a certain time 50% 61 ms 90% 81 ms 95% 90 ms 99% 106 ms 100% 112 ms (longest request)Copy the code

When the same request was tested (when n = 500000), the application using the cluster could handle 1426 requests per second — a significant increase compared to 788 requests per second for the application without the cluster. Applications that used clustering had an average latency of 65 milliseconds, compared with 119.4 for applications that did not use clustering. You can clearly see how clustering improves your application.

Here are the test results of http://localhost:3000/api/5000000:

Requests: 0 (0%), requests per second: 0, mean latency: 0 ms Target URL: http://localhost:3000/api/5000000 Max requests: 1000 Concurrency level: 100 Agent: none Completed requests: 1000 Total errors: 0 Total time: 2.43770738 s Requests per second: 410 Mean latency: 229.9 ms Percentage of the requests served within a certain time 50% 235 ms 90% 253 ms 95% 259 ms 99% 355 ms 100% 421 ms  (longest request)Copy the code

Here (when n = 5000000), the application can run 410 requests per second, compared to 114 for the non-clustered application, with a latency of 229.9 and 828.9 for the other applications.

Before moving on to the next section, let’s look at a situation where clustering might not provide much performance improvement.

We will run two more tests for each application. We will test requests that are not CPU-intensive and run fairly fast without overloading the event loop.

In the absence of a clustered application running, perform the following tests:

$ loadtest http://localhost:3000/api/50 -n 1000 -c 100
Copy the code

Here’s the summary:

Total time:          0.531421648 s
Requests per second: 1882
Mean latency:        50 ms
Copy the code

With the same cluster application still running, perform the following tests:

$ loadtest http://localhost:3000/api/5000 -n 1000 -c 100
Copy the code

Here’s the summary:

Total time:          0.50637567 s
Requests per second: 1975
Mean latency:        47.6 ms
Copy the code

Now, stop the application and run the cluster application again.

Run the cluster application and perform the following tests:

$ loadtest http://localhost:3000/api/50 -n 1000 -c 100
Copy the code

Here’s the summary:

Total time:          0.598028941 s
Requests per second: 1672
Mean latency:        56.6 ms
Copy the code

A clustered application can run 1672 requests per second, while a non-clustered application can run only 1882 requests per second. The average latency was 56.6 ms for clustered applications and 50 ms for non-clustered applications.

Let’s run another test. With the same cluster application still running, perform the following tests:

$ loadtest http://localhost:3000/api/5000 -n 1000 -c 100
Copy the code

Here’s the summary:

Total time:          0.5703417869999999 s
Requests per second: 1753
Mean latency:        53.7 ms
Copy the code

Here, the clustered application runs 1,753 requests per second, as opposed to 1,975 requests per second for the non-clustered application. The average latency was 53.7 ms for clustered applications versus 47.6 ms for non-clustered applications.

Based on these tests, you can see that clustering does not significantly improve application performance. In fact, clustered applications perform slightly worse than applications that do not use clustering. How did that happen?

In the above tests, we called the API with a fairly small value of n, which means the code will run a very small number of loops. This operation does not consume much CPU. Clustering works when it comes to CPU-intensive tasks. When your application is likely to run CPU-intensive tasks, using clustering will improve application performance in terms of the number of such tasks that can be run at once.

However, if your application is not performing a large number of CPU-intensive tasks, the performance gains from clustering may not be sufficient to compensate for the overhead of producing a large number of worker processes. Keep in mind that each process you create has its own memory and V8 instances. Because of the additional resource allocation, it is generally not recommended to generate a large number of Node.js child processes.

In our example, the clustered application performs worse than the non-clustered application because of the overhead of creating multiple child processes that don’t have much of an advantage. In the real world, you can use it to determine which applications in a microservice architecture could benefit from clustering — running tests to check whether the benefits of the extra complexity are worth it.

PM2 manages the cluster

In our application, we use the Node.js cluster module to manually create and manage worker processes. We first determine the number of worker processes to be generated (the number of CPU cores used), then manually generate those worker processes, and finally, listen for any terminated worker processes so that we can generate new ones. In a very simple application, we need to write a lot of code to handle clustering. In production applications, you also need to write more.

There is a tool that can help manage the process better – PM2 Process Manager. PM2 is the production process manager for Node.js applications with built-in load balancers. When configured correctly, PM2 will automatically run your application in cluster mode, generate workers for you, and take care of new workers when they die. PM2 makes it easy to stop, delete, and start processes, and it has monitoring tools to help you monitor and tune your application’s performance.

To use PM2, install globally:

$ npm install pm2 -g
Copy the code

We’ll use it to run our first unmodified application:

const express = require('express');
const app = express();
const port = 3000;

app.get('/'.(req, res) = > {
  res.send('Hello World! ');
})

app.get('/api/:n'.function (req, res) {
  let n = parseInt(req.params.n);
  let count = 0;

  if (n > 5000000000) n = 5000000000;

  for(let i = 0; i <= n; i++){
    count += i;
  }

  res.send(For ` count results${count}`);
})

app.listen(port, () = > {
  console.log('App listens on port${port}`);
})
Copy the code

Run the application with the following command:

$ pm2 start app.js -i 0
Copy the code

-I

will tell PM2 to start the application under Cluster_mode (instead of fork_mode). If

is set to 0, PM2 automatically generates as many worker processes as the number of CPU cores.

Just like this, the application runs in clustered mode — no code changes are required. Now you can run the same tests as before, and you can get the same results as your clustered application. In the background, PM2 also uses the Node.js cluster Module and other handy process management tools.

On the terminal, you see a list showing the details of some derived processes:

You can stop the application with the following command:

$ pm2 stop app.js
Copy the code

The application will go offline and terminal output will show the termination of all processes (stoppedState).

You can use pm2 start app.js -i 0 to run applications without always passing configurations, and you can also save them in a separate configuration File called Ecosystem File. This file also allows you to set specific configurations for different applications, which is especially useful for microservice applications, for example.

Ecosystem files can be generated using the following command:

$ pm2 ecosystem
Copy the code

It will generate a file called file.config.js. In the meantime, we need to modify the file as follows:

module.exports = {
  apps : [{
    name: "app".script: "app.js".instances : 0.exec_mode : "cluster"}}]Copy the code

By setting exec_mode with a cluster value, you can instruct PM2 to load balance between each instance. Setting the instance to 0 will result in as many worker processes as the number of CPU cores.

The -I or Instances option can be set to:

  • 0max(deprecated) Distribute application processes across all cpus
  • - 1Distribute the application across all CPU-1
  • numberDistribute applications on the CPUnumber

You can now run the application with the following command:

$ pm2 start ecosystem.config.js
Copy the code

The application will run in cluster mode as before.

You can start, restart, reload, stop, and delete the application using the following commands:

$ pm2 start app_name
$ pm2 restart app_name
$ pm2 reload app_name
$ pm2 stop app_name
$ pm2 delete app_name

When using ecosystem files:

$ pm2 [start|restart|reload|stop|delete] ecosystem.config.js
Copy the code

The restart command immediately stops and restarts the process, while the reload command reloads 0-second-downtime. Worker processes are restarted one by one, waiting for a new one to be generated, and then terminating the old one.

You can also check the status, logs, and metrics of running applications.

The following command lists the status of all applications managed by PM2:

$ pm2 ls
Copy the code

The following command is used to display logs in real time:

$ pm2 logs
Copy the code

The following command displays the real-time dashboard in the terminal:

$ pm2 monit
Copy the code

For more information about the PM2 and its cluster mode, see the documentation.

conclusion

First we said that clustering provides us with a way to improve the performance of Node.js applications by using system resources more efficiently. When an application is modified to use clustering, we can see a significant increase in throughput. We then took a quick look at tools that can help you simplify your cluster management process. Hope you found this article useful. For more information about clusters, see the Cluster Module documentation and PM2 documentation, and you can also check out this tutorial.

Our contributing writer Joyce Echessa is a full-stack Web developer. She occasionally writes down her thoughts in technical articles to keep track of what she has learned.

If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.


The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.