Translator: Wang Yezhu

The original link

Node.js is an event-based platform. This means that anything that happens in Node is a response to an event. Data processing passed into Node goes through layers of nested callbacks. This process is abstracted from the developer and handled by a library called Libuv, which provides the event loop mechanism.

Event loops are perhaps the most misunderstood concept in Node.

I work for Dynatrace, a performance monitoring service. When we solve the problem of event loop monitoring, we put a lot of effort into understanding exactly what we are monitoring.

This article will cover what we’ve learned about how the event loop works and how to monitor it properly.

Common misconceptions

Libuv is a library that provides event loops for Node.js. Bert Belder, one of the key people behind Libuv, opened his amazing Node Interactive keynote with a Google image search showing the different ways people use to depict events loops, most of which he says are wrong.

Let me summarize (in my view) the most common misconceptions.

Myth 1: Event loops run in a separate thread in user code.

misconceptions

A main thread executes the user’s JavaScript code (userland Code) and another thread executes the event loop. Whenever an asynchronous operation occurs, the main thread passes the asynchronous operation to the event loop thread, and when the operation is complete, the event loop thread notifies the main thread to execute the callback.

FACTS

There is only one thread that executes the JavaScript code and the event loop runs in that thread. Callback execution (any user code in a running Node.js application is a callback) is done through an event loop. We’ll look at these in more detail later.

Myth 2: All asynchronous operations are handled through a thread pool

misconceptions

Asynchronous operations, such as manipulating file systems, initiating external HTTP requests, or database interactions, always require loading a thread pool provided by Libuv.

FACTS

Libuv creates a pool of four threads by default to load asynchronous operations. Today’s operating systems already provide asynchronous interfaces for many I/O tasks (such as AIO in Linux). Libuv uses these asynchronous interfaces whenever possible and avoids thread pools. The same applies to third-party subsystems such as databases. Authors of these drivers would prefer to use asynchronous interfaces rather than thread pools. In short: Asynchronous I/O uses thread pools only when no other way exists.

Myth 3: Event loops are similar to stacks or queues

misconceptions

The event loop polls a fifO queue of asynchronous tasks and performs a callback when the task completes.

FACTS

Although a queue-like structure is required, the event loop does not use a stack. An event loop is like a series of phases that deal with specific tasks in a circular manner.

Understand the different stages in the event cycle

To really understand the event loop we have to understand what it does at each stage. Hopefully with Bert Belder’s approval, my way of showing how an event loop works would look like this:

Let’s talk about these stages. A full explanation can be found at node.js.

Timers

Various scheduled tasks set by setTimeout() or setInterval() are processed in this phase.

IO Callbacks

Most callbacks are handled at this stage. This is user code, because all user code in Node.js is essentially callback code (for example, a freshly received HTTP request triggers a series of nested callbacks).

IO Polling

Polls for new events that will be processed in the next event loop.

Set Immediate

Perform all callbacks registered through setImmediate().

Close

All callbacks listening for close events will be handled in this phase.

Monitor event loop

We can see that virtually everything that happens in a Node application runs through an event loop. This means that if we can derive metrics from event loops, these metrics can give us valuable information in terms of applying general health and performance. Since there is no API to retrieve runtime metrics from the event loop, various monitoring tools provide their own metrics. Take a look at the metrics we provide.

Tick Frequency

The number of cycles completed in each period.

Tick Duration

The amount of time a cycle takes.

Since our agent can run as a native module, it is relatively easy to provide us with this information by adding probes.

Application of Tick Frequency and Tick Duration indicators

When we first tested it on different loads, the results were surprising —- Let me show you an example:

In the following scenario, I will invoke an express.js application to send a request to another HTTP server.

Here are four scenarios:

  1. Idle did not receive any requests.

  2. Ab-c 5 uses Apache Bench to create five concurrent requests at a time

  3. Ab -c 10 Indicates 10 concurrent requests

  4. Ab-c 10 (Slow Backend) The HTTP server returns data 1s later to simulate the slow backend. This creates callback pressure as requests pile up inside Node while waiting on the back end.

If we look at the results chart, we can draw an interesting conclusion:

The duration and frequency of event cycles are dynamic to accommodate changes in load.

If the application is idle, that means there are no pending tasks (timer tasks, callbacks, etc.), and since there is no reason to go through the stages of the event loop at full speed, the event loop adjusts to this and blocks for a while in the polling phase waiting for new external events to come in.

This also means that the metrics for no load (low frequency and high time consumption) are similar to the metrics for a slow back end under high load.

We also see that the sample application works best with five concurrent requests.

Cycle frequency and cycle time should therefore be based on the current number of requests per second.

Although the data had provided us with some valuable information, we still didn’t know where the time was, so we dug deeper and came up with two new metrics.

Work processed latency

This metric measures the amount of time an asynchronous task takes to be processed by a thread pool.

High work processing latency indicates a busy/exhausted thread pool.

To test this metric, I created an Express route that uses an image called Sharp to process images. Because image processing is expensive, Sharp uses thread pools to complete image processing.

The results of running Apache Bench with 5 concurrent connections requesting routes with image processing are directly reflected in this chart and can be clearly separated from the medium load field without image processing.

Event Loop Latency

The event cycle delay measures the time it takes for a timed task set by setTimeout(X) to be processed.

A high time loop delay means that the time loop is busy handling callbacks.

For this metric, I created an Express route that uses a very inefficient algorithm to compute the Fibonacci sequence.

Run Apache Bench and call Fibonacci route with five concurrent connections, showing that the current callback queue is busy.

We clearly see that the above four metrics can provide valuable information to help us better understand how Node.js works internally.

All of these indicators need to be looked at in a bigger picture to understand it. So we are currently gathering information and using that data as a reference factor.

Adjust the event loop

In fact, having indicators without knowing how to take action to fix them doesn’t help us much. Here are some suggestions for what to do when the event loop seems busy.

Use all the CPUS

A Node.js application runs in a single thread. This means that in multi-core devices, the load is not distributed to all cores. Using the Cluster module, node.js makes it easy to create child processes on each CPU. Each child process maintains a separate event loop, and the main process distributes the load to all child processes.

Adjusting the thread pool

As mentioned above, Libuv will create a pool of four threads. The default size of this thread pool can be overridden by setting the environment variable UV_THREADPOOL_SIZE. While this can solve the load problem for I/ O-intensive applications, high load tests such as large thread pools can still use up memory or CPU resources.

Remove computationally intensive work from services

If Node.js is spending too much time on computation-intensive operations, removing that work for the service or using another language better suited to the task would be a viable option.

conclusion

Let’s summarize what we’ve learned in this article:

  • The event loop keeps a Node.js application running

  • Its function is often misunderstood —- it is required to go through a series of phases, each of which handles a different task

  • The event loop does not provide metrics available out of the box, so metrics collected by different APM service providers are different.

  • While these metrics provide valuable information about performance bottlenecks, a deep understanding of the event loop mechanism and the code being executed is key.

  • In the future, Dynatrace will add an event loop remote monitoring technology to root cause detection to associate event loop anomalies with problems.

There’s no doubt to me that we’ve just created the most comprehensive event loop monitoring solution on the market today, and I’m excited to see these exciting new features rolling out to our users in the coming weeks.

Thank you

The outstanding Node.js agent team at Dynatrace put a lot of effort into event loop monitoring. Most of the findings presented in this blog post are based on their in-depth knowledge of the inner workings of Node.js. I want to thank Bernhard Liedl, Dominik Gruber, Gerhard Stobich and Gernot Reisinger for their hard work and support.

I hope this article has really enlightened the reader on this subject. Please follow me at twitter@dkhan and be happy to answer your questions there or in the comments section below.

If you want to learn more about the inner workings of event loops or how to use them as a developer, I recommend this post from my friend on RisingStack.

If you’d like to give our Node.js monitor a try, download our free trial and share your feedback with me at any time — it’s how we get to know our users.