Node-micro-optimization-javascript: How to Improve the throughput of Node.js applications
Some content
-
Use aggregate IO operations whenever possible to batch write to minimize the number of system calls.
-
You need to take publishing overhead into account and eliminate the different timers in your application.
-
CPU profilers can give you some useful information, but they don’t give you complete feedback on the process.
-
Use ECMAScript’s advanced syntax with caution, especially if you’re not using the latest JavaScript engine or a converter like Babel.
-
Gain insight into the composition of your dependency tree and properly measure the performance of the dependencies you use
When we want to optimize the performance of an application that includes IO capabilities, we need to be aware of the CPU cycles that the application consumes and the factors that prevent the application from executing in parallel. This article is to share some of my thoughts on upgrading DataStax Node.js drivers in the Apache Cassandra project and some of the key factors that led to the application throughput degradation.
background
V8, the standard JavaScript engine used by Node.js, compiles JavaScript code to machine code and then runs it as native code. The V8 engine uses the following three components to ensure both low startup time and optimal performance:
-
A general-purpose compiler that can quickly compile JavaScript code into machine code.
-
A runtime analyzer that automatically tracks code execution times in your application and determines which code modules should be optimized.
-
An optimized compiler that automatically optimizes code to be optimized as annotated by the parser; And if the operation is considered overoptimized, the compiler can also automatically reverse optimize the operation.
While optimizing the compiler ensures optimal performance, it does not optimize all code, especially if it is not written in the right way. You can refer to suggestions from the Google Chrome DevTools team to see which code patterns V8 refuses to optimize. Typical examples include:
-
A function that contains a try-catch statement
-
Function arguments are reassigned using the Arguments object
While optimizing compilers can significantly increase code speed, for typical IO intensive applications, most of the performance optimization relies on instruction reordering and avoiding high-footprint calls to increase the number of operations performed per second; This is something we will discuss in the following chapters.
Benchmark test
To better identify optimization techniques that will benefit the most users, we need to simulate real user scenarios and define benchmarks based on the amount of work performed by common tasks. First we need to test the throughput and latency of API entry points; In addition, if you want more information, you can also choose to measure the performance of the internal call method. It is recommended to use process.hrtime() to obtain real-time parsing and execution times. Although it may cause some inconvenience to project development, I recommend introducing performance metrics as early in the development cycle as possible. You can choose to start with throughput tests from a few method calls and then slowly add more complex tests such as delay distribution.
CPU analysis
There are a variety of CPU parsers available today, with Node.js itself providing out-of-the-box CPU parsers that can handle most usage scenarios. The built-in Node.js parser, derived from V8’s built-in parser, samples stack information at a fixed frequency; You can create the V8 tag file with the –prof parameter when running the Node command. You can then aggregate the analysis results into more readable text by using the –prof-process parameter:
$ node --prof-process isolate-0xnnnnnnnnnnnn-v8.log > processed.txt
Copy the code
Open the processed record file in the editor and you can see that the entire record is divided into sections. First let’s look at the Summary section, which looks like this:
[Summary]:
ticks total nonlib name
20109 41.2% 45.7% JavaScript
23548 48.3% 53.5% C++
805 1.7% 1.8% GC
4774 9.8% Shared libraries
356 0.7% Unaccounted
Copy the code
The values above represent the sampling frequency in JavaScript/C++ code and in the garbage collector, respectively, which varies from code to code being analyzed. Then you can look at specific subsections (such as [JavaScript], [C++],…) as needed. To get specific sampling information. In addition, the analysis file contains a very useful section called [Bottom Up (Heavy) Profile], which shows the callers to buy a function in a tree structure, in the following basic format:
223 32% LazyCompile: *function1 lib/file1.js:223:20
221 99% LazyCompile: ~function2 lib/file2.js:70:57
221 100% LazyCompile: *function3 /lib/file3.js:58:74
Copy the code
The percentage above represents the percentage of callers in that layer to the total number of callers in the target function, while the asterisk before the function indicates that the function is optimized and the tilde indicates that the function is not optimized. In the above example, 199% of function2 calls are made by Function2, while Function3 accounts for 2100% of function2 calls. CPU analysis results and fire charts are very useful tools for analyzing stack usage and CPU time. It is important to note, however, that these analysis results are not the whole story, and a large number of asynchronous IO operations can make analysis difficult.
The system calls
Node.js leverages the platform-independent interface provided by Libuv to implement non-blocking IO, all IO operations in the application (Sockets, file systems,…). Will be converted to a system call. Scheduling these system calls can take a lot of time, so we need to aggregate IO operations as much as possible and write in batches to minimize the number of system calls. Specifically, we should banish the Socket or file into the buffer and process it all at once instead of processing each operation individually. You can use write queues to manage all your write operations. Common write queue implementation logic is as follows:
-
When we need to write and within a processing window:
-
Adds the buffer to the towrite list
-
-
Connect all buffers and write to the target pipe at once.
You can define the window size based on the total buffer length or the time it takes for the first element to enter the queue, but in defining the window size we need to weigh the latency of individual write operations against the latency of the entire write operation. You also need to consider both the maximum number of writes that can be aggregated and the overhead of a single write request. You might decide on the upper limit of a write queue in kilobytes, and our experience has found that around 8 kilobytes is a good threshold; Of course this value will vary depending on the scenario you are using, you can refer to our complete implementation of the write queue. In summary, the number of system calls was significantly reduced when we adopted batch writes, ultimately improving the overall throughput of the application.
Node. Js timer
Timers in Node.js have the same API as timers in Window, which makes it easy to implement simple scheduling operations. It has a wide range of applications across the ecosystem, so our application can be riddled with a lot of delayed calls. Similar to other hash-based wheel schedulers, Node.js uses hash tables and linked lists to maintain timer instances. Unlike other wheel schedulers, however, Node.js does not maintain a fixed length hash table. Instead, it indexes timers based on the time they are triggered. When adding a new timer instance, if Node.js finds that the same key already exists (timers with the same trigger event), the addition will be done in O(1) complexity. If the key does not already exist, a new bucket is created and the timer is added to the bucket. It is important to keep in mind that we should reuse existing timers for buckets as much as possible and avoid the time-consuming operation of removing the entire bucket and then creating a new one. For example, if you use slide delay, you should use setTimeout() to create a new timer before removing it using clearTimeout(). In the heartbeat packet processing, we will determine the order of O(1) complexity to schedule the idle timer before removing the last timer.
Ecmascript language features
When looking at overall performance assurance, we need to avoid using some of the higher-level language features in Ecmascript, Typical examples are function.prototype.bind (), object.defineProperty () and Object.defineProperties(). Performance flaws in these features can be found in JavaScript engine implementation descriptions or problems, For example, Improvement in Promise performance in V8 5.3 and function.prototype. bind performance in V8 5.4. You also need to be careful with the new language features in ES2015 or ESNext, which are much slower than the syntax in ECMAScript 5. The six-Speed project website tracks the performance of these language features on different JavaScript engines, so if you haven’t found performance metrics for some of these features you can do some testing yourself. The V8 team has also been working on improving the performance of the new language features to align them with the underlying implementation. We can keep track of their progress on ES2015 performance optimization in the performance planning section, where they collect user suggestions for improvement points and release new design documents to illustrate their solutions. You can also keep up to date with V8 implementation progress on this blog, although considering that V8 enhancements may take longer to be incorporated into the LTS version of Node.js: The LTS plan is to only merge with the latest V8 version during the Major Node.js iteration. You may have to wait 6-12 months for the new V8 engine to be merged into node.js, and the current new releases of Node.js will only contain some fixes for the V8 engine.
Rely on
The Node.js runtime provides a complete IO library, but the ECMAScript syntactic standard provides only a handful of built-in data types, forcing us to rely on third-party libraries for basic tasks. There is no guarantee that these third-party libraries will work accurately and efficiently, and even the popular star modules may have problems. The Node.js ecosystem is so flourishing that many dependencies may contain only a few methods that you can easily implement yourself. There is a trade-off between the cost of reinventing the wheel and the uncontrollable performance that comes with dependence. Our team tries to avoid introducing new dependencies as much as possible and is conservative about all dependencies. But we welcome bluebird’s library, which publishes its own reliable performance reviews. Async is used in our project to handle asynchronous operations, and async.series(), async.waterfall() and async.whilst() are used extensively in the code base. It’s hard to argue that asynchronous libraries with multiple layers of connectivity are the culprit, but many other developers have identified the problem. We can also choose an alternative library like neo-Async, which is significantly more efficient and has public performance measurements.
conclusion
Some of the optimization tips mentioned in this article are common sense, while others relate to the implementation details and workings of the Node.js ecosystem and JavaScript core engine. In the client driver we developed, we achieved a two-fold increase in throughput by introducing these optimizations. Considering that our Node.js application runs in a single-threaded manner, the CPU time slice and instruction ordering of our application can greatly affect overall throughput and high parallelism.
About the author
Jorge Bay is the core engineer of the node.js and C# client driver for the Apache Cassandra project, as well as the DSE for DataStax. Jorge has over 15 years of professional software development experience. The Node.js client driver he implemented for Apache Cassandra is also the basis for DataStax’s official driver