# # introduction

Node.js has been more and more widely used in BFF front-end and back-end separation, full stack development, client tools and other fields. However, compared with the vigorous development of the application layer, its Runtime is in a black box state for most front-end developers, which has not been well improved, thus hindering the application and promotion of Node.js in the business.

I am currently working for the AliNode team of Alibaba. In the past few years, I have personally experienced many online failures. Among them, the most troublesome and difficult to solve are the following:

  • A memory leak
  • CPU long 100% (death-like loop)

In the first point, for a memory leak of the type that slowly increases and eventually OOM, we have plenty of time to catch the Heapsnapshot and analyze the Heapsnapshot to locate the leak. (See the previous article “Node Crime Scene Revealed – Quick Memory Leak Location”)

The second point is that there is no good way to deal with cases such as the failure of the while loop to jump out of the condition, the process to suspend execution due to long re execution, and the application of OOM for a short period of time due to abnormal requests. This paper introduces a Core dump based method to analyze and locate online application faults.

## Background

The investigation of these difficult and miscellaneous diseases is nothing more than to try to save the site first and then analyze it through tools.

### How to save the scene?

Core Dump is a debugging aid provided by the operating system. It contains key information such as memory allocation information, program counter, and stack Pointers. It is very helpful for developers to diagnose and debug programs (some errors are very difficult to reproduce).

There are two ways to generate a Coredump file:

  • I. When our application crashes and terminates unexpectedly, the operating system will automatically record it.

This method is commonly used for “postmortem” analysis of OOM triggered by avalanche.

You can use ulimit -c unlimited to enable the kernel limit. You can also use the abort-on-uncaught-exception parameter at startup of the Node application to perform automatic Core dump on uncaught exceptions.

Here it is important to note that this is a not that safe operation: online generally pm2 have automatic restart function daemon tools such as process, this means that if our program in some cases frequently crash and reboot, then generates a lot of Coredump file, and it may even write server disk is full. So when I turn this on,
Be sure to monitor and alarm server disks.

  • II. Manually call gcore < PID > to manually generate.

This method is commonly used for “living inspection”, which is used to locate problems in the suspended animation state of Node.js processes.

Once you have the Coredump file for the process in question, all that remains is how to analyze it.

### How to analyze?

We usually use tools such as MDB, GDB and LLDB to diagnose the cause of the crash, but these tools can only help us go back to the stack information at the c++ level. Here is a simple example:

// crash.js
const article = { title: "Node.js", content: "Hello, Node.js" };

setTimeout(() => {
  console.log(article.b.c);
}, 1000);
Copy the code

When you enable the ulimit parameter and execute node –abort-on-uncaught-exception crash.js, you can see that the current directory generates a file in core. format (usually under /cores/ for Macs). Then we try to use LLDB for trial analysis:

$ lldb node -c core.<pid>
Copy the code

Core file XXX was loaded. Then we execute BT to trace back the stack information at crash time, as shown in the figure below:


For those familiar with node.js source code, you will notice that Node ::Start -> uv_run -> uv__run_timers are indeed the place from startup to entry into libuv, and that there should be a crash caused by the execution of a TIMER in the JS layer. These analyses are consistent with the code in question, however, given the complexity of the real code on the line, it is clear from c++ stack information alone that we cannot see where the js code triggered the crash.

Fortunately, we have LLNode, which is actually a plug-in for LLDB. It can restore JS stacks and V8 objects after conversion with the help of THE API exposed by LLDB.

$NPM install llnode -g $$# $llnode node -c core.Copy the code

V8 BT is then executed to backtrack, and the result is shown below:


Cool~ js stack information can also be seen, with this more complete information, we can quickly locate the problem function.

See here, it seems that we can end today’s topic, after all, with llNode + Coredump, we have perfectly solved the previously mentioned difficult diseases.

In practice, however, there are some problems with pure client LLNode:

  • Troublesome installation and configuration
    • Llnode depends on the LLDB version. Some Coredump files cannot obtain correct stack information in lldb-3. X /4.x.
    • The llnode plug-in itself is addon, and local installation and compilation requires a higher version of GCC (supporting c++11) and python environment support.
    • There is a problem in restoring utF8 encoding Chinese characters, causing some Chinese characters to be garbled and truncated abnormally, resulting in an error in the size of the string, which affects troubleshooting.
  • Analysis is not automated or intelligent enough
    • The default thread is not necessarily a JS thread, in which case you need to manually look for JS threads one by one.
    • The js stack and c++ stack at the command line, with some unknown stubs, are quite eye-testing.

## Coredump analysis service

Coredump analysis service is a free service provided by Node.js performance platform (Alinode). We have conducted secondary development and in-depth customization based on open source LLNode to solve the above problems, reduce the threshold for developers to use, and assist in online fault analysis and location more intelligibly.

The following two real fault cases will be used to show you.

### Ready for use

First, I will briefly introduce how to upload Coredump files and executable Node files for analysis. After starting the service, visit the console home page and open the file option of any application interface. (If you haven’t already created an app, follow the procedure for creating an app.)

By default, performance analysis files such as Heapsnapshot, CPU Profile and GC Trace are displayed in the file list. Now the Coredump analysis file list is added. Just move the mouse pointer to the file button in the left Tab to see the new content:


Select the Coredump file to enter the List of Coredump files. Click the upload file option in the figure above to upload the corresponding Coredump file and Node executable file to the server as prompted in the pop-up box.

< OS info>-



. Node For example, centos7-alinode-v4.2.2.node. Finally, there must be a one-to-one correspondence between the Coredump file and the Node executable. By one-to-one correspondence, I mean that the Coredump file must have been generated by the process started by the Node executable. If there is no one-to-one correspondence, the analysis results are often invalid information.

### LONG CPU 100% (loop like)

The example is built using egg.js. Let’s start with the Egg Controller code:

'use strict'; const Controller = require('egg').Controller; class RegexpController extends Controller { async long() { const { ctx } = this; Let STR = '<br/> '+' '; str += '<br/> <br/>' + ' <br/> ' + ' <br/>'; str += ' <br/>' + ' ' + ' ' + ' <br/>'; str += ' <br/> <br/>'; STR += "+" + according to the schedule, proceed to Siem Reap airport and return to China. <br/>'; STR += 'For airmail service, add 280/ order. <br/>'; const r = str.replace(/(^(\s*? <br[\s\/]*? > \ *?) +|(\s*? <br[\s\/]*? >\s*?) +? $)/igm, ''); ctx.body = r; } } module.exports = RegexpController;Copy the code

Regular matching is a fairly common operation in Node.js applications, and the strings that regular expressions match tend to come from users and other internal interfaces. This means that the matching string itself is not controllable, so if an abnormal input triggers the regular expression catastrophic backtracking, causes the execution time is expected to take a few years, even decades, apparently in either case, a single master worker thread model will lead to our Node. Js application is in a state of suspended animation, the process is still alive, But no new requests are processed.

In this code, we simulate a backtracking state. When we access the Controller, we will find that the Node.js server will immediately freeze. At this point, we will use gcore to generate a Coredump file of the current process, as in the previous section. We rename the generated Coredump file and the corresponding Node executable file and upload it. After the upload succeeds, the following interface is displayed:


Click the analysis button to start the analysis, as shown below after the analysis is completed:


In llNode, thread 1 is not necessarily the Main thread, so we filter out the thread that contains the JS stack information and mark it with red at the top for easy use.


It is easy to see that the process is currently performing a REPLACE operation, and the exact contents of the regular expression and the location of the function performing the REPLACE regular operation are clear. Even if we omit the re match strings shown here, it’s easy to see the full picture: mouse over “

…” And click on the string to expand more:


As you can see, the string content here is exactly the same as the string that triggered the long traceback in our code simulation. In addition, we also provide a relatively lightweight solution to this persistent 100% CPU problem. For details, see Node.js performance platform support for infinite loops and regular attack location.

### Heap memory avalanches in a short time

Memory avalanche in a short period of time is also a relatively difficult problem to detect, because there is usually a time lag between the alarm. In some cases, by the time we receive the alarm, the process has actually been OOM and restarted, so there is no time to capture the heap snapshot.

'use strict';
const Controller = require('egg').Controller;
const fs = require('fs');
const path = require('path');
const util = require('util');
const readFile = util.promisify(fs.readFile);

class DatabaseError extends Error {
  constructor(message, stack, sql) {
    super();
    this.name = 'SequelizeDatabaseError';
    this.message = message;
    this.stack = stack;
    this.sql = sql;
  }
}

class MemoryController extends Controller {
  async oom() {
    const { ctx } = this;
    let bigErrorMessage = await readFile(path.join(__dirname, 'resource/error.txt'));
    bigErrorMessage = bigErrorMessage.toString();
    const error = new DatabaseError(bigErrorMessage, bigErrorMessage, bigErrorMessage);
    ctx.logger.error(error);
    ctx.body = { ok: false };
  }
}

module.exports = MemoryController;
Copy the code

This example also comes from a real online fault. When we use egg-logger to output logs, many times we will output logs without limiting the attribute as in this example. In egg-loger, when we pass in an instance of Error as the first argument, The inspect method in the Core node.js library util is called for formatting. The problem is that if the error object has some properties that contain large strings, inspect will overflow and trigger OOM. This is exactly what happens when a large string is filled into resource/error.txt.

If we have turned on the ulimit parameter as described in section 2, then the system will automatically generate a Coredump file when heap memory is rapidly avalanches. Similarly, we upload the Coredump file and the corresponding Node executable to the performance platform and click the Analyze button. After the analysis is completed, click on the red JS main thread and you can see the following information:


The circled JS call stack is exactly the same as the procedure we mentioned earlier. More, we can look at the size of the string causing the problem. Hover over the string to see the size of the string:


It’s this 186M big string inspect that causes OOM. In that real online failure, the SQL statement that was actually concatenated was so large, about 120 megabytes in size, that the database operation failed first, and then the DatabaseError object instance that was output after the database operation failed set the SQL statement in question to the properties intact. This causes an avalanche of heap memory when CTx.logger. error(error) occurs.

## Some thoughts

Since 2014, I have been engaged in node.js related development, including business writing and infrastructure construction, and now I am committed to the lower level of security. Although it is not the first group of developers in China to get in touch with Node.js, it has witnessed the development process of such a young technology accompanied by constant questioning along the way. Unstable, toy, scripting tool, unable to do large projects are common community voices.

In my opinion, for the developers themselves:

  • First of all, we need to avoid the self-limiting idea of looking for a nail with a hammer. After all, there are situations where any technology works and doesn’t work.
  • Second, Node. Js community prosperity and ecological development has proved her as the appeal of a server technology, and in the performance of the underlying stability and troubleshooting the level, now has a lot of ways to complete our hard to do some things before, so we should have more confidence for the technology itself.

And along the way,

@ PiaoLing

Currently we are only going from 0 to 10, and there are many, many gaps for us to fill in the future. If you are interested in this, please come and join us.

Special thanks to

This article was first published on Node.js, thanks a lot

@ days pigs

## References

Core dump – Wikipedia
Github.com/nodejs/llno…
The Node.js performance platform supports infinite loops and regular attack locating