I hope I can help you

The cause of

One night, the director told me about a machine that was so heavily loaded that it was out of place in a crowd of servers.It seems to be hovering around 20%, but there is very little business traffic on this machine, not only that, the FEX00 CPU load used for the business disaster recovery on this machine is also not low, around 13%.

Foreshadowing the

Behavior of the Node module

In Node, modules fall into two categories:

  1. The modules provided by Node are compiled binaries, and some core modules are directly loaded into memory, which can be called “core modules”.
  2. User-written modules can be called file modules.

All modules introduced by Node are cached, and the cached objects are compiled and executed. No matter the core module or the file module, require adopts the cache first mode for the second load of the same module.

// lib/internal/modules/cjs/loader.js
const module = new Module(filename, parent);
Module._cache[filename] = module;
Copy the code

Refers cached references to the require object in the makeRequireFunction.

// lib/internal/modules/cjs/helpers.js
function makeRequireFunction(mod, redirects) {
  require.cache = Module._cache;
}
Copy the code

So that’s why Node needs to be rebooted every time the code is updated. Is there a way to hot reload?

overloading

For those of you who are used to the modern JS stack, there is a function that automatically refreshes or restarts the code when it is updated.

This feature saves time and operational costs during the development phase to the point where it becomes a standard part of the development environment.

Older services deployed on the server also have this capability. The simplest way to do this is to listen for changes in the code file and remove the cache of the module.


// Old framework code
function onChg(prev, now) {
    if (prev.mtime == now.mtime) return;
    delete require.cache[file];
}
Copy the code

But there are some hidden problems:


// lib/internal/modules/cjs/loader.js
function updateChildren(parent, child, scan) {
  const children = parent && parent.children;
  if(children && ! (scan && children.includes(child))) children.push(child); }Copy the code

If parent is present, since there is no module. _cache, require again pushes its own references into the parent.children array, so there is a risk of leakage.

To assume that

The system starts to locate problems.

Because services on this server are mixed, htop to see which process is occupying resources. Then use the lsof command to find the path of the process, determine which project it is, and view the log.

However, there is no log for this old project.

I checked the output of the process again

tail -f /proc/<PID>/fd/1
Copy the code

Emmm, it doesn’t even have output.

Let’s look at system calls

strace -p <PID>
Copy the code
#The output
futex(0x24a7fa4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x24a7fa0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x24a7f60, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x24a7fa4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x24a7fa0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x24a7f60, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x24a7fa4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x24a7fa0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x24a7f60, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x24a7fa4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x24a7fa0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x24a7f60, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x24a7fa4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x24a7fa0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x24a7f60, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x24a7fa4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x24a7fa0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x24a7f60, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x24a7fa4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x24a7fa0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x24a7f60, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x24a7fa4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x24a7fa0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
Copy the code

Tons of FUtex, but I can not follow this train of thought trace down.

Did not think what can be online without stopping the process of the way to locate the problem, so began to look at the source code, because the framework of the service is the internal research and development of the older generation of revolutionaries, the company is separated after the document.

Although the previous upgrade from Node 0.12 to 10.14 made the core module feel like a dead end, the other tool modules didn’t go too far.

When I focused on the previously unexplored code, I found a few lines of strange code.


// For ease of understanding, modify the source code based on
if (ENVIRONMENT==='dev') {
  var watcher = require('./lib/watchNode.js')
  var absDir = __dirname.replace(/\\/g.'/')
  watcher.takeCare([APP_PATH])
}
Copy the code

The development environment listens for file changes and then restarts.

Then find the value of the online ENVIRONMENT, people silly, unexpectedly is dev.

Suddenly, it must be the online service listening for file changes. However, there should be no new file generated under this path. If you don’t understand, keep going down.

Verify the hypothesis

Since the traffic of this service is very small, it is presumed that it has nothing to do with the caller, so we simply build a test environment and start debugging.

Fortunately, once built, the test environment can reproduce the bugs on the line.

First came the idea of using PERF.

node --perf_basic_prof index.js
Copy the code

To generate a/tmp/perf-<PID>.mapFile, and then use the tool to generate a fire map.Once again touching the intellectual blind spot.

Try using a Tick Processor instead.

node --prof index.js
Copy the code

After the command is executed, a file isolate-0x3c9c4d0-v8- .log is generated

node --prof-process isolate-0x3c9c4d0-v8-<PID>.log
Copy the code
#The output[JavaScript]: Ticks total nonlib name 72 2.6% 2.7% LazyCompile: *resolve path.js:1073:28 20 0.7% 0.7% Builtin: Builtin: StringAdd_CheckNone_NotTenured 13 0.5% 0.5% LazyCompile: *<anonymous> fs.js:153:18 [C++]: Ticks total nonlib name 287 10.5% 10.6% __pthread_cond_signal 265 9.7% 9.8% uv_timer_stop 230 8.4% 8.5% __lll_unlock_wake 156 5.7% 5.7% __lll_lock_wait 123 4.5% 4.5% epoll_pwait 120 4.4% 4.4% __GI___libc_malloc 92 3.4% 3.4% __pthread_mutex_lock 82 3.0% 3.0% node: : contextify: : ContextifyScript: : the New (v8: : FunctionCallbackInfo < Value > v8: : const &) 79 2.9% 2.9% Uv_TIMer_start 53 1.9% 2.0% _INT_FREE 52 1.9% 1.9% Timer_CB 52 1.9% 1.9% Node ::StatWatcher::New(v8::FunctionCallbackInfo<v8::Value> const&) 49 1.8% 1.8% __GI___libc_free 46 1.7% 1.7% _int_malloc 37 1.4% 1.4% node::fs::NewFSReqWrap(v8::FunctionCallbackInfo<v8::Value> const&) [Bottom up (heavy) profile]:  Note: percentage shows a share of a particular caller in the total amount of its parent calls. Callers occupying less than Ticks parent name 287 10.5% __pthread_cond_signal 80 27.9% LazyCompile: * (anonymous > / home/work/hornbill/nest/server/lib/watchNode js: 80 100.0% Builtin spake: 80 100.0% LazyCompile ArrayForEach: ~ < anonymous > / home/work/hornbill/nest/server/lib/watchNode js: they 80 and 100.0% LazyCompile: ~<anonymous> fs.js:140:18 79 27.5% LazyCompile: * (anonymous > / home/work/hornbill/nest/server/lib/watchNode js: because 79 100.0% LazyCompile: * < anonymous > fs. Js: 153-18Copy the code

The watchnode. js file flooded the screen.

ForEach = forEach = forEach = forEach = forEach = forEach = forEach = forEach = forEach = forEach = forEach

I ran a little script, and I was stunned. Over 50,000 files were being listened to, and there were 24 processes doing the same thing.

Using fs.watch() is more efficient than fs.watchFile and fs.unwatchFile. fs.watch should be used instead of fs.watchFile and fs.unwatchFile when possible.

Although the framework uses fs.watchfile to listen for file changes, and the Node documentation says to use fs.watch as much as possible, what does fs.watchfile do that causes the system to make a lot of futex calls?

Books to time square hate less

fs.watchFile

In the Node source code, In turn along the lib/fs. Js – lib/internal/fs/watcher. Js – > SRC/node_stat_watcher. Cc – deps/uv/SRC/fs – poll. Find uv_fs_poll_start c, Then there is the poll_cb callback in uv_fs_stat.

/ * comment knew * /
/* Reschedule timer, subtract the delay from doing the stat(). */
  interval = ctx->interval;
  interval -= (uv_now(ctx->loop) - ctx->start_time) % interval;

  if (uv_timer_start(&ctx->timer_handle, timer_cb, interval, 0))
    abort(a);Copy the code

The good guy is a timed polling.

fs.watch

This is a little easier to find than the last API

In turn along the lib/fs. Js – lib/internal/fs/watcher. Js – > SRC/fs_event_wrap uv_fs_event_init in cc.

A lot of files have function definitions, Node documentation, look for inotify for Linux, the operating system you use.

Scrolling through the Node source code took some time, but it wasn’t a waste of time. As you can see from the above analysis, since the framework uses the Fs.watchfile API, this explains why there are a lot of fsstat calls in the flame chart: probably a timer that constantly checks the file’s status.

There is a lot of FUtex, but it is probably because of the number of FSstat calls.

conclusion

The cause of the crash was a very elementary mistake: using the wrong configuration file. The bug ran for five or six years without anyone noticing it.

Fortunately, our infrastructure team continues to improve the infrastructure and bring hidden problems to light. Not only that, but for the past two years, the front end group has been replacing old projects with new technology stacks, taking note and putting a lot of effort into engineering links such as documentation, monitoring, and release deployment processes.

The problem is solved, but there is still a huge amount of knowledge to learn, so keep trying.

Refer to the link

Linuxtools – RST. Readthedocs. IO/zh_CN/lates…

www.zhangyunling.com/448.html

Tech.nitoyon.com/en/blog/201…

nodejs.org/api/fs.html

Linuxtools – RST. Readthedocs. IO/zh_CN/lates…

Groups.google.com/g/nodejs/c/…

Stackoverflow.com/questions/5…

Fex.baidu.com/blog/2015/0…

Github.com/Kimi-Gao/Pr…

zhuanlan.zhihu.com/p/34702356

Segmentfault.com/a/119000001…

www.spaghettiml.com/en/2019/01/…

Blog.csdn.net/wuheshi/art…

www.cnblogs.com/sunsky303/p…

Stackoverflow.com/questions/9…

Github.com/nswbmw/node…