The boss recently assigned a task, saying that a system level OOM occurred when an online customer deployed the application, which triggered OOM Killer to kill the application. Let’s solve the problem.

For this task, I began to investigate, analyze and solve it from the following points.

What is the system level OOM (out-of-memory)?

When a process is created, it creates its own virtual address space (4g for 32-bit systems). These virtual address Spaces are not the same as physical memory, and only when processes access these address Spaces, the operating system allocates physical memory and creates mappings. There are a lot of information about virtual memory and physical memory, which will not be described here. This article is written in an easy-to-understand way.

With virtual memory technology, an operating system can allow multiple processes to run simultaneously, even if their virtual memory adds up to far more than the system’s physical memory (and swap space). If these processes keep accessing their virtual addresses, the operating system has to allocate physical memory for them, and when a critical point is reached, the operating system runs out of physical memory and swap space, OOM occurs.

2. What happens if OOM occurs in the system?

When OOM occurs, the OS has two options: 1) restart the system; 2) Kill the specific process and free its memory space according to the policy. Of course, the second strategy is less influential, and since our online system also adopts the strategy of killing specific processes, only the second strategy is applied here.

  • The second behavior is also called OOM Killer. What process does the system kill to free its memory?This documentSelecting a Process describes the operating system selection algorithm for the Linux kernelbadness_for_task = total_vm_for_task / (sqrt(cpu_time_in_seconds) * sqrt(sqrt(cpu_time_in_minutes)))To calculate the starting value, total_vm_for_task is the actual memory occupied by the process, and cpu_time_in_seconds is the running time. This formula selects processes that occupy a large amount of memory and have a short running time.
  • If the process is root or has superuser privileges, the score above is divided by 4;
  • If the process has direct access to the hardware (that is, the hardware driver), divide the score by 4.

But the description in the document is not complete, this is the Linux kernel OOM_Killer code, and then this article analyzes the code, in addition to the above factors, also includes the child process memory, nice value, omkill_adj and other factors.

The operating system calculates the score for each process and records it in the /proc/[pid]/oom_score file. When OOM occurs, the operating system kills the process with the highest score.

3. How do I implement OOM alarm?

OOM alarms can be generated in the following two ways:

  • OOM alarm in advance: This alarm is generated when the OOM is about to occur.
  • During/After: This alarm is generated after the OOM Killer process is completed.

The OOM alarm in advance is the best way, but in fact, it is extremely difficult to achieve no false alarm and no missing alarm. Our online application is a Java application. Consider the following scenario: The customer application continuously applies for memory. When the physical memory usage of the system reaches 90%, what will be the next behavior of the system and application? In my opinion, there are three possibilities: 1) Java applications stop applying for memory and perform garbage collection to release the memory, so that the system will return to normal; 2) The application continues to apply for memory, so that the application memory exceeds the heap size, but the system still has some physical memory, so that the Java application OOM will occur; 3) The application continues to apply for memory, causing the system to run out of physical memory, but does not exceed the maximum heap memory, which will cause the OPERATING system OOM. For this scenario, it is extremely difficult to accurately predict the next behavior of the system and application.

On the other hand, we actually have an online alarm based on the machine memory usage, which actually contains three possibilities: 1) there is a problem with the application itself but it will not cause heap overflow or system OOM; 2) The application may cause heap overflow; 3) The application may result in OOM. No matter what the actual situation is, this alarm makes sense.

Event/post-event alarm is also a desirable way, for the following reasons: 1) This way can achieve no false alarm, no missing report; 2) For the app that is about to receive OOM, there is not much difference between the alarm time in the event and the alarm time beforehand. In addition, so far, customers have complained that their applications have died without any notice, which wastes both customer time and r & D troubleshooting time.

Overall consideration, if can realize Java application abnormal state detection and provide incident/post alarm and on-site analysis, is also very meaningful!

4. What are the exception states of Java applications?

The Java application exception states defined here are:

  • The Java application is killed by the user (Kill, kill-9).
  • A heap overflow occurs in a Java application.
  • The Java application passed OOM (Kill -9).

5. How to detect the above abnormal status of Java application?

First of all, a Java application heap overflow can occur through – XX: + HeapDumpOnOutOfMemoryError parameters to generate the dump information, we can through the polling mode can be found that whether the heap overflow (based on the event notification better, of course, to research).

So, now the question is how do we find a Java application killed by the user or killed by OOM?


5.1 ShutdownHook/sun. Misc. Signal

Old drivers may soon think that they can detect system signals by registering shutdownHook! Registered shutdownHook does detect SIGTERM signals (usually Kill commands with no parameters, such as Kill PID), but does not detect SIGKILL signals (kill-9). In addition, the investigation found that the sun.misc.signal. handle method could also be used to detect system signals, but unfortunately the SIGKILL Signal could not be detected.

5.2 the strace

This tool is powerful enough to intercept all system calls (including SIGKILL) and has the advantages of being built-in, easy to use, and readable output. Here is one of my experiments (process 24063 is a Java process that triggers OOM) :

 

The downside of this tool, however, is that the performance impact of the application being tracked is significant. The application used to make system calls (such as open, read, write, close) with a context switch (from user to kernel mode), but with Strace it will make multiple context calls, as shown below:

(See this article for more information.)

However, we have found a workable solution that, while having a significant performance impact, can be opened up to customers as a debug solution.

5.3 FTrace + System Logs

Ftrace is a built-in tool in Linux (see appendix for debugFS mounting). It helps developers understand the runtime behavior of the Linux kernel for troubleshooting or performance analysis. Importantly, it has minimal impact on the performance of the application itself, and we can detect only Kill events, which have almost no impact on the customer application (see Section 6 for performance analysis). In our scenario, it also supports listening for kernel events (including process SIGKILL signals). Ftrace is very easy to use. You can refer to this documentation or use the GITHUB script. Here’s a screenshot of the GITHUB script running:

In the figure above, I execute Kill 29265 for SIGNAL 15 and Kill -9 29428 for SIGNAL 9. The problem with this tool, however, is that when a Java process fires OOM Killer at system level, it does not detect the corresponding signal (subject to further investigation).

In addition, when OOM Killer is triggered, the system logs (/var/log/messages for Centos) record specific messages as follows:

5.4 Auditd + System Log

(The system log is used to find OOM information, which will not be described further. Auditd is mainly introduced below.)

My colleague suggested that WE try AUDITd. Therefore, we investigated auditd and found that it can meet the requirements and has less impact on test performance than ftrace (see Section 6 for performance analysis). Auditd is part of the Linux Auditing System and is responsible for receiving events (System calls, file access) that occur in the kernel and writing these events to a log for user analysis.

Here is the framework of the Linux audit system:

Among them:

  • On the left is our application;
  • In the middle is the Linux kernel, which contains an audit module that can record three types of events: 1) User: records events generated by users; 2) Task: record events of the Task type (such as fork sub-process); 3) Exit: Log the event at the end of the system call. At the same time, you can combine the Exclude rule to filter events and finally send these events to the AUDITd daemon in user space.
  • On the right is the application in user space, where Auditd is the core daemon. It mainly receives events generated in the kernel and records them in audit.log, which can be viewed through Ausearch or AuReport. When auditd starts, it reads the auditd.conf file to configure various behaviors of the daemon process (such as the location of log files), and reads the event rules in Audit. rules to control the event monitoring and filtering behaviors in the kernel. Additionally, we can control kernel event listening and filtering rules through auditctl.

For more information, do your own search or check out this article.

The auditd daemon is also started in centos (>=6.8) by default. Let’s test this tool. First, we execute the following command:

auditctl -a always,exit -F arch=b64 -S kill -k test_kill

This command logs the event when the kill system call returns and binds the test_kill flag (for later log filtering). Then, we can execute any script and kill it. We can see the following output in /var/log/audit/audit.log:

The first SYSCALL log records information about the process that sends SIGKILL signals. The second OBJ_PID log records information about the process that receives SIGKILL signals.

5.5 the Shell + dmesg

If we can control the Java application’s startup script, then this is the least disruptive option. Take a look at the following shell script:

#! /bin/bash java -Xms4g -Xmx4g Main ret=$? # # returns > 127 are a SIGNAL # if [ $ret -gt 127 ]; then sig=$((ret - 128)) echo "Got SIGNAL $sig " if [ $sig -eq $(kill -l SIGKILL) ]; then echo "process was killed with SIGKILL " dmesg > $HOME/dmesg-kill.log fi fiCopy the code

This script does these things:

  1. Start a Java application with java-xms4g -XMx4g Main;
  2. Java applications exit via $? Obtain the program exit status code;
  3. If the exit code is larger than 128, the application receives SIGNAL and exits. If SIGKILL, the information in the Kernal Ring buffer is collected through dMESG.

If the application quits because it was killed by OOM Killer, the following information will be displayed in dmesg-kill.log:

The advantage of this scheme is that the impact level is minimal, but the process kills less information than AUDITD, and only knows what SIGNAL SIGNAL is received. However, AUDITd can know which process, user, and group SIGNAL comes from.

6. Performance testing

6.1 Test Environment

Test machine

ecs.n1.medium

CPU

2 vCPU

Processor model

Intel Xeon E5-2680v3

Processor frequency

2.5 GHz

memory

4 GB

System image CentOS 7.4 64

6.2 Test Script

6.2.1 Test 1: Impact of system call performance

The test method

Read 500 bytes from /dev/zero and write to /dev/null, loop 100 million times (100M) :

dd if=/dev/zero of=/dev/null bs=500 count=100MCopy the code

The script generates about 200 million system calls (read million, write 100 million).

The test results

Test objectives

Total time (s)

Average time (μs)

Do not add any event listening

41.7

0.2085

auditd

47.1

0.2355

ftrace

77.3

0.3865

strace

> 3600

> 18

6.2.2 Test 2: Impact on JAVA Application Performance

Test method:

Construct consumer and Provider applications. The consumer initiates HSF calls to the provider, and the provider returns predefined data. The loop calls 1 million times and observes the consumer’s time.

Test results:

Test objectives

Total time (s)

Average Time consuming (ms)

Do not add any event listening

492

0.492

Both the consumer and provider enable Auditd

484

0.482

Both the consumer and provider enable fTrace

493

0.493

Both consumer and provider enable Strace

> 3600

> 3.6 

7,

To sum up, we can solve the OOM application problem of customers by the following means:

  • 1. Use the machine’s memory usage based alarm to inform the customer in advance;
  • 2. JVM startup parameters can be added-XX:+HeapDumpOnOutOfMemoryErrorTo help collect JVM memory overflow information;
  • 3. Collect OOM Killer information through system logs (/var/log/messages) or dmesg.
  • 4. Use the start shell script (see Section 5.5) or auditd (see Section 5.4) ftrace to obtain information that the application has been killed (possibly by the client itself).
  • 5. [Optional] Open the Strace tool to help customers debug problems.

8. Other tools

8.1 the trap

The trap command is used to specify the action to take when a signal is received, usually to clean up when the script is interrupted. When the shell receives a sigSpec signal, the ARG argument (command) will be read and executed. Here I try to intercept the SIGTERM and SIGKILL signals of the current script:

#! /bin/bash sighdl () { echo "signal caught " #do something exit 0 } trap sighdl SIGKILL SIGTERM ### main script X=0 while  : do echo "X=$X " X=`expr ${X} + 1` sleep 1 doneCopy the code

The trap command can detect the SIGTERM signal of the current process, but cannot detect the SIGKILL signal. This command is equivalent to shutdownHook or Signal in Java applications.

9, the appendix

9.1 Debugging Mounting Status of the FTrace System

The operating system

System version

Check whether the debugfs is mounted by default

note

CentOS




  7.4 (64-bit)  是  
  7.3 (64-bit)  是  
  7.2 (64-bit)  是  
  6.9 (64-bit)   否  mount -t debugfs nodev /sys/kernel/debug
  6.8 (64-bit)   否  mount -t debugfs nodev /sys/kernel/debug
  6.8 (32-bit)  否  mount -t debugfs nodev /sys/kernel/debug
 Aliyun Linux      
  17.1 (64-bit)  是  
 Ubuntu      
  16.04 (64-bit)  是  
  16.04 (32-bit)  是  
  14.04 (64-bit)  是  
  14.04 (32-bit)  是  
 Debian      
  9.2 (64-bit)  是  
  8.9 (64-bit)  是  
 SUSE Linux      
  Enterprise Server 12 SP2 (64-bit)  是  
  Enterprise Server 11 SP2 (64-bit)  是  
 OpenSUSE      
  42.3 (64-bit)  是  
 CoreOS      
  1465.8.0 (64-bit)  是  
 FreeBSD      
  11.1 (64-bit)   否