The problem background

Since the company’s test environment was configured for dual-machine load balancing, JVM application processes on both machines often quit automatically for no reason, up to a dozen or twenty times a day

The screening process

Start by examining the most common JVM memory overflow problems

1. The process exits

1. Does OOM of the JVM itself cause the process to crash?

By observing the application log, the program appears to be in normal running state before each exit, and there is no abnormal information that will cause the program to crash, let alone obvious error information such as heap OOM, method OOM and stack overflow

Therefore, to rule out

2. Does the JVM OOM cause the process to exit directly?

Further speculation: whether in the program OOM, it is too late to print the OOM information on the direct exit?

Confirm through local OOM verification and refer to relevant materials:

Not every time you OOM, the process crashes directly

In addition, THE OOM exit of THE JVM belongs to the active behavior of the JVM. Before the process crashes, it can save the relevant exception information and then exit the process

The JVM in the test environment also printed heap dumps when the configuration process exited, but did not see any dump information saved at the time

Therefore, the possibility of active exit caused by OOM of JVM itself is basically ruled out here

The process is killed by the OS

Since the JVM does not voluntarily exit, it can only be killed by the OS

So with JVM application-level failures out of the way, it’s time to start thinking at the system level

You can perform troubleshooting in terms of CPUS, networks, memory, and disks

First, consider the most likely memory problems

1. Are resources available to the JVM limited by the OS?

Run the ulimit -a command to view the resources that can be used by processes in the OS. The memory is unlimited

Check the limits file in the /proc/{pid} path of the JVM process. Check the limits file in the /proc/{pid} path

So here also can basically rule out OS to process resource use restrictions

2. The accident site when the OS save process crashes?

Consider further:

Since the JVM can save heap dumps and other error messages before crashing on its own

Then the OS should also be able to keep a copy of the information before killing the process

You can analyze this information to determine the cause of a process crash

Sure enough, I was informed by consulting materials

The OS can indeed be configured to save information about a process to a core file before killing it

But here for a long time also did not get the core file out….

3. Processes are killed due to insufficient system memory resources

Since the OS has no limits on the resources available to processes, is it possible that OS resources are not enough in the first place?

The free and top commands are set up

Oh…

Program available memory is only 7.4GB, a backhand look -Xmx unexpectedly opened to 6144M!

Well, it felt like everything was coming out

We all know the memory distribution that the JVM uses, in addition to the largest heap, thread stacks, method areas, and so on

The JVM heap is already running at 6 GIGABytes, plus the rest of the memory, and the OS has to compete with the JVM for buffers and caches

Iii. Final speculation and verification

In general, the OS allocates memory to processes on a per-page basis, and it does not allocate all the memory it needs at once. It allocates as much memory as it needs

So the presumed process crash looks like this:

When objects are continuously created, the heap usage becomes larger and larger. At this time, because the heap is too large by 6 G, the Eden area is not full, and before GC is triggered, the system memory usage is about to be exhausted, and there is no free memory allocated to the heap

OS can only kill the process that consumes the most memory, so who else can kill the JVM….

An analysis of THE JVM’s GC logs shows that the total heap footprint is up to 5G+ for almost every last GC before a process crash, which is consistent with the above prediction

And access to information, and found that the dmesg command (dmesg – T | grep “(Java)”), to view each process is killed by the OS

Segmentfault.com/q/101000000…

Blog.csdn.net/ispringmw/a…

Well, there’s no running

Records showed that Java processes were killed dozens or dozens of times a day, matching the number of JVM process crashes on the test server

The cause of the kill was also written Out of memery

So, the JVM process takes up too much memory, causing the system itself to get OOM

The truth comes out

Problem solving

The solution

So, if you know what the problem is, it’s easy to do. Just turn down the heap memory

This is done by reducing the heap to 4 GIGABytes, and then continuously observing the JVM process every day

Post observation closed loop

Use the JDK’s jstat tool to observe the JVM’s object creation speed, YGC times, YGC frequency, YGC time, how many objects enter the Old area after YGC, FGC times, FGC frequency, FGC time, and total GC time

The GC logs are also used to observe GC details

After that, the exit of the test server process was largely eliminated

Through the dmesg – T | grep “(Java)” command basic didn’t register the JVM process is killed by the OS

legacy

To be honest, the JVM has occasionally been killed by OOM

Here you can see that since the heap was turned down, processes are not killed dozens or even dozens of times a day, as they were before, but they still happen every few days

So the root cause must be the system memory problem, but it does not solve the problem, but it has almost no impact on the development and testing work

It is assumed that the system itself has other programs or space occupying memory, which has been crowded by the JVM

Specific reasons involve deeper OS knowledge and need for more experience

Follow-up energy to continue to study this problem tired….

conclusion

In fact, the question

Complex is not complicated, just a simple system running out of memory

Let’s just say it took quite a bit of effort to sort it out

Fortunately, it worked out in the end

  • The overall process is:

    • JVM itself OOM causes active exit process

    • Then eliminate system resource constraints

    • Finally, the system is out of memory, causing the JVM process to be forcibly killed

  • The knowledge involved is:

    • JVM runtime data area

    • The principle of GC

    • JVM memory overflow

    • GC log analysis skills

    • JDK provides troubleshooting tools such as jinfo and jstat

    • System Resource Limitation

    • Principle of system memory allocation

    • Knowledge of system kill processes

    • System memory OOM knowledge

    And so on…

Problem solving is a process of thinking, speculating and verifying according to phenomena

It’s a fun and rewarding way to apply what you’ve learned and learn something new