Introduction:

Hello everyone, I am an enthusiastic chaoyang citizen.

Recently, in some open source community groups, I saw that some students appeared OOM online environment, and urgently asked for help in the group. Help students first cut a log system output, abnormal stack uppercase OOM exception information. The student said, “The service was restarted several times today. Maybe there was a memory leak, but I didn’t find out where the leak was. Where should I start to check?” . You may have encountered in the process of development of code to write is bad, lead to memory leaks, finally get the whole service OOM, of course the OOM is a general designation, the JVM 4 OOM can occur in a piece of space, the I in front of the article: “the internal strength of uniting the series 2 】 to the JVM to a pulse”, is mentioned, not familiar with classmates can read carefully.

OOM is a common question for JAVA developers and a must for the interview process. Personally, I started to fight against it many years ago, and also experienced the online OOM service hanging due to multi-threaded processing, improper collection class reference object, recursive loop and other scenarios. I remember when I was still a single application, one of my colleagues added business processing logic when writing logs through Log4j. However, the appender part was not handled properly, resulting in no object collection in the thread when the number of users increased, resulting in a memory leak. Constantly tweaking Tomcat parameters to increase the JVM’s startup memory from the very beginning. For a period of time, once I met OOM, I had to restart and add memory, but there were always times when it didn’t help. Finally, I made modifications from the code level to find out the root cause of memory leakage and completely cure it.

Today let’s take a look at the causes of OOM, as well as some troubleshooting methods, to help you platoon “O” to solve the problem.

OOM Concepts

Let’s review the concept of OOM first.

OOM

Thrown when the Java Virtual Machine cannot allocate an object because it is out of memory, and no more memory could be made available by the garbage collector.

This error is thrown when the JVM does not have enough memory to allocate space for objects and the garbage collector has run out of space to collect.

Memory leaks

Unused memory is not released and cannot be reclaimed, causing the JVM to be unable to use the memory again, and the memory leaks. This is “don’t use it yourself, and don’t give it to others!” .

Out of memory

An overflow occurs when a program requests more memory than the JVM can provide. In most cases, memory overflow is caused by memory leaks. In other cases, the traffic is heavy and services are not flexibly expanded or traffic limiting. As a result, memory overflow occurs. I can’t stand it!

OOM daily

Because of healthcare, peak business occurs on Monday mornings, long before the single application phase of the service architecture is upgraded, due to the uneven level of developers and the fact that all businesses are coupled together. In particular, during the period of service growth, OOM may occur from time to time and the server CPU usage is too high. Operation and maintenance personnel or research and development personnel, staring at the server monitoring, memory in many cases is constantly rising, is about to support no longer, restart the service. It’s really “artificial intelligence operation and maintenance”! At that time, I was a colleague who restarted the service. I dumped the memory snapshot and thread snapshot and sent them to the r&d personnel for analysis. The operation and maintenance staff also suffered terribly during that time!


Later, with the development of the business and the expansion of staff, they gradually shifted to the micro service architecture. Before the automatic expansion of the service, it is also the case that some services occasionally appear OOM. Of course, our monitoring system also sets thresholds. For example, when the service exceeds 80% of the memory, colleagues in charge of operation and maintenance and r&d will receive a set of alarms by SMS, email, and IM messages. If a service is in OOM, the service process is suspended there, and the snapshot is automatically exported. The o&M colleague will send the snapshot to the r&d colleague for analysis, and it will still be found that memory leakage causes OOM.

Screening OOM

Next, we will introduce how to deal with OOM based on our current situation.

1. Add parameters to start the service


This is our deployment. yml configuration file on Git for each service, which will be added when OOM appears to export the snapshot file to DUMP_PATH. Then Jenkins executed Jenkinsfile and sh script under docker image to start the service. , of course, also can be set by the script, I believe that every company is different, but the important thing to add – XX: XX: + HeapDumpOnOutOfMemoryError and HeapDumpPath both launch parameters, so in the case of a OOM, also can save the time of the snapshot, To assist research and development personnel in analysis.

2. Check logs

When OOM occurs, the monitoring system will send a warning, we will receive a lot of text messages, IM messages, etc., so I won’t take screenshots here. Some self-defined alarms are generated. Later, we will go to Kibana to query relevant logs according to the message when we transport and repair services.


Here, it will cut to the filter of the relevant service and query the service log. Of course, the figure here is just imitating, and there is no data for the query of OOM. If the service generated OOM, the stack information will be displayed during the actual log query.

3. O&m positioning

When o&M receives a warning message, it checks the server. First, locate the problem of the service, that is, we usually check some commands of the server, such as: top, JPS, jinfo these basic operations. After locating the relevant service, if online analysis is needed, the relevant developer will be checked, including stack, GC, thread status and other related information. Developers use JMap, Jstat, and JStack. If OOM does appear, ops will take the snapshot down and send it to developers for analysis.

4. Analysis of OOM

Simulation OOM

This is because OOM has not appeared in the system recently, and THERE is no actual case content in hand (because OOM happened a long time ago). So I’m going to write a demo to see the steps and methods of the investigation. As I mentioned earlier, OOM is sometimes not caused by code, but by bad parameter Settings, heavy traffic, etc. But today is mainly about the code memory leak problem, resulting in memory overflow. In addition to the program counter, the HEAP, virtual machine stack, method area, and local method stack all appear OOM. Since we use MAT(Eclipse Memory Analyzer, which can be installed independently without Eclipse) for our daily Memory analysis, we will use this tool to demonstrate it today.

Here is a heap overflow as an example, directly on the code:


So we’re just going to do the big object thing, we’re just going to do a brutal infinite loop, we’re just going to stuff 10M objects into the heap. Start the memory Settings: – Xms10M – Xmx50M, and set the OOM generated when the snapshot, and saved to the specified path: – XX: XX: + HeapDumpOnOutOfMemoryError – HeapDumpPath = specific path

The specific Settings are shown below:


Of course, if it is local debugging, you can also use JProfile to monitor, you can get the threads of the service in real time, memory, etc., but this kind of service to open the port, and to get in real time, will bring some pressure to the service. I remember that in the single application, I also used JProfile to directly connect the service to troubleshoot problems.


Results: the program ran 8 times, then directly ran out of heap space. Since the parameters are set, the javA_pid53518.hprof file is output.


Thinking of the February

Here is the specific idea of using MAT for investigation:

  1. View the largest type of memory consumption through MAT view to roughly analyze the causes of memory leaks;
  2. View the detailed list and reference chain of large type objects to locate the specific point of memory leakage.
  3. View the values and dependencies of object properties, tease out program logic and parameters;
  4. Check whether the OOM problem is related to multi-threading through the thread view.
  5. Locate the relevant code and change it. Of course it could be tragic that you didn’t find…

MAT screening

  1. Start the MAT:

The MAC system reports an error when it clicks install to start, so start directly with the command: / your path/memoryAnalyzer-data./dump


  1. Load the hprof file

When opening the main interface of MAT, there is an Overview interface. It can be seen that in OOM, the size of the whole heap is 32.6m, and some objects occupy 32M. At this point we have to look at the object.

  1. Viewing the Object List

We just saw that the program ran 8 times. There are exactly eight objects and it takes 32M, which is exactly what our program expects. Now, what are these objects? How many instances are there? In which thread do objects run?

Right-click the large object and go to with Incoming References.


  1. View the object result graph

Retained Heap refers to the memory of the Retained object and its associated objects, and Shallow Heap refers to the memory of the Retained object. The ArrayList object itself is only 24 bytes, but all its associated objects take up 33MB of memory. That said, there must be something that keeps adding objects to the ArrayList that is causing OOM. And it shows the code inside main, the main thread of the program. This is also consistent with what our code itself expects.

  1. Viewing threads

As can be seen from the thread situation, it is the ArrayList in the main thread that is associated with the large object, resulting in OOM. There are no exceptions for the other threads, which is exactly what our code logic says.

  1. The last step is to get rid of the large objects that are stuffed into the code. Problem solved.

conclusion

Here is just a brief introduction to OOM screening process, of course, in some cases is not so simple, online monitoring is a more complicated and troublesome matter. Personally, I think the most important thing is to do a good job in code review, system monitoring, key steps to record the log. Service monitoring and governance is a necessary capability of current micro-services and cloud native systems, which also reflects the technical strength of a team. We are also constantly improving our ability to monitor and remove obstacles quickly. I hope we can gain something from sharing this simple idea. I will keep the attitude of sharing and make progress with everyone.

“Previous articles:”

Linear Data Structure (Part 1)

Linear Data Structure (Part 1)

Linear Data Structure (Part 2)

“[Internal Gongxiu Series 2] Give JVM a pulse”

“[Inner Strength Training Series 2] The Difficult JIT”

Special statement

This article is an original article. If you need to reprint it, please contact me and indicate the source. Thank you very much!