Understanding the JVM – How to troubleshoot partition overflow problems
This is the seventh day of my participation in the August More text Challenge. For details, see: August More Text Challenge
preface
This article will continue on a partition overflow case practical content to add a few OOM cases again, this article will not tell the new content, based on the case actual combat, hope these cases can help students to understand more ABOUT the JVM OOM overflow troubleshooting routine.
An overview of the
- How does Jetty’s underlying mechanism cause direct memory overruns? How do you see code design flaws?
- How should thread feign death be handled? What are the causes of high memory usage? Here is a case study of common analysis techniques.
- A simple example of how queues can cause JVM heap overflows shows the importance of queue data structure design.
Previous review:
Deeper understanding of how JVM-partitions overflow?
Case of actual combat
First case: how did Jetty cause a direct memory overflow?
Crime scene:
First shows this scenario does not see more, there is also a good case, collect, this project is special use instead of common Tomcat server using the Jetty server, when project launched suddenly encountered a report to the police, the content of this alarm is a suddenly can’t access the contents of the service.
There is no doubt that the first time to check the log online, the server is down is likely to be OOM, after checking the log, I found the following content:
There is no doubt that this is a Direct buffer memory overflow. A search on the Internet revealed that direct memory, also known as out-of-heap memory, is not bound by the JVM’s heap limits and is managed directly by the native operating system. In the next section we will see why there is a direct memory overflow.
Preliminary investigation:
It’s worth adding what Jetty is: Jetty is also written in JAVA code, so it can be deployed directly like Tomcat, and is a JVM process like Tomcat. Jetty will also listen to a port when it starts, just like Tomcat, and you can then send a request to Jetty for MVC forwarding and other operations.
Since the Direct Memory overflow occurs, the code does not use NIO or Netty related API content, and does not do Direct Memory allocation operation, it can be considered Jetty’s manipulation after repeated investigation. We do not need to pay attention to why he uses Direct Memory. There is a DESIGN flaw in the JVM involved, and we’ll look at how direct memory is understood and operated:
Those interested in jetty’s direct Memory Leaks can check out jetty /Howto/Prevent Memory Leaks.
Another problem related to JVM errors is running out of native memory. The symptom to watch for is that the process size keeps growing, but the heap usage remains relatively stable. Native memory can be consumed by many things, the JIT compiler being one, and Nio ByteBuffers being another. Sun error number #6210541 discusses the still-unresolved problem that the JVM itself in some cases allocates a direct ByteBuffer that is never garbage collected, thus effectively consuming native memory. Guy Korland’s blog discusses the issue here and here. Because the JIT compiler is native memory of a consumer, so the lack of available memory may be characterized by OutOfMemory exceptions in the JIT, such as “thread” CompilerThread0 “exceptions in Java. Lang. OutOfMemoryError: Requests XXX bytes for ChunkPool::allocate. Out swap space?”
By default, if the NIO SelectChannelConnector is configured, Jetty will allocate and manage its own pool of direct ByteBuffers for IO. It also allocates MappedByteBuffers to memory-mapped static files through DefaultServlet Settings. However, if you disable any of these options, you may be vulnerable to this JVM ByteBuffer allocation problem. For example, if you are using Windows, you may have disabled the use of memory mapped buffers for static file caching on DefaultServlet to avoid file locking problems.
First of all, if we want to bypass the JAVA heap and use a chunk of the native operating system’s memory outside the heap, we need to use a class called DirectByteBuffer. We can use this class to build direct memory, even though the object is in the heap at the time of construction. But once the object actually build at the same time will be outside the heap memory (operating system) to build a same object to associate with him, here the two pieces of memory as you can imagine as a secondary pointer, the relationship between using image understanding is that we have a treasure map, we mark the treasure in the treasure map, but in fact, in the middle of the treasure But isn’t it interesting that we can use a treasure map to manipulate the contents of the treasure?
So you might be thinking, how does this piece of content get released? When you stop referencing the DirectByteBuffer object, it becomes a garbage object, and when it becomes a garbage object, it will recycle the corresponding treasure as well.
The problem is that if you reference more and more direct memory mapped heap objects, but they are not freed, the direct memory will run out over time, and you can’t recycle them, so you can only throw an OOM exception!
In general we can assume that a JVM can cause a direct memory overflow if it allocates too much direct memory all at once, but is that the case on this system? As you can see from the crime scene above, that’s clearly not the case, but what about this case?
Case study:
Let’s continue to analysis case, although not a distribution of a large number of objects caused by mistake, but the problem caused by the reason or direct memory continuously backlog, since the direct memory corresponding to the mapping of a pile of content, then certainly couldn’t open a piece of heap memory, so these objects are no recovery. Are likely to follow the train of thought, can find contains these direct memory mapping object in Cenozoic recovery after didn’t be recycled, and because the Survior area just not put directly into old age, but although these objects into old age, but very few due to enter the object, also won’t trigger the old s recovery! Very pit dad, so here from the results, actually is the object directly into the old era caused by the problem!
From the result, it is the objects in the heap that map the out-of-heap memory that have not been reclaimed for a long time, and the constructed objects also enter the old age after YGC. After repeated reclamation, there are more and more reference objects, but the old age is not fully occupied and Full Gc is not triggered, which eventually leads to the overflow of the out-of-heap memory.
NIO has not considered reclaiming direct memory?
Of course, there is a reserveMemory code underneath the java.nio.bits source package that has a system.gc () in it. This might actually trigger the Old GC, but it’s wrong and irresponsible! In the previous example we had an example of FULL GC being too frequent due to frequent calls to System.gc(), so once you disable sytem.gc () in the JVM argument, the code is completely invalid.
Here I have to suspect that the person who wrote this piece of code is lazy, hopefully a subsequent JDK in NIO will fix this problem.
// These methods should be called whenever direct memory is allocated or
// freed. They allow the user to control the amount of direct memory
// which a process may access. All sizes are specified in bytes.
static void reserveMemory(long size, int cap) {
/ /...
System.gc();
/ /...
}
Copy the code
The solution:
- Survior of suit the remedy to the case, since it is new generation area not put going to be the object of garbage but temporarily survival, so naturally need to expand the size of the new generation heap size and Survior area, ensure that the next time the YGC before the memory mapping in the heap object directly to recycling, direct memory will not be accumulated.
- It’s safe to use the first method for the sake of your System. System.gc() is a notorious method, but it’s not marked expired because there’s a lot of code that uses it. But we should avoid this approach altogether when developing.
Second case: How should thread suspended animation be handled
Crime scene:
This case is a very common Tomcat and WEB system, but on one day it is reported that the service is in suspended animation, which means the service is not available at this time, similar to the problem of the downstream service down above. But in this case, it was reachable after a while, which meant that the suspended animation was only for a period of time.
Preliminary investigation:
This requires some Linux tricks. Note that looking at the log doesn’t work because it’s not an OOM issue, so it’s important to use the TOP command to see how much memory and CPU the process is using.
If the above suspended animation problem occurs on the interface, the following two ideas can be generally used for troubleshooting:
- This service may use a large amount of memory that cannot be freed, leading to frequent GC
- The CPU load is too high, the process is directly using up the memory resources, which is the daily use of the system CPU load page jam problem, eventually your server thread can not get the CPU execution right, and then the process will be killed by the CPU, and can not respond to the interface request.
After the above investigation, it was found that the CPU consumption was very small, but the memory usage was up to 50%! This is a 4G8 core machine. The JVM can use about 4-6GB of memory, and the heap space is about 3-4g. Therefore, if the JVM process is using 50% of the memory, it means that the JVM is not only using more than half of the memory allocated to it by the system. The system is running out of memory itself!
What happens when memory usage is high?
Let’s take a look at what happens to the JVM if the system’s memory usage is too high.
- High memory usage, frequent Full Gc, and Stop World problems caused by Gc
- Memory usage is too high
- If the memory usage is too high, the operating system may kill the process directly because of insufficient memory.
So the above investigation and these points which is more relevant?
Case analysis
According to the above content, let’s analyze the situation one by one:
First of all, in the case of OOM, we can directly exclude it, because this case is suspended animation, but log check does not have OOM, so we can not consider it. Let’s move on to the first and third points.
The first point is about frequent Full GC and Stop World. However, in the actual process of troubleshooting, the TIME of GC is only a few hundred milliseconds each time. Therefore, although the frequency of GC is high, it does not take much time, which is normal.
Finally, we focus to see the first three, said the third point is the problem of insufficient memory to kill the process, the need to remind some here would have been the whole system of automated inspection and startup scripts, when service downtime will start automatically, very close to the third question, after is to kill the process, the system automatically start automatically after normal again.
Who takes up a lot of memory
There is a separate section here to explain why so much memory is occupied. The troubleshooting process will not be detailed here. In short, through the analysis of tools such as MAT, we know that there is an object that has been occupied for a long time and there is no way to recycle it, that is, the custom class loader. However, due to the lack of good logic control, the thread created a large number of class loaders to continuously load, and finally led to the backlog of these objects in the old age, but the Full Gc found that there were references and could not recycle them.
The solution:
The problem in this case is that the class loaders are created frequently and the Full Gc is not collected, so the final solution is to modify the code to ensure that the class loaders are not created repeatedly.
The third case is the overflow problem of the synchronous system
Crime scene:
This case is a synchronization system, responsible for data synchronization operations from one system to another system, using Kafka middleware for production and consumption work.
What’s the problem? This is a problem in the OOM system, which needs to be manually restarted after the OOM system is restarted. As a result, the OOM system appears again after the OOM system is restarted. As a result, the amount of data in Kafka is increasing and the GC frequency is increasing, which eventually causes the OOM system to break down.
Preliminary investigation:
JVM heap overflow can occur either when the heap is Full GC and there are still a lot of objects alive, or when there are a lot of objects that cannot be stored in the heap. Both cases are possible.
In this case, the overflow occurred after a period of time, and even if there was more data to deal with, the overflow did not happen immediately after startup, but the frequency of OOM increased. In this case, we can confirm that there are too many living objects in the memory, which causes the processing data to be loaded into the heap, and then we can see that there are still too many objects in the heap.
Case study:
We used Jstat to analyze the case and found that the object was not collected after Full Gc. When the old space reached 100%, the object could not complete allocation and finally had to overflow and stop the JVM process.
Again, MAT’s tools are used for this analysis, which was illustrated in the case before this column, so I won’t go into too much detail here.
Next we will analyze the root cause of sudden leakage, this time need to pay attention to the Kafka message queue, Kafka is what here no longer say, it in the case of the main work is constantly pushed into the data on both sides of the production and consumption of data synchronization operation.
The point is, we all know that consumption data can be consumed hundreds of pieces at a time, so some developers set the consumption content into lists to facilitate consumption, and each List has hundreds of pieces of data. What happens when this design structure is consumed? Here is a metaphor is the product on the conveyor belt, the person in charge of each process originally only need to process a data on the conveyor belt, but using this result is like hanging more than 100 products to be processed on each conveyor belt product! In this way, the consumer is undoubtedly unable to consume, and the production side is constantly pushing data, and finally can only strike (OOM).
The solution:
This is a typical unequal case of generation and consumption rate, the solution is to set the queue to blocking queue, such as setting up 1024 size, once the queue is full, the producers will stop generating and blocking to monitor this queue, when the queue once work space and open consumption, so that consumer can handle in time, Objects will not be stuck in the heap and cannot be recycled.
conclusion
This time it is three cases, here with three different cases, using the perspective of system analysis to analyze their details, or ask some here, online OOM is strange, hope to learn the case is more of a solution rather than memorizing case, because real cases both may be resource allocation problems, Code may be system problem, shenzhen have could be the configuration parameters in system downtime, anyway the problem is very strange, so far no use very tiger balm solution, but by studying these cases, we have certain theoretical basis, the problems in the future, you can quickly back to case and avoid pit.
Write in the last
This is the end of the actual case studies. Starting with the next installment of this column, I’ll dive a little deeper into the basics of the JVM, which I’ve learned from the book. If you’re having trouble understanding the basics of the JVM, I’d recommend reading “Understanding the JVM In Depth, Edition 3” next week.
Finally recommend a personal wechat public number: “lazy nest”. If you have any questions, you can communicate with me through the private message of the public account. Of course, the questions in the comments will be answered as soon as you see them.
Historical article Review:
Note that the link of “Youdao Cloud Notes” is used here for your collection and self-summary:
Understanding JVMS in Depth – Phase Summary and Review (part 2)
In-depth understanding of JVMS – Case studies