This article starts with a BUG in the JDK’s J.U.C package called ConcurrentLinkedQueue. The thread pool in the Jetty framework uses this queue, causing a memory leak.
At the same time through jConsole, VisualVM, JMC these three visual monitoring tools, let you see the occurrence of “memory leak”. That’s interesting. Let’s take a look.
Let’s start with a BUG
Some time ago, I found a BUG in JDK that was a bit interesting.
https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8137185
A memory leak.
Who caused the memory leak?
ConcurrentLinkedQueue, this queue.
This BUG was also reported in the Jetty project:
I took a look at it and thought jetty’s one was interesting.
I’ll follow Jetty’s lead, but it’s all the same JDK BUG. The address is as follows:
https://bugs.eclipse.org/bugs/show_bug.cgi?id=477817
Let me translate for you in my broken English of grade eight and a half what this student named Max said.
He says that using ConcurrentLinkedQueue incorrectly in Java projects (instead of the abbreviation CLQ) can cause memory leaks.
The use of the CLQ queue in Jetty’s QueuedThreadPool causes slow memory growth and eventually a memory leak.
Although QueuedThreadPool only uses the add and remove methods of this queue. Unfortunately, the remove method does not reduce the size of the queue, only emptying the removed nodes in the queue. Therefore, the list grows to infinity.
He then sent an attachment containing a program that demonstrated the problem.
We’re not going to look at his program, but we’re going to show you the problem.
Let’s take a look at Jetty’s QueuedThreadPool thread pool.
Which version of Jetty?
As you can see, this BUG was reported on September 18, 2015. So, let’s just find a version before that date.
So I found a Maven version that was released on September 3, 2015:
QueuedThreadPool in this version looks like this:
As you can see, it does use CLQ queues.
Jetty only uses the queue’s size, add, remove(obj) methods from all the calls to the object:
It’s the same as Max described earlier.
Then Max gave some pictures to support his argument:
I’m focusing on what I’ve framed, which means he’s showing a picture. You can see the memory leak problem in this image, which is from their real project.
The project has been running for about two days, with a Web request coming in every five minutes.
Here’s the picture:
From his picture, I can only see that CLQ has a lot of nodes.
However, he did say that his project was not very demanding and the Jetty framework he was using should not have created so many nodes.
Ok, we have analyzed the problem mentioned by Max in front of us, and here is the big guy, to solve the puzzle:
Let’s not look at the answer, but let’s look at the person who answers the question.
Greg Wilkins, who is he?
I found his linkedin address:
https://www.linkedin.com/in/gregwilkins/?originalSubdomain=au
The leader of the Jetty project, in just a few words, is enough to make you call him awesome.
High-end ingredients often require only the simplest cooking. High-end talent, often only need a few words of introduction.
Big guy’s resume is so unpretentious and boring.
And, look at this head. Ah, sour sour. Once again, it proves that being bald and strong does not apply to foreign immortals.
Well, let’s take a look at how the jetty project leader answered this question:
First he was stupefied and shocked! Then he used the Ouch verb. As we often say:
He said: Gosh, I found that it not only caused a memory leak, but also caused the queue to get slower and slower over time. F * * king shocked.
This problem is bound to affect servers that use a large number of threads…… Hopefully not all servers will be affected.
But whether or not all servers have this problem, when it does, it must be a very serious BUG for some servers.
Then he said a Great catch! I understand that this is a modal particle. It’s like: This is awesome.
This is not easy to translate, I posted an example, everyone to experience it:
I also did not expect, in the technical text also gave the big tutor up English.
Finally he said, I’m working on fixing the problem.
Then, seven minutes and 37 seconds later, Greg replied again:
Almost eight minutes later, he was still in shock. I suspect he’s been shaking his head for the last eight minutes.
I’m still shaking my head at how it went unnoticed for so long. For Jetty, the fix is simple enough to achieve the same effect by using a set structure instead of a queue.
Let’s take a look at QueuedThreadPool in jetty after the fix. Here I’m using a package released on October 6, 2015, which is the most recent package since the BUG broke:
The corresponding code inside looks like this:
CurrentHashSet is used instead of CLQ.
Because this BUG was fixed in the JDK, I wanted to see if CLQ had a chance to come back out of curiosity.
So I took a look at the code in the latest version released this year:
Neither CurrentHashSet nor CLQ.
Instead of using the newKeySet method in JDK 8’s ConcurrentHashMap, use the C bit:
This is the evolution of a small Jetty thread pool. Congratulations, you have learned another knowledge that you almost never use.
Back to Greg’s reply, in this reply, he also provided a demonstration example of repair, which I will interpret in the next section.
After 23 minutes, he submitted the fix.
From the first reply to the post, to locating the problem, to submitting the code, it took 30 minutes.
And then at 2:57 a.m. (Is that the time when the big guys don’t sleep? Max replies:
I can’t believe there is such a problem with the CLQ, they should at least explain it in the API documentation.
By them, I mean the MEMBERS of the JDK team, specifically Doug Lea, after all, the work of the Don.
Why is it not explained in the API documentation?
Because they don’t know about it.
Greg replied two times in a row and pointed directly to the solution:
The problem is that the source code for the remove method contains a line of code labeled ① in the figure above.
This line of code unlinks the removed node (whose value has been replaced with null) to the list, and then allows the GC to reclaim the node.
However, when there is only one element in the collection, next! = null This judgment is not valid.
So the node that needs to be removed has been set to NULL, but the connection to the queue has not been cancelled, causing the GC thread to not reclaim the node.
His solution is simple, too, in the places labeled ② and ③. In short, just have the code execute the pred.casNext method.
In short, the cause of the memory leak is a node that has been set to null and will neither be used nor collected by GC due to code problems.
If you do not understand the cause of this BUG, you are not clear about the structure of the CLQ queue.
Then I recommend reading the book the Art of Concurrent Programming in Java, which has a section devoted to this queue, clearly written and well illustrated.
The story of this BUG in Jetty is clear.
Then, let’s go back to this link for the JDK BUG:
The reason he wrote here is the same reason I said before, there is no Unlink, so it cannot be recycled.
The BUG is present in the latest JDK releases 7, 8, and 9, he says.
By “new,” he means before the BUG was first mentioned:
The Demo running
In this section, let’s run through the fix Demo Greg gave us and get a feel for what the BUG looks like.
https://bugs.eclipse.org/bugs/attachment.cgi?id=256704
You can go to the link above and copy and paste it directly into your IDEA:
Note line 13 that Greg used ConcurrentHashSet because he gave us a fix for the Demo, but CLQ is used because we want to demonstrate the bug.
This Demo calls queue add(obj) and remove(obj) in an infinite loop. The values of time interval, queue size, maximum memory, remaining memory, and total memory are printed for each 10000 cycles.
The result looks like this (JDK version 1.7.0_71) :
It can be seen that each time duration is printed, the interval is increasing, and the queue size is always 1.
The following three memory-related parameters can be ignored, and we will use graphical tools in the next section.
Do you know how long I’ve been running this program by the time I write this article?
61 hours, 32 minutes and 53 seconds.
The time interval required for the latest 10000 cycles is 575615ms, which is nearly 10 minutes:
That’s what Greg is talking about: not only is memory leaking, but it’s getting slower and slower.
However, when I run the same application using JDK 1.8.0_212, it looks like this:
The time interval is stable and does not increase over time.
This version is to fix the BUG, I take you to see the source code:
In the JDK 1.8.0_212, an unlink is commented at the end of line 502 in the remove(obj) method of CLQ.
The official fix can be found here:
http://hg.openjdk.java.net/jdk8u/jdk8u-dev/jdk/rev/8efe549f3c87
There are many changes, but the principle is the same as before:
I’ve only run through the sample code in two JDK versions.
There were no memory leaks found in JDK 1.8.0_212, I looked at the source code for the corresponding remove(obj) method and it was indeed fixed.
Memory leaks can be seen in JDK 1.7.0_71.
Unlink, a simple word, behind the original hidden so many stories.
Jconsole, VisualVM, JMC
Now that we’re talking about memory leaks, it’s important to introduce some visual troubleshooting tools.
This program has been running for 61 hours. Let’s look at the heap memory usage during this period.
You can see a clear, slow upward trend in the total heap memory usage.
The diagram above is from JConsole.
Combined with the program, we can analyze the picture, this situation must be a memory leak, this is a very classic memory leak trend.
Next, let’s take a look at JMC monitoring:
The size of the heap used is shown above, and follows the same trend as jConsole.
Then look at the VisualVM diagram:
VisualVM chart, I do not know how to look at the whole run more than 60 hours of the chart, but from the above chart can also see that there is an upward trend.
In VisualVM, we can Dump the heap directly and then analyze it:
It can be clearly seen that the size of CLQ nodes accounts for 94.2%.
However, judging from our program, we don’t use that many nodes at all. We just used one.
You say, this is not a memory leak is what.
Memory leaks will eventually result in OOM.
So when OOM happens, we need to analyze whether there is a memory leak. That is, to see if objects in memory should live, and if they should live, it’s not a memory leak, it’s out of memory. You need to check the JVM parameter configuration (-xmx / -xms) and see if you can increase it further depending on the machine memory.
At the same time, you also need to check the code, whether there are life cycle process objects, whether there are data structures used improperly, to minimize the memory consumption of the program runtime.
We can simulate a memory leak in OOM by making the heap smaller.
Using the previous test case, but specifying -xmx to be 20m, the maximum available heap size is 20m.
Then run the code and monitor it with VisualVM, JConsole, and JMC. In order to have enough time to get the detection tool ready, I add the sleep code at line 8. The rest of the code is the same as before:
Add the -xmx20m argument:
Once up and running, we can also view the memory changes through the tools, which are VisualVM, JConsole, and JMC from top to bottom:
From the trend of the picture, as we have analyzed before, memory is growing.
An OOM exception occurs after the program runs 19 minutes and 06 seconds:
So what would a normal chart look like?
In JDK 1.8.0_121 (with the remove method fixed), run with the same JVM argument (-xmx20m) :
First of all, you can see from the above log that the interval is not increasing and the program is running very fast.
Then, VisualVM is used to detect the memory, and the screenshot is as follows after 19 minutes of running:
You can see that the heap memory usage does not increase over time. But there are still very frequent GC operations.
This is easy to understand, since CLQ’s data structure uses linked lists. Linked lists, in turn, are made up of different nodes.
Since node nodes are eligible to be reclaimed after the remove method is called, frequent calls to remove nodes will trigger the MIN GC of the JVM.
This type of MEMORY leak caused by JDK bugs can be quite devastating. The first time you feel it is when the app happens OOM.
You may have unthinkingly increased the heap space first, just as your application is about to go live after a week, so it involves restarting the application.
Then there was no OOM for a long time. You just thought maybe the problem was solved.
However, it still continues to happen. It is likely that the application was not online for about half a month before and after holidays, such as the seven days of National Day, and several days before and after, so it did not restart, and the application became slower and slower, and finally led to the second OOM appearance.
At this point, you think it might not be as simple as running out of memory.
Could it be a memory leak?
Then you reboot again. After this reboot, you start to Dump memory from time to time for analysis.
Suddenly, there are so many nodes.
Finally, find the cause of the problem.
It was a BUG in the JDK.
And you’re like Greg, “Oh, my God, shocking. ?
My program that has been running for over 60 hours now uses 233m of heap memory, but my entire heap size is close to 2G.
By showing both the overall size of the heap and the size of the heap already in use, you can see that the distance from memory leaks can be said to be blocked and long:
By my rough calculations, the program will run for about 475 hours, or 19 days before it gets an OOM due to a memory leak.
I’ll run as long as I can, but I don’t know if my computer will hold up when I hear the humming fan.
If it holds, I’ll let you know in a later post.
Well, that’s the end of the graphic tools section.
We’ve just shown them to be a very small function, and when used properly they can often do more with less.
If you are not familiar with their capabilities, you are advised to take a look at understanding the JVM Virtual Machine (version 3), which has a section devoted to these tools.
One last word (for attention)
This is when I wrote an article last night to take, my girlfriend said at a glance I feel like a person staring at the plate, looking at the stock trend chart, this stock is too great force.
If only the stock market as a whole were as simple and straightforward as a memory leak.
As long as the bag is safe before OOM. Unfortunately, some people are in the OOM before the moment to kill, what a sad story.
The two books mentioned in the article are both excellent and worth studying. As a Java programmer, I highly recommend buying these books if you don’t already own them.
You can’t buy to lose, you can’t buy to be deceived, you can only feel it’s too late. You’ll find that so many JVM, multithreaded interview questions come from these two books:
If you find the wrong place, please leave a message to point out, I will modify it.
Thank you for reading, I insist on original, very welcome and thank you for your attention.
I am Why, a literary creator delayed by code. I am not a big shot, but I like sharing. I am a nice man in Sichuan who is warm and interesting.