This is a pigeon blog for a long time, because the blog content and material is almost ready, but has been too lazy to sort out, today finally decided to finally sort out, this is the bug JDK-8221393 only Chinese introduction blog.

This is a bug in the JDK11 ZGC that, if triggered, causes a gradual increase in process CPU usage and, if left unchecked, eventually affects 100% of the service availability. This performance bug is not fixed in the latest JDK11 code. Don’t worry, though, the bug is a bit demanding, which is probably one of the reasons JDK developers don’t fix it. In addition, when I looked through the JDK13 source code, I found that the bug has been fixed and there are some related design improvements.

This bug is shown in the figure above, where CPU usage roughly doubles every 10 days with no change in business or code logic, and then returns to normal after a restart. In the image above, we rebooted on November 19 and November 29, respectively, and the CPU dropped significantly, then started the cycle again… This service is very important to us. Some of our colleagues have been looking for the reason for a long time and have tried to make many optimizations, but the problem has not been solved, so our service has to be restarted regularly. But it came to light the day after MY intervention, heh heh… (Not that I can fight, but that they lack a weapon to take advantage of.)

The screening process

As an engineer, faced with the above phenomenon, what would you do? I assume your first reaction is that there’s something wrong with the business code? Is there a memory leak somewhere? Is there somewhere in the business code that is loading too much data and slowing down? … Colleagues tried to dump the contents of the heap, dump the JSTAK thread… I did not see any abnormality, and optimized some unreasonable logic in the business code before, but still did not solve the problem. The problem was that they didn’t look in the direction of hot code, mainly because they didn’t know what tools were available.

And I happened to be watching AriArthasArthas added the performance check command at the timeprofilerCan generate CPU flame map, and I happened to see ruan yifeng teacherHow to read a fire map?. Then I recommended Arthas to my colleagues and suggested that Arthas generate a fire map of the CPU to see where the CPU was being consumed, and I noticed an anomaly.

How do I read the fire diagram?

Do not know how to see the flame diagram does not matter, in fact, very simple, specific can refer to teacher Ruan Yifeng how to read the flame diagram? .

The Y-axis represents the call stack, and each layer is a function. The deeper the call stack, the higher the flame, with the executing function at the top and its parent functions below. The X-axis represents the number of samples. The wider a function occupies along the X-axis, the more times it is drawn, or the longer it takes to execute. Note that the X-axis does not represent time, but rather all call stacks are grouped in alphabetical order. The flame diagram is to see which function on the top takes up the most width. Any “flat top” (plateaus) indicates that the function may have a performance problem.

You can simply assume that the wider each piece is, the more CPU it consumes, and the key to doing performance analysis with a fire graph is to find functions that are wide when you think they should be narrow. On the left, the exception is obvious. It has several “flat tops”, meaning that the function is consuming too much CPU, which is most likely the cause of the problem.

Here I went out of my way to find the stack information for the function call, and I found something in our code that didn’t make sense.

[0x00007f2091c15000] java.lang.Thread.State: RUNNABLE at java.lang.StackStreamFactory$AbstractStackWalker.fetchStackFrames(java.base@11/Native Method) at java.lang.StackStreamFactory$AbstractStackWalker.fetchStackFrames(java.base@11/StackStreamFactory.java:386) at java.lang.StackStreamFactory$AbstractStackWalker.getNextBatch(java.base@11/StackStreamFactory.java:322) at java.lang.StackStreamFactory$AbstractStackWalker.peekFrame(java.base@11/StackStreamFactory.java:263) at java.lang.StackStreamFactory$AbstractStackWalker.hasNext(java.base@11/StackStreamFactory.java:351) at java.lang.StackStreamFactory$StackFrameTraverser.tryAdvance(java.base@11/StackStreamFactory.java:593) at java.util.stream.ReferencePipeline.forEachWithCancel(java.base@11/ReferencePipeline.java:127) at java.util.stream.AbstractPipeline.copyIntoWithCancel(java.base@11/AbstractPipeline.java:502) at java.util.stream.AbstractPipeline.copyInto(java.base@11/AbstractPipeline.java:488) at java.util.stream.AbstractPipeline.wrapAndCopyInto(java.base@11/AbstractPipeline.java:474) at java.util.stream.FindOps$FindOp.evaluateSequential(java.base@11/FindOps.java:150) at java.util.stream.AbstractPipeline.evaluate(java.base@11/AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.findFirst(java.base@11/ReferencePipeline.java:543) at org.apache.logging.log4j.util.StackLocator.lambda$getCallerClass$6(StackLocator.java:54) at org.apache.logging.log4j.util.StackLocator? Lambda$94/0x00007f238e2cacb0.apply(Unknown Source) at java.lang.StackStreamFactory$StackFrameTraverser.consumeFrames(java.base@11/StackStreamFactory.java:534) at java.lang.StackStreamFactory$AbstractStackWalker.doStackWalk(java.base@11/StackStreamFactory.java:306) at java.lang.StackStreamFactory$AbstractStackWalker.callStackWalk(java.base@11/Native Method) at java.lang.StackStreamFactory$AbstractStackWalker.beginStackWalk(java.base@11/StackStreamFactory.java:370) at java.lang.StackStreamFactory$AbstractStackWalker.walk(java.base@11/StackStreamFactory.java:243) at java.lang.StackWalker.walk(java.base@11/StackWalker.java:498) at org.apache.logging.log4j.util.StackLocator.getCallerClass(StackLocator.java:53) at org.apache.logging.log4j.util.StackLocatorUtil.getCallerClass(StackLocatorUtil.java:61) at org.apache.logging.slf4j.Log4jLoggerFactory.getContext(Log4jLoggerFactory.java:43) at org.apache.logging.log4j.spi.AbstractLoggerAdapter.getLogger(AbstractLoggerAdapter.java:46) at org.apache.logging.slf4j.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:29) at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:358) at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:383) at com.xxxxxxx.infra.cache.redis.ReadCommandTemplate.<init>(ReadCommandTemplate.java:21) at com.xxxxxxx.infra.cache.redis.RedisClient$4.<init>(RedisClient.java:138) at com.xxxxxxx.infra.cache.redis.RedisClient.hget(RedisClient.java:138) ... .Copy the code

ReadCommandTemplate is a Redis client that we encapsulate. The ReadCommandTemplate is instantiated every time we read or write a Redis. Our business logic reads a Redis request once. This results in the ReadCommandTemplate class generating many instances. The ReadCommandTemplate template is not declared static, so it generates a Logger at the same time. There seems to be a performance problem in the process of generating loggers.

public abstract class ReadCommandTemplate<T> { private Logger logger = LoggerFactory.getLogger(ReadCommandTemplate.class); /** ** **Copy the code

Experienced Java developers will recognize this at a glance, but it should only slightly affect performance, not the weirdness of the first article! Indeed, the Redis client is widely used in our section, and other applications have no problems… Could it be that the app just happens to match all the conditions triggered by a bug?

Bugs confirmation

To determine if the bug is caused by the non-static logger, there are two options: 1. Fix logger issues and see if CPU increases. 2. Really understand how this bug works. Since option 1 needs to wait at least a week to be finalized, we chose to do both. My colleague fixed logger, and I started exploring the problem and reading the JVM source code. Later, scheme 1 was officially blamed on Logger, and I found a mailing list of JVM team discussions after someone gave bug feedback to the JVM team in 19.

Stefan Karlsson, developer of the JDKIdentify and outline the cause of this bug, as shown in the picture above. Here I explain roughly, the JVM when search method will be called “ResolvedMethodTable: : lookup”, in fact, here is to find a only 1007 hash bucket, becauseThe ZGC of JDK11 does not clean up the ResolvedMethodTable regularly, so the data in the ResolvedMethodTable grows larger and slower.

Jd11 + ZGC +log4j + jD11 + ZGC +log4j + jD11 + ZGC +log4j + jD11 + ZGC + LOG4j + JD11 + ZGC + LOG4j

Depth profiling

To understand this bug, let’s first look at what happens when we call loggerFactory.getLogger ()!!In JDK9 and later, log4j uses the newStackWalkerTo retrieve the thread’s stack information and iterate through each stack frame, StackWalker calls the native method fetchStackFrames to retrieve the stack frame from the JVM.

I looked through the JVM code, found ResolvedMethodTable: : lockup () doing what! But first we need to know what ResolvedMethodTable is. The ResolvedMethodTable can be thought of simply as a Hashtable in the JVM that stores all the method information that has been parsed, and it is only 1007 buckets (why is the design so small?). , such a small bucket is bound to easily generate hash conflicts, and the way to handle hash conflicts is to open the chain, and lockup() in the search process, need to traversal the single linked list, so if the single linked list is too long, lookup performance down, lookup() source code is as follows.

oop ResolvedMethodTable::lookup(Method* method) { unsigned int hash = compute_hash(method); int index = hash_to_index(hash); return lookup(index, hash, method); } oop ResolvedMethodTable::lookup(int index, unsigned int hash, Method* method) { for (ResolvedMethodEntry* p = bucket(index); p ! = NULL; p = p->next()) { if (p->hash() == hash) { oop target = p->object_no_keepalive(); if (target ! = NULL && java_lang_invoke_ResolvedMethodName::vmtarget(target) == method) { ResourceMark rm; log_debug(membername, table) ("ResolvedMethod entry found for %s index %d", method->name_and_sig_as_C_string(), index); return p->object(); } } } return NULL; }Copy the code

The question now is what causes the ResolvedMethodTable to have so much data that the linked list for one of the individual buckets is too long? We still find the answer from the mailing list of the bugs mail.openjdk.java.net/pipermail/z… Here I roughly translate as follows:

In addition to unloading unused classes, the GC does a lot of cleaning up, such as cleaning up unused information in ResolvedMethodTable and StringTable. ResolvedMethodTable holds a weak pointer to an instance of the Java ResolvedMethodName class. If ResolvedMethodName is unreachable, the GC would normally clean up the unreachable object. The ResolvedMethodName is part of MemberName. StackWalker traversal stack frame, creates MemberName Java objects on StackFrameInfo. MemberName, JVM code implementation is as follows:

void java_lang_StackFrameInfo::set_method_and_bci(Handle stackFrame, const methodHandle& method, int bci, TRAPS) { ... // Create a ResolvedMethodName object and insert it into ResolvedMethodTable CallInfo info(method(), ik, CHECK); MethodHandles::init_method_MemberName(mname, info); // Add ResolveMethodName to MemberName.Copy the code

The stack call information is as follows:

  java_lang_StackFrameInfo::set_method_and_bci(Handle, methodHandle 
const&, int, Thread*)
  JavaFrameStream::fill_frame(int, objArrayHandle, methodHandle const&, 
Thread*)
  StackWalk::fill_in_frames(long, BaseFrameStream&, int, int, 
objArrayHandle, int&, Thread*)
  StackWalk::fetchNextBatch(Handle, long, long, int, int, 
objArrayHandle, Thread*)
  JVM_MoreStackWalk
  java.lang.StackStreamFactory$AbstractStackWalker.fetchStackFramesCopy the code

Later, if StackFrameInfos is not used, these objects are removed from the class reference relationship and recycled by the GC, which then removes the corresponding ResolvedMethodName from the ResolvedMethodTable. But the problem isIn JDK11, ZGC doesn’t actually remove these ResolvedMethodNamesAs a result, the amount of data in the ResolvedMethodTable becomes larger and larger, the single chain becomes longer and longer, and the search efficiency becomes lower and lower.

In JDK11, SymbolTable:: Unlink (), ResolvedMethodTable :: Unlink (), ResolvedMethodTable :: Unlink (), ResolvedMethodTable :: Unlink (), ResolvedMethodTable :: Unlink (), ResolvedMethodTable :: Unlink (), ResolvedMethodTable :: Unlink (), ResolvedMethodTable :: Unlink ()

To simplify, each stack frame generates a ResolvedMethodName object into the JVM’s ResolvedMethodTable every time StackWalker calls through the stack frame, but the JDK11 ZGC does not effectively clean up unused objects. Because ResolvedMethodTable is a constant volume Hashtable, as more and more data is contained in it, the single linked list for each bucket becomes longer and longer, resulting in slower query efficiency. So you end up with higher and higher CPU usage.

Avoid pit guide

If you understand the bug principle above, you already know how to close a pit. If not, don’t use JDk11 + ZGC while using StackWalker too often (e.g. using log4j incorrectly). It’s not impossible to use log4j at all, as long as you don’t call StackWalker frequently, for example, the logger in our code just needs to be declared static so that StackWalker is only called when the class is initialized, and you won’t have a problem. This explains why many of our other applications using JDK11 and the problematic RedisClient have not experienced CPU exceptions because other applications have not enabled ZGC.

The jdK11 + ZGC +StackWalker bug is the same as the jdK11 + ZGC +StackWalker bug.

Bugfix

For our application, just follow the dodging instructions above, but for the JDK team, this bug is definitely something they need to fix.

This bug has been fixed in JDK13, as you can see from the bug page on jdK13. Let’s see how they fix it. Do you just need to call SymbolTable::unlink() where the ZGC is appropriate? Yes, but the JDK team went further than that. In addition to Unlink, they also optimized the implementation of ResolvedMethodTable to support dynamic scaling to avoid the problem of long single-linked lists. Concrete may have a look the JDK source code in the SRC/hotspot/share/prims resolvedMethodTable. CPP file.

void ResolvedMethodTable::do_concurrent_work(JavaThread* jt) { _has_work = false; double load_factor = get_load_factor(); log_debug(membername, table)("Concurrent work, live factor: %g", load_factor); If (load_factor > PREF_AVG_LIST_LEN &&!) load_factor > PREF_AVG_LIST_LEN &&! _local_table->is_max_size_reached()) { grow(jt); } else { clean_dead_entries(jt); } } void ResolvedMethodTable::grow(JavaThread* jt) { ResolvedMethodTableHash::GrowTask gt(_local_table); if (! gt.prepare(jt)) { return; } log_trace(membername, table)("Started to grow"); { TraceTime timer("Grow", TRACETIME_LOG(Debug, membername, table, perf)); while (gt.do_task(jt)) { gt.pause(jt); { ThreadBlockInVM tbivm(jt); } gt.cont(jt); } } gt.done(jt); _current_size = table_size(); log_info(membername, table)("Grown to size:" SIZE_FORMAT, _current_size); }Copy the code

conclusion

The main reason for this bug is that our code is not written properly (logger is not declared static), and this non-standard code also affects other applications that do not trigger this bug. After all, generating logger will also consume performance, and other applications will be upgraded after our code is fixed. The CPU usage of some services has been reduced by 5%+. This also illustrates the importance of code quality, especially if it is the kind of basic code that is widely adopted.

This bug will be fixed in jdK13 instead of later versions of JDK11. It will affect users using JDK11. I have an idea that this bug is not a serious bug that is easily triggered (frequent calls to JDk11 + ZGC +log4j), and even if it is triggered, JDK users can easily modify their own code to avoid this bug, so it is not an important and urgent bug for JDK developers. Just fix it later.

The resources

  1. Arthas is an open source Java troubleshooting tool for Alibaba.
  2. How does Ruan Yifeng read the flame map?
  3. JDK developer mailing list for bug discussions

    This article is from blog.csdn.net/xindoo