background

A service provider receives a Dubbo thread pool overflow alarm every hour after the VM is switched to k8S. The CPU and memory usage are not high, but the Dubbo thread pool is stable after the hour. One other thing to note is that the Dubbo thread count was 10-20 around the hour, but by the hour it was 400-500, which is a bit of an exaggeration

Is it possible that someone will say that because the business volume is larger? Yes, the business volume of the hour is to be a little bigger, but it is far from this degree, see the monitoring volume is more than 5%-10%, and at this time through monitoring found that CPU, memory usage is not high.

Problem orientation

So that’s the background, and our goal is to figure out what’s going on. The following is the troubleshooting process:

1. View thread stack information. Fortunately, this problem is easy to reproduce, because Dubbo saves the current thread Dump information to the user directory by default when the thread pool overflows. A look at the thread stack shows that most threads BLOCK loback logs

"DubboServerHandler - 172.21.41.166:20833 - thread - 111" Id = 253 BLOCKED on Ch. Qos. Logback. Core. Rolling. TimeBasedRollingPolicy @ 288 c40d owned by "DubboServerHandler - 172.21.41.166:20833 - thread - 75" Id=200 at ch.qos.logback.core.rolling.RollingFileAppender.subAppend(RollingFileAppender.java:229) - blocked on ch.qos.logback.core.rolling.TimeBasedRollingPolicy@288c40d at ch.qos.logback.core.OutputStreamAppender.append(OutputStreamAppender.java:102) at ch.qos.logback.core.UnsynchronizedAppenderBase.doAppend(UnsynchronizedAppenderBase.java:84) at ch.qos.logback.core.spi.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:51) at ch.qos.logback.classic.Logger.appendLoopOnAppenders(Logger.java:270) at ch.qos.logback.classic.Logger.callAppenders(Logger.java:257) at ch.qos.logback.classic.Logger.buildLoggingEventAndAppend(Logger.java:421) at ch.qos.logback.classic.Logger.filterAndLog_0_Or3Plus(Logger.java:383) at ch.qos.logback.classic.Logger.warn(Logger.java:688) ...... at org.apache.dubbo.common.bytecode.Wrapper36.invokeMethod(Wrapper36.java) at org.apache.dubbo.rpc.proxy.javassist.JavassistProxyFactory$1.doInvoke(JavassistProxyFactory.java:47) at org.apache.dubbo.rpc.proxy.AbstractProxyInvoker.invoke(AbstractProxyInvoker.java:84) at org.apache.dubbo.config.invoker.DelegateProviderMetaDataInvoker.invoke(DelegateProviderMetaDataInvoker.java:56) at org.apache.dubbo.rpc.protocol.InvokerWrapper.invoke(InvokerWrapper.java:56) at com.alibaba.csp.sentinel.adapter.dubbo.SentinelDubboProviderFilter.invoke(SentinelDubboProviderFilter.java:65) at org.apache.dubbo.rpc.protocol.ProtocolFilterWrapper$1.invoke(ProtocolFilterWrapper.java:83) ... Number of locked synchronizers = 1 - Java. Util. Concurrent. The ThreadPoolExecutor $1 a579e21 Worker @ / / have a look Dubboserverhandler-172.21.41.166:20833-thread-75 What does the thread do? "dubboServerHandler-172.21.41.166:20833-thread-75" Id=200 RUNNABLE at ch.qos.logback.core.rolling.RollingFileAppender.subAppend(RollingFileAppender.java:229) - locked ch.qos.logback.core.rolling.TimeBasedRollingPolicy@288c40d at ch.qos.logback.core.OutputStreamAppender.append(OutputStreamAppender.java:102) at ch.qos.logback.core.UnsynchronizedAppenderBase.doAppend(UnsynchronizedAppenderBase.java:84) at ch.qos.logback.core.spi.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:51) at ch.qos.logback.classic.Logger.appendLoopOnAppenders(Logger.java:270) at ch.qos.logback.classic.Logger.callAppenders(Logger.java:257) at ch.qos.logback.classic.Logger.buildLoggingEventAndAppend(Logger.java:421) at ch.qos.logback.classic.Logger.filterAndLog_0_Or3Plus(Logger.java:383) at ch.qos.logback.classic.Logger.info(Logger.java:579) at com.tuya.aries.biz.service.category.impl.CategoryBizService$afterPropertiesSet$1.onRemoval(CategoryBizService.kt:65) at com.google.common.cache.LocalCache.processPendingNotifications(LocalCache.java:1963) at com.google.common.cache.LocalCache$Segment.runUnlockedCleanup(LocalCache.java:3562) at com.google.common.cache.LocalCache$Segment.postWriteCleanup(LocalCache.java:3538) at com.google.common.cache.LocalCache$Segment.storeLoadedValue(LocalCache.java:3226) at com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2378) at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2337) at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2295) - locked com.google.common.cache.LocalCache$StrongAccessWriteEntry@52efd51e at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2208) at com.google.common.cache.LocalCache.get(LocalCache.java:4053) at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4057) at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4986) at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:204) at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:736) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:157) at org.springframework.aop.framework.adapter.MethodBeforeAdviceInterceptor.invoke(MethodBeforeAdviceInterceptor.java:56) at  org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:168) at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:92) ... Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@146c6130Copy the code

Most threads acquire locks when writing logs, which are held by the threadDubboServerHandler - 172.21.41.166:20833 - thread - 75Hold, through view logback source, found at this time doing log scrolling operation, this is a synchronous operation

On the face of it, logBack log scrolling is a synchronous operation, causing most business threads to block log scrolling. To verify my guess, I looked at the logback.xml configuration of the business side, which does scroll on the hour, as follows:

<appender name="InfoFile" class="ch.qos.logback.core.rolling.RollingFileAppender">
    <filter class="ch.qos.logback.classic.filter.ThresholdFilter">
        <level>INFO</level>
    </filter>
    <file>${loggingRoot}/info.log</file>
    <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
        <FileNamePattern>${loggingRoot}/info_%d{yyyy-MM-dd.HH}.log</FileNamePattern>
        <MaxHistory>7</MaxHistory>
    </rollingPolicy>
    <encoder class="ch.qos.logback.classic.encoder.PatternLayoutEncoder">
        <pattern>%d{yyyy-MM-dd HH:mm:ss.SSS}, %p, %c, %F, %L, %M %m%n</pattern>
    </encoder>
</appender>
Copy the code

At this point, I take it for granted that I have the first conclusion in mind: the service call volume is large, and logback happens to roll logs at this time, so this problem is caused. In fact, the latter thought is not quite right, why on the virtual machine no problem on the K8S problem? But we’ll talk about that later. To test this conjecture, I did the following:

  1. Changed the log scrolling mode to scrolling by size:Slap me in the face
  2. Change the log to async:Slap me in the face
  3. Checked the business sidelogback.xnlConfiguration files, there’s nothing special about them

As you can imagine, the problem has nothing to do with the logback itself, but it has something to do with the log output. There’s nothing to be gained by dwelling on this, so we focus on K8S.

2. The operation and maintenance suggested that the K8S machine to which the business party had cut off was not equipped with SSD. I immediately tested the conjecture and got a slap in the face

By now everyone is a little tired, what the hell is causing this problem? I had no idea, so I offered a temporary solution:

  1. Increase the number of Dubbo threads: the CPU and memory usage are low at this time. If you increase the number of Dubbo threads, at least the thread pool will not overflow, at most the response time will be longer on the hour
  2. Add replicas: With less traffic per node, the probability of thread pool overflows is definitely lower. I know that adding machines is not reasonable, but if you go directly to K8S, this problem is really difficult to investigate

In this way, the thread pool does not overflow, but through monitoring observation, the number of Dubbo threads occupied by the hour is still several levels higher than the normal level, the business side is not assured, still have to cut back. I understand. It’s hard to watch, but who dares use it until the problem is solved?

I happened to be testing the log scrolling strategy of the company’s log component. There is no problem with the local test. I plan to deploy the program to K8S and run it for a few days. I configured the log scrolling policy to scroll by size, and the size is limited to 800MB. However, I accidentally found a phenomenon on K8S: every time the log file reaches 500MB, the file will be cleared 0. There is a background task on K8S machine to clean the log file. If the log file is larger than 500MB, the content of the log file will be cleared directly. After checking with the operation and maintenance, the file handle will take several seconds when the log file is cleared.

At this point, the cause of the problem is basically determined, and after o&M shuts down the task, all business parties are normal.

The problem summary

The K8S machine has a scheduled task to remove those log files larger than 500MB, and the business side of the log is really coincidence, the basic log size of an hour is about 500MB. At the top of the hour, timing task clear the log file, has a file handle for a few seconds, logback to perform log rolling, the hour is to log file renaming, at this time in this thread take up loback a lock, other business threads in the output log need to obtain the first lock, eventually led to the blocked thread number more and more, Eventually the Dubbo thread pool overflows

So what did we learn from this problem sweep? How to say, although looking back on the problem is very simple, but after all, it is also the whole process of participation, in fact, there are many seemingly intractable problems may not be complicated, simple once found, not found on the blame. The key is to learn something from it. The problem itself may not be very valuable, and it may not happen again, but the troubleshooting process is very valuable, and ultimately it is the experience:

  1. Open mind: A problem that, on the face of it, might beAIn fact, it may beB+C+D+E+F+G. Like the question above, on the surface yeslogbackThe reason, in fact, isA combination of reasons + coincidence. So how do you open your mind? On the one hand, it is to master the commonly used framework in work. On the other hand, it can only be said that the more the better, efficient and continuous learning; On the other hand, some work surface is not used, but actually very useful knowledge, such as operating system and other basic knowledge, when used will be very helpful. The more you know, the more points you can take into account when facing a problem, which is also experience.
  2. Rigorous attitude: open mind is not enough, but also a rigorous attitude. An open mind will give you more ideas to think about, but if you don’t have a serious attitude, you’re more likely to fall by the wayside in the right direction and get stuck in the wrong one. Take the question above for example: why didn’t I look at the file size when I was testing the log scrolling strategy? Why not see how the log file changes on the hour? Once you have an idea, you take it for granted, and when you fail to verify it, you block it, and you waste your time
  3. Comb and summarize: when there is no train of thought, you might as well write down the points you think of on white paper, and then comb, so that it is not easy to miss details. It’s like a cop solving a crime, they’re looking for a killer, you’re looking for a Bug
  4. Open-minded and studious: where there are so many geniuses, even if there is no matter with you, since you are not, a little diligent bai. A person’s ability is always limited, especially ordinary people, many people many ideas, discussion, maybe there are other harvest