A service has been running stably until recently, when the CPU was 100% and the RPC remote call failed, and the mock logic went out. After the restart, the problem reappeared an hour later. Dump thread stack information, but you can’t see anything wrong without looking closely. So there was a process of investigation.

The day before, I dumped the thread stack information and found that the thread with high CPU usage was Ali’s stream limiting component Sentinel. According to the thread stack information, THE SENTINEL console would pull the QPS statistics of the service. This interface reads the local file contents because the QPS information is output to a file in the /root/logs/csp/ directory and is read in NIO mode. So MY guess is that Sentinel will trigger the dead-loop bug under certain conditions, which would explain the 100% CPU usage.

With that in mind, I removed the current limiting feature, thinking the problem would be solved, but didn’t expect to see the problem the next morning. So I restart the other nodes of the service, leaving one node and removing the rest from the registry. But unexpectedly, the registry has not this node, should be all the nodes of the service and the registry lost contact, the other is because of restart after the registry again registered.

Since the registry is not registered, it makes sense to mock the other service calls directly. Continuing the previous day’s progress, I first check whether the file handle number is too many open, so use lsof to check, as expected.

Process 15567 has more than 200,000 open file handles, so I prefer to believe that sentinel read files are buggy because I remember the previous day that the stream limiting feature was not completely removed, only the stream limiting configuration was removed. Then check to see if the process has opened files in the /root/logs/ CSP directory, which is taken from the source of the stream limiting component.

Unfortunately, you can’t see the number of open files per file. Then I use lsof -p process ID command to check the file handle information of the process.

Instead of opening files in /root/logs/ CSP, I found redis connections. The sentinel thread has a bug with the stream limiting component. The sentinel thread does not occupy too much CPU after the configuration was removed yesterday. Continue counting the number of open redis sockets.

Yes, the file handle limit problem is indeed related to redis-cluster. So I use the redis cli command to connect to the Redis service to see if the number of connections has increased and if it matches the number of file handles held by the process. The day before, I also noticed an occasional connection timeout in a message consuming service.

However, nothing unusual was found and the number of client connections was the same as usual. There’s nothing left to do. Then I analyzed the thread stack information and found that a large number of Dubbo threads were in wait state and were blocking the step of returning redis connections to the connection pool and applying for connections from the connection pool.

A few dubbos are now in runing state because these threads have skipped the memory cache. I use AOP to intercept requests, and those that hit the memory cache don’t go down the business logic, so REDis is not used. Therefore, requests processed by these threads are normal.

So why is the CPU 100 percent? A large number of business threads are already blocked in wait state. What thread is occupying so much CPU? That’s when I came up with the top command to check.

Seeing two non-business related threads that are consuming a lot of CPU, I go back to the thread stack information and only two garbage collection threads match the command column displayed by the top command.

The jstat command looks at the GC garbage collection.

Use the jstat command continuously to check gc and see that the number of FGC increases by 1 each time the command is executed, which is more than 1 second. The truth of the matter is very close. If you find it here, it’s business code.

Due to the use of memory caching, full GC is quite normal. The first two things that occur to me are that there is too much cache data and that the scheduled task of cache data cleaning failed. And since the service has been very stable before, it is speculated that the data format is not correct to trigger the bug, leading to the failure of data cleaning, more and more. But after I looked at the code, I added a try catch, and if an exception occurs, the cache will be removed, but only once an hour.

At this point, I won’t go any further. It is not possible to remove the memory cache. You can only calculate the memory required by the business and increase the memory of the instance. Through the database query, it is estimated that the number of records to be cached is more than 4 million. A test case was written locally, and computing more than 4 million records consumes about 700 m of memory. Therefore, it is not a problem to increase the machine’s memory slightly.

The number of redis connection file handles is so high because of connection pooling. STW doesn’t cause redis file handles to open so much, does it? If you know, very hope you can leave a message to help me answer the question, thank you!