The problem background
Received frequent alarm mail, timing task scheduling failure, see the XXL – job actuator list is empty, but the service and according to the health, view the historical task execution records found actuator is decreasing, because it is online service, only to restart, and then thread logs also have no, at the same time try to access the service of health examination interface, The health check interface is not accessible. The service has been suspended, but the TCP health check configured for the service has not been detected.
Symptom: The executor list of xxl-job is empty, TCP detection is normal, and the service is displayed normally. However, the HTTP health check interface cannot be accessed, and the service is suspended.
Preliminary investigation process:
-
Looking at the APM online, I found two anomalies,
- The heap memory will be in the full state regularly (the heap memory that is full is Eden Space—-, the principal’s scheduled task. The calculation task is very large, and it is normal to be full, and the number of GC counts shows that young GC and old GC are not too abnormal.) —– the time when the heap memory dies is almost the same as the normal heap memory. Dump out of line memory, check there is no problem, temporarily exclude the memory problem
- The thread pool of the restarted service has been growing slowly. The normal thread pool does not grow all the time, and the amount of growth is large
-
Go to the terminal and use arthas to check the server thread status
Arthas goes into the terminal, executes the thread command and finds that a number of threads are WATING, dumps the thread stack and finds that there are over 200 threads in WATING.Copy the code
-
Arthas looks at the WATING thread stack and finds that all threads are on the bottom stack, but doesn't see much of a clue.Copy the code
-
Insert thread init method, find xxL-job thread
[arthas@1]$ stack java.lang.Thread "<init>" Copy the code
-
If xxL-job thread is leaking, the thread will hang when the number of threads increases to a certain number (close to 400), which is embarrassing…. , then at the more normal before running for online banking services, online number of threads is also found on nearly 400 orders of magnitude has been very stable, the service is also very healthy, should also not be this reason, no ideas, temporary first TCP health check for the guarantee of the HTTP service can hang up when the first time to restart (behind are analyzed, The reason xxL-job threads grow so fast is because the default thread pool of the jetty server built into XXL-job is 256 threads.
Re-screening process:
-
Dongjie found that the timed task in the test environment also hung. He checked the memory and thread pool of the test environment and found that it was basically the same as that in the online environment. There was no big exception, but fortunately, the scene of the test environment was hung, so there should be more clues.
-
Since memory and threads didn’t find anything too serious, can we see any clues from the CPU of the suspended service
-
Enter the terminal, the top command to check the CPU, sure enough, there is a problem, CPU has been full
-
Enter the Arthas terminal
Thread-n 3 Check whether the three threads with the highest CPU usage are always on the following two stacks. 1. The first is the business code 2. The other two are bothlog4J2 dozen log relatedCopy the code
-
-
View the business code:
1. The thread is stuck waiting for the result of the Callable task, and idles until no result is returned. The executorService thread pool must have threads that are held all the time. 3. View the definition of the executorService thread pool. All thread names start with school-threadCopy the code
-
Arthas looks at the thread stack in the thread pool
[arthas@1]$thread 525 was found to be stuck in logger.error, and the last stack was identical to two of the top three CPU hogsCopy the code
-
Check the com. Lmax. Disruptor. MultiProducerSequencer. Next source, should lookdo whileThe loop is at line 136 (locksupport.parknanos (1);) It's been idling. If you want to be sure that it is. So we tried by arthas watch command to find out the following variables will know whether that is ex. [1] arthas @ $watch com. Lmax. Disruptor. Sequence a get"{returnObj}"Current: Gets the sequence value that the event publisher needs to publish cachedGatingSequence: Get event handler processing to the sequence of value [arthas @ 24631] $watch com. Lmax. Disruptor. Util. Util getMinimumSequence"{returnObj}"BufferSize: 128 n: 1 gatingSequence: the smallest sequence value of the current event handler (possibly multiple event handlers) bufferSize: 128 n: 1log4J2 When you need to store events in the RingBuffer for disruptor asynchronous logging, the ringbuffer is full, but the consumer cannot process it, causing the location to retrieve the next stored event to idleCopy the code
/ * * *@see Sequencer#next() */ @Override public long next(a) { return next(1); } / * * *@see Sequencer#next(int) */ @Override public long next(int n) { if (n < 1) { throw new IllegalArgumentException("n must be > 0"); } long current; long next; do { // Get the sequence value that the event publisher needs to publish current = cursor.get(); next = current + n; //wrapPoint stands for the position of the applied sequence after one loop around RingBuffer long wrapPoint = next - bufferSize; // Get the sequence value processed by the event handler long cachedGatingSequence = gatingSequenceCache.get(); /** * 1. The event publisher is requesting a sequence value greater than the event handler's current sequence value and the event publisher is requesting a sequence value minus the length of the loop less than the event handler's sequence value. * 2. If (1) is satisfied, the given sequence can be applied. * 3. If (1) is not satisfied, you need to look at the minimum sequence value of the current event handler (there may be multiple event handlers). If the minimum sequence value is greater than or equal to * the minimum sequence value of the current event handler is one circle larger, then the sequence cannot be applied (if it is applied, it will be overwritten). In a 400-meter track (bufferSize), Ming ran 599 meters (nextSequence) and Hong (slowest consumer) ran 200 meters * (cachedGatingSequence). Xiao Ming runs one meter before she knocks over xiao Hong. * * / if (wrapPoint > cachedGatingSequence || cachedGatingSequence > current) { long gatingSequence = Util.getMinimumSequence(gatingSequences, current); if (wrapPoint > gatingSequence) { LockSupport.parkNanos(1); // TODO, should we spin based on the wait strategy? continue; } gatingSequenceCache.set(gatingSequence); } else if (cursor.compareAndSet(current, next)) { break; }}while (true); return next; } Copy the code
-
After looking at the stack and verifying the source code, it appears that log4j2 created an infinite loop while logging asynchronously through disruptor, causing the service CPU to burst and thus the service to hang.
-
Local authentication (recurring issue) :
-
To verify the idea, we also used a thread pool and frantically printed the log. In order to generate an infinite loop as quickly as possible, we set the RingbufferSize for disruptor as small as possible. – DAsyncLogger. RingBufferSize = 32768, this machine is the same, but is set to the minimum value of 128 RingBufferSize
-
Verification code:
fun testLog(a){ var i = 0 while(i < 250000){ executorService.submit { LOGGER.debug("test $i") } i++ } LOGGER.debug("commit finish")}Copy the code
-
Call this function a few more times (not mandatory, it may take several times to appear), and the same stack and result appear as on line
-
-
Disruptor bug (log4j2) : disruptor bug (disruptor) : disruptor bug (disruptor) : Disruptor bug (log4j2) : Disruptor bug (disruptor) : Disruptor bug (disruptor) : Disruptor bug (disruptor) : Disruptor bug (disruptor) : Disruptor bug (disruptor) : Disruptor bug (disruptor) : Disruptor bug (disruptor) : Disruptor bug (disruptor) : Disruptor bug (disruptor) : Disruptor bug (disruptor To this direction is too persistent, in this erroneous area blind looking for a long time for a long time, finally make the head big.
-
I went to Xing Bin for a discussion, and the discussion was really useful. I found other problems from different directions of thinking (thanks for xing Bin’s ideas and doubts), and arthas entered the failed service again
1. A thread whose ID is 36 is found in the blocked state. 2. Check that thread stack 36 is blocked by thread 35. Look at 35 thread stack, and look at the front of the stack is the same is stuck in the com. Lmax. Disruptor. MultiProducerSequencer. Next 4. Again carefully look at, actually stuck is kafka. Clients. Metadata. Update 270 rows and Objects. RequireNonNull (topic,"topic cannot be null"); Kafka. Clients. Metadata. Add 117 lines of the debug ("Updated cluster metadata version {} to {}", this.version, this.cluster); MetaData can add and update MetaData with synchronized locksCopy the code
-
So why does MetaData lock its own add with its own update? Also take a look at our log configuration for log4j2
<CCloudKafka name="KafkaLogger" syncsend="false" > <Property name="bootstrap.servers">127.0.0.1:9092</Property> <PatternLayout pattern="[%d{yyyy-MM-dd HH:mm:ss.SSS}][%t][%level] %m%n"/> </CCloudKafka> <Async name="async" includeLocation = "true"> <appender-ref ref="Console"/> <appender-ref ref="RollingFileInfo"/> <appender-ref ref="RollingFileError"/> <appender-ref ref="AsyncMailer"/> <appender-ref ref="KafkaLogger"/> </Async> Copy the code
welogAsync printing is configured in 4J2log", one of which is sent to Kafka. The entire log printing and sending process is simple as follows. 1. When the Ringbuffer is full 2. Kafka periodically updates metadata. whenlog4 j2 try the log write to the disruptor will MultiProducerSequencer. The next fetch next can be inserted into the location of the storage, found no place can deposit and will conduct LockSupport. ParkNanos temporarily block 1 ns, After the disruptor consumer consumes the log event, remove the event to create a space 4. That's where the problem comes in, when Kafka's KafkaProducerwaitWhen the OnMetadata method attempts to consume this message, it will first add the topic to the MetaData MetaData. However, when it adds the topic, it will find that it cannot obtain the lock because it has been acquired by the update in step 2. Then the disruptor MultiProducerSequencer. The next idle all the time. Idle threads then continue to consume CPU, causing the service to hangCopy the code
-
So here’s the question that some of you who are familiar with log4j2 might be asking about the asynchrony of log4j2 in two ways
AsyncAppender and AsyncLogger are two asynchronous log implementations in Log4j2. AsyncAppender uses ArrayBlockingQueue to store log events that need to be output asynchronously. AsyncLogger uses the Disruptor framework to implement high throughput. The following configuration is asynchronous AsyncAppender, but why Disruptor is used? Because actually we global configuration - DLog4jContextSelector = org. Apache. Logging. Log4j. Core. The async. AsyncLoggerContextSelector, This causes the application to use Disruptor to implement asynchronism.Copy the code
<Async name="async" includeLocation = "true"> <appender-ref ref="Console"/> <appender-ref ref="RollingFileInfo"/> <appender-ref ref="RollingFileError"/> <appender-ref ref="AsyncMailer"/> <appender-ref ref="KafkaLogger"/> </Async> Copy the code
More AsyncAppender AsyncLogger and the difference between the two blog https://bryantchang.github.io/2019/01/15/ for referencelog4j2-asyncLogger/ https://bryantchang.github.io/2018/11/18/log4j-async/ Copy the code
-
The number of threads of xxL-job is increasing, and the threads are in wait state. This is related to the built-in jetty service of XXL-job. Then debug the breakpoint to thread.init () and see that the jetty server started the Thread. CorePoolSize and corePoolSize are 256 threads. This is why our scheduled task service starts up and the number of threads keeps increasing and then it doesn’t change much after a certain number, because of the thread pool.
conclusion
-
To solve the problem
To summarize the problem:logWhen 4j2 asynchronously logs, the queue is full, and we are using Kafka to print logs. Kafka starts a deadlock when the queue is full, causing distuptor to loop. 1. AsyncLogger set -d to set AsyncLogger to block if the queue is fulllog4j2.AsyncQueueFullPolicy=Discard 2. AsyncAppender <Async name="async" blocking="false" includeLocation = "true"> If you set a discarding policy, you also need to set the level of discarding logs as follows: -d Based on the projectlog4j2.DiscardThreshold=INFO Copy the code
-
conclusion
The solution of this problem did take a lot of time, from the beginning of the various points of doubt to the final step close to the truth, in fact, it was quite difficult, in many misunderstandings for a long time, spent a lot of time, but finally to the moment of the solution was very happy, but in the middle of the rightlogThe problems of 4J2 that I don't understand and easy to ignore details are still exposed. In fact, gradually a line comes down and I have a set of procedures and ideas for solving them. This is the most gratitening thing. I will also learn a lot of ideas and methods to solve problems.Copy the code