Welcome to follow our wechat official account: Shishan100
My new course ** “C2C e-commerce System Micro-service Architecture 120-day Practical Training Camp” is online in the public account ruxihu Technology Nest **, interested students, you can click the link below for details:
120-Day Training Camp of C2C E-commerce System Micro-Service Architecture
“Behind a JVM FullGC in my last article was a heart-stopping online production accident! “, told us about a case where the online system went down due to JVM FullGC. In this article, we continue to talk about the problems encountered by another online system in a production environment.
I. Background introduction
The background is this: a system on the line, in the event of a peak MQ middleware failure, triggers the degradation mechanism, and after the degradation mechanism is triggered for a short time, the system suddenly freezes and cannot respond to any requests.
To give you a brief introduction to the overall architecture of this system, this system is simply to have a very core behavior, is to write data to MQ, but the data written to MQ is very core and critical, absolutely do not allow loss.
So a downgrading mechanism was designed so that if the MQ middleware failed, the system would immediately write the core data to a local disk file.
As an added bonus, for those of you who are not familiar with the concept of MQ middleware, I suggest you take a look at the previous post “Java Advanced Interview Series 1” why do you want to introduce messaging middleware into your system architecture? Start with a basic understanding of MQ middleware.
But if in the peak of the concurrency value under the condition of relatively high, immediately receive a data synchronous write local disk file, the performance is very poor, can lead to system throughput instantly dropped substantially, the degradation mechanism is absolutely cannot run in the production environment, because they would have collapsed under high concurrent requests.
So at the time of design, the downgrade mechanism was carefully designed.
Our core idea is that when MQ middleware fails and the degrade mechanism is triggered, the system receives a request and does not write to the local disk immediately, but uses the mechanism of dual buffer and batch flush.
In simple terms, the system receives a message and immediately writes to the memory buffer, then starts a background thread to flush the cached data to disk.
The whole process, if you look at the picture below, you can see.
The memory buffer is actually designed to be split into two regions.
One is the current area, which is used by the system to write data, and the other is the ready area, which is used by background threads to flush data to disk.
The buffer size of each area is set to 512KB. The system writes to the current buffer on receipt of the request, but the current buffer has only 512KB of memory space, so it is bound to be full.
Again, let’s take a look at the picture below.
When the current buffer is full, the current and ready buffers are swapped. After the swap, the ready buffer holds 512kb of data that was previously written full.
Then the current buffer is now empty and can continue and the system continues to write the new data to the swapped new current buffer.
The whole process is shown below:
At this point, the background thread can write the data in the ready buffer directly to the local disk file in high performance Append mode using the Java NIO API.
Of course, background threads have a whole set of mechanisms, for example, a disk file has a fixed size, and if it reaches a certain size, it automatically opens a new disk file to write data.
Two, buried hidden danger
Good! With the above mechanism, high concurrency requests can be handled smoothly, even during peak times, and everything looks good!
But, at that time this degrade mechanism is in the development, we take the train of thought, buried hidden trouble for the back!
The idea was that if the current buffer was full, all threads would be stuck in a while loop waiting indefinitely.
When will it be? Wait until the ready buffer is flushed to disk files, then clear the Ready buffer and swap with the current buffer.
The current buffer must become empty again before the worker thread can continue writing data.
But have you considered the possibility that an unusual situation could happen?
It actually takes a while for the background thread to flush the ready buffer to the disk file.
What if, in the process of flushing data to a disk file, the current buffer is suddenly filled as well?
In this case, all worker threads in the system cannot write to the current buffer, and all threads are stuck.
Give you a picture to see this problem!
This is the most fundamental problem of the double buffer mechanism of the degrade mechanism of the system. After the development of this degrade mechanism, the normal request pressure test was used, and it was found that the two buffers worked well under the condition of 512KB, and there was no problem.
Three, peak request, problem outbreak
But the problem is the rush hour. At one peak, the system request pressure reached 10 times higher than normal.
Of course, under normal flow, during peak times, write requests are actually written directly to the MQ middleware cluster, so it doesn’t matter if your peak traffic increases by 10 times, MQ clusters are naturally resistant to high concurrency.
Unfortunately, at peak times, the MQ middleware cluster suddenly failed, which happens only a few times a year.
This causes the system to suddenly trigger the degrade mechanism and start writing data to the memory double buffer.
You know, this is the rush hour and the number of requests is 10 times normal. So 10 times the request pressure instantly caused a problem.
The problem is that the current buffer is flooded with high concurrency requests, the two buffers are swapped, and background threads start flushing the ready buffer to disk files.
As a result, the current buffer is suddenly full before the ready buffer is flushed to disk files due to the rush of requests.
This is awkward, the online system suddenly starts to freak out…
Typically, all threads of instances deployed on all machines are stuck in wait state.
Fourth, locate the problem and take appropriate medicine
As a result, the system began to fail to respond to requests at peak times. Later, the problem was solved through online emergency troubleshooting, positioning and repair.
In fact, the solution is very simple. We take a snapshot of the JVM dump to see where the system threads are stuck, and find that a large number of threads are stuck waiting for the current buffer.
The solution is to increase the size of the online system’s two-segment buffer from 512KB to a buffer of 10MB.
This also allows the downgrading mechanism’s dual buffering mechanism to run smoothly during online peak times, without the sudden rush of requests hitting both buffers.
Because larger buffers allow the ready buffer to be flushed to disk files, the current buffer will not fill as quickly.
But one of the lessons learned from this online failure feedback is that any complex mechanism for system design and development must be stress-tested against the maximum flow at peak online times. This is the only way to ensure that any complex mechanics that come online can withstand the rush of online traffic.
End
If there is any harvest, please help to forward, your encouragement is the biggest power of the author, thank you!
A large wave of micro services, distributed, high concurrency, high availability of original series of articles is on the way
Please scan the qr code belowContinue to pay attention to:
Architecture Notes for Hugesia (ID: Shishan100)
More than ten years of EXPERIENCE in BAT architecture