Welcome to follow our wechat official account: Shishan100

My new course ** “C2C e-commerce System Micro-service Architecture 120-day Practical Training Camp” is online in the public account ruxihu Technology Nest **, interested students, you can click the link below for details:

120-Day Training Camp of C2C E-commerce System Micro-Service Architecture

“Behind a JVM FullGC in my last article was a heart-stopping online production accident! “, told us about a case where the online system went down due to JVM FullGC. In this article, we continue to talk about the problems encountered by another online system in a production environment.

I. Background introduction

The background is this: a system on the line, in the event of a peak MQ middleware failure, triggers the degradation mechanism, and after the degradation mechanism is triggered for a short time, the system suddenly freezes and cannot respond to any requests.

To give you a brief introduction to the overall architecture of this system, this system is simply to have a very core behavior, is to write data to MQ, but the data written to MQ is very core and critical, absolutely do not allow loss.

So a downgrading mechanism was designed so that if the MQ middleware failed, the system would immediately write the core data to a local disk file.

As an added bonus, for those of you who are not familiar with the concept of MQ middleware, I suggest you take a look at the previous post “Java Advanced Interview Series 1” why do you want to introduce messaging middleware into your system architecture? Start with a basic understanding of MQ middleware.

But if in the peak of the concurrency value under the condition of relatively high, immediately receive a data synchronous write local disk file, the performance is very poor, can lead to system throughput instantly dropped substantially, the degradation mechanism is absolutely cannot run in the production environment, because they would have collapsed under high concurrent requests.

So at the time of design, the downgrade mechanism was carefully designed.

Our core idea is that when MQ middleware fails and the degrade mechanism is triggered, the system receives a request and does not write to the local disk immediately, but uses the mechanism of dual buffer and batch flush.

In simple terms, the system receives a message and immediately writes to the memory buffer, then starts a background thread to flush the cached data to disk.

The whole process, if you look at the picture below, you can see.

The memory buffer is actually designed to be split into two regions.

One is the current area, which is used by the system to write data, and the other is the ready area, which is used by background threads to flush data to disk.

The buffer size of each area is set to 512KB. The system writes to the current buffer on receipt of the request, but the current buffer has only 512KB of memory space, so it is bound to be full.

Again, let’s take a look at the picture below.

When the current buffer is full, the current and ready buffers are swapped. After the swap, the ready buffer holds 512kb of data that was previously written full.

Then the current buffer is now empty and can continue and the system continues to write the new data to the swapped new current buffer.

The whole process is shown below:

At this point, the background thread can write the data in the ready buffer directly to the local disk file in high performance Append mode using the Java NIO API.

Of course, background threads have a whole set of mechanisms, for example, a disk file has a fixed size, and if it reaches a certain size, it automatically opens a new disk file to write data.

Two, buried hidden danger

Good! With the above mechanism, high concurrency requests can be handled smoothly, even during peak times, and everything looks good!

But, at that time this degrade mechanism is in the development, we take the train of thought, buried hidden trouble for the back!

The idea was that if the current buffer was full, all threads would be stuck in a while loop waiting indefinitely.

When will it be? Wait until the ready buffer is flushed to disk files, then clear the Ready buffer and swap with the current buffer.

The current buffer must become empty again before the worker thread can continue writing data.

But have you considered the possibility that an unusual situation could happen?

It actually takes a while for the background thread to flush the ready buffer to the disk file.

What if, in the process of flushing data to a disk file, the current buffer is suddenly filled as well?

In this case, all worker threads in the system cannot write to the current buffer, and all threads are stuck.

Give you a picture to see this problem!

This is the most fundamental problem of the double buffer mechanism of the degrade mechanism of the system. After the development of this degrade mechanism, the normal request pressure test was used, and it was found that the two buffers worked well under the condition of 512KB, and there was no problem.

Three, peak request, problem outbreak

But the problem is the rush hour. At one peak, the system request pressure reached 10 times higher than normal.

Of course, under normal flow, during peak times, write requests are actually written directly to the MQ middleware cluster, so it doesn’t matter if your peak traffic increases by 10 times, MQ clusters are naturally resistant to high concurrency.

Unfortunately, at peak times, the MQ middleware cluster suddenly failed, which happens only a few times a year.

This causes the system to suddenly trigger the degrade mechanism and start writing data to the memory double buffer.

You know, this is the rush hour and the number of requests is 10 times normal. So 10 times the request pressure instantly caused a problem.

The problem is that the current buffer is flooded with high concurrency requests, the two buffers are swapped, and background threads start flushing the ready buffer to disk files.

As a result, the current buffer is suddenly full before the ready buffer is flushed to disk files due to the rush of requests.

This is awkward, the online system suddenly starts to freak out…

Typically, all threads of instances deployed on all machines are stuck in wait state.

Fourth, locate the problem and take appropriate medicine

As a result, the system began to fail to respond to requests at peak times. Later, the problem was solved through online emergency troubleshooting, positioning and repair.

In fact, the solution is very simple. We take a snapshot of the JVM dump to see where the system threads are stuck, and find that a large number of threads are stuck waiting for the current buffer.

The solution is to increase the size of the online system’s two-segment buffer from 512KB to a buffer of 10MB.

This also allows the downgrading mechanism’s dual buffering mechanism to run smoothly during online peak times, without the sudden rush of requests hitting both buffers.

Because larger buffers allow the ready buffer to be flushed to disk files, the current buffer will not fill as quickly.

But one of the lessons learned from this online failure feedback is that any complex mechanism for system design and development must be stress-tested against the maximum flow at peak online times. This is the only way to ensure that any complex mechanics that come online can withstand the rush of online traffic.

End

If there is any harvest, please help to forward, your encouragement is the biggest power of the author, thank you!

A large wave of micro services, distributed, high concurrency, high availability of original series of articles is on the way

Please scan the qr code belowContinue to pay attention to:

Architecture Notes for Hugesia (ID: Shishan100)

More than ten years of EXPERIENCE in BAT architecture

**> **推荐阅读:** > > 1、[拜托!面试请不要再问我Spring Cloud底层原理](https://link.juejin.im/?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Flink.juejin.im%2525253Ftarget%2525253Dhttps%252525253A%252525252F%252525252Flink.juejin.im%252525252F%252525253Ftarget%252525253Dhttps%25252525253A%25252525252F%25252525252Flink.juejin.im%25252525253Ftarget%25252525253Dhttps%2525252525253A%2525252525252F%2525252525252Flink.juejin.im%2525252525252F%2525252525253Ftarget%2525252525253Dhttps%252525252525253A%252525252525252F%252525252525252Flink.juejin.im%252525252525252F%252525252525253Ftarget%252525252525253Dhttps%25252525252525253A%25252525252525252F%25252525252525252Fjuejin.im%25252525252525252Fpost%25252525252525252F5be13b83f265da6116393fc7) > > 2、[【双11狂欢的背后】微服务注册中心如何承载大型系统的千万级访问?](https://link.juejin.im/?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Flink.juejin.im%2525253Ftarget%2525253Dhttps%252525253A%252525252F%252525252Flink.juejin.im%252525252F%252525253Ftarget%252525253Dhttps%25252525253A%25252525252F%25252525252Flink.juejin.im%25252525253Ftarget%25252525253Dhttps%2525252525253A%2525252525252F%2525252525252Flink.juejin.im%2525252525252F%2525252525253Ftarget%2525252525253Dhttps%252525252525253A%252525252525252F%252525252525252Flink.juejin.im%252525252525252F%252525252525253Ftarget%252525252525253Dhttps%25252525252525253A%25252525252525252F%25252525252525252Fjuejin.im%25252525252525252Fpost%25252525252525252F5be3f8dcf265da613a5382ca) > > 3、[【性能优化之道】每秒上万并发下的Spring Cloud参数优化实战](https://link.juejin.im/?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Flink.juejin.im%2525253Ftarget%2525253Dhttps%252525253A%252525252F%252525252Flink.juejin.im%252525252F%252525253Ftarget%252525253Dhttps%25252525253A%25252525252F%25252525252Flink.juejin.im%25252525253Ftarget%25252525253Dhttps%2525252525253A%2525252525252F%2525252525252Flink.juejin.im%2525252525252F%2525252525253Ftarget%2525252525253Dhttps%252525252525253A%252525252525252F%252525252525252Flink.juejin.im%252525252525252F%252525252525253Ftarget%252525252525253Dhttps%25252525252525253A%25252525252525252F%25252525252525252Fjuejin.im%25252525252525252Fpost%25252525252525252F5be83e166fb9a049a7115580) > > 4、[微服务架构如何保障双11狂欢下的99.99%高可用](https://link.juejin.im/?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Flink.juejin.im%2525253Ftarget%2525253Dhttps%252525253A%252525252F%252525252Flink.juejin.im%252525252F%252525253Ftarget%252525253Dhttps%25252525253A%25252525252F%25252525252Flink.juejin.im%25252525253Ftarget%25252525253Dhttps%2525252525253A%2525252525252F%2525252525252Flink.juejin.im%2525252525252F%2525252525253Ftarget%2525252525253Dhttps%252525252525253A%252525252525252F%252525252525252Flink.juejin.im%252525252525252F%252525252525253Ftarget%252525252525253Dhttps%25252525252525253A%25252525252525252F%25252525252525252Fjuejin.im%25252525252525252Fpost%25252525252525252F5be99a68e51d4511a8090440) > > 5、[兄弟,用大白话告诉你小白都能听懂的Hadoop架构原理](https://link.juejin.im/?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Flink.juejin.im%2525253Ftarget%2525253Dhttps%252525253A%252525252F%252525252Flink.juejin.im%252525252F%252525253Ftarget%252525253Dhttps%25252525253A%25252525252F%25252525252Flink.juejin.im%25252525253Ftarget%25252525253Dhttps%2525252525253A%2525252525252F%2525252525252Flink.juejin.im%2525252525252F%2525252525253Ftarget%2525252525253Dhttps%252525252525253A%252525252525252F%252525252525252Flink.juejin.im%252525252525252F%252525252525253Ftarget%252525252525253Dhttps%25252525252525253A%25252525252525252F%25252525252525252Fjuejin.im%25252525252525252Fpost%25252525252525252F5beaf02ce51d457e90196069) > > 6、[大规模集群下Hadoop NameNode如何承载每秒上千次的高并发访问](https://link.juejin.im/?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Flink.juejin.im%2525253Ftarget%2525253Dhttps%252525253A%252525252F%252525252Flink.juejin.im%252525252F%252525253Ftarget%252525253Dhttps%25252525253A%25252525252F%25252525252Flink.juejin.im%25252525253Ftarget%25252525253Dhttps%2525252525253A%2525252525252F%2525252525252Flink.juejin.im%2525252525252F%2525252525253Ftarget%2525252525253Dhttps%252525252525253A%252525252525252F%252525252525252Flink.juejin.im%252525252525252F%252525252525253Ftarget%252525252525253Dhttps%25252525252525253A%25252525252525252F%25252525252525252Fjuejin.im%25252525252525252Fpost%25252525252525252F5bec278c5188253e64332c76) > > 7、【[性能优化的秘密】Hadoop如何将TB级大文件的上传性能优化上百倍](https://link.juejin.im/?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Flink.juejin.im%2525253Ftarget%2525253Dhttps%252525253A%252525252F%252525252Flink.juejin.im%252525252F%252525253Ftarget%252525253Dhttps%25252525253A%25252525252F%25252525252Flink.juejin.im%25252525253Ftarget%25252525253Dhttps%2525252525253A%2525252525252F%2525252525252Flink.juejin.im%2525252525252F%2525252525253Ftarget%2525252525253Dhttps%252525252525253A%252525252525252F%252525252525252Flink.juejin.im%252525252525252F%252525252525253Ftarget%252525252525253Dhttps%25252525252525253A%25252525252525252F%25252525252525252Fjuejin.im%25252525252525252Fpost%25252525252525252F5bed82a9e51d450f9461cfc7) > > [8、](https://link.juejin.im/?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Flink.juejin.im%2525253Ftarget%2525253Dhttps%252525253A%252525252F%252525252Flink.juejin.im%252525252F%252525253Ftarget%252525253Dhttps%25252525253A%25252525252F%25252525252Flink.juejin.im%25252525253Ftarget%25252525253Dhttps%2525252525253A%2525252525252F%2525252525252Fjuejin.im%2525252525252Fpost%2525252525252F5bf2c6b6e51d456693549af4)[拜托,面试请不要再问我TCC分布式事务的实现原理坑爹呀!](https://juejin.cn/post/6844903716089233416) [](https://link.juejin.im/?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Flink.juejin.im%2525253Ftarget%2525253Dhttps%252525253A%252525252F%252525252Flink.juejin.im%252525252F%252525253Ftarget%252525253Dhttps%25252525253A%25252525252F%25252525252Flink.juejin.im%25252525253Ftarget%25252525253Dhttps%2525252525253A%2525252525252F%2525252525252Fjuejin.im%2525252525252Fpost%2525252525252F5bf2c6b6e51d456693549af4) > > 9、[【坑爹呀!】最终一致性分布式事务如何保障实际生产中99.99%高可用?](https://link.juejin.im/?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Flink.juejin.im%2525253Ftarget%2525253Dhttps%252525253A%252525252F%252525252Flink.juejin.im%252525252F%252525253Ftarget%252525253Dhttps%25252525253A%25252525252F%25252525252Flink.juejin.im%25252525253Ftarget%25252525253Dhttps%2525252525253A%2525252525252F%2525252525252Fjuejin.im%2525252525252Fpost%2525252525252F5bf2c6b6e51d456693549af4) > > 10、[拜托,面试请不要再问我Redis分布式锁的实现原理!](https://link.juejin.im/?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Flink.juejin.im%2525253Ftarget%2525253Dhttps%252525253A%252525252F%252525252Flink.juejin.im%252525252F%252525253Ftarget%252525253Dhttps%25252525253A%25252525252F%25252525252Flink.juejin.im%25252525253Ftarget%25252525253Dhttps%2525252525253A%2525252525252F%2525252525252Fjuejin.im%2525252525252Fpost%2525252525252F5bf3f15851882526a643e207) > > **11、****[【眼前一亮!】看Hadoop底层算法如何优雅的将大规模集群性能提升10倍以上?](https://link.juejin.im/?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Flink.juejin.im%2525253Ftarget%2525253Dhttps%252525253A%252525252F%252525252Flink.juejin.im%252525252F%252525253Ftarget%252525253Dhttps%25252525253A%25252525252F%25252525252Flink.juejin.im%25252525253Ftarget%25252525253Dhttps%2525252525253A%2525252525252F%2525252525252Fjuejin.im%2525252525252Fpost%2525252525252F5bf5396f51882509a768067e)** > > **12、****[亿级流量系统架构之如何支撑百亿级数据的存储与计算](https://link.juejin.im/?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Flink.juejin.im%2525253Ftarget%2525253Dhttps%252525253A%252525252F%252525252Flink.juejin.im%252525252F%252525253Ftarget%252525253Dhttps%25252525253A%25252525252F%25252525252Fjuejin.im%25252525252Fpost%25252525252F5bfab59fe51d4551584c7bcf)** > > 13、[亿级流量系统架构之如何设计高容错分布式计算系统](https://link.juejin.im/?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Flink.juejin.im%2525253Ftarget%2525253Dhttps%252525253A%252525252F%252525252Fjuejin.im%252525252Fpost%252525252F5bfbeeb9f265da61407e9679) > > 14、[亿级流量系统架构之如何设计承载百亿流量的高性能架构](https://link.juejin.im/?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Fjuejin.im%2525252Fpost%2525252F5bfd2df1e51d4574b133dd3a) > > 15、[亿级流量系统架构之如何设计每秒十万查询的高并发架构](https://link.juejin.im/?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Fjuejin.im%25252Fpost%25252F5bfe771251882509a7681b3a) > > 16、[亿级流量系统架构之如何设计全链路99.99%高可用架构](https://link.juejin.im/?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Fjuejin.im%252Fpost%252F5bffab686fb9a04a102f0022) > > 17、[七张图彻底讲清楚ZooKeeper分布式锁的实现原理](https://link.juejin.im/?target=https%3A%2F%2Fjuejin.im%2Fpost%2F5c01532ef265da61362232ed) > > 18、[大白话聊聊Java并发面试问题之volatile到底是什么?](https://juejin.cn/post/6844903730303746061) > > 19、[大白话聊聊Java并发面试问题之Java 8如何优化CAS性能?](https://juejin.cn/post/6844903731234865160) > > 20、[大白话聊聊Java并发面试问题之谈谈你对AQS的理解?](https://juejin.cn/post/6844903732061159437) > > 21、[大白话聊聊Java并发面试问题之公平锁与非公平锁是啥?](https://juejin.cn/post/6844903732883226637) > > 22、[大白话聊聊Java并发面试问题之微服务注册中心的读写锁优化](https://juejin.cn/post/6844903734267510798) > > 23、[互联网公司的面试官是如何360°无死角考察候选人的?(上篇)](https://juejin.cn/post/6844903734930046989) > > 24、[互联网公司面试官是如何360°无死角考察候选人的?(下篇)](https://juejin.cn/post/6844903735655661581) > > 25、[Java进阶面试系列之一:哥们,你们的系统架构中为什么要引入消息中间件?](https://juejin.cn/post/6844903736444207117) > > 26、[【Java进阶面试系列之二】:哥们,那你说说系统架构引入消息中间件有什么缺点?](https://juejin.cn/post/6844903737123667975) > > 27、[【行走的Offer收割机】记一位朋友斩获BAT技术专家Offer的面试经历](https://juejin.cn/post/6844903741213130765) > > 28、[【Java进阶面试系列之三】哥们,消息中间件在你们项目里是如何落地的?](https://juejin.cn/post/6844903742114906125) > > 29、[【Java进阶面试系列之四】扎心!线上服务宕机时,如何保证数据100%不丢失?](https://juejin.cn/post/6844903742928601095) > > 30、[一次JVM FullGC的背后,竟隐藏着惊心动魄的线上生产事故!](https://juejin.cn/post/6844903743712935944)**