Welcome to follow our wechat official account: Shishan100

My new course ** “C2C e-commerce System Micro-service Architecture 120-day Practical Training Camp” is online in the public account ruxihu Technology Nest **, interested students, you can click the link below for details:

120-Day Training Camp of C2C E-commerce System Micro-Service Architecture

“This article is about the resolution of an online production system accident, which represents a serious failure of the JVM FullGC in an online production system.

1. Introduction to service scenarios

Firstly, I will briefly talk about a background of online production system. Since this paper is only used as a case, a large number of business backgrounds will be weakened.

To put it simply, this is A distributed system. System A needs to transfer A very core and critical data to another system B through network request.

Therefore, A question is taken into consideration here. If system A has just transferred core data to system B, and system B crashes for no reason, won’t the data be lost?

Therefore, in the architecture design of this distributed system, a very classical Quorum algorithm is adopted.

The algorithm is simply that system B has to deploy an odd number of nodes, say at least 3 machines, or 5 machines, 7 machines, something like that.

Then each time system A transfers A data to the system, it must send A request to all the machines deployed by system B, transmitting A copy to all the machines deployed by system B.

To determine that A write from System A to system B is successful, system A must successfully transfer data to the machines of system B that exceed the Quorum number within the specified time range.

For example, if system B has three machines deployed, its Quorum is: 3/2 + 1 = 2, i.e., the number of all machines / 2 + 1.

Therefore, system A needs to determine whether A core data has been written successfully. If system B has three machines deployed, system A must receive A write success response from two machines of system B within A specified period of time.

Only then can system A consider the data to be written successfully to system B. This is known as the Quorum mechanism.

That is, in distributed architecture, data is transferred between systems, and for one system to ensure that the data it transfers to another system will not be lost, the machines receiving the Quorum (majority) of the other system must respond with success within a specified period of time.

In fact, this mechanism is widely used in many distributed systems and middleware systems. Our online distributed system also adopts this Quorum mechanism to transfer data between two systems.

I want to give you a picture of what this architecture looks like.

As shown in the figure above, the Quorum mechanism for transferring A piece of data between systems A and B is clearly demonstrated.

Next, we’ll show you in code what the Quorum write mechanism above might look like at the code level.

PS: Because the actual mechanism involves a large number of underlying network transmission, communication, fault tolerance and optimization, the following code is greatly simplified and only expresses a core meaning.

This is the code after a lot of simplification, but the core meaning is clear. You can watch it twice, but it’s actually pretty easy to figure out.

Asynchronously start the thread to send data to all the machines of system B, while entering a while loop waiting for the Quorum machines of system B to return the result.

If the expected number of machines do not return A result after the specified timeout period, the cluster deployed by system B is judged to be faulty and system A is directly quit, which is equivalent to system A down.

The whole code, that’s what it means!

Two, the problem protrudes

It’s not that hard just to look at the code, but the problem is that running it online is not as easy as you think when you’re writing the code.

Once, during the operation of the online production system, the overall system load was very stable. There should have been no problem, but suddenly we received an alarm saying that system A suddenly broke down.

Then began to check, left check right check, found that system B cluster is fine, should not have a problem.

And then I check system A, and there’s nothing wrong with system A.

Finally, by combining the logs of system A and the garbage collection logs of system A’s JVM FullGC, we can figure out the specific cause of the failure.

Third, positioning problems

In fact, the reason is very simple. After system A runs online for A period of time, it will occasionally carry out JVM FullGC that stops the World for A long time, which is A large area of garbage collection.

However, A large number of worker threads in system A will stop working. The worker thread will not resume running until JVM FullGC has finished.

Let’s look at the following code snippet:

But this is not true because the if statement would not have been true without the JVM FullGC.

It will pause for a second to enter the next while loop, which will then receive a Quorum number return from system B, and the while loop will be interrupted and continue.

As A result, JVM FullGC was delayed for dozens of seconds, which somehow triggered the execution of if judgment, and system A inexplicably quit and crashed.

JVM FullGC on line causes system to stall for a long time, which is one of the hidden killers of system instability.

Fourth, solve the problem

As for the above code stability optimization, it is also simple. We just have to put something in the code to monitor if JVM FullGC is happening in the code above.

If a JVM FullGC occurs, expireTime is automatically extended.

For example, the following code improvements:

Through the above code improvements, the stability of the online system can be effectively optimized to ensure that it will not randomly occur abnormal outage exit in the case of JVM FullGC.

END

If there is any harvest, please help to forward, your encouragement is the biggest power of the author, thank you!

A large wave of micro services, distributed, high concurrency, high availability of original series of articles is on the way

Please scan the qr code belowContinue to pay attention to:

Architecture Notes for Hugesia (ID: Shishan100)

More than ten years of EXPERIENCE in BAT architecture

**> **推荐阅读:** > > 1、[拜托!面试请不要再问我Spring Cloud底层原理](https://link.juejin.im/?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Flink.juejin.im%2525253Ftarget%2525253Dhttps%252525253A%252525252F%252525252Flink.juejin.im%252525252F%252525253Ftarget%252525253Dhttps%25252525253A%25252525252F%25252525252Flink.juejin.im%25252525253Ftarget%25252525253Dhttps%2525252525253A%2525252525252F%2525252525252Flink.juejin.im%2525252525252F%2525252525253Ftarget%2525252525253Dhttps%252525252525253A%252525252525252F%252525252525252Flink.juejin.im%252525252525252F%252525252525253Ftarget%252525252525253Dhttps%25252525252525253A%25252525252525252F%25252525252525252Fjuejin.im%25252525252525252Fpost%25252525252525252F5be13b83f265da6116393fc7) > > 2、[【双11狂欢的背后】微服务注册中心如何承载大型系统的千万级访问?](https://link.juejin.im/?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Flink.juejin.im%2525253Ftarget%2525253Dhttps%252525253A%252525252F%252525252Flink.juejin.im%252525252F%252525253Ftarget%252525253Dhttps%25252525253A%25252525252F%25252525252Flink.juejin.im%25252525253Ftarget%25252525253Dhttps%2525252525253A%2525252525252F%2525252525252Flink.juejin.im%2525252525252F%2525252525253Ftarget%2525252525253Dhttps%252525252525253A%252525252525252F%252525252525252Flink.juejin.im%252525252525252F%252525252525253Ftarget%252525252525253Dhttps%25252525252525253A%25252525252525252F%25252525252525252Fjuejin.im%25252525252525252Fpost%25252525252525252F5be3f8dcf265da613a5382ca) > > 3、[【性能优化之道】每秒上万并发下的Spring Cloud参数优化实战](https://link.juejin.im/?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Flink.juejin.im%2525253Ftarget%2525253Dhttps%252525253A%252525252F%252525252Flink.juejin.im%252525252F%252525253Ftarget%252525253Dhttps%25252525253A%25252525252F%25252525252Flink.juejin.im%25252525253Ftarget%25252525253Dhttps%2525252525253A%2525252525252F%2525252525252Flink.juejin.im%2525252525252F%2525252525253Ftarget%2525252525253Dhttps%252525252525253A%252525252525252F%252525252525252Flink.juejin.im%252525252525252F%252525252525253Ftarget%252525252525253Dhttps%25252525252525253A%25252525252525252F%25252525252525252Fjuejin.im%25252525252525252Fpost%25252525252525252F5be83e166fb9a049a7115580) > > 4、[微服务架构如何保障双11狂欢下的99.99%高可用](https://link.juejin.im/?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Flink.juejin.im%2525253Ftarget%2525253Dhttps%252525253A%252525252F%252525252Flink.juejin.im%252525252F%252525253Ftarget%252525253Dhttps%25252525253A%25252525252F%25252525252Flink.juejin.im%25252525253Ftarget%25252525253Dhttps%2525252525253A%2525252525252F%2525252525252Flink.juejin.im%2525252525252F%2525252525253Ftarget%2525252525253Dhttps%252525252525253A%252525252525252F%252525252525252Flink.juejin.im%252525252525252F%252525252525253Ftarget%252525252525253Dhttps%25252525252525253A%25252525252525252F%25252525252525252Fjuejin.im%25252525252525252Fpost%25252525252525252F5be99a68e51d4511a8090440) > > 5、[兄弟,用大白话告诉你小白都能听懂的Hadoop架构原理](https://link.juejin.im/?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Flink.juejin.im%2525253Ftarget%2525253Dhttps%252525253A%252525252F%252525252Flink.juejin.im%252525252F%252525253Ftarget%252525253Dhttps%25252525253A%25252525252F%25252525252Flink.juejin.im%25252525253Ftarget%25252525253Dhttps%2525252525253A%2525252525252F%2525252525252Flink.juejin.im%2525252525252F%2525252525253Ftarget%2525252525253Dhttps%252525252525253A%252525252525252F%252525252525252Flink.juejin.im%252525252525252F%252525252525253Ftarget%252525252525253Dhttps%25252525252525253A%25252525252525252F%25252525252525252Fjuejin.im%25252525252525252Fpost%25252525252525252F5beaf02ce51d457e90196069) > > 6、[大规模集群下Hadoop NameNode如何承载每秒上千次的高并发访问](https://link.juejin.im/?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Flink.juejin.im%2525253Ftarget%2525253Dhttps%252525253A%252525252F%252525252Flink.juejin.im%252525252F%252525253Ftarget%252525253Dhttps%25252525253A%25252525252F%25252525252Flink.juejin.im%25252525253Ftarget%25252525253Dhttps%2525252525253A%2525252525252F%2525252525252Flink.juejin.im%2525252525252F%2525252525253Ftarget%2525252525253Dhttps%252525252525253A%252525252525252F%252525252525252Flink.juejin.im%252525252525252F%252525252525253Ftarget%252525252525253Dhttps%25252525252525253A%25252525252525252F%25252525252525252Fjuejin.im%25252525252525252Fpost%25252525252525252F5bec278c5188253e64332c76) > > 7、【[性能优化的秘密】Hadoop如何将TB级大文件的上传性能优化上百倍](https://link.juejin.im/?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Flink.juejin.im%2525253Ftarget%2525253Dhttps%252525253A%252525252F%252525252Flink.juejin.im%252525252F%252525253Ftarget%252525253Dhttps%25252525253A%25252525252F%25252525252Flink.juejin.im%25252525253Ftarget%25252525253Dhttps%2525252525253A%2525252525252F%2525252525252Flink.juejin.im%2525252525252F%2525252525253Ftarget%2525252525253Dhttps%252525252525253A%252525252525252F%252525252525252Flink.juejin.im%252525252525252F%252525252525253Ftarget%252525252525253Dhttps%25252525252525253A%25252525252525252F%25252525252525252Fjuejin.im%25252525252525252Fpost%25252525252525252F5bed82a9e51d450f9461cfc7) > > [8、](https://link.juejin.im/?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Flink.juejin.im%2525253Ftarget%2525253Dhttps%252525253A%252525252F%252525252Flink.juejin.im%252525252F%252525253Ftarget%252525253Dhttps%25252525253A%25252525252F%25252525252Flink.juejin.im%25252525253Ftarget%25252525253Dhttps%2525252525253A%2525252525252F%2525252525252Fjuejin.im%2525252525252Fpost%2525252525252F5bf2c6b6e51d456693549af4)[拜托,面试请不要再问我TCC分布式事务的实现原理坑爹呀!](https://juejin.cn/post/6844903716089233416) [](https://link.juejin.im/?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Flink.juejin.im%2525253Ftarget%2525253Dhttps%252525253A%252525252F%252525252Flink.juejin.im%252525252F%252525253Ftarget%252525253Dhttps%25252525253A%25252525252F%25252525252Flink.juejin.im%25252525253Ftarget%25252525253Dhttps%2525252525253A%2525252525252F%2525252525252Fjuejin.im%2525252525252Fpost%2525252525252F5bf2c6b6e51d456693549af4) > > 9、[【坑爹呀!】最终一致性分布式事务如何保障实际生产中99.99%高可用?](https://link.juejin.im/?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Flink.juejin.im%2525253Ftarget%2525253Dhttps%252525253A%252525252F%252525252Flink.juejin.im%252525252F%252525253Ftarget%252525253Dhttps%25252525253A%25252525252F%25252525252Flink.juejin.im%25252525253Ftarget%25252525253Dhttps%2525252525253A%2525252525252F%2525252525252Fjuejin.im%2525252525252Fpost%2525252525252F5bf2c6b6e51d456693549af4) > > 10、[拜托,面试请不要再问我Redis分布式锁的实现原理!](https://link.juejin.im/?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Flink.juejin.im%2525253Ftarget%2525253Dhttps%252525253A%252525252F%252525252Flink.juejin.im%252525252F%252525253Ftarget%252525253Dhttps%25252525253A%25252525252F%25252525252Flink.juejin.im%25252525253Ftarget%25252525253Dhttps%2525252525253A%2525252525252F%2525252525252Fjuejin.im%2525252525252Fpost%2525252525252F5bf3f15851882526a643e207) > > **11、****[【眼前一亮!】看Hadoop底层算法如何优雅的将大规模集群性能提升10倍以上?](https://link.juejin.im/?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Flink.juejin.im%2525253Ftarget%2525253Dhttps%252525253A%252525252F%252525252Flink.juejin.im%252525252F%252525253Ftarget%252525253Dhttps%25252525253A%25252525252F%25252525252Flink.juejin.im%25252525253Ftarget%25252525253Dhttps%2525252525253A%2525252525252F%2525252525252Fjuejin.im%2525252525252Fpost%2525252525252F5bf5396f51882509a768067e)** > > **12、****[亿级流量系统架构之如何支撑百亿级数据的存储与计算](https://link.juejin.im/?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Flink.juejin.im%2525253Ftarget%2525253Dhttps%252525253A%252525252F%252525252Flink.juejin.im%252525252F%252525253Ftarget%252525253Dhttps%25252525253A%25252525252F%25252525252Fjuejin.im%25252525252Fpost%25252525252F5bfab59fe51d4551584c7bcf)** > > 13、[亿级流量系统架构之如何设计高容错分布式计算系统](https://link.juejin.im/?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Flink.juejin.im%2525253Ftarget%2525253Dhttps%252525253A%252525252F%252525252Fjuejin.im%252525252Fpost%252525252F5bfbeeb9f265da61407e9679) > > 14、[亿级流量系统架构之如何设计承载百亿流量的高性能架构](https://link.juejin.im/?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Fjuejin.im%2525252Fpost%2525252F5bfd2df1e51d4574b133dd3a) > > 15、[亿级流量系统架构之如何设计每秒十万查询的高并发架构](https://link.juejin.im/?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Fjuejin.im%25252Fpost%25252F5bfe771251882509a7681b3a) > > 16、[亿级流量系统架构之如何设计全链路99.99%高可用架构](https://link.juejin.im/?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Fjuejin.im%252Fpost%252F5bffab686fb9a04a102f0022) > > 17、[七张图彻底讲清楚ZooKeeper分布式锁的实现原理](https://link.juejin.im/?target=https%3A%2F%2Fjuejin.im%2Fpost%2F5c01532ef265da61362232ed) > > 18、[大白话聊聊Java并发面试问题之volatile到底是什么?](https://juejin.cn/post/6844903730303746061) > > 19、[大白话聊聊Java并发面试问题之Java 8如何优化CAS性能?](https://juejin.cn/post/6844903731234865160) > > 20、[大白话聊聊Java并发面试问题之谈谈你对AQS的理解?](https://juejin.cn/post/6844903732061159437) > > 21、[大白话聊聊Java并发面试问题之公平锁与非公平锁是啥?](https://juejin.cn/post/6844903732883226637) > > 22、[大白话聊聊Java并发面试问题之微服务注册中心的读写锁优化](https://juejin.cn/post/6844903734267510798) > > 23、[互联网公司的面试官是如何360°无死角考察候选人的?(上篇)](https://juejin.cn/post/6844903734930046989) > > 24、[互联网公司面试官是如何360°无死角考察候选人的?(下篇)](https://juejin.cn/post/6844903735655661581) > > 25、[Java进阶面试系列之一:哥们,你们的系统架构中为什么要引入消息中间件?](https://juejin.cn/post/6844903736444207117) > > 26、[【Java进阶面试系列之二】:哥们,那你说说系统架构引入消息中间件有什么缺点?](https://juejin.cn/post/6844903737123667975) > > 27、[【行走的Offer收割机】记一位朋友斩获BAT技术专家Offer的面试经历](https://juejin.cn/post/6844903741213130765) > > 28、[【Java进阶面试系列之三】哥们,消息中间件在你们项目里是如何落地的?](https://juejin.cn/post/6844903742114906125) > > 29、[【Java进阶面试系列之四】扎心!线上服务宕机时,如何保证数据100%不丢失?](https://juejin.cn/post/6844903742928601095)**