Welcome to follow our wechat official account: Shishan100
My new course ** “C2C e-commerce System Micro-service Architecture 120-day Practical Training Camp” is online in the public account ruxihu Technology Nest **, interested students, you can click the link below for details:
120-Day Training Camp of C2C E-commerce System Micro-Service Architecture
“This article is about the resolution of an online production system accident, which represents a serious failure of the JVM FullGC in an online production system.
1. Introduction to service scenarios
Firstly, I will briefly talk about a background of online production system. Since this paper is only used as a case, a large number of business backgrounds will be weakened.
To put it simply, this is A distributed system. System A needs to transfer A very core and critical data to another system B through network request.
Therefore, A question is taken into consideration here. If system A has just transferred core data to system B, and system B crashes for no reason, won’t the data be lost?
Therefore, in the architecture design of this distributed system, a very classical Quorum algorithm is adopted.
The algorithm is simply that system B has to deploy an odd number of nodes, say at least 3 machines, or 5 machines, 7 machines, something like that.
Then each time system A transfers A data to the system, it must send A request to all the machines deployed by system B, transmitting A copy to all the machines deployed by system B.
To determine that A write from System A to system B is successful, system A must successfully transfer data to the machines of system B that exceed the Quorum number within the specified time range.
For example, if system B has three machines deployed, its Quorum is: 3/2 + 1 = 2, i.e., the number of all machines / 2 + 1.
Therefore, system A needs to determine whether A core data has been written successfully. If system B has three machines deployed, system A must receive A write success response from two machines of system B within A specified period of time.
Only then can system A consider the data to be written successfully to system B. This is known as the Quorum mechanism.
That is, in distributed architecture, data is transferred between systems, and for one system to ensure that the data it transfers to another system will not be lost, the machines receiving the Quorum (majority) of the other system must respond with success within a specified period of time.
In fact, this mechanism is widely used in many distributed systems and middleware systems. Our online distributed system also adopts this Quorum mechanism to transfer data between two systems.
I want to give you a picture of what this architecture looks like.
As shown in the figure above, the Quorum mechanism for transferring A piece of data between systems A and B is clearly demonstrated.
Next, we’ll show you in code what the Quorum write mechanism above might look like at the code level.
This is the code after a lot of simplification, but the core meaning is clear. You can watch it twice, but it’s actually pretty easy to figure out.
Asynchronously start the thread to send data to all the machines of system B, while entering a while loop waiting for the Quorum machines of system B to return the result.
If the expected number of machines do not return A result after the specified timeout period, the cluster deployed by system B is judged to be faulty and system A is directly quit, which is equivalent to system A down.
The whole code, that’s what it means!
Two, the problem protrudes
It’s not that hard just to look at the code, but the problem is that running it online is not as easy as you think when you’re writing the code.
Once, during the operation of the online production system, the overall system load was very stable. There should have been no problem, but suddenly we received an alarm saying that system A suddenly broke down.
Then began to check, left check right check, found that system B cluster is fine, should not have a problem.
And then I check system A, and there’s nothing wrong with system A.
Finally, by combining the logs of system A and the garbage collection logs of system A’s JVM FullGC, we can figure out the specific cause of the failure.
Third, positioning problems
In fact, the reason is very simple. After system A runs online for A period of time, it will occasionally carry out JVM FullGC that stops the World for A long time, which is A large area of garbage collection.
However, A large number of worker threads in system A will stop working. The worker thread will not resume running until JVM FullGC has finished.
Let’s look at the following code snippet:
But this is not true because the if statement would not have been true without the JVM FullGC.
It will pause for a second to enter the next while loop, which will then receive a Quorum number return from system B, and the while loop will be interrupted and continue.
As A result, JVM FullGC was delayed for dozens of seconds, which somehow triggered the execution of if judgment, and system A inexplicably quit and crashed.
JVM FullGC on line causes system to stall for a long time, which is one of the hidden killers of system instability.
Fourth, solve the problem
As for the above code stability optimization, it is also simple. We just have to put something in the code to monitor if JVM FullGC is happening in the code above.
If a JVM FullGC occurs, expireTime is automatically extended.
For example, the following code improvements:
Through the above code improvements, the stability of the online system can be effectively optimized to ensure that it will not randomly occur abnormal outage exit in the case of JVM FullGC.
END
If there is any harvest, please help to forward, your encouragement is the biggest power of the author, thank you!
A large wave of micro services, distributed, high concurrency, high availability of original series of articles is on the way
Please scan the qr code belowContinue to pay attention to:
Architecture Notes for Hugesia (ID: Shishan100)
More than ten years of EXPERIENCE in BAT architecture