Why do you need to conduct a failure drill?
The combination of high volume requests, holiday peak traffic, and increasing system complexity is likely to be a combination of expected and unexpected failures. In many cases, due to the lack of accident handling plans or plans themselves unreliable, as well as the lack of failure handling experience of the developers, resulting in confusion among various alarms, thus missing the best fighter. In particular, some abnormal faults that have not appeared on the line at ordinary times, once suddenly appear, are often unprepared.
Is the system robust enough? Is there sufficient capacity to deal with failure? What behavior occurs when faced with a failure? We don’t want to validate these issues only when there is an actual online failure, because it’s too risky and too costly. Therefore, it is hoped that any possible failures can be simulated in advance when the online environment is isolated from real traffic, so as to observe the response of the system and verify the expected strategy.
To sum up, the main objectives of a fault drill are as follows:
- Make sure the system responds to failures the way we expect it to
- Look for unexpected weaknesses in the system
- Look for other ways to improve the robustness of the system to avoid the actual occurrence of accidents
The ideal situation is to streamline processes such as routine failure drills, identifying system risk points, optimizing business systems, and producing viable and effective troubleshooting plans
What is a failure drill?
Fault drill is the core of application high availability capability evaluation. A complete fault drill consists of the objects to be tested, the specific faults of the objects, the expected application performance, and the actual observation and judgment of the application performance.
(1) The object of the drill
The object of the drill is the location of the drill, which can be for the application itself, the downstream of the application, or the machine where the application is located
(2) Object faults
Common fault types are as follows:
The fault types | For example, |
---|---|
The dependent RPC service is faulty | Timeout/unavailable |
Middleware failure | Kafka timeout/unavailable, Redis timeout/unavailable |
Infrastructure failure | Database timeout/unavailable, DNS timeout/unavailable |
Machine fault | The CPU is fully loaded, the nic traffic is fully loaded, the network is interrupted, the machine breaks down, the equipment room is powered off, the disk space is fully loaded |
Abnormal traffic | Incoming traffic surges and drops to zero |
(3) Expected failure response performance of the application
That is, the plan. For each fault to be rehearsed, prepare the fault response plan. The plan template is as follows:
Link/Scenario | The fault | Can you drill | impact | Response to the plan | The SOP operation | The impact of implementing the plan | Cancellation condition of preplan | Plan cancellation SOP | The plan to deal with the failure of the plan implementation |
---|---|---|---|---|---|---|---|---|---|
(4) Actual observation and judgment of application performance
This can be used in the monitoring system to observe the performance of various indicators of the application, such as abnormal points, traffic points, business curves, machine performance and a series of places that may be affected by the failure.
Three, how to do the fault drill?
Pre-failure drill
(1) Check the essential basic abilities
- The application is required to have the ability to complete the transfer of dye marking traffic in the service link
- The ability to simulate downstream dependent service failures is required
- The application needs to be able to record/play/isolate request traffic
(2) Determine the scope and environment of the fault drill
- Which request traffic to inject faults on?
- Decision rule: Select the request traffic of the core service link
- Recommended practice: Analyze links and identify core service links
- Which downstream service failures are to be simulated?
- Decision principle: The probability of failure of this downstream service is high. The scope of business affected by failure of this downstream service is wide. The failure of this downstream service will affect the core business
- Recommended practice: Rely on link analysis to determine which downstream services depend on service links. Reverse dependence analysis to determine which service links are affected by downstream service faults, and evaluate the affected service scope
- In which application environment is the fault simulated?
- Decision rule: The closer the environment is to the online production environment, the better
- Recommended practice: Perform a fault drill on a machine with no real traffic online (disable external real traffic)
(3) Playback traffic isolation and shadow table isolation
- Flow separation
- Shadow table isolation
(4) Develop fault response plans
Prepare a fault response plan for each fault scenario to be rehearsed
- Plan formulation principle: plans are for the types of failure or risk can be one or more have to determine the plan under what circumstances can start/remove, what is the lead requirement plan is effective, namely: after launch plans, can really reduce the in plans to determine the influence of fault will cause additional impact after opening, can trigger a new fault
(5) The configuration is faulty
(6) Determine the objectives of the drill
-
Ensure that the fault response plan is effective, that is, the impact range of the fault can be reduced after the plan is started
-
Determine when the failure occurred and how the business process worked as expected (determined by business metrics, burial point monitoring, and relevant business link tracing tools)
-
Make sure that the load indicator of the application machine is within the expected range (through alarms of various basic tools) (set more checkpoints according to the business characteristics of the application machine)
(7) Train participating internal personnel
(8) Notify the external personnel involved
Notify the application RD, O&M RD, and basic component RD based on the impact range. Content elements of the Notice:
- Fault drill initiation application
- Start and end time of fault drill
- Fault test application cluster environment
- Impact estimates for each relevant application. For example, peak QPS calls to downstream dependent services, and the rate of abnormal requests received by upstream services
Recommended practice: Add all related personnel to a work group named XXX Application Fault Drill. Send fault drill notification to the group and organize cooperation
Failure drill
-
The recorded online traffic is progressively pressurized and played back to a machine with no real traffic in the initiating application of the fault drill
-
Turn on the fault simulation switch of the application and observe the impact of the fault. Note: To ensure that the actual traffic is not affected, only the dyeing traffic is faulty
-
Start the application troubleshooting plan
- Check whether the impact of the fault is eliminated or reduced as expected
- Observe business indicators
- Observe machine load indicators
- Verify that the business process works as expected (such as: cancel the display of XX module, no longer request YY interface)
After failure drill
- cleaning
- The traffic is shut down or the traffic isolation task is shut down
- The fault simulation switch is off and the plan is off
- Clean up data, caches, logs, and so on written during walkthroughs (optional)
- The modified service configuration switch reset during the test
- Restart the application
- Notify relevant personnel that the drill is over
- Rehearse reports and summaries
-
Whether the expected goal is met whether the plan is in effect whether the business process is running as expected whether the machine load is normal
-
If something unexpected happens
-
Collect key indicators (business indicators, machine load indicators)
-
Sort out the follow-up improvement points
When will the failure drill be done?
The fault needs to be precipitated in a scenario-based way and simulated online at a controllable cost, so that the system and developers can have more practical opportunities at ordinary times and accelerate the progress of the system, tools, processes and personnel.
< Normalize, set up a drill cycle >
Follow-up plan for fault test
The follow-up work of the fault drill mainly focuses on the following aspects: normalization of the drill, classification of fault standards, and intelligentization of the drill.
Regular drills are used to drive stability progress, enrich more failure scenarios, and define minimum failure scenarios and treatment methods; Intelligent drill based on architecture and business analysis, precipitate industry fault drill solutions.
<“todo”>
The original link: www.jackielee.cn/posts/6fb69…