preface
Have you ever been woken up by a phone call in the middle of the night and started the frantic process of troubleshooting, troubleshooting and restoring service? Perhaps it is because of a small change before bed, due to some unexpected scene, cause butterfly effect, resulting in a large area of system chaos, failure and service interruption, affecting the customer’s business.
Especially in recent years, such news is still heard in the IT industry, despite adequate monitoring of alarms and troubleshooting procedures. The crux of the question is how much faith can be placed in complex systems that go into production. Monitoring alarms and troubleshooting are post-hoc responses and passive responses. Is it possible to detect the weaknesses of these complex systems in advance?
Chaos engineering originated from the internal practice of Netflix, and has gradually developed and grown into a large number of practices in the industry. At present, there are full-time Choas Engineers, aren’t they very tall?
Before you panic about chaos engineering, let’s start with three heart-wrenching questions:
- What is chaos engineering
- Why chaos engineering
- How to implement chaos Engineering
Finally, I introduced a small tool of front-end chaos engineering, React-Chaos, and detailed its internal implementation principle, hoping to help improve the robustness of front-end system by combining the front-end scene and how chaos engineering can be implemented in front.
What is chaos engineering
Definition of Chaos Engineering
Conduct controlled experiments guided by experience on distributed systems to observe system behavior and discover system weaknesses to build capacity and confidence in the system’s ability to cause chaos due to unexpected conditions as it grows in size.
Introduction to the development of chaos engineering
In August 2008, a glitch in Netflix’s main database led to a three-day outage that disrupted DVD rentals and affected large numbers of customers in several countries.
Netflix engineers then set out to find an alternative architecture, and since 2011, the system has gradually migrated to AWS to run a new distributed architecture based on microservices.
This architecture eliminates single points of failure, but also introduces new types of complexity that require more reliable and fault-tolerant systems. To do this, Netflix engineers created Chaos Monkeys that randomly terminate EC2 instances running in production. Engineers can quickly see if the service they are building is robust and resilient enough to tolerate unplanned failures.
At this point, chaos engineering began to rise.
The timeline of chaos engineering’s evolution since 2010 is shown above:
- In 2010, Netflix internally developed Chaos Monkey, a Chaos experiment tool for randomly terminating EC2 instances on AWS cloud
- In 2011 Netflix released its Monkey Army toolset: Simian Army
- In 2012, Netflix opened source to the community to build Simian Army from Java, including Chaos Monkey V1
- In 2014, Netflix started recruiting Chaos Engineers
- In 2014, Netflix proposed the fault injection test (FIT) to control the explosion radius of chaos experiment by utilizing the characteristics of microservice architecture
- In 2015, Netflix released Chaos Kong to simulate the interruption of AWS regions
- In 2015, Netflix and the community officially put forward the guiding principle of Chaos Engineering — Principles of Chaos Engineering
- Gremlin was founded in 2016 by Kolton Andrus (formerly of Netflix and Amazon Chaos Engineer) to commercialize Chaos testing tools
- Chaos Monkey, the 2017 Netflix open source Chaos Monkey V2 version reconfigured by Golang, must be integrated with CD tool Spinnaker to use
- In 2017, Netflix released ChAP (Chaos Experiment Automated Platform), which can be seen as an enhanced version of application Fault Injection Testing (FIT)
- In 2017, a new book, “Chaos Engineering,” written by a former Chaos engineer at Netflix, was published online
- In 2017, Russell Miles founded ChaosIQ and opened source chaostoolKit Chaos experimental framework
An understanding of chaos engineering
Chaotic disaster before large applications engineers backup evolved gradually, first is single room single server is easy to create a single point of failure, and then with cloud computing, support cluster can achieve the level of the machine ability to scale, and constantly strengthen the operational capacity, increase monitoring can be found that the problem come from line, the first time that can be solved before problems to expand in a timely manner.
With the back-end microservice, more and more machines are deployed, the number of nodes increases, and the uncertainty of the overall system increases. Various network failures, disk and memory problems lead to higher failure rates.
Therefore, chaos engineering came into being to improve the system’s toughness and resist unknown risks by conducting controlled experiments in distributed systems.
To sum up chaos engineering:
- A technology culture that embraces failure
- A set of abstract and rigorous practical principles
- A stable means of active defense
- A rapidly evolving field of technology
Principles of chaos engineering
Why chaos engineering?
“If you don’t find and solve problems early (introduce chaos engineering experiments), eventually the problems will come to you (weekend/midnight).”
Chaos engineering can be understood as an active defense capability that can find unknown problems in advance and improve system toughness.
- Because the experiment is conducted in a blast minimization area, it can reduce business losses and allow significant risks to be exposed in a controllable way ahead of time
- Improve system resiliency and continuously verify the fault tolerance of the system in extreme scenarios
- Enhance team confidence, verify the effectiveness of stability measures, and quantify team value
How to implement chaos Engineering?
The complete chaos engineering experiment is a continuous iterative closed-loop system. Starting from the preliminary experimental requirements and experimental objects, the experimental scope is determined through experimental feasibility evaluation, and appropriate observation indicators, experimental scenes and environments are designed, and appropriate experimental tools and platform framework are selected.
Establish the experimental plan, communicate with the subjects’ stakeholders, and then jointly execute the experimental process, and collect the pre-designed experimental indicators;
After the completion of the experiment, clean up and restore the experimental environment, analyze the experimental results, trace the root causes and solve the problems, and automate the above experimental scenes, integrate them into the assembly line and implement them regularly;
After that, you can start adding new experimental scope, continuous iteration and orderly improvement.
A few important things to note during implementation:
- Select experimental tools based on technical architecture
- Minimum explosion radius, control experiment risk
- Build a technical culture for failure design
- Set goals around the strategy and design the organization around the goals
- Reuse mature products to improve efficiency
Zhou Yang, a technical expert of Alibaba Cloud, mentioned how to implement chaos engineering in ali’s new retail, cloud service, cloud business and other fields in his “Chaos Engineering Practice under the Framework of Cloud Origin”.
Minimize the explosion radius to achieve a normal experiment
Unified access layer is provided in each system to facilitate chaos experiment, minimize explosion radius and control experimental risk. The realization of the normalization of the experiment, many exposed problems.
I remember that in alibaba Double 11 one year, the senior technical team took the initiative to cut off the network of a certain computer room half an hour after the 0 o ‘clock peak to conduct the error disaster recovery drill. Although it required some risks, it was of great help to train the team how to deal with such emergencies.
Stability of cloud services
In cloud services, there is a constant demand for stability and the need to be able to cope with various extreme scenarios. It can be seen that Ali Cloud has a lot of practice in this aspect, from hardware, network, system and operation mode have a lot of solutions.
Standardize experimental procedures through platform capabilities
Based on the platform plug-in capability, the entire experiment process can be standardized: planning, execution, observation, recording, restoration and analysis, and the stability of the system can be improved through continuous standardized drills.
Can chaotic engineering be implemented at the front end?
It can be seen that the practice of chaos engineering in the industry is mainly on the server side. Is it necessary to implement chaos engineering in the front end?
The answer, of course, is Yes.
Front-end page as the user experience function of the entrance, user experience is the first. However, due to various problems, such as server data return exception, type check exception, and so on, the whole page will be blank, and the function is unavailable.
For the front-end system, how to deal with all kinds of abnormal data, how to control the error range in the minimum controllable range, how to ensure degraded use, for the front-end engineers have put forward higher and higher requirements.
React Chaos
J. Chaitt announces a tool for front-end Chaos engineering -React Chaos (github.com/jchiatt/rea…) He created a tool that randomly destroys your React component by throwing exception errors. React Chaos is a high-order component that wraps any component that you want to test for an exception as a high-order component.
React-chaos provides two core methods and components:
- WithChaos: Wrapping other components with higher-order components allows random exception errors to be thrown
- ErrorBoundary: Handling abnormal errors
The use and implementation of withChaos
const ComponentWithChaos = withChaos(
ComponentWillHaveChaos,
1,
'a custom error message, level 1'.true
);
Copy the code
The withChaos method encapsulates a component, and we can see how it is implemented. Here is the method definition of withChaos, with four parameters: component to be wrapped, chaos level, error message, and is it executed in production
const withChaos = ( WrappedComponent: React.ElementType, level: Level, errorMessage? : string, runInProduction? : boolean )Copy the code
Within the withChaos method, another component is returned, which is implemented by wrapping the WrappedComponent in a higher-order component manner using the Chaos component to implement additional capabilities within the Chaos component.
return class extends React.Component {
render() {
return (
<Chaos
level={level}
errorMessage={errorMessage}
runInProduction={runInProduction}
>
<WrappedComponent {...this.props} />
</Chaos>
);
}
};
Copy the code
Let’s look at the implementation logic in the Chaos component, which has a section of logic that randomly throws exceptions based on level
const chaosLevel = level ! = = 5? ConvertChaosLevel (level) : 0.5; const chaosOn = Math.random() >= chaosLevel;if (chaosOn) {
throw new Error(errorMessage);
}
Copy the code
So far, we have seen the implementation principle of withChaos. The Chaos component is used as a higher-order component of WrappedComponent, and an exception is thrown in the Chaos component according to Level to achieve the effect of component hanging.
The use and implementation of ErrorBoundary
ErrorBoundary can encapsulate a component to catch exceptions that occur in that component, and can fire a fallback in the event of an exception. It can be used as follows:
<ErrorBoundary fallback={<Fallback />}>
<ComponentWithChaos />
</ErrorBoundary>
Copy the code
In the ErrorBoundary component, it actually detects exceptions through the life cycle method of componentDidCatch and changes the state to render in the case of exceptions.
componentDidCatch(err: Error) {
this.setState({
hasError: true, error: err, }); }...render() {
const { error } = this.state;
const { children, fallback } = this.props;
if (error) {
return (
fallback || (
<pre>Error was caught but no fallback component was provided.</pre>
)
);
}
return children;
}
Copy the code
Practice effect
Finally, let’s look at the example in action:
-
Initial effect
-
The effect of random exceptions
Using this tool, with the withChaos method, a small amount of code can be used to easily simulate the scenario of component abnormality. ErrorBoundary can be used to conveniently control and capture the abnormal situation and prevent the whole page from blank screen.
Write in the last
In fact, chaos engineering is not a new concept, for the concept of service degradation, remote live, exception handling, everyone has a lot of practice.
As Netflix continues to practice chaos engineering, the industry is practicing this concept more and more. Through purposeful, continuous automated experiments in the production environment and within the range of controlled explosions, the system resilience is constantly improved and the team’s confidence in dealing with abnormal scenarios is enhanced.
The concept of chaos engineering has a lot of reference significance for the front-end system. The front-end system needs to have a contingency plan for the instability of the dependent system and constantly improve its robustness.
All right, guys, ready to mess up your front-end?
If you are interested, you can follow the wechat public account Daidadyannong to share information about financial management, technology and personal growth from time to time: