Chaos Mesh is an open source cloud native Chaos engineering platform. With the help of Chaos Mesh, users can easily inject abnormal faults into services and monitor the running status of the whole Chaos experiment together with the Chaos Dashboard. However, monitoring the operation of chaos experiment does not tell us the change of application service performance. From the perspective of the observability of the system, we may not be able to understand the whole picture of the fault simply through the dynamics of chaos experiment, which also hinders our further understanding and debugging of the system and the fault.

Apache SkyWalking is an open source Application Performance Monitor (APM) system, which provides monitoring, tracking, diagnosis and other functions for cloud native services. SkyWalking supports collecting events (events), which can be viewed in dashboards in distributed systems, and can intuitively observe the impact of different events on service performance. Combined with Chaos Mesh, So it can monitor the service impact caused by chaos experiment.

This tutorial will share how Chaos experiments can affect application performance in real time by combining SkyWalking with Chaos Mesh and using Event information monitoring.

The preparatory work

  • To create a Skywalking cluster, see the Skywalking Readme.
  • Chaos Mesh is deployed. The Helm is recommended.
  • Install the Java test tool JMeter (other tools are also available, only to increase service load)
  • If it is only used as a Demo, you can configure it by referring to the chaos- mesh-on-Skywalking warehouse

Step 1 – Access the SkyWalking cluster

Once SkyWalking is installed, you can access its UI, but since there are no services to monitor yet, you need to add services and set up Agent burial points. This paper uses Spring Boot, a lightweight micro-service framework, as a buried object to build a simple Demo environment.

Refer to the demo-deployment.yaml file creation in the Chaos-Mesh-on-Skywalking repository. Then use kubectl apply -f demo-deployment.yaml -n skywalking for deployment. After successful deployment, you can see real-time monitored service information in Skywalk-UI.

* * note: ** Since the Spring Boot port is also 8080, port forwarding should avoid conflict with The SkyWalking port. For example, use kubectl port-forward SVC /spring-boot-skywalking-demo 8079:8080-n skywalking.

Step 2 – Deploy SkyWalking Kubernetes Event

SkyWalking Kubernetes Event can be used to monitor and filter events in the Kubernetes cluster. By setting filtering conditions, SkyWalking Kubernetes events can be filtered out and sent to the SkyWalking background. This allows SkyWalking to see when events in your Kubernetes cluster affect the metrics of your service. If you want a command deployment you can use this configuration to create a YAMl file, set the filters and exporters parameters and use Kubectl apply to deploy.

Step 3 – Pressurize the service using JMeter

To get a better view, you need to load the Spring Boot service first, and this article chooses to use JMeter, a widely used Java stress testing tool, to load the service. Host :8079 was pressurized by JMeter, and 5 threads were added for continuous pressurization.

As you can see from the SkyWalking Dashboard, the current access success rate is 100% and the service load is around 5300 CPM (Calls Per Minute).

Step 4 – Chaos Mesh injection failure, observe the effect

With this in place, you can use the Chaos Dashboard to simulate stress scenarios and observe changes in service performance as the experiment progresses.

The following uses different Stress Chaos configurations to observe the corresponding service performance changes:

  • CPU load 10%, memory load 128 MB.

The time marks of the start and end of the chaos experiment can be displayed in the diagram by the switch on the right. Move the mouse pointer to the short line and you can see “Applied” or “Recovered” of the experiment. You can see that during the time period between the two green bars, the performance of the service to process the call decreased to 4929 CPM. After the experiment, the performance returned to normal.

  • The CPU load increased to 50% and the service load was further reduced to 4307 CPM.

  • In extreme cases, the CPU load reached 100% and the service load dropped to 40% of the chaos free experiment.

Because process scheduling under Linux does not allow a process to occupy the CPU all the time, the deployed Spring Boot Demo can handle 40% of access requests even in extreme cases with a full CPU load.

summary

Through the combination of SkyWalking and Chaos Mesh, we can clearly observe when the service is affected by Chaos experiment and how the performance of the service will be after Chaos injection. The combination of SkyWalking and Chaos Mesh allows us to easily observe the performance of the service in various extreme situations, which increases our confidence in the service.

Chaos Mesh grew a lot in 2021. In order to learn more about users’ experience in Chaos engineering practice and to continuously improve and enhance the support for users, the community launched a Chaos Mesh user questionnaire. Click the link to participate in the survey, thank you! www.surveymonkey.com/r/X78WQPC

Welcome to join the Chaos Mesh community and join the Chaos Mesh channel at CNCF Slack (slack.cncf. IO) : Project-Chaos-Mesh to discuss and develop projects! If you find a Bug or missing feature, you can also send an Issue or PR to GitHub (github.com/chaos-mesh).