For cloud services, if the system is abnormal, it will bring great loss. In order to minimize losses, we can only constantly explore when anomalies occur in the system, and even narrow down to whether certain parameter changes cause system anomalies. However, with the development of cloud native, further decoupling of micro-services is constantly promoted. Massive data and user scale also bring large-scale distributed evolution of infrastructure, and system failures become more and more difficult to predict. We need to constantly experiment in the system to actively find defects in the system, which is called chaos engineering. After all, practice is the only criterion to test truth, so chaos engineering can help us to more thoroughly grasp the operation rules of the system and improve the resilience of the system.

Litmus is an open source cloud-native chaos engineering tool set that focuses on simulated failure testing for Kubernetes clusters to help developers and SRE discover defects in clusters and programs to improve system robustness.

Litmus architecture

The architecture of Litmus is shown below:

Components of Litmus can be divided into two parts:

  1. Portal
  2. Agents

Portal is a set of Litmus components that serve as a control plane (WebUI) for cross-cloud management chaos experiments, and is used to coordinate and observe chaos experiment workflow on Agent.

Agent is also a set of Litmus components, including chaos experimental workflows running on a K8s cluster.

With Portal, users can create and schedule new chaos experiment workflows on the Agent and observe the results from Portal. Users can also connect more clusters to Portal and use Portal as a single Portal to manage across cloud chaos projects.

Portal components

  • Litmus WebUI

    Litmus WebUI provides a Web user interface where users can easily build and observe chaos experiment workflows. Litmus WebUI also acts as a control plane for the cross-cloud chaos experiment.

  • Litmus Server

    Litmus Server is used as middleware to process API requests from the user interface and store details of configuration and processing results in a database. It also acts as a communication interface between requests and schedules workflows to the Agent.

  • Litmus DB

    Litmus DB serves as a storage system for chaos experiment workflow and details of its test results.

The Agent component

  • Chaos Operator

    The Chaos Operator monitors the ChaosEngine and performs the Chaos experiments mentioned in CR. The Chaos Operator is namespace-scoped and runs in the Litmus namespace by default. After the experiment is complete, the Chaos Operator calls Chaos-Exporter to export the Chaos experiment indicators to the Prometheus database.

  • CRDs

    The following CRDS are generated during Litmus installation:

    chaosexperiments.litmuschaos.io
    chaosengines.litmuschaos.io
    chaosresults.litmuschaos.io
    Copy the code
  • Chaos Experiment

    Chaos Experiment is a basic unit in LitmusChaos architecture. Users can select on-site Chaos experiments from the Chaos Hub or create new Chaos experiments themselves to build the required Chaos Experiment workflow. In simple terms, it defines a list of CRD resources such as which operations the test supports, which parameters can be passed in, and which types of objects can be tested on. These resources are usually divided into three categories: General testing (such as memory, disk, CPU, etc.), application testing (such as testing for Nginx), platform testing (testing for a cloud platform: AWS, Azure, GCP). See the Chaos Hub documentation for details.

  • Chaos Engine

    ChaosEngine implements the functionality of the Chaos Experiment implementation into namespace applications. The CR is monitored by the Chaos Operator.

  • Chaos Results

    ChaosResult stores the results of Chaos experiments, which are created or updated while the experiments are running, and contains various information, including the configuration of the Chaos Engine, experimental status, and so on. Chaos-exporter will read the results and export them to the Prometheus database.

  • Chaos Probes

    Chaos Probes are pluggable indicator Probes that can be defined in the ChaosEngine of any Chaos experiment. Experimental Pod will perform the corresponding detection according to its defined pattern and consider its success as a necessary condition to determine experimental results (including standard “built-in” detection).

  • Chaos Exporter

    Optionally, you can export metrics to the Prometheus database. Chaos Exporter implements Prometheus Metrics endpoint.

  • Subscriber

    Subscriber is used to interact with Litmus Server to obtain detailed results of chaos experiment workflow and send them back to Agent.

Prepare the KubeSphere application template

KubeSphere integrates OpenPitrix to provide application lifecycle management. OpenPitrix is a multi-cloud application management platform that enables KubeSphere to implement an app store and application templates to visually deploy and manage applications. For applications that do not exist in the App Store, users can either deliver Helm Chart to KubeSphere’s public repository or import it into a private repository to provide application templates.

This tutorial will use KubeSphere’s application template to deploy Litmus.

To deploy an application from an application template, you need to create an enterprise space, a project, and two user accounts (WS-Admin and project-regular). Ws-admin must be granted the role of workspace admin in the enterprise space, and project-regular must be granted the role of operator in the project. Before creating, let’s review KubeSphere’s multi-tenant architecture.

Multi-tenant Architecture

KubeSphere’s multi-tenant system comes at three levels: cluster, enterprise space and project. Projects in KubeSphere are equivalent to the Kubernetes namespace.

You need to create a new enterprise space to operate on, rather than using the system enterprise space, where system resources run, most of them just for viewing. For security reasons, it is strongly recommended that different tenants be granted different permissions to collaborate in the enterprise space.

You can create multiple enterprise Spaces within a KubeSphere cluster, and multiple projects can be created within each enterprise space. KubeSphere has multiple built-in roles by default for each level. In addition, you can create roles with custom permissions. The KubeSphere hierarchy is suitable for enterprise users with different teams or organizations and different roles within each team.

Create an account

After installing KubeSphere, you need to add users with different roles to the platform so that they can work at different levels for their authorized resources. At the beginning, the system has only one account admin with a platform-admin role by default. In this step, you will create an account user-manager and then use user-manager to create a new account.

  1. In order toadminUse default account and password for identity (admin/P@88w0rd) Log in to the Web console.

For security reasons, it is strongly recommended that you change your password the first time you log in to the console. To change the password, select Personal Settings from the drop-down menu in the upper right corner. Set a new password in Password Settings. You can also change the console language in personal Settings.

  1. After logging into the console, click platform Management in the upper left corner, and then select Access Control.

    Within account roles, there are four built-in roles available as shown below. The first account to be created next will be assigned the users-Manager role.

    The built-in role describe
    workspaces-manager Enterprise space administrator who manages all enterprise space on the platform.
    users-manager The user administrator manages all users on the platform.
    platform-regular Common platform users who do not have any resource operation rights before being invited to join the enterprise space or cluster.
    platform-admin A platform administrator who can manage all resources on the platform.
  2. In Account Management, click Create. In the pop-up window, provide all the necessary information (marked with *), and then select Users-Manager in the roles field. Please refer to the following example.

    When you’re done, click OK. The newly created account is displayed in the account list in Account Management.

  3. Switch accounts Log in to the user manager again and create the following three accounts.

    account role describe
    ws-manager workspaces-manager Create and manage all enterprise Spaces.
    ws-admin platform-regular Manages all resources in the specified enterprise space (this account is used to invite member project-regular to join the enterprise space).
    project-regular platform-regular This account is used to create workloads, pipelines, and other resources in a given project.
  4. View the three accounts created.

Create enterprise Space

In this step, you need to create an enterprise space using the account WS-Manager created in the previous step. As the basic logical unit for managing projects, creating workloads, and organization members, enterprise space is the foundation of the KubeSphere multi-tenant system.

  1. Log in to KubeSphere as WS-Manager, which has the authority to manage all enterprise space on the platform. Click platform Management in the upper left corner and select Access Control. In the enterprise space, you can see that only one default enterprise space, system-workspace, is listed, which runs system-specific components and services, and you cannot delete it.

  2. Click Create on the right, name the new enterprise space as Demo-workspace, and set user WS-admin as the enterprise space administrator, as shown below:

    When you’re done, click Create.

  3. Log out of the console, and then log back in as WS-admin. In enterprise Space Settings, select enterprise members, and then click Invite Members.

  4. Invite project-regular into the enterprise space and grant it the role of workshop-viewer.

    The actual role name is in the format of



    . For example, in the enterprise space named Demo-workspace, the actual role name of the role viewer is demo-workspace.

  5. After adding project-regular to the enterprise space, click OK. In the enterprise members, you can see the two members listed.

    account role describe
    ws-admin workspace-admin Manages all resources in the specified enterprise space (in this example, this account is used to invite new members to join the enterprise space and create projects).
    project-regular workspace-viewer This account will be used to create workloads and other resources in the specified project.

Create a project

In this step, you need to create the project using the account WS-admin that you created in the previous step. Projects in KubeSphere are identical to namespaces in Kubernetes, providing virtual isolation for resources. See Namespaces for more information.

  1. Log in to KubeSphere as WS-Admin, in Project Management, click Create.

  2. Enter a project name (for example, Litmus) and click OK to finish. You can also add aliases and descriptions for the project.

  3. In Project Management, click on the project you just created to view its details.

  4. Invite project-regular to the project and grant the role of operator to the user. Please refer to the following figure for details.

    A user with the operator role is a project maintainer and can manage resources other than users and roles in the project.

Adding an Application Repository

  1. Log in to KubeSphere’s Web console as user WS-admin. In your enterprise space, go to the Application Repository page under Application Management and click Add Repository.

  2. In the pop-up dialog box, set the app store name as litmus, sets the warehouse application URL to https://litmuschaos.github.io/litmus-helm/, click on the verification is to validate the URL, and then click ok to enter the next step.

  3. After the application repository is imported successfully, it is displayed in the list as shown in the following figure.

Deploy the Litmus control plane

After importing the application repository for Litmus, you can deploy Litmus from the application template.

  1. Log out of KubeSphere and re-log in as project-regular. In your project, go to the Application page under application load and click Deploy New Application.

  2. Select From the application template in the dialog box that is displayed.

  3. Select From the application template in the dialog box that is displayed.

    From the App Store: Select built-in apps and apps uploaded separately in Helm Chart form.

    From the application template: Select applications from the private application repository and the enterprise application pool.

  4. Select the private application repository Litmus that you added earlier from the drop-down list.

  5. Select Litmus -2-0-0-beta for deployment.

  6. You can view the application information and configuration file, select the version from the version drop-down list, and click Deploy.

  7. Set the application name, confirm the application version and deployment location, and click Next.

  8. On the application configuration page, you can manually edit the manifest file or click Deploy.

  9. Wait for Litmus to be created and run.

Accessing the Portal Service

The Service name of the Portal is litmusportal-frontend- Service. You can check its NodePort in the service screen first:

Use ${Node IP}:${NODEPORT} to access the Portal:

Default username and password:

Username: admin
Password: litmus
Copy the code

(Optional) Deploying the Agent

Litmus contains two types of agents:

  • Self Agent
  • External Agent

By default, the cluster where Litmus is installed is automatically registered as Self Agent, and Portal performs chaos experiments in Self Agent by default.

As mentioned above, Portal is a cross-cloud chaos experiment control plane, that is to say, users can connect multiple External Agents deployed in External K8s clusters to the current Portal, so as to send chaos experiment to agents and observe the results in Portal.

For details about how to deploy the External Agent, refer to the official documentation of Litmus.

Creating chaos experiments

After the Portal is installed, you can use the Portal interface to create chaos experiments. You need to create an application for testing:

$ kubectl create deployment nginx --image=nginx --replicas=2 --namespace=default
Copy the code

Let’s start creating experiments.

  1. Log on to the Portal

  2. Schedule A Workflow

  3. Select Agent, such as self-agent:

  4. Select add Chaos experiment from Chaos Hub:

  5. Set the Workflow name:

  6. Click Add a New Experiment to Add a chaos experiment to your Workflow:

  7. Select experiment POD-delete:

  8. Start dispatching immediately:

  9. In KubeSphere you can see that Pod has been removed and rebuilt:

  10. You can also see that the experiment is successful on the Portal interface:

    Click on the specific Workflow node to see the detailed log:

  11. Repeat the above steps to create chaos experiment pod-CPU-HOG:

    In KubeSphere you can see that the CPU usage of Pod is approaching 1C:

  12. Set the number of copies of Nginx to 1 before starting the experiment:

    There is now only one Pod with IP 10.233.71.170:

    Repeat the above steps to create chaos experiment pod-network-Loss and change the packet loss rate to 50% :

    On entering the KubeSphere screen, hover over the toolbox icon in the lower right corner and select Kubectl from the pop-up menu.

    Ping Pod IP to test the packet loss rate, it can be seen that the packet loss rate is close to 50%, the experiment is successful:

All of the above experiments are for Pod. In addition to Pod, you can also experiment with Node, K8s components and other services. Interested readers can test themselves.

Workflow,

A Workflow is essentially a chaotic experiment Workflow, and while each Workflow in the previous section had only one experiment, each Workflow can actually set up multiple experiments and execute them in sequence.

Workflow is implemented by the CRD, which can be viewed in the KubeSphere Console interface, where you can see all the workflows you created earlier:

Using pod-network-Loss as an example, let’s see what the parameters are:

Each experiment in Workflow is also a CRD, named ChaosEngine.

Explain what each of these environment variables means:

  • Appns: Namespace of the object to execute.
  • experiments: The name of the test to perform (such as network delay test, Pod delete test, etc.), availablekubectl get chaosexperiments -n testCheck supported experiments.
  • ChaosServiceAccount: Sa to be used.
  • JobCleanUpPolicy: Indicates whether to retain the Job that performs the test. The value can be delete/retain.
  • AnnotationCheck: Whether annotationCheck is performed. If not, all pods are tested with true/false fields.
  • EngineState: indicates the status of the test, which can be set to Active/Stop.
  • TOTAL_CHAOS_DURATION: chaos test duration. The default value is 15 seconds.
  • CHAOS_INTERVAL: Specifies the chaos test interval. The default value is 5s.
  • FORCE: Deletes whether pod uses the — FORCE option.
  • TARGET_CONTAINER: Deletes a container in Pod (the first container is deleted by default).
  • PODS_AFFECTED_PERC: Percentage of test pods to total. Default is 0(equivalent to 1 copy).
  • RAMP_TIME: time to wait before and after the chaos test.
  • SEQUENCE: Test execution policy, which defaults to parallel execution and can be set to serial/ PARALLEL.

The detailed parameters of other experiments are not described here, and interested readers can refer to the relevant documentation for themselves.

conclusion

In this paper, we introduce the architecture of chaos engineering framework Litmus and the deployment method on KubeSphere. Through a series of chaos experiments, we verify the ability of the whole infrastructure and services to resist faults. Litmus is a particularly good Chaos engineering framework, behind have a strong community support, its experimental store (that is, the Chaos Hub) built into the experiment will be more and more, you can put these chaotic experiment a key deployment to create Chaos in the cluster, to intuitively show the result of the experiment through a visual interface, verify the elasticity of the cluster. With Litmus, we can not only face faults, but also actively create faults to find out the defects of the system and avoid the occurrence of black swan events.

The resources

  • Getting Started with Litmus

This article is published by OpenWrite!