The author | | peng source enlightenment alibaba cloud native public number

Kubernetes Stability Assurance Manual series

  • Kubernetes Stability Assurance Manual – Minimal edition
  • Kubernetes Stability Assurance Manual – Log topics
  • Kubernetes Stability Assurance Manual – Observability topics
  • Kubernetes Stability Assurance Handbook — Insight + Preplan

review

Stability assurance is a complex topic that requires effective, iterative, and sustainable assurance of cluster stability, and a systematic approach may solve this problem.

In order to form a systemic method, can create the stability guarantee the source of complexity for data model to described, and then on the basis of the data model to digitalization and visualization of the stability of the cluster security, to which data model is the last iteration to the guarantee of the stability of understanding, practice and experience of curing.

Source of stability complexity

The complexity sources of stability assurance generally have the following dimensions:

  • Number of system components and interactions: changes over time
  • Dynamic behavior characteristics of system components and interactions: not easy to derive and observe
  • System resource type and quantity: changes over time
  • Dynamic behavior characteristics of system resources: not easy to derive and observe
  • Cluster stability assurance actions: not easy to standardize and perform safely

To sum up, that is:

  • How to effectively and comprehensively understand the cluster
  • How to implement stability assurance action safely through preplan

The data model

Insights and scenarios can be abstracted from the data model with four diagrams and three tables:

Figure 4

  • Architecture diagram: Describes cluster components and their interactions
  • Architecture diagram: Describes the dynamic characteristics of cluster components and interactions
  • Resource composition diagram: describes the composition of cluster resources
  • Resource diagram: describes the dynamic usage of cluster resources

3 tables

  • Event List: describes the events generated in the cluster that need attention
  • Operation List: describes the management operations that can be performed in the cluster
  • Plan List: describes the association between events and operations in a cluster

As follows:

Insight into

Cluster functions are provided by cluster architecture and function components are based on cluster resources. Therefore, the core of insight into cluster stability is to grasp the characteristics of cluster architecture and cluster resources.

1. Architecture diagram

Cluster architecture can usually be represented by diagrams, in which nodes represent components and edges represent interaction relations. Cluster architecture can be intuitively understood by graph structure, as shown in the following figure:

It can be described by the following data structure:

{ "nodes": [ { "_id": "0ce0e913f6e5516846c654dbd81db6ecab1f684e", "name": "kube-apiserver", "description": < span style = "box-sizing: border-box! Important; word-wrap: break-word! Important;" "F0740d8bb67520857061a9b71d4a9e4fc50bfe3d", "name" : "etcd", "description" : "XXX VPC inside", "type" : "managed component | storage", "dependencies": {} }, { "_id": "05952a825e91cb50a81cbaf23c6941d5c3bb2c89", "name": "Eni-operator ", "description":" in XXX VPC, manage ENI", "type": "Component ", "dependencies": {" serviceAccount ": "enioperator", "clusterrole": "enioperator", "clusterrolebinding": "enioperator", "configmaps": ["eniconfig"], "secrets": ["enioperator"] } }, { "_id": "42699513a7561e89a5f99881d7b05653a1625c51", "name": "Network Service", "description": "Provides cloud resource management services such as VPC/VSwitch ", "type": "cloud Service"}], "edges": [{"_id": "38bce9ca8a0cec6d8586d96298bd63b0523fc946", "source": "eni-operator", "target": "kube-apiserver", "description": "Management ENI request"}, {" _id ":" 93 f3c21247165f0be3a969fc80f72bc1a402e9f5 ", "source" : "ENI - operator", "target" : "Network Service", "description": "Access aliccloud ECS OpenAPI, manage VPC/VSwitch Network resources"}]}Copy the code

2. Architecture operation diagram

During cluster operation, components and interactions can be inferred from external observation data, such as log/metrics/trace. Combined with the cluster architecture diagram, dynamic insight data can be superposed on the basis of static architecture to intuitively grasp the health status of the cluster, as shown below:

The numbers represent insight data, which can be “abnormal number”, “request traffic”, etc. In addition to providing insight through numbers, users can also use color to indicate health status and line thickness to indicate flow volume.

It can be described by the following data structure:

{
    "nodes": [
      {
            "_id": "ea4538dc0625d06b0dc93579998e04288656050f",
            "name": "mutatehook",
            "deploy": {
                "type": "K8s:Deployment",
                "namespace": "kube-system",
                "replicas": 3
            },
            "insight": [
                {
                    "source": {
                        "vendor": "cloud:aliyun:sls",
                        "log_project": "xxx",
                        "log_store": "mutatehook",
                        "log_url": "https://sls.console.aliyun.com/lognext/project/xxx"
                    },
                    "signal": {
                        "exception": {
                            "fuzzy": "fail OR Fail OR error OR Error"
                        }
                    }
              }
          ]
      }
    ],
    "edges": [
        {
            "_id": "38bce9ca8a0cec6d8586d96298bd63b0523fc946",
            "source": "eni-operator", "target": "kube-apiserver",
            "insight":[
                {
                    "source": {
                        "vendor": "cloud:aliyun:sls",
                        "log_project": "xxx",
                        "log_store": "xxx",
                        "log_url": "https://sls.console.aliyun.com/lognext/project/xxx"
                    },
                    "signal": {
                        "exception": {
                            "unauthorized": "Unauthorized",
                            "throttling": "'Throttling' OR 'throttling'"
                        }
                    }
                }
            ]
        }
    ]
}
Copy the code

3. Resource composition diagram

Resource management is a complex topic. By analyzing the resource composition relationship in a cluster, you can also try to represent the resource composition of a cluster by graph structure. Nodes represent resources, and edges represent the subordinate or binding relationship of resources.

It can be described by the following data structure:

{
    "kinds": ["vpc", "vswitch", "securitygroup", "ecs", "clb", "rds", "nat", "eip"],
    "tags": {
        "cluster/product": "xxx",
        "cluster/id": "2736f42d4e882ad6825d6364545a3f1cb5136859",
        "cluster/name": "xxx",
        "cluster/env": "staging"
    },
    "nodes": [
        {
            "kind": "vpc",
            "nodes": [
                {
                    "_id": "c505f21871bac7385c1387988cf226310af0831e",
                    "id": "vpc-xxx",
                    "description": "",
                    "ipv4": "xxx",
                    "tags": {
                        "resource/creator": "product",
                        "resource/role": ""
                     },
                     "url": "https://vpc.console.aliyun.com/vpc/xxx"
                }
            ]
        },
        {
            "kind": "ecs",
            "nodes": [
                {
                    "_id": "47c4fe5cc2585a49f07798a0b8b69cda7f8d4a23",
                    "id": "xxx",
                    "az": "xxx",
                    "interfaces": {
                        "primary": {
                            "ip": "xxx",
                            "eni": "xxx",
                            "mac": "xxx"
                        }
                    },
                    "instance-type-family": "xxx",
                    "instance-type": "xxx",
                    "tags": {
                        "resource/creator": "product",
                        "resource/role": "worker",
                        "node/container-runtime": "xxx",
                        "node/user-networking": "xxx",
                        "node/system-networking": "xxx"
                    },
                    "status": "",
                    "condition": "",
                    "url": "https://ecs.console.aliyun.com/#/server/xxx"
                }
            ]
        }
    ],
    "edges": [
        {
            "_id": "a754c748b2723a25c017421dd0969d00df3c000b",
            "source": "vsw-xxx", "target": "vpc-xxx",
            "description": ""
        },
        {
            "_id": "c34b164eba2897cfb2b574a576672d8aa441d709",
            "source": "eip-xxx", "target": "ngw-xxx",
            "description": ""
        }
    ]
}
Copy the code

4. Resource operation chart

During resource usage, the internal status of resources and the relationship between resources can also be predicted based on external observation data, such as log/metrics/ Event. Combined with the resource composition map, dynamic insight data can be superimposed on static resources to intuitively grasp the usage status of cluster resources.

It can be described by the following data structure:

{
    "nodes": [
         {
            "_id": "35103ac62d4ef0a314e2a5128f44c684205bea2f",
            "id": "vpc",
            "insight": [
                {
                    "source": {
                        "vendor": "cloud:aliyun:vpc",
                        "type": "OpenAPI"
                    },
                    "signal": {
                        "vpc/exist": "DescribeVpcs",
                        "vswitch/count": "DescribeVSwitches"
                    }
                },
                {
                    "source": {
                        "vendor": "cloud:aliyun:ecs",
                        "type": "OpenAPI"
                    },
                    "signal": {
                        "ecs/count": "DescribeInstances",
                        "securitygroup/count": "DescribeSecurityGroups"
                    }
                }
            ]
        },
        {
            "_id": "6450e07dc67027f76f29fbfcb841e57200855196",
            "id": "ecs",
            "insight": [
                {
                    "source": {
                        "vendor": "cloud:aliyun:ecs",
                        "type": "OpenAPI"
                    },
                    "signal": {
                        "ecs/exist": "DescribeInstances",
                        "ecs/count": "DescribeInstances",
                        "ecs/usage": "DescribeInstanceMonitorData"
                    }
                },
                {
                    "source": {
                        "vendor": "cloud:aliyun:ecs",
                        "type": "auto"
                    },
                    "signal": {
                        "ecs/state_change": ""
                    }
                }
            ]
        }
    ],
    "edges": [
        {
            "_id": "caa1e395c713f47766ca7bcfc20419c0be0f0803",
            "source": "i-xxx", "target": "sg-xxx",
            "insight": [
                {
                    "source": {
                        "vendor": "cloud:aliyun:ecs",
                        "type": "OpenAPI"
                    },
                    "signal": {
                        "exist": "DescribeInstances"
                    }
                }
            ]
        },
        {
            "_id": "537dc478d95714792b3694674d6164f72b361bb0",
            "source": "eip-xxx", "target": "ngw-xxx",
            "insight": [
                {
                    "source": {
                        "vendor": "cloud:aliyun:vpc",
                        "type": "OpenAPI"
                    },
                    "signal": {
                        "exist": "DescribeEipAddresses"
                    }
                }
            ]
        }
    ]
}
Copy the code

plan

Cluster exceptions are inevitable, and must be handled safely and effectively.

Exceptions can be represented by events. A safe and effective operation is one that has been reviewed and rehearsed. By combining exceptions with operations and triggering operations by exceptions, a plan that has been reviewed and rehearsed can be formed to handle cluster exceptions safely and effectively.

1. Event list

The event format can be used based on the CloudEvents standard: _github.com/cloudevents…

It can be described by the following data structure:

{ "events": [ { "_id": "a1ab5b61857be35a5c5b203dd84b49248161c823", "description": "restart workload manually", "event": {" id ":" restart - workload ", "source" : "XXX," "specversion" : "1.0", "type" : "com.aliyun.trigger.manual", "datacontenttype": "application/json", "data": "{\"NAMESPACE\": \"\", \"NAME\": \"\", \"TYPE\": \"\"}" } } ] }Copy the code

2. Operation list

To reduce the possibility of misoperations and avoid unverified and unverified operations when exceptions occur, you need to define a list of operations that can be performed in the cluster.

It can be described by the following data structure:

{
    "actions": [
        {
            "_id": "47abc5cd9d64018ebf96dc5b2d6a4fbd35a3cb6d",
            "name": "Action Restart Workload",
            "exec": "restart-workload",
            "env": [
                "NAMESPACE",
                "NAME",
                "TYPE"
            ]
        }
    ]
}
Copy the code

3. Plan list

Based on the event list and action list, you can associate events with actions to handle exceptions in an event-driven manner, that is, a contingency plan.

It can be described by the following data structure:

{ "plans": [ { "_id": "29a091c48d8992991ed69e8694b017a11abe3eec", "name": "Plan Restart Workload", "description": "Restart the workload," "event" : "a1ab5b61857be35a5c5b203dd84b49248161c823", "actions" : ["47abc5cd9d64018ebf96dc5b2d6a4fbd35a3cb6d"] } ] }Copy the code

Global visual stability assurance

Based on the data model of the above four figures and three tables, a kernel of insight and contingency plan for cluster stability assurance can be formed, and a global visual stability assurance service can be derived.

Such a service has the following key points:

  • A global perspective
  • digital
  • visualization

This service is implemented based on two principles:

  • People process images much more efficiently than words
  • The global perspective provides the ability to “understand the system end-to-end”, “locate problems accurately” and “handle problems safely”

Take traffic maps in daily life:

Through the traffic map, we can quickly understand the road distribution and key nodes in a region, and the conventional red, yellow and green colors can directly express the traffic congestion. Important events such as road repairs and closures will also be observed on richer traffic maps.

In this way, the traffic and geography of an area can be quickly understood based on visualization.

The underlying data model is the foundation, and the application of visualization means makes the value of data easier to play.

An implementation

1) Deployment pattern

  • Region of deployment
  • Provides services for single or multiple clusters in a Region

2) Use body sense

According to the best practices of stability assurance, stability assurance is divided into the following columns:

  • Running link diagram:
    • This column is a frequently used area for daily stability protection. Through visualization ability, it can intuitively perceive the occurrence, scope and influence of anomalies, and handle anomalies in a blank screen + visual way
  • Deployment Architecture diagram
    • This column is used to understand the deployment architecture of the cluster and to perceive and address the issues of the deployment dimension
    • Capacity management, including node management and capacity planning, is performed in this section
  • Business flow chart
    • This column precipitation business function flow chart, on the one hand to help business control function complexity, on the other hand to help business understand the status of business functions, together to help business iteration
    • Business related data analysis can be placed in this section
  • Data analysis: This column serves both data needs
    • The business requirements
      • View SLO information such as cluster size and cluster stability
      • Query: Queries statistics by feature (for example, queries resource applications by label)
    • Stability assurance requirement
      • View SLI information, such as the cluster water level, and SLO information, such as the stability of the cluster
      • Query: Queries statistics based on features (for example, queries information about all associated resources and resource leaks based on labels)
  • Observability management
    • This section manages observability related issues, including:
      • Observation data generation
      • Observation data acquisition
      • Observation data processing
      • Observational data consumption
  • Controllable management
    • This column is used to manage control-related operations, including:
      • Release management
      • Disaster management
      • Budget management
      • Resource management
      • Chaotic engineering
      • The safety management
      • Regular check-up

During normal operation of the system:

  • Confirm the coverage and accuracy of cluster in observability and controllability through the column of “Data Analysis”
  • In observability Management, you can manage observables, including data sources, monitoring, alarm completion, and governance
  • In the “Controllable Management” column:
    • According to the problems found in the observation data, plan configuration, issue management, etc
    • According to the problems found in chaos engineering or drill, carry out the plan configuration, etc
  • In the Running Link Diagram deployment Architecture Diagram, visually combine configured monitoring, alarms, and plans with components or links

During system exception and recovery, in the Running Link Diagram:

  • The cluster runs the link graph or alarms to detect exceptions
  • Trigger problem tracing automatically or manually
  • Abnormal components, abnormal links, and severity can be sensed by the colors of components and interactions in the cluster running link diagram
  • Click the abnormal number of components in the cluster running link diagram to obtain the associated abnormal details, or jump to the log or tracing system for manual query
  • Determine plans to be executed and associated components based on exception details or platform prompts
  • Execute contingency plan (block problem or restore service) in cluster Running link diagram
  • Confirm the effect of plan execution by the colors of components and interactions in the cluster operation link diagram
  • End problem tracing automatically or manually

The following items are recorded during problem tracing:

  • issue
  • The time when the exception occurs
  • Actions performed during exception handling
  • Run snapshot
  • The time of abnormal recovery

Data model and competitiveness analysis

Data models are a vehicle through which stability assurance best practices can be iterated, shared, and applied. Common insights and scenarios can be standardized into services, and personalized insights and scenarios can be described through fixed structures and then implemented using common controllers.

Data model to form insight + plan stability guarantee service, the technical core is:

  • Insight into the model
    • Key issues:
      • How to gain insight into cluster stability?
      • How to gain insight into business iteration efficiency?
  • The data model
    • Key issues:
      • How do YOU define valid and extensible data descriptions?

Based on the technical core, iterations can be carried out around the following competencies:

  • Insight into
    • global
    • digital
    • visualization
  • The efficiency of
    • Shortest operating path
    • Minimum operating cost
  • The advanced nature
    • Process best practices

summary

With the Spec specification of seven data models, we can represent insights + scenarios based on a structured description. With this as the core, we continuously iterate on the practice and understanding of stability assurance to accelerate business iteration. Further expansion, it is also possible to feed the business in the direction of development based on this model.

If you are interested, welcome to communicate in the comments area.