The authors take technical experts | ant gold Yao Jinghua; Fan Kang, senior development engineer of Ant Financial
As Kubernetes clusters grow in size and complexity, it becomes increasingly difficult for clusters to deliver PODS with high efficiency and low latency. This article will share ant Financial’s methods and experience in designing SLO architecture and implementing high SLO.
Why SLO?
Gartner’s definition of SLO: Under the SLA framework, SLO is the goal that the system must achieve; You need to ensure the success of the caller as much as possible. SLI/SLO/SLA: SLI/SLO/SLA
-
SLI defines a metric that describes how good a service is and meets good standards. For example, Pod is delivered in 1min. Slis are usually formulated in terms of latency, availability, throughput, and success rates.
-
The SLO defines a small target that measures the percentage of an SLI indicator that meets a good standard over time. For example, 99% of pods are delivered within 1 minute. When a service publishes its SLO, users have expectations about the quality of the service.
-
**SLA ** is an agreement derived from an SLO. It is often used to describe how much the service will pay if the target percentage defined by the SLO is not met. Generally speaking, SLA agreements are written in black and white to form legally valid contracts, which are often used between service providers and external customers (such as AliYun and aliyun users). Generally speaking, when the SLO is broken between internal services, it is usually not a financial compensation, but more a recognition of responsibility.
So, we focus more on the SLO inside the system.
What we concern about Larger K8s Cluster?
With the continuous development of production environment, K8s cluster becomes more and more complex and the cluster scale increases. How to guarantee the availability of K8s clusters in large-scale environments? Is placed in many manufacturers in front of a difficult problem. For A K8s cluster, we usually have the following concerns:
-
The first question is whether the cluster is healthy, whether all the components are working properly, and how many Pod creation failures there are in the cluster, which is a matter of overall metrics.
-
The second question is what happens in the cluster, whether there are exceptions in the cluster, what the user is doing in the cluster, and that is a traceability issue.
-
The third question is which component is the problem that leads to the lower success rate after the exception, which is a problem of cause location.
So how do we solve the above problems?
-
First, we define a set of SLOs that describe the availability of the cluster.
-
Next, we must be able to track the Pod lifecycle in the cluster; For a Pod that fails, you also need to analyze the cause of the failure to quickly locate the abnormal component.
-
Finally, we need to eliminate cluster exceptions through optimization.
SLls on Large K8s Cluster
Let’s take a look at some of the cluster metrics.
-
Item 1: Cluster health. At present, there are three values: Healthy, Warning, and Fatal. Warning and Fatal correspond to alarm systems. For example, if P2 alarm occurs, the cluster is Warning. If a P0 alarm occurs, the cluster is Fatal and must be processed.
-
Second indicator: success rate. The success rate here refers to the success rate of Pod creation. Pod success rate is a very important indicator. The number of Pod created by ants in a week is millions, and the fluctuation of success rate will cause a large number of Pod failures. And Pod success rate drop, is the most intuitive response to cluster anomalies.
-
The third index: the number of residual Terminating Pod. Why not delete the success rate? At the million level, even if the Pod deletion success rate reaches 99.9%, the number of Terminating Pods will be thousands. With so many pods remaining, the capacity of the application is not acceptable in a production environment.
-
Fourth index: service online rate. The service online rate is measured by the probe. If the probe fails, the cluster is unavailable. Service availability is designed for the Master component.
-
The last indicator: the number of failed machines, which is a node dimension indicator. Failure machines are typically physical machines that cannot deliver pods properly, either because the disk is full, or because the load is too high. The faulty cluster must be quickly discovered, isolated and repaired in time. After all, faults may affect the capacity of the cluster.
The success standard and reason classification
Once we have the metrics for clustering, we need to refine these metrics to define the criteria for success.
Let’s start with the Pod creation success metric. We divided Pod into regular Pod and Job class Pob. The RestartPolicy of common Pod is Always, and the RestartPlicy of Job Pod is Never or OnFailure. Both have a set delivery time, such as one minute or less. The delivery standard of common Pod is that the Pod is Ready within 1min. The delivery standard for Job PODS is that the Pod status reaches Running, Succeeded, or Failed within 1 minute. Of course, the creation time needs to exclude the PostStartHook execution time.
For Pod removal, the success criteria is: Pod is removed from ETCD within a specified time. Of course, the delete time needs to exclude the PreStopHookPeriod time.
For the failure of the machine, to be found as soon as possible and isolated and degraded. For example, if the physical drive is read-only, the Pod must be taint within 1min. As for the recovery time of the faulty machine, you need to set different recovery time according to different fault causes. For example, if a system failure requires an important installation of the system, the recovery time will be longer.
With these standards in place, we also sorted out the causes of Pod failures. Some of these failures were caused by the system and we needed to be concerned about them. Some of these failures are user-generated and we don’t need to worry about them.
A RuntimeError, for example, is a system error, the underlying Runtime has a problem; ImagePullFailed, Kubelet download image failed, because the ant has Webhook to verify the image access, all image download failure is generally caused by the system.
As user causes cannot be solved on the system side, we provide these failure causes to users in the form of interface query for users to solve by themselves. Such as ContainerCrashLoopBackOff, usually caused by the users container exit.
The infrastructure
On the one hand, it is used to show the indicators of the current cluster to the end users and operation and maintenance personnel. On the other hand, all components cooperate with each other. By analyzing the current cluster status, various factors affecting SLO can be obtained to provide data support for improving the success rate of pod delivery in the cluster.
Looking from the top down, the top-level component focuses on metrics such as cluster health, POD creation, deletion, upgrade success rate, number of residual Pods, number of unhealthy nodes, and so on. Display Board is often said to monitor the market.
We also built the Alert alarm subsystem, which supports flexible configuration. It can configure multiple alarm modes, such as phone calls, short messages and emails, for different indicators according to the percentage of indicator decline and the absolute value of indicator decline.
Analysis System Provides a more detailed cluster operation report by analyzing historical indicator data and collected node metrics and master component indicators. Among them:
-
The Weekly Report subsystem provides the data statistics of pod creation/deletion/upgrade in the current cluster in this week, as well as the summary of failure cases.
-
Terminating Pods Number Indicates the list of new Pods that cannot be deleted by K8s within a period of time and the reason why the Pods remain.
-
Unhealthy Nodes displays the proportion of the total available time of all Nodes in a cluster within a period, available time of each node, operation and maintenance records, and a list of Nodes that cannot be automatically recovered but need manual recovery.
To support these features, we developed a Trace System to analyze and show specific reasons for a single POD creation/deletion/upgrade failure. It includes three modules: log and event collection, data analysis and POD life cycle display:
-
The log and event collection module collects running logs and POD/Node events of each master component and node component, and stores logs and events indexed by POD/Node.
-
The data analysis module analyzes and restores the time used in each stage of POD life cycle, and determines the reasons for POD failure and node unavailability.
-
Finally, the Report module exposes the interface and UI to the end user, showing the pod life cycle and the cause of the error to the end user.
The trace system
Next, take a pod creation failure case as an example to show the workflow of the Tracing system.
After the user enters the POD UID, tracing system can search the corresponding pod life cycle analysis records and delivery success determination results through pod index. Of course, the data stored in a storage not only provides basic data for end users, but also analyzes the operating status of the cluster and each node through the lifecycle of the pods in the cluster. For example, too many Pods in the cluster are scheduled to the hot node, and the delivery of different Pods causes resource competition on the node, resulting in too high node load, but decreased delivery capacity, and finally, the delivery of pods on the node times out.
For another example, use historical statistics to analyze the execution time baselines for each phase of the PODS lifecycle. Use the baselines as evaluation criteria to compare the average time and time distribution of different versions of the components and recommend improvements to the components. In addition, through the overall PODS life cycle of each component responsible for the step time proportion, find out the steps with a large proportion, to provide data support for the subsequent optimization of POD delivery time.
Node Metrics
A healthy cluster requires not only high availability of the master component, but also node stability.
If pod creation is likened to RPC calls, each node is an RPC service provider, and the total capacity of the cluster is equal to the total number of POD creation requests that each node can handle. Each additional unavailable node represents the decline of cluster delivery capacity and available resources, which requires to ensure the high availability of nodes in the cluster. Each pod delivery/deletion/upgrade failure also means increased cost and decreased experience for users, which requires the cluster node to ensure good health before the pods scheduled on the node can be successfully delivered.
In other words, not only do node exceptions need to be detected as early as possible, but they need to be fixed as soon as possible. By analyzing the functions of each component on the POD delivery link, we added metrics of different types of components and converted the host running state to metrics. After collecting the data in the database and combining the POD delivery results on each node, we could build a model to predict node availability. Analyze whether there are unrecoverable anomalies of nodes and appropriately adjust the proportion of nodes in the scheduler to improve the pod delivery success rate.
Pod creation/upgrade failure, users can retry to solve the problem, but Pod deletion failure, although K8s oriented to the concept of the final state, components will continue to retry, but eventually there will be dirty data, such as Pod deletion on etCD, but there is still dirty data on the node. We designed and implemented an inspection system. By querying Apiserver, we obtained the Pods scheduled to the current node. By comparison, we found the remaining processes/containers /volumes directory /cgroup/network devices on the node, and tried to release the remaining resources through other ways.
Unhealthy node
The following describes the processing flow of the faulty machine.
There are many data sources for failure machine judgment, mainly node monitoring indicators, such as:
-
Failed to mount a certain Volume. Procedure
-
NPD(Node Problem Detector), which is a framework for the community
-
Trace system, such as Pod on a node to create persistent image download failed
-
SLO, such as a large number of pods left on a single machine
We developed a number of controllers to inspect these faults and form a list of faulty machines. A faulty machine can have several faults. For the faulty machine, different operations are performed according to the fault. The main operations are: hit Taint to prevent Pod scheduling up; Reduce the Node priority. Direct automatic processing for recovery. For some special reasons, such as full disk, manual intervention is required.
The failed machine system produces a daily report every day showing what the failed machine system did today. Developers can refine the entire system by constantly adding controllers and processing rules.
Tips on increasing SLO
Next, let’s share some ways to achieve a high SLO.
- First, the biggest problem we face in the process of increasing our success rate is the problem of image downloading. Remember, pods must be delivered within a certain time frame, and image downloads usually take a lot of time. To do this, we calculate the image download time and set an ImagePullCostTime error, which means the image download time is too long and the Pod will not be delivered on time.
Fortunately, Dragonfly, alibaba’s Image distribution platform, supports Image LazyLoad technology, which supports remote images without downloading images when Kubelet creates containers. So, this greatly speeds up Pod delivery. For more information about Image LazyLoad, check out Ali Dragonfly’s share.
-
Secondly, it is more and more difficult to improve the success rate of a single Pod as the success rate increases. You can introduce some workload for retry. At Ant, the PAAS platform retries until the Pod is successfully delivered or times out. Of course, when retrying, the previous failed node needs to be excluded.
-
Thirdly, the key Daemonset must be checked. If the key Daemonset is missing and the Pod is scheduled, it is very easy to have problems, thus affecting the creation/deletion of links. This requires access to the faulty machine system.
-
Fourth, many plugins, such as CSI Plugin, need to be registered with Kubelet. There may be nodes where everything is normal but failed to register with Kubelet. This node also cannot provide services delivered by Pod and needs to be connected to the faulty machine system.
-
Finally, because the number of users in the cluster is very large, isolation is important. On the basis of permission isolation, QPS isolation and capacity isolation are also needed to prevent one user’s Pod from running out of cluster capacity, so as to protect the interests of other users.
The most suitable offline salon for Chinese Serverless developers debuts in Beijing
This offline event will invite domestic front-line Serverless technical experts from Aliyun, Taobao, Xianyu, Baifu Travel, etc., to bring developers:
- Taobao/Tmall to deal with double 11 traffic peak how to scale practice Serverless.
- Cut to the developers pain point, tell about idle fish, Baifu travel and other Chinese enterprises Serverless landing and “step on the pit” experience.
- First disclosure of aliyun’s latest open source tool chain Serverless Devs design details and future trend.
Click the link immediately register: www.huodongxing.com/event/95701…
“Alibaba Cloud originator focuses on micro-service, Serverless, container, Service Mesh and other technical fields, focuses on the trend of cloud native popular technology, large-scale implementation of cloud native practice, and becomes the public account that most understands cloud native developers.”