How does Ant Group achieve high SLO on a large Kubernetes cluster?

The author | Tian Xiaoxu

As Kubernetes becomes the standard for cloud computing, Kubernetes applications in the enterprise are becoming mainstream.

According to the CNCF 2019 Kubernetes usage survey report, 84% of users have used Kubernetes in the production environment, 34% of the production environment container deployment scale is more than 1000, and 19% of the large-scale application scale is more than 5000. As clusters become larger and more complex, there are challenges to cluster availability.

Overall metrics: cluster health, all components working properly, number of Pod creation failures in the cluster, and so on;
Traceability: what happened in the cluster, whether there was an exception, what the user did, etc.
Cause locating: After an exception occurs, find out which component is faulty.

A good solution to these problems is the SLO, which describes the availability of the cluster by defining the SLO, tracking the Pod life cycle in the cluster, and quickly locating abnormal components in case of Pod failure. This article interviewed Ant group technology experts Fan Kang and Yao Jinghua to share how Ant Group’s SLO system was set up.

An SLA is an agreement derived from an SLO. An SLA forms a legally binding contract, usually between a service provider and an external customer, while an SLO is an expected state between internal services that defines the functions provided by the service.

1 SLO indicator definition

If we are going to describe the availability of a cluster by definition, then the specific description metrics become the key issue that needs to be addressed. In Ant Group, there are five key indicators of cluster availability: cluster health, Pod creation success rate, number of residual Terminating Pods, service online rate and number of failed machines.

Cluster health: Three values are usually used to describe cluster health: Healthy, Warning, and Fatal. Warning and Fatal correspond to alarm systems. For example, if a P2 alarm occurs, the cluster is Warning.
Pod creation success rate: this is a very important indicator. Ant Group creates millions of PODS in a week. If the success rate fluctuates, a large number of PODS will fail.
Residual Terminating Pods: Some people may wonder why the number of residual Terminating Pods is used instead of the deletion success rate? This is because when the number of Pod reaches one million level, even if the deletion success rate reaches 99.9%, the number of Terminating Pod also has thousands, so many residual PODS occupy the application capacity, it is unacceptable in the production environment;
Service online: This metric is measured by probes. A probe failure means the cluster is unavailable.
Number of faulty machines: this is a node dimension indicator. Faulty machines usually refer to physical machines that cannot deliver Pod correctly. Faulty machines in a cluster need to be quickly discovered, isolated, and repaired in time, otherwise the cluster capacity will be affected.

The thresholds and SLO performance goals for the above indicators are defined based on the growth of the business and may need to be adjusted as the business continues to grow.

Taking the success rate of Pod creation as an example, Ant Group divides Pod into ordinary Pod and Job class Pob. The RestartPolicy of ordinary Pod is Never. The RestartPlicy of Job class Pod is Never or OnFailure, both of which have delivery time. The delivery standard of common Pod is that the Pod is Ready within 1min. The delivery standard for Job PODS is that the Pod status reaches Running, Succeeded, or Failed within 1 minute. At the beginning, the success rate of Pod creation was defined as the ratio of successfully created PODS to total PODS. However, it was soon found that the system was difficult to identify the reasons for Pod failure, so it was adjusted into two parts: users and the system. The creation success rate is defined as the ratio of successful Pod creation to total Pod creation minus user failed pods.

2 SLO system of Ant Group

After defining the key indicators of SLO, the next step is to build the SLO architecture.

According to Fan Kang, ant group SLO system mainly includes two aspects, one aspect is used to show the end user/operations staff of the current cluster indicators, on the other hand is the individual components collaborate, analysis of the current state of the cluster, access to the various factors influencing the SLO, provide data support to improve cluster pod delivery success rate.

Ant Group SLO architecture diagram

From The top down, The layered architecture of Ant Group SLO includes SLO, Trace System, Increase of SLO, Target and The Unhealthy Node.

Among them, the top-level component mainly faces various indicator data, such as cluster health status, POD creation, deletion, upgrade success rate, number of residual Pods, number of unhealthy nodes, etc. Display Board is used to monitor the market and may not be viewed in real time. In order to avoid missing the best time to deal with emergencies, an Alert subsystem is built to support multiple alarm modes. The Analysis System provides a more detailed cluster operation report by analyzing historical indicator data and collected node metrics and master component indicators. Weekly Report subsystem gives the data statistics of pod creation/deletion/upgrade of the current cluster in this week, as well as the summary of failure cases; Terminating Pods Number Indicates the list of newly added Pods that cannot be deleted by Kubernetes mechanism within a period of time and the reason for the residual Pods. Unhealthy Nodes displays the proportion of the total available time of all Nodes in a cluster within a period, available time of each node, operation and maintenance records, and a list of Nodes that cannot be automatically recovered but need manual recovery.

To support these features, Ant group has also developed a Trace System to analyze and show specific reasons for the failure of a single POD creation/deletion/upgrade. It includes three modules: log and event collection, data analysis and POD life cycle display. The log and event collection module collects running logs and POD and Node events of each master component and node component, and stores logs and events indexed by POD and Node respectively. The data analysis module analyzed and restored the time in each stage of pod life cycle, and determined the reasons for POD failure and node unavailability. Finally, the Report module exposes the interface and UI to the end user, showing the pod life cycle and the cause of the error to the end user.

3 Experience Summary

At present, ant Group’s SLO practice not only improves the success rate of pod delivery in clusters, but also realizes a data analysis/diagnosis platform by building tracing system to analyze the time consuming of POD delivery key links in clusters and sorting out failure causes. Fan Kang also gave five lessons on how to achieve high SLO.

The biggest issue facing the SLO governance team in the process of increasing the success rate is image downloads. Pods must be delivered within a specified time, and image downloads usually take a lot of time. So the team set up an ImagePullCostTime error by calculating the image download time, meaning that the image download time was too long for the Pod to be delivered on time. In addition, Alibaba Image distribution platform Dragonfly supports Image LazyLoad technology, when Kubelet creates containers, there is no need to download images, greatly accelerating the delivery speed of Pod;
Improve the success rate of a single Pod: As the success rate increases, it will become more and more difficult to improve again, which can be retried by introducing workload. Ant group’s internal PaaS platform will retry until the Pod is successfully delivered or times out. Note that the previous failed node must be excluded before retry.
Check the critical Daemonset: If the critical Daemonset is missing, it is easy to cause problems when dispatching Pod, and even affect the creation/deletion of links, which may access the failed machine system.
Many plugins need to be registered with Kubelet, such as CNI Plugin. There may be a situation where everything is normal on the node, but the registration to Kubelet fails. Then the node also cannot provide Pod delivery services and needs to access the faulty machine system.
Isolation is important because of the large number of users in the cluster. On the basis of permission isolation, QPS isolation and capacity isolation are also needed to prevent one user’s Pod from exhausting cluster capacity and affecting the interests of other users.

How does Ant Group achieve high SLO on a large Kubernetes cluster?

1 SLO indicator definition

2 SLO system of Ant Group

3 Experience Summary

Related Posts

Building a Django Project from Scratch part 1: The simplest API

AQS conditional queue

Linux Command Line Hacking (LTS)