Introduction: The emergence of Kubernetes makes the majority of development students can also operate and maintain complex distributed systems, it greatly reduces the threshold of container application deployment, but the operation and management of a production level of highly available Kubernetes cluster is still very difficult. This article will share how Ant Financial manages large-scale Kubernetes cluster efficiently and reliably, and will introduce the design of core components of cluster management system in detail.
Kubernetes takes the lead in container choreography with its advanced design concept and excellent technical architecture. More and more companies are beginning to deploy practice Kubernetes in the production environment. In Alibaba and Ant Financial, Kubernetes has been used in the production environment on a large scale.
Click “Here” to get the PPT of this article. Or if you want to join us, check out the job listings below
An overview of the system
Kubernetes cluster management system needs to have convenient cluster life cycle management ability, complete cluster creation, upgrade and work node management. In large-scale scenarios, the controllability of cluster changes is directly related to the stability of the cluster, so the ability of management system to monitor, grayscale and rollback is one of the key points of system design. In addition, when the number of nodes in a super-large cluster reaches 10K, hardware faults and component anomalies of nodes often occur. The management system for large-scale cluster needs to fully consider these abnormal scenarios at the beginning of design and be able to recover from these abnormal scenarios.
Design patterns
Based on these background, we design an end-state oriented cluster management system. The system periodically checks the current cluster status to determine whether it is consistent with the target state. If any inconsistency occurs, the Operators initiates a series of operations to drive the cluster to the target state. This design refers to the common negative feedback closed-loop control system in the control theory. The closed-loop system can effectively resist the external interference of the system. In our scenario, the interference corresponds to the node hardware and software failure.
Architecture design
As shown in the figure above, a meta-cluster is a highly available Kubernetes cluster that manages the Master nodes of N business clusters. A business cluster is a Kubernetes cluster that serves the production business. SigmaBoss is a cluster management portal that provides users with a convenient interface and a controllable change process. The cluster-operator deployed in the meta-cluster provides the ability to create, delete, and upgrade a service Cluster Cluster. The cluster-operator is designed for the final state. When the Master node or component of the service Cluster becomes faulty, it will automatically isolate and repair the fault. To ensure that the Master node of the service cluster reaches a stable final state. This Kubernetes management Kubernetes scheme, we called Kube on Kube scheme, short KOK scheme. The machine-operator and node fault self-healing components are deployed in a service cluster to manage working nodes in the service cluster and provide functions of adding, deleting, upgrading, and troubleshooting nodes. In addition to the ability of machine-operator to maintain the final state of a single node, the cluster dimension grayscale change and rollback capabilities are built on SigmaBoss.
Core components
Cluster end-state holder
Based on K8s CRD, Cluster CRD is defined in the meta-cluster to describe the final state of the service Cluster. Each service Cluster corresponds to a Cluster resource. Creating, deleting, and updating Cluster resources correspond to the creation, deletion, and upgrade of the service Cluster. Cluster-operator watch Cluster resource that drives the Master component of the service Cluster to the final state of the Cluster resource description. Service cluster Master components are centrally maintained in ClusterPackageVersion CRD. ClusterPackageVersion resources record Master components (for example: Api-server, Controller-manager, Scheduler, Operators, etc.), default startup parameters, etc. A Cluster resource is associated with only one ClusterPackageVersion. You can publish and roll back the Master component of a service Cluster by modifying the ClusterPackageVersion version recorded in the Cluster CRD.
Node terminal state holder
The management tasks of Kubernetes cluster work nodes are as follows:
- Node system configuration, kernel patch management;
- Docker/Kubelet and other components installation, upgrade, uninstall;
- Node terminal state and schedulable state management (e.g., scheduling can be started only after the completion of key DaemonSet deployment);
- Node faults heal automatically.
To achieve the above management tasks, a Machine CRD is defined in the business cluster to describe the final state of the working node. Each working node corresponds to a Machine resource. The working node can be managed by modifying the Machine resource.
The Machine CRD definition is shown in the following figure. Spec describes the component names and versions to be installed on the node. Status records the installation and running status of each component on the current working node. In addition, Machine CRD provides plug-in end-state management capabilities that can be used in collaboration with other node management Operators, as described later in this article.
Component versioning on the work node is done by MachinePackageVersion CRD. MachinePackageVersion maintains information about RPM versions, configurations, and installation methods for each component. A Machine resource is associated with N different MachinePackageVersion to implement the installation of multiple components.
Based on Machine and MachinePackageVersion CRD, a terminal state controller machine-operator is designed and implemented. Machine-operator watch Machine resource, parse MachinePackageVersion, perform operation and maintenance operations on nodes to drive nodes to reach the final state, and continue to guard the final state.
Node terminal state management
With the change of business demands, node management is no longer limited to the installation of docker/Kubelet and other components. We need to realize the requirements such as waiting for the completion of log collection DaemonSet deployment before starting scheduling, and such requirements are becoming more and more. If the final state is uniformly managed by machine-operator, it is bound to increase the coupling between machine-operator and other components, and the expansibility of the system will be affected. Therefore, we designed a mechanism for node end-state management to coordinate machine-operator with other node operation and maintenance Operators. The design is shown in the figure below:
- Full ReadinessGates: Records the list of conditions that a node can schedule and need to check;
- Condition ConfigMap: Condition ConfigMap, Operators’ final state of each node reports ConfigMap.
Collaboration:
- The external node Operators detect and report its related sub-end-state data to the corresponding Condition ConfigMap.
- Machine-operator obtains all the sub-final states related to the node according to the label Condition ConfigMap and synchronizes them to conditions of Machine Status.
- Machine-operator Checks whether the node reaches the final state according to the Condition list recorded in the full ReadinessGates. If the node does not reach the final state, scheduling is not enabled.
Node faults heal automatically
As we all know, physical machine hardware has a certain probability of failure. With the increase of cluster node scale, the cluster will normally have faulty nodes. If not repaired online in time, the resources of this part of physical machine will be idle.
To solve this problem, we designed a closed-loop self-healing system for fault detection, isolation and repair.
As shown in the figure below, the combination of Agent reporting and active detection of monitoring system is adopted for fault discovery to ensure the real-time performance and reliability of fault discovery (Agent reporting is better in real-time, and active detection of monitoring system can cover the scenarios where Agent exceptions are not reported). Fault information is centrally stored in the event center. Components or systems that are concerned about cluster faults can subscribe to the event center events to obtain the fault information.
The self-healing system creates different maintenance processes based on the fault type, for example, hardware maintenance process and system reinstallation process. During the maintenance process, the faulty nodes are first isolated (node scheduling is suspended), and then the PODS on the nodes are labeled to be migrated to notify PaaS or MigrateController for Pod migration. After these operations are completed, the nodes are recovered (hardware maintenance, operating system reinstallation, etc.). If the node is successfully restored, scheduling is enabled again. If the node is not automatically restored for a long time, manually troubleshoot the node.
Risk prevention
Based on the atomic capability provided by machine-operator, the grayscale change and rollback capability of cluster dimension are designed and implemented in the system. In addition, to further reduce the risk of change, Operators conduct a risk assessment when initiating a real change, as shown in the diagram below.
High-risk change operations (such as node deletion and system reinstallation) are connected to the unified traffic limiting center. The traffic limiting center maintains traffic limiting policies for different types of operations. If traffic limiting is triggered, the change is fused.
To evaluate whether the change process is normal, we perform a health check on each component before and after the change. Although the health check can detect most exceptions, it cannot cover all exception scenarios. Therefore, during risk assessment, the system obtains cluster service indicators (such as Pod creation success rate) from the event center and monitoring system. If any abnormal indicator occurs, the system automatically changes the fault.
conclusion
This paper mainly shares the core design of Ant Financial Kubernetes cluster management system at the present stage. The core components use a large number of Operator oriented end-state design mode. This final cluster management system has withstood the test of performance and stability in the preparation for double 11 this year.
A complete cluster management system should not only ensure cluster stability and o&M efficiency, but also improve the overall resource utilization of the cluster. Next, we will improve the resource utilization rate of ant Financial production cluster from the aspects of improving the online rate of nodes and reducing the idle rate of nodes.
Q & A
Q1: At present, most of the company’s applications have been deployed in Docker. How to transform to K8s? Is there a case study?
A1: I have been working in Ant for nearly 5 years. The business of ant has changed from xen virtual machine at the beginning to Docker scheduled by K8s, and it basically iterates every year. K8s is a very open “PaaS” framework. If it is already deployed in Docker and meets the characteristics of “cloud native” applications, migration of K8s is theoretically smooth. Due to the heavy historical burden of ant, in practice, in order to meet business requirements, some enhancements have been made to K8s to ensure smooth migration of services.
Q2: Does application deployment in K8s and Docker affect performance? For example, are tasks related to big data processing recommended to be deployed in K8s?
A2: I understand that Docker is a container, not a virtual machine, and the impact on performance is limited. Ant big data, AI and other businesses have been migrating K8s and online applications. Time-insensitive services such as big data can make good use of idle cluster resources and greatly reduce data center costs.
Q3: How can K8s cluster and traditional operation and maintenance environment be better combined? Now the company will definitely not all K8s.
A3: If infrastructure is not unified, resources cannot be uniformly scheduled. In addition, maintaining two sets of relatively independent operation and maintenance systems costs a lot. The ant implements an “Adapter” during migration, which “Bridges” the traditional instructions for creating containers or publishing into K8s resource modifications.
Q4: How does Node monitor work? Will Pod transfer when Node hangs? Does business not allow automatic migration?
A4: Node monitoring is divided into hardware level, system level and component level. Hardware monitoring data comes from IDC. System level monitoring uses internal self-developed monitoring platform. If Node is abnormal, the Pod will be migrated automatically. For some stateful services, operators are customized to achieve Pod automatic migration. A Pod that does not have automatic migration capability will automatically be destroyed when it expires.
Q5: Will the whole K8s cluster be transparent to development in the future? Will code be used to program for the cluster or write deployment files, instead of writing applications and deployment by container? Is there such a plan?
A5: K8s provides a lot of extended capabilities for building PaaS platforms, but it is very difficult to deploy applications directly to K8s right now. I think some kind of DSL for application deployment is the future, and K8s will be at the heart of this infrastructure.
Q6: We currently use kube-to-kube to manage the cluster. What are the advantages of Kube-on-kube over Kube-to-kube? In a large-scale scenario, where is the performance bottleneck during node scaling of a K8s cluster? How is it solved?
A6: There are already many CI/CD processes running on top of K8s. With the Kube-on-Kube solution, we can manage the management and control of the business cluster just as we manage the ordinary business App. In addition to running kubelet Pouch, many Daemonset Pods will also be run on the node. When large-scale new nodes are added, node components will initiate a large number of List/Watch operations to Apiserver. Our optimization mainly focuses on apiserver performance improvement. In conjunction with Apiserver, reduce the full list/watch of a node.
Q7: Hello cangmo, because our company has not launched K8s yet, I would like to ask the following questions: What benefits does K8s have for us? What current problems can be solved? Which business scenarios and processes are preferentially used? Can the existing infrastructure be smoothly switched to Kubernetes?
A7: I think the biggest difference of K8s is the design concept for the final state, instead of one operation after another. This is useful for complex o&M scenarios. From the ant’s upgrade practice, smoothing is possible.
Q8: Cluster operator is a Pod operator. Use Pod to start the service cluster master. Then machine operator is a physical machine.
A8: Operators run in the Pod. The cluster operator pulls up the machine operator of the service cluster.
Q9: Hello! May I ask, in order to cope with high concurrency scenarios such as Double 11, what level of scale of meta-cluster is appropriate for managing what level of business cluster? In my understanding, the Cluster operator should be a list watch of resources. In the face of large-scale concurrent scenarios, what aspects of optimization have you done?
A9: A cluster can manage 10,000 nodes, so a meta-cluster can manage 3K+ service clusters in theory.
Q10: If the node encounters system kernel, Docker, and K8s anomalies, how to maximize the system normal from the software level?
A10: have the ability of health check, quit actively, be discovered by K8s, and pull up again in other nodes.
The resource scheduling team of Ant Financial is hiring
The resource scheduling team of Ant Financial is recruiting. Looking forward to your joining.
Ant Financial resource scheduling platform uses Pouch, Kubernetes and other technologies to provide efficient resource scheduling capabilities for upper-layer businesses. The scale of physical nodes in a single cluster exceeds 10,000 units. The team is committed to breaking through various technical bottlenecks to improve the resource utilization rate of large-scale clusters, reduce the resource cost under the complex business scenarios of the double 11, create intelligent, efficient and stable financial infrastructure, and help ant Technology finance.
Join us and you will:
- Design and implement large-scale, efficient and intelligent new generation scheduling system based on Kubernetes platform;
- Participated in the architecture design and research and development of Ant Kubernetes cluster management platform, and built a large high-availability and intelligent cluster management system;
- Improve resource utilization efficiency, reduce cost, design and research and development of mixing, VPA, CPUShare and other technologies;
- Support resource scheduling of computing, big data, machine learning/deep learning and other services, design and develop high concurrency, low latency, large-scale scheduling technology;
Job Requirements:
- Familiar with at least one language in Java or Golang, pay attention to engineering quality, have the ability to solve various system problems independently;
- Familiar with Kubernetes/Docker ecology, proficient in Kubernetes/ container scheduling related technology and related project code implementation;
- Deep understanding of Linux system, especially experience in cgroup, CPU share, memory share and other resource related technologies is preferred;
Welcome to send your resume: [email protected]
Financial Class Distributed Architecture (Antfin_SOFA)