The author | LLLDD Ali cloud source of technical experts | alibaba cloud native public number
Challenges of Alibaba node operation and maintenance
In the scenario of Alibaba, the challenges of node operation and maintenance mainly come from the following aspects: scale, complexity and stability.
The first is scale. Since the establishment of the first ASI cluster in 18 years, hundreds of ASI clusters and hundreds of thousands of nodes have been running online, among which the number of nodes in a single cluster is more than 10,000 at most. On top of this, alibaba group runs tens of thousands of different applications, such as taobao and Tmall, with a total number of container instances in the millions. ASI refers to Alibaba Serverless Infrastructure, also known as Alibaba Serverless Infrastructure. ASI includes the management plane and nodes of a cluster. Each ASI cluster is the ACK managed version cluster created by calling Aliyun standard OpenAPI. Then, on this basis, we developed the scheduler, developed and deployed a lot of Addon, enhanced functions and optimized performance, and connected with various systems of the group, and achieved full node hosting. There is no need for application development operations to care about the underlying container infrastructure.
Secondly, the environment is very complicated. At present, IaaS layer runs a variety of heterogeneous models, including x86 server, ARM home-made models, and GPU and FPGA models to serve new computing and AI services. There are also a lot of kernel versions online. 4.19 is the kernel version that was launched on a large scale last year. Meanwhile, the node problems of kernel version 3.10/4.9 also need to be supported. Now thousands of online application includes like taobao, Tmall, rookie, Scott, hungry yao, koala, box of horses and other different kinds of business, at the same time with online applications running together, safe container business, like a big data, offline calculation, real-time computing, search these business types are with online business such as running in the same host environment.
Finally, there are high requirements for stability. Online business is very sensitive to delay and jitter, is the feature of single node of the jitter, tamping machine, downtime, etc are likely to affect a particular user on taobao, the order of payment, cause the user’s speech and complaints, and so the overall demand for the stability is very high, asked for a single node failure handling strong timeliness and effectiveness.
KubeNode: describes the cloud native node O&M base
KubeNode is a base project developed by Alibaba to manage and operate nodes based on cloud native mode. Compared with traditional procedural operation and maintenance means, KubeNode extends CRD and a group of corresponding operators through K8s, which can provide complete node life cycle and node component life cycle management. By declarative and end-state oriented approach, the management of nodes and node components becomes as simple as an application in K8s, and a high degree of consistency and self-healing ability of nodes are achieved.
On the right is a simplified architecture of KubeNode, which consists of the following parts:
Machine Operator is responsible for node and node component management at the central end, and Remedy Operator is responsible for node self-healing. There is Kube Node Agent on the Node side. This single-machine Agent component is responsible for CRD object instances generated by The Watch center Machine Operator and Remedy Operator, and performs corresponding operations. For example, installing node components and executing fault self-healing tasks.
In conjunction with KubeNode, Alibaba also uses NPD for stand-alone fault detection and unified risk control with Kube Defender, an Ali-developed component. Of course, the fault detection items provided by the community version of NPD are relatively limited. Alibaba has expanded on the basis of the community and added many node fault detection items in combination with alibaba’s years of practice in node and container operation and maintenance, greatly enriching the ability of single-node fault detection.
1. Relationship between KubeNode and community projects
-
Github.com/kube-node: Irrelevant, the project was stopped in early 2018.
-
ClusterAPI: KubeNode can be used as a complement to the ClusterAPI node final state.
Function comparison:
Here is an explanation of the relationship between Alibaba’s self-developed KubeNode project and community projects. The name kube-Node may sound a bit familiar. There was a github project with the same name, github.com/kube-node, but the project was discontinued in early 2018, so the name is the same.
In addition, the ClusterAPI of the community is a project to create and manage K8s clusters and nodes. The relationship between the two projects is compared here:
- Cluster creation: The ClusterAPI is responsible for creating clusters, which KubeNode does not provide.
- Node creation: ClusterAPI and KubeNode can both create nodes.
- Node component management and final state maintenance: the ClusterAPI does not provide the corresponding functionality, KubeNode can manage node components and maintain the final state.
- Self-healing of node faults: ClusterAPI mainly provides self-healing capability based on node health status. KubeNode provides more self-healing functions for node components, which can self-heal various hardware and software failures on nodes.
Overall, KubeNode works with the ClusterAPI and is a good complement to the ClusterAPI.
The node component mentioned here refers to the software of Kubelet and Docker running on the node. Internal Pouch is used as our container runtime. Apart from kubelet and Pouch, which are essential components for scheduling, Pouch also has more than a dozen components for distributed container storage, monitoring collection, security container, and fault detection.
Usually kubelet is installed and upgraded, and Docker is done with one-time process-oriented actions like Ansible. In the long running process, it is common for the software version to be accidentally modified or run into bugs and not work. At the same time, the iteration speed of these components in Alibaba is very fast, and usually a version needs to be released and leveled in one or two weeks. In order to meet the requirements of rapid iteration, safe upgrade and consistent version of components, Alibaba developed KubeNode. Nodes and node components are described by K8s CRD, and final-state-oriented management is carried out to ensure version consistency, configuration consistency and running state correctness.
2. KubeNode – Machine Operator
The diagram above shows the architecture of the Machine Operator, a standard Operator design: an extended set of CRDS plus a central controller. CRD definitions include: Machine and MachineSet associated with nodes, and MachineComponent and MachineComponentSet associated with node components.
The central end of the Controller includes: Machine Controller, MachineSet Controller, MachineComponentSet Controller, respectively used to control the creation and import of nodes, node components installation, upgrade.
Infra Provider has scalability and can connect to different cloud vendors. Currently, it only connects to Ali Cloud, but it can also connect to different cloud vendors such as AWS and Azure by implementing corresponding providers.
KubeNode on a single machine is responsible for watch CRD resources. When new object instances are found, node components are installed and upgraded, components are periodically checked for normal operation, and component running status is reported.
1) Use Case: import nodes
The following section shares the import process of existing nodes based on KubeNode.
First, the user will submit an import operation of an existing node in our multi-cluster management system. Next, the system will issue the certificate and install KubeNode Agent. After the Agent runs normally and starts up, the third step will submit Machine CRD. Then the Machine Controller will change the state to import phase and synchronize label/taint from the Machine to Node after Node is ready. Step 5 is MachineComponentSet, according to the Machine information to determine the node components to be installed and synchronized to the Machine. Finally, Kube Node Agent will watch the information of Machine and MachineComponent to complete the installation of Node components, and after all components run normally, the Node import operation is completed. The whole process is similar to the user committing a Deployment and eventually starting a business Pod.
The consistency of final state of node components mainly includes the correctness and consistency of software version, software configuration and running status.
2) Use Case: upgrade components
Here is the component upgrade process, which relies on the batch upgrade capability provided by MachineComponentSet Controller.
First, the user submits a component upgrade operation on the multi-cluster management system, and then enters a batch by batch upgrade cycle: Update MachineComponentSet the number of machines to be upgraded in a batch, after which the MachineComponentSet Controller calculates and updates the version information of the components on the corresponding number of nodes. Then Kube Node Agent Watch changes to the component, installs the new version, and reports the component status is normal after checking the status. After all components in this batch are successfully upgraded, upgrade the next batch.
The upgrade process of a single cluster and a single component described above is relatively simple, but for more than 10 components and hundreds of clusters online, it is not so simple to complete the version leveling in all clusters. We operate through the ASIOps cluster unified operation and maintenance platform. In the ASIOps system, hundreds of clusters are configured into a limited number of release pipelines, each of which is arranged in the order: test -> pre-release -> formal. A normal release process is to select a release pipeline and publish it in the pre-set cluster order. Within each cluster, publish it in the 1/5/10/50/100/… After the release of each batch, health inspection will be triggered. If there is any problem, automatic release will be suspended. If there is no problem, the release of the next batch will be automatically started after the end of the observation period. In this way, the process of launching a new version of a component can be completed safely and efficiently.
3. KubeNode – Remedy Operator
Next, share the Remedy Operator in KubeNode, which is also a standard Operator for self-healing.
Remedy Operator also consists of a set of CRDS and corresponding controllers. CRD definitions include: NodeRemedier and RemedyOperationJob. Controllers include: Remedy Controller and RemedyJob Controller also have registries of fault self-healing rules. There are NPD and Kube Node Agent on the single machine.
Host Doctor is an independent fault diagnosis system on the central side, which obtains active operation and maintenance events from cloud manufacturers and converts them into fault Condition on nodes. On ali Cloud public cloud, hardware failures or planned operation and maintenance operations of physical machines where ECS is located can be obtained in the form of standard OpenAPI. After docking, problems of nodes can be sensed in advance and services on nodes can be automatically migrated in advance to avoid failures.
Use Case: Self-healing of rammer
Here to rammer self – healing case to introduce a typical self – healing process.
Firstly, we will configure and deliver the self-healing rules described by CRD on ASI Captain, a multi-cluster management system, and these rules can be added flexibly and dynamically. A corresponding repair operation can be configured for each Node Condition.
Then the NPD on the node periodically checks whether various types of faults occur. When the abnormal log “Task XXX blocked for more than 120 seconds” is found in the kernel log, the rammer of the node is determined. And report the fault Condition to Node, and Remedy Controller Watch will trigger the self-healing process after the change: First, the Kube Defender risk control center interface will be called to determine whether the current self-healing operation is allowed to execute. After passing, the RemedyOperationJob self-healing task will be generated, and Kube Node Agent Watch will execute the self-healing operation when it arrives at the job.
It can be seen that the whole self-healing process does not depend on the external third-party system. NPD is used for fault detection, and Remedy Operator performs self-healing and completes the whole self-healing process in a cloud native way, which can find and repair faults at the earliest level of minutes. At the same time, through the enhancement of NPD detection rules, the range of faults handled covers hardware faults, OS kernel faults, and full link repair of component faults. It is important to note that all self-healing operations will be connected to the Kube Defender Unified Risk Control Center for minute, hour, and day self-healing flow control in case of Region/Zone outage, massive IO hang, or other massive software bug. The fault self-healing of all nodes in the Region is triggered, leading to more serious secondary faults.
KubeNode Data system
The construction of KubeNode data system plays a very important role in overall measurement and improvement of SLO.
On the node side, NPD will detect faults and report the event center. Meanwhile, WALLE is the indicator collection component on the single node side, collecting various indicator information of nodes and containers, including common indicators such as CPU, Memory, IO and Network. And many other metrics like the kernel, security containers, and so on. Promethes (ARMS product on the Public cloud of The Ali Cloud) collects and stores the indicators of all nodes, and also collects the data of the extended Kube State Metrics. Obtain key indicator information of Machine Operator and Remedy Operator. On the basis of these data, the upper layer is configured for the user to monitor the market, fault alarm, and full link diagnosis.
Construction through the data system, we can be used for the analysis of the resource utilization, can provide real-time monitoring alarm, failure analysis, also can analyze the overall KubeNode coverage, consistent rate of nodes and node component, node self-healing efficiency, and to provide for all nodes link diagnosis function, when trying to identify the node problem, You can view all historical events on the node to help you quickly locate the cause.
future
At present, KubeNode has covered all ASI clusters of Alibaba Group. In the future, with the “Unified Resource Pool” project of Alibaba Group, KubeNode will be promoted to cover a larger range and more scenarios, so that the cloud-based container infrastructure operation and maintenance architecture can play a greater value.
Author’s brief introduction
Zhou Tao, alibaba cloud technology expert, joined Alibaba in 2017. In the past few years, she has been responsible for the research and development of Alibaba’s cluster node control system with a scale of hundreds of thousands and participated in the annual Double Eleven Promotion. With the group’s cloud project starting at the end of 2018, the nodes under management have covered the physical machines under the cloud to the DCP bare metal server on the public cloud of Ali Cloud, and supported the Double 11 event to realize the comprehensive transformation of the cloud source biochemical system of the core trading system.