The author | ali cloud container platform, senior technical experts Zeng Fansong chase (spirit) attention, alibaba cloud native “public” keywords “919”, for this PPT.

Takeaway: This paper mainly introduces alibaba in the process of landing Kubernetes in mass production environment, the typical problems encountered in the cluster scale and corresponding solutions, including etCD, KuBE-Apiserver, Kube-Controller performance and stability enhancement, These key enhancements are key to ensuring that Alibaba’s Kubernetes cluster of tens of thousands of nodes can smoothly support the tmall 618 push in 2019.

background

From the earliest AI system of Alibaba (2013), cluster management system has gone through several rounds of architecture evolution, to the comprehensive application of Kubernetes in 2018. The story during this period is very wonderful, and I have the opportunity to share it with you individually. Instead of talking about why Kubernetes is winning across the board, both in the community and within the company, we’ll focus on what went wrong with Kubernetes and what key optimizations we made.

20w pods
100w objects

Etcd has a large number of read and write delays, and denial of service, and due to its space limitations can not bear a large number of Kubernetes storage objects;
API Server query Pods/Nodes delay is very high, concurrent query request possible address back end etCD OOM;
The Controller cannot perceive the latest changes from the API Server in time, and the processing delay is high. If an abnormal restart occurs, the service recovery takes several minutes.
Scheduler has high delay and low throughput, which cannot adapt to the daily operation and maintenance requirements of Ali business, let alone support extreme scenarios of big promotion.

etcd improvements

To solve these problems, Ali Cloud Container platform has made great efforts in all aspects to improve the performance of Kubernetes in large-scale scenarios. The first is the ETCD layer, as the Kubernetes storage object database, which is critical to the performance of the Kubernetes cluster.

In the first version, we improved the total amount of data stored on ETCD by moving etCD data to a TAIR cluster. However, a significant drawback of this approach is that the additional TAIR cluster, the increased operation and maintenance complexity of the cluster poses a significant challenge to the data security in the cluster, and the data consistency model is not based on RAFT replication groups, sacrificing data security.
In version 2, we improved by storing different types of objects in API Server into different ETCD clusters. From the internal view of ETCD, it corresponds to different data directories. By routing data from different directories to different back-end ETcds, the total amount of data stored in a single ETCD cluster is reduced and scalability is improved.
The third version of the improvement, we in-depth study etCD internal implementation principle, and found that a key problem affecting the scalability of ETCD in the underlying BBOLt DB page allocation algorithm: As the amount of data stored in ETCD increases, the performance of linear lookup of “page storage pages of continuous length N” in BBolt DB decreases significantly.

To solve this problem, we designed a segregrated hashmap with its contiguous page size as the key and contiguous page start ID as value for its idle page management algorithm. Greatly improved performance by looking up the segregrated hashmap for the idle page lookup of O(1). When releasing blocks, the new algorithm tries to merge with the page adjacent to the address and update the segregrated hashmap. A more detailed analysis of the algorithm can be found on the CNCF blog: www.cncf.io/blog/2019/0…

API Server improvements

Efficient node heartbeats

In the Case of Kubernetes clustering, one of the core issues that affects its scaling to a larger scale is how to effectively handle the heartbeat of nodes. In a typical non-trival production environment, kubelet reports heartbeats every 10 seconds, with each heartbeat requesting 15kb (dozens of mirrors on the node, and several volumes of information), which causes two major problems:

The heartbeat request triggers an update of the Node object in etCD, which in a cluster of 10K Nodes will generate transaction logs of approximately 1GB/min (etCD will record the change history);
API Server has a high CPU consumption. Node nodes are very large and have high serialization/deserialization overhead. The CPU cost of processing heartbeat requests exceeds 80% of API Server CPU time.

build-in Lease API
Lease

The Lease object is updated every 10 seconds, indicating the Node status. The Node Controller determines whether the Node is alive based on the Lease status.
For compatibility reasons, the node object is updated every 60 seconds so that eviction__manager and others can continue to work according to the original logic.

Because the Lease object is so small, it is much less expensive to update than the Node object. Using this mechanism, Kubernetes significantly reduced the CPU overhead of API Server, and also significantly reduced the large number of transaction logs in ETCD from 1000 to several thousand nodes. This feature is enabled by default in community Kubernetes-1.14.

API Server load balancing

In a production cluster, it is common to deploy multiple nodes to form a highly available Kubernetes cluster for performance and availability reasons. However, in the actual operation of a high availability cluster, load imbalance between multiple API servers may occur, especially when the cluster is upgraded or some nodes fail to restart. This puts a lot of strain on the stability of the cluster, and the plan was to distribute the strain on the API Server in a highly available way, but in extreme cases all the strain reverts back to a single node, resulting in long system response times or even crashing the node leading to an avalanche. The following figure shows a simulated case in the pressure test cluster. In a three-node cluster, all the pressure from the API Server upgrade is transferred to one OF the API servers, whose CPU cost is much higher than the other two nodes.

API Server test adds LB, all Kubelets connected lb, typical cloud vendor delivered Kubernetes cluster, is this model;
Kubelet tests add LB, by lb to select API Server.

API Server: Thinks the client is untrusted and needs to protect itself from being overwhelmed by requests. When the load exceeds a threshold, the packet is sent409 - too many requestsRemind the client to retreat; Reject requests by closing client connections when its load exceeds a higher threshold;
Client: Received frequently within a period of time409, try to rebuild the connection to switch API Server; Periodically rebuild connections to switch API servers to complete shuffling;
At the operation and maintenance level, the API Server is upgraded by setting maxSurge=3 to avoid performance jitter during the upgrade.

As shown in the monitoring picture at the lower left corner of the figure, the enhanced version can achieve basic load balancing of API Server, and at the same time, it can quickly and automatically restore to the equilibrium state when two nodes are restarted (jitter in the figure).

List-Watch & Cacher

List-watch is the core mechanism of communication between Server and Client in Kubernetes. All objects in ETCD and their updated information, API Server uses Reflector to watch etCD data changes and store them in memory, and the client in Controller/Kubelets uses a similar mechanism to subscribe to data changes.

resourceVersion

storage

Pod Storage
Node Storage
Configmap Storage
.

Each type of storage has a limited queue that stores the most recent changes to the object and is used to support watcher lag (retry scenarios). In general, types of all types share an incrementing version number space (1, 2, 3… , n), that is, as shown in the figure above, the version number of the POD object is only guaranteed to increase and not to be continuous. When a Client uses list-watch to synchronize data, it is possible to focus only on a subset of pods. A typical Kubelet focuses only on the pods associated with its node, as shown above. A kubelet focuses only on green pods (2, 5). Because storage queues are finite (FIFO), when pods are updated and queued, old changes are eliminated from the queue. As shown in the figure above, when the update in the queue is unrelated to a Client, the Client progress remains at RV =5. If the Client reconnects after 5 is eliminated, then the API Server cannot determine whether there is a change between 5 and the minimum value of the current queue (7) that the Client needs to be aware of. Therefore, the Client too old Version err message triggers the Client to re-list all data. To solve this problem, Kubernetes introduced the Watch Bookmark mechanism:

Cacher & Indexing

In addition to list-watch, another client access mode is to directly query the API Server, as shown in the following figure. To ensure that the Client reads consistent data across multiple API Server nodes, the API Server obtains data from etCD to support the Client’s query requests. From a performance perspective, this raises several issues:

Indexes are not supported. To query the POD of a node, you need to obtain all the PODS in the cluster first, which costs a lot.
Because of etCD’s request-response model, a single request to query large data will consume a large amount of memory. In general, the query between API Server and ETCD will limit the amount of requested data and complete a large number of data queries through paging. Multiple round trips due to paging significantly reduces performance;
To ensure consistency, API Server query ETCD is usedQuorum readThis query overhead is at the cluster level and cannot be scaled.

The Client queries the API Server at time t0.
API Server requests etCD for the current data version rv@t0;
The API Server requests progress updates and waits for the Reflector data version to reach rv@t0;
Respond to user requests through cache.

Controller failover

In a 10K node production cluster, the Controller stores nearly a million objects. The overhead of retrieving and deserializing these objects from the API Server is not negligible. Restarting the Controller to recover may take several minutes to complete this task. This is unacceptable for a company of Alibaba’s size. In order to reduce the impact of component upgrade on system availability, we need to minimize the interruption time of a single controller upgrade to the system. The solution to this problem is as follows:

Pre-start the standby Controller informer to load the data required by the controller in advance.
When the master controller is upgraded, it releases the Leader Lease and triggers the standby controller to take over immediately.

Customized scheduler

Due to historical reasons, Alibaba’s scheduler adopts a self-developed architecture. Due to time constraints, the enhancement of scheduler part is not carried out in this sharing. Just two basic ideas are shared here, as shown below:

Equivalence classes: Typical user capacity expansion requests are to expand multiple containers at a time. Therefore, we divided the requests in the pending queue into Equivalence classes to achieve batch processing and significantly reduced the number of Predicates/Priorities.
For a single scheduling request, when there are too many candidate nodes in the cluster, we do not need to evaluate all nodes in the cluster. After enough nodes are selected, we can proceed to the subsequent processing of the scheduling (sacrificing solution accuracy to improve scheduling performance).

conclusion

Through a series of enhancement and optimization, Alibaba successfully applied Kubernetes to the production environment and reached the super-large scale of 10,000 nodes in a single cluster, including:

The storage capacity of ETCD is improved by separating index and data, data shard, etc. Finally, the block allocation algorithm of BBOLt DB storage engine at the bottom of ETCD is improved to greatly improve the performance of ETCD in the scenario of storing large amount of data. The single ETCD cluster supports large-scale Kubernetes cluster, greatly simplifying the complexity of the entire system architecture;
By landing Kubernetes lightweight heartbeat, improve HA cluster under multiple API Server node load balancing, ListWatch mechanism to increase bookmark, through the index and Cache to improve the Kubernetes large-scale cluster in the most headache List performance bottlenecks, making stable running 10,000-node clusters possible;
Through hot backup, the service interruption time of controller/ Scheduler during the active/standby switchover is greatly shortened and the availability of the whole cluster is improved.
Alibaba self-developed scheduler in the performance of the most effective two ideas: equivalence class processing and random relaxation algorithm.

Through a series of feature enhancements, Alibaba has successfully run its core internal business on the Kubernetes cluster of tens of thousands of nodes, and experienced the test of tmall 618 promotion in 2019. About the author: Zeng Fansong (name: Zhuling), senior technical expert of Aliyunyun native application platform. Rich experience in distributed system design and development. In the field of cluster resource scheduling, my self-developed scheduling system has managed hundreds of thousands of nodes, and has rich practical experience in cluster resource scheduling, container resource isolation, and mixing of different workloads. Currently, I am mainly responsible for the large-scale implementation of Kubernetes in Alibaba, and the application of Kubernetes in the core e-commerce business of Alibaba, which has improved the application release efficiency and cluster resource utilization rate, and has steadily supported the promotion of 2018 Double 11 and 2019 618.

“Alibaba cloudnative wechat public account (ID: Alicloudnative) focuses on micro Service, Serverless, container, Service Mesh and other technical fields, focuses on cloudnative popular technology trends, large-scale implementation of cloudnative practice, and becomes the technical public account that most understands cloudnative developers.”

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

When K8s cluster reaches ten thousand scale, how does Alibaba solve the performance problem of system components?

background

etcd improvements