Ant Group operates and maintains possibly the largest K8s cluster in the world: K8s officially uses 5K nodes as the peak of K8s scale, but ant Group actually operates and maintains A K8s cluster with a scale of 10K nodes. A good analogy is that if the K8s cluster size that the official and the K8s users can imagine is Mount Tai, then ant Group has achieved a Mount Everest on top of the official solution, leading the improvement of K8s scale technology.

This magnitude difference is not only a difference in quantity, but also a qualitative improvement in K8s management and maintenance. The reason behind maintaining the K8s cluster with such huge challenges and huge scale is that ant Group has paid much more optimization efforts than K8s official.

The so-called great oacorns grow from little acorns, this paper focuses on the ant Group in K8s cornerstone – ETCD level made high availability construction work: only etCD this cornerstone is stable, K8s this high-rise building to maintain stability, with TIDB big big huang Dongxu circle of friends [pictures have been authorized by Huang].

Challenges

Etcd is first the KV database of the K8s cluster. From the perspective of the database, the overall cluster architecture of K8s has the following roles:

  1. Etcd cluster database

  2. Kube – Apiserver ETCD API interface proxy, data cache layer

  3. Producers and consumers of Kubelet data

  4. Kube-controller-manager consumers and producers of data

  5. Consumers and producers of Kube-Scheduler data

Etcd is essentially a KV database, storing K8s own resources, user defined CRD and K8s system event data. The consistency and security requirements of each type of data are inconsistent. For example, the security of event data is less than that of K8s resource data and CRD data.

When promoting K8s, early proponents claimed that one of its advantages over OpenStack was that K8s did not use message queues and that its latency was lower than OpenStack. This is actually a misunderstanding. Both the watch interface provided by ETCD and the informer mechanism in the K8s client package indicate that K8s treats ETCD as a message queue. There are many carriers of K8s messages, such as K8s Event.

From the point of view of message queue, the roles of the overall cluster architecture of K8s are as follows:

  1. Etcd Message router

  2. Kube – Apiserver ETCD Producer message broker and message broadcast [or become secondary message router, consumer broker]

  3. Producers and consumers of Kubelet messages

  4. Kube-controller-manager consumers and producers of messages

  5. Consumers and producers of Kube-Scheduler messages

Etcd is a message queue in push mode. Etcd is the KV database and message router of the K8s cluster, acting as MySQL and MQ in the OpenStack cluster. This implementation seems to simplify the cluster structure, but it is not. In large scale K8s clusters, the general rule of law is to first use a separate ETCD cluster to store event data: by physically separating KV data from a portion of MQ data, the KV and MQ roles are partially separated. As mentioned in reference document 2, Meituan “reduces the pressure on the main database by splitting out independent Event clusters for the operation of ETCD”.

When the size of K8s cluster expands, ETCD carries three kinds of pressure, namely, KV data surge, event message surge and message write amplification. In order to prove the truth, some data are quoted as evidence:

  1. Etcd KV data level is more than 1 million;

  2. The data volume of ETCD event is over 100,000;

  3. Etcd read flow pressure peak is above 300,000 PQM, in which read event is above 10K QPM;

  4. The peak pressure of ETCD write traffic is above 200,000 PQM, and the write event is above 15K QPM.

  5. Etcd CPUS regularly soar above 900%;

  6. Etcd memory RSS above 60 GiB;

  7. Etcd disk usage up to 100 GiB;

  8. The number of etCD’s own goroutine is more than 9K;

  9. Etcd uses more than 1.6K user mode threads;

  10. Etcd GC single time consumption can reach 15ms under normal conditions.

The ETCD implemented in the Go language is very demanding in terms of CPU, memory, GC, number of Goroutines, and thread usage, all of which are close to the Go Runtime’s limits: It is often observed in the CPU profile that the GO Runtime and GC take up more than 50% of the resources.

Before the K8s cluster of ant experienced the maintenance of high availability project, when the scale of the cluster exceeded 7000 nodes, the following performance bottlenecks occurred:

  1. Etcd has a large number of read and write delays, even up to minute level;

  2. Kube – Apiserver queries Pods/Nodes/ConfigMap/CRD with a high delay, causing the ETCD to oom.

  3. Etcd List – All Pods can last up to 30 minutes;

  4. In 2020, there were several accidents caused by the collapse of etCD cluster due to list-all pressure.

  5. The controller cannot perceive data changes in time, and the delay of watch data can be up to 30s.

If the ETCD cluster in this situation is dancing on the blade, the K8s cluster is an active volcano: if you are not careful, you may suffer a P-level failure. At that time, the K8s Master operation and maintenance work is probably one of the most dangerous jobs in the entire ant group.

High Availability Policy

To improve the high availability of a distributed system, there are the following methods:

  1. Improve their own stability and performance;

  2. Fine management of upstream traffic;

  3. Guarantee service Downstream service SLO.

Etcd has been developed by the community and various users over the years and is stable enough on its own. What ant man can do is nothing more than use his skill of peeling skin to improve the overall utilization rate of cluster resources. Scale out and Scale up techniques are used to improve its performance as much as possible.

Etcd itself is the cornerstone of K8s and has no downstream services. If anything, it’s the physical Node environment it uses. The following are some of the high availability improvements we have made at the ETCD level in terms of etCD cluster performance improvement and request traffic management.

File System Upgrade

Flying out of the mountain nest is not easy. There is no better way to make an ETCD run faster than by providing a high-performance machine.

1. Use NVMe SSDS

Etcd itself = etCD program + its runtime environment. The early ETCD server used SATA disks. After a brief test, the etCD was found to be very slow to read disks. The boss upgraded the machine to the F53 specification using NVMe SSD: Etcd After NVMe SSDS are used to store BoltDB data, the random write rate is increased to more than 70 MiB/s.

Reference document 2 mentions that meituan “can achieve 5 times the daily high traffic access based on the deployment of SSD physical machines with high configuration”, so it can be seen that improving hardware performance is the first choice of large factories, so do not waste manpower if the machine can be messed up.

2. To use TMPFS

In theory, NVMe SSDS have an order of magnitude worse read/write performance than memory. In our tests, we found that after replacing NVMe SSDS with TMPFS [not disabling Swap out], etCD performance improved by as much as 20% with concurrent reads and writes. After examining the characteristics of various data types of K8s and considering that Event has low requirements on data security but high requirements on real-time performance, we did not hesitate to run the Event ETCD cluster on the TMPFS file system, which improved the overall performance of K8s to a higher level.

3. Disk file system

After a disk storage media upgrade, the next thing the storage layer can do is study the file system format of the disk. The underlying file system used by etCD is ext4, which uses the default block size of 4 KiB. Our team once conducted a write-only parallel pressure test on ETCD, and found that there was still room for improvement in etCD write performance when the file system was upgraded to XFS with a block size of 16 KiB [under the combined test KV size of 10 KiB].

However, in the case of concurrent reads and writes, the disk’s own write queue is almost stress-free, and since ETCD 3.4 implements parallel cache reads, the disk’s read pressure is almost zero, which means that continuing to optimize the file system will add little to the etCD’s performance. Since then, the key to single-node ETCD scale up has shifted from disk to memory: optimizing the read/write speed of its memory index.

4. Disk transparent large page

In modern operating systems, there are two kinds of memory management technologies, huge Page and transparent Huge Page. However, most users use Transparent huge Page to dynamically manage memory pages. In the ETCD running environment, disable the Transparent Huge Page function. Otherwise, frequent monitoring indicators such as RT and QPS will frequently show burrs, resulting in uneven performance.

Etcd tuning parameter

MySQL operation and maintenance engineers are often referred to as “parameter tuning engineers”, along with another well-known KV database, RocksDB, which has an outrageous number of parameters to adjust: the key is to use different parameters for different storage and operating environments to take full advantage of the hardware’s performance. Etcd is not as good as it is, but it is also expected that more and more parameters can be adjusted in the future.

Etcd itself also exposes a number of parameters tuning interfaces. In addition to the improvement of Freelist from list to map organization made by The K8s team of Alibaba Group, the current regular etCD adjustable parameters are as follows:

  1. write batch

  2. compaction

1.write batch

Like other conventional DBS, etCD disks use periodic batch commit, asynchronous write to drive to improve throughput, and memory cache to balance latency. The specific interface for adjusting parameters is as follows:

  1. Batch Write Number Number of KV to be written in batches. The default value is 10K.

  2. Batch Write Interval Batch write interval. The default value is 100 ms.

The two default values etCD Batch are not appropriate in a large-scale K8s cluster. You need to adjust them according to the specific running environment to avoid using the OOM memory. As a general rule, these two values should decrease proportionally as the number of nodes in the cluster increases.

2.compaction

Etcd itself supports transactions and message notifications, so it uses an MVCC mechanism to keep multiple versions of data from a single key. Etcd uses a periodic compaction mechanism to recover stale data. The etCD provides the following compression task parameters:

  1. Compaction interval Specifies the compaction interval.

  2. Compaction sleep Interval Specifies the interval between a batch compaction. The default value is 10 ms.

  3. Compaction Batch Limit Specifies the number of KV batches that can be compressed. The default value is 1000.

(1) Compress the task cycle

A K8s etCD compaction can occur in one of two ways:

  1. The K8s Kube-Apiserver also provides compact cycle parameters based on the COMapCT command and API interface.

  2. The ETCD executes compaction periodically.

  3. Etcd provides a periodic compaction interface. This parameter ranges from 0 to 1 hour.

  4. Etcd compaction can only be turned on, but it cannot be turned off. If it rains more than one hour, the ETCD compacts to one hour.

After the test and offline environment verification of the ANT K8s team, the current compression cycle value experience is as follows:

  1. When this compaction occurs at the ETCD level, make it one hour, as if it were shutting down this compaction at the ETCD level, and give the K8s Kube-apiserver control over the compaction interval.

  2. On the K8s Kube-Apiserver layer, set a compaction interval based on the size of the online cluster.

We fixed etCD compaction interval to kube-Apiserver. This is because etCD is a KV database, and kube-apiserver is the etCD cache. Its data is weak state data, relatively speaking, start and stop is more convenient, convenient parameter adjustment. When compiling a compaction interval, you can adjust the size of this compaction interval if there are more compaction nodes in a cluster.

Compaction is a write action. Running a compaction task frequently in a large cluster can cause read/write delays to a cluster. This becomes apparent when a compaction task is executed.

Further, if a task that is running on the platform has a significant spike in activity, such as between 8:30am and 21:45pm, and peaks during other times of the day, you can run a compaction task like this:

  1. Compactions on the ETCD layer set a compaction interval of 1 hour.

  2. Set comapction period to 30 minutes in kube-Apiserver layer.

  3. Start a periodic task on the ETCD operation and maintenance platform: If the current period is during the business trough, start a 10-minute week

This is a new compaction mission.

When a compaction occurs during long periods of time when there are no fluctuations throughout the day, such as a sales promotion, the etCD operations platform can immediately shut down the task to minimize the impact of read and write operations.

(2) Single compression

Even with a single compression task, the ETCD is executed in batches. Boltdb, the storage engine used by ETCD, is read and write in the form of multiple read and write: multiple read tasks can be performed concurrently, but only one write task can be performed at a time.

To prevent a single compaction task from using boltDB read/write locks, run a compaction task with a fixed number of compaction thresholds. Etcd will release a read/write lock for a sleep interval.

Until v3.5, compaction sleep interval was set to 10 ms. After v3.5, etCD enabled this parameter to enable compaction for large-scale K8s clusters. For example, the interval and number for batch write, the sleep interval and the batch limit for a single compaction need to be set differently depending on the cluster size. To ensure the smooth running of ETCD and the read and write RT index of Kube-Apiserver smooth without burr.

Operational platform

Scale up is used to improve the etCD’s capabilities, whether it is to tune the ETCD’s parameters or upgrade the file system on which it runs. There are two scale up methods that have not yet been used:

  1. Run profiles through pressure testing or online access to etCD, analyze bottlenecks in etCD processes, and optimize code processes to improve performance;

  2. Reduce the amount of etCD data on a single node by other means.

Optimizing ETCD performance through code flow can be done according to the human resources of etCD users, and the longer term work should be to keep up with the community and get the technical benefits of their version upgrades in time. Reducing the size of ETCD data to achieve improved ETCD performance depends on the capacity building of the ETCD user.

We have carried out a benchmark test on the relationship between the single node RT and QPS performance of ETCD and the KV data volume. The conclusion is that when the KV data volume increases, its RT will increase linearly, while its QPS throughput will decrease exponentially. One of the implications of this step is to minimize the data size of a single ETCD node by analyzing the data composition, external traffic characteristics, and data access characteristics in the ETCD.

At present, ant’s ETCD operation and maintenance platform has the following data analysis functions:

  1. Longest N KV: Specifies the longest N KV

  2. Top N KV – specifies the N KVS that are accessed most frequently within a period

  3. Top N Namespaces – N Namespaces with the largest number of KV

  4. Verb + resoure – Indicates the external access action and resource statistics

  5. Connections – Number of long connections per ETCD node

  6. Client Source Statistics – External request source statistics for each ETCD node

  7. Analysis of redundant data — KV distribution without external access in ETCD cluster

According to the data analysis results, the following work can be done:

  1. Customers current limiting

  2. Load balancing

  3. Cluster split

  4. Redundant data is deleted

  5. Detailed analysis of service traffic

1. Split the cluster

As mentioned above, a classic way to improve the performance of ETCD cluster is to separate event data into an independent ETCD cluster, because event data is a relatively large, highly liquid and highly visited data of K8s cluster. Splitting can reduce the data size of the ETCD and reduce the external client traffic of the etCD single node.

Some empirical and conventional etCD splitting methods are:

  1. pod/cm

  2. node/svc

  3. event, lease

After these data splitting, the high probability can significantly improve the RT and QPS of K8s cluster, but further data splitting is still necessary. Based on the hot data [top N KV] provided by the data analysis platform and the external customer access [verb + resource], the detailed analysis can be used as the basis for the etCD cluster splitting work.

2. Customer data analysis

The analysis of customer data is divided into the longest N KV analysis and top N Namespace.

One obvious fact is that the longer the KV data accessed in a single read/write, the longer the ETCD response time. After obtaining the longest N KV data written by the customer, the user of the platform can study whether the use method of the platform is reasonable, so as to reduce the access flow pressure of the service to K8s platform and the storage pressure of ETCD itself.

In general, each namespace of the K8s platform is assigned to a service for separate use. As mentioned earlier, K8s can be overwhelmed by list-all pressures, which are mostly namespace level list-all data accesses. After obtaining the top N namespace from the platform, focus on monitoring the list-all persistent connection requests of the services with relatively large data magnitude, and take traffic limiting measures at the Kube-Apiserver level, so as to basically ensure that the K8s cluster will not be overwhelmed by these persistent connection requests. Ensure high availability of the cluster.

3. Redundant data analysis

The ETCD has not only hot data, but also cold data. Although these cold data will not bring external traffic access pressure, it will lead to the increase of etCD memory index lock granularity, which will lead to the increase of RT latency of each ETCD access and the decrease of overall QPS.

Recently, through the analysis of the redundant data in the ETCD of a large-scale [above 7K nodes] K8s cluster, it was found that a large amount of data was stored in the ETCD of a certain service data. However, the data volume was large but the service provider did not access the data once in a week. The service provider uses the ETCD of the K8s cluster as the cold backup of its CRD data. After the data is migrated from the ETCD after communication with the business side, the number of memory keys is immediately reduced by about 20%, and the latency of most ETCD KV RT P99 is immediately reduced by 50% ~ 60%.

4. Load balancing

K8s platform operation and maintenance personnel generally have the following experience: If an ETCD cluster starts or stops, restart all K8s Kube-Apiserver as soon as possible to ensure that the number of connections between Kube-Apiserver and ETCD is balanced. There are two reasons for this.

  1. Kube-apiserver can be randomly connected to a node in the ETCD cluster when it is started. However, after etCD starts or stops, the number of connections between Kube-Apiserver and ETCD is irregular, resulting in uneven client pressure borne by each ETCD node.

  2. When the number of connections between the KUbe-apiserver and THE ETCD is balanced, two thirds of all read and write requests are forwarded by the followers to the leader, ensuring load balancing of the etCD cluster. If the connections are not balanced, the cluster performance cannot be evaluated.

The connection load of each ETCD provided by the ETCD operation and maintenance platform can obtain the balance of cluster connections in real time, and then determine the timing of operation and maintenance intervention to ensure the overall health of the ETCD cluster.

The latest VERSION of ETCD V3.5 already provides automatic load balancing between etCD clients and ETCD nodes, but this version is not yet available in the latest version of K8s. Can timely follow up the K8s community for this version of the support progress and timely obtain the technical dividend, reduce the platform operation and maintenance pressure.

The road to the future

After more than a year of K8s high availability construction including KUbe-Apiserver and ETCD, the K8s cluster has been stabilized. A notable feature is that there has not been a P-level failure in the K8s cluster in half a year. However, its high availability construction work is impossible to stop — ant Group, as the leadership quadrant of global K8s scale construction, is challenging the larger K8s cluster of node magnitude, which will promote the further improvement of etCD cluster construction capacity.

Much of the aforementioned ETCD capability improvement work revolves around its scale up capability improvement, which needs to be further enhanced:

  1. Etcd latest feature to timely follow up, timely community technology progress brought by the open source value into ant K8s platform customer value

2. Timely follow up the ETCD optimization work of Alibaba Group in ETCD Compact algorithm optimization, ETCD single-node multi-multiboltDB architecture optimization and kube- Apiserver server data compression [see Reference document 1], and give reference and feedback to the work of the brothers. Work together to improve

  1. Follow up the performance bottleneck of ETCD on Ant’s K8s platform, put forward our own solutions, and feed open source while improving the technical value of our platform

In addition to focusing on the improvement of etCD single node performance, our next work will focus on the scale out direction of distributed ETCD clustering. The essence of the aforementioned ETCD cluster splitting is to improve the overall performance of the ETCD cluster by distributing the ETCD cluster: the data of the cluster is divided according to the data types of the K8s business layer.

This work can be further expanded as follows: without distinguishing the business significance of KV, data can be written to multiple BACK-END ETCD subsets according to a certain routing mode from the pure KV level, so as to realize the overall cold and hot load balancing of ETCD cluster.

Distributed ETCD clustering can be implemented in two ways: proxyless and Proxy based: Proxy based ETCD: client[kube-apiserver] -> proxy -> etCD server The proxyless distributed ETCD cluster request link is client[kube-apiserver] -> etCD server.

The benefit of proxy Based ETCD distributed cluster is that the development can be directly based on ETCD Proxy provided by the ETCD community, and the community can be rewarded in the later period, realizing the unity of its open source value, technical value and customer value. However, the test shows that RT and QPS are reduced by 20% ~ 25% after Kube-Apiserver sends read and write requests to ETCD through proxy. So the next step is to focus on developing the Proxyless ETCD cluster.

The current split ETCD distributed cluster nature or 67% probability is proxy based distributed cluster: About two-thirds of kube-apiserver requests are forwarded to the leader by the followers, who are essentially proxies. If all kube- Apiserver requests are processed directly with the leader, the current K8s cluster has a theoretical PERFORMANCE benefit of 67% * 20% ≈ 13.4% for RT and QPS.

The disadvantage of proxyless ETCD distributed cluster is that if proxy routing logic is put into Kube-Apiserver, the upgrade cost of Kube-Apiserver version will increase. But the cost of only affecting the version upgrade of a single component of Kube-Apiserver is worth it compared to the benefit of at least 20% (which will certainly be greater if the etCD cluster scales in the future).

In addition to the idea of multiple ETCD Clusters, the data middleware team implemented ETCD V3 API on top of OBKV, which is another good technical route, similar to the ETCD V3 API interface layer on top of TIkV mentioned by Dongxu Huang at the beginning of this paper. It can be called an ETCD like system, and the relevant work is also in progress.

In short, as K8s gets bigger and bigger, the importance of ant Group’s overall ETCD work becomes more and more prominent. If the road of high availability construction of ETCD in the early stage is stumbling forward on the muddy path, then the road of high availability construction of ETCD in the future must be a prosperous road – the road is more and more broad!

See the document

Reference document 1:

www.kubernetes.org.cn/9284.html

Reference document 2:

Tech.meituan.com/2020/08/13/…

Author’s brief introduction

Yu Yu (github @Alexstocks), the Dubbogo community leader, is a programmer with 11 years of front-line experience in server-side infrastructure and middleware development.

Participation in succession and improved Redis/Pika Pika – Port/etcd/Muduo Dubbo/Dubbo – go/Sentinel – go and other well-known projects, currently in Kim served authentic native ants large-scale K8s cluster scheduling team work in the container arrangement, Participates in the maintenance of one of the largest Kubernetes production clusters in the world, dedicated to building a scalable, financial grade, and trusted cloud-native infrastructure.

Welcome to join us if you are interested in Serverless auto-scaling technology, adaptive hybrid deployment technology, and secure container technology such as Kata/Nanovisor.

Email address: [email protected] or [email protected].

Recommended Reading of the Week

  • We built a distributed registry

  • Still struggling with multi-cluster management? OCM come!

  • RFC8998+BabaSSL– Let the state secret to the farther star sea

  • MOSN subproject Layotto: Opens a new chapter of service Grid + Application runtime

For more articles, please scan the code to follow the “financial level distributed Architecture” public account