One, foreword
After more than half a year of research and development, Ant Financial completed the full implementation of Kubernetes this year, and made the core link 100% run in Kubernetes. By this year’s Singles’ Day, Ant Financial has managed tens of thousands of machines and hundreds of thousands of business instances through Kubernetes, and more than 90% of the business has been running smoothly on Kubernetes. The whole technology switch process is smooth and transparent, which is a key step in the evolution of cloud native resource infrastructure.
This article mainly introduces the use of Kubernetes in Ant Financial, the unprecedented challenge to Kubernetes brought by the Double Eleven Conference and our best practices. We hope that by sharing our thoughts in the process of practice, we can make it easier for you to use Kubernetes.
Ii. Kubernetes status quo of Ant Financial
2.1 Development history and implementation scale
The landing of Kubernetes in Ant Financial went through four stages:
- Platform R&D stage: In the second half of 2018, Ant Financial and Alibaba Group jointly invested in the research and development of Kubernetes technology ecology, striving to replace the internal research platform through Kubernetes;
- Grayscale verification: In early 2019, Kubernetes was established in Ant Financial grayscale. By upgrading the architecture of some resource clusters and replacing grayscale production examples, Kubernetes was verified in a small scale.
- Implementation of Cloud transformation (Ant Financial’s internal infrastructure Cloud biotechnology) : In April 2019, Ant Financial completed the target of Kubernetes adapting to cloud environment internally, and completed the target of 100% using Kubernetes in cloud computer room before 618. This is the first time that Kubernetes has been verified on a large scale inside Ant Financial.
- Scale implementation: After June 18, 2019, Ant Financial started to comprehensively promote the implementation of Kubernetes. Before the promotion, ant Financial completed the goal of 100% operation of core link in Kubernetes, and perfectly supported the Nov 11 exam.
2.2 Unified Resource Scheduling Platform
Kubernetes bears ant Financial’s technical goal of resource scheduling in the cloud native era: unified resource scheduling. Unified resource scheduling can effectively improve resource utilization and greatly reduce resource costs. To achieve unified scheduling, the key is to sink the scheduling capacity of each layer 2 platform from the resource level, so that resources are uniformly allocated in Kubernetes.
Ant Financial follows a standardized expansion mode when landing Kubernetes to achieve the goal of unified scheduling:
- All service extensions are based on Kubernetes APIServer, which ADAPTS and extends service functions through CRD + Operator.
- Basic services implement personalized adaptation through resource interfaces defined at the Node layer, which helps to form best practices for resource access.
Thanks to continuous standardization work, we have applied a number of technologies in Kubernetes for more than half a year, including secure container, unified logging, GPU fine scheduling, network security isolation and secure trusted computing, etc. And through Kubernetes unified use and management of these resources to serve a large number of online business and computing task business.
Double Eleven Kubernetes practice
Below, we introduce how Kubernetes is used in Ant Financial, as well as the challenges and practices we face in this process through the following scenarios.
3.1 Time-sharing resource multiplexing
In the process of traffic promotion, peak traffic of different service domains usually comes at different time periods, and a large amount of extra computing resources are required to deal with the traffic at different time periods. In the past several activities, we tried to achieve time-sharing reuse of resources by applying the method of fast transfer, but the practical effect of the final transfer scheme was affected by the factors such as the need to warm up the service instances, the uncontrollable transfer time, and the stability of large-scale scheduling.
On This year’s Double Eleven, we adopted the technology of resource sharing scheduling and refined cutting flow to achieve the goal of time-sharing utilization of resources. In order to achieve the goal of full utilization of resources and rapid switching, we made the following enhancements:
- The internal scheduling system introduces union-resource Management, which can place business instances with non-overlapping peaks and valleys in the same Resource set to maximize Resource utilization.
- Developed a set of application time-sharing multiplexing platform integrating resource update, flow switching and risk control, through which SRE can quickly and stably complete resource switching to cope with different business flood peaks.
The whole set of platform and technology finally achieved exciting results: Ant Financial realized maximum resource sharing among tens of thousands of instances of different business links, and these instances of shared resources could be smoothly switched at the minute level. This kind of technical capability also breaks through the efficiency limitation of the current resource horizontal scaling ability and opens up the imagination space for the time-sharing reuse of resources.
3.2 Computational task mixing
In the landing cases of Kubernetes community, we often see a variety of online businesses, and computing businesses often run in Kuberentes cluster through “enclosure” type resource application and independent two-layer scheduling. But in ant, we decided to use Kubernetes from the first day, will be Kubernetes fusion computing business to achieve the unified scheduling of resources as our goal.
Within Ant Financial, we continue to use Kubernetes to support various computing services, such as various AI training task frameworks, batch processing tasks and streaming computing. They all have one thing in common: resources on demand and go.
We adapt computational tasks through Operator model, so that the task will call Kubernetes API to apply for Pod and other resources when the task is actually executed, and delete Pod to release resources when the task exits. At the same time, we introduce dynamic resource scheduling capability and task portrait system into the scheduling engine, which provides hierarchical resource guarantee capability for different levels of online and computing business, so that the online business is not affected by the maximum utilization of resources.
The tasks running on Ant Financial’s Kubernetes are not degraded except during peak hours (00:00 ~ 02:00). There are still hundreds of computing tasks applied and released on Kubernetes every minute. In the future, Ant Financial will continue to promote the integration of business scheduling and build Kubernetes into an aircraft carrier of resource scheduling.
3.3 Scale core
Ant Financial is one of the few companies running the world’s largest Kubernetes cluster, with more than 10,000 machines and hundreds of thousands of Pods. With the implementation of such businesses as computational task mixing and resource time-sharing multiplexing, the dynamic use of resources and automated operation and maintenance of business have brought great challenges to Kubernetes’ stability and performance.
The first challenge we need to face is scheduling performance. In the 5K scale test, the community’s scheduler only has 1~2 Pods /s scheduling performance, which obviously cannot meet ant Financial’s scheduling performance requirements.
Since the resource needs of the same business are often the same, we developed the batch scheduling function, combined with work such as local optimization of the filter performance, and finally achieved a hundredfold improvement in scheduling performance!
After solving the scheduling performance problem, we found that APIServer was becoming a key component affecting Kubernetes availability in scale scenarios, and the flexible scalability of CRD+Operator was putting great pressure on the cluster. There are 100 ways the business side can beat a production cluster, and it’s hard to guard against it. This is a result of the community’s low investment in scale prior to this year and APIServer’s weak capabilities, as well as the flexibility of the Operator model. While benefiting from the Operator’s high flexibility, the developer often brings risks to the cluster manager that the business is out of control. Even developers with a degree of familiarity with Kubernetes cannot guarantee that their operators will not explode into large clusters in production.
Faced with the situation that the “nuclear button” is not in the hands of the cluster administrator, the ants internally solve the problems caused by scale from two aspects:
- We have developed internal best practice principles to help the business better design operators by continually drawing lessons from the iterative process
- When defining CRD, it is necessary to specify the maximum quantity in the future. It is better to adopt aggregate-Apiserver for the expansion of a large number of CR services.
- CRD must have a Namespaced scope to control the scope of influence;
- The MutatingWebhook + resource Update operation can cause uncontrollable damage to the runtime environment. Avoid this combination.
- Any controllers should use Informers and configure proper flow limiting for write operations;
- DaemonSet is very advanced, so try not to use this kind of design. If necessary, please use it under the guidance of Kubernetes experts.
- We have implemented a number of optimizations in Kubernetes, including multi-dimensional flow control, WatchCache handling full List requests, Controller automatic resolution of update conflicts, and APIServer adding custom indexes.
Through specification and optimization, we have optimized the overall link of API load from client to server, so that the resource delivery capacity has increased 6 times in the first half of the year, and the cluster availability rate has reached 3 9 per week, so that Kubernetes smoothly supports the great promotion of double 11.
3.4 Site Construction for Elastic Resources
In recent years, Ant Financial will make full use of cloud resources, expand the business of the whole station “temporarily” through rapid and flexible construction of the station, and recycle the site to release resources after the promotion. This flexible construction saves a lot of resources for the ants.
Kubernetes provides a powerful orchestration capability, but the management capability of the cluster itself is relatively weak. Ant Financial started from 0, based on Kubernetes on Kubernetes and end-state oriented design ideas, developed a set of management system to be responsible for the management of dozens of ant production clusters, providing rapid and flexible station building function oriented to great promotion.
This way we can automate the site setup and deliver a thousands of Nodes Kubernetes cluster ready to use within 3 hours. We delivered all flexible cloud clusters in one day on This year’s Double 11. With the continuous improvement and practice of technology, we expect to deliver clusters that meet the business drainage standards on a daily basis in the future, so that ant Financial’s sites can be used anytime and anywhere.
Fourth, look to the future and meet the challenges
The era of cloud native has arrived. Ant Financial has taken the first step in the construction of cloud native infrastructure through Kubernetes. Although there are still many practical challenges waiting for us, we believe that with our continuous investment in technology, these problems will be solved one by one.
4.1 Platform and Tenants
One of the challenges we face today is the uncertainty of multi-tenancy. Ant Financial cannot internally maintain a set of Kubernetes cluster for each business department, and the multi-tenant capability of a single Kubernetes cluster is very weak, which is reflected in the following two dimensions:
- APIServer and ETCD lack tenant-level service assurance capabilities.
- Namespace cannot effectively isolate all resources, and Namespace only provides partial resource capability, which is unfriendly to platform access parties.
In the future, we will continue to invest and apply technologies in core capabilities such as Priority and Fairness for API Server Requests and Virtual Cluster to effectively ensure the security and isolation of tenants’ service capabilities.
4.2 Automatic Operation and Maintenance
In addition to resource scheduling, the next stage of Kubernetes is an important scenario of automated operation and maintenance. This involves end-state self-maintenance of application resources in the full life cycle, including but not limited to automatic resource delivery and fault self-healing.
With the continuous improvement of the degree of automation, how to effectively control the risks brought by automation, so that automated operation and maintenance can truly improve efficiency rather than bear the risk of database deletion and running away at any time is an important problem.
Ant Financial experienced a similar situation in the implementation of Kubernetes: from the initial high degree of automation brought boundless excitement, to the extreme frustration after experiencing defects that are out of control and eventually explode to cause failures, all of these show that there is still a long way to go to automate operations on Kubernetes.
For this reason, we will work with ali Group brothers to promote the standardization of Operator. From the access standard, Operator framework, gray capacity construction and control governance, so that the automated operation and maintenance on Kubernetes more visible and controllable.
V. Concluding remarks
This year, we achieved Kubernetes from 0-1 landing, which stood the test of the real scene of the Double Eleven promotion. However, the infrastructure construction of cloud native is still in its infancy. In the future, we will continue to make efforts in flexible resource delivery, cluster scale service, technical risk and automated operation and maintenance, so as to support and promote business services to complete the implementation of cloud native.
Finally, we welcome like-minded partners to join us and participate in the construction of infrastructure in the cloud native scene! You can click [Financial level distributed architecture] official account [join us] – [Super many jobs] TAB to obtain the job information.
The authors introduce
Cao Yin, head of Ant Financial Kubernetes Implementation. He joined Ant Financial in 2015, mainly engaged in container technology and platform research and development. Since 2018, he has been responsible for ant Kubernetes r&d implementation. Having worked in Alibaba Cloud Elastic computing for four years, I have a deep understanding of cloud computing infrastructure.
Financial Class Distributed Architecture (Antfin_SOFA)