Abstract: In huawei Developer Conference (Cloud) 2021, Shen Yifan, Paas Cloud platform architect of ICBC, delivered a keynote speech titled “Icbc Multi-K8S Cluster Management and Disaster Recovery Practice” and shared the landing practice process of ICBC using multi-cloud container scheduling engine Karmada.

This article to share from huawei cloud community “Karmada | industrial and commercial bank of multiple k8s cluster disaster management and practice”, the original author: technology torch bearers.

In huawei Developer Conference (Cloud) 2021, Shen Yifan, ARCHITECT of Paas Cloud platform of ICBC, delivered a keynote speech titled “Icbc Multi-K8S Cluster Management and Disaster Recovery Practice” and shared the landing practice process of ICBC using multi-cloud container arrangement engine Karmada.

The speech mainly includes four aspects:

  • Icbc cloud platform status

  • Industry multi-cluster management scheme and selection

  • Why Karmada?

  • Landing situation and future prospects

Icbc cloud computing business background

Overall structure of ICBC cloud platform

Icbc cloud platform technology stack

We have adopted industry-leading cloud products and mainstream open source technologies, combined with some of our financial business scenarios, and carried out in-depth customization.

  • Infrastructure cloud: Build a new generation of infrastructure cloud based on Huawei Cloud Stack8.0 products and operation and maintenance requirements.

  • Application platform cloud: Through the introduction of open source container technology Docker, container cluster scheduling technology Kubernetes, etc., independently developed and built application platform cloud.

  • Upper-layer application scheme: Based on HaProxy, Dubbo, ElasticSerch, etc., the cloud ecosystem of load balancing, microservices, holographic monitoring, log center and other peripheral supporting services is established.

Icbc financial cloud effect

In terms of container cloud, ICBC’s financial cloud effect is also very great, which is firstly reflected in its large scale of cloud entry. Up to now, the cloud container scale of application platform has exceeded 200,000, and the business container accounts for about 55,000. Some of the overall core businesses have been incorporated into the container cloud. In addition to the largest scale of cloud entry in the industry, our business scenarios involve a wide range of core applications, core banking business system, including personal financial system accounts, quick payment, online channels, commemorative coin reservation, etc., these core business applications have also been deployed in containers; In addition, some of our core technology support applications such as MySQL, as well as some middleware and microservice framework, these core support applications have also entered the cloud; In addition, some new technology areas, including the Internet of Things, artificial intelligence, big data and so on.

As more and more core business applications enter the cloud, the biggest challenge for us is disaster recovery and high availability, and we have done a lot of practice in this regard:

1) The cloud platform supports a multi-level failure protection mechanism to ensure that different instances of the same service are evenly distributed to different resource domains in two or three centers. This ensures that the overall service availability is not affected when a single storage, cluster, or even a single data center fails.

2) In the case of faults, the cloud platform can realize automatic recovery of faults through container restart and automatic drift.

In the overall container cloud practice, we also encountered some problems, one of the more prominent is the pass layer container layer multi-cluster status. At present, the total number of K8S and K8S clusters in ICBC has reached nearly 100. There are four main reasons:

1) cluster variety: just also said our business scenario is very broad, such as the GPU to have different support GPU devices, middleware, database it to the underlying network container storage requirements are different, it is bound to generate different solutions, so we need to customize for different business scenarios different cluster.

2) Due to the performance limitation of K8S itself, including scheduler, ETCD, API Server and other performance bottlenecks, each cluster also has an upper limit of its number.

3) Business expansion is very fast.

4) There are many faulty domain partitions. Our two-place three-center architecture has at least three DCS, and each DC has different network zones, which are isolated by firewalls. Such a multiplier relationship will produce the distribution of many cluster faulty domains in the middle.

Based on the above 4 points, in view of the current situation, we are still relying on existing solution container cloud cloud GuanPing stage, through the cloud GuanPing manage these multiple k8s cluster, the other top business applications need to choose its cluster, including it needs preferences, network, area, etc., to choose one specific k8s cluster. After selecting the K8S cluster, our internal scheduling is automatically broken by failure rate.

However, the existing solutions still expose a lot of problems for upper-layer applications:

1) For the upper layer applications, it may be a part of the container cloud’s concern that we have the ability to automatically scale during business peaks, but auto-scale is now only within the cluster, there is no overall auto-scale across the cluster.

2) There is no cross-cluster automatic scheduling capability, including the scheduling capability may be in the cluster, the application needs to independently select the specific cluster

3) The cluster is not transparent to upper-layer users

4) No cross-cluster failover capability. We still rely heavily on duplicate redundancy on the two-to-three-center architecture, so there is a lack of high availability in this area in the automated recovery process of failure recovery.

Industry multi-cluster management scheme and selection

Based on the current situation, we have set some goals to carry out the overall technology selection of some solutions in the industry, which is divided into five modules:

There are three reasons why I want it to be an open source project with community support:

  • It is hoped that the overall solution is independent and controllable within the enterprise, which is also a big benefit of open source

  • I don’t want to spend more power

  • Why not integrate all the scheduling and failure recovery capabilities into the cloud platform just now? This part is where we want the overall multi-cluster management module, isolated from the cloud pipe platform, to sink into a multi-cluster management module below.

With these goals in mind, we conducted some research on community solutions.

Kubefed

The first project we investigated was a popular cluster federation project, which was divided into v1 and V2 versions. When we investigated, v2 (Kubefed) was the main one.

Kubefed solves some of these problems by itself, with cluster lifecycle management, Override, and base scheduling capabilities, but for us it has a few critical flaws that we don’t yet have:

1) The scheduling layer is a very basic scheduling capability, and they are not prepared to spend more energy in the scheduling aspect to support custom scheduling, and do not support scheduling by resource allowance.

2) The second point is also criticized. It does not support native K8S objects, so I will use its newly defined CRD in its management cluster. For the upper layer application of k8S native resource objects that we have been using for so long, we also need to re-develop the API of the cloud management platform itself. This part of the cost is very large.

3) It basically has no failover capability

RHACM

The second project we investigated is RHACM, which is mainly led by Red Hat and IBM. After the overall investigation, we found that its functions are relatively sound, including the capabilities we just mentioned, and its upper-layer application layer is close to the cloud pipe platform in terms of the capabilities of the user layer. But it only supports Openshift, and for the k8S clusters we already have in stock, the retrofit is expensive and too heavy. At the time of our research, it was not open source and the overall support of the community was inadequate.

Karmada

At that time, we communicated with Huawei about the current situation and pain points of multi-cluster management, including the discussion on community federal projects. We both hope to make innovations in this respect. The following is the functional view of Karmada:

Karmada functional view

From the perspective of its overall functional view and planning, it is very consistent with the goals we mentioned above. It has the overall cluster lifecycle management, cluster registration, multi-cluster scaling, multi-cluster scheduling, overall unified API, low-level standard API support. It is CNI/CSI in its overall functional view. Including the upper layer application, Istio, Mash, CI/CD and so on, there are overall planning considerations. Therefore, based on the overall idea, its functions are very much in line with ours. Icbc also decided to invest in this project, to build Karmada project with Huawei and many of our project partners, and to give it back to our community as a whole.

Why Karmada

The technical architecture

In my personal understanding, it has the following advantages:

1) Karmada is deployed in the form of k8S-like, it has APIServer, Controller Manager, etc. For enterprises that have so many K8S clusters, the transformation cost is relatively small, we only need to deploy a management cluster on it

2) Karmada-controller-Manager manages various CRD resources including cluster, policy, binding, works and so on as management side resource objects, but it does not invade the native K8S native resource objects that we want to deploy.

3) Karmada only manages the scheduling of resources between clusters, and the allocation within subsets is highly autonomous

How are Karmada’s overall Resources distributed?

  • First, register the cluster with Karmada

  • Second, define the Resource Template

  • Step 3: Identify the Propagation Policy

  • Step 4: Create an Override policy

  • Step five: Watch Karmada work

The following figure shows the overall delivery process:

To add a New Policy to the Deployment Policy; to add a New Policy to the Deployment Policy; to add a New Policy to the Deployment Policy; Each works is produced. This works is essentially an encapsulation of resource objects in a subset. To me, the propagation and workers mechanisms are more important.

The Propagation mechanism

To begin with, we defined the Propagation Policy. If we can see the whole YAML, we first decided on a simpler Policy and chose a cluster named Member 1. The second is that I need this policy to match the K8s ResourceTemplate, which matches the Nginx Deployment that we have defined with a namespace of default. In addition to supporting Cluster affinity, it also supports Cluster tolerance and distribution by Cluster label and faulty domain.

After the Propagation Policy is defined, the K8s ResourceTemplate to be delivered will be automatically matched with it. After the matching, Deployment will be distributed to three clusters such as ABC. This creates a binding with the THREE ABC clusters, which is called Resource Bindding.

Resource Bindding overall YAML may be a cluster that you choose, in this case member 1, Now, the entire Resource Bindding supports two levels, cluster and namespace. These two levels correspond to different scenarios. Namespace level refers to when a cluster uses namespace as tenant isolation. Possible namespace scope used by Resource Bindding. There is also a cluster scenario, where the entire subset is for one person, and for one tenant, we can use ClusterResource Bindding.

The Work mechanism

How will work be distributed after we’ve created PropagationBindding?

When Propagation Bindding is generated, for example, three clusters ABC are generated, here 1: m refers to these three clusters. Once you find these three clusters, the Bindding Controller will work, and then it will generate specific Works objects based on the Resource template and your bindings. Works objects in general are the encapsulation of resources in specific subsets. In the “Work manifests” file, use the “work manifests” file, use the “work manifests” file, use the “work manifests” file, use the “work manifests” file, use the “work manifests” file, use the “work manifests” file, use the “work manifests” file. The overall Deployment YAML to be distributed to the subset group is here, so it is just a resource encapsulation.

Karmada advantage

After specific use and verification, ICBC found that Karmada has the following advantages:

1) Resource scheduling

  • Customize a cross-cluster scheduling policy

  • Transparent to upper-layer applications

  • Two types of resource binding scheduling are supported

2) disaster

  • Dynamic binding adjustment

  • Resource objects are automatically allocated based on cluster labels or faulty domains

3) Cluster management

  • Cluster Registration

  • Full life cycle management

  • A standard API

4) Resource management

  • Support for K8S native objects

  • Works supports obtaining the deployment status of subset group resources

  • Resource object distribution supports both pull and push modes

The landing situation of Karmada in ICBC and its future outlook

First, let’s look at two of Karmada’s functions: how to manage clusters and resource delivery?

So far, Karmada has managed some inventory clusters in icbc’s test environment. In terms of future planning, a key point is how to integrate with our overall cloud platform.

Cloud Platform integration

In this respect we hope that the previously mentioned aspects of multi-cluster management, cross-cluster scheduling, cross-cluster scaling, failure recovery, and the overall view of resources all sink into a single control plane like Karmada.

For the upper-layer cloud management platform, we pay more attention to its management of user services, including user management, application management, input management, etc., as well as the Policy derived from Karmada defined on the cloud platform. The figure above is brief. The specific cloud platform may need to connect a line to each K8S cluster. For example, which node pod is on, the plane of Karmada may not be concerned, but the specific information of POD location class may need to be obtained from the subset group of K8S. This may also be a problem that we need to solve in the later integration. Of course, this is also in line with Karmada’s own design philosophy, which does not need to care about the specific position of POD in the K8S subset.

Future Vision 1- Scheduling across clusters

For cross-cluster scheduling, Karmada already supports fault domain planning, application preferences, and weight comparison mentioned above. However, we hope that it can also schedule according to the resources and quantity of the cluster, so as not to produce the state of unbalanced resources among the subsets. Although it is not implemented for the time being, it has a CRD called cluster. The cluster has a status information, which collects the ready status of the node. The remaining Allocatable on the node is the remaining information of the CPU memory. Well, with all this information, then we do custom scheduling, which is totally a planning thing.

After the overall scheduling design is completed, we hope to produce the effect in ICBC as shown in the figure below. When I schedule Deployment A, it is scheduled to Cluster 1 because of the preferred setting. Deployment B may have deployed three clusters, cluster 123, because of a fault domain fragmentation. Deployment C is also a failed domain fragmentation, but due to resource availability, its redundant PODS are scheduled to cluster 2.

Future Vision 2- Scaling across clusters

Scaling across clusters is now also part of the Karmada plan, and we may still need to address some issues:

1) considering it across the cluster scale and a subset of the relationship between the scale, because now our top business application configuration is often a single cluster expansion strategy, then the whole group of expansion strategy across the cluster strategy and subset are configuration, the relationship between them, exactly is the upper management as a whole to do, or have a priority, which may be behind us to consider.

2) The relationship between scaling across clusters and scheduling across clusters, I think it’s generally one scheduler. One of my multi-cluster is only responsible for the scaling part, such as the number of CPUS and memory reached, for example, the scaling is carried out at 70%-80% of the time, and the specific scheduling is carried out by the overall scheduler.

3) We need to gather the metric of each cluster, including some performance bottlenecks, and its overall working mode, which we need to consider later.

Future Vision 3- Cross-cluster failure recovery and high availability

1) The health status judgment strategy of the subset group: it may be disconnected from the management cluster, but the business container of the subset group is not damaged

2) Custom fault recovery policy: Like RestartPolicy, Always, Never, OnFailure

3) Relationship between rescheduling and cross-cluster scaling: it is hoped that the scheduling of multiple clusters is an overall scheduler, while scaling can control its own scaling strategy.

As a whole, for some business scenarios of ICBC, Karmada’s current capabilities and future planning can be predicted to solve the pain points of our business scenarios. I am very glad to have the opportunity to join the Karmada project. I hope more developers can join Karmada to build a community with us and to build such a new project of cloud management.

Attached: Karmada Community technical exchange address

Project address: github.com/karmada-io/…

Slack address: karmada-io.slack.com

Click follow to learn about the fresh technologies of Huawei Cloud