Author | Cao Guanglei (Fang Qiu)

background

Workload distribution across different zones, different hardware types, and even different clusters and cloud vendors has become a very common requirement. In the past, an application could only be split into multiple workload (such as Deployment), which could be managed manually by the SRE team or deeply customized to the PaaS layer to support the fine management of multiple workload for an application.

Furthermore, in application deployment scenarios, there are a variety of topology disaggregation and elastic demands. The most common of these is to disintegrate according to one or more topological dimensions, such as:

Application deployment must be scattered based on node dimensions to avoid stacking (improve disaster recovery capability).
Scatter applications based on available zones (AZs) to improve disaster recovery capability.
To disperse data by zone, you need to specify the ratio to be deployed in different zones.

With the rapid popularity of cloud native at home and abroad, applications have more and more demand for flexibility. Various public cloud vendors have successively launched Serverless container services to support elastic deployment scenarios, such as ECI of Ali Cloud and Fragate of AWS. Take ECI as an example, ECI can connect to Kubernetes system through Virtual Kubelet, give Pod certain configuration can be scheduled to the ECI cluster behind Virtual-Node. To summarize some common flexibility appeals, such as:

Applications are preferentially deployed to their own clusters, and then deployed to elastic clusters when resources are insufficient. To reduce the capacity of an elastic node, reduce the capacity of the elastic node first to save costs.
You can plan basic node pools and elastic node pools by yourself. During application deployment, a fixed number or proportion of pods must be deployed in the basic node pool, and other pods must be deployed in the elastic node pool.

In response to these requirements, OpenKruise has added a WorkloadSpread feature in v0.10.0. It currently supports Deployment, ReplicaSet, and CloneSet workloads to manage the partition Deployment and elastic scaling of their subordinate PODS. The application scenarios and implementation principles of WorkloadSpread are described in detail to help users better understand the feature.

WorkloadSpread introduction

Official documentation (see related link 1 at the end of this article)

In short, WorkloadSpread can distribute workload’s Pod to different types of nodes according to certain rules, which can meet both the above scattered and elastic scenarios.

Comparison of existing schemes

A brief comparison of existing schemes in some communities.

Pod Topology Spread Constrains

Pod Topology Spread Constrains is a solution provided by the Kubernetes community that can be defined to be broken at the level of the Topology key. After the definition is complete, the scheduler selects nodes that meet the distribution conditions based on the configuration.

Because PodTopologySpread is more evenly scattered, it does not support a custom partition number or scale configuration, and scaling breaks the distribution. WorkloadSpread allows you to customize the number of partitions and manages the scaling order. Therefore, the deficiency of PodTopologySpread can be avoided in some scenarios.

UnitedDeployment (see link 3 at the end of this article)

UnitedDeployment is a solution provided by Kruise community to manage PODS in multiple areas by creating and managing multiple workload.

UnitedDeployment supports the requirements of fragmentation and elasticity very well, but it is a new workload, and users will have high usage and migration costs. WorkloadSpread is a lightweight solution that requires a simple configuration and workload association.

Application scenarios

Below, I will list some application scenarios of WorkloadSpread and give corresponding configurations to help you quickly understand the capabilities of WorkloadSpread.

1. Deploy a maximum of 100 replicas in the basic node pool and the rest in the elastic node pool

subsets: - name: subset-normal maxReplicas: 100 requiredNodeSelectorTerm: matchExpressions: - key: App. Deploy/zone operator: values: In normal - name: the subset - elastic # copy number of unlimited requiredNodeSelectorTerm: matchExpressions: - key: app.deploy/zone operator: In values: - elasticCopy the code

If the workload is less than 100 copies, all the workload copies are deployed to the Normal node pool and more than 100 copies are deployed to the Elastic node pool. During capacity reduction, pods on the Elastic node will be deleted.

Since WorkloadSpread does not invade workload, but only restricts workload distribution, we can also dynamically adjust the number of copies according to resource load by combining HPA, so that when business is at its peak, it will automatically be scheduled to elastic nodes. Resources on the Elastic node pool are preferentially released during service peak hours.

2. Deploy it to the basic node pool first. If resources are insufficient, deploy it to the elastic resource pool

scheduleStrategy: type: Adaptive adaptive: rescheduleCriticalSeconds: 30 disableSimulationSchedule: false subsets: - name: subset - normal # copy number of unlimited requiredNodeSelectorTerm: matchExpressions: - key: app. Deploy/zone operator: In values: Normal - name: the subset - elastic # copy number of unlimited requiredNodeSelectorTerm: matchExpressions: - key: app. Deploy/zone operator: In values: - elasticCopy the code

Both subsets have no replica number limitation and enable analog scheduling and Reschedule capabilities of adabound scheduling policy. The elastic node is preferentially deployed to the Normal node pool. If normal resources are insufficient, Webhook selects elastic nodes through simulated scheduling. When a Pod in the Normal node pool is pending for more than the 30s threshold, the WorkloadSpread Controller deletes the Pod to trigger a rebuild and the new Pod is scheduled to the Elastic node pool. Pods on the elastic node are scaled down first to save users’ costs.

3. Scatter the device into three zones at a ratio of 1:1:3

subsets:
- name: subset-a
  maxReplicas: 20%
  requiredNodeSelectorTerm:
    matchExpressions:
    - key: topology.kubernetes.io/zone
      operator: In
      values:
      - zone-a
- name: subset-b
  maxReplicas: 20%
  requiredNodeSelectorTerm:
    matchExpressions:
    - key: topology.kubernetes.io/zone
      operator: In
      values:
      - zone-b
- name: subset-c
  maxReplicas: 60%
  requiredNodeSelectorTerm:
    matchExpressions:
    - key: topology.kubernetes.io/zone
      operator: In
      values:
      - zone-c
Copy the code

According to the actual situation of different zones, the workload should be broken up in a ratio of 1:1:3. WorkloadSpread ensures that workload expansion and contraction are distributed according to the defined scale.

4. Workload configures different resource quotas for different CPU archs

Workload distributed nodes may have different hardware configurations and CPU architectures, which may require Pod configurations for different subsets. These configurations can be metadata such as labels and annotations or resource quotas, environment variables, and so on for the Pod internal container.

subsets:
- name: subset-x86-arch
  # maxReplicas...
  # requiredNodeSelectorTerm...
  patch:
    metadata:
      labels:
        resource.cpu/arch: x86
    spec: 
      containers:
      - name: main
        resources:
          limits:
            cpu: "500m"
            memory: "800Mi"
- name: subset-arm-arch
  # maxReplicas...
  # requiredNodeSelectorTerm...
  patch:
    metadata:
      labels:
        resource.cpu/arch: arm
    spec: 
      containers:
      - name: main
        resources:
          limits:
            cpu: "300m"
            memory: "600Mi"
Copy the code

From the above sample, we patch different labels and container resources for the Pod of two subsets respectively, which is convenient for us to carry out more refined management of Pod. When workload pods are distributed on nodes with different CPU architectures, different resource quotas are configured to better utilize hardware resources.

Realize the principle of

WorkloadSpread is a pure bypass elastic/topology management solution. The user only needs to create the corresponding WorkloadSpread for his/her Deployment/CloneSet/Job object, with no changes to the workload and no additional costs for the user to use the workload.

1. Subset priority and replica number control

Multiple subsets are defined in WorkloadSpread, and each subset represents a logical domain. Users can divide a subset based on node configuration, hardware type, and zone. In particular, we specify the priority of a subset:

Priority from high to low, in order from front to back by definition.
The higher the priority is, the earlier the capacity is expanded. The lower the priority is, the smaller the capacity is.

2. How do I control the capacity reduction priority

Theoretically, the WorkloadSpread bypass scheme cannot interfere with the workload controller’s shrink order logic.

Recently, however, the problem has been fixed — thanks to generations of users, K8s starting from version 1.21 to ReplicaSet (Deployment) supported by setting controller. Kubernetes. IO/pod – deletion – cost this annotation to specify the pod Deletion Cost: The higher the Pod cost is, the lower the deletion priority is.

Kruise supported the deletion-cost feature in CloneSet from v0.9.0.

Therefore, WorkloadSpread controller controls workload reduction order by adjusting the deletion-cost of Pod subordinate to each subset.

For example: there are 10 copies of the following WorkloadSpread and its associated CloneSet:

Subsets: -name: subset-a maxReplicas: 8-name: subset-b # The number of replicas is unlimitedCopy the code

Value of deletion-cost and deletion order are as follows:

Two pods on subset-B, deletion-cost 100 (priority reduction)
8 pods on subset-A, and deletion-cost is 200 (last capacity reduction)

Then, if the user modifies the WorkloadSpread to:

subsets:
 - name: subset-a
   maxReplicas: 5 # 8-3, 
 - name: subset-b
Copy the code

Then the WorkloadSpread controller will change the deletion-cost value of three pods on susbet-a from 200 to -100

Three pods on subset-A, and deletion-cost was -100 (preferential capacity reduction)
Two pods on subset-B, and deletion-cost is 100 (second capacity reduction)
Five pods on subset-A, and deletion-cost is 200 (last capacity reduction)

That gives priority to subsets that exceed the limits of a subset subset, but still subsets them from back to front.

3. Quantity control

How to ensure that Webhook inject Pod rules strictly according to the order of subset priority and the number of maxReplicas is the key problem in the implementation level of WorkloadSpread.

3.1 Solving concurrency consistency Problems

In the status of workloadspread, there is the status corresponding to each subset, where the missingReplicas field represents the number of pods required by this subset. -1 indicates that the number is not limited (SUBSET is not configured with maxReplicas).

spec:
  subsets:
  - name: subset-a
    maxReplicas: 1
  - name: subset-b
  # ...
status:
  subsetStatuses:
  - name: subset-a
    missingReplicas: 1
  - name: subset-b
    missingReplicas: -1
  # ...
Copy the code

When webhook receives a Pod Create request:

Select suitable subset whose missingReplicas are greater than 0 or -1 according to subsetStatuses.
After finding a suitable subset, if the missingReplicas is greater than 0, subtract 1 and try updating the workloadSpread status.
If the update succeeds, the subset defined rules are injected into the POD.
If the update fails, re-get the WorkloadSpread to get the latest status and go back to Step 1 (with a retry limit).

Similarly, missingReplicas are increased by 1 and updated when webhook receives a Pod Delete/Eviction request.

There is no doubt that we are using optimistic locks to resolve update conflicts. However, using only optimistic locks is not appropriate, because workload will create a large number of pods in parallel when creating pods, apiserver will send a lot of Pod create requests to Webhook in a moment, and parallel processing will cause a lot of conflicts. It is well known that optimistic locking is not suitable for too many conflicts because the retry cost of resolving conflicts is very high. We also added a workloadspread level mutex to limit parallel processing to serial processing. There is a new problem with adding mutex, namely that once the lock is acquired by the current Groutine, it is highly likely that the Workloadspread taken from Infromer is not up to date or will clash. So after groutine updates the WorkloadSpread, it caches the latest WorkloadSpread object and releases the lock, so that the new Groutine can get the latest WorkloadSpread directly from the cache. Of course, multiple Webhooks still need to combine optimistic locking mechanism to resolve conflicts.

3.2 Solving data Consistency Problems

Is the value of the missingReplicas controlled by webhook? The answer is no, because:

Webhook receives Pod create requests that may or may not actually succeed (for example, if the Pod is not valid, or if the quota fails later).
Pod Delete/Eviction requests received by Webhook may not be successful (e.g. intercepted by PDB, PUB, etc.).
There are always possibilities in K8s for Pod to end or disappear without webhook (phase Succeeded/Failed, ETCD data lost, etc.).
At the same time, this is not consistent with the concept of end-state oriented design.

Therefore, workloadspread status is controlled by Webhook in collaboration with controller:

Webhook designed to modify the missingReplicas value in Pod Create /delete/eviction request link interception.
At the same time, all the pods under the current workload will be obtained in Reconcement of controller, classified according to subset, and missingReplicas will be updated to the actual missing number.
From the above analysis, it is likely that there is a delay in controller obtaining Pod from informer, So we also added the creatingPods Map in status. Webook will record an entry with the key pod.name and value timestamp into the Map when it is injected. The Controller, in combination with the Map, maintains the real missingReplicas. Similarly, there is a deletingPods Map to log Pod delete/ Eviction events.

4. Adaptive scheduling ability

ScheduleStrategy configuration is supported in WorkloadSpread. By default, the type is Fixed, which schedules PODS to the corresponding subset based on the sequential order and maxReplicas limits of each subset.

However, in real scenarios, a subset or topology may not fully satisfy the number of maxReplicas. You need to expand the capacity of a Pod zone based on actual resources. This requires Adaptive scheduling and allocation.

The Adaptive capability provided by WorkloadSpread can be logically divided into two types:

SimulationSchedule: In Kruise Webhook, according to the existing nodes/ Pods data in Informer, the scheduling ledger is assembled to simulate the scheduling of Pod. We’re doing a simple filter through nodeSelector/ Affinity, tolerations, and basic resources. (Not so good for nodes like VK)
Reschedule: after the Pod scheduling to a subset, if more than rescheduleCriticalSeconds time scheduling failure, will the subset is marked as unschedulable temporarily, and remove Pod trigger reconstruction. By default, unschedulable is reserved for 5 minutes, meaning Pod creation within 5 minutes skips that subset.

summary

WorkloadSpread enables workload flexible deployment and multi-domain deployment capabilities in a bypass manner by combining some of Kubernetes’ existing features. We expect users to use WorkloadSpread to reduce workload deployment complexity and take advantage of its flexible scalability capabilities to actually reduce costs.

At present, Ali Cloud is actively landing, the adjustment in the landing process will be timely feedback to the community. There are also some new capabilities planned for WorkloadSpread in the future, such as taking over WorkloadSpread’s workload support inventory PODS, supporting batch workload constraints, Even using labels to match pods across the workload hierarchy. Some of these capabilities require actual consideration of the needs of the community. We hope that you can participate in Kruise community, make more issues and PR, help users solve more difficulties in cloud native deployment, and build a better community.

Making:

Github.com/openkruise/…

Official:

openkruise.io/

Slack:

Channel in Kubernetes Slack

Nail communication group:

Openkruise. IO/zh-CN /docs/… Link 2: Pod Topology Spread Constrains: kubernetes. IO/docs/concep… IO /zh-cn/docs/…

Click the link below to learn about the OpenKruise project now! Github.com/openkruise/…

OpenKruise V0.10.0 features WorkloadSpread interpretation