Introduction: Singles’ Day 2021 is the second year that Alibaba Group’s core applications will be fully cloud-enabled. This year, on the premise of ensuring stability, we will mainly explore how to use the advantages of cloud native technology to reduce costs and improve resource utilization. In this year’s promotion, the core cluster adopts the exclusive shared instance mixing, unified the underlying resources, combined with the transaction business cloud disk, so that the mixing unit cost decreased by 30%+.

The introduction

Singles Day 2021 is the second year that Alibaba Group’s core apps will be fully cloud-enabled. This year, on the premise of ensuring stability, we will mainly explore how to use the advantages of cloud native technology to reduce costs and improve resource utilization. In this year’s promotion, the core cluster adopts the exclusive shared instance mixing, unified the underlying resources, combined with the transaction business cloud disk, so that the mixing unit cost decreased by 30%+.

Cloud native database management and control

With the popularity of cloud native technology, more and more applications start to run on Kubernetes, and Kubernetes tends to have a unified application delivery interface. The internal database products are also comprehensively promoting cloud native, expecting to release database product capabilities through resource pooling and deep integration with cloud infrastructure. Database management and control is based on Kubernetes to arrange DB instances. Because a single Kubernetes cluster size limit can not meet the deployment scale of DB instances, so we design a set of multi-cluster scheduling system, which can support multiple Kubernetes cluster resource scheduling.

For nodes connected to Kubernetes cluster, different database products will be selected in the following two modes based on specific services:

  • Singleton ECS Node: Node instances are exclusively used by ECS to schedule resources. The isolation between nodes is good, but ECS resources are strictly limited, resulting in some stability and performance problems
  • ECS resource pool mode: The database team maintains a resource pool of larger ECS that are scheduled within the resource pool by secondary scheduling of the database. It can coordinate resources among multiple PODS to improve product stability and performance.

In the second scenario, when ECSPool is built and the database team conducts secondary scheduling, the mode of CGroup isolation group is required to limit the resource of instance Pod. For customers on the cloud, there are two instance forms:

  • Exclusive: Binds CPU resources in CPU-set mode in cgroup
  • Share: The CPU-share of the Cgroup is used to isolate resources and mirror the oversold CPU resources to some extent.

The underlying ECS resource pool is strictly divided into an exclusive resource pool and a shared resource pool based on two different isolation modes. Once a Node allocates an exclusive instance, it is no longer allowed to allocate a shared instance, strictly dividing two independent resource pools.

Generally, aliyun database team schedules resource pools in a unified manner. Because the node pool is large enough, differentiating between different resource pools does not result in poor resource utilization. However, for customers serving the enterprise, this strictly differentiated pattern is not optimal resource utilization because there is no mixed deployment with other customers. For example, the customer has 4 instances, 2 of which are in shared mode and 2 of which are in exclusive mode. According to the previous scheduling logic, the customer must purchase four machines to complete the deployment of the instance. Therefore, to solve the above problem, we want to be able to mix core and non-core instances on the same machine. At the same time, core instances use CPUSet to ensure the exclusive use of part of the CPU core, while non-core instances use CPUShare to share the rest of the CPU core to achieve full utilization of resources.

CPUSet and CPUShare mode mixing scheme

The exclusive shared mashup assumes that the performance of the exclusive instance is not affected, because the resources used by the shared instance must be limited. In Kubernetes, cgroups are used to limit the CPU and memory that a pod can use. Among them, the memory limit is relatively simple. Set a maximum available memory for POD. Once the memory limit is exceeded, OOM will be triggered. For CPU limits, Kubernetes uses CFS quota to limit the number of time slices available to a process per unit of time. When the workload of the exclusive and shared instances increases on the same Node, the workload of the exclusive instance may switch back and forth between different CPU cores, affecting the performance of the exclusive instance. Therefore, in order not to affect the performance of the exclusive instance, we hope that the cpus of the exclusive instance and the shared instance can be bound separately on the same node.

Kubernetes native mixed part implementation

In Linux, the cPUSET subsystem of the Cgroup can be used to bind processes to a specified CPU. Docker also has related startup parameters to support binding containers to a specified CPU.

--cpuset-cpus="" CPUs in which to allow execution (0-3, 0,1) -- CPUSet-MEMs ="" Memory nodes (MEMs) in which to allow execution (0-3, 0,1). Only effective on NUMA systems.Copy the code

Since version 1.8, Kubernetes has provided the CPU Manager feature to support CpusET capabilities. Starting with Kubernetes version 1.12 and currently in version 1.22, this feature is still in Beta. CPU Manager is a module in Kubelet. Its main job is to bind certain containers that comply with the core binding strategy to the specified CPU, so as to improve the performance of these CPU-sensitive tasks.

Currently, CPU Manager supports two types of policies: None and static. Set kubelet –cpu-manager-policy, and add dynamic Policy to dynamically adjust THE CpusET during the Container life cycle.

None: the default value of CPU Manager, which means that cpusET is not enabled. CPU Request corresponds to CPU share, and CPU limit corresponds to CPU quota. Static: Allows enhanced CPU affinity and exclusivity for PODS with certain resource characteristics on a node. Kubelet allocates the bound CPU set before the Container starts. CPU topology is also considered to improve CPU affinityCopy the code

Static Policy Manages a shared CPU resource pool. Initially, the pool contains all CPU resources on a node. The number of available exclusive CPU resources is equal to the total number of cpus on the node minus those reserved by the –kube-reserved or –system-reserved parameters. Starting with version 1.17, the CPU reservation list can be set explicitly with kublet’s ‘–reserved-cpus’ parameter. Explicit CPU lists specified with ‘–reserved-cpus’ take precedence over reserved cpus specified with ‘–kube-reserved’ and ‘–system-reserved’ arguments. The CPU reserved by these parameters is obtained from the initial share pool in an integer manner in ascending order of the physical inner core ID. A shared pool is a collection of cpus that BestEffort and Burstable Pod run on. A container in a Guaranteed POD will also run on a shared pool CPU if non-integer value CPU requests are declared. Only containers with integer CPUrequests specified in a Guaranteed POD will be allocated exclusive CPU resources.

As a general container choreography platform, Kubernetes provides the core binding function, which has certain limitations:

  • The QoS of THE POD must be at the Guaranteed level. The database instance resources are oversold. The REQUEST and limit Settings of the POD are different.
  • Kubelet’s CPU allocation policy is fixed and does not support flexible expansion, so it cannot meet the scheduling scenario of single-share mixed parts.
  • Dynamic release of binding core restrictions is not supported. Kubelet does not support releasing binding cores when it is necessary to temporarily release instance resources in order to improve database performance.

Implementation scheme of mixed part of dispatching platform

Kubernetes default binding capability, which binding core strategy is placed in Kubelet, it is not easy to modify and expand, and for different business scenarios, binding core strategy may be different, in the dispatch test unified management is more appropriate. Because Kubelet is only responsible for performing binding, the specific binding strategy needs to be specified by the upper level business. So when we create a POD in CPUSet mode, we label the pod annotaion, and Kubelet will bind the pod to the specified CPU before starting the container.

Annotations: alloc - spec: | {" CPU ":" Spread "} alloc - status: '{" CPU ": [0, 1]}'Copy the code

**alloc-spec: ** Specifies a specific allocation policy. The following modes are supported:

  • Spread: The binding logic core is scattered on the physical core, and applications with low pressure are placed on the logic of the same physical check end. Unified scheduling also provides scheduling algorithm of stack constraints on physical cores among applications.
  • SameCoreFirst: binding to physical cores is preferred. Try to assign logical cores to the same physical cores.
  • BindAllCPUs: binds all CPU cores.

**alloc-status: ** Specifies the number of CPU cores to bind.

When the scheduler allocates or releases the Pod of CPUSet mode on the node, the number of CPU cores that can be used by CPUShare Pod will change. Kubelet can inform the range of shareable nodes by modifying the annotation of the node:

Kind: Node metadata: name: xx annotations: CPU - sharepool: '{" cpuIDs ":,1,2,3,4,5,6,99,100,101,102,103 [0]}'Copy the code

To realize the exclusive sharing mixing, you only need to specify the exclusive instance as CPUSet mode, and the shared instance as CPUShare mode, so that the cpus of the shared and exclusive instance can be bound separately without interfering with each other. However, this solution has the following limitations: The custom development of Kubernetes native Kubelet components limits deployment in a common environment.

A mixed part implementation scheme based on scheduler and extension controller

The above analysis of Kubernetes native and scheduling platform mixed implementation scheme. From the analysis results, are unable to fully meet the database side for the implementation of exclusive sharing mixed part of the binding core requirements, we hope not to modify Kubernetes source code on the basis of support for different binding core strategy.

Before introducing our implementation scheme, we will first introduce the three components of mixing department under the database management and control architecture.

  • Scheduler: The scheduler has a global view of resources and can allocate different CPU cores according to scheduling policies specified by the business. The CPU scheduling policies supported include single-share, NUMA affinity and anti-affinity, and I/O multi-queue distribution
  • Controller: Is responsible for setting the CPU core allocated by the scheduler onto the POD’s annotation and providing a query interface to the business
  • Cgroup-controller: deployed on each node with daemonset, watch the POD resources of this node. When the pod annotation contains binding core information, modify the CONFIGURATION of Cgroup CPUSET corresponding to pod to complete binding core.

This architecture and the original Kubernetes cluster compared, the biggest difference has two places, the first: Kubernetes cluster scheduling, to solve the database scale constraints on a single cluster; Second: database resource scheduling is decoupled from instance creation logic, and resource scheduling occurs before pod creation. When an instance is created, resources are allocated through the scheduler. After the instance succeeds, the service initiates a task flow to pull up the instance.

The scheduler

The scheduler’s core flow is similar to the Kubernetes scheduler flow (Filter -> Ranker -> Assume -> Process Pod)

  • Filter: Filters out nodes that do not meet requirements
  • Ranker: Calculates the Node score and selects the Node with the highest score
  • Sync Process Pod: Process resource requests (cloud disk, security group, ENI, etc.) asynchronously

In the single-share hybrid, the scheduler is responsible for allocating specific CPU cores to the single-share instances. The CPU core allocation policy can be customized by the service. The following logical processes are mainly realized in the group’s exclusive sharing and mixing scenario:

Node CPU data/state initialization and synchronization

IO queue distribution information is related to the specific physical model number. When a node goes online, it will print the IO queue information of node to node annotation, and the scheduler will watch the IO queue information of node, verify the validity and initialize it. When the scheduler watches the annotation change and the following two conditions are met, the CPU information will be updated.

  • The CPU is bounce
  • The CPU distribution in the original I/O queue is the same as that in the new I/O queue

The exclusive instance is bound to the core I/O queue

When selecting the exclusive instance CPU core, I/O queue splitting is considered.

The shatter policy depends on the CPU allocation priority of node.

Mixed Filter choreography

The scheduler decouples different filters to implement piplines that define different businesses in a plug-in orchestration

Apply all dimensions to do antiaffinity

An application has only one node on the same machine, regardless of whether it is exclusive or shared. This ensures that the impact of a single machine failure on the cluster dimension is 1/N at most

resourceSet.spec.scheduleConfig.policy.machineMaxList.N.key = ${APP_NAME}
resourceSet.spec.scheduleConfig.policy.machineMaxList.N.value = 1
Copy the code

The process of applying for an INver business Pod (machineMaxKey= INVE, Value =1) is as follows

Share Exclusive CPU mixed scheduling

Unallocated host (104C), reserved 8C, the maximum allocatable is 96C. The following figure shows the process for scheduling two 32C shared instances and migrating another 32C shared instance

A single ECS exclusive instance can sell the maximum CPU ratio of 2/3: that is, 96 physical CPUS can sell the maximum exclusive CPU of 64 cores, and the remaining 32 cores can be oversold to other instances according to the fixed oversold ratio (exclusive instances can not occupy the whole machine).

Available instance CPU Calculation: Available CPU=min(a,b)

A. Available CPU = Available PHYSICAL CPU-max(Shared instance) x 2 b. Available CPU = [Available CPU for a single machine -sum (Available CPU for a shared instance)] / Oversold ratio

The controller

The controller shields the underlying Kubernetes resource interface for the service, and all operations of the service are converted uniformly by the controller. In the single-sharing mixed-part scheme, the controller is responsible for updating the bind core information allocated by the scheduler to the ANNOTATION of POD, and verifying whether the bind core information started is correct when POD is started. In a large-scale promotion scenario, if the dedicated instance has a performance bottleneck, the CPU resources of the entire system must be occupied temporarily. The controller provides an interface for services to temporarily release the RESTRICTION on CPU binding.

Annotations: CPU - ISOLate-mode: exclusive; custom-cgroup-info: '{"engine":{" cPUSet ":"2-3"}}' # exclude-reserve-cpuset: "true" "True" # Is it necessary to temporarily release the binding core limit? If a contingency plan is set for binding all CPU cores except reservedCopy the code

cgroup-controller

Cgroup-controller is an extension controller in Kubernetes. It is deployed in The Kubernetes cluster by daemonset. It mounts the /sys/fs/cgroup/ directory on the host to the container. Locate the corresponding cgroup configuration path based on the POD qos level and modify the Cgroup configuration file directly.

     volumes:
        - hostPath:
            path: /var/run/
            type: "Directory"
          name: docker-sock
        - hostPath:
            path: /sys/fs/cgroup/
            type: "Directory"
          name: cgroups
Copy the code

In a single-shared hybrid scenario, the Cgroup-controller needs to be able to bind the CPU core to the single-shared instance and maintain the CPU resource pool of the shared instance. Cgroup-controller records the core distribution of exclusive and shared cpus on the current node and dynamically adjusts the list of available cpus for shared pods on the node when the exclusive instance on the node changes. For example, if the current node reserves CPU:[0-1], and an exclusive instance is scheduled to occupy CPU:[2-3], cgroup-controller calculates the shared CPU on the current node = total CPU-node reservation-exclusive CPU, and binds all shared PODS to the shared CPU core. When scheduling another exclusive POD, the cgroup-controller updates the shared CPU core and refreshes all CPU cores bound to shared pods on the host.

Cgroup-controller dynamically adjusts the pod resource quota by directly modifying the POD cgroup configuration. Although it does not need to dip into kubernetes source code, it also has the following problems:

  • Cgroup-controller can modify the resource quota of a container only after the container is started
  • When the exclusive instance detects changes, the CPU binding core of the shared instance is dynamically refreshed, which may affect the performance of the shared instance to some extent

Cgroup-controller can directly control the resource quota of the DB instance cgroup and directly affect the data plane of the DB instance. Therefore, necessary security checks are performed during the modification. For example, when setting the CPU bound to the exclusive instance, the system checks whether the bound CPU cores meet THE I/O queue distribution and whether the number of bound CPU cores exceeds the number of exclusive CPU cores that can be sold by a node.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.