Kubernetes is Google open source Docker container cluster management system, is an important member of the Docker ecosystem. As the Kubernetes community and vendors continue to improve and grow, Kuberentes has become one of the leaders in container management. At ArchSummit, Huawei senior engineer Wang Zefeng shared “Kubernetes’ practice in Huawei’s global IT System”. The following is a summary of his speech.
Review images

Ze-feng wang

Senior engineer of Huawei PaaS Development Department

Years of experience in system software development and performance tuning in the field of telecommunications, proficient in deep message parsing and protocol identification. Core developer of Huawei PaaS platform, responsible for the design and development of o&M system of the initial version, multi-AZ support and affinity/anti-affinity scheduling design and development of Huawei public cloud PaaS CCE service. Currently, I serve as a Committer of Kubernetes community, leading the development of affinity scheduling related features of Kubernetes community. I have strong interest in the open source operation mode of community projects.

A simple introduction

See Figure 1

As a global company, Huawei’s IT system is very large in scale, deployed in various regions of the world, with large business volume and rapid growth in the number of applications. As the scale grows, so do the problems:

  1. The resource usage is low, and the number of VMS increases rapidly.

  2. The deployment and maintenance of multi-DC across regions cannot be extended, and the operation and maintenance costs are huge.

  3. The system is overloaded and the application scaling period is long.

  4. It is difficult to support rapid iteration and inefficient to launch services.

See Figure 2

In the face of business pressure, Cloud Container Engine (CCE) solves these problems. Huawei’S CCE can be understood as Kubernetes commercial version (Figure 2). On the whole, huawei hopes to connect IaaS and PaaS platforms through a set of core technologies, and a set of technologies supports multiple business scenarios.

Review images

Figure 3

At the end of 2015, CCE was put online in Huawei IT system (Figure 3), and the resource utilization rate increased by 3 to 4 times, and the current scale reached more than 2000 virtual machines. Next, I will focus on two recent key technology practices: Kubernetes multi-cluster federation and affinity/antiaffinity scheduling between applications.

Kubernetes multi-cluster federation

See Figure 4

Figure 4 is a very simple Kubernetes architecture diagram, in which you can see the following components:

-POD (Container group)

– Container (Container)

– Label (Label)

– Replication Controller

– Service (Service)

– Node (Node)

– Kubernetes Master (Kubernetes Master)

The basic scheduling unit is Pod, which is described in the following examples.

See Figure 5

IT is estimated from a certain business of Huawei’s internal IT system (Figure 5) that the number of machines will reach 30,000 virtual machines and 100,000 containers after the overall containerization in 2016. In a typical scenario, Huawei IT systems deploy applications in multiple DATA centers (DCS) to provide cross-domain application service discovery. However, it is difficult to support k8S single cluster because of the large network limitation and difference across domains. Therefore, how to provide large-scale centralized deployment of cross-domain applications has become a problem to be solved at present.

See Figure 6

Figure 6 shows a simple multi-cluster federation architecture. The three black blocks are the three key components added at the cluster federation level, and the white one at the bottom is a cluster. This architecture is very similar to Kubernetes’ original architecture. The original Kubernetes architecture is also a Master managing multiple nodes, here is a Master managing multiple clusters. If the cluster in the diagram is replaced by nodes, it is the architecture of Kubernetes. Users create applications through a unified API portal. What cluster federation does is to split federated level object applications into subsets for deployment. The Cluster Controller maintains Cluster health and monitors load, while the Service Controller provides cross-cluster Service discovery.

See Figure 7

In general, all cluster management architectures are similar, including Controller, Scheduler and Agent (Kubelet). Here is the decoupling of interaction between components by list-watch (FIG. 7) mechanism. Embodies the design philosophy of Everything Talk to API. It is similar to message bus in SOA architecture, but the difference with message bus lies in the unified processing of data storage and event notification. When the cluster receives a new event, it can get the new data, including whether the change of data is a new object or an updated object.

Examples: The first step is to create a ReplicaSet. Controller-manager (multiple controllers), Scheduler, and Kubelet all initiate a Watch, but the watch object is different. A ReplicaSet Controller will be a Watch ReplicaSet, and Scheduler Watch is a Pod (also known as an application instance). DestNode is empty, that is, the unscheduled Pod. However, the Pod of Kubelet Watch is only destNode for the Pod of Node itself, that is, the Pod will do the corresponding processing only when it is scheduled to the corresponding Node. When creating an application, K8S saves the object to etCD and reports the created event when the object is saved. At the same time, the ReplicaSet Controller already watches the event, and it gets notified. ReplicaSet Controller finds that a new RS has been created and splits it. Since RS corresponds to a multi-instance stateless application, the Controller creates as many pods as there are replicas of RS.

The first phase of the creation process is now complete. DestNode is empty. At this time, it will receive the notification of Pod creation and execute the scheduling algorithm (make a comprehensive calculation of data of all available nodes in the cluster according to the scheduling requirements of Pod. Select a Node) and bind the Pod to the Node (update destNode for the Pod), which completes the Scheduler process. Then Kubelet will see that a new Pod is scheduled on the corresponding bound Node, and will create and start the container according to the Pod definition.

Through the above process, the various components through the List – Watch different objects (even if is the same object, status is not the same), realize decoupling of the natural, each component handling a different application lifecycle stage, at the same time to ensure the inside an asynchronous distributed system process sequencing problem among multiple components. Meanwhile, list-watch only monitors events each time, so every time Watch, such as creating a Pod, obtains incremental information, and the amount of interaction obtained is very small. We know that when a message/event is sent through the message bus, it is possible to lose it, which requires a fault-tolerant mechanism to ensure ultimate consistency. Kubernetes is internally resolved by a List. For example, ReplicaSet Controller will periodically make a full List of ReplicaSet data, which is the current application data required by the system. It will check the current operation of each application instance, and make corresponding processing, redundant deletion, insufficiency complement.

See Figure 8

Figure 8 shows the application creation process for cluster federation. As shown in the figure, the user sends a request to the federated API Server to create the application, which first saves the object. At the same time, the Controller periodically obtains the monitoring data of each cluster and synchronizes it to the API Server. After the application is created by Scheduler Watch, it will select suitable clusters according to all cluster monitoring information including health status and other indicators in the federation according to the scheduling algorithm, and corresponding clusters will be split (currently, the capacity and load of each cluster are mainly monitored). In this example, the application is split into two clusters, cluster A is two instances, cluster B is three instances, the split is still the object to the API Server, with Kubernetes native in A single cluster to create the process is consistent. At this time, the Controller will watch the creation of sub RS (actually an attribute update of RS object), it will go to the corresponding cluster to create the actual application entity, the number of instances is the operation value of the previous application split.

See Figure 9

Figure 9 shows the key mechanism of the federated scheduler, which watches two types of objects, a replica set and a Cluster (how many clusters are available in the federation, how they are loaded, and so on). When RC/RS changes, they will be saved to a local queue, from which the worker will read RC/RS one by one and process them. Select the appropriate target cluster according to the loaded scheduling policy, and then calculate how many instances should be created in each cluster. The data format that is finally written back to the API Server is shown in the figure, which is the update RS field destClusters, which was null when not scheduled and now becomes [{cluster: “clusterA”, replicas: 6},…]. Something like that.

See Figure 10

There are multiple controllers in a Cluster federation. In this example (Figure 10), there are two main controllers, a Cluster Controller and a Replication Controller. Cluster Controller Watch is a Cluster object that maintains all Cluster status information in the federation, including periodic health check and load status. If you have configured secure access, you also need to ensure that the configuration (such as the certificate file used to access the cluster) is available correctly. The Replication Controller is the Watch RS, which the federated scheduler also watches, where the WATCH is the RS that has been split by the scheduler and is responsible for creating a specified number of instances in each cluster.

See Figure 11

Following the previous example, splitting A to Cluster is 2 instances, and splitting B to Cluster is 3 instances. This can actually be understood as A control plane processing, which solves the problem of an application being deployed to multiple clusters under federation. The next step is to figure out the data side and how to implement cross-cluster access for applications as a subordinate (Figure 11). We know that there is an all-connected network in a single cluster, but we also want to adapt to the hybrid cloud scenario across clusters. At this time, the underlying network may not be connected, so we directly design according to the scenario where the network between clusters is disconnected. In terms of external traffic, user requests are distributed to each cluster through global distributed routing, and each cluster has a load balance, which then directs the traffic to each Node. When accessing across the cluster, access within the cluster can still be handled using the original service discovery mechanism. There is a constraint that the load balancer in each cluster must have a public IP address, which can be accessed by all nodes in all clusters. When application access requests need to cross the cluster, the traffic is forwarded to the load balancer of the peer cluster, which acts as a reverse proxy, and then forwarded to the Node where the back-end application instance resides.

See Figure 12

Figure 12 is the key mechanism of the Service Controller: one is the watch Service change and the other is the Watch Cluster change. If the traffic of a Cluster is suspended, it should not be forwarded. Therefore, it is necessary to watch its changes and refresh the connectivity of each service back-end (Cluster) in the service to ensure that each back-end of it is connected. At the same time, the change of Watch service may be the newly created service, the instance of the service back end being refreshed, or the LB failure of the service back end. All of these are changes for the service, and corresponding processing should be done at this time.

A Cluster deals with changes in the state of the Cluster. This section describes the process of adding a service. In the federated Service Controller, each cluster has Service objects, including the Endpoint. There are two types of Service objects: one is the Node where the Pod instance of the Service is located in the current cluster, and the other is the LB where the LB of other clusters is located when the Service is cross-cluster. It writes both sets of information to the Kube-proxy. To provide external access, write the Endpoint of each service instance in the cluster to the LB of the corresponding cluster for reverse proxy. This enables communication within a single cluster as well as across the cluster. For external traffic, users can select an LB of the nearest cluster based on the user’s region or access delay to achieve ideal access speed.

Affinity/antiaffinity scheduling between applications

Review images

Figure 13

The transformation of containerization and microservitization will lead to affinity and anti-affinity problems (Figure 13). Originally, a virtual machine would be equipped with multiple components and there would be communication between processes. However, when containerization is done, containers are usually directly separated by process. For example, a container for a business process, monitoring log processing or local data is placed in another container, and has an independent life cycle. In this case, if they are distributed at two distant points in the network, the performance of the request will be poor after multiple forwarding. Therefore, affinity is used to implement the nearest deployment, and network capabilities are enhanced to implement the nearest routing for communication and reduce network loss. For high reliability, anti-affinity disperses instances as much as possible. When a node fails, the impact on applications is 1/N or only one instance.

See Figure 14

Let’s start with a single application (Figure 14). In this example, we want all instances to be deployed in an AZ (i.e. mutually compatible at the AZ level) and mutually incompatible at the Node level, one instance per Node. When a Node hangs, only one instance is affected. However, failure domains such as AZ, Rack, Node, etc., are understood and named differently by different enterprises on different platforms. For example, one might want an application to do anti-affinity at the subrack level, because it is possible that all subracks will go down. However, in essence, they are all a label on Node. When scheduling, they only need to dynamically group according to this special label and deal with the affinity and anti-affinity relationship.

Review images

Figure 15

For affinity and anti-affinity, both hard and soft versions are supported (Figure 15). If the compatibility is all hard, application deployment may fail due to improper configurations. There are essentially two classes of algorithms here, but the implementation logic is similar. Hard algorithm filtering, do not meet all the nodes are filtered, to ensure that the last remaining must meet the conditions. In the case of soft, do not need so harshly asked, just best effort, condition does not meet the scheduling deployment is successful, still want to apply here is grading sorting, according to the affinity and the affinity of conforms to the situation, conform to the high degree to high score, in accordance with the low degree to low, when the selected Node selection according to scores from high to low.

See Figure 16

Affinity is mutual, and symmetry comes into play (Figure 16). Application affinity, so in the definition of application, for example, application B affinity application A, you’re actually writing an affinity rule in application B to affinity A, but A doesn’t know about it. If A creates B first and then B later, there’s no problem, because B will go to A, but A won’t go to B the other way around. So when we implement the algorithm, we add a default behavior to it. When scheduling, we need to own the affinity/antiaffinity Pod (one-way thinking above); At the same time, we need to check which pods are compatible/anticompatible with themselves to achieve symmetry.

Another common problem is migration by affinity applications. There are two points that need to be explained. First, symmetry is considered in algorithm design. No matter whether the application is deployed before or after deployment, when it is rebuilt and scheduled, it will still check which Pod is or is compatible with in the current system and go with them first. In addition, RC/RS currently rebuilds Pod only when Node hangs. Node does not hang. This ensures the need for compatible applications to stay apart and anticompatible applications to stay apart on two levels.

See Figure 17

At this point, we do hard and soft implementation, and when we do soft, we have no problem, because even if we do not meet the scheduling success, we can do it according to the idea of complete symmetry. The main problem here is the symmetry of hard affinity. If the affinity application does not exist (not scheduled), the scheduling failure is reasonable. However, it is unreasonable to fail because no other Pod affinity exists for the current Pod to be scheduled. So the hard affinity is not completely symmetric, its opposite is a soft affinity.

See Figure 18

Although symmetry is considered in the implementation of affinity/anti-affinity, it only works in the phase involving scheduling. After scheduling, the Pod will not be adjusted according to actual conditions during its life cycle. So we’ll address the previous problem further in two ways (Figure 18).

The first is to limit the Forgiveness of remote restarts, known as Forgiveness. If a Node fails or restarts and the Pod is migrated (the Pod is rebuilt and then dispatched to another Node), if the Pod originally saved some local data, it will be lost because it is migrated to another Node. When the Node gets hung, don’t migrate the Pod. Instead, pull it back up in place while you wait for the Node to recover automatically. So that data that was previously stored on the local hard drive can be processed. We also introduced timeouts for valid local data. For example, logs that are temporarily stored locally do not need to be processed after two or three days. At this point, the Pod needs to be bound to the Node within two or three days and wait for the Node to resume processing the task. If the Node is not recovered for more than two or three days, but the local data is meaningless, then we want the Pod to be migrated. When you do deportation checks, without time-out Forgiveness Pod, you can continue to bind to Node. Normal outcasts that times out can just be migrated.

The second is runtime migration. In the process of running, the cluster load, affinity between applications, Node health, or quality of service changes from moment to moment. The system should be able to dynamically adjust the distribution of application instances based on the current situation, which is the significance of Rescheduling. The ultimate behavior of Rescheduling is to periodically check the cluster status and make migration adjustments. However, there are many factors to consider before migrating, such as the availability of the application (all instances of an application cannot be unavailable at any one time because of migration), the unidirectional of the migration (if the migrated Pod goes back to the original Node, It becomes redundant) and convergence (migrating one Pod cannot cause a large number of others to migrate).

Review images

Seven cows architect practice day is launched by seven NiuYun offline technique salon activity, joint industry senior technical Daniel and the big companies and entrepreneurial brand good architect, is committed to providing for the developers, architects, and decision makers inside the industry forefront, the deepest technology exchange platform, help dynamic knowledge technology, learning experience.

 

The 11th session of “Sports + Live Broadcast Technology Best Practices” will be held in Beijing on August 14th. Registration is now under way. Click “Read the original article” below for more information and we look forward to your participation.