Kubernetes (K8S) is an open source excellent container scheduling system, which itself is also a distributed application. Although Internet architecture are discussed in this series of articles, but k8s some design idea is well worth pondering and reference, I am not operational experts, this paper tries to see some from his k8s architecture concept combines own understanding to analyze k8s in stability, simple, extensible three aspects to do some of the architectural design considerations.

  • Stability: considering that the system itself is stable enough, users can use the system to do some actions can be stable landing, the system itself fault tolerance is strong enough to deal with network problems, the system itself has enough high availability and so on.
  • Simplicity: Consider that the design of the system itself is simple enough, there is not much coupling between components, component responsibilities are single, and so on.
  • Expansibility: the consideration is that each module of the system has levels, the module is equal to both internal and external, external expansion module can be easily inserted into the system (plug-in), the module to achieve a unified interface to facilitate the replacement of switch concrete implementation and so on.

Below, we will look at some examples of K8S design for these three aspects. While looking at how K8S does it, we can think about ourselves. If a product we need to develop is similar to K8S, which needs a highly reliable resource state management and coordination system, how will we design it?

Stability: Declarative application management

As we know, K8S defines many resources (such as Pod, Service, Deployment, ReplicaSet, StatefulSet, Job, CronJob, etc.), When managing resources, we use declarative configuration (JSON, YAML, etc.) to add, delete, modify and check resources. The configuration we provide is a description of the destination state that we want these resources to achieve, called Spec. K8s will get the resource Status for the observed resource, called Status. When Spec! When =Status, k8S’s various control managers come into play, performing various operations to make the resource finally meet our desired Spec. This declarative management approach is less straightforward than imperative management, but it is more fault tolerant, as discussed in more detail in a later section. In addition, this management method is very simple, as long as the user provides the appropriate Spec definition, does not need to expose dozens of hundreds of different apis to implement changes to various aspects of the resource. Of course, we can also flexibly create a separate management API for some important actions (such as capacity expansion, such as image modification), the underlying operation of these apis is to modify the Spec, the bottom layer is unified.

In before the first quarter S1E2 a series of articles, I share the design of the task list, actually here declarative object management is such a thought, we saved in the database is the result of we want to, and then processed by different task Job eventually realize the result (it will also save the current state of the component to the database). Even if the task fails, subsequent tasks will continue to retry, which is the most reliable method.

2. Stability: Edge trigger vs. horizontal trigger

K8s uses a declarative management approach, known as horizontal triggering. Another approach is called imperative management, or edge triggering. For example, if we’re doing a payment system, the user tops up 100 yuan, withdraws 100 yuan and tops up another 100 yuan, for imperative management it’s three commands. If a withdrawal request is lost and the balance of the user account is in error, this is definitely unacceptable, and imperative management or edge triggering must be matched with compensation. Declarative management tells the system that the user’s balance after three operations is 100, 0, and 100, and the end user’s balance is 100 even if the withdrawal request is lost.

Take a look at the example below. With a good network, edge triggering is fine. We did on, off, and on three times, and the final state is 0.

When there is a network problem, the off operation is lost, and for edge triggering, it ends up in the wrong state of 2. For horizontal triggering, there is no such problem. Although the network is not good for a period of time and the status error stays at 1, after the network is restored, we can immediately perceive that the current state should be 0, the state can return to 0, and the final state can also return to the correct 1. Imagine if we were to scale up or down our Pod, and if we were to tell K8S how many pods to add or subtract each time, we would probably end up with a Pod that was not what we wanted because of network problems. It is better to tell K8S what state we want, whether there is a problem with the network, whether there is a problem with a management component, whether there is a problem with the POD, and eventually we expect K8S to help us adjust to the state we want, rather slowly than wrong.

here

3, stability: high availability design

As we know etCD is a distributed key-value database/coordination system based on Raft protocol, it is recommended to use an odd number of nodes like 3, 5, and 7 as clusters for high availability. For the Master node, we can deploy an ETCD on each node, so that the API Server on the node can communicate directly with the local ETCD. The API Server is light (no state), so it can use the load balancer as proxy before. Both nodes and clients can distribute requests to the appropriate API Server by load balancing. For Controller Manager and Scheduler similar to Job, it is obviously not suitable for multiple nodes to run at the same time. Therefore, they all adopt preemption mode to elect the Leader. Only the Leader can undertake the work task, and the followers are in standby state. The overall structure is shown in the figure below:

  • Stateless multiple nodes + load balancing
  • Stateful primary + secondary (or backup) nodes
  • Stateful nodes with symmetric synchronization

4. Simple: Publish and subscribe based on list-watch

One of the design principles of K8S is that the ETCD will be behind the API Server, and components in the cluster will not be able to talk to the database directly. Not only will it be very confusing to expose the database directly to components, but also it will be very unsafe for anyone to read and write etCD directly. Security control, such as identity authentication and authentication, needs to be carried out uniformly by API Server (we will refer to the plug-in chain of API Server later).

For the various resources within the K8S cluster, the K8S control manager and scheduler need to be aware of the state changes (such as creation) of the various resources, and then perform their management responsibilities based on the change events. Given decoupling, there is clearly a need for MQ, where the various management components can listen for state change events for various resources without being aware of each other’s presence and just do their own thing. If K8S also relies on some message-oriented middleware to implement this functionality, the overall complexity increases and some security customization of the message-oriented middleware is required.

The implementation method given by K8s still uses API Server to act as a simple message bus. All components establish HTTP long links through the Watch mechanism to learn the change events of the resources they are interested in at any time. After completing their own functions, they still call API Server to write new specs of our components. This Spec is perceived and processed by other hypervisors. The mechanism of Watch is push mechanism, which can deal with changes in real time. However, we know that considering various factors such as network, events may be lost and components may restart. At this time, we need to combine push and pull to compensate, so API Server also provides List interface. It is used to synchronize the latest status of the Watch when an error occurs or when a component is restarted. The list-watch mechanism combined with push and pull meets the requirements of timeliness and reliability.

  • The Deployment Controller subscribes to the Deployment create event
  • ReplicaSet Controller Subscribes to ReplicaSet to create events
  • Scheduler subscribes to Pod creation events for unbound nodes
  • All Kubelet subscriptions to Node and Pod binding events for their own nodes

Cluster resource change operations:

  1. The client calls the API Server to create the Deployment Spec
  2. The Deployment Controller receives a message that it needs to process the new Deployment
  3. The Deployment Controller calls the API Server to create the ReplicaSet
  4. The ReplicaSet Controller receives a message to process the new ReplicaSet
  5. The ReplicaSet Controller calls the API Server to create the Pod
  6. Scheduler receives a message for a new Pod to process
  7. After processing, Scheduler decides to bind the Pod to Node1, calling API Server to write the binding
  8. Kubelet on Node1 received an event message that needed to process the Pod deployment
  9. Kubelet on Node1 is Pod deployed according to Pod specs

It can be seen that list-Watch-based API Server realizes the function of simple and reliable message bus. Based on the event chain of resource message, the coupling between components is decoupled, and the declarative object management mentioned above ensures the management stability. From the level, the components of master are the components of the control plane, used to control the state of the management cluster, the components of node are the components of the execution plane, Kubelet is a brainless executive role, their communication bridge is the API Server events, Kubelet is unable to perceive the existence of the controller.

5, simple: API Sever convergence resource management portal

As shown in the figure below, API Server implements pre-validation of resource management operations (authentication, authorization, access, and so on) based on a plugin + filter chain (such as the well-known Spring MVC interceptor chain).

  • Identity authentication, based on various plug-ins to determine who is coming
  • Authorization, based on various plug-ins to determine whether the user is qualified to operate on the requested resource
  • Default and conversion, resource default setting, client to ETCD version number conversion
  • Management control, according to a variety of plug-ins to perform the verification or modification of resources, modify first then verify
  • Validation, according to various validation rules to verify the validity of each field
  • Idempotent and concurrency control, using optimistic concurrency (version number) to verify that the resource has not been concurrently modified
  • Audit and log all resource changes

If the resource is deleted, there will be some additional steps:

  • Elegant closed
  • The terminal hook can be configured with some terminals to call back at this time
  • Garbage collection, cascading deletion of resources that do not reference the root

For complex process operations, it is common to adopt the responsibility chain + processing chain + plug-in approach. You might say this API Server design, on the whole, is not simple, how there are so many link, in fact this is the most simple way, every link have independent plugin to run (plugin updates independently, can also according to the requirements of dynamic plug configuration), each plug-in just do what you should do, if there is no such a design, I’m afraid there will be a big method of 10,000 lines of code.

6. Simplicity: Design of Scheduler

  • The Pod to be scheduled has a concept of priority. The Pod with a higher priority is scheduled first
  • Find all available nodes first
  • Predicate is used to filter nodes
  • The nodes are sorted using priority (collator)
  • Select the node with the highest priority to schedule to Pod

Common predicate algorithms are:

  • Port conflict monitoring
  • Whether the resources meet the requirements
  • Compatibility consideration

Common priority algorithms are:

  • Network topology proximity
  • Balancing resource Usage
  • The nodes with more resources are preferred
  • The node in use takes precedence
  • Cached mirror nodes are preferred

For example, we can use this design pattern for reference when doing business systems such as routing systems. The simplicity of the word is that each of the pieces is simple, and they can be put together to form a complex system of rules that is much simpler than putting all the logic together.

7. Scalability: Layered architecture

K8s is designed with a linux-like layered architecture:

  • Core layer: the core function of Kubernetes, external API to build high-level applications, internal plug-in application execution environment
  • Application layer: deployment (stateless applications, stateful applications, batch tasks, clustered applications, etc.) and routing (service discovery, DNS resolution, etc.)
  • Management: System metrics (such as infrastructure, container, and network metrics), automation (such as automatic scaling, dynamic Provision, etc.), and policy management (RBAC, Quota, PSP, NetworkPolicy, etc.)
  • Interface layer: Kubectl command line tools, client SDK, and cluster federation

8. Extension: Interface and plug-in

In addition to the implementation of a large number of k8S internal components using the plug-in architecture, K8S in the overall design of some core and external resources and services abstract into a unified interface, can be plug-in into the specific implementation, as shown in the following figure:

  • Container Runtime Interface (CRI) is a Container Runtime Interface introduced by K8S V1.5. It decouples Kubelet from the Container Runtime. The original internal interface, which was completely poD-oriented, was split into gRPC interfaces for Sandbox and Container, and image management and Container management were separated into different services.
  • For networking, k8S supports two types of plug-ins:
    • Kubenet: This is a network plug-in based on CNI Bridge (extending port Mapping and traffic Shaping on the basis of the Bridge plug-in), which is currently recommended as the default plug-in
    • CNI: The Container Network Interface (CNI) is a Container Network specification initiated by CoreOS, which is the basis of Kubernetes Network plug-in.
  • For Storage, the Container Storage Interface (CSI) was introduced from K8S V1.9 to extend the Storage ecosystem of Kubernetes. In fact, CSI is the standard storage interface for the entire container ecosystem, as well as for other container cluster scheduling systems such as Mesos and Cloud Foundry.

All all resources, in addition, because the kubernetes k8s after 1.7, provides the CRD (CustomResourceDefinitions) custom resource secondary development ability to expand k8s API, through the extension, you can ask the k8s API to add new types, It is simpler and easier than modifying the K8S source code or creating a custom API server, and there is no need to merge the code and compatibility issues as the K8S kernel version is upgraded. This feature greatly improves the scalability of the K8S.

9, extension: PV & PVC & StorageClass

The decoupling design of the K8s in storage is worth special mention. As shown in the figure below, let’s take a look at the decoupling design of K8S in the storage area:

  • First of all, we definitely need the abstraction of volumes to abstract out the storage. However, having k8S users (both operations and development) set the volumes they need every time they deploy a Pod is obviously too coupled (for example, NFS volumes, which need to be addressed every time so they don’t need to and can’t focus on these low-level details). Volume V describes the underlying storage capabilities.
  • Therefore, K8S abstracts the concept of persistent volume PV and persistent volume PVC. The administrator can first configure the PV mapping to the volume, and the user only needs to create a PVC to associate with the PV, and then reference the PVC when creating a Pod. The PVC does not care about some specific details of the volume, but only the capacity requirements and operation rights. The PV layer abstracts the global volume resources that o&M can provide, while the PVC layer describes the storage resource requests that users want to apply for for Pod.
  • K8s also provides a layer of abstraction called StorageClass, which dynamically creates PVS by associating PVCS with specified (or default) StorageClass.

In addition to storing abstract V, PV, PVC, SC, there are other components in K8s that have similar levels of abstraction and the idea of dynamic binding.

When we program in OO languages, we naturally know that we need to define a class and then instantiate it to create objects. If the class is particularly complex (with different implementations), we might use factory patterns (or reflection, passing in the target type names) to create objects. Compare that to the K8S storage abstraction, which is a kind of decoupling, and in architecture, even in table architecture, we can introduce classes and instances. For example, the workflow of a workflow system can be thought of as a class template, and each initiated workflow is an instance of that workflow.

conclusion

Well, this article has taken a peek at the architecture of the K8S, and I wonder if you get a sense of how well the K8S is designed with high availability and reliability in mind and high scalability in mind. Almost any operation is allowed to fail to achieve a consistent state, and almost any component is allowed to be extended and replaced, allowing users to implement their own custom requirements.

If your business system is also a complex resource coordination system (K8S abstracts resources related to operation and maintenance, our business system abstracts other resources), then the design concept of K8S has many points to learn from. For example, if we are building a very complex process engine, we can consider:

  • The executor of the process abstracts out the interface and inserts it into the system as a plug-in
  • The resources involved in the process can be sorted out and listed clearly
  • The management of the process can store the expected results declaratively in a database
  • Process management components can all read, write, and subscribe to changes to a unified API service
  • The management and control components of the process can be implemented by plug-in chain and responsibility chain
  • A unified gateway can be used for authentication and authentication