This article reviews the relationship between K8s and CRI (Container Runtime Interface) and Contaier. First, we need to know the general workflow principles of K8s.
How does the K8s work
In K8s, there is a control panel, also known as the master node, which runs apiserver, controllerManager, kubeScheduler, kubedns, etc. When we want to create an application (Deployment, Statefulset), the main flow is as follows:
- Commit to Apiserver via the kubectl command, and Apiserver stores the resource in etCD
- Controllermanager fetches newly created resources and creates POD information through a control loop. Note that only the POD is created, and the container is not scheduled and created
- Kube-scheduler also loops over newly created but unscheduled pods and, after executing a series of scheduling algorithms, binds the pods to a node and updates the information in etCD. This is done by adding to the POD spec
nodeName
Field. - Kubelet monitors changes to all Pod objects. When a Pod is found bound to Node, and the Node itself is bound, Kubelet takes over all subsequent tasks, including creating Pod networks, containers, and so on.
- Kubelet calls the Container Runtime through CRI to create a Container in the POD.
In the figure below, Containerd acts as the High-level Runtime and calls runC to create namespace isolation and cgroup resource limits.
Container Runtime Interface (CRI)
CRI was introduced in Kubernetes 1.5 and acts as a bridge between Kubelet and the container runtime. The advanced container runtime that is expected to integrate with Kubernetes will implement CRI. Runtimes is expected to be responsible for managing the images and supporting Kubernetes Pods, as well as managing individual containers. CRI has only one function: in the case of Kubernetes, it describes the operations that the container should have and the parameters that each operation should have.
CRI is a container-centric API. CRI was designed not to expose POD information or POD apis to containers such as Docker. Pod is always a Kubernetes choreography concept and has nothing to do with containers, so this is why it is necessary to make the API container-centric.
CRI works between Kubelet and The Container Runtime.
- Docker: Docker has moved some of its functionality to Containerd, so CRI can directly interact with Containerd. Therefore, Docker itself does not need to support CRI(Containerd already does).
- Containerd: Containerd can use the ShiM to connect to different low-level Runtime scenarios. This section describes containerd in detail
- Cri-o: A lightweight runtime that supports RUNc and Clear Containers as low-level Runtimes.
How does CRI work
CRI consists of three interfaces: Sandbox, Container, and Image. CRI provides some common interfaces for manipulating containers, such as Create Delete List.
The Sandbox provides the environment for the Container to run, including the pod network. Container contains operations on the Container life cycle. Image provides operations on images.
Kubelet calls the CRI interface through GRPC to first create an environment, called the PodSandbox. When PodSandbox is available, continue to call the Image or Container interface to pull the Image and create the Container. Shim translates these requests into specific Runtime apis and performs specific operations for different Low-level Runtime.
PodSandbox
What is the Sandbox? From both the virtual machine and containerization perspective, both use Cgroups for resource quotas and conceptually decouple from a segregated runtime environment, but the difference is the implementation of resource isolation. Therefore, sandboxes are the space reserved by K8s for compatibility with different runtime environments. That is, K8s allows low-level Runtime runtime to create different Podsandboxes depending on their implementation. For Kata, podsandboxes are virtual machines. Linux namespace for Docker. Once the Pod Sandbox is set up, Kubelet can create user containers inside it. When removing a Pod, Kubelet will remove the Pod Sandbox first and then stop all containers in it. For containers, when the Sandbox runs, Simply add the namespace of the new Container to the namespace of the existing Sandbox.
By default, a Pod Sandbox is a pause container. The defaultSandboxImage referenced by the Kubelet code is the official gCR. IO /google_containers/ pause-AMD64 image.
shim v2
Due to the development of many different Low-level Runtime, such as Kata Container and gVisor, the shiM implementation methods are different. Therefore, a layer needs to be added between CRI ShiM and Low-level Runtime. Take the Kata Container as an example. According to PodSandbox, CRI Shim cannot be used directly to operate the Kata Container because of the low-level runtime differences. Currently, what the Kata Container does is provide a set of Kata Shim that translates the operations of CRI Shim into operations of Kata.
The problem with this is obvious. Each container starts with a SHIM for populating, and adding a Kata ShiM requires matching shiM operations to CRI, which leads to a significant performance penalty. Ultimately, you don’t want every container to match a SHIM, but the Sandbox will match a ShiM. Furthermore, this approach is disadvantageous to other CRI developers.
Shim V2 wants to add a layer in Containerd -> OCI Runtime to provide apis that can be implemented at various runtimes to support containerization while still controlling state and abstract actions. For specific proposals, see issue: github.com/containerd/…
Using shim v2, you can specify a ShiM for each pod, and when you call start in the sandbox you create, you start a ShiM. However, the next time the API is called, namely the Container API in the previous CRI, a SHIM will not be started. That is, a POD starts only one SHIM, taking over shiM operations for all containers in that POD.
Again, take Kata as an example. When replaced with SHIm V2, its call is shown as follows:
When the Create and start operations are performed, all of these operations are mapped to the concrete implementation of Shimv2, regardless of how CRI is mapped and implemented.
If you like, please follow my official account or check out my blog at packyzbq.coding.me. I will send my own learning records from time to time, so we can learn and communicate with each other