The author | to day Alibaba senior r&d engineers
First, basic knowledge
The storage snapshot generates the background
When using storage, to improve fault tolerance of data operations, we often need snapshots of online data and the ability to quickly restore it. In addition, when online data needs to be quickly replicated and migrated, such as environmental replication, data development and other functions, snapshots can be stored to meet the requirements. In K8s, CSI Snapshotter Controller is used to store snapshots.
Storage Snapshot user interface -Snapshot
As we know, K8s simplifies the use of storage by PVC and PV design system, and the design of storage snapshot is actually modeled after the design idea of PVC & PV system. To store snapshots, you can use VolumeSnapshot and specify the corresponding VolumeSnapshotClass object. Then the storage snapshot and VolumeSnapshotContent are dynamically generated by related components in the cluster. As shown in the figure below, the process of dynamically generating VolumeSnapshotContent is very similar to that of dynamically generating PV.
Storage snapshot user interface -Restore
How to quickly recover snapshot data after storage snapshot is available? As shown below:
As shown above, the PVC object can be used to specify its dataSource field as a VolumeSnapshot object. After the PVC is committed, the related components in the cluster find the storage snapshot data pointed to by the dataSource, create the corresponding storage and PV object, restore the storage snapshot data to the new PV, and restore the data.
Topolopy – meaning
First of all, let’s understand what topology means: the topology here is a “location” relationship for the managed nodes in the K8s cluster. It means that a node can belong to a certain topology through the labels information of node. There are three common ones, which are often encountered in use:
-
First, at the time of using the cloud storage service, often region, is the region of the concept, through label in K8s failure – domain. Beta. Kubernetes. IO/region to identify. This is to identify which region the cross-region nodes managed by a single K8s cluster belong to.
-
The second, are available that are widely used in the area, which is the available zones, in K8s through label failure – domain. Beta. Kubernetes. IO/zone to identify. This is to identify the availability zone of nodes across zones managed by a single K8s cluster.
-
The third type is **hostname, ** is the single machine dimension, is the topological domain for node scope, in K8s is often marked by label kubernetes. IO /hostname, this will be described in detail at the end of the article when talking about local PV.
The three topologies mentioned above are common, and topologies can be self-defined. A string can be defined to represent a topology domain, and the values corresponding to the key are actually different topology locations in the topology domain.
For example, you can create a topology domain at the latitude of **rack, the rack in the machine room. This allows machines on different racks to be marked as topological locations, meaning that the positions of machines on different racks can be identified by latitude rack. Rack =rack1; rack1 =rack1; rack1 =rack1; Another set of rack machines can be identified as rack=rack2, so that the rack latitude can be used to distinguish the location of nodes in K8s.
Let’s take a look at the use of topology in K8s storage.
Background of storage topology scheduling
As we said last time, storage and computing resources are managed separately in K8s through PVCS. If the created PV has “access location” restrictions, that is, it uses nodeAffinity to specify which nodes can access the PV. Why is there this access location restriction?
Since the process of creating a POD and a PV in K8s can be considered parallel, there is no way to ensure that the node where the pod ends up is able to access the storage of the PVS with location limitations, resulting in pod failure. Here are two classic examples:
Let’s take a look at the Local PV example. The Local PV encapsulates the Local storage on a node as a PV and accesses the Local storage using PV. Why is there a need for Local PV? To put it simply, the PV or PVC system is mainly used for distributed storage. Distributed storage depends on the network. If some services have high I/O performance requirements, the distributed storage may fail to meet the performance requirements. In this case, local storage is required, eliminating network overhead, and performance tends to be high. But there are disadvantages to using local storage! Distributed storage can be highly available with multiple copies, but local storage requires businesses to implement multiple copies of high availability using a similar Raft protocol.
Now let’s look at the Local PV scenario. What might happen if there is no “access location” restriction on PVS?
When the user commits the PVC, the K8s PV controller may be bound to the PV on Node2. However, when a POD is scheduled to use node1, it may not be able to use the storage when it gets up, because the pod will actually use the storage on Node2.
The second scenario (which is problematic if the “access location” is not restricted to PV) :
If the nodes managed by the K8s cluster are distributed in multiple availability zones in a single region, you can set up a cluster. When dynamic storage is created, the created storage is in availability 2, but later when a POD commits to use the storage, it may be scheduled to availability 1 and may not be able to use the storage. Therefore, cloud disks like Aliyun, or block storage, cannot currently be used across availability zones. If the storage created is actually in availability zone 2, but pod runs in availability zone 1, it cannot be used. This is the second common problem scenario.
Let’s take a look at how K8s solves this problem by storing topology scheduling.
Storage Topology Scheduling
First to summarize the previous two problems, they are PV binding to PVC or dynamic PV generation, I do not know which nodes will use its POD scheduling later. However, the use of PV itself is restricted by the topology location of the node where the POD is located. For example, in the Local PV scenario, I need to schedule on the specified node to use that PV, while in the second problem scenario, that is, across the available area, It is necessary to use the PV pod to the same available area node to use ali cloud disk service, then how to solve this problem in K8s?
To put it simply, in K8s, the binding operation of PV and PVC and the dynamic creation of PV are delayed, and the two operations will be delayed until the pod scheduling result is available. What good would that do?
- First of all, if the PV to be used is pre-allocated, such as Local PV, in fact, the PVC corresponding to the POD of this PV has not been bound, you can use the scheduler during the scheduling process, Combined with pod’s computing resource requirements (such as CPU/MEM) and POD’s PVC requirements, the node selected should meet both the computing resource requirements and the PVC used by POD should be bound to the PV nodeaffinity limit.
- Secondly, the scene of dynamically generating PV is actually equivalent to that if you know the node where the POD is running, you can dynamically create this PV according to the topology information recorded on the node, that is, to ensure that the topological position of the newly created PV is consistent with that of the running node. As described above in the aliyun cloud disk example, since know pod to run to availability 1, then create storage when specified in availability 1 can be created.
In order to implement the above mentioned delayed binding and delayed creation of PV, there are three related components that need to be changed in K8s:
- The PV Controller, also known as persistent Volume Controller, needs to support delayed Binding.
- The other is a component that dynamically generates PV. If the POD scheduling results come out, it will dynamically create PV based on the POD topology information.
- The third and most important change is the Kube-Scheduler. When selecting a Node node for a POD, it not only considers the CPU/MEM computing requirements of the pod, but also considers the storage requirements of the pod. That is, according to its PVC, it first looks at the node to be selected. Whether it can meet the PV nodeAffinity that can match this PVC; Or it is the process of dynamically generating PV. It checks whether the current node meets the topology limit according to the topology limit specified in StorageClass, so as to ensure that the node selected by the scheduler can meet the topology limit of the storage itself.
This is the knowledge of storage topology scheduling.
Second, use case interpretation
Let’s take a look at the basics of part 1 through the YAML use case.
Volume Snapshot/Restore example
Here’s how to use a storage snapshot: First, the cluster administrator needs to create VolumeSnapshotClass object in the cluster. An important field in VolumeSnapshotClass is Snapshot, which specifies the volume plug-in used to create storage snapshots. The volume plug-in needs to be deployed in advance. More on this volume plug-in later.
Next, if the user wants to make a real storage snapshot, he needs to declare a VolumeSnapshotClass, VolumeSnapshotClass first it needs to specify VolumeSnapshotClassName, The next very important field it needs to specify is source, which essentially specifies what the snapshot data source is. This place specifies name as disk-PVC, which means that a storage snapshot is created from this PVC object. After submitting the VolumeSnapshot object, the related component in the cluster will find the PV store corresponding to the PVC and take a snapshot of the PV store.
Now that you have a snapshot, how do you recover data from a snapshot? I declare a new PVC object and declare the VolumeSnapshot from which my DataSource came in the DataSource under its spec. In this case, the disk-snapshot object is specified. After I commit the PVC, Related components in the cluster, which dynamically generate new PV stores with data from the storage snapshots made by the Snapshot.
Example Of Local PV
Local PV yamL
Local PV objects can only be accessed locally, so you need to declare the Local PV object. In PV objects, nodeAffinity restricts the access of the PV to a single node, thus placing topology restrictions on the PV. The key in the topology above is marked with kubernetes.io/hostname, which is only accessible on node1. To use this PV, your pod must be scheduled to Node1.
Why do we need storageClassname when we statically create PVS? In a Local PV, if you want it to work properly, you need to use the lazy binding feature. Since it is a lazy binding, after the user writes the PVC commit, it cannot match even if there are related PVS in the cluster that can match it. In other words, the PV controller cannot do the binding right away. In this case, you need a way to tell the PV controller when the binding cannot be done right away. The storageClass provisioner specifies the no-provisioner, which tells K8s that it will not dynamically create the PV. It mainly uses the VolumeBindingMode field of storageclass, called WaitForFirstConsumer, which can be simply considered as a delayed binding.
When the user starts to commit the PVC, the PV Controller sees the PVC, finds the storageClass, finds that the BindingMode is delayed binding, and does nothing.
Later, when the PVC POD is actually used, the PVC used in the POD will be truly bound to PV only when it happens to be scheduled on the node that conforms to PV Nodeaffinity. This ensures that the PVC is bound to the PV only after the POD is deployed to the node. Ultimately, it ensures that the pod created can access the Local PV, which is how to meet the TOPOLOGY constraints of the PV in a static Provisioning scenario.
Restricted Dynamic Provisioning PV topology example
In dynamic Provisioning PV, how do you do topology constraints?
Dynamic means that when you create a PV dynamically, you have topological location constraints. How do you specify that?
In storageclass, you need to specify the BindingMode (WaitForFirstConsumer).
A second field of particular importance is allowedTopologies, where the limitations lie. In the figure above, you can see that the topology limit is the level of available areas, which has two meanings:
- The first layer means that when I create PVS dynamically, the PVS that I create must be accessible in the available zone;
- When the scheduler finds that the PVC using it exactly corresponds to the Storageclass, the scheduling pod will select Nodes in the available area.
In short, it is necessary to ensure that the dynamically created storage can be accessed by the availability zone, and that the scheduler will drop in the availability zone when selecting node. In this way, it ensures that my storage and the node corresponding to the POD I want to store are guaranteed. The topological domain between them is in the same topological domain. When users write PVC files, the writing method is the same as before, mainly because there are some topology restrictions in storageclass.
Three, operation demonstration
This section will demonstrate the previous content in an online environment.
First, take a look at the K8s service built on my Ali cloud server. There are three nodes in total. One master node, two nodes. The master node cannot schedule pod.
Look again, I have arranged the plug-ins I need in advance, one is the snapshot plug-in (CSI-external-snapshot *), the other is the dynamic cloud disk plug-in (CSI-disk *).
Now start the Snapshot demo. First go to create cloud disk dynamic, then can do snapshot. To dynamically create a cloud disk, you need to create storageclass first, then create a PV based on the PVC, and then create a POD using it.
Yaml: snapshotclass.yaml: snapshotclass.yaml: snapshotclass.yaml: snapshotclass.yaml:
The cSI-external-snapshot-0 plugin is used to create snapshots. The snapshot-0 plugin is used to create snapshots.
Next, create the volume-Snapshotclass file, and then start snapshot.
Yaml, Volumesnapshot, Volumesnapshot, Volumesnapshot, Volumesnapshot
Let’s see if the Snapshot has been created. As shown below, content has been created 11 seconds ago.
You can take a look at its contents, mainly to see some information recorded by VolumesnapShotContent. This is after MY snapshot comes out, it records the SNAPSHOT ID returned to me by the cloud storage manufacturer. Then there is the Snapshot data source, the PVC specified earlier, through which the corresponding PV can be found.
The snapshot demo looks something like this. Delete the snapshot you just created using volumesnapshot again. And then let’s see, the volumesnapShotContent that was created dynamically is also deleted.
Let’s take a look at the dynamic PV creation process and add some topology restrictions. First, create the storageclass, and then look at the restrictions in the storageclass. Storageclass first sets its BindingMode to WaitForFirstConsumer, which is delayed binding, and then applies a topological restriction to the storageclass. I have configured an availability level restriction in the allowedTopologies field.
I’m going to try to create the PVC, and once the PVC is created, it should theoretically be in a pending state. Let’s see, it is now doing delayed binding because it is not using its POD, so it cannot do binding for the time being, nor can it dynamically create new PVS.
Next, create a POD that uses the PVC and see what happens. Look at the POD. The pod is also pending.
A node cannot be scheduled due to taint. This node is the master, and the other two nodes are also unbound PVS.
Why do two nodes have no PVS to bind to? Isn’t it dynamically created?
Let’s take a closer look at the topology restrictions in storageclass. We know from the above explanation that the PV storage created with this storageclass must be accessible in the available area CN-Hangzhou-d. Pods that use this storage must also be scheduled to cn-Hangzhou-d nodes.
Check to see if the node has the topology information. If it doesn’t, it won’t.
Take a look at all the labels in the first node. Lables does have a key like this. In other words, there is one such topology, but this is specified as CN-hangzhou -b, the storageclass specified as CN-hangzhou -d.
The storageclass is restricted to D. The storageclass is restricted to D.
This makes it impossible to schedule pods on both nodes. Now let’s modify the topology restriction in storageclass to change CN-hangzhou-d to CN-hangzhou-b.
This means that the PV dynamically created by me should be accessible to the hangzhou-B area, and the pod using this storage should be scheduled to the node in the hangzhou-B area. I’m going to delete the pod, I’m going to reschedule it and let’s see what happens. Okay, now this one’s been reschedule, so we’re starting the container.
Note After the storageclass is changed from hangzhou-d to Hangzhou-B, there are two nodes in the cluster. The topology of the storageclass matches that of the storageclass. This way it can ensure that its pods have node nodes to schedule. The last point of the Pod in the figure above is Running, indicating that the topology restriction can be worked after the change.
Iv. Processing process
Kubernetes Process for Volume Snapshot/Restore
Next, take a look at the specific processing flow of storage snapshot and topology scheduling in K8s. As shown below:
Let’s take a look at the process of storing snapshots and explain the CSI part first. Extensions to storage in K8s are recommended to be implemented in the form of CSI out-of-tree.
Csi implements storage extensions in two main parts:
- The first part is the CSI Controller part driven by the K8s community, namely the CSI – Snapshottor Controller and the CSI -provisioner Controller, These are mostly generic controller parts;
- The other part is the different CSI-plugin part, also known as the driver part of the storage, implemented by specific cloud storage vendors using their own Openapis.
The two parts are connected together through Unix Domain socket communication. These two parts are needed to form a true storage extension.
As shown in the figure above, when the user submits the VolumeSnapshot object, it will be received by the CSI-Snapshottor Controller Watch. After that, it will be called to cSI-plugin via GPPC. Csi-plugin realizes the action of storing the snapshot through OpenAPI. After the snapshot is generated, it will be returned to cSI-Snapshottor Controller. The CSI-Snapshottor controller puts the information generated by the storage snapshot into the VolumeSnapshotContent object and makes the VolumeSnapshot submitted by the user as bound. This is actually a little bit like the BOUND for PV and PVC.
With storage snapshots, how do you use storage snapshots to recover previous data? As mentioned earlier, by declaring a new PVC object and specifying its dataSource as a Snapshot object, the CSI-provisioner Watch is delivered when the PVC is committed, and the storage is created via the GRPC. The difference between csi-provisioner and cSI-provisioner is that it also specifies the Snapshot ID. When you go to the cloud vendor to create the storage, you need to perform an additional step to restore the Snapshot data to the newly created storage. The process then returns to csi-provisioner, which writes the information about the newly created storage to a new PV object that is watched by the PV Controller. It makes a bound between the user-submitted PVC and the PV, The POD can then use the restored data through the PVC. This is the process for storing snapshots in K8s.
Kubernetes processing flow for Volume topology-aware Scheduling
Next, take a look at the processing flow of storage topology scheduling:
The first step is to declare the delayed binding, which is done by StorageClass, as described above, but not described here.
In the figure above, the red part is the new storage topology scheduling logic added by the scheduler. Let’s take a look at the general flow of the scheduler when it selects nodes for a POD without the red part:
- First of all, after the user submits the POD, it will be detected by the scheduler Watch, which will perform pre-selection first. Pre-selection means that it will match all nodes in the cluster with the resources needed by the POD.
- If a match is made, this node can be used. Of course, more than one node can be used. Eventually, a batch of nodes will be selected.
- Then go through the second stage of optimization, which is equivalent to the process of scoring these nodes and finding the most matching node through scoring.
- The scheduler then writes the scheduling results to the spec.nodeName field in the pod, which is then read by kubelet Watch on the corresponding node, and the pod creation process begins.
Now how does filtering node(step 2) work with volume related scheduling?
- The first step is to find all the PVCS used in the POD, find the PVCS that are already bound, and those that need to be latently bound.
- For bound PVCS, check whether the nodeAffinity in the PVCS matches the current node topology. If not, the node cannot be scheduled. If it matches, you move on to the PVC that needs to be delayed binding;
- For PVCS that require delayed binding. First, get all the PVS in the cluster that meet the needs of PVC. Then, match them with the topology of the current Node Labels one by one. If they (the PVS in stock) do not match, it means that the current PVS cannot meet the needs. If you want to create a PV dynamically, check whether the current node satisfies the topology restriction. If the topology limit declared in StorageClass matches the topology of labels already in the current node, the node can be used. If it does not match, the node cannot be scheduled.
After the above steps, all nodes that satisfy both pod computing and POD storage resource requirements are found. After nodes are selected, the third step is an optimization done inside the scheduler. After pre-selection and optimization, the node information of POD and some cache information made by PV and PVC in scheduler are updated.
The fourth and important step is that the Node Pod has been selected, whether the PVC it uses is to bind existing PVS or to dynamically create PVS. Triggered by the scheduler, the scheduler updates the information in the PVC and PV objects, then triggers the PV Controller to do the binding operation, or csi-provisioner to do the dynamic creation process.
conclusion
- By comparing THE PVC&PV system, the K8s resource object and the usage method of snapshot storage are explained.
- This paper introduces the necessity of storage topology scheduling function through the problems encountered in two practical scenarios, and how to solve these problems through topology scheduling in K8s.
- By analyzing the internal operating mechanism of storage snapshot and storage topology scheduling in K8s, you can deeply understand the working principles of these functions.
Alibaba cloudnative wechat public account (ID: Alicloudnative) focuses on the technical fields such as micro-service, Serverless, container, Service Mesh, etc., focuses on the popular technology trend of cloudnative, large-scale implementation practice of cloudnative, and becomes the technical public account of the most knowledgeable cloudnative developers.