The author | | HuiZhi source alibaba cloud native public number
** In the article “Read K8s persistent storage process”, we focus on the internal storage process of K8s, and the call relationship between PV, PVC, StorageClass, Kubelet, etc. Next, this article will focus on CSI (Container Storage Interface) to explore what CSI is and its inner workings.
background
K8s natively supports several storage types of PV, such as iSCSI, NFS, CephFS, and more (see link). These in-tree storage codes are placed in the Kubernetes repository. The problem here is that the K8s code is strongly coupled to the third-party storage vendor’s code:
- To change the in-tree storage code, the user must update the K8s component, which is costly
- Bugs in the In-Tree storage code cause K8s components to become unstable
- The K8s community is responsible for maintaining and testing in-tree storage functions
- The In-Tree storage plug-in enjoys the same privileges as the K8s core components, causing security risks
- Third-party storage developers must follow the rules of the K8s community to develop in-tree storage code
The CSI container storage interface standard solves the above problems by decoupling the third-party storage code from the K8s code, so that third-party storage developers only need to implement CSI (regardless of whether the container platform is K8s or Swarm, etc.).
Introduction to CSI core process
Before going into the details of CSI components and their interfaces, let’s take a look at the CSI storage process in K8s. K8s Pod goes through three stages when it mounts a storage volume: Provision/Delete, Attach/Detach, and Mount/Unmount K8s uses CSI in these three phases in graphic and illustrated ways.
1. Provisioning Volumes
1. The cluster administrator to create StorageClass resources, the CSI StorageClass contains the plug-in name (provisioner:pangu.csi.alibabacloud.com), and the parameters of the storage class must (parameters: Type = cloud_ssd). The sc.yaml file is as follows:
2. The user creates the PersistentVolumeClaim resource, and the PVC specifies the storage size and StorageClass (as shown above). The pvc.yaml file is as follows:
Volume controller (PersistentVolumeController) 3. * * * * the newly created PVC is not observed cluster matching PV, and the use of storage types for the out – of – the tree, so for PVC annotation: Volume. Beta. Kubernetes. IO/storage – provisioner = [the out – of – tree CSI plug-in name] (in this case is the provisioner:pangu.csi.alibabacloud.com).
4. External Provisioner components observed PVC contained in the annotation of “volume. Beta. Kubernetes. IO/storage – Provisioner” and its value is their own, then start the process.
- Get the relevant StorageClass resource and get the parameter (in this case type=cloud_ssd) from it for later CSI function calls.
- Call the CreateVolume function of the external CSI plug-in through the Unix Domain socket.
5. The External CSI plug-in returns success indicating that the disk is created, at which point the External Provisioner component creates a PersistentVolume resource in the cluster.
6. The volume controller binds the PV and PVC.
2. Attaching Volumes
1. When the AD controller (AttachDetachController) observes that the Pod using CSI PV is scheduled to a node, the AD controller calls the Attach function of ** internal IN-tree CSI plug-in (csiAttacher) **.
** Internal In-tree CSI plugin (csiAttacher) ** creates a VolumeAttachment object into the cluster.
3. The External Attacher observes the VolumeAttachment object and calls the ControllerPublish function of the External CSI plug-in to attach the volume to the corresponding node. After the External CSI plug-in is successfully mounted, the External Attacher updates the.status.Attached value of the related VolumeAttachment object to true.
4. The in-tree CSI plugin (csiAttacher) inside the AD controller observes that the.status. Attached setting of the VolumeAttachment object is true. This updates the ActualStateOfWorld, which is displayed on the.status.volumesAttached Node resource.
3. Mounting Volumes
1. The **Volume Manager (Kubelet component) observes that a new Pod using CSI type PV is scheduled to this node and calls the internal IN-tree CSI plugin (csiAttacher) ** WaitForAttach function.
2.** Internal In-tree CSI plugin (csiAttacher) ** Waits for the Status of the VolumeAttachment object in the cluster to become true.
3. The in-tree CSI plug-in (csiAttacher) calls the MountDevice function, which internally calls the NodeStageVolume function of the external CSI plug-in through the Unix domain socket; The plugin (csiAttacher) then calls the SetUp function of the internal In-tree CSI plugin (csiMountMgr), This function internally calls the NodePublishVolume function of the external CSI plug-in through the Unix Domain socket.
4. Unmounting Volumes
1. Delete related pods.
2. The Volume Manager (Kubelet component) observes that the Pod containing the CSI storage Volume is deleted and calls the TearDown function of the internal in-tree CSI plugin (csiMountMgr). This function internally calls the NodeUnpublishVolume function of the external CSI plug-in through the Unix Domain socket.
3. The Volume Manager (Kubelet component) calls the UnmountDevice function of the internal In-tree CSI plugin (csiAttacher). This function internally calls the NodeUnpublishVolume function of the external CSI plug-in through the Unix Domain socket.
5. Detaching Volumes
1. When the AD controller observes that the Pod containing the CSI volume is deleted, the controller calls the Detach function of the internal IN-tree CSI plugin (csiAttacher) **.
2. CsiAttacher will delete related VolumeAttachment objects in the cluster (but the VA object will not be deleted immediately because of finalizer).
3. The External Attacher observes that the DeletionTimestamp of the VolumeAttachment object in the cluster is not empty and calls the ControllerUnpublish function of the External CSI plug-in to remove the volume from the corresponding node. After the External CSI plug-in is removed, the External Attacher removes the Finalizer field of the VolumeAttachment object, and the VolumeAttachment object is completely deleted.
4. The in-tree CSI plugin (csiAttacher) detects that the VolumeAttachment object has been deleted and updates the internal status of the AD controller. At the same time, the AD controller updates the Node resource, and the Node resource’s.status. VolumesAttached information is no longer attached.
6. Deleting Volumes
1. Delete the RELATED PVC.
2. The External Provisioner component observed the PVC deletion event and performed different operations according to the PVC Reclaim policy:
- Delete: Call the external CSI plug-in’s DeleteVolume function to Delete the volume; Once the volume is successfully deleted, the Provisioner deletes the corresponding PV object from the cluster.
- Retain: Provisioner Does not perform the volume deletion operation.
CSI Sidecar component introduction
To adapt K8s to the CSI standard, the community has placed k8S-related storage process logic in the CSI Sidecar component.
1. Node Driver Registrar
1) function
The Node-driver-Registrar component will register the external CSI plug-in with Kubelet, This allows Kubelet to call external CSI plug-in functions through specific Unix Domain sockets (Kubelet will call external CSI plug-in functions) NodeGetInfo, NodeStageVolume, NodePublishVolume, NodeGetVolumeStats, etc.).
2) principle
The Node-driver-Registrar component is registered via the Kubelet external plug-in registration mechanism.
-
Kubelet annotates the Node resource for this Node: Kubelet call external CSI plug-in NodeGetInfo function, the return value [nodeID], [driverName] will serve as a value for “CSI. Volume. Kubernetes. IO/nodeID” key.
-
Kubelet Update Node Label: apply the [AccessibleTopology] value returned by NodeGetInfo to the Node Label.
-
Kubelet update Node Status: update maxAttachLimit (maximum number of volumes that can be attached to a Node) to status.allocatable: Attachable – volumes – csi – [driverName] = [maxAttachLimit].
- Kubelet Update CSINode resources: update [driverName], [nodeID], [maxAttachLimit], [AccessibleTopology] to specs (only Key values are reserved for topology).
2. External Provisioner
1) function
Create/delete an actual storage volume and PV resources representing the storage volume.
2) principle
External-provisioner Specifies the parameter — Provisioner at startup, which specifies the Provisioner name, corresponding to the Provisioner field in the StorageClass.
External-provisioner starts and watches the PVC and PV resources in the cluster.
For PVC resources in the cluster:
-
Determine whether a STORAGE volume needs to be dynamically created on the PVC. The criteria are as follows:
- PVC is included in the annotation “volume. Beta. Kubernetes. IO/storage – provisioner” key (create) by volume controller, and its value is equal with the provisioner name.
- If the VolumeBindingMode field corresponding to the PVC StorageClass is WaitForFirstConsumer, The PVC must be included in the annotation “volume. Kubernetes. IO/selected – node” key (see the scheduler how to deal with WaitForFirstConsumer), and its value is not null; If the value is Immediate, the Provisioner needs to provide dynamic storage volumes immediately.
-
The CreateVolume function of the external CSI plug-in is called through a specific Unix Domain Socket.
-
Create a PV resource. The PV name is [Provisioner PV prefix] – [PVC UUID].
For PV resources in the cluster:
-
The criteria for determining whether PV needs to be deleted are as follows:
- Determine whether its.status. Phase is Release.
- Determine its. Spec. PersistentVolumeReclaimPolicy whether to Delete.
- Whether it contains the annotation (pv. Kubernetes. IO/provisioned – by), and whether its value for yourself.
-
Invoke the DeleteVolume interface of the external CSI plug-in through a specific Unix Domain Socket.
-
Example Delete PV resources from the cluster.
3. External Attacher
1) function
Attach or remove the storage volume.
2) principle
** External-attacher ** Internally keeps the VolumeAttachment and PersistentVolume resources in the Watch cluster.
For VolumeAttachment resources:
-
Obtain all PV information, such as volume ID, node ID, and mount Secret, from the VolumeAttachment resource.
-
If the value of the DeletionTimestamp field of the VolumeAttachment is empty, the value is a VolumeAttachment. If volumes are attached, the ControllerPublishVolume interface of the external CSI plug-in is called through a specific Unix Domain Socket. If for the volume removal by a specific Domain Unix Socket call external CSI plug-in ControllerUnpublishVolume interface.
For the PersistentVolume resource:
-
Finalizer: external-attacher/[Driver name] is configured for the PV.
-
If the PV is deleted (DeletionTimestamp is not empty), delete the Finalizer: external-attacher/[Driver name].
4. External Resizer
1) function
Expand a storage volume.
2) principle
The PersistentVolumeClaim resource in the Watch cluster is internally generated by external-Resizer.
For the PersistentVolumeClaim resource:
-
Determine whether the PersistentVolumeClaim resource needs to be expanded: the PVC Status needs to be Bound with.status. Capacity and.spec.resources.Requests.
-
Update the.status. Conditions of the PVC to indicate that it is in Resizing state at this time.
-
The ControllerExpandVolume interface of the external CSI plug-in is invoked through a specific Unix Domain Socket.
-
Update PV.spec. Capacity.
-
If CSI supports online expansion of file systems, NodeExpansionRequired is true for the ControllerExpandVolume interface. External-resizer updates.status.Conditions of PVC to FileSystemResizePending state; If it is not supported, the Capacity expansion succeeds, and the.status. Conditions of external-resizer update PVC is empty, and the.status. Capacity of PVC is updated.
The Volume Manager (Kubelet component) observes that the storage Volume needs to be expanded online and invokes the NodeExpandVolume interface of the external CSI plug-in through a specific Unix Domain Socket to expand the file system.
5. livenessprobe
1) function
Check that the CSI plug-in works properly.
2) principle
By exposing a/HEALTHZ HTTP port externally to serve kubelet’s Probe Probe, internally the Probe interface of the external CSI plug-in is called through a specific Unix Domain Socket.
CSI Interface Introduction
Three-party storage vendors need to implement the three interfaces of CSI plug-in: IdentityServer, ControllerServer, and NodeServer.
1. IdentityServer
IdentityServer is used to authenticate the identity of the CSI plug-in.
IdentityServer is the server API for Identity service. type IdentityServer interface { GetPluginInfo(context.context, *GetPluginInfoRequest) (* getPlugininforese, error) For example, whether to provide ControllerService capabilities GetPluginCapabilities(context.context, * GetPluginCapabilitiesRequest) (* GetPluginCapabilitiesResponse, error) / / for CSI plug-in health Probe (context. The context, *ProbeRequest) (*ProbeResponse, error) }Copy the code
2. ControllerServer
ControllerServer creates, deletes, attaches, and deletes volumes and snapshots.
// ControllerServer is the server API for Controller service. type ControllerServer interface {// Create a volume CreateVolume(context.Context, *CreateVolumeRequest) (*CreateVolumeResponse, Error) // DeleteVolume DeleteVolume(context. context, *DeleteVolumeRequest) (*DeleteVolumeResponse, Error) // Attach a storage volume to a specific node ControllerPublishVolume(context.context, *ControllerPublishVolumeRequest) (*ControllerPublishVolumeResponse, Error) / / from a specific node removal volumes ControllerUnpublishVolume (context. The context, * ControllerUnpublishVolumeRequest) (* ControllerUnpublishVolumeResponse, error) / / verification storage volume capacity whether meet the requirements, Such as whether to support across nodes read write ValidateVolumeCapabilities (context. The context, *ValidateVolumeCapabilitiesRequest) (*ValidateVolumeCapabilitiesResponse, ListVolumesRequest (* ListVolumesRequest) (*ListVolumesResponse, GetCapacity(context. context, *GetCapacityRequest) (*GetCapacityResponse, Error) // Get ControllerServer support Such as whether to support the snapshot capabilities ControllerGetCapabilities (context. The context, * ControllerGetCapabilitiesRequest) (* ControllerGetCapabilitiesResponse, error) / / create a snapshot CreateSnapshot (context. The context, *CreateSnapshotRequest) (*CreateSnapshotResponse, error) *DeleteSnapshotRequest) (*DeleteSnapshotResponse, error) *ListSnapshotsRequest) (*ListSnapshotsResponse, error) // Expand storage volume ControllerExpandVolume(context.context, *ControllerExpandVolumeRequest) (*ControllerExpandVolumeResponse, error) }Copy the code
3. NodeServer
NodeServer is responsible for mounting and unmounting storage volumes.
// NodeServer is the server API for Node service. type NodeServer interface {// Format the storage volume and mount it to the temporary global directory NodeStageVolume(context.Context, *NodeStageVolumeRequest) (*NodeStageVolumeResponse, Error) / / storage volume from the temporary directory uninstall NodeUnstageVolume global (context. The context, * NodeUnstageVolumeRequest) (* NodeUnstageVolumeResponse, Error) // Set the volume from temporary directory bind-mount to destination directory NodePublishVolume(context.context, * NodePublishVolumeRequest) (* NodePublishVolumeResponse, error) / / storage volume from the target directory uninstall NodeUnpublishVolume (context. The context, * NodeUnpublishVolumeRequest) (* NodeUnpublishVolumeResponse, error) / / get the capacity of the storage volume information NodeGetVolumeStats (context. The context, * NodeGetVolumeStatsRequest) (* NodeGetVolumeStatsResponse, error) / / storage volume expansion NodeExpandVolume (context. The context, * NodeExpandVolumeRequest) (* NodeExpandVolumeResponse, error) / / for NodeServer support function point, Such as whether to support for storage volume capacity of information NodeGetCapabilities (context. The context, * NodeGetCapabilitiesRequest) (* NodeGetCapabilitiesResponse, NodeGetInfo(context. context, *NodeGetInfoRequest) (*NodeGetInfoResponse, error)}Copy the code
K8s CSI API object
K8s supports the CSI standard and contains the following API objects:
- CSINode
- CSIDriver
- VolumeAttachment
1. CSINode
ApiVersion: storage.k8s. IO /v1beta1 kind: CSINode Metadata: name: node-10.212.101.210 spec: drivers: name: Yodaplugin.csi.alibabacloud.com nodeID: node - 10.212.101.210 topologyKeys: - kubernetes. IO/hostname - name: pangu.csi.alibabacloud.com nodeID: a5441fd9013042ee8104a674e4a9666a topologyKeys: - topology.pangu.csi.alibabacloud.com/zoneCopy the code
Function:
-
Determine whether the external CSI plug-in was successfully registered. The Node Driver Registrar component registrant with Kubelet will create the resource so there is no need to explicitly create the CSINode resource.
-
The Node resource name in Kubernetes corresponds to the Node name (nodeID) in the three-party storage system. Here Kubelet calls the GetNodeInfo function of the external CSI plugin NodeServer to get the nodeID.
-
Display volume topology information. CSINode topologyKeys is used to represent the topology information of storage nodes. Volume topology information enables Scheduler to select appropriate storage nodes when Pod is scheduled.
2. CSIDriver
apiVersion: storage.k8s.io/v1beta1 kind: CSIDriver metadata: name: pangu.csi.alibabacloud.com spec: If the CSI plugin requires a Pod message during the Mount phase, the value can be attached to the volume. If the CSI plugin requires a Pod message during the Mount phase, the value can be attached to the volume. True # Specifies the volume modes supported by CSI, volumeLifecycleModes: -persistentCopy the code
Function:
-
Simplify the discovery of external CSI plug-ins. Created by the cluster administrator, kubectl get csidriver lets you know which CSI plug-ins are on your environment.
-
Custom Kubernetes behavior. If some external CSI plug-ins do not need to perform VolumeAttach, you can set.spec.attachRequired to false.
3. VolumeAttachment
apiVersion: storage.k8s.io/v1 kind: VolumeAttachment metadata: annotations: csi.alpha.kubernetes.io/node-id: 21481ae252a2457f9abcb86a3d02ba05 finalizers: - external-attacher/pangu-csi-alibabacloud-com name: csi-0996e5e9459e1ccc1b3a7aba07df4ef7301c8e283d99eabc1b69626b119ce750 spec: attacher: Pangu.csi.alibabacloud.com nodeName: node - 10.212.101.241 source: persistentVolumeName: pangu-39aa24e7-8877-11eb-b02f-021234350de1 status: attached: trueCopy the code
Function: VolumeAttachment records information about volumes and nodes.
Support feature
1. Topology support
In StorageClass there are AllowedTopologies fields:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: csi-pangu
provisioner: pangu.csi.alibabacloud.com
parameters:
type: cloud_ssd
volumeBindingMode: Immediate
allowedTopologies:
- matchLabelExpressions:
- key: topology.pangu.csi.alibabacloud.com/zone
values:
- zone-1
- zone-2
Copy the code
The external CSI plug-in will be deployed to label each Node with the [AccessibleTopology] value returned by the NodeGetInfo function (see the Node Driver Registrar section).
External Provisioner before calling CSI plug-in CreateVolume interface, will be in the request parameter setting AccessibilityRequirements:
-
For WaitForFirstConsumer
- When PVC anno contained in the “volume. Kubernetes. IO/selected – node” and not empty, the first to obtain the corresponding node CSINode TopologyKeys, Then, the Values are obtained from the Label of Node resource according to the TopologyKeys. Finally, the Values are compared with the AllowedTopologies of StorageClass to determine whether it is included in the Label. If not, an error is reported.
-
To Immediately
- Enter the values of StorageClass AllowedTopologies. If StorageClass does not set AllowedTopologies, add the values of all nodes that contain TopologyKeys.
How does Scheduler handle scheduling using storage volumes
Based on community version 1.18 scheduler
The scheduling process of the scheduler mainly includes the following three steps:
- Filter: Filters nodes that meet Pod scheduling requirements.
- Score: the node is scored by the internal optimization algorithm. The node with the highest Score is the selected node.
- Bind: The scheduler informs kube-Apiserver of the scheduling result, updating the Pod’s.spec.nodeName field.
Scheduler pre-selection phase: handles THE PVC/PV binding relationship of Pod and Dynamic Provisioning (PV), and makes scheduler scheduling consider the node affinity of THE PV used by Pod. The detailed scheduling process is as follows:
-
Pod does not include PVC directly skipped.
-
FindPodVolumes
-
Obtain Pod boundClaims, claimsToBind, and unboundClaimsImmediate.
- BoundClaims: Bound PVC
- Corresponding StorageClass claimsToBind: PVC VolumeBindingMode VolumeBindingWaitForFirstConsumer
- UnboundClaimsImmediate: The VolumeBindingMode of the PVC StorageClass is VolumeBindingImmediate
-
An error is reported if Len (unboundClaimsImmediate) is not empty, indicating that the PVC needs to mediate.
-
If len(boundClaims) is not empty, check whether the affinity of the PV node corresponding to the PVC conflicts with the Label of the current node. If so, an error message is displayed (check the PV topology of the Immediate type).
-
If len(claimsToBind) is not empty
- Check whether the existing PV matches the PVC. Record the PV that matches the PVC in the scheduler cache.
- PVCS that do not match PV go through the dynamic scheduling process. Dynamic scheduling determines whether the current scheduling node meets topology requirements (for PVCS of the WaitForFirstConsumer type) through the AllowedTopologies field of StorageClass.
-
The scheduler optimization phase is not discussed.
The Assume stage of the dispatcher
The scheduler will Assume PV/PVC and then Pod.
-
Deep copy the Pod to be scheduled.
-
AssumePodVolumes (PVC for WaitForFirstConsumer type)
- Set the annotation: pv.kubernetes. IO /bound-by-controller=”yes”.
- Changes to the scheduler in the cache is not matching to the PV of PVC, set the annotation: volume. Kubernetes. IO/selected – node = 】 【 the selected node.
-
Assume Pod to complete
- Change Pod’s.spec. NodeName to [selected node] in the scheduler cache.
The scheduler Bind phase
BindPodVolumes:
-
Call Kubernetes’ API to update the PV/PVC resources in the cluster to match the PV/PVC in the scheduler Cache.
-
Check PV/PVC status:
- Check that all PVCS are in a Bound state.
- Check whether the NodeAffinity of all PVS conflicts with node labels.
-
The scheduler performs Bind: calls Kubernetes’ API to update the Pod’s.spec. NodeName field.
2. Expand the storage volume
Storage volume expansion has been mentioned in the External Resizer section, so it will not be repeated. Users only need to edit the PVC. Spec. Resources. Requests. The Storage field, attention can only shrink capacity expansion.
If the PV capacity expansion fails, the PVC cannot change the storage in the spec field to the original value (the capacity can only be expanded but not reduced). Reference K8s website provide PVC reduction methods: kubernetes. IO/docs/concep…
3. Limit the number of volumes on a node
The volume limitation is mentioned in the Node Driver Registrar section so I won’t go over it again.
4. Monitor the storage volume
The storage provider implements the NodeGetVolumeStats interface of the CSI plugin, which is called by Kubelet and reflected in its metrics:
- Kubelet_volume_stats_capacity_bytes: indicates the capacity of a storage volume
- Kubelet_volume_stats_used_bytes: used capacity of a storage volume
- Kubelet_volume_stats_available_bytes: indicates the available capacity of the storage volume
- Kubelet_volume_stats_inodes: specifies the total number of storage volume inodes
- Kubelet_volume_stats_inodes_used: specifies the inode usage of a storage volume
- Kubelet_volume_stats_inodes_free: indicates the remaining inode capacity of a storage volume
5. Secret
CSI storage volume can pass Secret to process private data required by different processes. Currently, StorageClass supports the following parameters:
- csi.storage.k8s.io/provisioner-secret-name
- csi.storage.k8s.io/provisioner-secret-namespace
- csi.storage.k8s.io/controller-publish-secret-name
- csi.storage.k8s.io/controller-publish-secret-namespace
- csi.storage.k8s.io/node-stage-secret-name
- csi.storage.k8s.io/node-stage-secret-namespace
- csi.storage.k8s.io/node-publish-secret-name
- csi.storage.k8s.io/node-publish-secret-namespace
- csi.storage.k8s.io/controller-expand-secret-name
- csi.storage.k8s.io/controller-expand-secret-namespace
Secret will be included in the corresponding parameters of CSI interface, such as for CreateVolume interface is included in the CreateVolumeRequest. Secrets.
6. Block device
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: nginx-example
spec:
selector:
matchLabels:
app: nginx
serviceName: "nginx"
volumeClaimTemplates:
- metadata:
name: html
spec:
accessModes:
- ReadWriteOnce
volumeMode: Block
storageClassName: csi-pangu
resources:
requests:
storage: 40Gi
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
volumeDevices:
- devicePath: "/dev/vdb"
name: html
Copy the code
Third-party storage vendors need to implement the NodePublishVolume interface. Kubernetes provides a toolkit for block device (” k8s. IO/Kubernetes/PKG/util/mount “), in NodePublishVolume stage can be called the EnsureBlock and MountBlock function of the tool.
7. Volume snapshot/volume clone capability
Due to the length of this article, I will not introduce too much rationale here. For readers interested, see volume Snapshot and Volume Clone.
conclusion
In this paper, CSI core process is introduced in general, and CSI Sidecar component, CSI interface, API object is combined with CSI standard in-depth analysis. On K8s, using any kind of CSI storage volume is inseparable from the above process, and the container storage problem in the environment must be one of the problems. In this paper, the process of its sorting, in order to facilitate the vast number of program ape (yuan) investigation of environmental problems.