The article “Brief Analysis of CSI working Principle” was first published on: blog.hdls.me/16255765577…

Recently, I have been working on CSI. As the development deepens, I increasingly think the details of CSI are quite complicated. By sorting out the working process of CSI, I can deepen my understanding of CSI and share my understanding of CSI with everyone.

I will introduce CSI through two articles. This is the first one, focusing on the basic components and working principles of CSI. This paper is based on Kubernetes as COs (Container Orchestration Systems) of CSI. The second part will analyze the specific implementation of several typical CSI projects.

The basic components of CSI

There are two types of CSI cloud providers, one is in-tree and the other is out-of-tree. The former refers to the storage plug-in running inside the k8S core components; The latter refers to storage plug-ins that run independently of k8S components. This article focuses on out-of-tree type plug-ins.

Out-of-tree plug-ins mainly interact with K8S components through gRPC interface, and K8S provides a large number of SideCar components to cooperate with CSI plug-in to achieve rich functions. Out-of-tree plug-ins are divided into SideCar components and third-party plug-ins.

Sidecars components

external-attacher

Listening VolumeAttachment object, and call the CSI driver Controller service ControllerPublishVolume and ControllerUnpublishVolume interface, Use to attach a volume to or remove it from a node.

If the storage system needs the attach/detach step, it needs to use this component because the ATTACH /detach Controller inside K8s does not directly call the CSI Driver interface.

external-provisioner

Listen on the PVC object and invoke the CreateVolume and DeleteVolume interfaces of the CSI Driver Controller service to provide a new volume. Provided that the provisioner field of the StorageClass specified in the PVC returns the same value as the GetPluginInfo interface of the CSI Driver Identity service. Once the new volume is provided, K8s creates the corresponding PV.

If the PVC-bound PV’s recycle policy is DELETE, the external-Provisioner component calls the DeleteVolume interface of the CSI Driver Controller service after listening for the PVC’s deletion. Once the volume is deleted, the component also deletes the corresponding PV.

This component also supports the creation of data sources from snapshots. If the Snapshot CRD data source is specified in the PVC, the component gets information about the Snapshot through the SnapshotContent object and passes this to the CSI driver when calling the CreateVolume interface. The CSI driver needs to create volumes based on data source snapshots.

external-resizer

If the user requests more storage on the PVC object, the component calls the NodeExpandVolume interface of the CSI Driver Controller service to expand the volume.

external-snapshotter

This component needs to work with Snapshot Controller. The Snapshot Controller creates VolumeSnapshotContent based on the Snapshot object created in the cluster. External-snapshotter listens for the VolumeSnapshotContent object. When VolumeSnapshotContent is heard, the corresponding parameters are passed to the CSI Driver Controller service via CreateSnapshotRequest to invoke its CreateSnapshot interface. This component is also responsible for calling the DeleteSnapshot, ListSnapshots interface.

livenessprobe

Responsible for monitoring the health of the CSI Driver, reporting to K8S via Liveness Probe, and restarting the POD when the CSI Driver is detected as abnormal.

node-driver-registrar

By calling NodeGetInfo interface of CSI Driver Node service directly, CSI driver information is registered on Kubelet of corresponding Node through the plug-in registration mechanism of Kubelet.

external-health-monitor-controller

Call the ListVolumes or ControllerGetVolume interface of the CSI Driver Controller service to check the health status of the CSI volume and report the health status to the PVC event.

external-health-monitor-agent

Invoke NodeGetVolumeStats interface of CSI Driver Node service to check CSI Volume health status and report it to POD event.

Third-party plug-ins

A third-party Storage Provider (SP or Storage Provider) needs to implement Controller and Node plug-ins. Controller manages volumes and is deployed in StatefulSet mode. Node is responsible for mounting the Volume into the POD and deploying it on each Node as a DaemonSet.

CSI plug-in is interacted with Kubelet and K8S external components through Unix Domani Socket gRPC. CSI defines three sets of RPC interfaces that SP needs to implement in order to communicate with k8S external components. The three sets of interfaces are CSI Identity, CSI Controller, and CSI Node. Let’s look at these interface definitions in detail.

CSI Identity

Both Controller and Node need to be implemented to provide identity information for the CSI driver. The interfaces are as follows:

service Identity {
  rpc GetPluginInfo(GetPluginInfoRequest)
    returns (GetPluginInfoResponse) {}

  rpc GetPluginCapabilities(GetPluginCapabilitiesRequest)
    returns (GetPluginCapabilitiesResponse) {}

  rpc Probe (ProbeRequest)
    returns (ProbeResponse) {}
}
Copy the code

The Node-driver-Registrar component will call GetPluginInfo to register the CSI driver with Kubelet. GetPluginCapabilities is used to indicate what features the CSI driver primarily provides.

CSI Controller

These interfaces are used to create or delete a volume, attach or detach a volume, attach or detach a volume snapshot, and expand or shrink a volume. These interfaces are required by the Controller plug-in. The interfaces are as follows:

service Controller {
  rpc CreateVolume (CreateVolumeRequest)
    returns (CreateVolumeResponse) {}

  rpc DeleteVolume (DeleteVolumeRequest)
    returns (DeleteVolumeResponse) {}

  rpc ControllerPublishVolume (ControllerPublishVolumeRequest)
    returns (ControllerPublishVolumeResponse) {}

  rpc ControllerUnpublishVolume (ControllerUnpublishVolumeRequest)
    returns (ControllerUnpublishVolumeResponse) {}

  rpc ValidateVolumeCapabilities (ValidateVolumeCapabilitiesRequest)
    returns (ValidateVolumeCapabilitiesResponse) {}

  rpc ListVolumes (ListVolumesRequest)
    returns (ListVolumesResponse) {}

  rpc GetCapacity (GetCapacityRequest)
    returns (GetCapacityResponse) {}

  rpc ControllerGetCapabilities (ControllerGetCapabilitiesRequest)
    returns (ControllerGetCapabilitiesResponse) {}

  rpc CreateSnapshot (CreateSnapshotRequest)
    returns (CreateSnapshotResponse) {}

  rpc DeleteSnapshot (DeleteSnapshotRequest)
    returns (DeleteSnapshotResponse) {}

  rpc ListSnapshots (ListSnapshotsRequest)
    returns (ListSnapshotsResponse) {}

  rpc ControllerExpandVolume (ControllerExpandVolumeRequest)
    returns (ControllerExpandVolumeResponse) {}

  rpc ControllerGetVolume (ControllerGetVolumeRequest)
    returns (ControllerGetVolumeResponse) {
        option (alpha_method) = true; }}Copy the code

As mentioned above in the introduction of k8S external components, different interfaces are provided for different component calls to coordinate the implementation of different functions. Such as CreateVolume/DeleteVolume cooperate external – provisioner realize create/delete volume function; ControllerPublishVolume/ControllerUnpublishVolume cooperate external – attacher achieve volume attach/detach function, etc.

CSI Node

These interfaces are used to mount/umount volumes and check the volume status. The Node plug-in provides these interfaces. The interfaces are as follows:

service Node { rpc NodeStageVolume (NodeStageVolumeRequest) returns (NodeStageVolumeResponse) {} rpc NodeUnstageVolume (NodeUnstageVolumeRequest) returns (NodeUnstageVolumeResponse) {} rpc NodePublishVolume (NodePublishVolumeRequest) returns (NodePublishVolumeResponse) {} rpc NodeUnpublishVolume (NodeUnpublishVolumeRequest) returns (NodeUnpublishVolumeResponse) {} rpc NodeGetVolumeStats (NodeGetVolumeStatsRequest) returns (NodeGetVolumeStatsResponse)  {} rpc NodeExpandVolume(NodeExpandVolumeRequest) returns (NodeExpandVolumeResponse) {} rpc NodeGetCapabilities (NodeGetCapabilitiesRequest) returns (NodeGetCapabilitiesResponse) {} rpc NodeGetInfo (NodeGetInfoRequest) returns (NodeGetInfoResponse) {} }Copy the code

NodeStageVolume is used to enable multiple pods to share a volume. The volume can be mounted to a temporary directory and then mounted to pod through NodePublishVolume. NodeUnstageVolume is its reverse operation.

The working process

Let’s take a look at the entire workflow of pod mounting volume. The process consists of Provision/Delete, Attach/Detach, and Mount/Unmount. However, not every storage solution goes through the Provision/Delete, Attach/Detach, and Mount/Unmount phases. For example, NFS does not have the Attach/Detach phase.

The whole process involves not only the work of the components described above, but also the AttachDetachController and PVController components of the ControllerManager and kubelet. The following is a detailed analysis of Provision, Attach and Mount phases.

Provision

The Provision phase is shown in the figure above. Extenal-provisioner and PVController both watch PVC resources.

  1. When PVController watch is created in the cluster, it will judge whether there is any in-tree plugin that matches it. If not, it will judge that the storage type is out-of-tree, and then annotate the PVCvolume.beta.kubernetes.io/storage-provisioner={csi driver name};
  2. Call csi Controller’s csi driver when the extenal-provisioner watch to PVC’s annotated CSI driver matches its OWN CSI driverCreateVolumeInterface;
  3. When the CSI ControllerCreateVolumeWhen the interface returns success, extenal-provisioner creates the corresponding PV in the cluster;
  4. PVController Watch If a PV is created in the cluster, bind the PV to the PVC.

Attach

In the Attach stage, volume is attached to the node. The whole process is shown in the figure above.

  1. The ADController listens when the POD is scheduled to a node and uses a CSI-type PV, and calls the interface of the internal In-tree CSI plug-in, which creates a VolumeAttachment resource in the cluster.
  2. External-attacher component Watch will call CSI Controller when the VolumeAttachment resource is createdControllerPublishVolumeInterface;
  3. When the CSI ControllerControllerPublishVolumeAfter the interface is successfully invoked, external-Attacher sets the Attached state of the corresponding VolumeAttachment object to true.
  4. Update the internal ADController state ActualStateOfWorld when the Attached state of ADController Watch to the VolumeAttachment object is true.

Mount

The final step in mounting a volume into a POD involves Kubelet. The whole process is simply that kubelet on the corresponding Node calls CSI Node plug-in and performs mount operation during pod creation. The following is a subdivision analysis of components within Kubelet.

First kubelet create pod main function syncPod, Kubelet will call its child volumeManager WaitForAttachAndMount method, wait for volume mount to complete:

func (kl *Kubelet) syncPod(o syncPodOptions) error{...// Volume manager will not mount volumes for terminated pods
	if! kl.podIsTerminated(pod) {// Wait for volumes to attach/mount
		iferr := kl.volumeManager.WaitForAttachAndMount(pod); err ! =nil {
			kl.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedMountVolume, "Unable to attach or mount volumes: %v", err)
			klog.Errorf("Unable to attach or mount volumes for pod %q: %v; skipping pod", format.Pod(pod), err)
			return err
		}
	}
...
}
Copy the code

VolumeManager contains two components: desiredStateOfWorldPopulator and reconciler. These two components work together to mount and unmount the volume in the POD. The whole process is as follows:

DesiredStateOfWorldPopulator and reconciler synergy mode are producers and consumers. VolumeManager maintains two queues (strictly interface, but here acting as queues), namely DesiredStateOfWorld and ActualStateOfWorld. The former maintains the expected state of the volume in the current node; The latter maintains the actual state of the volume in the current node.

DesiredStateOfWorldPopulator in his own circle of only do two things, one gets the current node from kubelet podManager new Pod, Record the volume information to be mounted to DesiredStateOfWorld. Another thing to do is get the pod removed from the current node from podManager and check whether its volume is in the ActualStateOfWorld record. If not, remove it from DesiredStateOfWorld as well. This ensures that DesiredStateOfWorld records the expected state of all volumes in the node. The code is as follows (some code has been removed to simplify the logic) :

// Iterate through all pods and add to desired state of world if they don't
// exist but should
func (dswp *desiredStateOfWorldPopulator) findAndAddNewPods(a) {
	// Map unique pod name to outer volume name to MountedVolume.
	mountedVolumesForPod := make(map[volumetypes.UniquePodName]map[string]cache.MountedVolume)
	...
	processedVolumesForFSResize := sets.NewString()
	for _, pod := range dswp.podManager.GetPods() {
		dswp.processPodVolumes(pod, mountedVolumesForPod, processedVolumesForFSResize)
	}
}

// processPodVolumes processes the volumes in the given pod and adds them to the
// desired state of the world.
func (dswp *desiredStateOfWorldPopulator) processPodVolumes(
	pod *v1.Pod,
	mountedVolumesForPod map[volumetypes.UniquePodName]map[string]cache.MountedVolume,
	processedVolumesForFSResize sets.String) {
	uniquePodName := util.GetUniquePodName(pod)
    ...
	for _, podVolume := range pod.Spec.Volumes {   
		pvc, volumeSpec, volumeGidValue, err :=
			dswp.createVolumeSpec(podVolume, pod, mounts, devices)

		// Add volume to desired state of world
		_, err = dswp.desiredStateOfWorld.AddPodToVolume(
			uniquePodName, pod, volumeSpec, podVolume.Name, volumeGidValue)
		dswp.actualStateOfWorld.MarkRemountRequired(uniquePodName)
    }
}
Copy the code

And the Reconciler is the consumer, which does three main things:

  1. unmountVolumes(): Traverses the volume in ActualStateOfWorld to determine whether it is in DesiredStateOfWorld. If not, unmount the CSI Node interface and record it in ActualStateOfWorld.
  2. mountAttachVolumes(): Obtain the volume to be mounted from DesiredStateOfWorld, invoke the CSI Node interface to mount or expand, and record the volume in ActualStateOfWorld.
  3. unmountDetachDevices(): Iterate over the volume in ActualStateOfWorld, unmount/detach it if it has attached but no pod in use and no record in DesiredStateOfWorld.

Let’s take mountAttachVolumes() as an example to see how it calls the CSI Node interface.

func (rc *reconciler) mountAttachVolumes(a) {
	// Ensure volumes that should be attached/mounted are attached/mounted.
	for _, volumeToMount := range rc.desiredStateOfWorld.GetVolumesToMount() {
		volMounted, devicePath, err := rc.actualStateOfWorld.PodExistsInVolume(volumeToMount.PodName, volumeToMount.VolumeName)
		volumeToMount.DevicePath = devicePath
		if cache.IsVolumeNotAttachedError(err) {
			...
		} else if! volMounted || cache.IsRemountRequiredError(err) {// Volume is not mounted, or is already mounted, but requires remounting
			err := rc.operationExecutor.MountVolume(
				rc.waitForAttachTimeout,
				volumeToMount.VolumeToMount,
				rc.actualStateOfWorld,
				isRemount)
			...
		} else ifcache.IsFSResizeRequiredError(err) { err := rc.operationExecutor.ExpandInUseVolume( volumeToMount.VolumeToMount, rc.actualStateOfWorld) ... }}}Copy the code

The mount operation is all done in rc.operationExecutor. OperationExecutor

func (oe *operationExecutor) MountVolume(
	waitForAttachTimeout time.Duration,
	volumeToMount VolumeToMount,
	actualStateOfWorld ActualStateOfWorldMounterUpdater,
	isRemount bool) error{...vargeneratedOperations volumetypes.GeneratedOperations generatedOperations = oe.operationGenerator.GenerateMountVolumeFunc(  waitForAttachTimeout, volumeToMount, actualStateOfWorld, isRemount)// Avoid executing mount/map from multiple pods referencing the
	// same volume in parallel
	podName := nestedpendingoperations.EmptyUniquePodName

	return oe.pendingOperations.Run(
		volumeToMount.VolumeName, podName, "" /* nodeName */, generatedOperations)
}
Copy the code

This function constructs the executive function first, then executes, so let’s look at the constructor:

func (og *operationGenerator) GenerateMountVolumeFunc(
	waitForAttachTimeout time.Duration,
	volumeToMount VolumeToMount,
	actualStateOfWorld ActualStateOfWorldMounterUpdater,
	isRemount bool) volumetypes.GeneratedOperations {

	volumePlugin, err :=
		og.volumePluginMgr.FindPluginBySpec(volumeToMount.VolumeSpec)

	mountVolumeFunc := func(a) volumetypes.OperationContext {
		// Get mounter plugin
		volumePlugin, err := og.volumePluginMgr.FindPluginBySpec(volumeToMount.VolumeSpec)
		volumeMounter, newMounterErr := volumePlugin.NewMounter(
			volumeToMount.VolumeSpec,
			volumeToMount.Pod,
			volume.VolumeOptions{})
		...
		// Execute mount
		mountErr := volumeMounter.SetUp(volume.MounterArgs{
			FsUser:              util.FsUserFrom(volumeToMount.Pod),
			FsGroup:             fsGroup,
			DesiredSize:         volumeToMount.DesiredSizeLimit,
			FSGroupChangePolicy: fsGroupChangePolicy,
		})
		// Update actual state of world
		markOpts := MarkVolumeOpts{
			PodName:             volumeToMount.PodName,
			PodUID:              volumeToMount.Pod.UID,
			VolumeName:          volumeToMount.VolumeName,
			Mounter:             volumeMounter,
			OuterVolumeSpecName: volumeToMount.OuterVolumeSpecName,
			VolumeGidVolume:     volumeToMount.VolumeGidValue,
			VolumeSpec:          volumeToMount.VolumeSpec,
			VolumeMountState:    VolumeMounted,
		}

		markVolMountedErr := actualStateOfWorld.MarkVolumeAsMounted(markOpts)
		...
		return volumetypes.NewOperationContext(nil.nil, migrated)
	}

	return volumetypes.GeneratedOperations{
		OperationName:     "volume_mount",
		OperationFunc:     mountVolumeFunc,
		EventRecorderFunc: eventRecorderFunc,
		CompleteFunc:      util.OperationCompleteHook(util.GetFullQualifiedPluginNameForVolume(volumePluginName, volumeToMount.VolumeSpec), "volume_mount"),}}Copy the code

This is done by registering with kubelet’s CSI plugin list to find the corresponding plug-in, executing volumemounter.setup, and finally updating the ActualStateOfWorld record. CsiMountMgr implements the external CSI plugin with the following code:

func (c *csiMountMgr) SetUp(mounterArgs volume.MounterArgs) error {
	return c.SetUpAt(c.GetPath(), mounterArgs)
}

func (c *csiMountMgr) SetUpAt(dir string, mounterArgs volume.MounterArgs) error {
	csi, err := c.csiClientGetter.Get()
	...

	err = csi.NodePublishVolume(
		ctx,
		volumeHandle,
		readOnly,
		deviceMountPath,
		dir,
		accessMode,
		publishContext,
		volAttribs,
		nodePublishSecrets,
		fsType,
		mountOptions,
	)
    ...
	return nil
}
Copy the code

As you can see, kubelet invokes the CSI Node is volumeManager csiMountMgr NodePublishVolume/NodeUnPublishVolume interface. At this point, the entire Pod volume process has been sorted out.

conclusion

This paper analyzes the working process of the entire CSI system from three aspects: CSI components, CSI interfaces, and how volume is mounted to POD. CSI is the standard storage interface of the whole container ecology. CO communicates with CSI plug-in through gRPC. In order to achieve universality, K8S designs many external components to cooperate with CSI plug-in to achieve different functions, thus ensuring the purity of K8S internal logic and the simplicity of CSI plug-in.