The author

Wang Cheng, Tencent Cloud R&D engineer, Kubernetes Contributor, engaged in database product container, resource control and other work, pay attention to Kubernetes, Go, cloud native field.

An overview of the

Entering the world of K8s, you will find that there are many interfaces that are easy to expand, including CSI, CNI, CRI and so on. The purpose of abstracting these interfaces is to better provide open, extensible, standardized and other capabilities.

The K8s persistent storage has undergone the migration from In-tree Volume to CSI Plugin(out-of-tree). On the one hand, the K8s core trunk code is decoupled from volume-related code for better maintenance. On the other hand, it is to facilitate major cloud vendors to realize unified interfaces and provide personalized cloud storage capabilities, so as to achieve an open and win-win cloud storage ecosystem.

In this paper, CSI implementation mechanism is analyzed from the core life cycle of persistent volume PV, such as Create, Attach, Detach, Mount, Unmount and Delete.

Related terms

Term Definition
CSI Container Storage Interface.
CNI Container Network Interface.
CRI Container Runtime Interface.
PV Persistent Volume.
PVC Persistent Volume Claim.
StorageClass Defined by provisioner(i.e. Storage Provider), to assemble Volume parameters as a resource object.
Volume A unit of storage that will be made available inside of a CO-managed container, via the CSI.
Block Volume A volume that will appear as a block device inside the container.
Mounted Volume A volume that will be mounted using the specified file system and appear as a directory inside the container.
CO Container Orchestration system, communicates with Plugins using CSI service RPCs.
SP Storage Provider, the vendor of a CSI plugin implementation.
RPC Remote Procedure Call.
Node A host where the user workload will be running, uniquely identifiable from the perspective of a Plugin by a node ID.
Plugin Aka “plugin implementation”, a gRPC endpoint that implements the CSI Services.
Plugin Supervisor Process that governs the lifecycle of a Plugin, MAY be the CO.
Workload The atomic unit of “work” scheduled by a CO. This MAY be a container or a collection of containers.

This article and subsequent articles are based on K8s V1.22

An overview of the process

PV creation core process:

  • apiserverCreate a Pod according toPodSpec.VolumesCreate a Volume;
  • PVControllerListening to the PV informer, add associated annotations (such as PV. Kubernetes. IO/provisioned – by), tuning of PVC/PV binding (Bound);
  • judgeStorageClass.volumeBindingMode:WaitForFirstConsumerPV is created after Pod is successfully scheduled to Node.ImmediatePV creation logic is immediately called without waiting for Pod scheduling.
  • external-provisionerCall rpc-createvolume to CreateVolume after listening on PV informer.
  • AttachDetachControllerThe Bound PVC/PV is implemented by CSIPlugin internal logic through the InTreeToCSITranslator converterVolumeAttachmentCreation of resource types;
  • external-attacherListen on VolumeAttachment Informer, call rpc-ControllerPublishVolume to AttachVolume;
  • kubeletReconcile Continuous tuning: through judgmentcontrollerAttachDetachEnabled || PluginIsAttachableWith the current Volume status, mount the Volume to the Pod directory for use by the Container.

From the CSI

CSI(Container Storage Interface) is an industry-standard Interface specification jointly developed by Kubernetes, Mesos, Docker and other community members (github.com/container-s…). , designed to expose any storage system to containerized applications.

The CSI specification defines the minimum set of operations and deployment recommendations for a storage provider to implement a CSI-compliant Volume Plugin. The primary focus of the CSI specification is to declare the interfaces that the Volume Plugin must implement.

Take a look at Volume’s lifecycle:

   CreateVolume +------------+ DeleteVolume
 +------------->|  CREATED   +--------------+
 |              +---+----^---+              |
 |       Controller |    | Controller       v
+++         Publish |    | Unpublish       +++
|X|          Volume |    | Volume          | |
+-+             +---v----+---+             +-+
                | NODE_READY |
                +---+----^---+
               Node |    | Node
              Stage |    | Unstage
             Volume |    | Volume
                +---v----+---+
                |  VOL_READY |
                +---+----^---+
               Node |    | Node
            Publish |    | Unpublish
             Volume |    | Volume
                +---v----+---+
                | PUBLISHED  |
                +------------+

The lifecycle of a dynamically provisioned volume, from
creation to destruction, when the Node Plugin advertises the
STAGE_UNSTAGE_VOLUME capability.
Copy the code

As you can see from the Volume lifecycle, a persistent Volume needs to go through the following phases to reach Pod availability:

CreateVolume -> ControllerPublishVolume -> NodeStageVolume -> NodePublishVolume

When a Volume is deleted, it goes through the following reverse phase:

NodeUnpublishVolume -> NodeUnstageVolume -> ControllerUnpublishVolume -> DeleteVolume

In fact, each step of the above process corresponds to the standard interface provided by CSI. Cloud storage manufacturers only need to implement their own cloud storage plug-ins according to the standard interface, so that they can seamlessly connect with the K8s low-level choreography system and provide diversified cloud storage, backup, snapshot and other capabilities.

Multi-component collaboration

In order to achieve the persistent volume management capability with high scalability and out-of-tree, the K8s CSI implementation has the following components:

Component is introduced

  • Kube-controller-manager: K8s resource controller, which uses PVController and AttachDetach to achieve Bound/Unbound, Attach/Detach for persistent volumes.
  • Csi-plugin: K8s is independently separated to realize the logic control and invocation of CSI standard specification interface, which is the core hub of the whole CSI control logic.
  • Node – driver – the registrar: Is a secondary container (Sidecar) maintained by the official K8s SIG group. It uses kubelet plug-in registration mechanism to register plug-ins with Kubelet. It needs to request CSI plug-in Identity service to obtain plug-in information.
  • External-provisioner: A secondary container (sidecar) maintained by the official K8s SIG group, which provides the creation and Delete of persistent volumes.
  • External-attacher: A secondary container (SIDecar) maintained by the official K8s SIG group. Its main function is to Attach and Detach persistent volumes.
  • External-snapshotter: a secondary container (SIDecar) maintained by the official K8s SIG group. It provides snapshot and backup functions for persistent volumes.
  • External-resizer: It is an auxiliary container (SIDECar) maintained by the official K8s SIG group. Its main function is to achieve the elastic expansion and shrinkage of persistent volumes, which requires the corresponding capabilities provided by cloud vendors’ plug-ins.
  • Kubelet: the control hub running on each Node in K8s. Its main function is to tune the attachment, mounting, monitoring detection and reporting of Pod and Volume on the Node.
  • Cloud-storage-provider: a plug-in implemented by major cloud storage vendors based on CSI standard interfaces, including Identity service, Controller Controller service, and Node Node service.

Component communication

Since the CSI plugin’s code is considered untrusted in K8s, So CSI Controller Server communicates with External CSI SideCar, CSI Node Server communicates with Kubelet through Unix sockets. Communicates with the Storage Service provided by the cloud Storage vendor through gRPC(HTTP/2) :

RPC calls

As can be seen from CSI standard specifications, cloud storage vendors who want to seamlessly access K8s container choreography system need to implement relevant interfaces according to the specifications, which are mainly as follows:

  • Identity service: Both Node Plugin and Controller Plugin must implement these RPC sets, coordinate K8s and CSI version information, and be responsible for exposing the plug-in information to the outside world.
  • Controller Controller services: The Controller Plugin must implement these RPC sets and create and manage volumes, corresponding to the Attach /detach Volume operation in K8s.
  • Node Node service: Node Plugin must implement these RPC sets to mount the Volume storage Volume to the specified directory, corresponding to the mount/unmount Volume operation in K8s.

The RPC interface functions are as follows:

Example Creating or deleting a PV

Create and Delete persistent volume PV in K8s, implemented by the external-provisioner component, in github.com/kubernetes-…

First, get the command line arguments using the standard CMD method and execute the newController -> Run() logic as follows:

// external-provisioner/cmd/csi-provisioner/csi-provisioner.go
main() {
...
	// Initializes the controller to implement the Volume creation/deletion interfacecsiProvisioner := ctrl.NewCSIProvisioner( clientset, *operationTimeout, identity, *volumeNamePrefix, *volumeNameUUIDLength, grpcClient, snapClient, provisionerName, pluginCapabilities, controllerCapabilities, ... ) .// The real ProvisionController, which wraps the above CSIProvisionerprovisionController = controller.NewProvisionController( clientset, provisionerName, csiProvisioner, provisionerOptions... ,)... run :=func(ctx context.Context){...// Run runs
		provisionController.Run(ctx)
	}
}
Copy the code

Next, call the PV create/delete process:

RunClaimWorker -> syncClaimHandler -> syncClaim -> provisionClaimOperation -> Provision -> CreateVolume PV runVolumeWorker -> syncVolumeHandler -> syncVolume -> deleteVolumeOperation -> Delete -> DeleteVolume

IO /sig-storage-lib-external-provisioner abstract the interface from sigs.k8s. IO /sig-storage-lib-external-provisioner

// Import sigs.k8s. IO /sig-storage-lib-external-provisioner from vendor mode
// external-provisioner/vendor/sigs.k8s.io/sig-storage-lib-external-provisioner/v7/controller/volume.go
type Provisioner interface {
	// Call PRC CreateVolume interface to implement PV creation
	Provision(context.Context, ProvisionOptions) (*v1.PersistentVolume, ProvisioningState, error)
	// Call PRC DeleteVolume interface to implement PV deletion
	Delete(context.Context, *v1.PersistentVolume) error
}
Copy the code

The Controller tuning

Pv-related controllers in the K8s include PVController and AttachDetachController.

PVController

PVController by adding the relevant Annotation in PVC (such as pv. Kubernetes. IO/provisioned – by), the external – provisioner component is responsible for the complete corresponding pv create/delete, The PVController then monitors the PV creation status, completes the Bound with the PVC, and reconcils the task. It is then handed over to the AttachDetachController for further logical processing.

In addition to the K8s Informer mechanism, PVController uses local cache to efficiently implement status update and binding event processing for PVCS and PVCS. Maintain a local store to handle Add/Update/Delete events.

First, through the standard newController -> Run() logic:

// kubernetes/pkg/controller/volume/persistentvolume/pv_controller_base.go
func NewController(p ControllerParameters) (*PersistentVolumeController, error){...// Initialize PVController
	controller := &PersistentVolumeController{
		volumes:                       newPersistentVolumeOrderedIndex(),
		claims:                        cache.NewStore(cache.DeletionHandlingMetaNamespaceKeyFunc),
		kubeClient:                    p.KubeClient,
		eventRecorder:                 eventRecorder,
		runningOperations:             goroutinemap.NewGoRoutineMap(true /* exponentialBackOffOnError */),
		cloud:                         p.Cloud,
		enableDynamicProvisioning:     p.EnableDynamicProvisioning,
		clusterName:                   p.ClusterName,
		createProvisionedPVRetryCount: createProvisionedPVRetryCount,
		createProvisionedPVInterval:   createProvisionedPVInterval,
		claimQueue:                    workqueue.NewNamed("claims"),
		volumeQueue:                   workqueue.NewNamed("volumes"),
		resyncPeriod:                  p.SyncPeriod,
		operationTimestamps:           metrics.NewOperationStartTimeCache(),
	}
	...
	// PV add/delete event monitor
	p.VolumeInformer.Informer().AddEventHandler(
		cache.ResourceEventHandlerFuncs{
			AddFunc:    func(obj interface{}) { controller.enqueueWork(controller.volumeQueue, obj) },
			UpdateFunc: func(oldObj, newObj interface{}) { controller.enqueueWork(controller.volumeQueue, newObj) },
			DeleteFunc: func(obj interface{}) { controller.enqueueWork(controller.volumeQueue, obj) },
		},
	)
	...
	// PVC add/delete event monitor
	p.ClaimInformer.Informer().AddEventHandler(
		cache.ResourceEventHandlerFuncs{
			AddFunc:    func(obj interface{}) { controller.enqueueWork(controller.claimQueue, obj) },
			UpdateFunc: func(oldObj, newObj interface{}) { controller.enqueueWork(controller.claimQueue, newObj) },
			DeleteFunc: func(obj interface{}) { controller.enqueueWork(controller.claimQueue, obj) },
		},
	)
	...
	return controller, nil
}
Copy the code

Next, call the PVC/PV binding/unbinding logic:

ClaimWorker -> updateClaim -> syncClaim -> syncBoundClaim -> bind PVC/PV unbind: volumeWorker -> updateVolume -> syncVolume -> unbindVolume

AttachDetachController

AttachDetachController takes the Bound PVC/PV and passes through the InTreeToCSITranslator The CSI plug-in mode is converted from the Volume managed by in-tree mode to the CSI plug-in mode managed by out-of-tree mode.

The VolumeAttachment resource types are then created/deleted by CSIPlugin’s internal logic, and the reconcile task is completed. It then passes to the external-Attacher component for the next logical processing.

The relevant core code is implemented in reconciler.run () as follows:

// kubernetes/pkg/controller/volume/attachdetach/reconciler/reconciler.go
func (rc *reconciler) reconcile(a) {

	// DetachVolume first to ensure that the Volume rescheduled to other nodes due to Pod is detached (Detach)
	for _, attachedVolume := range rc.actualStateOfWorld.GetAttachedVolumes() {
		// If the Volume is not in the expected state, call DetachVolume to remove the VolumeAttachment resource object
		if !rc.desiredStateOfWorld.VolumeExists(
			attachedVolume.VolumeName, attachedVolume.NodeName) {
			...
			err = rc.attacherDetacher.DetachVolume(attachedVolume.AttachedVolume, verifySafeToDetach, rc.actualStateOfWorld)
			...
		}
	}
	// Call AttachVolume to create a VolumeAttachment resource object
	rc.attachDesiredVolumes()
	...
}
Copy the code

Attach/separate Volume

K8s persistent volume PV Attach and Detach, by external-Attacher component implementation, related engineering code in: [github.com/kubernetes-…

The external-Attacher component observes the VolumeAttachment object created by the previous AttachDetachController, If the Driver name in its.spec.Attacher specifies a CSI Plugin within its own Pod, the CSI Plugin’s ControllerPublish interface is called for Volume Attach.

First, get the command line arguments using the standard CMD method and execute the newController -> Run() logic as follows:

// external-attacher/cmd/csi-attacher/main.go
func main(a){... ctrl := controller.NewCSIAttachController( clientset, csiAttacher, handler, factory.Storage().V1().VolumeAttachments(), factory.Core().V1().PersistentVolumes(), workqueue.NewItemExponentialFailureRateLimiter(*retryIntervalStart, *retryIntervalMax), workqueue.NewItemExponentialFailureRateLimiter(*retryIntervalStart, *retryIntervalMax), supportsListVolumesPublishedNodes, *reconcileSync, ) run :=func(ctx context.Context) {
		stopCh := ctx.Done()
		factory.Start(stopCh)
		ctrl.Run(int(*workerThreads), stopCh)
	}
    ...
}
Copy the code

Next, call the Volume attachment/detach logic:

Volume Attach (Attach) : syncVA -> SyncNewOrUpdatedVolumeAttachment -> syncAttach -> csiAttach -> Attach -> ControllerPublishVolume Volume Separation (Detach) : syncVA – > SyncNewOrUpdatedVolumeAttachment – > syncDetach – > csiDetach – > Detach – > ControllerUnpublishVolume

Kubelet Mount/unmount Volume

K8s persistent volume PV Mount (Mount) and Unmount (Unmount), by the Kubelet component to achieve.

Kubelet starts the Reconcile loop via VolumeManager, and when a new Pod with a PV that PersistentVolumeSource is CSI is observed scheduling to this node, So call reconcile function to Attach/Detach/Mount/Unmount to related logical processing.

// kubernetes/pkg/kubelet/volumemanager/reconciler/reconciler.go
func (rc *reconciler) reconcile(a) {
	// UnmountVolume to ensure that the Volume that is attached to another Pod is unmounted (Unmount)
	rc.unmountVolumes()

	/ / then by judging controllerAttachDetachEnabled | | PluginIsAttachable and the current state of Volume
	// AttachVolume/MountVolume/ExpandInUseVolume
	rc.mountAttachVolumes()

	// Unmount (Unmount) or Detach (Detach) the Volume that is no longer needed (Pod delete)
	rc.unmountDetachDevices()
}
Copy the code

The related call logic is as follows:

Volume Mount (Mount) : Reconcile -> mountAttachVolumes -> MountVolume -> SetUp -> SetUpAt -> NodePublishVolume Unmount: reconcile -> unmountVolumes -> UnmountVolume -> TearDown -> TearDownAt -> NodeUnpublishVolume

summary

This paper analyzes the CSI implementation mechanism by analyzing the core life-cycle processes of the persistent volume PV in K8s, such as Create, Attach, Detach, Mount, Unmount and Delete. Through the source code, illustrated the related process logic, in order to better understand the K8s CSI operation process.

It can be seen that K8s uses CSI Plugin(out-of-tree) to open its storage capability. On the one hand, the K8s core trunk code is decoupled from volume-related code for better maintenance. On the other hand, in compliance with CSI standard interfaces, it is convenient for major cloud vendors to implement relevant interfaces according to business requirements and provide personalized cloud storage capabilities, so as to achieve an open and win-win cloud storage ecosystem.

PS: Stay tuned for morek8s-club

The relevant data

  1. CSI specification
  2. Kubernetes source
  3. Kubernetes – csi source
  4. Kubernetes sig – storage source code
  5. K8s CSI concept
  6. K8s CSI is introduced

About us

More about cloud native cases and knowledge, can pay attention to the same name [Tencent cloud native] public account ~

Benefits:

Background reply [manual] (1) the public, can obtain "tencent cloud native roadmap manual" & "tencent cloud native best practices" ~ (2) public reply background 】 【 series, can obtain the series of 15 + 100 super practical cloud native original dry anthology, contains Kubernetes efficiency of authors and K8s performance optimization practice, best practices, etc.Copy the code