The author | zhi-heng sun (HuiZhi) alibaba development engineer

K8s is known for its Persistent Storage, which ensures that application data exists independently of the application lifecycle. However, its internal implementation is rarely mentioned. What is the internal storage flow of K8s? How about the call relationship between PV, PVC, StorageClass, Kubelet, CSI plug-in, etc., these mysteries will be revealed in this article.

K8s persistent storage foundation

Before moving on to the K8s storage process, let’s review the basic concepts of persistent storage in K8s.

1. Noun explanation

  • In-tree: code logic in K8s official repository;

  • Out-of-tree: code logic is decoupled from K8s code outside the official K8s repository;

  • PV: PersistentVolume, a cluster-level resource created by the cluster administrator or External Provisioner. The PV lifecycle is independent of the Pod that uses the PV, and the storage device details are stored in the.spec of the PV;

  • PVC: PersistentVolumeClaim, a namespace-level resource created by the user or StatefulSet controller (based on VolumeClaimTemplate). PVC is similar to Pod in that Pod consumes Node resources and PVC consumes PV resources. A Pod can request specific levels of resources (CPU and memory), while a PVC can request specific volume size and Access Mode.

  • StorageClass: StorageClass is a cluster-level resource created by the cluster administrator. The SC provides a class template for dynamically providing storage volumes. The.Spec file on the SC defines different service quality levels and backup policies for the PV of the storage volume.

  • CSI: Container Storage Interface, the purpose of which is to define an industry standard “Container Storage Interface” that enables Storage vendors (SP) to develop plugins based on CSI standards to work in different Container Orchestration (CO) systems. CO systems include Kubernetes, Mesos, Swarm, etc.

2. Components

  • PV Controller: Manages PV and PVC binding and cycle management, and performs *Provision and Delete ** operations on data volumes as required.

  • AD Controller: Responsible for the **Attach/Detach ** operation of the data volume and attaching the device to the target node;

  • Kubelet: Kubelet is the main “Node agent” running on each Node. Its functions include Pod lifecycle management, container health check, container monitoring, etc.

  • Volume Manager: a component in Kubelet that manages the Mount/Umount operations of data volumes (Attach/Detach Attach/Detach operations of data volumes), Volume device formatting, and so on.

  • Volume Plugins are storage plug-ins developed by storage providers to expand the Volume management capabilities of various storage types and implement various operations on third-party storage. These operations are described in blue. Volume Plugins have two types: In-tree and out-of-tree.

  • External Provioner: External Provioner is a sidecar container that calls the CreateVolume and DeleteVolume functions in Volume Plugins to perform **Provision/Delete ** operations. Because the PV controller of K8s cannot directly call the related functions of Volume Plugins, External Provioner is called by gRPC.

  • External Attacher: External Attacher is a sidecar container, Role is to call Volume of Plugins ControllerPublishVolume and ControllerUnpublishVolume function to perform the Attach/Detach * * * * operations. Because the AD controller of K8s cannot directly call the related functions of Volume Plugins, External Attacher calls them through gRPC.

3. Persistent volume usage

Kubernetes introduced PV and PVC to enable applications and their developers to request storage resources without dealing with storage facility details. There are two ways to create a PV:

  • A cluster administrator can manually create a STATIC PV for an application.

  • Alternatively, the user manually creates the PVC and dynamically creates the corresponding PV from the Provisioner component.

The following uses NFS shared storage as an example to see the difference.

Creating a storage volume statically

The following figure shows the process for statically creating a storage volume:

Step 1: The cluster administrator creates an NFS PV. NFS belongs to the in-tree storage type that K8s natively supports. Yaml files are as follows:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: nfs-pv
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  nfs:
    server: 192.1684.1.
    path: /nfs_storage
Copy the code

Step 2: User creates PVC, YAML file as follows:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nfs-pvc
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
Copy the code

The kubectl get pv command can be used to see that pv and PVC are bound:

[root@huizhi ~]# kubectl get pvc
NAME      STATUS   VOLUME               CAPACITY   ACCESS MODES   STORAGECLASS   AGE
nfs-pvc   Bound    nfs-pv-no-affinity   10Gi       RWO                           4s
Copy the code

Step 3: The user creates the application and uses the PVC created in step 2.

apiVersion: v1
kind: Pod
metadata:
  name: test-nfs
spec:
  containers:
  - image: nginx:alpine
    imagePullPolicy: IfNotPresent
    name: nginx
    volumeMounts:
    - mountPath: /data
      name: nfs-volume
  volumes:
  - name: nfs-volume
    persistentVolumeClaim:
      claimName: nfs-pvc
Copy the code

The NFS remote storage is then mounted to the /data directory of the Pod nginx container.

Dynamically creating a storage volume

Dynamic creation of storage volumes requires **nfs-client-provisioner ** and the corresponding Storageclass to be deployed in the cluster.

Dynamic volume creation requires less intervention from the cluster administrator than static volume creation. The following figure shows the process:

The cluster administrator only needs to ensure that nfS-related storageclass exists in the environment:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: nfs-sc
provisioner: example.com/nfs
mountOptions:
  - Vers = 4.1
Copy the code

Step 1: The user creates a PVC, where the PVC storageClassName is specified as the NFS storageclass name:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: nfs
  annotations:
    volume.beta.kubernetes.io/storage-class: "example-nfs"
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 10Mi
  storageClassName: nfs-sc
Copy the code

Step 2: NfS-client-provisioner in the cluster dynamically creates the corresponding PV. You can see that the PV has been created in the environment and is bound to the PVC.

[root@huizhi ~]# kubectl get pvNAME CAPACITY ACCESSMODES RECLAIMPOLICY STATUS CLAIM REASON AGE pvc-dce84888-7a9d-11e6-b1ee-5254001e0c1b 10Mi RWX Delete  Bound default/nfs 4sCopy the code

Step 3: Create an application and use the PVC created in Step 2, which is the same as step 3 for statically creating a storage volume.

K8s persistent storage process

1. Process overview

Here refer to @ Jun Bao’s flow chart in the cloud native storage course

The process is as follows:

  1. The user creates a Pod containing a PVC that requires a dynamic storage volume;

  2. **Scheduler ** schedules Pod to an appropriate Worker node based on Pod configuration, node status, PV configuration and other information;

  3. **PV controller **watch the PVC used by the Pod is Pending, so call the Volume Plugin (in-tree) to create the storage Volume. Create a PV object (out-of-tree handled by External Provisioner);

  4. The AD controller finds that the Pod and PVC are in the pending state and calls the **Volume Plugin ** to mount the storage device to the target Worker node

  5. On the Worker node, the Volume Manager in Kubelet waits for the storage device to complete mounting and mounts the device to the global directory via the Volume Plugin: /var/lib/kubelet/ Pods /[pod uid]/volumes/kubernetes. IO ~iscsi/[PV name]

  6. **Kubelet ** Starts the Pod Containers via Docker and maps volumes mounted to the local global directory into the container using bind mount.

The more detailed process is as follows:

2. Process details

The process of persistent storage varies slightly with different K8s versions. This article is based on Kubernetes version 1.14.8.

According to the preceding flowchart, a storage volume is created in three phases: Provision/Delete, Attach/Detach, and Mount/Unmount.

provisioning volumes

There are two workers in PV controller:

  • ClaimWorker: Handles add/update/delete related events for PVC and state migration of PVC;
  • VolumeWorker: migrates PV status.

PV status migration (UpdatePVStatus) :

  • The initial state of the PV is Available. After the PV is Bound to the PVC, the state changes to Bound.
  • After the PVC bound to PV was deleted, the status changed to Released;
  • When the PV recycling policy is Recycled or the.spec. ClaimRef of PV is manually deleted, the PV status becomes Available.
  • When the PV recycling policy is unknown, the Recycle service fails, or the storage volume fails to be deleted, the PV status changes to Failed.
  • Manually delete the.spec. ClaimRef of PV and the PV status becomes Available.

PVC status migration (UpdatePVCStatus) :

  • When no PV meeting the PVC condition exists in the cluster, the STATUS of the PVC is Pending. After PV and PVC are Bound, the STATE of PVC changes from Pending to Bound.
  • The PV bound to the PVC is deleted in the environment, and the PVC state becomes Lost.
  • The PVC state changes to Bound when it is Bound again to a PV ** of the same name.

The Provisioning process is as follows (here simulating a user creating a new PVC) :

Static Volume storage process (FindBestMatch) : PV controller first filters an Available PV in the environment to match the new PVC.

  • DelayBinding: the PV controller determines whether the PVC needs ** delayed binding: **1. Check to see if the annotation of PVC containing volume. Kubernetes. IO/selected – node, if any, said the PVC is already in the scheduler specifies the node (ProvisionVolume), so don’t need to delay binding; 2. If there is no volume in the annotation of PVC. Kubernetes. IO/selected – node, no StorageClass at the same time, the default said don’t need to delay binding; If StorageClass exists, check the VolumeBindingMode field. If WaitForFirstConsumer is set to WaitForFirstConsumer, the binding is delayed. If Immediate, the binding is not delayed.

  • FindBestMatchPVForClaim: THE PV controller tries to find an existing PV in an environment that meets the REQUIREMENTS of PVC. PV controller will filter ** all PVS ** once, and will select the best matching PV from the PVS that meet the conditions. Filtering rule: 1. VolumeMode matches. 2. Whether PV is bound to PVC; 3. PV.status. Phase is Available; 4. LabelSelector check, PV and PVC label should be consistent; 5. Check whether the PV and PVC StorageClass are consistent. 6. PV that meets the minimum SIZE of PVC shall be updated in each iteration and returned as the final result;

  • Bind: the PV controller binds the selected PV and PVC: 1. Update PV’s.spec. ClaimRef information to the current PVC; 2. Update PV.status. Phase to Bound; 3. Add PV annotation: pv.kubernetes. IO /bound-by-controller: “yes”; 4. Update.spec. VolumeName of PVC to PV name; 5. Update the.status. Phase of PVC to Bound; IO /bound-by-controller: “yes” and pv.kubernetes. IO /bind-completed: “yes”;

ProvisionVolume: ** If no SUITABLE PV is available in the environment, enter the dynamic Provisioning scenario:

  • Before Provisioning: 1. The PV controller determines whether the StorageClass used by the PVC is in-tree or out-of-tree: Check whether the Provisioner field of StorageClass contains **”kubernetes. IO /” **. 2. PV controller updates PVC annotation: The claim. Annotations [” volume. Beta. Kubernetes. IO/storage – provisioner “] = storageClass. Provisioner.

  • ** In-tree Provisioning (Internal Provisioning) : * * 1. The in – tree Provioner interface can realize ProvisionableVolumePlugin NewProvisioner method, used to return a new Provisioner. 2. The PV controller invokes the Provision function of the Provisioner, which returns a PV object. 3. The PV controller creates the PV object returned in the previous step and binds it to PVC. Spec.claimref is set to PVC and.status. Phase is set to Bound. .spec. StorageClassName Set to the same StorageClassName as the PVC. At the same time the new annotation: “pv. Kubernetes. IO/bound – by – controller” = “yes” and “pv. Kubernetes. IO/provisioned – by” = plugin. GetPluginName ();

  • **out-of-tree Provisioning (External Provisioning) : **1. External Provisioner checks whether the claim.Spec.VolumeName in the PVC is empty or skips the PVC if it is not; 2. External Provisioner. Check the claim in PVC Annotations [” volume. Beta. Kubernetes. IO/storage – Provisioner “] is equal to its own Provisioner Name (External Provisioner is passed the — Provisioner parameter at startup to determine its own Provisioner Name); 3. If the PVC VolumeMode=Block, check whether the External Provisioner supports Block devices. 4. External Provisioner invoks the Provision function: invoks the CSI storage plug-in’s CreateVolume interface through gRPC; 5. External Provisioner creates a PV to represent the volume and binds the PV to the previous PVC.

deleting volumes

The Deleting process is Provisioning’s inverse:

The user deleted the PVC, deleted the PV controller and changed the PV.status. Phase to Released.

When PV. Status. Phase = = when Released, PV controller to check the Spec first. PersistentVolumeReclaimPolicy values, to Retain the skip, to Delete:

  • **in-tree Deleting: **1. The in-tree Provioner implements the NewDeleter method of the DeletableVolumePlugin interface, which returns a NewDeleter. 2. The controller invokes the Delete function of Deleter to Delete the corresponding volume. 3. After the volume is deleted, the PV controller deletes the PV object.

  • **out-of-tree Deleting: **1. External Provisioner calls the Delete function, using gRPC to call the CSI DeleteVolume interface; 2. After the volume is deleted, the External Provisioner deletes the PV object

Attaching Volumes

Both Kubelet components and AD controller can attach/detach. If enable-controller-attach-detach is specified in Kubelet startup parameters, Kubelet does it. Otherwise, the SYSTEM is controlled by the AD by default. This section uses the AD controller as an example to explain the attach/detach operation.

The AD controller has two core variables:

  • DesiredStateOfWorld (DSW) : The expected attached state of data volumes in the cluster, containing information about Nodes -> Volumes -> Pods;
  • ActualStateOfWorld (ASW) : Indicates the actual hanging state of data volumes in a cluster. It contains information about volumes-> Nodes.

Attaching process is as follows:

The AD controller initializes the DSW and ASW based on the resource information in the cluster.

The DSW and ASW of the AD controller are periodically updated.

  • * * the Reconciler. ** Periodically runs through a GoRoutine to ensure that the volume is hooked/removed. Updated ASW during this period:

**in-tree attaching: **in-tree attaching: **in-tree Attacher will implement the AttachableVolumePlugin interface of the NewAttacher method, used to return a NewAttacher; 2. The AD controller invokes Attacher’s Attach function to Attach the device. 3. Update ASW.

* * the out – of – tree attaching: **1. Call CSIAttacher of in-tree to create a **VolumeAttachement (VA) ** object. This object contains Attacher information, node name, and PV information. 2. The External Attacher watches the VolumeAttachement resource in the cluster and invokes the Attach function when a data volume needs to be attached. Invoke the CSI plugin’s ControllerPublishVolume interface through gRPC.

  • DesiredStateOfWorldPopulator.Running periodically through a GoRoutine, the main function isUpdate DSW:

FindAndRemoveDeletedPods – Traverses all DSW Pods and removes them from DSW if they have been removed from the cluster; FindAndAddActivePods – Traverses all PodLister Pods and adds to DSW if the Pod does not exist.

  • * * PVC Worker. ** Watch PVC add/update events, handle PVC-related pods, and update DSW in real time.

Detaching Volumes

The Detaching process is as follows:

  • When the Pod is deleted, the AD controller will watch the event. First check whether Pod’s Node resources include AD controller “volumes. Kubernetes. IO/keep – terminated – Pod – volumes” label, if contain the do operation; If it does not contain the volume, it is removed from DSW.

  • The AD controller brings the ActualStateOfWorld state closer to the DesiredStateOfWorld state through the **Reconciler **, and when it finds a volume in the ASW that does not exist in the DSW, Detach:

**in-tree detaching: **1. The AD controller implements the AttachableVolumePlugin interface’s NewDetacher method to return a NewDetacher; 2. The controller invokes the Detach function of Detacher, which corresponds to the volume. 3. The AD controller updates the ASW.

**out-of-tree detaching: **1. The AD controller calls the CSIAttacher of the In-tree to delete the related VolumeAttachement object. 2. The External Attacher watches the VolumeAttachement (VA) resource in the cluster and invokes the Detach function when the volume needs to be removed. Through gRPC call CSI plug-in ControllerUnpublishVolume interface; 3. The AD controller updates the ASW.

**Volume Manager ** also has two core variables:

  • DesiredStateOfWorld (DSW) : The expected data volume mounting state of the cluster. Volumes -> Pods
  • ActualStateOfWorld (ASW) : Indicates the actual data volume mounting status in the cluster. It contains information about volumes-> Pods.

Mounting/UnMounting The process is as follows:

The purpose of the global mount Path: Block devices can only be mounted once on Linux, and in a K8s scenario, a PV may be mounted to multiple Pod instances on the same Node. If the block device is formatted and mounted to a temporary global directory on Node, then the bind mount technique in Linux is used to mount this global directory to the corresponding directory in Pod. /var/lib/kubelet/ Pods /[pod uid]/volumes/kubernetes. IO ~iscsi/[PV name]

VolumeManager initializes the DSW and ASW based on the resource information in the cluster.

VolumeManager has two components within it that periodically update DSW and ASW:

  • DesiredStateOfWorldPopulator: through a GoRoutine periodic operation, the main function is to update the DSW;
  • Reconciler: Periodically operates through a GoRoutine to ensure that the volume is mounted/unloaded. Updated ASW during this period:

UnmountVolumes: Ensure that volumes are unmounted after the Pod is deleted. Run the following command through all the pods in ASW. If the Pod is not in DSW (the Pod is deleted), VolumeMode=FileSystem is used as an example.

  1. You can Remove all bind-mounts: calls the TearDown interface of Unmounter (or the NodeUnpublishVolume interface of the CSI plug-in if the port is an out-of-tree). You can use this interface to Remove all bind-mounts.
  2. Unmount Volume: call the UnmountDevice function of DeviceUnmounter (for out-of-tree, call the NodeUnstageVolume interface of CSI plug-in).
  3. Update of ASW.

MountAttachVolumes: Ensure that the volumes to be used by the Pod are successfully mounted. Run the command through all the PODS in DSW. If the Pod is not in ASW (the directory to be mounted and mapped to Pod), perform the following operations using VolumeMode=FileSystem as an example:

  1. Wait for the volume to be attached to the node (by External Attacher or Kubelet itself);
  2. Mount volume to the global directory: call the MountDevice function of DeviceMounter (or NodeStageVolume interface of CSI plug-in if it is out-of-tree).
  3. Update the ASW: The volume has been mounted to the global directory.
  4. Bd-mount Volume to Pod: call the SetUp interface of Mounter (out-of-tree NodePublishVolume interface of CSI);
  5. Update of ASW.

UnmountDetachDevices: Ensure that the volumes to be unmounted are unmounted. Run the UnmountedVolumes command in all ASWs. If the UnmountedVolumes command is not in DSW, perform the following operations:

  1. Unmount Volume: call the UnmountDevice function of DeviceUnmounter (for out-of-tree, call the NodeUnstageVolume interface of CSI plug-in).
  2. Update of ASW.

conclusion

This paper first introduces the basic concept and usage of K8s persistent storage, and deeply analyzes the internal storage flow of K8s. On K8s, you can’t use any type of storage without using the above process (some scenarios don’t use Attach/DETach), and the storage problems in the environment must be caused by a failure in one of them.

Containers store more pits, especially in a private cloud environment. But the more challenges, the more opportunities! At present, the domestic proprietary cloud market is also competing in the field of storage. Our Agile PaaS container team welcomes you to join us and create a better future together!

Refer to the link

  1. Kubernetes community source code
  2. Kubernetes Storage Architecture and Plug-in Usage (Jun Bao)
  3. Applying Storage and Persistence to Data Volumes – Core Knowledge (to Day)
  4. Volume – kubernetes – design – proposals 】 【 provisioning
  5. [Kubernetes-design-proposals] CSI Volume Plugins in Kubernetes design Doc

Cloud native app team hiring!

Ali Cloud native application platform team is looking for talents at present, if you meet:

  • Passionate about container and infrastructure-related cloud native technology, rich accumulation and outstanding achievements (such as product implementation, innovative technology implementation, open source contribution, leading academic achievements) in one direction of cloud native infrastructure in related fields such as Kubernetes, Serverless platform, container network and storage, operation and maintenance platform;

  • Excellent presentation, communication and team work skills; Forward thinking about technology and business; Strong ownership, result-oriented, good at decision-making;

  • Familiar with at least one programming language in Java, Golang;

  • Bachelor degree or above, at least 3 years working experience.

Resume can be sent to [email protected], if you have any questions, please add wechat: TheBeatles1994.

Cloud Native Webinar invites you to attend

Click now to schedule a live broadcast

“Alibaba Cloud originator focuses on micro-service, Serverless, container, Service Mesh and other technical fields, focuses on the trend of cloud native popular technology, large-scale implementation of cloud native practice, and becomes the public account that most understands cloud native developers.”