Longhorn Cloud Native distributed block storage solution design architecture and concepts

The content is derived from the official Longhorn 1.1.2 English technical manual.

A series of

What is Longhorn?

1. The design

The Longhorn design has two layers: a Data plane and a Control plane. The Longhorn Engine is the data plane of the storage controller, and the Longhorn Manager is the control plane.

1.1. Longhorn Manager and Longhorn Engine

The Longhorn Manager Pod runs as Kubernetes DaemonSet on every node in the Longhorn cluster. It is responsible for creating and managing volumes in the Kubernetes cluster and handling API calls from the UI or the Kubernetes volume plug-in. It follows the Kubernetes controller pattern, sometimes called the operator pattern.

The Longhorn Manager communicates with the Kubernetes API server to create a new Longhorn volume CRD. Longhorn Manager then looks at the RESPONSE from the API server, and when it sees that the Kubernetes API server has created a new Longhorn volume CRD, Longhorn Manager creates a new volume.

When Longhorn Manager is asked to create a volume, it creates an instance of Longhorn Engine on the node to which the volume is connected, and creates a copy on each node where the copy will be placed. Copies should be placed on different hosts to ensure maximum availability.

Multiple data paths to copies ensure high availability of Longhorn volumes. Even if there is a problem with one copy or engine, the problem does not affect all copies or PODS ‘access to the volume. The Pod will still operate normally.

The Longhorn Engine always runs in the same node as the Pod using the Longhorn Volume. It replicates volumes synchronously across multiple replicas stored on multiple nodes.

The Engine and replicas use Kubernetes to orchestrate.

In the figure below,

Longhorn volumesThere are three instances.
Each volume has a dedicated controller calledLonghorn EngineAnd as aLinuxThe process is running.
There are two copies of each Longhorn volume (replica), each copy is a Linux process.
The arrows in the figure indicate volumes (volume), controller instances (controller instance), replica instances (replica instances) and the disk.
By creating a separateLonghorn EngineIf one controller fails, the functions of other volumes are not affected.

Figure 1. Read/write data flow between volumes, Longhorn engine, replica instances, and disks

1.2. Advantages of microservice-based design

In Longhorn, each Engine needs to service only one volume, simplifying the design of storage controllers. Because the faulty domain of the controller software is isolated from a single volume, a controller crash affects only one volume.

Longhorn Engine is simple and lightweight enough that we can create up to 100,000 individual engines. Kubernetes schedules these independent engines, pulls resources from a shared set of disks, and collaborates with Longhorn to form an elastic distributed block storage system.

Because each volume has its own controller, controllers and replica instances of each volume can also be upgraded without significant interruption of IO operations.

Longhorn can create a long-running job to coordinate upgrades of all the real-time volumes without interrupting the continuous running of the system. To ensure that the upgrade does not cause unforeseen problems, Longhorn can choose to upgrade a small portion of the volume and roll back to an older version if problems occur during the upgrade.

1.3. CSI Driver

The Longhorn CSI driver obtains the block device, formats it, and mounts it to the node. Kubelet then mounts the device binding into the Kubernetes Pod. This allows the Pod to access the Longhorn volume.

The required Kubernetes CSI driver image will be automatically deployed by Longhorn Driver Deployer.

1.4. CSI Plugin

Longhorn manages in Kubernetes through the CSI Plugin. This allows for easy installation of Longhorn plugins.

The Kubernetes CSI Plugin calls Longhorn to create a volume that creates persistent data for the Kubernetes workload. The CSI Plugin enables you to create, delete, attach, detach, mount, and take snapshots of volumes. All other functions provided by Longhorn are implemented through the Longhorn UI.

The Kubernetes cluster uses the CSI Interface internally to communicate with the Longhorn CSI Plugin. The Longhorn CSI Plugin uses the Longhorn API to communicate with the Longhorn Manager.

Longhorn does make use of iSCSI, so additional configuration of the nodes may be required. This may include installing open-iscsi or iscsiadm depending on the distribution.

1.5. Longhorn UI

The Longhorn UI interacts with the Longhorn Manager through the Longhorn API and as a complement to Kubernetes. The Longhorn interface enables you to manage snapshots, backups, nodes, and disks.

In addition, the space usage of the cluster’s working nodes is collected and illustrated by the Longhorn UI. See here for more information.

2. Longhorn volume and primary storage

When volumes are created, Longhorn Manager creates Longhorn Engine microservices and copies for each volume as microservices. Together, these microservices form a Longhorn Volume. Each copy should be placed on a different node or on a different disk.

After the Longhorn Manager creates the Longhorn Engine, it connects to replicas. The engine exposes block devices on the node on which the Pod runs.

Kubectl supports creating Longhorn volumes.

2.1. Thin Provisioning and Volume Size

Longhorn is a thin-provisioned storage system. This means that the Longhorn Volume will only take up as much space as it currently needs. For example, if you allocate a 20 GB volume and only use 1 GB of it, the actual data size on the disk will be 1 GB. You can view the actual data size in the VOLUME details of the UI.

If you delete content from the volume, the size of the Longhorn volume itself does not shrink. For example, if you create a 20 GB volume, use 10 GB, and then remove 9 GB of content, the actual size on the disk is still 10 GB instead of 1 GB. This happens because Longhorn is running at the block level and not the filesystem level, so Longhorn doesn’t know if the content has been deleted by the user. This information is mainly stored at the file system level.

2.2. Restoring a Volume in Maintenance mode

When attaching volumes from the Longhorn UI, there is a maintenance mode check box. It is used to recover volumes from snapshots.

This option will cause the volume to be attached without enabling the front end (block device or iSCSI) to ensure that no one can access the volume data while the volume is attached.

After V0.6.0, the snapshot recovery operation requires the volume to be in maintenance mode. This is because if the contents of the block device are modified while the volume is being mounted or used, the file system can become corrupted.

It is also useful to check the status of the volume without worrying about unexpected data access.

A copy of the 2.3.

Each copy contains a series of snapshots of the Longhorn volume. Snapshots are like a layer of an image, with the oldest snapshots serving as the base layer and the newer ones at the top. If the data overwrites the data in the old snapshot, the data is contained only in the new snapshot. A series of snapshots together show the current state of the data.

For each Longhorn volume, multiple copies of that volume should run in a Kubernetes cluster, each copy on a separate node. All copies are treated equally, and Longhorn Engine is always running on the same node as POD, which is also the volume consumer. In this way, we can ensure that even if the Pod goes down, the engine can be moved to another Pod and your service will not be interrupted.

The default replica count can be changed in Settings. When attaching a volume, you can change the copy count of the volume in the UI.

If the current count of working copies is less than the specified count, Longhorn will begin to regenerate new copies.

If the current normal copy count is greater than the specified copy count, Longhorn will do nothing. In this case, if the replica fails or is deleted, Longhorn will not start rebuilding the new replica unless the healthy replica count falls below the specified replica count.

Longhorn copies are built using Linux Sparse Files that support thin provisioning.

2.3.1. How copy read and write operations work

When reading data from a copy of the volume, use the data if it can be found in the live data. If no, the latest snapshot is read. If no data is found in the latest snapshot, the next earlier snapshot is read, and so on until the oldest snapshot is read.

When a snapshot is created, a differencing disk is created. As the number of snapshots increases, differential disk chains (also known as snapshot chains) can become very long. Thus, to improve read performance, Longhorn maintains a read index that records which differential disks hold valid data for each 4K block of storage.

In the figure below, the volume has eight blocks. The read index has eight entries and is lazily populated when a read operation occurs.

The write operation resets the read index to point to the live data. Real-time data consists of data on some indexes and white space on others.

Besides reading the index, we currently maintain no additional metadata to indicate which blocks are used.

Figure 2. How does the read index keep track of snapshots that hold the latest data

The figure above is color-coded and shows which blocks contain the latest data according to the read index. The source of the latest data is also listed in the table below:

Read Index	Source of the latest data
0	The latest snapshot
1	Real-time data
2	The oldest snapshot
3	The oldest snapshot
4	The oldest snapshot
5	Real-time data
6	Real-time data
7	Real-time data

Note that as shown by the green arrow in the figure above, Index 5 before reading the Index points to the second oldest snapshot as the source of the most recent data, and then changes to point to the real-time data when 4K blocks, the storage of Index 5 is overwritten by the real-time data.

The read index is kept in memory and consumes one byte per 4K block. The byte size read index means that you can create up to 254 snapshots per volume.

Read indexes consume a certain number of in-memory data structures for each copy. For example, a 1 TB volume consumes 256 MB of memory to read an index.

2.3.2 How Do I Add a New Copy

When a new replica is added, the existing replica is synchronized to the new replica. The first replica is created by taking a new snapshot from the live data.

The following steps show a more detailed breakdown of how Longhorn adds a new copy:

Longhorn EnginePause.
Assume that the snapshot chain in a replica consists of real-time data and snapshots. After a new copy is created, the live data becomes the latest (second) snapshot and a new blank version of the live data is created.
A new copy toWO(write only) pattern creation.
Longhorn EngineCancel the pause.
All snapshots have been synchronized.
The new copy is set toRW(read/write) mode.

2.3.3. How do I rebuild a faulty copy

Longhorn will always try to maintain at least a given number of healthy copies for each volume.

When the controller detects a fault in one of its replicas, it marks the replicas as being in error state. Longhorn Manager is responsible for starting and coordinating the process of rebuilding failed copies.

To rebuild the failed copy, The Longhorn Manager creates a blank copy and calls the Longhorn Engine to add the blank copy to the volume’s copy set.

To add a blank copy, the Engine does the following:

Pause all read and write operations.
In order toWO(write only) mode to add a blank copy.
Create a snapshot of all existing replicas, which now have a blank differential disk in the header (differencing disk).
Unpause all read and write operations. Only write operations are dispatched to the newly added copy.
Start the background process to synchronize all disks except the most recent differential disks from the good copy to the blank copy.
After synchronization is complete, all duplicates now have consistent data, and the volume manager sets the new duplicates toRW(read/write) mode.

Finally, the Longhorn Manager calls the Longhorn Engine to remove the failed copy from its replica set.

2.4. The snapshot

The snapshot function enables a volume to be restored to a point in history. Backups in secondary storage can also be built from snapshots.

When a volume is restored from a snapshot, the snapshot displays the status of the volume when the snapshot is created.

The snapshot feature is also part of Longhorn’s rebuilding process. Every time Longhorn detects that a copy is down, it automatically creates a snapshot and starts rebuilding it on another node.

2.4.1. Working Principle of snapshot

Snapshots are like a layer of an image, with the oldest snapshots serving as the base layer and the newer ones at the top. If the data overwrites the data in the old snapshot, the data is contained only in the new snapshot. A series of snapshots together show the current state of the data.

A snapshot cannot be changed after it is created unless it is deleted, in which case its changes are merged with the next most recent snapshot. New data is always written to the live version. New snapshots are always created from live data.

To create a new snapshot, the live data will become the latest snapshot. A new blank version of the live data is then created to replace the old live data.

2.4.2. Periodic Snapshot

To reduce the space used by the snapshot, the user can schedule a recurring snapshot or backup to keep multiple snapshots, which will automatically create a new snapshot/backup as planned and then clean up any excessive snapshots/backups.

2.4.3. Deleting a Snapshot

You can manually delete unnecessary snapshots on the page. When a snapshot is triggered to be deleted, the system automatically marks it as deleted.

In Longhorn, you cannot delete the latest snapshot. This is because whenever a snapshot is deleted, Longhorn merges its contents with the next snapshot so that the correct contents are retained for the next and future snapshots.

However, Longhorn cannot do this for the latest snapshot because there are no more recent snapshots to merge with the deleted snapshot. The next snapshot of the latest snapshot is the volume-head, at which time the user is reading/writing, so the merge process does not occur.

Instead, the latest snapshot will be marked as deleted and will be cleaned up next time, if possible.

To clean up the latest snapshot, create a new snapshot and delete the previous “latest” snapshot.

2.4.4. Storing snapshots

Snapshots are stored locally as part of each copy of the volume. They are stored on disks on nodes in the Kubernetes cluster. Snapshots are stored in the same location as volume data on physical disks on the host.

2.4.5. Crash consistency

Longhorn is a crash-consistent block storage solution.

It is normal for the operating system to keep content in the cache before writing to the block layer. This means that if all copies are closed, Then Longhorn may not contain changes that occurred immediately before the shutdown, because the content is held in an operating system-level cache and has not yet been transferred to the Longhorn system.

This problem is similar to what can happen when a desktop computer is shut down due to a power outage. After power is restored, you may find some corrupted files on your hard drive.

To force data to be written to the block layer at any given time, you can either run the synchronization command manually on the node or you can unmount the disk. In either case, the operating system writes the content from the cache to the block layer.

Longhorn automatically runs the synchronization command before creating the snapshot.

3. Backup and secondary storage

A backup is an object in the BackupStore, which is AN NFS or S3-compatible object storage outside of the Kubernetes cluster. Backup provides a form of secondary storage, so your data can still be retrieved even if your Kubernetes cluster becomes unavailable.

Because volume replication is synchronous and because of network latency, it is difficult to replicate across regions. Backupstore is also used as a medium to resolve this problem.

After configuring the backup target in the Longhorn Settings, Longhorn can connect to the backup storage and show you a list of existing backups in the Longhorn UI.

If Longhorn is running in the second Kubernetes cluster, it can also synchronize disaster recovery volumes to backups in secondary storage so that your data can be recovered faster in the second Kubernetes cluster.

3.1. Working Principle of backup

Create a backup using a snapshot as the source so that it reflects the state of the volume data when the snapshot was created.

Compared to snapshots, a backup can be thought of as a flat version of a series of snapshots. Similar to the way information is lost when a layered image is converted to a flat image, data is also lost when a series of snapshots are converted to a backup. In either case, any overwritten data is lost.

Because backups do not contain snapshots, they do not contain a history of volume data changes. After restoring a volume from the backup, the volume initially contains a snapshot. This snapshot is a merged version of all the snapshots in the original chain and reflects the real-time data of the volume at the time the backup was created.

While snapshots can reach TERabytes, backups consist of 2 MB files.

Each new backup of the same original volume is incremental, detecting and transferring changed blocks between snapshots. This is a relatively easy task because each snapshot is a differencing file and only the changes from the last snapshot are stored.

To avoid storing a large number of small storage blocks, Longhorn uses 2 MB blocks for backup operations. This means that if any 4K block in the 2MB boundary changes, Longhorn will back up the entire 2MB block. This provides the right balance between manageability and efficiency.

Figure 3. The relationship between the backup in secondary storage and the snapshot in primary storage

The figure above describes how to create a backup from a snapshot in Longhorn:

The main storage side of the chart is displayedKubernetesIn the clusterLonghornA copy of a volume. A replica consists of four snapshot chains. In the order from new to old, snapshots areLive Data,snap3,snap2 和 snap1.
The secondary storage side of the chart shows external object storage services, such asS3).
In secondary storage,backup-from-snap2The color code shows what it includes fromsnap1The blue changes and comes fromsnap2Green change.

Any changes in Snap2 do not override the data in Snap1, so changes in snap1 and snap2 are contained in backup-from-snap2.

calledbackup-from-snap3The backup of reflects the creation ofsnap3Is the status of volume data. Color coding and arrow representationbackup-from-snap3fromsnap3All the crimson changes, but only fromsnap2One of the green changes in.

This is because a red change in Snap3 overrides a green change in Snap2. This illustrates how backups do not include a complete history of changes because they conflate snapshots with previous snapshots.

Each backup maintains its own set2 MBBlock. each2 MBBlocks are backed up only once. Two backups share one green block and one blue block.

When a backup is removed from secondary storage, Longhorn does not delete all the blocks it uses. Instead, it periodically performs garbage collection to clear unused blocks from secondary storage.

The 2 MB blocks for all backups belonging to the same volume are stored in a common directory and can therefore be shared across multiple backups.

To save space, 2 MB blocks that do not change between backups can be used repeatedly for multiple backups that share the same backup volume in tier-2 storage. With the checksums and addressing for the 2 MB blocks, we have some degree of deduplication for the 2 MB blocks in the same volume.

Volume-level metadata is stored in volume.cfg. The metadata files for each backup (such as snap2.cfg) are relatively small as they only contain the offsets and checksums for all 2 MB blocks in the backup.

Compress each 2 MB block (.blk file).

3.2. Periodically back up data

Backups can be scheduled using recurring snapshots and backups, but can also be performed as required.

It is recommended to schedule periodic backups for your volumes. If the backupstore is unavailable, schedule periodic snapshots instead.

Creating a backup involves copying data over a network, so it takes time.

3.3. Disaster recovery volumes

A disaster recovery (DR) volume is a special volume that stores data in the backup cluster if the entire primary cluster fails. DR volumes are used to improve the elasticity of Longhorn volumes.

DR volumes are used to restore data from backups. Therefore, the following operations cannot be performed on DR volumes before they are activated:

Create, delete, and restore snapshots
Create backup
Creating a Persistent Volume
Create a persistent volume declaration

DR volumes can be created from volume backups in backup storage. After a DR volume is created, Longhorn monitors its original backup volume and increments recovery from the latest backup. A backup volume is an object that contains multiple backups of the same volume in backup storage.

If the original volume in the primary cluster breaks down, the DR volume in the backup cluster can be activated immediately. This greatly reduces the time required to restore data from the backup storage to the volumes in the backup cluster.

When the DR volume is activated, Longhorn checks the last backup of the original volume. If the backup has not been restored, the recovery will begin and the activation operation will fail. You need to wait until the recovery is complete and try again.

If any DR volumes exist, the backup target in the Longhorn setting cannot be updated.

When a DR volume is activated, it becomes a regular Longhorn volume and cannot be deactivated.

3.4. Backup storage update interval, RTO, and RPO

Usually incremental restore is triggered by periodic backup storage updates. You can set the backup storage update interval in the Setting – General – Backupstore polling interval.

Note that this interval may affect the Recovery time target (RTO). If the recovery time is too long, the amount of data to be recovered from the Dr Volume may be large and the recovery time may be long.

As for the recovery point target (RPO), it is determined by the periodic backup plan for the backup volume. If the periodic backup plan for normal volume A creates A backup every hour, the RPO is one hour. You can see how to set up regular backups in Longhorn here.

The following analysis assumes that a backup of the volume is created every hour and that incremental recovery from a backup takes five minutes:

ifBackupstoreThe polling interval is30Minutes, a maximum of one backup data is available since the last restoration. It takes five minutes to restore a backup, soRTOFive minutes.
ifBackupstoreThe polling interval is12Hour, the maximum number since the last recovery12Data backup. The backup recovery time is5 times 12 is 60Minutes, soRTO 为 60Minutes.

Appendix: How persistent storage works in Kubernetes

To understand persistent storage in Kubernetes, it is important to understand Volumes, PersistentVolumes, PersistentVolumeClaims and StorageClasses, and how they work together.

An important property of a Kubernetes Volume is that it has the same lifecycle as the Pod to which it belongs. If the Pod is lost, the Volume is lost. A PersistentVolume, in contrast, persists in the system until it is deleted by the user. Volumes can also be used to share data between containers within the same Pod, but this is not the primary use case because users typically only have one container per Pod.

A PersistentVolume (PV) is a piece of persistent storage in the Kubernetes cluster, and a PersistentVolumeClaim (PVC) is a storage request. StorageClasses allow new storage to be dynamically configured for a workload as needed.

How does the Kubernetes workload use new and existing persistent storage

Broadly speaking, there are two main ways to use persistent storage in Kubernetes:

Use an existing persistent volume
Dynamically configure a new persistent volume

Existing storage configuration

To use an existing PV, your application needs to use PVCS bound to the PV, and the PV should contain the minimum resources required by the PVC.

In other words, a typical workflow for setting up existing storage in Kubernetes is as follows:

Set persistent storage volumes in the sense that you have access to physical or virtual storage.
Add a reference to persistent storagePV.
Add referencePV 的 PVC.
In your workload will bePVCMount as a volume.

When a PVC requests a block of storage, the Kubernetes API server will try to match the PVC with the preallocated PV as the matching volume is available. If a match can be found, the PVC will be bound to the PV, and the user will start using the preallocated storage block.

If no matching volume exists, PersistentVolumeClaims will remain unbound indefinitely. For example, a cluster configured with many 50 Gi PVS does not match a PVC requesting 100 Gi. After the 100 Gi PV is added to the cluster, the PVC can be bound.

In other words, you can create an infinite number of PVCS, but the Kubernetes master nodes will only be bound to PVCS if they can find enough PVCS and have at least the amount of disk space required by the PVCS.

Dynamic storage configuration

For dynamic storage configuration, your application needs to use a PVC bound to The StorageClass. StorageClass contains the authorization to provide a new persistent volume.

The entire workflow of dynamically configuring new storage in Kubernetes involves a StorageClass resource:

addStorageClassAnd configure it to automatically configure the new storage from the storage you have access to.
Add referenceStorageClass 的 PVC.
willPVCVolumes mounted as workloads.

Kubernetes cluster administrators can use Kubernetes StorageClass to describe the storage “classes” they provide. StorageClasses can have different capacity limits, different IOPS or any other parameter that the vendor supports. Storage vain-specific provisioner is used with the StorageClass to automatically allocate PVS based on parameters set in the StorageClass object. In addition, Provisioner can now enforce resource quotas and permission requirements for users. In this design, administrators are freed from the unnecessary work of anticipating PV requirements and assigning PVS.

When using StorageClass, the Kubernetes administrator is not responsible for allocating each block of storage. The administrator only needs to grant a user the permission to access a storage pool and determine the quota for the user. The user can then dig out the required storage portion from the storage pool.

You can also use StorageClass without having to explicitly create a StorageClass object in Kubernetes. Since StorageClass is also a field used to match a PVC with a PV, you can manually create a PV using a custom StorageClass name, and then you can create a PVC that requires a PV with that StorageClass name. Kubernetes can then bind the PVC to the PV using the specified StorageClass name, even if the StorageClass object does not exist as a Kubernetes resource.

Longhorn introduced a Longhorn StorageClass so that Kubernetes workloads can divide persistent storage as needed.

Horizontal scaling of Kubernetes Workloads with persistent storage

VolumeClaimTemplate is a StatefulSet Spec property that provides a way for block storage solutions to scale Kubernetes workloads horizontally.

This property can be used to create matching PVS and PVCS for PODS created by StatefulSet.

These PVCS are created using StorageClass, so they can be set automatically when StatefulSet is extended.

When StatefulSet is scaled down, the additional PV/PVC remains in the cluster and is reused when StatefulSet is scaled up again.

VolumeClaimTemplate is important for block storage solutions such as EBS and Longhorn. Because these solutions are essentially ReadWriteOnce, they cannot be shared between Pods.

Deployment does not work well with persistent storage if you have multiple Pods running persistent data. For multiple Pods, StatefulSet should be used.