preface

UCloud officially released a new generation of archiving storage products based on US3 in August 2020, which uses UCloud’s new self-developed storage architecture. Compared with standard storage, UCloud reduces the storage cost by nearly 80% while reducing the price by nearly 30% compared with similar archiving storage products in the market.

According to IDC’s prediction, the annual increase of global data volume will reach 175ZB by 2025, and only about 15ZB of data can be stored, with a loss rate of more than 91%. In the current iceberg model of enterprise data, 80 percent of the data volume comes from cold data. In the area of public cloud, UCloud believes there is still a lot of room for volumetric storage to improve and develop through technical means.

How to maximize the latest high-capacity hardware to further reduce storage costs? How to ensure user data security when archive storage is stored for a long time? All these require great optimization and hardware adaptation of the whole IO path of US3 archive storage. At the same time, we also need to ensure the ease of use of the product and avoid additional cost to users.

Next, this paper will analyze the software and hardware selection and optimization details of the underlying storage engine of US3 archive storage in detail from the two perspectives of how UCloud uses hard disk technology to improve storage density and optimize IO scheduling to reduce operating costs.

SMR disks and JBOD devices are used to improve storage density

Reducing hardware costs is mainly reflected in improving storage density. Here we explore different storage media including Blu-ray, tape, hard disk, and also refer to the hardware design of Microsoft Pelican system. Considering that our ultimate goal is to expect ** users to be able to activate and read data within minutes in an emergency and within hours under normal circumstances, the minimum storage time for users does not need to be calculated in years. ** Therefore, UCloud, combined with its own advantages in storage technology, temporarily excludes blu-ray and tape storage media, and mainly adopts high-density hard disk to realize archiving cloud storage service.

Here’s how data is recorded on a traditional hard disk.

Such traditional hard disks are generally of the vertical magnetic recording PMR type. Data is recorded by writing tracks that are parallel to each other and do not overlap. Increasing data storage capacity can only be increased by increasing the number of tracks.

Compared with this traditional hard disk, there is a magnetic storage data recording technology based on imbraided magnetic recording SMR that can improve the storage density and the overall storage capacity of the hard disk. Before introducing the hardware implementation of SMR disks, we need to know a bit of background. First, let’s zoom in on the disk’s magnetic head.

Because of physical reasons, the disk write head needed width than read head on many wide, this leads to a read/write two operations is unequal to the demand of track width, write to the width of the need to more, this creates the possibility to improve the density of disk, let’s look at the SMR disk structure.

SMR disks write new tracks that overlap with previous tracks, making the previous track narrower and thus having a higher track density. As you can see, the tracks that use imbricated magnetic are stacked on top of each other in a similar way to tiles used for roofing, hence the name imbricated magnetic recording disks.

From SMR hardware structure of hard drives, we it is not hard to see in the ascension of hard disk storage capacity at the same time, to write actually could do a lot of difficulties, once the current track down a track are written to the data, the track if I want to write again, because the track overlap, write head and larger will affect the back of the data. In terms of usage, SMR disks are divided into several zones. Data in each Zone can only be appended, and 1% of the zones are created. The tracks do not overlap, called CMR zones, and random reads and writes are supported.

It can be expected that there will be a lot of costs to shield the upper layer from the SMR disk. There are two ways: Device managed and Host Aware to simply shield the SMR disk from sequential write restrictions. However, either way, it is to convert random I/OS into sequential I/OS. This reduces write magnification and read performance, and affects the service life of disks in specific I/O scenarios. The upper-layer layer cannot control the impact.

UCloud storage team has accumulated the technology of bypassing the file system to directly operate the block layer storage on several existing products. In order to avoid strong dependence on the file system of the underlying storage, we choose the host managed method to manage the read and write of SMR disk.

When the hard disk data falls to the ground, we also combine a small amount of metadata with the data and write it together. There are three considerations for doing so:

First, we will include the overall CRC of IO for this small amount of metadata, which is used to prevent Silent Data Corruption of hard disks and improve the Data reliability of users when using US3 for archiving storage. Therefore, we need to pay special attention to the bit flip and other errors of hard disks in the scenario of massive and long-term storage such as cold storage.

Second, when our metadata is unavailable due to some devastating software and hardware problems, we can restore the overall structure by re-reading the metadata written with IO. Of course, this cost is relatively high, and it is expected to be the processing scheme in dealing with some black Swan events.

Third, we can reduce our write magnification by not writing IO twice due to the need to update metadata, which is especially important in HDD scenarios where random IO capabilities are not a strong point.

Several CMR zones in the head are selected for self-parsing the metadata of the current disk, and redundant copies are made. Here, 1% CMR Zone is still too much for metadata. Therefore, some CMR zones and SMR zones that can only be appended are abstracted into Data zones that can only be appended to maximize the utilization of disk space.

At this point, we increased the storage density of a single disk, increasing the storage space of a single disk by 150%. We also increased the disk density of a single cabinet to further improve the overall storage density. Compared with the traditional 36-disk high density model, we adopted the JBOD method. Benefiting from the advantages of UCloud’s self-built machine room, the previous limitations of load-bearing floor of a single cabinet room and the shortage of high-power cabinets no longer exist. Thus, more JBOD storage devices can be stored in a single cabinet, increasing the storage capacity of a unit rack by 5.375 times and the number of hard disks by 59%.

In addition, we also use the dual-head hardware architecture, all the hard disks in JBOD are guaranteed to be visible at the same time, so that in the case of single machine downtime, we can still use our main selection algorithm to immediately cut to another machine, to ensure the availability of services.

Optimize I/O scheduling algorithm to reduce operating costs

In essence, the increase in density reduces the Capital Expenditure (CAPEX). In the case of archiving and storage, the long-term Operating Expense (OPEX) also accounts for a large proportion. The optimization we made here is to reduce our electricity expense (i.e., OPEX cost) without affecting user experience and storage performance.

Therefore, we add scheduling algorithms based on hard disk spin-up and spin-down in IO scheduling layer. It is used to reduce the power waste of idling a large number of hard disks in the cold storage scenario of high-density models.

There are many factors to consider in the overall scheduling algorithm. Firstly, we divide disks in the JBOD into several disk groups according to the fault domain to ensure that disk and JBOD faults can be tolerated with appropriate EC strips and number of JBODs. Then, spin-up-Down operations are performed based on disk groups.

At the same time, we need to ensure that the Spin updown times of the hard disk are within a certain range while meeting users’ urgent reading requirements. Here, we average the operable power ups and downs times of the hard disk within its service life to every hour every day. In the algorithm, we ensure that each Spin updown of the disk will have a certain cooling time. In this way, the user can reduce the user cost and ensure the reliability of user data in the hard disk usage mode.

In addition to reliability considerations, we also need to ensure that write performance can be fully sustained on our hardware. Due to the nature of SMR drives and business logic, writes, including Compaction, occur sequently. Therefore, we combine the size of EC strips to ensure that the write data bandwidth of a disk group can cover the network card bandwidth of our entire device, so that there is no extra waste in performance.

Write in the last

Based on the above increase disk storage density and reduce operating costs (i.e., electricity) are two major aspects of the design considerations, we developed US3 archival storage of the underlying storage engine (pictured above), while slashing US3 archive storage costs, ensure the file in storage this cold high reliability of data storage for a long time.

US3 archiving storage will continue to improve the user experience of the product from various aspects, such as more convenient and automatic data cooling processing, more intelligent storage cost reduction, so that users can fully enjoy the price dividend brought by UCloud technology innovation. It also explores the use of other storage media such as tape in in-depth archiving scenarios, enabling users to store massive cold data without directly interacting with complex underlying hardware.