The importance of Latency indicators to storage system performance

We analyze a distributed storage system from three dimensions: reliability, ease of use, and performance.

Reliability: It is the cornerstone of a storage system. A storage system must provide at least 99.99% data reliability. Data loss or disorder is fatal to a storage system, such as large-scale distributed clusters such as big data and cloud storage

Ease of use: system administrators are most concerned about product design, troubleshooting, and system scalability

Performance: If reliability is the cornerstone of a storage system, performance is the soul of a storage system. High reliability and high performance are indispensable for an excellent storage system

This article will analyze distributed storage system from the dimension of performance, then how to analyze the performance of a distributed storage system?

Let’s take a look at the main parameters for profiling a storage system: IO Type, IO Mode, IO Size, and IO Concurrency. Different combinations of the values of these parameters correspond to different business models, which are shown in the following table:

Table 1: Common service types

Note 1: SPC-1 measures the IOPS of a storage system under random small I/O loads, while SPC-2 measures the bandwidth of a storage system under continuous read/write applications with heavy loads.

Note 2: SPC-1 abstracts the test area into asUS, which are divided into three categories: ASU1 temporary data area, ASU2 User data area, and ASU3 log area.

According to the table above, it is difficult to provide comprehensive coverage due to various service types. Three performance indicators are used to evaluate a storage system: IOPS, Throughput, and Latency. Generally, Throughput is used to evaluate system performance for large IO. Small IO Uses IOPS to measure system performance. So, what is the role of Latency? In addition, Latency metrics are often ignored in performance evaluation. Before you evaluate a storage system, the IOPS and Throughput of the test system are satisfactory. However, after the service system is implemented, the Latency is too high.

Here’s a quote from a blog post: “Even if the vendor’s new storage system miraculously achieves 1,000,000 4K random IOPS in 60/40 mixed read/write conditions, I’d still like to know what the latency is. Dimitris Krekoukias wrote in a recent blog post on IOPS and latency that one vendor’s system can achieve 15000 IOPS with an average latency of 25ms. However, the database engine is still reporting high I/O latency.

One critic detailed on Krekoukias’s Ruptured Monkey blog that the only time a credit card processor doesn’t slow down processes like fraud prevention or withdrawal authorization is when the delay is less than 4ms. Similarly, in his blog post, he explained how to understand latency. Here’s a simple paraphrase: “1000IO per second. On average, each IO takes 1000ms/1000 iops =1ms, so the average latency per IO is 1ms. The above calculation ignores the factor of concurrency. For asynchronous and highly concurrent services, the storage system can meet service requirements. However, it is still questionable whether the storage system can meet the requirements of services based on OLTP and OLAP, which are generally synchronous and low-concurrency services.

According to the above data, the key to achieving high performance is the combination of high IOPS and low Latency. When the storage system provides higher IOPS, do not increase the latency of a single I/O too much synchronously. Otherwise, the performance of the service system will be affected. For example, JetStress suggests that the average delay should be less than 20ms and the maximum delay should be less than 100ms.

Latency problems in distributed storage

Let’s take a look at the latency of traditional storage. In traditional storage systems, the I/O latency has a natural advantage. Advantage one is that the physical I/O path is short, and the JBOD is usually mounted to the rear end of the nose controller. The second advantage is the use of RAID 5, RAID10, or RAID 2.0 data protection algorithms, which are Disk or Chunk based and have much lower overhead than the failure domain-based replica mode.

Figure 1: Traditional SAN IO architecture

Figure 2: HOW Raid 2.0 works

In distributed storage, consisting of multiple or hundreds of servers, using replica mode. Therefore, an I/O is processed on multiple replica servers over the network, and each replica has a data consistency check algorithm. All these operations increase the I/O latency.

Figure 3: DISTRIBUTED cluster IO flow

According to the figure above, in a distributed storage system with three copies, one write I/O needs to be completed by the write Leader and two followers before it is returned to the user.

How much of a performance impact does the replica pattern of distributed storage really have? For a sense of perspective, let’s look at two performance comparisons:

Figure 4: Running OLTP services on local storage

Figure 5: Running OLTP services on distributed storage

According to the performance data in the above two figures, the I/O latency of distributed storage is indeed very high.

Another performance problem of distributed storage system is I/O jitter, which is reflected in the delay. The deviation between average delay and minimum delay and maximum delay is large. This problem is mainly caused by the architecture principle of distributed system and improper I/O request processing. For example, if one copy responds slowly, the LATENCY of the entire I/O will increase or the consistency algorithm will cause I/O stacking. The I/OS will be sent to the same storage node.

Yan Rong develops a distributed storage cache engine

For how to reduce distributed storage Latency, in fact, is a system and complex matter, at the same time, let’s take a look at ali Cloud storage and Huawei flash memory array do what:

Ali cloud ESSD

Hardware upgrade: Server, SSD, network (40GB/s, 25GB/s), network switch, RDMA protocol network communication framework: Luna, RDMA+ user-mode driver + Private I/O protocol thread model improvement: User-driven, co-routine coroutine added, and Run to Completion mode added data layout: Use apend write to modify the metadata mode (the location of metadata to record data) for more detailed layers and abstractions; Customized policy modules are provided for different I/O attributes to meet different I/O Profile requirements

Huawei Flash storage OceanStor Dorado V3

Hardware optimization: The disk enclosure is directly connected to the SAS controller enclosure in 12Gbps mode. Thin NVMe SSD delt-sensitive critical processing is not frequently interrupted or blocked by other tasks Processor grouping divides each CPU core into different partitions based on service requirements. Critical services run on each partition without interruption to ensure stable and low latency of requests. Read requests and write requests to cache have priority in various key processing resources in the storage system, including CPU, memory, and concurrent access to disks. Asynchronous cache flush request with low priority and effective aggregation ensures that large blocks are sequentially written to SSDS, optimizes the request processing efficiency by partitioning hot and cold data, reduces write amplification, and improves performance. The garbage collection mechanism of SSDS is fully leveraged, reducing data migration during block erasures, and reducing impact on SSD performance and service life

Wait…

It can be seen that the main methods to improve performance and optimize delay can be summarized as follows:

  • Hardware upgrades, such as NVMe, RDMA
  • Optimize IO path and communication framework
  • Optimize the processing time for each module
  • Optimizing disk Layout
  • Add a data caching layer

Taking NVMe as an example, the NVMe driver layer directly interacts with the common block layer through the SCSI protocol stack to reduce I/O latency. At the same time, the multi-queue mechanism of the common block layer is used to further improve performance. However, the disadvantage of this approach is the higher cost for customers.

Figure 7: Generic storage device stack versus NVMe stack

A common and effective way for storage vendors to reduce latency is to add a data caching layer by configuring an SSD for multiple HDDS on storage nodes and using the open source BCache solution, which is a common and affordable solution. As shown in the following figure, this solution has limited performance improvement, mainly due to the long I/O path and complex logic of distributed storage core layer. Meanwhile, there are also many problems with the open source BCache solution, such as: Although BCache is open source, it has no maintenance ability if problems occur. BCache increases the difficulty of operation and maintenance. BCache has a weak ability to troubleshoot member disks.

Figure 8: Back-end caching

Yan storage of science and technology research and development team, according to many years of technical experience and reference for domestic and foreign manufacturers design train of thought, the original Yan melt distributed storage software design, aiming at how to improve performance IO delay on the scheme design of autonomous design realized both before and after the cache engine, namely the increase of the back-end cache on the basis of the client cache layer, splitting the storage part of the core functions, A distributed cache read-write layer is implemented between the network layer and the storage core layer. After the network receives the request, the data is written to the cache layer to realize the distributed cache read-write and ensure the data reliability. The overall design idea is as follows:

Figure 9: Distributed read and write cache

In this way, Compared with other back-end caching solutions, Yan Rong distributed storage software has a significant performance improvement in IOPS of nearly 30%, which can achieve more than 85% of the performance of raw SSD, and delay reduction of nearly three times. The specific comparison data is as follows:

Figure 10: Performance comparison

At present, According to the basis of the existing design, Yan Rong’s R&D team continues to iterate and plans the overall cache engine architecture in the next version of product release as follows, that is, multi-level cache of local cache read, distributed cache read and back-end cache read and write can meet the overall demand of different businesses for high and stable performance. Multiple levels of cache on THE I/O path make THE I/O of the whole cluster more stable and reduce I/O fluctuation from the LEVEL of I/O processing. The performance of raw SSD is expected to exceed 90%.

Figure 11: Distributed level cache