1. What did we do in 2017?

I remember that As early As in 2017, Dr. Wang Jian once called for a fierce discussion on whether “IDC As a Computer” could be achieved. To achieve this, storage and computing must be separated. After separation, computing and storage resources can be independently scheduled by scheduling. The database is the most difficult of all the businesses that realize the separation of storage and computation. Because databases have high requirements for I/O latency and stability. But from an industry perspective, storage computing separation is the future of technology, because people like Google Spanner and Aurora have already implemented it.

So in 2017, we hold a firm belief to achieve the separation of storage and calculation of database. In fact, in 2017 we did, based on Pangu and AliDFS(cepH branch), bear 10% of the transaction volume in Zhangbei cell storage computing separation. 2017 is the first year of database storage and computing separation, laying a solid foundation for the large-scale implementation of storage and computing separation in 2018.

Ii. Technological breakthroughs in 2018?

If 2017 is the breakthrough year of database storage and computing separation, then 2018 is the year of pursuit of extreme performance, but also from experiment to large-scale deployment, its technical challenges can be imagined. Building on 2017, the challenge in 2018 is to make storage computing separation more high-performance, ubiquitous, versatile, and simple.



In 2018, in order to achieve the highest performance and throughput of database I/O under storage computing separation, we developed a user-mode cluster file system DADI DBFS. By depositing the technology on DADI DBFS user-mode cluster files, enabling the group database transaction full cell scale storage computing separation. So what technical innovations did DBFS make to become a mid-tier storage product?

2.1 User-mode technology

2.1.1 “ZERO copy”

Zero COPY of I/O paths is realized by bypassing the kernel directly in user mode. Out-of-kernel copy is avoided, resulting in a significant increase in throughput and performance.

In the past, when the kernel mode was used, data was copied twice: once by the user-mode process, data was copied to the kernel, and once by the kernel process, data was copied to the user-mode network forwarding process. Both copies affect overall throughput and latency.

After switching to pure user mode, we used the Polling model to send I/O request requests. In addition, for CPU consumption in polling mode, we used adaptive sleep technology, so that core resources would not be wasted when idle.



2.1.2 RDMA

In addition, DBFS combined with RDMA technology directly exchanges data with Pangu storage, achieving delay close to local SSD and higher throughput, thus making this year’s cross-network extremely low latency I/O possible, laying a strong foundation for large-scale storage computing separation. The group’s participation in this year’s RDMA cluster is arguably the largest in the industry in terms of size.

2.2 Page cache

To achieve buffer I/O capabilities, we implemented page cache separately. Page CAHCE is implemented by touch Count based LRU algorithm. The point of introducing Touch Count was to better integrate with the DATABASE I/O features. Because large table scans and so on are common in the database, we don’t want these infrequently used data pages to erode the efficiency of the LRU. We move the page between the hot and cool ends based on touch count.

The Page size of the Page cache is configurable. When combined with the Page size of the database, the Page cache achieves better cache efficiency. In general, DBFS page cache has the following capabilities:

● Transfer the hot and cold end of page based on touch count

● The ratio of hot and cold ends can be configured, the current ratio of hot and cold is 2:8

● Page size configurable, combined with the database page optimization configuration

● More shard, improve concurrency; The total capacity is configurable



2.3 the asynchronous I/O

To improve the I/O throughput of databases, most databases use asynchronous I/O. In order to be compatible with the I/O features of the upper-layer database, we implemented asynchronous I/O. Asynchronous I/O features:

● Lockless queue implementation

● Configurable I/O depth enables precise delay control for different DATABASE I/O types

● Polling Adaptive, reduce CPU consumption



2.4 atomic write

DBFS implements atomic write to avoid partial write when writing database pages. Innodb, based on DBFS, can safely turn off double write Buffer, thus saving up to 100% of database bandwidth in memory separation.

In addition, PostgreSQL, for example, uses buffer I/O to avoid the occasional missing page problem that PG encounters with dirty page flush.

2.5 Online Resize

In order to avoid data migration caused by capacity expansion, DBFS combined with pangu realizes online resize of volume. DBFS has its own Bitmap Allocator to manage the underlying storage space. We optimized bitmap Allocator to achieve lock free resize at the file system level, so that upper-layer services can be efficiently expanded at any time without damage, which is completely superior to the traditional ext4 file system.

The support of Online Resize avoids the waste of storage space, because the storage space is not 20% of the reserve, so it can be expanded and written.

The following is the bitmap change process during expansion:



2.6 TCP and RDMA interconnect

The large-scale introduction and use of RDMA in the group database is also a very big risk point. DBFS and Pangu realized the function of RDMA and TCP intercutting together, and carried out exchange drills in the whole link process, making the risk of RDMA in a controllable range and stability assurance more perfect.

In addition, DBFS, tango, and the network team conducted a lot of capacity level tests, failure drills, and other work on the RDMA to prepare for the largest RDMA rollout in the industry.

2.7 Promote deployment in 2018

After the technical breakthrough and breakthrough, DBFS finally completed the arduous task and passed the test of promoting the full link and the double “Eleventh” exam, again verified the feasibility of storage computing separation and the overall technical trend.

3. Storage of DBFS

In addition to the above functions that must be implemented as a file system, DBFS also implements a number of features that make business DBFS more universal, easier to use, more stable and secure.

3.1 Technology precipitation and empowerment

We have deposited all the technical innovations and functions in the form of products in DBFS, enabling DBFS to enable more businesses to access different underlying storage media in the form of user mode, enabling more databases to achieve storage and computing separation.

3.1.1 POSIX compliant

At present, to support database services, we are compatible with most common POSIX file interfaces to facilitate the interconnection of upper-layer database services. In addition, page cache, asynchronous I/O and atomic write are also implemented to provide rich I/O capabilities for database services. In addition, we have implemented the GliBC interface to support the manipulation and processing of file streams. These two interfaces greatly simplify the complexity of database access, increase the ease of use of DBFS, and enable DBFS to support more database services.

The POSIX part is not listed for those who are familiar with it. The following are only some gliBC interfaces for reference:

// glibc interface


FILE *fopen(constchar*path,constchar*mode);


FILE *fdopen(int fildes,constchar*mode);


size_t fread(void*ptr, size_t size, size_t nmemb, FILE *stream);


size_t fwrite(constvoid*ptr, size_t size, size_t nmemb, FILE *stream);


intfflush(FILE *stream);


intfclose(FILE *stream);


intfileno(FILE *stream);


intfeof(FILE *stream);


intferror(FILE *stream);


voidclearerr(FILE *stream);


intfseeko(FILE *stream, off_t offset,int whence);


intfseek(FILE *stream,long offset,int whence);


off_t ftello(FILE *stream);


longftell(FILE *stream);


voidrewind(FILE *stream);

3.1.2 Fuse implementation

In addition, fuse has been implemented for compatibility with the Linux ecosystem to get through VFS interactions. The introduction of Fuse allows users to access DBFS without any code changes, regardless of extreme performance, greatly improving the ease of use of the product. In addition, it greatly facilitates traditional operation and maintenance operations.



3.1.3 Servitization capability

DBFS has developed its own shmQ component, based on internal memory IPC communication, thus enabling support for PostgreSQL based process architecture and MySQL based thread architecture, making DBFS more universal and secure, providing a solid foundation for future online upgrades.

Based on lock-free implementation, shmQ has excellent performance in performance and throughput. According to the current test, it can control access delay within several Us under 16K and other large database pages. Supported by servitization and multi-process architecture, the current performance and stability meet expectations.



3.1.4 Clustered File Systems

Cluster function is another obvious feature of DBFS. It enables the database to be based on the shared-disk mode, realizing linear expansion of computing resources and saving storage costs for services. In addition, the shared-disk mode also provides rapid flexibility for the database, which greatly improves the SLA for quick master/slave switchover. Clustered file systems provide the capability of one write, multiple reads and multiple writes, laying a solid foundation for the shared-Disk and shared nothing architectures of databases. Compared with traditional OCFS, we implement it in user mode, with better performance and more autonomy and control. OCFS relies heavily on the Linux VFS, such as no separate Page cache.

When DBFS supports write – read mode, multiple roles are available (M/S). One M node can share data with multiple S nodes, and BOTH M and S nodes can access pangu data. The upper layer database restricts the data access of M/S nodes. The data access of M nodes is readable and writable, while the data access of S nodes is read-only. If the primary library fails, a switch is made. Steps for primary/secondary switchover:



● When service monitoring indicators detect that M nodes are inaccessible or abnormal, the system decides whether to switch over M nodes.

● If a switchover occurs, the management platform initiates the switchover command. If the switchover command is completed, the DBFS and the upper-layer database have completed the role switchover.

● In the process of DBFS switching, the most important action is IO fence, which forbids the original M node IO capability to prevent double write.

DBFS implements global metalock control and Blockgroup allocation optimization for all nodes when multi-point write. In addition, it will also involve disk-based quorum algorithm, etc. The content is relatively complex, so I will not give a detailed statement for the moment.

3.2 Combination of Software and Hardware

With the emergence of new storage media, database is bound to need to use it to play a better performance or lower cost optimization, and achieve the autonomy of the underlying storage media.

From the perspective of Intel’s planning for storage media, AEP,Optane and SSD will be formed in terms of performance and capacity, and QLC will appear in the direction of large capacity. Therefore, in terms of overall performance and cost, we think Optane is a relatively good cache product. We chose this as the implementation of DBFS header persistence filecache.



3.2.1 Persistent File Cache

DBFS implements the local persistent cache function based on Optane, which further improves the read and write performance of the database under the condition of memory separation. File cache does a lot of work to achieve production availability, such as:

● Stable and reliable fault handling

● Dynamic Enable and Disable are supported

● Support load balancing

● Supports performance indicator collection and display

● Supports data correctness Scrub

The support of these functions lays a solid foundation for on-line stability. Wherein, I/O for Optane is SPDK’s pure user-mode technology, and DBFS is combined with vhost of Fusion Engine. The page size of the File Cache can be optimally configured based on the block size of the upper-layer database for best results.

Here is the architecture of file cache:



Here are the read/write performance gains from the test:



The value containing cache is obtained based on filecache. Overall performance As the hit ratio increases, read latency decreases. In addition, many performance indicators of File cache are monitored.

3.2.2 the Open Channel SSD

X-engine collaborated with DBFS and the Fusion Engine team to further build a controllable storage system based on Object SSDS. In reducing SSD wear, improving SSD throughput, reducing read and write interference and other fields, in-depth exploration and practice, have achieved very good results. At present, we have combined the tiered storage strategy of X-Engine to get through the read and write path. We look forward to the further research and development of intelligent storage.

Iv. Summary and outlook

In 2018, DBFS has massively supported X-DB to support “11.11” in the form of storage and computing separation. At the same time, it also enables ADS to achieve multi-read ability in one write and Tair, etc.

In addition to supporting services, DBFS itself has enabled the support of PG process and MySQL thread architecture, opened up VFS interface, and achieved compatibility with Linux ecosystem, becoming a true storage medium level product — clustered user mode file system. In the future, more hardware and software combination, hierarchical storage, NVMeoF and other technologies will enable more databases to achieve greater value.



The original link

This article is the original content of the cloud habitat community, shall not be reproduced without permission.