preface

To solve the reliability, capacity, and cost problems in data backup scenarios, more and more users prefer to use object storage for backup. However, there are some scenarios where using object storage US3 for backup is not convenient or even appropriate. For example, in database backup scenarios, if you directly use object storage for backup, you may need to use mysqldump to perform logical backup or xtrabackup to perform physical backup to the local PC, and then upload backup files to object storage using SDK tools based on object storage. The backup process is cumbersome. Another example is the log archive backup of the service. To reduce costs, logs can be stored in the US3 object storage and operated by SDK or tools. Backup codes need to be written and management is complicated. This problem can be solved by providing a way to remotely access object storage using POSIX interfaces.

Open source solution Practice

There are already open source projects that map buckets in object stores to file systems, such as S3FS and Goofys, and we have found some problems using these open source solutions.

s3fs

S3fs uses FUSE to mount s3 and buckets for object stores that support the S3 protocol locally (FUSE is described below). After testing S3FS, we found that its performance in writing large files was particularly poor. After studying its implementation process, we found that S3FS would preferentially write local temporary files when writing, and then write data concurrently to object storage in the form of fragment upload. If space is insufficient, the shard will be uploaded synchronously with the following code:

ssize_t FdEntity::Write(const char* bytes, off_t start, size_t size) { // no enough disk space if(0 ! = (result = NoCachePreMultipartPost())){ S3FS_PRN_ERR(“failed to switch multipart uploading with no cache(errno=%d)”, result); return static_cast(result); } // start multipart uploading if(0 ! = (result = NoCacheLoadAndPost(0, start))){ S3FS_PRN_ERR(“failed to load uninitialized area and multipart uploading it(errno=%d)”, result); return static_cast(result); }} Because our main usage scenario is the backup of large files, we decided to give up this solution based on cloud host disk cost and other considerations.

goofys

Goofys uses Go to mount S3 and some non-S3 object stores to Linux file systems. After testing, we found that Goofys has three main problems:

 The write does not have concurrency control. In the large file write scenario, Goofys also shards the file and writes one coroutine per shard to back-end storage. Generally, object storage communicates with each other through HTTP. Because requests are synchronous, a large number of connections will occur without limiting the number of concurrent requests, consuming a large amount of resources such as memory.

 The data is synchronized and the performance is poor. FUSE has two read modes async and sync, which can be selected by setting at mount time. Goofys forces sync mode and stops prefetch after three out-of-order reads. The code is as follows:

if ! fs.flags.Cheap && fh.seqReadAmount >= uint64(READAHEAD_CHUNK) && fh.numOOORead < 3 { … err = fh.readAhead(uint64(offset), len(buf)) … } fh.numOOORead is the number of out-of-order reads. The FUSE module splits IO exceeding 128K to align with 128K. A brief description of the difference between synchronous and asynchronous read modes for FUSE. The normal entry point for reading in the kernel is the read_iter function on the underlying file system, which then calls generic_file_read_iter on the VFS layer. The internal implementation of this function prereads by calling readPages. If there is no corresponding page after prereading, readPage is called to read a single page. Since Goofys does not support this setting, we tested it by setting different configurations for S3FS and then grabbing the call stack at read time to compare the difference. The read stack with asynchronous read mode set looks like this:

fuse_readpages+0x5/0x110 [fuse] read_pages+0x6b/0x190 __do_page_cache_readahead+0x1c1/0x1e0 ondemand_readahead+0x1f9/0x2c0 ? pagecache_get_page+0x30/0x2d0 generic_file_buffered_read+0x5a50xb10 ? mem_cgroup_try_charge+0x8b/0x1a0 ? mem_cgroup_throttle_swaprate+0x17/0x10e fuse_file_read_iter+0x10d/0x130 [fuse] ? __handle_mm_fault+0x662/0x6a0 new_sync_read+0x121/0x170 vfs_read+0x91/0x140

Vfs_read is the entry function that the system calls to the VFS layer. Readpages is then called to read multiple pages. Fuse_readpages sends read requests to user-mode file systems to complete the read process. The stack for synchronous read mode looks like this:

fuse_readpage+0x5/0x60 [fuse] generic_file_buffered_read+0x61a/0xb10 ? mem_cgroup_try_charge+0x8b/0x1a0 ? mem_cgroup_throttle_swaprate+0x17/0x10e fuse_file_read_iter+0x10d/0x130 [fuse] ? __handle_mm_fault+0x662/0x6a0 new_sync_read+0x121/0x170 vfs_read+0x91/0x140

As in an asynchronous process, it reads from generic_FILe_read_iter, and attempts to read a single page when there is no corresponding page after reading. The kernel version is based on 4.14:

no_cached_page: /* * Ok, it wasn't cached, so we need to create a new * page.. */ page = page_cache_alloc_cold(mapping); if (! page) { error = -ENOMEM; goto out; } error = add_to_page_cache_lru(page, mapping, index, mapping_gfp_constraint(mapping, GFP_KERNEL)); if (error) { put_page(page); if (error == -EEXIST) { error = 0; goto find_page; } goto out; } goto readpage;

If synchronous reading is set, the FUSE module invalidates the kernel’s prefetch and instead goes to no_cached_page to read a single page. Therefore, the read IO of the file system falling from the synchronous mode to the user mode has large chunks of readpagesIO and 4K single-page IO of readPage. Since the offset is the same, Goofys will judge that the read is out of order and stop the preread after three times. Since every interaction with US3 is 4K GET request, the performance will be poor. It is difficult to meet the needs of users.

 The fragment upload size is not fixed and does not work with US3. The current fragment size of US3 is fixed at 4M, while the fragment size of Goofys needs to be dynamically calculated and manually modified for adaptation. The code is as follows:

`func (fh *FileHandle) partSize() uint64 { var size uint64

if fh.lastPartId < 1000 { size = 5 * 1024 * 1024 } else if fh.lastPartId < 2000 { size = 25 * 1024 * 1024 } else { size = 125 * 1024 * 1024}

.

} `

In addition, S3 does not have an interface for rename. In both S3FS and Goofys, rename is implemented by copying the content of the source file to the target file and then deleting the source file.

While US3 supports direct file name modification internally, US3FS implements rename operation through related interface, which has better performance than S3FS and Goofys. At the same time, both S3FS and Goofys buckets that mount US3 need to go through the proxy for protocol conversion. Using US3FS reduces this IO path and has more advantages in performance.

Through the practice of S3FS and Goofys, we find that they have some performance problems in US3 backup scenarios, and the workload of adaptation is relatively large. Based on this, we decide to develop a file system that can meet the needs of users in data backup scenarios, relying on object storage as the back-end.

US3FS design Overview

US3FS implements part of the POSIX API through FUSE. Before introducing the US3FS implementation, let’s take a quick look at the VFS mechanism and FUSE implementation of Linux (you can skip this section).

VFS

The VFS, or Virtual File System, is a Virtual layer in the Linux kernel. It belongs to the I/O subsystem. On, provides a file system interface for user – mode applications. The concrete implementation is abstracted into the same function pointer for the underlying file system to implement.

Linux file system metadata is divided into directory Entry (dentry) and inode. As we know, file names do not belong to file metadata. To optimize the query, VFS creates dentries in memory to cache the mapping between file names and inodes, as well as the implementation of directory trees. In standalone file system implementations, the dentry exists only in memory and does not fall off the disk. When a file is searched for and there is no dentry in memory, the VFS invokes the specific file system implementation to find the file and set up the corresponding data structure. Inodes cache metadata of a file, such as the size and modification time, and are persisted to the hard disk. Data is read and written to the corresponding Page and block device through the address space.

FUSE

FUSE, Filesystem in Userspace, is a user – mode Filesystem. As we know, implementing a feature directly in kernel mode is a painful task. Kernel debugging is often difficult, and if you do not pay attention to it, you will be caught in the details of the kernel. FUSE is intended to simplify the programmer’s life by hiding the details of the kernel and providing a set of user-interface implementations of the file system. The interaction between the kernel-mode FUSE module and the user-mode FUSE library communicates through /dev/fuse and then invokes the user’s own implementation. Of course, the disadvantage is to increase the IO path and kernel/user mode switch, which has a certain impact on performance.

Metadata design

US3FS implements FUSE interfaces to map bucket objects in US3 to files. Unlike distributed file systems, there is no METADATA server (MDS) to maintain file metadata. You need to obtain file metadata from US3 through HTTP. When there are a large number of files, a large number of requests will be issued instantly, resulting in poor performance. To address this, US3FS maintains a bucket directory tree in memory and sets the validity of file metadata to avoid frequent interaction with US3.

This also leads to consistency issues. When multiple clients modify files in the same bucket, the cache consistency is not guaranteed, and users need to choose between them. In order to improve retrieval performance, the file does not like object storage in the form of tile on the entire directory, but used the traditional file system in a similar fashion, for each directory to build the data structure to store the files, at the same time, the design of inodes as far as possible concise, preserve only necessary fields, reduce memory footprint.

The Inode currently stores uid, GID, size, mtime, etc., which are persisted in objects through the metadata function of US3. For example, as shown in the figure below, US3 has an object named “A /b/c/f1″ in its bucket. In the file system, each” /” prefix is mapped to a directory to achieve the directory tree on the left.

IO Flow Design

For data writing, US3 supports fragment uploading of large files. Using this feature, US3FS writes data to the cache and uploads data in fragments in the background to the back-end storage in chunks of 4MB. The shard upload process is shown in the figure below. The number of concurrent writes in the whole system is limited by token buckets. Each shard writing thread gets the token and writes it, and writes the last shard when the file is close to complete the upload process.

File reads are preread in the US3FS cache to improve performance. Kernel-fuse itself shards data reads and writes. Without modifying the kernel, the MAXIMUM I/O size is 128K. In this scenario, I/OS are sliced into 128 KB pieces. Without prefetch, network bandwidth cannot be utilized. The prefetch algorithm of US3FS is as follows:

As shown in the figure, after the first synchronous reading is completed, the prefetch of the current length will be carried out later, and the midpoint of the prefetch will be set as the trigger for the next prefetch. If the subsequent reads are not continuous, the system clears the previous status and performs a new prefetch. If the reads are continuous, the system checks whether the end position of the current read is not less than the offset for triggering the prefetch. If the prefetch is triggered, the size of the prefetch window is doubled until the prefetch window reaches the preset threshold. Then preread in a new window. If no, the prefetch is not performed. Prefetch improves the sequential read performance. Since US3FS is mostly used in large file scenarios, US3FS does not cache any data itself. There is a kernel of Pagecache on top of US3FS, and pagecache works well when users repeatedly read the same file.

Data consistency

Due to the implementation mechanism of object storage, data written to large files is not visible until all sharding upload is completed. Therefore, data written to US3FS is not readable until close. When close, US3FS will send a request to end sharding, ending the whole write process. The data is visible to the user.

Contrast test

Under the test model of 64 concurrency and 4M IO size, sequential write and sequential read of 40G files were tested for several times, and the average results were as follows:

During the test, Goofys had a relatively high memory footprint with a peak of around 3.3g, while US3FS was relatively flat with a peak of around 305M, saving 90% of the memory. S3fs performs relatively well because it uses local temporary files for caching, so it takes up less memory, but writes to files are large, and when hard disk space is insufficient, performance degrades to data in tables.

In the sequential read test, the test results can verify our analysis. Due to the design of Goofys, the performance can not meet our requirements in this scenario. In addition, in the test scenario of moving 1G files, the comparison results are as follows:

It can be seen that in mobile demand scenarios, especially those with large files, US3FS can improve performance by hundreds of times.

conclusion

In a word, S3FS and Goofys have their own advantages and disadvantages in reading and writing large files. In contrast, US3FS developed by US3 has better performance both in reading and writing, and is more adaptable to US3 and easier to expand.