preface

Our YRCloudFile is a distributed file system for the cloud era. Its main features are high-performance data access for massive small files, seamless support for Kubernetes platform, and data support in hybrid cloud scenarios. When we develop YRCloudFile, we will also learn about mainstream distributed file systems in the industry to learn their advantages and avoid their disadvantages. This article discusses several of the leading distributed file systems we have investigated, all of which are open source because they allow for a wealth of information to be gathered and code to be seen, making understanding and discussion easier.

When we investigate a distributed file system, we focus on its core architecture, which includes the following aspects:

1) Its metadata scheme, such as whether there is metadata cluster, how metadata is organized, metadata placement strategy, etc.

2) Copy mechanism and consistency of metadata, EC or multi-copy, synchronous or asynchronous write between copies, and data consistency.

3) Data copy mechanism and consistency, that is, EC or multiple copies, and data consistency among copies.

Around these core issues, we will do some extended analysis, such as system availability, system performance, etc.

The distributed file systems discussed in this article and their selection for analysis and understanding are based on the following considerations:

  1. HDFS: The reason for selecting HDFS is classic. The HDFS scenario is mainly used for big data analysis, with rich data and a large number of users.

  2. MooseFS: Simple, but very typical design, LizardFS is a variation.

  3. Lustre: Well known, widely compared, there’s a commercial company behind Lustre but the code is open and downloadable.

  4. GlusterFS: an old distributed file system with many applications, GlusterFS has been put into secondary research and development for a long time in China, and its consistent hash idea is quite classic.

  5. CephFS: in recent years, the most popular distributed storage Ceph, no one, I think is in the distributed storage theory and practice of the integration.

HDFS

The Hadoop Distributed File System (The Hadoop Distributed File System) applies to The Hadoop big data ecosystem. It mainly stores large files and defines The File mode as write-once-read-many to greatly simplify data consistency problems.

Refer to the structure diagram of the official website:

HDFS started out as a single Namenode (metadata node of HDFS), which became a bottleneck for capacity and performance. Later, I went to do more Namenode.

HDFS is not a general-purpose distributed file system, that is, it does not provide complete POSIX semantics, at least it was not designed for that purpose, and it is not capable of being a general-purpose file system. Its characteristic is obvious, accumulate over the years the data is very much, this article also does not say much.

MooseFS

If you’re a file system aficionado and developer planning to write a distributed file system in the near future, here’s what you might do:

1) Metadata

Use a separate metadata service, not a metadata cluster, but a unit data server, like HDFS Namenode, for simplicity.

To prevent the single point of metadata service from going offline, cluster unavailability due to data corruption, and data loss, you deploy one or more metadata services as backup, creating a master and slave architecture. However, how to synchronize data between the master and slave can be real-time or asynchronous.

Also, for simplicity, metadata services are stored directly on local file systems, such as ext4.

2) Data

For simplicity, the data service store is also based on local file systems. For fault tolerance, multiple copies of data are stored across servers.

The obvious downside of this is that the unit data service is a bottleneck, limiting both the maximum number of files a cluster can support and the performance of concurrent access to the cluster.

MooseFS is exactly this architecture, calling the metadata service Master Server and the data service Chunk Server, as shown in the official schematic.

We analyze the read and write flow of MooseFS to see if it has problems with data consistency. Here we refer to “MooseFS 3.0 User’s Manual” (moosefs.com/Content/Dow…) .

It is important to note that all copies of MooseFS are available for reading:

Write process diagram:

As you can see from the write flow, MooseFS uses chain Replication, which is not a problem in itself. However, according to the public information, there is no transaction processing for MooseFS multiple copies, and any copies of MooseFS support reading, so it is speculated that MooseFS has data consistency problems in the case of failure.

Git clone git checkout https://github.com/moosefs/moosefs.git v3.0.111Copy the code

Take the latest code and see the code flow for writing data according to the keyword CLTOCS_WRITE_DATA. There are two problems:

1) Write multiple copies. There is no mechanism to ensure consistency in case of failure.

2) Local data write call hDD_write (), it calls pwrite() internally, the data does not fall in real time, but in pagecache, and then the pagecache brush mechanism is responsible for falling disk.

You can construct A scenario where client writes ABC, copy 1 writes AB, and copy 2 writes A, and the cluster is powered off. Inconsistent data exists in the cluster after the cluster is restarted.

In another scenario, the client writes data ABC, copies 1 and 2 are written successfully but not flushed, and MooseFS reports back to the client that the write is successful. After the cluster is powered off and restarted, copy 1 and copy 2 May be inconsistent, and the data read by the client may be different from the promised data.

In fact, in our previous work, we maintained an online MooseFS system, and we analyzed MooseFS. When we wrote this article today, we went back to the latest MooseFS code, read the targeted section, and found the same data consistency problem.

moosefs.com/

Lustre

Lustre is widely used in the HPC space, so there must be something special about Lustre. But we’re still focused on metadata and data architecture, and data consistency.

The Lustre architecture is similar to MooseFS in that it is a typical metadata service architecture. Lustre, however, does not seem to provide a replica mechanism (two replicas need to be activated actively, asynchronous replicas), perhaps because of the HPC scenario, and the data space comes from the back-end SAN array, so it does not need to provide data redundancy.

Data redundancy relies on back-end SAN arrays to ensure data consistency. So we’re not going to get into it.

lustre.org/

GlusterFS

GlusterFS is a very famous distributed file system. Many Domestic companies have been developing GlusterFS for a long time. The most suitable application scenarios of GlusterFS are video storage and log storage, and its IO features are sequential reading and writing of large chunks of data. General file systems can handle this scenario.

GlusterFS, on the other hand, is most talked about in terms of “consistent hashing,” “uncentered architecture,” and “stack-based design,” which I’ll cover briefly in this article. GlusterFS online learning materials are numerous and will not be covered in this article, but we will focus on its metadata design and data consistency.

GlusterFS uses a metadata-free structure, also known as “no central architecture,” or “decentralized,” and uses consistent hashes, or Distributed Hash tables (DHTS), to locate data. A GlusterFS cluster is deployed. IO nodes know the storage node list of the cluster. When reading and writing files, the FILE name and storage node list are used as DHT algorithm input to locate the file storage location.

When the cluster is free from faults, the storage node list is fixed, and the DHT algorithm is simple and effective. However, when a node in or out failure occurs in a cluster (such as a storage node crash), THE DHT response is weak and the response logic is complex, which has a great impact on service availability. Because the storage node list changes, the location of many files based on DHT calculation also changes, which involves data movement. Service I/OS will be involved. Wait for the recovery I/OS to complete first.

On the other hand, such a centrless metadata structure consumes a lot of metadata operations. For example, an LS will be scaled up to several storage nodes to perform the LS. The fact that there is no meta-structure is a headache because it is generally believed that metadata operations make up a large portion of file systems.

For data consistency, GlusterFS does not provide strong data consistency. We had experience working on GlusterFS a few years ago, and split-brain — the Chinese term for “split brain” — was a term we were still impressed with.

Split-brain means that data copies are inconsistent. For example, the content of copy 1 is ABC, and the content of copy 2 is AB. The cluster cannot determine which data is valid.

We construct a split brain scenario:

  • brick1 down, write fileX
  • brick1 up, brick2 down, write fileX
  • Brick1 Brick2 both up,split brain occurs, Brick1 Brick2 mutual blame

GlusterFS is explicitly mentioned this problem of official document, see “at Heal the info and the split – Brian resolution” (docs.gluster.org/en/latest/T…). . Since GlusterFS is not intended to be a strongly consistent distributed file system, it’s not hard to see why it’s primarily used for video or log file storage for large sequential IO, because these data files are large, the total number of files is small, and the performance is more throughput oriented. It is not critical data and has little impact after data consistency problems such as split brain occur.

From the Perspective of the GlusterFS community initiative, they also consider addressing metadata performance issues and data consistency issues, but I think these are two core issues that are strongly related to the core architecture and are difficult to address properly. But technology is endless, and we’re looking forward to it.

www.gluster.org/

CephFS

Ceph is the most successful distributed storage system in recent years. It is not a distributed file system, because Ceph has three application scenarios: block storage Ceph RBD, object storage Ceph RGW, and file storage CephFS.

I believe many people are familiar with Ceph RBD, because a few years ago when cloud computing was in the ascendancy of OpenStack, Ceph RBD and OpenStack hit it off with each other and have a wide range of applications.

RADOS is the core of CephFS, RBD and RGW. This module is responsible for core functions such as data location, multiple copies, strong consistency between copies and data recovery. RADOS data location algorithm is CRUSH, which is a hashing algorithm that supports a certain degree of user control. RADOS performs multi-copy write transactions to ensure atomicity of a single write, and ensures data consistency between copies through pglog and other mechanisms.

We also have Ceph RBD development experience, we see Ceph as a combination of distributed storage theory and practice. Its theory is complete, although the realization of the code is complex, but the module is relatively clear, the code quality is quite high, it is worth learning.

From the theoretical point of view, Ceph is impeccable. A few years ago, many companies invested in the secondary development of Ceph, but the storm swept away, leaving only some manufacturers with strong R&D strength to insist on investment in this aspect. One of the reasons is that Ceph code is too complicated, it is generally difficult to fully control, and the optimization effect is not fully proportional to the resources invested.

Let’s return to the topic of “distributed file systems” and return to the point where this article focuses on metadata and data consistency.

CephFS has a metadata service called MDS. The metadata and data of CephFS are stored in RADOS, so the development idea of CephFS is quite different from MooseFS mentioned above. For example, MooseFS needs to be the Leader Follower of metadata service, while CephFS does not need to consider these. Its data is naturally stored reliably and consistently redundantly in the cluster through RADOS. In simple terms, the METADATA management MDS of CephFS is more like building a metadata service layer on top of a stable data placement and storage layer (this layer is responsible for ensuring strong consistency of data by means of replicas or EC), namely RADOS. This architecture makes full use of the advantages of RADOS architecture, but also brings some disadvantages. As we mentioned earlier, a large number of operations in distributed file system are related to metadata. CephFS accesses MDS first, and then RADOS data.

ceph.io/

conclusion

In this paper, we briefly discuss several common open source distributed file systems, mainly from the two perspectives of metadata and data consistency analysis, we believe that CephFS is the most complete from a theoretical perspective.

However, there are many different perspectives for discussing a distributed file system. MooseFS is suitable for large file backups, while CephFS is too complex. Lustre is widely used in HPC scenarios, but CephFS may suffer performance degradation. So this article is not a “pen shot” at the mentioned distributed file system, but rather a “metadata and data consistency” perspective to discuss our views and get a broader idea of how to design and implement your own file system.

Each system has its own trade-offs, has its own shining points, this article can not be enough one by one, later have the opportunity to discuss in detail, but also welcome you to discuss.