Many engineers who are new to the storage industry or want to move to distributed storage often have the question “What does a distributed storage engineer’s skill tree look like?”
So we raised this issue specifically to find our research and development masters.
First, the topic: “What does a distributed storage engineer’s skill tree look like?” We believe that storage in a broad sense should include databases, but distributed databases are generally “unique” : technical characteristics, technical difficulties, application scenarios are unique.
Discussing distributed databases is an entirely separate topic, so here we will focus on distributed storage beyond distributed databases.
The topic of distributed storage is still very big. In terms of access semantics, it is generally divided into object storage, block storage and file storage.
Say first object storage, and its application scenario is relatively narrow, than block storage is the foundation of cloud computing (IaaS, you create a virtual machine, must have a piece of hard disk), it is better than the file stored in the history of long and wide application, the decision is mainly the upper application access to semantic, most of the upper application either through virtualization block storage access semantics, Or access the data through file POSIX semantics.
Technically, AWS S3 is the de facto standard for object storage, where objects do not support change and can accept data inconsistencies between short-lived replicas.
Object storage is characterized by the pursuit of low cost, implementation of EC is often required.
Next, we discuss distributed block storage and distributed file storage. Their common characteristics are the pursuit of high performance and strong consistency of data. Distributed block storage is generally used to provide virtual hard disks for virtual machines, namely the EBS products of major public clouds. In most application scenarios, it is only used by one client. However, distributed file system is often used concurrently by multiple clients, and will face concurrent and massive file reading and writing, which has high technical difficulty coefficient.
Our YRCloudFile is positioned as a high-performance distributed file system, so we have focused on discussing the skill tree of distributed file system engineers. Since each section can be developed as a separate topic, this is only a brief description.
What we think of as the backbone of the skills tree for distributed file storage engineers:
1) Medium:
Designing on HDDs is one thing; designing on high-performance SSDs is quite another. For example, based on HDD, ordinary synchronous thread is enough. Based on SSD, synchronous thread is not enough for the hardware. If we want to conduct IO Polling, whether libaio is good or not, and whether io_uring stability is OK with the latest kernel is OK, are all skills to be used when designing distributed file storage for SSD. On the other hand, the low latency of the device also highlights the performance cost of thread context switching.
2) Metadata architecture:
Whether to be a metadata structure or a distributed structure without metadata.
Through comparison and selection, we chose the meta. So how is the directory tree partitioned? How to establish and maintain the cluster membership relationship? Under various abnormal network conditions, whether the cluster members can reach an agreement on whether a node is alive or dead? This is all part of the distributed file storage engineer’s skill tree.
3) Reliability and consistency:
Prevent some piece dish to be broken thoroughly, want fault-tolerant, make copy commonly, or be much copy, or use EC. How to ensure the consistency of data between replicas?
4) Efficiency:
Many customers are used to using the Linux native file system because of the PageCache, so even if the hard drive is an HDD, customers generally use it very smoothly. After switching to a distributed file system, the data is distributed, the data may be local or remote, can the performance still meet the needs of the customer? How to protect?
Distributed file systems typically use a variety of caches, such as client caches, so how to ensure cache consistency between clients?
We believe that distributed file storage is both highly theoretical and highly engineering challenging. In recent years, the domestic technology progress is very rapid, technology sharing is also very comprehensive. The key theory of distributed storage, there are a lot of online, but you need new friends to comb the context, so as to build their own knowledge tree.
As the Internet flattens the world, it also flattens the boundaries of knowledge. You can always find technical information on a subtopic, but the difficulty is that you need to know the subtopic, and the difficulty is that you need to know what to find.
Therefore, we believe that for experienced distributed storage engineers, the challenge is mainly good engineering practices, mainly how to implement modules to be more concise, more reasonable, and higher performance.
For the new engineers who want to do distributed storage, or to be more specific, want to do distributed file storage friends, need to have the opportunity to experience the actual process of building distributed storage, the challenge is how to build the right knowledge tree — this is also the goal of this topic.
In our experience, building such a knowledge tree on our own is time-consuming and requires a lot of comprehensive skills.
Therefore, it is highly recommended that you participate in a distributed storage project.
This can be an open source project, or you can join a company that does distributed storage.