In the era of big data with information explosion, how to solve the problem of massive data storage at a lower cost has become an important part of enterprise big data business. UCloud has developed a new generation of object storage service US3. In the past, it has launched computing and storage separation and big data backup solutions for big data service scenarios.

The main reasons behind this include:

1. Due to the rapid development of network technology, network transmission performance is no longer the bottleneck of high throughput service requirements in big data scenarios; 2. HDFS storage solution in Hadoop technology stack is complex and costly in operation and maintenance; 3. The object storage service US3 based on massive storage resource pools on cloud platforms has the advantages of on-demand, simple operation, reliability, stability, and low price. It is the best storage solution to replace HDFS. Therefore, in order to make it more convenient for users to use US3 to realize computing and storage separation and big data backup solutions in Hadoop scenarios, US3 has developed three components: US3Hadoop adapter, US3Vmds, and US3Distcp.

This paper mainly introduces some ideas and problem solving in the development and design process of US3Hadoop adapter.

Overall design idea

Storage in the Hadoop ecosystem is basically handled through FileSystem, a common base class for file systems. The US3Hadoop adapter (adapter for short) operates on US3 through the US3FileSystem implementation of this base class. Similar to DistributedFileSystem implemented by HDFS and S3AFileSystem implemented by AWS S3 protocol. The adapter sends both IO and index requests directly to US3, as shown in the following diagram:

Here the index operation is not involved in reading and writing data API, such as: HEADFile, ListObjects, Rename, DeleteFile, Copy(used to modify metadata); IO operation apis, such as GetFile, PutFile(files smaller than 4M) have been fragmented upload related 4 apis: InitiateMultipartUpload, UploadPart, FinishMultipartUpload, AbortMultipartUpload. With these apis in US3, we can see how FileSystem’s member methods correspond to each other, and see which methods FileSystem needs to override. Based on actual requirements and referring to the DistributedFileSystem and S3AFileSystem implementations, we identified the main methods that need to be rewritten: Initialize, create, rename, getFileStatus, open, listStatus, mkdirs, setOwner, setPermission, setReplication, setWorkingDirectory, GetWorkingDirectory, getSchem, getUri, getDefaultBlockSize, delete Also, exceptions are not supported for methods that are difficult to emulate, such as Append member methods.

The semantics of FileSystem’s member methods are similar to the interface semantics of a stand-alone FileSystem engine, which basically organizes and manages file systems in a directory tree structure. The ListObjects API provided by US3 also provides a way for the directory tree to be pulled. When Hadoop calls the listStatus method, all the children of the current directory (prefix) can be pulled through the ListObjects loop to return the corresponding results.

The metadata function of US3 is used to set the user/group of the file and the operation permission, and these information is mapped to the KV metadata pair of the file. The written file stream will preferentially cache up to 4MB data in the memory, and then decide whether to implement PutFile or fragment upload API according to subsequent operations.

Reading the file stream reads the expected data by returning the stream instance via GetFile. While these approach implementations may seem straightforward, there is a lot of potential for optimization.

By analyzing the invocation of FileSystem, it can be seen that index operations account for more than 70% in the big data scenario, while getFileStatus has the highest proportion of index operations, so it is necessary to optimize it. So what’s the optimization point?

First of all, the “directory” in US3 (object storage is KV storage, the so-called directory is only simulated) is a Key ending in “/”, while FileSystem operates files through the Path structure, which does not end in “/”. Therefore, when the Key obtained through Path is used for HeadFile in US3, HeadFile may return 404 because the Key is a directory in US3. You must use Key/ for the second time to check the HeadFile in US3. If the Key directory still does not exist, the getFileStatus delay will increase significantly.

So the US3 adaptation does two things when creating a directory :1. Write an empty file whose MIME-type is “file/path” and file name is “Key” to US3. 2. Write an empty file whose MIME-type is Application /x-director and file name is Key/ to US3.

The miME-type of a common file is application/octet-stream. This allows getFileStatus to determine whether the current Key is a file or directory through the HeadFile API once, and when the directory is empty, it can also be displayed on the US3 console. In addition, since large files are mainly written in Hadoop scenarios, it takes ms to write an empty file index, and the delay is basically negligible.

In addition, getFileStatus has an obvious feature of “space-time locality” in the use of Hadoop. The Key recently operated by getFileStatus in a specific FileSystem instance will be operated multiple times in a short period of time. To take advantage of this, US3FileSystem inserts a 3s-lifetime FileStatus into the Cache before FileStatus returns to getFileStatus. Delete FileStatus from the Cache as a 404 Cache with a lifetime of 3 seconds after the Key is deleted. Or insert a 404 Cache with a 3s lifetime. If rename is selected, it reuses the source Cache to construct the Cache for the destination Key and deletes the source, thus reducing the amount of interaction with US3. A Cache hit (us level) reduces the latency of getFileStatus by a factor of 100.

Of course, this will introduce some consistency problems, but only in the case that at least one of multiple concurrent jobs has “write”, such as delete and rename, if only read, there is no impact. But the big data scenario falls into the latter category.

ListObjects consistency problem

The ListObjects interface of US3 is similar to other object storage solutions. Currently, it can only achieve final consistency (but US3 will introduce the ListObjects interface with strong consistency guarantee later). Therefore, the adapter of other object storage implementations will also have to write a file. Occasionally the file does not exist when listStatus is called immediately. Other object storage solutions sometimes alleviate this by introducing a middleware service (typically a database) that writes the file index to the middleware when a file is written, and then merges the index information with the middleware when listStatus is written.

But it’s not enough, for example, writing to the object store succeeds, but the program crashes when writing to the middleware, which leads to inconsistencies, which leads back to final consistency.

The US3Hadoop adapter is relatively simple and efficient to implement without additional services, providing read-your-writes consistency at the index operation level that is equivalent to strong consistency in most Hadoop scenarios. Unlike S3AFileSystem implementations, which return immediately after create or RENAME or delete, the US3Hadoop adapter internally calls the ListObjects interface to do an “reconciliation” until the “reconciliation” results are as expected and returns.

Of course, there is also optimization space, such as delete a directory, the corresponding directory will be pulled out of all the files, and then call DeleteFile API to delete, if each DeleteFile API delete “check” once, then the entire delay will double. The US3Hadoop adapter does “check” only for the last index operation, because the oplog of the index is synchronized to the list service in chronological order. If the last index “check” succeeds, then the previous Oplog must have been successfully written to the list service.

Deep customization of Rename

As mentioned earlier, RENAME is also an important optimization point of US3. Other implementations of object storage schemes use the Copy interface to Copy files first and then delete source files. As you can see, if rename’s file is large, the entire rename process will result in high latency.

US3 specially develops an API interface for Rename based on the requirements of this scenario. Therefore, the US3Hadoop adapter implements relatively light semantics for Rename and maintains latency at ms level.

Ensure read is efficient and stable

Read is a high-frequency operation in big data scenarios, so the read stream implementation of the US3Hadoop adapter is not a simple encapsulation of the BODY of the HTTP response, but takes many optimizations into account. For example, the optimization of read streams can reduce the frequency of network IO system calls and latency of READ operations by adding preread buffers. In particular, the I/O improvement effect of sequential reads in large batches is obvious.

In addition, FileSystem’s read streams have seek interfaces, which support random reads. There are two scenarios:

The Underlay Stream’s Http response to the body Stream will be deprecated and closed, and a new GetFile API will need to be created to fragment the Stream from seek’s position. Get the body Stream of its Http response as the new Underlay Stream. However, in the actual testing process, it was found that many SEEK operations may not necessarily be read after the operation, or seek may be directly closed, or seek may return to the post position of the read stream position, so when SEEK occurs, the implementation of the US3Hadoop adapter is only to mark the SEEK position. The Underlay Stream is lazily closed and opened when read. In addition, if seek is still in Buffer, the Underlay Stream will not be reopened. Instead, the Buffer consumption offset will be modified.

2. Another scenario for random reads is to seek the position after the read stream position. Here we use delayed Stream opening as before, but it is not necessary to close the old Underlay Stream and re-open the new Underlay Stream at the target location to determine the actual seek operation. Since the current read position may be very close to seek’s back position, assuming only 100KB distance, this distance may be completely within the range of the preread Buffer. In this case, you can also modify the consumption offset of the Buffer to achieve this.

In fact, the US3Hadoop adapter does this, but the current rule is that the distance from seek’s postposition to the current read stream position is less than or equal to the sum of the remaining preread Buffer space plus 16K, We directly locate the afterposition of seek by modifying the consumption offset of the preread Buffer and consuming data in the Underlay Stream. The 16K was added to account for the data in the TCP receive cache. Of course, the time cost of later determining that n-bytes of data are consumed from a ready Underlay Stream is roughly equal to the time cost of re-initiating a GetFile API and preparing to transmit the Http response body is also factored into the offset calculation.

Finally, flow optimization should also take into account Underlay Stream anomalies. For example, in HBase scenarios, an open Stream is held for a long time, but the Stream is not operated for a long time due to other operations. Then US3 may actively close and release the TCP connection corresponding to Underlay Stream. Subsequent operations on Underlay Stream will raise TCP RST exceptions. To provide availability, the implementation of the US3Hadoop adapter is to re-open the Underlay Stream at the point where it has been read.

Write in the last

The implementation of US3Hadoop adapter, based on the open source scheme, further optimizes the relevant core problems, improves the reliability and stability of Hadoop access to US3, and plays an important role of bridging Hadoop and US3 in several customer cases, helping users improve storage read and write efficiency in big data scenarios.

However, US3Haoop adapter still has a lot of room for improvement. Compared with HDFS, the delay of index and IO is still behind, and the atomicity guarantee is relatively weak. These are also the problems that we will consider to solve next. Currently, US3Vmds has solved most of the indexing delay problems, which greatly improves the performance of OPERATING US3 through the US3Hadoop adapter and approaches the performance of native HDFS in some scenarios. Specific data can refer to the official documentation (docs. Ucloud. Cn/ufile/tools…

In the future, US3 products will continue to improve and optimize storage solutions in big data scenarios, and further improve users’ US3 experience in big data scenarios while reducing the cost of big data storage.