background
HDFS is the default storage system of the Hadoop ecosystem, and many data analysis and management tools are designed and implemented based on its API. However, HDFS is designed for traditional computer rooms. It is not easy to maintain HDFS on the cloud. It requires a lot of manpower for monitoring, tuning, capacity expansion, and fault recovery.
Under the trend of separating storage from computing, many people try to build data lakes using object storage. Object storage also provides connector for Hadoop ecosystem. However, due to the limitations of object storage itself, its functions and performance are very limited, and these problems become more prominent when data grows to a certain scale.
JuiceFS is designed to address these issues by preserving the cloud-native features of object storage while better integrating HDFS semantics and functionality to significantly improve overall performance. This paper takes Ali Cloud OSS as an example to introduce how JuiceFS comprehensively improves the performance of object storage in big data scenarios on the cloud.
Metadata performance
To be fully compatible with HDFS and provide extreme metadata performance, JuiceFS uses a full memory approach to manage metadata, using OSS as a data store, and all metadata operations do not require access to OSS for extreme performance and consistency. The response time for most metadata operations is less than 1ms, whereas OSS typically takes tens to more than a hundred milliseconds. Here are the results of metadata compression using NNBench:
The rename operation shown above is only for a single file because it copies data and is slow. In actual big data tasks, directories are usually renamed. OSS is O(N) complexity, which slows down significantly as the number of files in the directory increases, whereas JuiceFS ‘rename complexity is O(1) complexity, which is just an atomic operation on the server side, and can always be this fast regardless of the size of the directory.
Similarly, the DU operation looks at the total size of all files in a directory, which is useful for managing capacity or understanding the size of data. JuiceFS is 76 times faster than OSS for a directory with 100GB of data (3949 subdirectories and files). This is because the DU of JuiceFS is returned immediately based on the size of the server’s memory that is counted in real time, whereas OSS requires the client to traverse all files in the directory and sum them up. If there are more files in the directory, the performance gap is even greater.
Sequential read and write performance
In big data scenarios, a lot of raw data is stored in text format, data is written in the way of appending, and data is read sequentially (or one of the blocks is read sequentially). Throughput is a key metric when accessing such files. To better support such scenarios, JuiceFS splits them into logical chunks of 64MB and then into 4MB (configurable) chunks written to object storage, allowing concurrent reading and writing of multiple chunks to improve throughput. OSS also supports block uploads, but there are limits on block size and number of blocks, while JuiceFS does not have these limits, up to 256PB per file.
JuiceFS ‘built-in LZ4 or ZStandard compression algorithm can compress and decompress concurrent read and write files, which not only reduces storage cost, but also reduces network traffic and further improves sequential read and write performance. For the data that has been compressed, the two algorithms can also automatically identify, avoiding repeated compression.
Combined with JuiceFS ‘intelligent prefetch and write back algorithms, it is easy to take full advantage of network bandwidth and the power of multi-core CPUS to maximize the performance of text file processing. The following chart shows the results of a single-threaded sequential I/O performance test, showing that JuiceFS can significantly speed up reads and writes to large files using random data that cannot be compressed.
Random read performance
For analytical data warehouse, the original data is usually cleaned and stored in a more efficient column storage format (Parquet or ORC), which can greatly save storage space and significantly improve the speed of analysis. The access mode of these data in column format is very different from that of text format. Random read is the majority of the data, which has higher requirements on the comprehensive performance of the storage system.
JuiceFS has made a number of optimizations for the access characteristics of these columnar-format files, of which the partitioning of data onto SSDS on compute nodes is at the heart. To ensure the correctness of cached data, JuiceFS uses a unique ID for all written data to identify data blocks in OSS and never changes, so that cached data does not need to be invalidated, but only cleaned by the LRU algorithm when space is insufficient. Parquet and ORC files are usually hot with only partial columns, and caching the entire file or a 64MB Chunk is a waste of space. JuiceFS adopts a 1MB (configurable) block caching mechanism.
Computing cluster will only have a cached copy, usually by the hash algorithm to determine the location of the cache consistency, and use the local optimization mechanism of the scheduling framework to computing task scheduling to have data cache node, to like HDFS data localization and even better, because three copies of HDFS is usually random scheduling, Operating system page cache utilization will be low, JuiceFS data cache will try to schedule to the same node, system page cache utilization will be higher.
Localization is lost when the scheduling system is unable to perform localization scheduling, such as when SparkSQL reads small files and randomly merges multiple small files into the same task, even when HDFS is used. JuiceFS ‘distributed cache solves this problem very well. When computing tasks are not scheduled to the node where the cache resides, the JuiceFS client accesses the cached data through an internal P2P mechanism, greatly improving cache hit ratio and performance.
Q2, which has a representative query time, was selected to test the acceleration effect of different block sizes and cache Settings:
When caching is not enabled, the performance of 1MB partition is better than that of 4MB partition, because 4MB partition generates more read magnification, resulting in random read slowdowns, and wastes a lot of network bandwidth, resulting in network congestion.
After caching is enabled, Spark can directly perform random reads from cached data blocks, greatly improving the random read performance. SparkSQL randomly merges small files into a task. As a result, most files cannot be scheduled to the node with cache. The cache hit ratio is very low.
When distributed caching is enabled, the JuiceFS client can read the speed of the cache from a fixed node regardless of where the computation task is scheduled, and the cache hit ratio is very high and the speed is very fast (usually the second query can achieve significant acceleration).
JuiceFS also supports random writes, but big data scenarios do not require this capability, and OSS does not support this capability, so no comparison will be made.
Comprehensive performance
Tpc-ds is a typical test set for big data analysis scenarios, and we used it to test the performance improvement effect of JuiceFS on OSS, including different data formats and different analysis engines.
The test environment
We built a cluster on the Cloud using CDH 5.16 (probably the most widely used version) with the following configuration and software versions:
Cloudera2 Apache Impala 2.12 Presto 0.234 UOS-Java -SDK 3.4.1 JuiceFS Hadoop SDK 0.6-beta Master: 4 CPU 32 GB memory, 1 Slave: 4 CPU 16 gb memory, 200 gb efficient cloud disk x 2,3 Spark parameters: master yarn driver-memory 3g executor-memory 9g executor-cores 3 num-executors 3 spark.locality.wait 100 spark.dynamicAllocation.enabled falseCopy the code
The test data set used 100GB TPC-DS data set, multiple storage formats and parameters. It would have taken too long to run the entire 99 test statements, so we chose the first 10 statements as representatives and already included various types of queries.
Write performance
To test write performance by reading and writing to the same table, use the following SQL statement:
INSERT OVERWRITE store_sales SELECT * FROM store_sales;
Copy the code
We compared the unpartitioned text format with the Parquet format partitioned by date, and both JuiceFS showed significant performance improvements, especially for the partitioned Parquet format. Analysis shows that OSS takes a lot of time on Rename, which needs to copy data and cannot be concurrent, whereas Rename is an atomic operation in JuiceFS, which is instantaneous.
SparkSQL query performance
Apache Spark is widely used, and we used SparkSQL to test the speed boost of JuiceFS in three file formats: text, which is unpartitioned, Parquet, and ORC, which are partitioned by date.
For unpartitioned text format, all text data needs to be scanned. The main bottleneck is CPU. JuiceFS has limited speed increase effect, which can be up to 3 times. Note that if you use HTTPS to access OSS, the TLS library in Java is much slower than the TLS library in Go used by JuiceFS. JuiceFS compresses data and reduces network traffic. Therefore, when both enable HTTPS to access OSS, JuiceFS works better.
The figure above illustrates that JuiceFS performance barely changes with HTTPS, while OSS suffers a lot.
For interactive queries, hotspot data is often queried repeatedly. The figure above shows the result of the same query being repeated three times. JuiceFS relies on cached hotspot data to significantly improve performance.
For ORC data sets, the speed effect is similar to that of Parquet, with a maximum speed of 11 times and a minimum speed of 40%.
JuiceFS can significantly improve OSS query performance for all data formats, up to more than 10 times.
Impala Query performance
Impala is a very high performance interaction analysis engine, optimized for I/O localization and I/O scheduling without the need for JuiceFS distributed caching to achieve great results: 42 times faster for OSS!
Presto is a query engine similar to Impala, but because OSS configured in the test environment does not work with Presto (for unknown reasons), JuiceFS has no way to compare with OSS.
conclusion
In summary, JuiceFS significantly improves THE speed of OSS in all scenarios, especially in column storage formats such as Parquet and ORC, with an 8-fold increase in writes and more than 10-fold increase in queries. This significant performance improvement not only saves the valuable time of data analysts, but also significantly reduces the use of computing resources and reduces costs.
The above performance comparison is made with the OSS of Alibaba Cloud as an example. The speed enhancement capability of JuiceFS is applicable to all cloud object storage, including Amazon S3, Google Cloud GCS, Tencent Cloud COS, etc., as well as all kinds of private cloud or self-developed object storage. JuiceFS can significantly improve their performance in data lake scenarios. In addition, JuiceFS offers better Hadoop compatibility (such as permission control, snapshots, etc.) and full POSIX access, making it ideal for on-cloud data lakes.
If you have any help, please pay attention to our project Juicedata/JuiceFS! (0 ᴗ 0 ✿)