Benefits:
- International top event HBaseCon Asia will be held in Beijing in August 2018, is now free to apply for, the more details reference yq.aliyun.com/promotion/6…
- If you are interested in big data storage, distributed database, HBase, etc., welcome to join us to make the best online big data storage, job reference and contact information
preface
Have you ever had the need to have a cold data cache of a few hundred QPS, but waste dozens of servers because of the storage water level? Have you ever encountered the requirement that a table with hundreds of gigabytes must be pure cache hit to meet business requirements? Have you ever encountered, tens of M small table, due to high QPS, must keep split, balance, using multiple servers to fight hot? In the face of complex scenarios, the Ali-hbase team is committed to providing services with more choices and lower costs. This article describes hbase compression and DataBlockEncoding.
Lossless compression: smaller, faster and more resource-efficient
General-purpose compression is an important method to reduce database storage. It is also widely used in hbase. Generally, there is a concept of data blocks in databases, and each block is compressed and decompressed. The larger the block size is, the higher the compression rate is, and the scan throughput increases. The smaller the chunk, the smaller the latency. As a Tradeoff, online hbase usually uses a block size of 64K. It does not compress data in the cache, but compresses and decompresses data when data is dropped or read.
Hbase uses LZO compression or Snappy compression. The common feature of these two kinds of compression is that they both pursue high compression and decompression speed and achieve reasonable compression ratio. However, with the rapid growth of services, more and more services are expanding due to storage water level problems. Hbase uses cross-cluster partition recovery technology to optimize the number of copies and upgrade models. However, it still cannot meet the rapid expansion of storage capacity. We have been trying to find a compression method with higher compression.
New compressions (ZSTD, LZ4) come online
Zstandard (abbreviated Zstd) is a new lossless compression algorithm designed to provide fast compression and achieve high compression ratios. Unlike the LZMA and ZPAQ, it does not seek the highest possible compression ratio, nor does it seek the highest compression speed like the LZ4. The compression speed of this algorithm exceeds 200MB/s and the decompression speed exceeds 400MB/s, which basically meets the current hbase throughput requirements. It has been proven that the data compression rate of Zstd can be improved by 25% to 30% compared to Lzo, which can mean a cost reduction of a third to a quarter for storage businesses.
In the other case, part of the table memory is small, but the QPS is large, rt requirements are very high. For this scenario, we introduce LZ4 compression, which can extract more than twice as fast as LZO in some scenarios. Once the read operation is decompressed, the RT and CPU cost of LZ4 decompression are significantly lower than that of LZO compression.
We use a picture to visually show the performance of various compression algorithms:
Take several typical online data scenarios as examples to see the actual compression rate and single-core decompression speed of several compressions (the following data are from online).
Business types | No compressed table size | LZO (compression ratio/decompression speed MB/s) | ZSTD (compression rate/decompression speed MB/s) | LZ4 (compression ratio/decompression speed MB/s) |
---|---|---|---|---|
Monitor the class | 419.75 T | 5.82/ 372 | 13.09/ 256 | 5.19/463.8 |
The log class | 77.26 T | 4.11/ 333 | 6.0/ 287 | 4.16/496.1 |
Risk Control | 147.83 T | 4.29/ 297.7 | 5.93/ 270 | 4.19/441.38 |
Records of consumption | 108.04 T | 5.93/ 316.8 | 10.51/ 288.3 | 5.55/520.3 |
At present, on November 11, 2017, ZSTD has been fully rolled out online, with a cumulative optimized storage of more than 2.5PB. It is expected to save storage space of more than 15PB after being fully rolled out. LZ4 has also been launched in some read demanding services. The following figure shows the decrease of the total storage capacity of a cluster after the ZSTD compression algorithm is applied to a monitoring class. The amount of data is reduced from 100+T to 75T.
Coding techniques: search and extract structured data
As a schema-free database, hbase is more flexible than traditional relational databases. Users can write data of different schemas into the same table without having to design the table structure. However, due to the lack of data structure support, hbase requires many additional data structures to annotate length information, and different compression methods cannot be used for different data types. To address this problem, hbase provides the coding function to reduce storage overhead. Encoding is usually enabled in the cache because of its low CPU cost and high performance.
Introduction to old DIFF Encoding
Hbase has long supported DataBlockEncoding, which compreses data by reducing the number of duplicate parts in the hbase keyvalue. Taking the most common DIFF algorithm on line as an example, the result of a certain KV compression is as follows:
- A one-byte flag (the purpose of this flag is explained below)
- If the key length is different from that of the previous KV, write 1~5 bytes
- If it is different from the previous KV value, write the length of 1 to 5 bytes
- Record the same prefix length as the previous KV key, 1 to 5 bytes
- Row keys for non-prefix parts
- If it is the first KV, write the column family name
- The column name of the non-prefix part
- Write timestamp of 1~8 bytes or difference with last KV timestamp (original value or difference with last KV, depending on which byte is smaller)
- If the type is different from that of the previous KV, write 1 byte type (Put, Delete)
- The Value content
So when decompressing, how to determine whether the bond length is the same as the previous KV, whether the value length is the same, and whether the timestamp written is the original value or the difference value? This is done by writing the first 1-byte flag, the 8-bit bit in this byte, which means:
- The 0th bit, if it’s one, is the same length as the last kV
- The first one, if it’s one, is the same length as the last kv
- The second bit, if it’s 1, the type is the same as the last kv
- The third bit, if it is 1, the timestamp written is a difference, otherwise it is the original value
- Bit 456, the combined value of these three bits (which can represent 0 to 7), indicates the length of the timestamp to be written
- The seventh bit, if 1, indicates that the timestamp difference is negative and the absolute value is taken.
After DIFF encoding, seek for a file contains the following two steps:
- Find the corresponding datablock using the index key
- Start with the first complete KV, search sequentially, decode the next KV until you find the target KV.
With online data validation, DIFF Encoding can reduce the amount of data by two to five times.
New Indexable Delta Encoding goes live
To improve performance, hbase usually loads Meta information into block cache. If the block size is small and the Meta information is large, the Meta information cannot be fully loaded into the Cache, resulting in performance deterioration. If the block size is large, the performance of DIFF Encoding sequential queries becomes a performance bottleneck for random reads. To solve this problem, we developed Indexable Delta Encoding, which can also be used to query quickly within blocks through indexes. Seek performance has been greatly improved. Indexable Delta Encoding principle is shown below:
After finding the corresponding data block through BlockIndex, we find the offset of each complete KV from the end of the data block, and use binary search to quickly locate the complete KV that meets the query conditions. Then decode each Diff KV in sequence until the target KV position is found.
With Indexable Delta Encoding, random seek performance of HFile is doubled compared to that before using Indexable Delta Encoding. For 64K block, random seek performance is almost the same as that without Encoding. In the random Get scenario with full cache hits, the storage overhead is only increased by 3-5% compared to Diff Encoding RT, which is reduced by 50%. Indexable Delta Encoding has been used in multiple online scenarios and has stood the test of Singles’ Day. For example, in a risk control cluster, the peak traffic of the double cluster is close, but the cluster with Indexable Delta Encoding is slightly more visited, the average read RT decreases by 10%-15%.
In the future
Performance optimization and user experience have always been the unremitting pursuit of Ali hbase team. We are constantly enriching our Arsenal and striving to be the best online storage product for massive data. If you have any questions about the hbase implementation, please contact us.