Bigtable_A Distributed Storage System for Structured Data
The author: Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C.Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E.Gruber
6 Function Optimization
The previous section introduced the implementation of Bigtable. To meet the requirements of high performance, high availability, and reliability, Bigtable needs a lot of optimization. This section details other parts of the Bigtable implementation.
Position group
A client program can combine multiple column families into a location group. Each location group in the Tablet has a separate SSTable. Separating column families that will not be accessed together into different position groups can improve the efficiency of read operations. For example, in a Webtable table, web metadata (such as language and Checksum) can be placed in one location group, and web content in another. This way, when an application reads web metadata, it does not need to read all of the page content.
In addition, you can set some useful tuning parameters on a positional group basis. For example, you can set a location group’s storage location to memory. The Tablet server loads into memory the SstAbles of a group of locations whose storage location is set to memory according to a lazy loading policy. Once the load is complete, there is no need to read the hard disk to access the column family belonging to that location group. This feature is particularly useful for small chunks of data that need to be accessed frequently: within Bigtable, we use this feature to speed up access to location-dependent column families in metadata tables.
The compression
The client program can decide whether the sstAbles of a location group need to be compressed; In what format, if any, compression is required. Each SSTable block, whose size is determined by the tuning parameters of the position group, is compressed in a user-specified compression format. Chunking compression does take up a small amount of space, but with chunking compression, you don’t have to decompress the entire file when only a small part of the SSTable is read. Many client programs use custom two-step compression. The first step is to use Bentley-McIlroy algorithm to compress common long strings in a larger scanning window. The second step is quick compression, that is, looking for duplicate data in a small scan window of 16KB. The compression speed of both steps is very fast, with the compression rate reaching 100-200MB/s and the decompression rate reaching 400-1000MB/s.
Although speed is our main concern when choosing a compression algorithm, this two-step compression method also performs very well in space compression. For example, with Webtable, we use this compression to store web content. In our tests, we stored a large number of web pages in a compressed location group. For experimental purposes, we only store data for one version of each document. This mode has a space compression ratio of 10:1, compared to 3:1 or 4:1 for traditional Gzip when compressing HTML pages. In two-step compression mode, pages retrieved from the same host are stored in adjacent locations. In this way, the Bentley-McIlroy algorithm can find a lot of duplicate content from pages from the same host. Many other applications also achieve better compression rates by choosing the right line names to cluster similar data together. The compression efficiency is higher when multiple versions of the same data are stored in Bigtable.
Cache improves read performance
To improve read performance, the Tablet server uses a two-level cache strategy. The scan cache is a level 1 cache that mainly caches key-value pairs that the Tablet server obtains through the SSTable interface. The block cache is a level 2 cache that caches blocks of data from sstAbles read from GFS. The scan cache is suitable for applications that need to read the same data repeatedly frequently; Block caching is suitable for applications that require frequent reads of peripheral data of recently read data (such as sequential reads or random reads of different columns in a hot row position group).
Bloom filter
As described in Section 5.3, a read operation must read all the sstables that make up the Tablet state. If the SSTable is not in memory, the hard disk needs to be accessed multiple times. Bloom filters that allow clients to specify sstAbles with specific location groups can effectively reduce disk access times. The Bloom filter enables you to query whether an SSTable contains data for specific rows and columns. For some applications, simply setting aside a small amount of memory to store Bloom filters can drastically reduce the number of disk accesses required for read operations. With the Bloom filter, when an application accesses a nonexistent row or column, a read operation does not need to access the hard disk.
Implementation of commit logging
If the commit log for each Tablet operation were stored in a separate file, a large number of files would be written simultaneously to the GFS. Considering the implementation of the underlying file system of the GFS server, writing these files to different disk log files will result in a large number of disk Seek operations. Also, bulk commits tend to have fewer operations, and storing commit logs for each Tablet operation in a separate file doesn’t take advantage of bulk commits. To avoid these problems, we set the number of commit log files per Tablet server to 1 and append the log of the changes to the same log file, so that the same log file contains records of changes to multiple Tablets.
A single log significantly improves the performance of normal operations, but makes recovery more difficult. When a Tablet server goes down, its loaded tablets are moved to other Tablet servers: each Tablet server typically loads only a few tablets from the original server. When the state of the Tablet is restored, the new Tablet server extracts the modification information from the logs written by the original Tablet server and executes it again. However, these changes are recorded in the same log file. One solution is for the new Tablet server to read the complete commit log file and then repeat only the changes related to the Tablet that needs to be recovered. According to this approach, if you have 100 Tablet servers, each loading a Tablet from the failed Tablet server, the log file needs to be read 100 times (once per server).
To avoid reading log files multiple times, we first sort the logs by keywords (table, Row Name, LogSequencenumber). After sorting, the logs of changes made to the same Tablet are sequentially stored together, requiring only one disk Seek operation. To sort in parallel, we split the log into 64MB segments and then sort the segments simultaneously on different Tablet servers. The sorting is coordinated by the Master server and starts when a Tablet server needs to recover the Tablet from a log file submission. Writing and submitting logs to GFS can cause system turbulence for a variety of reasons (for example, the GFS server is down during the write operation, or the network of three consecutive servers where the GFS copy resides is congested or overloaded). To ensure that changes can be made during peak GFS load, each Tablet server actually has two log writing threads, each writing its own log file. Only one thread is active at a time. If one thread is writing inefficiently, the Tablet server switches to another thread and the log of the changes is written to the log file for that thread. Each log record has a serial number. On recovery, the Tablet server can detect and ignore duplicate records due to thread switching.
Speed up Tablet recovery
When the Master server moves the Tablet from one Tablet server to another, the source Tablet server Minor compresses the Tablet. This reduces the number of unmerged records in the log files of the Tablet server, shortening recovery time. Once the compression is complete, the server stops serving the Tablet. Before uninstalling the Tablet, the source Tablet server performs another quick Minor compression to eliminate newly unmerged records generated during the previous compression process. After the second Minor compression, the Tablet can be loaded onto the new Tablet server without needing to recover from the log.
Utilization invariance
We can simplify our system by taking advantage of the fact that with Bigtable, the sstAbles generated by all but the SstAbles cache are unchanged. For example, reading data from an SSTable does not require synchronization of file system access operations, enabling efficient parallel operations on rows. Memtable is the only mutable data structure that can be accessed by both read and write operations. In order to reduce the competition of read operations, we adopt copy-on-write (COW) mechanism for memtable to implement parallel read and write operations.
Since the SstAbles are immutable, the problem of permanently deleting data becomes a problem of garbage collection of the obsolete Sstables. The sstables for each Tablet are registered in the METADATA table. The Master server uses the mark-delete garbage collection method to delete the obsolete SstAbles in the SstAbles. The METADATA table stores the collections of the rootsStables.
Finally, the invariability of sstables improves the fragmentation efficiency of tablets. Instead of creating a new set of Sstables for each divided Tablet, you share the set of Sstables for the original Tablet.
7 Performance Evaluation
To test the performance and scalability of Bigtable, we set up a Cluster of N Tablet servers, the number of which can be adjusted randomly. Each Tablet server comes with 1GB of memory and data is written to a GFS cluster of 1786 machines, each with two 400GB IDE hard drives. We tested Bigtable with N client-generated workloads. (The number of clients is the same as the number of Tablet servers, ensuring that clients do not become a performance bottleneck.) Each client is equipped with a 2GZ dual-core Opteron processor, enough physical memory to hold the working set of all the processes, and a gigabit Ethernet card. These machines are connected to a two-layer tree-switched network with a total bandwidth of about 100-200Gbps at the root node. All machines use the same equipment to ensure that the network round-trip time between any two machines is less than 1ms.
The Tablet server, Master server, test machine, and GFS server all run on the same set of machines. Each machine runs a GFS server. The other machines run either the Tablet server, the client program, or the processes started by other tasks using the set of machines during the test.
R is the number of different column keywords that Bigtable contains during the test. Set the value of R so that each benchmark reads/writes around 1GB of data to each Tablet server.
For benchmarking sequence writing, the column keywords we used ranged from 0 to R-1. This range is divided into 10N sequence intervals of the same size. The core scheduler allocates these intervals to N clients, a dynamic allocation mechanism that helps reduce the performance impact of other processes running simultaneously on the client. We write a separate string under each column key. Each string is randomly generated and therefore cannot be compressed. In addition, the strings under different column keywords are different, so there is no cross-row compression problem. The random write benchmark takes a similar approach, except that row keywords are hashed before being written. Hash uses modulo R to ensure that write workloads are evenly distributed across the column storage space throughout the benchmark.
The benchmark for sequence reading generates column keywords in the same way as for sequence writing. Unlike sequence writing, sequence reading reads the string under the column key (the string was written by the previous sequence writing benchmark). The benchmark for random reads is similar to that for random writes.
Scan benchmarks are similar to sequential reads, but use apis provided by Bigtable to scan all values in the range of rows. A single RPC call can fetch a large number of values from a Tablet server, so scanning benchmarks can reduce the number of RPC calls.
Random read (memory) benchmarks are similar to random reads, but in random reads (memory), the location group containing the benchmark data is set to “in-memory.” Therefore, the read operation reads directly from the memory of the Tablet server and does not need to read from the GFS. For this test, we reduced the amount of data stored on each Tablet server from 1GB to 100MB, so that we could load all the data into the Tablet server’s memory.
Figure 6 shows benchmark performance for reading/writing 1000 byte values as operations per second per Tablet server; The curve is the total number of operations per second for all Tablet servers.
Performance of a single Tablet server
Let’s start by analyzing the performance of a single Tablet server. Random reads perform an order of magnitude or more slower than other operations. Each random read transfers 64KB of sstAbles over the network from GFS to the Tablet server, and we only use the 1000-byte values. The Tablet server performs approximately 1200 reads per second, or approximately 75MB of data per second from GFS. This transport bandwidth is sufficient to occupy the CPU of a Tablet server because of the network protocol stack consumption, SSTable parsing, and Bigtable code execution involved. It’s enough to take up all the bandwidth on the network in our system. Most BigTable applications that use this access pattern typically limit the data block size to 8KB.
Random reads in memory are much faster, because all 1000 byte reads are from local memory on the Tablet server, rather than 64KB blocks from GFS.
Random writes and sequential writes perform better than random reads because each Tablet server improves performance by directly appending the contents of write operations to the end of the mentioned log file as a batch commit and writing to the GFS as a data stream. There is not much difference in performance between random and sequential writes, as both write operations are actually recorded to the commit log file of the same Tablet server.
Sequential read performance is better than random read because every 64KB SSTable data block is cached in the block cache. The subsequent 64 read operations directly read data from the cache.
The performance of scanning is better. Because the client program returns a large number of values for each RPC call, the call consumption is largely offset.
Performance improvement
Increasing the number of Tablet servers in the system from one to 500 increased the overall throughput of the system by more than 100 times.
For example, when the number of Tablet servers increased to 500, the performance of random reads in memory increased by 300 times because the bottleneck in the benchmark was the CPU of a single Tablet server.
Even so, the performance improvement was not significant. In most benchmarks, when the number of Tablet servers increases from one to 50, throughput per server drops significantly due to uneven load across multiple servers, most often due to other applications taking over the CPU and network. The load-balancing algorithm tries to avoid this imbalance, but it has its drawbacks: on the one hand, reducing the movement of the Tablet limits its ability to re-load balance (if the Tablet is moved, it becomes unavailable for a short period of time (typically 1 second); On the other hand, the workload generated by the benchmark can fluctuate.
The random read benchmark tests showed that the performance improvement of random reads was least affected by the number of Tablet servers (the number of servers increased by 500 times, but the overall throughput increased by 100 times). This is because each 1000 byte read operation will generate network transmission consumption of 64KB data blocks, occupying various 1GB shared links in the network, resulting in an increase in the number of servers and a sharp decrease in throughput.
The resources
-
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C.Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes and Robert E.Gruber. Bigtable_A Distributed Storage System for Structured Data
-
Yan Wei. Google Bigtable Chinese version