One, foreword

This article first introduces HBase, including its overall architecture, dependent components, and core service classes. This section focuses on the HBase data reading process and describes how to optimize performance on the client and server. In addition, combining theories with practices, this section hopes to inspire readers. If the article has a mistake please leave a message below, we discuss common study.

2. HBase Overview

HBase is a distributed, scalable, column-oriented database for storing massive data. Its main function is to solve the real-time random read/write problem of massive data. Generally, HBase relies on HDFS as the underlying distributed file system. Based on this premise, this article describes the HBase architecture, read paths, and optimization practices in detail.

2.1 HBase Key Processes

HBase is a distributed database in Master/Slave architecture. It contains two core services, Master and RegionServer. HBase relies on HDFS for underlying storage and ZooKeeper for consistency coordination.

  • Master is a lightweight process that is responsible for all DDL operations, load balancing, region information management, and takes the lead role in downtime recovery.
  • RegionServer manages HRegion, communicates with client endpoint to point, and is responsible for reading and writing real-time data.
  • Zookeeper performs HMaster elections and stores key information such as meta-Region address, replication progress, Regionserver address and port.

2.2 HBase architecture

Firstly, the architecture diagram is given as follows

HRegion
MemStore
HRegion
HFile


Read path parsing

The client can read data in two ways: Get and Scan. Get is a random point-and-search method that returns a row of data based on a Rowkey. You can also pass in a list of Rowkeys when constructing a Get object, so that multiple data can be returned in a single RPC request. The Get object can set columns and filters to obtain only the specified column data under a specific rowkey. Scan is a range query. You can specify startRow and endRow of the Scan object to determine the data range for a Scan and obtain all the data in the range. A read process initiated by the client can be divided into two phases. The first phase is how the client sends the request to the correct RegionServer, and the second phase is how the RegionServer handles the read request.

3.1 How Does a Client Send Requests to a Specific RegionServer

HRegion manages a table and a contiguous data range. A table consists of multiple HRegions. These HRegions provide read and write services on the RegionServer. Therefore, when a client sends a request to a specific RegionServer, it needs to know the HRegion meta information. The meta information is stored in the hbase: Meta system table, which also provides services on a RegionServer. This information is crucial. Is the basis for all clients to locate HRegion, so this mapping information is stored on ZooKeeper. The flowchart for obtaining HRegion meta information is as follows:

3.2 RegionServer Processes read requests

RegionServer processes the Get request as a special Scan request. The startRow and StopRow of the Get request are the same. Therefore, you can understand the process of the Get request through the process of the Scan request.

3.2.1 Data Organization

Let’s review the organizational structure of HBase data. First, Table is horizontally divided into multiple Hregions. According to a column family, each HRegion contains a MemStore and multiple HFile files. The user needs to know that a given rowkey can be quickly located to the corresponding data block according to the index combined with binary search. With this background information, we can translate the processing of a Read request to the problem of how to get the correct data from a MemStore, multiple hfiles that the user needs (by default, the latest version, non-deleted, and non-expired data). The user may set filter to specify the number of returned items.) In RegionServer, all components that may be involved in Region reading are initialized as scanner objects. RegionScanner objects are encapsulated for Region reading. A column family corresponds to a Store, which is encapsulated as StoreScanner. Inside a Store, MemStore is encapsulated as MemStoreScanner, and each HFile is encapsulated as StoreFileScanner. Finally, the query for the data falls on top of the query for MemStoreScanner and StoreFileScanner. Some of these scanners are first filtered out based on the TimeRange and Rowkey Range of a Scan. The remaining scanners form a minimum KeyValueHeap inside RegionServer. The core of this data structure is a PriorityQueue, which is sorted by the KeyValue pointed to by the Scanner.

// Use to organize all Scanner protected PriorityQueue<KeyValueScanner> heap = null; // PriorityQueue Scanner protected KeyValueScanner current = null;Copy the code

3.2.2 Data Filtering

Data is stored in memory and HDFS files. RegionServer builds a minimum heap to read the data. RegionServer filters the data and returns the desired value. We assume that the HRegion has four Hfiles and one MemStore, and that the minimum heap has four scanner objects. ScannerA -d replaces these scanner objects and assumes that the rowkey we need to query is rowA. Each scanner has a current pointer that points to the KeyValue being traversed, so the current pointer to the scanner object at the top of the heap points to rowA(rowA: CF :colA). Move the current pointer to iterate over all the data in the scanner by firing the next() call. Scanner The following figure shows the logical organization view.





  • The keyValue type is PUT
  • Columns are the columns specified by Scanner
  • The filter conditions are met
  • Latest version
  • Undeleted data

If the scan parameters is more complex, the condition will also change, such as when you return to the Raw data specified scan, hit delete the tag data will be returned, this part is no longer in detail, thus read basic parsing is complete, the process is introduced in this paper, of course, very rough, interested students can study this part of the source code.

Fourth, read optimization

After introducing the read process, we will introduce how to optimize the read request by combining with the practice of the like service. Since optimization is talked about, we must first know which points can affect the performance of the read request. We still understand the optimization method in depth from the client side and the server side.

4.1 Client Layer

HBase data can be read in two modes: Get and Scan. At the general level, the client and server need to communicate with ZooKeeper and then locate region information through the meta table. Therefore, the RT is high when the HBase is read for the first time. To avoid this situation, the client needs to preheat the table. You can obtain information about all regions in the table and send a Scan or Get request to each region. In this way, the region address is cached. Whether there are read/write hotspots on the Rowkey. If there are hotspots, the advantages of the distributed system are lost. All requests fall on only one or several HRegions, and the request efficiency is not high. What the read-write ratio is. RegionServer (MVCC) : MVCC (STUCK) : MVCC (STUCK) : MVCC (STUCK) : MVCC (STUCK) : MVCC (STUCK) : MVCC (STUCK) : MVCC (STUCK) : MVCC (STUCK) : MVCC (STUCK) : MVCC (STUCK)

4.1.1 Request Optimization

  • Get requests are quantized to reduce the number of RPCS, but if the number of gets in a batch is too large, if disk burrs or Split burrs are encountered, all GETS will fail (no partial success results will be returned) and an exception will be thrown.
  • Specify the column family, identifier. In this way, the server can filter out many useless scanners, reducing I/O times and improving efficiency. This method also applies to Scan.

4.1.2 Scan Request Optimization

  • Set up reasonable startRow and stopRow. If the two values are not set in the SCAN request, but only filter is set, a full table scan is performed.
  • Set an appropriate number of caching, scans.setcaching (100). Scan may Scan a large amount of data. Therefore, the client does not load all data in a single Scan request. Instead, the client loads data in multiple RPC requests. The default value is 100. If you need to scan massive data without logical paging, you can set the cache value to 1000 to reduce the number of RPCS and improve the processing efficiency. If the user needs to retrieve data quickly and iteratively, it makes sense to set caching to 50 or 100.

4.2 Server Optimization

There is more that can be done with server optimization than with the client. First we list the points that affect how the server handles read requests.

  • The gc burr
  • Disk burr
  • HFile Number of files
  • Cache configuration
  • Localization rate
  • Whether the Hedged Read pattern is on
  • Whether short-circuit reading is enabled
  • Whether to enable high availability

There is no easy way to avoid gc burrs, and generally a Young gc of HBase takes between 20 and 30ms. Disk burrs are inevitable. SATA disks have read IOPS of about 150 and SSDS have read IOPS of more than 30000 randomly. Therefore, SSDS can improve the throughput of storage media and reduce the impact of burrs. The number of hfiles is increased by the flush mechanism, and decreased by Compaction. If a query results in a maximum number of IOs, read latency becomes greater. This section focuses on optimizing configuration when Compaction occurs, including triggering thresholds, file size thresholds, and the number of files that can be committed at a time. Read cache can be set as CombinedBlockCache. It is also important to adjust the ratio of read cache and MemStore to read request optimization. Here we set hfile.block.cache. The following is the optimization practice we do based on business requirements. At the beginning of the establishment of our online cluster, we connected to the important fan business, which has high requirements for RT. In order to meet the business needs, we took the following measures.

4.2.1 Heterogeneous Storage

HBase resource isolation + Heterogeneous storage. SATA disks are far inferior to SSD in random IOPS, single access RT, and read/write throughput. SATA disks are not competent for RT sensitive services. Therefore, HBase needs to support SSD storage media. In order for HBase to support heterogeneous storage, response support is required at the HDFS level. HDFS 2.6.x and later versions provide the ability to store files on SSDS. In other words, SSDS and SATA disks can coexist in an HDFS cluster. The HDFS storage format is [SSD] and [disk]. However, HBase 1.2.6 cannot set the storage format of the table column family and the WAL of RegionServer to [SSD]. This function was opened after the community HBase 2.0 version, so we backport the corresponding patch from the community. On top of our own version of HBase. Support (SSD) community issue is as follows: issues.apache.org/jira/browse… . After SSDS are added, the HDFS cluster storage architecture is shown in the following figure:

Ideal for a hybrid cluster

 <property>
      <name>dfs.datanode.data.dir</name>
      <value>[SSD]file:/path/to/dfs/dn1</value>
 </property>
Copy the code

The value is changed in hbase-site. XML in RegionServer of SSDS

 <property>
      <name>hbase.wal.storage.policy</name>
      <value>ONE_SSD</value>
 </property>
Copy the code

ONE_SSD can also be replaced with ALL_SSD. SATA RegionServer does not need to be changed or changed to HOT.

4.2.2 HDFS short-circuit Read

This feature is introduced by HDFS-2246. The RegionServer of our cluster is mixed with datanodes. In this way, data localization rate is guaranteed. The first copy of data is written to the local Datanodes first. When the short-circuit read function is disabled, even if the data on the local DataNode is read, RPC requests need to be sent and data is returned after layer upon layer processing. The realization principle of short-circuit read is that when the client requests data from DataNode, DataNode will open the file and checksum file. Instead of passing the path to the client, pass the descriptors of the two files directly to the client. After receiving the descriptors of the two files, the client directly opens the files to read the data. This feature is realized through UNIX Domain Socket interprocess communication. The flow chart is shown in the figure below:

To enable short-circuit read, modify the hdFS-site. XML file

    <property>
        <name>dfs.client.read.shortcircuit</name>
        <value>true</value>
    </property>
    <property>
        <name>dfs.domain.socket.path</name>
         value>/var/run/hadoop/dn.socket</value>
    </property>
Copy the code

Holdings HDFS Hedged the read

If local data is not returned within a period of time due to disk jitter or other reasons, the same data request is sent to other Datanodes. The returned data is the first to be returned and the later data is discarded, which reduces the impact of disk burr. This function is disabled by default. To use this function in HBase, you need to modify hbase-site.xml

    <property>
       <name>dfs.client.hedged.read.threadpool.size</name>
       <value>50</value> 
    </property>
    <property>
       <name>dfs.client.hedged.read.threshold.millis</name>
       <value>100</value>
    </property>
Copy the code

The thread pool size can be the same as the number of read handlers, and the timeout threshold should not be too small, otherwise it will strain both the cluster and the client. HedgedReadOps and hedgedReadOps can be monitored by Hadoop to check the effect of Hedged Read. The former means the number of Hedged read, The latter means the number of times a Hedged read is faster than a native reading.

4.2.4 High Availability Read

HBase is a CP system. Only one RegionServer provides read and write services for one region at a time. This ensures data consistency and avoids the problem of multi-copy synchronization. However, if a RegionServer goes down, the system requires a certain recovery time (deltaT). Within this deltaT, Region does not provide services. This deltaT time is primarily determined by the number of logs that need to be played back in the recovery from the outage. The schematic diagram of cluster replication is as follows:

4.2.5 Troubleshooting preheating Failure

The application of cold start preheating does not take effect. This problem occurs when an application accesses HBase for the first time to read data after initialization. The process is shown in Figure 2. This process takes a long time because it involves multiple RPC requests. After all Region addresses are cached, the client and RegionServer communicate point-to-point so that RT is guaranteed. So we’re going to do a warm-up when the application starts up, and what we usually do to warm-up is call getAllRegionLocations. GetAllRegionLocations had a bug with the 1.2.6 version of getAllRegionLocations (1.3.x and 2.x also had similar problems). The solution was expected to return all Region locations and cache those Region addresses. This method will only be the first Region of the cache table, the author found the problem after feedback to the community, and submit the patch to repair this problem, issue connect: issues.apache.org/jira/browse… . In this way, you can use the getAllRegionLocations method after the bug is fixed to warm up the application after it starts. RT burrs are not generated when the application reads and writes HBase for the first time. Set the timeout period for both the active and standby fan services to 300ms. After these optimizations, its bulk Get requests are 99.99% within 20ms and 99.9999% within 400ms.

Five, the summary

The HBase read path is more complex than the write path. This document only briefly introduces the core ideas. Because of this complexity, when considering optimization, it is necessary to have a deep understanding of its principles, and not only focus on its own service components, but also consider whether its dependent components can also be optimized. Finally, my ability is limited, the opinion in this paper inevitably has flaws, but also hope to exchange correction.

The infrastructure team is mainly responsible for the excellent Data Platform (DP), real-time computing (Storm, Spark Streaming, Flink), offline computing (HDFS,YARN,HIVE, Spark SQL), online storage (HBase), Real-time OLAP(Druid) and other technical products, welcome interested partners contact [email protected]

Refer to www.nosqlnotes.com/technotes/h… Hbasefly.com/2016/11/11/ hadoop.apache.org/docs/stable… www.cloudera.com/documentati…