Article | zhaoyuan on big data

One, foreword

This article first introduces HBase, including its overall architecture, dependent components, and core service classes. This section focuses on the HBase data reading process and describes how to optimize performance on the client and server. In addition, combining theories with practices, this section hopes to inspire readers. If the article has a mistake please leave a message below, we discuss common study.

2. HBase Overview

HBase is a distributed, scalable, column-oriented database for storing massive data. Its main function is to solve the real-time random read/write problem of massive data. Generally, HBase relies on HDFS as the underlying distributed file system. Based on this premise, this article describes the HBase architecture, read paths, and optimization practices in detail.

2.1 HBase Key Processes

HBase is a distributed database in Master/Slave architecture. It contains two core services, Master and RegionServer. HBase relies on HDFS for underlying storage and ZooKeeper for consistency coordination.

  • Master is a lightweight process that is responsible for all DDL operations, load balancing, region information management, and takes the lead role in downtime recovery.

  • RegionServer manages HRegion, communicates with client endpoint to point, and is responsible for reading and writing real-time data.

  • Zookeeper performs HMaster elections and stores key information such as meta-Region address, replication progress, Regionserver address and port.

2.2 HBase architecture

Firstly, the architecture diagram is given as follows

Architecture Analysis: HBase data storage is based on the LSM architecture. Data is sequentially written to the HLog. By default, RegionServer has only one HLog instance, and then written to the MemStore of HRegion. HRegion is a contiguous region of data in an HBase table. Data is arranged by rowKey lexicographical order. RegionServer manages these HRegions. When the MemStore reaches the threshold, flush operations are triggered and hfiles are written to a single HFile. Hfiles are periodically consolidated by major and minor operations. All hfiles and log files are stored in HDFS.

This section describes the key components, roles, and architecture of HBase. The HBase read path is described in this section.

Read path parsing

The client can read data in two ways: Get and Scan. Get is a random point-and-search method that returns a row of data based on a Rowkey. You can also pass in a list of Rowkeys when constructing a Get object, so that multiple data can be returned in a single RPC request. The Get object can set columns and filters to obtain only the specified column data under a specific rowkey. Scan is a range query. You can specify startRow and endRow of the Scan object to determine the data range for a Scan and obtain all the data in the range.

A read process initiated by the client can be divided into two phases. The first phase is how the client sends the request to the correct RegionServer, and the second phase is how the RegionServer handles the read request.

3.1 How Does a Client Send Requests to a Specific RegionServer

HRegion manages a table and a contiguous data range. A table consists of multiple HRegions. These HRegions provide read and write services on the RegionServer. Therefore, when a client sends a request to a specific RegionServer, it needs to know the HRegion meta information. The meta information is stored in the hbase: Meta system table, which also provides services on a RegionServer. This information is crucial. Is the basis for all clients to locate HRegion, so this mapping information is stored on ZooKeeper.

The flowchart for obtaining HRegion meta information is as follows:

Take a single Rowkey Get request as an example. After the user initializes the connection to ZooKeeper and sends a Get request, the USER needs to locate the HRegion address of the rowkey. If the address is not in the cache, you need to ask ZooKeeper (arrow 1) for the address of the meta table. After obtaining the meta table address, read the meta table data to locate the HRegion information and RegionServer address of the RowKey (arrow 2). The address is cached and the CONCURRENT Get requests are sent point-to-point to the corresponding RegionServer(arrow 3). At this point, the client locates the process of sending requests.

3.2 RegionServer Processes read requests

RegionServer processes the Get request as a special Scan request. The startRow and StopRow of the Get request are the same. Therefore, you can understand the process of the Get request through the process of the Scan request.

3.2.1 Data Organization

Let’s review the organizational structure of HBase data. First, Table is horizontally divided into multiple Hregions. According to a column family, each HRegion contains a MemStore and multiple HFile files. The user needs to know that a given rowkey can be quickly located to the corresponding data block according to the index combined with binary search. With this background information, we can translate the processing of a Read request to the problem of how to get the correct data from a MemStore, multiple hfiles that the user needs (by default, the latest version, non-deleted, and non-expired data). At the same time, the user may set filter to specify the number of returned items and other filtering conditions.

RegionServer initializes all components that may be involved in the read as the corresponding Scanner object. RegionScanner object is encapsulated for the read of Region, and a column family corresponds to a Store. The corresponding encapsulation is StoreScanner. Inside Store, MemStore is encapsulated as MemStoreScanner. Every HFile is encapsulated as StoreFileScanner. Finally, the query for the data falls on top of the query for MemStoreScanner and StoreFileScanner.

Some of these scanners are first filtered out based on the TimeRange and Rowkey Range of a Scan. The remaining scanners form a minimum KeyValueHeap inside RegionServer. The core of this data structure is a PriorityQueue, which is sorted by the KeyValue pointed to by the Scanner.

                                                                
     

    // Use to organize all scanners

    protected PriorityQueue heap = null ;

    // PriorityQueue is currently the first Scanner

    protected KeyValueScanner current = null ;

Copy the code

3.2.2 Data Filtering

Data is stored in memory and HDFS files. RegionServer builds a minimum heap to read the data. RegionServer filters the data and returns the desired value.

We assume that the HRegion has four Hfiles and one MemStore, and that the minimum heap has four scanner objects. ScannerA -d replaces these scanner objects and assumes that the rowkey we need to query is rowA. Each scanner has a current pointer that points to the KeyValue being traversed, so the current pointer to the scanner object at the top of the heap points to rowA(rowA: CF :colA). Move the current pointer to iterate over all the data in the scanner by firing the next() call. Scanner The following figure shows the logical organization view.

The first next request returns To rowA: CF :colA in ScannerA, and the ScannerA pointer moves to the next KeyValue rowA: CF :colB, the ordination in the heap remains the same.

The second next request returns rowA:cf:colB in ScannerA. The current pointer to ScannerA moves to the next KeyValue rowB:cf:ColA, Because the heap is sorted by KeyValue, rowB is smaller than rowA, so the sequence of scanner changes inside the heap, as shown below:

The scanner closes after all internal data is retrieved, or the rowA searches for the next item after all data is retrieved. By default, the returned data must be filtered by ScanQueryMatcher. The returned data must meet the following conditions:

  • The keyValue type is PUT

  • Columns are the columns specified by Scanner

  • The filter conditions are met

  • Latest version

  • Undeleted data

If the scan parameters is more complex, the condition will also change, such as when you return to the Raw data specified scan, hit delete the tag data will be returned, this part is no longer in detail, thus read basic parsing is complete, the process is introduced in this paper, of course, very rough, interested students can study this part of the source code.

Fourth, read optimization

After introducing the read process, we will introduce how to optimize the read request by combining with the practice of the like service. Since optimization is talked about, we must first know which points can affect the performance of the read request. We still understand the optimization method in depth from the client side and the server side.

4.1 Client Layer

HBase data can be read in two modes: Get and Scan.

At the general level, the client and server need to communicate with ZooKeeper and then locate region information through the meta table. Therefore, the RT is high when the HBase is read for the first time. To avoid this situation, the client needs to preheat the table. You can obtain information about all regions in the table and send a Scan or Get request to each region. In this way, the region address is cached.

Whether there are read/write hotspots on the Rowkey. If there are hotspots, the advantages of the distributed system are lost. All requests fall on only one or several HRegions, and the request efficiency is not high.

What the read-write ratio is. RegionServer (MVCC) : MVCC (STUCK) : MVCC (STUCK) : MVCC (STUCK) : MVCC (STUCK) : MVCC (STUCK) : MVCC (STUCK) : MVCC (STUCK) : MVCC (STUCK) : MVCC (STUCK) : MVCC (STUCK)

Get request optimization

  • Get requests are quantized to reduce the number of RPCS, but if the number of gets in a batch is too large, if disk burrs or Split burrs are encountered, all GETS will fail (no partial success results will be returned) and an exception will be thrown.

  • Specify the column family, identifier. In this way, the server can filter out many useless scanners, reducing I/O times and improving efficiency. This method also applies to Scan.

Scan Request Optimization

  • Set up reasonable startRow and stopRow. If the two values are not set in the SCAN request, but only filter is set, a full table scan is performed.

  • Set an appropriate number of caching, scans.setcaching (100). Scan may Scan a large amount of data. Therefore, the client does not load all data in a single Scan request. Instead, the client loads data in multiple RPC requests. The default value is 100. If you need to scan massive data without logical paging, you can set the cache value to 1000 to reduce the number of RPCS and improve the processing efficiency. If the user needs to retrieve data quickly and iteratively, it makes sense to set caching to 50 or 100.

4.2 Server Optimization

There is more that can be done with server optimization than with the client. First we list the points that affect how the server handles read requests.

  • The gc burr

  • Disk burr

  • HFile Number of files

  • Cache configuration

  • Localization rate

  • Whether the Hedged Read pattern is on

  • Whether short-circuit reading is enabled

  • Whether to enable high availability

There is no easy way to avoid gc burrs, and generally a Young gc of HBase takes between 20 and 30ms. Disk burrs are inevitable. SATA disks have read IOPS of about 150 and SSDS have read IOPS of more than 30000 randomly. Therefore, SSDS can improve the throughput of storage media and reduce the impact of burrs. The number of hfiles is increased by the flush mechanism, and decreased by Compaction. If a query results in a maximum number of IOs, read latency becomes greater. This section focuses on optimizing configuration when Compaction occurs, including triggering thresholds, file size thresholds, and the number of files that can be committed at a time. Read cache can be set as CombinedBlockCache. It is also important to adjust the ratio of read cache and MemStore to read request optimization. Here we set hfile.block.cache. The following is the optimization practice we do based on business requirements.

At the beginning of the establishment of our online cluster, we connected to the important fan business, which has high requirements for RT. In order to meet the business needs, we took the following measures.

4.2.1 Heterogeneous Storage

HBase resource isolation + Heterogeneous storage. SATA disks are far inferior to SSD in random IOPS, single access RT, and read/write throughput. SATA disks are not competent for RT sensitive services. Therefore, HBase needs to support SSD storage media.

In order for HBase to support heterogeneous storage, response support is required at the HDFS level. HDFS 2.6.x and later versions provide the ability to store files on SSDS. In other words, SSDS and SATA disks can coexist in an HDFS cluster. The HDFS storage format is [SSD] and [disk]. However, HBase 1.2.6 cannot set the storage format of the table column family and the WAL of RegionServer to [SSD]. This function was opened after the community HBase 2.0 version, so we backport the corresponding patch from the community. On top of our own version of HBase. Community issues that support [SSD] are as follows:

https://issues.apache.org/jira/browse/HBASE-14061?jql=text%20~%20%22storage%20policy%22.

After SSD disks are added, the HDFS cluster storage architecture is shown in Figure 5:

Figure 5 HDFS cluster storage logic in hybrid models

In an ideal mixed-model cluster deployment, HBase supports three file storage policies: HOT, ONE_SSD, ALL_SSD, where ONE_SSD storage policy can not only store two copies of the three copies on cheap SATA disk media to reduce SSD disk storage cost, but also access the data on the local SSD disk can obtain the ideal RT. Is an ideal storage strategy. The HOT storage policy is the same as that without heterogeneous storage, but ALL_SSD stores all copies on SSDS. At present, we do not have such an ideal hybrid model in Youzan. There are only two big data models, pure SATA and pure SSD. The corresponding architecture of such models will be different from before, as shown in Figure 6.

Based on this scenario, we made the following planning:

  1. Plan SSD machines into independent groups and set hbase.wal.storage.policy to ONE_SSD to ensure wal localization rate.

  2. Configure the table in the SSD group as ONE_SSD or ALL_SSD.

  3. Table storage policies in non-SSD groups use the default HOT

The configuration policy is as follows: Change the configuration policy in HDFS -site. XML


     

    <property>

         <name>dfs.datanode.data.dir</name>

         <value>[SSD]file:/path/to/dfs/dn1</value>

    </property>

Copy the code

The value is changed in hbase-site. XML in RegionServer of SSDS


     

    <property>

         <name>hbase.wal.storage.policy</name>

         <value>ONE_SSD</value>

    </property>

Copy the code

ONE_SSD can also be replaced with ALL_SSD. SATA RegionServer does not need to be changed or changed to HOT.

4.2.2 HDFS short-circuit Read

Enable the short-circuit read mode of the HDFS. This feature is introduced by HDFS-2246. The RegionServer of our cluster is mixed with datanodes. In this way, data localization rate is guaranteed. The first copy of data is written to the local Datanodes first. When the short-circuit read function is disabled, even if the data on the local DataNode is read, RPC requests need to be sent and data is returned after layer upon layer processing. The realization principle of short-circuit read is that when the client requests data from DataNode, DataNode will open the file and checksum file. Instead of passing the path to the client, pass the descriptors of the two files directly to the client. After receiving the descriptors of the two files, the client directly opens the files to read the data. This feature is realized through UNIX Domain Socket interprocess communication, as shown in Figure 7.

The internal implementation of this feature is complex and is designed to slot the state and count copies of shared memory segments, which will not be elaborated here.

To enable short-circuit read, modify the hdFS-site. XML file:


     

       <property>

           <name>dfs.client.read.shortcircuit</name>

           <value>true</value>

       </property>

       <property>

           <name>dfs.domain.socket.path</name>

            value>/var/run/hadoop/dn.socket</value>

       </property>

Copy the code

Holdings HDFS Hedged the Read

Open Hedged Read mode. If local data is not returned within a period of time due to disk jitter or other reasons, the same data request is sent to other Datanodes. The returned data is the first to be returned and the later data is discarded, which reduces the impact of disk burr. This function is disabled by default. To use this function in HBase, you need to modify hbase-site.xml


     

       <property>

          <name>dfs.client.hedged.read.threadpool.size</name>

          <value>50</value>

       </property>

       <property>

          <name>dfs.client.hedged.read.threshold.millis</name>

          <value>100</value>

       </property>

Copy the code

The thread pool size can be the same as the number of read handlers, and the timeout threshold should not be too small, otherwise it will strain both the cluster and the client. HedgedReadOps and hedgedReadOps can be monitored by Hadoop to check the effect of Hedged Read. The former means the number of Hedged read, The latter means the number of times a Hedged read is faster than a native reading.

4.2.4 High Availability Read

Highly available read. HBase is a CP system. Only one RegionServer provides read and write services for one region at a time. This ensures data consistency and avoids the problem of multi-copy synchronization. However, if a RegionServer goes down, the system requires a certain recovery time (deltaT). Within this deltaT, Region does not provide services. This deltaT time is primarily determined by the number of logs that need to be played back in the recovery from the outage. The cluster replication schematic diagram is shown in Figure 8 below.

HBase provides the HBase Replication mechanism to implement one-way asynchronous data Replication between clusters. Two clusters are deployed online. The SSD groups in the standby cluster and the primary cluster have the same configuration. When some RegionServers or even clusters in the active cluster become unavailable due to sudden traffic from disks, networks, or other services, the standby cluster is required to continue providing services. Data in the standby cluster may be delayed due to HBase Replication. According to the current size of our cluster, the average delay is less than 100ms. Therefore, in order to achieve high availability, the fan business can accept the replication delay and abandon strong consistency in favor of final consistency and high availability. The scheme adopted in the first version is as follows:

Fan business party is don’t want to perceive the back-end service status, that is to say, at the client level, they just want a Put or Get request normal delivery and returns the expected data, it needs to be highly available client encapsulate a demotion, fusing the processing logic, here we adopt Hystrix as the underlying fusing processing engine, The basic HBase API is encapsulated on the engine. You only need to configure the ZK addresses of the primary and secondary equipment rooms. All degraded circuit breaker logic is encapsulated in the Ha-hbase-client.

4.2.5 Troubleshooting preheating Failure

The application of cold start preheating does not take effect. This problem occurs when an application accesses HBase for the first time to read data after initialization. The process is shown in Figure 2. This process takes a long time because it involves multiple RPC requests. After all Region addresses are cached, the client and RegionServer communicate point-to-point so that RT is guaranteed. So we’re going to do a warm-up when the application starts up, and what we usually do to warm-up is call getAllRegionLocations. GetAllRegionLocations had a bug with 1.2.6 (later, 1.3.x, and 2.x). GetAllRegionLocations was expected to return all Region locations and cache those Region addresses. This method only caches the first Region of the table. The author reported the problem to the community after finding it and submitted patch to fix the problem. Issue link: https://issues.apache.org/jira/browse/HBASE-20697?filter=-2. In this way, you can use the getAllRegionLocations method after the bug is fixed to warm up the application after it starts. RT burrs are not generated when the application reads and writes HBase for the first time.

Set the timeout period for both the active and standby fan services to 300ms. After these optimizations, its bulk Get requests are 99.99% within 20ms and 99.9999% within 400ms.

Five, the summary

The HBase read path is more complex than the write path. This document only briefly introduces the core ideas. Because of this complexity, when considering optimization, it is necessary to have a deep understanding of its principles, and not only focus on its own service components, but also consider whether its dependent components can also be optimized. Finally, my ability is limited, the opinion in this paper inevitably has flaws, but also hope to exchange correction.

The infrastructure team is mainly responsible for the excellent Data Platform (DP), real-time computing (Storm, Spark Streaming, Flink), offline computing (HDFS,YARN,HIVE, Spark SQL), online storage (HBase), Real-time OLAP(Druid) and other technical products, welcome interested partners contact [email protected]

Refer to http://www.nosqlnotes.com/technotes/hbase/hbase-read/http://hbasefly.com/2016/11/11/ http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html https://www.cloudera.com/documentation/enterprise/5-6-x/topics/admin_hedged_reads.html

Further reading

  1. Big Data development Platform (Data Platform) in the good best practices

  2. Good data warehouse metadata system practice

  3. How we redesigned the NSQ – Other features and future plans

  4. Quantitative analysis and optimization of resource consumption in HBase write and throughput scenarios

  5. Good big data platform security construction practice

  6. Flink has praised the practice of real-time computing

  7. SparkSQL is in good practice

  8. Druid is in awesome practice

-The End-

Vol.156

Have a good technical team

It has 3 million merchants, 150 industries and 20 billion e-commerce transactions

Provide technical support

Micro mall | retail | Beauty industry

Wechat official account: Favorcoder; Weibo: @FavorTechnology

Tech blog: tech.youzan.com

The bigger the dream, 

the more important the team.