1. The background
For a variety of reasons, the HBase used by core services has been migrated to the elastic MapReduce (EMR) cluster on the cloud, and the default HBase component parameters of the EMR are used.
As a result, two core nodes (with Region Server and Datanode deployed) and a large number of Region rits were down during the traffic peak. It took 15 minutes to recover automatically, and a P0 fault occurred.
The default socket timeout parameter of HDFS set by EMR on the cloud is 900,000 (15 minutes). As a result, the REGION takes 15 minutes to read WAL logs of the faulty node before trying to retry the next node. The self-healing time cannot meet the requirements of online services. You need to set the timeout time to 60000 (1min) to achieve rapid self-healing.
This section describes best practices for HBase parameter tuning based on HBase component features and online service requirements for high availability and low jitter.
2. Review the HBase infrastructure
This is just a brief overview of the architecture, so we can talk about the parameters that need to be optimized for each component. For more details, please refer to the Comprehensive Understanding of HBase Architecture.
2.1 Overall Architecture
In physical structure, HBase contains three types of Servers: ZooKeeper, HMaster, and RegionServer, forming a master-slave mode.
- RegionServer is used to service read and write operations. When users access data through the Client, the Client communicates directly with the HBase RegionServer.
- HMaster Performs RegionServer management and DDL (creating and deleting tables) operations.
- Zookeeper is a part of the Hadoop Distributed File System (HDFS). It is used to maintain the whole cluster, ensure high availability (HA), and automatically failover faults.
- The underlying storage is still dependent on HDFS. Datanodes of Hadoop store data managed by RegionServer, and all HBase data is stored in HDFS. Hadoop’s NameNode maintains metadata for all physical data blocks.
Composition of 2.2 RegionServer
A RegionServer runs on a DataNode of HDFS and has the following components:
- WAL: Write Ahead Log. It is a file in a distributed system. It is mainly used to store new data that has not been persisted to disk. If the new data is not persisted and the node goes down, WAL can be used to recover the data.
- BlockCache: A read cache. It stores data that is accessed frequently. When the cache is full, the least recently accessed data is cleared.
- MenStore: is a write cache. It stores data that has not yet been written to disk. It sorts its own data before writing data to disks to ensure that data is written in sequence. Each Colum family in each region will have a corresponding Memstore.
- HFiles: Stores the keys of each row in lexicographical order.
3. Read the optimization
3.1 Optimizing the Read/write Memory Ratio
A RegionServer has one BlockCache and N MemStores. The sum of their sizes must be smaller than HeapSize x 0.8. Otherwise, HBase cannot be started.
BlockCache serves as the read cache. It is important for read performance. If the number of reads is large, you are advised to use a 1:4 memory machine, for example, a machine with 8 CPUS 32 GB or 16 cpus 64 GB.
In the scenario where there are too many reads and too few writes, you can increase the BlockCache value and decrease the Memstore value to improve the read performance.
Core adjustment parameters are as follows:
-hfile.block. cache.size = 0.5; - hbase. Regionserver. Global. Memstore. Size = 0.3.Copy the code
3.2 Reducing the number of HFiles
Because HBase read data does not match the cache, you need to open HFile. If the more HFile files, the more I/OS, the higher the read latency.
Therefore, HBase merges Hfiles using a compaction mechanism.
However, for online services, performing compact during daytime traffic peak hours seriously affects disk I/OS, causing read/write burrs. Therefore, speed limiting for Compact is required.
3.3 Enabling short Circuit Read
HBase data is stored in the HDFS. Data Read from the HDFS must pass through the DataNode. After short-circuit Local Read is enabled, the client can directly Read Local data.
Assume that there are two users User1 and User2. User1 has the permission to access the file/AppData /hbase1 in the HDFS directory. User2 does not have the permission, but User2 needs to access the file. You can have User1 open the file to get a file descriptor, and then pass the file descriptor to User2, so that User2 can read the contents of the file, even if User2 does not have permissions.
When this relationship is mapped to HDFS, DataNode can be regarded as User1 and client DFSClient as User2. The files to be read are/AppData /hbase1 files in the DataNode directory. The implementation is shown in the figure below:
Core parameters are as follows:
dfs.client.read.shortcircuit = true
Copy the code
3.4 Enabling the Read Offset Feature (Disk I/O Needs to be Evaluated)
When short-circuit Read is enabled, short-circuit Local Read is preferred. However, in some special cases, local read failures may occur in a short time due to disk problems or network problems.
To cope with this kind of problem, HBase implements Hedged Read.
The basic working principle of this mechanism is that the client initiates a local read and sends a request for the same data to other Datanodes if the local read fails to be returned after a period of time. Whichever request is returned first, the other is discarded.
Of course, this feature obviously amplifies the disk IO strain and needs to be used with caution.
The core parameters are as follows :(adjust the parameters based on the actual environment)
- DFS. Client. Hedged. Read the threadpool. Size = 10 / / specifies how many threads for service hedged reads. If this value is set to 0 (the default), the hedged reads for disabled state - DFS. Client. Hedged. Read. Threshold. Millis: defaults to 500 (0.5 seconds) : before spawning a second thread, the waiting time.Copy the code
4. Write optimization
4.1 Increasing MemStore Memory
In the scenario of “write more but read less”, you can adjust the MemStore memory ratio to reduce the BlockCache memory ratio, which is the opposite of the idea of read optimization 3.1.
It can be evaluated according to the reading and writing ratio.
4.2 Increase HFile generation
This section does not conflict with 3.2 and requires a trade-off
During data writing, MemStore flushs data to disk when certain conditions are met to generate an HFile file. When the number of HFile files in a Store exceeds a certain threshold, write or update blocks will occur.
RS logs will have messages like “Has too many store files… The information. When this happens, you wait for a Compaction merge to reduce the number of hfiles. This is typically a Minor Compaction.
So we try to increase that threshold as much as we can to minimize compaction.
Core parameters:
hbase.hstore.blockingStoreFiles = 100
Copy the code
If you write quickly, it is easy to bring up a large number of hfiles, because HFile merges are not as fast as writing.
You need to conduct a major compaction during a low operating season to fully utilize system resources. If the HFile cannot be reduced, add a node.
4.3 Properly increase the blocking times of Memstore
When MemStore size reach flash threshold (hbase. Hregion) MemStore. Flush. The size, the default 128 m), will flush flash to disk, the basic operation without blocking. But when a Region all MemStore size reaches a blocking ratio (hbase. Hregion) MemStore) block. The multiplier, the default value is 4, 4 times that of flash threshold default 4 * 128 = 512 m), Blocks all update requests for the Region and forces flush. The client may throw the RegionTooBusyException exception.
These two parameters can be adjusted to minimize write blocking
Core parameters include:
hbase.hregion.memstore.flush.size = 128
hbase.hregion.memstore.block.multiplier = 4
Copy the code
5. IO optimization
HBase uses a compaction mechanism to stabilize read latency by causing a large number of read latency burrs and write blocking.
In order to balance performance and stability, it is necessary to perform rate-limiting on compation.
Core adjustment parameters are as follows:
Hour = 6 // End time of unrestricted Compact - hbase.offpeak.start.hour = 22 // Start time of unrestricted compact - Hbase.hstore.com paction. Throughput. Who. Bound = 15728640 compact maximum 15 m - / / the speed limit . Hbase.hstore.com paction. Throughput. The lower bound = 10485760 / / speed compact minimum of 10 m - hbase hregion. Majorcompactio = 0 / / close timing major compaction - hbase.regionserver.thread.com paction. Large = 1 / / compation threads Hbase.regionserver.thread.com paction. Small = 1 / / compaction thread hbase.hstore.com paction. Max = 3 / / a Minor Compacts the maximum number of HFile files that can be mergedCopy the code
When this compaction happens during the day and when a scheduled Major compaction is turned off, HFile consolidation may not be sufficient, so consider external controls (such as Java apis) that conduct a major compaction at night to minimize hfiles.
6. Fault recovery optimization
RegionServer outage can be caused by the Full GC, network exception, official Bug (close Wait port is not closed), and DataNode exception.
Once the RegionServer is down, HBase immediately detects the RegionServer outage. After detecting the RegionServer outage, HBase reallocates all regions on the RegionServer to other normal RegionServers in the cluster and recovers lost data based on the HLog. After the recovery is complete, services can be provided externally. The whole process is completed automatically without human intervention. The basic principle is shown in the figure below:
If the Datanode is abnormal and the read timeout setting is too large (dfs.client.socket-timeout and DFs.socket. timeout), Region cannot read WAL logs properly, which increases the recovery time.
Core parameters are as follows:
dfs.client.socket-timeout = 60000
dfs.datanode.socket.write.timeout = 480000
dfs.socket.timeout = 60000
Copy the code
7. Other optimizations
7.1 the split strategy
HBase 2.0.0 or later uses the split policy SteppingSplitPolicy.
SteppingSplitPolicy When the initial number of regions is small, the split threshold is low and the split is triggered frequently.
We have already made pre-split for table, so I can set the split strategy to fixed size (size by parameter hbase. Hregion. Max. Filesize decision)
hbase.regionserver.region.split.policy = org.apache.hadoop.hbase.regionserver.ConstantSizeRegionSplitPolicy
Copy the code
7.2 open the rsgroup
Rsgroup is very helpful for o&M operations such as capacity expansion and shrinkage, and can control the impact of region movement. Move_servers_rsgroup moves regions one by one in the for loop.
hbase.coprocessor.master.classes = org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpointhbase.master.loadbalancer.class = org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancerCopy the code
In addition, an independent RSgroup can be set for the META table to avoid a retry storm caused by an RS fault and region drift failure (abnormal opening status), which is isolated from service RSgroups. At the same time, increase the number of handlers in the meta table.
Hbase. Regionserver. Metahandler. Count = 400 / / to evaluate Suggestions according to the number of client setupCopy the code
8, summary
Based on the HBase infrastructure, this document describes the parameters of each component and read/write process to meet the requirements of online services with high availability and low jitter.
If you have other optimization experience, welcome to leave a comment.
See the end, the original is not easy, point a concern, point a like it ~
Reorganize the knowledge fragments and construct the Java knowledge graph: github.com/saigu/JavaK… (Easy access to historical articles)