“This is the 35th day of my participation in the November Gwen Challenge. See details of the event: The Last Gwen Challenge 2021”.
1. HBase optimization
1.1. High availability
In HBase, the HMaster monitors the HRegionServer life cycle and balances the Load of HRegionServer. If the HMaster fails, the entire HBase cluster becomes unhealthy and does not work for a long time. Therefore, HBase supports ha configuration for HMaster.
-
Stop the HBase cluster. (If the HBase cluster is not enabled, skip this step.)
[moe@hadoop102 conf]$ bin/stop-hbase.sh Copy the code
-
Create the backup-masters file in the conf directory
[moe@hadoop102 conf]$ vim backup-masters Copy the code
-
Configure the HMaster node in the backup-masters file
hadoop103 hadoop104 Copy the code
-
Distribute backup-masters to synchronize to other nodes
[moe@hadoop102 conf]$ xsync backup-masters Copy the code
-
Starting an HBase Cluster
[moe@hadoop102 conf]$ bin/start-hbase.sh Copy the code
-
Open the page test view
1.2. Pre-division
Each region maintains StartRow and EndRow. If the added data meets the RowKey range maintained by a region, the data is submitted to the region for maintenance. Based on this principle, you can plan the partitions to which data is to be added in advance to improve HBase performance.
-
Manually set pre-division
Hbase> create 'staff1','info','partition1',SPLITS => ['1000','2000','3000','4000'] Copy the code
-
Generate hexadecimal sequence pre-partitioning
create 'staff2','info','partition2',{NUMREGIONS => 15, SPLITALGO => 'HexStringSplit'} Copy the code
-
Prepartition according to the rules set in the file
The contents of the montes.txt file are as follows:
aaaa bbbb cccc dddd Copy the code
create 'staff3','partition3',SPLITS_FILE => 'splits.txt' Copy the code
-
Create pre-partitioning using the JavaAPI
// A custom algorithm that generates a series of hash values stored in a two-dimensional array byte[][] splitKeys = some hash function// Create an HbaseAdmin instance HBaseAdmin hAdmin = new HBaseAdmin(HbaseConfiguration.create()); // Create an HTableDescriptor instance HTableDescriptor tableDesc = new HTableDescriptor(tableName); // Create Hbase tables with pre-partitioning using HTableDescriptor instances and two-dimensional arrays of hash values hAdmin.createTable(tableDesc, splitKeys); Copy the code
1.3 RowKey Design
The only identifier of a piece of data is a RowKey. The partition where the data is stored depends on the pre-partition range of the RowKey. The purpose of designing a RowKey is to evenly distribute data across all regions to prevent data skew to some extent. Let’s talk about common design scenarios for RowKey.
-
Generate random numbers, hashes, and hash values
Such as: the original rowKey for 1001, after the SHA1 became: dd01903921ea24941c26a48f2cec24e0bb0e8cc7 original this rowKey for 3001, after the SHA1 became: 49042 c54de64a1e9bf0b33e00245660ef92dc7bd original this rowKey for 5001, after the SHA1 became: 7 b61dec07e02c188790670af43e717f0f46e8913 before doing this operation, generally we will choose sample drawn from a data set, to decide what kind of after rowKey to Hash as a threshold of each partition.Copy the code
-
String inversion
20170524000001 changed to 10000042507102 20170524000002 changed to 20000042507102Copy the code
This also hashes the data progressively put in to some extent.
-
String splicing
20170524000001_a12e 20170524000001_93i7 Copy the code
1.4. Memory optimization
HBase operations require a large amount of memory overhead. After all, tables can be cached in memory. Generally, 70% of the available memory is allocated to the HBase Java heap. However, it is not recommended to allocate a very large heap memory, because the RegionServer is unavailable for a long time during GC. Generally, 16 to 48 GB memory is required. If the system memory is insufficient due to the high memory usage of the frame, the frame will be killed by the system service.
1.5. Basic optimization
-
You can add content to HDFS files
HDFS – site. XML, hbase – site. XML
Attribute: dfs.support.append Description: After HDFS append synchronization is enabled, HBase data synchronization and persistence can be implemented. The default value is true.Copy the code
-
Optimize the maximum number of open files allowed by DataNode
hdfs-site.xml
Properties: DFS. Datanode. Max. Transfer. Threads: HBase generally at the same time operating a large number of documents, according to the number and size of cluster and data movement, is set to 4096 or higher. Default value: 4096Copy the code
-
Optimize the wait time for data operations with high latency
hdfs-site.xml
Properties: DFS. Image. Transfer. The timeout: If the latency for a data operation is very high and the socket needs to wait a longer time, you are advised to set this value to a larger value (60000 ms by default) to ensure that the socket will not be timeout out.Copy the code
-
Optimize data write efficiency
mapred-site.xml
Mapreduce.map.output.com mapreduce.map.output.com press press. Codec explanation: open the two data can greatly improve the efficiency of document writing, writing time. The first attribute values to modify to true, the second attribute value is amended as: org.apache.hadoop.io.com the GzipCodec or other compression method.Copy the code
-
Example Set the number of RPC listens
hbase-site.xml
Properties: Hbase regionserver. Handler. Count: 30, the default value is used to specify the RPC to monitor the number of, can be adjusted according to client requests, read and write requests is large, increase the value.Copy the code
-
Optimize HStore file size
hbase-site.xml
Properties: hbase hregion. Max. Filesize explanation: The default value is 10737418240 (10GB). You can reduce this value if MR tasks of HBase need to be run. One region corresponds to one map task. This value means that if the HFile size reaches this value, the region will be split into two Hfiles.Copy the code
-
Optimized HBase client cache
hbase-site.xml
Properties: hbase client. Write. Buffer explanation: used to specify the hbase client cache, increase the value can reduce the number of RPC calls, but will consume more memory, and conversely. Generally, we need to set a certain cache size to reduce the number of RPC.Copy the code
-
Next Specifies the number of lines obtained by scanning HBase
hbase-site.xml
Properties: hbase client. Scanner. Caching explanation: used to specify the scan. The next method to get the default number of rows, the larger the value, the greater the memory consumption.Copy the code
-
Flush, compact, split mechanisms
When the MemStore reaches the threshold, Flush MemStore data into Storefile. The compact mechanism merges small flush files into large Storefile files. Split: When a Region reaches its threshold, a large Region is split into two.
Attributes involved:
That is, 128 MB is the default threshold of Memstore
Hbase) hregion) memstore. Flush. Size: 134217728Copy the code
That is, if the total size of all memstores in a single HRegion exceeds the specified value, flush all memstores in the HRegion. RegionServer’s flush is processed asynchronously by adding requests to a queue, simulating the production-consumption model. There is a problem here, when the queue is too late to consume and there is a huge backlog of requests, it can cause memory to surge and at worst trigger OOM.
Hbase. Regionserver. Global. Memstore. UpperLimit: 0.4 hbase. The regionserver. Global. Memstore. LowerLimit: 0.38Copy the code
That is: When MemStore use total memory to hbase. Regionserver. Global. MemStore. UpperLimit specified value, there will be multiple MemStores flush to a file, MemStore flush is executed in descending order of size until flushing to MemStore uses slightly less memory than lowerLimit
Two, friendship links
Big data HBase Learning Journey 3
Big data HBase Learning Journey Part 2
Big data HBase Learning Journey 1