“This is the 35th day of my participation in the November Gwen Challenge. See details of the event: The Last Gwen Challenge 2021”.

1. HBase optimization

1.1. High availability

In HBase, the HMaster monitors the HRegionServer life cycle and balances the Load of HRegionServer. If the HMaster fails, the entire HBase cluster becomes unhealthy and does not work for a long time. Therefore, HBase supports ha configuration for HMaster.

  1. Stop the HBase cluster. (If the HBase cluster is not enabled, skip this step.)

    [moe@hadoop102 conf]$ bin/stop-hbase.sh
    Copy the code
  2. Create the backup-masters file in the conf directory

    [moe@hadoop102 conf]$ vim backup-masters
    Copy the code
  3. Configure the HMaster node in the backup-masters file

    hadoop103
    hadoop104
    Copy the code
  4. Distribute backup-masters to synchronize to other nodes

    [moe@hadoop102 conf]$ xsync backup-masters
    Copy the code
  5. Starting an HBase Cluster

    [moe@hadoop102 conf]$ bin/start-hbase.sh
    Copy the code
  6. Open the page test view

1.2. Pre-division

Each region maintains StartRow and EndRow. If the added data meets the RowKey range maintained by a region, the data is submitted to the region for maintenance. Based on this principle, you can plan the partitions to which data is to be added in advance to improve HBase performance.

  1. Manually set pre-division

    Hbase> create 'staff1','info','partition1',SPLITS => ['1000','2000','3000','4000']
    Copy the code
  2. Generate hexadecimal sequence pre-partitioning

    create 'staff2','info','partition2',{NUMREGIONS => 15, SPLITALGO => 'HexStringSplit'}
    Copy the code
  3. Prepartition according to the rules set in the file

    The contents of the montes.txt file are as follows:

    aaaa
    bbbb
    cccc
    dddd
    Copy the code
    create 'staff3','partition3',SPLITS_FILE => 'splits.txt'
    Copy the code
  4. Create pre-partitioning using the JavaAPI

    // A custom algorithm that generates a series of hash values stored in a two-dimensional array
    byte[][] splitKeys = some hash function// Create an HbaseAdmin instance
    HBaseAdmin hAdmin = new HBaseAdmin(HbaseConfiguration.create());
    // Create an HTableDescriptor instance
    HTableDescriptor tableDesc = new HTableDescriptor(tableName);
    // Create Hbase tables with pre-partitioning using HTableDescriptor instances and two-dimensional arrays of hash values
    hAdmin.createTable(tableDesc, splitKeys);
    Copy the code

1.3 RowKey Design

The only identifier of a piece of data is a RowKey. The partition where the data is stored depends on the pre-partition range of the RowKey. The purpose of designing a RowKey is to evenly distribute data across all regions to prevent data skew to some extent. Let’s talk about common design scenarios for RowKey.

  1. Generate random numbers, hashes, and hash values

    Such as: the original rowKey for 1001, after the SHA1 became: dd01903921ea24941c26a48f2cec24e0bb0e8cc7 original this rowKey for 3001, after the SHA1 became: 49042 c54de64a1e9bf0b33e00245660ef92dc7bd original this rowKey for 5001, after the SHA1 became: 7 b61dec07e02c188790670af43e717f0f46e8913 before doing this operation, generally we will choose sample drawn from a data set, to decide what kind of after rowKey to Hash as a threshold of each partition.Copy the code
  2. String inversion

    20170524000001 changed to 10000042507102 20170524000002 changed to 20000042507102Copy the code

    This also hashes the data progressively put in to some extent.

  3. String splicing

    20170524000001_a12e
    20170524000001_93i7
    Copy the code

1.4. Memory optimization

HBase operations require a large amount of memory overhead. After all, tables can be cached in memory. Generally, 70% of the available memory is allocated to the HBase Java heap. However, it is not recommended to allocate a very large heap memory, because the RegionServer is unavailable for a long time during GC. Generally, 16 to 48 GB memory is required. If the system memory is insufficient due to the high memory usage of the frame, the frame will be killed by the system service.

1.5. Basic optimization

  1. You can add content to HDFS files

    HDFS – site. XML, hbase – site. XML

    Attribute: dfs.support.append Description: After HDFS append synchronization is enabled, HBase data synchronization and persistence can be implemented. The default value is true.Copy the code
  2. Optimize the maximum number of open files allowed by DataNode

    hdfs-site.xml

    Properties: DFS. Datanode. Max. Transfer. Threads: HBase generally at the same time operating a large number of documents, according to the number and size of cluster and data movement, is set to 4096 or higher. Default value: 4096Copy the code
  3. Optimize the wait time for data operations with high latency

    hdfs-site.xml

    Properties: DFS. Image. Transfer. The timeout: If the latency for a data operation is very high and the socket needs to wait a longer time, you are advised to set this value to a larger value (60000 ms by default) to ensure that the socket will not be timeout out.Copy the code
  4. Optimize data write efficiency

    mapred-site.xml

    Mapreduce.map.output.com mapreduce.map.output.com press press. Codec explanation: open the two data can greatly improve the efficiency of document writing, writing time. The first attribute values to modify to true, the second attribute value is amended as: org.apache.hadoop.io.com the GzipCodec or other compression method.Copy the code
  5. Example Set the number of RPC listens

    hbase-site.xml

    Properties: Hbase regionserver. Handler. Count: 30, the default value is used to specify the RPC to monitor the number of, can be adjusted according to client requests, read and write requests is large, increase the value.Copy the code
  6. Optimize HStore file size

    hbase-site.xml

    Properties: hbase hregion. Max. Filesize explanation: The default value is 10737418240 (10GB). You can reduce this value if MR tasks of HBase need to be run. One region corresponds to one map task. This value means that if the HFile size reaches this value, the region will be split into two Hfiles.Copy the code
  7. Optimized HBase client cache

    hbase-site.xml

    Properties: hbase client. Write. Buffer explanation: used to specify the hbase client cache, increase the value can reduce the number of RPC calls, but will consume more memory, and conversely. Generally, we need to set a certain cache size to reduce the number of RPC.Copy the code
  8. Next Specifies the number of lines obtained by scanning HBase

    hbase-site.xml

    Properties: hbase client. Scanner. Caching explanation: used to specify the scan. The next method to get the default number of rows, the larger the value, the greater the memory consumption.Copy the code
  9. Flush, compact, split mechanisms

    When the MemStore reaches the threshold, Flush MemStore data into Storefile. The compact mechanism merges small flush files into large Storefile files. Split: When a Region reaches its threshold, a large Region is split into two.

    Attributes involved:

    That is, 128 MB is the default threshold of Memstore

    Hbase) hregion) memstore. Flush. Size: 134217728Copy the code

    That is, if the total size of all memstores in a single HRegion exceeds the specified value, flush all memstores in the HRegion. RegionServer’s flush is processed asynchronously by adding requests to a queue, simulating the production-consumption model. There is a problem here, when the queue is too late to consume and there is a huge backlog of requests, it can cause memory to surge and at worst trigger OOM.

    Hbase. Regionserver. Global. Memstore. UpperLimit: 0.4 hbase. The regionserver. Global. Memstore. LowerLimit: 0.38Copy the code

    That is: When MemStore use total memory to hbase. Regionserver. Global. MemStore. UpperLimit specified value, there will be multiple MemStores flush to a file, MemStore flush is executed in descending order of size until flushing to MemStore uses slightly less memory than lowerLimit

Two, friendship links

Big data HBase Learning Journey 3

Big data HBase Learning Journey Part 2

Big data HBase Learning Journey 1