The knowledge of HBase needs to be shared within the team. After studying for a period of time, the knowledge is relatively scattered. This time, I will organize it systematically.

An overview of

features

Hbase is a NoSQL database, which means that it does not support SQL as a query language like traditional RDBMS databases. Hbase is a distributed storage database. Technically, it is more like distributed storage than a distributed database. It lacks many RDBMS features such as column types, auxiliary indexes, triggers, and advanced query language. So what are the Hbase features? As follows:

  • Strong read-write consistency, but not “ultimate consistency” data storage, which makes it ideal for high-speed computational aggregation
  • Automatic sharding: The system is distributed in clusters by Region. As the number of rows increases, regions are automatically sharded and redistributed
  • Automatic failover
  • Hadoop/HDFS integration, and HDFS out of the box, without too much trouble to connect
  • Rich “Clean, Efficient” apis, Thrift/REST apis, Java apis
  • Block caching, bloom filters, and efficient column query optimization
  • Operation management: Hbase provides a built-in Web interface for operation management and monitors JMX indicators

When to use Hbase?

Hbase is not suitable for all problems:

  • If you have a billion or billions of rows, Hbase is a good choice. If you have a few million rows or less, RDBMS is a good choice. Because the amount of data is small, the number of machines that can really work is small, and the rest of the machines are in the idle state
  • Second, if you don’t need auxiliary indexes, statically typed columns, transactions, etc., a system that is already using an RDBMS and wants to switch to Hbase will need to be redesigned.
  • Finally, ensure sufficient hardware resources. Each HDFS cluster does not perform well with less than five nodes. The default number of HDFS copies is 3 plus a NameNode.

Hbase can run in a single-machine environment, but use it in a development environment.

Internal application

  • Store service data: vehicle GPS information, driver location information, user operation information, device access information…
  • Store log data: architecture monitoring data (login logs, middleware access logs, push logs, SMS and email sending records…) , service operation log information
  • Storage service attachments: THE UDFS system stores attachments such as images, videos, and documents

However, companies do not use the native Hbase API. If the native API is used, the access cannot be monitored and system stability is affected. As a result, the version upgrade cannot be controlled.

Hbase architecture

  • Zookeeper, as distributed coordination. RegionServer also writes its information to ZooKeeper.
  • HDFS is the underlying file system that Hbase runs
  • RegionServer, understood as a data node, stores data.
  • The Master RegionServer reports information to the Master in real time. The Master knows the global Running status of RegionServer and can control the failover of RegionServer and Region splitting.

Architecture refinement

  • HMaster is the implementation of the Master Server, which monitors the Cluster’s RegionServer instance and is the interface for all metadata changes. In a cluster, HMaster usually runs on NameNode. Here is a more detailed description of HMaster

    • HMasterInterface Exposed interface, Table(createTable, modifyTable, removeTable, enable, disable),ColumnFamily (addColumn, modifyColumn, removeColumn),Region (move, assign, unassign)
    • The background thread that the Master runs: the LoadBalancer thread, which controls the region to balance the load of the cluster. CatalogJanitor thread periodically checks the hbase: Meta table.
  • HRegionServer implements RegionServer and manages Regions. RegionServer runs on Datanodes in the cluster

    • Hregioninterface Exposed interfaces: Data (GET, PUT, DELETE, Next, etc.), Region (splitRegion, compactRegion, etc.)
    • RegionServer Background Thread: CompactSplitThread, MajorCompactionChecker, MemStoreFlusher, LogRoller
  • Regions, representing table, Region has multiple stores (column clusters), Store has a Memstore and multiple StoreFiles(HFiles), StoreFiles’ underlying layer is Block.

Store design

In Hbase, tables are divided into smaller blocks and stored on different servers. These small blocks are called Regions and the RegionServer is where Regions are stored. The Master process distributes regions between RegionServers. In the Hbase implementation, the HRegionServer and HRegion classes represent RegionServer and Region. In addition to containing some HRegions, HRegionServer processes two types of files for data storage

  • HLog, also known as write-ahead log (WAL)
  • HFile Real data store file
HLog
  • MasterProcWAL: HMaster records management operations such as server conflict resolution, table creation, and other DDLs operations in its WAL file. This WALs is stored in the MasterProcWALs directory. Unlike RegionServer’s WALs, HMaster’s WAL also supports flexible operations. If the Master server goes down, the other Master will take over.

  • WAL records all Hbase data changes. If a RegionServer fails to FLush in MemStore, WAL can ensure that data changes are applied to. If writing WAL fails, the entire operation to modify the data fails.

    • Typically, each RegionServer has only one WAL instance. Prior to 2.0, WAL implementations were called HLog
    • WAL is located in the */hbase/WALs/* directory
    • MultiWAL: If each RegionServer has only one WAL, the HDFS must be continuous. Therefore, WAL must be written continuously. Then, performance problems occur. MultiWAL enables RegionServer to write multiple WLS in parallel. This improves the overall throughput through multiple pipelines in the HDFS, but does not improve the throughput of a single Region.
  • WAL configuration:

    // Enable multiwal< property> <name>hbase.wal. Provider </name> <value>multiwal</value> </property>Copy the code

Wikipedia about WAL

HFile

HFile is the format in which Hbase stores data in the HDFS. It contains multiple layers of indexes, so that when Hbase retrieves data, it does not need to load the entire file. The size of the index (the size of the keys, the size of the data volume) affects the size of the block, and in the case of large data sets, it is also common for the block size to be set to 1GB per RegionServer.

A discussion of how data is stored in a database is a discussion of how data is effectively organized on disk. Because we usually focus on how to read and consume data efficiently, not on the data store itself.

Hfile generation mode

At first, there are no blocks in the HFile, and the data still exists in the MemStore.

When the HFile Writer is created, the first empty Data Block appears. The initialized Data Block has space reserved for the Header, which is used to store the metadata information of a Data Block.

Then, the KeyValues in the MemStore are appended to the first Data Block in memory:

Note: If Data Block Encoding is configured, the Encoding is synchronized when Append KeyValue is used. The encoded Data is no longer in KeyValue mode. Data Block Encoding is an internal Encoding mechanism provided by HBase to reduce KeyValue structural inflation.

Read and write simple process

Hbase single-machine deployment mode

This time, deploy a standalone version of Hbase. The separate Hbase daemons (Master, RegionServers, and ZooKeeper) run in the same JVM process and are persisted in the file system. This is the simplest deployment, but it helps us understand Hbase better. After the installation, let’s demonstrate the hbase command line.

The environment

  • CentOS 7
  • Hbase 1.2.8

Install the standalone

  1. Make sure you have the JDK installed. On Linux, just use the package manager that comes with it. Using binary is also a good choice, I use CentOS
Yum install Java - 1.8.0 comes with - its * - yCopy the code
  1. Download Hbase binary package, download address located in mirror.bit.edu.cn/apache/hbas…
Tar -xf hbase-1.2.8-bin.tar.gz CD hbase-1.2.8Copy the code
  1. Configure hbase environment variables and modify JAVA_HOME. Where is your JAVA_HOME

JAVA_HOME=/etc/alternatives/ java_sdk_1.8.0/java_home =/etc/alternatives/java_sdk_1.8.0/Copy the code
  1. Configure onf/hbase-site. XML, which is the main hbase configuration file. You can specify the directory where hbase and ZooKeeper data is written, and the location of the hbase root directory.

I put the hbase directory in the hbase directory of the Hadoop user’s home directory. We do not need to create the data directory of hbase in advance. Hbase automatically creates it for us. If the data directory already exists, hbase migrates the existing directory.

useradd -s /sbin/nologin -m hadoop

vim conf/hbase-site.xml
<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>file:///home/hadoop/hbase</value>
  </property>
  <property>
    <name>hbase.zookeeper.property.dataDir</name>
    <value>/home/hadoop/zookeeper</value>
  </property>
  <property>
    <name>hbase.unsafe.stream.capability.enforce</name>
    <value>false</value>
    <description>
      Controls whether HBase will check for stream capabilities (hflush/hsync).

      Disable this if you intend to run on LocalFileSystem, denoted by a rootdir
      with the 'file://' scheme, but be mindful of the NOTE below.

      WARNING: Setting this to false blinds you to potential data loss and
      inconsistent system state in the event of process and/or node failures. If
      HBase is complaining of an inability to use hsync or hflush it's most
      likely not a false positive.
    </description>
  </property>
</configuration>
Copy the code
  1. The Hbase binary package contains the start-hbase script, which is convenient to start Hbase. If the configuration is correct, the Hbase will start normally.
./bin/start-hbase.sh
Copy the code

If the Hbase is started, you can open http://localhost:16010 to view the Hbase Web UI

The use of Hbase

You can use the command line tool provided by Hbase in the /bin/directory of Hbase

  1. Connect the Hbase
./hbase shell

Copy the code
  1. To view help information, tap
>help
Copy the code

  1. To create a table, you must specify the table name and column cluster name
Hbase (main):003:0> create 'test', 'cf' 0 row(s) in 1.6320 seconds => hbase ::Table -testCopy the code

  1. To list information about your table,list ‘sometable’

  1. To view the table in more detail, use the describe command

  1. Put the data into the table

  1. View all the data in the table

  1. Get a single row of data

  1. The rest of the commands can be tried on their own

  2. To exit the shell, run quit

This section describes how to install the hbase single-server version and the basic usage of hbase shell. For more information about hbase, you can learn about the official documentation.

Hbase Data Model

In Hbase, there are some terms that need to be understood in advance. As follows:

  • Table: Hbase Table consists of multiple rows
  • Row: A Row in Hbase consists of one or more columns with values. Rows are arranged alphabetically, so the design of rows is important. This design allows the related lines to be very close together. Usually, the design of the line is to reverse the domain name of the site, such as (org.apache. WWW, org.apache.mail, org.apache.jira), so that all the Apache domain names are close together.
  • Column: A Column consists of a Column cluster and a Column id. The Column ID is usually Column cluster: Column ID. You do not need to specify the Column ID when creating a table
  • Column Family: A Column Family physically contains a number of columns and Column values, each of which has stored properties that can be configured. For example, whether to use cache, compression type, number of versions stored, etc. In a table, every row has the same column cluster, although some column clusters store nothing.
  • Column Qualifier: Indicates the unique identifier of a Column. However, column identity can be changed, so each row may have a different column identity
  • Cell: A Cell is composed of row,column family, and column Qualifier containing timestamp and value, which generally expresses the version of a value
  • Timestamp: A Timestamp is usually written next to a value and represents the version number of a value. The default Timestamp is the moment you write the data, but you can specify a different Timestamp when you write the data

HBase is a sparse, distributed, persistent, multi-dimensional, and sorted mapping. It uses row keys, column keys, and timestamp as indexes.

When data is stored in Hbase, there are two sortedmaps. The first SortedMap is sorted by rowkey and the second SortedMap is sorted by Column.

The test data

create 'user','info','ship';

put 'user', '524382618264914241', 'info:name', 'zhangsan'
put 'user', '524382618264914241', 'info:age',30
put 'user', '524382618264914241', 'info:height',168
put 'user', '524382618264914241', 'info:weight',168
put 'user', '524382618264914241', 'info:phone','13212321424'
put 'user', '524382618264914241', 'ship:addr','beijing'
put 'user', '524382618264914241', 'ship:email','[email protected]'
put 'user', '524382618264914241', 'ship:salary',3000

put 'user', '224382618261914241', 'info:name', 'lisi'
put 'user', '224382618261914241', 'info:age',24
put 'user', '224382618261914241', 'info:height',158
put 'user', '224382618261914241', 'info:weight',128
put 'user', '224382618261914241', 'info:phone','13213921424'
put 'user', '224382618261914241', 'ship:addr','chengdu'
put 'user', '224382618261914241', 'ship:email','[email protected]'
put 'user', '224382618261914241', 'ship:salary',5000

put 'user', '673782618261019142', 'info:name', 'zhaoliu'
put 'user', '673782618261019142', 'info:age',19
put 'user', '673782618261019142', 'info:height',178
put 'user', '673782618261019142', 'info:weight',188
put 'user', '673782618261019142', 'info:phone','17713921424'
put 'user', '673782618261019142', 'ship:addr','shenzhen'
put 'user', '673782618261019142', 'ship:email','[email protected]'
put 'user', '673782618261019142', 'ship:salary',8000

put 'user', '813782218261011172', 'info:name', 'wangmazi'
put 'user', '813782218261011172', 'info:age',19
put 'user', '813782218261011172', 'info:height',158
put 'user', '813782218261011172', 'info:weight',118
put 'user', '813782218261011172', 'info:phone','12713921424'
put 'user', '813782218261011172', 'ship:addr','xian'
put 'user', '813782218261011172', 'ship:email','[email protected]'
put 'user', '813782218261011172', 'ship:salary',10000

put 'user', '510824118261011172', 'info:name', 'yangyang'
put 'user', '510824118261011172', 'info:age',18
put 'user', '510824118261011172', 'info:height',188
put 'user', '510824118261011172', 'info:weight',138
put 'user', '510824118261011172', 'info:phone','18013921626'
put 'user', '510824118261011172', 'ship:addr','shanghai'
put 'user', '510824118261011172', 'ship:email','[email protected]'
put 'user', '510824118261011172', 'ship:salary',50000
Copy the code

Design points of Hbase tables (Schemas)

As long as there are databases, the problem of schema design exists. There is a paradigm of schema design in the relational type. As a column storage database, Hbase’s schema design is also very important.

What properties do you need to focus on when designing, how do you design these properties, etc.

Comparison between Hbase and relational databases

attribute Hbase RDBMS
The data type It’s just strings Rich data types
Data manipulation Add, delete, modify and check, do not support join Various functions are joined to tables
Storage mode Column-based storage Based on table structure and row storage
Data protection After the update, the old version is still retained replace
scalability Easily add nodes The middle tier is required, sacrificing performance

Considerations for Hbase design

Key Hbase concepts: table, Rowkey, column cluster, and timestamp

  • How many column clusters should this table have
  • What data does the column cluster use
  • How many columns are in each column cluster
  • What is the column name that you need to know to read and write data, even though you don’t have to define it when you create the table
  • What data should the cell store
  • How many time versions are stored in each cell
  • What is the rowKey structure and what information should it contain

Design points

Line design

The key part, directly related to the follow-up service access performance. If the design is not reasonable, the follow-up query service efficiency will be exponentially reduced.

  • Avoid monotonously increasing rows. Hbase rows are arranged in order. In this case, most write operations are performed on a Region in a period of time, and the load is all on one node. The value can be [metric_type][event_timestamp]. Different metric_type can spread the pressure to different regions
  • Row keys are short enough to be readable, and since querying short keys does not perform as well as long keys, the design tradeoff is length.
  • Xingjian cannot be changed, the only way to change is to delete before inserting
Column cluster design

A column cluster is a collection of columns. The members of a column cluster have the same prefix, delimited by a colon (:).

  • Now, Hbase cannot handle more than two or three column clusters. Therefore, minimize the number of column clusters. If A table has multiple column clusters, column cluster A has 1 million rows and column cluster B has 1 billion rows, column cluster A will be spread to many regions, resulting in low efficiency in scanning column cluster A.

  • The length of column cluster names should be as small as possible to save space and speed up efficiency, such as D for data and V for value

Configure the attributes of the column cluster
  • HFile data block. The default value is 64KB. The size of the database affects the size of the data block index. If the data block is large, the more data loaded into memory at a time, the better the scan query effect. But with small data blocks, random queries perform better
> create 'mytable',{NAME => 'cf1', BLOCKSIZE => '65536'}
Copy the code
  • Data block caching. Data block caching is enabled by default, and can be turned off for less accessible data
> create 'mytable',{NAME => 'cf1', BLOCKCACHE => 'FALSE'}
Copy the code
  • Data compression. Compression increases disk utilization, but increases CPU load
> create 'mytable',{NAME => 'cf1', COMPRESSION => 'SNAPPY'}
Copy the code

Hbase table design is based on requirements. However, it is helpful to adhere to the rigid table design specifications to improve performance. The following table design points are summarized.

Java API operation

Hbase provides various clients, such as REST clients, Thift clients, and ORM framework Kundera. Hbase also provides Java apis to operate tables and column clusters. Its shell encapsulates the Java apis.

The Hbase Java API provides the following advanced features:

  • Metadata management, data compression of column clusters, region separation
  • Create, delete, update, and read rowKeys

Let’s just go straight to the code and make it easier to understand

The environment

  • Hbase 0.98
  • Java 1.8
  • Zookeeper 3.4.6
  • Mac OS

case

If Hbase client versions are inconsistent, problems may occur. Use the same version as possible. Because the server is tested with Hbase0.98, and the client is also used with 0.98. In addition, because the version of Hadoop 2.x has made a great improvement to 1.x, it is recommended to use hbase-Hadoop 2.x client.

       <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-client</artifactId>
            <version>0.98.24 - hadoop2</version>
        </dependency>
Copy the code
Establish a connection
  1. You can directly create an HTable(“tableName”). However, each time you create a table, you need to query the. Meta table to determine whether the table exists. As a result, the process of creating a table is slow

  2. With HTablePool, it is created much like HTable, but with connection pool, it does not create a separate HTable for each request.

You can specify more detailed configuration information when creating an Htable or HtablePool.

HTablePool hTablePool = new HTablePool();
hTablePool.getTable("user");
Copy the code
Add and delete

The Rowkey is the only row in the Hbase table and is used to locate some data in the table, such as column clusters and time stamps. The Java API provides the following classes for the Hbas CURD:

  • Put
  • Get
  • Delete
  • Scan
  • Increment

We discuss a few classes in detail, and the rest can be inferred from one example.

Write the data

When a write request is received, data is written synchronously to Hlog and MemStore by default. Writing in both places ensures data persistence. MemStore is eventually persisted to Hfile on disk. Each time MemStore flushs, a new Hfile is created.

The Put class is used to store data in Hbase tables. When storing data, the Rowkey must be specified for the Put instance

After the Put instance is created, add data to it


    public void put(a) throws IOException {
        // Get the default configuration
        Configuration conf = HBaseConfiguration.create();
        // Obtain the Table instance
        HTable table = new HTable(conf, "tab1");
        // Create a Put instance and specify a rowKey
        Put put = new Put(Bytes.toBytes("row-1"));
        // Add a column with the value "Hello" in the "cf1:greet" column
        put.add(Bytes.toBytes("cf1"), Bytes.toBytes("greet"), Bytes.toBytes("Hello"));
        // Add a column with the value "John" in the column "cf1:person"
        put.add(Bytes.toBytes("cf1"), Bytes.toBytes("person"), Bytes.toBytes("John"));
        table.put(put); 
        table.close();
    }
Copy the code

Data can also be inserted in batches:

// The table object can be passed with the List parameter table.put(final List puts).

Execution result:

Read the data

Hbase uses the LRU cache to read data. The Htable object uses the following methods to read data

A Get instance is constructed much like a Put in that it specifies a rowkey.

If you want to look for specific cells, that is, data for specific columns, you can use additional methods for more fine-tuning.

Consider the following case code:

public void get(a) throws IOException {
        // Get the default configuration
        Configuration conf = HBaseConfiguration.create();
        // Obtain the Table instance
        HTable table = new HTable(conf, "tab1");
        // Create a Put instance and specify a rowKey
        Get get = new Get(Bytes.toBytes("row-1"));
        //
        get.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("greet"));
        // Add a column with the value "John" in the column "cf1:person"
        Result result = table.get(get);
        byte[] value = result.getValue(Bytes.toBytes("cf1"), Bytes.toBytes("greet"));
        System.out.println("Value obtained" + new String(value));
        table.close();
    }
Copy the code

The execution result

Update the data

The update data is basically the same as the write data, except that when the Put instance is assigned, a different value is set on the same column, and the new value is updated during the operation.

The code is as follows:

public void update(a) throws IOException {
        Configuration conf = HBaseConfiguration.create();
        // Obtain the Table instance
        HTable table = new HTable(conf, "tab1");
        // Create a Put instance and specify a rowKey
        Put put = new Put(Bytes.toBytes("row-1"));
        // Add a column with the value "Hello" in the "cf1:greet" column
        put.add(Bytes.toBytes("cf1"), Bytes.toBytes("greet"), Bytes.toBytes("Good Morning"));
        // Add a column with the value "John" in the column "cf1:person"
// put.add(Bytes.toBytes("cf1"), Bytes.toBytes("person"), Bytes.toBytes("John"));
        table.put(put);
        table.close();
    }
Copy the code

Execution result:

Delete the data

The Delete command only indicates that the current data is deleted, rather than being deleted immediately, that is, logical deletion is performed first. The actual deletion is when the Hfile is compressed and these marked records are deleted.

Delete objects are also similar to Put and Get

Constructing a Delete instance

If you want to specify more detail, you can specify specific columns and so on

Look at the following case code:

 public void delete(a) throws IOException {
        Configuration conf = HBaseConfiguration.create();
        // Obtain the Table instance
        HTable table = new HTable(conf, "tab1");
        // Create a Delete instance and specify the rowKey
        Delete delete = new Delete(Bytes.toBytes("row-1"));
        // delete column "cf1:greet"
        delete.deleteColumn(Bytes.toBytes("cf1"), Bytes.toBytes("greet"));
        
        table.delete(delete);
        table.close();
    }
Copy the code

Result: Deletion is performed twice in a row

Operation optimization

Once a system is online, development and tuning will continue throughout the life of the system, including HBase. This section focuses on Hbase tuning

Hbase Query Optimization

As a NoSQL database, adding, deleting, modifying and searching is its most basic function, among which query is the most commonly used one.

Setting the Scan Cache

You can set setCaching() for Scan queries in HBase. This reduces the interaction between the server and the client and improves Scan query performance.


   /**
   * Set the number of rows for caching that will be passed to scanners.
   * If not set, the default setting from {@link HTable#getScannerCaching()} will apply.
   * Higher caching values will enable faster scanners but will use more memory.
   * @paramCaching the number of rows for caching * Specifies the number of cached rows */
  public void setCaching(int caching) {
    this.caching = caching;
  }
Copy the code
The specified column to display

When using Scan or GET to GET a large number of rows, it is a good idea to specify the desired column, because the server is transferring over the network to the client and the large amount of data can be a bottleneck. If part of the data can be filtered effectively, network I/O costs can be greatly reduced.

  /**
   * Get all columns from the specified family.
   * <p>
   * Overrides previous calls to addColumn for this family.
   * @param family family name
   * @returnThis * gets all columns */ in the specified column cluster
  public Scan addFamily(byte [] family) {
    familyMap.remove(family);
    familyMap.put(family, null);
    return this;
  }

  /**
   * Get the column from the specified family with the specified qualifier.
   * <p>
   * Overrides previous calls to addFamily for this family.
   * @param family family name
   * @param qualifier column qualifier
   * @returnThis * gets the specific column */ of the specified column cluster
  public Scan addColumn(byte [] family, byte [] qualifier) {
    NavigableSet<byte []> set = familyMap.get(family);
    if(set == null) {
      set = new TreeSet<byte []>(Bytes.BYTES_COMPARATOR);
    }
    if (qualifier == null) {
      qualifier = HConstants.EMPTY_BYTE_ARRAY;
    }
    set.add(qualifier);
    familyMap.put(family, set);
    return this;
  }
Copy the code

General: scance.addcolumn (…)

Close the ResultScanner

If you forget to close this class after using table.getScanner, it will always connect to the server and resources cannot be released. As a result, some resources on the server are unavailable.

So when you’re done, you need to close it, just like JDBS does with MySQL

scanner.close()

Disable block caching

If full table scans are performed in batches, the cache is installed by default. If the cache is installed, the scanning efficiency is reduced.

scan.setCacheBlocks(true|false);

For frequently read data, you are advised to use the default value and enable block caching

Caching query Results

In the case of frequent HBase queries, you can create a cache system between the application program and HBase. New queries are checked in the cache first, and HBase is not checked in the cache.

To optimize

Write operations are common in Hbase, and Hbase has unparalleled advantages over NoSQL in write operations. The following section describes how to optimize write operations

Disable WAL log writing

To ensure high availability of the system, WAL logs are enabled by default. WAL is mainly used for disaster recovery. If an application can tolerate certain data loss risks, WAL can be disabled when writing data.

Risk: When RegionServer breaks down, written data is lost and cannot be recovered

Set AutoFlush

The default value is true. When the client receives a piece of data, it sends it to the server immediately. If this value is set to false, the client cache the put request before submitting it. The request was submitted to the RegionServer only when the threshold was reached or hbase.flushCommits () was executed.

Risk The client crashes before the request is sent to RegionServer and data is lost

        table.setAutoFlush(false);
        table.setWriteBufferSize( 12 * 1024 * 1024 );
Copy the code
Create a Region in advance

Generally, a table has only one Region at the beginning. All the data inserted into the table and all the plasticizers inserted into the table are stored in the Region. When a certain threshold is reached, the Region is split. In this way, all write operations for this table are concentrated on a certain server at the beginning, resulting in a heavy pressure on this server and a waste of resources in the entire cluster

You are advised to pre-create regions and use the RegionSplitter provided by Hbase

Delay log Flush

By default, WAL is written first and HDFS is written within 1S. The default time is 1S. You can set this parameter

hbase.regionserver.optionallogflushinterval

A larger value, such as 5s, is used to keep data in memory until RegionServer periodically flusits data.

Important parameters of Scan

Scan is a very common operation in Hbase. Although the previous Hbase API operations briefly describe Scan operations, they are not detailed enough. Because Scan is very common, it is necessary to arrange detailed operations.

Scan

Data tables in HBase are divided into regions for data sharding. Each Region is associated with a RowKey range. Data in each Region is organized according to the lexicographical order of Rowkeys.

Based on this design, HBase can easily handle the “specify a RowKey range and obtain all records within that range” query, which is called Scan in HBase.

1. Build Scan, specify startRow and stopRow, if not specified, a full table Scan will be performed 2. Obtain the ResultScanner 3. Traverse the query results 4. Close the ResultScanner

 public void stringFilter(a) throws IOException {
        Configuration conf = HBaseConfiguration.create();
        // Obtain the Table instance
        HTable table = new HTable(conf, "user");

        / / build Scan
        Scan scan = new Scan();
        scan = scan.setStartRow(Bytes.toBytes("startRowxxx")).setStopRow(Bytes.toBytes("StopRowxxx"));
        RowFilter filter = new RowFilter(
                CompareFilter.CompareOp.EQUAL,
                new BinaryComparator(Bytes.toBytes("224382618261914241"))); scan.setFilter(filter);/ / get resultScanner
        ResultScanner scanner = table.getScanner(scan);
        Result result = null;
        
        // Process the result
        while((result = scanner.next()) ! =null) {
            byte[] value = result.getValue(Bytes.toBytes("ship"), Bytes.toBytes("addr"));
            if (value == null || value.length == 0) {
                continue;
            }
            System.out.println(
                    new String(value)
            );
            System.out.println("hello World");
        }
    
        / / close the ResultScanner
        scanner.close();
        table.close();
    }
Copy the code

Other Settings

Caching: Specifies the number of Results that can be read in a batch by an RPC request

The following example code sets the number of Results to be read back at one time to 100:

scan.setCaching(100);
Copy the code

Each time the Client sends a SCAN request to the RegionServer, a batch of data (the number of Results to be retrieved each time is determined by Caching) will be retrieved and stored in the Result Cache.

Each time the application reads data, it obtains the data from the local Result Cache. If the data in the Result Cache is finished, the Client sends a SCAN request to RegionServer to obtain more data.

Batch: Sets the number of columns in each Result

The following example code sets a limit of 3 on the number of columns in each Result:

scan.setBatch(3);
Copy the code

This parameter is applicable to scenarios where a row of data is too large, so that the requested column of a row of data is split into multiple Results and returned to the Client.

The following is an example:

Suppose a row has 10 columns: {Col01, Col02, Col03, Col04, Col05, Col06, Col07, Col08, Col09, Col10} Suppose Batch is set to 3 in Scan, then the row will be split into 4 Results and return:

Result1 -> {Col01, Col02, Col03} Result2 -> {Col04, Col05, Col06} Result3 -> {Col07, Col08, Col09} Result4 -> {Col10}Copy the code

For Caching parameters, we indicate that it is the number of Results obtained by the Client from the RegionServer each time. In the previous example, a row of data is split into four Results, which will cause the counter in Caching to be reduced four times. Combining Caching and Batch, let’s take another complex example:

For example, the Scan parameters are set as follows:

final byte[] start = Bytes.toBytes(“Row1”); final byte[] stop = Bytes.toBytes(“Row5”); Scan scan = new Scan(); scan.withStartRow(start).withStopRow(stop); scan.setCaching(10); scan.setBatch(3);

The RowKey of the data to be read and the column set associated with it are as follows:

Row1: {Col01, Col02, Col03, Col04, Col05, Col06, Col07, Col08, Col09, Col10} Row2: {Col01, Col02, Col03, Col04, Col05, Col06, Col07, Col08, Col09, Col10, Col11} Row3: {Col01, Col02, Col03, Col04, Col05, Col06, Col07, Col08, Col09, Col10}

Review the definitions of Caching and Batch:

Caching: Affects the number of Results returned during a read.

Batch: specifies the number of columns that can be contained in a Result. If the number of columns in a row exceeds the Batch limit, the row will be split into multiple Results.

The result set returned by the Client on its first request to RegionServer looks like this:

Result2 -> Row1: {Col04, Col05, Col06} Result3 -> Row1: {Col01, Col02, Col03} Result2 -> Row1: {Col04, Col05, Col06} Result3 -> Row1: Result4 -> Row1: {Col10} Result5 -> Row2: {Col01, Col02, Col03} Result6 -> Row2: {Col01, Col02, Col03} Result6 -> Row2: Result7 -> Row2: {Col07, Col08, Col09} Result8 -> Row2: {Col10, Col11} Result9 -> Row3: {Col01, Col02, Col03} Result10 -> Row3: {Col04, Col05, Col06}

Limit: limits the number of rows obtained by a Scan operation

Similar to the LIMIT clause in SQL syntax to limit the total number of rows obtained by a single Scan operation:

scan.setLimit(10000);

Note: The Limit argument is new in version 2.0. However, in version 2.0.0, there seems to be a BUG when Batch and Limit are set at the same time. Preliminary analysis suggests that the cause of the problem is related to the logical processing of the numberOfCompletedRows counter in BatchScanResultCache. Therefore, you are not advised to set the two parameters at the same time.

CacheBlock: Indicates whether the RegionServer needs to cache HFileBlocks involved in the Scan

scan.setCacheBlocks(true);

E) Raw Scan: whether the deleted identifier and the data that has been deleted but has not been cleaned can be read

scan.setRaw(true);

MaxResultSize: Limits the result set returned by a Scan from the memory usage dimension

The following example code sets the maximum value of the returned result set to 5MB:

scan.setMaxResultSize(5 * 1024 * 1024);

Reversed Scan: indicates the Reversed Scan

Normal Scan reads in lexicographical order from small to large, whereas Reversed Scan does the opposite:

scan.setReversed(true);

Scan with a Filter

The Filter can set more criteria on the returned records based on the Scan result set. These criteria can be related to rowkeys, column names, and column values, and can be combined with multiple Filter criteria.

The most common Filter is SingleColumnValueFilter, based on which queries like the following can be implemented:

Return all rows that satisfy the condition {column I:D has a value greater than or equal to 10}.

The example code is as follows:

Filter enricicates the query capability of HBase. However, before using Filter, note the following: The query response delay may change out of control. Because we cannot predict the amount of data that needs to be scanned to find a record that meets the conditions, this problem can be effectively controlled if the Scan range is effectively limited (by setting StartRow and StopRow limits). This information requires that you investigate your business data model in detail before using a Filter.

The last

This article is a bit long, for reference

reference

  • [actual HBase enterprise development]
  • HBase Reference Documents
  • HBase Read process: Simple HBase Start Tutorial 4
  • Official Hbase Documents
  • Hbase Shell Command
  • Hbase shell tuturial
  • HBase High-performance random query – HFile principle parsing