Detailed HBase Underlying Principles (In depth, recommended to save)

Search the official account “Five minutes to Learn Big data” to delve deeply into big data technology

Introduction of HBase

HBase is a distributed, column-oriented open source database. Built on HDFS. Hadoop Database is the name of Hbase. HBase computing and storage capabilities depend on Hadoop clusters.

It is a cross between NoSql and RDBMS. It can only retrieve data by row key and range of primary key, and only supports single-row transactions (complex operations such as multi-table joins can be implemented with Hive support).

Features of HBase tables:

Large: A table can have a billion rows and millions of columns
Column-oriented: Column-oriented storage and permission control, with column (family) independent retrieval.
Sparse: Null columns take up no storage space, so tables can be designed to be very sparse.

Basic HBase Principles

System architecture

This section describes HBase components based on the figure

Client

The Client maintains some caches to speed up access to hbase, such as regione’s location.

Zookeeper

HBase can use built-in Zookeeper or external Zookeeper. In actual production environments, external Zookeeper is generally used to maintain consistency.

Functions of Zookeeper in HBase:

Ensure that there is only one master in the cluster at any time
Stores addressing entries for all regions
Monitors the status of the Region Server in real time and notifies the Master of the online and offline status of the Region Server

HMaster

Assign a Region to the Region Server
Load balancing of region Server
Discover the failed Region Server and reassign regions on it
Garbage collection on HDFS
Process schema update requests

HRegion Server

The HRegion Server maintains the regions assigned to it by the HMaster and processes I/O requests to these regions
The HRegion Server divides regions that become too large during operation

As shown in the figure, the HMaster does not need to be involved in the Client’s access to HBase data (address access to Zookeeper and HRegion Server, data read and write access to HRegione Server).

HMaster only maintains the metadata information of table and HRegion with low load.

HBase table data model

Row Key Row Key

As with noSQL databases,row keys are the primary keys used to retrieve records. There are only three methods for accessing a row in an hbase table:

Accessed through a single row key
Range through the row key
A full table scan

Row Key the Row Key can be any character string (the maximum length is 64KB. In actual applications, the length ranges from 10 to 100bytes.) in hbase, the Row Key is stored as a byte array.

Hbase sorts data in tables by rowkey (dictionary order)

Data is stored in byte order of the Row key. When designing the key, take full advantage of the sort storage feature, storing together rows that are often read together. (Location correlation).

Note: the dictionary sequence to sort the result is 1,10,100,11,12,13,14,15,16,17,18,19,2,20,21 int… . To preserve the natural order of shaping, the row keys must be left filled with 0.

A row read or write is an atomic operation (no matter how many columns are read or written at once). This design decision makes it easy for the user to understand the behavior of the program when performing concurrent update operations on the same row.

Column Family Column Family

Each column in an HBase table belongs to a column family. Column families are part of a table’s schema (columns are not) and must be defined before using the table.

Column names are prefixed with column families. For example, courses:history, courses:math belong to the courses column family.

Access control, disk, and memory usage statistics are all performed at the column family level. The more column families you have, the more files you have to participate in IO and search for a row of data, so don’t have too many column families if you don’t have to.

Column Column

The specific columns below the ColumnFamily belong to a ColumnFamily similar to the specific columns created in mysql.

Timestamp Timestamp

In HBase, a storage unit is defined by row and columns. The storage unit is called a cell. Each cell holds multiple versions of the same data. Versions are indexed by timestamp. The timestamp type is a 64-bit integer. The timestamp can be assigned by hbase(automatically when data is written). In this case, the timestamp is the current system time accurate to milliseconds. Timestamps can also be explicitly assigned by the customer. If the application is to avoid data version conflicts, it must generate its own unique timestamps. In each cell, data of different versions is sorted in reverse chronological order. That is, the latest data is ranked first.

To avoid management (including storage and indexing) burden caused by too many data versions, hbase provides two data version reclamation methods:

Save the last n versions of the data
Save the latest version (set the life cycle TTL of the data).

Users can set it for each column family.

The unit Cell

Uniquely identified by {row key, column(=<family> + <label>), version}. The data in a cell is untyped and stored in bytecode form.

The version number VersionNum

The version number of the data. Each piece of data can have multiple versions. The default value is the system timestamp and the type is Long.

Physical storage

1. Overall structure

All rows in a Table are arranged in lexicographical order by Row Key.
The Table is divided into multiple HRegions in the direction of the row.
Hregions are segmented by size (10 GB by default). Each table has only one HRegion at the beginning. As data is inserted into the table, hregions grow larger. As the number of rows in the Table increases, there will be more and more Hregions.
HRegion is the smallest unit of distributed storage and load balancing in HBase.The minimum cell indicates that different HRegions can be distributed on different HRegion servers. butAn HRegion cannot be split into multiple servers.
Although HRegion is the smallest unit of load balancing, it is not the smallest unit of physical storage.

In fact, HRegion consists of one or more stores, each of which holds a Column Family. Each Strore in turn consists of a MemStore and zero or more storefiles. As shown above.

2. StoreFile and HFile structures

StoreFile is stored in HFile format on HDFS.

The format of HFile is:

First of all, HFile files are of variable length. There are only two fixed lengths: Trailer and FileInfo. As shown in the figure, Trailer has Pointers to the starting point of other data blocks.

File Info records Meta information about the File, such as AVG_KEY_LEN, AVG_VALUE_LEN, LAST_KEY, COMPARATOR, MAX_SEQ_ID_KEY, and so on.

Data Index and Meta Index blocks record the starting point of each Data block and Meta block.

Data blocks are the basic unit of HBase I/O. To improve efficiency, HRegionServer uses the Block Cache mechanism based on LRU. The size of each Data Block can be specified when creating a Table. Large blocks facilitate sequential Scan, while small blocks facilitate random query. In addition to the Magic at the beginning of each Data block is a KeyValue pair spliced together, Magic content is some random numbers, the purpose is to prevent Data damage.

Each KeyValue pair in HFile is a simple byte array. But the byte array contains many items and has a fixed structure. Let’s look at the concrete structure inside:

It starts with two fixed-length values, representing the length of the Key and the length of the Value. This is followed by Key, starting with a fixed-length value indicating the length of the RowKey, followed by RowKey, then a fixed-length value indicating the length of the Family, then the Family, then Qualifier, then the two fixed-length values, Indicates Time Stamp and Key Type (Put/Delete). The Value part has no such complex structure and is pure binary data.

HFile is divided into six parts:

Data Block – Holds the Data in the table. This part can be compressed.
Meta Block segment (Optional) – Saves user-defined KV pairs that can be compressed.
File Info – Hfile meta information is not compressed. Users can also add their own meta information in this section.
Data Block Index – Index of a Data Block. The key of each index is the key of the first record of the block being indexed.
Meta Block Index (Optional) – Index of the Meta Block.
Trailer – This section is fixed length. The offset of each segment is saved. When reading an HFile, Trailer is read first. Trailer stores the starting location of each segment (Magic Number of segment is used for security check), and DataBlock Index is read into memory so that when retrieving a key, You do not need to scan the entire HFile. Instead, you only need to find the block where the key resides in memory, read the entire block into memory through disk I/O, and then find the key. DataBlock Index is deprecated using the LRU mechanism.

Data blocks and Meta blocks of HFile are usually stored in compression mode. After compression, network I/O and disk I/O can be greatly reduced. The subsequent cost of CPU compression and decompression is of course. Currently, HFile can be compressed in two ways: Gzip and Lzo.

(3) Memstore StoreFile

An HRegion consists of multiple stores. Each Store contains all data stores in a column family, including memstores in memory and storefiles in hard disks.

The write operation starts with the Memstore. When the amount of data in the Memstore reaches a certain threshold, HRegionServer starts the FlashCache process and writes StoreFile. Each write creates a single StoreFile

When the StoreFile size exceeds a certain threshold, the current HRegion is divided into two and the HMaster allocates the StoreFile to the corresponding HRegion server for load balancing

When the client retrieves data, it searches memStore first and then StoreFile.

4. HLog(WAL log)

WAL stands for Write Ahead log, similar to mysql’s Binlog, used for disaster recovery. Hlog records all data changes. Once data is modified, it can be restored from the log.

Each Region Server maintains one Hlog, not one Hlog for each Region. In this way, logs of different regions (from different tables) are mixed together. In this way, continuously adding a single file reduces the number of disk addressing and improves the table write performance. If a Region Server goes offline, to restore the region on the region server, split the logs on the Region server and send them to other Region servers for restoration.

The HLog File is a normal Hadoop Sequence File:

The Key of an HLog Sequence File is an HLogKey object, which records the owning information of written data. In addition to table and region names, HLogKey also contains Sequence number and timestamp. Timestamp is “write time” and sequence number starts from 0 or is the sequence number last stored in the file system.
The Value of HLog Sequece File is the HBase KeyValue object, that is, the KeyValue in the corresponding HFile. For details, see the previous description.

Reading and writing process

1. Read request process:

HRegionServer stores the Meta table and table data. To access the table data, the Client accesses ZooKeeper and obtains the meta table location information from ZooKeeper. That is, locate the HRegionServer on which the Meta table is stored.

Then, the Client accesses the HRegionServer where the Meta table resides by using the IP address of the HRegionServer, reads the Meta, and obtains metadata stored in the Meta table.

The Client accesses the HRegionServer based on the information stored in the metadata and scans the Memstore and Storefile of the HRegionServer to query data.

Finally, HRegionServer sends the queried data to the Client.

View meta table information

hbase(main):011:0> scan 'hbase:meta'
Copy the code

2. Write request process:

The Client also accesses ZooKeeper first, finds the Meta table, and obtains Meta table metadata.

Determines the HRegion and HRegionServer servers corresponding to the data to be written.

The Client sends a data write request to the HRegionServer server. Then the HRegionServer receives the request and responds.

The Client writes data to the HLog to prevent data loss.

The data is then written to the Memstore.

If both HLog and Memstore are successfully written, the data is successfully written

If the Memstore reaches the threshold, the Memstore data is flushed to Storefile.

When the number of storefiles increases, the Compact merge operation is triggered, and excessive storefiles are merged into one large Storefile.

When the Storefile becomes larger, the Region also becomes larger. When the Storefile reaches the threshold, the Split operation is triggered to Split the Region.

Details:

HBase uses MemStore and StoreFile to store table updates. Data is first written to Log(WAL Log) and memory (MemStore) when updated. MemStore data is sorted. When MemStore accumulates to a certain threshold, a new MemStore is created and the old MemStore is added to flush queue. Flush to disk by a separate thread, becoming a StoreFile. At the same time, a redo point is recorded in ZooKeeper, indicating that changes made prior to this point have been persisted. If an accident occurs in the system, data in the MemStore may be lost. In this case, WAL Log is used to restore data after checkpoint.

StoreFile is read-only and cannot be modified once created. Therefore, HBase updates are continuously added operations. When the number of storefiles in a Store reaches a certain threshold, a minor_compact, major_compact is performed to merge the changes on the same key into a large StoreFile. When the size of a StoreFile reaches a certain threshold, the system splits the StoreFile into two storefiles.

In compact, you need to access all StoreFile and MemStore in Store and merge them by row key. StoreFile and MemStore are sorted and StoreFile has in-memory indexes. The merging process is still relatively fast.

HRegion management

HRegion distribution

An HRegion can be assigned to only one HRegion Server at any time. HMaster records available HRegion Servers. Which HRegions are allocated to which HRegion servers and which hRegions are not allocated. When a new HRegion needs to be allocated and an HRegion Server has available space, the HMaster sends a load request to the HRegion Server to allocate the HRegion to the HRegion Server. After receiving the request, the HRegion Server provides services for the HRegion.

HRegion Server online

The HMaster uses ZooKeeper to track the HRegion Server status. When an HRegion Server is started, a ZNode representing the HRegion Server is created in the Server directory on ZooKeeper. Because HMaster subscribes to change messages in the Server directory, HMaster can be notified from ZooKeeper in real time when files in the server directory are added or deleted. Therefore, once the HRegion Server goes online, the HMaster receives messages immediately.

HRegion Server offline

When the HRegion Server goes offline, its session with ZooKeeper is disconnected, and ZooKeeper automatically releases the exclusive lock on the files representing the Server. HMaster can determine:

The network between the HRegion Server and ZooKeeper is disconnected.
The HRegion Server is down.

In either case, the HRegion Server can no longer provide services for its HRegion. In this case, the HMaster will delete the ZNode data representing the HRegion Server from the Server directory. Allocate the HRegion of the HRegion Server to other nodes that are still alive.

Working mechanism of HMaster

The master online

The master startup performs the following steps:

Obtains a unique lock representing the active master from ZooKeeper to prevent other HMasters from becoming masters.
Scan the server parent node on ZooKeeper to obtain the list of available HRegion Servers.
Communicates with each HRegion Server to obtain the mapping between the currently allocated HRegion and HRegion Server.
Scan the collection of.meta. Region, calculate the hRegions that have not been allocated, and add them to the list of hRegions to be allocated.

Master offline

The HMaster only maintains metadata of tables and regions but does not participate in the I/O process of table data. Therefore, the HMaster offline freezes all metadata modification (such as creating and deleting tables, modifying table schemas, balancing HRegion load, and logging in and out of HRegion). The HRegion merge cannot be performed, except that the split of HRegion can be performed normally, because only the HRegion Server is involved.) The data read and write of the table can be performed normally. Therefore, the offline HMaster has no impact on the HBase cluster.

As can be seen from the on-line process, the information saved by HMaster is all redundant information (all can be collected or calculated from other parts of the system).

Therefore, an HBase cluster usually has one HMaster providing services and more than one HMaster waiting for the opportunity to occupy its position.

HBase three important mechanisms

1. The flush mechanism

1. (hbase. Regionserver. Global. Memstore. Size) by default; 40% of the heap size regionServer global memStore size. Exceeding this size triggers a flush to disk operation. The default is 40% of the heap size and regionServer level flush blocks client reads and writes

2. (hbase. Hregion) memstore. Flush. The size) default: 128 m in a single region memstore cache size, more than the whole hregion will flush,

3. (hbase. Regionserver. Optionalcacheflushinterval) default: 1 h files before automatically refresh in a memory can live the longest

4. (hbase. Regionserver. Global. Memstore. Size. The lower the limit) the default: Sometimes the cluster’s “write load” is so high that the number of writes consistently exceeds that of Flush that we expect memStore to not exceed a certain safety setting. In this case, writes are blocked until memStore is restored to a “manageable” size, which is the default heap size * 0.4 * 0.95, When a RegionServer-level Flush operation is sent, the client write is blocked until the entire RegionServer-level memStore size is heap size * 0.4 *0.95

5. (hbase hregion. Preclose. Flush. The size) by default: 5M If the memstore size of a region is greater than this value and close of the region is triggered, the region is pre-flush to clear the memstore to be closed, and the region is offline. When a region goes offline, we cannot write any more. Flush operations take a lot of time if a memstore is large. The pre-flush operation means that the memstore is cleared before the region is offline. When the close operation is finally executed, the flush operation is fast.

6. (hbase.hstore.com pactionThreshold) the default: Number of hfiles that can be stored in a store. If the number of hfiles exceeds this threshold, the memstore will be written to a new Hfile, that is, the memstore corresponding to each column family of each region. By default, when there are more than 3 hfiles, the files are merged and overwritten into a new file. The larger the number of hfiles is, the less time it takes to trigger a merge, but the longer each merge takes

2. Compact mechanism

Merge small storeFile files into large HFile files. Delete expired data, including deleted data. Save the version number of the data to one.

The split mechanism

When the HRegion reaches the threshold, the HRegion is divided into two parts. By default, an HFile is shard when it reaches 10Gb.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Detailed HBase Underlying Principles (In depth, recommended to save)

Search the official account “Five minutes to Learn Big data” to delve deeply into big data technology

Introduction of HBase