“This is the 14th day of my participation in the First Challenge 2022.
Zero, introduction,
HBase is based on Google BigTable. HBase is a distributed massive column non-relational database system that provides real-time random read and write of large-sized data sets.
It is the open source implementation of BigTable: the distributed mass storage of data is realized through the design of data fragmentation and HDFS. In the data structure, the data table structure can be customized during the running time by the design of column family. LSM tree allows data to be continuously written to disks, greatly improving data write performance.
(1) Features
- Mass storage: Underlying based
HDFS
Storing massive amounts of data - Column storage:
HBase
Table data is stored based on column families, which contain several columns - Easily extensible: Low-level dependencies
HDFS
When the disk space is insufficient, the disk only needs to be dynamically addedDataNode
Service nodes do - High concurrency: Supports high concurrency read/write requests
- Sparsity: Sparsity is mainly aimed at
HBase
Column flexibility. In a column family, you can specify as many columns as you want. If the column data is empty, it will not take up storage space. - Multiple versions of data:
HBase
The data in the table can have multiple version values. By default, the data is distinguished by the version number, which is the timestamp at which the data was inserted - Single data type: all data in
HBase
Is stored in byte arrays
(2) Application
- Transportation: ships
GPS
Information. There are tens of millions of data stores every day. - Finance: consumption information, loan information, credit card repayment information, etc
- E-commerce: e-commerce website transaction information, logistics information, tour information, etc
- Telecommunications: call information
Summary: HBase is suitable for storing massive detailed data and provides high query performance (a single table exceeds 10 million or 100 million, and high concurrency requirements).
1. Scalable architecture
HBase is designed for scalable massive data storage and provides real-time data access delay for online services.
The scalability of HBase depends on HRegion and HDFS, which can be split.
The architecture diagram is as follows:
ZooKeeper
- To achieve the
HMaster
The high availability of- Save the
HBase
The metadata information is allHBase
Addressing entry to a table
- Save the
- right
HMaster
和HRegionServer
Monitoring is implemented
HMaster
All HRegion information, including the Key range, HRegionServer address, and access port number, is recorded on the HMaster server. To ensure high availability of HMasters, HBase starts multiple HMasters and elects a primary server using ZooKeeper.
- for
HRegionServer
distributionRegion
- Maintain load balancing for the entire cluster
- Maintain metadata information of the cluster
- Found invalid
Region
And will be invalidRegion
Assign to normalHRegionServer
上
HRegionServer
HRegionServer is a physical server. Multiple HRegion instances can be started on each HRegionServer. When the amount of data written to an HRegion reaches the threshold, an HRegion is split into two HRegions and is migrated across the entire cluster to balance HRegionServer load.
- Responsible for managing the
Region
- Receives read/write requests from clients
- Sharding becomes larger during operation
Region
HRegion
HBase is a data storage process. Applications write and read data through communication with HRegion. In HBase, data is managed by HRegion. That is, if an application wants to access a data, it must locate the HRegion and submit data read and write operations to the HRegion. Data operations on the storage layer are performed by HRegion. Each HRegion stores data in the Key range [key1, key2].
- each
HRegion
By multipleStore
Constitute a - each
Store
Save a column family (Columns Family
), how many column families does the table haveStore
- each
Store
By aMemStore
And multipleStoreFile
Composition,MemStore
是Store
What’s in memory, when I write it to the fileStoreFile
。StoreFile
The bottom isHFile
Save the format.
Read the sequence diagram as follows:
As shown in the sequence diagram above, the steps are as follows:
- The application passes
ZooKeeper
For the LordHMaster
The address of the - The input
Key
The value gets thisKey
Where theHRegionServer
address - Then request
HRegionServer
On theHRegion
To obtain the required data.
Summary:
-
HBase is designed to store massive data in distributed mode. The routing algorithm is different from Memcached.
-
HBase fragments fragments based on Key regions, that is, HRegions.
-
The application program searches for fragments using HMaster, obtains HRegionServer, and communicates with the HRegion server to obtain data to be accessed.
Extensible data model
To improve data writing speed, HBase uses a data structure called LSM tree for data storage.
LSM Tree: Log Structed Merge Tree
Data is continuously written in Log mode, and then asynchronously merges multiple LSM trees on the disk.
LSM tree, as shown in figure:
The LSM tree can be regarded as an n-order merge tree.
Data writes (including inserts, modifications, and deletions) are performed in memory and a new record is created (modifications record new data values, while deletions record a deletion flag).
The data is still a sort tree in memory. When the amount of data exceeds the specified memory threshold, the sort tree is merged with the latest sort tree on disk.
When the amount of data in this sort tree exceeds the threshold, the data in this sort tree is merged with that in the next level on the disk.
During the merge process, older data is overwritten (or recorded as a different version) with the latest update.
When a read operation is required, the search always starts from the sort tree in memory, and if not found, the search is performed sequentially from the sort tree on disk.
A data update in the LSM tree does not require disk access and can be done in memory.
When the data access is mainly based on write operations and the read operations are concentrated on the recently written data, the LSM tree greatly reduces the number of disk accesses and speeds up the access speed.
Data model
The logical architecture is shown below:
The physical architecture is shown as follows:
NameSpace
(database) namespace
Similar to the database concept of a relational database, there are multiple tables under each namespace. HBase has two built-in namespaces, HBase and default. HBase stores built-in HBase tables. The default table is the default namespace used by users. A table can optionally have a namespace or not. If a namespace is added to the table, the table name is distinguished by:.
Table
A table concept similar to a relational database.
However, when HBase defines tables, only column families are required. Data attributes, such as TTL and COMPRESSION, are defined in the column family, and specific columns are not required.
Row
(One logical row Each row of data in an HBase table consists of a RowKey and multiple columns. A row contains more than one column, the column by column family classification of data in the respective column family can only be selected from the column family as defined in the table, not data) can define this table does not exist in the column family, otherwise an error NoSuchColumnFamilyException.
RowKey
(Primary key for each row)
Rowkey is defined by a user-specified non-repeating string that uniquely identifies a row! RowKey design is important because the data is stored in lexicographical order by RowKey and can only be retrieved by RowKey when querying data. If a previously defined RowKey is used, the previous data will be updated!
Column Family
(column)
A column family is a collection of columns.
- A column family has the flexibility to define multiple columns on the fly.
- Most of the table attributes are defined in column families. Different column families in the same table can have completely different attribute configurations, but all columns in the same column family will have the same attribute.
- The purpose of the column family is
HBase
You try to put columns of the same column family on the same machine, so if you want to put columns together, you define the same column family for them.
Column Qualifier
(column)
Columns in Hbase can be defined at will. There are no names or numbers of columns in a row, but only column families are limited. So columns must depend on the column family to exist! Column names must be preceded by the column family to which they belong! For example the info: name, info: the age
TimeStamp
(Timestamp — version)
Used to identify different versions of data. The timestamp is specified by the system by default or can be explicitly specified by the user. When reading cell data, you can omit the version number. If the version number is not specified, Hbase returns the data of the last version by default.
Cell
Multiple versions of data can be stored in a single column. Each version is called a Cell.
Region
(Table partition)
Region consists of several rows of a table! Rows in Region are sorted by rowkey dictionary. Region cannot cross RegionSever. When a large amount of data is generated, HBase splits a Region.