1. The cartoon explains HBase

Developer.51cto.com/art/201904/…

2.HBasae storage architecture

client

  • The Client contains interfaces for accessing HBase
  • Maintaining the corresponding cache, such as caceh, speeds up HBase access. META Metadata information

zookeeper

  • Implement high availability of HMaster
  • It stores hbase metadata and serves as the addressing entry for all hbase tables
  • HMaster and HRegionServer are monitored

HMaster

  • Allocate regions to specific RegionServer during startup and perform various management operations, such as division and merge of regions.
  • Maintain load balancing for the entire cluster
  • Maintain metadata information about the cluster
  • Discover the invalid Region and allocate the Region to a normal HRegionServer

Note: When HMatser is down, Hbase can still read and write data, because the metadata table Hbase: Meta is stored on Zookeeper, but HMaster is required for table creation, column family configuration modification, and split and merge

HRegionServer

  • HRegionServer has one or more regions, and the read and write data is stored in the regions.
  • Receive read/write requests from the client.
  • The Region that becomes larger during the operation is divided

HLog

  • Write-ahead log (WAL) is used to Write data in advance. When an operation reaches a Region, HBase writes the operation to WAL. HBase first adds data to Memstore based on memory. When the amount of data reaches a certain level, HBase writes data to the HFile. If the server breaks down, memory data is lost. WAL can be used for data recovery.

Region

  • A section of the table. HBase is an automatic sharding database. A Region is a partition of a partitioned table in a relational database or a fragment in Redis.
  • A Region contains multiple stores
  • Each region has a starting rowKey and an ending rowKey, which represents the scope of its lock storage row.

store

  • A store stores a column family. Columns in a column family are stored together and can be retrieved at one time.

memstore

  • There is only one memstore in a sotre
  • A memstore is an area of memory to which data is first written to a memore buffer and then flushed to disk.

HFile

  • There are multiple Hfiles in the store. When the memStore is full, HBase generates a new HFile in the HDFS, writes the memStore contents to the HFile, and then dumps the data to the HDFS.

3.HBase data read and write process

4. The Region split

Why split regions

  • A Region represents a Rowkey data set of a table. When a Region is too large, the Master splits the Region. A large Region takes a long time to read data. Hbase is also called an automatic sharding database because big data is split to different machines, queried and aggregated.

Automatic splitting of regions

  • Before version 0.94, the splitting strategy based on the fixed size of region isConstantSizeRegionSplitPolicy
  • After version 0.94, the default split policy (IncreasingToUpperBoundRegionSplitPolicy)
    • Best policy solution: Dependency formulaMath.min(tableRegionsCounr^3*initialSize,defaultRegionMaxFileSize)To calculate
    • TableRegionsCounr: Table in all the Region on the RegionServer have combined, initialSize source. The size of this configuration parameter configure hbase increasing. The policy. The initial. The size, if there is no configuration to usememstoreFlash twice the size of hbase) hregion) memstore. Flush. The size * 2. DefaultRegionMaxFileSize configuration parameters for the Region’s largest size hbase. Hregion). Max filesize.
    • Calculation example
(1) When you start with only one file, the upper limit is 256MB, because 1^3 * 128*2 = 256MB. (2) when there are 2 files, the upper limit is 2GB, because 2^3 * 128*2 = 2048MB. (3) When there are 3 files, the upper limit is 6.75GB, because 3^3 * 128 * 2 = 6912M the upper limit is defaultRegionMaxFileSizeCopy the code
  • KeyPrefixRegionSplitPolicy strategy

* used parameter KeyPrefixRegionSplitPolicy prefix_length the strategy IncreasingToUpperBoundRegionSplitPolicy * inheritance In this case, data with the same prefix is divided into different regions according to the default configuration. If you all data for only one or two prefix, then used KeyPrefixRegionSplitPolicy is invalid, the default strategy is better at this time. If your prefixes are more finely divided, your query is more likely to be queried across regions, and it is better to use the default policy. * This policy applies to scenarios where data has multiple prefixes. Queries are mostly for prefixes, rather than across multiple prefixes.Copy the code
  • DelimitedKeyPrefixRegionSplitPolicy strategy
    • DelimitedKeyPrefixRegionSplitPolicy. Delimiter parameters meaning: prefix separator
    • For example, if you define the prefix delimiter _, host1_001 and Host12_999 will be prefixed to Host1 and Host12, respectively. I’m going to split it by prefix host1, host12
  • BusyRegionSplitPolicy strategy
    • Solution scenario: The hotspot problem is that regions in the database are accessed at different frequencies. Some regions are frequently accessed in a short period of time, which causes heavy pressure. These regions are hot regions.
    • How to determine is the hot spot Region: hbase. Busy. Policy. Request model.next blockedRequests, namely the severity of the request is blocked. The value ranges from 0.0 to 1.0. The default value is 0.2, indicating the blocking severity. Value range 0.0 to 1.0 By default, 0.2 means that 20 percent of requests are blocked. MinAge: indicates the minimum age of splitting. If the age of a Region is younger than this age, the Region is not split. This prevents regions that are accessed frequently in a short period of time from being split. The default is 10 minutes. Hbase. Busy. Policy. AggWindow: calculation is a busy time window, unit of milliseconds. The default value is 300000 milliseconds. The frequency used to control calculations.
    • Specific judgment methods: If the current time – the last testing time > = hbase. Busy. Policy. AggWindow, The following calculation: this paragraph of time is blocked request/this paragraph of time always request = be model.next (aggBlockedRate) if aggBlockedRate > hbase. Busy. Policy. BlockedRequests judgment is busy

5. The region merging

Hbase shell merge_region 'region1 hash ',' Region2 hash 'Copy the code

6. Memstore flash

  • When the memStore size exceeds this value, it is flushed to disk, which defaults to 128MB
<property>
	<name>hbase.hregion.memstore.flush.size</name>
	<value>134217728</value>
</property>
Copy the code
  • When data in memStore is longer than 1 hour, it is flushed to disk
<property>
	<name>hbase.regionserver.optionalcacheflushinterval</name>
	<value>3600000</value>
</property>
Copy the code
  • The size of HRegionServer’s global MemStore, above which a flush to disk operation is triggered. The default is 40% of the heap size
The < property > < name > hbase. Regionserver. Global. Memstore. Size < / name > < value > 0.4 < value > / < / property >Copy the code
  • Manually flush
    • flush tableName

7. HFile merger

  • Why does HFile need compaction
    • Each memStore brush creates a new HFile. After all, hfiles are stored on the hard disk. Every time you read something stored on the hard disk, you have an operation: Addressing, when there are more files, each time the data is read, the addressing action is more, the efficiency is reduced, so to prevent addressing too much, we need to reduce the fragmentation appropriately, so the merge operation is performed.
    • The HFile merge operation is to find the HFile to be merged in a Store, and then merge them, and finally remove the previous broken files.
  • Minor compaction
    • In the process of merging multiple Hfiles in the Store into a single HFile:, the TTL (record retention time) is removed. The data that is deleted and updated is only marked, not physically removed. This kind of merging is triggered very frequently.
    • The trigger condition
<! -- if at least three store files are required, Minor compaction will start -- > < property > < name > hbase.hstore.com paction. Min < / name > < value > 3 < value > / < / property > <! - once said minor compaction in selected 10 store file - > < property > < name > hbase.hstore.com paction. Max < / name > < value > 10 value > < / </property> <! -- The default value is 128M, Said the file size is less than the value of store the file will be added to the minor compaction of store in the file - > < property > < name > hbase.hstore.com paction. Min. Size < / name > <value>134217728</value> </property> <! -- Default is long.max_value, Said the file size is greater than the value of store file must be minor compaction ruled out - > < property > < name > hbase.hstore.com paction. Max. Size < / name > <value>9223372036854775807</value> </propertyCopy the code
  • Major compaction
    • Merge all hFiles in the Store into one HFile
    • The trigger condition
<! - the default value is 7 days for a big merger, - > < property > < name >. Hbase hregion. Majorcompaction < / name > < value > 604800000 < value > / < / property >Copy the code
  • Manual trigger
    • major_compact ‘panda_nodes’

8.HBase data model

  • Namespace(table Namespace): The table Namespace is not mandatory. It is used only when multiple tables are grouped into a group for unified management.
  • Table: A Table consists of one or more column families. Data attributes, such as timeout (TTL), compression algorithm, and so on, are defined in the column family definition. After the column family is defined, the table is empty and will have data only if rows are added.
  • ROW: A ROW contains multiple columns, which are classified by column family. Data in the respective column family can only be selected from the table lock defined column family, can’t define this table does not exist in the column family, or you will get a NoSuchColumnFamilyException. Because HBase is a column database, data in a row can be distributed to different servers.
  • Column FamilyA column family is a collection of columns. A column family is a collection of columns that can be used in a column database. HBase tries to place columns of the same column family on the same server to improve access performance and manage associated columns in batches. All data attributes are defined in column families. In HBase, column families are the most important concept, not columns.
  • Column Qualifier: Multiple columns form a row. Column families and columns are often represented together with Column Family: Column Qualifier. Columns can be defined arbitrarily, and there is no limit to the number of columns in a row, except for the column family.
  • Cell: Multiple versions of data can be stored in a column. Each version is called a Cell. Therefore, cells in HBase are different from those in traditional relational databases. Data in HBase is more granular than traditional data structures, and data in the same location is subdivided into multiple versions.
  • Timestamp (Timestamp/version number) : you can call this either a Timestamp or a version number, since it is used to identify the version numbers of multiple cells in the same column. If you do not specify a version number, the system automatically uses the current timestamp as the version number. When you manually define a number to use as the version number, this Timestamp really only has the meaning of the version number (so I always think “version number” is a better name for this concept).

Note: For technical communication, please contact email

Panda Notebook Email:[email protected]