Today’s share is about HBASE design experience and what you have learned.

– a,Introduction of HBASE

– 2,Detailed HBASE read and write, read magnification, merge, and fault recovery

– 3,HBASE alarm information usage

Four,HBASE optimization experience

What is HBASE?

HBase is a distributed, column-oriented open source database. This technology is derived from The Google paper “Bigtable: A Distributed Storage System for Structured Data” written by Fay Chang. Just as Bigtable leveragesthe distributed data store provided by Google’s File System, HBase provides capabilities similar to Bigtable on top of Hadoop.

HBase is a subproject of the Apache Hadoop project. Unlike common relational databases, HBase is a database suitable for unstructured data storage. Another difference is that HBase is column-based rather than row-based.

(1) Data model

This is a table, and we can obtain a column family or multiple column family data based on the roykey. Each column family has an unlimited number of columns, and each column can store data on it. Each data has a version and can be distinguished by timestamp. So in a table, we know that row keys, column families, column version timestamps can determine a unique value

(2) Logical architecture

For any table, its rowkey is globally ordered. For physical storage purposes, we put it on multiple machines, and we divide it into regions by size or other strategies. Each region represents a piece of data. A region ensures a balanced state on physical machines.

(3) System architecture

It is also a standard storage architecture. His Hmaster is mainly responsible for simple coordination services, such as region transfer, balancing, and error recovery. In fact, he does not participate in the query, and the real query occurs in the Region server. Region server is responsible for storage. As mentioned earlier, each table is divided into several regions. It is then stored on the Region Server.

The most important part here is hlog. To ensure data consistency, we first write a log file, which is a feature of the database system, after the log is created, we can write successfully. As mentioned earlier, there are many column-family column families in HBase. Each column family corresponds to a store in a region. Stores contain storeFile and MenStore respectively.

In order to optimize HBase in the future, we first consider writing files into MenStore. When data in MenStore is full, data will be distributed to disks. Then storeFile and MemStore as a whole rely on a data model called LMSTree.

The data is then written in append mode, whether inserted, modified, or deleted. It’s actually constantly append. For example, your update and delete operations are written in marked mode, so it avoids random DISK I/O and improves write performance. Of course, its underlying is based on HDFS.

HBase uses Zookeeper as a distributed management service to maintain the status of all services in a cluster. Zookeeper maintains which Servers are healthy and available and notifies servers when they fail. Zookeeper uses the consistency protocol to ensure the consistency of distributed status. Note that it takes three or five machines to do the conformance protocol.

The distributed protocol of ZK must be mastered. After all, the hot property flink and hbase in big data use ZK to do the distributed protocol.

(4) How do you read it?

  1. Locate the Region Server to which the Rowkey obtained from the Meta Table belongs
  2. Corresponding Region Server reads and writes data
  3. Meta Table, which stores all regions in the system, is structured as follows
  • Key: table, region start Key, region ID
  • Value: region server

(5) What is Region Server and what does it store?

The Region Server runs on HDFS Datanodes and consists of the following components:

  • WAL: Write Ahead Log is a file on a distributed file system that stores new data that has not been persisted. It is used for fault recovery.
  • BlockCache: This is the read cache, which stores the most frequently accessed data in memory. BlockCache is the LRU (Least Recently Used) cache.
  • MemStore: This is a write cache that stores new data in memory that has not yet been persisted to disk. When written to a hard disk, data is sorted first. Note that each Column Family of each Region has a MemStore.
  • HFile Stores HBase data on the HDFS in the form of an ordered KeyValue.

(6) How to write data?

  1. The first is to write data to WAL (WAL is append at the end of the file, high performance)

  2. After adding the MemStore to the write cache, the server can send an ACK to the client to indicate that the data is written

(7) What is a MemStore?

  • Cache HBase data, which is the same as HFile storage
  • Updates are sorted by Column Family

(8) cache write how to brush disk, always write disk up?

  1. There’s enough in MemStore

  2. The entire ordered data set will be written to a new HFile file on HDFS (sequential)

  3. This is one reason why HBase limits the number of Column families (not too many)

  4. Maintain a maximum sequence number so you know what data is persisted

(9) What is HFILE?

HFile uses a multi-tiered index similar to a B+ tree to query data without reading the entire file:

  • KeyValues are stored in order.
  • The rowKey points to index, which in turn points to a specific data block in 64KB units.
  • Each block has its leaf index.
  • The last key of each block is stored in the mid-tier index.
  • The index root node points to the mid-tier index.

Trailer refers to the original information data block, which is written to the end of the HFile when the data is persisted to HFile. Trailer also contains information such as bloom filters and time ranges.

The Bloom filter is used to skip files that do not contain the specified Rowkey, and the time range information is filtered by time, skipping files that are not within the requested time range.

The indexes we just discussed are loaded into memory when HFile is opened, so that the data query takes only one hard disk query.

Two, read amplification, write amplification, fault recovery

(10) When will read merge be triggered?

Region Server has read cache, write cache, and HFILE. Region Server has read cache, write cache, and HFILE

So if we read a piece of data, it may be reading the Cache (LRU), it may have just been written to the Cache, or it may not be writing to the Cache, HBase uses the Block Cache index and bloom filter to load the corresponding HFile into memory. Therefore, data may come from the read cache, the scanner read and write cache, and the HFILE. This is called the left HBASE read merge

As mentioned earlier, the write cache may have multiple Hfiles, so a read request may read multiple files and affect performance, which is also known as read amplification.

(11) Since there is read magnification, is there a way to read fewer files? – write to merge

In simple terms, HBase automatically merges some small HFiles and overwrites them into a small number of larger HFiles

It uses merge sort algorithm to merge small files into large files, effectively reducing the number of hfiles

This process is called a minor compaction.

(12) Which hfiles does write merge target?

1. It merges and overwrites all HFiles under each Column Family

2. In this process, deleted and expired cells are physically deleted, which improves read performance

3. Since a major compaction rewrites all hfiles, it incurs a significant amount of hard drive I/O and network overhead. This is called Write Amplification.

4. Automatic HBASE scheduling is performed by default. You are advised to perform it in the early morning or on weekends because of write enlargement

5. When a Major compaction occurs, data migration caused by Server crashes or load balancing can be moved back to a location away from the Region Server, restoring local data.

(13) The Region Server does not manage multiple regions. How many regions does it manage? When will regions be expanded? – the Region division

Let’s review the concept of region:

  • The HBase Table is horizontally cut into one or several regions. Each region contains a continuous, ordered row bounded by the start key and end key.
  • The default size of each region is 1GB.
  • The Region Server reads and writes data in the Region and interacts with the client.
  • Each Region Server can manage about 1000 regions (which may come from one or more tables).

At first, each table has only one region by default. When a region becomes large, it is split into two sub-regions. Each sub-region contains half of the data of the original region. The two sub-regions are created on the original region server in parallel. The split action is reported to the HMaster. For load balancing purposes, the HMaster may migrate the new region to another Region server.

(14) Due to split, read amplification may occur in multiple Region servers for load balancing, until the write merge comes, and the region server is relocated or merged to a place near the Region Server node

(15) How is HBASE data backed up?

All reads and writes occur on the primary DataNode of the HDFS. HDFS automatically backs up WAL (logs before write) and HFile blocks. HBase relies on HDFS to ensure data integrity and security. When data is written to the HDFS, one copy is written to the local node and the other two copies are written to other nodes.

WAL and HFiles are persisted to hard disks and backed up. How does HBase recover data from MemStore that has not been persisted to HFile?

(16) How is HBASE data recovered after downtime?

  1. If a Region Server crashes, its regions cannot be accessed until the crash is detected and fault recovery is complete. Zookeeper detects node faults by heartbeat detection, and HMaster receives a notification of region Server faults.

  2. If the HMaster detects that a Region server is faulty, it allocates regions managed by the Region Server to other healthy Region Servers. To recover the data in the MemStore of the faulty Region Server that has not been persisted to HFile, HMaster divides WAL into several files and stores them on the new Region Server. Each region server then plays back the data from the WAL fragment it took to create a MemStore for the new region to which it was assigned.

  3. WAL contains a series of modification operations, each of which represents a PUT or DELETE operation. These changes are written in chronological order, and when persisted they are written to the end of the WAL file.

  4. What if the data is still in MemStore and has not been persisted to HFile? WAL files will be played back. The way to do this is to read WAL files, sort and add all changes to MemStore, and finally MemStore will be written to HFile.

This is the end of HBASE basic explanation. Let’s start with the actual combat exercise.


Three, actual combat experience

How to design rowKey

I can’t start writing until I get back from work. Okay, here we go.

Alarm service scenarios are divided into two types

  1. Instantaneous event types – usually start and end.
  2. Persistent event type – usually starts some time and ends.

In both cases, rowkeys are as follows: Unique ID + Time + alarm type

We do an MD5 on the ID, do a hash, so that we can ensure that the data is evenly distributed.

Target platform

The second scenario is called indicator platform. We use Kylin to make a layer of encapsulation, in which we can select the data stored in HBase, which dimensions we can select, and which indicators to query. For example, this transaction data, you can choose the time, the city. A graph is formed to create a report. The report can then be shared with others.

Why Kylin? Because Kylin is a MOLAP engine, it’s an algorithmic model, and it meets our requirements for sub-second responses to pages.

Second, he had certain requirements for concurrency, and the original data reached the scale of ten billion. Also need to have a certain flexibility, preferably SQL interface, mainly offline. All things considered, we use Kylin.

Kylin profile

Apache Kylin™ is an open source distributed analytics engine that provides SQL query interface on top of Hadoop and multidimensional analysis (OLAP) capabilities to support very large scale data, originally developed by eBay Inc. Develop and contribute to the open source community. It can query huge Hive tables in subseconds. His principle is relatively simple. Based on a union operation model, I know in advance which dimensions I need to query an index. Run through all the cases with preset metrics and dimensions. All the results are calculated using MOLAP and stored in HBase. Then scan HBase based on the dimensions and indicators of SQL queries. Why sub-second queries can be implemented depends on HBase calculations.

Kylin architecture

And the logic is consistent with that. On the left is the data warehouse. All data is stored in a data warehouse. In the middle is the computing engine, which performs daily scheduling, converts the HBase KY structure into HBase, stores it in HBase, provides SQL interfaces externally, provides routing functions, parses SQL statements, and converts them into specific HBase commands.

Kylin has A concept called Cube and Cubold, and the logic is very simple. For example, you already know that the query dimensions are A, B, C, and D. That abCD query, can be taken or not taken. There are 16 combinations, and the whole thing is called cube. Each of these combinations is called Cuboid.

How does Kylin do physical storage in Hbase — using the shell case

Start by defining a primitive table with two dimensions, year and city.

You’re defining an indicator, like total price. The following are all the combinations. As mentioned earlier, there are many combinations of CuBOIDS in Kylin. For example, there is a combination of cuBOID: 00000011 in the first three lines.

There is a little bit of trickery involved in encoding the values of the dimensions, which would be longer if you put the original values of the program in the Rowkey.

The Rowkey will also be stored in a sell and the value of any version will be stored in that. If the rowkey becomes very long, there is a lot of pressure on HBase, so a dictionary code is used to reduce the length. In this way, the calculation data in Kylin can be stored in HBase.

Using bits to do some features, multiple dimensions of statistical analysis, speed is relatively fast, is also a common means of big data analysis!!

Of course, we also provide query functionality in practice with Apache Phoenix

Fourth, optimization experience

So let me draw a picture

1. Check whether the SCAN cache is set properly.

Optimization principle: Before explaining this problem, we need to explain what the SCAN cache is. Generally speaking, a scan will return a large amount of data. Therefore, when a client sends a Scan request, it does not load all data locally at one time, but divides it into multiple RPC requests. On the one hand, a large number of data requests may severely consume network bandwidth and affect other services. On the other hand, a large amount of data may cause OOM on the local client. In such a design, the user would first load some data locally, then iterate, then load the next data locally, and so on until all data was loaded. Data is stored in the SCAN cache when loaded locally. The default value is 100 data pieces. In general, the default scan cache Settings will work fine. However, for some large scans (tens of thousands or even hundreds of thousands of rows of data may be queried in a scan), requesting 100 pieces of data each time means hundreds or even thousands of RPC requests are required in a scan, and the cost of such interaction is undoubtedly very high. Therefore, consider increasing the SCAN cache setting, such as 500 or 1000, to be more appropriate. The author has conducted a test before, under the condition of 10W + data volume of a scan, increasing scan cache from 100 to 1000 can effectively reduce the overall delay of scan requests, which basically reduces the delay by about 25%.

Suggestion: In large Scan scenarios, increase the SCAN cache from 100 to 500 or 1000 to reduce the number of RPCS

2. Can BATCH requests be used for GET requests?

Optimization principle: HBase provides single GET and batch GET apis. Batch GET apis reduce the number of RPC connections between the client and RegionServer and improve the read performance. Also note that bulk GET requests either successfully return all requested data or throw an exception.

Optimization suggestion: Use batch GET for read requests

3. Can the request display the specified column family or column?

Optimization principle: HBase is a typical column family database. Data in the same column family is stored together, and data in different column families is stored in different directories. If a table has multiple column families, data from different column families must be retrieved independently based on rowkeys without specifying column families. As a result, the performance of a query with a specified column family is much worse than that of a query with a specified column family. In many cases, the performance of a query with a specified column family is even 2 to 3 times worse.

Optimization suggestion: Specify the search if you can specify a column family or column for precise search

4. Check whether cache is disabled for offline batch read requests.

Optimization principle: Generally, batch data read offline is scanned for all tables at a time. On the one hand, the data volume is large, and on the other hand, the request is executed only once. In this scenario, if the default scan Settings are used, data is loaded from the HDFS and stored in the cache. It is expected that a large amount of data entering the cache will squeeze out other real-time service hotspot data, and other services will have to be loaded from HDFS, causing significant read delay burrs

Optimization suggestion: Disable cache for offline batch Read requests, scans.setBlockCache (false)

5. The RowKey must be hashed (such as MD5 hash), and the table must be pre-partitioned

6. If the MEMORY size of the JVM is less than 20 GB, select LRUBlockCache as the BlockCache policy. Otherwise, select the offheap mode of the BucketCache policy. Look forward to HBase 2.0!

7. Check the number of RegionServer and Region storeFiles and check whether HFile files are too many

Optimization suggestions:

Hbase.hstore.com pactionThreshold Settings cannot too big, the default is 3; Setting need to be determined according to the size of the Region, usually can simply think hbase.hstore.com paction. Max. Size = RegionSize pactionThreshold/hbase.hstore.com

8. Does this Compaction consume too many system resources?

(1) Minor Compaction Settings: hbase.hstore.com pactionThreshold Settings cannot be too small, can’t set too big, it is recommended that set to 5 ~ 6; Hbase.hstore.com paction. Max. Size = RegionSize/hbase.hstore.com pactionThreshold (2) Major Compaction Settings: If a large Region read latency is greater than 100 GB, it is not recommended to enable an automatic Major Compaction when a Compaction occurs at a low peak. You can start a Major Compaction while operating a small Region or operations that are latency insensitive. (3) Expect more good Compaction strategies, similar to stripe-compaction, to stabilize operations as soon as possible

9. Check whether Bloomfilter is set. Are the Settings reasonable?

Bloomfilter should be set for any business, usually to ROW, unless you confirm that the random query type for business is Row + CF, which can be set to RowCOL

BloomFilter is enabled on each ColumnFamily and we can use the JavaAPI

/ / we can specify the type of the open BloomFilter HColumnDescriptor by ColumnDescriptor. SetBloomFilterType () / / optional NONE, ROW, ROWCOLCopy the code

We can also specify BloomFilter when creating the Table

hbase> create 'mytable',{NAME => 'colfam1', BLOOMFILTER => 'ROWCOL'}
Copy the code

10. Enable the Short Circuit Local Read functionwebsite

Optimization principle: Currently, all HDFS data reading needs to go through DataNode. The client sends a data reading request to DataNode. After receiving the request, the DataNode reads the file from the hard disk and sends the file to the client through TPC. The Short Circuit policy allows clients to bypass DataNode and directly read local data.

11. Is Hedged Read on?

Optimization principle: Three copies of HBase data are stored in the HDFS, and short-circuit Local Read is preferred for Local Read. But in some cases, when a disk problem or a network problem can cause a local Read failure in a short time, the community proposed a compensation retry mechanism called Hedged Read to cope with this problem. The basic working principle of this mechanism is that the client initiates a local read and sends a request for the same data to other Datanodes if the local read fails to be returned after a period of time. Whichever request is returned first, the other is discarded. Best recommendation: Turn on Hedged Read. For details, see here

12. Is the data local rate too low?

Data local rate: The HDFS data is usually stored in three copies. If the current RegionA is on Node1, the triple copy of data A is written is (Node1,Node2, and Node3), and the triple copy of data B is written is (Node1,Node4, and Node5). Data c is written to three copies (Node1,Node3, and Node5). It can be seen that a copy of all data written to local Node1 must be written. All data can be read locally, so the data local rate is 100%. Now suppose RegionA is migrated to Node2 and only data A is on the node, the other data (B and C) reads can only be read remotely across nodes, with a local rate of 33% (assuming the data of A, B and C are the same size). Optimization principle: If the data local rate is too low, a large number of CROSS-network I/O requests will be generated, which inevitably leads to high read request latency. Therefore, improving the data local rate can effectively optimize the random read performance. The low data local rate is usually caused by Region migration (automatic balance function, RegionServer downtime migration, and manual migration). Therefore, the data local rate can be maintained by avoiding unnecessary Region migration. On the other hand, if the data local rate is low, You can also run the major_compact command to increase the data local rate to 100%. Optimization suggestion: Avoid Region migration without cause. For example, disable automatic balance or pull up and migrate back to the lost Region when the RS is down. Run the major_compact command to improve the local data rate during off-peak hours

Reference article:

Blog.csdn.net/shufangreal…

❤ ️ ❤ ️ ❤ ️ ❤ ️

Thank you very much talent can see here, if this article is written well, feel that there is something to ask for praise 👍 for attention ❤️ for share 👥 for handsome Oba I really very useful!!

If there are any mistakes in this blog, please comment, thank you very much!

At the end of this article, I recently compiled an interview material “Java Interview Process Manual”, covering Java core technology, JVM, Java concurrency, SSM, microservices, databases, data structures and so on. How to obtain: GitHub github.com/Tingyu-Note… , more attention to the public number: Ting rain notes, one after another.