· Practical experience in HBase design (IV) | Actual Java development

Today’s share is about HBASE design experience and what you have learned.

Today is the second article about HBASE merge, split, and failover.

– a,Introduction of HBASE

– 2,Detailed HBASE read and write, read magnification, merge, and fault recovery

– 3,HBASE alarm information usage

– 4. HBASE optimization experience

What is HBASE?

HBase is a distributed, column-oriented open source database. This technology is derived from The Google paper “Bigtable: A Distributed Storage System for Structured Data” written by Fay Chang. Just as Bigtable leveragesthe distributed data store provided by Google’s File System, HBase provides capabilities similar to Bigtable on top of Hadoop.

HBase is a subproject of the Apache Hadoop project. Unlike common relational databases, HBase is a database suitable for unstructured data storage. Another difference is that HBase is column-based rather than row-based.

(1) Data model

(2) Logical architecture

(3) System architecture

(4) How do you read it?

(5) What is Region Server and what does it store?

(6) How to write data?

(7) What is a MemStore?

(8) cache write how to brush disk, always write disk up?

(9) What is HFILE?

The last article addressed these issues! The update is coming, so I’ll just get back to it

Two, read amplification, write amplification, fault recovery

(10) When will read merge be triggered?

Region Server has read cache, write cache, and HFILE. Region Server has read cache, write cache, and HFILE

So if we read a piece of data, it may be reading the Cache (LRU), it may have just been written to the Cache, or it may not be writing to the Cache, HBase uses the Block Cache index and bloom filter to load the corresponding HFile into memory. Therefore, data may come from the read cache, the scanner read and write cache, and the HFILE. This is called the left HBASE read merge

As mentioned earlier, the write cache may have multiple Hfiles, so a read request may read multiple files and affect performance, which is also known as read amplification.

(11) Since there is read magnification, is there a way to read fewer files? – write to merge

In simple terms, HBase automatically merges some small HFiles and overwrites them into a small number of larger HFiles

It uses merge sort algorithm to merge small files into large files, effectively reducing the number of hfiles

This process is called a minor compaction.

(12) Which hfiles does write merge target?

1. It merges and overwrites all HFiles under each Column Family

2. In this process, deleted and expired cells are physically deleted, which improves read performance

3. Since a major compaction rewrites all hfiles, it incurs a significant amount of hard drive I/O and network overhead. This is called Write Amplification.

4. Automatic HBASE scheduling is performed by default. You are advised to perform it in the early morning or on weekends because of write enlargement

5. When a Major compaction occurs, data migration caused by Server crashes or load balancing can be moved back to a location away from the Region Server, restoring local data.

(13) The Region Server does not manage multiple regions. How many regions does it manage? When will regions be expanded? – the Region division

Let’s review the concept of region:

The HBase Table is horizontally cut into one or several regions. Each region contains a continuous, ordered row bounded by the start key and end key.
The default size of each region is 1GB.
The Region Server reads and writes data in the Region and interacts with the client.
Each Region Server can manage about 1000 regions (which may come from one or more tables).

At first, each table has only one region by default. When a region becomes large, it is split into two sub-regions. Each sub-region contains half of the data of the original region. The two sub-regions are created on the original region server in parallel. The split action is reported to the HMaster. For load balancing purposes, the HMaster may migrate the new region to another Region server.

(14) Due to split, read amplification may occur in multiple Region servers for load balancing, until the write merge comes, and the region server is relocated or merged to a place near the Region Server node

(15) How is HBASE data backed up?

All reads and writes occur on the primary DataNode of the HDFS. HDFS automatically backs up WAL (logs before write) and HFile blocks. HBase relies on HDFS to ensure data integrity and security. When data is written to the HDFS, one copy is written to the local node and the other two copies are written to other nodes.

WAL and HFiles are persisted to hard disks and backed up. How does HBase recover data from MemStore that has not been persisted to HFile?

(16) How is HBASE data recovered after downtime?

If a Region Server crashes, its regions cannot be accessed until the crash is detected and fault recovery is complete. Zookeeper detects node faults by heartbeat detection, and HMaster receives a notification of region Server faults.
If the HMaster detects that a Region server is faulty, it allocates regions managed by the Region Server to other healthy Region Servers. To recover the data in the MemStore of the faulty Region Server that has not been persisted to HFile, HMaster divides WAL into several files and stores them on the new Region Server. Each region server then plays back the data from the WAL fragment it took to create a MemStore for the new region to which it was assigned.
WAL contains a series of modification operations, each of which represents a PUT or DELETE operation. These changes are written in chronological order, and when persisted they are written to the end of the WAL file.
What if the data is still in MemStore and has not been persisted to HFile? WAL files will be played back. The way to do this is to read WAL files, sort and add all changes to MemStore, and finally MemStore will be written to HFile.

This is the end of HBASE basic explanation, tomorrow to explain the actual combat drill!

Three, actual combat experience

How to design rowKey

I can’t start writing until I get back from work. Okay, here we go.

Alarm service scenarios are divided into two types

Instantaneous event types – usually start and end.
Persistent event type – usually starts some time and ends.

In both cases, rowkeys are as follows: Unique ID + Time + alarm type

We do an MD5 on the ID, do a hash, so that we can ensure that the data is evenly distributed.

Target platform

The second scenario is called indicator platform. We use Kylin to make a layer of encapsulation, in which we can select the data stored in HBase, which dimensions we can select, and which indicators to query. For example, this transaction data, you can choose the time, the city. A graph is formed to create a report. The report can then be shared with others.

Why Kylin? Because Kylin is a MOLAP engine, it’s an algorithmic model, and it meets our requirements for sub-second responses to pages.

Second, he had certain requirements for concurrency, and the original data reached the scale of ten billion. Also need to have a certain flexibility, preferably SQL interface, mainly offline. All things considered, we use Kylin.

Kylin profile

Apache Kylin™ is an open source distributed analytics engine that provides SQL query interface on top of Hadoop and multidimensional analysis (OLAP) capabilities to support very large scale data, originally developed by eBay Inc. Develop and contribute to the open source community. It can query huge Hive tables in subseconds. His principle is relatively simple. Based on a union operation model, I know in advance which dimensions I need to query an index. Run through all the cases with preset metrics and dimensions. All the results are calculated using MOLAP and stored in HBase. Then scan HBase based on the dimensions and indicators of SQL queries. Why sub-second queries can be implemented depends on HBase calculations.

Kylin architecture

And the logic is consistent with that. On the left is the data warehouse. All data is stored in a data warehouse. In the middle is the computing engine, which performs daily scheduling, converts the HBase KY structure into HBase, stores it in HBase, provides SQL interfaces externally, provides routing functions, parses SQL statements, and converts them into specific HBase commands.

Kylin has A concept called Cube and Cubold, and the logic is very simple. For example, you already know that the query dimensions are A, B, C, and D. That abCD query, can be taken or not taken. There are 16 combinations, and the whole thing is called cube. Each of these combinations is called Cuboid.

How does Kylin do physical storage in Hbase — using the shell case

Start by defining a primitive table with two dimensions, year and city.

You’re defining an indicator, like total price. The following are all the combinations. As mentioned earlier, there are many combinations of CuBOIDS in Kylin. For example, there is a combination of cuBOID: 00000011 in the first three lines.

There is a little bit of trickery involved in encoding the values of the dimensions, which would be longer if you put the original values of the program in the Rowkey.

The Rowkey will also be stored in a sell and the value of any version will be stored in that. If the rowkey becomes very long, there is a lot of pressure on HBase, so a dictionary code is used to reduce the length. In this way, the calculation data in Kylin can be stored in HBase.

Using bits to do some features, multiple dimensions of statistical analysis, speed is relatively fast, is also a common means of big data analysis!!

Of course, we also provide query functionality in practice with Apache Phoenix

Fourth, optimization experience

So let me draw a picture

1. Check whether the SCAN cache is set properly.

Optimization principle: Before explaining this problem, we need to explain what the SCAN cache is. Generally speaking, a scan will return a large amount of data. Therefore, when a client sends a Scan request, it does not load all data locally at one time, but divides it into multiple RPC requests. On the one hand, a large number of data requests may severely consume network bandwidth and affect other services. On the other hand, a large amount of data may cause OOM on the local client. In such a design, the user would first load some data locally, then iterate, then load the next data locally, and so on until all data was loaded. Data is stored in the SCAN cache when loaded locally. The default value is 100 data pieces. In general, the default scan cache Settings will work fine. However, for some large scans (tens of thousands or even hundreds of thousands of rows of data may be queried in a scan), requesting 100 pieces of data each time means hundreds or even thousands of RPC requests are required in a scan, and the cost of such interaction is undoubtedly very high. Therefore, consider increasing the SCAN cache setting, such as 500 or 1000, to be more appropriate. The author has conducted a test before, under the condition of 10W + data volume of a scan, increasing scan cache from 100 to 1000 can effectively reduce the overall delay of scan requests, which basically reduces the delay by about 25%.

Suggestion: In large Scan scenarios, increase the SCAN cache from 100 to 500 or 1000 to reduce the number of RPCS

2. Can BATCH requests be used for GET requests?

Optimization principle: HBase provides single GET and batch GET apis. Batch GET apis reduce the number of RPC connections between the client and RegionServer and improve the read performance. Also note that bulk GET requests either successfully return all requested data or throw an exception.

Optimization suggestion: Use batch GET for read requests

3. Can the request display the specified column family or column?

Optimization principle: HBase is a typical column family database. Data in the same column family is stored together, and data in different column families is stored in different directories. If a table has multiple column families, data from different column families must be retrieved independently based on rowkeys without specifying column families. As a result, the performance of a query with a specified column family is much worse than that of a query with a specified column family. In many cases, the performance of a query with a specified column family is even 2 to 3 times worse.

Optimization suggestion: Specify the search if you can specify a column family or column for precise search

4. Check whether cache is disabled for offline batch read requests.

Optimization principle: Generally, batch data read offline is scanned for all tables at a time. On the one hand, the data volume is large, and on the other hand, the request is executed only once. In this scenario, if the default scan Settings are used, data is loaded from the HDFS and stored in the cache. It is expected that a large amount of data entering the cache will squeeze out other real-time service hotspot data, and other services will have to be loaded from HDFS, causing significant read delay burrs

Optimization suggestion: Disable cache for offline batch Read requests, scans.setBlockCache (false)

The rest of the optimization will be in the next article to do a complete summary, today is a little late, the first rest ~~~

❤ ️ ❤ ️ ❤ ️ ❤ ️

Thank you very much talent can see here, if this article is written well, feel that there is something to ask for praise 👍 for attention ❤️ for share 👥 for handsome Oba I really very useful!!

If there are any mistakes in this blog, please comment, thank you very much!

At the end of this article, I recently compiled an interview material “Java Interview Process Manual”, covering Java core technology, JVM, Java concurrency, SSM, microservices, databases, data structures and so on. How to obtain: GitHub github.com/Tingyu-Note… , more attention to the public number: Ting rain notes, one after another.