Hbase from introduction to practice

I read about Hbase a long time ago, and I thought that this kind of database, which requires big data knowledge, must be beyond the reach of business development programmers like me. But as more companies went through the process, it became clear that big data was actually just a branch of Java development. You can see that Hive, HBase, Spark and other common Hadoop series big data operation tools are used in many ToB services. In contrast to ToC’s business, ToB needs to see changes in data, and the more data it has, the more valuable it is to the customer. For example, a user who recently ran a Demo store had 60W keyword messages in the advertising system, and the product required users to run once every two hours. So for the average programmer, big data related also need to know the relevant knowledge.

Back to the topic of Hbase, I have a complete understanding of Hbase through this project, so LET’s analyze it together.

How to learn new things

Read the last issue of “Strange flower said”, or very touched. Knowledge is infinite, but wisdom is limited. Continuous induction and summary is the best way to promote knowledge to wisdom. I summarized how to learn a new software system, including language, package block database such as out of the box system.

Learn how to use

There is no data structure in Hbase, so you can put long columns in any row. That means it’s a good fit for some data warehouse business scenarios where you don’t have to worry about the structure of the data and just throw it in. The only thing you need to design is the ROWKEY. In simple terms, the data is arranged in lexicographical order, and it is very fast when you scan adjacent data. Of course, it is also free to be placed by the user.

ROW KEY	The data content
aabb	.
abaa	.
ba	.
bbcc	.

Create Hbase table

create 'campaigns', {NAME =>'campaign', VERSIONS => 1, BLOCKCACHE => true, BLOCKSIZE => '65536', BLOOMFILTER => 'ROW'}, {SPLITS => ['00','01','02','03','04','05','06','07','08','09','10','11','12','13','14','15','16','17', 19 '18', ' 'and' 20 ', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', 'and', '32', '33', '34', '35', '36', '37', '38', 'and', '40', 'and', '42', '43', '44', '45', 'and', '47', '48', '49', '50', '51' and '52', '53', '54', '55', '56, 57, 59 '58', ' 'and' 60 ', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99']}Copy the code

Campaigns denote the table name and campaign denote the column family name. If you don’t pay attention to other information, this is what you need to create a table. So what does the rest mean? Let me tell you all about it.

BLOCKCACHE

After BLOCKCACHE is enabled, Hbase data is read block by block. In this case, the required capacity can be read from the disk at a time. Therefore, BLOCKCACHE is generally set to the amount of data to be read.

BLOOMFILTER

Bloom filter, it says yes not necessarily, it says no definitely not. Use a data structure to filter most invalid read requests.

SPLITS

Hbase is designed for distributed distribution. However, hot data often appears, and the result is that the read and write of a server is overwhelmed, and the advantages of distribution cannot be played. You can design regions in advance to avoid hotspot problems and reduce region splitting time.

Why is there only one column family

A Region contains multiple Hstores, and each HStore corresponds to a Column Family. HBase splIet the largest StoreFile in a region when the size of the StoreFile exceeds the threshold. In this case, there will be a large and small HStore, and the small one will be forced to spliet, resulting in a large number of scans when querying the small table, all in order to isolate the Column Family.

www.cnblogs.com/gaopeng527/…

The query

put "table","rowkey","columFamily:colum","value"
get "table","rowkey","columFamily:colum"
Copy the code

count "campaigns"
scan "campaigns", {}
Copy the code

More look at: www.cnblogs.com/nexiyi/p/hb…

When truncate is used to delete positive table data, all pre-partitions are deleted. So it’s better to drop and re-create.

Best practices

Best practices are often accumulated in the process of trampling.

Phoenix

The background is that the service uses Python to read Hbase, but ROWKEY cannot be used to perform some calculations (Hbase Coprossors can be used), and secondary indexes are required. Then I did some research and found Phoenix as a “magic tool”. I built another SQL layer on top of Hbase, and Phoenix came with Hbase Coprossors to save the time of writing code. So I decided it was you, Pikachu, and kept stepping on the hole.

Stomp pit 1: character type

We all know that in Mysql you can specify a variety of types, including Integer and Bigdcimal. Byte [] s= bytes.tobytes (” R1 “); byte[] s= bytes.tobytes (” r1 “);

Phoenix is required to read typed fields with a fixed length of bytes. You can use github.com/apache/phoe… To fix the type length when putting in. Or select SUM(TO_NUMBER(” impressions “)) from “datA_report_campaigns” using the function provided by Phoenix when read.

Trample two: secondary index

We thought secondary indexes were the same as Mysql, but we were too naive. The secondary index of Phoenix creates a new table and uses the indexed table as a ROWKEY, similar to the concept of inverted index.

However, the pain is that if you have a non-indexed column in the query, it will not walk the index.

In addition, the secondary index is generated by Phoenix using the Hbase Coprossor, so only Phoenix can generate index data. If data is inserted using Hbase, you need to insert data into the Index table.

Phoenix 4.7 has a bug

The amount of Hbase data is inconsistent with the amount of Phoenix data. There is only 60W Hbase data and 800W Phoenix data. I thought it was high concurrency that caused the data to be corrupted, but no attempt was made to reproduce. It’s suspected that a time-related bug triggered a Major compaction from Hbase manually.

Delete from system.stats where physical_name= ‘cust.profile’; Connection: stackoverflow.com/questions/3…

When a Major compaction occurs periodically, Phoenix does not update simultaneously, so data needs to be deleted.

Hbase coprossors use

There was some resistance to CoProssors, because putting the program in the data source always affected the data. Later, it was found that TiDB and other frameworks for big data computing all have the same thinking and put calculations in the data source. This can reduce the bandwidth limit of big data transmission, thus increasing speed. There are two types of Coprossors:

The EndPoint implements computation, similar to a stored procedure
Observer Observer mode, similar to implementing secondary indexes

Use Demo: www.ibm.com/developerwo…

Other relevant best practices

www.cnblogs.com/davidwang45…

System monitoring

Hbase DISASTER recovery, performance monitoring, and how to upgrade the system smoothly

Performance optimization

Optimized HDFS configurations, HBase server optimization (GC optimization, Compaction optimization, and hardware configuration optimization), column family design optimization, and client optimization

The overall understanding

Without a good practice for system monitoring and performance tuning, there is little useful information. After learning to use and best practices, it is necessary to understand the concept of system design, including the overall architecture, data structure.

In a nutshell, understanding:

What is the Hbase writing process like?
What is the read process of an Hbase?

I feel there is too much to talk about. First, put an overall Hbase architecture diagram and let me talk about the rest slowly.