Table structure design of Hbase differs from that of relational databases in many ways. Hbase has several new concepts such as Rowkey, column family, and timestamp. Therefore, how to design table structure is very important.

create

Hbase uses the Rowkey column family TIMESTAMP to determine a row of data.

This is completely different from a relational database:

attribute

HBase

RDBMS

The data type Just strings

Rich data types

Data manipulation Simple add, delete, modify and query does not support join Various functions and table joins

Storage mode Column-based storage

Based on table structure and row storage
Data protection The old version remains after the update

replace

scalability Easy to add nodes, high compatibility

Need an intermediate layer, sacrifice functionality

Therefore, Hbase needs to consider the following factors:

1. How many column families should the table have

2. What data do column families use

3. How many columns are in each column family

4. What is the column name

5. What data should be stored in the unit

6. How many time versions are stored in each unit

7. What is the Rowkey structure and what information should it contain

Points to note:

1, the Join

There is no join in Hbase, so large table structure row records need to be added with keywords to solve this problem

2, Rowkey

Rowkey design is very important because Hbase is ordered and prefixes and suffixes must be considered

You can use Hbase Shell and Java Api:

Configuration config = HBaseConfiguration.create(); Admin admin = new Admin(conf); TableName table = TableName.valueOf("myTable"); admin.disableTable(table); HColumnDescriptor cf1 = ... ; admin.addColumn(table, cf1); // adding new ColumnFamily HColumnDescriptor cf2 = ... ; admin.modifyColumn(table, cf2); // modifying existing ColumnFamily admin.enableTable(table);Copy the code

Rowkey design

A Rowkey is an indivisible array of bytes stored lexicographically in a table.

Region provides services for an interval of rows based on rowkeys. HFile stores ordered rows on disks. Rowkeys greatly affect Hbase performance.

The Rowkey is the index, and if you don’t know the Rowkey, you can only scan the entire table, and performance deteriorates significantly.

Here’s an example of how popular movies are:

1. Rowkeys are in lexicographical order from largest to smallest

Native Hbase supports sorting from smallest to largest. You can use Rowkey= integer-max_value -Rowkey to perform the sorting at the application layer.

2. Hash Rowkey as much as possible

The Rowkey should be hashed as much as possible to ensure that data is not in one Region and avoid read and write centralization.

For example, we could design the userid_videoid concatenation string so that the users would be uneven.

There are three ways to solve the problem: Reverse the userID hash userID. The userID is modded and encrypted by MD5. Add the first six bits to the userID

3. Keep the Rowkey as short as possible

If the Rowkey is too long, the storage overhead is high.

If the Rowkey is too long, the memory usage decreases and the index hit ratio decreases.

Column family

A column family is a collection of columns where all members of a column family have the same prefix, for example courses: history and courses: math are members of the courses column family. The colon is the separator. The column family prefix must be an outputtable character, and the column may consist of any byte array.

Column families must be declared when the table is created. Columns do not need to be declared, and users can create new columns at any time.

Rule of thumb:

  • The goal is to limit region sizes to between 10 and 50 GB.
  • The goal is to limit the cell size to 10 MB or 50 MB if you are using mob type. Otherwise, store cell data in HDFS and store Pointers to the data in HBase.
  • A typical Scheme table contains 1 to 3 column families. HBase table design should not be similar to RDBMS table design.
  • For tables with one or two column families, 50-100 regions are appropriate. Remember that a region is a continuous segment of a column family.
  • Keep column family names as short as possible. Each value stores the name of the column family (ignoring the prefix encoding). They should not be self-documented, descriptive names like typical RDBMSS.
  • If you are storing time-based machine data or log information, and the row key is based on device ID or service ID + time, you end up with a situation where older data regions will never have additional writes. In this case, you end up with a small number of active regions and a large number of regions that will not be written. In this case, a larger number of regions is acceptable because resource consumption depends only on the active region.
  • If only one column family writes frequently, let only that column family consume memory. Note write mode when allocating resources.

The instance

Shops and Merchandise

Shop item is many-to-many

RDBMS table structure design:

Shops table:

The column name

columns

id

A primary key

name

Shop name
address home

regdate The registration date

Commodity list:

The column name

columns

id

A primary key

name

Name of commodity
price

The price

details Goods details
title

Display name

Relational tables:

The column name

columns

shop_id Store the primary key
item_id Commodity primary key
type

Association types

Hbase table structure design:

The store table:

Commodity list:

Weibo users and fans

Users and fans are one to many

RDBMS table structure design:

The users table:

The column name

columns

id

A primary key

nickname The user name

Fan correspondence table:

The column name

columns

user_id The user id
fans_id Fan id
Hbase table structure design:

More blog posts on real-time computing, Hbase, Flink, Kafka and other related technologies. Welcome to real-time streaming computing