Table structure design of Hbase differs from that of relational databases in many ways. Hbase has several new concepts such as Rowkey, column family, and timestamp. Therefore, how to design table structure is very important.
create
Hbase uses the Rowkey column family TIMESTAMP to determine a row of data.
This is completely different from a relational database:
attribute |
HBase |
RDBMS |
---|---|---|
The data type | Just strings |
Rich data types |
Data manipulation | Simple add, delete, modify and query does not support join | Various functions and table joins |
Storage mode | Column-based storage |
Based on table structure and row storage |
Data protection | The old version remains after the update |
replace |
scalability | Easy to add nodes, high compatibility |
Need an intermediate layer, sacrifice functionality |
Therefore, Hbase needs to consider the following factors:
1. How many column families should the table have
2. What data do column families use
3. How many columns are in each column family
4. What is the column name
5. What data should be stored in the unit
6. How many time versions are stored in each unit
7. What is the Rowkey structure and what information should it contain
Points to note:
1, the Join
There is no join in Hbase, so large table structure row records need to be added with keywords to solve this problem
2, Rowkey
Rowkey design is very important because Hbase is ordered and prefixes and suffixes must be considered
You can use Hbase Shell and Java Api:
Configuration config = HBaseConfiguration.create(); Admin admin = new Admin(conf); TableName table = TableName.valueOf("myTable"); admin.disableTable(table); HColumnDescriptor cf1 = ... ; admin.addColumn(table, cf1); // adding new ColumnFamily HColumnDescriptor cf2 = ... ; admin.modifyColumn(table, cf2); // modifying existing ColumnFamily admin.enableTable(table);Copy the code
Rowkey design
A Rowkey is an indivisible array of bytes stored lexicographically in a table.
Region provides services for an interval of rows based on rowkeys. HFile stores ordered rows on disks. Rowkeys greatly affect Hbase performance.
The Rowkey is the index, and if you don’t know the Rowkey, you can only scan the entire table, and performance deteriorates significantly.
Here’s an example of how popular movies are:
1. Rowkeys are in lexicographical order from largest to smallest
Native Hbase supports sorting from smallest to largest. You can use Rowkey= integer-max_value -Rowkey to perform the sorting at the application layer.
2. Hash Rowkey as much as possible
The Rowkey should be hashed as much as possible to ensure that data is not in one Region and avoid read and write centralization.
For example, we could design the userid_videoid concatenation string so that the users would be uneven.
There are three ways to solve the problem: Reverse the userID hash userID. The userID is modded and encrypted by MD5. Add the first six bits to the userID
3. Keep the Rowkey as short as possible
If the Rowkey is too long, the storage overhead is high.
If the Rowkey is too long, the memory usage decreases and the index hit ratio decreases.
Column family
A column family is a collection of columns where all members of a column family have the same prefix, for example courses: history and courses: math are members of the courses column family. The colon is the separator. The column family prefix must be an outputtable character, and the column may consist of any byte array.
Column families must be declared when the table is created. Columns do not need to be declared, and users can create new columns at any time.
Rule of thumb:
- The goal is to limit region sizes to between 10 and 50 GB.
- The goal is to limit the cell size to 10 MB or 50 MB if you are using mob type. Otherwise, store cell data in HDFS and store Pointers to the data in HBase.
- A typical Scheme table contains 1 to 3 column families. HBase table design should not be similar to RDBMS table design.
- For tables with one or two column families, 50-100 regions are appropriate. Remember that a region is a continuous segment of a column family.
- Keep column family names as short as possible. Each value stores the name of the column family (ignoring the prefix encoding). They should not be self-documented, descriptive names like typical RDBMSS.
- If you are storing time-based machine data or log information, and the row key is based on device ID or service ID + time, you end up with a situation where older data regions will never have additional writes. In this case, you end up with a small number of active regions and a large number of regions that will not be written. In this case, a larger number of regions is acceptable because resource consumption depends only on the active region.
- If only one column family writes frequently, let only that column family consume memory. Note write mode when allocating resources.
The instance
Shops and Merchandise
Shop item is many-to-many
RDBMS table structure design:
Shops table:
The column name |
columns |
---|---|
id |
A primary key |
name |
Shop name |
address | home |
regdate | The registration date |
Commodity list:
The column name |
columns |
---|---|
id |
A primary key |
name |
Name of commodity |
price |
The price |
details | Goods details |
title |
Display name |
Relational tables:
The column name |
columns |
---|---|
shop_id | Store the primary key |
item_id | Commodity primary key |
type |
Association types |
Hbase table structure design:
The store table:
Commodity list:
Weibo users and fans
Users and fans are one to many
RDBMS table structure design:
The users table:
The column name |
columns |
---|---|
id |
A primary key |
nickname | The user name |
Fan correspondence table:
The column name |
columns |
---|---|
user_id | The user id |
fans_id | Fan id |
Hbase table structure design:
More blog posts on real-time computing, Hbase, Flink, Kafka and other related technologies. Welcome to real-time streaming computing