Hbase Introduction (4) table structure design -RowKey

Table structure design of Hbase differs from that of relational databases in many ways. Hbase has several new concepts such as Rowkey, column family, and timestamp. Therefore, how to design table structure is very important.

create

Hbase uses the Rowkey column family TIMESTAMP to determine a row of data.

This is completely different from a relational database:

attribute	HBase	RDBMS
The data type	Just strings	Rich data types
Data manipulation	Simple add, delete, modify and query does not support join	Various functions and table joins
Storage mode	Column-based storage	Based on table structure and row storage
Data protection	The old version remains after the update	replace
scalability	Easy to add nodes, high compatibility	Need an intermediate layer, sacrifice functionality

Therefore, Hbase needs to consider the following factors:

1. How many column families should the table have

2. What data do column families use

3. How many columns are in each column family

4. What is the column name

5. What data should be stored in the unit

6. How many time versions are stored in each unit

7. What is the Rowkey structure and what information should it contain

Points to note:

1, the Join

There is no join in Hbase, so large table structure row records need to be added with keywords to solve this problem

2, Rowkey

Rowkey design is very important because Hbase is ordered and prefixes and suffixes must be considered

You can use Hbase Shell and Java Api:

Configuration config = HBaseConfiguration.create(); Admin admin = new Admin(conf); TableName table = TableName.valueOf("myTable"); admin.disableTable(table); HColumnDescriptor cf1 = ... ; admin.addColumn(table, cf1); // adding new ColumnFamily HColumnDescriptor cf2 = ... ; admin.modifyColumn(table, cf2); // modifying existing ColumnFamily admin.enableTable(table);Copy the code

Rowkey design

A Rowkey is an indivisible array of bytes stored lexicographically in a table.

Region provides services for an interval of rows based on rowkeys. HFile stores ordered rows on disks. Rowkeys greatly affect Hbase performance.

The Rowkey is the index, and if you don’t know the Rowkey, you can only scan the entire table, and performance deteriorates significantly.

Here’s an example of how popular movies are:

1. Rowkeys are in lexicographical order from largest to smallest

Native Hbase supports sorting from smallest to largest. You can use Rowkey= integer-max_value -Rowkey to perform the sorting at the application layer.

2. Hash Rowkey as much as possible

The Rowkey should be hashed as much as possible to ensure that data is not in one Region and avoid read and write centralization.

For example, we could design the userid_videoid concatenation string so that the users would be uneven.

There are three ways to solve the problem: Reverse the userID hash userID. The userID is modded and encrypted by MD5. Add the first six bits to the userID

3. Keep the Rowkey as short as possible

If the Rowkey is too long, the storage overhead is high.

If the Rowkey is too long, the memory usage decreases and the index hit ratio decreases.

Column family

A column family is a collection of columns where all members of a column family have the same prefix, for example courses: history and courses: math are members of the courses column family. The colon is the separator. The column family prefix must be an outputtable character, and the column may consist of any byte array.

Column families must be declared when the table is created. Columns do not need to be declared, and users can create new columns at any time.

Rule of thumb:

The goal is to limit region sizes to between 10 and 50 GB.
The goal is to limit the cell size to 10 MB or 50 MB if you are using mob type. Otherwise, store cell data in HDFS and store Pointers to the data in HBase.
A typical Scheme table contains 1 to 3 column families. HBase table design should not be similar to RDBMS table design.
For tables with one or two column families, 50-100 regions are appropriate. Remember that a region is a continuous segment of a column family.
Keep column family names as short as possible. Each value stores the name of the column family (ignoring the prefix encoding). They should not be self-documented, descriptive names like typical RDBMSS.
If you are storing time-based machine data or log information, and the row key is based on device ID or service ID + time, you end up with a situation where older data regions will never have additional writes. In this case, you end up with a small number of active regions and a large number of regions that will not be written. In this case, a larger number of regions is acceptable because resource consumption depends only on the active region.
If only one column family writes frequently, let only that column family consume memory. Note write mode when allocating resources.

The instance

Shops and Merchandise

Shop item is many-to-many

RDBMS table structure design:

Shops table:

The column name	columns
id	A primary key
name	Shop name
address	home
regdate	The registration date

Commodity list:

The column name	columns
id	A primary key
name	Name of commodity
price	The price
details	Goods details
title	Display name

Relational tables:

The column name	columns
shop_id	Store the primary key
item_id	Commodity primary key
type	Association types

Hbase table structure design:

The store table:

Commodity list:

Weibo users and fans

Users and fans are one to many

RDBMS table structure design:

The users table:

The column name	columns
id	A primary key
nickname	The user name

Fan correspondence table:

The column name	columns
user_id	The user id
fans_id	Fan id

Hbase table structure design:

More blog posts on real-time computing, Hbase, Flink, Kafka and other related technologies. Welcome to real-time streaming computing

Hbase Introduction (4) table structure design -RowKey

create

Rowkey design

Column family

The instance

Shops and Merchandise

RDBMS table structure design:

Hbase table structure design:

Weibo users and fans

RDBMS table structure design:

Hbase table structure design:

Related Posts

56 lines of code, take you to climb douban movie review

Microplasticity: a new way of learning to learn

Missing value processing: SimpleImputer (Easy to understand + super detailed)