HBase is suitable for storing massive pB-level data (10 billion billion records). Data can be returned within tens to hundreds of milliseconds if you query data based on Rowkey.

So how does Hbase do this?

Next, introduce the thought and process of data query.

Query process:

Step 1:

The project has 10 billion service data stored in an Hbase cluster (consisting of multiple server data nodes). Each data node has several regions (regions). Each Region is a collection of Hbase data (for example, 200,000 pieces of data).

Now we start to query the corresponding record according to the primary key RowKey. The Master of Hbase helps us quickly locate the data node where the record resides and the Region in the data node. At present, we have 10 billion records, occupying 10TB space. All records are split into 5000 regions, so now each Region is 2G.

Since the records are stored in one Region, you only need to query the 2G record file to find the corresponding record.

Step 2:

Hbase data is stored in column families. For example, a record has 300 fields, and the first 100 fields are related to personnel information, which is a column cluster (a collection of columns); The middle 100 fields are related to company information, which is a column cluster. The last 100 fields are related to personnel transaction information, which is also a column cluster.

The three column clusters are stored separately. This storage structure ensures that Hbase supports up to one million table widths (fields).

If a 2G Region file is divided into four column families, each column family is 500M.

At this point, we just need to traverse the 500M column cluster to find the corresponding record.

Step 3:

If the record to be queried is in one of the column families, the column family is at the bottom level and contains one or more hfiles.

HFile can be understood as a more fine-grained storage file at the bottom of a column cluster.

If the typical size of an HFile is 100M, then the column family contains five Hfiles on disk or in memory.

The Hbase memory and disk data are sorted. The records to be queried may be in the first or last place. On average, you only need to search 2.5 Hfiles with a total of 250 MB to find the corresponding records.

Step 4:

Each HFile is stored as a key/value pair. You only need to traverse the key position in the file and judge that it meets the conditions.

Generally, keys are of limited length. Assuming that the key/value ratio is 1:25, only 10M data is needed to obtain corresponding records.

If the data is on a mechanical disk, it takes only 0.1 seconds to find the data according to its access speed of 100M/S.

For SSD, it takes 0.01 seconds to check.

Hbase has a memory caching mechanism. If data is stored in the memory, the efficiency is higher.


conclusion

This ensures that Hbase query performance does not deteriorate with the increase of data volume.

In addition, HBase is a column-oriented database (column cluster mechanism). If a table has a large number of fields, you can store some of these fields independently on some machines and store other fields on other machines to facilitate column query.

This complex storage structure and distributed storage ensures the efficiency of querying massive Hbase data.

END

Like this article friends, welcome to pay attention to, forward, comment, let us become a wise architect, the whole network with the same number!