Abstract: Although the HBase native API gives users the ultimate control, it also brings high development costs and learning costs. SQL is a good solution to this problem. This paper starts from why SQL is needed, and then explains SQL on Hbase, and then focuses on the optimization and improvement of Ali-hbase SQL, and finally prospects for the future.

Benefits: top international event HBaseCon Asia will be held in Beijing in August 2018, is now free applications, more details reference yq.aliyun.com/promotion/6…

If you are interested in big data storage, distributed database, HBase, etc., welcome to join us to build the best online big data storage, job reference and contact information: maimai.cn/job?webjid=…

Abstract: In the Hbase special session of 2017 Cloud Computing Conference, Tian Mu of Alibaba brought a speech on SQL practice and improvement of Ali-hbase. This paper mainly starts from why SQL is needed, and then explains SQL on Hbase, and then focuses on sharing the optimization and improvement of Ali-hbase SQL, and finally prospects for the future.

Please click PPT download

Here are the highlights:

Why do YOU need SQL?

Hash Hash

Points barrels

HBase Native API implementation

SQL on HBase

Alipay intelligent search dump platform

Goods report

Iot device information storage

Ali-Hbase SQL

Performance optimization

The goal is to optimize the performance of simple requests to the maximum, with a performance difference of less than 5% between the HBase Native API. In single-line read and write scenarios, the DIFFERENCE between SQL and HBase apis is obvious. Client-side metadata cache, metadata: column names, data types, table attributes, index information, and so on. Metadata update policy: We do not refresh metadata every time. We refresh metadata periodically, identifying the latest version by the version number, and updating the latest version if it is not the latest. This is an optimized cache update strategy for UPSERT.

Future jobs

Future plans are to support column name mapping and ImmutableDataEncoding, which we are currently investigating. Column name mapping saves 1/3 to 1/2 of the storage space in the case of large and wide tables. ImmutableData encoding can further save nearly 50% of storage space, but the limitation is that the data cannot be modified.

In addition, the heavy client also needs to change, at present, we want to optimize the function or fix the bug all need to let the user to upgrade the SQL client, this is very disgusting things; Therefore, the support of Query Server Mode and thin client can effectively solve the problem of constantly iterating products, and users can enjoy our improvements without upgrading.

Support distributed Sequence, and eventually we will also be able to distribute SQL capabilities;

Optional index consistency, asynchronous global secondary index. In some scenarios, users do not need strong consistency, such as logging, and eventually it is OK to be consistent within 1 minute, so we do an asynchronous global update, and the update cost is further reduced.

SQL practice and improvement of Ali-hbase