Abstract: Gauss Redis, both open source Redis and HBase respective advantages, to provide lower cost, better performance, more flexible database services!

This article is shared in huawei Cloud community “Huawei Cloud PB-level database GaussDB(for Redis) Revealed Issue 9: Comparison with HBase”, written by The official blog of Gauss Redis.

1, the introduction

HBase is a distributed, column-oriented open source database. Based on the Hadoop ecosystem, HBase is used by many companies at home and abroad for different services in modern Internet systems. This article briefly describes the basic architecture and application scenarios of HBase, focuses on the performance of key HBase features in this scenario, and the pain points of HBase usage. This section describes the performance of GaussDB(for Redis for short), a strong consistent and persistent NoSQL database developed by Huawei, in the preceding scenarios and how to improve HBase pain points.

2. HBase system overview

The physical structure of HBase includes components such as ZooKeeper, HMaster, RegionServer, and HDFS. ZooKeeper implements HMaster high availability, RegionServer monitoring, metadata entry, and cluster configuration maintenance. The HMaster maintains Region information of the cluster, processes metadata changes, and balances load. RegionServer directly processes user read/write requests. RegionServer actually processes read/write and split regions assigned to it. It uses WAL to implement the fault tolerance mechanism. HDFS provides the ultimate underlying data storage service. It provides the distributed storage service of metadata and table data and ensures high reliability and availability by using multiple data copies.

In the logical structure, the RowKey is the primary key of the table and is arranged in lexicographical order. When hregions reach a certain size, they will also be split according to the RowKey range. ColumnFamily Divides a table vertically and manages multiple columns in a group. In HBase, ColumnFamily is the schema of a table, but Column is not. A Cell is a stored value. In HBase, all data is stored in bytecode format.

3. HBase is in full swing

3.1 Label Data Storage

Label data is the representative of sparse matrix, which describes various attributes of entities and is mainly used in intelligent recommendation, business intelligence, marketing engine and other fields.

Three different users have left a large amount of behavioral data in different apps of the same company, including user information directly filled in, specific behaviors of using the APP and marks of certain phenomena by experts in the field. Such data can be obtained through the label algorithm in the background:

We can find that there are limitations in user behavior collection, so the types of labels that can be obtained are different, and a large number of data items in the table can only be empty, which is the so-called sparse matrix. And as users use the APP more deeply, it is expected that the fields of interest/disinterest to users will gradually be explored, and the columns in the table will increase accordingly.

Such a feature is disastrous for MySQL because the table structure must be defined when MySQL creates tables, the dynamic addition and deletion of attributes is a huge amount of work, and the storage cost of a large number of NULL values becomes unacceptable. However, if HBase storage is used, columns with no specified value do not occupy any storage space, making efficient use of limited resources. In addition, only the ColumnFamily is specified when an HBase table is created, and it is easy to add and delete columns, which is conducive to future attribute expansion.

3.2 Collection of Internet of Vehicles data

The Internet of vehicles (IOT) system uses on-board equipment to collect various data generated during the operation of vehicles, upload them in real time through the network, and conduct dynamic analysis and utilization on the platform.

It can be found that the data faced by the Internet of vehicles system is characterized by the continuous writing of TB or even PB data with high concurrency by a large number of vehicle terminals. Moreover, for real-time analysis, in order to ensure the timeliness of the analysis results, low delay response of the query is required.

HBase uses the LSM storage model to cope with high concurrent write scenarios and ensure an acceptable read latency. In addition, HBase has good horizontal scalability. RegionServer is added or subtracted to dynamically adjust storage capacity to meet the cost requirements.

3.3 Preservation of transaction records

In the field of mobile payment, ensuring the security of sensitive information such as historical transaction records is an important topic. When a data center encounters a natural disaster or an external attack, ensure that the data is not lost, and ensure that the RTO is as short as possible and the RPO is as small as possible.

HBase Based on the HADOOP Distributed File System (HDFS) as a storage system, HDFS implements the three-copy policy and stores copies on different nodes or racks according to certain rules, providing high Dr Capability. In engineering practice, strategies such as Regionreplica, active/standby cluster, and mutual standby active-active are also developed to ensure high availability and disaster recovery as much as possible.

4. HBase is not all-powerful

As can be seen from the three examples above, based on its own design, HBase performs well in sparse matrix storage, high concurrency and heavy traffic write resistance, high availability, and high reliability scenarios. However, this does not mean that HBase can adapt to all scenarios without any weaknesses.

4.1 Achilles Heel of HBase

1. Juliet pauses

Java systems can’t get around the Full GC discussion. When STW is generated by Full GC on HBase, ZooKeeper fails to receive the heartbeat message from RegionServer and determines that the node is down. Other nodes take over the data. When Full GC is complete, RegionServer actively commits suicide to prevent brain rupture. Call it the Juliet pause. This type of problem generally requires careful GC policy tuning by experienced Java programmers for business scenarios to avoid as much as possible.

2. Less data types

HBase supports byte arrays for storage. Data such as strings, complex objects, and even images need to be converted into byte arrays for storage. However, such storage can only represent loose data relationships. For collection, queue, Map and other data structures or data relationships, developers need to encode transformation logic to store them, which is less flexible.

Performance bottlenecks

HBase is divided into regions for storage based on the RowKey lexicographical order. Poor RowKey design results in load imbalance. As a large number of requests are sent to a Region to form hotspots, the I/OS of the RegionServer may be overwhelmed.

When RegionServer goes offline, ZooKeeper detects that the RegionServer node is down, moves its data to other nodes to take over, and modifs Region information in the Meta table. During this process, the data on RegionServer becomes unavailable and requests for this part of the data are blocked.

4.2 Icarus Wings of Redis

4.2.1 Good performance of open source Redis

The feature of open source Redis solves the pain point of HBase to some extent, because it has the following advantages:

1. Richer data types

Redis 5.0 protocol contains nine data types: String, List, Set, ZSet, Hash, Bit Array, HyperLogLog, Geospatial Index and Streams, as well as related operations based on these data types. Redis gives developers more choice about how to express data and its relationships than HBase’s single data type.

2. Silky feeling of pure memory

The essence of open source Redis is an in-memory database of key-value type. The entire database is loaded and operated in memory. This means that the response speed and processing power of Redis is much faster than that of HBase, which requires disk I/O. Currently, a large number of test results show that the performance of Open source Redis can reach 100,000 reads and writes per second.

4.2.2 Significant weaknesses of open source Redis

The operation of pure memory also leaves open source Redis with inevitable weaknesses in the following two aspects:

1. Big data nightmare

As the amount of data continues to grow, limited memory becomes a usage limitation. In this case, more memory capacity must be used to complete the full load of data, and the price of memory is much higher than that of disk, which will lead to a surge in usage cost. At the same time, the common server memory is mostly GB level, which also severely limits the competitiveness of open source Redis in the field of high level database.

2. What to do when the power goes out

Another drawback of pure memory operations is that all data can be lost after an outage. Existing solutions use AOF or RDB to persist data, which can be recovered in memory after the process restarts. However, these two methods are not complete. AOF is a collection of executing commands, so the recovery speed is relatively slow. The RDB periodically dumps memory data, causing data loss risks. In addition, half of the memory needs to be reserved in the worst case, which reduces the memory usage.

5. Gauss Redis: Adults don’t take multiple choice questions

HBase and open source Redis each have their own strengths. At this time, a familiar sentence comes to mind: only children do multiple choice questions, but of course adults want all of them. Gaussian Redis has both advantages and better meets the demand for database services.

  • Compatible with Redis5.0 protocol

Continue the rich data types of open source Redis to provide more options for describing data and data relationships. For example, in the sparse matrix scenario, the Hash type is used and the ColumnFamily of the HBase table is not even required, which enables flexible data organization.

  • Performance on par with open source Redis

For details, see Huawei Cloud Gauss DB(for Redis) cluster Performance Comparison with Open-source Redis cluster Performance. In large-traffic and high-concurrency scenarios, Gaussdb and Open-source Redis provide better read/write performance than HBase.

  • Higher DISASTER recovery reliability

Gaussiredis is a storage layer based on the distributed and strongly consistent data lake DFV developed by Huawei. The 3AZ feature has been implemented in some sites. The wind, fire, water, and electricity are physically isolated between AZs.

  • Second elastic expansion

Gauss Redis uses the storage separation architecture to sink data to the storage pool. When computing nodes expand or shrink, they can only modify mappings without relocating data, achieving smooth scaling in seconds. Data unavailable when HBase is online or offline in the Region does not exist.

  • Low-cost mass persistent storage

After logical and physical compression, the full amount of data will be stored in DFV persistent storage in the shared storage pool. There is no problem of data loss due to outage, and the comprehensive cost per GB is less than one tenth of the open source Redis. In actual application, DFV capacity can be expanded at any time according to service needs, and there is no problem that open source Redis storage is limited.

  • Automatic monitoring operation and maintenance and other advantages

Gauss Redis is equipped with a comprehensive monitoring system that can visually monitor key performance indicators such as request delay, and realize automatic removal, smooth movement, automatic alarm and automatic recovery of faulty nodes. In addition, Gaussian Redis uses the hash strategy to balance data, avoiding hot issues better than HBase and avoiding Full GC.

Six, the concluding

On the basis of compatibility with Redis5.0 protocol, Gauss Redis combines the advantages of both open-source Redis and HBase, and the related features of Huawei DFV storage to avoid the weaknesses of HBase and open-source Redis in typical scenarios. Provide database services with lower cost, better performance and greater flexibility.

7, the appendix

Author: Huawei Cloud Gauss Redis team.

For more technical articles, check out the official Gauss Redis blog:

Bbs.huaweicloud.com/community/u…

Gauss Redis official homepage:

www.huaweicloud.com/product/gau…

Click to follow, the first time to learn about Huawei cloud fresh technology ~