The original address: https://mp.weixin.qq.com/s/I3FnP1JtY5QZCY2D3QioyQ
Lindorm is an important part of big data storage processing in Flying Cloud operating system. Lindorm is a distributed NoSQL database developed by HBase for big data. It integrates large-scale, high-throughput, fast, flexible, and real-time mixing capabilities, and provides the world’s leading high-performance, cross-domain, multi-consistent, and multi-model mixed storage and processing capabilities for massive data scenarios. At present, Lindorm has fully served the structured and semi-structured storage scenarios of big data in Ali economy.
Note: Lindorm is another name for the HBase branch within Ali. The version sold on Ali Cloud is HBase Enhanced Edition. The HBase enhanced edition and Lindorm in the following articles refer to the same product.
Extreme optimization, super performance
Compared with HBase, Lindorm is deeply optimized in RPC, memory management, cache, log writing, and many new technologies are introduced to greatly improve read/write performance. With the same hardware, Lindorm can reach more than 5 times that of HBase, and burr can reach 1/10 that of HBase. These performance data were not generated under laboratory conditions, but were produced using the open source testing tool YCSB without changing any parameters. We published the test tools and scenarios in the Aliyun help file, and anyone can follow the guide to run the same results themselves.
Trie Index
The file LDFile for Lindorm (similar to HFile in HBase) is a read-only B+ tree structure, in which the file index is a critical data structure. Block cache has high priority and needs to reside in memory as much as possible. If we can reduce the size of file indexes, we can save valuable memory space needed for indexes in the Block cache. Or increase the index density and reduce the size of data blocks while the index space remains unchanged to improve performance. The HBase index block stores a full number of Rowkeys. In a sorted file, many rowkeys have the same prefix.
ZGC support, 100 GB heap average 5ms pause
ZGC(Powerd by Dragonwell JDK) is one of the representatives of the next generation Pauseless GC algorithm. Its core idea is that Mutator uses the memory Read Barrier to recognize pointer changes. Enables most of the Mark and Relocate work to be performed in the concurrent phase.
-
Lindorm memory self-management technology that reduces the number of objects and memory allocation rate by an order of magnitude. (For example, CCSMap contributed by Ali HBase team to the community).
-
AJDK ZGC Page caching mechanism optimization (lock, Page caching policy).
-
AJDK ZGC trigger timing optimization, ZGC no concurrent failure. AJDK ZGC runs stably for two months on Lindorm and successfully passes the Double Eleven exam. Its JVM pause times are stable at around 5ms, with maximum pause times not exceeding 8ms. ZGC greatly improves THE RT and burr indexes of online running clusters, with an average RT optimization of 15% ~ 20% and a P999 RT reduction of twice. In this year’s Double 11 ant risk control cluster, under the support of ZGC, P999 time was reduced from 12ms to 5ms.
Note: The units in the figure should be US, with an average GC at 5ms
LindormBlockingQueue
VersionBasedSynchronizer
The LDLog is used for data recovery during system failover in the Lindorm to ensure atomicity and reliability of data. Every time data is written, LDLog must be written first. After the LDLog is successfully written, subsequent operations such as writing memstore can be performed. Therefore, handlers in Lindorm must wait for WAL writes to complete before being woken up for further operations. In high pressure conditions, unawoken causes a large number of CPU Context switches, resulting in performance degradation. To address this problem, Lindorm developed a version-based high concurrency multi-path synchronizer to greatly optimize context switching.
Complete lock-free
The HBase kernel has a large number of locks on the critical path. In high concurrency scenarios, these locks cause thread contention and performance degradation. The Lindorm kernel does not lock locks on key links, such as MVCC or WAL. In addition, various indicators, such as QPS, RT, and cache hit ratio, are generated during HBase running. There are also lots of locks in the “humble” operations that record these Metrics. In the face of such problems, Lindorm referenced by the ideas of the tcmalloc LindormThreadCacheCounter, is developed to solve the problem of the performance Metrics.
Handler association programs
In high concurrency applications, the implementation of an RPC request often consists of multiple sub-modules, involving several IO. These sub-modules collaborate with each other, and the system ContextSwitch quite frequently. ContextSwitch optimization is an unavoidable topic for high-concurrency systems, and there are many ideas and practices in the industry. Among them, Coroutine (coroutine) and SEDA(staged event-driven) schemes are the ones we focus on. Considering engineering cost, maintainability, and code readability, Lindorm chooses coroutine for asynchronous optimization. We use the built-in Wisp2.0 function of Dragonwell JDK provided by Ali JVM team to realize the coprogramming of HBase Handler. Wisp2.0 out of the box effectively reduces the resource consumption of the system, and the optimization effect is objective.
New Encoding algorithm
To improve performance, HBase usually loads Meta information into block cache. If the block size is small and the Meta information is large, the Meta information cannot be fully loaded into the Cache, resulting in performance deterioration. If the block size is large, the performance of sequential queries through Encoding blocks can become a performance bottleneck for random reads. To solve this problem, Lindorm has developed Indexable Delta Encoding, which can also be searched quickly within the block through indexes, and seek performance has been greatly improved. Indexable Delta Encoding principle is shown below:
With Indexable Delta Encoding, random seek performance of HFile is doubled compared to that before using Indexable Delta Encoding. For 64K block, random seek performance is almost the same as that without Encoding (other Encoding algorithms have some performance loss). In the case of random Get with full cache hits, the ratio of Diff Encoding RT decreases by 50%
other
Compared to HBase community edition, Lindorm has dozens of performance enhancements and reconstructs, and introduces many new technologies. Due to space constraints, this list is limited. Other core technologies include:
-
CCSMap
-
Quorum-based Write protocol for Automatically Avoiding Failures
-
Efficient Group Commit
-
High performance cache without fragmentation – Shared BucketCache
-
Memstore Bloomfilter
-
Efficient data structures for reading and writing
-
Gc-invisible Memory management
-
Separation of online computing and offline job architecture
-
JDK/ operating system deep optimization
-
FPGA offloading Compaction
-
TCP acceleration in user mode
-
…
Rich query model, reduce the development threshold
WideColumn model (native HBase API)
WideColumn is an HBase identical access model and data structure that makes Lindrom 100% compatible with the HBase API. Users can access Lindorm through the WideColumn API in the high-performance native client provided by Lindorm, or through the Alihbase-Connector plug-in using the HBase client and API(no code modification is required). At the same time, Lindorm uses a lightweight client design to sink a lot of data routing, bulk distribution, timeouts, retries, and other logic to the server, as well as a lot of optimization in the network transport layer to save CPU consumption on the application side. Compared with HBase, Lindorm improves the CPU usage and network bandwidth efficiency by 60% and 25%, as shown in the following table.
In addition, the HBase native API supports high-performance secondary indexes. When data is written using the HBase native API, index data can be transparently written into index tables. In the query process, the Scan + Filter large query that may Scan all tables is changed to query the index table first, greatly improving the query performance.
TableService model (SQL, secondary index)
HBase supports only Rowkey index. Therefore, multi-field query is inefficient. Therefore, users need to maintain multiple tables to meet the query requirements of different scenarios, which not only increases the complexity of application development to a certain extent, but also cannot guarantee the data consistency and writing efficiency perfectly. In addition, HBase provides only KV apis for simple operations, such as Put, Get, and Scan. There is no data type. All data must be converted and saved by users. For developers who are used to SQL, the threshold of entry is very high and error prone.
In order to solve this pain point, reduce the user usage threshold, and improve the development efficiency, we added TableService model in Lindorm, which provides rich data types, structured query expression API, and native support for SQL access and global secondary index, which solves many technical challenges and greatly reduces the development threshold for ordinary users. With SQL and SQL-like apis, users can easily use Lindorm as they would a relational database. Here is a simple example of Lindorm SQL.
-- Primary table and indexDDL
create table shop_item_relation ( shop_id varchar, item_id varchar, status varchar constraint primary key(shop_id, item_id)) ;
create index idx1 on shop_item_relation (item_id) include (ALL); Select * from primary key; select * from primary keycreate index idx2 on shop_item_relation (shop_id, status) include (ALL); -- Multiple column index, redundant all columns -- write data, update synchronously2indexupsert into shop_item_relation values('shop1'.'item1'.'active');
upsert into shop_item_relation values('shop1'.'item2'.'invalid'); Select * from shop_item_relation WHERE item_id = select * from shop_item_relation WHERE item_id ='item2'; Select * from shop_item_relation where shop_id ='shop1' and status = 'invalid'; - hit idx2Copy the code
FeedStream model
-
Storage: Not suitable for long-term storage of data, usually expire in days
-
Deletion ability: The specified data entry cannot be deleted
-
Query capability: Does not support complex query and filtering conditions
-
Consistency and performance are difficult to guarantee at the same time: databases such as Kafka are heavy throughput, there is the possibility of losing data in some cases to improve performance, and transactional message queues are limited in throughput.
-
Partition rapid expansion: The number of partitions in a Topc is fixed and rapid expansion is not supported.
-
Physical queue/logical queue: Usually only a small number of physical queues are supported (for example, each partition can be regarded as a queue). However, services need to simulate logical queues based on physical queues. For example, in an IM system, a logical message queue is maintained for each user, so users often need a lot of extra development work.
To meet the above requirements, Lindorm launches a queue model called FeedStreamService, which can solve problems such as message synchronization, device notification, and self-added ID assignment for a large number of users.
Note: This model has been tested in the HBase enhanced version of Ali Cloud. Interested users can contact the cloud HBase to answer questions about the nail number or initiate work order consultation on Ali Cloud.
Full-text indexing model
Although the TableService model in Lindorm provides data types and secondary indexes. However, Solr and ES are excellent full-text search engines. Using Lindorm+Solr/ES maximizes the strengths of both Lindorm and Solr/ES, allowing us to build sophisticated big data storage and retrieval services. Lindorm has a built-in external index synchronization component that automatically synchronizes data written to Lindorm to an external index component such as Solr or ES. This model is very suitable for businesses that need to save a large amount of data, but the field data of query conditions only accounts for a small part of the original data, and need a combination of various conditions to query, for example:
-
In common logistics service scenarios, a large amount of track logistics information needs to be stored and query conditions need to be arbitrarily combined based on multiple fields
-
In traffic monitoring business scenarios, a large number of vehicle passing records are saved, and the records of interest are retrieved according to any combination of vehicle information conditions
-
A variety of site members, commodity information retrieval scenarios, generally save a large number of commodity/member information, and need to carry out complex and arbitrary query according to a few conditions, in order to meet the arbitrary search needs of site users.
More models on the way
In addition to the above models, we will also develop more easy-to-use models according to business needs and pain points, so as to facilitate the use of users and reduce the threshold of use. Time series models, graphic models, etc., are on the way, so stay tuned.
High availability with zero intervention and second recovery
From a baby to a young man, Ali HBase has fallen many times, even broken head and blood, and we are fortunate to grow under the trust of customers. During the 9 years of Alibaba application, we have accumulated a large amount of high availability technology, which has been applied to HBase enhanced version.
MTTR optimization
HBase is an open source implementation based on Gooogle’s famous paper BigTable. Its core feature is that data is persistently stored in HDFS, the underlying distributed file system. HDFS maintains multiple copies of data to ensure high reliability of the entire system, while HBase itself does not need to care about multiple copies of data and consistency. This will help the overall engineering simplified, but also introduces the defects of “single point of service”, namely, speaking, reading and writing of data to determine service only a fixed a node server, this means that when a node after downtime, data needs to be through the replay Log restore memory state, after loading and sending them to a new node, to return to service.
Adjustable multi consistency
In the HBase architecture, each region can be online only in one RegionServer. If the RegionServer is down, a region needs to go through steps such as Re-Assgin, REGION segmentation by WAL, and WAL data playback before it can be read and written again. This recovery time can take several minutes, which is an unsolvable pain point for some demanding businesses. In addition, HBase has active/standby synchronization, but in case of a fault, the cluster granularity can only be manually switched. In addition, data between the active and standby services can only be consistent. Some services can only be strongly consistent.
In this architecture, Lindorm has the following consistency levels, which users can choose based on their business:
Client HA switchover
Currently, HBase can work in active/standby mode, but there is no efficient client switching access solution in the market. HBase clients can access only HBase clusters of a fixed address. If the primary cluster is faulty, stop the HBase client, modify the HBase configuration, and restart the HBase client to connect to the secondary cluster. Alternatively, users must design a set of complex access logic on the service side to access the primary and secondary clusters. Alibaba HBase has modified HBase clients. Traffic switchover occurs inside the clients. After the switchover command is sent to the clients over the high availability channel, the clients close the old link, open the link with the secondary cluster, and retry the request.
Cloud native, lower cost of use
Lindorm considers the cloud from the beginning of the project, and various designs can reuse the cloud infrastructure as much as possible, specially optimized for the cloud environment. In the cloud, for example, in addition to supporting cloud disks, we also support data storage in OSS, a low-cost object store to reduce costs. We have also made a lot of optimizations for ECS deployment, adapting to models with small memory specifications, strengthening deployment flexibility, all for cloud native, in order to save customers’ costs.
ECS+ the ultimate elasticity of cloud disk
The HBase enhanced version of Lindorm on the cloud uses ECS+ cloud disk deployment (some large customers may use their own sites). The ECS+ cloud disk deployment provides extreme flexibility for Lindorm.
-
Service elasticity is difficult to meet: When unexpected service traffic peaks or abnormal requests occur, it is difficult to find new physical servers for capacity expansion in a short period of time.
-
Poor flexibility due to storage and computing binding: The ratio of cpus to disks on a dedicated server is fixed, but each service has different characteristics. If the same dedicated server is used, some services have insufficient computing resources but excessive storage resources, while some services have excessive computing resources and storage bottlenecks. In particular, after mixed storage is introduced in HBase, it is difficult to determine the ratio between HDDS and SSDS. Some demanding services often use up SSDS while HDDS are available, and a large number of out-of-line service SSDS cannot be used.
-
Heavy o&M pressure: When using a physical machine, o&M needs to pay attention to whether the physical machine is over warranty, whether there is disk failure, network card failure and other hardware faults that need to be dealt with. The repair of the physical machine is a long process, and the machine needs to be stopped at the same time, so the o&M pressure is huge. For massive storage services like HBase, it is common for several disks to fail every day. These problems will be solved when Lindorm is deployed on ECS+ cloud disk.
Integrated cold and heat separation
In the scenario of massive big data, some service data in a table is only archived data or rarely accessed over time. At the same time, this part of historical data has a large volume, such as order data or monitoring data. Reducing the storage cost of this part of data will greatly save the cost of enterprises. How to greatly reduce storage cost for enterprises with minimal operation and maintenance configuration cost, Lindorm cold and heat separation function arises at the historic moment. Lindorm provides a new storage medium for cold data. The cost of a new storage medium is only one third of that of an efficient cloud disk.
Zstd-v2, increase the compression ratio by another 100%
Two years ago, we replaced the group’s storage compression algorithm with ZSTD, achieving an additional 25% compression gain over SNAPPY. This year, we further optimized this problem by developing and implementing a new ZSTD-V2 algorithm. For the compression of small chunks of data, we proposed a method of training dictionaries with pre-sampled data and then accelerating them with dictionaries. We took advantage of this new capability to sample and train the data, build the dictionary, and then compress the data when building LDFile for Lindorm. In various business data tests, we achieved a maximum compression ratio of 100% over the native ZSTD algorithm, which means we can save customers another 50% on storage costs.
HBase Serverless, preferred for getting started
The HBase Serverless version of Ali Cloud is a new HBase service built based on the Lindorm kernel and the Serverless architecture. Alibaba Cloud HBase Serverless truly turns HBase into a service. Users do not need to plan resources in advance, select the number of CPU and memory resources, and purchase clusters. There is no need to perform complex operation and maintenance operations such as capacity expansion in response to business peak and business space growth, and no need to waste idle resources in business trough.
Security and multi-tenant capabilities for large accounts
The Lindorm engine has a complete username and password architecture built in, providing multiple levels of permission control and authentication on every request to prevent unauthorized data access and secure user data access. In addition, Lindorm is equipped with multi-tenant isolation functions such as Group and Quota limits to ensure that services in an enterprise using the same HBase cluster will not be affected by each other and share the same big data platform in a secure and efficient manner.
User and ACL system
The Lindorm kernel provides an easy-to-use user authentication and ACL system. For user authentication, enter the user name and password in the configuration. The user password is stored in plaintext on the server and is not transmitted in plaintext during authentication. Even if the ciphertext is intercepted, the communication content used for authentication cannot be reused or forged.
Group isolation
When multiple users or services use the same HBase cluster, resource contention may occur. The read and write of some important online services may be affected by the batch read and write of offline services. The Group function is provided by HBase Enhanced (Lindorm) to solve the multi-tenant isolation problem.
Quota limit flow
There is a complete Quota system built into the Lindorm kernel to limit resource usage for individual users. For each request, the Lindorm kernel has an exact calculation of the CU (Capacity Unit) consumed, which is calculated in terms of the resources actually consumed. For example, a Scan request receives little data due to filter, but The RegionServer may have consumed a large amount of CPU and I/O resources to filter data. The actual resource consumption is calculated in the CU. When using Lindorm as a big data platform, an enterprise administrator can allocate different users to different services and limit the number of CU reads per second for a user or the total number of CU reads per second by using the Quota system to limit the user’s usage of resources and affect other users. Quota limits also support Namesapce level and table level limits.
conclusion
The new-generation NoSQL database Lindorm is the result of technical accumulation of Alibaba HBase&Lindorm team for 9 years. Lindorm provides the world’s leading high-performance, cross-domain, multi-consistent, multi-model hybrid storage and processing capability for massive data scenarios. Focus on simultaneously solving the demands of big data (unlimited expansion, high throughput), online services (low latency, high availability), multi-functional query, to provide users with seamless expansion, high throughput, continuous availability, millisecond level of stable response, adjustable strength, low storage cost, rich index real-time mixed data access capabilities.
The last
If you like the article, remember to pay attention to me. Thank you for your support!