Introduction: As a sharp tool to promote the peak of carrying traffic in Double 11, Tair supports the core experience scene of e-commerce transactions. It not only maintains sub-millisecond smooth delay at the peak of billions of QPS, but also makes technological innovations in the core experience scenes of e-commerce transactions.

The author source | | desert ice ali technology to the public

As a sharp tool to promote the peak of carrying traffic in Double 11, Tair supports the core experience scenes of e-commerce transactions. It not only maintains sub-millisecond smooth delay at the peak of billions of QPS, but also makes technological innovations in the core experience scenes of e-commerce transactions.

A preface

Singles Day 2021 will be the 13th Singles Day for Tmall and the 13th singles Day for Tair. Each Tair student participated in the preparation was different. Some students felt the unique atmosphere of technical team construction for the first time. Some of you, a little more, gathered seven robes to summon the dragon. This year is the most intense I have ever experienced. There are ups and downs in the preparation process, and the pressure test process of new products launched this year is not smooth. At the same time, it is also the warmest one I have ever experienced: with the full support of business partners, the team brothers fought side by side and finally achieved “smooth as silk and smooth as rock”.

Two background

Since its inception in April 2009, Tair has gone through several iterations, supporting different engines for rich business scenarios. Among them, MDB/LDB is the sub-product with the longest development time, and is still the absolute main force of today’s Singles’ Day. It smoothly bears the flow peak of Singles’ Day and performs well in the pressure measurement stage. Behind this is the full coverage of scenarios with mature kernel capabilities, and the increased maintenance efficiency with iterative production capabilities. Of course, the database product with 10K+ instances is inseparable from the system owners’ advanced professional ability, attention to product operation, and quick response to demand.

In addition to the stability of the product itself, Tair MDB/LDB is also the cornerstone of the development of all other products in Tair product line. For example, Tair MDB With PMem, as the first milestone of Tair in persistent memory, was followed by the iteration of Tair persistent memory released in the 2020 Cloud Computing Conference. In this year’s Singles’ Day, TairSQL, based on the persistent memory architecture, expands the computing scenarios supported by the in-memory database Tair. The Tair persistent memory model played an important role in different scenarios of this year’s Singles’ Day, as detailed in the following sections.

Three Tair persistent memory type

Tair persistent memory is an in-memory database product with large capacity and compatible with Redis that is sold publicly on aliyun official website, and also provides services for core applications within Alibaba Group. Compared with the Redis community edition, the cost of a single instance can be reduced by up to 30%, and the data persistence does not rely on traditional disks. This ensures the persistence of each operation and provides the same throughput and latency as the Redis community edition, greatly improving the reliability of service data.

Tair type persistent memory storage medium used by Intel ® proud teng ™ lasting memory (Optane PMem) is with both excellent memory and storage performance solutions, cost-effective large capacity memory with support for data persistence cleverly combined together, more data can be stored in the closer to the CPU, speed up large memory, Speed up database restart time and reduce I/O, reduce power consumption of large memory nodes, and protect data in case of power outages.

Intel ® proud teng ™ lasting memory make up the gap between the traditional SSD and DRAM, with innovative technology provides particular mode of operation, for meet the demand of various workloads, especially from the cloud to the database, to memory analysis, virtualization infrastructure and other data-intensive and computationally intensive workload, power from a larger data set to obtain a deeper insight.

After the launch of the 2020 Computing Conference, the user feedback collected by Tair persistent memory has raised higher requirements on the range of supported scenarios, access performance, cost performance and other aspects by serving more and more user scenarios on the cloud and within the group. Based on these requirements, Tair persistent memory solves the core optimization technology to dynamically and adaptively move data between DRAM and persistent memory, ensuring that the space occupied by user index and data area is maintained within a fixed proportion, and meeting the data storage requirements in different user scenarios.

At the same time, the Tair persistent memory is deeply combined with the Kernel technology of the Aliyun Linux operating system, which is compatible with the requirements for data snapshot in scenarios such as active/standby replication and real-time backup, and greatly reduces the delay impact of real-time snapshot in scenarios with large memory footprint. In addition to cover more support model and the performance optimization of high-frequency scene, in terms of providing higher cost performance, Tair lasting memory model can simplify the lasting memory storage structure of independent research and development of metadata takes up space, and lists and Hash users against high frequency compression, using the data structure of the intensification of transparent under the stable data persistence performance, Achieve 1-2 times the data compression rate, significantly reducing the hardware cost of the data persistence version.

Tair persistent memory model in addition to Redis general scene deep work on continuous optimization, but also to expand the cost, data consistency, low latency and capacity of the comprehensive requirements of high such as advertising and feature storage scene also shine. Meanwhile, innovation has been made in two different user scenarios in 2021 Singles’ Day, which has significantly improved the system stability, cost performance and experience of the application. Firstly, the TairCPC data model, which plays an important role in risk control scenarios, is introduced.

1 TairCPC

TairCPC, which made its debut on Singles’ Day 2020, was incorporated into Tair’s durable memory product this year and played an important role in the risk control scene of Singles’ Day.

The aggregative operator Sketches capability provided by TairCPC is sunk into the storage engine in the form of modules, which can make use of a small space to do high-performance calculation of sampling data. Users can directly return real-time calculation results after incremental writing. The risk control business of TairCPC is used as the core module of the group’s transaction link, which directly affects the security of the entire online transaction. TairCPC is used in the real-time risk control scenario of the core real-time computing link of the product.

With the help of Tair persistent memory, the scene of this year’s Double Eleven saves about 1/3 of the storage space. With the cost advantage of persistent memory, the cost of users is greatly reduced. For TairCPC and Tair persistent memory, a lot of performance optimization is carried out, which makes the performance of many scenarios equal to that of memory, and improves the performance of slow search by an order of magnitude, effectively improving the system stability. Complete persistence of data (RPO=0) is achieved with little impact on performance.

2 TairSQL

The technical innovation made by Tair in the core inspection scene of 2021 Singles’ Day comes from a subsystem code-named TairSQL internally. In the peak period of Double 11, users will automatically receive coupons after placing orders and the write-off of assets after successful transactions will bring corresponding write traffic to the database system. The write delay of millisecond level must be kept at a low level to ensure that users can feel the consistent changes of the price in the shopping guide scenes such as commodity search and detail display.

The technical challenges of price consistency scenarios for database products are as follows: High read/write load and demanding latency. To address the technical challenges of this scenario, the following sections briefly describe the kernel technologies used by TairSQL.

TairSQL kernel technology

According to the business characteristics of Singles’ Day, TairSQL has done the transformation of persistent memory data storage, the reduction of client connection overhead, the acceleration of cluster initialization, the optimization of memory usage and other works related to cost performance and stability, but it serves the scenario of high throughput and low latency. Mainly due to persistent memory storage, efficient transaction processing model, lightweight user interface access and several core features:

  • Persistent memory data storage, using persistent memory as the final data storage medium, reduces the IO delay on the access link, eliminates the need for traditional database products time-consuming cache, frequent elimination of data on disk exchange, and reasonable data distribution for index data, user area data access frequency. Let high frequency index query updates be done in DRAM.
  • The transaction model, in which each node serves dozens of partitions in a horizontally scaled cluster and each partition uses a single thread response, avoids the overhead of lock contention and provides smoother P99 access latency.
  • Lightweight user interface, the lightweight user interface access technology reduces the cost of SQL parsing and compilation for each user request, combined with the transaction processing model so that the user’s read and write requests can be processed within hundreds of us and returned.

The qualified kernel technology only meets the physiological needs of products, while the safety needs of products need to be met by providing the corresponding stability technology.

TairSQL stability technology

Stability technology involves all aspects of the product, including not only the characteristics of stability in development, but also the peripheral components that can reflect the operating state of the system. The following chapter mainly introduces the stability technology of monitoring, client and server flow control.

Monitoring. As we all know, monitoring is the eye of the system. Without monitoring, it is not easy to see the details of the product operation. TairSQL currently has two main monitors, one for cluster availability related indicators, and the other is a Grafana+Prometheus+TairSQL link to provide second RT and QPS data presentation. The perfection of monitoring directly determines whether some details of the system can be found. For example, TairSQL’s second-level monitoring can clearly show the QPS of each data node, and hot spots can be found without triggering flow control. From the perspective of the final data access source database, hot spots can not be hidden.

The client. TairSQL uses a rich client approach where requests can be routed directly to the node to be accessed. Resource consumption control on the client side, interaction overhead with the server side, impact on the server side when building and disconnecting the 10K+ application node, and timely feedback to the client side when the server topology changes are all implementation considerations and optimization points made on the client SDK. At the same time, the client ADAPTS with VipServer, Hawk-Eye and other products within the group to shield the impact of back-end node changes on applications and support link access of shadow table and location of full link access.

Server flow control. Flow control/back pressure is a component of a mature service end product features, TairSQL currently online service side flow control statistics part according to the work queue memory size and length of the two latitude to limit, according to the condition of limit pressure measurement as the default value of reference, the trigger phase flow control is relatively loose, the only exceptions are triggered. The recovery stage is relatively strict, and the flow control state will be released only after the node is determined to be back to normal with a high confidence value.

Four brother system

The innovations made by Tair, the cloud native in-memory database, cannot be separated from the support of Ali Cloud’s perfect infrastructure:

  • DBaaS, the database management and control platform, quickly realizes the general capabilities such as security audit, high availability, elastic scaling and intelligent diagnosis provided by Ali Cloud database, as well as the enterprise-level capabilities such as data flashback and global distribution provided by Tair. For Tair persistent memory type, DBaaS, combined with Alibaba cloud container service ACK, supports affinity scheduling of persistent memory resources and computing resources to reduce persistent memory access delay, and provides QoS policy support for persistent memory to ensure secure and controllable service and consistent product experience.
  • The durable memory series products provided by Divine Dragon Bare metal server provide the basis of elastic service for Tair, the cloud native memory database. The network technology optimized for sudden traffic enables Tair to cope with high-throughput scenarios with ease. Intelligent prediction of hardware risks, such as memory, allows Tair to anticipate the level of risk during rush hours and avoid it.
  • Aliyun Linux not only ADAPTS persistent memory hardware, but also optimizes Tair’s unique persistent memory data snapshot support and real-time snapshot latency reduction.

Five summarizes

Tair persistent Memory edition’s performance in 2021 Tmall Global Shopping Festival is an important milestone in Tair’s product evolution. Tair will continue with the core storage memory/lasting memory, key construction cloud native, mixing the data on the storage medium intelligence distribution, integration of online storage and real-time computing core ability, strengthen the cloud native memory database product ability, in the same set of system offers a variety of workload, many scene really help customers online.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.