Didi from KV storage to NewSQL combat

Introduction to 0.

This article describes didi’s practical path of building NewSQL on top of distributed Nosql storage Fusion. This section describes the features, application scenarios, and design schemes of Fusion-NewSQL.

1. The background

Fusion-newsql is a NewSQL storage system developed by Didi and built on the basis of distributed KV storage. Fusion-newsq is compatible with the MySQL protocol, supports the secondary index function, and provides super large scale data persistent storage and high-performance read and write.

Our problems

With the rapid and sustainable development of Didi’s business, the volume of data and requests increases sharply, putting increasing pressure on storage systems. Although depots table to a certain extent, the increase in demand can solve data quantity and request, but the drops multiple lines of business (express, car, two wheels, etc.) of the rapidly changing business, plus field and the demand of the index database is very frequent, depots table plan for frequent Schema changes are not friendly, will cause heavy DBA tasks, change cycle is long, And large table operations will also have an impact on the line. At the same time, the sub-database and sub-table scheme is not friendly or does not support secondary indexes at all. Given the above, the NewSQL database solution became one direction for us to solve business problems.

Open source product research

Initially, we investigated the open source distributed NewSQL solution: TIDB. While TIDB is an excellent NewSQL product, TIDB is not a good fit for our business scenario for the following reasons:

We need a high throughput, low latency database solution, but TIDB due to meet affairs, 2 PC program cannot meet the natural low latency (within 100 ms 99 rt, and even within 50 ms 99 rt) most of the business, we don’t really need a distributed transaction, or by other compensation mechanism, to bypass the distributed transactions. This is determined by the business scenario. The storage cost of TIDB three copies is relatively high. Some of our internal scenarios of offline data import into online system cannot be directly connected with TIDB.

For these reasons, we are on the road to developing our own NewSQL that meets our business needs.

Our foundation

We did not intend to develop a complete NewSQL system from 0, but to build a NewSQL that could meet our business scenario based on our own distributed KV storage Fusion. Fusion is a NoSQL database that adopts Codis architecture, is compatible with Redis protocol and data structure, and uses Rocksdb as storage engine. Fusion is already used by several hundred businesses within Didi and is one of didi’s main online stores.

The architecture diagram of Fusion is as follows:

We use hash sharding to do data sharding. From top to bottom, users can access Fusion through the client of Redis protocol. The user’s access request is sent to the Proxy, which then forwards data to the back-end Fusion data node. Proxy forwarding to back-end data nodes computes the hash value based on the requested key, and then mod the number of slot fragments to obtain a fixed SLOtid. Each SLOtid is mapped to a storage node to solve the data routing problem.

With a high-concurrency, low-latency, high-capacity storage tier, all we need to do is build the MySQL protocol and secondary indexes on top of it. Then how to convert MySQL data format into Redis data structure storage is the problem we must face, which will be discussed in detail later.

2. Demand

High throughput, low latency, large capacity. Compatible with MySQL protocol and downstream ecology. Primary key and secondary index queries are supported. Schema changes are flexible and do not affect the stability of online services.

3. Architecture design

Fusion-newsql consists of the following parts:

Fusion cluster n/A Data cluster N/A Fusion cluster that stores Index information N/A Index The cluster is responsible for the Schema management configuration center n/A ConfigServer The asynchronous build indexer -Consumer is responsible for consuming the mysql-binlog format Data written to MQ by the Data cluster. Based on the schema information, the generated Index Data is written to the Index cluster. External dependencies, MQ, Zookeeper

The architecture diagram is as follows:

4. Detailed design

Storage structure

How to convert MySQL table structure data into Redis data structure is the first problem we face.

The diagram below:

We convert a row of MySQL tables into a Redis Hashmap structure. The key of a Hashmap consists of a table name plus a primary key value, which is globally unique. The following figure shows how MySQL converts to the Redis protocol through primary key queries:

In addition to data, indexes also need to be stored in fusion-newSQL. Unlike data stored as a Hashmap, indexes are stored as key-value structures. The format of the key-value is slightly different depending on the index type (the following format is coded to make indexName more intuitive) :

Unique index: Key: table_indexname_indexColumnsValue Value: Rowkey

Non-unique indexes: Key: table_indexname_indexColumnsValue_Rowkey Value: NULL

The reason for this difference is that the portion of the non-unique index before the addition of Rowkey may be duplicated and not globally unique. In addition, the unique index does not encode Rowkey in the key, because when the query statement is a simple “=” query, the Rowkey can be directly found by the GET operation instead of scan, which is more efficient.

How to query data through secondary indexes will be highlighted later in the query process.

Data read and write process

Data is written to

The user sends the protocol to dise-server through the mysql-SDK. Dise-server verifies the SQL written by the user according to the schema. Dise-server converts the verified SQL into Redis Hashmap structure. The Data cluster writes Data to wal files and stores the Data in Rocksdb. Data cluster background thread consumes WAL files and converts them to mysql-binlog format. Send data to MQ asynchronous index module to consume MQ, build index information based on mysql-binlog operation type (INSERT, UPDATE, DELETE) and schema information, and write index data to index cluster. Through the above link, a single MySQL write operation by the user completes the data store and index construction. Since building an index from the data is done asynchronously using MQ, there is a time lag between the data and the index.

The query

Here is an example of using a secondary index to query data:

After receiving an SQL query, the Dise-server selects an index based on the condition. If no index is matched, the dise-server returns an error message (Fusion-newSQL cannot use non-index fields as query conditions). Based on the selected Index, the search scope is constructed. You can run the scan command to traverse the Index cluster to obtain the primary key set that meets the conditions. The following figure shows an example of scanning through a secondary index using an SQL query:

According to the primary key, run the hgetall command to query the qualified result set from the Data cluster. Build the result set into MySQL and return it to the user. As can be seen from the format of the index data above, the prefix must be fixed when the scope is scanned. When the scope is mapped to the SQL statement, it means that the scope query can only have one field instead of multiple fields in the WHERE to condition. Such as:

The index is the combined index of the age and name fields. If the query statement is as follows:

Select * from student where age > 20 and name > ‘W’;

In scan, there is no way to determine the prefix, so the data that meets the condition cannot be queried through index_age_name index. Therefore, only one field in the WHERE condition can be queried in KV form when stored in the index. Of course, it can be solved by storing the joint index separately and retrieving the intersection of multiple interactive searches, but this is contrary to the original intention of reducing the number of RPC and delay. To solve this problem, we introduced the Elastic Search Search engine, which is explained later in this section.

▍ Schema change

If a Schema change is involved, the user sends a work order to the management system. After the approval, the management control system pushes the change request to the configuration center. After the security check, the configuration center writes the new Schema to the storage and pushes the change to each node.

Field changes:

The node receives a push to update the local Schema. For historical data, the data is not really modified. Instead, when querying, fields are matched according to the Schema information. If some fields are missing from the Schema, the default values are used instead. If the data has more fields than the Schema, hide the extra fields.

Adding an index is processed in two steps:

When a new index is added, historical data is not processed and incremental data immediately goes through the index building process. Through the historical index construction tool, scan the historical data, build KV of the new index, and complete the index construction of the historical data. There is an optimization point here, scanning the slave instead of the master, to avoid any impact on the line.

5. Ecological construction

The time when a single storage product could solve all problems is long gone, and data islands cannot serve businesses well. Fusion-newsql was designed with other storage systems in mind.

Fusion-NewSQL to other storage systems

Fusion-newsql sends data to MQ in the mysql-compatible Binlog format. Downstream systems that can access MySQL data can store it in other systems by consuming fusion-NewSQL data in the same format as MQ. This approach achieves maximum compatibility with minimum workload.

▍ Hive to Fusion – NewSQL

Fusion-newsql also allows offline Hive table data to be transferred to Fusion-NewSQL using the FastLoad (DTS) tool provided by fusion-NewSQL to facilitate data flow from offline to online.

If the user does the data flow themselves, the user will scan the Hive table and construct MySQL write statements to fusion-newSQL, one by one, as follows:

The mysql-client sends the write request to DiseServer. DiseServer writes and parses MySQL, converts it into hashMap, and sends the converted Data to Data cluster using Redis protocol. Storage nodes in the Data cluster receive Data and write it to wal files. Storage nodes in the Data cluster go through the Rocksdb write process, which includes memtable write, memtable write, flush, and background compact. Asynchronous threads consume WAL and send data to MQ in mysql-binlog format. The asynchronous indexer consumes mysql-binlog and sends write requests to the Index cluster. Index Write wal to storage nodes in the cluster. The storage node in Index cluster enters the Rocksdb write process.

As you can see from the flow above, this migration approach has several pain points:

Users with hive-to-fusion-NewSQL data import requirements need to develop a set of code with the same logic, which incurs high maintenance costs. Each Hive data must go through a long link, which takes a long time to import. Offline platform has a large amount of data and high throughput, which directly improves the QPS of the online system and has a great influence on the stability of the online system.

Based on the above pain points, we designed the Fastload data import platform. By agreeing the table format of Hive to Fusion-NewSQL, we used Hadoop to process data simultaneously, and constructed SST storage files recognized by Rocksdb to bypass the complex DISE write link. Import data into fusion-newSQL as follows:

A user fills in a work order and maps certain fields of the Hive table to fields of the fusion-newSQL table. In this case, multiple fields in Hive can form a fusion-newSQL field. Hadoop traverses Hive tables and obtains routing information that Data should be stored in Data cluster and Index cluster through Zookeeper. After traversing the above, calculate the routing information and directly build the Data into SST that can be identified by Rocksdb. And the data stored in it is KV data composed of DISE’s table structure information. The SST file is directly sent to the specified storage node. The storage node or the ingest function provided by Rocksdb directly loads the SST file into fusion-newSQL for users to read.

This scheme avoids long and complex write links, does not increase the QPS of the system, and has almost no impact on online access without reaching the bottleneck of disk and network IO. In addition, you only need to fill in the Schema mapping between Hive and Fusion-NewSQL, and you do not need to worry about implementation.

Complex Search through Elastic Search

In the process of using MySQL or fusion-NewSQL for business, we found a scenario like this: The Search conditions of the service are very complex and involve many fields, conditions, and aggregations. In this case, the service will choose Elastic Search as the downstream of MySQL or Fusion-NewSQL and import data into Elastic Search. And then with the rich Search capability of Elastic Search, you can first get data from Elastic Search from the primary key of MySQL or fusion-NewSQL and then get all data based on the primary key.

Based on the above scenario, fusion-NewSQL provides a special index type: ES. When creating an index, users can select fields that need to perform complex queries to create an ES index. In this way, services can avoid the need to develop complex logic for each service to interact with Elastic Search and use MySQL as the database interface. At the same time, the KV secondary index of Fusion-NewSQL mentioned earlier cannot support multiple field range retrieval ability.

The architecture diagram is as follows:

The ES index simply writes the field information and primary key contained in the ES index to Elastic Search at red 4 above. Green 1 If an ES type index is selected during the query, an Elastic Search DSL statement is assembled based on the fields in the WHERE condition, and the primary key is obtained from Elastic Search and then from the Data cluster. Due to the slow delay of Elastic Search Search, fusion-NewSQL can support multiple indexes of a table to use KV index and ES index, which requires high latency and relatively simple query conditions. ES indexes are used for complex query conditions and low latency requirements.

6. Summary

Fusion-newsql currently has access to 70 core services such as order, estimate, billing, user center and transaction engine, with a total QPS of more than 200W and total data of more than 600TB.

Of course, fusion-New is not a general and complete NewSQL solution. Instead, it builds a database for external expression based on the existing NOSQL database by supporting SQL protocols and combining various components. However, this approach can meet most business scenarios with minimal development costs. Have high input-output ratio.

7. Follow-up

Limited transaction support, such as data that allows business planning to fall on a node, can support stand-alone cross-row transactions. Real-time indexes replace asynchronous indexes to meet write – and – read requirements. At present, there has been a scheme of write write + compensation mechanism, which can satisfy the real-time index in normal state without distributed transaction and guarantee the data index consistency in abnormal case. More SQL protocol and feature support.

Author: ▬

Li xin drops | senior software development engineer

Years of experience in distributed storage design and development. Participated in the design and development of Nosql/NewSQL database Fusion, distributed timing database Sentry, NewSQL database SDB and other systems.