This is the sixth day of my participation in Gwen Challenge

Recent science and technology exchange with counterparts, often asked depots table and how to choose the distributed database, there are also many online about + (depots table) and traditional relational database middleware NewSQL distributed database of articles, but some of the ideas and judgment is I think is extreme, from environment to evaluate good or bad it is unfair.

This paper compares the realization principles of the key features of the two modes, hoping to clarify their real advantages and disadvantages and application scenarios as objectively and neutrally as possible.

Where is NewSQL database Advanced?

First of all, regarding whether “middleware + relational database sub-database sub-table” can be counted as NewSQL distributed database, there is a foreign paper pavLO-NewSQL-Sigmodrec. According to the classification in this paper, Spanner, TiDB and OB are the first new architecture types. Middleware schemes such as Sharding-Sphere, Mycat and DRDS are the second kind (there is also a third cloud database in this paper, which will not be introduced in detail in this paper). Is the distributed architecture based on middleware (including SDK and Proxy) + traditional relational database (sub-database and sub-table) mode? I think so, because storage is really distributed and can scale horizontally. But not a pseudo-distributed database? There is some truth to that in terms of architectural sophistication.” Pseudo is mainly reflected in the repeated SQL parsing and execution plan generation of middleware layer and bottom DB, storage engine based on B+Tree, etc., which is actually redundant and inefficient in distributed database architecture. In order to avoid causing a war of words between real and fake distributed databases, NewSQL database in this article refers to the new architecture NewSQL database.

Where is NewSQL database superior to middleware + sub-library sub-table? Draw a simple architecture comparison diagram:

  1. Traditional databases are not as efficient as NewSQL databases because of their disk-oriented design, memory-based storage management and concurrency control.

  2. Middleware pattern SQL parsing, execution plan optimization and other repetitive work in middleware and database, relatively low efficiency;

  3. NewSQL database distributed transactions are optimized for higher performance compared to XA;

  4. The new architecture NewSQL database storage design is based on paxOS (or Raft protocol) multiple copies, compared to traditional database master/slave mode (semi-synchronous to asynchronous also has the problem of loss), in the implementation of the real high availability, high reliability (RTO<30s, RPO=0)

  5. NewSQL database naturally supports data sharding, and data migration and expansion are automated, greatly reducing the work of the DBA. Meanwhile, it is transparent to the application, and there is no need to specify branch database and table keys in SQL.

These are some of the main marketing points of the NewSQL database product, but are these seemingly nice features really so? Next, I will elaborate my understanding on the above points respectively.

Distributed transaction

It’s a double-edged sword.

CAP restrictions

Think of the earlier NoSQL databases that don’t support distributed transactions (the latest versions of mongoDB and others are starting to do so). Is there a lack of theoretical and practical support? No, the reason is that the CAP theorem is still the collar curse on A distributed database, guaranteeing strong consistency while necessarily sacrificing availability A or partition tolerance P. Why don’t most NoSQL provide distributed transactions?

So does NewSQL database break the CAP theorem limit? Don’t. The nose of the NewSQL database, Google Spanner (based on which most distributed databases are designed), provides consistency and greater than five nines, claims to be a “de facto CA”, The real meaning of cloud Spanner is that the probability of system being in CA state is very high, and the probability of service outage due to network partition is very low. The real reason is that cloud Spanner creates a private global network to ensure that there is no network partition caused by network interruption, and its efficient operation and maintenance team is also the selling point of Cloud Spanner. The details can be found in Spanner, TrueTime and CAP Theory written by Eric Brewer, the author of CAP.

I recommend an interesting article on distributed systems, standing on the distributed shoulders of giants, that says: In distributed systems, you can know where the work is, or you can know when the work is done, but you can’t know both; Two-phase protocols are essentially anti-availability protocols.

Completeness:

Does the two-phase commit protocol strictly support ACID, and can all exception scenarios be overridden? 2PC sends an exception in the commit phase, which is similar to the commit phase of maximum effort. Strictly speaking, atomicity of A and consistency of C cannot be guaranteed within A period of time (the recovery mechanism can guarantee final A and C after fault recovery). Complete distributed transaction support is not a simple thing, it needs to be able to cope with network and all kinds of hardware including network card, disk, CPU, memory, power supply and other anomalies, through strict tests. Previously with a friend of the exchange, they even said that currently known NewSQL in distributed transaction support are incomplete, they have cases failed, insiders so sure, also shows that distributed transaction support integrity is actually uneven levels.

However, distributed transaction is a very important underlying mechanism of these NewSQL databases, and cross-resource DML and DDL are dependent on its implementation. If the performance and completeness of this part are compromised, the correctness of the upper cross-shard SQL execution will be greatly affected.

performance

Distributed transaction XA is also supported in traditional relational databases, but why is it rarely used in high-concurrency scenarios? Because XA’s basic two-phase commit protocol has some problems, such as high network overhead, long blocking time and deadlocks, it is rarely used in OLTP systems based on traditional relational databases on a large scale. Distributed transaction implementation of NewSQL database is still mostly based on two-phase commit protocol, such as Google Percolator distributed transaction model, which adopts atomic clock +MVCC+ Snapshot Isolation (SI), This approach ensures global consistency through TSO(Timestamp Oracle), avoids locks through MVCC, and makes part of the commit asynchronous through primary and secondary locks, which actually improves the performance of distributed transactions compared to XA.

SI is an optimistic lock. In hot data scenarios, a large number of commit failures may occur. In addition, the isolation level of SI is not exactly the same as RR, it does not have fantasy read, but it does have write tilt.

However, no matter how optimized it is, compared with 1PC, the extra GID acquisition, network overhead, and prepare log persistence on 2PC will still cause a significant performance loss, especially when the number of cross-nodes is large. For example, in a banking scenario, when a batch deduction is made, a file may be used for W accounts. No matter how you do it or how you swallow it, it’s not very high.

Distributed transaction test data given by Spanner

Although the NewSQL distributed database products are advertised as fully supporting distributed transactions, this does not mean that applications can be completely free from concerns about data splitting. The best practices of these databases still state that distributed transactions should be avoided in most scenarios.

Since the performance cost of strong consistent transactions is too high, we can reflect on whether we really need such strong consistent distributed transactions. Especially after the microservices split, many systems are unlikely to be placed in a unified database. Trying to make Consistency weak is flexible transactions, abandon ACID(Atomicity,Consistency, Isolation, Fix), Switch to BASE(Basically Available,Soft state,Eventually Consistent), such as Saga, TCC, reliable message guarantee, and other models. For large-scale high-concurrency OLTP scenarios, I personally recommend using flexible transactions rather than strongly consistent distributed transactions. As for flexible transactions, THE author has also written about a technical component before, and some new models and frameworks have emerged in recent years (such as Fescar, which is open source by Alibaba). I don’t have enough space to describe it, so I will write a separate article when I am free.

Can only use two-phase commit protocol to solve distributed transactions? The idea of avoiding distributed transactions with updateserver in oceanbase1.0 was instructive, but this has also changed to 2PC since version 2.0. Distributed transactions in the industry are not only two-phase commit solution, there are other solutions its-time-to-move-on-from-two-phase(if not open, there is a domestic translation version www.jdon.com/51588)

HA and remote live more

Master/slave mode is not the optimal mode. Even semi-synchronous replication, there is a loss problem in extreme cases (semi-synchronous to asynchronous). Currently, it is generally accepted that a better solution is based on PaxOS distributed consistency protocol or other paxOS-like methods such as RAFT mode. Google Spanner, TiDB, cockcoachDB, and OB all adopt this method. Based on Paxos protocol, the multi-copy storage follows the half-write principle and supports automatic master selection, which solves the problem of high data reliability, shortens failover time, and improves availability. In particular, it reduces the workload of operation and maintenance. This solution is technically mature and is also the standard configuration of the underlying NewSQL database. Of course, this approach can also be used in traditional relational databases. Alibaba and wechat teams have also transformed MySQL storage to support multiple copies of PaxOS. MySQL has also launched an official version of MySQL Group Cluster, and it is expected that master-slave mode may become history in the near future.

Distributed consistency algorithm itself is not difficult, but in engineering practice, it is not easy to realize a production-level reliable and mature consistency protocol because of the need to consider a lot of exceptions and do a lot of optimization. For example, you need to convert it to multi-PaxOS or multi-RAFT to reduce network and disk I/O overhead by batch or asynchronous.

It is important to note that many NewSQL database vendors advertise remote multiplexes based on PaxOS or RAFT protocols, but this is based on the premise that the network latency between remote databases is not too high. Take the “two and three centers” of the bank as an example, the distance between the remote is thousands of miles, and the delay reaches tens of milliseconds. If you want to live more, it needs the remote copy to participate in the database log half confirmation, so high delay almost no OLTP system can accept.

It is a good vision to do remote multi-living at the database level, but there is no good solution for the delay caused by distance at present. I communicated with the ant team before that the ant’s remote multi-activity scheme is to synchronize double-write transaction information at the application layer through MQ. The remote DC stores the transaction information in the distributed cache. Once the remote switchover occurs, the database synchronization middleware will inform the data delay time, and the application will read the transaction information from the cache. Blacklist service objects, such as users and accounts, and remove them from the blacklist after data synchronization. Since the double-write is not all database operation logs but only transaction information, the data delay only affects the data for a period of time, which is the remote degree multi-live scheme that I think is more reliable at present.

In addition, some systems have been unitary transformation, which should also be taken into consideration when choosing paxOS master, which is also the lack of many NewSQL databases.

Scale horizontal extension and sharding mechanism

Paxos algorithm solves the problem of high availability and high reliability, but does not solve the problem of Scale horizontal expansion, so sharding must be supported. NewSQL databases have built-in sharding mechanism, and automatically identify hotspots according to the data load of each sharding (disk usage, write speed, etc.), and then split, migrate, and merge the sharding. Applications are unaware of these processes, which saves a lot of operation and maintenance workload for DBAs. For example, TiDB divides data into regions. If the region size reaches 64 MB, data is automatically migrated.

At the beginning of design, the split key, split mode (range, modular, consistent hash or custom routing table), routing rules, number of split library tables, and expansion mode of each table should be defined. Compared to The NewSQL database, this pattern introduces a lot of intrusion and complexity to the application, which is also a challenge for most systems.

The online capacity expansion can also be achieved in the database and table mode. The basic idea is to add data through asynchronous replication, then set the route switch as read-only, and finally release the write operation. Of course, this requires the cooperation of middleware and database.

The problem here is that the unified built-in sharding strategy for NewSQL databases (e.g., TiDB based on range) may not be the most efficient because it is inconsistent with the partitioning elements of the domain model, resulting in distributed transactions for many transactions. Banks, for example, the core business system is based on the customer dimension, that is to say the customer table, the customer account table, water table together in most scenarios are written, but if, in accordance with each table primary key range shard, the deal is not done on a shard, this in the high frequency OLTP system will result in performance problems.

Distributed SQL support

Common single-shard SQL, both of which are well supported. Because the NewSQL database is a generic database, it supports more complete SQL, including cross shard join, aggregation and other complex SQL. Most middleware patterns are designed for application requirements, but most also support SQL with split key, library table traversal, single library JOIN, aggregation, sorting, paging, etc. But cross-library join and aggregation support is not enough.

NewSQL databases generally do not support stored procedures, views, foreign keys, and other functions, while the underlying middleware pattern is traditional relational databases, these functions are relatively easy to support if only involving a single library. NewSQL databases are usually compatible with MySQL or PostgreSQL, so SQL support is limited to these two types. Middleware such as driver mode usually only needs to perform simple SQL parsing, routing calculation, and SQL rewriting, so it can support more types of SQL database.

The difference in SQL support mainly lies in distributed SQL execution plan generator. Because NewSQL database has distribution and statistics information of underlying data, it can do CBO and generate execution plan with higher efficiency. However, middleware mode does not have such information. It can only be Based on rule-based-opimization (RBO), which is why the middleware pattern generally does not support cross-library join, because the implementation is often not efficient, so it is better to leave it to the application.

It can also be seen that the architecture style of middleware + sub-library sub-table mode reflects a compromise and balance, which is an application-oriented design. The NewSQL database is more demanding and “all-in-one”. It is a common underlying technology software, so the complexity and technical threshold of the latter is much higher.

The storage engine

The storage engine design of traditional relational database is disk-oriented, mostly based on B+ tree. The B+ tree reduces the height of the tree to reduce random read times and disk seek times, improving the read performance. However, a large number of random writes lead to tree splitting, resulting in random writes, and the write performance deteriorates. The underlying storage engine of NewSQL mostly uses LSM. Compared with B+ tree LSM, the random write to disk is changed into sequential write, greatly improving the write performance. However, LSM’s read performance is worse than B+ tree because of the need to merge data. Generally speaking, LSM is more suitable for the scenario where the write is greater than the read. Of course, this is only a comparison from the perspective of data structure. In the actual implementation of the database, the read and write performance will be optimized by SSD, buffer, Bloom filter and other methods, so the read performance will not decrease much. NewSQL data due to multiple copies, distributed transactions and other overhead, compared to the stand-alone relational database SQL response time is not superior, but due to the elastic expansion of the cluster, the overall QPS improvement is significant, which is also NewSQL database vendors say distributed database is more important than throughput. Rather than a single SQL response time.

Maturity and Ecology

Distributed database is a new general underlying software, accurate measurement and evaluation needs a multi-dimensional test model, It should include development status, usage, community ecology, monitoring operation and maintenance, surrounding supporting tools, function satisfaction, DBA talent, SQL compatibility, performance testing, high availability testing, online capacity expansion, distributed transactions, isolation level, online DDL, etc. Although the development of NewSQL database has been tested for a certain period of time, However, most of them are concentrated in the Internet and non-core transaction systems of traditional enterprises, which are still in the stage of rapid iteration and continuous optimization of scale use. In contrast, the traditional relational database has been developed for many years. Through the complete evaluation, it has obvious advantages in maturity, function, performance, surrounding ecology, risk control, relevant talent accumulation and other aspects, and has better compatibility to the established system. For Internet companies, the growing pressure of data volume and the gene of pursuing new technology will be more inclined to try NewSQL database, which is a very attractive solution without considering the problems of database table splitting, application transformation, expansion, transaction consistency and so on. For traditional enterprises such as banks with high risk awareness, NewSQL database may still be in the stage of exploration and prudent pilot for a period of time in the future. Based on middleware + sub-database sub-table mode has simple architecture and lower technical threshold. Although it is not as comprehensive as NewSQL database, the most core demand of most scenarios is the correct routing of split SQL, and this function of middleware mode is more than enough, it can be said that it is sufficient in most OLTP scenarios.

Due to space limitations, other features such as online DDL, data migration, operation and maintenance tools are not covered in this article.

conclusion

If you still don’t know which mode to choose after reading the above, consider whether the points NewSQL solves are real pain points for yourself by combining these questions:

  • Do strongly consistent transactions have to be resolved at the database layer?

  • Is the growth rate of the data unpredictable?

  • Does the frequency of capacity expansion exceed its o&M capability?

  • Is throughput more important than response time?

  • Is it necessary to be completely transparent to the application?

  • Is there a DBA team that is familiar with the NewSQL database?

If two or three of the above are true, then you can consider using NewSQL database. Although there may be some learning costs in the early stage, it is the development direction of database, and the future benefits will be higher, especially in the Internet industry, with the rapid development of data volume, the pain of database and table will increase day by day. Of course, you have to be prepared to take some risks with NewSQL. If you haven’t made up your mind yet, consider the following questions:

  • Does the final consistency meet actual scenarios?

  • Can the total amount of data in the next few years be estimated?

  • Is there a system maintenance window for capacity expansion and DDL operations?

  • Is it more sensitive to response time than throughput?

  • Do you need compatibility with existing relational database systems?

  • Is there an accumulation of traditional database DBA talent?

  • Is the intrusion of database and table into the application tolerable?

If most of these questions are “yes”, keep the tables separate. There are few perfect solutions in software, and NewSQL databases are not a silver bullet for data distribution architectures. In comparison, database and table partitioning is a scheme with lower cost and less risk. It maximizes the reuse of traditional relational database ecology, and satisfies most functions after database and table partitioning through middleware, with stronger customization ability. In the current stage of NewSQL database is not fully mature, sub-database sub-table can be said to be a low upper limit but high lower limit scheme, especially the core system of the traditional industry, if you still intend to use the database as a black box product, practical good sub-database sub-table will be considered a safe choice.

In many cases, the selection of software depends on the characteristics of the field and the style of the architect. Limited by the author’s knowledge and the characteristics of the industry, the above are only some superficial personal views, welcome to discuss.