The authors introduce
Zhang Liang is the head of data research and development at JD.com. Passionate about open source, currently leading two open source projects, Elastice-Job and Sharding-Sphere(Sharding-JDBC). I am good at Java-based distributed architecture and cloud platform oriented by Kubernetes and Mesos. I advocate elegant code and have a lot of research on how to write code with presentation power. At the beginning of 2018, HE joined JINGdong Data Department and is now in charge of data research and development. Currently, the main focus is on building Sharding-Sphere into the industry’s first-class financial data solution.
The accumulation of data is the wealth of giants in all walks of life, and the database is an important way of data storage. As big data and micro-services become popular today, the traditional relational database will also usher in a revolution. Cloud native database architectures are getting more and more attention, so I’d like to talk about cloud native data architectures with you. As the first part of this article, we will first analyze the current development status of all kinds of databases.
Can relational database still be used?
Relational database has been the leader in the field of database for several decades. The chart below shows the world’s most authoritative STATISTICAL rankings for DB-Engines,The rankings are based on the number of keyword searches on Google and Bing, the number of employees, the number of job searches, and the number of questions and followings on Stack Overflow:
DB-Engines database rankings published in June 2018
As of June 2018, among the top 6 databases, only the 5th MongoDB database is a document database, while the rest are all relational databases, and the proportion of the top 3 databases is far ahead of other databases.
1, the advantages of
Relational databases are still strong after being bombarded by technological innovations such as big data, NoSQL and NewSQL, because of their inherent advantages. Its advantages are mainly reflected in the impact on developers, operation and maintenance personnel and the system itself.
Development advantage
For developers, the first advantage of relational databases is that they are SQL oriented.
SQL is the structured query language for relational databases. Although different relational databases have different SQL dialects, ANSI-based SQL is supported by most relational databases. And SQL is database-oriented access language, can be very convenient to the database to add, delete, change, check and authorization and management. SQL query flexibility is very high, can be very convenient between online transaction processing (OLTP) and online analysis processing (OLAP) conversion.
In addition, SQL is a programming language that application developers must master, and its popularity is so widespread that it is highly unlikely to recruit an application developer who can’t write SQL at all. As a result, SQL dramatically reduces the cost of developer recruitment.
In addition to THE SQL language itself, various development languages for relational database support is also very complete. Take Java as an example: JDBC is the standard interface for accessing databases in Java. Relational database vendors provide drivers to implement the JDBC interface. Engineers developing in the Java language do not need to be aware of the differences between relational databases; they can simply program according to the JDBC interface.
Due to the difficult one-to-one correspondence between relational database storage and object-oriented Java programs, many object relational mapping (ORM) frameworks are produced to simplify the impedance mismatch of relational object model, such as JPA and its official implementation Hibernate, MyBatis, Jooq, etc., which further simplifies the daily development work of application engineers. ORM framework is mostly used JDBC encapsulation, the compatibility of various relational databases is very high.
Operational advantages
Relational database due to long time, for each kind of common relational database, can be more easily to the corresponding recruitment database administrator (DBA), to ensure the stability of a relational database, security, integrity, and performance, at the same time ensure that the monitoring and analysis of the relational database system bottlenecks, and the rationality of the design.
Mature relational databases have their own complete ecosystem and mature supporting tools for high availability, data backup, performance monitoring and analysis. Large enterprises and important business systems generally require specialized DBAs for operation and maintenance.
System advantage
Time is the only criterion to test the maturity and stability of technology. Relational database has been used on a large scale for decades, and its storage engine has been very mature. Mvcc-based database engine achieves a good balance between performance and correctness, and greatly improves query efficiency through B+ Tree index. For critical nodes like data, careful use of relational databases is the architect’s preferred solution.
Acid-based transactions are another powerful guarantee that relational databases bring to application systems. ACID is an acronym for the four essential elements of a database transaction that can execute correctly. It includes atomicity, consistency, isolation, and persistence. Only a database that supports transactions can ensure the correctness and integrity of data to the maximum extent:
-
Atomicity. All operations in the same transaction either complete (commit) or do not complete (rollback) and cannot be stopped in some intermediate step. If an error occurs during the execution of the transaction, the data will be restored to the state before the transaction began.
-
Consistency. Non-read-only transactions should encapsulate the state of the database from one consistent state to another. Consistent state means that the data in the database should meet integrity constraints and that the intermediate state of a transaction should not be perceived outside of the transaction.
-
Isolation. When multiple transactions are executed concurrently, they should not affect each other as if only one operation were executed by the database in parallel.
-
They are persistent. After the transaction completes, all changes made to the database by the transaction are persisted in the database.
It is not difficult to use transactions in programming, and various development frameworks such as Spring have made it very simple and elegant at the AOP level.
2, lack of
The performance and access carrying capacity of relational database are impeccable in the era of enterprise application facing single data node. However, with the rapid expansion of traffic and data volume, relational database has become the bottom support of such a huge scale system as before, and even the bottleneck of application system.
Relational databases mainly have the following three deficiencies:
-
The concurrent access to a node is limited. As the data stored in the database is stateful, it is difficult to split and expand services arbitrarily. A single database node hosts a large number of query and update requests from service nodes, which is not a peer-to-peer architectural deployment pattern.
-
The data carrying capacity of a node is limited. The data carrying capacity of a single database node is limited. The larger the volume of data, the deeper the index created to query the data. The index depth determines the number of I/O accesses. The deeper the index depth is, the slower the search is.
-
Distributed transaction performance degrades significantly. After splitting the database, you need to use distributed transactions instead of local transactions. Xa-based distributed transactions use two-phase commit, locking resources in the preparation phase until the end of the entire transaction. Performance degrades dramatically as system concurrency increases.
To sum up, the inadequacy of relational databases is ultimately caused by the original design. It is not a distributed product, and is inherently unfriendly to distributed systems, which makes it difficult to adapt to the architectural model of the Internet. Faced with stateless services that can be flexibly expanded at any time, relational databases have become a little cumbersome.
NoSQL is not as expected
As the shortcomings of relational databases become more apparent, NoSQL is a useful complement. NoSQL, however, is Not meant to replace relational databases. Rather, it refers to Not Only SQL, providing an alternative to SQL.
NoSQL has many categories, including key-value databases, document databases, column family databases, and graph databases, for different scenarios.
1. Key-value database
The key value database is represented by Redis. It is used as a cache in many scenarios, but Redis also provides drop-disk functionality. Redis is very efficient in the case of primary key queries, but not in the case of content queries.
Redis provides the clustering capability to distribute data to different nodes, effectively dispersing the traffic bottleneck of a single node. If all the data of Redis cannot be loaded in memory, the performance of Redis will be degraded. Therefore, it is a good solution to fragment the data of Redis according to the primary key in the case of large data volume.
Redis provides transaction functionality through MULTI, EXEC, DISCARD, and WATCH commands. Redis transactions provide a one-time, sequential, and non-interruptible mechanism for executing commands. However, even if some of the commands in a transaction fail, they cannot be rolled back, so Redis transactions are not one-to-one with transactions in the database realm.
2. Document database
The document database is represented by MongoDB. The document model is closer to object-oriented data representation and has a Schema model with high degrees of freedom that can be easily mapped to JSON data.
The design concept of a document database is completely different from that of a relational database. It does not have a statically defined table structure, but can flexibly add or subtract attributes and embed subdocuments and arrays in documents at will. Therefore, designing applications for a document database should focus on the objects themselves, rather than how the database table structure is defined. This design makes it very convenient for developers to modify program logic without considering the locking table caused by the change of database table structure.
The query dimension of MongoDB is very flexible and can be indexed according to the content to be searched to improve efficiency. In addition, MongoDB is much stronger in distributed performance than relational database, it can automatically sharding data, and can transparently load balancing and failover between sharding. It also has built-in GridFS to support storage of large data sets.
However, MongoDB cannot support ACID transactions like relational databases, but uses final consistency transactions. Therefore, it is not recommended to use MongoDB in very critical business systems such as orders, transactions, accounting, etc., but in business systems such as forums, which require lower levels of data transactions.
Column family database
The column family database is represented by HBase in the Hadoop big data system. It is a distributed database designed to handle massive amounts of data.
HBase determines a record by the row primary key and column family. Attributes in each column family are not fixed, similar to document databases. HBase also automatically splits data to enable horizontal expansion of data stores. HBase data is stored in distributed file systems, such as the HDFS, and supports mass data best.
HBase uses log-structured merge-tree (LSM Tree). It stores data changes to the memory and merges the changes to the disk in batches after the changes reach a specified threshold. The data write speed is greatly improved by converting a single write operation into a batch write operation. However, when reading data, HBase needs to search for data in the memory and disk respectively, which affects performance. Therefore, Hbase is more suitable for applications with more write and less read. In addition, Hbase also does not support ACID transactions, and data can only be queried by row keys.
Graph databases are databases that deal with graph relationships and are used for special scenarios, so they are not covered here.
In general, there are many types of NoSQL databases that are suitable for different scenarios. Let’s compare the three types of NoSQL databases in the following table:
Although the use of various NoSQL scenarios vary greatly, most of them support functions such as sharding and data migration required by distributed databases well, and are stronger than traditional relational databases in supporting massive data and large concurrency.
NoSQL databases offer good scalability and flexibility, but their shortcomings are obvious:
Different NoSQL databases have their own query languages, and it is much more difficult to develop standard application interfaces than SQL. NoSQL also does not provide ACID transaction operations, so many enterprises cannot safely apply NoSQL to their core business systems.
As defined by NoSQL, they are merely a useful complement to SQL-based relational databases, not a replacement for relational databases.
Rising NewSQL
Because SQL and ACID transactions were so popular, and distributed databases were in greater demand than ever before, another database, NewSQL, was born.
NewSQL, short for distributed scalable databases, inherits NoSQL’s ability to handle massive amounts of data while maintaining the support of SQL and ACID transactions from traditional relational databases. NewSQL’s focus is on Hybrid databases, which tend to look for multi-schema database implementations that no longer distinguish BETWEEN OLTP and OLAP queries.
In 2016, Andrew Pavlo and Matthew Aslett published a paper called “What’s Really New with NewSQL?” In the article, NewSQL is divided into three categories: New Architecture, Transparent Sharding Middleware and database-as-A-Service.
References:
https://db.cs.cmu.edu/papers/2016/pavlo-newsql-sigmodrec2016.pdf
1. New architecture
This class of NewSQL is a new database system designed for distributed architecture.
They typically use share-nothing architecture and support features such as multi-node concurrency control, highly fault-tolerant automated data copy replication, flow control, and distributed query processing.
Because they are naturally designed for distributed multi-node systems, they are better at query optimization and inter-node communication protocols. For example, multiple data nodes in a NewSQL database can communicate directly with each other without relying on a central node.
With the exception of Google’s Spanner, other similar databases need to manage their own storage and distribution of data on disk and in memory. This means that this type of database system is responsible for sending queries to data nodes, rather than copying data to request nodes to reduce network traffic.
With new architectural designs and storage engines that have not been fully proven over time, enterprise technology selectors are being particularly cautious. At the same time, because there are fewer engineers with experience operating the next generation of NewSQL, the current audience is relatively small compared to relational databases. Many enterprises are currently trying to follow NewSQL with a new architecture, but have not yet migrated their core systems.
The most typical products of the new architecture type are Google’s Spanner and domestic database TiDB.
Transparent shard middleware
Transparent sharding database middleware allows applications to fragment data to multiple data nodes, but the data nodes still use a relational database for a single data node. Transparent sharding middleware uses central components to route data manipulation requests, coordinate transactions, manage data distribution, and replicate data copies. The entire cluster is a logical instance, and applications can be used smoothly without any changes.
The core advantage of transparent shard database middleware is compatibility. It can switch between the system’s existing stand-alone relational database and shard middleware at a low cost without any code changes by the developer. They are designed to make full use of the computing and storage capabilities of traditional relational databases in distributed scenarios, rather than to implement a new relational database.
In this way, we can not only take advantage of the stability and compatibility of traditional relational database, but also increase the processing of distributed scenarios on its basis. Incremental rather than subversive is the core concept of NewSQL products like this. Due to open source and popularity, database middleware based on MySQL protocol is the most common.
Because traditional relational databases based on single data nodes are designed for disk, memory-based storage management and concurrency control are not as efficient as the redesigned distributed architecture NewSQL. In addition, SQL parsing, query plan optimization and other work will be repeated in the middleware and database, making the overall operation efficiency slightly lower than the newly designed NewSQL.
This kind of NewSQL is very popular among large and medium-sized Internet companies in China, and each company basically has its own database middleware. However, due to heavy coupling with the internal business system of the company, there are few mature open source products. Sharding-proxy in the Sharding-Sphere ecosystem we will discuss later belongs to this kind of NewSQL products.
3. Cloud database
The last type of NewSQL is cloud database products provided by cloud computing companies. Users of the cloud database do not need to maintain the database and its hardware by themselves, but all data is hosted to the services provided by the cloud platform. The user can connect to the cloud database through the URL of the database, and operate and monitor the system through API or operation dashboard.
Cloud databases cost the least to use, and engineers do not have to consider any details of the database. Ideal for small to medium sized companies, but for companies with large volumes of data, the first two NewSQL open source or homegrown solutions are more appropriate.
Aurora, provided by Amazon, is a good example of this type of NewSQL application.
Overall, NewSQL, while immature, is the right attempt to look to the future. The three types of NewSQL databases have different concerns, and the focus of the new schema type database is radical innovation; The focus of transparent shard database middleware is incremental; Cloud databases, on the other hand, focus more on masking user usage details.
While the different types have their differences, their core functions are similar. Regardless of NewSQL, Hybrid databases will be the future, and development costs will be greatly reduced when OLTP and OLAP are no longer distinguishable.
At this point, we have a general overview of the current state of various databases. In the next article, I will elaborate on the core functional characteristics of cloud native databases. Students with relevant thinking are also welcome to leave a message.