Dialogue with Li Feifei, general director of Ali Cloud Intelligent database Division: The cloud database war has entered the second half

Feifei Li, currently vice president and senior researcher of Alibaba Group, is in charge of ali Cloud Intelligent database Business Division. Prior to joining Alibaba, He was a tenured professor in the Department of Computer Science at the University of Utah. His research achievements have won many important academic awards such as IEEE ICDE and ACM SIGMOD Best Paper Award.

In 2018, Li joined Alibaba Dharma Institute and led her team to engage in research with independent intellectual property rights. At present, led by ali cloud intelligence database division has developed a new generation of distributed database system, support the complex business of alibaba group, huge amounts of data, and the challenge of double peak 11 trading, has been applied in several cities of intelligent urban traffic network management, and service the financial, such as retail, logistics, manufacturing industry enterprises.

In 2018, Ali Cloud database successfully entered the Magic Quadrant of Gartner database, which is the first time for a Chinese company to appear in this list. Recently, Ali Cloud database was selected into Forrester Database Evaluation report again, becoming the first technology company in China to be recognized by two top institutions.

On May 10, 2019, DTCC 2019(the 10th China Database Technology Conference) was held in Beijing. Fei-fei Li came to the scene and delivered an excellent keynote speech. During the conference, he received an in-depth interview with Lao Yu, executive editor of IT168&ITPUB.

Two Revelations:

1. PolarDB has become the fastest growing database product on Aliyun since its commercialization in October last year;

2, AnalyticDB has been ranked first in the world through TPC-DS, the first cost-effective data has been released on the TPC-DS official website;

Some highlights:

1. At present, all self-research cloud native databases on the market are not really distributed databases, which can only be called distributed storage databases. Therefore, this may be a breakthrough point in the second half;

2. The boundary between NoSQL and traditional relational databases will become increasingly blurred;

3. There are many vendors who claim that their databases are NewSQL, but strictly speaking, they only implement one or two points in some dimensions, and do not perfectly solve all the technical challenges of NewSQL;

4. The MongoDB protocol is cleverly modified to obtain the hosting platform of cloud vendors to do their own cloud hosting services.

5. In the first half, the core competitiveness of cloud vendors is actually the underlying hosting platform. Therefore, cloud vendors are absolutely impossible to host the latest version of open source database after modification of the agreement, so that their own control platform is open source;

6, the second half of the two focus: one is to constantly improve the competitiveness of hosting platform, the second is to have their own research of the database kernel, this is why everyone in the research of the database, because, just rely on the competitiveness of hosting platform is unable to pull the gap;

……………………………………………………………………………………………

The following is the original interview, in order to facilitate reading, without affecting the original premise, slightly modified.

Q: Please talk about the competitiveness of Ali Cloud in the field of database from the important product nodes and breakthroughs of Ali Cloud database product line?

A: As we all know, the database market is mainly divided into the following sectors. One is the traditional OLTP, the so-called RDBMS online trading system. The most classic commercial ones are Oracle and SQL Server, and the open source ones are MySQL and Postgresql. OLTP is a big part of Ali Cloud.

The second segment is OLAP online analytics, such as Teradata and AWS Redshift.

The third section is NoSQL database brought by unstructured and semi-structured data processing requirements, such as Hbase, Cassandra, and now very popular MongoDB, Redis belong to this section.

The last section is the product, data transfer, data backup and data management section of the tool ecosystem. Below the four sections, there is the operation, maintenance and control platform, also known as the cloud database hosting platform. These modules constitute the cloud database system and architecture.

The above is the database market of the larger plate, the following is to talk about ali Cloud in each module in the core of some products and technology accumulation. First, there is the most important OLTP block, which can be broken down into two categories:

One is managed products, that is, third-party commercial databases and open source databases, such as SQL Server, MySQL, Postgresql, etc., mainly to provide customers with rich choices, so that customers can seamlessly migrate offline databases to the cloud.

The second category is self-research cloud native database, which is the most important heavyweight product PolarDB. PolarDB is a cloud native database based on distributed shared storage. The advantage of distributed shared storage is that storage and computing are separated and decoupled. After decoupling, storage and computing can be flexibly expanded respectively, achieving extreme elasticity, which is very attractive to customers on the cloud. Because one of the key points of customer demand in the cloud is on-demand usage and on-demand billing.

In addition, PolarDB also has many other technologies, such as high availability, using triple copy and distributed data consistency protocol, Parallel Raft achieves financial high availability performance, so customers don’t have to worry about RPO and RTO problems. Another is in the distributed storage and above a write read multiple computing nodes and then the upper layer of an intelligent agent layer, can achieve intelligent, automatic Low balance, load balancing between computing nodes allocation technology. This combination gives PolarDB a significant advantage in OLTP database processing on cloud native.

For example, PolarDB can achieve minute elastic expansion, from 2 Core to 32 Core only takes about five minutes, from 2 nodes to 4 nodes only takes a few minutes, the expansion from a few TB support to 100 TB is no problem, can support single point million QPS processing performance. There are great solutions for cloud customers, for elasticity, high availability, load balancing, and so on. PolarDB is very competitive in cloud applications, compared to the traditional on-premise database architecture, PolarDB is very competitive.

I can say with great confidence that PolarDB on Aliyun has matched and in some places surpassed AWS Aurora, both in terms of performance and technology. Ali Cloud also published papers to introduce Ali’s technology at international top technology sharing conferences such as SIGMOD and VLDB.

From the commercial point of view, since the beginning of commercialization in October last year, PolarDB has been the fastest growing database products on Aliyun, the actual use of PolarDB customers from new retail to finance, and then to the traditional manufacturing industry, a lot of enterprises began to migrate database applications to PolarDB. This is the case for the OLTP plate.

In the OLAP segment, we also divide into managed products and self-developed products. Hosting for example, like traditional BI tool Tableau, the main product of self-research is AnalyticDB analytical distributed database, whose main feature is mixed storage of ranks and columns, which can achieve multi-table complex Chinese query and respond in seconds or even milliseconds.

Technical details do not expand, talk about two specific examples, one is recently done TPC-DS ranking, TPC-DS as we know is recognized by the industry is very important to the analysis of a Benchmark database, we have a good news, AnalyticDB has passed tPC-DS layers of test, do the world first, Cost performance first, this data has been published on tPC-DS official website. In addition, the introduction of the whole AnalyticDB system paper, will also be in this year’s VLDB propaganda, these two aspects can prove its advanced technology.

From the business, AnalyticDB support from tax, urban brain to public cloud, from finance to real estate and a series of industries on the massive data of high parallel second level online analysis appeals, and PolarDB formed a natural complementary, from OLTP to OLAP formed a complete data link.

Finally, Ali Cloud also has a strong technical layout in NoSQL and tools, mainly through the group’s application for many years forged out of some core products. For example, in terms of tools, we have DTS data transmission, real-time incremental data consistency backup transmission between different libraries, cloud on cloud off cloud and different instances on cloud, etc., so that customers can quickly carry out database migration, and data backup DBS service. This series of products are from the customer’s point of view, what do customers need? What are customers’ pain points? We get customer needs to reverse what we need to do technically, to achieve today’s state.

Q: PolarDB is Aurora, AnalyticDB is Redshift, so, Ali cloud database development has its own established r & D strategy, or follow the strategy?

A: Objectively speaking, AWS is definitely the first mover in the cloud, not only in the database sector, but also in IaaS and PaaS. With their advanced experience and the detours they avoided, there is no need for us to take a completely different path from them. I personally believe that we should draw on the strengths of various groups and keep an open mind.

So, to answer your question just now, I think we were a Follower in the beginning, which is nothing to be ashamed to admit. But we should from followers to surpass, to become a leader. Through several years of efforts, we have been able to walk out of a different way and become a leader. How do they change from followers to leaders? The core appeal is to start from the needs of customers.

What are the advantages of Ali Cloud? Aliyun’s advantage is its access to the vast customer demand in China. The main market of AWS is in the United States, and the demands of Customers in The United States and China are the same, but also different. For example, there are many large and medium-sized state-owned enterprises. There is no such organizational structure in the United States, and their demands are definitely different from those of commercial companies in the United States. This is a very concrete example of a new way of thinking, a new challenge to the way we evolve technology, and ultimately a different way of technology than Aurora.

In addition, we are backed by Alibaba Group and in a complex ecological environment, from e-commerce to offline new retail, such as Hema and online entertainment such as Youku, which not only pose great challenges to our technology, but also provide a very rich training ground. This is one of the core guarantees of our ability to continue and continue to develop new technologies.

Q: So far, how many services are there in ali cloud database product line?

A: We now have about 16 products in total, from managed products to self-developed products. These products are divided into four main segments: OLTP, OLAP plus NoSQL tooling, and finally an underlying hosting platform that users can’t see. The underlying managed product is not a stand-alone product, it is an invisible existence.

In terms of the number of database products, we are essentially the same, basically in the same order of magnitude, there is no big difference. The core difference is between the OLTP and OLAP plates.

Ali Cloud has gone from Follower to basically equal to AWS, and even achieved a lead in some areas of technology. For example, just talk about OLAP, AnalyticDB performance has been on the TPC-DS list, side by side to the first. By comparison with AWS official Redshift (buying Redshift on AWS runs the same Workload), AnalyticDB performs better than Redshift in many tPC-DS queries.

In addition, in some areas, we also do not have people, that is, AWS may not have, Ali Cloud has. For example, in the distributed database section, due to the group’s “Double 11” scenario, we need to build share-nothing architecture. So we did polarDB-x based on PolarDB. Such a share-nothing distributed architecture can support the application scenarios of “Double 11” massive and highly concurrent data.

From the point of view of AWS, there is no target product with us. So, now the cloud database era is a hundred flowers bloom, a hundred schools of thought contend state, the world’s various vendors, including Alibaba, AWS, Azure and Google in some areas have their own leader, but in other areas may be another vendor has a leader. Objectively speaking, the database of Ali Cloud is in an absolute leading position in terms of market, technology and products in China, and it is also at the level of AWS in the world. Hopefully in the second half of the competition, we can really be the leader in some areas.

Q: We know that several open source database vendors like MongoDB have modified their license agreements, mainly for cloud computing vendors. What do you think will be the relationship between the two in the future? Is this one of the driving forces behind cloud vendors releasing their own cloud native databases?

A: That’s a very good question, and I’m going to extend it to say that not only are open source database vendors motivated and pressured to make the transition to cloud native, but traditional giants like Oracle are definitely pushing hard to make the transition to cloud native.

Cloud native databases have many technical points, the most important being elasticity, storage computing separation, isolation, multi-tenancy and, most importantly, having your own cloud hosting platform. To provide services on the cloud, a company like Oracle or MongoDB must rely on cloud vendor management platforms, which is why MongoDB changed its agreement last year.

In fact, MongoDB protocol modification is very clever. It allows hosting of MongoDB open source versions, but the hosting platform below must be open source if the service is to continue to be provided based on future versions. That is to say, if AWS or Ali Cloud continue to host the latest version of MongoDB, the underlying management and control platform should be open source, and MongoDB can be used as its own cloud hosting service after open source. In fact, MongoDB did just that, developing its own Atlas. As can be seen from MongoDB’s latest financial report, the growth of Its Atlas has reached more than 40%, and its market share has grown from only more than 10% at the beginning of last year to more than 30% of MongoD’s total revenue at the end of last year.

The idea behind MongoDB is simple. Rather than letting cloud vendors provide hosting services and build an open source version of MongoDB to capture market share, MongoDB wants to do hosting services itself, add its own kernel, and cut the whole pie. The cloud vendor is positioned as just one layer of Infrastructure as a Service (IaaS). MongoDB is a strategy. Other open source database vendors include commercial databases Oracle and SAP. Oracle makes Oracle Cloud, and SAP also makes its OWN SAP Cloud.

Cloud manufacturers’ coping strategy is also very simple: continue hosting products, but only hosting the previous version of the product, that is, open source version of the hosting platform has no requirements, absolutely impossible to host the latest version to make their own control platform open source. Cloud database war in the first half of each cloud vendor core competitiveness? It’s actually a hosting platform down there. Because, in the first half we mainly rely on MySQL, PG and commercial SQL Server these databases, to pull the offline on-premise database market to the cloud migration, which is the most core competitiveness.

Users have two options to use the cloud, either hosting MySQL and PG or SQL Server, or building their own in a virtual machine.

These two choices for customers, the value of cloud vendors is reflected in the hosting platform. Because on the kernel, it’s no different than building it. The core of the managed platform is the guarantee of SLA. Service-level Agreement, RTO and RPO can be much better than or the same as self-built SLA, but the cost is lower than it.

For users, it may take a strong DBA team to achieve the same SLA guarantees as the hosting platform. This can greatly reduce the cost of operations, which was the situation in the first half.

In the second half, if MongoDB and other vendors do their own cloud hosting services, they will force the customers who used cloud hosting services to go back to virtual machines to build their own services, and completely position cloud vendors as IaaS. For example, if customers use MongoDB Atlas, it is equivalent to getting the SLA capability provided by AWS or Ali cloud hosting platform, but they do not need to pay directly to cloud vendors. Therefore, they may choose to build it themselves because of cost advantages.

So how do cloud vendors respond? There are two points: the first is to continuously enhance the competitiveness of hosting platform. For example, ali Cloud has an autonomous driving cloud hosting platform called SDDP, which uses machine learning and artificial intelligence technology to carry out automatic operation and maintenance and automatic optimization of database instances on the cloud hosting platform to ensure the competitiveness of the hosting platform. Second, from the kernel point of view, why are Amazon, Alibaba and Google doing their own cloud native databases? It was realized that hosting platform competitiveness alone could not close the upgrade gap, so it was necessary to have its own controllable kernel, and this kernel can be different from the traditional on-premise DB in performance. Some features of cloud native can attract customers to migrate from MySQL, PG, Aurora and MongoDB to self-developed cloud native databases.

AWS is the most typical, with Aurora. DynamoDB in NoSQL and Redshift in analytics. After MongoDB modified the protocol, it introduced its own DocumentDB. The logic behind all of this is the same as before. Personally, I think this game is already in the second half. To sum up, as a cloud manufacturer, we need to make efforts in two aspects: one is the control platform, to improve its operation and maintenance capacity and efficiency through intelligent means; the other is to improve its security, reliability and verifiability. Last year, AWS launched QLDB and Quantum Ledger Database, which use Merkle Tree technology in blockchain to verify Database operation and maintenance logs. In this way, customers can verify the operation and maintenance logs after going to the cloud to ensure that SLA protection is achieved. These are some differences to be made from the control platform. In addition, from the perspective of the kernel, we continuously invest in the research and development of the kernel, so as to be able to compete with the traditional database or the new database kernel like MongoDB in a differentiated way. The above is my view of the cloud database battlefield in the second half of some of the more exciting aspects.

Q: You mentioned cloud native databases, and I’ve been hearing a few words lately: cloud native distributed databases, distributed middleware, etc. How to distinguish true cloud native from pseudo cloud native?

A: That’s a good question. What is the traditional database architecture? Share-everything is a share-everything architecture. For example, a local disk may have large memory on it, multiple memory sticks can be inserted, and a large memory pool can be created. And then above that, we have computing, which is shared computing state, which is multiple cores. But the key thing is that transaction, or multiple Queries come in, and these transactions and queries share state across the entire database from storage to memory to Core. This is called share-Everything architecture, which is the traditional database architecture. Oracle and SQL Sever are examples of this architecture.

This architecture has its advantage of sharing state, so Coordination is easier to do. But the disadvantage is that Scalebility will be greatly limited, so the concept of distribution has been derived. The core challenge of distribution is to provide the ability to Scale Out and Scale Up.

How do I Scale Out? The classic way to do something like Google Spanner is to do share-nothing, and then do partition table, partition, and sharding. If there are cross-shard queries and transactions between shards, you need to do distributed queries and distributed transactions. This is the architecture of library and table, Spanner, and PolarDB-X. This is a kind of, there are two branches of Share-nothing, one is native distributed architecture. In other words, sharding and partition are actually done at the bottom, but the customer does not need to know. For the customer, the business logic doesn’t need to change; if there are distributed transactions, distributed queries will take care of themselves. Customers don’t have to worry about how to split the database and table, to split the business logic, it’s a kind of.

There is also a kind of solution that you mentioned in the question just now, using the form of middleware to make a sub-library sub-table, which is harmful to business logic. This requires database service providers, or customers themselves to have a clear understanding of the business logic. For example, inventory, order, these two libraries are separate, usually do not have intersection. So, in terms of business logic, splitting it into two libraries, which are stored on different nodes, is a middleware solution, and there are many such solutions in the industry.

The advantage of this solution is that it is relatively simple, but the disadvantage is that it does not have the invasive transformation of the customer’s business logic that native distributed databases do. In addition, it does not support transactional distributed queries as well as native distributed databases. These are all about share-protein distributed databases.

Now, what you’re talking about is a cloud native distributed database. I think this is a false proposition, false concept. In fact, now all vendors in the cloud database has not a real distributed architecture, most are using distributed storage to do shared storage, and then do a lot of read above. PolarDB, Aurora, they all have this architecture. It is actually distributed shared storage, above do a write multi-read storage computing separation, this is I think now the so-called cloud native database or cloud native distributed database is the most typical architecture, distributed use of RDMA fast network to do a distributed shared storage. It looks like a disk, but it is actually a distributed disk, but to the upper kernel it looks like a local disk. The advantage it brings is that there is only one copy of physical data, which avoids the challenge of backup of physical data between the primary and secondary databases such as MySQL and PG. The nodes written and read by the primary and secondary repositories are a piece of physical data, which brings many, many benefits. But strictly speaking, it is a distributed storage database, it can not be called a distributed database. This is a little bit different from our classic definition of distributed database, but now people call this cloud native database or cloud native distributed database.

What will happen in the second half? Personally, I think there may be a breakthrough in the second half, which is to combine the real distributed architecture mentioned above with the shared-storage architecture of cloud native distributed storage through Shardin architecture (cloud native distributed database is architecturally called shared-storage architecture). What are the benefits of this structure? Because shared-storage uses RDMA, RDMA is limited. For example, it cannot be extended indefinitely across an AZ or even worse. RDMA shared storage can only be extended to a dozen or dozens of nodes. Once a network swich is crossed, the performance loss of RDMA is very high. Remote access cannot be as fast as local access, which limits the shared-storage architecture.

The most classic Oracle RAC has 10 nodes and 20 nodes. However, if the concurrency is high enough or the number of nodes is large enough, the only way to Scale Out is to Scale up and down. This time must do distributed partition, Sharding such architecture, but partition, sharding if not shared storage, what impact? Each shard can’t be too big, because a single node can only hold so much data. This means that there may be many shards, and the number of distributed transactions is very high. Once a distributed COMMIT is implemented, the performance loss is very high. So if I can combine these two, one of the nice things is that I can Scale Out again, because I have share-nothing on top and shared storage nodes on the bottom, so I can make each shard really big. In other words, I need very few shards to support the same data. Fewer shards means distributed processing across shards is greatly reduced and distributed capabilities are greatly improved. So the combination of these two, I think, will be a relatively new and interesting challenge.

Q: The following questions have always bothered me. You just said distributed, in fact, there is a classification in the past, such as SQL, NoSQL, NewSQL, so what is the relationship between NewSQL and distributed database? I used to think of it as SQL, NoSQL is NewSQL, distributed database and NewSQL is an inclusion relationship or some other relationship?

A: This is also a very good question, database colleagues and friends often have some confusion. First of all, it’s called NoSQL, but it doesn’t actually mean don’t do SQL, it’s short for Not Only SQL, which happens to be called NoSQL, but it actually means more than just SQL. How did NoSQL first develop? In fact, it comes from a traditional relational database. In a relational database, the Scale Out capability is limited because of strong consistency, namely ACID. So Isolation for ensuring Atomicity, Consistency, Avilability, and healing conflicts with distributed bribery, At that time, the hardware technology, software technology did not make the traditional relational database can be infinite level expansion.

However, many Internet companies, represented by Google, did need data and transaction processing and query processing to expand at an infinite level at that time, because the amount of data was too large and the data running every day had to be stored, which was an infinite growth process. And the other thing about this data is that it doesn’t have to be structured, it can be semi-structured or even unstructured data, so the concept of NoSQL is derived. To sum up, the core concept of NoSQL is to weaken the strong and consistent requirements of the traditional relational database, such as Isolation level, which may be snapshot Isolation in the traditional relational database. Now, just do ReadCommitted, as long as there is no dirty read. For other applications, if higher isolation mechanism is required at the application layer, write logic at the application layer and use the external Consistence method to solve the problem. At the database level, only no dirty reads are required, sacrificing certain data consistency in exchange for almost unlimited horizontal expansion. The most classic such systems are Hbase, Google’s BigTable, Cassandra, etc., which is where NoSQL comes from.

To sum up, in order to provide unlimited horizontal scale-out capability, sacrifice certain Consistency guarantee, mainly make some sacrifices on Consistency and Isolation in exchange for infinite horizontal scale-out capability.

Where does NewSQL come from? NoSQL has been in development for about a decade, probably in 2008, 2009 when the concept came out, and now it’s nearly a decade. You find it difficult to push consistency and so on to the application logic layer, and you gradually find that there is also a strong consistency requirement for unstructured and semi-structured data. This is not to say that traditional transaction processing requires only structured data, but not unstructured or semi-structured data. Therefore, it is found that NoSQL system also needs to guarantee ACID Guarantee, so-called EVENTUAL consistency and weak consistency Guarantee are not applicable to many applications, so snapshot Isolation is still needed. In the case of NoSQL, this is what my friends in NoSQL found out about this need, what my friends in relational databases found? For example, MySQL has added Json support since 5.7, PG11.0 and 11.2. Typical semi-structured data structures, that is, traditional critical databases can no longer support only structured data, it must also support unstructured, semi-structured data, that is, both sides start to move towards the middle. A guarantee with strong consistency, but it only supports structured data and has limited Scale Out capabilities. The other supports semi-structured and unstructured data, and Scale Out is very good, but there is no consistency guarantee. Both sides sacrificed some things to get what they wanted, but later they increasingly realized that what they sacrificed was also needed by customers, not only what they kept, so both sides began to complement the missing functions.

To have both, it becomes both…… Again……” This becomes the combination of the two, which is NewSQL. So in the end, I personally think that if NewSQL does eventually become a more common trend, the boundary between traditional NoSQL and relational databases will become more and more blurred.

That is,NewSQL is not equal to a distributed database, which can be a NewSQL database. Important in NewSQL is the Scale Out capability, which is definitely a distributed storage architecture first and foremost. So it doesn’t have to be distributed shared storage, but it definitely has shard, partition. HBase and MongDB are the most common situations of shard default. But distributed data is not necessarily a distributed database, depending on whether it is truly distributed queries and transactions. It’s possible that it does shard, distributed data, but queries and transactions are called perfect Shardable workload. This transaction and query is completed by looking only at the first shard, and the other query is completed by looking only at the second shard. So although it has many shards, it has many queries and transactions, but each query and transaction is handled by a single shard. Strictly speaking, I don’t think of it as a distributed database, just a neat partition of the business logic that makes it perfectly parallel.

A truly distributed database has two characteristics: first, the data must be split into multiple Shards; Second, its queries and transactions are likely to generate cross-shard queries and cross-shard transactions, which are called distributed. Traditional NoSQL supports only the first feature or only the second feature. The second feature is supported by sacrificing data consistency level in exchange for its ability to support distributed transactions and distributed queries.

In terms of NewSQL, I personally believe that a really good NewSQL database must support structured, semi-structured, and unstructured data, number one.

The second point is that a good NewSQL database should have very good horizontal expansion ability and Scale Out ability, support distributed query, distributed transaction, and at the same time have very good flexibility and Scale Out ability on a single node. At present, there is no single database that can solve all the problems perfectly, and there are other technical points like HTAP (hybrid transaction and analysis processing), read and write, and how to efficiently handle them? There are multimode multimode, a variety of data forms in a library, how to query in the unified interface? If it solves all these problems perfectly, I think it would be a better NewSQL database.

For now, NewSQL is just a technical concept. Strictly speaking, there are a lot of vendors who say that their databases are NewSQL, and they may have implemented one or two points in some dimensions, but they don’t completely solve all of the technical challenges we just mentioned in NewSQL, so I think we still have a long way to go.

The original link

This article is the original content of the cloud habitat community, shall not be reproduced without permission.

Dialogue with Li Feifei, general director of Ali Cloud Intelligent database Division: The cloud database war has entered the second half

Two Revelations:

Some highlights:

Related Posts

GitHub’s new open project, FoolNLTK: a handy Chinese processing toolkit

Output the front-end source code in real time, torturing the open source project sparrow-JS for half a year

Data Structure Study Notes