Recently, our COMPANY’S CEO Liu Qi accepted the “love analysis IFenxi” interview, analyzed the development trend of the current database market, the characteristics of TiDB and application scenarios, and revealed the company’s future development layout. The following is the love analysis report and interview record, which is very informative, enjoy 🙂

Research | li zhe wang

Writing | li zhe

Even narrowing down from big data to the niche of databases, PingCAP is a very special company, and its product, TiDB, is one of the few databases on the market geared towards the HTAP scenario.

Traditionally, databases are divided into transactional databases (TP) and analytical databases (AP).

NoSQL databases emerging in recent years, such as MongoDB and Hadoop based Hbase, are mostly analytical databases, which solve large-scale data query and analysis problems through distributed architecture.

However, transactional database bearing production system has always been dominated by traditional database manufacturers, Oracle, IBM and other traditional large enterprise market, small and medium-sized enterprises and Internet companies mostly use open source technology MySQL, few new technologies, new companies can enter this market.

In 2012, Google released Spanner, a transactional database based on a distributed architecture. Inspired by Google, a series of emerging database manufacturers such as CockroachDB (Cockroach database) have emerged in foreign countries to solve THE TP problem, but the domestic market is almost blank, and there is no startup company that develops this kind of database.

In 2015, PingCAP was established to fill the gap in China.

Internet background team, using open source model to do the database

Unlike other database vendors in the market, most of PingCAP’s founding team came from large Internet companies, such as Wandoujia and JINGdong, and almost none came from traditional IT or database vendors.

With the background of Internet, each member of the founding team has experienced the exponential growth of data, and has the experience of dealing with massive data. When doing database products, they will give priority to scalability.

Also, since most Internet companies will adopt MySQL technology, TiDB is initially compatible with the MySQL protocol, making it easier for PingCAP to acquire customers.

Another characteristic of the Internet is that open source comes first. PingCAP has established the open source method of database from the very first day. But unlike the other teams, PingCAP founder Qi Liu and others, who used to be the authors of the distributed caching project Codis, have the ability to run an open source community and understand how to leverage community power to develop products.

On the one hand, the open source community will expand the reach of PingCAP products and bring in potential customers; On the other hand, through the operation of the open source community, PingCAP pays more attention to the research and development of its core product TiDB, and other functions can be partially implemented by the open source community users.

In addition, through user feedback, PingCAP can understand the potential needs of users as a reference for TiDB development.

The product supports both TP and AP, with strong consistency and scalability as the main features

TiDB was originally designed to solve TP problems, but in practice it was difficult to get customers to replace their MySQL database with a new one, especially if the database vendor was a little-known startup.

Most enterprise customers still keep the traditional MySQL database at the front end, and use TiDB database as the data mart behind, which is connected to the front-end database. However, the real-time performance of this data mart is far better than that of Hadoop architecture, and it can run in the actual production system.

After running this way for a while, the customer will gradually replace the MySQL database and use TiDB as the front-end database when they accept PingCAP’s product.

When customers use TiDB database as data mart, because the front-end database needs to query data from this data mart, the query function of TiDB database is put forward higher requirements. TiDB adjusted its own database executor for AP function expansion.

As a result, TiDB supports both TP and AP functions, making it a Hybrid Transactional/Analytical Processing (HTAP) database product.

In TP scenarios, TiDB has the characteristics of strong consistency and can carry industries that are highly sensitive to data consistency, such as finance. Compared with traditional databases, TiDB scalability is the biggest advantage. TiDB can improve performance by increasing the number of machines.

In AP scenarios, PingCAP provides better real-time performance and faster data processing than Hbase.

At this stage, it mainly covers Internet finance, games and other Internet fields, and the sales leads are mainly from the open source community

Compared with traditional enterprises, it is easier for Internet companies to try new technologies, and teams with An Internet background can better understand the business characteristics of Internet companies.

At the same time, the development speed of Internet companies is much faster than that of traditional enterprises, and the growth rate of data volume is extremely fast, so the demand for improving the underlying technical architecture and enhancing the performance of database is more intense, especially in the game industry and Internet finance industry.

These factors make PingCAP’s early customers mostly come from Internet companies. Tongcheng Travel, 360 Finance, Mobike and so on have successively become PingCAP’s customers.

By the end of 2017, PingCAP had an overall team of about 100 people, more than 80% of whom were in r&d and only one full-time salesman.

The ability of a salesperson to acquire customers is very limited. PingCAP still uses the open source community to acquire customers, and the salesperson is only responsible for following up with interested companies. In 2017, the number of users in the actual production environment reached 200, resulting in a dozen paying customers.

At this stage, PingCAP still focuses on product polishing and community operation, and has not yet entered the stage of large-scale product promotion. Therefore, in 2018, PingCAP will consider entering traditional industries such as finance, medical care and logistics, but will not increase its sales team on a large scale, and will still adopt a cautious market strategy.

Recently, Ianalysis conducted an interview with Liu Qi, founder of PingCAP. He elaborated PingCAP’s business model, future strategy, and future development trend of database industry. Now I will share part of the interview content.

Based on the original intention to solve the problem of database scalability, the product can meet both TP and AP business requirements

Love Analysis: Why did you start PingCAP?

Liu Qi: I had this idea when I worked at JD.com. There was no database that could scale well, and the most common way was to separate databases and tables. However, this approach has disadvantages. First, its elastic expansion ability is poor, second, its ease of use is poor, third, the mental burden of programming is relatively large, and fourth, it is weak in expression.

I was working on a project that also required distributed databases, but there was no satisfactory product on the market.

Therefore, at the beginning, we wanted to solve our own problems. In the middle, we also developed a distributed cache. Later, we began to solve the problem of database scalability and started our own business.

Love analysis: database as the underlying technology, customers will be very careful to choose suppliers, how to acquire customers initially?

Liu Qi: In 2016, after we received A round of financing from Yunqi Capital, we began to think about how to acquire the first users. Yes, it’s risky for users to put a new database online. Who wants to risk their online business with a new database?

Gaia Interactive Entertainment is our first user. At that time, they had problems with their MySQL database, the online query speed was very slow, the whole system was jammed to unusable, and it was difficult to carry out business without trying to use new technology. Our product was still in beta, and they started pushing the database online.

Because taking a new database online is a real risk, many users do it the other way around. There’s a bunch of MySQL running online, and they’re building a big data cluster in the back, and they’re putting all the data in here, so it looks a little bit like a data warehouse. Because we are protocol compliant, we can copy the data and they can query it in real time.

In the game industry or risk control management with high real-time requirements, they are in urgent need of this technology to solve problems.

We disclose a lot of financial cases now, and quite a few of them are used in the real-time risk control scenario. The advantage is that it is not directly targeted at online business, the risk is less than online MySQL, and it just solves their pain point.

After this stage, if the customer feels that the technology is stable enough, he will pull the online and push our products to the front to support all the business.

When customers take our database as a warehouse, in fact, the complexity of the query is very high, our database can help customers do some things that they dare not do before, an SQL query statement even several pages long.

The problem is that our design itself is not for AP business, and the query function is focused on AP. Therefore, when we optimize the actuator, we also make corresponding adjustments and expand the AP function.

In this way, our product can support both online TP and AP services, and our product becomes HTAP.

After making this product well, we found that the characteristics of the product are very obvious, there is no strong competitor in this field, and this product is to meet the needs of users. In many cases, users’ requirements cannot be simply divided into TP or AP. In fact, there is no clear definition. Even customers do not care about these, but only hope to solve their own problems.

Love analysis: in terms of data writing and query, there are differences between rows and columns. How can TiDB be implemented in a table?

Liu Qi: Row and column is just a form of storage. Technically, it can be changed.

For example, cold data is slowly converted to column storage in the background, and then the latest data is still in row storage. Front desk or a standard row storage, according to the cold and hot data, converted into row storage or column storage.

In fact, the latest paper has put forward a new point of view, data storage is not pure row storage or column storage, but according to the access frequency, frequently accessed data use row storage, do not need to sweep the entire table, the implementation of the way is very diverse.

Love analysis: When Google made Spanner, it emphasized scalability. Did it have low requirements for computing power?

Liu Qi: This is a concept of Google in the past, but in this way, if you do some relatively complex operations, the response time of the database will be relatively long, which is determined by the storage format.

However, Google’s 2017 paper has changed the storage format to a more mixed format. We iterated along the same lines as Google, and we changed our storage format earlier because we met the actual needs of our users earlier.

Love analysis: Is there a conflict between algorithm and scalability? Will a complex algorithm affect its scalability performance?

Liu Qi: Algorithms have nothing to do with scalability. Algorithms mainly affect the efficiency of execution.

For example, if it is column storage, the execution efficiency is higher. For example, the bank sums up the amount of all accounts. If it is column storage, it will be very simple.

Love analysis: What changes should be made to the database when it is pushed to the foreground?

Liu Qi: How much concurrency to use depends on the load of the whole system. Some optimization will be done.

Suppose there are 100 machines, there is such a data cluster, evenly pushed to each machine calculation, the high degree of concurrency, each robot may be very busy, at this time it is useless to add tasks to it, the machine will crash.

But if you have a “smart” scheduler that controls the instructions and schedules different machines to do different operations while maintaining high concurrency, the machines won’t be too busy, but the problem is that it can introduce long delays.

Of course, the same data may not be calculated by CPU, but by GPU or FPGA, which has higher requirements on the scheduler. According to the development trend, the ability of the scheduler is an important indicator to measure the performance of a database.

Love Analysis: How does TiDB achieve real-time performance?

Liu Qi: Because it is a distributed structure itself, performance can continue to expand, it doesn’t matter how much data input in front. If you’re not doing it fast enough now, you can do it by adding machines.

The speed is also related to calculation, and some calculations are not able to push all the nodes up. For example, if I want to bring all the data back and sort it, there’s no way for all the nodes to do that.

In this case, the optimizer is important in identifying which computations need to be pushed down for parallel computation and which are just decisions to make.

Love analysis: MySQL architecture, data migration to TiDB can be done without feeling migration?

Liu Qi: We have considered this problem from the beginning of the design. We can do insensitive migration for MySQL. If it is Oracle or other DB2 protocols, it may involve changing the code.

Love Analysis: What is the migration cycle for other protocols?

Liu Qi: This also needs to consider the complexity of the business, for example, the original business has 100,000 SQL, as long as they need to verify once, if the business itself is complicated, it will be faster. On the MySQL protocol side, we’ll be able to do POC pretty soon.

Love Analysis: Are you considering fast migration support for Oracle or DB2 next?

Liu Qi: We have no plans to do that, because new businesses don’t use these technologies anymore. If you think about that, the goal is to get into old projects. One of the problems with compatibility is when you start with old projects. Users need to know how compatible the new technology really is. Can I replace it with a new technology without fear?

Compatibility is not only the compatibility of functions, but also the compatibility of bugs. It is very difficult to achieve 100% compatibility. The original programmers of the enterprise may also leave, and it will be very heavy workload and risk to replace the old business.

At present, internet-related industries such as Internet finance and games are key industries, which are suitable for scenarios with large data volume and high business complexity

Love analysis: what industries are the products targeted at?

Liu Qi: In the process of commercialization, the most important thing is to make a product, and then improve its functions according to the needs of customers.

In addition, our products are open source. The advantage of open source is that when users are using it, they can timely feedback their experience and problems, and in this process, they can find out who our potential users are.

Our first customer was a gaming company, which was actually more than we expected, and we thought maybe the Internet first, because the Internet is so aggressive with new technologies.

The game industry also has its characteristics. The most profitable game for game companies is the operation of popular games, and the daily turnover may reach tens of millions. They want their infrastructure to be stable and robust enough that if they hit a bottleneck, they’re going to have to shut it down, which is costly, so they want new technology to solve the problem.

Another is the Internet and traditional industries. When using our new products, Internet enterprises are still very conservative, because there are so many MySQL in use in the past, they will think it is very risky to suddenly change to a new technology.

However, enterprises like Internet finance still have high requirements for real-time risk control and management through real-time information, which cannot be met by previous schemes, so they choose to use our products.

Love Analysis: What are the application scenarios of TiDB?

Liu Qi: Our database has strong generality and is generally oriented to new business requirements. We have not designed our database for a certain industry.

Speaking of the advantages of our products, the data volume of customers must reach more than 100 million level. If the data volume is relatively small, there is no need to use distributed database. In addition, the complexity of the business should be relatively high, so that our advantages are more obvious.

Love analysis: which industries will you focus on next?

Liu Qi: From the point of view of revenue, finance should be an industry we focus on. Other fields such as logistics and medical care are also growing fast.

The team is mainly from Internet companies, with very few sales staff

Love Analysis: PingCAP user promotion progress in 2017?

Liu Qi: In 2017, we had 200 users running in the production environment. The unit price of the product is relatively high, but there are fewer paying users.

Love analysis: TiDB is an open source technology, what enhancements will be made to provide enterprise-level products?

Liu Qi: Although we provide an open source technology, there are still some closed sources, such as monitoring operation components, backup tools, security tools and so on.

For enterprise applications, it has to have a nice user interface and a nice set of tools, and that’s the way we provide it.

The other part, we call Database & Service, we provide not just a Database, but a Database platform for enterprise users to apply for TiDB data cluster. If you don’t have it, you might need to be handled manually by an administrator, and the experience can be very different.

Love analysis: How does TiDB charge?

Liu Qi: Now we have two considerations: on the one hand, we can use cloud deployment. We can see the database entrance of Tencent Cloud. This business model is relatively simple.

On the other hand, you can buy our subscription, or you can buy our license, calculated by number of nodes.

Love analysis: Company team size?

Liu Qi: Now there are about 100 employees in the company, with 82 employees in r&d. There is only one salesman. The reason why there are few sales is that the users are recruited by themselves. We have not invested much in this aspect.

Our r & D requirements are very high, including the external support of r & D personnel, the speed of response and so on. It doesn’t look as dramatic as Oracle, but there are a lot of outside companies that contribute to us.

For example, mobike contributed a lot of the scheduler code, and toutiao contributed the optimization in many scenarios, including Samsung Research Institute in South Korea. There are many people helping us do the testing, which also reflects one of the benefits of open source technology.

Ai Analysis: Will the R&D staff undertake part of the pre-sales work?

Liu Qi: In 2017, there were still some R&D personnel doing pre-sales work, but we will make some adjustments in 2018, which is also a very important task for us.

The construction of personnel structure should form a complete system, with pre-sales, implementation and RESEARCH and development performing their respective duties, and arranging different people to solve problems at different stages.

Analysis of love: with fewer sales staff, does it put forward higher requirements for the operation of the community?

Liu Qi: I think the more researchers there are, the faster the communication with the community will be. The most important users in the community are the developers, and the communication with the developers is definitely smoother for the developers, and the sales staff can’t replace this role. For example, if a user suggests that there is a problem with some code, the development will respond quickly.

Large-scale users such as Toutiao, Mobike and Tongcheng contact us actively because of their pain points, and do not need sales to do extra work.

Of course, there are also many small users in the community, although small users do not have such a large payment power, but also have a direct effect on the community.

They test their own scenarios and find problems we’ve never seen before, and the information they provide is very important to us, so we put a lot of effort into running the community.

Love analysis: PingCAP’s team background is mostly Internet?

Liu Qi: Right, the Internet born more, are relatively large scale Internet companies, have experienced the pain of large data after.

In addition, there are those from traditional industries, and there are those from the financial industry before sales. He is more clear about the application scenarios of the financial industry.

Ai Analysis: When entering traditional industries, are there any changes in personnel structure requirements?

Liu Qi: At present, we don’t think so. We hope to win customers directly through our products and reflect the advantages of our products. If it is a customer who uses the same database, we will not fight for it, which is not our strength.

Love analysis: how to balance product development and community maintenance?

Liu Qi: We will definitely do a basic version first, and then promote it in the community. When we encounter a Bug, we must fix it, otherwise it will affect the use of a lot of people.

In terms of internal development, we will quickly develop many new features, which will not be immediately applied to the stable version. Instead, we will first release a Beta version in the community. We will find bugs through user testing, and we will fix them.

In this process, we need to get feedback from the community through continuous testing by users. Because it’s not up to us, it’s up to the users.

The convergence of TP and AP is the future trend, and the database market will become more diversified in the future

Love analysis: There is a conflict between consistency and usability in CAP principle, how to optimize?

Liu Qi: In the future, we will provide an option that users can choose according to their own needs, high consistency or high availability. For example, bank data requires high consistency, while Internet applications are more focused on high availability. We will provide all of them to users and let them choose.

Love Analysis: How is NewSQL technology different from previous technologies?

Liu Qi: In history, SQL was used at the beginning, and then why did NoSQL appear? It is because SQL cannot be extended. Although NoSQL has the ability to expand, its expression power is poor, and it may not support transaction processing, so it does not have the traditional advantages of SQL.

NewSQL has two advantages at the same time: it can be well extended, and it can have the transaction processing power and expression power of SQL.

Love analysis: Will TP and AP merge in the next step?

Liu Qi: We think so. Users don’t care whether it’s TP or AP. Solving problems is the absolute principle.

TP and AP are separated because of historical reasons, when the database was born, there was no distinction. Now that the technology can do it, you certainly want to merge. There may still be a separate AP for complicated data analysis, but our product is still iterating rapidly, and ultimately it depends on who has the best performance.

Love Analysis: Will there be another Oracle in the distributed database platform space?

Liu Qi: Due to historical reasons, Oracle’s position is irreplaceable in a short period of time, but the rise of new database architecture is also fast, now Oracle is facing unprecedented challenges, I think in the next two years, 20% of the traditional database will be replaced by new database.

Look at our user growth rate now, this trend is quite obvious.

Love analysis: What changes will the market pattern take place in the future?

Liu Qi: I think the market will become more diversified.

First of all, the current demand is very fragmented, traditional database can not express well, for example, the Streaming requirements are becoming higher and higher.

The advantage of relational database is its versatility and balance. But some scenarios with the current database framework is difficult to adapt, certainly not more than a specially designed database to use smooth, such as the figure database.

From a trend point of view, when NoSQL came out, people thought about what scenarios it could replace, and it turned out that NoSQL had a lot of constraints. NewSQL will certainly change the landscape, and there should be two or three big players eating up the majority of the market, but the smaller players will still be there.

Love analysis: Will the development of open source technology affect the business of database companies?

Liu Qi: In fact, open source technology has existed for a long time, such as MySQL has a history of more than 20 years, but enterprise application is not so simple after all, there are still many problems that need to be solved by the team.

There will never be a completely free database, and even open source databases will be charged.

Ai Analysis: Internet companies tend to develop their own infrastructure. Will this affect PingCAP?

Liu Qi: This issue is divided into domestic and foreign companies. Domestic companies like to build private clouds, but there is a big difference in foreign countries. Many foreign companies have removed their private clouds, and the reason is very simple: the efficiency of deploying private clouds is not as good as directly using mature public clouds.

Now many Internet companies do not want to be locked in by companies like Oracle as in the past, I need to use your database, but also have some control. Because The Growth of Internet companies is fast, the change of demand is more obvious, they hope to have a certain understanding and control of the database, so as to facilitate the Internet enterprises to modify the data code to meet their own customization needs.

Love analysis: Will cloud vendors eventually become competitors of database companies?

Liu Qi: The relationship between database and cloud is a bit like the relationship between APP and APP Store. Cloud vendors may also do databases, but it should be more of a partnership.