Editor’s note

In this paper, efficiency Selection — Introduction to Internal Selection of Distributed Database TiDB written by the DBA team of NetEase compares the differences between the innovative architecture based on TiDB and the traditional architecture of MySQL + DDB. This paper explains the reasons for NetEase to consider the introduction of TiDB from multiple dimensions such as business adaptation, cost reduction and efficiency improvement, and technological innovation.

The author ni Shansan ([email protected]), NetEase database expert, hangyan database operation and maintenance team leader.

TiDB is one of the hottest technology hot spots in the field of open source database. It has always maintained a leading level in China in terms of development momentum, technology iteration speed and community activity. The NetEase DBA team has always been paying attention to the development of TiDB technology with curiosity and taste, and has been making “small” attempts. However, if a relational database product is to be rigorously investigated from an online selection perspective, the motivation and expected results should be discussed very rigorously.

To put it bluntly, there is no reason for us to use a new technology to try to challenge the existing technology solution for stable use, especially if DDB, NetEase’s self-developed distributed database, can meet our current needs. TiDB for us is not a change from 0 to 1 of distributed database, but a collection of 1 to 1.1 innovations in a series of pain points, which need to be elaborated in detail. This paper is a comprehensive discussion of this problem.

The DBA team hopes to gradually introduce TiDB into the core selection of NetEase database. If you are interested in related technologies or have specific requirements, please keep in touch with DBA. We also hope that this article will play a certain role in answering questions and promoting.

DDB is a horizontal distributed middleware based on MySQL developed by NetEase, which can be likened to ShardingShpere or Vitess.

Distributed database technology development background

As PC servers with significant cost advantages tend to be absolutely mainstream in all aspects, nowadays when it comes to databases, few people will think of Oracle or DB2 in the era of minicomputers and disk cabinets, such a large centralized database with hundreds of tons of cabinets. However, the volume of business data and requests is still increasing, so in recent years, we have to rely on distributed database technology to provide massive cluster services with small PC Server resources.

  • Distributed database has a variety of implementation modes. Compared with the access experience we can provide in the era of single database, services generally require distributed database solutions to meet the following basic features:

  • A (relational) database needs to support SQL, including CRUD and DDL, as fully as possible, as well as transaction features

  • Services are provided through a unified cluster portal, shielding internal distributed details

  • Scalable, by whatever means, the database needs to be able to keep up with the growing amount of data and throughput required

  • As the most important middleware, the high availability of services to cope with hardware and software failures is also an important consideration

The distributed technology of using fragmentary resources to assemble large-scale services can be said to be the standard configuration of server software design at present. Theory and practice have been developing continuously, but in fact, the development of distributed technology of relational database may not be as mature and perfect as we imagined. Relational databases themselves, though early in development, are one of the latestones in the development of distributed technologies. In the field of generalized storage, 2006 was probably the most important year for distributed technology, when Google BigTable, Amazon Dynamo, and Hadoop appeared at almost the same time, laying the foundation for half of the current distributed storage industry from technical principles to infrastructure implementation. However, until the publication of Google Spanner’s paper in 2012, the relational database has been in the monopoly of commercial solutions for a long time and the theory of distributed technology has been developing slowly. It is probably because of the above four characteristics that the technical threshold of distributed database is obviously higher than other services that do not need to support transactions and support SQL.

In my opinion, the development of relational database distributed technology can be divided into three main school lines, each of which has multiple stages:

At present, the status quo of the whole industry is basically the same. Although the development of the three routes is early or late, they are all widely accepted and have been active at present, rather than the relationship between replacement and replacement commonly thought. At the same time, you can see that the choices aren’t that many, at least not as clear as the big data stack. Let’s further analyze our possible selection:

  • Although the solution of Sharded Storage (Line 3) has high technical content and is faultless from performance to throughput, it is a completely commercial solution. We do not exclude that we will make full use of it in the public cloud scenario in due time, but relying on a business solution from a specific vendor is obviously not the only option to roll the dice unless there is a relatively open solution (the good news is that NetEase Data is indeed developing in this area and has made progress).

  • DDB is a database and table (Line 1) middleware developed by NetEase and based on MySQL, which is also our mainstream selection at present. In this large type of distributed solution, the functions of various open source middleware are very similar to the implementation of DDB. Compared with DDB, any other solution does not have obvious advantages of homogeneity competition, but their shortcomings are very prominent, not as universal as DDB. So in the middleware genre, DDB is the optimal solution. However, the sub-database and sub-table middleware has gradually encountered some efficiency problems in today’s business usage scenarios, which is also the main focus of this paper. Why we need to try new distributed database technology when we have DDB distributed solution?

  • That leaves NewSQL (Line 2) to see if it solves the pain points we encountered in the DDB middleware solution. Of the three NewSQL representatives, only TiDB is the main MySQL compatible, especially YugabyteDB or Postgres, so our focus naturally shifted to DDB VS TiDB.

So look like the generational competition between database technology, in fact, is not so complex, for our current situation, traditional database = DDB + MySQL, the current generation = TiDB.

Comparison and analysis of TiDB selection advantage with traditional distributed scheme

The fundamental question is whether and why we need to introduce TiDB technology. As mentioned above, we hope to solve or improve some problems in the use of the current MySQL distributed middleware mode through the new generation of database application. We will elaborate on the problems and how TiDB can be improved.

Advantage overview

First of all, the conclusions on the advantages of TiDB from the selection survey are summarized and then discussed one by one according to relevant categories:

  • Provide DDB with the same horizontal capacity expansion

  • SQL supports better than DDB

  • The resource expansion and scaling granularity is more flexible and does not need to be doubled each time like DDB

  • Efficiency and security of capacity expansion

  • Reduced data storage costs (no NEED for RAID10 + RockSDB compression – RockDB write magnification)

  • Online DDL support is better, except add index is basically second return, large table add modify column cost significantly reduced

  • The primary/secondary replication efficiency is significantly improved, reducing the risk of degradation of consistency and high availability under write loads

  • The lower limit for data consistency has been increased

  • The reliability and efficiency of ha automatic recovery are significantly improved

  • HTAP capabilities reduce the cost and risk of data transmission and provide more efficient real-time analytics for the business

SQL supports better performance than the traditional hash table schema

Although TiDB is not fully compatible with all SQL functions and syntax of MySQL, it still has some limitations.

  • There is no support for stored procedures/triggers, there is no support for select into variables and there is no support for MySQL’s newer WITH ROLLUP window

  • Foreign key constraints are not supported

  • DML update base tables on views are not supported

  • No support for explicit XA flow control (so MySQL compatible but not as DDB storage node)

  • Savepoint is not supported

  • Versions prior to 5.0 do not support DECIMAL precision changes

But for us, these limitations have little impact because distributed middleware DDBS are also mostly unsupported or very limited. We can consider TiDB SQL capability as a superset of DDB, and services only need to change drivers, such as DBI, the DDB proprietary driver, to open source MySQL driver such as JDBC + connection pool to complete service migration.

The most direct part of this superset is the significant improvement in the support capability of subqueries and joins. In DDB mode, all tables have hash distribution of specified fields. If the JOIN field of the JOIN statement is exactly the balanced field of two tables, the join query in the distributed database can be completed with fair performance. However, if the join field must be matched across nodes, DDB can only be centrally calculated by the client DBI to cache all data in a centralized manner, which may easily cause DBI OOM to crash. Furthermore, subqueries are simply not supported because the optimizer never completed the functionality. TiDB support in this area is obviously much better.

However, there are a few other considerations when using TiDB:

  • AUTO_INCREMENT (ID) is allocated independently by each TiDB server and is guaranteed to be globally unique and increment within a single TiDB server. However, global increment cannot be guaranteed and no continuity is guaranteed. This results in similar results for both real and DDB batch allocation of unique ids.

  • The length limit for single row and single column is 6MB, which means that the upper limit for single key entry of KV is 6MB. In MySQL, the record length can be broken to level G through longtext/longblob, which is not applicable in TiDB (although the data type is reserved, But the documentation is wrong).

  • The transaction length limit, which defaults to 5000 if optimistic transactions are used and transaction retry is enabled, is adjustable, and we are unlikely to make widespread use of the optimistic lock pattern, so it will have little impact. Other previously terrible transaction line limit of 30W, total size of 100MB, etc., will not exist after 4.0.

The actual efficiency and capacity of horizontal expansion are significantly improved

Both DDB and TiDB can theoretically expand the storage capacity by adding storage nodes and rebalancing the storage data, but their approaches are quite different. Objectively speaking, the scale of DDB used in NetEase has made horizontal expansion difficult in some scenarios.

Traditional distributed middleware (for example, Vitess) expanded form of the same, you need to add for the cluster expansion before multiplied the new resources, but these resources not yet foreign service, and then through a variety of ways to get stock data and real-time incremental data synchronization and split into new resources, and then find a moment to complete the old and new resources to replace, Existing resources are offline, and the capacity expansion process is complete.

With a lot of experience and tool development, we have a very complete automation assistance capability and reliability assurance for DDB traditional mode expansion, so the capability is not a problem. The key is efficiency:

  • The first is that resources must be multiplied. Our internal large-scale DDB cluster data storage capacity of the magnitude of hundreds of tons, this is the result of 8 years of storage, under the condition of this amount of data is difficult to double explosive growth. However, the capacity expansion under the traditional mode of one-dive-two usually requires double input of new resources. After the demolition, the water level changes from 100% to 50%, but the actual increase in a year may be less than 15%, which shows the cost efficiency of the one-time capacity expansion. Of course we can scale things up with rehashes and so on, and technically we can do that, but then there’s the cost of time and stability.

  • Secondly, the efficiency of expansion process is too low. We have many ways to complete the migration of stock data and real-time synchronization of incremental data. Technically, there is no problem, but the efficiency is too low. On the premise that the load on the guarantee line is relatively stable, our empirical data is about 10 20 hours for physical data migration and 30 40 hours for logical data migration, including incremental compensation time, every 2T. Of course, these transfers are all in parallel in a distributed cluster. Due to disk capacity, our database usually uses 6TB as a parallel unit, but even so, the entire time can be calculated in weeks.

  • Then there is the success rate of expansion. In most cases, sufficient resource input and long enough preparation period make expansion relatively feasible. However, in recent years, we do encounter the problem that the expansion plan cannot be implemented. For example, there was a demand for expansion due to the adjustment of business logic. The data write throughput increased rapidly, and the load and capacity were both insufficient and needed to be expanded. However, MySQL’s weak master-slave replication efficiency caused that the incremental data compensation could never keep up with the data write volume after expansion, even the more efficient internal logic parallel CDC scheme could not keep up. A situation in which the data source switchover cannot be completed. That is, even if we prepare enough resources to expand the existing cluster, there may still be no use.

To sum up, we can see that DDB expansion has some difficulties in terms of cost, efficiency and feasibility in some actual scenarios, and the introduction of TiDB can solve these problems for us. From the perspective of operation and maintenance, the expansion of TiKV is a simple operation of the machine:

  • Due to stateless TiDB server, it is relatively easy to expand, while PD, as a management service, probably does not need to expand. Therefore, generally speaking, capacity expansion is mainly for storage service TiKV.

  • TiKV capacity expansion only needs to be properly configured and started on the newly added server. The newly started TiKV service will be automatically registered with PD of the existing cluster, and PD will automatically perform online load balancing and gradually migrate part of data to the new TiKV service without being aware of the service. The newly invested TiKV resources can be considered to have been added to the existing cluster at the moment of startup, and there is no need to wait for the long process of externally controlled data redistribution.

  • By the way, the scale reduction and expansion of the database in the traditional distributed form are basically the same trouble, and both need to face the problem of low efficiency of data redistribution. As for TiKV node in TiDB, as long as it does not violate relevant constraints, TiKV will notify PD before safe shutdown, so that PD can migrate the data on this TiKV to other TiKV instances first, and complete non-perceptive capacity reduction after ensuring enough copies of data. Efficiency and automation are also greatly improved over traditional models.

Therefore, we can see that TiDB has a great improvement over the traditional mode in terms of capacity expansion operation efficiency, process automation and process reliability, and can improve the capacity expansion process efficiency calculated by days or even weeks to the level of minutes.

Schema change support capability improved

Similar to the above horizontal extension, traditional distributed middleware based on MySQL has significant schema change bottleneck, and the problem also appears in storage resources and performance.

Some schema changes are relatively easy to do, such as creating a new table, which is not a big deal in any case. However, some behaviors are relatively troublesome in MySQL, or traditional block row storage database, such as adding fields, changing field type attributes, adding indexes, changing keys, etc. In Oracle, all operation requirements can be unified through normal DDL or online redefinition in two formal forms. However, in MySQL, this kind of change function has always been a technical difficulty, with many means and many pits. You can feel it through the following figure:

Major pain points include:

  • MySQL’s online DDL has always been hard to trust, and even as the online DDL has improved over time, we have had to stick with external change tools to avoid certain situations.
  1. While online DDL does not have a linear relationship with incremental updates due to stability risks, there are other risks. For example, we have encountered in the recent past that the inplace algorithm in ONLINE DDL needs to solve the risk of brushing-induced write load type number according to the library cardon influence stability. www.percona.com/community-b… /). We have been in either case use which way to do a lot of strategy judgment, try for many years, it would be very difficult for us to according to each specific business scenarios to completely reliable intelligence strategy, sometimes in order to avoid unnecessary risks, have to rely on the industry still widely used, popular with external tools to continue as a unified change means of test.
  2. Due to uncertain risks, the database processed automatically by our self-developed operation and maintenance platform changes to a large or small magnitude of nearly 100,000 a year. If the success rate of a core function that is frequently used every day is subject to a default limit of 128MB, it is often encountered in some business scenariosRROR 1799 (HY000): Creating index 'idx_XXX' Required more than' Innodb_online_ALTER_log_MAX_size 'bytes of modification log......Class problems, not only fail but also waste a lot of resources for rollback. Is this feature a reliable feature?
  3. In fact, it is the norm for DBAs to not widely use online DDL for years in the industry. For example, ali Cloud and other public cloud manufacturers have basically the same practice with us: if you want manufacturers to provide change support, they will all use external tools, even in the PolarDB era.
  • The internal mechanism of MySQL is not reliable, and the external tool is also problematic, especially the efficiency problem. The external tool changes require more free disk space than the change table to complete the operation, because the external tool copy mode is the same as the original DDL, but does not lock the DML. We need to make an exact copy of the original table data to apply the changes, and create a lot of binlogs in the process. For example, if you want to change a 500GB table, our experience is that the database needs at least 1T to 1 locally. More than 5 terabytes of free space, we usually think of it as completely safe. Copying data at the same time is similar to capacity expansion, which takes a lot of time. The actual situation is that the inventory table of the core business is often large, so you have to expand before changing, which causes the time cost and resource cost of the change is often unacceptable to the business — doubling the resources, weekly time, read/write separation consistency failure during the change, high availability failure… Our business is often daunting, turning what should be arbitrary database changes into a requirement that is theoretically possible but practically impossible to implement.

  • Other performance impacts, conformance risks due to constraint keys, high availability risks during changes, distributed change scheduling and atomic control in DDB environments, MDL blocking that everyone will encounter in extreme cases, impact on backup mechanisms… There are all kinds of problems.

The above pain points are fatal and seriously affect business iteration, stability, and efficiency. We are desperate for a new model to change the situation, and by the way, part of the pain of change is still difficult even with PolarDB and Aurora, because they still use MySQL at the top. Over the years, our DBA team has invested more than 20,000 lines of back-end scheduling and control code, excluding tools, processes and front-end codes, in order to provide online reliable automated change capabilities. At present, we can basically ensure efficient and safe daily changes except that it is difficult to change due to insufficient resources. Although this is our ability to eat, I still want to say that this model is complex and worthless, and the efficiency is gradually unacceptable with the increase of business stock. I hope that technological breakthroughs can be made.

TiDB has a great advantage in this area according to our research and current application conclusions.

In brief, TiDB has the following advantages:

  • Due to the data structure of KV simulated row table, as well as the characteristic that LSM storage cannot be updated, the operation of a large number of data types, addition and subdivision of fields, and default value classes can achieve complete Instant by changing only metadata. Compared with the instant of MySQL 8.0, the scope of instant is obviously wider and the efficiency is very high. All changes are completed at 1s level, and the service awareness is small.

  • Adding index must generate new persistent data, which cannot be operated on Instant. The implementation of background task and asynchronous thread mechanism has relatively little impact on business. In addition, tiDB has clear parameters to record the current INDEX creation progress and speed, which can facilitate scheduling.

  • Both of the above are significantly less space-efficient than MySQL Copy or the external utility mode. And TiDB itself is much easier to expand than MySQL middleware, the efficiency and feasibility of change will be greatly improved.

  • There is no extension of the master-follower problem.

Of course, although the efficiency is improved, it cannot be said that the change of TiDB has no impact on the business completely. Specifically, there are some new situations:

  • Because of the schema version change, the DML statement involved the same table in the cluster as the DDL being executed. However, the business is required to use connection pooling to connect to the database, and there is a retry mechanism on the business side to avoid complete transaction failure.

  • ADD INDEX operation. According to pressure measurement, frequent read and write of target columns, especially for long time, has a great impact on performance (that is, adding indexes to original service fields may have a great impact on performance). In extreme cases, QPS/TPS decreases by about 23%, which is similar to PT-OSC.

In general, we hope that the introduction of TiDB will solve the pain points in the current business table structure changes, improve the related efficiency, and help the business to achieve faster and safer iteration.

High availability reliability and efficiency are significantly improved

MySQL is the most widely used database type in the industry, but until MGR technology is fully popularized (which it is not yet), its implementation of high availability of services will be problematic. For specific design problems, you can search my database high availability article before KM, we will not expand, directly introduce business pain points.

  • Although master-slave consistency can be relatively secure, the lower limit is very low, requiring the coordination of many related parameters and mechanisms, such as multiple log persistence, semi-synchronization, replication of information table records, semi-synchronous ACK mode, non-loss, etc.. Any problem can lead to inconsistency between master and slave, which in turn affects reliability.

  • Without MGR, there is no internal coordination mechanism in the cluster. Switching needs external participation, but cannot be completed within the cluster like ZK and Redis cluster. This leads to the success rate of switching, the feasibility of resource shifting, and how to prevent a large number of holes in extreme cases in terms of split brains and double write. It is not impossible to do, but it is not easy to make it completely reliable. My previous article has a very detailed introduction of mechanism and principle, please refer to it if you are interested.

  • With MGR, although the problem of role and autonomous switching within the cluster has been solved, the official access mode has not given a very good solution. No matter it is the proxy mode comparable to Sidecar, or changing the client by itself, or external control of resource drifting as in the past, the problem of client switching has not been well solved. This MySQL has not been learning from Redis or MongoDB, which is frustrating.

  • There is also an efficiency issue with external switching, that is, the time interval between switching services to clients. It could be domain switching, client caching, or serialization of metadata updates here at DDB. In particular, in THE DDB mode widely used by NetEase internal DDB, there are two pain points: notifying the client of network timeout and updating the configuration center only in serial mode. As a result, the time required for the most basic switchover scenario caused by hardware failure can only be within 1 minute, less than 5 minutes. This is a question of switching efficiency.

  • The biggest pain point, even for MGR, is that MySQL binlog replication is a nightmare of high availability. I’m going to talk about this separately.

Separately, one of the most important issues in MySQL is the performance of binlog master-slave replication, which cannot be improved by upgrading MGR:

  • Compared to Oracle’s data page logical replication, MySQL’s binlog logical replication performance is simply unbearable. Even in the case of parallel replication, depending on the business characteristics, in most cases, no matter which parallel mode can achieve qualitative change in replication efficiency, we have to fiddle with log persistence, group commit and other peripheral performance factors to improve efficiency, and even create some persistence risks. Generally, we believe that under the condition that there is no bottleneck of CPU, IO and network, a single group of MySQL instances can only guarantee the replication efficiency of 3000~5000 TPS for simple updates at most. With the complexity of logic, some services may be delayed if they exceed 2000 TPS. Performance could be even worse with all the security measures, especially with all the rigour associated with logging persistence from the library… Replication delay means high availability automatic switchover failure – although logs are sent to the slave library, the data has not been updated and the delayed data cannot be read and written after service switchover.

  • High online load may lead to replication delay. For example, during the peak activity period of some of our e-commerce businesses, the writing peak of orders and other data causes delay. At the most critical time, the high availability of the database is virtually unprotected.

  • Changes may cause delays. Both online DDL and external tool changes change with the scale of the changed object, resulting in uncontrollable delays, which may be calculated in days. During the change, the secondary library used for read/write separation split load cannot be read, and the primary library lacks high availability protection. Are you gonna break out in a cold sweat when you know the truth?

  • The only solution we think is possible for MySQL to solve latency is to split the instance level horizontally, with each instance supporting as little TPS load as possible, a very different kind of parallelism. However, this is not a very feasible mode. For example, in the distributed middleware scenario represented by DDB, a database is split from 64 fragments to 640 fragments, and each client connection object, distributed transaction loss, cluster change and control difficulty will be constantly changing. And there is no guarantee that delays will ever be completely resolved — for example, if the problem is hot, what if the business hot writes are concentrated in one shard?

TiDB also has huge advantages in terms of high availability and data replication:

  • The consistency floor is very high and raft ensures that data copies are reliable, making it difficult to create inconsistencies.

  • TiKV Raft Group autonomous election, stateless TiDB Server layer automatic routing update, autonomous switching and business connection drift are no longer difficult.

  • The switching efficiency is also very good. Although each group needs to be elected, it is highly parallel and stable within 10s.

  • Raft Group level highly parallel replication efficiency, we have done pressure measurements and write pressure is not associated with any risk at 10-20 times higher than the MySQL equivalent cluster delay water level.

TiDB’s data consistency of multiple copies and high availability is a cross-era progress for traditional MySQL, especially for non-Mgr mode — it has finally entered the qualified distributed cluster era.

In the future, especially when businesses may use a large number of public cloud facilities, efficient and reliable mechanisms such as TiDB may be a strong demand, because although public cloud has advantages in overall resource pool, the reliability of single point of resource is not as good as our own machine room according to our experience and SLA standards. If we do not rely entirely on RDS for cost consideration, the increase in the probability of resource problems will magnify many problems of the previous high availability mechanism, and TiDB can be said to solve these problems radically, which is one of the key reasons we want to upgrade.

Storage costs have advantages

There is no big difference between TiDB and traditional database in storage hardware selection. Basically, we generally apply SSD. So storage costs can be estimated in this way.

With a single SSD disk as the minimum unit, we still have 4 disks in MySQL mode even if the minimum redundancy is 1, while TiDB can not be raid1 due to the high consistency guarantee, so the minimum unit is 3 copies, but the actual investment is 3 disks. That’s at least 25% less space.

Furthermore, TiDB has another optimization advantage over MySQL, MySQL innoDB is not widely compressed due to historical reasons, while TiDB’s RocksDB storage engine is compressed by default. Although kv logic has a big difference and there is no way to directly compare, in general, compression can always save space to a certain extent.

According to our specific test quantification, TiDB compression compared with MySQL non-compression according to different business data, there is a big gap. For example, the Sysbench class randomly strings meaningless data and does not compress very much,

5 $68487932199-96439406143-93774651418-41631865787-96406072701-20604855487-25459966574-28203206787-41238978918-19503783441 This similar random character data is innoDB 114G, while TiDB 102G.

But if you’re comparing actual business data with a lot of numbers, floating point, time stamps, short words that make sense, if you have a lot of numbers, and the compression is amazing,

Innodb is 398GB uncompressed, while TiDB is less than 30GB. So again, depending on the data type, it’s not necessarily a very high compression ratio, but it’s certainly better than InnoDB without compression. According to our practical experience conservative view, compression savings than the example of comprehensive about 30%~50%.

In this way, 25% RAID savings, combined with more than 30% compression savings, we can think that MySQL for TiDB storage can have about 35~50% storage hardware cost savings. This also has certain significance to the business of pre-increase of stock obviously.

By the way, newer (not that new, but we don’t use that much) hardware, such as NVMe storage devices, because PCIE can no longer connect to RAID, so it is natural for TiDB, such as software layer redundancy, and completely abandon raid hardware layer redundancy mode.

HTAP Innovation model

While these are probably improvements from 1 to 1.1, HTAP is probably a big scenario innovation from 0 to 1. We strongly expect tiDB-related features to bring business value to NetEase.

Traditional database we usually think of MySQL as a pure OLTP database, due to the row memory, the optimizer function is limited, there is no MPP engine, middleware function is limited, resource utilization is generally…… For a variety of technical reasons, it is completely impossible to run OLAP analysis requests. There are Clickhouse services that focus on MPP and resource utilization, but write more than one concurrent or non-batch SQL and risk stability that will not run well with OLTP requests. It’s a stark distinction.

Most of our current business scenarios are similar to the one above, requiring data to be entered from AN OLTP system into an online OLAP dedicated to ad-hoc class analysis and an offline data warehouse dedicated to task-based computation. Although the figure is very simple, the technology stack used in it is quite complex. The data department team, DBA team, business algorithm team, middleware R&D team and other teams in NetEase are all tied to this link, which can be said to be one of the business links with the highest technical content. I believe many companies in the industry are in a similar situation.

  • The OLTP part is mainly MySQL and its distributed middleware.

  • The pipline link may also be an ETL link with logical processes. The technology stack may have sqOOP or DATax in regular batch mode, real-time CDC (internally NDC) mode, and queue and business code such as Kafka for cleaning and logical filtering.

  • OLAP optical processing ad-hoc we currently have Oracle, Greenplum, Clickhouse, Kudu + Impala, Presto stack; Other SQL engines such as Doris, Keylin, Druid, and Hive are available. If the stream needs to complement the message queue, Flink, Spark, etc.; Yarn and Spark on the offline task side……

As you can see from xiaomi’s tech share above, this is the industry that delights in listing technology stacks and link architectures. From my point of view (and that of the other teams involved), it’s kind of interesting that big data, it’s really cool, it’s really technically necessary, it really needs to be applicable. But there are some obvious pain points:

  • ETL process timeliness and data consistency.

  • Logical changes are also difficult. Whether the service architecture, storage mode or the structure of the most upstream table changes, a large number of changes are required to support the link.

  • The characteristics of the service technology stack are typically multiple, specialized and deep, relying on a large number of experienced R&D and operation and maintenance inputs.

  • Input resources are high, both for intermediate processes and storage, and multiply if the requirements require multiple technology stacks to meet them.

  • Some efficiency issues arising from architectural and team complexity.

There are also databases in the industry that are constantly exploring between OLTP and OLAP compatibility. Oracle built MPP and Kudu tried OLTP compatibility. And TiDB relies on its unique column and column dual storage technology characteristics, seems to walk in a more reasonable path.

TiDB’s stated goal of 100 percent OLTP and 80 percent OLAP scenarios, I’m not sure if this goal is realistic, but as it stands, replacing a portion of the stack of technologies in the above complex ad-Hoc analysis requirements should be a reasonable goal. The technical solution of TiDB has also undergone a major iteration in just a few years. 19 years ago, the simple TiSpark solution was generally applicable, but the current solution combined with TiFlash column storage is really out of its own characteristics.

To make things simple, we’ve talked about the underlying TiDB is an OLTP database that simulates kv data in TiKV and writes to TiKV using RAFT protocol and log synchronization to ensure multi-copy consistency. In this way, the TiSpark OLAP engine and TiDB OLTP engine can freely choose the access destination between the two data sources, which are completely consistent but have heterogeneous storage. We know that column storage is a great improvement for OLAP itself, and the development team’s other professional optimization in the related field, a true HTAP database solution in the storage and computing layer at the same time to ensure that the data is consistent, suitable and applicable.

This architecture has some significant advantages:

  • The structure is simple and compact, the management and implementation plan is very perfect, and the function of TiFlash is very flexible, the business can be adjusted at any stage, any on-line table can not be opened to save costs, can be opened at any time to support new analysis needs.

  • Cluster implementation costs have been reduced from the HDFS level to the normal database level, so you don’t have to find a dozen at a time to do minimal online deployment, you can invest resources as needed.

  • Analysis data from table structure to content is basically the same as the online table, if the analysis needs to be based on the online structure, then directly write SQL, can completely save the synchronization link related resources and maintenance overhead. Of course, objectively speaking, ETL also has the demand of T.

  • Optimizer support, the query exactly which form can be customized, can also be left to the optimizer to determine the extent of pure SQL exposure to do more thorough, business can be easier to implement requirements.

  • For example, TiSpark can use a vectorization engine like Clickhouse to push tasks down to TiFlash using the same CoProcessor protocol as TiKV to improve resource utilization and performance.

  • Expansion is as easy as TiKV. Of course, most distributed big data clustering solutions with the exception of a few (clickHouse by name) can do this.

We have some practices in the company, which we’ll cover briefly later. At the same time, we don’t expect TiDB’s HTAP to be effective all at once. We just hope to improve service capability in some specific scenarios, simplify architecture and improve efficiency:

  • Replacing existing Oracle and GreenPlum technologies, mainly the roll up classes may be a bit of a hassle, but most of the syntable classes are definitely ok. GreenPlum fortunately, Oracle is having a hard time expanding with our current hardware

  • Solve some inappropriate SQL directly in MySQL or DDB cluster running problems, streamline OLAP processing efficiency, promote more data analysis capabilities and needs

  • Partially replace Impala on Kudu and Presto on HDFS as shown below. With all the data coming from online imports, you can try out the links you see, providing the appeal that a more standard SQL entry would have.

TiDB internal practices

Application Scenario Recommendation

Since TiDB is not the only choice of distributed scheme, strictly speaking, there are certain applicable scenarios for us, which are summarized as follows:

  • There are obviously off-peak season, cyclical, need peak protection type of business

  • For example, e-commerce has promotional activities, such as some business specialized in supporting temporary activity modules, such as significant cyclical education business, can fully enjoy the advantages of flexible expansion and contraction.

  • Businesses that have a particularly strong need for high availability and data consistency

  • Accounting reconciliation, consumer vouchers, external payment callbacks, etc., high availability and a high level of consistency can help a business be as reliable as possible.

The resource reliability of the public cloud is lower than that of the internal equipment room. If RDS is not used, you can try TiDB.

  • Core module two – three – center scheme

  • TiDB provides a very targeted three center implementation plan, both reference: docs.pingcap.com/zh/tidb/sta… . It means that the city machine room does the real-time high availability redundancy switch across the machine room, and the remote machine room provides a very reliable final data redundancy bottom. This scenario can be used for high-importance system metadata, such as private cloud service metadata, particularly for internal object storage metadata, critical data of very large scale, and to take advantage of the ability to scale quickly and horizontally.

  • Businesses that require HTAP

  • As mentioned above, if you need to analyze and query any online data, you can try to use TiDB in a single system instead of the traditional derivative data process.

However, since our goal is to bring TiDB into the core database selection as much as possible, I recommend that businesses that are currently interested in TiDB not get too hung up on scenarios where it must be a perfect match. Of course, it is best to use the matching scene. If you are not sure of the applicable nature, just:

  • Planning to use MySQL/DDB, but also very interested in TiDB

  • It does meet the difficult pain point of large changes in capacity expansion

  • I just wanted to experiment

  • Want to reduce the cost of storage and operation

  • People who want to learn and share new technology

Give it a try. Insurance considerations, no matter what kind of new technology is tried, that kind of use is relatively single or marginal, but the business and stock data is relatively large scenario is certainly the best start.

Technology transfer and supporting schemes

Differences between using TiDB and other existing databases:

  • The main considerations for business use of TiDB are SQL compatibility and client connection schemes.

  • If it is a self-developed system, and the business inside NetEase is accustomed to using MySQL or RDS of public and private cloud instead of DDB, there is no difference in the use of TiDB from previous projects. Common drivers and connection pool can be used to update the database address. In my experience, SQL functions that TiDB does not support are rarely used in our normal development.

  • If you want to store the underlying metadata of other open source systems, you may want to consider SQL compatibility, because the complexity of USING SQL varies greatly in the industry. We have seen some foreign open source projects that use SQL in a more complex depth.

  • If you are used to DDB, you need to change to JDBC driver and add connection pool, because DBI client is built-in connection pool, JDBC does not.

  • The SQL side of DDB replacement is generally fine, and TiDB support is the parent set of DDBS. The main thing to note is that DDB’s older unique ID assignment mode should be replaced with MySQL compatible types such as implicit assignment.

  • If you used to use QS proxy mode to connect DDB, use LBD to remove LBD, change to normal JDBC connection. Those who used QS in the past but did not use LBD mode (such as NLB mode) do not need to make any changes.

TiDB Server Distributed cluster access

As we know that TiDB Server is a group of stateless computing nodes, in the past MySQL database we will provide a unique entry, such as virtual IP, domain name or DDB master address and other forms, business does not need to be related to the back-end service failover details. TiDB exposes the TiDB server on the TiDB side. We can choose the following options:

Scheme 1: Traditional HAProxy, LVS and other load balancing software are used to switch resources together with VIP and domain names. This is the most commonly used scheme.

The main problem of this scheme is how to ensure the high availability of the load balancing agent itself, and the stack is likely to end up in OSPF. However, it is still the simplest and most reliable access method, so it is recommended to use.

Scheme 2 adopts the multi-host configuration method of the native MySQL JDBC driver, which is written as follows when connecting to the database:

The static final Srting DB_URL = "JDBC: mysql: loadbalance: / / 10.170.161.65:4121,10.200. 166.102:4121 / zane? useSSL=false&failOverReadOnlly=false";Copy the code

That is, multiple TiDB Server services provided by a one-time connection. After testing, load balancing and automatic troubleshooting of failed nodes and connection recovery can be realized among multiple TiDB servers. For details, please refer to the above links. The main problem is that if the TiDB Server list changes, such as scaling or migration, the business will need to modify the link and restart, which will be very troublesome.

Scheme 3: Due to the good feature of stateless TiDB Server, we transplant the TiDB Server cluster into K8s and serve both inside and outside K8s as a service.

The K8s transformation of stateless TiDB Server is very smooth and appropriate. Ingress can also provide the ability to access services outside K8s. Relevant preparations and verification work have been completed. Our ultimate goal is definitely the full cloud protogenicity of TiDB, and the current partially in-cloud partially out-of-cloud scheme is just a transitional model for stability considerations.

In summary, all three options are available, and we will determine which model to use based on the actual situation and business discussions.

Monitoring and other o&M support

From the perspective of DBA, TiDB currently has a relatively perfect monitoring mechanism and supporting tools, and is completely based on the open source technology of peers in the industry. Therefore, we do not plan to re-create a collection script and display interface for it as traditional database, but directly reuse the original scheme.

TiDB components (including other logging tools, management tools and other supporting tools except the main services shown in the figure) provide standard Prometheus data interfaces, and monitoring services adopt the native Prometheus pull mode. In addition, Grafana has a complete template for TiDB with high differentiation and strong data logic connection, which is a highly effective template after practice. Therefore, we will use the native solution for TiDB monitoring display and alarm in the future, and will not migrate to Sentry or other self-built systems. And the alarm system to connect the current POPO or internal phone alarm interface can also be debugged.

In addition to monitoring, TiDB serves as other basic database monitoring requirements, including but not limited to: backup and restoration, capacity expansion and scaling, automatic cluster operation tools, log processing tools, data import and export, cluster configuration and modification management, and rolling version upgrade…… The most valuable thing is that TiDB community attaches great importance to SOP documentation, so that community experience can be translated into practical guidance, which is very rare in open source database, and we have also been fully rehearsed internally. Our team also had a lot of contact with PingCAP’s support and tool r&d team, and deeply realized that all the students in PingCAP’s research and development and community support are senior DBAs who have a profound understanding and a lot of practical experience in the requirements and difficulties in the process of database use. Therefore, relevant product tools are quite practical. And reliable compared to other open source projects.

Further, some common points for internal daily use, such as OWL’s ability to access white-screen data and automatic chemical sheet support for daily changes, are ready and verified online. The difference between TiDB and other MySQL databases should not be felt in daily business use.

In conclusion, we believe that TiDB has all the conditions for internal promotion in terms of operation and maintenance support capacity.

Existing services are migrated to the TiDB solution

Adding new services is easy, but moving existing services forward to TiDB for existing data and applications is more difficult. In addition to some of the above mentioned access methods and SQL adaptation may need to be modified, the application code needs to be published, mainly how the stock data from DDB or MySQL input TiDB. A few considerations:

  • We certainly don’t want to use a migration solution with long outages (or at least write outages), preferably with real-time data synchronization

  • In addition, we must consider that the application can be rolled back at any time to achieve the purpose of rolling back the database. Therefore, to ensure that the incremental data after switching to TiDB must be recycled to the original database cluster in real time to achieve the goal of rolling back to the data source at any time.

  • There should be a certain amount of grayscale publishing space to verify that the business connection and use of TiDB is feasible.

After some internal practice, we make full use of the existing internal tools and TiDB supporting tools, and relevant requirements are feasible and verified by experience to some extent:

We synchronized full data to the target TiDB cluster through NDC, an internal CDC tool that supports DDB distributed cluster and MySQL single cluster, and maintained continuous real-time synchronization of incremental data. TiDB subscribes to TiDB logs through TiCDC provided by TiDB, synchronizes them to MySQL (blackhole engine is used for log transfer), and synchronizes them back to DDB through standard NDC. This enables a two-way real-time replication scheme.

NDC is a multi-data source migration and CDC platform developed internally by NetEase, which can be analogous to the platform product of Xdata + Canal

Possible steps for business migration are:

  • At present, the core problem of bidirectional synchronization of two data sources is that there is no conflict resolution and circular replication resolution mechanism, so comprehensive double write cannot be realized. Single data source single write should be kept, so that the scheme is safe. It is also the most critical consideration for the business migration step.

  • Start by preparing real-time synchronization of DDB->TiDB until the data keeps up. The TiDB->DDB reverse link is not enabled.

  • If possible, the business grayscale publishes read-only business to TiDB to verify service availability.

  • Official release Switchover During the off-peak period, DDB cluster write is disabled, DDB->TiDB link is suspended, TiDB->DDB reverse link is enabled, and TiDB data source version is released. The service cannot be written only during the publication period.

  • If the previous step is successful, the migration after online regression application is basically successful.

  • If any problem is found after switching data sources, you must roll back the application and data source to solve the problem. Because the reverse link ensures that DDB data is consistent with TiDB data, TiDB is forbidden to write data, and DDB is enabled to write data, and the application can be rolled back.

In addition, in the process of some business communication, we found that a large amount of data ETL or message bus is currently online and depends on DDB/MySQL binlog subscription, so it is also necessary for NDC to be compatible with TiDB. We adopt the same scheme as switching:

We will first play back TiDB logs to a transit MySQL through a special tool, and then use NDC subscription to ensure that the original data link is basically unchanged.

Recent TiDB related business application brief

This time our technology promotion practice is not to scratch the surface, but to really expect TiDB to bring some qualitative change to database technology and usage efficiency. TiDB has recently been or will soon be launched in multiple business modules of NetEase Pay and NetEase Cloud Music. TiDB’s HTAP, second-level DDL, speed expansion and capacity reduction capabilities will also continuously improve user experience.

Generally speaking, the application of TiDB in NetEase is still in its infancy. For a long time we struggled with two questions, first “why do we need a new distributed system with DDB” and “where does HTAP have value?” After some exploration and efforts to promote, we have some progress on the understanding of these two problems, so there is this paper. We hope that our recent work can play a role of casting a brick to attract jade, and implement the ability improvement brought by the development of database technology into the actual business needs, so as to improve the database service and ability level of NetEase. Teams and students who are interested in TiDB related technologies are also welcome to keep active communication with our DBA team to promote the verification and application of related technologies together.