Author introduction: Xu Chao, senior DBA engineer of VIPKID.

This article mainly shares an application practice of TiDB 4.0 in VIPKID. It mainly involves two parts, the first part is some application scenarios of TiDB in VIPKID, and the second part is to introduce what surprises and benefits TiDB 4.0 brings to us.

Application scenario of TiDB in VIPKID

First of all, I would like to introduce VIPKID. VIPKID is an online Children’s English education company, which focuses on serving young children aged 4-15 and their parents. It mainly provides one-to-one teaching service by foreign teachers in North America.

Scenario 1: Large data volume and high concurrency

Returning to the theme, the first application scenarios of TiDB in VIPKID are some scenarios with large data volumes and high concurrent writes, as shown in the following figure:

To take one example, we now have a classroom barrier system, the system can real-time collection of some of the event information in the classroom, such as in and out of the classroom and classroom service initialization information, this information is through the side and teachers and parents report, business students can according to these information rapid locating the fault in the classroom, and do some debugging method, Like switching lines and stuff to keep some quality in class.

At present, the peak TPS of this system is about 10,000, and the daily new data volume is about 18 million. This table only keeps the data of the last two months, and currently there are about 1.2 billion in a single table. In fact, the data volume is relatively large. So we migrated the entire data of this system to TiDB.

In the migration process we made some small changes, in the original table on the basis of the original increment ID removed, by specifying the configuration of the table to do a pre-split, to avoid high write hot issues. For example, in version 3.0, hot regions can be located using the information_schema.TIDB_HOT_REGIONS table, or the corresponding hot Region information can be obtained directly from PD. However, the Region ID obtained in this way is actually a Region ID. You need to call the API of the specific TiDB server to obtain and locate the specific table or index. In 4.0, TiDB Dashboard allows hot spots to be visually presented in front of us. We can directly see hot spot write, read and other hot information.

Scenario 2: Multi-dimensional query after the core module is divided into databases and tables

The second scenario is that many of our core businesses have already made sub-database and sub-table in the early stage, such as class appointment schedule. Those who know more about the database and table actually know that in general, only one business column will be used as Sharding Key to split the database and table. However, in the case of the schedule, more dimensions may be involved, including classrooms, teachers, students and courses. In addition, other servers may also access this part of the data. For example, when the teacher accesses this part of the data, he may access it as the teacher dimension. If the teacher splits the data as the student dimension, the teacher’s access request will be broadcast to all shards, which is not allowed online. To solve this problem, we had two solutions at the beginning. One is to gather all the data and put it in ES; the other is to write two sets of clusters and split the two sets of clusters in their own dimensions. Either way, the maintenance costs are high, both for the RD and for us dbAs.

Therefore, we use DM to synchronize all the fragmented tables in the upstream to the downstream in real time and merge them into a global table. All the cross-dimensional queries, no matter the management terminal operation platform or the business indicator monitoring system, go through the TiDB system and check this global table, as shown in the figure below:

In fact, the logic is relatively simple. DM will pull all the Binlog fragments on the line, filter and match the Binlog events that need to be synchronized according to the blacklist and whitelist, and then match the formulated routing rules, which will be adapted and applied to the downstream global table. There are also two cases: one is where the upstream branch table contains the global ID, which naturally avoids primary key collisions after the merge. There is also the case that there is no global ID. In this case, the official solution is to do some pre-processing for the ID Column before synchronization. After processing, there will be no primary key conflict during the underlying synchronization. However, we communicated with PingCAP, and in fact, they do not recommend using this method any more. Therefore, we changed to another way, because our business has no business dependence on such auto-increment ID and primary key ID, so I actually built a table in advance in the downstream TiDB section as shown in the following figure, and then removed the primary key attribute and auto-increment attribute corresponding to ID. Adding this ID to the unique index of the only union I had solved the problem:

In the DM block we did two simple things. The first is DM delay monitoring. In fact, DM itself supports delay monitoring and is also done through the write heartbeat. However, DM is not attached to the main node of MySQL cluster, but to a read-only storage, so we do not allow DM to write the heartbeat of the storage directly. We have a heartbeat mechanism online to monitor the delay, so in the actual DM, the delay monitoring actually uses the online heartbeat table, which is equivalent to synchronizing the online heartbeat directly, and then rewriting it through matching routing rules. In this way, although multiple clusters may be involved in TiDB, each cluster actually corresponds to a separate heartbeat table, so that DM latency monitoring can be done. As you can see in the figure below, the synchronization delay of sub-database and sub-table merging scenarios is within 200 ms in most cases:

The second thing is to solve the problem of DM high availability. Since DM itself does not have high availability, it may support automatic pull up when a DM worker goes down, but it is helpless when the whole node goes down. As shown below, we did a simple experiment:

It is equivalent to putting all persistent data of DM-master and DM-worker on shared storage. Dm-worker is started by container. In case of node breakdown, it only needs to lift the corresponding worker on the standby machine. And associate the path of the DM on the shared storage of the dm-worker that you specify. The problem here is that IP cannot be solidified, so we need to change the configuration file of DM-Master during this process, and then do a Rolling Update.

Scenario 3: Data life cycle management

The third scenario for TiDB in VIPKID is actually data lifecycle management. If you have used MySQL, you know that the maintenance cost of large tables in MySQL is relatively high, so generally many companies will do data archiving step by step. We now divide the data into multiple levels according to the specific business and read and write conditions, and put the warm data, cold data, and actually online read traffic into TiDB.

Introduced TiDB consider this scene is the transfer of the data, running a premise to ensure the hot and cold data table structure, but the reality is, cold heat large volume of data in most cases a lot of data, such as online thermal data for a table structure changes, such as add a column, if cold data to do the same operation at the same time, In fact, the cost is very high, maybe ten or dozens of times of the thermal meter, so we want to use TiDB to do this. On the one hand, TiDB’s DDL has some features, adding and subtraction fields are second level, and on the other hand, with TiDB’s horizontal scaling capability, we can reuse a SET of TiDB archive cluster for each business side. Here is a brief summary of our experience with TiDB DDL, as shown in the figure below:

The first step is to change the metadata, which is returned to the user at the second level. The rest of the actual data cleaning is put in the background to be handled asynchronously through GC. The time of adding index operation depends on the amount of data because it involves data filling and reorganization, but it will not block online services. I have read the official introduction before, the process of data filling in TiDB is divided into multiple units, and each unit carries out concurrent read and write operations. TiDB is different from MySQL in that it just changes the metadata and does not need to rebuild the underlying data. For example, adding a column does not need to backfill the original data. There are two situations: one is that the default value is empty. One is to have a specified default value, in fact, only need to record the default value in TiKV with the table structure, when you query the data decode column value, if the value is empty, directly return null; If there is a specified default value, the specified default value is returned without changing the historical data.

Scenario 4: Real-time data analysis scenario

Another scenario is the scenario of real-time data analysis. In fact, the following is a scenario introduced by TiDB in VIPKID:

This scenario is mainly used by BI analysts and some BI platforms. In terms of architecture, it is equivalent to synchronizing all online data into TiDB through DM, and conducting some real-time data analysis with the help of TiDB’s horizontal expansion ability and some computing push-down ability.

What are the surprises and benefits of TiDB 4.0?

TiFlash column storage engine

TiFlash is a column storage engine, which is relatively friendly to AP scenarios, and it has its own computing power, supporting some operator push-down, calculation acceleration. In addition, TiFlash can replicate data in TiKV in real time without breaking the previous operation mode of TiKV. It adds itself as a Learner role in Raft Group to support table granularity synchronization. Another is intelligent selection, TiFlash can automatically choose to use TiFlash column storage or TiKV row storage, or even mixed use in the same query to provide the best query speed, it is not just a simple SQL to choose which storage engine, granularity can be refined to a specific operator in the same SQL, For example, an SQL call, part of the index can actually walk the row storage faster, and part of the whole table scan column storage, through this way to improve the overall SPEED of SQL query. Finally, TiFlash supports independent deployment and can be separated from TiKV to some extent to achieve hardware resources isolation.

Let’s talk about the benefits TiFlash brings to us.

The first is the performance improvement, I did a simple test here, the test environment is now five TiKV nodes, a TiFlash node, there is a single table 250 million data, and THEN I count this single table, running in the TiKV index for more than 40 seconds. In a single TiFlash scene it takes 10 seconds. This is just a single TiFlash node. If I add more TiFlash nodes, this speed should be improved further.

TiFlash

TiKV

The second is cost reduction. We used to have a set of clusters for BI, but now we have replaced them with a new TiDB 4.0 cluster. The table on the left is the resource allocation ratio of each component of the new 4.0 cluster, and the table on the left is the resource allocation of the new and old clusters under the same load. As you can see, deploying the new cluster reduced our overall cost by 35%.

The third TiFlash online to solve some stability problems. Before, if some TP services were run in TiKV layer and a SQL was involved in a full table scan at this time, the load of the whole TiKV cluster would increase, which may lead to a longer overall response time of TP queries. Now I can add a TiFlash node and automatically add a copy of TiFlash to the large tables in the corresponding cluster. In this case, if there are some similar cases, I can preferentially use TiFlash, and ensure that TiKV is not affected as much as possible.

Two more “limitations” that we encountered in using TiFlash.

First, if your SQL contains functions that TiFlash hasn’t implemented yet, you won’t be able to do push-down calculations, just taking advantage of TiFlash’s quickening capabilities as a column store. If you have unrealized push-down requirements, you can contact PingCAP official technical support, add wechat account XXX, and note “TiFlash technical consultation”).

Second, TiFlash is not good enough to cover JOIN scenarios between large tables. It only optimises JOIN scenarios between small tables and large tables. (Of course, we later learned from PingCAP that they will release a JOIN optimization function for large tables in July. Optimizations that allow a few small tables to JOIN a large table order of magnitude higher than the existing schema, and they will start to optimize the rest of the JOIN scenarios this year as well. If you are familiar with Spark, you may know that there are several algorithms for Hash Join, one of which is equivalent to broadcasting a small table to each data node and then performing Hash Join operations locally on the data node. You can see the screenshot on the left of the following figure, which is a large screen statistics related to teacher recruitment:

This query took about 17 minutes to execute, and TiFlash was accessed in the actual process, but the whole time was actually spent on Hash Join, and not much time was spent on TiFlash. As can be seen from the figure, the overall load on TiDB server is very high. So the bottleneck is the TiDB server.

So we thought about how to break up the Hash Join, and we introduced TiSpark. As shown below:

In comparison tests (a single TiDB server is 8C 64GB), the same query allocated to a small TiSpark cluster takes about a third of the time. The following figure is a TiDB server cluster for BI after transformation:

The components of TiSpark and TiFlash are mainly added here. Since SQL can be clearly displayed on a large screen, many of them directly use TiSpark, while THE query statistics of BI analysts and BI platform directly use TiDB. Then intelligently choose to let it go TiKV or TiFlash.

TiDB Dashboard

The built-in Dashboard in TiDB 4.0 is, in my opinion, a big improvement in usability and “out of the box”.

There is also a way to look up hot topics in version 3.0, but relatively there may be a certain learning cost, and the information is scattered. Now version 4.0 puts it all in the Dashboard, presenting it to users in a more intuitive way. Then there is Dashboard, such as log collection and slow log, which is the same logic as the previous one, in fact, to make TiDB easier for users. If there is no such thing, if the user cluster is small, you can check the specific node. If the cluster size is large, Users may want to prepare a set of log extraction, log reporting things, in fact, there is a certain cost of landing. One piece is a piece of the cluster diagnosis, cluster to diagnose this as our business actually one regular activities every week, sometimes r&d said I this week and the quantity of last week and had no obvious change, but actually feel down longer response time, actually can generate the specified diagnosis report, I can clear a baseline range inside compare:

Backup & Restore

Before TiDB is relatively blank in the backup part, and only supports logical backup. However, the efficiency of logical backup is very low in the case of large data volume, so the BR tool in 4.0 now favors physical backup, which is very encouraging.

BR itself communicates with PD to get the current timestamp and the current distribution of TiKV, and BR itself internally starts a temporary TiDB instance to build a backup related request information and send the request information to each TiKV node. Therefore, the backup process is actually a distributed backup. The designated TiKV node will only back up the primary node on its current instance, which is similar to physical backup. When it finishes backing up the Region Leader, it will return the corresponding metadata to BR, which will save the metadata in a separate file. Then use it for recovery.

We tried it briefly, as shown below:

We backed up in the way of shared storage according to the official suggestion, which means that the same shared storage was hung on the BR machine and all TiKV machines to make a backup. A 2.4T TiDB instance was backed up for about several hours. This time can be further optimized. Because I think the average speed is only more than 80 megabytes, may be limited by the current SAS disk bandwidth problem, the later adjustment will be faster. You can also see the size of each directory in BR backup. The total backup size can be seen as more than 700 GB.

That’s all for sharing. Thank you.

This article is adapted from Chao Xu’s talk at TiDB DevCon 2020.