Exploration of new features of TiDB 4.0 in e-commerce industry

Author introduction: Ji Haodong, zhuan company database person in charge, responsible for the overall database operation of zhuan company.

What problems did the initial introduction of TiDB solve?

The use of TiDB mainly solves two problems, one is the problem of database and table, the other is the operation and maintenance complexity.

This is a very common problem that increases the complexity of our business logic, and multi-dimensional mapping may lead to a decline in our overall performance. With TiDB, we no longer need to consider the sub-database sub-table, no longer need to write so much complex logic.

In terms of operation and maintenance complexity, TiDB can achieve rapid horizontal expansion without the need for complex data relocation by DBAs or traffic migration by businesses. In addition, Online DDL of large tables has little sense of business.

New problems arise

The introduction of TiDB brings with it some new problems.

The deployment is slow and management is difficult. TiDB Ansible has a variety of different exceptions when managing multiple TiDB clusters, which can greatly increase our operational complexity.
The hotspot cannot be located quickly. Data hot spots are a common problem for emporium. Because there are many TiDB nodes, hot keys cannot be quickly located. Therefore, you need to query the logs of each node and perform troubleshooting step by step. The troubleshooting cost is high.
The cluster status cannot be quickly viewed. There are too many monitoring items and scattered logs. If we want to confirm the cluster status at a certain time, we can only analyze it step by step and cannot quickly locate cluster exceptions.
Data extraction increases on-line response latency. This is a very common problem because data extraction also affects the performance of our TiKV.
Large clusters cannot be effectively backed up. In fact, it is an urgent problem to solve for the fast backup and recovery of large cluster. Previously, we only used TiDB when the data volume was very large, which is when backup is very urgent. We’ve been doing logical backups, but logical backups aren’t very efficient for us.
The configuration of TiKV thread pool is complex and has impact on services. The number of threads is configured when TiKV is deployed, and three priorities are required. Two readpool thread pools, readpool.storage and readpool.coprocessor, must be configured for different service scenarios. . As our business evolves and iterates, our SQL is different, so readPool is used in different ways, and adjusting thread configuration can affect business access to varying degrees.

What problems does TiDB 4.0 solve?

Let’s take a look at some of the problems that TiDB 4.0 can solve.

Cluster deployment Management issues — TiUP

Previously, TiDB Ansible management was quite difficult, and TiDB Ansible has its own problems. TiDB 4.0 has developed a new component management tool — TiUP, which is very good in terms of experience at present. We can deploy 3 TiDB, 3 PD, 3 TiKV and 1 TiFlash in one minute. The effect is very amazing. TiUP also has a number of parameters to check the status of our cluster. We would like to remind that TiFlash port management is very complex, there are many ports, we must do a good job in the use of TiFlash port management.

Data Hotspot — Key Visualizer

In the early days, we could only troubleshoot hot issues through various logs, and then analyze them slowly to find them. There is now a new visualization tool called Key Visualizer that provides a quick and intuitive look at our entire cluster of hot spots. As shown in the figure above, we take the online cluster, copy the data and traffic over, and then direct the new traffic over. We can see the write status of the cluster at any point in time. For example, we can see the current situation, the number of bytes written, which library, which table, and its Rowkey. In the figure on the right, we can see a busy degree of our overall KEY by judging the brightness of the cluster. Generally speaking, this figure is in a more expected state, and the overall write is relatively uniform. If it’s a hot spot, there might be a line, and you can obviously see our hot spot KEY, and with a tool, we can quickly find the hot spot KEY.

Quick view cluster status problem – TiDB DashBoard

TiDB 4.0 has a new component called TiDB DashBoard for problems where cluster status cannot be quickly located. Through TiDB DashBoard and TiDB cluster diagnosis report, we can quickly get the basic information, load information, component information, configuration information and error information of the cluster. These information is actually very rich, and it is very effective for us to find the exceptions of our cluster steadily and firmly.

TiDB DashBoard is one of the highlights of TiDB 4.0, it can access our cluster information in real time. Above is the DashBoard overview page, which includes QPS, response latency, node status, and alarm related content. Through the overview, DBA can quickly check the status of the cluster, quickly locate problems, improve the application, it can be said that THE overall application of TiDB 4.0 has been very high.

Slow queries are a feature of milestones. We have been teasing TiDB’s slow query before, and we have been teasing it from 1.0 to 4.0. However, 4.0 can specify databases, view different slow queries, and quickly locate our slow queries with DashBoard. We no longer need their own ETL, also do not need their own machine, can quickly locate the slow query, and contain sorting, execution time and other information, this is to use TiDB for the company, a very good news.

We can find the list of our slow queries through the slow query, and with the list, we can know the specific SQL statement. SQL statement information contains SQL statement templates, fingerprint ids, samples, execution plans, and transaction related metrics, which are very rare to us. When we did ETL by ourselves, in fact, many indicators and information were not available, but now through SQL statement analysis, we can see the various execution stages of slow query, as well as the execution time of each stage, which improves our overall SQL analysis experience.

Log search has also been added. In the early days when we did ETL, we needed to retrieve various logs and then analyze them. Now with the function of log search, we no longer need to log in the machine or make corresponding systems to analyze logs, which will greatly reduce our labor cost and development cost. With this tool, we can specify the time period, specify the log level, and also specify its node, through which we can retrieve some of our latest logs, which is very friendly to us.

Data extraction to increase online response delay — TiFlash node

Now we enable TiFlash node to solve the problem that data extraction will increase the online response delay. The features of TiFlash include asynchronous replication, consistency, intelligent selection, and computational acceleration. We will not talk about the specific principles, but we will mainly talk about the usage scenarios in the rotation. The main application scenario in rotation is for several nodes and physical isolation, which is equivalent to adding a TiKV node on a new machine. We have made a separation, and different requests go to different back-end data nodes, so that when data extraction, it will not affect the overall online performance. And this is an intelligent choice, according to the complexity of our business, SQL, to determine whether to go TiKV or TiFlash, online TiKV, offline TiFlash, this is mandatory.

Large clusters cannot be effectively backed up — Backup & Restore

The distributed Backup & Restore tool solves the problem that large clusters cannot be effectively backed up. According to our tests, it takes less than 10 minutes to back up 300GB of data to the network file system in a 10-gigabit network card environment with a speed limit of 120MB/s. Under the same speed limit of 120MB/s, data recovery through the network file system takes about 12 minutes, which can be said to greatly reduce our backup and recovery time. And another key factor is that the speed of backup completely depends on how much TiKV we have. The more TiKV we have, the faster our backup speed and recovery speed will be.

Configuration of TiKV thread pool — Unified Read Pool

A new optimization feature in TiDB 4.0 is the thread pool for Unified Read Pool. Before 4.0, our Readpool storage and Coprocessor needed to be configured by ourselves and adjusted dynamically by ourselves. Moreover, every adjustment may affect the business, which was a painful point. Unified Read Pool combines storage and Coprocessor into a single thread pool. Whether we use storage or coprocessor is determined by our SQL. If we need storage, we use storage; if we need coprocessor, we use coprocessor. This not only improved our user experience, but also solved our problem of uneven resource allocation. The figure above shows how we can enable the thread pool configuration for Unified Read Pool.

The future planning

TiDB 4.0 released a number of practical features, such as TiDB Dashboard, TiFlash, Unified thread pool, etc., to improve the overall usability of TiDB. In the future, we plan to upgrade to V4.0, which will release human resources to a certain extent and reduce our operation complexity.

This article is adapted from Ji Haodong’s talk at TiDB DevCon 2020. You can follow the official Bilibli account home page (ID: TiDB_Robot) for videos related to the conference.