preface
In ancient times, doctors paid attention to “looking, smelling, inquiring and cutting” and made judgments on the symptoms of patients through their comprehensive external manifestations. Now, the invention of CT allows people to use X-rays to Pierce through tissues of the body and visualize the whole situation. Doctors can use this information to quickly diagnose problems. The emergence of CT not only raised the efficiency of diagnosis to a new height, but also provided a standard for the objective description of the body state, which is an important milestone in the history of medicine.
A working TiDB cluster is said to have a hot spot if only a few nodes are very busy while the others are relatively idle. As a distributed database, TiDB automatically and dynamically redistributes data to reach the equilibrium as possible, but sometimes hot spots are still generated due to abrupt changes in business characteristics or business load, and performance bottlenecks often occur at this time.
Before TiDB 4.0, if we want to diagnose read/write hotspot problems in the cluster, we generally need to see, hear, ask, and cut through the external performance of the cluster to gradually understand the hotspot problems:
-
Check whether the CPU and I/O of each component are balanced.
-
Check the hotspot table one by one based on the cluster hot zone list.
-
Further analyze the service logic through the table to check the cause of hot spots.
-
…
The whole process is tedious, involves different tools and components, requires a certain learning cost, and the overall result is not intuitive.
Google provides a visualization tool in Bigtable’s cloud service:Key Visualizer, it can gracefully solve the problem of hotspot detection. TiDB also implements Key Visualizer in version 4.0. Now, we can easily take a “CT” of the cluster to quickly and intuitively observe the overall hot spot and flow distribution of the cluster, as shown in the figure below:
Why are there hot spots?
This may sound like a TiDB bug, but it is not, it is a feature 🙃. In all seriousness, most of the time hotspots occur as a result of business read/write patterns that do not fit well in distributed scenarios.
For example, if 90% of the traffic is reading or writing a small piece of data, then this is a typical hot spot because a row of data on the TiDB architecture will be processed by one TiKV node and not all nodes will be able to process that row. Therefore, if most of the traffic is frequently accessing a certain line of data, most of the traffic will eventually be processed by a certain TiKV node. Finally, the performance of this TiKV machine becomes the upper limit of the performance of the entire service, and the processing capacity cannot be improved by adding more machines.
TiDB is actually processed by Region (that is, a batch of adjacent data). In addition to the above scenarios, there are more scenarios that may generate hot spots, such as the write table data hotspot caused by the continuous writing of adjacent data using the autoincrement primary key and the write table index hotspot caused by the writing of adjacent time data under the time index. Here is not one introduction, interested students can read the TUG community on the article “TiDB hot issues in detail”.
How do you find the hot spot culprits?
The working principle of
According to the previous description, the nature of hotspots is that most read/write traffic involves only one Region. As a result, only a few TiKV nodes in the cluster carry most operations. TiDB Key Visualizer displays the read/write traffic of all regions in a heat map by using colors to indicate read/write traffic. The heat map enables users to quickly view the heat status of regions in a cluster and intuitively know the location and trend of hot regions in the cluster, as shown in the following figure:
Photo caption:
- The vertical axis of the thermal map Y represents the Region in the cluster, which spans all databases and data tables in the TiDB cluster. The horizontal axis X represents the time.
- The darker the color is (cold), the lower the read/write traffic of the Region. The brighter the color is (hot), the hotter the read/write traffic is.
The user can also control to display only read traffic or write traffic. In the graph above, for example, there are six distinct bright lines in the bottom half, indicating that there are about six regions (or adjacent regions) with very high read and write traffic at any given time. If the user moves the mouse pointer over the bright line, he can know which library and which table this high-traffic Region belongs to.
Common thermal map interpretation
1. Balance: Desired outcome
As shown in the figure, a uniform color or a good mix of dark and light colors indicates that reads or writes are evenly distributed over time and Region space, indicating that access pressure is evenly distributed across all machines. This kind of load is best suited for distributed databases and is what we want to see.
2. Alternating light and dark of X-axis: we need to pay attention to the resource situation in the peak period
As shown in the figure, the thermal map shows light and dark alternation on the X-axis (time), but is more uniform on the Y-axis (Region), indicating that the read or write load changes periodically. This may occur in the case of periodic scheduled tasks, such as big data platforms extracting data from TiDB at regular times every day. In general, you can pay attention to the availability of resources during peak usage periods.
3. Y-axis light and dark alternation: attention should be paid to the degree of hot spot aggregation
As shown in the figure, the thermal map contains several bright stripes. From the Y-axis, the periphery of the stripes is dark, which indicates that the Region of the bright stripe has high read/write traffic. You can observe whether it meets the expectation from the business perspective. For example, if all services are associated with the user table, the overall flow of the user table will be high, and it makes sense to show the brightly colored area in the heat map. It should be noted that TiKV has its own hotspot balancing mechanism based on regions. Therefore, the more regions involved in hotspots, the more traffic can be balanced on all TiKV nodes. In other words, the thicker and more bright stripes mean the more scattered hot spots and more TiKV can be utilized. Thinner and fewer bright stripes mean more concentrated hot spots, more prominent hot spot TiKV, and more need for DBA intervention and attention.
4. Sudden brightness: You need to pay attention to the sudden increase of read and write requests
As you can see, some areas of the thermal map suddenly change from dark to bright. This indicates that the data traffic of these regions suddenly increases in a short period of time. For example, microblog hot search or second kill business. In such cases, dbAs are required to focus on whether traffic changes are as expected and assess the adequacy of system resources, depending on the business. It is worth noting that, like the third point, the thickness of the Y-axis direction of the bright area is very critical. If the bright area is very thin, it indicates that a large amount of traffic suddenly increases in a short period of time and is concentrated in a small amount of TiKV, which requires the focus of DBA.
5. Bright slash: You need to focus on the business model
As shown in the diagram, the thermal map shows bright slashes, indicating that the regions being read and written are continuous. This scenario often occurs during the data import or scan phase with indexes. For example, continuous writes to a table with an incremented ID, and so on. The Region corresponding to the bright part in the figure is a hotspot of read and write traffic, which may cause performance problems for the entire cluster. In this case, the service may need to realign the primary key, break it up as much as possible to spread the pressure across multiple regions, or choose to schedule the service tasks during low peak periods.
It is important to note that only a few common thermal map patterns are listed here. The Key Visualizer actually displays the heat map of all databases and data tables in the cluster. Therefore, it is very possible to observe different heat map modes in different areas or the mixed results of multiple heat map modes. The use of flexible judgment should be based on the actual situation.
How to Resolve hot Spots
Whether the former look, smell, ask, cut, or now the Key Visualizer is to help find the hot “culprit”. If the culprit is found, it can be further dealt with to improve the overall performance and health of the cluster. In fact, TiDB has many built-in functions to help alleviate common hot issues. This article will not go into details due to space limitations. If you are interested in TiDB, you can read the article common Hot Issues of High Concurrency writing and How to Avoid them.
Practical cases
After reading the above long Amway, let’s take a look at a practical example to get a feel for the power of Key Visualizer. Our developers often use the scores in various standard evaluations to help judge the performance improvement results of TiDB and TiKV. With Key Visualizer, we recently found a problem with the SQL writing of the performance test application, as shown in the following figure:
This is the read heat map of the TPC-C test on TiDB. We assume that this is a real business and we are now tuning for it. The left half of the diagram is the import data phase of the standard test, and the right half is the performance test phase of the standard test.
As you can see from the figure, the bMSQL_NEW_ORDER table had significantly higher traffic than all other tables during the performance test phase (right half). Although the bright color band in the hotspot diagram has a high height, that is, the hotspot table has a large number of regions, which should be well dispersed to each TiKV to balance the load, it is an unreasonable phenomenon that the table has a large amount of read traffic in terms of design.
As a result, we analyzed the SQL statements associated with the table and found that there were some redundant SQL in the test program that repeatedly read data from the table, so we made some improvements to the optimizer at the database level to improve performance in this case.
Other Application Scenarios
In addition to the scenarios mentioned above, Key Visualizers can also be helpful for the following scenarios:
1. Discover service load changes
The business load carried on the database often changes over time, such as a gradual shift in user needs or concerns. You can use Key Visualizer to observe service loads in a fine-grained manner. By comparing the history of service loads, you can discover trends in a timely manner and gain an advantage.
2. Check the service health
At present, the application architecture of many users has gradually changed from single system to microservice architecture. The increasing complexity of the call chain in the system increases the difficulty of monitoring the entire system as the architecture changes. The database is the last and often the most important link in the chain of these calls. Use the Key Visualizer to view the historical changes of database load to view the health status of services and discover service exceptions in a timely manner.
3. Activity rehearsal
Online business competition is more and more fierce, “festival making” and “promotion” once a week, prevention of rollover is naturally indispensable work for DBA. With the heat map provided by Key Visualizer, the promotion can be previewed in advance, and the business behavior can be intuitively and qualitatively understood at a lower level, and the simulated scene corresponding to the flow pattern can be understood in advance. When similar patterns are observed in subsequent production environments, they can be handled with ease, reducing the likelihood of rollover.
The rapid early adopters
For now, users who want to try out the Key Visualizer can boot up the PD Master version (or set tidb_version to Latest for Ansible deployment) and use the following address in their browser:
http://PD_ADDRESS:2379/dashboard
Note: If you have changed the default port of PD, you need to change the port in the above address to your own port.
In addition to the Key Visualizer, TiDB Dashboard also includes more diagnostic features that we’ll cover in a future series of articles.