Introduction to the

In data synchronization scenarios, the consistency verification of upstream and downstream data is very important, and the lack of data verification may have a negative impact on business decisions. Sync-diff-inspector is a consistency verification tool developed by the Data Platform team. It can verify the consistency of upstream and downstream Data in various Data synchronization scenarios. For example, from multiple data sources to a single purpose (from mysql database and table to TiDB), from a single source to a single purpose (from TiDB table to TiDB table), etc., in the process of data verification, its efficiency and correctness are crucial. First of all, let’s take a look at the architecture diagram of Sync-Diff-Inspector to get a general understanding of the functions and implementation principles of sync-Diff-Inspector.

Sync-diff-inspector 2.0 architecture diagram

According to the Sync – diff – inspector 2.0?

In version 1.0, we encountered some issues with customer feedback, including:

  • Memory overflow occurred on the TiDB during consistency verification of large tables. Procedure
  • Problem with Float data verification not supported.
  • The result output is not user-friendly, so you need to simplify the verification result.
  • GC occurred during the verification, resulting in verification failure.

The reasons for the above problems are related to the implementation of the original version:

  • Single thread is adopted to divide Chunk. All the chunks in the table that have been divided will be compared only after all the chunks in the table have been divided, which will reduce the usage of TiKV during this period
  • The Checkpoint function writes the status of each Chunk checked to the database. Therefore, I/OS written to the database become the bottleneck of the check process.
  • When the checksums in the chunk range are different, a row comparison is performed, consuming a large amount of I/O resources.
  • The lack of adaptive GC causes the Snapshot being validated to be GC, resulting in a validation failure.
  • .

Sync-diff-inspector is new in 2.0

The Chunk classification

Check whether two tables have the same data by calculating the checksum of the two tables, but determine which row is different by comparing row by row. To reduce the number of rows that need to be compared row by row when checksum is inconsistent, sync-Diff-Inspector uses a compromise solution: divide the table into chunks in index order and compare upstream and downstream data on each chunk.

The division of chunks follows the previous method. TiDB statistics divide the table into buckets with indexes as a range, and then merge or shard the buckets based on the size of the chunk. The sharding process selects random rows as ranges.

In the original sync-Diff-Inspector, a single thread is used to divide the chunk. The partition starts only after all the chunks in the table are divided. In this case, asynchronous partition is used to improve the resource utilization during this period. There are two ways to reduce resource utilization:

  1. In the process of chunk division, the scheduled size of chunk may be smaller than that of a bucket, and the bucket needs to be divided into several chunks. This is a relatively slow process. Therefore, the comparison thread of the consumer end, namely chunk, will wait, and the resource utilization will be reduced. Two processing methods are adopted here: multiple buckets are divided asynchronously to improve resource utilization; Some tables do not have bucket information. Therefore, the entire table can only be divided as a bucket. Multiple tables are used to increase the number of asynchronous buckets.
  2. The division of chunk will also occupy some resources. Too fast division of chunk will slow down the speed of chunk comparison to some extent. Therefore, the consumer uses channel to limit the speed of chunk division for multiple tables.

In summary, the optimized Sync-Diff-Inspector divides chunks into three parts. As shown in the figure below, there are 3 chunk_iter specified, and each chunk_iter is divided into a table. The progress of chunks_iter partitioning is adjusted through the global channel. Note that the flow is limited only by table, and each chunk_iter is divided asynchronously when it starts dividing all chunks. When the global channel’s buffer is full, the chunk_iter blocks. When all chunks of chunk_iter are in the global channel’s buffer, the chunk_iter will start partitioning the next table.

Checkpoint and repair SQL

Sync-diff-inspector allows you to continue checking at breakpoints. The Diff process records breakpoints every ten seconds. When the verifier exits abnormally at some point, the sync-diff-inspector is run again to continue the verification from the most recently saved breakpoint. If the sync-diff-inspector configuration file changes the next time it is run, the sync-diff-inspector dismisses the breakpoint information and re-validates it.

The completeness and correctness of this function depend on the global orderality and continuity defined during Chunk partitioning. Compared with the original checkpoint, sync-Diff-Inspector 2.0 does not need to record the status of each chunk. Instead, it only needs to record the status of consecutive chunks that have been checked recently, greatly reducing the amount of data to be recorded. The global ordering feature of chunk is composed of a structure, which contains the table of the chunk, the bucket of the table and the bucket of the table (if the chunk is formed by the combination of two or more buckets, the first and last buckets are recorded), and the number of chunks that the bucket is divided into. This chunk is the number of chunks that have been sliced. This feature can also determine whether two chunks are contiguous. Every time the breakpoint clock is triggered, the last chunk of the consecutive chunk that has been compared is selected as the checkpoint and the information of this chunk is written to the local file.

When different lines are checked, sync-diff-Inspector generates the fix SQL and saves it in a local file. Since the chunks examined are out of order and parallel, a file is created for each chunk (if the chunk has different counterparts) to save the repair SQL, and the file name is the globally ordered structure of the chunk. There must be a sequence in which to fix SQL and checkpoint records:

  1. If the record of the repair SQL is written first, the program will exit unexpectedly. The chunk that was written to the repair SQL but not recorded by checkpoint will be generated next time. Generally, the repair SQL file will be overwritten. However, since the bucket is divided randomly, although the number of chunks after the bucket is divided is fixed, the different rows detected last time are the third chunks after the bucket is divided, and this time they may be in the range of the fourth chunk. There will be repeated repair SQL.
  2. If the checkpoint is written first, the program abnormally exits. The next check starts from the later part of the chunk recorded by the checkpoint. If the chunk has a repair SQL but has not been recorded, the repair SQL information is lost.

In this method, write repair SQL records first, the next execution will be all repair SQL files after the chunk of checkpoint record (files are named after the global ordered structure of this chunk, So it is easy to determine the order of the two chunks) to move into the Trash folder to avoid repeated repair SQL.

Binary checksum adaptive chunkSize

The performance cost of checksum for large tables and chunks is that there are extra costs (including the time for establishing and transmitting a session) for each checksum. If chunk is divided into chunks, Then the proportion of time spent in a checksum will be larger. Generally, it is necessary to set the predetermined chunkSize of chunk to be larger. However, if the chunkSize is too large, when the upstream and downstream databases checksum different chunks, it will cost a lot to directly compare the large chunk by line.

In the process of data synchronization, generally only a small amount of data inconsistency will occur. Based on this assumption, when it is found that the checksum of the upstream and downstream of a chunk is inconsistent during the verification process, the original chunk can be divided into two sub-chunks with similar size by dichotomy. The sub-chunks are checksum compared to further narrow the possible range of inconsistent rows. The advantage of this optimization is that the time and memory resources consumed by checksum comparison are far less than that consumed by row by row comparison. By using checksum comparison to continuously narrow the possible range of inconsistent rows, the number of rows requiring row by row comparison can be reduced, the comparison speed can be accelerated, and the memory consumption can be reduced. In addition, each calculation of checksum is equivalent to traversing the sub-chunk after one binary. Theoretically, multiple additional consumption is not considered, and the cost of binary test is equivalent to only doing two more checksums for the original chunk.

Since doing a checksum is equivalent to traversing all the rows in the range, you can calculate the number of rows in the range in the process. If the index column is not unique, the result of checksum (crC32) will not be the same as the result of crC32. At the same time, the number of rows of each chunk can be calculated. The limit syntax can be used to locate the index of the data in the middle row of the chunk, which is the premise of binary method.

However, chunkSize cannot be set too large. When there are different chunks on both sides of the chunkSize, the chunkSize will be stopped for line comparison. Large chunks are more likely to contain multiple different rows at the same time, reducing the usefulness of binary check. The default chunkSize of each table is set to 50000 rows, and each table can be divided into a maximum of 10000 chunks.

Index processing

Tables in upstream and downstream databases may have different schemas, for example, a downstream table may have only a portion of the upstream indexes. Improper index selection can cause a time-consuming database. During table structure verification, only the indexes that exist in the upstream and downstream are retained. (If no such indexes exist, all indexes are retained.) On the other hand, some indexes contain columns that are not unique, so there may be a large number of rows with the same index value, and the chunks may not be evenly divided. When selecting an index, the sync-diff-inspector selects the primary key or unique index first, followed by the index with the lowest repetition rate.

Where processing

Create table t (a int, b int, c int, primary key (a, b, c));

And the range of a partition chunk is (1, 2, 3), (1, 2, 4)].

The original Sync-Diff-Inspector generates where statements:

  • ((a > 1) OR (a = 1 AND b > 2) OR (a = 1 AND b = 2 AND c > 3))
  • ((a < 1) OR (a = 1 AND b < 2) OR (a = 1 AND b = 2 AND c <= 4))

(a = 1) AND (b = 2) AND (c > 3) AND (c <= 4);

Adaptive GC

In the original Sync-Diff-Inspector, a large number of tables may fail to be checked due to GC. The sync-diff-Inspector tool supports adaptive GC. It starts a background Goroutine during diff initialization and updates the GC Safepoint TTL parameter during the inspection so that the corresponding snapshot will not be GC. Ensure smooth calibration process.

Processing Float column

By the nature of float, the effective precision is only six bits, So use round(%s, 5-floor(log10(abs(column))))) as part of the checksum string for columns of type float in checksum SQL If column has a special value of 0, the result ISNULL, but ISNULL(NULL) is not true as part of the checksum string.

User interaction optimization

The sync-diff-inspector displays the following information:

  • Write logs to log files.
  • A progress bar is displayed in the foreground, indicating which tables are being compared.
  • Record the verification results of each table, including the overall comparison time, amount of comparison data, average speed, comparison results of each table, and configuration information of each table.
  • Generated repair SQL information.
  • Checkpoint information recorded at a certain interval.

The effect is shown below:

See Overview for details

Performance improvement

Based on the above optimization methods, we conducted performance tests in Sysbench with 668.4GB data, 190 tables with 10 million rows of data per table. The test results are as follows:

Sync-diff-inspector 2.0 is faster than the original and uses less memory on the TiDB.

future

Open Architecture

In the sync-diff-inspector we define the Source abstraction, which currently only supports tiDB-to-TiDB, mysql-to-mysql, and mysql-to-tiDB data consistency checks, but in the future, By implementing the Source corresponding method, data consistency verification can be performed for a variety of other databases, such as Oracle, Aurora, etc.

Support for more types

The Sync-diff-Inspector is currently not supported (for example, JSON, bit, binary, and BLOb) due to specific column types. They need special treatment in checksum SQL statements, such as jSON-type columns that extract the value of each key that appears in JSON via jSON_extract.

The more radical binary checksum

The new Sync-Diff-Inspector uses the binary checksum method to reduce the amount of data for line-by-line comparison, but stops the binary and performs line-by-line comparison when inconsistent data is found in both chunks. This approach is pessimistic, suggesting that there may be multiple inconsistencies in Chunk at the moment. However, sync-Diff-Inspector is usually used in situations where there are only a few inconsistencies. More aggressively, the result is an array of chunks with a minimum number of rows (3,000 by default) and inconsistent data. The arrays are then compared line by line.