This article is reprinted from the public account “White Noise OG”.
After last week’s long launch cycle, I finally had time to summarize the story that happened during the period. TiDB is a very excellent domestic distributed NewSQL database, because of its support level of scalability, strong consistency, high availability, has been applied in domestic banks’ accounting, payment core system since March, 2018.
Near the middle of the year, the construction of important systems of the bank entered the sprint stage of production. This time, several systems were connected with TiDB. In order to optimize the allocation of cluster resources, the theme of this sharing — the reduction of online system TiKV and the migration of region was triggered.
The expansion of the TiDB database is explained in detail in the official documentation (pingcap.com/docs-cn/op-…). It has been widely mentioned by various big names, but the practice of shrinking and migrating in the banking transaction system is rarely shared, which is also one of the purposes of this paper.
Enter the topic, first explain the environment, the server cluster uses NVMe+SSD storage scheme to build 16 TiKV instances, as an important core payment system, two places, three centers and five copies are indispensable, 8K+ region on each TiKV. The whole migration process lasted 5 hours, the process did not stop the system external services, is very smooth and smooth.
Let’s look at the migration process:
(I) TiKV adopts Raft consistency algorithm to ensure strong consistency of copies. The migration process is essentially the reverse process of capacity expansion. After the offline TiKV is labeled, the region will be moved to the final retained TiKV.
(2) Then focus on Raft Group of Region 1 and move its copy. In fact, the data of all regions is the same, but the data of the region is copied within the reserved TiKV. The data of the new copy is incomplete. As learner in Raft Group.
(3) After Learner is created, PD will initiate election in a Raft Group (5 local regions + 2 Learner) :
-
The election adds a label limit to ensure that the leader is eventually generated in the reserved TiKV;
-
Because learner does not have the right to vote, the election is actually 5 copies, and the majority (N+1)/2 is still 3.
(4) A new leader is elected. When the data of the two new replicas are even, the region in the offline TiKV will be deleted.
(5) Such a new 5 copy Raft Group we get.
Here are a few more points:
1. Disk I/O greatly affects the migration efficiency. In the test environment, common SAS disks are used.
2. The processes of (2), (3) and (4) are not atomic operations, and of course, the data of learner itself is not consistent, but the modification of RAFT should ensure consistency eventually. After confirming with PingCAP developer, these will be added later.
3. I think the most interesting and meaningful point is that the introduction of Learner is a very clever design in the process of migration, which solves the embarrassing position of inconsistent data copy in the election process, and Learner is also an important role in multi-raft protocol. HTAP engine TiFlash&TiSpark also introduces column replicas in this way, looking forward to TiDB 3.0.
PS: After the smooth operation of Cloud TiDB, the highlight of this launch, I hope to have an opportunity to summarize and share. TiDB has implemented many important changes since its launch, without suspending the external service of the system. From the perspective of a development dog, TiDB has indeed invested a lot in the direction of financial NewSQL database.
Finally, thanks to PingCAP Gin and the research and development gods for their support, thanks to operation and Maintenance dads for their hard work until 4am.