After studying the raft-RS and RAFtStore articles (Commit and Apply scenarios for Raft Alternatives, Overview of RAFtStore, etc.), Raft-rs and RAFtStore processes are familiar. Where raft-RS solves the problem of a single RAFT group (i.e. a single Region), RAFtStore solves the problem of multiple RAFT groups (i.e. multiple regions). Split and Merge are actions unique to the Raft groups in the RaftStore. Split in TiKV can Split a Region into multiple regions. Merge can combine two neighboring regions of a Range into one Region. This article is to introduce the source of Split.

Region epoch

message RegionEpoch {
    // Conf change version, auto increment when add or remove peer
    uint64 conf_ver = 1;
    // Region version, auto increment when split or merge
    uint64 version = 2;
}
Copy the code

We will start with the region epoch, which is protobuf defined above. As mentioned in the previous source sharing article, it is essentially two version numbers, and the update rules are as follows:

  1. When the configuration is changed, conf_ver + 1 is used.

  2. In Split mode, the version of both the original region and the new region is equal to the version of the original region plus the number of new regions.

  3. In Merge, the version of both regions is equal to the maximum version of the two regions + 1.

Rule 2 and Rule 3 lead to an interesting conclusion: if two regions have overlapping ranges, the historical order between them can be confirmed by comparing the versions of the two regions. A larger version means an update, and there is no equality.

The proof is simple, because the scope changes only with Split and Merge, and each Split and Merge will update the version of the affected Region to a larger version than the original version. For a range, No matter which Region it belongs to, the version of the Region to which it belongs must be strictly monotonically increasing.

PD uses this rule to determine the old and new of different regions whose scope overlaps.

In each Proposal, the Region epoch of PeerFsm will be added to the Proposal, and the validity of the Region epoch will be checked during application. If it is illegal, the Proposal will be skipped.

As shown in the figure above, the Region epoch of the new Proposal was obtained after the Applied Index Proposal was Applied. If the Proposal between Applied Index + 1 and Last Index has a Region Epoch modification operation, the new Proposal may be skipped during application.

Store ::util:: check_region_EPOCH:

  1. Non-admin Request: The version in the Proposal is different from the current one.

  2. The Region epoch in Split, Merge Request, Proposal is not equal to the current one.

The Split is triggered

Split trigger conditions are generally divided into two types:

  1. PD trigger

  2. TiKV Each Region automatically triggers periodic checks

PD trigger is mainly to specify which keys to Split, Split Region use document function is implemented using PD trigger.

Each Region triggers a split check every split-region-check-tick-interval (default: 10s). The code is described in PeerFsmDelegate::on_split_region_check_tick. The following situations do not trigger checks

  • There are inspection tasks in progress;

  • The data increment is smaller than the threshold.

  • Snapshot is being generated and the number of times triggered is less than a fixed value. If the snapshot is Split frequently, the generated snapshot may be discarded because the version is inconsistent with the current version. However, it cannot be Split all the time, so the trigger upper limit is set.

When the check is triggered, a task is sent to split_checker_worker, and the Runner::check_split function in split_checker.rs is called while the task is running.

  1. Call coprocessor::new_split_checker_host for SplitCheckerHost, and add_checker is called for every registered split_check_observers. If meet the trigger threshold will add its split_check SplitCheckerHost: : checkers, if the checkers is empty the end check. (It is worth mentioning that the coProcessor here does not refer to the coprocessor that calculates the push-down, but rather the CoProcessor that watches raftStore events and provides event triggers to the outside world. It is a good way to reduce the intrusion of external observation events into raftStore code.

  2. “Policy” = “SCAN” and “APPROXIMATE” = “split_checker” So you end up with approximations, and you end up with scans.

  3. Get the Split key.

    A. If policy is scan, call scan_split_keys to scan and read all data of the large Column Family in the Region. For each KV pair, The on_KV of each split_checker is called to calculate the Split key. After the scan is complete, the split_keys of the split_checker are traversed and the first non-empty result is returned. This policy introduces additional I/O due to the need to scan the stored data.

    B. To approximate_split_keys, call parity _split_keys and traverse split_checker’s parity _keys. The first result that is not empty is returned. This is achieved through RocksDB’s property, with almost no additional I/O being introduced, making it a better strategy for performance.

  4. Send CasualMessage: : SplitRegion to this Region.

SplitCheckerHost simply aggregates the results of split_check. The implementation is in these Split_checks, which all implement the SplitChecker trait and are also mentioned in the process description above.

pub trait SplitChecker<E> { /// Hook to call for every kv scanned during split. /// /// Return true to abort scan early.  fn on_kv(&mut self, _: &mut ObserverContext<'_>, _: &KeyEntry) -> bool { false } /// Get the desired split keys. fn split_keys(&mut self) -> Vec<Vec<u8>>; /// Get approximate split keys without scan. fn approximate_split_keys(&mut self, _: &Region, _: &E) -> Result<Vec<Vec<u8>>> { Ok(vec! []) } /// Get split policy. fn policy(&self) -> CheckPolicy; }Copy the code

Split_check has the following types:

  1. Check the total or approximate Size of Region with code at sie.rs.

  2. Check whether the total or approximate number of Region keys exceeds the threshold. The code is located at key.rs.

  3. Split according to the Key range, the code is located at half. Rs. Apart from PD specifying the Key above, this method is also triggered by PD and is currently only manually triggered by pd-ctl and tikV-ctl commands.

  4. The Key belongs to the Table prefix Split, the code is located in table.rs, configuration is off by default.

Due to space constraints, specific implementation details can be seen in the code.

The Split implementation

The implementation of Split is relatively simple. In general, the Split operation is treated as a Proposal to reach consensus through Raft, and then the respective Peer performs Split separately.

Let’s talk about the specific process.

After the trigger the Split, trigger party sends a CasualMessage: : SplitRegion to this Region, handling code see PeerFsmDelegate: : on_prepare_split_region, In addition to checking whether you are the leader, you also need to check if version has changed, and if so, reject triggering Split.

After the check is successful, send an RPC request to PD to allocate some new ids, including the ids of all the new regions and all its Peer ids. After PD replies, Construct a Proposal AdminCmdType::BatchSplit and propose it to the Peer. The code is handle_ASK_batch_split under pd_worker.

The subsequent process is described in Commit and Apply scenario analysis of Raft Propose. As mentioned above, the legitimacy of Region Epoch will be judged before application, and if it is illegal, it needs to be skipped. Assuming it hasn’t been skipped, let’s look at the ApplyDelegate:: batch_split of this Proposal application.

  1. Update the version of the original Region. The epoch of the new Region inherits the epoch of the original Region.

  2. If the right_derive value is true, split the original Region to the right. If the right_derive value is false, set the start key and end key of each Region.

  3. Call write_peer_state and write_initial_apply_state to create metadata for each Split Region.

After the application is complete, ApplyFsm sends PeerMsg::ApplyRes to PeerFsm. The code that PeerFsm processes is in PeerFsmDelegate:: on_readY_split_region

  1. In the case of the leader, the PD reports meta information (including scope, Region epoch, and so on) about itself and the new Region.

  2. Create PeerFsm and ApplyFsm of the new Region in turn and do some registration work.

  3. Update PeerFsm’s Region epoch.

It is important to note that if the application goes down after it has finished falling, this part of the work can be recovered after the restart. In fact, all logging applications should be designed to meet this principle.

At this point, the Split work is complete. After most of the peers of the original Region complete the Split work, the new Region can successfully select a leader and provide services.

Consistency in the Split process

On the premise that the clock offset of each machine does not exceed a certain range, the Leader of a Region holds a Raft lease to ensure that no other Leader with a longer term will be generated during this period. Based on this guarantee, the local read function with linear consistency can be provided by using the lease. Specific implementation can refer to the previous source reading article.

However, in the Split process, the lease held by the original Region does not guarantee this.

Assuming three replicas, consider the following situation: the Split Proposal has been applied on two followers, but has not been applied on the Leader (since apply is asynchronous, the progress of the Follower application may exceed that of the Leader).

After Split, the range of the original Region is reduced, and the remaining ranges belong to the new Region. The number of surviving peers in the new Region exceeds most of the copies required by Raft. Therefore, the election can be properly initiated and a Leader can be generated, and read and write requests can be properly served. At this time, the original Region Leader still does not apply the Split Proposal. If the original Region Leader continues to serve read requests in the original range because of the lease, the linear consistency will be destroyed.

TiKV will not renew the lease during Split. To record the index last_committed_SPLit_IDX of the last Split Proposal, see Peer:: handle_RAFt_readY_append. To tell if the value is Split (Peer::is_splitting), simply check whether last_committed_SPLit_IDX is greater than applied_index.

If you have read Peer:: handLE_RAFt_readY_append, which records last_committed_SPLit_IDx, you should note that it does not terminate the lease immediately, but only sets index to prevent the next renewal. In other words, the Leader of the original Region can provide local reads for that lease time during Split. Based on the previous analysis, this seems unreasonable.

The reason is very interesting. For a Peer whose original Region is not the Leader, the Peer that creates a new Region cannot immediately initiate an election and has to wait for a Raft election timeout. For a Peer whose original Region is the Leader, A Peer in a new Region can initiate an election immediately. Raft’s timeout election time is longer than the lease time, which is a prerequisite for the lease to be correct. Therefore, during the lease period during Split, even if the new Region Peer on other TiKV is created, the election will not be initiated. Therefore, no new data will be written. Therefore, reading from the original Region will not damage the linear consistency.

conclusion

The basic process of Region Split is relatively simple. It relies on the reliable replication function provided by Raft of the original Region. The corresponding Region Merge is implemented because the two regions belong to different Raft groups. Interactions with Region splits, Raft snapshots, and the impact of network isolation are undoubtedly more complex. We will continue to talk about Region Merge in future source code readings, so stay tuned!