The background,
Percolator is the paper “Large-scale Incremental Processing Using Distributed Transactions and” published by Google in 2010 A distributed transaction solution presented in Notifications. This scheme is used to solve the incremental index problem of search engine in the thesis.
Percolator supports ACID semantics and implements Snapshot Isolation transaction Isolation levels, so it can be considered a generic distributed transaction solution. Percolator is based on Google’s own Bigtable implementation, which is essentially a two-phase commit protocol that utilizes Bigtable’s row transactions.
Second, architecture,
Percolator consists of three components:
-
Client: The Client is the control center of the entire protocol and the Coordinator of the two-phase submission.
-
TSO: a global timing service that provides globally unique and increasing timestamps (timetamp);
-
Bigtable: distributed storage of real persistent data;
2.1. The Client
There are two roles in the two-phase commit algorithm: coordinator and participant. In Percolator, the Client acts as the coordinator, responsible for initiating and committing transactions.
2.2. Timestamp Oracle (TSO)
Percolator relies on TSO to provide a globally unique and increasing timestamp to implement Snapshot Isolation. The Client needs to get a timestamp from the TSO both at the beginning of the transaction and at the time of the commit.
2.3 Bigtable
Bigtable is a multi-demensional ordered Map with key and value pairs as follows:
(row:string, column:string,timestamp:int64)->string
Copy the code
Key consists of triples (row, column, timestamp), and value can be considered as a byte array.
In Bigtable, a row can contain multiple columns. Bigtable provides the ability to transact a single row across multiple columns. Percolator uses this feature to ensure that operations on multiple columns of the same row are atomic. Percolator metadata is stored in a special column as follows:
(Image: Research.Google)
C: Lock, C :write, c:data:
-
C: Lock inserts a record in this column during transaction Prewrite
-
**c:write **. At commit time, a record will be inserted into this column
-
**c:data **, stores the data itself
2.4 the Snapshot Isolation
-
All Read operations in a transaction will Read a consistent Snapshot, which is equivalent to Repeated Read isolation level.
-
When two concurrent transactions write to the same cell at the same time, only one transaction can commit successfully.
-
When a transaction commits, roll back the transaction if some data updated by the transaction is modified by another transaction whose start time is larger than its start time; otherwise, commit the transaction.
-
There is a write skew problem, where two transactions read and write data sets overlap but write data sets do not overlap. In this case, both transactions can commit successfully, but neither transaction sees the new data written by the other, which cannot reach the serializable isolation level. However, SNPASHOT ISOLATION has better read performance than SerialIZABLE, because the read operation only needs to read snapshot data without locking.
Third, transaction processing
3.1 Write Logic
Percolator commits transactions using a two-phase Commit algorithm (2PC) called Prewrite and Commit.
In the Prewrite phase:
1) Get a timestamp from TSO as the transaction’s start_TS;
2) For each row that needs to be written in the transaction, the transaction’s starT_TS is written in the LOCK column and new data is written in the data column with starT_TS attached, such as 14:”value2″ above. One of these locks is selected as the primary lock, and the other locks are called secondary locks. Each secondary lock contains a pointer to the primary lock.
1. If the data to be written is of a new version larger than starT_TS, the current transaction needs to be rolled back.
2. If a lock already exists in the row to which a lock is to be inserted, the current transaction needs to be rollback.
In the Commit phase:
1) Get a TIMESTAMP from TSO as commit_TS of a transaction;
2) Delete the primary lock and write commit_TS to the write column. Both operations need to be atomic. If the primary lock does not exist, the commit fails.
3) Repeat the steps for all secondary locks.
Here is a typical example of transferring 7 dollars from Bob to Joe:
1. Before the transaction starts, the two accounts Bob and Joe have 10 dollars and 2 dollars respectively.
(Image: Research.Google)
2, in the Prewrite phase, write a lock (7: I am primary) to Bob’s lock column, and write data (7: I am primary) to Bob’s data column.
(Image: Research.Google)
In the Prewrite phase, continue writing secondary locks. Write lock (7:[email protected]) to Joe’s lock column, which points to the primary lock previously written, and write data 7:$9 to data column.
(Image: Research.Google)
Commit: delete the primary lock and write a new record with a new TIMESTAMP (COMMIT_TS) in the WRITE column, and delete the lock column.
(Image: Research.Google)
5. In the COMMIT phase, the secondary locks are cleared and new records are written at the new TIMESTAMP in the WRITE column.
(Image: Research.Google)
3.2 Read Logic
1) Get a timestamp ts.
2) Check whether the data we are currently reading has a lock with a timestamp in the range [0, ts].
-
If there is a lock with a timestamp in the [0, ts] range, it means that the current data is locked by a transaction started earlier than the current transaction, but the current transaction has not been committed. Because there is no way to determine whether the transaction will be committed, the current read request cannot be fulfilled and the data can only be read after the lock is released.
-
If there is no lock, or if the timestamp of the lock is greater than TS, then the read request can be satisfied.
3) Get the maximum commit_TS in the range [0, ts] from the write column, and then get the corresponding STARt_TS.
4) Obtain the corresponding record from the data column based on the start_TS obtained in the previous step.
3.3 Handling the Client Crash Scenario
The transaction coordinator of Percolator is on the Client, and the Client may crash. If an exception occurs during the Client commit, locks written before the transaction are retained. If these locks are not cleared in time, subsequent transactions will block on the lock indefinitely.
Percolator uses lazy to clear locks. When transaction A encounters A lock left by transaction B, transaction A will clear the lock left by transaction B if it determines that transaction B has failed. However, it is difficult for transaction A to determine with 100% certainty that transaction B really failed, which may result in transaction A clearing the locks left by transaction B while transaction B has not yet failed and is committing.
To avoid this exception, the Percolator transaction model selects one of the locks written to each transaction as the Primary lock, which acts as the synchronization point for cleanup operations and transaction commits. The state of the Primary Lock is changed during both the cleanup operation and the transaction commit. Because lock changes are performed under the row transaction of Bigtable, only one of the cleanup operations and transaction commits will succeed, avoiding the exception that can occur in the concurrent scenario mentioned earlier.
The status of the primary lock determines whether the transaction has been committed:
If the Primary Lock does not exist and commit_TS is written to the write column, the transaction has been committed.
If the Primary Lock still exists, then the transaction has not entered the COMMIT phase, that is, the transaction has not been committed.
When transaction A encounters the Lock record left by transaction B during the commit process, it needs to operate according to the status of transaction B’s Primary Lock.
If the Primary Lock for transaction B does not exist and there are commit_TS in the write column, then the transaction is committed
A needs to roll-forward the lock record for transaction B. The roll-forward operation is the reverse of the ROLLBACK operation, which clears lock records and writes commit_TS in the write column.
If the Primary Lock of transaction B exists, then transaction A can determine that transaction B has not committed successfully. In this case, transaction A can choose to delete the Lock record left by transaction B. Before deleting the Lock record, transaction A needs to delete transaction B’s Primary Lock first.
If the Primary Lock for transaction B does not exist and there is no commit_TS information in the WRITE column, transaction B has been rolled back. In this case, clear the locks left by transaction B.
Although the above operation logic will not be inconsistent, transaction A may clear the Primary Lock of the surviving transaction B, causing transaction B to be rolled back, which will affect the overall performance of the system.
To solve this problem, Percolator uses the Chubby LockService to store the survival status of each Client that is committing a transaction, so that it can be determined if the Client has really died. The Primary Lock and conflicting Lock records are cleared only after the Client has actually failed. However, it is also possible that the Client is alive, but is actually Stuck and does not commit the transaction. If the lock record is not cleared, other conflicting transactions cannot be committed successfully.
To handle this scenario, a Wall time is stored in each live state, and if the wall time is judged to be too old, the conflicting lock record is processed. Long transactions need to update the wall time at regular intervals to ensure that their transactions are not rolled back.
The final transaction conflict logic is as follows:
If the Primary Lock for transaction B does not exist and there are commit_TS in the write column, transaction A needs to roll-forward the Lock for transaction B. The roll-forward operation is the reverse of the ROLLBACK operation, which clears lock records and writes commit_TS in the write column.
If the Primary Lock for transaction B does not exist and there is no commit_TS information in the WRITE column, transaction B has been rolled back. In this case, clear the locks left by transaction B.
If the Primary Lock of transaction B exists and the TTL has expired, then transaction A can choose to clear the Lock record left by transaction B. The Primary Lock of transaction B must be cleared before clearing the Lock record.
If the Primary Lock of transaction B exists and the TTL has not expired, transaction A needs to wait for transaction B’s commit or rollback to continue processing.
Iv. Realization and optimization in TiKV
4.1 Implementation of Percolator in TiKV
The underlying storage engine for TiKV uses RocksDB. RocksDB provides atomic write Batch, which implements Percolator’s requirements for row transactions.
RocksDB provides a feature called Column Family(CF), where a RocksDB instance can have multiple CFs, each CF being an isolated key command space and having its own LSM-tree. However, multiple CFs in the same RocksDB instance share a WAL, which ensures that writes to multiple CFs are atomic.
In TiKV, there are three CFs in a RocksDB instance: CF_DEFAULT, CF_LOCK, and CF_WRITE, which correspond to Percolator’s data column, LOCK column, and write column respectively.
We also need to store multiple versions of data for each key. How do we represent version information? In TiKV, we simply combine key and TIMESTAMP into an internal key to store in RocksDB. Here is the content of each CF:
-
F_DEFAULT: (key,start_ts) -> value
-
CF_LOCK: key -> lock_info
-
CF_WRITE: (key,commit_ts) -> write_info
Combine key and timestamp as follows:
-
Encode the user key as memCOMPARABLE;
-
Invert timestamp by bit and encode it into big-endian form;
-
Appends the encoded timestamp to the encoded key.
For example, key key1 and timestamp 3 will be encoded as “key1\\x00\ x00\ x00\ x00\ x00\ x00\ XFB \ XFF \ XFF \ XFF \ XFF \ XFF \ XFF \ XFF \ XFF \ XFF \ XFF \ XFF \ XFF \ XFF \ XFF \xfe”. In this way, different versions of the same Key are adjacent to each other in Rocksdb, and the data of the older version precedes the data of the older version.
The implementation of Percolator in TiKV is slightly different from that in the paper. In TiKV, CF_WRITE has four different types of data:
-
**Put **, CF_DEFAULT has a corresponding data, write operation is a Put operation;
-
Delete, indicating that the write operation is a Delete operation;
-
Rollback, when rolling back a transaction, instead of simply deleting the record in CF_LOCK, we insert a Rollback record in CF_WRITE.
-
Lock
4.2 Optimization of Percolator in TiKV
2 the Parallel Prewrite
For a transaction, we do not Prewrite in a one-by-one format. When we have multiple TiKV nodes, we execute Prewrite in parallel on multiple nodes.
In the implementation of TiKV, when a transaction is committed, the Keys involved in the transaction are split into multiple batches, and each batch is executed in parallel during the Prewrite phase. There is no need to care whether the primary key succeeds in the first Prewrite.
If a conflict occurs during the Prewrite phase of a transaction, the transaction is rolled back. When we perform a Rollback, we insert a Rollback record in CF_WRITE instead of deleting the corresponding lock record as described in Percolator’s paper. This Rollback record indicates that the transaction has rolled back and that a subsequent Prewrite request will not succeed when it arrives. This may occur when the network is abnormal. If we let the Prewrite request succeed, correctness is guaranteed, but the key is locked until the lock record expires and other transactions can lock the key again.
4.2.2 Short Value in Write Column
When accessing a value, we need to find the latest version of start_ts for the key from CF_WRITE, and then find the specific record from CF_DEFAULT. If a value is small, finding RocksDB twice is relatively expensive.
In the implementation, an optimization was made to avoid short values finding RocksDB twice. If the value is small, in the Prewrite phase, instead of putting the value in CF_DEFAULT, we place it in CF_LOCK. Then during the COMMIT phase, the value is moved from CF_LOCK to CF_WRITE. Then when accessing the short value, we only need to access CF_WRITE, saving one RocksDB lookup.
4.2.3 Point Read Without Timestamp
For each transaction, we need to first allocate a START_TS and then ensure that the transaction only sees the data committed before starT_TS. But if a transaction reads only one key, is it necessary to assign a START_TS to it? The answer is no, we just need to read the latest data for this key.
4.2.4 Calculated Commit Timestamp
To ensure Snapshot Isolation, we need to ensure that all transactional reads are repeatable. Commit_ts should be large enough to ensure that no transaction is committed before a Valid read; otherwise, no guarantee REPEATable read is issued. Such as:
Txn1 gets start_ts 100
Txn2 gets start_ts 200
Txn2 reads key “k1” and gets value “1”
Txn1 prewrites “k1” with value “2”
Txn1 commits with commit_ts 101
Tnx2 reads key “k1” and gets value “2”
Tnx2 reads “k1” twice, but gets a different result. If commit_TS is assigned from PD, this problem is definitely not present, because if Txn2’s first read occurs before Txn1’s Prewrite, Txn1’s COMMIT_ts must occur after its Prewrite. Commit_ts of Txn2 is greater than Txn1’s start_TS.
However, commit_TS cannot be infinite. If commit_TS is longer than the actual time, a new transaction may read the data committed by the transaction. We can’t be sure if a timestamp exceeds the current actual time without asking the PD.
To ensure Snapshot Isolation semantics and data integrity, commit_TS should have the following scope:
max(start_ts,max_read_ts_of_written_keys)<commit_ts<=now
Copy the code
Commit_ts is calculated as follows:
commit_ts=max(start_ts,region_1_max_read_ts,region_2_max_read_ts,...) +Copy the code
Region_N_max_read_ts indicates the maximum timestamp of all read operations on Region N, which indicates all regions involved in the transaction.
4.2.5 Single Region 1 PC
For a non-distributed database, it is relatively easy to guarantee the ACID properties of a transaction. But for distributed databases, 2PC is required to ensure transactional ACID properties. TiKV uses Percolator algorithm is a 2PC algorithm.
In a single region, a WRITE Label is an atomic-compliant label. If all the data affected in a transaction is in one region, 2PC is not necessary. If the transaction does not have write conflict, then the transaction can be committed directly.
Five, the summary
Advantages:
-
Transaction management is based on the storage system. The overall system architecture is clear, the system has good scalability and is easy to implement.
-
The read and write performance is good in the scenario with fewer transaction conflicts.
Disadvantages:
-
In the scenario with many transaction conflicts, the performance is poor because of the high overhead of retries after conflicts occur.
-
In the case of MVCC concurrency control algorithm, read wait will also occur. When there is read/write conflict, the read performance will be greatly affected.
On the whole, the design of Percolator model is commendable, with clear architecture and simple implementation. In scenarios with fewer read/write conflicts, the performance is not bad.
Six, references,
1. Codis author reveals TiKV transaction model for the first time, open source implementation of Google Spanner
2. Advantages and disadvantages analysis of Google Percolator transaction model
3. Large-scale Incremental Processing Using Distributed Transactions and Notifications — Google Research
4. Database · Principle Introduction · Interpretation of distributed transaction implementation principle of Google Percolator (Taobao.org)
Author: Wang Xiang, Vivo Internet Database team