Author: ano
Code Miscellany: Lockless Programming (Part 1)
origin
As the saying goes, as a Kubernetes development team in the database students, can not write B tree at least to do the heart of B number. Therefore, I have been studying how to write persistent B tree in my spare time recently. Then I learned about BwTree (the unlocked B tree proposed by Microsoft in 2014) and started to think about how to implement the mapping table. Then started looking at FasterKV (persistent lockless KV from Microsoft in ’18) and went completely off track…
This article is about my experience with Rust messing around with a few lock-free data structures so I can look them up later and tell myself:
Don’t bother with lockless programming!
The data structure
Often, implementations of lockless data structures require special hardware support, such as processors that support atomic CAS, LL/SC instructions, or transactional memory. Currently, the mainstream processors already support CAS operations, so most lock-free data structures are designed based on CAS operations. Generally speaking, CAS operations can only work on 32/64-bit integers — just one pointer is laid down, so lock-free structures are designed around atomic operations on Pointers. This paper will also mainly introduce the implementation of a chained list in Rust. The linked list algorithm is mainly based on Zhang et al. A lockless unordered linked list was proposed in 2013 [1].
Note: cpus that support multi-Word CAS also exist.
Unordered Set
First, let’s introduce the chain-free list mentioned above. It is actually an unordered set with a one-way list structure, which contains the following three operations:
insert(k)
, insert an elementk
And returns whether it was successful;remove(k)
Delete an elementk
And returns whether it was successful;contains(k)
, judge an elementk
Is it in a set (linked list);
The nodes and overall structure of the linked list are shown below:
Where each node contains two atomic variables in addition to its own element:
- State represents the state of the current node. There are four definitions of state in this paper:
- DAT, for visible element (node)
- INS, representing the element being inserted
- REM, the element being deleted
- INV, for invalid element (node)
- Next stores the pointer to the next node
Like a normal one-way list, the list has a head pointer to the list head, which is then concatenated by Pointers stored in next on the node. When a new node needs to be inserted, change the head to the address of the new node using the CAS atomic operation, as shown in the figure. Of course, CAS operation may fail, so we just need to repeat the steps in the figure. The Rust code looks like this:
fn enlist(&self, node_ptr: SharedNodePtr<E>, g: &Guard) {
debug_assert!(! node_ptr.is_null());let node = unsafe { node_ptr.deref() };
let mut head_ptr = self.head.load(Ordering::Acquire, g);
loop {
// Set node.next to untagged(head).
node.next.store(head_ptr.with_tag(0), Ordering::Release);
match self.head.compare_exchange(
// Keep head's tag.
head_ptr, node_ptr.with_tag(head_ptr.tag()),
Ordering::AcqRel, Ordering::Acquire, g
) {
Ok(_) = >return.Err(err) => head_ptr = err.current,
}
}
}
Copy the code
An unordered set based on a chaining list is easy to implement if you leave out the list node deletions — adding only inserts the corresponding element at the head of the list (DAT), and deleting inserts a node in the same way, marked as deleting (REM). The search for elements in a collection is as long as the list is traversed sequentially until the first node with the same element is found, then the state of the node is the state of the element. The big problem with this simple approach is that lists get longer and longer, and the complexity of set operations is positively related to list length. Therefore, the algorithm proposed by Zhang et al introduced INS and INV states and included node deletion operations.
Of the three operations mentioned above, contains is a read-only operation, so implementing it simply involves mindlessly searching down the linked list until the first visible element is found. Read-only operations will always work, regardless of GC. The other two operations are similar: they both insert a node containing the corresponding element at the head of the list, and then check back from that node to see if the operation succeeded. The algorithm process is roughly as follows:
-
Insert Inserts the INS node and goes back to the node with the same element as the first element
-
If the node status is INS or DAT, the element already exists in the linked list, indicating that the insertion failed.
-
If the node state is REM, it indicates that the element in the linked list has been deleted, indicating that the insertion is successful.
-
Ignore node status in INV.
-
If no, the insert is successful.
-
Remove inserts the node in the REM state and goes back to find the same node as the first element
-
If the node status is REM, it indicates that the element in the linked list has been deleted, indicating that deletion failed.
-
A node state of INS indicates that another thread is inserting the element, trying to mark it as REM using CAS
- If yes, the deletion is successful.
- If no, read the node status again and try again.
-
Ignore node status in INV.
-
If the node status is DAT, CAS is used to mark the node as INV. If CAS succeeds or fails, the node is deleted successfully or fails.
During the backward lookup of inserts and deletions, if an INV node is encountered, CAS is used to try to remove the node, thereby shortening the linked list.
CAS(prev.next, cur_ptr, cur.next)
Copy the code
In the CAS-based chain-free list algorithm, two problems are usually encountered:
- ABA problem: before one thread 1 performs atomic CAS, it sees A, then another thread 2 changes it to B, then changes it back to A, and then another thread 1 performs CAS successfully and may destroy the data structure;
- There is no ABA problem in the algorithm described in this article, because there is no value loop modification for any of the atomic variables (state is sequential, next must be spaced backwards, and there is no insertion in the middle of the list);
- If thread 1 and thread 2 simultaneously delete B and C from A -> B -> C -> D, the delete from C may be invalidated/cancelled:
- Invalidation: thread 1 deletes B, and the linked list is A -> C -> D, and the linked list 2 is still B -> C -> D. Setting B. ext to D has no effect on the original linked list.
- Cancellation: thread 1 read B.n ext for C, thread 2 delete C at this time, list – A – > B > D, then thread 1 CAS (A.n ext, B, C) success, list again to the A – > C – > D;
- The problem of concurrent deletion exists in this algorithm, but it does not affect the correctness, because the deleted node must be
INV
State, and the node in that state does not affect any operation;
In GC languages such as Java, concurrent deletion is not a big problem. In non-GC languages, concurrent deletion can make node GC more difficult, which will be analyzed and resolved in the following GC section.
The time complexity of the three operations of the above algorithm is O(N), so the performance is not very good when the set is large. Also, Pointers themselves cause a large number of CPU cache misses, so lock-free algorithms generally have much worse single-thread performance than ordinary non-concurrent algorithms.
Lock free hash table
In hash tables, linked lists are usually used to solve the problem of hash conflicts. When we have a chain-free table, it is natural to build a lock free hash table, which contains the following operations:
put(k, v)
, map k to vremove(k)
, delete the mapping on kget(k)
, gets the value of the mapping on k
Obviously, we can easily construct an unlocked hash table by having buckets just like a regular hash table, and each bucket pointing to a chain-free table. This no-lock policy is based on the fact that buckets cannot be added or removed. In SIGMOD 2018, Microsoft published a paper on FasterKV [3], which mentioned that a lock-free expansion and scaling method was proposed by combining with the epoch framework. According to my understanding, it is still a disguised spin lock, which will not be introduced in detail here.
For experiments, I also implemented an in-memory version of FasterKV with Rust (without the capacity to expand or shrink), the results of which are shown in the implementation section.
Garbage Collection (GC)
Modern high-level programming languages, such as Java, Go, Python, and a host of JVM-based languages, often provide memory GC capabilities. Rust is one of the newer, lower-leaning “hapless” languages that don’t have GC. In Rust, the compiler automatically weaves memory-free code into a program, roughly borrowing from the C++ RAII. But in lock-free data structures, concurrent access presents an additional problem: when is an object and its memory safe to reclaim?
Consider a scenario where thread 1 removes a node in the linked list, but thread 2 is still accessing the node. This is true in the lockless concurrency scenario. At this point, this node cannot be safely reclaimed, but we all know that it will eventually be impossible to access. Therefore, the main problem to be solved in the lock-free garbage collection method is how to find the moment when the deleted object is no longer accessible. The famous GC methods include QSBR (Quiescent-state-based Reclaimation), EBR (epoch-based Reclaimation) and HPBR (Hazard- Pointer-based) Reclaimation), this article mainly introduces and uses EBR this method.
Epoch-based Reclaimation
The main idea of EBR is to subgeneration operations (epochs), which ensures that the objects deleted in the old epoch cannot be accessed in the new epoch during the continuous progress of the epoch, so as to find a safe moment to recycle objects. The EBR works as shown below:
Among the common EBR implementations:
- The global world has a unique epoch that keeps growing, and as it grows, the memory fence keeps all previous changes visible;
- Each thread enters the current epoch during operation and exits the current epoch when it finishes. In implementation, each thread maintains a local epoch and synchronizes it during each operation.
- For the sake of efficiency, it can be synchronized after several operations.
- When a thread removes an object from a lockless data structure, add the object to the Retire list and mark the recycle epoch as the current global epoch, while attempting to elevate the global epoch;
- For the same efficiency, the global epoch can be elevated after several operations.
Obviously, after the epoch is elevated, previous deletions should be visible to all threads, so as long as all threads exit the epoch in which the deletion occurred, the deleted objects must be safely reclaimed. Thus, the safe epoch (which can be safely reclaimed in deleted objects less than or equal to this epoch) is the smallest of all threads’ local epoches minus 1. Under this strategy, the global epoch may grow rapidly, so there is another way: when elevating the global epoch, check whether all threads’ local epochs are already equal to the global epoch, and if so, elevate it only. Because the difference between the global epoch and the local epoch in this case is no more than 1, the security epoch can simply be calculated by subtracting the global epoch by 2. Under this strategy, the EPOCH can also be simplified to only 3 generations, and the generation that can be safely recycled is (global EPOCH + 1) % 3.
Note: Although the generational difference is also guaranteed to be no more than 1, the local epoch is used as the generation of Retire, so the security epoch should be global epoch-3.
EBR effectively identifies the grace period that can be safely recycled by generational approach, but there are some problems:
- If some operations are slow, the safe epoch/global EPOCH may not be advanced, resulting in memory never being freed;
- The deletion must occur completely and cannot be cancelled or invalidated so that it can be accessed in subsequent operations;
The second problem with the chaw-free table described above is that concurrent deletions can cause node deletions to be cancelled or invalidated. Thanks @zzy590 for pointing out this problem. After discussing with him, we came up with a solution — make pointer mark and delete nodes in two stages. The approximate principle of the method is as follows:
- Delete before the corresponding node
next
To mark a pointer:CAS(node.next, unmark(next), mark(next))
; - The deletion of the next node of the marked node is prohibited, that is, the adjacent concurrent deletion is prohibited;
- The CAS of the delete operation is changed to
CAS(prev.next, unmark(cur), unmark(next))
;
- The CAS of the delete operation is changed to
- Any thread that encounters a node marked with a pointer can attempt to delete it;
- The node is skipped if the deletion is prohibited.
In the changed algorithm, one case of adjacent concurrent deletions is shown below:
In addition to object collection, EBR can also be used as a framework to prevent other concurrent access problems. For example, IT is used in FasterKV to advance fuzzy Region in Hybrid Log, which has strong universality.
implementation
Finally, I implemented the chain-free table and hash table described in the article using Rust. The EBR framework uses the Crossbeam-Epoch provided by Crate, and then implements the first 16 bits as markers and the last 48 bits as atomic Pointers to addresses mentioned in FasterKV by referring to the internal architecture.
The code is available on Github: github.com/ruaaadb/loc… . Those of you who are interested can come down and play.
Note: Compilation requires a local TCMALloc linked library, tcMALloc can accommodate extremely high local memory allocation rates.
My Macbook Pro/M1 Pro did a simple linked list experiment, and the performance is as follows:
Threads | R=2K, S=1K, L=0% | R=2K, S=1K, L=34% | R=2K, S=1K, L=80% | R=2K, S=1K, L=100% |
---|---|---|---|---|
1 | 637 ops/ms | 649 ops/ms | 660 ops/ms | 671 ops/ms |
2 | 1060 ops/ms | 1094 ops/ms | 1196 ops/ms | 1312 ops/ms |
4 | 1754 ops/ms | 1914 ops/ms | 2148 ops/ms | 2571 ops/ms |
8 | 2907 ops/ms | 3321 ops/ms | 4042 ops/ms | 4941 ops/ms |
Note: R is the key range, S is the expected length of the list, L is the query ratio, and the test write is half delete and half insert.
As you can see, the performance is not particularly good, but thanks to the progress of the chip, even after two stages of deletion, the data can still be knocked out of the paper.
reference
[1] Zhang, Kunlong, et al. “Practical non-blocking unordered lists.” International Symposium on Distributed Computing. Springer, Berlin, Heidelberg, 2013.
[2] Harris, Timothy L. “A Pragmatic Implementation of Non-blocking Linked – Lists.” International Symposium on Distributed Computing. Springer, Berlin, Heidelberg, 2001.
[3] Chandramouli, Badrish, et al. “Faster: A concurrent key-value store with in-place updates.” Proceedings of the 2018 International Conference on Management of Data. 2018.
[4] github.com/crossbeam-r…
[5] en.wikipedia.org/wiki/ABA_pr…
[6] www.youtube.com/watch?v=trs…