Make writing a habit together! This is the 7th day of my participation in the “Gold Digging Day New Plan · April More text Challenge”. Click here for more details.
The architecture of UML
1 Single-thread write
Disruptor’s RingBuffer is completely lock-free due to “single-threaded write”, which is a prerequisite for all “premises” without which no technology can be completely lock-free. Redis, Netty and other high performance technical framework design is this core idea.
2 System memory optimization – Memory barrier
One key technology is needed to achieve locklessness: memory barriers.
The Java equivalent is the valotile variable with the happens-before semantics.
See: Memory barrier – Linux smp_wMB ()/smp_ RMB () System kernel: for example Linux kFIFo: smp_ WMB (), both the underlying read and write are using Linux smP_ WMB github.com/opennetwork…
3 System cache optimization – Eliminates fake sharing
The cache system stores data in units of cache lines. Cache lines are consecutive bytes of an integer power of 2, typically 32 to 256 bytes. The most common cache line size is 64 bytes.
When multiple threads modify variables that are independent of each other, they inadvertently affect each other’s performance if they share the same cache row, which is called pseudo-sharing.
Core: the Sequence
Think of it as an AtomicLong that identifies progress. It also prevents Flase Sharing of CPU cache between different sequences.
The following design ensures that the saved value is always in a cache line. (8 longs, exactly 64 bytes), space for time. These variables just don’t make any sense. They just help us do the Padding Cache Line so that we can use the CPU Cache as much as possible.If you access the L1 Cache or L2 Cache embedded in the CPU, the access latency is 1/15 or even 1/100 of the memory. Memory access is much slower than CPU access. To achieve extreme performance, you need to get as much data as possible from the CPU Cache, not from memory.
The CPU Cache loads data in memory, not field by field, but entire Cache rows. If you define an array of type long of length 64, the data is loaded from memory into the CPU Cache, not one array element at a time, but one fixed-length Cache row at a time.
A 64-bit Intel CPU computer’s cache line is typically 64 Bytes. One long takes 8 bytes, so 8 longs are loaded at once. That is to load an array of eight consecutive values at once. This loading makes iterating through sets of elements very fast. There is no need to read the data from memory again because the next seven consecutive accesses will hit the cache.
But this becomes problematic when you use individual variables instead of arrays. Disruptor RingBuffer defines the RingBufferFields class, which contains indexMask and several other variables that hold information about the internal state of RingBuffer.When the CPU loads data, it naturally loads that data from memory into the cache. But in this case, the cache will load this data as well as this dataOther variables defined before and after.
Disruptor is a multithreaded server framework, and other variables defined before and after this data can be updated and read by multiple threads. These write and read requests will come from different CPU cores. Therefore, in order to ensure the synchronization of data update, the CPU Cache has to write data back to memory, or load data from memory.
These CPU caches do not write back or load in a variable. These are all in units of the entire Cache Line. Therefore, when the variables before and after INITIAL_CURSOR_VALUE are written back to memory, the field itself is written back to memory, and the cache of the constant is invalidated. When the value is read again, it is read from memory again. This means that the reads are much slower. To do this, Disruptor leverages cache row padding and defines seven variables of type long before and after the variables defined in RingBufferFields:
- The first seven come from the inherited RingBufferPad class
- The next seven are defined directly in the RingBuffer class
These 14 variables have no practical use. We neither read nor write about them.
The variables defined in RingBufferFields are final and will not be modified after the first write. Therefore, once it is loaded into the CPU Cache, as long as it is accessed frequently, it will not be swapped out of the Cache. This means that the read speed of this value will always be the CPU Cache access speed, not the memory access speed.
RingBuffer, using caching and branch prediction
Disruptor this takes advantage of CPU Cache performance throughout Disruptor. Disruptor the Disruptor framework is a high-speed producer-consumer queue.
Producers keep adding new tasks to the queue, and consumers keep removing them from the queue.The best way to implement a queue is a linked list. Just maintain the top and bottom of the list. Producers just keep inserting new nodes into the end of the chain, and consumers just keep pulling the oldest node out of the head. LinkedBlockingQueue can be used directly in the producer-consumer pattern.Disruptor does not use a LinkedBlockingQueue. Instead, it uses a RingBuffer data structure that is a fixed-length array. Compared to linked lists, array data has spatial locality in memory.
Multiple consecutive elements of the array are loaded into the CPU Cache, so access traversals are faster. Data from each node in the linked list will not appear in the adjacent memory space, so it will not enjoy the advantage of continuously accessing data from the Cache after the entire Cache Line is loaded.
Traversal access to data also has the great advantage of accurate branch prediction at the CPU level. This allows us to make more efficient use of the multilevel pipeline in the CPU, and our programs will run faster. If you don’t quite remember the principles of this part, you can go back and review lecture 25 on branching predictions.
4 algorithm optimization – serial number fence mechanism
When producers post events, we always use:
long sequence = ringBuffer.next();
Copy the code
Disruptor3.0 uses SequenceBarrier and Sequence together. Coordinate and manage the pace of work between consumers and producers, avoiding the use of locks and CAS.
- The consumer number must be less than the producer number
- The consumer ordinal number must be less than its predecessor (dependency) consumer ordinal number
- The producer number cannot be greater than the smallest number of consumers
- To avoid producers too fast, will not have time to consume the news covered
SingleProducerSequencerPad#next
/** * @sequencer #next(int) */ @override public long next(int n) sequence = -1 { throw new IllegalArgumentException("n must be > 0"); } // nextValue is SingleProducerSequencer long nextValue = this.nextvalue; long nextSequence = nextValue + n; Long wrapPoint = nextsequence-bufferSize; Long cachedGatingSequence = this.cachedValue; if (wrapPoint > cachedGatingSequence || cachedGatingSequence > nextValue) { long minSequence; while (wrapPoint > (minSequence = Util.getMinimumSequence(gatingSequences, nextValue))) { LockSupport.parkNanos(1L); // TODO: Use waitStrategy to spin? } this.cachedValue = minSequence; } this.nextValue = nextSequence; return nextSequence; }Copy the code
reference
- zhuanlan.zhihu.com/p/21355046