The directory structure
1. CPU caching basics
Cache hit
Third, cache consistency
Typical use cases
5. Queue pseudo-sharing
Introductive takeaway
Basically CPU cache knowledge is a basic knowledge point into dachang, but also quite valued, this part of the knowledge to master better words, will be very extra points!
History:
In the early decades of computing, main memory was incredibly slow and expensive, but cpus weren’t particularly fast either. Starting in the 1980s, the gap widened rapidly. Microprocessor clock speeds have skyrocketed, but memory access times have improved far less dramatically. As this gap widened, it became increasingly clear that a new type of fast memory was needed to bridge it.
CPU Caching Basics (part 1)
Today I’m going to share cache consistency and cache hits
Cache consistency
Part from Wikipedia:
In order to maintain data consistency with lower-level storage (such as memory), updates must be propagated in due time. This propagation is done by writing back. There are two Write back policies: Write Back and Write Through.
Based on the write back strategy and the missed allocation strategy mentioned above, see the following table
From the figure above, we know:
Write back: If the cache hits, the memory does not need to be updated to reduce memory write operations. Usually, the allocation policy is allocation
-
How do I flag that the cache was updated when it was loaded by another CPU? Each Cache line provides a dirty bit to indicate whether an update has occurred since it was loaded. (The CPU is loaded chunk by chunk, not byte by byte, as mentioned earlier)
Writing:
-
Write through means that whenever the cache receives a write instruction, it writes the data directly back to memory. If this data address is also in the cache, the cache must be updated at the same time. Since this design causes a lot of write operations, it is necessary to set up a buffer to reduce hardware collisions. This buffer is called a Write buffer and is usually no larger than four cache blocks. However, write buffers can also be used for write-back caches for the same purpose.
-
Write through is easier to implement than write back, and it is easier to maintain data consistency.
-
Usually allocation strategies are non-allocation
For a two-level cache system, level 1 cache may use write pass to simplify implementation, while level 2 cache uses write back to ensure data consistency
The msci protocol:
Here is a webpage this address is too 6x, referred to a lot of information, or animation… www.scss.tcd.ie/Jeremy.Jone…
It is recommended to play the GIF of the website above first to understand the read and write data of each CPU’s cache and main memory.
Here’s a quick explanation: our main store has a value of x=0, and the processor has two CPU0’s and cpu1’s
-
Cpu0 reads the value of x in cpu0’s cache. If cpu0 can’t find the value, there is an address bus, which is to route the CPU and main memory, search for the CPU and main memory at the same time, compare versions, go to main memory to get x, get the value of X, and assign the value to CPU0’s cache through the data bus
-
Write to x+1, obtain the value of x=0 from cpu0, and add 1.
-
If cpu1 does not find the value of x, it searches for the value of X in CPU and main memory according to the address bus, compares the version (if the version is the same, the main memory value will be deleted first), and finds the value of X in CPU 0. Cpu0 updates the value of X in CPU 1’s cache through the data bus, and then updates the value of x in main memory
-
For x+1, cpu1 directly obtains CPU 1’s x=1 and increses it by 1. (Main memory is updated and CPU 0’s cache is not updated, but other cpus are notified by RFO
Other situations you can try yourself.
Notification Agreement:
Snoopy agreement. This protocol is more of a data notification bus technology. CPU caches use this protocol to identify data states in other caches. If data is shared, the status of shared data can be notified to other CPU caches through a broadcast mechanism. This protocol requires that each CPU Cache be able to “snoop” on notifications of data events and react accordingly.
MESI protocol status:
A) Modified B) Exclusive C) Shared D) Invalid
Follow the animation again, in fact, is not very complicated.
This is actually useful to understand the volatile keyword in Java!
The following document is a typical use case!
To expand:
MOESI: MOESI is a complete cache consistency protocol that contains all possible states commonly used in other protocols. In addition to the four common MESI protocol states, there is a fifth “owned” state, which represents data that has been modified and shared. This avoids the need to write the modified data back to main storage before sharing. Although the data must still be written back eventually, you can defer writing back.
MOESF: Data in the Forward state is clean and can be discarded without further communication
AMD uses MOESI, Intel uses MESIF
I won’t go further here
Use cases go one wave
Case 1:
public class CpuCache { static int LEN = 120 * 1024 * 1024; static int arr[] = new int[LEN]; public static void main(String[] args) { long currAddTwo = System.currentTimeMillis(); addTwo(); System.out.println(System.currentTimeMillis() - currAddTwo); long currAddEight = System.currentTimeMillis(); addEight(); System.out.println(System.currentTimeMillis() - currAddEight); } private static void addTwo() { for (int i = 0; i<LEN; i += 1) { arr[i]*=i; } } private static void addEight() { for (int i = 0; i<LEN; i += 64) { arr[i]*=i; }}}Copy the code
You can guess by how much or how many times it might take to print
Analyze the time complexity: addTwo if 4n, then addEight is n
But don’t forget that the CPU is loaded with a Cache line of 64Bytes, so they take about the same amount of time to increment by 2 or 8. My machine takes:
48
36
Copy the code
False sharing:
Using Martin’s example, with some modifications, the code looks like this:
public class FalseShare implements Runnable { public static int NUM_THREADS = 2; // change public final static long ITERATIONS = 500L * 1000L * 1000L; private final int arrayIndex; private static VolatileLong[] longs; public FalseShare(final int arrayIndex) { this.arrayIndex = arrayIndex; } public static void main(final String[] args) throws Exception { Thread.sleep(1000); System.out.println("starting...." ); if (args.length == 1) { NUM_THREADS = Integer.parseInt(args[0]); } longs = new VolatileLong[NUM_THREADS]; for (int i = 0; i < longs.length; i++) { longs[i] = new VolatileLong(); } final long start = System.currentTimeMillis(); runTest(); System.out.println("duration = " + (System.currentTimeMillis() - start)); } private static void runTest() throws InterruptedException { Thread[] threads = new Thread[NUM_THREADS]; for (int i = 0; i < threads.length; i++) { threads[i] = new Thread(new FalseShare(i)); } for (Thread t : threads) { t.start(); } for (Thread t : threads) { t.join(); // System.out.println(t); } } public void run() { long i = ITERATIONS + 1; while (0 ! = --i) { longs[arrayIndex].value = i; } } public final static class VolatileLong { public volatile long value = 0L; public long p1, p2, p3, p4, p5, p6; //, p7, p8, p9; }}Copy the code
The logic of the code is that by default four threads modify the contents of an array of different elements. The value element is of type VolatileLong and has only one value long integer and six unused long integer members. Value is volatile so that changes to value are visible to all threads
When I thread set to 4: line 50, with 6 long ints, runs for 13s, instead of only 9s with 4 long ints, when I comment out line 50, runs for 24s. I found this test result a little odd.
To clarify the definition of pseudo-sharing:
In Java programs, the members of an array are also contiguous in the cache. In fact, neighboring member variables from Java objects are also loaded into the same cache line. If multiple threads operate on different member variables, False Sharing can occur if the cache rows are the same.
Disruptor Project (github.com/LMAX-Exchan…
A thread running on processor Core 1 wants to update the value of variable X, while another thread running on processor Core 2 wants to update the value of variable Y. However, both frequently changed variables are on the same cache line. The two threads take turns sending RFO messages to claim ownership of the cached row.
Both X and Y are ostensibly operated by separate threads, and there is no relationship between the two operations. It’s just that they share a cache row, but all competing conflicts are shared.
According to the code above instance, simply put, we an array, an object is 8 bytes (32-bit system) or 12 bytes (64 – bit systems), if add the 6 long integer = 48 bytes, so that different objects with a cache line, can avoid the cache line frequently sends RFO message Shared cache line, reducing competition, Then why are the numbers we tested wrong? When you have six longs it takes more time than when you have four longs.
The reason is that our machine is 2-core, and when the thread is set to 2, the 6 longs become 4s, and line 50 becomes 10s.
This allows an object to use only one cache line using the padding, reducing the synchronization of the cache lines.
Queue pseudo-sharing
In the JDK LinkedBlockingQueue, there is a reference to the head of the queue and a reference to the end of the queue, last. Such queues are often created in asynchronous programming, and the values of these two references are often modified by different threads, but they are likely to be in the same cache line, resulting in pseudo-sharing. The more threads, the more cores, the greater the negative impact on performance.
But: As with Grizzly, there is a built-in LinkedTransferQueue, which is different from JDK 7’s built-in LinkedTransferQueue. The difference is that PaddedAtomicReference is used to improve concurrency. In fact, this is a bad coding technique, does not make sense.
Netty used to use PaddedAtomicReference instead of Node, using a complement method to solve the queue sharing problem, but it has since been removed.
The essence of AtomicReference and LinkedTransferQueue is optimistic lock. Optimistic lock has poor performance in the case of intense competition. Optimistic lock should be used in the case of non-intense competition.
Padded-AtomicReference is also a false proposition, why not use Lock + Volatile if competition is encouraged, and PaddedAtomicReference has no advantage over AtomicReference if non-competition is present. Therefore, using Padded AtomicReference is a false coding technique.
So in 1.8 remove the LinkedTransferQueue related pad logic and post a 1.7 code
import java.util.concurrent.*; public class BlockingQueueTest { public static void main(String[] args) throws Exception { for (int i = 0; i < 3; ++i) { loop(); } } private static void loop() throws InterruptedException { // final BlockingQueue<Object> queue = new LinkedBlockingQueue<Object>(); final BlockingQueue<Object> queue = new LinkedTransferQueue<Object>(); for (int i = 0; i < 10; ++i) { queue.put(i); } final int THREAD_COUNT = 50; final CountDownLatch startLatch = new CountDownLatch(1); final CountDownLatch endLatch = new CountDownLatch(THREAD_COUNT); for (int i = 0; i < THREAD_COUNT; ++i) { Thread thread = new Thread() { public void run() { try { startLatch.await(); } catch (InterruptedException e) { e.printStackTrace(); } try { for (int i = 0; i < 1000 * 20; ++i) { Object item = queue.take(); queue.put(item); } } catch (Exception e) { e.printStackTrace(); } finally { endLatch.countDown(); }}}; thread.start(); } long startMillis = System.currentTimeMillis(); startLatch.countDown(); endLatch.await(); long millis = System.currentTimeMillis() - startMillis; System.out.println(queue.getClass().getName() + " : " + millis); }}Copy the code
50 threads competing for 10 objects. LinkedBlockingQueue is greater than LinkedTransferQueue
In 1.7 it was several times faster, but in 1.8 it was almost as fast.
Thank you for reading this, if this article is well written and if you feel there is something to it
Ask for a thumbs up 👍 ask for attention ❤️ ask for share 👥 for 8 abs I really very useful!!
If there are any mistakes in this blog, please comment, thank you very much! ❤ ️ ❤ ️ ❤ ️ ❤ ️