A piece of code, double time difference, straight concurrent programming pseudo-shared

One, foreword

As the project neared its end, I finally had some time to spare and picked up a book I had left unfinished, JAVA High-Concurrency Programming, and forced myself to study it. When I read about Disruptor, a Lock-free Caching framework, I was introduced to the concept of false sharing, described by many as a silent performance killer that affects the performance of concurrent applications. It was a sin.

What exactly is false sharing? No hurry, let’s get a glass of water and review the principles of the computer we left behind at school.

Second, concept analysis

CPU Cache (Level 3)

Cache Memory is the temporary storage between the CPU and Memory. It has a much smaller capacity than Memory, but is swapped much faster than Memory. There are several levels of caches between the CPU and main memory. CPU caches can be divided into level 1 caches, level 2 caches, and some high-end cpus have level 3 caches. All data stored in each level of cache is part of the next level of cache, and the closer the cache is to the CPU, the faster and smaller the cache.

The main purpose of caching is to solve the mismatch between CPU computing speed and memory read/write speed. CPU computing speed is much faster than memory read/write speed. As a result, THE CPU spends a long time waiting for data to arrive or writing data into memory. The data in the cache is a small part of the memory, but this small part is to be accessed by the CPU in a short time. When the CPU calls a large amount of data, it can avoid the memory and directly call from the cache, thus speeding up the reading speed.

If our program is performing the same operation on the same block of data multiple times, loading it into a cache close to the CPU while performing the operation can greatly speed up the program.

We use L1, L2 and L3 to represent level 1 cache, level 2 cache and level 3 cache respectively. According to the order of data reading and the degree of close combination with CPU, the speed is L1 >L2 > L3 > main memory, and the capacity is L1< L2< L3< main memory.

L1 cache is small but fast, and is adjacent to the CPU core that uses it, L2 cache is larger and slower, and can still only be used by a single CPU core, L3 cache is larger, slower, and is shared by all CPU cores on a single slot, and finally main memory, which is shared by all CPU cores on all slots. A CPU with a level 3 cache has a 95% hit rate at level 3, with less than 5% of the data being queried from memory. Schematic diagram of three-level cache:

Cache line

A Cache Line is the smallest unit in the CPU Cache. The CPU Cache consists of several Cache lines. A Cache Line is usually 64 bytes in size (Note: Depending on the CPU, this article is based on 64 bytes, other lengths such as 32 bytes are not discussed in this article), and it effectively refers to a block of address in main memory. A Java long is 8 bytes, so you can store up to 8 long variables in a cache line. So, if you access an array of longs, when one of the values in the array is loaded into the cache, it loads another seven, so that you can walk through the array very quickly. In fact, you can iterate very quickly over any data structure allocated in a contiguous chunk of memory. If you have items in a data structure that are not adjacent to each other in memory (such as a linked list), you will not get the benefit of cache loading, and each item in the data structure may have a cache miss.

The msci agreement

The MESI protocol is an Invalidate-based cache consistency protocol and is one of the most commonly used protocols to support write-back caching.

Cache row state

A CPU’s cache is measured in cache lines, and the MESI protocol describes the state of a cache line in a multi-core processor. (Mainstream processors use it to ensure cache coherence and memory coherence.)

In the MESI protocol, each cache row has four states: M (Modified) : the local processor has Modified the cache row, that is, the dirty row, its contents are different from those in memory, and there is only one local copy of the cache (Exclusive) : The contents of the cached row are the same as those in memory and are not Shared by other processors. S: the contents of the cached row are the same as those in memory, but other processors may also have a copy of the cached row. I: The cached row is Invalid and cannot be usedCopy the code

State transition

In the MESI protocol, the Cache controller of each Cache not only knows its own read and write operations, but also listens for other caches. The state of each Cache line is migrated between the four states based on read and write operations by the local and other cores. MESI protocol status migration diagram is as follows:

Initial: At first, the cache row is not loaded with any data, so it is in the I state. Local Write: If the Local processor writes data to the cache line in the I state, the state of the cache line changes to M. Local Read: If the Local processor reads the cache row in state I, it is clear that the cache has no data for it. This time two situations: (1) other processor cache data, there is no visit from memory load data to the cache line, and then it set E state, said I was the only one with this data, no other processors (2) other processor cache data have a visit, would this cache line state is set to "S state. (Note: If a cached row is in M state and then written/Read by the local processor, the state does not change.) Suppose we have two processors c1 and C2. If C2 needs to read the cache line content of another processor C1, C1 needs to send its cache line content to C2 through the Memory Controller. After RECEIVING the cache line, C2 sets the corresponding cache row state as S. The memory also gets this data from the bus and saves it before setting it up. Remote Write: Actually not Remote Write, but after C2 gets the data of C1, it is not for reading, but for writing. C1 also owns a copy of this data. C2 will issue an RFO (Request For Owner) Request that it have access to this row of data. The corresponding cache line of other processors is set to I, and no one can touch this row except itself. This keeps the data secure, and the process of processing RFO requests and setting I at the same time takes a significant performance toll on the write operation.Copy the code

False sharing

With some of these concepts in mind, let’s ask a question. What happens if there are multiple threads operating on different member variables, but they are the same cache row?

That’s right, False Sharing! Let’s take a look at a classic CPU cache line:

Note: A thread running on processor Core1 wants to update the value of variable X, while another thread running on processor Core2 wants to update the value of variable Y. However, both frequently changed variables are on the same cache line. The two threads take turns sending RFO (Request For Owner) messages to claim ownership of the cached row. When CORE1 acquires ownership and starts updating X, the cache row corresponding to Core2 needs to be set to the I state (invalidation state). When CORE2 acquires ownership and starts updating Y, the cache row corresponding to Core1 needs to be set to the I state (invalidation state). Taking ownership in turn not only results in a large number of RFO messages, but if a thread needs to read this row, both L1 and L2 caches are invalid, and only L3 caches are synchronized. As we know from the previous content, reading L3 data will affect performance, or worse, cross-slot reading, L3 cache misses, can only be loaded from main memory.Copy the code

For example:

ArrayBlockingQueue has three member variables:

-putIndex: subscript of the position where the element can be inserted. -count: number of elements in the queueCopy the code

These three variables can easily be placed in a cache line, but the changes are not much related. Therefore, each change invalidates the previously cached data, so that the shared data cannot be fully achieved.

When the producer thread puts an element to the ArrayBlockingQueue, the putIndex changes so that the rows in the consumer thread’s cache are invalid and need to be re-read upward. This inability to fully use the cached rows feature is called pseudo-sharing.

The CPU cache system is stored in the unit of cache behavior. When multiple threads modify variables that are independent of each other, if these variables share the same cache line, they will inadvertently affect each other’s performance. This is called pseudo-sharing.

3. Program simulation

Four threads are used to modify an array of different elements, the element type being VolatileLong. The element type is value, which contains one long integer and six unused long integer members. Value is volatile so that value changes are visible to all threads. The main code is as follows:

public class FalseShare implements Runnable {

    public static int NUM_THREADS = 4;
    public final static long ITERATIONS = 500L * 1000L * 1000L;
    private final int arrayIndex;
    private static VolatileLong[] longs;
    public static long SUM_TIME = 0l;

    public FalseShare(final int arrayIndex) {
        this.arrayIndex = arrayIndex;
    }

    private static void exeTest(a) throws InterruptedException {
        Thread[] threads = new Thread[NUM_THREADS];
        for (int i = 0; i < threads.length; i++) {
            threads[i] = new Thread(new FalseShare(i));
        }
        for (Thread t : threads) {
            t.start();
        }
        for(Thread t : threads) { t.join(); }}public void run(a) {
        long i = ITERATIONS + 1;
        while (0 != --i) {
            longs[arrayIndex].value = i;
        }
    }

    public final static class VolatileLong {
        public volatile long value = 0L;
// public long p1, p2, p3, p4, p5, p6; // Cache line population
    }

    public static void main(final String[] args) throws Exception {
        for (int j = 0; j < 10; j++) {
            System.out.println("The first" + j + "Time...");
            longs = new VolatileLong[NUM_THREADS];
            for (int i = 0; i < longs.length; i++) {
                longs[i] = new VolatileLong();
            }
            long start = System.nanoTime();
            exeTest();
            long end = System.nanoTime();
            SUM_TIME += end - start;
        }
        System.out.println("Average time:" + SUM_TIME / 10); }}Copy the code

First execution:

// public long p1, p2, p3, p4, p5, p6; // Cache line population
Copy the code

Second execution:

          public long p1, p2, p3, p4, p5, p6;     // Cache line population
Copy the code

Each time the program runs, the loop is 10 times, and the average time is calculated as follows:

First time: Average time: 28305116160 Second time: Average time: 14071204270Copy the code

[Examples] : A cache line is 64 bytes, a long is 8 bytes, and the header of a Java program is either 8 bytes (32-bit system) or 12 bytes (64-bit system default compression is 16 bytes), so we only need to fill in 6 useless long integers to fill in 6*8=48 bytes. Having different VolatileLong objects in different cache rows avoids pseudo-sharing (it does not matter if 64-bit systems exceed 64 bytes of the cache row, as long as different threads do not operate on the same cache row).

4. Avoid fake sharing

1. Cache row padding (having objects operated by different threads in different cache rows)

1) Cache rows are filled manually

2) Fill it automatically with Contended annotations

2. Use compile directives to force each variable to align (omitted)

Five, the summary

1) CPU cache operates in the unit of cache behavior, and the root cause of false sharing problem is that different CPU cores operate the same cache row at the same time. 2) The pseudo sharing problem can be solved by cache line padding, and the @sun.misc.Contended annotation was introduced in Java8 for automatic padding; 3) A cache row has four states: M (modified) E (proprietary) S (shared) I (invalid); 4) The state of each cache line is transferred among the four states according to the read and write operations of the local and other cores; 5) Not all scenarios need to solve the pseudo-sharing problem, because CPU cache is limited, and populating will sacrifice some cache;

over