High concurrency pseudo-sharing and cache line padding (cache line alignment)(@contended)

Make writing a habit together! This is the fifth day of my participation in the “Gold Digging Day New Plan · April More text Challenge”. Click here for more details.

 I am YuShiwen who likes to share knowledge and blog. Learn and grow together with you!  smell have successively, learned is own, everybody refuels!  guide: there are five chapters in this issue, . The first chapter is an example, let us feel the feeling of using Cache lines (Cache lines) to fill up fast.  The second chapter is about how to transfer data between memory, cache and register, so that you can first master the following basic knowledge;  the third chapter throws up the issue of pseudo-sharing and introduces what pseudo-sharing is;  The fourth chapter is how to solve the pseudo-sharing problem of caches (you can use cache line padding or @contended annotations);  The fifth section is the complete code, before and after the Cache Line is filled. We can use this to tune the code, I hope you can be patient to finish reading, to ensure that there will be harvest, the author of this article took two days to create, the quality will not be poor, we come on!Copy the code

1. Compare before and after filling with Cache lines

Let’s take a look at an example to give you a sense of how fast your code can take off once you understand the basics: defining seemingly useless member attributes in a class, a substantial speed boost. The following is the result of running the Cache Line fill method without using the Cache Line fill method, which takes 3579 milliseconds:

Add 7 long types to the variable x (56 bytes before and 56 bytes after the variable x, which is the cache line padding, as described in the following section). Of course, these 14 variables are not used in the code, but why is it nearly twice as fast, as shown in the figure below? You can see that the time is 1280 ms:Ps: See the complete code in the two screenshots aboveChapter 5You can also jump to the section to see the complete code.

Why can be so magical, here is to say in advance to draw conclusions, specific we can look back.

Cache consistency is synchronized according to the Cache line as a unit. That is, the transfer unit in the Cache is the Cache line, and the size of a Cache line is usually 64 bytes.
Cache synchronization is required whenever the contents of a cached row change;
So even though they are not using the same data, they are in the same cache row, and as soon as the contents of the cache row change, the cache synchronization needs to be done, and this synchronization takes time.

2. How to transfer data between memory, cache and register

Why is that? We talked about cache consistency earlier in this blog post:”Understand the underlying principles of high concurrency.” Interviewer: Tell me about MESI (Cache Consistency Protocol), click the text to jump. The relationship between memory, cache and register is roughly as follows:The process of loading an executable file in a hard disk into a register is as follows:

The executable file from the hard disk (the underlying storage is binary) is loaded into memory, the operating system allocates resources to it, and it becomes process A, which is not yet running.
After A certain period of time, CPU0’s time slice is allocated to process A. At this point, CPU0 loads the thread and reads the required data from memory into the cache in A single cache line, which is now typically 64 bytes. (It is important to remember that this cache line is 64 bytes. This number will be used several times later).
The data is then read from the cache into the register. Currently, the cache is usually a level 3 cache, which I won’t draw here.
The register data was obtained and sent to the ARITHMETIC and Logic Unit (ALU) for calculation.

Here’s why level 3 caching was designed:

Computers use clocks to synchronize instructions. The clock pulses at a fixed frequency (called the clock frequency). When you buy a 1.5GHz computer, 1.5GHz is the clock frequency, which is 1.5 billion clock pulses per second. A complete clock pulse is called a cycle. The clock doesn’t record minutes and seconds. It simply beats at a constant rate.
The main reason is that the CPU method takes too long to consume memory. The CPU takes the following time to read data from various levels of cache and memory:

CPU access	Approximately required cycles	About the time it takes
register	1 cycle	0ns
L1 Cache	3—4 cycle	1ns
L2 Cache	10—20 cycle	3ns
L3 Cache	40—45 cycle	15ns
memory		Ns. 60-90

3. Data sharing problems in cache (real and fake sharing)

3.1 True Sharing (The same variable X is found in registers of different cpus)

Let’s start with the actual sharing of data, as shown in the figure below. We used data X in both CPU0 and CPU1, not data Y for now.If cache consistency is not considered, the following problem occurs: in multithreading, two cpus start reading long X =0 at the same time, and then execute the following statement:

int X = 0;
X++;
Copy the code

At the beginning, X is initialized to 0, so let’s say I have two threads A, B,

Thread A executes on CPU0, loads the value of X variable from main memory to the cache, and then loads it from the cache to the register. It performs X+1 operation in the register, and gets the value of X equal to 1. At this time, the value of X equal to 1 is still stored in CPU0’s cache.
Since thread A’s calculation of X equals 1 is still in the cache and memory has not been flushed, thread B executes on CPU1, loads the value of I from memory, and X is still 0. Then thread B performs the X+1 operation to obtain the value of X equals 1 and stores it in CPU1’s cache.
Threads A and B both get A value of 1 and flush back to memory after A certain period of time
After writing it back to memory, after two X++ operations, it’s still 1;

You can see that although we did two ++X operations, we only did one increment, which is the result of caching inconsistencies.

How to fix the problem:

Specific we can through the msci agreement (see the author this blog post: blog.csdn.net/MrYushiwen/…). To ensure the consistency of the cache, as shown in the middle of the figure, in the cache of different registers, the problem of data consistency needs to be considered, which requires a certain amount of time to synchronize data, so as to achieve the effect of cache consistency.

3.2 Pseudo-sharing (different CPU registers use different variables, one uses X, one uses Y, and XY is in the same cache line)

Cache consistency is synchronized according to the Cache line as a unit. That is, the transfer unit in the Cache is the Cache line, and the size of a Cache line is usually 64 bytes.
Cache synchronization is required whenever the contents of a cached row change;
In 3.1, we used the same X in the register, they must be in the same cache line, this is the real shared data, the shared data is X.
In 3.2, the different CPU registers used by different variables, is used a X, is used a Y, but the variables X and Y are in the same cache line (a 64 byte read, see in figure 3.1), cache consistency is according to the caching behavior unit for synchronization, so although used is not the same data, But they (data X and data Y) are in the same cache row, and their cache synchronization takes time.

4. Pseudo-sharing solutions (cache line padding or use @contended annotations)

4.1. Cache row population

As shown in Section 1, we can populate the cache rows before and after the x variable:

public volatile long A,B,C,D,E,F,G;
public volatile long x = 1L;
public volatile long a,b,c,d,e,f,g;
Copy the code

When added, the screenshots in Section 3.2 will look like this:

No matter how the cache line, including x 64 byte in a row, which is a cache line there can be no variable Y, and the same variable cache line there can be no x, Y, so there is no false sharing between them there is no need to consider the cache consistency problem, this part also saves time.

4.2. Contended annotation

In Java 8, the @sun.misc.Contended annotation is provided to avoid pseudo-sharing by adding 128 bytes of padding before and after the object or field that uses the annotation, and using twice the size of most hardware cache lines to avoid pseudo-sharing collisions caused by adjacent sector prefetches. Our current cache line size is typically 64 bytes, but here the Contended annotation adds 128 bytes before and after, which is more than enough. Note: If you want the @Contended annotation to work, you need to add the JVM parameter -xx: -restrictContended parameter to the @Sun.misc.Contended annotation at startup.

However in java11 @ Contended annotations are classified to module Java. The base of the package JDK. Internal. Vm. The annotation, which defines the Contended annotation types. I use java12, with the following notes:This annotation, as shown below, can also be used to fill the cache line

5. Complete code (with and without cached lines)

You can also run the following code to see the magic effect of filling the cache line.

5.1 Unused cache line padding code is as follows:

package mesi;

import java.util.concurrent.CountDownLatch;

/ * * *@Author: YuShiwen
 * @Date: 2022/2/27 2:52 PM
 * @Version: 1.0 * /

public class NoCacheLineFill {

    public volatile long x = 1L;
}

class MainDemo {

    public static void main(String[] args) throws InterruptedException {
        // CountDownLatch was introduced in java1.5 and is implemented with a counter whose initial value is the number of threads.
        // Every time a thread completes its task, the countDown method is called and the count is reduced by 1.
        // When the counter reaches 0, it indicates that all threads have completed the task, and the thread calling await can resume the task.
        CountDownLatch countDownLatch = new CountDownLatch(2);

        NoCacheLineFill[] arr = new NoCacheLineFill[2];
        arr[0] = new NoCacheLineFill();
        arr[1] = new NoCacheLineFill();

        Thread threadA = new Thread(() -> {
            for (long i = 0; i < 1_000_000_000L; i++) {
                arr[0].x = i;
            }
            countDownLatch.countDown();
        }, "ThreadA");

        Thread threadB = new Thread(() -> {
            for (long i = 0; i < 100_000_000L; i++) {
                arr[1].x = i;
            }
            countDownLatch.countDown();
        }, "ThreadB");

        final long start = System.nanoTime();
        threadA.start();
        threadB.start();
        // Wait for thread A and thread B to complete
        countDownLatch.await();
        final long end = System.nanoTime();
        System.out.println("Time:" + (end - start) / 1 _000_000 + "毫秒"); }}Copy the code

5.2 Fill the cache line with the following code:

package mesi;

import java.util.concurrent.CountDownLatch;

/ * * *@Author: YuShiwen
 * @Date: 2022/2/27 3:45 PM
 * @Version: 1.0 * /

public class UseCacheLineFill {

    public volatile long A, B, C, D, E, F, G;
    public volatile long x = 1L;
    public volatile long a, b, c, d, e, f, g;
}

class MainDemo01 {

    public static void main(String[] args) throws InterruptedException {
        // CountDownLatch was introduced in java1.5 and is implemented with a counter whose initial value is the number of threads.
        // Every time a thread completes its task, the countDown method is called and the count is reduced by 1.
        // When the counter reaches 0, it indicates that all threads have completed the task, and the thread calling await can resume the task.
        CountDownLatch countDownLatch = new CountDownLatch(2);

        UseCacheLineFill[] arr = new UseCacheLineFill[2];
        arr[0] = new UseCacheLineFill();
        arr[1] = new UseCacheLineFill();

        Thread threadA = new Thread(() -> {
            for (long i = 0; i < 1_000_000_000L; i++) {
                arr[0].x = i;
            }
            countDownLatch.countDown();
        }, "ThreadA");

        Thread threadB = new Thread(() -> {
            for (long i = 0; i < 1_000_000_000L; i++) {
                arr[1].x = i;
            }
            countDownLatch.countDown();
        }, "ThreadB");

        final long start = System.nanoTime();
        threadA.start();
        threadB.start();
        // Wait for thread A and thread B to complete
        countDownLatch.await();
        final long end = System.nanoTime();
        System.out.println("Time:" + (end - start) / 1 _000_000 + "毫秒"); }}Copy the code

I am YuShiwen who likes to share knowledge and blog. Learn and grow together with you! See you in our next post.