The problem

(1) What is a CPU cache row?

(2) What is a memory barrier?

(3) What is fake sharing?

(4) How to avoid fake sharing?

CPU Cache architecture

The CPU is the heart of the computer and ultimately performs all operations and programs.

Main memory (RAM) is where data is stored, and there are several levels of caching between the CPU and main memory because even direct access to main memory is very slow.

If you perform the same operation on a piece of data multiple times, it makes sense to load it close to the CPU when performing the operation, such as a loop count. You don’t want to run to main memory to fetch the data and increment it each time you loop.

The closer you are to the CPU, the faster and smaller the cache.

So the L1 cache is small but fast, and it sits right next to the CPU core that uses it.

L2 is larger, slower, and still can only be used by a single CPU core.

L3 is more common in modern multicore machines, still larger, slower, and shared by all CPU cores on a single slot.

Finally, main memory, which holds all the data a program runs on, is larger, slower, and shared by all CPU cores on all slots.

When the CPU performs an operation, it goes to L1 to find the data it needs, then L2, then L3, and finally to main memory if none of these caches are available.

The farther out you go, the longer the computation takes.

So if you’re doing some very frequent operations, make sure your data is in L1 cache.

CPU cache line

A cache is made up of cache lines, usually 64 bytes (common processors cache lines 64 bytes, older processors cache lines 32 bytes), and it effectively refers to an address in main memory.

A Java long is 8 bytes, so you can store up to 8 long variables in a cache line.

Each cache update loads consecutive 64 bytes from main memory as the program runs. Therefore, if you access an array of type long, when one of the values in the array is loaded into the cache, the other seven elements will also be loaded into the cache.

However, if you use data structures where items are not adjacent to each other in memory, such as linked lists, you will not get the benefit of free cache loading.

However, there is a downside to this free loading. Imagine if we have a variable a of type long, which is not part of the array, but a separate variable, and we have another variable B of type long next to it, then b will be loaded for free when A is loaded.

It may seem like nothing is wrong, but if one CPU core thread is modifying A, another CPU core thread is reading B.

When the current user modifies a, both a and B will be loaded into the cache line of the former core. After updating A, all other cache lines containing A will be invalidated because a in other caches is not the latest value.

When the latter reads B, it finds that the cache row is invalid and needs to be reloaded from main memory.

Keep in mind that our cache deals with rows as a unit, so invalidating a’s cache invalids B’s and vice versa.

The problem is that B is completely unrelated to A, but each time a’s update needs to be reread from main memory, it is slowed down by a cache miss.

This is known as pseudo-sharing.

False sharing

Ok, the CPU cache architecture and the cache line mechanism, now we move on to the topic of pseudo-sharing.

When multiple threads modify variables that are independent of each other, they inadvertently affect each other’s performance if they share the same cache row, which is called pseudo-sharing.

Let’s take a look at the following example that illustrates exactly how bogus sharing works.

public class FalseSharingTest {

    public static void main(String[] args) throws InterruptedException {
        testPointer(new Pointer());
    }

    private static void testPointer(Pointer pointer) throws InterruptedException {
        long start = System.currentTimeMillis();
        Thread t1 = new Thread(() -> {
            for (int i = 0; i < 100000000; i++) { pointer.x++; }}); Thread t2 =new Thread(() -> {
            for (int i = 0; i < 100000000; i++) { pointer.y++; }}); t1.start(); t2.start(); t1.join(); t2.join(); System.out.println(System.currentTimeMillis() - start); System.out.println(pointer); }}class Pointer {
    volatile long x;
    volatile long y;
}
Copy the code

In this example, we declare a Pointer class that contains both the x and y variables (which must be declared volatile to ensure visibility, but more on memory barriers later). One thread increments x 100 million times and one thread increments Y 100 million times.

As you can see, x has nothing to do with y at all, but updating x invalidates other cache rows containing x, and invalidates y at the same time. The output time of running this program is 3890ms.

Avoid fake sharing

As we know, a cache line is 64 bytes and a long type is 8 bytes, so it is easy to avoid pseudo sharing. The author summarizes the following three methods:

(1) Add seven more longs between two long variables

We change the Pointer above to the following structure:

class Pointer {
    volatile long x;
    long p1, p2, p3, p4, p5, p6, p7;
    volatile long y;
}
Copy the code

Run the program again and find that the output time is magically shortened to 695ms.

(2) Create your own long instead of Java’s own long

Change Pointer to the following:

class Pointer {
    MyLong x = new MyLong();
    MyLong y = new MyLong();
}

class MyLong {
    volatile long value;
    long p1, p2, p3, p4, p5, p6, p7;
}
Copy the code

At the same time, the pointer. X++; Modified to pointer. X.v alue++; The pointer. Y++; Modified to pointer. Y.v alue++; When the program is run again, the time is 724ms.

(3) Use @sun.misc.Contended annotations (Java8)

Modify MyLong as follows:

@sun.misc.Contended
class MyLong {
    volatile long value;
}
Copy the code

Using this annotation by default is not valid and requires -xx: -restrictcontended to be added to the JVM startup parameters, and the rerunning of the program discovery time is 718ms.

Note that the first two of the above three methods are implemented by adding fields, which have no place to use and may be optimized by the JVM, so the third method is recommended.

conclusion

(1) THE CPU has multi-level cache, the closer to the CPU the smaller and faster the cache;

(2) The data in the CPU cache is processed in the unit of cache behavior;

(3) CPU cache row can bring the benefit of free loading data, so the array processing performance is very high;

(4) CPU cache line also brings disadvantages, multithreading processing irrelevant variables will affect each other, that is, pseudo-sharing;

(5) The main idea to avoid pseudo-sharing is to make irrelevant variables not appear in the same cache line;

(6) One is to add seven long types between every two variables;

(7) Create your own long types instead of using native ones.

(8) Third, use annotations provided by Java8;

eggs

What classes in Java are protected from pseudo-sharing?

Remember the source parsing of ConcurrentHashMap?

The size() method is segmented, and each segment uses a class called CounterCell that has @sun.misc.Contended annotations.

Don’t know can pay attention to my public number “Tong Elder brother read source code” to see the historical news find this article to see.

In addition to this class, there is also a LongAdder in Java that uses this annotation to avoid pseudo sharing. In the next chapter, we will learn the source code analysis of LongAdder.

What other apps do you know to avoid fake sharing?


Welcome to pay attention to my public number “Tong Elder brother read source code”, view more source code series articles, with Tong elder brother tour the ocean of source code.