Baidu distributed ID generator (GitHub)

At the same time, complete the CacheLine to avoid the hardware-level “pseudo-sharing” problem caused by RingBuffer

There was a question, what is fake sharing?

Pseudo – share problem, and some CPU hardware knowledge related.

The cache

In modern computers, the CPU can process thousands of gigabytes of data per second.

In the CPU core, there is a physical structure called a register that can read as fast as the CPU, but has very little capacity.

In addition to registers, there are L1 and L2 caches, the closer to the CPU’s cache, the faster and smaller the capacity. So the L1 cache is small but fast and sits right next to the CPU core that uses it; L2 is larger, slower, and still can only be used by a single CPU core.

L3 is larger and slower, but can be shared by a group of CPU cores, and finally memory, which can be shared by all CPU cores.

It has the smallest capacity and the fastest speed. Each core has a L1 Cache(specifically, each core has two L1 caches, one for data L1d Cache and one for instruction L1i Cache).

A larger L2 Cache, such as 256K, is slower. In general, each core has an independent L2 Cache. Level 2 cache is the buffer for level 1 cache: Level 1 cache is expensive to manufacture and therefore has limited capacity. Level 2 cache is used to store data that is needed for CPU processing but cannot be stored by Level 1 cache.

The L3 Cache is the largest level of the three-level Cache, such as 12MB, and the slowest level. A group of CPU cores share an L3 Cache. In the past, some cpus didn’t necessarily have L3 caches, and level 3 caches and memory could be thought of as buffers for level 2 caches, increasing in capacity but decreasing in cost per unit of manufacture.

L3 Cache is fundamentally different from L1 and L2 Cache. The L1 Cache and L2 Cache are independently owned by each CPU core, while the L3 Cache is shared by a group of CPU cores, which can be considered a smaller but faster memory.

Cache line

Instead of fetching data from the cache value by value, the CPU has a fixed value (minimum unit) of 64 bytes. These 64 bytes are called Cache lines.

The current CPU Cache Line size is 64Bytes. Assuming we have a level 1 cache of 512 bytes, the number of caches that this level 1 cache can hold is 512/64 = 8 based on the size of 64B cache units.

Pseudo sharing problem

If there is a cache row, all data of type Long are [x, a, B, y, C, D, e, f], where x is held by thread 1 and y is held by thread 2.

If thread 1 and thread 2 write data at the same time, there may be a problem. If thread 1 writes data successfully first, the cache row held by thread 2 will become invalid. Thread 2 needs to check whether the cache row is modified before writing, and if it is modified, it will read from L3 again and recalculate.

When the same cache row is hotly contested, CPU performance is severely affected.

Baidu home distributed ID source code

So baidu distributed ID is how to solve the problem of false sharing?

You can see it in the source code:

public class PaddedAtomicLong extends AtomicLong {
    private static final long serialVersionUID = -3415778863941386253L;

    public volatile long p1, p2, p3, p4, p5, p6 = 7L;

    public PaddedAtomicLong(a) {
        super(a); }public PaddedAtomicLong(long initialValue) {
        super(initialValue);
    }

    public long sumPaddingToPreventOptimization(a) {
        returnp1 + p2 + p3 + p4 + p5 + p6; }}Copy the code

A long type is 8 bytes, p1, P2, P3, p4, p5, p6 are 48 bytes.

I’m surprised. It’s 64 bytes. Note that serialVersionUID is also 8 bytes.

PaddedAtomicLong, which inherits from AtomicLong, also stores 8 bytes of long data.

P1 + P2 + P3 + P4 + P5 + P6 + serialVersionUID + PaddedAtomicLong make up 64 byte cache lines, and none of the others have any practical meaning except PaddedAtomicLong. Obviously wasted CPU cache, space for efficiency.

This code, which is a display call, prevents the JVM from doing some optimization, and optimizes p1, P2, P3, p4, p5, p6 as unused.

public long sumPaddingToPreventOptimization(a) {
    return p1 + p2 + p3 + p4 + p5 + p6;
}
Copy the code

The above source code, feeling coquetty and difficult to understand, after Java8, provides some annotations, can be more elegant implementation.

JAVA8 pseudo-sharing support

Go straight to code

public class DemoLong{
    @Contended
	private long count;
}
Copy the code

After Java8, the @contended annotation was added, and the JVM automatically fills the cache rows at runtime.

Note, however, that you need to add -xx: -RestrictContEnded when starting the app, or else it won’t work.

The underlying principle is still populating cache rows, so there is still a waste of CPU cache.

Application scenarios

A few years ago, most low-end cpus did not have L3 cache, but now L3 cache has become popular and become an important direction to improve CPU performance (AMD ZEN3 architecture CPU, by significantly improving L3 cache performance, to improve overall CPU performance), but L3 cache is still in the ground.

So, keep the cache rows to a minimum.

However, for distributed Id generator, the amount of changed data is relatively small. Take Baidu’s example, only tail and cursor in the ring use PaddedAtomicLong, so the L3 cache is relatively small.

However, the generator can be accessed by multiple threads at the same time, and the cache row can be accessed by multiple threads at the same time. If a pseudo-sharing problem is caused, the performance of Id generation will be significantly reduced and the bottleneck will be obvious.

So the Id generator, populating the cache rows, does more good than harm overall.

Refer to the article

www.cnblogs.com/crazymakerc…

Thank you!