This is the third day of my participation in the August More Text Challenge


For a program, almost all the computation tasks can not be done by CPU alone, it has to deal with memory at least: reading the computation data, writing the computation results.

The computing power of modern CPUS is very strong, but the I/O speed of storage devices is very slow. Typically, the CPU can perform hundreds of operations per read/write operation in memory. To fill the gap, modern cpus have to add one or more levels of Cache reads and writes that are close to the CPU’s processing speed. The CPU reads the required data into the cache so that the calculation can proceed quickly. When the calculation is complete, the CPU writes the calculation results back to main memory so that the CPU does not have to wait for the memory to read or write data.

Multi-level cache – Fills the gap between memory read/write speed and CPU computing speed

The author draws a schematic diagram that roughly represents the cache architecture of modern cpus, as follows:The read and write speeds of caches are as follows:

The cache Clock cycle (approx.) Time (approx)
Main memory 80ns
L3 40 15ns
L2 10 3ns
L1 3 ~ 4 1ns
register 1

The read/write speed of the cache closer to the CPU is higher, resulting in smaller capacity and higher cost. When the CPU needs to read data, first from the nearest cache to find, can not find the layer to search, if hit the cache, there is no need to read from the main memory, directly used to calculate. Instead of loading data from main memory and writing it to the multi-level cache in turn, the data can be read directly from the cache next time.

Locality principle and Cache lines

When a CPU accesses memory, whether to access instructions or data, the storage units it accesses tend to be clustered in a small contiguity area.

  • When we need to read the value of int variable I from memory, will the CPU really only load the 4 bytes of I into the cache?

The answer is: NO!!

When we went to read a 4 bytes int variable, computer think application next big probability will visit neighboring data, then will give together is loaded into the adjacent data cache, read again next time, don’t have access to main memory, it is ok to read directly from the cache, reduced the number of CPU access to main memory, improve the cache hit ratio.

In plain English, the CPU reads data block by block, even if you only need 1 byte of data, this is called the Cache Line. The Cache Line size is different for different cpus. Most Intel cpus have 64 bytes.

A multi-level Cache consists of several Cache lines. Each time the CPU pulls data from main memory, adjacent data is stored in the same Cache Line.

The following test code reads 10 million data times respectively, and the Cache Line failure takes 11ms. It only takes 3ms to make good use of the Cache Line feature.

public class CacheLine {
	static int length = 8 * 10000000;
	static long[] arr = new long[length];

	public static void main(String[] args) {
		long temp;// No special meaning, read out of the data assigned

		// 1. Each read hops 8. The next read data must not be in the last read Cache Line
		long start = System.currentTimeMillis();
		for (int i = 0; i < length; i += 8) {
			temp = arr[i];
		}
		long end = System.currentTimeMillis();
		System.out.println(end - start);// 11ms

		// 2. Read sequentially, Cache Line takes effect, read only the first eighth of the data
		start = System.currentTimeMillis();
		for (int i = 0; i < length / 8; i++) {
			temp = arr[i];
		}
		end = System.currentTimeMillis();
		System.out.println(end - start);// 3ms}}Copy the code

False sharing

When multiple threads simultaneously read and write shared variables, the Cache consistency protocol invalidates the entire Cache Line whenever any data in the Cache Line is invalidated. As a result, the data that does not affect each other is allocated to the same Cache Line, and the data is written to the same Cache Line, causing the invalidity of each other’s Cache Line. The phenomenon that the Cache Line Cache feature cannot be used is called “pseudo-sharing”.

The following code starts two threads to modify the shared variables A and B, respectively. Since a b occupies a total of 16 bytes, it can be allocated to the same Cache Line. Two threads that do not affect each other can modify the data. As a result, the Cache Line of the other party continues to fail, and the load command restarts to reload data in main memory, reducing the performance of the program.

public class FalseShare {
	static volatile long a;
	static volatile long b;

	public static void main(String[] args) throws InterruptedException {
		CountDownLatch cdl = new CountDownLatch(2);

		long t1 = System.currentTimeMillis();
		new Thread(()->{
			for (long i = 0; i < 1_0000_0000L; i++) {
				// Thread only changes to a
				FalseShare.a = i;
			}
			cdl.countDown();
		}).start();
		
		new Thread(()->{
			for (long i = 0; i < 1_0000_0000L; i++) {
				// Thread only changes to b
				FalseShare.b = i;
			}
			cdl.countDown();
		}).start();
		cdl.await();
		longt2 = System.currentTimeMillis(); System.err.println(t2 - t1); }}Copy the code

The running result of the program: 2782ms.

Alignment filling

If a Cache Line contains 64 bytes of data, a and B cannot be allocated to the same Cache Line. If a Cache Line contains 64 bytes of data, a and B cannot be allocated to the same Cache Line. If a Cache Line contains 64 bytes of data, a and B cannot be allocated to the same Cache Line. There are no questions above.

Solutions are as follows:The running result of the program: 752ms.

@Contended

Cache Line alignment is a relatively low solution because it is impossible to determine which CPU your program will be running on. The size of the Cache Line varies from CPU to CPU. If the Cache Line exceeds 64 bytes, filling 7 long variables will not work. ?

JDK8 introduced a new annotation, @contended, which allows variables to be stored in a separate Cache Line and not shared with other variables.

Modified as follows:The running result of the program: it takes 744ms, and the Cache Line takes effect.

note

In older versions of JDK8, @contended is disabled by default and must be enabled manually: -xx: -restrictContEnded The author’s JDK version is [1.8.0_191], and the -xx: -restrictContended parameter is no longer available, and the @contended annotation is enabled by default.

The tail

Understanding hardware design is essential to writing high performance programs!!