Introduction: This article describes how to test the fetch delay of multi-level cache and the underlying computer principles.
CPU cache is usually divided into multi-level pyramid model, L1 is closest to the CPU, the access delay is the smallest, but the cache capacity is also the smallest. This paper introduces how to test the fetch delay of multi-level cache and the computer principle behind it.
Photo source: cs.brown.edu/courses/csc…
Cache Latency
Wikichip[1] provides cache latency of different CPU models, usually in the unit of cycle, which is converted to NS through simple calculation. Taking Skylake as an example, the reference value of CPU cache latency at all levels is:
CPU Frequency: 2654 MHZ (0.3768 nanosec/clock)
Design experiments
1. naive thinking
Apply for a buffer that is the size of the cache. The buffer is preheated for the first time and all the data is loaded into the cache. The statistical time of the second pass is calculated, and the average latency of each read is calculated.
The code implementation of MEM-lat. C is as follows:
#include <sys/types.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#include <sys/time.h>
#include <unistd.h>
#define ONE p = (char **)*p;
#define FIVE ONE ONE ONE ONE ONE
#define TEN FIVE FIVE
#define FIFTY TEN TEN TEN TEN TEN
#define HUNDRED FIFTY FIFTY
static void usage()
{
printf("Usage: ./mem-lat -b xxx -n xxx -s xxx\n");
printf(" -b buffer size in KB\n");
printf(" -n number of read\n\n");
printf(" -s stride skipped before the next access\n\n");
printf("Please don't use non-decimal based number\n");
}
int main(int argc, char* argv[])
{
unsigned long i, j, size, tmp;
unsigned long memsize = 0x800000; /* 1/4 LLC size of skylake, 1/5 of broadwell */
unsigned long count = 1048576; /* memsize / 64 * 8 */
unsigned int stride = 64; /* skipped amount of memory before the next access */
unsigned long sec, usec;
struct timeval tv1, tv2;
struct timezone tz;
unsigned int *indices;
while (argc-- > 0) {
if ((*argv)[0] == '-') { /* look at first char of next */
switch ((*argv)[1]) { /* look at second */
case 'b':
argv++;
argc--;
memsize = atoi(*argv) * 1024;
break;
case 'n':
argv++;
argc--;
count = atoi(*argv);
break;
case 's':
argv++;
argc--;
stride = atoi(*argv);
break;
default:
usage();
exit(1);
break;
}
}
argv++;
}
char* mem = mmap(NULL, memsize, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
// trick3: init pointer chasing, per stride=8 byte
size = memsize / stride;
indices = malloc(size * sizeof(int));
for (i = 0; i < size; i++)
indices[i] = i;
// trick 2: fill mem with pointer references
for (i = 0; i < size - 1; i++)
*(char **)&mem[indices[i]*stride]= (char*)&mem[indices[i+1]*stride];
*(char **)&mem[indices[size-1]*stride]= (char*)&mem[indices[0]*stride];
char **p = (char **) mem;
tmp = count / 100;
gettimeofday (&tv1, &tz);
for (i = 0; i < tmp; ++i) {
HUNDRED; //trick 1
}
gettimeofday (&tv2, &tz);
if (tv2.tv_usec < tv1.tv_usec) {
usec = 1000000 + tv2.tv_usec - tv1.tv_usec;
sec = tv2.tv_sec - tv1.tv_sec - 1;
} else {
usec = tv2.tv_usec - tv1.tv_usec;
sec = tv2.tv_sec - tv1.tv_sec;
}
printf("Buffer size: %ld KB, stride %d, time %d.%06d s, latency %.2f ns\n",
memsize/1024, stride, sec, usec, (sec * 1000000 + usec) * 1000.0 / (tmp *100));
munmap(mem, memsize);
free(indices);
}
Copy the code
Here are three tips:
- HUNDRED macro: Macro expansion to minimize interference from other instructions to fetch.
- Secondary pointer: Buffers are strung together by secondary Pointers to avoid offset calculation during memory access.
- Char * and char** are 8 bytes, so the stride is 8.
Test method:
#set -x
work=./mem-lat
buffer_size=1
stride=8
for i in `seq 1 15`; do
taskset -ac 0 $work -b $buffer_size -s $stride
buffer_size=$(($buffer_size*2))
done
Copy the code
The test results are as follows:
//L1 Buffer size: 1 KB, stride 8, time 0.003921s, latency 3.74ns 2 KB, stride 8, time 0.003928s, latency 3.75ns 3 KB, stride 8, time, latency 3.75ns Buffer size: 8 KB, stride 8, time, latency 3.74 ns Buffer size: 10 KB, stride 8, time, latency 3.76ns 28 KB, stride 8, time, latency 3.78 ns //L2 Buffer size: 28 KB, stride 8, time, latency 3.86 ns Buffer size: 28 KB, stride 8, time, latency 3.87 ns Buffer size: 256 KB, stride 8, time, latency 3.86 ns Buffer size: 1kb, stride 8, time, latency 3.86 ns Buffer size: 1024KB, stride 8, time, latency 3.92s // 2048 KB, stride 8, time, latency 3.94 ns Buffer size: Stride 8, time, latency 3.92s Buffer size: 16384 KB, stride 8, time 0.004272s, latency 4.07 nsCopy the code
Compared with the reference value, L1 delay is larger, L2 and L3 delay is smaller, which does not meet expectations.
2. thinking with hardware: cache line
In modern processors, memory is organized in the cache with a cache line granularity. The read/write granularity is a cache line. The most common cache line size is 64 bytes.
If we simply read 128KB buffer in sequence with 8-byte granularity, assuming that the data hits L2, the data will be cached to L1, and all other access operations of a cache line will only hit L1, resulting in a significantly smaller L2 delay measured by us.
For the CPU tested in this paper, the size of cacheline is 64 bytes, and the stride only needs to be set to 64.
The test results are as follows:
//L1 Buffer size: 1 KB, stride 64, time 0.003933s, latency 3.75ns Buffer size: 2 KB, stride 64, time 0.003930s, latency 3.75ns Buffer size: 4 KB, stride 64, time 0.003925s, latency 3.74ns Buffer size: 8 KB, stride 64, time 0.00392s, latency 3.75ns Buffer size: 28 KB, stride 64, time 0.003935s, latency 3.75ns Buffer size: 28 KB, stride 64, time 0.00119s, latency 3.92ns //L2 28 KB, stride 64, time 0.007423 s, latency 7.08 ns Buffer size: 28 KB, stride 64, time 0.007414s, latency 7.07 ns Buffer size: 256 KB, stride 64, time 0.007437 s, latency 7.09 ns Buffer size: 10 KB, stride 64, time 0.007429 s, latency 7.09 ns Buffer size: 1024KB, stride 64, time 0.00765s, latency 7.30ns Buffer size: 2048 KB, stride 64, time 0.007670s, latency 7.32 ns Stride 40 KB, stride 64, time 0.007695 s, latency 7.34 ns Buffer size: Buffer size: 16384 KB, stride 64, time 0.008172s, latency 7.79nsCopy the code
Although the delays of L2 and L3 are greater than those of option 1, they are still not as expected.
3. thinking with hardware: prefetch
Modern processors usually support prefetch. Data prefetch Loads future data to the cache in advance. This reduces the time for the CPU to wait for data to be loaded from the memory, improves the cache hit ratio, and improves software running efficiency.
Intel processors support four types of hardware prefetch [2], which can be turned off and on by MSR control:
Here, we simply set the stride to 128 and 256 to avoid hardware prefetch. The L3 access delay of the test increased significantly:
// Stride 128 Buffer size: 1 KB, stride 256, time 0.003927s, latency 3.75ns Buffer size: 2 KB, stride 256, time 0.003924s, latency 3.74ns Buffer size: 4 KB, stride 256, time 0.003928s, latency 3.75ns Buffer size: 8 KB, stride 256, time 0.003923s, latency 3.74ns Buffer size: 18 KB, stride 256, time 0.003930s, latency 3.75ns Buffer size: 28 KB, stride 256, time 0.003929s, latency 3.75ns Buffer size: 28 KB, stride 256, time 0.007534 s, latency 7.19 ns Buffer size: 28 KB, stride 256, time 0.00742s, latency 7.12 ns Buffer size: Stride 256 KB, stride 256, time 0.00749s, latency 7.13 ns Buffer size: 12kb, stride 256, time 0.007698s, latency 7.34 ns Buffer size: 120 KB, stride 128, time 0.007597 s, latency 7.25 ns Buffer size: 1024KB, stride 128, time 0.00916s, latency 8.74ns Buffer size: 2048 KB, stride 128, time 0.010008s, latency 9.55 ns Buffer size: 4096 KB, stride 128, time 0.010008s, latency 9.55 ns Buffer size: 8192 KB, stride 128, time 0.010364s, latency 9.89 ns 16384 KB, stride 128, time 0.012031 s, latency 11.47 ns // stride 256 Buffer size: 12kb, stride 256, time 0.007698s, latency 7.34 ns Buffer size: 1024KB, stride 256, time 0.01265s, latency 12.0ns Buffer size: 2048 KB, stride 256, time 0.025210s, latency 24.04 ns Buffer size: 4096 KB, stride 256, time 0.025466 s, latency 24.29 ns Stride, stride 256, time, latency 24.64 ns Buffer size: Stride 256, time 0.027442 s, latency 26.17nsCopy the code
The access delay of L3 is basically in line with expectations, but L1 and L2 are obviously larger.
If you are testing for random access latency, it is more common to randomize the buffer Pointers as they are strung together.
// shuffle indices
for (i = 0; i < size; i++) {
j = i + rand() % (size - i);
if (i != j) {
tmp = indices[i];
indices[i] = indices[j];
indices[j] = tmp;
}
}
Copy the code
It can be seen that the test result is basically the same as the stride of 256.
Buffer size: 1 KB, stride 64, time 0.003942 s, latency 3.76ns 2 KB, stride 64, time 0.003925s, latency 3.74ns Buffer size: 4 KB, stride 64, time 0.003928s, latency 3.75ns Buffer size: 8 KB, stride 64, time 0.00392s, latency 3.75ns Buffer size: 28 KB, stride 64, time 0.003932s, latency 3.75ns 28 KB, stride 64, time 0.004276s, latency 4.08 ns Stride 64 KB, stride 64, time 0.007465s, latency 7.12 ns Buffer size: 28 KB, stride 64, time 0.007470s, latency 7.12 ns Buffer size: 256 KB, stride 64, time 0.007521s, latency 7.17 ns Buffer size: 12kb, stride 64, time 0.00940s, latency 8.91ns 1024KB, stride 64, time 0.01523s, latency 14.53ns Buffer size: 2048 KB, stride 64, time 0.027567 s, latency 26.29 ns 4096 KB, stride 64, time, latency 26.56 ns Stride 64, time 0.029945s, latency 28.56ns Buffer size: Stride 64, time 0.03487s, latency 33.26nsCopy the code
4. thinking with compiler: register keyword
Now that we’ve solved the small L3 problem, let’s move on to the larger L1 and L2. To find out why it is too large, we disassemble the executable first to see if it executes the assembly instructions we want:
objdump -D -S mem-lat > mem-lat.s
Copy the code
Add spacing to the card
Remove the card
- -D: Display assembler contents of all sections.
- -s: Intermix source code with disassembly. -g: Intermix source code with disassembly.
The generated assembler file mem-lat.s:
char **p = (char **)mem;
400b3a: 48 8b 45 c8 mov -0x38(%rbp),%rax
400b3e: 48 89 45 d0 mov %rax,-0x30(%rbp) // push stack
//...
HUNDRED;
400b85: 48 8b 45 d0 mov -0x30(%rbp),%rax
400b89: 48 8b 00 mov (%rax),%rax
400b8c: 48 89 45 d0 mov %rax,-0x30(%rbp)
400b90: 48 8b 45 d0 mov -0x30(%rbp),%rax
400b94: 48 8b 00 mov (%rax),%rax
Copy the code
First, the variable mem is assigned to the variable p, which is pushed -0x30(% RBP).
char **p = (char **)mem;
400b3a: 48 8b 45 c8 mov -0x38(%rbp),%rax
400b3e: 48 89 45 d0 mov %rax,-0x30(%rbp)
Copy the code
Logic of access:
HUNDRED; // p = (char **)*p
400b85: 48 8b 45 d0 mov -0x30(%rbp),%rax
400b89: 48 8b 00 mov (%rax),%rax
400b8c: 48 89 45 d0 mov %rax,-0x30(%rbp)
Copy the code
- The rax register reads the value of pointer variable p from the stack (p is of type CHAR ** and is a second-level pointer, i.e., pointer P points to a char * variable (p is also an address). The value of variable P in the figure below is 0x2000.
- Read into the RAx register the value that the RAX register points to as a variable, corresponding to the monocular operation * P. The value of address 0x2000 in the figure below is 0x3000. Rax is updated to 0x3000.
- Assign the RAx register to the variable P. The value of variable p in the figure below is updated to 0x3000.
According to the disassembly results, the expected 1 move instruction was compiled into 3, and the cache latency increased by 3 times.
The C register keyword allows the compiler to store variables in registers, avoiding the overhead of reading from the stack each time.
It’s a hint to the compiler that the variable will be heavily used and that you recommend it be kept in a processor register if possible.
We declare p with the register keyword.
register char **p = (char **)mem;
Copy the code
The test results are as follows:
// L1 Buffer size: 1 KB, stride 64, time 0.000030 s, latency 0.03ns Buffer size: 2 KB, stride 64, time 0.000029s, latency 0.03ns Buffer size: 4 KB, stride 64, time 0.000030 s, latency 0.03 ns Buffer size: 8 KB, stride 64, time 0.000030 s, latency 0.03ns Buffer size: 28 KB, stride 64, time 0.000030 s, latency 0.03ns Buffer size: 28 KB, stride 64, time 0.000030 s, latency 0.03ns // L2 28 KB, stride 64, time 0.000030 s, latency 0.03ns Buffer size: 28 KB, stride 64, time 0.000030 s, latency 0.03ns Buffer size: 256 KB, stride 64, time 0.000029s, latency 0.03ns Buffer size: 120 KB, stride 64, time 0.000030 s, latency 0.03ns Buffer size: 1024KB, stride 64, time 0.000030s, latency 0.03ns // L3 Buffer size: 2048 KB, stride 64, time 0.000030 s, latency 0.03ns Buffer size: Stride 64, time 0.000029s, latency 0.03ns Buffer size: Buffer size: 16384 KB, stride 64, time 0.000030 s, latency 0.03nsCopy the code
The delay of access was less than 1 ns, obviously not in line with expectations.
5. Thinking with Compiler: Touch it!
To disassemble again and see what went wrong, compile the code as follows:
for (i = 0; i < tmp; ++i) {
40155e: 48 c7 45 f8 00 00 00 movq $0x0,-0x8(%rbp)
401565: 00
401566: eb 05 jmp 40156d <main+0x37e>
401568: 48 83 45 f8 01 addq $0x1,-0x8(%rbp)
40156d: 48 8b 45 f8 mov -0x8(%rbp),%rax
401571: 48 3b 45 b0 cmp -0x50(%rbp),%rax
401575: 72 f1 jb 401568 <main+0x379>
HUNDRED;
}
gettimeofday (&tv2, &tz);
401577: 48 8d 95 78 ff ff ff lea -0x88(%rbp),%rdx
40157e: 48 8d 45 80 lea -0x80(%rbp),%rax
401582: 48 89 d6 mov %rdx,%rsi
401585: 48 89 c7 mov %rax,%rdi
401588: e8 e3 fa ff ff callq 401070 <gettimeofday@plt>
Copy the code
The HUNDRED macro does not produce any assembly code. Statements involving the variable p, which do nothing but read data, are most likely optimized by the compiler.
register char **p = (char **) mem;
tmp = count / 100;
gettimeofday (&tv1, &tz);
for (i = 0; i < tmp; ++i) {
HUNDRED;
}
gettimeofday (&tv2, &tz);
/* touch pointer p to prevent compiler optimization */
char **touch = p;
Copy the code
Disassemble to verify:
HUNDRED;
401570: 48 8b 1b mov (%rbx),%rbx
401573: 48 8b 1b mov (%rbx),%rbx
401576: 48 8b 1b mov (%rbx),%rbx
401579: 48 8b 1b mov (%rbx),%rbx
40157c: 48 8b 1b mov (%rbx),%rbx
Copy the code
The HUNDRED macro produces assembly code that only operates on register RBX mov instructions, high-level.
The delayed test results are as follows:
// L1 Buffer size: 1 KB, stride 64, time 0.001687 s, latency 1.61 ns Buffer size: 2 KB, stride 64, time 0.001684 s, latency 1.61ns Buffer size: 4 KB, stride 64, time 0.001682s, latency 1.60ns Buffer size: 8 KB, stride 64, time 0.001693 s, latency 1.61ns Buffer size: 10 KB, stride 64, time 0.00168s, latency 1.61ns Buffer size: 28 KB, stride 64, time 0.001783 s, latency 1.70 ns // L2 Buffer size: Stride 64 KB, stride 64, time, latency 5.62 ns 28 KB, stride 64, time 0.00228s, latency 5.64 ns Buffer size: 256 KB, stride 64, time, latency 5.68 ns Buffer size: 10 KB, stride 64, time 0.007856 s, latency 7.49 ns Buffer size: 1024 KB, stride 64, time 0.014929s, latency 14.24 ns // L3 Buffer size: 2048 KB, stride 64, time 0.026970 s, latency 25.72 ns Buffer size: 4096 KB, stride 64, time 0.026968 s, latency 25.72 ns Buffer size: Stride 64, time 0.028823 s, latency 27.49 ns Buffer size: 16384 KB, stride 64, time 0.033325 s, latency 31.78 nsCopy the code
L1 delay 1.61 ns, L2 delay 5.62 ns, finally, as expected!
Write in the last
The idea and code of this paper refer to lmbench[3] and the tool MEM-Lat of other students in the team. Finally, I dug a hole for myself. When randomizing the buffer pointer, I did not consider the influence of TLB miss of hardware. If readers are interested, they can supplement it later.
The original link
This article is the original content of Aliyun and shall not be reproduced without permission.