The directory structure

1. CPU caching basics

Cache hit

Third, cache consistency

Typical use cases

5. Queue pseudo-sharing

Introductive takeaway

Basically CPU cache knowledge is a basic knowledge point into dachang, but also quite valued, this part of the knowledge to master better words, will be very extra points!

History:

In the early decades of computing, main memory was incredibly slow and expensive, but cpus weren’t particularly fast either. Starting in the 1980s, the gap widened rapidly. Microprocessor clock speeds have skyrocketed, but memory access times have improved far less dramatically. As this gap widened, it became increasingly clear that a new type of fast memory was needed to bridge it.

Before 1980: CPU has no cache

1980~1995: CPU began to have level 2 cache

So far: L4, some L0, L1, L2, L3

We practice

CPU caching basics

A Register, “Register,” is a computer memory in the CPU that is used to temporarily store instructions, data, and addresses. Register storage capacity is limited, read and write speed is very fast. In computer architecture, registers store the intermediate results of calculations performed at a known point in time to speed up computer programs by rapidly accessing the data.

Registers are located at the top of the memory hierarchy and are the fastest memory that the CPU can read and write. Registers are usually measured in the number of bits they can hold, for example, an 8-bit register or a 32-bit register. In the CPU, the parts that contain registers are the instruction register (IR), program counter, and accumulator. Registers are now implemented as registers arrays, but they may also be implemented using individual triggers, high-speed core memory, thin film memory, and other means on several machines.

Registers can also refer to groups of registers that can be indexed directly by the output or input of an instruction, and are more properly called “architecture registers”. For example, the x86 instruction set defines a set of eight 32-bit registers, but a CPU implementing the x86 instruction set may have more than eight registers internally.

CPU cache

In computer systems, the CPU Cache (IN English: CPU Cache, referred to in this article as Cache) is a component used to reduce the average time it takes the processor to access memory. It is the second layer from the top down in a pyramid storage system, after the CPU register. It has far less capacity than memory, but speeds close to processor frequency.

When the processor makes a memory access request, it first looks to see if there is any request data in the cache. If there is (hit), the data is returned without accessing memory; If not, the data in memory is loaded into the cache before being returned to the processor.

Caching is effective primarily because the access to memory at runtime takes on the Locality feature. This Locality includes both Spatial Locality and Temporal Locality. Using this locality effectively, caching can achieve extremely high hit rates.

From the processor’s point of view, the cache is a transparent part. As a result, programmers often cannot directly intervene with the cache. However, specific optimizations can be made to your program code based on the characteristics of the cache to make better use of the cache.

Today’s computers generally have three levels of cache (L1, L2, L3). Let’s look at the structure:

Among them:

  • L1 cache is divided into two kinds, one is instruction cache, one is data cache. L2 and L3 caches do not discriminate between instructions and data.

  • L1 and L2 are cached in each CPU core, and L3 is the memory shared by all CPU cores.

  • The closer L1, L2, and L3 are to the CPU, the smaller they are and the faster they are. The farther away they are from the CPU, the slower they are.

  • And then there’s memory, and then there’s hard disk

And take a look at their speed:

Take a look at the processor I use for work, although it’s a bit rubbish

Specific information can be seen:

The speed of L1 is about 27~36 times that of main memory. L1 and L2 are at KB level, while L3 is at M level. L1 is divided into data and instruction cache of 32KB respectively. Think about why there is no L4?

Let’s look at a picture

This chart from Haswell’s review at Anandtech is useful because it illustrates the performance impact of adding a huge (128MB) L4 cache and the regular L1 / L2 / L3 structure. Each ladder represents a new cache level. The red line is the chip with L4 – note that for large files, it’s still almost twice as fast as the other two Intel chips. But larger caches require more transistors, which are slow and expensive, and increase the size of the chip.

What about L0?

The answer is: yes. Modern cpus also typically have very small “L0” caches, typically only a few kilobytes in size, for storing microoperations. AMD and Intel both use this cache. Zen’s cache is 2,048 µOP, compared to 4,096 µOP for Zen 2. These tiny cache pools operate under the same general principles as L1 and L2, but represent even smaller pools of memory that can be accessed by the CPU with lower latency than L1. Often, companies adjust these functions to each other. Zen 1 and Zen + (Ryzen 1XXX, 2XXX, 3XXX APU) have a 64KB L1 instruction cache, which is 4-way associated and has a 2,048 µOP L0 cache. Zen 2 (Ryzen 3XXX desktop CPU, Ryzen Mobile 4XXX) has a 32KB L1 instruction cache that is 8-way associated and has a 4,096 µOP cache. By doubling the set affinity and µOP cache size, AMD can reduce the L1 cache size by half.

Having said that, how does CPU caching work?

The purpose of the CPU cache: my CPU is so fast that every time I go to main memory to fetch data, it costs too much. I create a memory pool for myself to store the data I want most. What data can be loaded into the CPU cache? Complicated calculations and programming code.

What if I don’t find the data I want in L1’s memory pool? It’s a cache miss

What else to do? Go to L2. Some processors use an inclusive cache design (meaning that data stored in L1 will also be repeated in L2), while others are mutually exclusive (meaning that the two caches never share data). If the data is not found in the L2 cache, the CPU continues down the chain to L3 (usually still on the bare chip), then L4 (if present) and main memory (DRAM).

This raises a question: how to find a more efficient? The CPU can’t just walk next to each other

The next article will talk about how to cache hits, please wait for it

Thank you for reading this, if this article is well written and if you feel there is something to it

Ask for a thumbs up 👍 ask for attention ❤️ ask for share 👥 for 8 abs I really very useful!!

If there are any mistakes in this blog, please comment, thank you very much! ❤ ️ ❤ ️ ❤ ️ ❤ ️