Introduction: In the analysis of Striped64, a new high-concurrent atom accumulator in JDK8, it is found that there is a concept of “pseudo-sharing”, and to understand it, we must have a certain understanding of CPU cache. Therefore, this paper will first make a research and exploration on CPU cache architecture and some related terms.
How CPU caching works
As we all know, in today’s computer era, the difference between CPU processing speed and memory reading and writing speed is very huge. In order to solve this difference and make full use of CPU efficiency, CPU cache emerges as The Times require. It is the buffer of temporary data exchange between CPU processor and memory.
CPU cache and memory are both a type of non-permanent random-access memory (RAM), so is it physically different from memory? Of course, THE CPU cache is basically composed of SRAM (static RAM) (there are also IBM Power series processing is made of eDRAM CPU cache), and memory is often called DRAM, in fact, it is SDRAM (synchronous dynamic random access memory), is a kind of DRAM. DRAM composed of memory only contains a transistor and a capacitor, the integration is very high can easily achieve large capacity, but because of the capacitor to store information so need to constantly refresh the capacitor charge, and the time difference between charge and discharge lead to DRAM data read and write speed is much slower than SRAM.
SRAM that constitutes cache is more complex than DRAM that constitutes memory, so it occupies large space, high cost and low integration. So that in the early stage of LOW CPU technology, CPU cache can not be integrated into the CPU but only integrated into the motherboard. But its advantage is that it does not need to refresh the circuit so that the reading and writing speed is fast.
If the physical structure and performance differences between SRAM and DRAM show the physical principles of CPU caching, then the principle of time locality and the principle of space locality are the logical principles supporting CPU caching. The principle of temporal locality states that a referenced memory location is likely to be referenced many times in the near future. The principle of spatial locality states that if a memory location is no longer referenced, a program is likely to reference a memory location near that memory location in the near future.
CPU cache hierarchy
Because the CPU is the most began to categorize CPU Cache internal integration of CPU Cache already cannot satisfy the need of high performance CPU, and restrictions on the manufacturing process and not the number of greatly increased the CPU internal Cache, so the integration in the motherboard caches, people at that time was called the CPU internal Cache level Cache, namely the L1 Cache, The Cache on the mainboard outside the CPU is called the level 2 Cache, or L2 Cache. Level 1 Cache is actually divided into level 1 Data Cache (D-cache, L1d) and level 1 Instruction Cache (Instruction Cache, I-cache, L1i), which are respectively used for storing Data and decoding instructions to execute Data. Both can be accessed by CPU at the same time. Reduced CPU multi – core, multi – thread contention cache caused by conflict.
Early Intel and AMD seem to have different views on the last layer of cache L2, each CPU core has an independent level 1 cache L1, then the level 2 cache L2? AMD still used a separate level 2 Cache for each CPU core, but Intel used a “Smart Cache” design with multiple cores sharing level 2 Cache, which was better than AMD’s design at the time.
As the manufacturing process improved, L2 was also integrated into the CPU Cache, but then the need for big data processing and game performance, etc., the emergence of L3 Cache level on high-end cpus. The advent of level 3 caching is said to provide an uphill boost to CPU performance. Of course, in 2018, it is no longer the privilege of high-end CPUS to have level 3 cache, and it is said that level 4 cache has appeared on some special cpus. Of course, this does not mean that the more levels of cache, the more performance will be improved. After level 3 cache, due to the transmission distance from the CPU and the increase of its own capacity, The performance gains from CPU access to the cache and direct access to memory have been gradually offset, so instead of adding so-called level 4 caching, direct access to memory is preferred. In summary, the CPU cache hierarchy is roughly described. The following is from the book “Understanding Computer Systems in Depth” about the CPU cache and memory, hard disk storage hierarchy:
As you can see from the diagram, understanding Computer Systems in Depth divides registers into L0 level caches, followed by L1, L2, L3, memory, local disk, and remote storage. The higher up the cache, the less space, the faster, and the more expensive it is; The further down you go, the more storage, the slower, and the cheaper. From top to bottom, each layer can be regarded as the cache of the next layer, that is: register L0 is the cache of L1 level cache, L1 is L2 cache, and so on; The data in each layer comes from the next layer, so the data in each layer is a subset of the data in the next layer.
Before we continue with CPU caching, let’s take a quick look at some of the CPU information I found using CPU-Z on my laptop:
The above information can also be viewed using a tool called CoreInfo. The first graph shows that I have L1, L2, and L3 CPU caches. The bottom right corner shows that my processor is dual-core, four-threaded. So what does it mean to have two cores and four threads? Dual-core we are not unfamiliar, analogy of mononuclear before, is a CPU with multiple cores, each core with independent computing unit, controller, registers, L1 and L2 cache, then a multiple CPU core Shared the last layer of CPU cache L3, make its can be run at the same time a process of multiple threads, as shown in the figure below:
So what is four-threading, in fact, this is what’s called HyperthreadingTechnology, which simulates a physical chip for a logic core by using special hardware instructions, so if you look at the processor through device management in Windows you’ll see four, In fact, both of them are simulated, which can fully “mobilize” the temporarily idle resources inside the CPU, because our CPU actually has many execution units that are idle when running a program. The purpose of simulating a core is to use some spare CPU space (resources) to maximize CPU performance. But simulated kernel after all is virtual, so it will share registers, and the simulated logic nuclear L1, L2, therefore even dual-core four threads or only two level cache L1, two secondary L2 cache, a three-stage cache L3, so if the nuclear physics and its simulation of nuclear threads to use the same execution units at the same time, Or access the same cache row, one by one.
What about dual cpus or processors? In front of the double core is in a processor with two processor cores, the core is two, but other hardware is also two cores in the common, and the double CPU is the real sense of the double core, not only is the processor core is two, other such as cache and other hardware configuration are also double. One CPU corresponds to one physical slot. Multiple cpus are connected through a QPI bus. Most of our common computers (like my laptop above) are single-CPU multicore, and true multicore is not what PCS are used to.
The internal structure of the CPU cache
Now that we have an understanding of the CPU cache hierarchy, let’s take a look inside the CPU cache to see its internal structure.
The image above is a view of the internal structure of a CPU cache, from the book Understanding Computer Systems. Originally, inside the CPU cache is made up of S group, the size of the S with the size of the cache storage address space, and then each group has a number of cache line inside the cache line, for example above, each group has a E line cache line, E equal to 2, each cache line contains a tag if it valid and effective t a tag, And then the actual cache data is stored in B bytes. The size of the entire cache is C=BES.
And a memory address when doing the cache lookup, first of all, in the middle of the s points should be in which group, high t bits indicate which line in the group, at the low b a said should start with the how many shifts of the cache line read, after all, a cache line can hold a lot of data, usually 64 bytes.
This is called a “direct mapping cache” when E is equal to 1, a “full associative cache” when E is equal to C/B (a group containing all rows), and a “group associative cache” when 1>E>C/B (a group containing rows) falls between. Because the CPU cache space is small, general memory data mapping algorithm to CPU cache will inevitably lead to there are many different data place is mapped to the same cache line, this kind of access to the same cache line data will lead to different cache hit, not need to be to the next level cache or memory load data to replace the original cache, This kind of miss is called “collision miss”, and if the collision miss continues to occur, we call it “jitter”. Obviously, direct mapping caches have only one row per group, so this “jitter” would probably be quite frequent, and this is certainly not the highest cache design. However, due to the large number of rows, it is very difficult and expensive for the CPU to quickly match the desired data in a large cache. So it’s only good for small caches. In the end, only “group associative caching” was our best solution.
In the above cpu-Z screenshot, my CPU cache is the group linked cache. The 8-way after L1/L2 indicates that each group has 8 rows, and L3 has 12 rows. The total cache size of L1 D /L1 I is 32KB. L2 cache size is 256KB….
The typical cache line size is 64 bytes (not including significant and marker bits), so B is equal to 64, and so is my laptop, as you can see in the second image of CPU-Z above. This information can also be accessed using the CoreInfo tool or if we program in Java, Cache information can also be retrieved through the CacheSize API, a small Google project that allows the Java language to access native Cache information. Example code is as follows:
public static void main(String[] args) throws CacheNotFoundException { CacheInfo info = CacheInfo.getInstance(); CacheLevelInfo l1Datainf = info.getCacheInformation(CacheLevel.L1, CacheType.DATA_CACHE); System.out.println(" l1Datainf.toString()); CacheLevelInfo l1Instrinf = info.getCacheInformation(CacheLevel.L1, CacheType.INSTRUCTION_CACHE); System.out.println(" l1instrinf.tostring ()); }Copy the code
The print result is as follows:
Level 1 data cache information: CacheLevelInfo [cacheLevel=L1, cacheType=DATA_CACHE, cacheSets=64, cacheCoherencyLineSize=64, cachePhysicalLinePartitions=1, cacheWaysOfAssociativity=8, isFullyAssociative=false, isSelfInitializing=true, TotalSizeInBytes =32768] CacheLevelInfo [cacheLevel=L1, cacheType=INSTRUCTION_CACHE, cacheSets=64, cacheCoherencyLineSize=64, cachePhysicalLinePartitions=1, cacheWaysOfAssociativity=8, isFullyAssociative=false, isSelfInitializing=true, totalSizeInBytes=32768]Copy the code
CacheSets appears to have 64 rows, cacheCoherencyLineSize means 64 bytes cache row size, cacheWaysOfAssociativity means eight cache rows in a set, TotalSizeInBytes is the size of the entire cache line. L1 data/instruction cache size is: C=B×E×S=64×8×64=32768 bytes =32KB.
CPU cache read/write and cache consistency protocol
The first is read, read the CPU to perform a memory word w instruction from bottom to top, in order to find below contains words w to find the cache line, again by the lower will be the cache line is returned to a layer of caching on a layer of caching will this cache line on its own after a cache line, continue to return to a layer, until you reach the L1. After placing the data row into its cache row, L1 extracts the word W that the CPU really needs from the stored cache row and returns it to the CPU. The cache determines whether a request is hit or not, and then 1) group selection; 2) Line matching; 3) Word extraction.
One important aspect of this is that when the CPU cache misses, when it requests the lower level cache, the data is returned in a unit of cache behavior, not just the single word you want, and when there is a miss conflict, the corresponding replacement strategy is implemented to replace it.
Finally, there are two kinds of writing cases:
1. To write a cached word w, that is, write hit: how to update its next level cache after updating its copy of w first? The simplest is “direct write”, that is, immediately write the cache line containing W back to the first layer of the cache layer, although it is simple, but you know that the CPU may be writing data all the time, if everyone keeps writing will inevitably produce a lot of bus traffic, not good for the processing of other data; Another method known as the “back”, delayed update as far as possible, only when the need to replace the updated cache replacement strategy than write it back and then of the first layer of the cache, this reduced the total flow, but adds complexity, cache line must be additional maintenance a “modified”, indicates that the cache line has been modified.
2. Write a word that is not in the cache, that is, write failure: one is write allocation, which is to load the cache that is not hit, and then update the entire cache line, and then the processing logic of write hit; The other is non-write assignment, which simply writes the word to the next level.
Another important topic when it comes to CPU cache writes is the cache consistency protocol MESI. About cache coherence protocol and its variations are another complex content, while the msci actually only is numerous, one of the most famous in the consistent agreement in the name of the name also as for the agreement on the abbreviation of four kinds of cache state abbreviation, cache coherence protocol specifies how to guarantee the consistency problem of cached in each CPU cache:
Using the MESI protocol as an example, each Cache line has four states, which can be represented by two bits:
state
describe
M(Modified)
This line of data is valid, but the data is modified and inconsistent with the data in memory. The data only exists in this Cache.
E(Exclusive)
This line of data is valid. The data is the same as the data in memory. The data only exists in the local Cache.
S(Shared)
This line of data is valid. The data is consistent with the data in memory, which is stored in many caches.
I(Invalid)
This line of data is invalid.
About cache coherence protocol, because of its content, and a comparison of me here only rough said I understand, all in all it is a kind of data in multiple CPU cache consistent means, as to what is means, according to the consistency of each CPU manufacturers use agreement of different and different, in my current understanding, mainly has the following kinds:
-
Before the CPU modifies its cache, it broadcasts to other CPU caches via the last level cache L3 (because the last level cache is shared by multiple cores) or bus (in the case of multiple cpus across slots), invalidates other cache rows that contain this cache data, and then changes its cache data and marks it as M. When another CPU cache needs to read the modified row (or needs to be swapped out due to a collision miss), the modified row is immediately written back to memory, and the other CPU loads the latest data from memory into its own cache.
-
When the CPU cache uses “direct write” to update the cache, other cpus learn from the bus that the relevant cache line data is invalid through sniffing technology, and immediately invalidate their own cache line data, so that the next time the read fails to load the latest data back into memory.
-
When a CPU changes its cache line data, it proactively sends the relevant updates to other cpus with relevant caches via L3 or bus (in the case of multiple cpus across slots), so that they update their caches synchronously to be consistent.
All in all, a means to an CPU cache consistency emerge in endlessly, and through the above three ways, can be seen in the treatment of the cache consistency problem, if it is a single CPU core processors, so always be using the last level cache L3 to pass data, and this is not the worst, when multiprocessor cross slot, Data also has to be transferred across the bus and across slots to ensure cache consistency, which can be even more challenging for performance. This problem with CPU cache consistency will be at the root of the “pseudo-sharing” we discussed at the beginning of this article, as described in the next section.