The front of all kinds of memory about a look, DRAM and SRAM is understood, other memory is hazy know how to return a responsibility, will not go deep, here mainly do Cache notes. @[toc]

What is a Cache

A Cache consists of S sets. Each group contains E blocks of data, each containing B bytes. Therefore, the total Cache size C=S*E*B data bytes. A block is just a row in the Cache, transferring one block at a time.

Note 2:!! Block and line mean the same thing!! When I looked at CSAPP, I was confused for a long time. One block and one line were the same. So I’m going to stick to block

1. Map cache directly(Direct-Mapped Cache)

In the simplest case, S is equal to2^sSets, E =1blocksB = per setB ^ 2 bytesPer block, and the address bits aret+s+b, then the obtained address can be divided as follows:So when you’re looking for data at an address, you pass in the address, and you slice it uptagMark,set indexGroup number, andblock offsetIntra-block offset.The figure above is an example of E=1. Tag, set index and block offset are removed each time an address is entered. First, find the group according to set index. Since each group has only one piece, we only need to check whether the flag bit of the found record is equal to the tag passed in. If yes, the data is in the cache. If no, it is not in the cache.

This is called a direct mapping cache, because every time you come in, you find that block based on a set (because there’s only one block in a set), and then you compare it to that block. The obvious drawback of this cache is that if AB’s set index is the same, and the program’s characteristic is that ABABAB is interlaced, then none of the hits will be hit. Therefore, it is better to increase E so that each group can place several blocks of the same set index in parallel, as described below.

2. Group associative mapping (E-way Set Associative CacheE link cache)

When E>1, each group has E blocks to cache, which means that addresses of the same set index can no longer compete for only one block. Now there are many blocks available for everyone to choose from. Now that you have a choice, you can choose which old block to overwrite when the next one comes in, and that’s where the elimination algorithm comes in. Here is an example of 2-way.

Each time a data entry is decomposed into tag,set index and block offset, the corresponding set index row is also found, and then all blocks in the group are compared to see if they have corresponding tag and valid=1.

3. Fully associative mapping

From a direct map to a group associative map, the freedom of where data is stored in the cache increases, and this freedom is maximized in a fully associative map. Fully associative mapping, as the name implies, can be regarded as all blocks are all connected on the same line, even here the concept of row is removed, only one line is designed, the address is only divided into tag and offset. Each bit can be randomly selected and mapped as long as the corresponding Cache row is empty or complies with a certain substitution rule.

Obviously, with the same cache size, those that maximize the use of all the cache space work and have the most flexible replacement strategies. The downside is that every time you pass in an address to see if it’s in the cache, you need to look through all the cache blocks to see if they have the same tag.

Second, the write Cache

1.What to do on a write-hit

Strategy 1: write-through Write immediately to Memory.

When data is written to an address that just hits the cache, not only the data in the cache is modified, but also the new data is immediately transferred to main memory to get the corresponding write completion.

The obvious downside of the write-through strategy is that it is inefficient to write to both cache and main memory.

Strategy 2: write-back Write back: defer Write to memory until replacementof line.

When data is written to an address that matches the cache, only the data in the cache is modified, not flushed to the memory. It’s not flushed to memory until that’s replaced.

For write backs, you need to add dirty bits to the line to indicate whether the blocks were written. Algorithm: When the cache recognizes that a block is about to be overwritten, it checks for the dirty bit and writes the data back to disk if it is 1. If dirty is 0, there is of course no need to write data to memory.

Write to disk at the last minute

2.What to do on a write-miss

Policy 1: write-allocate: Load into cache,update line in cache. Loaded into the cache, update the cache line ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ Policy 2No-write-allocate Non-write allocation: Write straight to memory, and does not load into cache.

Write directly to main memory, not loaded into cache

At present, the following are basically adopted:Group associative mapping + write back + write assignmentThe strategy of

Iii. Memory Hierarchy (Memory Hierarcchies)

This gap in access speed between our storage devices and the CPU is largely made up for by program locality as disk capacity increases. So, these features of storage technology and the local nature of our programs complement each other perfectly, providing advice on how to design a storage system — called the Memory Hierarchy.

Replacing an existing block is called replacing evicting or replacing evicting 2. If the cache fails to replace a cache miss, replacing replacing an existing block or removing evicting 2. The ejected block is called victim block 3. Deciding which block to replace is controlled by the cached replacement policy replacement policy

species

4. Cache effect on program performance

1. Memory Mountain (Memory Mountain)

Read the throughputRead Throughput, also called read bandwidthRead Bandwidth, the number of bytes read from the storage system per second (MB/s), often used to measure the performance of the program storage system.Memory mountainMemory MountainIs a measure of read throughput as a function of spatial and temporal locality. Each computer has a unique memory mountain that indicates the capabilities of its memory system. Below is a test program of the memory mountain and the memory mountain of the Intel Core I7 system.There is also a run function below that actually measures the memory hill.

double run(int size,int stride,double Mhz)
{	
	double cycles;
	int elems = size / sizeof(double);
	test(elems,stride);						//warm up the cache
	cycles = fcyc2(test,elems,stride,0);	//call test(elems,stride)
	return (size / stride) / (cycles / Mhz);//convert cycles to MB/s
}
Copy the code

Note: this graph is the data of the second run after the first warm up1. When the Size is fixed and <=32K(L1 cache capacity), when the Stride increases, the throughput is almost unchanged, because all the data is in L1 cache and hit every time. 2. When the Size is fixed >32K, only part of it is stored in L1 cache, and the larger the Size is, the more difficult it is to be stored in the cache, so the spatial locality slope appears. Fixed step size is constant, you can see when the total number of elementsThe working set < = 32 KB, the read throughput is almost unchanged because all data is in L1 cache and hit every time.The working set size is larger than 32KB and smaller than 256KB, because the read throughput is slave. The proportion of L2 cache reads increases and drops sharply, so there is a cliff. When it approaches 32+256KB, the speed of the L2 cache is slowed down by the speed of the L2 cache.The working set size is a bit larger than 256KB and less than 8MBWhen, there is a cliff, which basically maintains the read speed of L3 cacheThe working set size is greater than 8MB, L3 can only be read from main memory, so the read speed decreases rapidly. No matter how big the size is, the reading speed of main memory can be basically maintained.

When the working set is too large to fit into any cache, the highest point of the main memory ridge (step =1) is also 8 times higher than its lowest point. So even when the temporal locality of a program is poor, spatial locality can be remedied, and it is very important.

2. Rearrange cycles to improve spatial locality

3. Use chunking to improve temporal locality

Below is the code for a matrix product, which has only a[I]N + K] has better spatial locality, while B [KThe spatial locality of n+j] is poor. The probability of failure is equal to n over 8 plus n is equal to 9n over 8 times n^2 is equal to 9/8 n^3Therefore, optimization is considered through matrix blocks: Here are some important points to finish the lesson:

1. Focus your attention on the inner loop, where most of the computational memory access is concentrated;

2. The spatial locality is optimal by accessing the data in step 1 according to the actual order of storage in the memory; 3. Once you have read a piece of data from a memory, use it as much as possible (kij version). 4. When the data length is large enough, the time locality becomes worse. (Because it may be overwritten the next time you access it)