How to implement LRU cache in TLB multiplex group associative cache design?

This article comes from the public number: Hook hook Java universe (wechat: Javagogo), mo to promote, all dry goods!

Original link: mp.weixin.qq.com/s/_j4O1aGnj… By Lin 䭽

TLB is a “part” of the CPU.

In the DESIGN of TLB, it is impossible to create data structures in memory. Therefore, in the design of 8-way group associated cache, we only need to select the Least Recently Used cache from 8 cache entries each time.

1. Increase the cumulative value

In one way, hardware compares the number of cache uses recorded in eight caches simultaneously.

This solution needs to do two things:

Additional space is required in the cache entry to record the number of times the entry has been used (cumulated bits). Similar to the time-based read bit operation we discussed in page table design – read bits are automatically accumulated to a cumulative bit after a certain period of time.
The hardware can implement a fast algorithm for querying minimum values.

The first point will produce extra space overhead, but also need timer coordination, high cost. Note that caches are expensive, and use natural energy at the expense of caching space.

Point 2 also requires additional hardware design.

So, is there a better plan?

2. 1bit simulating LRU

A better solution is to simulate the LRU.

We could consider continuing with the above approach, but each cache entry only takes out one LRU bit (bit) to describe whether the cache has been used recently. Cache permutations only look for item permutations with LRU bit equal to 0.

A better move would be to consider clearing the LRU bits (zero) in the eight entries when all the LRU bits are set to 1, thus saving a timer. Equivalent to a memory operation, LRU position 1,8 positions are used, LRU all set to 0.

3. Search tree simulates LRU

Finally, I’ll introduce a clever way to simulate LRU with search trees.

For an 8-way group associative cache, this method requires 8-1=7 bits to construct a tree.

The eight cache entries are controlled by seven nodes, each of which is one bit.

0 means the node points to the left, 1 means the node points to the right.

When initialized, all nodes point to the left, as shown below:

Each subsequent write starts at the root node and follows the arrow direction (0 to the left, 1 to the right) to find the next update direction.

Let’s say the next position to update on the diagram is 0. When the update is complete, the node arrows on all paths are reversed, i.e., 0 becomes 1 and 1 becomes 0.

The figure above is the result of read a, where all the arrows on the path have been reversed, and now we see that the next position is 4, which I’ve marked in orange.

The image above shows the result of the read b operation, where the orange is now updatable at 2.

The figure above shows what happens after reading C.

Assuming the following read order is D, E, F, g, h, the cache would look like this:

If the user reads an existing value, such as c, the arrow pointing to C will be flipped. Here is the result of read C:

This result does not change the location of the next update, but reverses the path to C.

If you want to read x, it’s going to cover the orange position.

So, essentially, this tree-like approach is creating a first-in, first-out order. Any child node that the node arrow points to should be eliminated (used first).

This is a design that I personally find very ingenious, because if you build a queue here and then move the current position of the hit element to the end of the queue each time, you need to build at least one linked list. Each node in the list must have at least the current value and a next pointer, which requires complex data structures to be created. Creating complex data structures in memory is easy, but very difficult in the CPU.

So this design, based on the bit-tree, solves this problem easily.

Of course, this is a simulated LRU case, and you can still construct the order in which the LRU cache is violated.

Welcome to pay attention to the public account of the Java universe (wechat: Javagogo), refuse hydrology, harvest dry goods!

How to implement LRU cache in TLB multiplex group associative cache design?

1. Increase the cumulative value

2. 1bit simulating LRU

3. Search tree simulates LRU

Related Posts

Pytorch implementation YoloV3 | in 20 days

The old project transformation return value is normalized

Classic Aircraft Wars -I game (JAVA small virtual bamboo)