Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.
The art of multipropcessor programming is an important programming tool for programming and programming. The art of multipropcessor programming is an important programming tool for programming. And according to the personal information and understanding of the experience, to you want to have a deeper understanding of the people share some personal information
hardware
Processors and Threads
A multiprocessor consists of multiple hardware processors, each capable of executing a sequential program. When talking about multiprocessor architectures, the basic unit of time is the instruction cycle: the time it takes the processor to extract and execute an instruction.
A thread is a sequential program, a software abstraction. Context switch means that the processor can execute one thread for a certain period of time and then execute another thread. The processor can cancel a thread or remove it from the schedule for various reasons:
- The thread makes a memory request that takes a while to complete
- The thread has run long enough to let another thread execute.
When a thread is removed from the schedule, it may resume execution on another processor.
Interconnect
There are three common basic server interconnection structures:
- SMP(Symmetric Multiprocessing)
- Nonuniform Memory Access (NUMA)
SMP indicates that multiple cpus work symmetrically without primary, secondary, or subordinate relationships. Each CPU shares the same physical memory. It takes the same time for each CPU to Access any address in Memory, so SMP is also called Uniform Memory Access (UMA: Uniform Memory Access). In a typical SMP architecture, there is a cache between the CPU and memory. Moreover, both the processor and main memory have bus controllers that are responsible for sending and listening to broadcast information on the bus. The overall structure is shown in the figure below:
This structure is the easiest to implement, but as more processors grow, the bus cannot expand and eventually becomes overloaded.
In NUMA architecture, as opposed to SMP, a series of nodes are connected to each other through a point-to-point network, sort of like a small local area network,Each node contains several processors and local memory. Local storage of a nodeIt is also accessible to other nodesOf course, accessing your own local memory is faster than accessing another node’s memory. Networks are more complex than buses and require more complex protocols, but bring extensibility. As shown below:
From a programmer’s point of view, interconnects are a finite resource, whether underlying is SMP or NUMA. Keep this in mind when writing code to avoid using too many interconnect resources.
Memory
All processors share memory, which is typically abstracted into a large array of “words” with the subscript address. Word length is platform-dependent and is now mostly 64 bits, as is the maximum address length. 64 bits of memory is a lot.
The process for a processor to access memory is briefly summarized as follows:
- The processor retrieves the value of the address in memory by sending a message containing the address to read
- The processor sends a message to memory containing the address and value to be written. After the data is written, memory replies with an acknowledgement message.
Cache
Cache hit ratio
If the processor keeps reading directly from memory, direct processor access to memory takes a long time, perhaps hundreds of instruction cycles, which is inefficient. It is common to introduce several caches: small pieces of memory next to the processor, located between the processor and memory.
When the value of an address needs to be read, the cache is accessed to see if it exists: the presence represents a hit and is read directly. Non-existence is called miss. Similarly, if you need to write a value to an address, the address is in the cache and you don’t need to access memory.
We generally care about the percentage of requests hit in the cache, which is the cache hit ratio
Locality vs. cached lines
Most programs exhibit a high degree of locality:
- If a processor reads or writes to a memory address, it is likely to read or write to the same address again soon.
- If a processor reads or writes a memory address, it is likely that it will soon read or write nearby addresses as well.
For locality, caching typically operates on not one word at a time, but a group of adjacent words, called cache rows.
Multilevel cache
Modern processors typically have multiple levels of Cache, with L1 Cache, L2 Cache, and L3 Cache being the nearest to the processor:
- L1 Cache is usually located on the same chip as the processor, closest to the processor, and requires only one to three instruction cycles to access
- L2 Cache is usually located on the same chip as the processor, in the buffer position, and requires access through further copper wires or even more circuits, which increases latency, typically around 8 to 11 instruction cycles
- L3 Cache L1/L2 is private to each processor, so only one copy of the same data can be stored in each processor’s unique Cache. So consider introducing a cache shared by all processors, known as the L3 cache. L3 cache materials and wiring are different from L1/L2, requiring longer access time, generally around 20 ~ 25 instruction cycles
Cache memory is limited and only a fraction of memory units are placed in the cache at any one time, so we need a cache replacement strategy. If the replacement policy can replace any cache row, the cache is fully associative. In contrast, if only one particular cached row can be replaced, it is directly mapped. If the compromise allows the substitution of any cache row from a set of size K, it is called k-way set associative.
Coherence
Sharing, or contention for contention, occurs when one processor accesses main memory addresses that another processor has loaded into the cache. Cache consistency needs to be considered, because if one processor updates a shared cache row, the other processor’s copy needs to be invalidated to avoid reading expired values.
MESI cache consistency protocol, cache rows have the following four states:
- Modified: A cached row is Modified and must eventually be written back to main memory. No other processor can cache the cached row until then.
- Exclusive: The cache row has not been modified, but other processors cannot load it into the cache
- Shared: The cache line is not modified and can be loaded into the cache by other processors
- Invalid: There is no meaningful data in the cache line
For example, suppose that the processor and main memory are connected by a bus, as shown in the figure below:
A) Processor A reads data from address A and stores the data in its cache Exclusive
B) Processor B reads data from address A, and processor A detects address conflict and responds to address A’s data in the cache. After that, the data of address A is loaded into the cache by A and B in the Shared state
C) Processor B writes to A, changes the state to Modified, and broadcasts a (all other processors that have loaded the data into the cache) with the state set to Invalid.
D) A will then need to access A, which will broadcast the request, and B will send the modified data to A and main memory, and set two copies of the state as Shared.
When processors access logically different data that happens to be on the same memory row, this situation is called false sharing
Spinning (Spinning)
Spin: one processor constantly checks for a word in memory, waiting for another processor to change it.
For SMP or NUMA systems with caching, spin consumes very few resources. According to our introduction to MESI, the first time an address is read, a cache miss is generated and the contents of the address are loaded into the cache block. Thereafter, as long as the data does not change, the processor only reads the data from the cache without tying up the interconnect. When the address is changed, the processor also receives Invalid and rerequests the data to get the change.
Why TTASLock is better than TASLock.
According to the previous analysis, we can know that every time TASLock LOCKED.compareAndSet(this, false, true), it will generate modification signal, occupying interconnect bandwidth. The while loop is executed every time, generating a lot of modification signals. But the locked. get(this) of TTASLock is just a local spin. So TTASLock performs much faster than TASLock.