Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”
This article has participated in the “Digitalstar Project” and won a creative gift package to challenge the creative incentive money.
The background of NUMA
Before NUMA, cpus were happily moving in the direction of higher and higher frequencies (vertically). Challenged by the physical limit, and turned to the direction of more and more nuclei (horizontal development). In the beginning, the memory controller was still in the North bridge, and all CPU access to memory was done through the North bridge. All CPU access to memory is “consistent”, as shown in the following figure:
This architecture is called Uniform Memory Access (UMA), which literally means “Unified Memory Access”. This architecture is very easy at the software level. The bus model ensures that all Memory Access is consistent, that is, each processor core shares the same Memory address space. However, as the number of CPU cores increases, such an architecture will inevitably encounter problems, such as bandwidth challenges to the bus and conflicts in accessing the same memory. To solve these problems, NUMA was invented.
Note: North Bridge, also known as the Host Bridge, is the most important component that plays a leading role in the motherboard chipset. Generally speaking, the chipset name is named after the northbridge chip name. For example, the Northbridge chip of Intel 845E chipset is 82845E, and the northbridge chip of 875P chipset is 82875P.
NUMA architecture details
The background of NUMA gives us an idea of the Memory Access architecture principles, as opposed to NON-uniform Memory Access (NUMA). In this architecture, different Memory components and CPU cores belong to different nodes, and each Node has its own Integrated Memory Controller (IMC).
Inside Node, the architecture is similar to SMP, and IMC Bus is used for communication between different cores. Different nodes communicate with each other through Quick Path Interconnect (QPI), as shown in the following figure:
Generally speaking, one memory slot corresponds to one Node. One feature that needs to be noted is that QPI latency is higher than IMC Bus, that is, CPU access to memory is different from remote/local, and the difference is very obvious according to experimental analysis.
In Linux, there are several caveats to NUMA:
- By default, the kernel does not migrate memory pages from one NUMA Node to another;
- But there are tools available that can migrate cold pages to Remote nodes: NUMA Balancing;
- There is still much debate in the community about the rules for memory page migration on different NUMA nodes.
NUMA detailed analysis
The Non-uniform Memory Architecture was developed to solve the scalability problem in the traditional Symmetric multi-processor system. In a symmetric multiprocessing system, the processors share the memory controller in the North bridge to achieve common access to external memory and IO, that is, all the processors access memory and I/O in the same way and at the same cost. In such a system, as more processors are added to SMP systems, the competition for buses will become greater and the performance of the system will be significantly reduced. The SMP system is shown as follows:
- Local node: For all cpus in a node, the node is called a local node. (Fastest)
- Neighbor node: The node adjacent to the local node is called the neighbor node. (Speed is second)
- Remote node: a node that is not a local node or a neighbor node is called a remote node. (Worst speed)
Hypercubes can be used as a valid topology to describe NUMA systems, limiting the number of nodes in the system to 2^C, where C is the number of neighbors each node has, as shown in the figure below
Overhead summary
Taking C=3 as an example, for node 1, 2,3 and 5 are neighbor nodes, and 4,6,7 and 8 are remote nodes. Obviously, the relation of access cost is local node < neighbor node < remote node.
NUMA case study
AMD Hyper-Transport
While earlier SMPS (symmetric multiprocessor systems) had only one memory controller located in Northbridge, more advanced methods today are to integrate the memory controller into the CPU, so that each CPU has its own memory controller and does not compete with each other.
One of the first processors to do this was the AMD Opteron family, which AMD introduced in 2003.
A memory controller (IMC) is integrated in each CPU, and a hyper-transport technology is used to establish a connection between cpus. This connection enables cpus to access external memory through other cpus, but of course the access cost is higher than that of local memory.
Operating system support
To support the NUMA architecture, the OS design must take memory distribution into account.
As a simple example, if a process is running on a given processor, the physical memory allocated to the process should be the processor’s local memory, not external memory.
-
The OS also takes care to avoid migrating a process from one node to another. In a normal multiprocessing system, the OS would have tried not to migrate processes between processors, because this would have meant that the relevant contents of a processor’s cache would have been lost. If a migration is necessary at some point, the OS can choose an idle processor at will.
-
In A NUMA system, however, the choice of new processors is subject to some limitations, the most important of which is that the memory access overhead of the new processor should not be greater than that of the previous processor, which means that the processor in the local node should be selected whenever possible.
-
If no processor can be found, the OS can select another type of processor.
-
In this worst-case scenario, there are two options:
- If a process is temporarily migrated, it can be migrated back to a more appropriate processor.
- If it is not temporary, then the process’s memory can be copied to the new processor’s memory, thus eliminating the overhead of accessing external memory by accessing the copied memory. This is obviously a space-for-time approach.
NUMA Node operation
NUMA Node distribution
There are two NUMA nodes, each managing 16GB of memory.
NUMA Node binding
The costs of communication between nodes are different. It may be different to accommodate Remote nodes. This information is presented as a matrix in Node Accommodating.
We can bind a process to a CPU or NUMA Node memory to execute, as shown in the figure above.