This section introduces the concepts of algorithm acceleration and task unloading, as well as the differences between algorithm acceleration, task unloading and heterogeneous computing, so that readers can understand the hardware acceleration implementation form on the basis of understanding the principle of hardware acceleration. Specific examples are given in the next section.
1. Algorithm acceleration concept
Algorithm acceleration by putting the CPU consuming algorithms in the system into the hardware to process, compress the execution time of the algorithm, and realize the parallelism of CPU and accelerator, to achieve the overall performance acceleration. Algorithm acceleration is the primary form of hardware acceleration, we implement a specific algorithm into hardware, through software explicit control accelerator operation. The general process of software controlled accelerator operation is as follows:
(1) Initialize the accelerator and complete the relevant configuration required for the operation of the accelerator.
(2) Software is ready for data. If the accelerator does not have DMA built in, software or other hardware writes data to the accelerator FIFO. If the accelerator has built-in DMA and can actively read data, the software tells the data location information to the accelerator DMA, and the accelerator DMA actively transfers the data to the accelerator. To minimize the frequency of CPU interactions, data can also be exchanged in queues.
(3) Software control accelerator starts operation and performs data processing.
(4) After the processing is completed, the hardware will write the data to the output FIFO, directly output to other hardware or output to the agreed memory.
(5) If necessary, the hardware sends the interrupt to the software, and the software completes the subsequent processing.
2. Task uninstallation concept
Workload Offload, typically when a device with limited computing power offloads part of its work to perform calculations elsewhere. For example, mobile terminal nodes, such as mobile phones and Internet of Things devices, are limited by their own computing performance, so part of the work consuming a lot of computing resources is transferred to the edge or cloud processing.
Similar problems exist in cloud computing data centers. Cloud computing uses virtualization technology to mix business VMS with Hypervisor management and other background tasks to run on the host CPU. On the one hand, with the development of technology, services have higher requirements on host performance, but the improvement of CPU performance is increasingly limited. On the other hand, the I/O performance improvement of network and storage requires a lot of computing resources, and management and background tasks occupy a lot of CPU resources. The combined influence of the two factors makes the user business that originally wants to obtain more computing resources get less computing resources.
Therefore, we want to remove the management and background tasks from the host CPU side as much as possible, and transfer the host CPU to the user business. As a result, we need to offload administrative and other background tasks to a specific hardware device that still provides sufficient support for the user business environment, as does the virtualization environment.
It should be noted that the task offloading described in this chapter refers to the task offloading from one chip to another chip at the board level, and does not involve the task offloading to other servers through the network.
3. The difference between algorithm acceleration and task unloading
Task unloading and algorithm acceleration are essentially the same, they both achieve the goal of overall acceleration by putting part of the work into the hardware to execute. But in terms of implementation, algorithm acceleration is the most basic form, and task unloading is a more advanced form.
As shown in Figure 5.3(a), the algorithm acceleration is in the same address space as the CPU. The CPU can “see” the algorithm acceleration module and directly interact with the algorithm acceleration module on the control plane and data plane. Task unload is a bit more complicated and has some new features:
- Task unloading is based on algorithm acceleration module, which is still implemented by hardware processing module of algorithm in essence.
- As shown in Figure 5.3(b), task offloading usually refers to transferring tasks to another system. The two systems need certain interfaces to communicate, such as PCIe bus between two chips. In contrast to algorithmic acceleration architectures, the unload parts of task unload are not under the “control” of the host CPU, which “cannot see” them.
- Task unloading can not only consider the design and implementation of hardware processing module like algorithm acceleration (the software and hardware interaction of algorithm acceleration is relatively simple), but also consider the data and control interface interaction with other software or hardware.
- Work tasks need to think about interactions in terms of system hierarchies. For a job task, use who provides services to me and who I provide services to. Reflected in the hardware is the data transmission between each module (including the CPU and memory module running the software), as well as the configuration and state information interaction of the control plane. HAL layer is needed to realize the software abstraction of the hardware operation.
(a) Algorithm acceleration (b) Task unloading
Figure 5.3 Comparison of algorithm acceleration and task unloading
4. Difference between algorithm acceleration and heterogeneous computing
If we think of the GPU as an accelerator to accomplish specific graphics algorithm acceleration, then the GPU can also be used as an algorithm accelerator. From this point of view, heterogeneous computing and algorithm acceleration are essentially the same. There are some differences between heterogeneous computing and basic algorithm acceleration, and these differences will significantly affect the application of both. The differences between the two are as follows:
- Algorithm acceleration is a low – level form, heterogeneous computing is a high – level form. Algorithm acceleration is aimed at a specific algorithm scene; Heterogeneous acceleration is not for a specific scenario, but for a specific type of scenario. It is necessary to extract a certain degree of general features in this kind of specific scene, and optimize for the general features to accelerate.
- Algorithm acceleration is custom acceleration and heterogeneous computing devices are processors. Algorithm acceleration is a specific algorithm implemented by hardware completely. Software participates in the processing of control surface, but the hardware processing module of algorithm acceleration does not support instruction programming. While heterogeneous computing devices are collectively referred to as processors, which support software instruction programming and have certain versatility.
- Algorithm acceleration is usually custom-developed hardware and software, while heterogeneous computing is aimed at platformization. Algorithm acceleration generally implements the algorithm hardware processing module driver, the rest of the work is handed over to the subsequent software developers. Heterogeneous computing is not only a device driver to realize hardware processing, but also to realize mixed programming of heterogeneous platforms, support runtime programs, and even support data consistency processing between two systems.
- Algorithm accelerated development threshold is low, can face many large and small occasions. Heterogeneous computing is for some large-scale typical scenarios, its development is a large-scale manpower, long-term, iterative process.
How can the software and hardware of cloud computing mix more efficiently?
How will the data center of the future be built? Let’s learn about Fusion of Hardware and Software.