Video introduction: Machine learning for Computer Architecture
One of the key contributors to recent advances in machine learning (ML) has been the development of custom accelerators, such as Google TPU and Edge TPU, which significantly increase the available computing power to unlock features such as AlphaGo, RankBrain, WaveNets, and conversational agents. This addition could improve neural network training and reasoning performance, opening up new possibilities for a wide range of applications such as vision, language, comprehension and autonomous vehicles.
To sustain these advances, the hardware accelerator ecosystem must continue to innovate in architectural design and adapt to rapidly evolving ML models and applications. This requires evaluation of many different accelerator design points, each of which can not only improve computing power but also reveal new capabilities. These design points are typically parameterized by various hardware and software factors, such as memory capacity, number of cells at different levels, parallelism, interconnection, pipelining, software mapping, and so on. This is a difficult optimization task because the search space is exponentially 1 larger and objective functions (e.g., lower latency and/or higher energy efficiency) are computationally expensive to evaluate by simulation or synthesis, making it challenging to determine viable accelerator configurations.
In “Apollo: Transferable Architecture Exploration”, we present our research progress in the DESIGN of ML drives for custom accelerators. Although recent work has been proved that the use of ML to improve junior layout planning process (including hardware components on the space layout and connection in silicon) promising results, but in this work, we will focus on ML mixed into advanced system specification and architecture design phase, this is the key factors influencing the whole chip performance, It establishes design elements that control advanced functionality. Our study demonstrates how ML algorithms can facilitate architectural exploration and suggest high-performance architectures across a range of deep neural networks in areas such as image classification, object detection, OCR, and semantic segmentation.
Architecture search space and workload
The goal of architectural exploration is to discover a viable set of accelerator parameters for a set of workloads in order to minimize the required objective function (for example, run-time weighted average) constraints under an optional set of user definitions. However, the manifold of an architectural search usually contains many points and there is no viable mapping from software to hardware. Some of these design points are a priori known and can be circumvented by the user formulating them as optimization constraints (for example, in area budget 2 constraints, the total memory size must not exceed a predefined limit). However, due to the interaction between the architecture and the compiler and the complexity of the search space, some constraints may not be properly expressed in the optimization, so the compiler may not be able to find a viable software mapping for the target hardware. These infeasible points are not easily expressed in optimization problems and are usually unknown until the entire compiler passes. Thus, one of the major challenges of architectural exploration is to effectively avoid unfeasible points and effectively explore the search space with a minimum number of periodically-accurate architectural simulations.
The figure below shows the overall architectural search space for the target ML accelerator. The accelerator contains a two-dimensional array of processing elements (PE), each of which performs a set of arithmetic calculations in a single instruction multiple data (SIMD) manner. The main architectural component of each PE is the processing core, including multiple computing channels for SIMD operations. Each PE has shared Memory between all of its computing cores, primarily for storing model activation, partial results, and outputs, while a single core has Memory primarily for storing model parameters. Each kernel has multiple computing channels, with multiplicative accumulation (MAC) units. The model calculation results for each cycle are either stored in PE memory for further calculation or unloaded back to DRAM.
Optimization strategy
In this study, we explore four optimization strategies in the context of architecture exploration:
1. Randomness: Sampling the architecture search space randomly and evenly.
2.Vizier: Explore the search space using Bayesian optimization, where the evaluation of the objective function is expensive (e.g. hardware simulation, which can take hours to complete). Using a set of sample points from the search space, Bayesian optimization forms a proxy function, usually expressed as a Gaussian process, that approximates the manifold of the search space. Guided by proxy function values, the Bayesian optimization algorithm decides between sampling more from promising regions of the manifold (exploitation) and sampling more from invisible regions of the search space (exploration) in the tradeoff between exploration and exploitation. The optimization algorithm then uses these new sampling points and further updates the proxy function to better model the target search space. Vizier makes expected improvements its core acquisition feature. Here, we use Vizier (Safe), a variant of constraint optimization that guides the optimization process to avoid recommending trials that do not meet a given constraint.
3.Evolutionary search is performed using a population of K individuals, where the genome of each individual corresponds to a series of discrete accelerator configurations. By using tournament selection to select two parents for each individual from the population, recombine their genomes at some crossover rate, and mutate the recombined genomes at some probability, resulting in new individuals.
4. Population-based black box optimization (P3BO) : Using a set of optimization methods, including evolution and model-based, has been shown to improve sample efficiency and robustness. The sampled data is exchanged between optimization methods in the integration, and the optimizer weights it based on its performance history to generate a new configuration. In our study, we use a variant of P3BO in which the optimizer’s hyperparameters are dynamically updated using evolutionary search.
Accelerator search for spatial embedding
To better visualize the effectiveness of each optimization strategy in accelerator search spatial navigation, we use T-distributed random neighbor embedding (T-SNE) to map the explored configuration to a two-dimensional space within the optimization range. The goal (reward) for all experiments was defined as throughput (reasoning per second) for each accelerator region. In the figure below, the X-axis and Y-axis represent the T-SNE components (embedded 1 and embedded 2) of the embedded space. Star and circle markers show infeasible (zero reward) and feasible design points, respectively, the size of which corresponds to the reward.
As expected, a random strategy searches the space in a uniformly distributed fashion, eventually finding few viable points in the design space.
Vizier’s default optimization strategy strikes a good balance between exploring the search space and finding design points with higher rewards (1.14 versus 0.96) compared to the random sampling method. However, this approach tends to fall into unworkable zones, and while it did find some points with the greatest payoff (indicated by the Red Cross marker), it found few viable points in the last iteration of exploration.
Evolutionary optimization strategies, on the other hand, find viable solutions early in optimization and assemble clusters of viable points around them. Therefore, this approach mainly navigates the feasible areas (green circles) and effectively avoids the infeasible points. In addition, evolutionary search can find more design options with the maximum reward (Red Cross). This diversity of high-return solutions gives designers the flexibility to explore various architectures with different design trade-offs.
Finally, crowd-based optimization methods (P3BO) explore the design space in a more targeted way (areas with high reward points) to find the best solution. The P3BO strategy finds the design point with the highest reward in a search space with more stringent constraints (for example, a large number of infeasible points), demonstrating its effectiveness in navigating a search space with a large number of infeasible points.
Architecture exploration under different design constraints
We also examined the advantages of each optimization strategy under different area budget constraints (6.8 mm 2, 5.8 mm 2, and 4.8 mm 2). The violin diagram below shows the complete distribution of the maximum rewards that can be achieved in the studied optimization strategy at the end of the optimization (after 10 runs per run in 4K). The wider portion represents a higher likelihood of looking at a viable architectural configuration for a given reward. This means that we tend to produce width-increasing optimization algorithms at points with higher rewards (higher performance).
The two best-performing optimization strategies for architecture exploration are evolution and P3BO, both of which provide solutions with high returns and robustness across multiple runs. Looking at the different design constraints, we observed that P3BO optimization strategies resulted in higher performance solutions as area budget constraints tightened. For example, when the area budget constraint was set to 5.8 mm 2, P3BO found that a design point with a reward (throughput/accelerator area) of 1.25 was superior to all other optimization strategies. When the area budget constraint was set at 4.8 mm 2, the same trend was observed, with slightly better rewards with higher robustness (less variability) found over multiple runs.
The violin diagram shows the complete distribution of maximum realizable rewards running in 10 optimization strategies after 4K trial evaluation under an area budget of 6.8mm 2. P3BO and Evolutionary algorithms yield more high-performance designs (wider parts). The X-axis and Y-axis represent the geometric mean of the acceleration (reward) at the benchmark accelerator and the optimization algorithm studied, respectively.
conclusion
While Apollo is a first step toward a better understanding of accelerator design Spaces and building more efficient hardware, inventing accelerators with new capabilities is still uncharted territory and new territory. We believe that this research is an exciting way forward to further explore ML-driven technologies for architectural design and co-optimization across computing stacks, such as compilers, mapping and scheduling, to develop efficient accelerators with new capabilities for the next generation. Applications.
Update note: first update wechat public number “rain night blog”, later update blog, after will be distributed to each platform, if the first to know more in advance, please pay attention to the wechat public number “rain night blog”.
Blog Source: Blog of rainy Night