- Google’s Apollo AI for Chip Design Improves Deep Learning Performance by 25%
- By Anthony Alford
- The Nuggets translation Project
- Permanent link to this article: github.com/xitu/gold-m…
- Translator: PingHGao
- Proofread: PassionPenguin, Chorer
Google’s Apollo chip design AI framework improves the performance of deep learning chips by 25%
Scientists at Google Research have unveiled APOLLO, a new framework for optimizing chip design for artificial intelligence accelerators. APOLLO uses revolutionary algorithms to select chip parameters to minimize deep learning reasoning delays with the smallest chip area. With the help of Apollo, the researchers found designs that were 24.6 percent faster than those selected through the baseline algorithm.
Research scientist Amir Yazdanbakhsh outlined the system in a recent blog post. APOLLO searches for a set of hardware parameters, such as memory size, I/O bandwidth, and processor unit, to provide optimal inference performance for a given deep learning model. By using revolutionary algorithms and transfer learning, APOLLO can effectively explore the parameter space, thereby reducing the overall time and cost of design. Yazdanbakhsh think:
We believe this research is an exciting way forward to further explore machine learning-driven architectural design and collaborative optimization across computing stacks (such as compilers, maps, and scheduling) to develop efficient accelerators with new features for future applications.
Deep learning models have been used in many fields, from computer vision (CV) to natural language processing (NLP). However, these models typically require significant computational and memory resources for reasoning, which further strains the hardware constraints of edge and mobile devices. Custom hardware accelerators, such as Edge TPU, can improve model reasoning delays, but often require modifications to the model, such as parameter quantization or model pruning. Some researchers, including the Google team, have suggested using AutoML to design high-performance models for specific accelerator hardware.
By contrast, the APOLLO team’s strategy is to customize accelerator hardware to optimize the performance of a given deep learning model. Accelerators are based on 2D arrays of processing elements (PE), each containing multiple single-instruction multi-data (SIMD) cores. This basic pattern can be customized with the value of several different parameters, including the size of the PE array, the number of cores per PE, and the amount of memory per kernel. Overall, there are nearly 500M parameter combinations in the design space. Because the proposed accelerator design must be simulated in software, it is time consuming and computationally intensive to evaluate its performance on deep learning models.
APOLLO is built on Top of Google’s internal Vizier “black box” optimization tool, and Vizier’s optimized Bayesian approach is used as a benchmark to evaluate APOLLO performance. The APOLLO framework supports a variety of optimization strategies, including random search, model-based optimization, evolutionary search, and an integrated approach called population-based black box optimization (P3BO). The Google team conducted several experiments to find the best set of accelerator parameters for a range of computer vision models, including MobileNetV2 and mobilenet dge, for three different chip area constraints. They found that the P3BO algorithm produced the best design and the performance gap became more pronounced as the available chip area decreased compared to Vizier. P3BO found better configuration and a 36% reduction in simultaneous search evaluations compared to manually guided exhaustive or “brute force” searches.
Accelerator hardware design for improving AI reasoning is an active area of research. Apple’s new M1 processor includes a neural engine designed to speed up AI calculations. Researchers at Stanford University recently published an article in Nature describing a system called Illusion, which uses a network of smaller chips to simulate a single, larger accelerator. At Google, scientists have also published work on optimizing chip layouts to find the best placement strategy for integrated circuit components on physical chips.
If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.
The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.