About author: kevinxiaoyu, senior researcher, affiliated to Tencent teg-architecture platform department, his research interests include architecture design and optimization of deep learning heterogeneous computing and hardware acceleration, FPGA cloud, high-speed visual perception, etc. The three-part “Heterogeneous acceleration techniques for Deep Learning” series analyzes the architectural evolution of heterogeneous acceleration in academia and industry at the technical level.
In the previous discussion, the typical architectures emerging in academia are discussed mainly at the level of solving the core problem of bandwidth. In this article, we’ll take a look at the different choices semiconductor manufacturers and Internet giants are making in AI computing.
I. Choice of semiconductor manufacturers
Different vendors have different application scenarios, and suitable architectures and solutions vary, such as the design orientation of cloud side and end-to-end processing architectures. For semiconductors, as long as the market size is large enough and enough customers pay, there is enough incentive to customize the hardware accordingly. The following describes the solutions of semiconductor manufacturers represented by Nvidia and Intel.
1.1, NVIDIA
NVIDIA’s application scenarios are image computing and massive data parallel computing. GPU is the main force of heterogeneous computing and the source of the rise of AI. The ecological development is very mature, and the support from the development process to various libraries is very perfect. NVIDIA’s vision for deep learning is to continue its GPU architecture and focus on two directions: enriching the ecosystem and improving the customization of deep learning. For the former, cuDNN and other optimized libraries for neural network were introduced to improve the ease of use and optimize the underlying structure of GPU. For the latter, on the one hand, no longer adhere to the 32-bit floating point operation, increase the support for more data types, such as 16-bit floating point and 8-bit fixed-point; On the other hand, by adding special modules for deep learning (such as TensorCore in V100) to enhance customization, the performance of deep learning applications can be greatly improved. Volta V100 released in May 2017 can achieve a powerful computation power of 120TFlops in the application of deep learning. The parameters and architecture are shown in figure 3.1 and 3.2 respectively.
Figure 3.1 Comparison of Volta GPU parameters and calculation structure of TensorCore
Figure 3.2 Volta GPU architecture
The problems of GPU are: large delay, low energy ratio and high cost.
In order to solve the bandwidth problem caused by a large number of cores running at the same time, on the one hand, GPU adopts a scheme similar to CPU multi-level cache to form adjacent storage. On the other hand, it uses the most advanced memory technology, so that it can run more cores when computing. When multiple cores access video memory at the same time, the queue mechanism causes that the request for each Core cannot be responded in the first time, thus increasing the delay of task processing.
From the perspective of customization and computing granularity, the application object of GPU is general parallel computing. In order to be compatible with earlier architecture and more applications, there is a certain balance and redundancy in the design. For example, in Figure 3.1 and Figure 3.2, the Tensor Core is designed for AI computing, and the other 5120 cores are more compatible with general-purpose parallel computing. Therefore, the performance and power consumption of the same chip area and number of transistors are not superior to that of ai-specific processors.
In terms of cost, the current cost per card of Tesla V100 is estimated at more than 100,000 yuan, while other models are also in the thousands to tens of thousands, which is too expensive for large-scale cloud deployment.
Due to the continuity of historical design, NVIDIA gpus are mainly aimed at servers and desktop computers. Even though the low-power TX and TK series are launched, they are mainly aimed at high-performance terminal applications with low power sensitivity, such as unmanned driving. In order to ensure the integrity of the product line in the field of computing platforms, the competitiveness of the embedded market is indispensable when new fields emerge. NVIDIA has opened source a new set of scalable Deep Learning processor architectures: DLA (Deep Learning Accelerator). In general, the open source approach is common at the software level of lightweight development. Due to the large investment, high threshold, poor fault tolerance and other reasons, the open source hardware architecture is still in its infancy. DLA has become a powerful weapon for NVIDIA to expand its ecosystem to the embedded end, avoiding the head-to-head confrontation with many AI computing startups. Huang’s courage and imagination are amazing. See github.com/nvdla/ for the RTL framework of DLA
1.2, Intel
As a general computing platform supplier, Intel will not give up any large-scale computing platform market. As mentioned in the first article, using cpus to do large-scale deep learning calculations is inefficient. Facing the advantages of GPU compatibility with deep learning and lightweight cutting, Intel started the strategy of “buy, buy, buy”. In heterogeneous computing, Altera, the second largest FPGA company, was acquired. Nervana, an application for deep learning in the cloud; Deep learning application scenario in low power embedded front end, acquired Movidius; In driverless/assisted terminal applications, acquired Mobileye.
In Intel’s vision, FPGA/AI ASICS can connect with Xeon processors at the chip level through QPI interfaces (up to twice the bandwidth of PCI 4.0) and even integrate on the same chip. Despite the current difficulties and delays in the integration, it does not prevent the evaluation of the benefits of its design: a high-bandwidth, low-latency, reconfigurable coprocessor is always desirable for scenarios where computing types are constantly changing, such as cloud and desktop machines, as shown in Figure 3.3:
Figure 3.3 Heterogeneous computing layout of Intel in Datacenter
Intel-lake Crest: Intel’s Lake Crest deep learning chip, launched in October 2017, is also a phase 2, scalable AI processor. It has 32GB of ON-chip HBM2, DNN acceleration and chip-level interconnection capability. Because HBM2 replaces off-chip DDR, the integration of board level is greatly improved and the power consumption is reduced. It can be applied to the system architecture of single card, multi-chip, multi-card and chip pool. The internal architecture of Lake Crest is shown in Figure 3.4.
Figure 3.4 Intel ASIC Lake Crest internal structure
Movidius: Embedded front-end application for low power consumption. The market is quite large, including robotics, UAV/vehicle, security and even personal portable acceleration equipment. Movidius, for example, has launched neural computing sticks, developed for portable personal deep learning and capable of 100GFlops at 1W power consumption. Notice that the energy consumption here is the total energy consumption of the portable device. It is relatively easy to achieve 100GFlops/W on the deep learning custom chip itself, but the performance is quite difficult when the power supply, control, and off-chip memory chips are included, as shown in Figure 3.5.
Figure 3.5Movidius’ neural computing stick
Second, the choice of Internet giants
2.1, Google
Google attaches great importance to frontier exploration and layout. As early as 2013, I realized that AI would have explosive growth in the business. Not only the internal business development requires a large amount of computing resources, but also external services in various AI scenarios. If GPU is used, the cost is too high, which is not conducive to enriching the upstream and downstream of AI ecosystem. A complete product line from deep learning framework to computing platform is more conducive to taking the initiative and realizing end-to-end application mode. Therefore, on the one hand, the development of TPU was started to reduce the cost by 10 times with the same performance as GPU. On the other hand, develop TensorFlow and the compilation environment of TensorFlow to TPU, promote the application of TPU with the help of TF and deep learning cloud services, so as to further increase the usage of TPU and share the design cost evenly. At present, TPU has launched the second generation (TPU2), which is used for internal search, map, voice and other services. The application of TPU2 is open to external research institutions to create an end-to-end ecological chain of deep learning. In the field of deep learning, Google focuses on how to seize and take advantage of the opportunity to improve the ecology and quickly transform into products. Therefore, the deep combination of AI+ hardware, or application + hardware, has become the key to the implementation of the theme of Google IO Conference “AI First”, from ecology to application. For example, TPU on the cloud end, Intelligent speaker Google Home on the embedded application end, IPU in Pixel2 mobile phone and so on are the concrete embodiment of this idea.
2.1.1 Computing Characteristics and Architecture Change — TPU1[1]
Google’s TPU has gone through two generations of TPU1 and TPU2. It belongs to the typical AI chip, which expands from solving bandwidth in the first stage to computing power in the second stage, from loose distribution to computing power intensive cluster, and from traditional network data interaction to ultra-low latency chip-level interconnection.
The framework and plate-level structure of TPU1 are shown in Figure 3.6 and 3.7 respectively. TPU1 only implements the Inference function and can support 16bit/8bit operation. It mainly performs matrix-matrix multiplication, matrix-vector multiplication, vector-vector multiplication and other functions to support deep learning algorithms such as MLP (multi-layer perceptron), LSTM and CNN. To facilitate rapid deployment and reduce data center architecture changes, TPU1 uses PCIE cards to communicate with servers. The board is configured with two sets of DDR3 memory modules.
Internally, its architecture is characterized by a large scale pulsating array (65536 8bit DSPS, corresponding to 68402 VU9P DSP FPGA, 38404 Tesla P40) and a storage architecture of 24MB on-chip cache-off-chip DDR, with a working frequency of 700MHz. Peak computing power is 655360.7GHZ2/1000 =91.75Tops, power consumption is only 40W. The key to achieving such high performance and avoiding the bandwidth bottleneck lies in the extremely high data reuse characteristics of the pulsating array calculation process, as described in Section 2.1 of Part ii.
Figure 3.6 TPU1 internal structure and on-chip layout
Figure 3.7 Board level structure and deployment of TPU1
TPU1 has quite strong peak computing power, but there are problems in practical application:
According to statistics, more Datacenter customers are interested in Latency rather than Throughput. After the launch of TPU, customers often set the delay option to the highest priority in use.
2. In the application of CNN, LSTM and MLP, the corresponding computational efficiency is 78.2%, 8.2% and 12.7% respectively, which benefits from the high data reuse rate that CNN can achieve on chip. Google puts the percentage of all three applications in its data centers at 5%, 29% and 61%, respectively. Although the TPU up to 65536 DSP units can achieve high computational efficiency in CNN applications, CNN only accounts for 5% of the applications in Google data centers. The large-scale DSP units occupy a large area of silicon chip, which reduces the rate of good quality and increases the cost. Only 5% applications have high benefits, and its cost performance is debatable.
3 The PCIE card mode makes TPU invocation dependent on the server, and the deployment density of TPU on the rack is relatively low.
The core of the above problems is the problem of calculation granularity. The scale of pulsation matrix is too large. On the one hand, bandwidth is still the bottleneck in scenarios with low data reuse rate such as LSTM and MLP. On the other hand, delay takes precedence over throughput, batchsize is small, it is difficult to run full performance, relatively small calculation granularity is more suitable. So Google started developing TPU2.
2.1.2 Computing Characteristics and Architecture Change — TPU2[2]
TPU2 is a typical second-stage AI chip, which uses 16GB HBM to obtain 600GB/s on-chip bandwidth. It mainly solves two problems: computing granularity problem and computing force scaling problem.
1. Calculation granularity: The scale of pulsation matrix is reduced from 25,256 to double 128,128, as shown in FIG. 3.8, so that the control and scheduling unit is one quarter of the original, which is more conducive to improving the calculation efficiency of single chip while reducing the delay of data import and export.
2 Computing power scaling: The PCIE communication mode of TPU1 is changed to board-level communication mode to achieve multi-board interconnection and high-density computing power cluster. Computing power expansion is implemented in a chip-level distributed way, reducing the difficulty of task allocation and synchronization and communication cost. For example, the current TensorFlow Research Cloud consists of 1024 pieces of TUP2, and it is estimated that each task can call up to 256 pieces of TPU2. Board-level interconnection avoids the inefficient PCI-Ethernet channel of traditional servers and directly achieves 200GB communication bandwidth between boards, greatly reducing the communication delay between boards, facilitating efficient computation across chip models and 7ms response speed in data centers.
Other improvements:
The application of one-chip HBM replaces off-chip DDR3 and reduces the on-board area so that four PIECES of TPU2 can be deployed on the board.
The single-chip performance is 45Tops, and the single-chip performance is 180Tops;
3 Supports 32-bit floating point arithmetic.
4. Support training and Inference.
Figure 3.8 TPU2 board deployment and internal structure, including four chips (A in the figure), two groups of 25GB/s private networks (B in the figure), two omni-path Architecture (OPA) interfaces (C in the figure), and power interfaces (D in the figure).
Figure 3.9 TPU2 deployment in A data center, where A and D are CPU racks, B and C are TPU2 racks. The blue line is the UPS, the red dotted line is the power supply, and the green dotted line is the rack network switchover configuration.
2.2, Microsoft
Compared with Google’s choice to develop ITS own ASIC, Microsoft’s BrainWave project [3] adopts reconfigurable FPGA. The speculated reasons are as follows:
1 Application scenarios determine the architecture. Compared with TensorFlow, CNTK’s influence and ability to feed back hardware are weak.
2 Microsoft is more focused on a programmable heterogeneous distributed platform for cloud service types and sizes, as well as delay-sensitive computation-intensive applications, rather than just deep learning. After the heterogeneous distributed platform is formed, fixed logic (static region) and task-specific configurable logic (dynamic region) can be divided in FPGA. The static area is configured as a distributed node framework and data interface, and the dynamic area can be replaced as needed, as shown in Figure 3.10 and 3.11.
Figure 3.10BrainWave’s large-scale reconfigurable cloud architecture
Figure 3.11 Microsoft server-side FPGA pool and its intelligent nic structure (third generation)
In Figure 3.11, the FPGA communicates with the server through the PCIE interface in the form of an intelligent nic. FPGA and FPGA adopt chip level interconnection, with 40Gb bandwidth and ultra-low communication delay, and the communication process without CPU. This structure can not only fully load the DNN calculation model into the on-chip cache for ultra-low delay calculation through the combination of multiple FPGas, but also be compatible with ultra-low delay processing of various accelerated scenarios, such as search, video, sensor data flow, etc. [4]. Its application can be summarized as the following two kinds:
A. Scalable and low-delay tasks: When the number of FPGA interconnection on the chip level is tens of thousands, the FPGA pool can be formed. The communication and data interaction among them get rid of the PCI-server interaction mode, and directly carry out task assignment and interaction on the chip level, which can be formed with the highest performance of 1exa-OPS /s, such as machine translation at the speed of 78,120,000 pages/second. For scenarios where bandwidth is the bottleneck, such as LSTM, the model can be split and deployed to multiple chips, so that the weight does not need to be imported from off-chip DDR during calculation, so as to avoid the limitation of bandwidth and realize the release of computing power.
B. Local accelerator: the acceleration mode is similar to TPU1. As the local accelerator of the server, data exchange and task allocation are carried out through PCIE. It can not only be used in the main computing unit of DNN, but also as a dedicated preprocessing coprocessor, which can be placed between the server network interface and the switch in the form of an intelligent network card to reduce the load of THE CPU under high throughput and enable the CPU to focus on tasks that it is better at, such as real-time codec, encryption and decryption of the data stream of cloud services.
Figure 3.11 shows the architecture of Microsoft’s third generation FPGA pooling. The first generation is single card with multiple chips, and one computer with four cards is one computing granularity, as shown in Figure 3.12. The second generation of single machine single card, ensure the server isomorphism.
Figure 3.12 FPGA accelerator card and Task allocation on Microsoft server (first-generation)
2.3, IBM,
IBM also launched brain-like ASIC TrueNorth[5]. As early as 2006, it started the project to reduce the clock frequency to 1KHz, so as to focus on low-power applications. The interconnection mode of the nodes in the chip can be realized programmatically, which has relevant applications in the visual identification of the embedded front end of academia and national defense projects, and can be interconnected with a large number of chips to achieve larger tasks, as shown in Figure 3.13.
Figure 3.13 TrueNorth chip architecture and interconnection
2.4. Summary and comparison of data center applications
At the Datacenter end, fpGas versus ASics can be summarized as follows:
A. Scope and Flexibility: FpGas have limited single-card performance compared to cloud-based acceleration ASics such as TPU. The advantages of fpGA-based architectures depend on the coverage of DataCenter acceleration services. When the mainstream acceleration services represented by cloud services are not limited to matrix operations such as DNN, but need to deal with a variety of acceleration or big data processing applications of different customers, FPGA with reconfigurable and deeply customized capabilities will be a good choice. Asics are designed for a specific domain or application. The more specific the application, the more specific the architecture, the more efficient its computing, and the narrower its business coverage.
B. Peak computing capacity: Peak computing capacity depends on two aspects, the number of operation units and operating frequency. For the former, without considering the constraints of bandwidth and power consumption, ASIC can customize the number of operational units and physical spatial distribution on demand, theoretically there is no upper limit, so as to achieve higher computing performance and energy efficiency, such as TPU up to 92Tops/s; In order to maintain the generality and programmability of FPGA, the number of DSPS can only be distinguished by model, and the upper limit of DSP is determined by FPGA manufacturer. FPGA operating frequency is not fixed, depending on the manufacturing process (e.g. TPU@28nm, Cambrian DianNao@65nm) and the logic delay of the design. In order to be compatible with a variety of designs, the spatial distribution of FPGA logic unit, DSP (multiplication and addition operation unit within FPGA), RAM and other resources are approximately evenly distributed in the physical layer. There is inherent path delay in the integration between DSP and DSP, RAM and RAM, and the frequency upper limit of large-scale logic is usually less than 500M. The core of DSP is below 1GHz. ASIC can determine the distribution of resources within a film without inherent path delay, thus achieving higher frequencies (e.g. NVIDIA [email protected], Intel I7 [email protected]). In addition, whether the actual performance can approach peak computing power depends on the degree of customization. The deep customization of FPGA can ensure the utilization rate of each DSP, and to some extent make up for the disadvantage that the number of DSP is not as good as that of ASIC.
C. Computing efficiency: depends on the balance between computing granularity and architecture scheduling. When the calculation granularity is large, such as TPU1 256*256 pulsation array, the full performance requires great bandwidth and high data reuse rate support, and it is difficult to ensure high computing efficiency in diversified scenarios. Therefore, in more architectures, small and medium-sized Processing structures (such as 256 DSPS) constitute a Processing Element (PE), and multiple pes constitute A PE array, ensuring computing efficiency through task allocation and scheduling of multi-layer structures (such as DaDianNao of Cambrian).
D. Design cycle and online time: FPGA adopts programmable mode to decouple design and flow sheet process, and the development cycle of large-scale logic is usually half a year; ASIC design process is not only more complex, flow sheet, packaging and other processes are also subject to the manufacturer’s schedule, development cycle is usually 18-24 months. Even with the size and commitment of Google’s TPU team, it took 15 months from design to deployment. The triple lag time makes fpgas more suitable for rapid iteration to quickly tap into the constantly updated algorithm model.
E. Cost and risk: the chip cost of high-performance FPGA depends on the purchase quantity, and the supply risk and r&d risk are relatively low; ASIC manufacturing requires high cost, as shown in Formula (1), and involves the yield rate and other issues. Once errors are found in actual production, they have to be reinvested to start from scratch, which is risky.
(1)
In the above formula, Ct is the total cost of chip design and processing, Cd is the design cost, Cm is the cost of Mask in processing, Cp is the processing cost of each Wafer, V is the total output, Y is the yield rate, n is the number of chips that can be cut on each Wafer. As can be seen from the above equation, when ASIC is determining the manufacturing process, the unit cost of ASIC depends on the number of deployments V. When deploying more than 10,000 pieces, the unit price can be reduced to several hundred yuan. At the same time, the cost of ASIC is closely related to the processing technology. The higher precision process can effectively improve the operating frequency of the chip and reduce the power consumption, but the total cost is also greatly increased.
The comparison is merely an analysis of the merits of the chips themselves. Sometimes, in a complete industrial chain, the value of ASIC is not limited to itself, but the only way to the upstream and downstream after the expansion of a field. Compared with the prosperity and monopoly of the whole industry chain and ecology in the next few years, the cost of ASIC is insignificant. It can be said that TPU is just Google dominating the field of deep learning and moving up and down the cloud when it comes to layout. Compared to the application of embedded AI hardware that is gradually unfolding, this is just the beginning.
Datacenter and Cloud Low latency FPgas and ASics
With the combination of deep learning and big data, and the landing of a variety of applications, the evolution of computing platforms is ultimately ecological competition. Performance, latency, and throughput are key metrics, but sometimes ease of use, universality, and operation are the key to winning or losing. Overall, Gpus, FPGas and ASics are better. The GPU ecosystem is the best. Years of operation make it incomparable in library completeness, end-to-end ease of use of algorithm and hardware, and development community construction. FPGA has the best flexibility. It can quickly enter the application market in the early stage of emerging computing and before the emergence of ASIC, and is not constrained by the application scale and cost of ASIC. It is suitable for reconfigurable heterogeneous computing in cloud scenarios. ASIC’s customizability is controllable. Under the premise of solving the bandwidth, it can almost achieve 10 to 100 times of the physical computing scale of GPU and FPGA, becoming the ultimate weapon of data center.
Although FPGA and ASIC can achieve more in-depth customization than GPU in a narrow field, in today’s constantly updated AI model, in order to catch up with GPU, two problems must be solved: one is iteration cycle and the other is end-to-end ecological construction.
3.1 The problem of FPGA and ASIC — iteration cycle
The development cycle and threshold of FPGA and ASIC have always been criticized. Even with fast-iterating FPgas, it takes nearly six months for a Datacenter project to go live, making it difficult to capture the market quickly when new opportunities arise. To this end, there are currently two solutions in the industry: high-level synthesis and AI coprocessors.
1) High-level synthesis, that is to use C and other high-level language for development, and then directly convert it into RTL level structure that can be directly applied to FPGA or ASIC through high-level synthesis tools, which can improve the development efficiency by more than 4 times. Typical representatives are Xilinx’s HLS and Altera’s OpenCL. However, compared with direct RTL development, the structure generated by HLS has a big gap in execution efficiency and logical scale, which can reduce performance by about 30%. In this regard, some manufacturers add AI algorithm optimization into high-level synthesis, that is, they use RTL to develop commonly used AI modules to form hardware IP libraries, such as Conv, Pooling, RELU in CNN, to replace some modules automatically generated by high-level synthesis and seek a balance between customization and ease of use to improve the performance of the final system.
2) AI coprocessor, namely developing FPGA or ASIC with partial generality. In order to find a balance between generality and customization and ensure the computational efficiency of the coprocessor, only the tasks with the longest time and highest computational density in the algorithm model are put into the coprocessor, and the other parts are completed by CPU, so as to reduce the performance loss caused by generality. At the same time, the coprocessor supports the corresponding instruction set, and the CPU calls the coprocessor by transferring instructions to cope with different application scenarios. When the logic structure is solidified, the long cycle iteration of RTL development is transformed into the development of instruction sequence, and the recompilation and layout and wiring of FPGA engineering are not even required, which greatly reduces the entry time. Different from DianNao series and TPU, Cambricon processor of Cambrian is an instruction oriented processor with the most complete instruction set [6], as shown in Figure 3.14. Of course, general AI coprocessor can also be realized on FPGA, but due to limited resources, it will eventually move to ASIC after the emergence of a large number of application scenarios.
Figure 3.14 Instruction set and overall chip architecture of Cambrian Cambricon processor
3.2 FPGA and ASIC problems — end-to-end ecological construction
End-to-end refers to the seamless connection from deep learning algorithm frameworks such as TensorFlow, Torch, and Caffe to hardware, allowing developers to rely entirely on the compiler to complete the invocation and optimization of computing resources without hardware knowledge or optimization methods, even using hardware acceleration without realizing it. At this point, GPU is in the forefront of heterogeneous computing. When FPGA and ASIC participate in the calculation in the way of AI coprocessor, corresponding interfaces need to be provided in the deep learning framework, and the compiler will convert the code in the framework into the instruction sequence that conforms to the hardware instruction set, so as to realize the top-level invocation of FPGA or ASIC. In the process of realizing the instruction sequence, the compiler should conform to the optimal configuration of hardware resources, such as supporting single card single chip, single card multi-chip, multi-card single chip, multi-card multi-chip and other actual hardware configuration modes. In addition, users have different emphases, especially for Inference service. There are great differences in user demands in various aspects such as swallowing, delay and resource allocation. Compilers play a vital role in coordinating hardware resources and computing topologies dynamically, even from single-machine and multi-machine to distributed.
Four, conclusion
A deep understanding of Datacenter’s business needs and industry trends, as well as architecture and solution differences, lie at the heart of the choice between scenario-based design, generic versus custom, performance versus bandwidth, and new computing models such as FPgas, ASics and even quantum computing. This is the driving force behind our exploration of heterogeneous computing architectures.
reference
[1] Jouppi N P, Young C, Patil N, A tensor processing unit based on tensor analysis [J]. ArXiv Preprint arXiv: 174.04760, 2017.
[2] Jeff Dean. Recent Advances in Artificial Intelligence and the Implications for Computer System Design[C]. Hot Chips 2017.
[3] Eric Chung, Jeremy Fowers, Kalin Ovtcharov, et al. Accelerating Persistent Neural Networks at Datacenter Scale[C]. Hot Chips 2017.
[4] Caulfield A M, Chung E S, Putnam A, et al. A cloud-scale acceleration architecture[C]//Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on. IEEE, 2016: 1-13.
[5] Merolla P A, Arthur J V, Alvarez-Icaza R, et al. A million spiking-neuron integrated circuit with a scalable communication network and interface[J]. Science, 2014, 345 (6197) : 668-673.
[6] Liu S, Du Z, Tao J, et al. Cambricon: An Instruction Set Architecture for Neural Networks[C]// International Symposium on Computer Architecture. IEEE Press, 2016:393-405.