Abstract: CANN (Compute Architecture for Neural Networks) is a heterogeneous computing Architecture specially oriented to AI scenarios, aiming at improving user development efficiency and releasing the ultimate computing power of Centerm AI processor.

1, the introduction

Since 2016, AlphaGo, which beat the world’s top chess player and defeated all humans,

By 2020, will be writing novels, screenwriting, coding, perfect in everything gpT-3,

In 2021, the pangu Model, which is the closest to human understanding of Chinese, with superior generalization power…

In recent years, the field of artificial intelligence has been constantly refreshing human cognition and subverting human imagination.

Just like a human mastering a skill, training a smart enough MODEL of an AI algorithm often requires tens of thousands of data volumes. Take GPT-3 for example, the number of its parameters has reached 175 billion, and the sample size is as much as 45TB. The single training time is in the unit of month, so the demand for computing power has been a stumbling block on the AI track!

At the same time, with the increasingly mature application of ARTIFICIAL intelligence, the demand for processing unstructured data such as text, pictures, audio and video increases exponentially, and the data processing process gradually shifts from general computing to heterogeneous computing.

Centerm AI basic hardware and software platform launched by Huawei. Centerm AI processor + heterogeneous computing architecture CANN, with innate superior computing power and heterogeneous computing capability, is gradually becoming a catalyst for the rapid landing of the AI industry.

CANN (Compute Architecture for Neural Networks) is a heterogeneous computing Architecture specially designed for AI scenarios, aiming at improving user development efficiency and releasing the ultimate computing power of Centerm AI processors. On the top, it supports the mainstream front-end framework, and on the bottom, it shields the hardware differences of serialized chips to users, so as to meet the users’ comprehensive demands for ARTIFICIAL intelligence with the advantages of full scene, low threshold and high performance.

2. Compatible with mainstream front-end framework and fast algorithm transplantation

In the field of ARTIFICIAL intelligence, AI algorithm model building has been highly skilled. In addition to Huawei’s Open source MindSpore, there are also deep learning frameworks for AI model building such as Google’s TensorFlow, Facebook’s PyTorch and Caffe.

Through the Plugin adaptation layer, CANN can easily undertake THE AI models developed based on different frameworks and convert the models defined by different frameworks into standardized graph formats expressed by Ascend IR(Intermediate Representation), shielding the differences between frameworks.

In this way, developers only need very few changes, can quickly fix the algorithm transplantation, experience centerm AI processor surging computing power, greatly reduce the cost of switching platforms, say it is not delicious?

3, simple and easy to use development interface, so that Xiao Bai can play AI

Relying on artificial intelligence to realize intelligent transformation has become a required course in almost all industries. Adhering to the concept of minimalist development, CANN provides a set of AscendCL (Ascend Computing Language) programming interface that is easy to use, shielding the differences of underlying processors for developers. You only need to master a set of APIS. It can be applied to the whole series of Centerm AI processors.

At the same time, it can meet the needs of developers in the case of future CANN version upgrade, still can achieve backward full compatibility, and the operating efficiency is not compromised!

Simple AI application development interface

Artificial intelligence is entrusted with human’s vision for a better life in the future. When we face the soul torture of “what is this garbage and which bucket to throw in” every day, an AI garbage classification bucket application can rescue you from the hot water.

AscendCL provides a C API library for developing deep neural network inference applications. AscendCL provides runtime resource management, model loading and execution, and image preprocessing capabilities, making it easy for developers to unlock AI applications such as image classification and object recognition. In addition, AscendCL libraries can be called through mainstream open source frameworks and AscendCL programming interfaces open to CANN can be directly called.

Here are 5 steps to getting the AI garbage sorting app going:

  1. Operation management resource application: Initializes internal system resources.
  2. Load the model file and build the output memory: convert the open source model into the CANN-supported OM model and load it into memory; Obtain basic model information and build model output memory to prepare for subsequent model reasoning.
  3. Data preprocessing: preprocessing the read image data, and then constructing the input data of the model.
  4. Model reasoning: Model reasoning based on the constructed model input data.
  5. Analytical reasoning results: The reasoning results of the analytical model are based on the model output.

Flexible operator development interface

When you have operators that CANN does not yet support in your AI model, or you want to modify existing operators to improve computational performance, you can use CANN’s open custom operator development interface to develop any operator you want.

For AI developers of different levels, CANN provides two operator development modes: efficient (TBE-DSL) and professional (TBE-TIK), which can flexibly meet the needs of developers of different levels.

Among them, TBE-DSL is relatively easy to get started. It can automatically realize data segmentation and scheduling. Developers only need to pay attention to the calculation logic of the operator itself, without understanding the details of the hardware, they can develop high-performance operators.

Tbe-tik is a bit more difficult. Unlike TBE-DSL, which only abstracts programming at a high level, it provides instruction level programming and tuning capabilities. Developers need to manually complete class instruction level calls so that they can fully exploit hardware capabilities and achieve more efficient and complex operators.

Convenient IR composition interface

In addition, developers can use the standardized Ascend IR(Intermediate Representation) interface to build high-performance models that can be executed on ascension AI processors, without the deep learning framework itself, by directly invoking the operator libraries in CANN.

4, 1200+ high performance operator library, build surging power source

The model built based on deep learning framework is actually composed of a number of computing units, which we call operators (Op for short), which correspond to specific computing logic.

The accelerated computation of operators on hardware forms the basis and core of the accelerated neural network. CANN currently provides 1200+ deeply optimized, hardware-friendly operators, it is such a wealth of high-performance operators, build a surge of computing power source, let your neural network “instantaneous” acceleration.

  • NN (Neural Network) operator library CANN covers calculation types of commonly used deep learning algorithms, including TensorFlow, Pytorch, MindSpore and ONNX framework, and accounts for the largest proportion of all operators in CANN. Users only need to pay attention to the implementation of algorithm details and do not need to develop and debug operators themselves in most cases.
  • BLAS (Basic Linear Algebra Subprograms) operator library: BLAS is a Basic Linear Algebra program, which is a numerical library for Basic Linear Algebra operations such as vectors and matrices. CANN supports universal matrix multiplication and Basic operations such as Max, Min, Sum, multiplication and addition.
  • Digital Video pre-processor (DVPP) operator library: Provides high-performance pre-processing capabilities such as Video codec, picture codec, and image clipping and scaling.
  • AIPP (AI pre-processing) operator library: mainly realizes image size change, color gamut conversion (image format conversion), subtract/multiply coefficient (image normalization), and integrates with model reasoning process to meet reasoning input requirements.
  • Huawei Collective Communication Library (HCCL) Operator Library: It mainly provides collection communication functions such as Broadcast, AllReduce, Reducescatter, and AllGather on a single machine and between multiple cards, providing efficient data transmission capability in distributed training.

5, high performance graph compiler, giving neural network super power

The hardest thing in the world is waiting, waiting for traffic lights, waiting for summer and winter vacations, waiting for takeout, waiting for the right person…

Artificial intelligence is the same, with the rapid evolution of neural network structure, pure hand optimization is used to solve the problem of AI model performance is more and more prone to the bottleneck, CANN figure compiler like a magician, will have a higher degree of abstraction calculation chart, according to the structural characteristics of rise AI processor hardware, compile optimization, able to efficiently perform.

What are the magic operations of a magician?

Automatic operator fusion: Automatic fusion based on operator, subgraph, SCOPE and other multidimensional degrees can effectively reduce computing nodes and greatly reduce computing time.

Buffer fusion: Aiming at the memory bound problem of neural network computing with large data throughput, we can improve computing efficiency by reducing the number of data handling times and improving the cache utilization rate in centerm AI processor.

We made a comparison before and after Buffer fusion:

Before fusion, operator 1 carries data from the cache buffer in centerm AI processor to external storage after calculation. Operator 2 obtains data from external storage as input and carries it to the cache buffer for calculation. After the fusion, after the calculation of operator 1 is completed, the data is retained in the cache buffer, and operator 2 directly obtains the data from the cache buffer for the calculation of operator 2, which effectively reduces the number of data handling and improves the calculation performance.

Full picture sinking: Centerm AI processor integrates rich computing equipment resources, such as AICore/AICPU/DVPP/AIPP, etc. Thanks to the rich soil on Centerm AI processor, CANN can not only sink the computing part to centerm AI processor for acceleration, but also sink the control flow, DVPP and communication part together for execution. Especially in training scenarios, the ability to execute the entire closed loop of logically complex computation graphs within the AI processor can effectively reduce the interaction time with the Host CPU and improve computing performance.

Heterogeneous scheduling ability: when the calculation chart contains multi-type computing tasks, CANN make full use of rich rise AI processor heterogeneous computing resources, on the premise of meet the dependencies in the diagram, distributes computing tasks to different computing resources, realize the parallel computing, improve the resource utilization of each cell, eventually improve the overall efficiency of computing tasks.

6, automatic mixing precision, effectively achieve income balance

As the name implies, automatic mixing precision is a kind of technology that can automatically mix semi-precision and single precision to accelerate model execution, which plays an indispensable role in large-scale model training scenarios.

Single precision (Float Precision32, FP32) is a data type commonly used by computers, while semi-precision (Float Precision16, FP16) is a relatively new floating point type that uses 2-byte (16-bit) storage in computers and is suitable for scenarios where precision is not required.

Obviously, the use of the FP16 type will definitely result in a loss of computational accuracy, but for deep learning training, not all calculations require very high precision. Therefore, operators insensitive to accuracy in the calculation graph can be accelerated by using FP16 type, which can effectively reduce memory consumption and achieve a balance between performance and accuracy.

7. E-class cluster opens the ERA of AI supercomputer

As the problems that mainstream deep learning models can solve become more and more complex, the complexity of models themselves also begins to increase, and the field of artificial intelligence needs more powerful computing power to meet the training needs of future networks.

“Pengcheng Cloud Brain II”, based on the basic hardware and software of Ascend AI, breaks the computing power ceiling of 100 P-Flops (trillions of calculations per second) in today’s industry, making the computing power scene of E-FLOPS (exascale calculations per second) enter the historical stage.

It integrates thousands of Centerm AI processors and achieves 256-1024 PFLOPS, or 25.6-10.24 peTaflops.

How to efficiently schedule thousands of Centerm AI processors is a difficult problem faced by large-scale cluster networks.

CANN integrates THE Huawei Collective Communication Library (HCCL) to provide centerm AI processor multi-machine multi-card training with a high-performance Collective Communication solution of data parallelism/model parallelism:

  1. The two-level topology network of high-speed HCCS Mesh interconnection within servers and non-blocking RDMA network between servers, combined with the topology adaptive communication algorithm, can make full use of link bandwidth, and distribute the data transmission volume between servers to each independent network plane in parallel, greatly improving the linearity of model training in super-large-scale cluster.
  2. Integrated high-speed communication engine and dedicated hardware scheduling engine, greatly reduce the cost of communication scheduling, achieve unified and harmonious scheduling of communication tasks and computing tasks, accurate control of system jitter.

If pengcheng Cloud Brain II can be compared to a large symphony orchestra, CANN is an excellent conductor, joining hands with Centerm AI processor, opening a new chapter in the ERA of AI supercomputing.

8. Write at the end

Since its release in 2018, CANN has been constantly trying to break through, bringing developers minimalist experience and releasing the ultimate performance of AI hardware, which has become the legs supporting CANN in the field of artificial intelligence.

We believe that it will unswervingly join hands with those who want to change the world, change the world and build the future together on the AI track!

CANN will also receive a new and more powerful version 5.0 at the end of 2021. What surprises will it bring? Let’s wait and see!

Click to follow, the first time to learn about Huawei cloud fresh technology ~