As an AI engineer who has read the basic tensorFlow tutorial, played Keras for a while, tested fastAI, and followed The MXnet/Gluon course of Li Mu’s team, I started learning PyTorch today. Then I suddenly thought of this problem and made some changes.

1. What is a deep learning framework

This concept is very general, let me give you a few examples, and then I’ll conclude with the following:


Pytorch is defined as:

  1. NumPy is replaced seamlessly, and the neural network is accelerated by taking advantage of GPU computing power.
  2. Through the automatic differential mechanism, the realization of neural network becomes easier.

Some features of MXnet or Gluon:

I remember the automatic derivation mechanism mentioned in the class of Teacher Li Mu’s team

PaddlePaddle:

PaddlePaddle automatically performs the chain rule to compute the gradient for each parameter and variable in the model

TensorFlow has an automated derivative mechanism for a long time. A search for TensorFlow automated derivative on the Internet yields many blog posts.

Therefore, it is not difficult to know that the current deep learning framework has realized the function of automatic gradient calculation.


2. The emergence of deep learning frameworks

Background: Since IBM’s Deep Blue chess system defeated world champion Garry Kasparov (Hsu, 2002) in 1997, alphaGo defeated the top human go player Ke Jie in 2016. Cause: Neural network once again become a hot spot of academia and industry (academic) academic leading industry, promote the industrial scene, so appear a lot of depth study computing framework, the root cause is deep learning dependence on hardware environment is very high, for developers have higher threshold, deep learning computing framework, shielding the development cost on the level of a large number of hardware environment, So that researchers and developers can focus on the implementation of the algorithm, fast iteration.

Example: Most deep learning frameworks have been around since 2016.

  • tensorflow November 9, 2015(The first release on Github was 0.5.0) –tensorflow-github-tags
  • February 15, 2017Tensorflow V1.0 has been releasedTensorFlow 1.0 is officially released with major updates and 5 highlights
  • Since then,October 1, 2019Tensorflow released version 2.0 –TensorFlow 2.0 launched in the wee hours! “Change everything, beat PyTorch”
  • Pytorch –August 24, 2016The first release of v0.1.1 –pytorch-github-tags
  • In December 2018, pytroch1.0 official release –PyTorch 1.0 is now available!.
  • Paddlepaddle, the first release on Github:August 31, 2016V0.8.0 b0 -paddlepaddle-github-tags
  • Microsoft CNTK, the first release on Github:February 22, 2016The 2015-12-08 -CNTK-github-tags
  • Amazon’s Mxbet, first release on Github:May 27, 2016V0.7.0 -mxnet-github-tags

Therefore, the first year that the giants entered and the deep learning computing framework exploded is the end of 2015 — 2016.


In addition, I searched the first year of deep learning computing framework on the Internet, and found the following content (from: Domestic and open source this year, lively life) :

In 2020, “open source” became one of the labels in the FIELD of AI, and this year also became the first year of open source for deep learning frameworks in China. Since the beginning of this year, several domestic AI frameworks such as Huawei Mindspore, Megvii Tien MegEngine, Tencent TNN and Tsinghua Jittor have gradually announced open source. In addition, Baidu Feioar also through cooperation with different enterprises, constantly expand feioar compatibility and open characteristics. As domestic technology companies open source the framework, they gradually break the dominance of PyTorch, TensorFlow, Keras and MXNet, laying the most critical foundation for the development of domestic ARTIFICIAL intelligence technology. According to the “Market Share Survey of Deep Learning Platforms in China” by IDC, a global consulting firm, 86.2% of surveyed enterprises and developers choose to use open source deep learning frameworks in terms of AI technology usage. For now, apart from massive tech companies like Huawei and Tencent, open source projects from a number of new AI companies are becoming mainstream. Megvii’s deep learning framework MegEngine is one of them.

I don’t think I’ve heard much about it, but something is better than nothing, 😀

3. Summary

The role of deep learning computing frameworks

  1. You have to do the Tensor with the GPU instead of Numpy, which does numerical calculations, and you have to do neural network operations
  2. Provide automatic derivative/differential/gradient mechanism, so that the implementation of neural network is easy
  3. Many basic network components are built in, such as fully connected network, convolutional network, RNN/LSTM, etc., to simplify code work and allow people to focus on other steps such as model design rather than programming.

Or as others put it:

Most deep learning frameworks have five core components:

  1. Tensor
  2. Operations based on tensors
  3. Computation Graph
  4. Automatic Differentiation
  5. BLAS, cuBLAS, cuDNN and other expansion packs

3.1 How does deep learning framework speed up computation

1. Tensors + various operations based on tensors + computations = to speed up calculations


The original Theano used computational graphs and its program performance was 1.8 times that of Numpy and 11 times that of Numpy on the GPU.

Reference:

-Running on a GPU and performing 11 times better than NumPy, this Python library is worth having

-How to understand the TensorFlow diagram?

> Computing graph is essentially a program logic graph built in memory by TensorFlow. Computing graph can be divided into multiple blocks and run in parallel on multiple different cpus or Gpus, which is called parallel computing. Therefore, computational graphs can support large-scale neural networks.



-Why does Tensorflow need to use “graph computation” to represent computation

Figure 3.1.1 calculation

Because there are so many operations on tensors and tensors, the difficulty of sorting out relationships can lead to problems such as whether multiple operations should be executed in parallel or in sequence, how to collaborate with various underlying devices, and how to avoid various types of redundant operations. These problems may slow down the operation efficiency of the entire deep learning network or introduce unnecessary bugs, and the computational graph is created to solve this problem.

Computational Graphs are first introduced to AI: The first introduction of computational graphs to the field of ARTIFICIAL intelligence was Yoshua Bengio’s 2009 paper “Learning Deep Architectures for AI”. As shown in figure: the author uses different placeholders (*, +, sin) to form operation nodes, and letters X, A, b to form variable nodes, and then connects these nodes with directed line segments to form a clear “graph” type data structure representing operation logic relations, which is the original calculation graph.

Calculation chart of development: with the evolution of technology, combining with the characteristics of different scripting language and the underlying language (script language modeling is convenient but slow execution, the underlying language, on the other hand), so the industry gradually formed such a process: the front end in Python scripting language, such as modeling, the back-end using low-level languages such as c + + implementation, which integrated the advantages of both. This development framework greatly reduces the code coupling degree of traditional framework for cross-device computation, and avoids the maintenance overhead of front-end modification every time the back-end changes. The key coupling between the front end and back end is the computational diagram.


Advantages of computational graphs: And Intermediate Representations between the front and the back end are really interactive, so you can use Tensor objects as data structures, functions/methods as types of operations, and apply certain types of operations to certain data structures. To define a powerful modeling language similar to MATLAB.

It is important to note that the developer does not normally will be used in intermediate representation for calculation chart is used to model structure directly, because the calculation chart usually contains a large number of redundant target, also does not have to extract the Shared variables, thus, symbol fusion, usually after a dependency and pruning method such as Shared memory to optimize calculation chart. At present, different frameworks have different implementation mechanisms and emphases for computational graphs. Both Theano and MXNet, for example, use implicit processing to transition from an expression to a computed graph during compilation. Caffe is more straightforward, creating a Graph object and displaying the call in a manner similar to graph.operator (XXX).

Conclusion: the introduction of calculation chart, allowing developers to from the macro view of the internal structure of the neural network, as if the compiler can be determined from the perspective of the whole code, how to allocate register calculation chart can also be decided from the macro code to run when the GPU memory allocation, as well as the distributed environment different ways of cooperation between the underlying device. In addition, there are many deep learning frameworks that apply computational diagrams to model debugging, which can output a text description of the current type of operation in real time.

3.2 How to realize automatic differentiation

Automatic differentiation tool (mainly more convenient than traditional symbolic differentiation)

The traditional way of solving differential: corresponding to automatic differential, the more traditional way in the industry is symbolic differential. Symbolic differentiation is commonly known as derivative analysis. Disadvantages of traditional differential solutions: for some nonlinear processes (such as correction of linear element ReLU) or large-scale problems, symbolic differential methods are often very expensive and sometimes even not feasible (i.e., not differentiable).

Advantages of automatic differentiation: As we all know, neural network is a complex function body composed of many nonlinear processes, and the computational diagram completely represents the internal logical relations of this function body in a modular way.

  1. soDifferentiating this complex body of functions, that is, finding the gradient of the model, becomes simply a complete traversal from input to output in the computational graph.
  2. The above iterative automatic differential method has been widely used to solve model gradients.
  3. Many computational Graph packages, such as the Computation Graph Toolkit, have pre-implemented automatic differentiation because it can handle situations where symbolic differentiation doesn’t work.
  4. In addition, since the derivative at each node can only be calculated relative to its adjacent nodes, modules that realize automatic differentiation can generally be directly added to any operation class, and of course can also be directly called by the upper differentiation large module.

3.3 What is an extension pack used for

Common ones are: BLAS, cuBLAS, cuDNN, etc

  • Existing advantages: complete structure/process/module

    The ability to transform the data to be processed into tensors, to apply all kinds of operations to the tensors, to arrange the computations and the flow of those operations and tensors, to train the model through automatic differentiation, and then to get the output and start testing, is actually enough to support a deep learning computing framework.

  • Waiting for improvement: computational efficiency.

    After most of the implementation is based on the high-level language (such as Java, Python, Lua, etc.), and even the most simple operation, high level language than the underlying language (c + +) consume more CPU cycles, moreover is the depth of the complex neural network structure, so the operation slowly became a natural defects of high-level languages.


There are two solutions to the problem of low computational speeds in high-level languages.

  1. The first way isEmulates a traditional compiler. As ifTraditional compilers compile high-level languages into platform-specific assembly languages for efficient executionThis methodConvert a high-level language to C, and then compiled and executed on the basis of C language. In order to achieve this transformation, the implementation code for each tensor operation is preprogrammed with C transformation parts, which are then synthesized by the compiler at compile time.Compilers such as pyCUDA and Cython already do this.
  2. The second method is mentioned above, using a scripting language for the front-end modeling, with the underlying language such as c + + realize the backend, this means that the high-level language and the interaction between the low-level languages have occurred within the framework, so every time after the changes do not need to modify the front end, also do not need to complete compilation (just need to compile by modifying the parameters for compilation), So the overall speed is faster.
  3. In addition, due to the difficulty of optimizing programming in the underlying language and the fact that most of the underlying operations actually have an open optimal solutionAnother significant acceleration is the use of off-the-shelf extensions. For example, BLAS (Basic Linear Algebra subroutine, Basic Linear Algebra subroutine), which was initially implemented in Fortran, is an excellent basic matrix (tensor) operation Library. In addition, MKL (Math Kernel Library) from Intel, etc. Developers have the flexibility to choose according to their personal preferences.

    The generalBLAS librariesOnly forCommon CPU scenariosOptimized, but most of the currentDeep learning modelHave started to adopt parallel GPU computing mode, so utilize such asNVIDIA's GPU-optimized cuBLAS and cuDNNA more targeted library, such as the One in the library, may be a better choice.

Speed is critical for a deep learning framework. For example, training a neural network would take four days without acceleration, but four hours with acceleration. In the fast-moving A.I. field, especially for young A.I. startups, this difference could determine who is a pioneer and who is a follower.

3.4 Deploy model acceleration tools

3.4.1 GPU-level acceleration – Nvidia TensorRT

In many cases, it is necessary to use a special model acceleration tool — TensorRT to deploy models on the platform and put them into application. TensorRT is only responsible for the inference process of models and generally does not use TensorRT to train models.

Why does TensorRT speed up the model?

TensorRT is an acceleration package nvidia has made for its platform. TensorRT does two things to speed up the model.

  1. TensorRT supports INT8 and FP16 calculations. Deep learning networks typically use 32 – or 16-bit data during training. TensorRT, on the other hand, does not use such high precision in network reasoning to achieve the purpose of accelerating inference.
  2. TensorRT reconstructs the network structure, combines some operations that can be combined together, and optimizes the GPU characteristics. Most deep learning frameworks aren’t optimized for Gpus, and Nvidia, the manufacturer and carrier of gpus, naturally launched TensorRT, a tool for accelerating their Gpus. In the case of a deep learning model without optimization, such as a convolution layer, a bias layer and a reload layer, these three layers need to call the CORRESPONDING API of cuDNN three times, but in fact, the implementation of these three layers can be completely combined. TensorRT will merge some networks that can be combined.

3.4.2 CPU-level acceleration — Intel OpenVINO toolbox

OpenVINO is a deep learning inference acceleration toolbox based on convolutional Neural network (CNN) developed by Intel Corporation. It enables Intel hardware to maximize deep learning computing performance. Therefore, when deploying the deep learning model on the CPU side, OpenVINO toolbox can be used to improve the reasoning speed of the deep learning model.


In addition, refer to the heart of Machine article: CPU-based deep learning inference deployment optimization practice, there are also video lectures online, you can search.

In the practice of system-level optimization on CPU, mathematical library optimization (based on MKL-DNN) and deep learning inference SDK optimization (Intel OpenVINO) are mainly adopted. Both methods involve acceleration of the SIMD instruction set.


Referring to the model quantization of neural network inference acceleration, it can be seen that: Many cloud service providers and hardware vendors offer a series of inference optimization infrastructures, such as Amazon’s SageMaker, Deep Learning AMIs, Intel ® ‘s Deep Learning Boost, Vector Neural Network Instruction Set (VNNI), etc.

4. Reference: