Abstract: This article focuses on the specific needs of AI frameworks for IR, what solutions the industry has, and some thoughts of MindSpore.

Share this article from huawei cloud community the column MindSpore technology | AI framework layer IR analysis, the original author: feet girl full month.

IR (Intermediate Representation) is the intermediary of the translation between source code and object code in the process of program compilation. The design of IR is very critical for the compiler. A good IR should consider the completeness of compilation from source code to object code, the ease of compilation optimization and performance. What is the nature of the AI framework? The essence of an AI framework is to translate a user’s model representation into executable code for efficient execution (training and reasoning), where from the user’s model representation (such as deep neural networks) to the final executable code is the behavior of a compiler that also has an IR, Its design is critical to the completeness/flexibility/ease of use/performance of the AI framework.

This article focuses on the specific needs of AI frameworks for IR, what solutions the industry has, and some of MindSpore’s thinking. First of all, I will introduce you to the classification and characteristics of the general compiler IR.

Introduction to IR in the industry

I. IR According to its organizational structure [1], it can be divided into Linear IR, Graphical IR and Hybrid IR

  • Linear IR:

Similar to the pseudocode of some abstract machines, the corresponding algorithm iterates through a simple linear sequence of operations

  • Hybrid IR:

It combines elements of graph IR and linear IR. A common hybrid IR uses the underlying linear IR to represent acyclic code blocks and the graph IR to represent the control flow between these blocks

  • Graphical IR:

The knowledge/information of the compilation process is stored in the graph, and the corresponding algorithm is described by operating on the objects (nodes, edges, lists, and trees) in the graph

An example of linear IR is stack-machine Code, which is a single-address Code that assumes operands exist on a Stack. Most operations take operands from the stack and push their results onto the stack. For example, the expression b-a*3 corresponds to the following stack machine code:

push 3
push a
multiply
push a
substract
Copy the code

LLVM IR is a typical hybrid IR that contains two layers (CFG+BB) :

The top layer is the Control Flow Graph (CFG), which represents the Control Flow between Basic blocks (BB). Each Node of CFG is a basic block. There is an Edge between basic block B1 and B2: B1 -> B2. If the control flow may flow from the last instruction of basic block B1 to the first instruction of basic block B2

At the bottom is the basic block, in which each instruction is presented in the form of SSA (Static Single Assignment) and these instructions form a linear list of instructions

Sea of Nodes IR (by Cliff Click) is a typical graph IR[2]. In this IR, the two-layer structure of BB+SSA instruction in CFG diagram is simplified, BB is removed, and only one layer structure containing instructions is left. By introducing special REGION, IF, and PROJECTION instructions, the whole order instruction in BB block is relaxed into explicit data dependency and control dependency, and the control dependency and data dependency are represented and processed in the same way, thus simplifying IR analysis and transformation. Here is a simple IR example:

In this example, the boxes are the nodes of the graph, representing SSA instructions, and the arrows are the edges of the graph; Solid arrows indicate control dependencies; Hollow arrows represent data dependencies. From this example, you can see that the IR explicitly includes the use-def dependency, and no additional calculations are required.

Based on the explicit use-def information in this IR, two types of optimizations can be easily implemented: Parse time optimization (globalization)

At Parse, only partial optimizations, such as peephole optimizations (e.g., constant folding, identity-function), can be done because the program does not have complete information. By designing appropriate classes and inheritance systems, peephole optimization can be implemented using simple algorithms:

For global optimization, such as Sparse Conditional Constant Propagation (SCCP), it can be easily implemented. Firstly, def-Use Chains are calculated based on the explicit use-def in the figure. Then, SCCPSea of Nodes IR can be easily implemented. This provides a very important idea: express the dependency information explicitly in the figure IR. Second, analyzing IR from the perspective of common programming languages, we can see that IR forms are divided into two different camps: One is the IR compiler for imperative programming languages, and the other is the IR compiler for functional programming languages. The IR compiler for imperative programming languages is based on SSA, which I won’t go into here, but I’m going to focus on IR for functional programming languages. In IR for functional programming languages, CPS, or ANF, is the basic component of a Continuation-passing style (CPS) A function f always has an extra argument in addition to its own argument continuationContinuation is also a function that, when f has finished evaluating its return value, calls a continuation using that return value as an argument to it, instead of returning it. So a CPS function does not formally return; when it does return, it passes all its arguments to a continuation, allowing the continuation to continue execution. Such as:

def foo(x):
return x+1
Copy the code

In CPS form, k is a continuation:

def foo(x,k):
k(x+1)
Copy the code

Intuitively, the advantage of the function not “return” but “continue” CPS is to make the following information explicit: Procedure returns (calling a continuation), intermediate values (with an explicit name), order of evaluation, trailing calls (calling a procedure with the same continuation) such as the following python code that takes the product of all primes less than n.

def prodprimes(n):
    if n == 1:
        return 1
    if isprime(n):
        return n * prodprimes(n - 1)
return prodprimes(n - 1)
Copy the code

When expressed in CPS form:

def prodprimes(n, c):
    def k(b):
        if b == True:
            m = n - 1
            def j(p):
                a = n * p
                c(a)
            prodprimes(m, j)
        else:
            def h(q):
                c(q)
            i = n - 1
            prodprimes(i, h)
    if n == 1:
        c(1)
    else:
        isprime(n, k)
Copy the code

As you can see from the code above, “procedure returns” are replaced by calls to CONTINUations such as C, J, K, and H; Intermediate values A, b, m, and I are given variable names. The CPS form is ideal for compiler analysis and transformations, such as tail-recursion elimination: If function F ends with a call to function G, then the continuation of function G need not be a continuation generated in F, but can be replaced with a continuation passed to f. In the original code for the example above, the “return prodprimes(n-1)” statement is a tail recursion. In CPS form, it is clear that h(q) is defined as c(q), so h is equal to C, and the following transformation can be performed [3] :

def h(q):                         i = n - 1
    c(q)            ->           prodprimes(i, c)
i = n - 1
prodprimes(i, h)
Copy the code

While CPS is very consistent and powerful, one of its big problems is that it is hard to read. Hence the a-norm Form (ANF) 2. The ANF Form directly converts the Direct Style source code [4] without CPS Form

The ANF form divides expressions into two categories: atomic expressions and compound expressions.

An atomic expression represents a constant value, a variable, a primitive, or an anonymous function. A composite expression is composed of multiple atomic expressions and can be regarded as an anonymous function or primitive function call. The first input of the combination is the called function and the remaining inputs are the parameters of the call. A compound expression is either let-bound to a variable or only appears in the last position. As you can see, the ANF form uses let-bound to explicitly express intermediate values and control flow and evaluation order. Its grammar is defined as follows [5]

The < aexp > : : = NUMBER | STRING | VAR | BOOLEAN | PRIMOP | (lambda (VAR)... <exp>) <cexp> ::= (<aexp> <aexp>...) | (if <aexp> <exp> <exp>) <exp> ::= (let ([VAR <cexp>]) <exp>) | <cexp> | <aexp>Copy the code

For example, the prodprimes function above, if expressed using the above syntax, would be:

(define prodprimes
  (lambda (n)
    (let (a (= n 1))
      (if a 1 (let (b isprime(n))
                   (if b (let (m (- n 1))
                           (let (p (prodprimes m))
                             (* n p)))
                         (let (s (- n 1))
                           (prodprimes m))
                    ))))))
Copy the code

This ANF form, if translated into Python, would look something like this:

def prodprimes(n):
    r = n == 1
    if r:
        return 1
    b = isprime(n)
    if b:
        m = n - 1
        p = prodprimes(m)
        return n * p
    s = n - 1
return prodprimes(s)
Copy the code

From this code, you can also see that the ANF form is simpler to understand than the CPS form

The effect of layer IR in AI frame

Now mainstream AI frameworks all have layer IR. A good layer IR is conducive to the compilation, optimization and execution of AI models, which is the basis for AI frameworks to conduct efficient training and reasoning. From the perspective of training, there are three execution modes of AI frameworks in the industry at present: Eager execution mode, Graph execution mode, and Staging(mixed) execution mode. Graph execution mode and Staging mode are all based on layer IR. The Eager execution mode typically takes advantage of features of the host language (now mostly Python) for interpreted execution, using techniques such as overloading and Tape.

Graph execution mode mainly takes the Graph structure of the AI model, and then compiles and optimizes the compilation and execution based on Graph IR. Now there are three ways to get the Graph structure of the AI model: First, programmers can use API mapping (TF1.x, etc.). Second, Tracing Jits (a trend introduced by JAX, now supported by TF2.0/Pytorch, etc.). Tracing jIts refers to running the user’s model script to get a forward execution sequence. And then based on that sequence the advantage is that it’s easy to match the Eagle pattern, The disadvantages of simple implementation are that the transformation of control flow is troublesome, the execution sequence is difficult to implement if it is related to the execution result of the operator, and it is difficult to deal with side effects. Therefore, TF needs to combine AST analysis to solve the transformation of control flow. The third type is AST JIT (Pytorch TorchScript) is based on Python AST for composition. The advantage is that the conversion function can be comprehensive, including control flow, etc. The disadvantage is that the implementation is complex. Many Dynamic features of Python can be implemented in a large workload and Staging mode, which is similar to the Eager mode. Compilation execution acceleration (using Tracing JIT or AST JIT) for partial and molecular graphs is performed using Python modifiers. Graph IR is also used.

From the perspective of reasoning, a large number of compilation optimizations, such as quantization and pruning, are generally carried out on layer IR when THE AI framework generates the final reasoning model. Meanwhile, the final reasoning model format is also directly or indirectly used by layer IRAI. The requirements and challenges of layer IR are compared with other general IR. AI frame layer IR has some special needs and challenges:

**AI’s model deals mainly with tensor data, which is quite different from normal applications, but adding tensor data types is not too difficult for the compiler’s IR.

** Automatic differentiation: ** Differentiability is the biggest difference between AI model development and general application development. Modern AI frameworks all provide automatic differentiation. The challenge lies in the simplicity of implementation, performance, and future extensibility of higher-order differentiation

**JIT capability: ** Either graph mode or Staging mode can be considered JIT mode from an algorithm engineer’s point of view because no compilation steps are shown. Compilation performance is a major challenge for JIT

** Implicit parallelism: ** From the perspective of developers, there are two types of parallelism: one is explicit parallelism, where the developer explicitly tells the system where to parallelize, such as “start multithreading/add”

** Parallel decorator: ** Another way is implicit parallelism, through the compiler to analyze dependencies, automatic parallelism generally speaking, the traditional CFG+BB compiler, because the program analysis using full order analysis, easy to do explicit parallelism; Functional compilers are theoretically easy for data dependency analysis and implicit parallel optimization. Interestingly, Kernel execution accounts for most of the overhead in deep learning scenarios, and asynchronous concurrency at runtime can also significantly improve the overall performance. Implicit parallelism is relatively weakened, but it still plays a role in achieving the ultimate performance

* * Loop optimization: The AI calculations involve a lot of Tensor operations, which is Loop optimization for the compiler, but the challenge is mainly in the IR of the operator layer. Of course, layer IR is also a compiler IR, so it should be generic. This includes basic functions such as type system, control flow and data flow analysis, and side effect elimination

Some genres in the industry on layer IR

** Calculation of graph IR: ** A DAG-centered implementation, many early frameworks are using this scheme to calculate graph IR design is more natural, calculation graph is mainly composed of edges and nodes, nodes are generally used to express operators, variables, constants and so on; Edge corresponds to Tensors and actually expresses a data dependence relationship. At the back of the automatic differential and optimization is carried out based on the DAG the advantage of this scheme is simple and intuitive, optimizing the performance overhead of small shortcomings is the formal calculation chart IR is not really a compiler IR, the type system, the support of the complex logic, such as recursive, side effects of treatment, control flow and data flow analysis support is not complete

**CFG+BB: ** Based on the IR of the traditional compiler to do layer IR, such as TorchScript, Julia, how to achieve automatic differentiation? We take Julia Zygote as an example [6] : For ordinary code (non-PHI, non-branch) in BB block, AD code can be generated in reverse order with the help of the chain rule

After expressing the above expression as SSA, inserting J and calculating AD, pseudo-SSA code can be obtained as shown in the figure below:

The %6 node in the figure above is called an “alpha node”, corresponding to the node %6 in Primal, B3 in the top row, which is the reverse function of “/” operation

For the control flow between CFG’s, reverse analysis of the control flow is required and the appropriate dumb PHI nodes are inserted into the Primal CFG to record and play back the control flow. For example, this code calculates power:

In the corresponding Primal CFG, the %1 phi node is inserted as the dumb PHI node to record the control flow. This %1 is then used for control in AD CFG (the %1 record is pushed into the control flow and then played back out of the stack in AD CFG)

Through subsequent code optimization, the AD Power code is similar to the following pseudocode:

It can be seen that the automatic differentiation of CFG+BB is finally realized by iteration, and the SSA form with Scope needs to solve the problem of boundary transfer, which will still bring some trouble in processing automatic differentiation

** How to do graph optimization? ** is converted to use-def and def-use for optimization

** How to do parallel optimization? ** Since CFG+BB is in full order, it needs to be converted to use-def and analyzed in combination with side effect information

The advantage of using CFG+BB scheme is complete function, scheme maturity, high reuse, but CFG+BB form of automatic differential/graph optimization/parallel optimization, have to carry out certain conversion work, and is not so intuitive and efficient

Functional IR

Using functional IR to make layer IR, typical such as Relay, Myia, etc., how to achieve automatic differentiation? For non-control flows, AD is calculated in the same way as in the BB block above. For control flow, functional IR takes a different approach, converting iteration to recursion and branching through the switch function. For example, the same pow() function above:

def pow(x, n):
    return header_pow(n, 1, x)
def header_pow(phi_n, phi_r, x):
def body_pow():
    phi_n_1 = phi_n - 1
    phi_r_1 = phi_r * x
        return header_pow(phi_n_1, phi_r_1, x)
    def after_pow():
        return phi_r
    f = switch(phi_n > 0, header_pow, after_pow)
    f()
Copy the code

Taking pow(5,3) as an example, the recursive call process is as follows:

pow(5, 3) -> header_pow(3, 1, 5) -> body_pow() -> header_pow(2, 5, 5) -> body_pow() -> header_pow(1, 5*5, 5) – > body_pow – > header_pow (0, 5 * 5 * 5, 5) – > after_pow () (the return 5 * 5 * 5)

It can be seen that the call and return of recursive call here correspond to the phi node loading and unloading operations of CFG+BB control flow respectively

Since the AD process is the process of function transformation, the graph after AD is also the structure of recursive call, so there is no need for phi node loading and unloading operations like CFG+BB control flow, and the recursive call process naturally replaces the process of loading and unloading

Take the derivative of phi with respect to x

def x_grad_pow(x, n): phi_n = n phi_r = 1 return x_bprop_header_pow(phi_n, phi_r, x, 1) def x_bprop_header_pow(phi_n, phi_r, x, sens): def env_x_bprop_body_pow(): %3 = x_bprop_header_pow(phi_n -- 1, phi_r * phi_x, x, 1) %4 = phi_r_bprop_header_pow(phi_n -- 1, phi_r * phi_x, x, 1) %4 = phi_r_bprop_header_pow(phi_n -- 1, phi_r * phi_x, x, 1) %5 = %4 * phi_r return %3 + %5 def env_x_bprop_after_pow(): return 0 f = switch(phi_n > 0, env_x_bprop_body_pow, env_x_bprop_after_pow) r = switch(phi_n > 0, f(), 0) return r def phi_r_bprop_header_pow(phi_n, phi_r, x, sens): def env_phi_r_bprop_body_pow(): %3 = phi_r_bprop_header_pow(phi_n - 1, phi_r * x, x, 1) %4 = %3 * x return %4 def env_phi_r_bprop_after_pow(): return 1 if phi_n > 0: %5 = env_phi_r_bprop_body_pow() else: %5 = env_phi_r_bprop_after_pow() return %5Copy the code

The advantage of functional IR is that it is friendly to automatic differentiation and suitable for parallel analysis. However, the challenge lies in the elimination of side effects of functional IR and the performance of functional IR in the execution state (including recursion is not friendly to execution).

Design considerations for Mindspore

The layer IR of MindSpore is called MindIR, and the technical route chosen by MindIR is Functional Graph IR (referring to Sea of Nodes, Thorin, Myia, etc.), which has the following features:

Functional with a more natural automatic differential implementation and a more convenient implicit parallel analysis capability: Function as a first class citizen, supports higher-order functions, including the flow of control is through the special function, can be done in the form of unified differential function implemented in the form of no side effects, compared with the imperative languages, can simplify the analysis and the optimization of more native support for closures, on the one hand can express users convenient source of closures, said It is also possible to support the automatic differential algorithm that requires access to the intermediate result of the original function in the reverse function: the reverse function accesses the intermediate result and returns it as a closure using data dependent partial order analysis, which facilitates out-of-order or parallel execution

Graph based is more suitable for JIT rapid optimization: it adopts a one-layer presentation similar to Sea of Nodes IR and integrates control flow and data flow, which is more suitable for JIT optimization

ANF form: Similar to Thorin, Graph IR is adopted and Scope is eliminated. However, the CPS form of Thorin IR is not adopted, but the ANF form MindIR, which has similar expression ability and is more intuitive and easier to check, hopes to realize automatic differential and implicit parallel analysis more conveniently through Functional method. Graph Based approach integrates control flow and data flow to support more efficient JIT optimization. [7]MindIR grammar is inherited from ANF, and its definition is as follows:

<ANode> ::= <ValueNode> | <ParameterNode> <ParameterNode> ::= Parameter <ValueNode> ::= Scalar | Named | Tensor | Type | Shape | Primitive | MetaFuncGraph | FuncGraph < CNode > : : = (< AnfNode >...). <AnfNode> ::= <CNode> | <ANode>Copy the code

The ANode in MindIR corresponds to the atomic expression of ANF. The ANode has two subclasses ValueNode and ParameterNodeValueNode, indicating that the constant node can bear a constant value (scalar, symbol, tensor, type, dimension, etc.). It can also be a Primitive function or a MetaFuncGraph or aFuncGraph because in functional programming the function definition itself is a value. ParameterNode is a compound expression of ParameterNode corresponding to ANF in MindIR, indicating that gradient contributions of ParameterNode and CNode will be calculated when MindSpore automatically differentiates. And returns the gradient for the final ParameterNode without calculating the gradient for ValueNode

Here is an example of a program to understand MindIR by comparison

def func(x, y):
 return x / y

@ms_function
def test_f(x, y):
    a = x - 1
    b = a + y
    c = b * func(a, b)
 return c
Copy the code

The ANF expression for this Python code is:

lambda (x, y)
    let a = x - 1 in
    let b = a + y in
    let func = lambda (x, y)
        let ret = x / y in
        ret end in
    let %1 = func(a, b) in
    let c = b * %1 in
    c end
Copy the code

The corresponding MindIR is w.url.cn/s/Ansh1KW

In MindIR, a FuncGraph represents the definition of a common function. The FuncGraph is generally composed of ParameterNode, ValueNode and CNode. It can clearly express the calculation process from parameter to return value. The two test_f and func functions in Python code are converted to two function diagrams, whose arguments X and y are converted to ParameterNode of the function diagram, and each expression is converted to a CNode. The first input of CNode links to the function called, such as add, func, and return. Note that these nodes are ValueNode, as they are understood as constant function values. Other inputs to CNode link to the parameters of this call. Parameter values can come from ParameterNode, ValueNode, and other CNodes.

In ANF, each expression is bound to a variable with a let expression, and the dependency on the output of the expression is indicated by reference to the variable, while in MindIR, each expression is bound to a node, and the dependency is indicated by directed edges between nodes

Functional semantics

An important feature of MindIR compared with traditional computational graphs is that it can not only express data dependencies between operators, but also express rich functional semantics

Higher-order functions

In MindIR, the definition of a function is defined by a subgraph, but it can itself be a value that is passed as input or output to other higher-order functions. For example, in the following simple example, function F is passed to function G as an argument, so function G is a higher-order function that receives input from the function. The real call point of function F is inside function G

@ms_function
def hof(x):
 def f(x):
 return x + 3
 def g(function, x):
 return function(x) * function(x)
    res = g(f, x)
 return res
Copy the code

The corresponding MindIR is w.url.cn/s/A8vb8X3

In practical network training scripts, Partial and HyperMap functions used in the optimizer and GradOperation functions are typical higher-order functions. Higher-order semantics greatly improve the flexibility and control flow of the MindSpore expression

Control flow is expressed in MindIR in the form of higher-order function selection calls. This form transforms the control flow into the data flow of higher order functions, thus making the automatic differential algorithm more powerful. It can support not only automatic differentiation of data flow, but also automatic differentiation of control flow such as conditional jump, loop and recursion. Here is a simple Fibonacci use case to illustrate

@ms_function
def fibonacci(n):
 if(n < 1):
 return 0
 elif(n == 1):
 return 1
 else:
 return fibonacci(n-1) + fibonacci(n-2)
Copy the code

The corresponding MindIR is w.url.cn/s/AUiE9Mc

Those who qualify can go onto those who qualify. Those who qualify can go onto those who qualify. Those who qualify can go onto those who qualify. Fibonacci is the True branch of IF and those who qualify can go onto those who qualify. Those called in those who qualify qualify Fibonacci are the True branch of ELIF and those who qualify qualify Fibonacci are the False branch of ELIF.

The key thing to understand here is that in MindIR conditional jumps and recursion can be described as higher-order control flows. For example, ✓ Fibonacci and Those who qualify Fibonacci are passed in as parameters of the switch operator. The switch selects which function can be returned according to those parameters. Switch makes a binary selection of the input function as a normal value and does not call it, whereas the actual function call is made on the CNode immediately after switch

Free variables and closures

Free variables refer to variables in the scoped environment rather than local variables in a code block

A closure is a programming language feature that refers to the combination of a block of code and a scoped environment

In MindIR, blocks of code are rendered as function diagrams, and the scoped environment can be understood as the context in which the function is called, and free variables are captured by value copy rather than reference.

A typical closure use is as follows:

@ms_function
def func_outer(a, b):
 def func_inner(c):
 return a + b + c
 return func_inner

@ms_function
def ms_closure():
    closure = func_outer(1, 2)
    out1 = closure(1)
    out2 = closure(2)
 return out1, out2
Copy the code

The corresponding MindIR is w.url.cn/s/AsUMXTS

In the example, a and b are free variables because the variables A and B in func_inner are parameters defined in the referenced parent graph func_outer. The variable closure is a closure that combines the function func_inner with its context func_outer(1, 2). So out1 is 4 because it’s the same thing as 1+2+1, and out2 is 5 because it’s the same thing as 1+2+2

reference

[1] “Engineering A Compiler” Second Edition, Chapter 5. Intermediate Representation

[2] “To Combine Analyses, to Combine Optimizations”

[3] COMPILING WITH CONTINUATIONS, Chapter 1

[4] Functional Programming Languages Part V: Functional Intermediate Representations

[5]matt.might.net/articles

[6] Don’t Unroll Adjoint: Differentiating SSA-Form Programs

[7] mindspore.cn/doc/note/z

Click to follow, the first time to learn about Huawei cloud fresh technology ~