Data flow analysis is a widely used method in static code analysis. Due to recent in learning software test, static code scanning frequently encounter data flow analysis, at the same time oneself in the graduation design also implements some simple code optimization algorithm based on code data flow analysis, combined with their own understanding to write this article, a simple chat this artifact in the field of static code analysis, I hope you practive.
What is data flow analysis
Data flow analysis is a code analysis method, which can analyze and summarize the semantic information in the program through algebraic method, so as to infer the Definition and Use information of variables in the dynamic execution of the program.
- In the compiler, through data flow analysis, the compiler can understand the definition and usage relationship between variables, so that the optimization of Constant Propagation can be calculated in advance during compilation, reducing the amount of computation required during program running.
- Data flow analysis can also help the compiler check whether a piece of code is reachable (if it can be executed on the fly), and the compiler can use the analysis to remove some unreachable code
- In software development and testing, data flow analysis, as a static analysis method, analyzes the declaration, assignment and use of variables when the program is not running, and forms reasonable code improvement suggestions
What is a basic block
A basic block is a piece of code in which sequential execution can only enter from a single entry and leave from a single exit without any branches (if, else, while, for…). There is. As shown below
int x = 1;
int sum = 0;
while( x <= 100)
{
sum += x;
x ++;
}
printf ("%d\n", sum);
Copy the code
Its basic blocks can be divided into:
In compilation optimization, we usually use the jump instruction of intermediate code to divide the basic block. The above code can be translated as an intermediate code instruction (simplified version) :
(dec, x, ) (assign, x, 1 ) (dec, sum, ) (assign, sum, 0 ) L0: (jump if not, x <= 100, L1) (+=, sum, x) (+=, x , 1) (jump, L0 , ) L1: [...] (call, printf, )Copy the code
In intermediate code, the next statement of the jump instruction and the target statement of the jump instruction are used as the first instruction of the new base block, and the first instruction of the program is also used as the first instruction of the base block. The code from the first instruction to the next instruction can be divided into a basic block. Therefore, it is divided into:
If you’re interested, you can go back and see how the division of the intermediate code corresponds to the division of the original code, and you can find the basis for the division of the intermediate code.
From basic blocks to control flow diagrams
A Control Flow Graph (CFG) is a Graph that describes the path of execution during the execution of a program. In a data flow diagram, each node is a base block, and the edges of the diagram represent jump relationships between code blocks. For example, if-else, while and for statements will jump to different branches and execute the code in the branch according to the judgment result after the judgment is completed. The CFG side can describe this relationship. The code analyzer can determine the path of code execution and the direction of variable data according to CFG. An ENTRY node and an EXIT node are also added to compile analysis. From the code in the previous section, we can construct the following data flow diagram:
Constant value?? Use??
Before we get into the specifics of data flow analysis, two very, very important concepts need to be clarified
- Definition, def: Generally refers to the assignment of a variable, including variable declaration and initialization, copy, formal parameter list, etc
- Use: Use the value of a variable to perform operations, including expressions, logical criteria, passing variables as arguments, returning values, etc
In our analysis, we generally summarize the variables that have been evaluated and used into two sets. Such as the following code snippet:
int a = 20;
int b = a + 10;
Copy the code
The fixed value set and the use set are respectively:
We defined and assigned two integer variables a and B, so a and B need to be added to the set of fixed values. At the same time, when we assign b, we use the value of A, and we also need to add a to the set of uses.
Variables “generated” and “killed”
Data flow analysis needs to analyze the changes of variables during execution, and it needs to analyze the fixed values and usage in the base block. In CFG, each base block may generate a new variable or overwrite the value defined by the precursor node on the execution path, known as a fixed value operation. We can break down the fixed value operations into two categories:
- Genenrate (gen) – The generation of fixed values, that is, the definition and initialization of variables
- Kill – The precursor node on the path and the given value are overwritten in this statement, that is, the value generated by the precursor node is “killed”
In general, we can define the fixed set generated in this line or the basic block code as the generation set gen[Bi]gen[B{I}]gen[Bi], and the “killed” fixed set as the killed set kill[Bi]kill[B{I}]kill[Bi].
For example, going back to the 1-100 summation code we used earlier, the code for Block A and Block C:
In this code, we can see that Block A generates A new fixed value x and A fixed value sum, so
In Block C, sum and x rewrite the values generated in Block A, respectively, so:
Methods of data flow analysis
In the control flow diagram, each base block receives the precursors’ values from other paths and, after processing, issues the latest values to the path’s successors. We can stipulate:
- IN[Bi]IN[B_{I}]IN[Bi] is a set of fixed values emitted by BiB_iBi, the precursor node on the path, to the current node
- OUT[Bj]OUT[B_{j}]OUT[Bj] is the latest set of values that the current node sends to its successor node BjB_jBj on the path
Since a node in CFG may arrive from multiple nodes and may travel from the current node to different nodes (as shown below), we need to specify the merging mode of fixed values from different nodes. We can define the convergence function for constant value merging as:
In the analysis of data flow, the summarized fixed value may change after passing the node. We define the operation of each node on the fixed value as the transfer function FS (x) F_S (x) FS (x). The propagation direction of transfer function is related to the direction of data flow analysis. If the data flow adopts forward data analysis, the information of the data flow will be transmitted forward along the execution path, and the information on the node comes from the precursor node on the path. In this case, the function of the transfer function is as follows:
If the data flow is analyzed backward, the data flow is propagated backward along the execution path, with information on nodes coming from subsequent nodes along the path. At this point, the transfer function is:
In addition, for special nodes ENTRY and EXIT, we have
Reaches the fixed value
This section is an example of how we use data flow analysis to calculate arrival values. In a data stream, a value is called an Reaching Definition if it can travel a certain path from a generated node to a node without being killed on the way. Let’s go back to the summing procedure from the beginning of this article:
We first define the direction of our analysis, and since we want to analyze from definition generation to a node, the direction of analysis should be the forward analysis method. So we can define the turn function as:
Think about it. If we add a fixed value to the base block, it should appear in the OUTOUTOUT, and if we kill the definition in the base block, we should remove it in the OUTOUTOUT. So the intersection function should be:
We can calculate (the program computes using iterative methods) :
Kill | Generate |
---|---|
Kill[A] = ; |
Gen[A] = |
Kill[B] = ; |
Gen[B] = |
Kill[C] = ; |
Gen[C] = |
Kill[D] = ; |
Gen[D] = |
Note that Block B can be reached from Block A and Block C, so the OUT sets of both need to be merged. So there are
Block | IN | OUT |
---|---|---|
ENTRY | ||
A | ||
B | ||
C | ||
D | ||
EXIT |
From the table we know that sum_{c} of Block D comes from Block C, in other words, xCX_ {c}xc is the arrival constant of D node.
subsequent
Data flow analysis can be used in many optimization scenarios for compilers, or for static code checking, static security vulnerability checking, and so on for computers. In the future, I will write some related articles to discuss (first set a flag, haha).