Ali Cloud DataWorks team – Qin Qi

What is code defect detection?

When doing a CodeReview (CR), you must think: why do I have to do this by hand every time? Can a machine do it instead of a man? After investigation, I came to the conclusion that at present, machines cannot completely replace manual labor, but some auxiliary work can be done to improve the efficiency of manual CR. For example, some algorithms or rules can be used to automatically discover hidden defects in the code, which involves the technology related to code defect detection. There are two ways of defect detection: one is to only predict whether there is a defect; the other is to identify the defect and locate it in a specific location. Both ideas are useful to developers, and finding problems in your code early can save a lot of effort and cost. In addition, there are also researches related to defect repair, which can provide repaired patches for defects to realize automatic defect repair.

Development of code defect detection

With the development of software testing technology, there are many researches on defect detection technology. Effective defect detection can improve the efficiency of software testing. The development of defect detection technology can be divided into the following types:

  • At the beginning, people summed up some obvious characteristics according to the history of defects, such as the number of file modifications, code complexity, etc. If the subsequent development encountered similar characteristics of the code file, it needs to focus on the probability of defects.
  • Later, fixed writing methods or rules for common defects were summarized, and the current common defect detection method based on static rules emerged, which can detect the specified defect type. Even this ability has been integrated into the development stage, and real-time detection can be carried out during development to avoid problems as soon as possible.
  • With the rapid development of deep learning technology in recent years, many people also apply this technology in defect detection. Through training based on meta-information or code information of historical defect data, code files can be predicted and identified with specified defects.


Representative product and technical analysis

Here are a few representative products and technologies in the field of code defect detection.

Infer and Getafix (Facebook)

These are defect detection and self-repair tools owned by Facebook.

Infer is a code static analysis tool that can discover defects in code before release. Currently, Infer is open source on Github and owns 12K STAR. It is a defect detection tool for Java, C/C++, Objective-C and other languages. For Java code, it can find null-pointer exceptions, resource leaks, and many other defects. Infer’s general workflow can be divided into two stages: Capture and Analysis. Infer compiles the code into an internal intermediate language during the capture phase. Then, during the analysis phase, Infer analyzes and detects each function and method individually. When bugs are found, Infer will terminate the method without affecting the analysis of other methods. You can run the analysis command multiple times until no bugs are detected. Refer to the following figure:


Infer Run — javac hello. Java to see the output:


You can try it out briefly here.

Getafix is a bug fix tool that can recommend trusted fixes to flaws discovered by Infer and has been deployed in Facebook apps. Getafix works by applying a new hierarchical clustering algorithm to thousands of Code changes developers have made in the past, looking at both the Change itself and the context around the Change to provide accurate fix suggestions.

ClusterFuzz (Google)

Google ClusterFuzz uses Fuzzing theory to detect software security and stability problems. Fuzzing technology is a black-box based testing technology that automatically generates and executes a large number of random test cases to discover unknown vulnerabilities in a product or protocol. As of February 2021, ClusterFuzz has found nearly 29,000 bugs in Google products and more than 26,000 bugs in open source products. For detailed structure, please refer to the following figure:


SonarQube

SonarQube is known as a real-time code review tool based on static analysis rules for easy local integration. It supports thousands of related rules and supports 27 programming languages, including JS and TS. It detects code quality based on seven dimensions: complexity distribution, duplicate code, unit test statistics, code rule checking, comment rate, potential bugs, and structure and design. Bugs are detected by tools such as Findbugs, PMD and CheckStyle. In addition to deploying server versions, SonarQube also provides lightweight IDE plug-ins, such as the SonarLint plugin in VsCode that supports encoding formats and defect detection in multiple languages.

PRECFIX (Alibaba)

PRECFIX (Patch Recommendation by Empirically Clustering) is a code defect detection and repair tool proposed by Alibaba engineers. It extracts “defect repair pairs”, i.e. fragments before and after defect repair, based on code submission data. Similar “defect repair pairs” are then clustered and extracted into templates. Then the source code is scanned and matched according to the template content, so as to recommend the corresponding repair suggestions.

After a preliminary trial, a simple summary and comparison can be made:

SonarQube Infer
Supported language types Java, C, JS, Python and 27 other languages Android, Java, C, C++, Objective-C
use Local server or IDE plug-in The command line
Support for CI (Continuous integration) is is
The effect IDE plugins support a variety of rules, similar to the way ESLint works, and can detect common bugs, code specifications, and more
You can find null pointer exceptions, unclosed I/O streams, and so on

Common defect detection technical scheme

According to the data, some common defect detection schemes are summarized as follows:

Machine learning algorithm: Training defect prediction model based on code change information

First of all, we have to mention SZZ algorithm, which is generally used to identify the changes introduced by defects and promotes the development of defect detection technology. It is based on the change record of a version control system (such as Git) for identification. The main process is as follows:

  1. Identify defect fix changes: In all code changes, identify changes that contain defect IDS
  2. Identify fixed bug code: The diff algorithm of the version control system is used to determine the line of code modified for the bug (i.e., the bug code).
  3. Identify possible defect introduction changes: Based on the change history, the first committed record of defective code may be the change that introduced the defect.
  4. Noise data elimination: Removing noise data from possible defect introduction changes (i.e. some unrelated code commit, such as after defect discovery)

The SZZ algorithm provides a method to identify defects and introduce changes. Later studies found that the data generated by the SZZ algorithm had a lot of noise, so many optimization algorithms were generated for this problem:

  • Annotation Graph SZZ (AG-SZZ) : This removes blank lines, code style-related changes, and uses the Annotation Graph (a tool that tracks code change progress) to track code change commit history.
  • Meta-change Aware SZZ (MA-szz) : It ignores changes that include branch creation, merging, and modifying file attributes, known as meta-changes.
  • Refactoring Aware SZZ (ra-sZZ) : This ignores code Refactoring changes because Refactoring does not change the external behavior of the software.

Although SZZ is noisy, most code defect detection schemes often use SZZ algorithm to annotate data.

Based on the defect changes detected by SZZ algorithm, features at different latitudes can be extracted to represent them, and then a defect prediction model can be constructed by using machine learning technology. General extracted features include:

  • Characteristics based on change metadata, such as developer, commit time, change log, number of modified file lines, and so on
  • Changes based on changed code content, such as code complexity characteristics, word frequency of changed code, log, and filename, or differences based on the number of nodes of the same type in the abstract syntax tree (AST) of the code file before and after the change.
  • Changes based on the software evolution process are quantified based on the code change history of the project, such as the number of times the relevant files have been modified and the number of developers who have modified the files
  • Combined with software project management system, more latitude features can be extracted, such as CR information, defect information, etc.
  • Defective code information features, typically source code or corresponding abstract syntax trees

Commonly used models are divided into Supervised model and Unsupervised model. The difference between the two is whether there is a data set with known labels (that is, known whether there are defective code change data), in which the supervisor can build classification or regression model according to the labeled code. Support Vector Machine (SVM) classifier is commonly used. Unsupervised models do not need these annotation data, and can extract features according to different angles of code changes and represent them with feature vectors. Commonly used models include LSTM and bi-LSTM.

The general evaluation indicators of the programme are:

  • Accuracy, recall, accuracy and F1-Measure, AUC and other commonly used indicators in machine learning field
  • The work-aware metric is the number or proportion of defects that can be detected in a given amount of code (i.e. work) when developers conduct code reviews based on the predicted results of the prediction model.

To sum up, the general idea of machine learning model can be referred to the following figure:


Code similarity analysis

This method draws lessons from the vectorization idea in Natural Language Processing (NLP) and proposes a defect detection method combined with data mining technology: The code with known defects is embedded into the vector, and the target code to be detected is also vectorized, and then the similarity with the defect code is judged according to the distance between the vectors.

Defect location method based on program spectrum

** Program spectrum ** mainly refers to the coverage information about program statements generated during the execution of the program, and whether the execution passed. This concept was later applied to program code analysis. For a test statement, the more test cases that pass the execution, the less likely the statement is to have defects (doubt rate), and vice versa. However, in normal systems, the majority of test cases are successfully executed, and such uneven results will have an impact on the rate of doubt, so it is generally necessary to adjust the contribution of successful test cases to produce better results.

Defect repair technique

Here are some other techniques for defect repair:

Build – validate defect repair techniques

The method of generate-and-validate generation is mainly divided into two steps. The first step is to Generate a series of patch schemes through search, and then run test cases. If all test cases pass, the repair is considered successful. Otherwise, search and verification continue until success or timeout occurs. Common generation verification repair methods include GenProg, RSRepair and so on.

Semantic drive

English is the Semantics-driven, semantic -driven method to fix the problem formalized expression, by solving the way to get the final patch, the more classical algorithms are SemFix, DirectFix, Angelix, NOPOL, DynaMoth and so on.

Current technical challenges

Many methods related to code defect detection are introduced above, but it is difficult to apply in actual production. In practice, there will be many difficulties, which will directly affect the final defect detection effect.

  1. Lack of high-quality data sets. Generally, it is relatively easy to use marking data for training, but it is difficult to collect code data containing marking information in actual production, and the accuracy of automatic marking is low. In addition, it also depends on the information submitted by developers and the defect management information of the software management system. If this part of data is not accurate, it will also affect the final training result.
  2. The business is complex and there are many types of defects. In actual production, due to the complexity and diversity of business types, and the corresponding defect types are also diverse, the training set is required to be rich enough; otherwise, the model generalization ability will be poor, and the effect of another business code may not be very good.
  3. Lack of a unified assessment method. For example, the test case-based generation verification method in defect repair technology determines whether the repair is successful according to whether all test cases are passed. But in practice, passing test cases does not mean that the software is free of any hidden defects, which will bring great errors in the evaluation of patches.

future

As for code defect detection, the detection method based on static rule scan is the most widely seen in the market. However, with the development of machine learning technology, there have been many researches based on machine learning algorithm. It can be predicted that the last must be the world of intelligent algorithms, and intelligent algorithms will intervene in the whole life cycle of software development, so as to maximize the improvement of production efficiency. Having seen gpT-3’s stunning results in the direction of code generation, I’m sure it won’t be long before this day comes.

The above is only personal opinion, more detailed knowledge and content is still learning, if you have questions, welcome to clap brick, exchange.

reference

This article has referred to numerous papers and materials before and after, some of which are listed as follows:

  • Fuzzing technology summary

  • Infer

  • Advances in real-time software defect prediction

  • Deep learning source code defect detection method



    Tao department front – F-X-team opened a weibo!
    In addition to the article there is more team content to unlock 🔓