Abstract: This article introduces SmartCommit, a code submission aid tool. Its main function is to automatically generate group submission scheme through hybrid-change decomposition algorithm, accept developer feedback and interactive adjustment, and gradually guide and assist developers to make atomic commit in accordance with best practices.

This article is from the huawei cloud community “With SmartCommit, you don’t have to worry about compound commit anymore”, by the agile xiaozhi.

Code submission is one of the most frequent daily activities of developers in collaborative software development, and it is necessary to follow the best practice of code submission atomicity. However, several studies have found that “compound commits” are common in actual open source and industrial projects, where developers often commit all code changes made over a period of time at once, even if the changes contain multiple unrelated change intentions or correspond to multiple development and maintenance tasks. SmartCommit [1], a code submission aid tool, is introduced in this paper. Its main function is to automatically generate group submission schemes through the hybrid-change decomposition algorithm, accept developer feedback and interactive adjustment, and gradually guide and assist developers to make atomic commit in accordance with best practices.

Atomicity of code submission

Code change submission is a basic function of version management system represented by Git, and it is also one of the most frequent daily operations of developers. In software development and maintenance with group participation, individual developers make changes to code for different purposes, which are recorded by Git in the unit of code commit. Code commit is the basis for other functions of the version management system. The chronological commit sequence constitutes the commit History of the code repository, which tracks and records the content, time, description, and submitter of each change in the software repository. Therefore, a clear code submission history is important in activities such as code review, team collaboration, branch management, continuous integration, problem location, and repair.

The proper use of code submissions to document, organize, and manage individual contributions is the foundation of effective collaborative software development and maintenance. Task-oriented atomic code submission is a best practice advocated in the Official Git documentation, and is a specification and requirement explicitly advocated by the open source community (Angular, Vue, etc.) and well-known software companies (Google, Microsoft, etc.).

Code commit atomicity means that code changes within each commit should be highly cohesive and inclusive, focused on a software development or maintenance task (such as adding new features, fixing bugs, refactoring, etc.). Following this best practice, for software developers, it is beneficial for others to understand and review code changes, locate the introduction and submission of problems and reverse, and select and reuse historical changes, etc., when the group collaborates in software development. For software Warehouse mining (MSR) researchers, task-oriented submission helps reduce noise in the data, thus providing a clearer evolutionary history.

Description of code commit atoms in Git documentation [2]

Requirements for change submitters in Google engineering practice [3]

Specification for submission information in Angular projects [4]

Atomic and composite commits

In group cooperative software development, Atomic Commits are important to effective group coordination. However, multiple studies have found that Composite Commits are prevalent in open source and industrial projects (10 to 40 percent of Commits).

In contrast to atomic commit, a composite commit is when a developer commits all the code changes made over a period of time at once, even if they are mixed with code changes for multiple development/maintenance tasks. The reasons for composite submission mainly include three aspects:

1. In day-to-day development developers are often involved in multiple tasks simultaneously, consciously or unconsciously, such as code structure refactoring (i.e.

2. While some teams or open source communities provide code contribution specifications for developers, there are few clear rules or guidelines for submission styles.

3. Although some version control systems or tools provide the ability to select and organize fine-grained code changes (Git Interactive staging, GitKraken/Fork, etc.), they are completely dependent on the developer’s active triggering and manual selection, which requires high cost of recognition and use.

A typical atomic commit (left) versus a composite commit (right)

Study on code change decomposition

In view of the composite submission problem, the academic community proposed a variety of methods and technologies to decompose the hybrid-generation changes, such as the method based on heuristic rules [5], the method based on program slicing technology [6], the method based on data flow and control flow dependence [7], the method based on pattern matching [8], etc. However, these methods have the following common limitations and deficiencies:

  • Most are post-mortem, detecting or decomposes composite commits at some stage (such as code review, commit picking, history slice, etc.) after they have been recorded as a version history.

  • The decomposition results of prior art are often too fine-grained to correspond to developer intentions and tasks, and therefore not applicable to the code submission phase.

  • Prior art does not take into account the developer’s background knowledge in code change decomposition algorithms. Considering that atomic code submission is a best practice that is implemented in relation to the project context, team requirements, developer habits, etc., the code change decomposition process needs to incorporate the developer’s knowledge of the problem context, domain, and project, allowing for some flexibility and subjectivity.

summary

Task-oriented, highly cohesive, self-contained atomic commit is widely accepted, but due to multi-task parallelism, lack of unified specifications, insufficient tool support and other reasons, composite commit that violates this best practice is still common. Although there have been many studies on this problem, its limitations and shortcomings make it difficult to apply the existing methods to the actual code submission workflow.

Code change interactive decomposition tool SmartCommit

To overcome the limitations and limitations of existing approaches, this article introduces SmartCommit, an interactive graph-based code change decomposition algorithm whose goal is to progressively guide and assist developers in atomic code commit for development and maintenance tasks during the development phase, thereby eliminating compound commit at its root.

First, SmartCommit transforms the hybrid-change decomposition problem into an incremental solution optimization problem:

Given a changeset 𝐢 = {c1, c2, c3… As input, the purpose of the Change decomposition problem is to divide 𝐢 into a list of non-empty sets G = [𝑔1, 𝑔2,… 𝑔𝑛], each set called a Change Group, corresponding to a development or maintenance task.

As shown in the figure below, committing all fine-grained changes as a COMMIT is considered the Initial State, and committing each fine-grained code change as a single COMMIT is considered the Extreme State. The purpose of change Decomposition is to find an acceptable state satisfying code commit atomicity between the initial state and the extreme state.

However, due to the complexity of the actual situation and the flexibility of code submission, the change decomposition process is difficult to complete automatically through the algorithm. Therefore, SmartCommit views the change decomposition process as a human-computer interaction incremental optimization process: The algorithm provides Initial Suggestion, and the interactive mechanisms of Coarse Control and Fine Tuning are used to assist developers to quickly form the desired change group submission scheme. The algorithm can make use of the computer program analysis ability and fine quantitative evaluation algorithm, the generated decomposition scheme can provide a good starting point for developers; The interactive mechanism can make use of the developer’s feedback and adjustment to the algorithm generation scheme to guide the decomposition scheme generated by the algorithm to approach the desired state as much as possible. Considering the checking code is in itself a process of human-computer interaction (through the command developers need to change, describe the change and associated problems, etc.), SmartCommit change decomposition mechanism is a natural method of man-machine combination, can comprehensive utilization of computational advantage and information superiority, and developers to mix changes the task decomposition.

Interactive decomposition of hybrid-change sets

The above ideas are implemented as SmartCommit algorithm. The following figure shows its workflow, which mainly includes the following four stages:

1. Change set Preprocessing

Given a Git workspace or a compound commit, code changes in it are abstracts o ff hunk and are expressed in di o ff hunk blocks, the collection of all code change blocks o ff hunk as changeset;

2. Graph Construction

Based on code change diagrams, a Di o ff Hunk Graph for code change blocks is constructed by aggregating the nodes involved in changes in units of code change blocks, using each code change block as a point and the explicit and implicit relationships between code change blocks as edges.

3. Interactive decomposition of changes

Through an edge-centered graph partition algorithm, the point set of code change block diagram is divided into several independent subsets, and each subset is converted into a change group, which is provided to users as a proposed decomposition scheme for interactive adjustment.

4. Changes to Forgettable submission

When the decomposition scheme is ready to commit, the developer can select several change groups to commit, with information describing the code changes in those groups, and submit multiple change groups to version control with one click.

SmartCommit algorithm process Overview

Due to limited space, the following will mainly focus on the three core parts of SmartCommit algorithm:

1. Data structure: A graph structure used to model the relationship between code changes distributed at different points in the project

2. Decomposition algorithm: an algorithm based on graph division algorithm and centering on relation to generate code change decomposition scheme

3. Interaction mechanism: the interaction mechanism that integrates algorithm analysis ability and developer background knowledge through interaction mechanism

The data structure

In order to model and manage fine-grained relationships between code changes, we adopted Graph as a data structure and designed “Change Relation Graph”. The point set of the code change diagram consists of change blocks (Diff Hunks), each of which corresponds to an independent code edit/change in a workspace or commit. An edge set consists of Relation between change blocks, and each edge depicts the Relation and strength between two connected change blocks from a certain dimension.

For a set of points, SmartCommit extracts change block information based on Git diff results (based on text) and code abstraction syntax tree (AST) by analyzing the input workspace or a composite commit. The necessary information includes:

1. Index: consists of file_index:hunk_index, which uniquely locates the position of a change block in the source code. File_index Indicates the index of the source file (parent file) to which the current change belongs in the change set. It is numbered from 0. Hunk_index indicates the index of the current change among all changes in the parent file, sorted by the starting line number starting from 0.

2. Change_type: Indicates the change action type, such as add, delete, and modify.

3. Base_hunk/current_HUNK: Indicates the code block corresponding to the previous (base) and current (current) versions of the change block, including the file type, file path, line number range, code snippet, AST subtree, and so on.

For edge sets, SmartCommit integrates indicators proved to have a certain correlation with the coupling between changes in relevant studies, and evaluates the correlation relationship and strength between change blocks from multiple dimensions:

1. Structural Correlation

Refers to the direct or indirect syntactic and semantic dependencies between blocks of code change, which are usually directional and cannot be broken in commit (for example, method calls cannot be committed before their declaration/definition, otherwise compilation errors will be included in the intermediate commit version);

2. Heuristic correlations

Refers to heuristic rules that may infer that multiple fine-grained changes originate from the same editing action, such as similarity and proximity between code change blocks, Its purpose is to detect systematic changes, changes applied to cloned code, neighboring changes in the same domain, etc.

3. Refactoring Correlation

The purpose is to detect multiple fine-grained code changes in different locations of the project resulting from the same systematic or structural change, mainly referring to various types of refactoring operations;

4. Logical Correlation

Multiple changes, such as code formatting changes, dead code cleaning, text movement, and so on, resulting from common operations that are semantically unrelated but consistent in the editing action of the program.

Each type of association relation above corresponds to one type of edge respectively, and the strength of association relation is taken as the weight of edge, and its calculation method is detailed description and code implementation in the paper. In addition to the above relationships, code change diagrams can easily extend correlations in other dimensions, such as evolutionary coupling, time stamp di o fference, etc. However, in order to obtain this information, the algorithm needs to rely on logging of code editing history by a particular type of versioning tool (VCS) or integrated development environment (IDE) in its implementation.

Decomposition algorithm

Get the current workspace change sets the corresponding code changes after the block diagram, we have to change set decomposition for code changes into a block diagram of the graph partition problems, namely the integrated nodes between the edge and the edge of the weight, the code change block diagram of the point set is divided into a set of independent (mutually exclusive) subset.

Based on the idea of multi-level Graph Partitioning, SmartCommit adopts an edge-centered Graph Partitioning algorithm based on Kruskal algorithm. The algorithm takes a code change block diagram and an optional weight threshold as input (a critical point is dynamically determined by max-Gap Splitter algorithm when the threshold is not set by the user). First, the algorithm creates an empty priority queue, 𝑄, to hold edges, and a parallel set number group, 𝑆, to hold change groups (each vertex is initialized as one of its elements, that is, each block of code change is a separate group). For each edge in the graph (𝑒, 𝑣), it is added to priority queue queue, in the form of a triplet (𝑀, (𝑒, 𝑣)). Its priority is first sorted in descending order according to the weight of the edge, and then sorted in ascending order according to the ID of the start node of the edge and the ID of the target node. The algorithm then goes into a loop: pop up the highest-priority edge in the current queue; If both endpoints of the current edge are already in the same group, the edge is ignored and the loop continues; Otherwise, if its weight is greater than the threshold, the grouping of the two endpoints is merged. If the weight of the edge is less than the threshold, or if 𝑄 is empty, the loop terminates and a connected set is generated from 𝑆 as a result; If there are still nodes that are a separate group, they are merged into a group and appended to the other groups. Finally, all node groups are output as a result of partitioning the code change block diagram.

Graph division process based on Kruskal and Max-Gap Splitter algorithm

The interaction mechanism

The grouping of nodes based on the graph partitioning algorithm will be converted into code change grouping and presented to the developer in an appropriate form as a grouping scheme suggested by the algorithm for review. If the proposed grouping scheme deviates from the developer’s expectations, the developer can adjust it through two interactive actions:

  • Coarse control

By controlling the algorithm termination conditions, the decomposition algorithm is rerun to produce decomposition schemes of different granularity. In the actual implementation of SmartCommit, the code change block diagram is cached in memory or hard disk, so rerunking the algorithm does not require re-constructing the diagram and generates it quickly. The purpose of coarse-grained control is to use the developer’s feedback on the algorithm generation scheme to guide the algorithm to approach the desired grouping state more quickly, and to generate schemes that require no or only a small amount of fine-grained adjustment.

  • Fine tuning

Move a small number of misallocated change blocks to the group to which they belong by moving one or more code change blocks between different groups. The purpose of fine-grained tuning is to allow developers to fine-tune their submissions to correct the results of the algorithm or to rule out changes that do not need to be committed.

After tuning to an acceptable decomposition scheme, the developer can select some or all of the change groups to produce multiple commits in one click, recording the selected change groups as a series of consecutive Commits in the version management system.

It is important to note that SmartCommit’s default grouping granularity is for broad development and maintenance tasks (such as implementing new features, fixing problems, refactoring, and so on) rather than specific fine-grained changes (such as adding classes, changing method parameters, changing return value types, and so on) because of the need to consider generality across projects. SmartCommit follows the classification of development and maintenance tasks by Semantic-Release [9], an automated release tool, and Commitizen [10], a tool that regulates Commit Messages, so it can be used in conjunction with these tools.

Commitzen Commitzen Commitzen Commitzen Commitzen Commitzen Commitzen Commitzen Commitzen Commitzen

summary

To avoid compound commits at their root, SmartCommit aims to guide and assist developers in thinking about the atomicity of code changes during development. Unlike previous work, SmartCommit models the relationship between code changes from multiple dimensions based on a graph structure, groups the changes for broad maintenance and development tasks, and introduces human-computer interaction mechanisms that combine algorithms with the strengths of developers.

Tool prototype implementation

The core algorithm of SmartCommit is implemented in Java [11], with a GUI interface based on NodeJS and Electron [12]. With Java and Electron’s cross-platform features, the tool supports Windows, Linux, and macOS operating systems. It can be packaged as a standalone desktop software or IntelliJ IDEA plug-in, or it can be configured as a Git subcommand, Git sc, to be invoked from the Git command line when code is submitted.

As a Huawei-Peking University 2019-2020 technical cooperation project, SmartCommit has been implemented as an Intellj IDEA plug-in in early 2020, which is used by engineers from multiple teams of Huawei Cloud, Consumer Cloud, Euler, cloud Core network and other departments for daily code submission. In addition to the basic change decomposition function provided by open source, huawei internal version provides additional functions such as submission type classification, automatic submission information generation, and Issue ID association recommendation.

Experimental evaluation and validation

The researchers evaluated SmartCommit in both open source projects and industrial environments, and cross-validated the algorithm and tool by comparing the results. By conducting controlled experiments on 3000 simulated composite submissions from well-known open source projects and analyzing the usage data of 83 huawei internal engineers for 36 weeks, the results show that:

1. Accuracy: SmartCommit produced an initial decomposition scheme with a median accuracy of 71.00%-83.50% on 10 open source projects and 74.70% and 70.45% on 2 industrial projects.

2. Interactivity: More steps (1-15 actions) are required through fine-grained tuning without user involvement; However, in the case of coarse-grained control in actual use, the number of adjustment steps required is less than 5 steps 80% of the time.

3. Performance: 90% of the time, SmartCommit can complete analysis in less than 5 seconds, and the running time does not increase significantly with the size of the input changeset. Most users consider SmartCommit performance acceptable in their daily work.

4. Practical value: The 10 active users interviewed said SmartCommit can be a great way to help developers follow best practices and provide additional benefits, such as making it easier to write Commit messages for grouped change groups and finding changes (such as local configuration, personal information, sensitive data) that shouldn’t be committed after grouping.

conclusion

To solve the problem of complex code submission, this paper introduces SmartCommit, a code change decomposition and submission tool based on static program analysis and graph partition algorithm. The tool automatically analyzes relationships between fine-grained changes and automatically decomposes hybrid changes or non-atomic commits; With a GUI front end interface, developers can be interactively and incrementally assisted in complying with task-oriented submission best practices. SmartCommit has been tested in both open source and industrial projects, and the results show that its automatic decomposition algorithm provides a starting point for improving the atomicity of code submission and reduces the cost of following best practices through an interactive mechanism.

As a tool prototype built from research results, SmartCommit has some limitations and shortcomings in practical use, such as:

1. The current implementation only supports Git projects and Java language code.

2. The algorithm does not make full use of the correlation between changes of all dimensions, and there is room for further improvement in the accuracy of automatic decomposition.

3. The tool is mainly aimed at the development and submission stage, which can replace git diff/add/commit/push and other commands. Can it be applied to the code review stage?

To meet the above requirements and further improve the effectiveness and availability of SmartCommit, we are currently developing Version 2.0 of SmartCommit with the following improvements through refactoring:

1. Graph construction: Extract and decouple the language-related graph construction part, and use a general code Parser to replace the current JDT Parser dedicated to Java to support more languages.

2. Graph decomposition: The matrix form is used to store multiple graphs, and the association information between changes of more dimensions is added. Combined with the top-down graph division and the top-down point clustering algorithm, the data-driven method is further introduced to improve the accuracy of the automatic decomposition algorithm.

3. Interaction and application: collect user feedback data generated during interaction and make use of it; Add API for processing Pull request and integrated change description generation algorithm for core algorithm.

【 References 】

[1] Bo Shen, Wei Zhang, Haiyan Zhao, Wei Zhao, Guangtai Liang, Zhi Jin. SmartCommit: a graph-based interactive assistant for activity-oriented commits. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (FSE), pp. 379-390. 2021.

[2] git-scm.com/docs/gitwor…

[3] Google. Making. IO/eng – practic…

[4] github.com/angular/ang…

[5] K. Herzig and A. Zeller. “The Impact of Tangled Code Changes”. 2013 10th Working Conference on Mining Software Repositories (MSR). 2013: 121-130.

[6] W. Muylaert and C. De Roover. “Untangling Composite Commits Using Program Slicing.” 2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM). 2018: 193 — 202.

[7] M. Barnett, C. Bird, J. Brunet et al. “Helping developers Help themselves: Automatic Decomposition of code review changesets. 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering. 2015: 134 — 144.

[8] M. Dias, A. Bacchelli, G. Gousios et al. “Untangling fine-grained code Changes”. 2 IEEE International Conference on 24th Software Analysis, Evolution, and Reengineering (SANER). 2015: 341 — 350.

[9] github.com/semantic-re…

[10] github.com/commitizen/…

[11] github.com/Symbolk/Sma…

[12] github.com/Symbolk/Sma…

Click to follow, the first time to learn about Huawei cloud fresh technology ~