By Xu Lun, F(X)Team, Ali Tao Department
Can machine learning fix bugs automatically? For many of you this may be a fantasy topic. We can try it ourselves. Everyone can learn.
It’s very difficult to say, very easy to say. We make data sets of the buggy and fixed code fragments and train them using techniques like machine translation. We then use the trained model to predict how the new code will be fixed. This idea and data set are from the paper “An Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation”.
The topic of automatic code repair is very simple, one is the buggy code, the other is the fixed code.
The main process is as follows:
To fit the code more broadly, the author abstracts the code:
Let’s look at an example from a dataset:
The buggy code looks like this:
public java.lang.String METHOD_1 ( ) { return new TYPE_1 ( STRING_1 ) . format ( VAR_1 [ ( ( VAR_1 . length ) - 1 ) ] . getTime ( ) ) ; }
Copy the code
The fix looks like this:
public java.lang.String METHOD_1 ( ) { return new TYPE_1 ( STRING_1 ) . format ( VAR_1 [ ( ( type ) - 1 ) ] . getTime ( ) ) ; }
Copy the code
How to use CodeBERT to fix bugs automatically
Microsoft has been leading the way in AI4SE, an artificial intelligence for software engineering. Let’s take a step-by-step look at how to use Microsoft’s CodeBERT model to automatically fix bugs.
Step 1: Install the Transformers framework, because CodeBERT is based on it:
pip install transformers --user
Copy the code
Step 2: Install PyTorch or Tensorflow as the backend of the Transformers. If the driver works, install the latest version:
pip install torch torchvision torchtext torchaudio --user
Copy the code
Step 3: Download Microsoft’s data set
git clone https://github.com/microsoft/CodeXGLUE
Copy the code
The data set has been downloaded to CodeXGLUE/ code-code /code-refinement/data/ and is divided into two data sets, small and medium.
Let’s first practice using small data set:
cd code export pretrained_model=microsoft/codebert-base export output_dir=./output python run.py \ --do_train \ --do_eval \ --model_type roberta \ --model_name_or_path $pretrained_model \ --config_name roberta-base \ --tokenizer_name roberta-base \ --train_filename .. /data/small/train.buggy-fixed.buggy,.. /data/small/train.buggy-fixed.fixed \ --dev_filename .. /data/small/valid.buggy-fixed.buggy,.. /data/small/valid.buggy-fixed.fixed \ --output_dir $output_dir \ --max_source_length 256 \ --max_target_length 256 \ --beam_size 5 \ --train_batch_size 16 \ --eval_batch_size 16 \ --learning_rate 5e-5 \ --train_steps 100000 \ --eval_steps 5000Copy the code
Depending on the power of your machine, I trained with an NVIDIA 3090GPU in about one night. The best models are stored in output_dir/checkpoint-best-bleu/pytorch_model.bin.
Then we can use the test set to verify our training results:
python run.py \ --do_test \ --model_type roberta \ --model_name_or_path roberta-base \ --config_name roberta-base \ --tokenizer_name roberta-base \ --load_model_path $output_dir/checkpoint-best-bleu/pytorch_model.bin \ --dev_filename .. /data/small/valid.buggy-fixed.buggy,.. /data/small/valid.buggy-fixed.fixed \ --test_filename .. /data/small/test.buggy-fixed.buggy,.. /data/small/test.buggy-fixed.fixed \ --output_dir $output_dir \ --max_source_length 256 \ --max_target_length 256 \ --beam_size 5 \ --eval_batch_size 16Copy the code
On my machine, after half an hour of reasoning, the output looked like this:
10/26/2021 11:51:57 - INFO - __main__ - Test file: .. /data/small/test.buggy-fixed.buggy,.. /data/small/test.buggy-fixed.fixed 100% | █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ | 365/365 [30:40 "00:00, 5.04s/it] 10/26/2021 12:22:39-info-__main__ - bleu-4 = 79.26 10/26/2021 12:22:39-info-__main__ - xMatch = 5.04 S /it] 10/26/2021 12:22:39-info-__main__ - xMatch = 16.3325 10/26/2021 12:22:39 - INFO - __main__ - * * * * * * * * * * * * * * * * * * * *Copy the code
How do we evaluate the quality of the code we generate? We can do this by comparing our output/test_1.output to output/test_1.gold using the following evaluator.py script:
python evaluator/evaluator.py -ref ./code/output/test_1.gold -pre ./code/output/test_1.output
Copy the code
The following output is displayed:
BLEU: 79.26; Acc: 16.33Copy the code
The former is a BLEU indicator describing the quality of NLP generation, while the latter is accuracy.
What is the level of this indicator? We can compare it to the baseline:
Method | BLEU | Acc (100%) | CodeBLEU |
---|---|---|---|
Naive copy | 78.06 | 0.0 | – |
LSTM | 76.76 | 10.0 | – |
Transformer | 77.21 | 14.7 | 73.31 |
CodeBERT | 77.42 | 16.4 | 75.58 |
Accuracy may not seem high, but CodeBERT already has a 60% improvement over the RNN technique used in the original paper.
We use diff to get a sense of the difference between the generated and the original:
With more data, we can automatically fix bugs without losing a single brain cell, which is “after-sleep income.”
Automatic bug detection
If you feel that automatic bug resolution is far from practical, you can just find the bug first. Do not look down upon the relatively weak proposition of automatic bug discovery, which brings greater improvement in applicability and accuracy.
Buggy data sets are extremely simple, with a single field indicating whether or not there is a bug.
The dataset is stored in jSONL format as follows:
{"project": "qemu", "commit_id": "aa1530dec499f7525d2ccaa0e3a876dc8089ed1e", "target": 1, "func": "static void filter_mirror_setup(NetFilterState *nf, Error **errp)\n{\n MirrorState *s = FILTER_MIRROR(nf); \n Chardev *chr; \n chr = qemu_chr_find(s->outdev); \n if (chr == NULL) {\n error_set(errp, ERROR_CLASS_DEVICE_NOT_FOUND,\n \"Device '%s' not found\", s->outdev); \n qemu_chr_fe_init(&s->chr_out, chr, errp);" , "idx": 8} {"project": "qemu", "commit_id": "21ce148c7ec71ee32834061355a5ecfd1a11f90f", "target": 1, "func": "static inline int64_t sub64(const int64_t a, const int64_t b)\n\n{\n\n\treturn a - b; \n\n}\n", "idx": 10}Copy the code
We don’t need to write any code, we just train:
python run.py \ --output_dir=./saved_models \ --model_type=roberta \ --tokenizer_name=microsoft/codebert-base \ --model_name_or_path=microsoft/codebert-base \ --do_train \ --train_data_file=.. /dataset/train.jsonl \ --eval_data_file=.. /dataset/valid.jsonl \ --test_data_file=.. /dataset/test.jsonl \ --epoch 5 \ --block_size 200 \ --train_batch_size 32 \ --eval_batch_size 64 \ --learning_rate 2e-5 \ --max_grad_norm 1.0 \ --evaluate_during_training \ --seed 123456Copy the code
This time is much shorter than the automatic repair, all done in 20 minutes. Then run the test set:
python run.py \ --output_dir=./saved_models \ --model_type=roberta \ --tokenizer_name=microsoft/codebert-base \ --model_name_or_path=microsoft/codebert-base \ --do_eval \ --do_test \ --train_data_file=.. /dataset/train.jsonl \ --eval_data_file=.. /dataset/valid.jsonl \ --test_data_file=.. /dataset/test.jsonl \ --epoch 5 \ --block_size 200 \ --train_batch_size 32 \ --eval_batch_size 64 \ --learning_rate 2e-5 \ --max_grad_norm 1.0 \ --evaluate_during_training \ --seed 123456Copy the code
Calculate the accuracy:
python .. /evaluator/evaluator.py -a .. /dataset/test.jsonl -p saved_models/predictions.txtCopy the code
The running results are as follows:
0.6288433382137628} {' Acc:Copy the code
Let’s compare this to the mainstream results in the industry:
Methods | ACC |
---|---|
BiLSTM | 59.37 |
TextCNN | 60.69 |
RoBERTa | 61.05 |
CodeBERT | 62.08 |
I think I passed the accuracy mark. The benefits of this approach are still the “after-sleep income” we talked about earlier. As long as it accumulates more valid data, its recognition capability will continue to grow, which requires little maintenance manpower.