基于CodeBERT和Stacking集成学习的补丁正确性验证方法

Patch Correctness Verification Method Based on CodeBERT and Stacking Ensemble Learning

韩威 ¹姜淑娟 ¹周伟¹

扫码查看

作者信息

1. 中国矿业大学计算机科学与技术学院江苏徐州 221116;中国矿业大学矿山数字化教育部工程研究中心江苏徐州 221116
折叠

摘要

近年来,自动程序修复已成为软件工程领域的重要研究课题.然而,现有的自动修复技术大多是基于补丁生成和测试的,在补丁验证环节时间成本很高.此外,由于测试套件的不完备,许多候选补丁虽然能通过测试,但实际上并不正确,从而导致补丁过拟合.为提高补丁验证的效率并缓解补丁过拟合的问题,提出了一种静态的补丁验证方法.该方法首先使用大型预训练模型CodeBERT自动提取缺陷代码片段和补丁代码片段的语义特征,然后使用历史缺陷修复补丁数据训练Stacking集成学习模型,训练之后的模型可以对新的缺陷修复补丁进行有效验证.在Defects4J缺陷数据集相关的1 000个补丁数据上对所提方法的验证能力进行评估.实验结果表明,该方法可以有效地验证补丁的正确性,从而提高补丁验证的效率.

Abstract

In recent years,automatic program repair has become an important research topics in the field of software engineering.However,most of the existing automatic repair technologies are based on patch generation and testing,which consumes a signifi-cant amount of time and cost in the patch verification process.In addition,because the test suite is not completeness,many candi-date patches can pass the test,but the test results are not consistent with the facts,which leads to the patch overfitting problem.To improve the efficiency of patch verification and alleviate patch overfitting issues,a static patch verification method is pro-posed.The method first uses the large pre-training model CodeBERT to automatically extract the semantic features of defect code fragments and patch code fragments,and then uses the historical defect repair patch data to train a Stacking ensemble learning model.The trained model can effectively verify the new defect repair patch.The verification ability of the proposed method is e-valuated on the 1 000 patch data related to the Defects4J defect dataset.Experimental results show that the static patch verifica-tion method can effectively verify the correctness of the patch,thereby improving the efficiency of patch verification.

关键词

自动程序修复/补丁验证/预训练模型/集成学习/Defects4J缺陷数据集

Key words

Automatic program repair/Patch verification/Pre-training model/Ensemble learning/Defects4J defect dataset

引用本文复制引用

出版年

2025

计算机科学

重庆西南信息有限公司（原科技部西南信息中心）

计算机科学

北大核心

影响因子：0.944

ISSN：1002-137X

段落导航