即时软件缺陷预测针对项目开发与维护过程中的代码提交来预测是否会引入缺陷。在即时软件缺陷预测研究领域,模型训练依赖于高质量的数据集,然而已有的即时软件缺陷预测方法尚未研究数据集扩充方法对即时软件缺陷预测的影响。为提高即时软件缺陷预测的性能,提出一种基于数据集扩充的即时软件缺陷预测(prediction based on data augmentation,PDA)方法。PDA方法包括特征拼接、样本生成、样本过滤和采样处理4个部分。增强后的数据集样本数量充足、样本质量高且消除了类不平衡问题。将提出的PDA方法与最新的即时软件缺陷预测方法(JIT-Fine)作对比,结果表明:在JIT-Defects4J数据集上,F1指标提升了 18。33%;在LLTC4J数据集上,F1指标仍有3。67%的提升,验证了 PDA的泛化能力。消融实验证明了所提方法的性能提升主要来源于数据集扩充和筛选机制。
A just-in-time software defect prediction method based on data augmentation
Just-in-time(JIT)software defect prediction aims to predict whether code commits during project develop-ment and maintenance will introduce defects.In the field of JIT software defect prediction research,model training re-lies on high-quality datasets.However,the impact of dataset augmentation methods on JIT software defect prediction has not been thoroughly investigated in existing methods.To enhance the performance of JIT software defect predic-tion,a method based on dataset augmentation,named prediction based on data augmentation(PDA)is proposed.PDA includes four parts:feature stitching,sample generation,sample filtering,and sampling processing.The augment-ed dataset has an ample number of samples with high quality and eliminates the class imbalance problem.Comparing the proposed PDA method with the latest JIT software defect prediction method(JIT-Fine),results indicate:an 18.33%improvement in the F1 score on the JIT-Defects4J dataset;and a 3.67%improvement on the LLTC4J dataset,demon-strating PDA's generalization ability.Ablation studies have confirmed that the performance improvement of the pro-posed PDA method mainly comes from dataset augmentation and filtering mechanisms.
data augmentationdeep learningjust-in-time defect predictionsample generationimbalanced datasets