基于多源域适应和数据增强的跨项目开源软件缺陷预测
Cross-project Open Source Software Defect Prediction Based on Multi-Source Domain Adaptation and Data Augmentation
李光杰 1唐艺 1何焱 1张启磊 1邢颖 2赵梦赐2
作者信息
- 1. 军事科学院国防科技创新研究院,北京 100071
- 2. 北京邮电大学,北京 100876
- 折叠
摘要
通过挖掘软件代码仓库数据预测软件缺陷是提高软件质量和增强软件安全性的重要方法.人们提出了多种基于机器学习的方法挖掘软件代码仓缺陷数据预测软件缺陷.然而,由于从不同代码仓提取的软件缺陷数据具有异质性,因此机器学习的预测效果往往并不理想.为此,本文提出一种基于多源域适应和数据增强的缺陷预测方法.该方法通过挖掘各种源代码仓和目标代码仓之间的特征相似性提高预测的准确性:一方面利用带权重的最大平均方差使特征分布距离最小,另一方面利用注意力机制提高与目标代码仓高度相似的源代码仓权重.对比实验结果表明,本文所提方法在软件缺陷预测效果最佳.
Abstract
Predicting defect through mining softwarerepositories(MSRs)is crucial for enhancing the security andquality of software.With an extensive collection of software defectdata acquired by mining various repositories,numerous machinelearning-based approaches have been proposed for defectdetection.However,due to the heterogeneity of vulnerabilitydata originating from different repositories,the robustness ofthe approach is significantly compromised.In light of this,a defect prediction approach was proposed,based onmulti-source Domain Adaptation and Data Augmentation(DPDA).Our approach mined feature similarities be-tween various source repositories and targetrepository.Specifically,it employed weighted maximum meandifferences to minimize the distribution distance of their features.Meanwhile,different attention scores were assigned to weighdifferent sources to increase the weight of source repositories withhigh similarity to the target repository.This strategic weightingaims to focus on the source re-pository with highsimilarity in the model,reducing the impact of irrelevant repositories.Thecomparative experiments demonstrated that our approach can achievethe best performance in predicting software defect.
关键词
缺陷预测/多源域适应/注意力机制/数据增强Key words
defect prediction/multi-source domain adaptation/attention mechanism/data augmentation引用本文复制引用
出版年
2024