首页|学习困难与泛化能力感知的软件缺陷预测过采样方法

学习困难与泛化能力感知的软件缺陷预测过采样方法

扫码查看
软件缺陷数据的类别分布不平衡特点给软件缺陷预测任务带了巨大的挑战.合成过采样是解决这一问题最为主流的技术,但如何设计合适的采样策略避免因引入异常样本而导致的过度泛化风险,始终是软件缺陷预测过采样方法面临的难点.针对这一问题,本文提出一种结合样本学习困难程度和合成泛化影响的过采样方法(GDOS).具体来说,GDOS方法通过样本的局部先验概率和潜在合成方向上的样本分布信息衡量样本的安全系数与泛化系数,并以此度量样本的选择权重.通过抑制潜在过泛化区域的样本合成概率,给予相对安全的近邻合成方向更高的选择概率,为高质量样本的合成提供保障.在26个PROMISE数据集上的实验表明,GDOS在MCC、pd、pf、F-measure等指标上较于经典的采样方法和专门提出的软件缺陷预测采样方法均取得了更优的性能表现.
Software defect prediction oversampling technique with generalization and difficulty-aware
The class imbalanced distribution of software defect data brings great challenges to software defect predic-tion.Synthetic oversampling is the most popular technique to solve this problem,but how to design a suitable sam-pling strategy to avoid the risk of over-generalization caused by the introduction of abnormal samples is still an open challenge for software defect prediction.To solve this problem,a Generalization and Difficulty-aware Oversampling(GDOS)method by combining the influence of sample learning difficulty and synthetic generalization for minority oversampling was proposed.For each oversampling seed sample,GDOS evaluated the selection weights of its assis-tant minority samples by measuring the safe factor and the generalization factor simultaneously according to its local prior probability and the sample distribution information of potential synthesis direction.Through suppressing the possibility of synthesizing samples in potential over-generalization regions and enhancing the possibility of synthesi-zing samples in relative safe directions,GDOS guaranteed the synthesis of high-quality samples.Numerical compar-ison with nine state-of-the-art methods on twenty-six datasets from the PROMISE repository had demonstrated the superiority of GDOS in terms of MCC,pd,pf and F-measure.

software defect predictionclass imbalanceoversamplingovergeneralization

范洪旗、严远亭、张以文、张燕平

展开 >

安徽大学 计算智能与信号处理教育部重点实验室,安徽 合肥 230601

安徽大学 计算机科学与技术学院,安徽 合肥 230601

软件缺陷预测 类别不平衡 过采样 过度泛化

国家自然科学基金资助项目国家自然科学基金资助项目

6180600262272001

2024

计算机集成制造系统
中国兵器工业集团第210研究所

计算机集成制造系统

CSTPCD北大核心
影响因子:1.092
ISSN:1006-5911
年,卷(期):2024.30(8)