江西师范大学学报(自然科学版)2024,Vol.48Issue(3) :301-310.DOI:10.16357/j.cnki.issn1000-5862.2024.03.11

改进的采样算法与无监督聚类相结合的软件缺陷预测模型

The Software Defect Prediction Model Combining Improved Sampling Algorithm and Unsupervised Clustering

石海鹤 周世文 钟林辉 肖正兴
江西师范大学学报(自然科学版)2024,Vol.48Issue(3) :301-310.DOI:10.16357/j.cnki.issn1000-5862.2024.03.11

改进的采样算法与无监督聚类相结合的软件缺陷预测模型

The Software Defect Prediction Model Combining Improved Sampling Algorithm and Unsupervised Clustering

石海鹤 1周世文 1钟林辉 1肖正兴2
扫码查看

作者信息

  • 1. 江西师范大学计算机信息工程学院,江西南昌 330022
  • 2. 深圳职业技术大学人工智能学院,广东深圳 518055
  • 折叠

摘要

该文首先在自适应综合过采样算法ADASYN(adaptive synthetic sampling)的基础上,考虑少数类内部不同密度簇之间的连接性问题,将与采样点距离为中等的点纳入新样本生成范围,改进得到T-ADA-SYN过采样优化算法,有效地增加了少数类内部不同密度簇的连接性,生成了分布更为均衡的数据集.然后使用基于连接的spectral clustering算法进行聚类预测操作,将过采样算法和无监督聚类相结合,提出一种新型实用的软件缺陷预测模型TA-SC(T-ADASYN+spectral clustering).以F-score为评价指标,spectral clustering为聚类模型进行验证.实验结果表明:改进的T-ADASYN过采样算法在公开的PROMISE数据集和NASA数据集上比常用的过采样算法均有6%的性能提升,且TA-SC模型在PROMISE和NASA 2个数据集上比常用聚类算法分别有3%和2%的性能提升.

Abstract

Firstly,based on adaptive comprehensive oversampling algorithm ADASYN(adaptive synthetic sam-pling),considering the connectivity among different density clusters within a small number of classes,the points that are middle neighbors distance from sampling points are included in the range of new samples,and the T-ADASYN oversampling optimization algorithm is obtained.The T-ADASYN oversampling optimization algorithm is improved to effectively increase the connectivity of clusters with different densities within a few classes and generate a more bal-anced data set.The connectivity-based Spectral Clustering algorithm is further used for the clustering prediction op-eration,thus combining the oversampling algorithm and unsupervised clustering for the first time and proposing a no-vel and practical software defect prediction model TA-SC(T-ADASYN+Spectral Clustering).Using F-Score as the evaluation indicator and Spectral Clustering as the clustering model for validation,the experimental results show that the improved T-ADASYN oversampling algorithm has an average improvement of 6%and 6%compared to common-ly used oversampling algorithms on the publicly available PROMISE dataset and NASA dataset,respectively,and the TA-SC model has the highest results of 3%and 2%improvement compared to commonly used clustering algorithms in both datasets.

关键词

软件缺陷预测/类别不平衡/过采样算法/聚类算法/无监督学习

Key words

software defect prediction/class imbalance/oversampling/clustering algorithm/unsupervised learning

引用本文复制引用

基金项目

国家自然科学基金(62062039)

国家自然科学基金(61872123)

教育部高等学校科学研究发展中心专项课题(ZJXF2022255)

江西师范大学研究生创新基金(YJS2022027)

出版年

2024
江西师范大学学报(自然科学版)
江西师范大学

江西师范大学学报(自然科学版)

CSTPCD北大核心
影响因子:0.538
ISSN:1000-5862
参考文献量5
段落导航相关论文