Software defect prediction oversampling technique with generalization and difficulty-aware
The class imbalanced distribution of software defect data brings great challenges to software defect predic-tion.Synthetic oversampling is the most popular technique to solve this problem,but how to design a suitable sam-pling strategy to avoid the risk of over-generalization caused by the introduction of abnormal samples is still an open challenge for software defect prediction.To solve this problem,a Generalization and Difficulty-aware Oversampling(GDOS)method by combining the influence of sample learning difficulty and synthetic generalization for minority oversampling was proposed.For each oversampling seed sample,GDOS evaluated the selection weights of its assis-tant minority samples by measuring the safe factor and the generalization factor simultaneously according to its local prior probability and the sample distribution information of potential synthesis direction.Through suppressing the possibility of synthesizing samples in potential over-generalization regions and enhancing the possibility of synthesi-zing samples in relative safe directions,GDOS guaranteed the synthesis of high-quality samples.Numerical compar-ison with nine state-of-the-art methods on twenty-six datasets from the PROMISE repository had demonstrated the superiority of GDOS in terms of MCC,pd,pf and F-measure.