A Feature Selection Method to Handle Imbalanced Data in Text Classification

扫码查看

原文链接

NETL
NSTL

外文摘要：Imbalanced data problem is often encountered in application of text classification. Feature selection, which could reduce the dimensionality of feature space and improve the performance of the classifier, is widely used in text classification. This paper presents a new feature selection method named NFS, which selects class information words rather than terms with high document frequency. To improve classifier performance further, we combine a feature selection method (NFS) with data resampling technology to solve the problem of imbalanced data. Experiments were evaluated on Reuters-21578 Collection, and results show that the NFS method performs better than chi-square statistics and mutual information on the original dataset when the number of selected features is greater than 1000. The maximum value of Macro-F1 is 0.7792 when the NFS method is applied to the resampling dataset, which represents an increase in Macro-F1 by 4.02% given the original dataset. Thus, our proposed method effectively improves minority class performance.

外文关键词：

Text classificationFeature selectionImbalanced data

作者：

Fengxiang Chang、Jun Guo、Weiran Xu、Kejun Yao

展开 >

作者单位：

School of Information and Communication Engineering Beijing University of Posts and Telecommunications, Beijing, 100876, China

Computing and Data Processing Center GRUEEX Ltd., Zagreb, Croatia

出版年：

2015

Journal of digital information management

ISSN：0972-7272

年,卷(期)：2015.13(3)