安徽理工大学学报(自然科学版)2024,Vol.44Issue(3) :83-89.DOI:10.3969/j.issn.1672-1098.2024.03.011

最大相关最小冗余两阶段文本特征选择方法

The Two-staged Text Feature Selection Method with Maximum Correlation and Minimum Redundancy

冷婷 叶仁玉 徐思蓉
安徽理工大学学报(自然科学版)2024,Vol.44Issue(3) :83-89.DOI:10.3969/j.issn.1672-1098.2024.03.011

最大相关最小冗余两阶段文本特征选择方法

The Two-staged Text Feature Selection Method with Maximum Correlation and Minimum Redundancy

冷婷 1叶仁玉 1徐思蓉1
扫码查看

作者信息

  • 1. 安庆师范大学数理学院,安徽 安庆 246133
  • 折叠

摘要

目的 为解决传统卡方统计法(CHI)仅考虑文本特征与文本类别的相关性进行特征选择,未考虑特征之间的冗余性,导致文本分类的性能不佳的问题.方法 使用最大相关最小冗余原则,对CHI法初次选择的特征子集,利用强相关低冗余思想有目的地筛选低冗余特征,提升文本特征选择效果,提出一种基于最大相关最小冗余的两阶段文本特征选择方法(CHI_im-pMI).结果 对复旦大学新闻文本语料进行分类,相比于CHI和CHI_MI特征选择方法,CHI_impMI方法的性能指标均为最优,文本分类效果最好.结论 CHI_impMI方法在相关度与冗余度之间达到了很好的平衡,从而有效提升文本分类性能.

Abstract

Objective To solve the problem that traditional Chi-square statistical method(CHI)considers only the correlation between text features and text categories for feature selection,but not the redundancy between fea-tures,which leads to poor performance of text classification.Methods The principle of maximum correlation and minimum redundancy was applied to screen the feature subset initially selected by CHI method for the low redun-dancy features with the idea of strong correlation and low redundancy,so as to improve the text feature selection effect,and then the two-staged text feature selection method based on maximum correlation minimum redundancy(CHI_impMI)was proposed.Results In classifying the news text corpus of Fudan University,compared with CHI and CHI_MI feature selection methods,the CHI_impMI method had the best performance indicators and the best text classification effect.Conclusion The CHI_impMI method achieves a good balance between relevancy and re-dundancy,thus effectively improving the performance of text classification.

关键词

卡方统计方法/最大相关最小冗余原则/互信息/文本分类/特征选择

Key words

Chi-Square statistics/the principle of maximum correlation and minimum redundancy/mutual information/text classi-fication/feature selection

引用本文复制引用

基金项目

国家社会科学基金资助项目(21BTJ040)

安徽高校自然科学研究重点安徽省教育厅项目(KJ2019A0557)

安徽省研究生创新创业实践项目(2022cxcysj166)

出版年

2024
安徽理工大学学报(自然科学版)
安徽理工大学

安徽理工大学学报(自然科学版)

影响因子:0.331
ISSN:1672-1098
段落导航相关论文