情报工程2024,Vol.10Issue(5) :85-98.DOI:10.3772/j.issn.2095-915x.2024.05.008

基于BERTopic主题模型融合RoBERTa算法的短文本分类方法研究

Research on Short Text Classification Method Based on BERTopic Topic Modeling and RoBERTa Algorithm

刘桂锋 陈亦侯 包翔 韩牧哲
情报工程2024,Vol.10Issue(5) :85-98.DOI:10.3772/j.issn.2095-915x.2024.05.008

基于BERTopic主题模型融合RoBERTa算法的短文本分类方法研究

Research on Short Text Classification Method Based on BERTopic Topic Modeling and RoBERTa Algorithm

刘桂锋 1陈亦侯 1包翔 1韩牧哲1
扫码查看

作者信息

  • 1. 江苏大学科技信息研究所 镇江 212013
  • 折叠

摘要

[目的/意义]针对短文本分类中的稀疏问题,提出一种基于BERTopic-RoBERTa-PCA-CatBoost模型进行主题概率特征扩展的短文本分类方法.[方法/过程]使用RoBERTa模型获取短文本的词向量表示,使用BERTopic主题模型提取主题概率特征向量,二者融合进行特征扩展,最后通过CatBoost算法分类.[局限]在分类层面,未使用深度学习算法进行验证;在特征融合层面,未来可以考虑其他的特征融合方法.[结果/结论]提出的BERTopic-RoBERTa-PCA-CatBoost模型与LDA-CatBoost模型相比在准确率上提升 10.90%,精确率上提升 10.91%,召回率上提升 10.68%.基于主题概率特征扩展的短文本分类方法能够克服单一模型的不足,提高短文本分类的效果.

Abstract

[Purpose/Significance]To address the sparsity issue in short text classification,this paper proposes a short text classification method based on topic probabilistic feature expansion with BERTopic-RoBERTa-PCA-CatBoost model.[Methods/Processes]The RoBERTa model is employed to obtain word vector representations of short texts.Topic probabilistic feature vectors are extracted using BERTopic topic model,which is then fused with word vectors for feature expansion.Finally,the CatBoost algorithm is utilized for classification.[Limitations]In terms of classification,deep learning algorithms have not been utilized for verification.Regarding feature fusion,future work may consider alternative feature fusion methods.[Results/Conclusions]The proposed BERTopic-RoBERTa-PCA-CatBoost model demonstrates improvements of 10.90%in accuracy,10.91%in precision,and 10.68%in recall compared to LDA-CatBoost model.The short text classification method based on topic probabilistic feature expansion can overcome the limitations of individual models and enhance the effectiveness of short text classification.

关键词

短文本分类/词向量/BERTopic模型/RoBERTa模型

Key words

Short Textbook Classification/Word Vector/BERTopic Model/RoBERTa Model

引用本文复制引用

出版年

2024
情报工程

情报工程

CSTPCDCHSSCD
ISSN:
段落导航相关论文