首页|基于BERTopic主题模型融合RoBERTa算法的短文本分类方法研究

基于BERTopic主题模型融合RoBERTa算法的短文本分类方法研究

扫码查看
[目的/意义]针对短文本分类中的稀疏问题,提出一种基于BERTopic-RoBERTa-PCA-CatBoost模型进行主题概率特征扩展的短文本分类方法.[方法/过程]使用RoBERTa模型获取短文本的词向量表示,使用BERTopic主题模型提取主题概率特征向量,二者融合进行特征扩展,最后通过CatBoost算法分类.[局限]在分类层面,未使用深度学习算法进行验证;在特征融合层面,未来可以考虑其他的特征融合方法.[结果/结论]提出的BERTopic-RoBERTa-PCA-CatBoost模型与LDA-CatBoost模型相比在准确率上提升 10.90%,精确率上提升 10.91%,召回率上提升 10.68%.基于主题概率特征扩展的短文本分类方法能够克服单一模型的不足,提高短文本分类的效果.
Research on Short Text Classification Method Based on BERTopic Topic Modeling and RoBERTa Algorithm
[Purpose/Significance]To address the sparsity issue in short text classification,this paper proposes a short text classification method based on topic probabilistic feature expansion with BERTopic-RoBERTa-PCA-CatBoost model.[Methods/Processes]The RoBERTa model is employed to obtain word vector representations of short texts.Topic probabilistic feature vectors are extracted using BERTopic topic model,which is then fused with word vectors for feature expansion.Finally,the CatBoost algorithm is utilized for classification.[Limitations]In terms of classification,deep learning algorithms have not been utilized for verification.Regarding feature fusion,future work may consider alternative feature fusion methods.[Results/Conclusions]The proposed BERTopic-RoBERTa-PCA-CatBoost model demonstrates improvements of 10.90%in accuracy,10.91%in precision,and 10.68%in recall compared to LDA-CatBoost model.The short text classification method based on topic probabilistic feature expansion can overcome the limitations of individual models and enhance the effectiveness of short text classification.

Short Textbook ClassificationWord VectorBERTopic ModelRoBERTa Model

刘桂锋、陈亦侯、包翔、韩牧哲

展开 >

江苏大学科技信息研究所 镇江 212013

短文本分类 词向量 BERTopic模型 RoBERTa模型

2024

情报工程

情报工程

CSTPCDCHSSCD
ISSN:
年,卷(期):2024.10(5)