首页|基于知识增强的文本分类方法

基于知识增强的文本分类方法

扫码查看
为了解决文本分类任务中因部分数据质量差、数据不平衡和数据集过小等原因而导致的分类不准确问题,提出了一种基于知识增强的文本分类算法.首先,该算法通过加入外部知识对数据集进行数据增强;其次,使用GloVe词向量对原始文本和外部知识进行词嵌入,并使用CNN、LSTM和BERT模型提取文本特征;再次,将提取到的原始文本特征和外部知识文本特征进行融合,以此得到最终的文本特征;最后,将融合后的文本特征送入多层感知机进行分类,以此得到文本分类的最终结果.在不同数据集上进行实验显示:在SST-5数据集上,模型CNN(KB)、LSTM(KB)和BERT(KB)的文本分类准确率比基线模型分别提高了 5.01%、7.92%和1.5%;在SST-2数据集上,模型LSTM(KB)和BERT(KB)的文本分类准确率比基线模型分别提高了 1.76%和1.29%;在IMDB数据集上,模型CNN(KB)、LSTM(KB)和BERT(KB)的文本分类准确率比基线模型分别提高了 0.97%、2.87%和0.76%.上述结果表明,该文本分类算法可有效提高文本分类的准确性,并可为不同领域的文本分类应用提供参考.
Text classification method based on knowledge enhancement
In order to solve the problem of inaccurate classification in text categorization task due to poor quality of some data,data imbalance and too small dataset,a text categorization algorithm based on knowledge enhancement is proposed.Firstly,the algorithm enhances the data set by adding external knowledge.Secondly,the original text and external knowledge are word-embedded using GloVe word vectors and the text features are extracted using CNN,LSTM and BERT models.Thirdly,the extracted original text features and external knowledge text features are fused in order to obtain the final text features.Finally,the fused text features are fed into the multilayer sensing model to obtain the final text features.The experiments on different datasets show that on the SST-5 dataset,the text classification accuracy of CNN(KB),LSTM(KB)and BERT(KB)is improved by 5.01%,7.92%and 1.5%,respectively,compared with the baseline model,and on the SST-2 dataset,the text classification accuracy of LSTM(KB)and BERT(KB)is improved by 1.76%and 1.5%,respectively,compared with the baseline model.1.76%and 1.29%,respectively;on the IMDB dataset,the text categorization accuracies of models CNN(KB),LSTM(KB)and BERT(KB)are improved by 0.97%,2.87%and 0.76%,respectively,over the baseline model.The above results show that the text classification algorithm can effectively improve the accuracy of text classification and can provide good reference for text classification applications in different fields.

deep learningneural networkstext classificationknowledge enhancementfeature extraction

张博伦、赵亚慧、姜克鑫、卢星华

展开 >

延边大学融合学院,吉林延吉 133002

延边大学工学院,吉林延吉 133002

延边大学外国语学院,吉林 延吉 133002

深度学习 神经网络 文本分类 知识增强 特征提取

国家语委"十三五"科研项目延边大学外国语语言文学一流学科建设项目

YB135-7618YLPY13

2024

延边大学学报(自然科学版)
延边大学

延边大学学报(自然科学版)

影响因子:0.388
ISSN:1004-4353
年,卷(期):2024.50(2)