融合BTM与TextCNN的文本语义增强主题爬虫研究

扫码查看

原文链接

万方数据
维普

中文摘要：在拥有海量数据的信息时代,如何高效精准地检索到所需信息是一项巨大挑战,主题爬虫是获取某个特定领域信息的有效途径.通用的主题相似度计算通常是基于词粒度的特征表达,而忽略了文本整体的主题特征表达,会影响爬虫系统的查准率和查全率.对此,提出融合BTM与TextCNN模型的主题爬虫,将内容主题判别模块当作文本分类问题研究,通过融合BTM得到的文本主题向量与Word2vec词向量以增强文本语义信息,利用卷积神经网络提升判别模块的精确度,弥补了传统卷积神经网络分类模型中文本特征表示不充分的问题.实验结果表明,在开源新闻文本分类数据集(THUCNews)和自定义爬取的真实论文数据集中,融合BTM与TextCNN模型在测试集中的平均分类精准率分别为93.7％和91.3％,比只采用TextCNN的平均分类精确率分别提升了0.6、1.3个百分点.

外文标题：Research on Text Semantic Enhancement Topic Crawler Integrating BTM and TextCNN

外文摘要：In the information era with a large amount of information,how to efficiently and accurately retrieve the information we required is a huge challenge.Topic Crawlers are an effective way to get information about a particular domain.General topic similarity computation is based on the word granularity level,while ignoring the expression of the whole semantic feature,which will lead to the impact of both precision and recall of the crawler system.In order to solve this problem,a topic crawler method based on BTM and TextCNN is proposed,and the content topic discrimination module is considered as a text classification problem.The text semantic information is enhanced by fusing the text topic vector from BTM and Word2vec word vectors.This method uses convolutional neural network to improve the accuracy of discriminant module,which can improve the problem of inadequate representation of text features of convolutional neural network.The experimental results show that the average classification precision of the test sets in the open source news text classification dataset(THUCNews)and the real paper data-set is respectively 93.7％and 91.3％on the fused BTM and TextCNN models,which respectively increases 0.6 and 1.3 percentage points com-pared with the TextCNN benchmark model.

外文关键词：

topic crawlertopic similarityTextCNNBTMWord2vec

作者：

艾芳菊、尹虓寅

展开 >

作者单位：

湖北大学计算机与信息工程学院,湖北武汉 430062

关键词：

主题爬虫主题相似度 TextCNN BTM Word2vec

基金：

科技大数据湖北省重点实验室开放基金

项目编号：

E1KF291005

出版年：

2024

DOI：

10.11907/rjdk.231116

软件导刊

湖北省信息学会

软件导刊

影响因子：0.524

ISSN：1672-7800

年,卷(期)：2024.23(3)

参考文献量19