融合语义特征的无监督关键词提取算法

扫码查看

原文链接

万方数据
维普

中文摘要：针对传统的词图模型的关键词提取算法缺乏文本语义理解的不足,提出一种融合语义特征的无监督关键词提取算法,该方法结合词嵌入技术与词图模型的思想,将文本语义信息和语序信息同时融入到传统的词图模型算法中.首先利用Word2vec和Doc2vec模型分别对词和文本进行向量表征,获取文本的语序信息,然后通过词向量计算出候选词与文本之间的语义相似度,进而改进TextRank算法,重新对候选关键词之间的边权值和初始值进行分配,并构建对应的重启概率矩阵和转移概率矩阵用于词图模型迭代计算候选词的分值以及关键词的提取.实验结果表明,有效地融合文本的语义信息和语序信息能够提升关键词提取的准确性.

外文标题：Unsupervised Keyword Extraction Algorithm Integrating Semantic Features

外文摘要：For keyword extraction algorithm,an unsupervised keyword extraction algorithm integrating semantic features is proposed to deal with the lack of text semantic problem,which is intractable for traditional word graph model.This method,which combines the idea of word embedding technology and word graph model to integrate the text semantic information and word order in-formation into the traditional word graph model algorithm.Firstly,by using Word2vec and Doc2vec models to represent words and text respectively,and the word order information of the text is obtained.Then,the semantic similarity between candidate words and text is calculated through the word vector,and then the TextRank algorithm is improved to redistribute the edge weight and initial value between candidate keywords.In addition,the corresponding restart probability matrix and transition probability matrix are con-structed for iterative calculation of candidate word scores and keyword extraction of word graph model.The experimental results show that effectively fusing the semantic information and word information of the text can improve the accuracy of keyword extrac-tion.

外文关键词：

keyword extractionsemantic informationword informationvector representationTextRank algorithm

作者：

赵长路、刘军、胡佳、胡宝权

展开 >

作者单位：

兰州理工大学机电工程学院兰州 730050

哈尔滨商业大学计算机与信息工程学院哈尔滨 150028

关键词：

关键词提取语义信息语序信息向量表征 TextRank算法

基金：

国家自然科学基金项目科技部国家重点研发计划兰州理工大学红柳一流学科建设项目

项目编号：

718610252018YFB1703105

出版年：

2024

DOI：

10.3969/j.issn.1672-9722.2024.07.001

计算机与数字工程

中国船舶重工集团公司第七0九研究所

计算机与数字工程

CSTPCD

影响因子：0.355

ISSN：1672-9722

年,卷(期)：2024.52(7)