首页|融合语义特征的无监督关键词提取算法

融合语义特征的无监督关键词提取算法

扫码查看
针对传统的词图模型的关键词提取算法缺乏文本语义理解的不足,提出一种融合语义特征的无监督关键词提取算法,该方法结合词嵌入技术与词图模型的思想,将文本语义信息和语序信息同时融入到传统的词图模型算法中。首先利用Word2vec和Doc2vec模型分别对词和文本进行向量表征,获取文本的语序信息,然后通过词向量计算出候选词与文本之间的语义相似度,进而改进TextRank算法,重新对候选关键词之间的边权值和初始值进行分配,并构建对应的重启概率矩阵和转移概率矩阵用于词图模型迭代计算候选词的分值以及关键词的提取。实验结果表明,有效地融合文本的语义信息和语序信息能够提升关键词提取的准确性。
Unsupervised Keyword Extraction Algorithm Integrating Semantic Features
For keyword extraction algorithm,an unsupervised keyword extraction algorithm integrating semantic features is proposed to deal with the lack of text semantic problem,which is intractable for traditional word graph model.This method,which combines the idea of word embedding technology and word graph model to integrate the text semantic information and word order in-formation into the traditional word graph model algorithm.Firstly,by using Word2vec and Doc2vec models to represent words and text respectively,and the word order information of the text is obtained.Then,the semantic similarity between candidate words and text is calculated through the word vector,and then the TextRank algorithm is improved to redistribute the edge weight and initial value between candidate keywords.In addition,the corresponding restart probability matrix and transition probability matrix are con-structed for iterative calculation of candidate word scores and keyword extraction of word graph model.The experimental results show that effectively fusing the semantic information and word information of the text can improve the accuracy of keyword extrac-tion.

keyword extractionsemantic informationword informationvector representationTextRank algorithm

赵长路、刘军、胡佳、胡宝权

展开 >

兰州理工大学机电工程学院 兰州 730050

哈尔滨商业大学计算机与信息工程学院 哈尔滨 150028

关键词提取 语义信息 语序信息 向量表征 TextRank算法

国家自然科学基金项目科技部国家重点研发计划兰州理工大学红柳一流学科建设项目

718610252018YFB1703105

2024

计算机与数字工程
中国船舶重工集团公司第七0九研究所

计算机与数字工程

CSTPCD
影响因子:0.355
ISSN:1672-9722
年,卷(期):2024.52(7)