基于文本摘要的无监督关键词抽取方法

Unsupervised keyword extraction method based on text summarization

尤泽顺 ¹周喜 ¹董瑞 ¹张洋宁 ²杨奉毅¹

扫码查看

作者信息

1. 中国科学院新疆理化技术研究所,新疆乌鲁木齐 830011;中国科学院大学计算机科学与技术学院,北京 100049;新疆民族语音语言信息处理实验室,新疆乌鲁木齐 830011
2. 中国科学院新疆理化技术研究所,新疆乌鲁木齐 830011;新疆农业大学计算机与信息工程学院,新疆乌鲁木齐 830052
折叠

摘要

为克服基于嵌入的关键词抽取方法在长文档上性能下降的问题,提出一种基于文本摘要的方法(summarization-based document embedding rank,SDERank).将句向量的加权和作为文档嵌入,根据每个句子与文档主题的语义相关度赋予权重.以往基于嵌入的方法选择关键词时忽略候选词之间的关联,针对该问题,在SDERank的改进版SDERank+中,PageRank算法被用于提取候选词之间的共现权重作为相似度分数的修正.实验结果表明,在4个广泛使用的数据集上SDERank和SDERank+比之前最好的模型MDERank的F1分数平均高出2.2％和3.29％.

Abstract

To address the problem of the performance of embedding-based keyword extraction methods on long documents deterio-rates,a summarization-based keyword extraction approach was proposed,denoted by summarization-based document embedding rank(SDERank).The document embedding was taken as the weighted sum of sentence vectors,and weights were assigned to each sentence according to its semantic relevance to the document topic.Existing embedding-based methods fail to take into account the relation between candidate words.For this problem,in SDERank+,an improved version of SDERank,the co-occurrence weight of candidate words was calculated to amend the original similarity score by PageRank.Experimental results demonstrate that SDERank and SDERank+achieve 2.2％and 3.29％higher F1 scores respectively than that of the current best,MDERank,on four widely used datasets.

关键词

自动关键词抽取/文本摘要/长文档建模/文档主题分析/语义处理/权重优化/向量相似性

Key words

automatic keyword extraction/text summarization/long document modeling/document topic analysis/semantic proce-ssing/weight optimization/similarity matches

引用本文复制引用

基金项目

新疆维吾尔自治区自然科学基金项目(2022D01E04)

新疆维吾尔自治区重大科技专项基金项目(2020A02001-1)

中国科学院西部青年学者基金项目(2019-XBQNXZ-B-008)

中国科学院青年创新促进会基金项目(2021436)

出版年

2024

计算机工程与设计

中国航天科工集团二院706所

计算机工程与设计

CSTPCD北大核心

影响因子：0.617

ISSN：1000-7024

段落导航