面向综述论文的语义情报内容挖掘方法研究

Semantic Information Extraction Methods for Review Papers

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：[目的]为充分挖掘综述论文的语义情报内容,提出相关情报要素体系及其挖掘任务的形式化定义,构建相应的信息抽取技术框架.[方法]针对综述论文专业性强、术语分布稀疏、标注难度大等问题,通过多任务学习实现跨任务标注数据的信息互补,并引入自监督学习实现未标注数据中潜在信息的挖掘利用.[结果]本文所提技术框架显著增强了各项任务的性能表现,尤其是在要素间关系识别任务中,准确率提高8.32个百分点.此外,通过自监督学习,整体F1值进一步提升约2个百分点.[局限]在信息抽取过程中,未考虑图片、表格等文本之外的数据.[结论]提出了综述论文语义情报内容挖掘的方法流程,并引入多任务学习和自监督学习技术,利用跨任务标注数据及未标注数据提升挖掘效果.

外文摘要：[Objective]To fully explore the semantic information content of review papers,this study proposes a system of relevant information elements and a formal definition of their extraction tasks.We constructed a corresponding framework to explore the semantic information of review papers.[Methods]To address the issues of high specialization,sparse term distribution,and difficulty in annotation in review papers,we applied multi-task learning to achieve information complementarity across tasks.We also introduced self-supervised learning to discover latent information from unlabeled data.[Results]The proposed multi-task learning framework significantly enhanced the performance of various tasks,especially improving the accuracy of element relationship recognition tasks by 8.32％.Furthermore,the overall F1 score increased by about 2％through self-supervised learning.[Limitations]The information extraction process does not consider non-textual data such as images and tables.[Conclusions]The proposed method and process incorporate multi-task and self-supervised learning to improve the mining effect of labeled data and unlabeled data.

外文关键词：

Information ExtractionReading ComprehensionMulti-Task LearningSelf-Supervised Learning

作者：

胡懋地、于倩倩、钱力、常志军、张智雄

展开 >

作者单位：

中国科学院文献情报中心北京 100190

中国科学院大学经济与管理学院信息资源管理系北京 100190

国家新闻出版署学术期刊新型出版与知识服务重点实验室北京 100190

关键词：

信息抽取阅读理解多任务学习自监督学习

出版年：

2024

DOI：

10.11925/infotech.2096-3467.2023.0828

数据分析与知识发现

中国科学院文献情报中心

数据分析与知识发现

CSTPCDCSSCICHSSCD北大核心EI

影响因子：1.452

ISSN：2096-3467

年,卷(期)：2024.8(11)