首页|面向综述论文的语义情报内容挖掘方法研究

面向综述论文的语义情报内容挖掘方法研究

Semantic Information Extraction Methods for Review Papers

扫码查看
[目的]为充分挖掘综述论文的语义情报内容,提出相关情报要素体系及其挖掘任务的形式化定义,构建相应的信息抽取技术框架.[方法]针对综述论文专业性强、术语分布稀疏、标注难度大等问题,通过多任务学习实现跨任务标注数据的信息互补,并引入自监督学习实现未标注数据中潜在信息的挖掘利用.[结果]本文所提技术框架显著增强了各项任务的性能表现,尤其是在要素间关系识别任务中,准确率提高8.32个百分点.此外,通过自监督学习,整体F1值进一步提升约2个百分点.[局限]在信息抽取过程中,未考虑图片、表格等文本之外的数据.[结论]提出了综述论文语义情报内容挖掘的方法流程,并引入多任务学习和自监督学习技术,利用跨任务标注数据及未标注数据提升挖掘效果.
[Objective]To fully explore the semantic information content of review papers,this study proposes a system of relevant information elements and a formal definition of their extraction tasks.We constructed a corresponding framework to explore the semantic information of review papers.[Methods]To address the issues of high specialization,sparse term distribution,and difficulty in annotation in review papers,we applied multi-task learning to achieve information complementarity across tasks.We also introduced self-supervised learning to discover latent information from unlabeled data.[Results]The proposed multi-task learning framework significantly enhanced the performance of various tasks,especially improving the accuracy of element relationship recognition tasks by 8.32%.Furthermore,the overall F1 score increased by about 2%through self-supervised learning.[Limitations]The information extraction process does not consider non-textual data such as images and tables.[Conclusions]The proposed method and process incorporate multi-task and self-supervised learning to improve the mining effect of labeled data and unlabeled data.

Information ExtractionReading ComprehensionMulti-Task LearningSelf-Supervised Learning

胡懋地、于倩倩、钱力、常志军、张智雄

展开 >

中国科学院文献情报中心 北京 100190

中国科学院大学经济与管理学院信息资源管理系 北京 100190

国家新闻出版署学术期刊新型出版与知识服务重点实验室 北京 100190

信息抽取 阅读理解 多任务学习 自监督学习

2024

数据分析与知识发现
中国科学院文献情报中心

数据分析与知识发现

CSTPCDCSSCICHSSCD北大核心EI
影响因子:1.452
ISSN:2096-3467
年,卷(期):2024.8(11)