基于公式化表达脱敏与边界识别加强的学术论文研究问题与方法识别研究

Identification of Problem and Method in Scientific Papers Based on Formulaic Expression Desensitization and Enhanced Boundary Recognition

张颖怡 ¹章成志²

扫码查看

作者信息

1. 苏州大学社会学院档案与电子政务系,苏州 215123
2. 南京理工大学经济管理学院信息管理系,南京 210094
折叠

摘要

研究问题和方法是学术论文中的重要组成部分,其在学术论文组织、管理与检索以及科研成果评价中具有重要意义.为缓解研究问题与方法识别中存在的公式化表达依赖和词语边界识别错误等问题,本文提出一种联合公式化表达脱敏和边界识别加强的模型.具体地,公式化表达脱敏使用数据增强方法实现,边界识别加强使用指针网络与序列标注模型实现.随着学术论文的开放获取,学术论文全文被研究者用于实体识别任务中.为证明使用学术论文全文的必要性,本文人工构建了自然语言处理领域的摘要和全文标注数据集,同时设计了数值和内容指标,用于分析两类数据集中的问题和方法识别结果以及问题与方法关系对抽取结果的差异.十折交叉实验结果表明,本文模型的宏平均F1值优于SciBERT-BiLSTM-CRF基线模型3.69个百分点且存在显著性差异.根据摘要与全文实体识别和关系对抽取结果的对比,发现摘要中包含的问题与方法实体的表意较宽泛,全文中具有更多描述模型设计和训练细节的实体和关系对.

Abstract

Problems and methods are crucial components of scientific papers and play a significant role in the organiza-tion,management,retrieval,and evaluation of scientific papers.To alleviate the formulaic expression dependency and word boundary recognition errors in problem and method recognition methods,we propose a model combined with formu-laic expression desensitization and enhanced boundary recognition.Specifically,formulaic expression desensitization is achieved through data augmentation methods,whereas boundary enhancement utilizes pointer networks and sequence la-beling models.With open access to scientific papers,researchers are utilizing full-text papers for entity recognition tasks.To demonstrate the importance of using full-text papers,this paper manually constructs an abstract and full-text annotated dataset in the field of natural language processing.Numerical and content-based metrics are designed to compare the prob-lem,method,and their relationship extracted from two datasets.The results of ten-fold cross-validation experiments indi-cate that the proposed model outperforms baseline models such as SciBERT-BiLSTM-CRF significantly,with a macro-av-erage F1 score improvement of 3.69 percentage points.When comparing entity recognition and relationship extraction re-sults between abstracts and full texts,this paper shows that problem and method entities in abstracts have a broader seman-tic representation,whereas full texts contain more detailed entities and relationships that describe model design and train-ing procedures.

关键词

知识实体识别/研究问题和方法识别/指针网络/数据增强

Key words

knowledge entity recognition/identification of problem and method/pointer networks/data augmentation

引用本文复制引用

基金项目

国家自然科学基金项目(72074113)

出版年

2024

情报学报

中国科学技术情报学会　中国科学技术信息研究所

情报学报

CSTPCDCSSCICHSSCD北大核心

影响因子：1.296

ISSN：1000-0135

参考文献量60

段落导航