中文学术论文全文语步识别研究

Identifying Moves in Full-Text Chinese Academic Papers

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：[目的]针对学术论文语步识别相关研究存在只能处理少量的语步、语步识别粒度较粗、缺少公开的语步分类数据集等问题,研究学术论文的全文语步识别,为机器自动理解论文内容提供基础.[方法]基于BERT模型,采用多阶段微调的方式构建学术论文语步分类数据集,并提出一种融入章节标题文本的语步识别方法,在细粒度层面实现中文学术论文全文语步的识别.[结果]实验结果表明,学术论文语步的22类别分类任务中,RoBERTa-wwm-ext模型总体准确率提升0.031,达到0.909,Micro-F 1值提升0.022,达到0.837.[局限]所构建的学术论文语步分类数据集尚存在少量数据不平衡问题,所提方法受限于论文质量,这些问题得到改进后,模型对语步的识别能力应能得到进一步提高.[结论]所提方法取得了较高的语步识别准确率,研究成果可用于学术论文的自动理解、论文质量评价及论文语义检索等领域,对科技文献的有效利用具有重要作用.

外文摘要：[Objective]This paper investigates the recognition of moves in full-text academic papers.It establishes a solid foundation for automatically understanding paper contents.Existing research on move recognition in academic papers only processes a small number of moves with coarse granularity.There are few open datasets for move classification.[Methods]Based on the BERT model,we constructed a move classification dataset of academic papers with multi-stage fine-tuning.Then,we proposed a move recognition model incorporating the section titles to recognize moves at a fine-grained level.[Results]For the 22-class classification,the overall accuracy of the RoBERTa-wwm-ext model increased by 0.031 to 0.909,and the Micro-Fl improved by 0.022 to 0.837.[Limitations]There is a small amount of unbalanced data in the constructed corpus,and the paper's quality will affect by the proposed model's performance.[Conclusions]The proposed model benefits the automatic understanding of academic papers,research quality evaluation,and semantic content retrieval,which play important roles in using scientific and technological literature.

外文关键词：

Academic Papers UnderstandingMove RecognitionPre-trained Model

作者：

杜新玉、李宁

展开 >

作者单位：

北京信息科技大学计算机学院北京 100101

关键词：

学术论文理解语步识别预训练模型

基金：

国家自然科学基金

项目编号：

61672105

出版年：

2024

DOI：

10.11925/infotech.2096-3467.2022.1284

数据分析与知识发现

中国科学院文献情报中心

数据分析与知识发现

CSTPCDCSSCICHSSCD北大核心EI

影响因子：1.452

ISSN：2096-3467

年,卷(期)：2024.8(2)

参考文献量32