首页|中文学术论文全文语步识别研究

中文学术论文全文语步识别研究

扫码查看
[目的]针对学术论文语步识别相关研究存在只能处理少量的语步、语步识别粒度较粗、缺少公开的语步分类数据集等问题,研究学术论文的全文语步识别,为机器自动理解论文内容提供基础.[方法]基于BERT模型,采用多阶段微调的方式构建学术论文语步分类数据集,并提出一种融入章节标题文本的语步识别方法,在细粒度层面实现中文学术论文全文语步的识别.[结果]实验结果表明,学术论文语步的22类别分类任务中,RoBERTa-wwm-ext模型总体准确率提升0.031,达到0.909,Micro-F 1值提升0.022,达到0.837.[局限]所构建的学术论文语步分类数据集尚存在少量数据不平衡问题,所提方法受限于论文质量,这些问题得到改进后,模型对语步的识别能力应能得到进一步提高.[结论]所提方法取得了较高的语步识别准确率,研究成果可用于学术论文的自动理解、论文质量评价及论文语义检索等领域,对科技文献的有效利用具有重要作用.
Identifying Moves in Full-Text Chinese Academic Papers
[Objective]This paper investigates the recognition of moves in full-text academic papers.It establishes a solid foundation for automatically understanding paper contents.Existing research on move recognition in academic papers only processes a small number of moves with coarse granularity.There are few open datasets for move classification.[Methods]Based on the BERT model,we constructed a move classification dataset of academic papers with multi-stage fine-tuning.Then,we proposed a move recognition model incorporating the section titles to recognize moves at a fine-grained level.[Results]For the 22-class classification,the overall accuracy of the RoBERTa-wwm-ext model increased by 0.031 to 0.909,and the Micro-Fl improved by 0.022 to 0.837.[Limitations]There is a small amount of unbalanced data in the constructed corpus,and the paper's quality will affect by the proposed model's performance.[Conclusions]The proposed model benefits the automatic understanding of academic papers,research quality evaluation,and semantic content retrieval,which play important roles in using scientific and technological literature.

Academic Papers UnderstandingMove RecognitionPre-trained Model

杜新玉、李宁

展开 >

北京信息科技大学计算机学院 北京 100101

学术论文理解 语步识别 预训练模型

国家自然科学基金

61672105

2024

数据分析与知识发现
中国科学院文献情报中心

数据分析与知识发现

CSTPCDCSSCICHSSCD北大核心EI
影响因子:1.452
ISSN:2096-3467
年,卷(期):2024.8(2)
  • 32