融合ChatGPT数据增强的学术论文语步识别方法研究

Research on Academic Paper Move Recognition Method with ChatGPT Data Augmentation

许钦亚 ¹薛秋红 ²钱力 ³刘会洲 ⁴刘鲁静⁵

扫码查看

作者信息

1. 中国科学院文献情报中心北京 100190;中国科学院大学经济与管理学院信息资源管理系北京 100190
2. 中国电子科技集团公司信息科学研究院北京 100086
3. 中国科学院文献情报中心北京 100190;中国科学院大学经济与管理学院信息资源管理系北京 100190;国家新闻出版署学术期刊新型出版与知识服务重点实验室北京 100190
4. 中国科学院过程工程研究所北京 100190
5. 中国科学院深圳先进技术研究院碳中和技术研究所深圳 518055
折叠

摘要

[目的/意义]学术论文的语步结构对读者深入理解内容和快速定位关键信息具有重要作用,本文旨在研究全文语步识别方法,以快速获取学术论文的核心内容,推动智能化的语义检索.[方法/过程]在当前语步识别方法方面的相关研究的基础上,提出一种融合ChatGPT数据增强和预训练语言模型的细粒度语步识别模型SciBERT-HAMI模型.该模型利用原始文本,通过ChatGPT大模型进行语料扩充,以增加训练数据的多样性和数量:使用分层神经网络模型学习论文的"词—句—章节"语义特征表示,以捕捉不同层次的语义信息;将SciBERT的词嵌入表示作为输入,并使用分层神经网络模型与FocalLoss损失函数进行细粒度语步识别模型训练.[结果/结论]结合ChatGPT数据增强策略,SciBERT-HAMI-DA模型在CoreSC和AZ数据集的F1值分别为0.731和0.741,对比实验表明,所提模型在论文全文细粒度语步识别任务上性能得到有效提升,并通过消融实验验证数据增强和模型组件的有效性.融合预训练语言模型与ChatGPT数据增强,全文语步识别模型的预测效果得到有效提升,有助于推动学术研究的自动化与智能化.

Abstract

[Purpose/Significance]Given the significant role of the move structure in academic papers for enabling readers to deeply understand the content and rapidly locate key information,this study aims to inves-tigate methods for full-text move recognition,to quickly capture the core content of academic papers,thereby advancing intelligent semantic retrieval.[Method/Process]The article reviewed current studies on move recog-nition methods and,on this basis,proposed a fine-grained move recognition model,the SciBERT-HAMI,which integrated ChatGPT data augmentation and a pre-trained language model.This model employed original texts and corpus augmentation via the ChatGPT large model,to enhance the variety and volume of the training data.A hierarchical neural network model was adopted to learn the paper's semantic feature representations at the"word-sentence-section"levels,to capture semantic information at varied levels.The SciBERT word embedding representations were inputted,and the model was trained using a hierarchical neural network with the FocalLoss loss function for fine-grained move recognition.[Result/Conclusion]Integrating ChatGPT data augmentation strategies,the SciBERT-HAMI-DA model achieve F1 scores of 73.1％and 74.1％on the CoreSC and AZ data-sets,respectively.Comparative experiments demonstrate that the proposed model shows effective performance improvement in the task of fine-grained move recognition in full-text academic papers,and its effectiveness is verified through ablation experiments.By integrating pre-trained language models and ChatGPT data augmenta-tion,the prediction effect of the full-text move recognition model is effectively improved,which helps to promote the automation and intelligence of academic research.

关键词

语步识别/ChatGPT/数据增强/SciBERT

Key words

move recognition/ChatGPT/data augmentation/SciBERT

引用本文复制引用

基金项目

国家社会科学基金重大项目(21&ZD329)

出版年

2024

图书情报工作

中国科学院文献情报中心

图书情报工作

CSTPCDCSSCICHSSCD北大核心

影响因子：2.203

ISSN：0252-3116

参考文献量4

段落导航