基于古籍大模型的无监督互文自动发现研究

Research on Unsupervised Automatic Intertextual Discovery Based on Large Models of Ancient Books

叶文豪 ¹胡蝶 ¹王东波 ¹周好 ¹刘浏¹

扫码查看

作者信息

1. 南京农业大学信息管理学院南京 210095
折叠

摘要

[目的/意义]针对先秦典籍这一高度互文的文本,建立一套无监督的互文自动发现流程,探索基于大语言模型来开展古籍内容分析与考据校勘工作,以提高工作效率.[方法/过程]对比分析现有语言模型的不同技术路线,并引入对比学习框架,采用无监督的方式训练互文自动发现模型,通过构建成语溯源任务检验系列模型效果,选择最优模型.[结果/结论]成语溯源的结果显示,当前对话大语言模型仍存在大量事实性错误.所构建的基于对比学习框架的古籍大模型ESimCSE-GujiRoBERTa在成语溯源任务中取得最优结果.此模型在先秦诸子典籍引用互文识别上展现优异的语义区分能力.同时,从春秋三传互文识别的结果来看,互文自动发现能够为古籍考据校勘工作提供有益参考.

Abstract

[Purpose/Significance]For the study of highly intertextual pre-Qin classics,an unsupervised intertextual automatic discovery process is established to better carry out content analysis and textual research of ancient books based on large language models,which improves work efficiency.[Method/Process]It com-paratively analyzed different technical routes of existing language models,and introduced a contrastive learning framework to train intertextual automatic discovery models in an unsupervised manner.By constructing an idiom origin tracing tasks to evaluate a series of models,it selected the optimal model.[Result/Conclusion]The re-sults of idiom origin tracing show that there are still a large number of factual errors in the current Chat LLMs.The ESimCSE-GujiRoBERTa model has achieved the best results in the idiom origin tracing task.This model shows excellent semantic discrimination ability in the intertextual recognition of citations in the classics of pre-Qin scholars.At the same time,judging from the results of the intertextual identification of the"Chun Qiu San Zhuan",the automatic discovery of intertextuality can provide a useful perspective for the textual research and collation of ancient books.

关键词

古籍互文/语言模型/无监督/对比学习

Key words

ancient book intertextuality/language model/unsupervised learning/contrastive learning

引用本文复制引用

出版年

2024

图书情报工作

中国科学院文献情报中心

图书情报工作

CSTPCDCSSCICHSSCD北大核心

影响因子：2.203

ISSN：0252-3116

段落导航