图书情报工作2024,Vol.68Issue(23) :41-51.DOI:10.13266/j.issn.0252-3116.2024.23.004

基于古籍大模型的无监督互文自动发现研究

Research on Unsupervised Automatic Intertextual Discovery Based on Large Models of Ancient Books

叶文豪 胡蝶 王东波 周好 刘浏
图书情报工作2024,Vol.68Issue(23) :41-51.DOI:10.13266/j.issn.0252-3116.2024.23.004

基于古籍大模型的无监督互文自动发现研究

Research on Unsupervised Automatic Intertextual Discovery Based on Large Models of Ancient Books

叶文豪 1胡蝶 1王东波 1周好 1刘浏1
扫码查看

作者信息

  • 1. 南京农业大学信息管理学院 南京 210095
  • 折叠

摘要

[目的/意义]针对先秦典籍这一高度互文的文本,建立一套无监督的互文自动发现流程,探索基于大语言模型来开展古籍内容分析与考据校勘工作,以提高工作效率.[方法/过程]对比分析现有语言模型的不同技术路线,并引入对比学习框架,采用无监督的方式训练互文自动发现模型,通过构建成语溯源任务检验系列模型效果,选择最优模型.[结果/结论]成语溯源的结果显示,当前对话大语言模型仍存在大量事实性错误.所构建的基于对比学习框架的古籍大模型ESimCSE-GujiRoBERTa在成语溯源任务中取得最优结果.此模型在先秦诸子典籍引用互文识别上展现优异的语义区分能力.同时,从春秋三传互文识别的结果来看,互文自动发现能够为古籍考据校勘工作提供有益参考.

Abstract

[Purpose/Significance]For the study of highly intertextual pre-Qin classics,an unsupervised intertextual automatic discovery process is established to better carry out content analysis and textual research of ancient books based on large language models,which improves work efficiency.[Method/Process]It com-paratively analyzed different technical routes of existing language models,and introduced a contrastive learning framework to train intertextual automatic discovery models in an unsupervised manner.By constructing an idiom origin tracing tasks to evaluate a series of models,it selected the optimal model.[Result/Conclusion]The re-sults of idiom origin tracing show that there are still a large number of factual errors in the current Chat LLMs.The ESimCSE-GujiRoBERTa model has achieved the best results in the idiom origin tracing task.This model shows excellent semantic discrimination ability in the intertextual recognition of citations in the classics of pre-Qin scholars.At the same time,judging from the results of the intertextual identification of the"Chun Qiu San Zhuan",the automatic discovery of intertextuality can provide a useful perspective for the textual research and collation of ancient books.

关键词

古籍互文/语言模型/无监督/对比学习

Key words

ancient book intertextuality/language model/unsupervised learning/contrastive learning

引用本文复制引用

出版年

2024
图书情报工作
中国科学院文献情报中心

图书情报工作

CSTPCDCSSCICHSSCD北大核心
影响因子:2.203
ISSN:0252-3116
段落导航相关论文