首页|基于古籍大模型的无监督互文自动发现研究

基于古籍大模型的无监督互文自动发现研究

扫码查看
[目的/意义]针对先秦典籍这一高度互文的文本,建立一套无监督的互文自动发现流程,探索基于大语言模型来开展古籍内容分析与考据校勘工作,以提高工作效率.[方法/过程]对比分析现有语言模型的不同技术路线,并引入对比学习框架,采用无监督的方式训练互文自动发现模型,通过构建成语溯源任务检验系列模型效果,选择最优模型.[结果/结论]成语溯源的结果显示,当前对话大语言模型仍存在大量事实性错误.所构建的基于对比学习框架的古籍大模型ESimCSE-GujiRoBERTa在成语溯源任务中取得最优结果.此模型在先秦诸子典籍引用互文识别上展现优异的语义区分能力.同时,从春秋三传互文识别的结果来看,互文自动发现能够为古籍考据校勘工作提供有益参考.
Research on Unsupervised Automatic Intertextual Discovery Based on Large Models of Ancient Books
[Purpose/Significance]For the study of highly intertextual pre-Qin classics,an unsupervised intertextual automatic discovery process is established to better carry out content analysis and textual research of ancient books based on large language models,which improves work efficiency.[Method/Process]It com-paratively analyzed different technical routes of existing language models,and introduced a contrastive learning framework to train intertextual automatic discovery models in an unsupervised manner.By constructing an idiom origin tracing tasks to evaluate a series of models,it selected the optimal model.[Result/Conclusion]The re-sults of idiom origin tracing show that there are still a large number of factual errors in the current Chat LLMs.The ESimCSE-GujiRoBERTa model has achieved the best results in the idiom origin tracing task.This model shows excellent semantic discrimination ability in the intertextual recognition of citations in the classics of pre-Qin scholars.At the same time,judging from the results of the intertextual identification of the"Chun Qiu San Zhuan",the automatic discovery of intertextuality can provide a useful perspective for the textual research and collation of ancient books.

ancient book intertextualitylanguage modelunsupervised learningcontrastive learning

叶文豪、胡蝶、王东波、周好、刘浏

展开 >

南京农业大学信息管理学院 南京 210095

古籍互文 语言模型 无监督 对比学习

2024

图书情报工作
中国科学院文献情报中心

图书情报工作

CSTPCDCSSCICHSSCD北大核心
影响因子:2.203
ISSN:0252-3116
年,卷(期):2024.68(23)