首页|基于预训练语言模型的古籍文本智能补全研究

基于预训练语言模型的古籍文本智能补全研究

扫码查看
[目的]为古籍补全任务提供一种基于预训练语言模型的新方法,利用不同语义层次和简繁体预训练语言模型获得的表示,构建混合专家系统和简繁融合模型实现古籍补全.[方法]针对传世文献和出土文献分别设计基于混合专家系统的模型和简繁融合模型,在不同场景下充分融合与挖掘模型能力,进一步提升模型古籍补全的能力.[结果]使用自行构建的传世文献数据集以及出土文献数据集,补全任务的准确率分别达到70.14%和57.13%.[局限]只从自然语言处理角度出发,未来可以利用多模态技术,计算机视觉与自然语言处理相结合,整合图像信息和语义信息两个维度,可能会有更好的效果.[结论]在构建的传世文献和出土文献数据集上进行验证,达到较高的准确率,为古籍补全任务提供了一种具有竞争力的解决思路.
Intelligent Completion of Ancient Texts Based on Pre-trained Language Models
[Objective]This paper proposes a new method based on pre-trained language models for completing ancient texts,utilizing representations obtained from pre-training models at different semantic levels and for simplified and traditional Chinese characters.The method constructs a mixture-of-experts system and a simplified-traditional Chinese fusion model to complete ancient texts.[Methods]We designed the mixture-of-experts system-based model for transmitted texts and constructed the simplified-traditional Chinese character fusion model for excavated literature.We fully integrated and explored the model's capabilities in different scenarios to improve its ability to complete ancient texts.[Results]We examined the new models with self-constructed datasets of transmitted and excavated texts.The models achieved accuracy of 70.14%and 57.13%for the completion task.[Limitations]We only utilized natural language processing approaches.Future improvements involve leveraging multimodal techniques,combining computer vision with natural language processing,and integrating image and semantic information to yield better results.[Conclusions]The proposed models achieve high accuracy on the constructed datasets of ancient literature,providing a competitive solution for completing ancient texts.

Digitization of Ancient BooksPre-trained Language ModelsMixture-of-Experts Systems

李嘉俊、明灿、郭志浩、钱铁云、彭智勇、王晓光、李旭晖、李静

展开 >

武汉大学计算机学院 武汉 430072

武汉大学文化遗产智能计算实验室 武汉 430072

武汉大学信息管理学院 武汉 430072

武汉大学历史学院 武汉 430072

展开 >

古籍数字化 预训练语言模型 混合专家系统

国家社会科学基金重大项目

21&ZD334

2024

数据分析与知识发现
中国科学院文献情报中心

数据分析与知识发现

CSTPCDCSSCICHSSCD北大核心EI
影响因子:1.452
ISSN:2096-3467
年,卷(期):2024.8(5)
  • 1
  • 19