古汉语通假字资源库的构建及应用研究

Ancient Chinese Language Resource on Tongjiazi:Construction and Application

王兆基 ¹张诗睿 ¹胡韧奋 ¹张学涛¹

扫码查看

作者信息

1. 北京师范大学国际中文教育学院,北京 100875
折叠

摘要

古籍文本中的文字通假现象较为常见,这不仅为人理解文意造成了困难,也是古汉语信息处理面临的一项重要挑战.为了服务于通假字的人工判别和机器处理,该文构建并开源了一个多维度的通假字资源库,包括语料库、知识库和评测数据集三个子库.其中,语料库收录 11 000 余条包含通假现象详细标注的语料;知识库以汉字为节点,通假和形声关系为边,从字音、字形、字义多个角度对通假字与正字的属性进行加工,共包含 4 185 个字节点和 8 350 对关联信息;评测数据集面向古汉语信息处理需求,支持通假字检测和正字识别两个子任务的评测,收录评测数据 19 678 条.在此基础上,该文搭建了通假字自动识别的系列基线模型,并结合实验结果分析了影响通假字自动识别的因素与改进方法.进一步地,该文探讨了该资源库在古籍整理、人文研究和文言文教学中的应用.

Abstract

In ancient Chinese texts,it is common to use Tongjiazi,i.e.characters with the same sound or similar sounds instead of the original characters.To facilitate the manual analysis and machine processing of Tongjiazi,this paper builds a multi-dimensional resource for Tongjiazi,including three sub-datasets of the corpus,the knowledge base and the evaluation dataset.The corpus contains more than 11 000 sentences with detailed annotations of Tongjia usages.The knowledge base is presented in graph data with 4 185 characters as the nodes and 8 350 edges describing relations of pronunciation,glyph and meaning.The evaluation dataset includes testing data of 19 678 entries for two subtasks:Tongjiazi detection and the original character identification.This paper also builds a series of baseline models for the automatic recognition of Tongjiazi and analyzes the factors affecting the performance.

关键词

古代汉语/资源库/通假字/自动识别

Key words

ancient Chinese/resource/database/Tongjiazi/automatic recognition

引用本文复制引用

基金项目

国家语委重大项目(ZDA145-9)

国家自然科学基金(62006021)

北京市社会科学基金重点项目(21DTR037)

古文字与中华文明传承发展工程规划项目(G1930)

出版年

2024

中文信息学报

中国中文信息学会,中国科学院软件研究所

中文信息学报

CSTPCDCHSSCD北大核心

影响因子：0.8

ISSN：1003-0077

参考文献量27

段落导航