Ancient Chinese Language Resource on Tongjiazi:Construction and Application
In ancient Chinese texts,it is common to use Tongjiazi,i.e.characters with the same sound or similar sounds instead of the original characters.To facilitate the manual analysis and machine processing of Tongjiazi,this paper builds a multi-dimensional resource for Tongjiazi,including three sub-datasets of the corpus,the knowledge base and the evaluation dataset.The corpus contains more than 11 000 sentences with detailed annotations of Tongjia usages.The knowledge base is presented in graph data with 4 185 characters as the nodes and 8 350 edges describing relations of pronunciation,glyph and meaning.The evaluation dataset includes testing data of 19 678 entries for two subtasks:Tongjiazi detection and the original character identification.This paper also builds a series of baseline models for the automatic recognition of Tongjiazi and analyzes the factors affecting the performance.
ancient ChineseresourcedatabaseTongjiaziautomatic recognition