首页|融合部首信息的古汉语自动分词与词性标注一体化分析

融合部首信息的古汉语自动分词与词性标注一体化分析

Automatic Word Segmentation and Part-Of-Speech Tagging for Classical Chinese Based on Radicals

扫码查看
[目的]针对现有古汉语自动分词与词性标注技术存在的准确度不高、效率不高等问题,提出一种融合部首信息的古汉语自动分词与词性标注一体化模型.[方法]基于7万余条汉字及其部首的数据,构建部首向量表示模型Radical2Vector.并将Radical2Vector模型与古汉语文本表示模型SikuRoBERTa相结合,共同拼接BiLSTM-CRF模型作为实验的主体模型结构.同时,设计分词与词性双层标注方案,在《左传》数据集上进行自动分词与词性标注一体化实验.[结果]模型分词任务的F1值达到95.75%,词性标注任务的F1值达91.65%,相比基线模型分别提高8.71和13.88个百分点.[局限]仅融合了每个汉字的单个部首信息,未利用汉字的其他部件信息.[结论]本文成功融入汉字部首信息,有效提升了古汉语文本的表示效果.通过分词与词性标注的一体化方案,本文构建的模型在分词与词性标注任务上表现出色.
[Objective]This paper proposes an integrated model incorporating radical information to improve the low accuracy and efficiency of existing automatic word segmentation and part-of-speech tagging for Classical Chinese.[Methods]Based on over 70,000 Chinese characters and their radicals,we constructed a radical vector representation model,Radical2Vector.We combined this model with SikuRoBERTa for representing Classic Chinese texts,forming an integrated BiLSTM-CRF model as the main experimental framework.Additionally,we designed a dual-layer scheme for word segmentation and part-of-speech tagging.Finally,we conducted experiments on the Zuo Zhuan dataset.[Results]The model achieved an F1 score of 95.75%for the word segmentation task and 91.65%for the part-of-speech tagging task.These scores represent 8.71%and 13.88%improvements over the baseline model.[Limitations]The approach only incorporates a single radical for each character and does not utilize other components of the characters.[Conclusions]The proposed model successfully integrates radical information,effectively enhancing the performance of textual representation for Classical Chinese.This model demonstrates exceptional performance in word segmentation and part-of-speech tagging tasks.

Word SegmentationPart-Of-Speech TaggingAncient Chinese Information Processing

常博林、袁义国、李斌、许智星、冯敏萱、王东波

展开 >

南京师范大学文学院 南京 210097

南京师范大学语言大数据与计算人文研究中心 南京 210097

南京农业大学信息管理学院 南京 210095

自动分词 自动词性标注 古文信息处理

2024

数据分析与知识发现
中国科学院文献情报中心

数据分析与知识发现

CSTPCDCSSCICHSSCD北大核心EI
影响因子:1.452
ISSN:2096-3467
年,卷(期):2024.8(11)