融合部首信息的古汉语自动分词与词性标注一体化分析

Automatic Word Segmentation and Part-Of-Speech Tagging for Classical Chinese Based on Radicals

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：[目的]针对现有古汉语自动分词与词性标注技术存在的准确度不高、效率不高等问题,提出一种融合部首信息的古汉语自动分词与词性标注一体化模型.[方法]基于7万余条汉字及其部首的数据,构建部首向量表示模型Radical2Vector.并将Radical2Vector模型与古汉语文本表示模型SikuRoBERTa相结合,共同拼接BiLSTM-CRF模型作为实验的主体模型结构.同时,设计分词与词性双层标注方案,在《左传》数据集上进行自动分词与词性标注一体化实验.[结果]模型分词任务的F1值达到95.75％,词性标注任务的F1值达91.65％,相比基线模型分别提高8.71和13.88个百分点.[局限]仅融合了每个汉字的单个部首信息,未利用汉字的其他部件信息.[结论]本文成功融入汉字部首信息,有效提升了古汉语文本的表示效果.通过分词与词性标注的一体化方案,本文构建的模型在分词与词性标注任务上表现出色.

外文摘要：[Objective]This paper proposes an integrated model incorporating radical information to improve the low accuracy and efficiency of existing automatic word segmentation and part-of-speech tagging for Classical Chinese.[Methods]Based on over 70,000 Chinese characters and their radicals,we constructed a radical vector representation model,Radical2Vector.We combined this model with SikuRoBERTa for representing Classic Chinese texts,forming an integrated BiLSTM-CRF model as the main experimental framework.Additionally,we designed a dual-layer scheme for word segmentation and part-of-speech tagging.Finally,we conducted experiments on the Zuo Zhuan dataset.[Results]The model achieved an F1 score of 95.75％for the word segmentation task and 91.65％for the part-of-speech tagging task.These scores represent 8.71％and 13.88％improvements over the baseline model.[Limitations]The approach only incorporates a single radical for each character and does not utilize other components of the characters.[Conclusions]The proposed model successfully integrates radical information,effectively enhancing the performance of textual representation for Classical Chinese.This model demonstrates exceptional performance in word segmentation and part-of-speech tagging tasks.

外文关键词：

Word SegmentationPart-Of-Speech TaggingAncient Chinese Information Processing

作者：

常博林、袁义国、李斌、许智星、冯敏萱、王东波

展开 >

作者单位：

南京师范大学文学院南京 210097

南京师范大学语言大数据与计算人文研究中心南京 210097

南京农业大学信息管理学院南京 210095

关键词：

自动分词自动词性标注古文信息处理

出版年：

2024

DOI：

10.11925/infotech.2096-3467.2023.0834

数据分析与知识发现

中国科学院文献情报中心

数据分析与知识发现

CSTPCDCSSCICHSSCD北大核心EI

影响因子：1.452

ISSN：2096-3467

年,卷(期)：2024.8(11)