基于CRF模型的《里耶秦简》自动断句与分词研究

Automatic sentence segmentation and word segmentation for Liye Qin Bamboo manuscripts based on CRF model

冯慧敏 ¹郭帅帅 ²刘铭²

扫码查看

作者信息

1. 山东农业工程学院基础课教学部,济南 250100;西北大学科学史高等研究院,西安 710127
2. 西北大学科学史高等研究院,西安 710127
折叠

摘要

里耶秦简的数量是之前出土秦简的10倍,填补了秦朝历史记载中的诸多空白.将《里耶秦简》作为实验语料,探索基于CRF(条件随机场)模型的里耶秦简自动断句与分词方法.结合简文的实际特点,通过设置不同的特征模板,面向不同的任务验证模型序列标注的泛化能力;通过设置断句、分词一体化的对比实验,以选取性能更优的处理方案;同时设计了深度学习方法与预训练模型的对比试验.实验结果表明,CRF模型一体化的标注方案在各任务中的整体性能均有所提升,其中自动断句、分词的F1值分别达到75.79％与94.44％,且速度快用时少,更适用于里耶秦简.

Abstract

Information processing of ancient Chinese seldom uses unearthed documents as corpus to carry out relevant research.The number of Liye Qin bamboo manuscripts reached ten times that of all the Qin slips unearthed before,which can fill many gaps in the historical records of the Qin Dynasty.In this paper,we used them as experimental corpus and explored the automatic sentence segmentation and word segmentation of unearthed documents based on the CRF model.We combined the actual characteristics of the corpus and set up different feature templates to verify the generalization ability of model sequence labeling on different tasks.We set up a joint approach to sentence segmentation and word segmentation as comparative experiment to select a better performance processing plan.At the same time,a comparative experiment was designed between deep learning methods and pretrained models.The results proved that the overall performance of the joint approach in each task was improved and that the Fl-score of automatic sentence segmentation and word segmentation reached 75.79％and 94.44％,respectively.Since it's faster and takes less time,this approach is more suitable for the Liye Qin bamboo slips.The research results can serve the proofreading work of the last three volumes of Liye Qin bamboo slips and the in-depth processing and construction of the corpus.

关键词

CRF模型/里耶秦简/自动断句/自动分词

Key words

CRF model/Liye Qin bamboo manuscripts/automatic sentence segmentation/automatic word segmentation

引用本文复制引用

出版年

2024

科技导报

中国科学技术协会

科技导报

CSTPCD北大核心

影响因子：0.559

ISSN：1000-7857

段落导航