Automatic sentence segmentation and word segmentation for Liye Qin Bamboo manuscripts based on CRF model
Information processing of ancient Chinese seldom uses unearthed documents as corpus to carry out relevant research.The number of Liye Qin bamboo manuscripts reached ten times that of all the Qin slips unearthed before,which can fill many gaps in the historical records of the Qin Dynasty.In this paper,we used them as experimental corpus and explored the automatic sentence segmentation and word segmentation of unearthed documents based on the CRF model.We combined the actual characteristics of the corpus and set up different feature templates to verify the generalization ability of model sequence labeling on different tasks.We set up a joint approach to sentence segmentation and word segmentation as comparative experiment to select a better performance processing plan.At the same time,a comparative experiment was designed between deep learning methods and pretrained models.The results proved that the overall performance of the joint approach in each task was improved and that the Fl-score of automatic sentence segmentation and word segmentation reached 75.79%and 94.44%,respectively.Since it's faster and takes less time,this approach is more suitable for the Liye Qin bamboo slips.The research results can serve the proofreading work of the last three volumes of Liye Qin bamboo slips and the in-depth processing and construction of the corpus.
CRF modelLiye Qin bamboo manuscriptsautomatic sentence segmentationautomatic word segmentation