面向古诗词的物象库构建方法及其分布规律研究

Research on the Construction Method and Distribution Law of Object-Image Database for Ancient Poetry

刘懋霖 ¹赵萌 ¹王昊¹

扫码查看

作者信息

1. 南京大学信息管理学院;江苏省数据工程与知识服务重点实验室
折叠

摘要

在数字人文视野下,古诗词资源蕴含巨大价值但难以规模化分析.研究古诗词知识库的自动构建方法,有利于从宏观的角度对古诗词进行分析研究,挖掘其中价值.首先,基于"物象"的概念,尝试提取古诗词中所有可能包含情感的客观名物,降低分析复杂度以构建自动化流程;其次,基于深度学习方法构建RoBERTa-BiLSTM-CRF模型,对古诗词语料进行物象抽取;之后,使用《全唐诗》和部分宋代诗词资源验证模型的可行性与泛用性;最后,成功构建《全唐诗》物象库,并初步分析其物象分布规律.使用《全唐诗》自动标注语料训练模型后,模型对普通名词、时间名词和地名识别的F1分值分别达到89.6％、93.3％和93.6％.将模型迁移至未用于训练的宋代诗词语料,抽取密度为每首诗4.5个物象,具备未登录词发现能力,说明模型有良好的泛用性和可扩展性.

Abstract

From the perspective of digital humanities,ancient poetry resources are of great value but difficult to be analyzed at scale.The research on the automatic construction method of knowledge base of ancient poetry is conducive to the analysis and research of ancient poetry from a macro perspective and the mining of its value.Firstly,based on the concept of"object image",the key information in ancient poems is extracted to reduce the complexity of analysis to build an automated process.Secondly,roberta-BilstM-CRF model is constructed based on deep leaming method,and object image is extracted from ancient poetry corpus.Then,The Whole Tang Dynasty Poems and some Song Dynasty poetry resources are used to verify the feasibility and universality of the model.Finally,the object image database of The Whole Tang Dynasty Poems is constructed successfully,and the distribution law of the object images is preliminarily analyzed.After using the automatic tagging corpus training model,the F1 scores of common nouns,time nouns and place names reached 89.6％,93.3％and 93.6％respectively.The model was transferred to the Song Dynasty poetry corpus that was not used for training,and the extraction density was 4.5 objects per poem,which showed the ability to discover unknown words,indicating that the model has good universality and expansibility.

关键词

数字人文/古诗词/物象/深度学习

Key words

Digital humanistic/Ancient poetry/Object image/Deep leaming

引用本文复制引用

基金项目

国家自然科学基金面上项目(72074108)

南京大学中央高校基本科研业务费专项(010814370113)

出版年

2024

图书馆杂志

上海市图书馆学会上海图书馆

图书馆杂志

CSSCICHSSCD北大核心

影响因子：1.475

ISSN：1000-4254

参考文献量31

段落导航