面向藏文文本的人物关系抽取语料库的构建

Construction and Research of a Character Relationship Extraction Corpus for Tibetan Texts

德吉措 ¹安见才让¹

扫码查看

作者信息

1. 青海民族大学计算机学院,西宁 810007;青海省藏文信息处理与机器翻译重点实验室,西宁 810007;省部共建藏语智能信息处理及应用国家重点实验室,西宁 810007
折叠

摘要

作为实体关系抽取研究的重要基础,构建高质量、标准化的语料库能够提高实体关系抽取任务的精确度和召回率.目前,藏文关系抽取语料库构建大多依靠传统人工标注方法且局限于特定领域,存在标注效率低且人物关系语料库相对缺乏的问题.文章构建了藏文人名实体识别语料库;通过分析人物关系特征和实体关系类别及其标注规范,构建触发词词典进行语料回标,生成 15 400 条实体识别和 8 000 条藏文人物关系抽取标注语料.为验证语料库的可用性,利用命名实体识别和关系抽取实验进行统计分析,其实体识别F1 值达到67.2%,关系抽取F1 值达到 66.2%,结果表明该语料库的构建对后续面向藏文人物关系抽取研究提供了数据基础.

Abstract

As the important foundation of entity relationship extraction research,the construction of a high-quality,standardized corpus can improve the precision and recall of the entity relationship extraction task.At present,the construction of Tibetan relationship extraction corpus mostly relies on traditional manual annotation methods and is limited to specific domains,which has the problems of low annotation efficiency and relative lack of person relationship corpus.Therefore,this paper constructs a Tibetan person-entity recognition corpus;by analyzing person-relationship features and entity-relationship categories and their annotation specifications,and constructing a trigger word dictionary for corpus back-labeling,it generates 15 400 entity-recognition and 8 000 Tibetan person-relationship extraction annotated corpora.In order to verify the usability of the corpus,the named entity recognition and relationship extraction experiments are utilized for statistical analysis,and its entity recognition F1 value reaches 67.2%,and its relationship extraction F1 value reaches 66.2%,which shows that the construction of this corpus provides a data basis for the subsequent research oriented to the Tibetan character relationship extraction.

关键词

语料库/人物关系抽取/藏文文本/触发词

Key words

Corpus/Character relationship extraction/Tibetan text/Trigger words

引用本文复制引用

出版年

2024

青海科技

青海省科学技术厅

青海科技

影响因子：0.052

ISSN：1005-9393

参考文献量18

段落导航