中文信息学报2024,Vol.38Issue(8) :76-83.

基于预训练的藏医药实体关系抽取

Entity Relation Extraction Based on Pre-trained Language Model for Tibetan Medicine

周青 拥措 拉毛东只 尼玛扎西
中文信息学报2024,Vol.38Issue(8) :76-83.

基于预训练的藏医药实体关系抽取

Entity Relation Extraction Based on Pre-trained Language Model for Tibetan Medicine

周青 1拥措 1拉毛东只 1尼玛扎西1
扫码查看

作者信息

  • 1. 西藏大学信息科学技术学院,西藏拉萨 850000;西藏自治区藏文信息技术人工智能重点实验室,西藏拉萨 850000;藏文信息技术教育部工程研究中心,西藏拉萨 850000
  • 折叠

摘要

藏医药领域的文本主要以非结构化形式保存,藏医药文本的信息抽取对挖掘藏医药的知识有重要作用.针对现有藏文实体关系抽取模型语义表达能力差、嵌套实体抽取准确率低的问题,该文介绍了一种基于预训练模型的实体关系抽取方法,使用TibetanAI_ALBERT_v2.0预训练语言模型,使得模型更好地识别实体,使用Span方法解决实体嵌套问题.在Dropout的基础上,增加了一个KL散度损失函数项,提升了模型的泛化能力.在Tibet-anAI_TMIE_v1.0藏医药数据集上进行了实验,实验结果表明,精确率、召回率和F1值分别达到了 84.5%、80.1%和82.2%,F1值较基线提升了 4.4个百分点,实验结果证明了该文方法的有效性.

Abstract

The texts in the field of Tibetan medicine are mainly stored in unstructured form.The information extrac-tion of Tibetan medicine texts plays an important role in excavating the knowledge of famous Tibetan medicine.In response to the problems of poor semantic expression ability and low accuracy of nested entity extraction in existing Tibetan entity relation extraction models,this paper introduces a pre-trained entity relation extraction method.The TibetanAI_ALBERT_v2.0 pre-trained language model is used to enable the model to better recognize entities,and the Span method is used to solve the problem of entity nesting.On the basis of Dropout,a KL divergence loss func-tion is added to enhance the model's generalization ability.Experiments on the TibetanAI_TMIE_v1.0 dataset of Ti-betan medicine show that the precision,recall,and F1 score have reached 84.5%,80.1%,and 82.2%,respectively.The F1 score has increased by 4.4 percentage points compared to the baseline.The results demonstrate the effective-ness of the proposed method.

关键词

藏医药/实体关系抽取/预训练语言模型

Key words

Tibetan medicine/entity relation extraction/pre-trained language model

引用本文复制引用

基金项目

西藏自治区科技厅项目(XZ202401JD0010)

科技创新2030——"新一代人工智能"重大项目(2022ZD0116100)

出版年

2024
中文信息学报
中国中文信息学会,中国科学院软件研究所

中文信息学报

CSTPCDCSCDCHSSCD北大核心
影响因子:0.8
ISSN:1003-0077
参考文献量26
段落导航相关论文