计算机系统应用2024,Vol.33Issue(1) :280-288.DOI:10.15888/j.cnki.csa.009370

基于RoBERTa和T5的两阶段医学术语标准化

Two-stage Medical Terminology Standardization Based on RoBERTa and T5

周景 崔灿灿 王梦迪 王泽敏
计算机系统应用2024,Vol.33Issue(1) :280-288.DOI:10.15888/j.cnki.csa.009370

基于RoBERTa和T5的两阶段医学术语标准化

Two-stage Medical Terminology Standardization Based on RoBERTa and T5

周景 1崔灿灿 1王梦迪 2王泽敏1
扫码查看

作者信息

  • 1. 华北电力大学控制与计算机工程学院,北京 102206
  • 2. 北京中科睿见科技有限公司,北京 100080
  • 折叠

摘要

医学术语标准化作为消除实体歧义性的重要手段,被广泛应用于知识图谱的构建过程之中.针对医学领域涉及大量的专业术语和复杂的表述方式,传统匹配模型往往难以达到较高的准确率的问题,提出语义召回加精准排序的两阶段模型来提升医学术语标准化效果.首先在语义召回阶段基于改进的有监督对比学习和RoBERTa-wwm提出语义表征模型CL-BERT,通过CL-BERT生成实体的语义表征向量,根据向量之间的余弦相似度进行召回并得到标准词候选集,其次在精准排序阶段使用T5结合prompt tuning构建语义精准匹配模型,并将FGM对抗训练应用到模型训练中,然后使用精准匹配模型对原词和标准词候选集分别进行精准排序得到最终标准词.采用ccks2019公开数据集进行实验,F1值达到了 0.9206,实验结果表明所提出的两阶段模型具有较高的性能,为实现医学术语标准化提供了新思路.

Abstract

Medical terminology standardization,as an important means to eliminate entity ambiguity,is widely used in the process of building knowledge graphs.Aiming at the problem that the medical field involves a large number of professional terminology and complex expressions,and the traditional matching models are often difficult to achieve a high accuracy rate,a two-stage model of semantic recall and precise sorting is proposed to improve the standardization effect of medical terminology.First,in the semantic recall stage,a semantic representation model CL-BERT is proposed based on the improved supervised contrastive learning and RoBERTa-wwm.The semantic representation vector of an entity is generated through CL-BERT,and recall is carried out according to the cosine similarity between the vectors,so as to obtain the standard word candidate set.Secondly,in the precise sorting stage,T5,combined with prompt tuning,is used to build a precise semantic matching model,and FGM confrontation training is applied to the model training;next,the precise matching model is used to precisely sort the original word and standard word candidate sets,so as to obtain the final standard words.The ccks2019 public data set is used for experiments,achieving an Fl value of 0.920 6.The experimental results show that the proposed two-stage model showcases high performance,and provides a new idea for medical terminology standardization.

关键词

医学术语标准化/RoBERTa-wwm/对比学习/T5/prompt/tuning/知识图谱

Key words

medical terminology standardization/RoBERTa-wwm/contrastive learning/T5/prompt tuning/knowledge graph

引用本文复制引用

出版年

2024
计算机系统应用
中国科学院软件研究所

计算机系统应用

CSTPCD
影响因子:0.449
ISSN:1003-3254
参考文献量5
段落导航相关论文