首页|基于RoBERTa和T5的两阶段医学术语标准化

基于RoBERTa和T5的两阶段医学术语标准化

扫码查看
医学术语标准化作为消除实体歧义性的重要手段,被广泛应用于知识图谱的构建过程之中.针对医学领域涉及大量的专业术语和复杂的表述方式,传统匹配模型往往难以达到较高的准确率的问题,提出语义召回加精准排序的两阶段模型来提升医学术语标准化效果.首先在语义召回阶段基于改进的有监督对比学习和RoBERTa-wwm提出语义表征模型CL-BERT,通过CL-BERT生成实体的语义表征向量,根据向量之间的余弦相似度进行召回并得到标准词候选集,其次在精准排序阶段使用T5结合prompt tuning构建语义精准匹配模型,并将FGM对抗训练应用到模型训练中,然后使用精准匹配模型对原词和标准词候选集分别进行精准排序得到最终标准词.采用ccks2019公开数据集进行实验,F1值达到了 0.9206,实验结果表明所提出的两阶段模型具有较高的性能,为实现医学术语标准化提供了新思路.
Two-stage Medical Terminology Standardization Based on RoBERTa and T5
Medical terminology standardization,as an important means to eliminate entity ambiguity,is widely used in the process of building knowledge graphs.Aiming at the problem that the medical field involves a large number of professional terminology and complex expressions,and the traditional matching models are often difficult to achieve a high accuracy rate,a two-stage model of semantic recall and precise sorting is proposed to improve the standardization effect of medical terminology.First,in the semantic recall stage,a semantic representation model CL-BERT is proposed based on the improved supervised contrastive learning and RoBERTa-wwm.The semantic representation vector of an entity is generated through CL-BERT,and recall is carried out according to the cosine similarity between the vectors,so as to obtain the standard word candidate set.Secondly,in the precise sorting stage,T5,combined with prompt tuning,is used to build a precise semantic matching model,and FGM confrontation training is applied to the model training;next,the precise matching model is used to precisely sort the original word and standard word candidate sets,so as to obtain the final standard words.The ccks2019 public data set is used for experiments,achieving an Fl value of 0.920 6.The experimental results show that the proposed two-stage model showcases high performance,and provides a new idea for medical terminology standardization.

medical terminology standardizationRoBERTa-wwmcontrastive learningT5prompt tuningknowledge graph

周景、崔灿灿、王梦迪、王泽敏

展开 >

华北电力大学控制与计算机工程学院,北京 102206

北京中科睿见科技有限公司,北京 100080

医学术语标准化 RoBERTa-wwm 对比学习 T5 prompt tuning 知识图谱

2024

计算机系统应用
中国科学院软件研究所

计算机系统应用

CSTPCD
影响因子:0.449
ISSN:1003-3254
年,卷(期):2024.33(1)
  • 5