基于融合多策略对比学习的中文医疗术语标准化研究

Standardization of Chinese Medical Terminology Based on Multi-Strategy Comparison Learning

扫码查看

原文链接

维普
万方数据

中文摘要：[目的]应对中文医疗术语标准化存在的短文本、相似性高、单蕴含与多蕴含等挑战,研究基于融合多策略对比学习的召回-排序-数量预测研究框架.[方法]首先,融合文本统计特征和深度语义特征进行候选召回,依据相似度分数获取候选实体集;其次,候选排序将原始术语、标准实体、来自候选召回的候选实体结合预训练模型与对比学习策略训练向量表示,依据余弦相似度重新排序;再次,数量预测通过多头注意力更新原始词的向量表示,预测原始术语中蕴含标准实体的数量;最后,融合候选召回和候选排序的相似度分数,基于数量预测结果按照顺序选取对应标准实体.[结果]在中文医疗术语标准化数据集Yidu-N7k上进行性能评估,与统计模型、主流深度学习模型进行比较,融合多策略对比学习的标准化框架的准确率达到92.17％,对比基于预训练的二分类基线模型最多提高0.94个百分点.同时,在自制的150例女性乳腺癌钼靶检查报告数据集上,融合多策略对比学习的标准化框架的准确率达到97.85％,性能最优.[局限]实验只在医疗数据集上展开,在其他领域的有效性需进一步研究.[结论]多策略的候选召回可以全面地考虑文本信息能够应对短文本挑战;对比学习的候选排序能够捕捉文本细微差距能够应对相似性高挑战;多头注意力的数量预测能够增强向量表示能够应对单蕴含与多蕴含挑战.融合多策略对比学习的中文医疗术语标准化方法为促进医学信息挖掘和临床研究提供了潜力.

外文摘要：[Objective]To address the challenges of short texts,high similarity,and single and multiple entailments in the standardization of Chinese medical terminology,this paper proposes a research framework based on the fusion of multiple strategy comparison learning for recall-ranking-quantity prediction.[Methods]Firstly,we integrated text statistical and deep semantic features to retrieve candidate entities.Based on similarity scores,we obtained the candidate set.Secondly,we combined candidate ranking with original terms,standard entities,and candidate entities from recall by training vector representations with pre-trained models and contrastive learning strategies,followed by reordering based on cosine similarity.Next,we updated the vector representations of original terms through multi-head attention to predict the number of standard entities from the original terms.Finally,we selected the standard entities based on the quantity prediction results by integrating the similarity scores of candidate recall and ranking.[Results]We examined the new model on the Chinese medical terminology normalization dataset Yidu-N7k.Compared with statistical models and mainstream deep learning models,the proposed framework achieved an accuracy of 92.17％.This represents an improvement of at least 0.94％over the pre-trained binary classification baseline model.Additionally,on a dataset of 150 expert-labeled reports of mammography examinations for female breast cancer,the new framework's accuracy reached 97.85％,achieving the best performance.[Limitations]The experiments are only conducted on medical datasets,and the effectiveness in other domains needs further exploration.[Conclusions]A multi-strategy candidate recall can comprehensively consider text information to address the challenge of short text.Contrastive learning candidate rank can capture subtle textual differences to address the challenge of high similarity.Quantity prediction with multi-head attention can enhance vector representation and address the challenges of single and multiple entailments.The proposed method provides the potential for promoting medical information mining and clinical research.

外文关键词：

Medical Terminology NormalizationMulti-Strategy Candidate RecallContrastive LearningBreast Cancer MammographyExamination Report

作者：

岳崇浩、张剑、吴义熔、李小龙、华晟、童顺航、孙水发

展开 >

作者单位：

智慧医疗宜昌市重点实验室宜昌 443002

三峡大学计算机与信息学院宜昌 443002

杭州师范大学信息科学与技术学院杭州 311121

三峡大学经济与管理学院宜昌 443002

展开 >

关键词：

医疗术语标准化多策略候选召回对比学习乳腺癌钼靶检查报告

基金：

国家社会科学基金项目

项目编号：

20BTQ066

出版年：

2024

DOI：

10.11925/infotech.2096-3467.2023.0931

数据分析与知识发现

中国科学院文献情报中心

数据分析与知识发现

CSTPCDCSSCICHSSCD北大核心EI

影响因子：1.452

ISSN：2096-3467

年,卷(期)：2024.8(6)