基于《药物不良反应杂志》病例报告数据集的语义信息检索研究

扫码查看

原文链接

万方数据

中文摘要：目的基于《药物不良反应杂志》病例报告数据集，探索语义信息检索（语义检索）的应用价值。方法本研究所用数据集由《药物不良反应杂志》1999至2022年发表的共计2 597篇病例报告的PDF文件构成。语义检索系统基于百度飞浆（PaddlePaddle）的深度学习框架搭建，代码用Python语言书写，文本编码模型为百度RocketQA模型。采用排名前k位文档的精确率（P@k）、召回率（R@k）、平均排序倒数（MRR）、平均精度均值（MAP）及精确率-召回率（P-R）曲线对语义检索的效果进行评价。本研究通过计算语义检索和关键词匹配检索的召回率，对2种方式的检索效果进行比较。结果预处理后题目字段作为待检索对象（item）的集合包含2 597个文档；去重整理后检索词（query）的集合包含药品名称1 388条，不良反应/事件1 118条。以药品名称和不良反应/事件为检索词进行语义检索的精确率分别为0.667~1和0.566~1，召回率分别为0.667~0.871和0.566~0.863；采用药品名称和不良反应/事件检索词进行语义检索结果中排名前1、3、5和10文档的P-R曲线显示，随着召回率的升高，排名前1、3的精确率下降趋势较缓，排名前5、10的精确率下降趋势明显。2类检索词的MRR分别为0.854和0.871，MAP分别为0.778和0.773。以不良反应/事件为检索词，语义检索的召回率高于关键词匹配检索；以药品名称为检索词，关键词匹配检索的召回率总体高于语义检索。结论基于百度飞浆深度学习框架搭建的语义检索系统对于《药物不良反应杂志》病例报告数据集的检索性能良好。语义检索与关键词匹配检索相比，以不良反应/事件为检索词时语义检索的检索效果较好，以药品名称为检索词时关键词匹配检索效果较好。 Objective To explore the application value of semantic information retrieval (semantic retrieval) based on case reports dataset of Adverse Drug Reactions Journal. Methods The dataset used in this study consists of 2 597 PDF files of case reports published on Adverse Drug Reactions Journal from 1999 to 2022. The semantic retrieval system is built by Baidu PaddlePaddle′s deep learning framework, the code was written in Python, and the text encoding model was Baidu RocketQA model. The precision at position k (P@k), recall at position k (R@k), mean reciprocal rank (MRR), mean average precision (MAP) and precision-recall (P-R) curve were used to evaluate the performance of semantic retrieval. The performance of semantic retrieval and keyword matching retrieval were compared by calculating the recall. Results The set of preprocessed theme fields as items to be retrieved contained 2 597 documents, the set of search terms (queries) after removing deplicates and reorganizing included 1 388 drug name queries and 1 118 adverse reactions/events queries. The precision of drug name queries and adverse reactions/events queries by semantic retrieval were 0.667-1 and 0.566-1, and their recall were 0.667-0.871 and 0.566-0.863, respectively. The P-R curves of the top 1, 3, 5 and 10 documents in the semantic retrieval results using drug names queries and adverse reactions/events search terms showed that the precision decreased slowly in top 1 and 3 documents but significantly in top 5 and 10 documents with the increase of recall. The MRR of the 2 types of search terms were 0.854 and 0.871, and the MAP were 0.778 and 0.773, respectively. Using adverse reactions/events as search terms, semantic retrieval has a higher recall rate than keyword matching retrieval using drug names as search terms, the recall rate of keyword matching retrieval is generally higher than that of semantic retrieval. Conclusions The semantic retrieval system based on Baidu PaddlePaddle deep learning framework has good retrieval performance on the case reports dataset of Adverse Drug Reactions Journal. The semantic retrieval performs better with adverse reactions/events queries, while the keyword matching retrieval performs better with drug name queries.

外文标题：Semantic information retrieval based on the case report dataset ofAdverse Drug Reactions Journal

外文摘要：Objective To explore the application value of semantic information retrieval (semantic retrieval) based on case reports dataset of Adverse Drug Reactions Journal. Methods The dataset used in this study consists of 2 597 PDF files of case reports published on Adverse Drug Reactions Journal from 1999 to 2022. The semantic retrieval system is built by Baidu PaddlePaddle′s deep learning framework, the code was written in Python, and the text encoding model was Baidu RocketQA model. The precision at position k (P@k), recall at position k (R@k), mean reciprocal rank (MRR), mean average precision (MAP) and precision-recall (P-R) curve were used to evaluate the performance of semantic retrieval. The performance of semantic retrieval and keyword matching retrieval were compared by calculating the recall. Results The set of preprocessed theme fields as items to be retrieved contained 2 597 documents, the set of search terms (queries) after removing deplicates and reorganizing included 1 388 drug name queries and 1 118 adverse reactions/events queries. The precision of drug name queries and adverse reactions/events queries by semantic retrieval were 0.667-1 and 0.566-1, and their recall were 0.667-0.871 and 0.566-0.863, respectively. The P-R curves of the top 1, 3, 5 and 10 documents in the semantic retrieval results using drug names queries and adverse reactions/events search terms showed that the precision decreased slowly in top 1 and 3 documents but significantly in top 5 and 10 documents with the increase of recall. The MRR of the 2 types of search terms were 0.854 and 0.871, and the MAP were 0.778 and 0.773, respectively. Using adverse reactions/events as search terms, semantic retrieval has a higher recall rate than keyword matching retrieval using drug names as search terms, the recall rate of keyword matching retrieval is generally higher than that of semantic retrieval. Conclusions The semantic retrieval system based on Baidu PaddlePaddle deep learning framework has good retrieval performance on the case reports dataset of Adverse Drug Reactions Journal. The semantic retrieval performs better with adverse reactions/events queries, while the keyword matching retrieval performs better with drug name queries.

外文关键词：

Information storage and retrievalCase reportsDatabaseSemantic retrievalKeyword retrievalDeep learning

作者：

肖雅艺、雷毅、王欣、白向荣、张青霞、费晓璐、孟艳

展开 >

作者单位：

首都医科大学宣武医院信息中心，北京　100053

北京工业大学信息学部，北京　100124

首都医科大学宣武医院医务处，北京　100053

首都医科大学宣武医院药学部，北京　100053

展开 >

关键词：

信息存储和检索病例报告数据库语义检索关键词检索深度学习

基金：

国家重点研发计划重点专项国家重点研发计划重点专项

项目编号：

2020YFC20055052020YFC2005503

出版年：

2024

DOI：

10.3760/cma.j.cn114015-20230920-00691

药物不良反应杂志

中华医学会

药物不良反应杂志

CSTPCD

影响因子：0.667

ISSN：1008-5734

年,卷(期)：2024.26(3)

参考文献量23