基于质谱熵的小分子化合物相似性计算方法研究

Similarity calculation methods for small molecule compounds based on mass spectral entropy

吴丽萍 ¹向诚 ²张海强 ¹李勇¹

扫码查看

作者信息

1. 昆明理工大学信息工程与自动化学院,昆明 650550
2. 昆明理工大学生命科学与技术学院,昆明 650550
折叠

摘要

在二级质谱(MS2)数据检索中,通常利用质谱数据之间的相似性进行检索.针对质谱数据相似性计算中数据不整齐导致的检索效率和准确性不高以及商用软件相似性计算方法单一的问题,研究提出了"拼接填充"与"匹配填充"两种MS2 数据对齐方法,并基于信息熵采用质谱熵相似性计算方法进行相似性检索.首先对归一化后的原始质谱数据进行特征提取,保留能突出质谱数据特征的数据,再分别采用两种数据对齐方法对质谱数据进行预处理;然后基于信息熵方法,分别计算未知质谱与已知质谱混合后的虚拟质谱与两者质谱的熵差,获得未知质谱与已知质谱的相关系数即相似性;最后选择小分子化合物的质谱数据集进行实例验证.结果表明:两种质谱数据预处理方法能够解决相似性计算中质谱长度不等的问题,基于质谱熵的相似性计算方法稳定且结果可靠,适用于小分子化合物的相似性检索,同时也为商用软件的谱图相似性计算提供了新的方案.

Abstract

In the field of MS2 data retrieval,the similarity of mass spectrometry data is frequently employed for re-trieval purposes.To address the problems of inefficient and inaccurate retrieval due to unequal data lengths and the use of a single similarity calculation method in commercial software,this study proposes two methods for aligning mass spectrometry data,namely"stitch and fill"and"match and fill"and uses the mass spectrometry similarity computation method for information retrieval based on information entropy.Firstly,the normalized mass spectrome-try data are subjected to feature extraction,and the data that can highlight the features of the mass spectrometry da-ta are retained.The mass spectrometry data are then pre-processed using the two data alignment methods.Subse-quently,based on the information entropy method,the entropy difference between the virtual mass spectrum after mixing the unknown mass spectrum and the known mass spectrum and the two mass spectra were calculated sepa-rately to obtain the correlation coefficient,i.e.,the similarity between the unknown mass spectrum and the known mass spectrum.Finally,a mass spectral dataset of small molecule compounds was selected for validations.The re-sults demonstrate that the two mass spectrometry data preprocessing methods effectively address the issue of unequal data length in similarity calculations.Furthermore,the mass spectrometry entropy-based approach is computational-ly efficient and yields reliable results.This method is particularly suitable for similarity retrieval of small molecule compounds and offers a new approach for similarity calculations in commercial software spectra.

关键词

小分子化合物/相似性计算/二级质谱数据/信息熵/质谱熵

Key words

small molecule compounds/similarity calculations/MS2 data/information entropy/mass spectrometry entropy

引用本文复制引用

基金项目

国家自然科学基金(82160787)

&&(20025800400)

出版年

2024

北京化工大学学报(自然科学版)

北京化工大学

北京化工大学学报(自然科学版)

CSTPCD北大核心

影响因子：0.399

ISSN：1671-4628

参考文献量6

段落导航