该研究将Transformer模型适配于蛋白质特征降维场景,通过其特有的自注意力机制,赋予模型对长程依赖关系的较好建模性能,同时,多头注意力设计使得模型能够从不同角度捕获特征间的相互作用,进一步提升降维结果的表达力和鲁棒性.文章提出了一种新型的GRKM组合聚类算法,在原始K-means算法中引入了灰狼优化算法(Grey Wolf Optimization Algorithm)确定聚类的K值,以随机游走算法(Random Walk)确定初始聚类中心,以马氏距离(Markov Distance)来衡量样本间的相似性.研究中,对5 种具有代表性的蛋白质数据集进行了实验验证,得到了改进后算法在轮廓系数以及DB指数等方面相较于改进前都有较大提升的结论.最终的结果分析选取APP 蛋白质数据,将蛋白质聚为 8类,探讨了各类别的生物功能,在解释性方面也取得了较为明显的效果.所提算法为深入理解蛋白质功能、发现潜在生物标志物以及指导药物设计等实际应用提供了参考工具.
Application of combinatorial clustering algorithm in protein data analysis using Transformer
In this study,the Transformer model is adapted to the protein feature dimensionality reduction scenario,which endows the model with better modeling performance for long-range dependencies through its unique self-attention mechanism,and at the same time,the multi-attention design enables the model to capture the interactions between features from different perspectives,which further enhances the expressiveness and robustness of the dimensionality reduction results.A novel GRKM combinatorial clustering algorithm is studied and experimented,which introduces a Grey Wolf Optimization Algorithm into the original K-means algorithm to determine the K value of the clusters,and a Random Walk algorithm to determine the initial cluster centers,and the Markov Distance to measure the similarity between samples.In the study,five representative protein datasets are experimentally validated,and it is concluded that the improved algorithm has a substantial improvement in the profile coefficient as well as DB index compared with the pre-improved one.The final result analysis selects APP protein data,clusters the proteins into eight categories,explores the biological functions of each category,and achieves more obvious results in terms of interpretability.The algorithm in this paper provides a reference tool for practical applications such as in-depth understanding of protein function,discovering potential biomarkers,and guiding drug design.
protein sequenceTransformer modelclustering algorithmMarkov DistanceRandom WalkGrey Wolf Optimization Algorithm