Identification of crucial molecular fragments through ECFP fingerprints and decision trees
Fragment-based drug design is an emerging technique in pharmaceutical research.One of the key challenges in this approach is the identification and quantitative characterization of molecular fragments.A strategy based on molecular fingerprints and decision trees has been proposed for the identification of important molecular fragments.This strategy utilizes Extended-Connectivity Fingerprints(ECFP)to encode molecular fragments of protein-ligand complexes.Three decision tree models—Ran-dom Forest,XGBoost,and LightGBM—are employed to quantify feature importance,enabling the extraction of highly reliable and important molecular fragments.The feature importance of ECFP fingerprints follows an exponential decay trend,indicating that only a few ECFP features significantly contribute to the binding affinity of protein-ligand complexes.Molecular fragments that are consis-tently recognized and highly contributive across all three decision tree models can be considered as highly reliable markers.These markers can be applied in fragment-based drug design and optimization.