查看更多>>摘要:As with other pharma companies, we maintain production QSAR models of ADMET end points and update them regularly. Here, for six ADMET end points, we examine the predictions of test set molecules on multiple versions of random forest models spanning a period of 10 years. For any given end point, the predictions for the majority of molecules are similar for all model versions. However, for a small minority of molecules, the prediction shifts substantially over the span of a few versions. For most molecules that shift, the prediction becomes more accurate at later times. This Perspective investigates metrics that can help indicate which molecules will shift substantially in prediction and when the shift will occur.
Garci?a-Ortego?n MiguelSimm Gregor N. C.Tripp Austin J.Herna?ndez-Lobato Jose? Miguel...
17页
查看更多>>摘要:The field of machine learning for drug discovery is witnessing an explosion of novel methods. These methods are often benchmarked on simple physicochemical properties such as solubility or general druglikeness, which can be readily computed. However, these properties are poor representatives of objective functions in drug design, mainly because they do not depend on the candidate compound’s interaction with the target. By contrast, molecular docking is a widely applied method in drug discovery to estimate binding affinities. However, docking studies require a significant amount of domain knowledge to set up correctly, which hampers adoption. Here, we present dockstring, a bundle for meaningful and robust comparison of ML models using docking scores. dockstring consists of three components: (1) an open-source Python package for straightforward computation of docking scores, (2) an extensive dataset of docking scores and poses of more than 260,000 molecules for 58 medically relevant targets, and (3) a set of pharmaceutically relevant benchmark tasks such as virtual screening or de novo design of selective kinase inhibitors. The Python package implements a robust ligand and target preparation protocol that allows nonexperts to obtain meaningful docking scores. Our dataset is the first to include docking poses, as well as the first of its size that is a full matrix, thus facilitating experiments in multiobjective optimization and transfer learning. Overall, our results indicate that docking scores are a more realistic evaluation objective than simple physicochemical properties, yielding benchmark tasks that are more challenging and more closely related to real problems in drug discovery.
查看更多>>摘要:Synthesis planning and reaction outcome prediction are two fundamental problems in computer-aided organic chemistry for which a variety of data-driven approaches have emerged. Natural language approaches that model each problem as a SMILES-to-SMILES translation lead to a simple end-to-end formulation, reduce the need for data preprocessing, and enable the use of well-optimized machine translation model architectures. However, SMILES representations are not efficient for capturing information about molecular structures, as evidenced by the success of SMILES augmentation to boost empirical performance. Here, we describe a novel Graph2SMILES model that combines the power of Transformer models for text generation with the permutation invariance of molecular graph encoders that mitigates the need for input data augmentation. In our encoder, a directed message passing neural network (D-MPNN) captures local chemical environments, and the global attention encoder allows for long-range and intermolecular interactions, enhanced by graph-aware positional embedding. As an end-to-end architecture, Graph2SMILES can be used as a drop-in replacement for the Transformer in any task involving molecule(s)-to-molecule(s) transformations, which we empirically demonstrate leads to improved performance on existing benchmarks for both retrosynthesis and reaction outcome prediction.
查看更多>>摘要:Imbalanced data sets in materials informatics are pervasive and pose a challenge to the development of classification models. This work investigates crystal point group prediction as an example of an imbalanced classification problem in materials informatics. Multiple resampling and classification techniques were considered. The findings suggest that the most influential variable of the resampling algorithms is the one controlling the number of samples to omit (undersample) or synthetically generate (oversample), as expected. The effect of balancing is to enhance the classification performance of the minority class at the cost of reducing the correct predictions of the majority class. Moreover, ideal balancing, where the classes are precisely balanced, is not optimum. Alternatively, partial balancing should be performed. In this study, the ideal ratio of the minority to majority class was found to be around two-thirds. The biggest improvement in the classification was for the random undersampling technique with k-nearest neighbors and random forest.
查看更多>>摘要:Graph-based architectures are becoming increasingly popular as a tool for structure generation. Here, we introduce novel open-source architecture HyFactor in which, similar to the InChI linear notation, the number of hydrogens attached to the heavy atoms was considered instead of the bond types. HyFactor was benchmarked on the ZINC 250K, MOSES, and ChEMBL data sets against conventional graph-based architecture ReFactor, representing our implementation of the reported DEFactor architecture in the literature. On average, HyFactor models contain some 20% less fitting parameters than those of ReFactor. The two architectures display similar validity, uniqueness, and reconstruction rates. Compared to the training set compounds, HyFactor generates more similar structures than ReFactor. This could be explained by the fact that the latter generates many open-chain analogues of cyclic structures in the training set. It has been demonstrated that the reconstruction error of heavy molecules can be significantly reduced using the data augmentation technique. The codes of HyFactor and ReFactor as well as all models obtained in this study are publicly available from our GitHub repository: https://github.com/Laboratoire-de-Chemoinformatique/HyFactor.
Aniceto Nata?liaBonifa?cio Vasco D. B.Guedes Rita C.Martinho Nuno...
16页
查看更多>>摘要:Blocking the catalytic activity of urease has been shown to have a key role in different diseases as well as in different agricultural applications. A vast array of molecules have been tested against ureases of different species, but the clinical translation of these compounds has been limited due to challenges of potency, chemical and metabolic stability as well as promiscuity against other proteins. The design and development of new compounds greatly benefit from insights from previously tested compounds; however, no large-scale studies surveying the urease inhibitors’ chemical space exist that can provide an overview of developed compounds to data. Therefore, given the increasing interest in developing new compounds for this target, we carried out a comprehensive analysis of the activity landscape published so far. To do so, we assembled and curated a data set of compounds tested against urease. To the best of our knowledge, this is the largest data set of urease inhibitors to date, composed of 3200 compounds of diverse structures. We characterized the data set in terms of chemical space coverage, molecular scaffolds, distribution with respect to physicochemical properties, as well as temporal trends of drug development. Through these analyses, we highlighted different substructures and functional groups responsible for distinct activity and inactivity against ureases. Furthermore, activity cliffs were assessed, and the chemical space of urease inhibitors was compared to DrugBank. Finally, we extracted meaningful patterns associated with activity using a decision tree algorithm. Overall, this study provides a critical overview of urease inhibitor research carried out in the last few decades and enabled finding underlying SAR patterns such as under-reported chemical functional groups that contribute to the overall activity. With this work, we propose different rules and practical implications that can guide the design or selection of novel compounds to be screened as well as lead optimization.
Blay VincentRadivojevic TijanaAllen Jonathan E.Hudson Corey M....
14页
查看更多>>摘要:The growing capabilities of synthetic biology and organic chemistry demand tools to guide syntheses toward useful molecules. Here, we present Molecular AutoenCoding Auto-Workaround (MACAW), a tool that uses a novel approach to generate molecules predicted to meet a desired property specification (e.g., a binding affinity of 50 nM or an octane number of 90). MACAW describes molecules by embedding them into a smooth multidimensional numerical space, avoiding uninformative dimensions that previous methods often introduce. The coordinates in this embedding provide a natural choice of features for accurately predicting molecular properties, which we demonstrate with examples for cetane and octane numbers, flash points, and histamine H1 receptor binding affinity. The approach is computationally efficient and well-suited to the small- and medium-size datasets commonly used in biosciences. We showcase the utility of MACAW for virtual screening by identifying molecules with high predicted binding affinity to the histamine H1 receptor and limited affinity to the muscarinic M2 receptor, which are targets of medicinal relevance. Combining these predictive capabilities with a novel generative algorithm for molecules allows us to recommend molecules with a desired property value (i.e., inverse molecular design). We demonstrate this capability by recommending molecules with predicted octane numbers of 40, 80, and 120, which is an important characteristic of biofuels. Thus, MACAW augments classical retrosynthesis tools by providing recommendations for molecules on specification.
查看更多>>摘要:In modern drug design, one of the main issues is the optimization of an initial lead structure toward a drug candidate by modifying specific properties in the desired direction. The synthetic feasibility of the target structure is often neglected during this process, resulting in structures with low or suboptimal synthetic accessibility. In this work, we present a novel approach for synthesis-aware lead optimization called Synthesia. In contrast to the traditional approaches, Synthesia integrates the preservation of the synthesizability of the target structure into the lead structure modification process. Synthesia is able to create structural diversity for a lead structure that matches user-defined molecular properties without losing the applicability of a particular synthetic pathway. The methodology is validated by demonstrating that Synthesia is capable of providing structural analogues of DrugBank compounds that meet generic modification goals and maintain their synthetic pathways. In addition, Synthesia is used to cluster compounds from two different patent structure series (CDK7, Daurismo) according to their compatibility with the same synthetic pathways, maximizing the synthetic efficiency and providing an initial estimation of the effort of synthesizing the entire series. Altogether, we demonstrate Synthesia’s ability to modify compound properties while maintaining in silico synthesizability.
Defelipe Lucas A.Arcon Juan PabloTurjanski Adrian G.Mayol Gonzalo F....
12页
查看更多>>摘要:Protein–protein interactions (PPIs) are essential, and modulating their function through PPI-targeted drugs is an important research field. PPI sites are shallow protein surfaces readily accessible to the solvent, thus lacking a proper pocket to fit a drug, while their lack of endogenous ligands prevents drug design by chemical similarity. The development of PPI-blocking compounds is, therefore, a tough challenge. Mixed solvent molecular dynamics has been shown to reveal protein–ligand interaction hot spots in protein active sites by identifying solvent sites (SSs). Furthermore, our group has shown that SSs significantly improve protein–ligand docking. In the present work, we extend our analysis to PPI sites. In particular, we analyzed water, ethanol, and phenol-derived sites in terms of their capacity to predict protein–drug and protein–protein interactions. Subsequently, we show how this information can be incorporated to improve both protein–ligand and protein–protein docking. Finally, we highlight the presence of aromatic clusters as key elements of the corresponding interactions.
Brinkmann Bregje W.Singhal AnkushSevink G. J. AgurNeeft Lisette...
15页
查看更多>>摘要:Ingested nanomaterials are exposed to many metabolites that are produced, modified, or regulated by members of the enteric microbiota. The adsorption of these metabolites potentially affects the identity, fate, and biodistribution of nanomaterials passing the gastrointestinal tract. Here, we explore these interactions using in silico methods, focusing on a concise overview of 170 unique enteric microbial metabolites which we compiled from the literature. First, we construct quantitative structure–activity relationship (QSAR) models to predict their adsorption affinity to 13 metal nanomaterials, 5 carbon nanotubes, and 1 fullerene. The models could be applied to predict log k values for 60 metabolites and were particularly applicable to ‘phenolic, benzoyl and phenyl derivatives’, ‘tryptophan precursors and metabolites’, ‘short-chain fatty acids’, and ‘choline metabolites’. The correlations of these predictions to biological surface adsorption index descriptors indicated that hydrophobicity-driven interactions contribute most to the overall adsorption affinity, while hydrogen-bond interactions and polarity/polarizability-driven interactions differentiate the affinity to metal and carbon nanomaterials. Next, we use molecular dynamics (MD) simulations to obtain direct molecular information for a selection of vitamins that could not be assessed quantitatively using QSAR models. This showed how large and flexible metabolites can gain stability on the nanomaterial surface via conformational changes. Additionally, unconstrained MD simulations provided excellent support for the main interaction types identified by QSAR analysis. Combined, these results enable assessing the adsorption affinity for many enteric microbial metabolites quantitatively and support the qualitative assessment of an even larger set of complex and biologically relevant microbial metabolites to carbon and metal nanomaterials.