The use of Informatics Approaches in Cheminformatics Alexander Tropsha Laboratory for Molecular Modeling, UNC Eshelman School of Pharmacy OUTLINE • Overview of Current Projects • Background on Cheminformatics • Examples of Application Projects: Data Retrieval Modeling Testable Hypothesis Generation Validation C-C=C-O > Database of compounds (with their measured activities for multiple targets) > Tools to visualize and navigate into chemical space. Structure-Activity Relationships (SAR) modeling D E S C R I P T O R S Physico-Chemical properties (logS, BP, MP, logK etc.) Biological activities Computational Chemical Biology C-ChemBench / CECCR project Complementary Ligands Based on Receptor Information (CoLiBRI) Computational Chemical Biology Protein StructureFunction relationships modeling Simplicial Neighborhood Analysis of Protein Packing (SNAPP) Activity/Function prediction for molecules Empirical Rules/Filters Similarity Search Consensus QSAR models VIRTUAL SCREENING ~102 – 103 molecules ~106 – 109 molecules Activity/Function prediction for molecules Protein-ligand recognition Cheminformatics and Structural Bioinformatics Selected Models Descriptors and QSAR approaches (modeling techniques, applicability domain definitions etc.) Cheminformatics and Structural Bioinformatics Tools for chemical data mining Tetrahymena Pyriformis Computational Chemical Toxicology The Laboratory for Molecular Modeling Principal Investigator Alexander Tropsha Research Professors Clark Jeffries, Alexander Golbraikh, Hao Zhu, Simon Wang, M. Karthikeyan Graduate Research Assistants Christopher Grulke, Nancy Baker, Kun Wang, Hao Tang, JuiHua Hsieh, Rima Hajjo, Tanarat Kietsakorn, Tong Ying Wu, Liying Zhang, Melody Luo, Guiyu Zhao, Andrew Fant Postdoctoral Fellows Georgiy Abramochkin, Lin Ye, Denis Fourches MAJOR FUNDING NIH Visiting Research - P20-HG003898 (RoadMap) Scientists - R21GM076059 (RoadMap) Achintya Saha, Aleks - R01-GM66940 Sedykh, Berk Zafer - GM068665 EPA (STAR awards) - RD832720 Adjunct Members - RD833825 Weifan Zheng, Shubin Liu Research Programmer Theo Walker System Administrator Mihir Shah What is Chemoinformatics? Dr. Frank Brown introduced the term “chemoinformatics” in the Annual Reports of Medicinal Chemistry in 1998: toxicity prioritization & screening “The use of information technology and management has become a critical part of the drug discovery process. Chemoinformatics is the mixing of those information resources to transform data into information and information into knowledge for the intended purpose of making better decisions faster in the area of drug lead identification and organization” environmental toxicity screening In fact, Chemoinformatics is a generic term that encompasses the design, creation, organization, management, retrieval, analysis, dissemination, visualization and use of chemical information. Slide courtesy of Ann Richard http://www.bioinfoinstitute.com/chemoinfo.htm NIH’s Molecular Libraries Initiative in numbers NIH Roadmap Initiative Molecular Libraries Initiative 4 Chemical Synthesis Centers CombiChem Parallel synthesis DOS 4 centers + DPI 100K – 1M compounds Expected 1M compounds MLSCN (9+1) 9 centers 1 NIH intramural 20 x 10 = 200 assays PubChem (NLM) ECCR (6) Predictive Exploratory ADMET Centers (10) Current SAR matrix (as of May 25, 2007): - 256 different MLSCN bioassays -over 140,000 chemicals -29,558 compounds categorized as “active” in at least one MLSCN bioassay 200 assays Chemocentric view of biological data NO2 Toxicity Risk Assessment SAR structure-activity relationships increasing uncertainty Quantitative Structure-Activity relationships (QSAR) Pharmacophore mapping Docking Molecular modeling Molecular mechanics Property filtering 2D 3D Substructural Similarity Searching Searching Molecular Diversity Analysis Quantum mechanical Semiempirical ADMET 2D Substructure Searching Scoring functions Druglikeness Decision trees Neural Networks Virtual Screening Data Mining Cluster analysis Graph theory Multiple linear Principal regression components Inductive logic analysis reasoning Genetic algorithms Active Analog Hansch Free-Wilson Pharmaceutical Sciences Drug Discovery Chemical Design Materials Science Green Chemistry Agricultural Pesticides Food Science Polymers Atmospheric chemistry Environmental Studies Green Chemistry Predictive Toxicology Key point: Focus on Externally Validated Predictions SAR dataset Input External database/library Cheminformatics Magic Small number of computational hits Output Large fraction are Real Test confirmed actives Cheminformatics Analysis of Assertions Describing Drug-Induced Liver Injury in Different Species In Collaboration with BioWisdom, UK Background Drug Induced Liver Injury (DILI) is one of the major causes of drug toxicity, both during clinical development and post-approval Animal studies, and clinical trials on limited populations, are used to establish drug safety; both appear insufficient A wealth of published information that could deepen our understanding the mechanisms of DILI is available, but the information is scattered in distributed published works, using inconsistent language Introduction to the Safety Intelligence Program (SIP) An industry-sponsored initiative that embraces the expertise of it’s pharmaceutical members and other stakeholders to build the world's most comprehensive intelligence resource for use in improving drug safety assessments. The Safety Intelligence System 5,700 pathologies 8,500 compounds 192,000 facts 1 interface The largest forever-expanding collection of known effects of chemicals occurring in the different tissue, drugs effects on clinical biomarkers of tissue injury and drug molecular mechanisms. Facts (assertions) derived from: Biomedical literature Regulatory documents: EMEA EPARs, FDA NDAs Label Data And many more… Intelligence Network Build Process Public Domain Sources Licensed Sources Proprietary Sources Meta-Search Structured Data Sources e.g, GO, UMLS, SWISS_PROT Data Source Descriptors Concept Maps Sofia Terminology & Ontology Unstructured Data Sources e.g, Medline, Patents, FDA SBAs Spiders Structured Data Loader User Defined Term List Noun Phrase Discovery Selected Corpus Automated Assertion Generation Pass QA QA Fail Pass Raw Assertion Discovery Relationship Discovery Relations Typing Semantic Normalisation Chemistry Canonicalisation DocView (manual validation) Intelligence Network Pass Slide courtesy of Julie Barnes, Biowisdom Species Concordance Study Design The Safety Intelligence System contains comprehensive assertional meta-data describing >5,800 effects of >8500 compounds in the liver E.g. ‘Acetaminophen INDUCES Hepatocyte Death (mouse)’ (pathological effect) E.g. ‘Prednisolone SUPPRESSES Collagen Synthesis (human)’ (physiological effect) A subset of the above assertional meta-data, referenced by MEDLINE or the EMEA EPARs, were exported from the Safety Intelligence System for analysis The data were restricted to therapeutic products only The compounds were assigned to human, rodent or non-rodent groups according to the species in which the effect was reported The concordance of drug-induced liver effects across humans, rodents or nonrodents was determined Species Concordance of Drug-Induced Liver Effects: Assertions Evidenced by MEDLINE 14,600 assertions, 1061 compounds Large data set – lending itself to quantitative analyses Non-rodent data are less well represented than human and rodent Objectives Can we employ cheminformatics approaches to validate assertions of drug-induced liver effects in different species? Can we identify chemotypes that define species-specific liver effects? Can we establish chemistry driven rules for concordance (or lack thereof) between chemical effects on humans vs. non-humans? Project Workflow Primary data sources BioWisdom Safety Intelligence System Assertional meta-data generated using SofiaTM platform Assertion export SIP Members Assertion refinement Chemical curation, fragment analysis & QSAR Study Design • Used assertions evidenced by MEDLINE, rather than EMEA EPARs, because of their greater quantity • Used rodent and human data to build the model (knowing that non-rodent data are sparse in MEDLINE) • Used non-rodent data (where a liver effect was observed) to validate the model Curation of Chemical Data Step 1 : all inorganic molecules have been removed, as well as those having no available SMILE strings. (993 of 1061 molecules remaining) Examples: Zinc chloride Cl[Zn]Cl Ferrous sulfate Sulfur [S] Cobalt dichloride Manganese chloride [Cl-].[Cl-].[Mn+2] Activated charcoal cis-Diaminedichloroplatinum [NH4+].[NH4+].[Cl-].[Cl-].[Pt+2] [Fe+2].[O-]S(=O)(=O)[O-] [Cl-].[Cl-].[Co+2] C Step 2 : 2D structures were obtained from the SMILE strings, using JChem software from ChemAxon. Then, all counter-ions have been removed and molecules have been neutralized, using ChemAxon Standardizer. (+aromatization, +normalization of nitro groups) (989 compounds remaining) Example: Na+ Step 3 : manual molecular cleaning to correct some structures and to remove compounds with non-sensible SMILES or duplicates (951 of 1061 molecules remaining) Data transformation for the revised Venn diagram Species profile for each compound (951) was retrieved from the original data automatically with a program written in Delphi. only only only For the cheminformatics analysis, we assumed that each compound has been tested in all species, i.e., humans, rodents and nonrodents. “1” = known liver effect “0” = no liver effect The Venn Diagram of the Curated Dataset HUMAN (650) RODENT (685) 292 236 257 110 12 26 18 NON-RODENT(166) Total number of compounds: 951 1. Clustering of compounds in the chemistry space* C*C*C-C=O Calculation of fragment descriptors C*C-C=O C-C=O C-C C=O C*C Sequences of Atoms/Bonds Inputs for clustering algorithm *ISIDA is developed in the group of Prof. A Varnek, Univ. of Strasbourg. 1. Clustering of 951 compounds in the chemistry space For cluster analysis we used fragment descriptors, hierarchical algorithm, Euclidean similarity between compounds, and a complete linkage between clusters. Small clusters identified with high levels of similarity between compounds. 1. Clustering of compounds in chemical space Example 1: Barbiturate derivatives; sedation/anaesthesia a b ID = 45 HUMAN = 0 RODENT = 1 NON-RODENT = 0 c ID = 76 HUMAN = 0 RODENT = 1 NON-RODENT = 0 d ID = 93 HUMAN = 0 RODENT = 1 NON-RODENT = 0 ID = 543 HUMAN = 0 RODENT = 1 NON-RODENT = 0 Example 2: a = cladribine, b = clofarabine, c = cordycepin; all anticancer drugs a ID = 201 HUMAN = 1 RODENT = 0 NON-RODENT = 0 b c ID = 208 HUMAN = 1 RODENT = 0 NON-RODENT = 0 ID = 223 HUMAN = 0 (???) RODENT = 1 (???) NON-RODENT = 0 1. Example 1: Assessing potential data gaps b a d c Allobarbital Aprobarbital Barbital Methohexital HUMAN = 0 RODENT = 1 NON-RODENT = 0 HUMAN = 0 RODENT = 1 NON-RODENT = 0 HUMAN = 0 RODENT = 1 NON-RODENT = 0 HUMAN = 0 RODENT = 1 NON-RODENT = 0 • • • Recent mining of MEDLINE did not identify any evidence for these compounds having human liver effects Basic searches in google (e.g. barbital, human, hepatotoxicity) did not reveal evidence for these compounds having human liver effects The apparent lack of human liver effects may be due to these compounds being used for sedation/anaesthesia where lower doses and shorter exposures may be used than in animal studies 1. Example 2: Assessing potential data gaps Cladribine a HUMAN = 1 RODENT = 0 NON-RODENT = 0 • • Clofarabine b Cordycepin HUMAN = 1 RODENT = 0 NON-RODENT = 0 Recent mining of MEDLINE did not identify any new evidence for 2a and b having rodent liver effects However, EMEA EPAR data in the Safety Intelligence System did identify b as having rodent liver effects (no rodent liver effects identified for a) c HUMAN = 0 (???) RODENT = 1 (???) NON-RODENT = 0 • Recent mining of MEDLINE did identify an effect of c in a human hepatocellular cell line 1. Clustering of compounds in chemical space Example 3: a. amiodarone (antiarrhythmic agent), b. benzarone (used for treatment of peripheral vascular disorders), c. benzbromarone (uricosuric agent, used for gout), d. benziodarone (vasodilator). b a ID = 98 HUMAN = 1 RODENT = 1 NON-RODENT = 0 ID = 60 HUMAN = 1 RODENT = 1 NON-RODENT = 1 c d ID = 99 HUMAN = 1 RODENT = 1 NON-RODENT = 0 ID = 100 HUMAN = 0 RODENT = 1 NON-RODENT = 0 Does this compound lack human liver effects ? 1. Example 3: Assessing potential data gaps d • • Benziodarone HUMAN = 0 RODENT = 1 NON-RODENT = 0 Does this compound lack human liver effects ? Recent mining of MEDLINE did not identify any new evidence for 3d having human liver effects However, a basic search in google (e.g. benziodarone, human, hepatotoxicity) did reveal that the drug caused hepatotoxicity in humans (inferred) 1. Clustering of compounds in chemical space Example 4: Estrogen-like compounds Estradiol b 2-methoxyestradiol a ID = 8 HUMAN = 1 RODENT = 1 NON-RODENT = 0 ID = 329 HUMAN = 1 RODENT = 1 NON-RODENT = 1 Estrone d Estriol ID = 333 HUMAN = 1 RODENT = 1 NON-RODENT = 0 ID = 332 HUMAN = 0 RODENT = 1 NON-RODENT = 0 c e Ethinyl estradiol ID = 338 HUMAN = 1 RODENT = 1 NON-RODENT = 1 1. Example 4: Assessing potential data gaps c Estriol HUMAN = 0 RODENT = 1 NON-RODENT = 0 • Recent mining of MEDLINE and a basic search in google (e.g. estriol, human, hepatotoxicity) did not identify any new evidence for estriol (c) having human liver effects 1. Clustering of compounds in chemical space Some clusters have been identified in which compounds share highly molecular structures and also, toxicity profiles for H, R and NR. This information is highly important to identify chemotypes that define species-specific DILI effects. However, in some clusters, similar compounds appear to display different toxicity profiles. These cases may correspond to missing or unreported data, and highlight areas for gap-spotting or additional experimental investigation. 2. Analysis of chemical fragment distribution A HUMAN ONLY Compounds found to show liver effects for humans only RODENT ONLY B Compounds lacking liver effects for humans Are there some differences in fragment distributions between compounds displaying human vs. rodent specific effects? STRUCTURE REPRESENTATION naphtalen-1-amine Viewed by computers Viewed by another molecule Viewed by chemists Graphs are widely used to represent and differentiate chemical structures, where atoms are vertices and bonds are expressed as edges connecting these vertices. MOL File Vertices Molecular graphs allow the computation of numerous indices to compare them quantitatively. Edges Molecular descriptors 2. Analysis of fragment distributions within sets A and B Fragment type FA C-N-C C-C-C-N-C C-C-C-N C-C-N-C C-C-N-C-C C-N C-C-N C-N-C-C-N C-C-C-N-C-C C-N-C-C-N-C N-C-C-N C*N C*C C-C-N-C-C-O C-C-N-C=O C*C*N C-C-N-C-C-N S-C 71.6 50.0 58.9 64.0 39.8 86.4 76.3 24.2 30.9 21.2 24.6 35.2 80.1 22.0 29.2 33.1 18.6 23.3 C-C-N-C-C-N-C 17.8 C-S-C 15.3 C-N-C-C-O 29.2 C-N-C=O 37.7 C*C*C*C 70.8 C-S-C-C 13.6 C-C-N-C-C=O 17.4 FB 49.0 28.0 37.4 43.6 20.6 67.7 59.1 7.8 15.2 5.8 9.7 20.6 66.1 8.6 16.0 19.8 6.2 10.9 5.8 3.5 17.5 26.1 59.1 1.9 5.8 ΔF 22.6 22.0 21.5 20.4 19.2 18.7 17.1 16.4 15.8 15.3 14.8 14.5 13.9 13.5 13.3 13.2 12.4 12.4 12.0 11.8 11.7 11.6 11.6 11.6 11.5 Fragment type FA O-C-C-N-C=O C=C-N C-N-C-C=O C-N-C=C C*C*C C-C-C 15.7 15.3 19.9 14.0 75.0 86.9 N-C-C-N-C-C-O 12.7 C-C-C=O 47.9 O=C-C-N-C=O 15.7 C-C-C-N-C-C-N 14.8 S-C-C 14.4 N-C=O 42.8 C*C*C*N 23.3 C*N*C 29.7 C-C-C-C-N 33.1 C-C-C-N-C-C=O 13.1 N-C*N 15.7 C-C=C-N 12.7 N-C-C-N-C-C=O 11.4 C=C-C-O 14.4 C-C-C-N-C-C-C 14.4 C-C=C-N-C 11.4 S-C-C-C 11.4 N-C-C=O 20.8 C-C-C-C-N-C 27.1 C-C*N 17.4 Etc. FB 4.3 3.9 8.6 2.7 63.8 75.9 1.9 37.4 5.4 4.7 4.3 32.7 13.2 19.8 23.3 3.5 6.2 3.5 2.3 5.4 5.4 2.7 2.7 12.1 18.7 8.9 ΔF 11.4 11.4 11.4 11.3 11.2 11.0 10.8 10.5 10.2 10.2 10.1 10.1 10.1 9.8 9.7 9.6 9.5 9.2 9.1 9.0 9.0 8.7 8.7 8.7 8.4 8.4 FA = Fragment Frequency (%) for (Human Only – 236 compounds) FB = Fragment Frequency (%) for (Rodent Only – 257 compounds) 2. Differential fragment frequency distribution FA = Fragment Frequency in A FB = Fragment Frequency in B ΔF = ( FA - FB) 3. Binary QSAR based classification HUMAN ONLY Class A Class B (248) (283) Compounds known to affect liver in humans only RODENT ONLY Compounds NOT affecting liver in humans Can we predict the compound class from its structure only ? Principle of QSAR/QSPR modeling Introduction O C O M P O U N D S N 0.613 O 0.380 N O N O N O N O N O N O N O N D E S C R I P T O R S -0.222 0.708 Quantitative Structure Property Relationships 1.146 0.491 0.301 0.141 0.956 0.256 0.799 1.195 O N 1.005 P R O P E R T Y Principle of QSAR/QSPR modeling Introduction O C O M P O U N D S N 0.613 O 0.380 N O N O N O N O N O N O N O N D E S C R I P T O R S -0.222 0.708 Quantitative Structure Property Relationships 1.146 0.491 0.301 0.141 0.956 0.256 0.799 1.195 O N 1.005 P R O P E R T Y 3. QSAR based classification Using SUPPORT VECTOR MACHINES (SVM) Accuracy (%) = (number of compounds correctly predicted )/(total number of compounds) Fold Modeling set 5 fold CV 1 62.3% 62.9% 88.2% 77.6% 71.0% 67.3% 217 fragments 162 Dragon 64.9% 67.5% 81.2% 81.2% 64.2% 55.7% 112 fragments 197 Dragon 62.4% 65.2% 91.3% 91.1% 64.2% 61.3% 194 fragments 198 Dragon 4 64.9% 99.3% 72.6% 208 fragments 84.9% 82.6% 68.9% 68.9% 151 Dragon 5 62.1% 63.3% 205 fragments 61.9% 94.4% 70.8% 175 Dragon 2 3 Modeling set Accuracy NB: Preliminary results; could be improved. External set Accuracy Model ID Descriptors 3. QSAR based classification Class A Class B (248) (283) HUMAN ONLY 18 EXTERNAL SET (18 compounds reporting no liver effects in humans or rodents) QSAR MODELS RODENT ONLY 3. QSAR based classification Compounds 18 Modeling set 5 fold CV 62.9% 64.0% Modeling set Accuracy 92.5% 97.9% External set Accuracy 77.8% 66.7% Model ID Descriptors 206 Fragments 141 Dragon 14 of 18 compounds are predicted to lack liver effects for humans. 4 compounds are predicted to have human liver effects. BUT: Missing/unreported data ??? Sulfadoxine (ID=820) Human = 0 Rodent = 0 IN THE MODELING SET: Sulfadimethoxine (ID=819) Human = 1 Rodent = 0 3. Sulfadoxine: Assessing potential data gap Sulfadoxine Human = 0 Rodent = 0 Missing/unreported data? • Recent mining of MEDLINE did identify evidence pyrimethamine/sulfadoxine (fansidar) causing hepatitis in patients • Normally, combinations would be excluded from these analyses for