The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006 NIH Roadmap • Series of initiatives designed to pursue major opportunities in biomedical research and gaps in current knowledge that cannot be addressed by any single NIH Institute or Center • Goal: enable rapid transformation of new scientific knowledge into tangible benefits for public health • http://nihroadmap.nih.gov/ NIH Molecular Libraries and Imaging Initiative • Part of the “New Pathways to Discovery” area • Goal: augment the “toolbox” for understanding the functionally interconnected molecular events that maintain health and lead to disease • Build on high-throughput, highly specific, mechanism-based biological assays • Aims to develop and discover small molecules that hold promise as research tools to probe cellular physiology and pathophysiology NIH Molecular Imaging Roadmap • High specificity/high sensitivity molecular imaging probes • Molecular imaging and contrast database • Imaging probe development center NIH Roadmap Molecular Libraries Initiative (MLI) • A series of integrated research programs with the goal of making small molecule screening and screening data more widely available to the research community • http://nihroadmap.nih.gov/molecularlibrarie s/index.asp MLI Aims • Go beyond the identification of compounds with potential therapeutic properties • Will result in the identification of compounds to use as probes to study cellular processes in health and disease • Biological screening data, assay protocols, and chemical structures for compounds to be publicly available in PubChem NIH MLI Components • Molecular Libraries Screening Center Network (MLSCN) • Cheminformatics (centered around PubChem) • Technology development NIH MLI Technology Development Areas • Chemical diversity – Pilot-scale libraries for investigation of novel chemical diversity space – Novel methods for natural product chemistry • Development of assays • Novel instrumentation and detection technologies for high throughput screening • Datasets and algorithms for better prediction of absorption, distribution, metabolism, excretion, and toxicity properties of small molecules Assay Guidance Manual • Originally written as a guide for therapeutic projects teams within Eli Lilly; covers: – – – – – – • Identifying potential assay formats compatible with High Throughput Screen (HTS) and Structure Activity Relationship (SAR) Developing optimal assay reagents Optimizing assay protocol with respect to sensitivity, dynamic range, signal intensity and stability Adaptation of the assay to the microtiter plate formats Validation of the assay performance Orthogonal follow-up assays for chemical probe validation and refinement http://www.ncgc.nih.gov/guidance/index.html NIH Molecular Libraries Small Molecule Repository • Run under contract by Discovery Partners International • Collects samples for high throughput biological screening and distributes them to the NIH Molecular Libraries Screening Center Network • http://mlsmr.discoverypartners.com/MLSM R_HomePage/ Roadmap MLI Funded Areas • Molecular Libraries Screening Centers (MLSCN) – Nine of them at academic institutions – NIH Chemical Genomics Center • http://www.ncgc.nih.gov/ • http://nihroadmap.nih.gov/molecularlibrarie s/fundedresearch.asp Roadmap MLI Funded Areas • Submitting assays for HTS in the MLSCN – 28 different submissions • Pilot-scale libraries for HTS (8) • New methodologies for natural product chemistry (6) • Assay development for HT molecular Screening (39) • Molecular libraries screening instrumentation (4) Roadmap MLI Funded Areas • Novel preclinical tools for predictive ADME-Toxicology (5) • Innovation in molecular imaging probes (11) • Development of high-resolution probes for cellular imaging (9) Roadmap MLI Funded Areas • Exploratory Centers for Cheminformatics Research at: – Indiana University – University of Michigan – Rensselaer Polytechnic Institute – MIT – North Carolina State University, Raleigh – University of North Carolina, Chapel Hill IU Projects Underway • Innovative cross-screen analysis of NIH Developmental Therapeutics Project Human Tumor Cell Line data • Development of cheminformatics web services and use cases in Taverna • Development of a novel interface for the analysis of PubChem HTS data • A structure storage and searching system for Distributed Drug Discovery • Quantum chemical computer simulations database • Training modules for cheminformatics instruction on the Web • Web guide for essential cheminformatics resources (http://www.indiana.edu/~cheminfo/cicc/resources.html) • Design of a grid-based distributed data architecture for chemistry NIH NCI Developmental Therapeutics Program • The NCI has been collecting and testing compounds for 50 years. For about 30 years this has been managed by the Developmental Therapeutics Program (DTP). From 1955 to 1985 the primary test was to look for increase in survival of mice bearing transplantable tumors. In 1990, the primary screen switched to looking for inhibition of growth of 60 human tumor cell lines in culture. DTP also ran the anti-HIV screen for about 10 years and managed the yeast anti-cancer screen in which compounds were tested for their ability to inhibit the growth of yeast strains with defined mutations in cell cycle genes. These assays provide the bulk of the data DTP makes publicly available. NIH NCI DTP • DTP’s correlation analyses allow one to associate a list of genes with a given compound or vice versa • Want to get workflows running that integrate chemical structure data with the gene expression and sequence data in the bioinformatics world • Need help in the practical details of creating web services that will work in the mygrid/Taverna (or equivalent) framework NIH DTP Data Compounds Chemical Structures 60 Cell Assay Data Points ~265,000 ~43,000 ~12,000,000 ~45,000 ~90,000 Yeast Assay ~110,000 ~600,000 in vivo Antitumor ~120,000 ~1,100,000 Anti-HIV Assay NCI Panel of 60 Human Cell Cancer Lines • • • • Protein levels RNA measurements Mutation status Enzyme activity levels NIH DTP’s COMPARE Program • The pattern of activity across all 60 cell lines that a compound exhibits is related to the mechanism of action – Can be used to discover the mechanism of a compound’s actions by looking at which compounds of known activity are correlated with the unknown – Has been used to discover novel compounds with a given activity by testing the top correlating compounds to a compound with the activity of interest – Used to prioritize compounds that seem to have a novel mechanism – Calculates a correlation coefficient between two vectors in 60-dimensional space NIH DTP • Given a compound tested in the 60 cell assay, one can look for the genes whose expression most highly correlates with the ability of the compound to inhibit cell growth. Conversely, given a gene, one can look for compounds whose ability to inhibit cell growth is most highly correlated with the expression of that gene. NIH DTP Needs • Grid Web services • Visualization – may use VOTables • Tools to squish a set of points in a large dimensional space down into 2D or 3D while attempting to preserve the relative distances – Looking at the nearest neighbors of the point of interest with such a map could reveal relations that would be missed in just a table listed by distance NIH DTP Main Search Page • http://dtp.nci.nih.gov/docs/dtp_search.html High-Throughput Screening (HTS) • the integration of biological, chemical and clinical data • automated & standardized statistical analysis of large and complex data volumes • biological and chemical profiling by use of statistical analyses on combined data from screening, pharmacological profiling, and structural properties Other Potential Partners • Center for Chemical Genomics at the University of Michigan – http://www.lifesciences.umich.edu/institute/labs/ccg/index.html • Milos Novotny (IUB Chemistry): $3.5 million National Center for Research Resources (NIH) grant to conduct research in the analysis of glycoproteins • David Flockhart (IUB School of Medicine): Cytochrome P450 database http://medicine.iupui.edu/flockhart/ PubChem • 5,298,729 compounds as of 1/16/2006 • the place to go for biological and related data • the central depository of all information related to the NIH Roadmap project • expected that the actual data will reside there, and only some things may be held elsewhere, with PubChem acting as a pointer – May even have the images from screens and assays • chemical structures from Elsevier's xPharm database PubChem Data (as of 10/25/2005) • • • • Bioassays deposited Bioassay test results Substances deposited Unique Substances 177 3,158,669 7,848,390 5,269,228 PubChem Technical Details • Entrez database system – For all textual information in the database • NCBI Toolkit - an open-source infrastructure toolkit • OpenEye OEChem toolkit and associated software – for most structure standardization tasks, plus some structure identifier computations like SMILES and IUPAC name generation. • NIST InChI library – for computing the InChI identifier • CACTVS Chemoinformatics Toolkit – for structure depictions, structure database system, structure query execution, structure deduplication, some property calculations and the WWW structure and image editors • Various general low-level support libraries, e.g., – zlib, png, gd and freetype libraries • In-house code – for the queuing system, deposition system, display CGIs, structure standardization set-up, update scripts, etc. PubChem Database Display and Query Subsystems - 1 • A special Entrez version – stores textual and numerical data – hosted on a MS SQL Server relational database cluster – holds precomputed structure images for display, ASN.1 structure data blobs for download, and extensive crosslinking functions for linking to other NCBI databases PubChem Display and Query Subsystems - 2 • structure search component – based on the CACTVS structure search system – pseudo-relational in nature (the underlying storage manager is the Sleepycat BDB database manager) – hosted on a Linux server cluster – structure search file is not stored in the SQL database, but there is an automatic synchronization and update mechanism – Some data, such as Lipinski filter criteria, are stored in both databases PubChem Programming Utilities • Entrez Programming Utilities – http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils _help.html • CACTVS chemoinformatics toolkit – a full ASN.1 parser for CACTVS understands the full data spec for structures and assay data – modules for talking to the Entrez database for accessing structure blobs and some other NCBI systems PubChem Data Deposition • PubChem Deposition Gateway • http://pubchem.ncbi.nlm.nih.gov/deposit/deposit.cgi PubChem Sketcher • No need to worry about the type of structure definition displayed in the top line • uses a hidden internal representation to transfer the information • http://pubchem.ncbi.nlm.nih.gov/search/ InChI, The IUPAC International Chemical Identifier • • • • • Official site: http://www.iupac.org/inchi/ Unofficial InChI FAQ: http://wwmm.ch.cam.ac.uk/inchifaq/ WSDL InChI server at http://wwmm.ch.cam.ac.uk/gridsphere/grid sphere Searching InChIs • Sample search: • “InChI=1/C17H14O4S/c1-22(19,20)14-9-7-12(8-10-14)15-11-2117(18)16(15)13-5-3-2-4-6-13/h2-10H,11H2,1H3” • Must include the quotation marks • no carriage return or line feed in the string • InChI code for C60 fullerene: – InChI=1/C60/c1-2-5-6-3(1)8-12-10-4(1)9-11-7(2)17-21-13(5)2324-14(6)22-18(8)28-20(12)30-26-16(10)15(9)25-2919(11)27(17)37-41-31(21)33(23)43-44-34(24)32(22)4238(28)48-40(30)46-36(26)35(25)45-39(29)47(37)5549(41)51(43)57-52(44)50(42)56(48)5954(46)53(45)58(55)60(57)59 ACD Labs and InChIs • Transferring structures from PubChem to ACD/ChemSketch • http://www.acdlabs.com/download/technot es/90/draw_db/pubchem.pdf InChI Support in BKChem • BKchem - a free chemical drawing program • Successfully reads most InChIs • http://bkchem.zirael.org/inchi_en.html InChI • PubChem sketcher also supports generation of InChI strings • http://pubchem.ncbi.nlm.nih.gov/edit/ – Change the format selector to "InChI“ – Copy the InChI generated and input to the text searching component, putting quotation marks around the InChI before searching it Protein Data Bank (PDB) Data Dictionaries • develop software and data definitions to support the structural genomics efforts • enable high-throughput data deposition • data dictionaries define items at the level of detail of the materials and methods section of a journal • uses macromolecular Crystallographic Information File (mmCIF) data dictionaries • http://mmcif.pdb.org/index.html Translate WSDL to Human Readable Form • http://soapclient.com/soaptest.html