The NIH Roadmap - Indiana University

advertisement
The NIH Roadmap
and PubChem
Gary Wiggins
I533
Spring 2006
NIH Roadmap
• Series of initiatives designed to pursue
major opportunities in biomedical research
and gaps in current knowledge that cannot
be addressed by any single NIH Institute
or Center
• Goal: enable rapid transformation of new
scientific knowledge into tangible benefits
for public health
• http://nihroadmap.nih.gov/
NIH Molecular Libraries and
Imaging Initiative
• Part of the “New Pathways to Discovery” area
• Goal: augment the “toolbox” for understanding
the functionally interconnected molecular
events that maintain health and lead to disease
• Build on high-throughput, highly specific,
mechanism-based biological assays
• Aims to develop and discover small molecules
that hold promise as research tools to probe
cellular physiology and pathophysiology
NIH Molecular Imaging Roadmap
• High specificity/high sensitivity molecular
imaging probes
• Molecular imaging and contrast database
• Imaging probe development center
NIH Roadmap Molecular Libraries
Initiative (MLI)
• A series of integrated research programs
with the goal of making small molecule
screening and screening data more widely
available to the research community
• http://nihroadmap.nih.gov/molecularlibrarie
s/index.asp
MLI Aims
• Go beyond the identification of compounds
with potential therapeutic properties
• Will result in the identification of
compounds to use as probes to study
cellular processes in health and disease
• Biological screening data, assay protocols,
and chemical structures for compounds to
be publicly available in PubChem
NIH MLI Components
• Molecular Libraries Screening Center
Network (MLSCN)
• Cheminformatics (centered around
PubChem)
• Technology development
NIH MLI Technology Development
Areas
• Chemical diversity
– Pilot-scale libraries for investigation of novel chemical
diversity space
– Novel methods for natural product chemistry
• Development of assays
• Novel instrumentation and detection
technologies for high throughput screening
• Datasets and algorithms for better prediction of
absorption, distribution, metabolism, excretion,
and toxicity properties of small molecules
Assay Guidance Manual
•
Originally written as a guide for therapeutic projects
teams within Eli Lilly; covers:
–
–
–
–
–
–
•
Identifying potential assay formats compatible with High
Throughput Screen (HTS) and Structure Activity Relationship
(SAR)
Developing optimal assay reagents
Optimizing assay protocol with respect to sensitivity, dynamic
range, signal intensity and stability
Adaptation of the assay to the microtiter plate formats
Validation of the assay performance
Orthogonal follow-up assays for chemical probe validation and
refinement
http://www.ncgc.nih.gov/guidance/index.html
NIH Molecular Libraries Small
Molecule Repository
• Run under contract by Discovery Partners
International
• Collects samples for high throughput
biological screening and distributes them
to the NIH Molecular Libraries Screening
Center Network
• http://mlsmr.discoverypartners.com/MLSM
R_HomePage/
Roadmap MLI Funded Areas
• Molecular Libraries Screening Centers
(MLSCN)
– Nine of them at academic institutions
– NIH Chemical Genomics Center
• http://www.ncgc.nih.gov/
• http://nihroadmap.nih.gov/molecularlibrarie
s/fundedresearch.asp
Roadmap MLI Funded Areas
• Submitting assays for HTS in the MLSCN
– 28 different submissions
• Pilot-scale libraries for HTS (8)
• New methodologies for natural product
chemistry (6)
• Assay development for HT molecular
Screening (39)
• Molecular libraries screening
instrumentation (4)
Roadmap MLI Funded Areas
• Novel preclinical tools for predictive
ADME-Toxicology (5)
• Innovation in molecular imaging probes
(11)
• Development of high-resolution probes for
cellular imaging (9)
Roadmap MLI Funded Areas
• Exploratory Centers for Cheminformatics
Research at:
– Indiana University
– University of Michigan
– Rensselaer Polytechnic Institute
– MIT
– North Carolina State University, Raleigh
– University of North Carolina, Chapel Hill
IU Projects Underway
• Innovative cross-screen analysis of NIH Developmental
Therapeutics Project Human Tumor Cell Line data
• Development of cheminformatics web services and use
cases in Taverna
• Development of a novel interface for the analysis of
PubChem HTS data
• A structure storage and searching system for Distributed
Drug Discovery
• Quantum chemical computer simulations database
• Training modules for cheminformatics instruction on the
Web
• Web guide for essential cheminformatics resources
(http://www.indiana.edu/~cheminfo/cicc/resources.html)
• Design of a grid-based distributed data architecture for
chemistry
NIH NCI Developmental
Therapeutics Program
• The NCI has been collecting and testing compounds for
50 years. For about 30 years this has been managed by
the Developmental Therapeutics Program (DTP). From
1955 to 1985 the primary test was to look for increase in
survival of mice bearing transplantable tumors. In 1990,
the primary screen switched to looking for inhibition of
growth of 60 human tumor cell lines in culture. DTP also
ran the anti-HIV screen for about 10 years and managed
the yeast anti-cancer screen in which compounds were
tested for their ability to inhibit the growth of yeast strains
with defined mutations in cell cycle genes. These assays
provide the bulk of the data DTP makes publicly
available.
NIH NCI DTP
• DTP’s correlation analyses allow one to
associate a list of genes with a given compound
or vice versa
• Want to get workflows running that integrate
chemical structure data with the gene
expression and sequence data in the
bioinformatics world
• Need help in the practical details of creating web
services that will work in the mygrid/Taverna (or
equivalent) framework
NIH DTP Data
Compounds
Chemical
Structures
60 Cell Assay
Data Points
~265,000
~43,000
~12,000,000
~45,000
~90,000
Yeast Assay
~110,000
~600,000
in vivo
Antitumor
~120,000
~1,100,000
Anti-HIV Assay
NCI Panel of 60 Human Cell
Cancer Lines
•
•
•
•
Protein levels
RNA measurements
Mutation status
Enzyme activity levels
NIH DTP’s COMPARE Program
• The pattern of activity across all 60 cell lines that
a compound exhibits is related to the
mechanism of action
– Can be used to discover the mechanism of a
compound’s actions by looking at which compounds
of known activity are correlated with the unknown
– Has been used to discover novel compounds with a
given activity by testing the top correlating
compounds to a compound with the activity of interest
– Used to prioritize compounds that seem to have a
novel mechanism
– Calculates a correlation coefficient between two
vectors in 60-dimensional space
NIH DTP
• Given a compound tested in the 60 cell
assay, one can look for the genes whose
expression most highly correlates with the
ability of the compound to inhibit cell
growth. Conversely, given a gene, one can
look for compounds whose ability to inhibit
cell growth is most highly correlated with
the expression of that gene.
NIH DTP Needs
• Grid Web services
• Visualization – may use VOTables
• Tools to squish a set of points in a large
dimensional space down into 2D or 3D while
attempting to preserve the relative distances
– Looking at the nearest neighbors of the point of
interest with such a map could reveal relations that
would be missed in just a table listed by distance
NIH DTP Main Search Page
• http://dtp.nci.nih.gov/docs/dtp_search.html
High-Throughput Screening (HTS)
• the integration of biological, chemical and
clinical data
• automated & standardized statistical
analysis of large and complex data
volumes
• biological and chemical profiling by use of
statistical analyses on combined data from
screening, pharmacological profiling, and
structural properties
Other Potential Partners
• Center for Chemical Genomics at the University
of Michigan
– http://www.lifesciences.umich.edu/institute/labs/ccg/index.html
• Milos Novotny (IUB Chemistry): $3.5
million National
Center for Research Resources (NIH) grant to
conduct research in the analysis of glycoproteins
• David Flockhart (IUB School of Medicine):
Cytochrome P450 database
http://medicine.iupui.edu/flockhart/
PubChem
• 5,298,729 compounds as of 1/16/2006
• the place to go for biological and related data
• the central depository of all information related
to the NIH Roadmap project
• expected that the actual data will reside there,
and only some things may be held elsewhere,
with PubChem acting as a pointer
– May even have the images from screens and assays
• chemical structures from Elsevier's xPharm
database
PubChem Data (as of 10/25/2005)
•
•
•
•
Bioassays deposited
Bioassay test results
Substances deposited
Unique Substances
177
3,158,669
7,848,390
5,269,228
PubChem Technical Details
• Entrez database system
– For all textual information in the database
• NCBI Toolkit - an open-source infrastructure toolkit
• OpenEye OEChem toolkit and associated software
– for most structure standardization tasks, plus some structure
identifier computations like SMILES and IUPAC name
generation.
•
NIST InChI library
– for computing the InChI identifier
• CACTVS Chemoinformatics Toolkit
– for structure depictions, structure database system, structure
query execution, structure deduplication, some property
calculations and the WWW structure and image editors
• Various general low-level support libraries, e.g.,
– zlib, png, gd and freetype libraries
• In-house code
– for the queuing system, deposition system, display CGIs,
structure standardization set-up, update scripts, etc.
PubChem Database Display and
Query Subsystems - 1
• A special Entrez version
– stores textual and numerical data
– hosted on a MS SQL Server relational
database cluster
– holds precomputed structure images for
display, ASN.1 structure data blobs for
download, and extensive crosslinking
functions for linking to other NCBI databases
PubChem Display and Query
Subsystems - 2
• structure search component
– based on the CACTVS structure search system
– pseudo-relational in nature (the underlying storage
manager is the Sleepycat BDB database manager)
– hosted on a Linux server cluster
– structure search file is not stored in the SQL
database, but there is an automatic synchronization
and update mechanism
– Some data, such as Lipinski filter criteria, are stored
in both databases
PubChem Programming Utilities
• Entrez Programming Utilities
– http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils
_help.html
• CACTVS chemoinformatics toolkit
– a full ASN.1 parser for CACTVS understands
the full data spec for structures and assay
data
– modules for talking to the Entrez database for
accessing structure blobs and some other
NCBI systems
PubChem Data Deposition
• PubChem Deposition Gateway
• http://pubchem.ncbi.nlm.nih.gov/deposit/deposit.cgi
PubChem Sketcher
• No need to worry about the type of
structure definition displayed in the top line
• uses a hidden internal representation to
transfer the information
• http://pubchem.ncbi.nlm.nih.gov/search/
InChI, The IUPAC International
Chemical Identifier
•
•
•
•
•
Official site: http://www.iupac.org/inchi/
Unofficial InChI FAQ:
http://wwmm.ch.cam.ac.uk/inchifaq/
WSDL InChI server at
http://wwmm.ch.cam.ac.uk/gridsphere/grid
sphere
Searching InChIs
• Sample search:
• “InChI=1/C17H14O4S/c1-22(19,20)14-9-7-12(8-10-14)15-11-2117(18)16(15)13-5-3-2-4-6-13/h2-10H,11H2,1H3”
• Must include the quotation marks
• no carriage return or line feed in the string
• InChI code for C60 fullerene:
– InChI=1/C60/c1-2-5-6-3(1)8-12-10-4(1)9-11-7(2)17-21-13(5)2324-14(6)22-18(8)28-20(12)30-26-16(10)15(9)25-2919(11)27(17)37-41-31(21)33(23)43-44-34(24)32(22)4238(28)48-40(30)46-36(26)35(25)45-39(29)47(37)5549(41)51(43)57-52(44)50(42)56(48)5954(46)53(45)58(55)60(57)59
ACD Labs and InChIs
• Transferring structures from PubChem to
ACD/ChemSketch
• http://www.acdlabs.com/download/technot
es/90/draw_db/pubchem.pdf
InChI Support in BKChem
• BKchem - a free chemical drawing
program
• Successfully reads most InChIs
• http://bkchem.zirael.org/inchi_en.html
InChI
• PubChem sketcher also supports
generation of InChI strings
• http://pubchem.ncbi.nlm.nih.gov/edit/
– Change the format selector to "InChI“
– Copy the InChI generated and input to the
text searching component, putting quotation
marks around the InChI before searching it
Protein Data Bank (PDB)
Data Dictionaries
• develop software and data definitions to
support the structural genomics efforts
• enable high-throughput data deposition
• data dictionaries define items at the level
of detail of the materials and methods
section of a journal
• uses macromolecular Crystallographic
Information File (mmCIF) data dictionaries
• http://mmcif.pdb.org/index.html
Translate WSDL to Human
Readable Form
• http://soapclient.com/soaptest.html
Download