Diapositive 1 - Evidence

advertisement
The use of Informatics
Approaches in Cheminformatics
Alexander Tropsha
Laboratory for Molecular Modeling,
UNC Eshelman School of Pharmacy
OUTLINE
• Overview of Current Projects
• Background on Cheminformatics
• Examples of Application Projects: Data
Retrieval  Modeling Testable
Hypothesis Generation  Validation
C-C=C-O
> Database
of compounds (with their measured activities for multiple targets)
> Tools to visualize and navigate into chemical space.
Structure-Activity Relationships (SAR) modeling
D
E
S
C
R
I
P
T
O
R
S
Physico-Chemical
properties (logS, BP,
MP, logK etc.)
Biological
activities
Computational Chemical Biology
C-ChemBench / CECCR project
Complementary Ligands Based on Receptor Information
(CoLiBRI)
Computational Chemical Biology
Protein StructureFunction
relationships modeling
Simplicial Neighborhood
Analysis of Protein Packing
(SNAPP)
Activity/Function prediction for molecules
Empirical Rules/Filters
Similarity Search
Consensus QSAR models
VIRTUAL
SCREENING
~102 – 103
molecules
~106 – 109
molecules
Activity/Function prediction for molecules
Protein-ligand recognition
Cheminformatics and Structural
Bioinformatics
Selected Models
Descriptors and QSAR approaches
(modeling techniques, applicability domain definitions etc.)
Cheminformatics and Structural Bioinformatics
Tools for chemical
data mining
Tetrahymena Pyriformis
Computational Chemical Toxicology
The Laboratory for Molecular Modeling
Principal Investigator
Alexander Tropsha
Research Professors
Clark Jeffries, Alexander
Golbraikh, Hao Zhu,
Simon Wang, M. Karthikeyan
Graduate Research
Assistants
Christopher Grulke, Nancy
Baker, Kun Wang, Hao Tang, JuiHua Hsieh, Rima Hajjo, Tanarat
Kietsakorn, Tong Ying Wu, Liying
Zhang, Melody Luo, Guiyu Zhao,
Andrew Fant
Postdoctoral Fellows
Georgiy Abramochkin, Lin Ye,
Denis Fourches
MAJOR FUNDING
NIH
Visiting Research
- P20-HG003898 (RoadMap)
Scientists
- R21GM076059 (RoadMap)
Achintya Saha, Aleks
- R01-GM66940
Sedykh, Berk Zafer
- GM068665
EPA (STAR awards)
- RD832720
Adjunct Members
- RD833825
Weifan Zheng, Shubin Liu
Research Programmer
Theo Walker
System Administrator
Mihir Shah
What is Chemoinformatics?
Dr. Frank Brown introduced the term “chemoinformatics” in the Annual
Reports of Medicinal Chemistry in 1998:
toxicity prioritization & screening
“The use of information technology and management has become a
critical part of the drug discovery process. Chemoinformatics is the mixing
of those information resources to transform data into information and
information into knowledge for the intended purpose of making better
decisions faster in the area of drug lead identification and organization”
environmental toxicity screening
In fact, Chemoinformatics is a generic term that encompasses the design,
creation, organization, management, retrieval, analysis, dissemination,
visualization and use of chemical information.
Slide courtesy of Ann Richard
http://www.bioinfoinstitute.com/chemoinfo.htm
NIH’s Molecular Libraries Initiative in numbers
NIH Roadmap Initiative
Molecular Libraries Initiative
4 Chemical Synthesis
Centers
CombiChem
Parallel synthesis
DOS
4 centers + DPI
100K – 1M compounds
Expected
1M compounds
MLSCN (9+1)
9 centers
1 NIH intramural
20 x 10 = 200 assays
PubChem
(NLM)
ECCR (6) Predictive
Exploratory ADMET
Centers
(10)
Current SAR matrix
(as of May 25, 2007):
- 256 different MLSCN bioassays
-over 140,000 chemicals
-29,558 compounds categorized
as “active” in at least one
MLSCN bioassay
200 assays
Chemocentric
view of
biological data
NO2
Toxicity Risk Assessment
SAR
structure-activity
relationships
increasing uncertainty
Quantitative Structure-Activity
relationships (QSAR)
Pharmacophore
mapping
Docking
Molecular
modeling
Molecular
mechanics
Property
filtering
2D
3D Substructural Similarity
Searching
Searching
Molecular
Diversity Analysis
Quantum
mechanical
Semiempirical
ADMET
2D Substructure
Searching
Scoring
functions
Druglikeness
Decision trees
Neural
Networks
Virtual Screening
Data Mining
Cluster
analysis
Graph
theory
Multiple linear
Principal
regression
components
Inductive logic
analysis
reasoning
Genetic
algorithms
Active Analog
Hansch
Free-Wilson
Pharmaceutical
Sciences
Drug Discovery
Chemical Design
Materials Science
Green Chemistry
Agricultural
Pesticides
Food Science
Polymers
Atmospheric
chemistry
Environmental
Studies
Green Chemistry
Predictive
Toxicology
Key point: Focus on Externally Validated
Predictions
SAR dataset
Input
External
database/library
Cheminformatics
Magic
Small number of
computational hits
Output
Large fraction are
Real Test
confirmed actives
Cheminformatics Analysis of Assertions
Describing Drug-Induced Liver Injury
in Different Species
In Collaboration with BioWisdom, UK
Background
Drug Induced Liver Injury (DILI) is one of the major
causes of drug toxicity, both during clinical development
and post-approval
Animal studies, and clinical trials on limited populations,
are used to establish drug safety; both appear
insufficient
A wealth of published information that could deepen our
understanding the mechanisms of DILI is available, but
the information is scattered in distributed published
works, using inconsistent language
Introduction to the Safety Intelligence Program (SIP)
An industry-sponsored initiative that embraces the expertise of it’s
pharmaceutical members and other stakeholders to build the world's most
comprehensive intelligence resource for use in improving drug safety
assessments.
The Safety Intelligence System
5,700 pathologies
8,500 compounds
192,000 facts
1 interface
The largest forever-expanding collection of known effects of chemicals
occurring in the different tissue, drugs effects on clinical biomarkers of
tissue injury and drug molecular mechanisms.
Facts (assertions) derived from:
Biomedical literature
Regulatory documents: EMEA EPARs, FDA NDAs
Label Data
And many more…
Intelligence Network Build Process
Public Domain Sources
Licensed Sources
Proprietary Sources
Meta-Search
Structured Data Sources
e.g, GO, UMLS, SWISS_PROT
Data Source Descriptors
Concept Maps
Sofia Terminology &
Ontology
Unstructured Data Sources
e.g, Medline, Patents, FDA SBAs
Spiders
Structured
Data Loader
User Defined
Term List
Noun Phrase Discovery
Selected Corpus
Automated
Assertion Generation
Pass
QA
QA
Fail
Pass
Raw Assertion Discovery
Relationship Discovery
Relations Typing
Semantic Normalisation
Chemistry Canonicalisation
DocView
(manual validation)
Intelligence Network
Pass
Slide courtesy of Julie Barnes, Biowisdom
Species Concordance Study Design
The Safety Intelligence System contains comprehensive assertional meta-data
describing >5,800 effects of >8500 compounds in the liver
E.g. ‘Acetaminophen INDUCES Hepatocyte Death (mouse)’ (pathological effect)
E.g. ‘Prednisolone SUPPRESSES Collagen Synthesis (human)’ (physiological effect)
A subset of the above assertional meta-data, referenced by MEDLINE or the
EMEA EPARs, were exported from the Safety Intelligence System for analysis
The data were restricted to therapeutic products only
The compounds were assigned to human, rodent or non-rodent groups
according to the species in which the effect was reported
The concordance of drug-induced liver effects across humans, rodents or nonrodents was determined
Species Concordance of Drug-Induced Liver Effects:
Assertions Evidenced by MEDLINE
14,600 assertions, 1061 compounds
Large data set – lending itself to quantitative analyses
Non-rodent data are less well represented than human and rodent
Objectives
Can we employ cheminformatics approaches to validate
assertions of drug-induced liver effects in different species?
Can we identify chemotypes that define species-specific
liver effects?
Can we establish chemistry driven rules for concordance (or lack
thereof) between chemical effects on humans vs. non-humans?
Project Workflow
Primary
data
sources
BioWisdom
Safety Intelligence System
Assertional meta-data generated
using SofiaTM platform
Assertion
export
SIP Members
Assertion
refinement
Chemical curation, fragment
analysis & QSAR
Study Design
• Used assertions evidenced by MEDLINE, rather than
EMEA EPARs, because of their greater quantity
• Used rodent and human data to build the model
(knowing that non-rodent data are sparse in MEDLINE)
• Used non-rodent data (where a liver effect was
observed) to validate the model
Curation of Chemical Data
Step 1 : all inorganic molecules have been removed, as well as
those having no available SMILE strings. (993 of 1061 molecules remaining)
Examples:
Zinc chloride
Cl[Zn]Cl
Ferrous sulfate
Sulfur
[S]
Cobalt dichloride
Manganese chloride [Cl-].[Cl-].[Mn+2]
Activated charcoal
cis-Diaminedichloroplatinum
[NH4+].[NH4+].[Cl-].[Cl-].[Pt+2]
[Fe+2].[O-]S(=O)(=O)[O-]
[Cl-].[Cl-].[Co+2]
C
Step 2 : 2D structures were obtained from the SMILE strings, using JChem software
from ChemAxon. Then, all counter-ions have been removed and molecules have
been neutralized, using ChemAxon Standardizer. (+aromatization,
+normalization of nitro groups) (989 compounds remaining)
Example:
Na+
Step 3 : manual molecular cleaning to correct some structures and to remove
compounds with non-sensible SMILES or duplicates
(951 of 1061 molecules remaining)
Data transformation for the revised Venn diagram
Species profile for each compound (951) was retrieved from the
original data automatically with a program written in Delphi.
only
only
only
For the cheminformatics
analysis, we assumed that
each compound has been
tested in all species, i.e.,
humans, rodents and nonrodents.
“1” = known liver effect
“0” = no liver effect
The Venn Diagram of the Curated Dataset
HUMAN
(650)
RODENT
(685)
292
236
257
110
12
26
18
NON-RODENT(166)
Total number
of compounds:
951
1. Clustering of compounds in the chemistry space*
C*C*C-C=O
Calculation of fragment
descriptors
C*C-C=O
C-C=O
C-C
C=O
C*C
Sequences of Atoms/Bonds
Inputs for
clustering
algorithm
*ISIDA is developed in the group
of Prof. A Varnek, Univ. of
Strasbourg.
1. Clustering of 951 compounds in the chemistry space
For cluster analysis we used fragment descriptors, hierarchical algorithm, Euclidean
similarity between compounds, and a complete linkage between clusters.
Small clusters identified with high levels of similarity between compounds.
1. Clustering of compounds in chemical space
Example 1: Barbiturate derivatives; sedation/anaesthesia
a
b
ID = 45
HUMAN = 0
RODENT = 1
NON-RODENT = 0
c
ID = 76
HUMAN = 0
RODENT = 1
NON-RODENT = 0
d
ID = 93
HUMAN = 0
RODENT = 1
NON-RODENT = 0
ID = 543
HUMAN = 0
RODENT = 1
NON-RODENT = 0
Example 2: a = cladribine, b = clofarabine, c = cordycepin; all anticancer drugs
a
ID = 201
HUMAN = 1
RODENT = 0
NON-RODENT = 0
b
c
ID = 208
HUMAN = 1
RODENT = 0
NON-RODENT = 0
ID = 223
HUMAN = 0 (???)
RODENT = 1 (???)
NON-RODENT = 0
1. Example 1: Assessing potential data gaps
b
a
d
c
Allobarbital
Aprobarbital
Barbital
Methohexital
HUMAN = 0
RODENT = 1
NON-RODENT = 0
HUMAN = 0
RODENT = 1
NON-RODENT = 0
HUMAN = 0
RODENT = 1
NON-RODENT = 0
HUMAN = 0
RODENT = 1
NON-RODENT = 0
•
•
•
Recent mining of MEDLINE did not identify any evidence for these
compounds having human liver effects
Basic searches in google (e.g. barbital, human, hepatotoxicity) did not
reveal evidence for these compounds having human liver effects
The apparent lack of human liver effects may be due to these
compounds being used for sedation/anaesthesia where lower doses
and shorter exposures may be used than in animal studies
1. Example 2: Assessing potential data gaps
Cladribine
a
HUMAN = 1
RODENT = 0
NON-RODENT = 0
•
•
Clofarabine
b
Cordycepin
HUMAN = 1
RODENT = 0
NON-RODENT = 0
Recent mining of MEDLINE did
not identify any new evidence for
2a and b having rodent liver
effects
However, EMEA EPAR data in the
Safety Intelligence System did
identify b as having rodent liver
effects (no rodent liver effects
identified for a)
c
HUMAN = 0 (???)
RODENT = 1 (???)
NON-RODENT = 0
•
Recent mining of MEDLINE did
identify an effect of c in a human
hepatocellular cell line
1. Clustering of compounds in chemical space
Example 3: a. amiodarone (antiarrhythmic agent), b. benzarone (used for treatment of
peripheral vascular disorders), c. benzbromarone (uricosuric agent, used for gout),
d. benziodarone (vasodilator).
b
a
ID = 98
HUMAN = 1
RODENT = 1
NON-RODENT = 0
ID = 60
HUMAN = 1
RODENT = 1
NON-RODENT = 1
c
d
ID = 99
HUMAN = 1
RODENT = 1
NON-RODENT = 0
ID = 100
HUMAN = 0
RODENT = 1
NON-RODENT = 0
Does this
compound lack
human liver
effects ?
1. Example 3: Assessing potential data gaps
d
•
•
Benziodarone
HUMAN = 0
RODENT = 1
NON-RODENT = 0
Does this
compound lack
human liver
effects ?
Recent mining of MEDLINE did not identify any new evidence for 3d having
human liver effects
However, a basic search in google (e.g. benziodarone, human,
hepatotoxicity) did reveal that the drug caused hepatotoxicity in humans
(inferred)
1. Clustering of compounds in chemical space
Example 4: Estrogen-like compounds
Estradiol
b
2-methoxyestradiol
a
ID = 8
HUMAN = 1
RODENT = 1
NON-RODENT = 0
ID = 329
HUMAN = 1
RODENT = 1
NON-RODENT = 1
Estrone
d
Estriol
ID = 333
HUMAN = 1
RODENT = 1
NON-RODENT = 0
ID = 332
HUMAN = 0
RODENT = 1
NON-RODENT = 0
c
e
Ethinyl estradiol
ID = 338
HUMAN = 1
RODENT = 1
NON-RODENT = 1
1. Example 4: Assessing potential data gaps
c Estriol
HUMAN = 0
RODENT = 1
NON-RODENT = 0
•
Recent mining of MEDLINE and a basic search in google (e.g. estriol,
human, hepatotoxicity) did not identify any new evidence for estriol (c)
having human liver effects
1. Clustering of compounds in chemical space
Some clusters have been identified in which compounds share highly
molecular structures and also, toxicity profiles for H, R and NR.
This information is highly important to identify chemotypes that define
species-specific DILI effects.
However, in some clusters, similar compounds appear to display
different toxicity profiles.
These cases may correspond to missing or unreported data, and
highlight areas for gap-spotting or additional experimental
investigation.
2. Analysis of chemical fragment distribution
A
HUMAN
ONLY
Compounds found to
show liver effects
for humans only
RODENT
ONLY
B
Compounds
lacking liver effects
for humans
Are there some differences in fragment distributions between
compounds displaying human vs. rodent specific effects?
STRUCTURE REPRESENTATION
naphtalen-1-amine
Viewed by
computers
Viewed by another
molecule
Viewed by chemists
Graphs are widely used to represent
and differentiate chemical structures,
where atoms are vertices and bonds
are expressed as edges connecting these vertices.
MOL File
Vertices
Molecular graphs allow the
computation of numerous
indices to compare them
quantitatively.
Edges
Molecular descriptors
2. Analysis of fragment distributions within sets A and B
Fragment type
FA
C-N-C
C-C-C-N-C
C-C-C-N
C-C-N-C
C-C-N-C-C
C-N
C-C-N
C-N-C-C-N
C-C-C-N-C-C
C-N-C-C-N-C
N-C-C-N
C*N
C*C
C-C-N-C-C-O
C-C-N-C=O
C*C*N
C-C-N-C-C-N
S-C
71.6
50.0
58.9
64.0
39.8
86.4
76.3
24.2
30.9
21.2
24.6
35.2
80.1
22.0
29.2
33.1
18.6
23.3
C-C-N-C-C-N-C 17.8
C-S-C
15.3
C-N-C-C-O
29.2
C-N-C=O
37.7
C*C*C*C
70.8
C-S-C-C
13.6
C-C-N-C-C=O 17.4
FB
49.0
28.0
37.4
43.6
20.6
67.7
59.1
7.8
15.2
5.8
9.7
20.6
66.1
8.6
16.0
19.8
6.2
10.9
5.8
3.5
17.5
26.1
59.1
1.9
5.8
ΔF
22.6
22.0
21.5
20.4
19.2
18.7
17.1
16.4
15.8
15.3
14.8
14.5
13.9
13.5
13.3
13.2
12.4
12.4
12.0
11.8
11.7
11.6
11.6
11.6
11.5
Fragment type
FA
O-C-C-N-C=O
C=C-N
C-N-C-C=O
C-N-C=C
C*C*C
C-C-C
15.7
15.3
19.9
14.0
75.0
86.9
N-C-C-N-C-C-O 12.7
C-C-C=O
47.9
O=C-C-N-C=O 15.7
C-C-C-N-C-C-N 14.8
S-C-C
14.4
N-C=O
42.8
C*C*C*N
23.3
C*N*C
29.7
C-C-C-C-N
33.1
C-C-C-N-C-C=O 13.1
N-C*N
15.7
C-C=C-N
12.7
N-C-C-N-C-C=O 11.4
C=C-C-O
14.4
C-C-C-N-C-C-C 14.4
C-C=C-N-C
11.4
S-C-C-C
11.4
N-C-C=O
20.8
C-C-C-C-N-C 27.1
C-C*N
17.4
Etc.
FB
4.3
3.9
8.6
2.7
63.8
75.9
1.9
37.4
5.4
4.7
4.3
32.7
13.2
19.8
23.3
3.5
6.2
3.5
2.3
5.4
5.4
2.7
2.7
12.1
18.7
8.9
ΔF
11.4
11.4
11.4
11.3
11.2
11.0
10.8
10.5
10.2
10.2
10.1
10.1
10.1
9.8
9.7
9.6
9.5
9.2
9.1
9.0
9.0
8.7
8.7
8.7
8.4
8.4
FA = Fragment Frequency (%) for (Human Only – 236 compounds)
FB = Fragment Frequency (%) for (Rodent Only – 257 compounds)
2. Differential fragment frequency distribution
FA = Fragment Frequency in A
FB = Fragment Frequency in B
ΔF = ( FA - FB)
3. Binary QSAR based classification
HUMAN
ONLY
Class A
Class B
(248)
(283)
Compounds known
to affect liver in
humans only
RODENT
ONLY
Compounds NOT
affecting liver in
humans
Can we predict the compound class from its structure only ?
Principle of QSAR/QSPR modeling
Introduction
O
C
O
M
P
O
U
N
D
S
N
0.613
O
0.380
N
O
N
O
N
O
N
O
N
O
N
O
N
O
N
D
E
S
C
R
I
P
T
O
R
S
-0.222
0.708
Quantitative
Structure
Property
Relationships
1.146
0.491
0.301
0.141
0.956
0.256
0.799
1.195
O
N
1.005
P
R
O
P
E
R
T
Y
Principle of QSAR/QSPR modeling
Introduction
O
C
O
M
P
O
U
N
D
S
N
0.613
O
0.380
N
O
N
O
N
O
N
O
N
O
N
O
N
O
N
D
E
S
C
R
I
P
T
O
R
S
-0.222
0.708
Quantitative
Structure
Property
Relationships
1.146
0.491
0.301
0.141
0.956
0.256
0.799
1.195
O
N
1.005
P
R
O
P
E
R
T
Y
3. QSAR based classification
Using SUPPORT VECTOR MACHINES (SVM)
Accuracy (%) = (number of compounds correctly predicted )/(total number of compounds)
Fold
Modeling set
5 fold CV
1
62.3%
62.9%
88.2%
77.6%
71.0%
67.3%
217
fragments
162
Dragon
64.9%
67.5%
81.2%
81.2%
64.2%
55.7%
112
fragments
197
Dragon
62.4%
65.2%
91.3%
91.1%
64.2%
61.3%
194
fragments
198
Dragon
4
64.9%
99.3%
72.6%
208
fragments
84.9%
82.6%
68.9%
68.9%
151
Dragon
5
62.1%
63.3%
205
fragments
61.9%
94.4%
70.8%
175
Dragon
2
3
Modeling set
Accuracy
NB: Preliminary results; could be improved.
External set
Accuracy
Model ID
Descriptors
3. QSAR based classification
Class A
Class B
(248)
(283)
HUMAN
ONLY
18
EXTERNAL SET
(18 compounds reporting
no liver effects
in humans or rodents)
QSAR MODELS
RODENT
ONLY
3. QSAR based classification
Compounds
18
Modeling set
5 fold CV
62.9%
64.0%
Modeling set
Accuracy
92.5%
97.9%
External set
Accuracy
77.8%
66.7%
Model ID
Descriptors
206
Fragments
141
Dragon
14 of 18 compounds are predicted
to lack liver effects for humans.
4 compounds are predicted to have human
liver effects. BUT:
Missing/unreported data ???
Sulfadoxine (ID=820)
Human = 0
Rodent = 0
IN THE MODELING SET:
Sulfadimethoxine (ID=819)
Human = 1
Rodent = 0
3. Sulfadoxine: Assessing potential data gap
Sulfadoxine
Human = 0
Rodent = 0
Missing/unreported
data?
•
Recent
mining
of
MEDLINE
did
identify
evidence
pyrimethamine/sulfadoxine (fansidar) causing hepatitis in patients
•
Normally, combinations would be excluded from these analyses
for
Download