Slides 2(ppt) - Laboratoire d'Infochimie

advertisement
Ligand-Based Virtual Screening:
Extraction of Knowledge from Experimentally Confirmed
Ligands, and the Quest for other Candidates Matching the
Known.
Dragos Horvath
Laboratoire d’InfoChimie, UMR 7177
CNRS – Université de Strasbourg
6700 Strasbourg, France
horvath@chimie.u-strasbg.fr
1. « Ceci n’est pas une molécule »
Molecular Models and Descriptors in Chemoinformatics:
Numerical Encoding of Structural Information
Molecular Descriptors or Fingerprints
• Need to represent a structure by a characteristic bunch
(vector) of numbers (descriptors).
– Example: (Molecular Mass, Number of N Atoms, Total Charge,
Number of Aromatic Rings, Radius of Gyration)
• Should include property-relevant aspects:
– the “nature” of atoms, including information on their neighborhood-induced properties, and their relative arrangement.
– Number of N Atoms  (Primary Amino Groups, Secondary
Amino Groups, … , … , Amide, … , Pyridine N, …)
– … unless being a H bond acceptor is the key (O or N alike)!
– Arrangement in space (3D, conformation-dependent distances in
Å) or in the molecular graph (2D, topological distance =
separating bond count)
Example 1: ISIDA Sequence Counts
O-C*C*C*C-N
O-C*N*C*C-N
…
(1,1,…)
1 12
1 0
…
(2,0,…)
Example 2: Fuzzy mapping of pharmacophore
triplets (2D-FPT)
Atoms labeled by their pharmacophore types:
• Hydrophobic, Aromatic
• Hydrogen Bond Donor, Cation
• Hydrogen Bond Acceptor, Anion
3
3
3
3
4
4
5
3
5
5
4
4
5
Di(m) = total occupancy of basis triplet i in molecule m.
7
6
Example 2: Fuzzy mapping of pharmacophore
triplets (2D-FPT)
Atoms labeled by their pharmacophore types:
• Hydrophobic, Aromatic
• Hydrogen Bond Donor, Cation
• Hydrogen Bond Acceptor, Anion
3
3
3
3
0
4
5
0
…
4
0
0
…
+6
4
7
5
5
4
0
5
3
…
…
+3
6
…
…
…
Di(m) = total occupancy of basis triplet i in molecule m.
…
0
…
Chemically Relevant Typing: Proteolytic
equilibrium dependence of 2D-FPT
12%
88%
Chemically Relevant Typing: Proteolytic
equilibrium dependence of 2D-FPT
?
12%
88%
Chemically Relevant Typing: Proteolytic
equilibrium dependence of 2D-FPT
?
12%
88%
Reference Atoms
3: A 3D Example: Overlay-Based ComPharm
Pharmacophore Field Intensities
1
Pharmacophoric Features
Alk. Aro. HBA HDB (+)
(-)
X11 X12 X13 X14 X15 X16
2
X21
X22
X23
X24
X25
X26
3
X31
X32
X33
X34
X35
X36
4
X41
X42
X43
X44
X45
X46
5
X51
X52
X53
X54
X55
X56
• A descriptor of the nature of the
molecule’s pharmacophoric neighborhood “seen” by every reference
atom, assuming an optimal overlay
of the molecule on the reference...
2. Computer-Aided Ligand-Based Design:
the « Medicinal Chemistry » of Ligand
Fingerprints
« Similar molecules have similar properties » →
« Molecules with similar fingerprints have similar
properties »
« Structure-Activity Relationships » →
« Fingerprint-Activity Relationships » (or Quantitative
Structure-Activity Relationships, QSAR)
2.1 Molecular Similarity in Chemoinformatics
Molecular
Similarity
Expressed by
Fingerprint
Similarity
The Similarity Principle – Neighborhood
Behavior
Property Dissimilarity
Molecule Pairs M,m
*
* *
*
*
*
*
*
*
*
*
*
*
**
*
Pairs with different Properties
* L(m,M)=|P(m)-P(M)| ≥l
* *
*
*
*
* *
*
*
*
*
*
* *
* * *
* *
*
*
*
*
*
* Pairs with similar
* * <l
L(m,M)=|P(m)-P(M)|
*
*
* ** Properties
*
*
*
*
* *
*
*
*
*
Some Random Ranking Criterion for pairs (m,M)
*
The Similarity Principle – Neighborhood
Behavior
Molecule Pairs M,m
Property Dissimilarity
*
*
*
* *
** *
*
*
*
*
*
*
*
False Positives
True Negatives (TN)
*
*
*
*
* (FP)
*
*
*
*
*
*
*
*
* *
* ** *
* *
*
*
*
*
*
*
* *(FN)
True
(!) False
* Negatives
* Positives* ** Potentially
*
(TP)
*
*
*
*
*
*
*
*
*
*
Calculated Structural Dissimilarity S(m,M)
Similarity-Based Virtual Screening…
Active Reference
Nearest Neighbors
Reference
Fingerprint
Superposition-based Similarity Scoring
Automated
Fingerprint
Matching...
Ligand Candidate
Fingerprint Library
Best Matching Candidates
Strenght & Limitations of Similarity-based VS


(+) Only need ONE active ligand to seek for more like it…
(+) With appropriate descriptors, calculated similarity may be
complementary to the scaffold-based similarity perceived by
medicinal chemists
 → « Scaffold Hopping »: bypassing synthetic bottlenecks and/or
pharmacokinetic property problems, patent space evasion, etc.

?
(--) Within the reference ligand, « all
groups are equal, but some are more
equal than others » when it comes to
controlling activity… so what if we
mismatch the latter??
2.2: So, we need to LEARN the features that
really matter – building QSARs
A QSAR model expresses
observed correlations between
certain descriptors and activity
Mol
M1
Act
A1
D1
D2
D3
d11
d21
d31
…
…
Dn
dn1
M2
A2
d12
d22
d32
…
dn2
M3
A3
d13
d23
d33
…
dn3
M4
Ac4
d14
…
dn4
M..
Ac..
d1… d2… d3… …
dn…
Mm
Acm
d1m
…
dnm
d24
d2m
d34
d3m
a ´ D
A=
Model
Fitting
i
linear
A
neural net
i
D1
D2
Dn
M (D1,D2,D3,…,Dn)
oui
Di<?
(M active)
non
(M inactive)
?
decision
tree
neighborhood model
Correlations: The Cornerstone of QSAR
Philosophy (or, perhaps, Religion?)
MolID
Activity
Class
Phe
Ring
Count
HB
Acceptor
Count
Count of
C-N pairs
at 5 bonds
1
1
1
3
2
2
1
2
2
1
3
1
1
4
2
4
1
1
3
2
5
1
2
2
3
6
0
1
1
5
7
0
1
0
4
8
0
2
2
3
9
0
2
1
5
10
0
1
0
4
Correlations: The Cornerstone of QSAR
Philosophy (or, perhaps, Religion?)
MolID
Activity
Class
Phe
Ring
Count
HB
Acceptor
Count
Count of
C-N pairs
at 5 bonds
1
1
1
3
2
2
1
2
2
1
3
1
1
4
1
1
3
2
5
1
2
2
3
6
0
1
1
5
7
0
1
0
4
8
0
2
2
3
9
0
2
1
5
10
0
1
0
4
More 4is better
2
Correlations: The Cornerstone of QSAR
Philosophy (or, perhaps, Religion?)
MolID
Activity
Class
Phe
Ring
Count
HB
Acceptor
Count
Count of
C-N pairs
at 5 bonds
1
1
1
3
2
2
1
2
2
1
3
1
1
4
1
1
3
2
5
1
2
2
3
6
0
1
1
5
7
0
1
0
4
8
0
2
2
Less is3better
9
0
2
1
5
10
0
1
0
4
More 4is better
2
Correlations: The Cornerstone of QSAR
Philosophy (or, perhaps, Religion?)
I always end up in this deplorable state,
MolID
1
2
3
4
5
Activity
Phe
HB
Count of
no
matter
whether
I
drink:
Class
Ring
Acceptor C-N pairs
Count
Count
at 5 bonds

Vodka-Soda

1

1

1
Martini-Soda
1
3
Gin-Soda
2
2
Whisky-Soda…
1
2
1
More 4is better
2
… therefore,
as of tomorrow,
I2 decided to
1
1
3
stop
SODA
1 drinkin’
2
2 !
3
6
0
1
1
5
7
0
1
0
4
8
0
2
2
Less is3better
9
0
2
1
5
10
0
1
0
4
Correlation is not Causality - an Obvious, but
Inconvenient Truth…

SAR sets are always limited in diversity and therefore may (and
always will) accomodate coincidental relationships between
different features:
Diverse library of 16x6x10=960 compounds… with NPC=NHD
Why Lucky Correlations may Outperform more
Rigorous Modeling…
• Rule-based pharmacophore feature assignment has a hard
time with the imine group =N–. In rule-based triplets of
benzodiazepine receptor ligands, it was flagged as cation.
• Proper pH-sensitive flagging corrected this error… and
dramatically reduced model quality!
• Labelling as ‘cation’ was a way to ‘highlight’ that N group –
and, since preferentially seen in actives, highlighting made
QSAR learning easier
CAUSAL QSAR… but not in the way you’d
expect it! A Psychedelic µ Receptor Model…
• Training set: small combinatorial carbamate library, of 240
compounds obtained by robotized synthesis, LC/MS purity
control and µ receptor affinity (pIC50) measurement
(proof-of-concept study, CEREP 1997)
-OR, -NR2
• A successful ComPharm QSAR model (R2≈0.8) was built
to explain the measured pIC50 values (btw. µ- and mM)
−
HB-acceptor in para of benzyl alcohol enhances µ receptor
affinity
…based on wrong experimental data!
• The most “active” carbamates of the training set turned out
to be contaminated with ‰ traces of decarboxylation
product, featuring the opioid ligand specific tertiary amine
and having nanomolar potencies!
+
• Our QSAR actually explained… the decarboxylation
mechanism: p-OR or –NR2 stabilizes the intermediate
carbocation… thus rendering contamination possible!
Not seeing the Scaffold because of the
Molecules: why Water is a Thrombin Inhibitor.
• Non-linear Thrombin affinity model, with R2train=0.92,
Q2=0.84 and R2validation=0.61.
–
pKi = 2.2e-4Ar4Hp14PC122 –3.8e-5Ar4Ar12HA102 –1.4HD8Hp6PC42 +2.45exp[-(Ar6Ar10HD14-19.8)2/1362]
+4.36/{1+exp[-(Ar10Ar12Hp8-71.3)/119]}
+0.77exp[-3(Ar12HD8Hp12-13.3)2/1042]
–1.05/{1+exp[(Ar12Hp4Hp10-185.1)/327]} –2.26{1+exp[-(Ar12Hp6NC8-3.2)/13]} –4.71exp[-(HA4HD4Hp4-43.4)2/1222]
+7.3e-4HA4Hp4Hp6 –2.2e-3HA10HD4Hp8 +5.94exp[-(HA12Hp14PC12-1.2)2/152]
• If all population levels are zero, the calculated pKi is of
5.9, mostly due to contributions from the highlighted
Gaussians.
– Absence of pharmacophore triplets automatically qualifies any
small molecule as micromolar thrombin inhibitor!
• The model has learned that, for benzamidines – the
chemical class represented in the training set, the presence
of underlined triplets coincides with a loss of activity.
Beware of “Antipharmacophores”!
• Ar12-HD8-Hp12 is an “antipharmacophore” of the
Thrombin model:
– Absence of this pharmacophore triplet means an enhanced
activity, its presence correlates with an observed affinity loss
• The undesirable Presence, in the above context, implicitly
means presence in specific points of the ubiquitous
benzamidine scaffold
– Fragments or pharmacophore elements that are genuinely “bad”
for activity, no matter where they are located, are rare. An
“antipharmacophore” rather reflects poor training set diversity!
Actives
Inactives
Actives, but not available for training
The Phantom Scaffold… materializes when
adding diverse inactives to the training set
• All the training molecules – both actives and inactives –
being benzamidines, the QSAR model cannot possibly
learn the importance of the benzamidine moiety!
• After enriching the thrombin data set with inactives, one
model out of ~2000 was able to predict the activity of
unrelated thrombin ligands of known binding geometries.
– Scaffold Hopping: Yes, we can!
• Triplets entering the model successfully corroborate some
of the ligand features involved in binding.
– These include features entering the pockets P1 and P2, but not the
aromatic moiety binding in pocket P3.
Enhanced Training Set Diversity leads to
Models with “Scaffold Hopping” abilities
3.24
3.92
1.93
2.55
HA8-Hp6-PC4
Enhanced Training Set Diversity leads to
Models with “Scaffold Hopping” abilities
Missing P3: deleting a Phe does not lead to ‘the
same molecule, but with one phenyl less’.
• The training set included both examples of actives, with
required aromatic P3 substituent and inactives, without this
key moiety. So, why did the model not learn about the
importance of P3?
• In all training examples, however, the removal of the P3
moiety was always done by deacylating the
phenylalanine…
– thus, compounds missing P3 substitution were systematically
compounds having one excess protonable group
– the model actually chose to learn the ‘bogus’ rule that one
more cation causes an activity loss.
Missing P3: deleting a Phe does not lead to ‘the
same molecule, but with one phenyl less’.
• The training set included both examples of actives, with
required aromatic P3 substituent and inactives, without this
P3 not learn about the
key moiety. So, why did the model
importance of P3?
• In all training examples, however, the removal of the P3
moiety was always done by deacylating the
? P3
phenylalanine…
– thus, compounds missing P3 substitution were systematically
compounds having one excess protonable group
– the model actually chose to learn the ‘bogus’ rule that one
more cation causes an activity loss.
“Let s be a representative sample of the set S…”
• It takes a sample of ~104 individuals to extrapolate the
voting intentions of a population of ~107. What’s the
representative subset size of 1025 drug-like compounds?
– If we ever dared to publish QSARs trained on fewer compounds,
shame on us!
• If given N=3 coordinate pairs (Y,X), not even Carl
Friedrich Gauss could come up with a model more
sophisticated than Y=aX2+bX+c
– Don’t listen when they say that Support Vector Machines have
very few “tunable parameters”!
• May your model apply to one million and one molecules –
it may still fail for the one million and second!
– One cannot validate QSAR – but just fail to invalidate it!
We are Medicinal Chemists – tell us about
Pharmacophore Models, forget QSAR!!
No knowledge of the
active site – need
alternative overlay
hypotheses !
• Bad news: Pharmacophore
models are just a peculiar
type of 3D-QSAR:
– use overlay models to “bind”
descriptors to specific spots in
space
– Pharmacophore hot spots are
defined by the consensual
presence of groups of similar
type, throughout the series of
known actives
– Descriptors are occupancy
levels of these spots
Kama Sutra with Ligands: Match As Many
Equivalent Pharmacophore Features You May!
?
? +
+ ??
methotrexate
dihydrofolate
+
?
+
+ ++
Böhm, Klebe, Kubinyi, “Wirkstoffdesign” (1999)
The Cox-2 Scenario: An Ideal
Pharmacophore Model
• As exhaustive and diverse as possible a set of
CycloOxygenase-II (Cox2) inhibitors, including:
– A set of 1914 marketed drugs of the U.S. Pharmacopeia, tested
on Cox2 by the Cerep laboratories (BioPrintTM).
– A set of 326 inhibitors compiled from literature by N. Baurin
(thanks!), including co-crystallized selective Cox2 ligand SC558 and related medicinal chemistry series.
– Potencies, expressed as pIC50 = -log IC50[mol/l] can be directly
compared (cross-check on compounds present in both subsets)
Minimalistic Overlay-based Model
– Training RMS=0.712, R2=0.712
0.191
Hp@Atom#11
0.179
HA@Atom#20
0.430
HD@Atom#20
-0.428
zexp(Ar-HA5)
1.414
zsig3(#PC)
1.414
Intercept
0.000
Aromatic Requested
Fo
rb
idd
en
Ar@Atom#2
Do
no
r
Coeff.
Hy
Descriptor
dro
ph
ob
eR
eq
ue
ste
d
– validation RMS=0.698, Q2=0.724
Acceptor Requested
HipHop Acceptor!
Furthermore, it supports « Scaffolfd
Hopping » !
• it manages to explain the Cox2 activities of the apparently
unrelated nonspecific Cox1/Cox2 inhibitors:
• This is an ideal scenario – scaffold-independent model
trained on thousands of compounds: so maybe the
overlay models are mechanistically relevant !
… or maybe not!
Val 116
Val 349
?
Trp 387

Arg 120
SC-558
Arg 513
?
His 90
!
Tyr 355
Predictive? Yes!
Enlightening? No!
• Overlay-based models correctly explained the behavior of
the two distinct Cox2 binder classes… on hand of an
erroneous working hypothesis!!
– The ‘correct’ overlay asks an ‘anion’ (flurbiprofene –COO-) to
be aligned atop of a ‘hydrophobe’ (CF3) – a heresy in
pharmacophore matching! (Well, is CF3 a hydrophobe??)
– QSAR building is never safe from correlation artifacts, not even
in models with 6 variables versus 2200 observables and
excellent statistics!
– Such model may be very successful in selecting database subsets
enriched in new Actives - but QSAR alone would never have
elucidated the binding mechanism to Cox2!
QSAR – a Bookkeeping Tool !?
• “Bookkeeping” QSAR: a quantitative way to wrap up the
information contained in your training set compounds
– Models are scaffold-bound, heavily populated by scores of bogus
antipharmacophores
– It will typically tell you things you’d notice by simply looking at
the molecules
– Sampling of all the possible models fitting the observations
allows to enumerate all the alternative working hypotheses that
still await to be discarded…
– …or validated. The model might highlight not yet tried
combinations of known features with better activities!
– Think positive: this provides a rational plan to challenge
these hypotheses, and thus learn more from better planned
experiments.
3. Did we forget something? Each Model has
its Limited Applicability Domain…
… even General Relativity and Quantum Physics!
Defining the Conditions of Applicability of QSAR Models
– and respecting them – might help!
The Applicability Domain (AD) – A
Compromise…
• Restrict the applicability of a QSAR model to a welldefined subset of the chemical space – the one populated
by the training molecules. Insufficient sampling of
chemotypes outside this AD is then irrelevant.
– How do we define this subset of chemical space to be as large as
possible, while nevertheless densely enough populated by
training molecules?
Feature count 1
Example: the Feature
Control Approach
Feature count 2
* * * *
* * *
* **
*
*
*
* *
*
… but no miraculous solution! A real-life
inspired (Gedanken)Experiment
• Modeling of metal ligation propensities, with a training set
composed of three subfamilies, R being alkyl chains:
• pKbind= aAnilin/AcidNAnil + aPyr/AcidNPrd+ gsizeF(R) + CAcid
• AD requirements: the molecule should contain




NAnil=0 or 1 aniline fragment
NAcid=0 or 1 benzoic acid fragment
NPrd=0 or 1 pyridine fragment
Alkyl chains of size as seen in training set
Contributions of a good programmer, but lousy
chemist, to the understanding of QSAR!
• Would you like to know whether propane is a potent metal
binder ?
– Yes, it is: pKbind= gsizeF(C3) + CAcid (same as for p-propyl
benzoic acid)
– But it can’t possibly be within the AD, can it?
 NAnil=0 or 1 aniline fragment
 NAcid=0 or 1 benzoic acid fragment
 NPrd=0 or 1 pyridine fragment
 Alkyl chains of size as seen in training set
Contributions of a good programmer, but lousy
chemist, to the understanding of QSAR!
• Would you like to know whether propane is a potent metal
binder ?
– Yes, it is: pKbind= gsizeF(C3) + CAcid (same as for p-propyl
Feature count 1
benzoic acid)
– But it can’t possibly be within the AD, can it?
Example:
NAnil=0
theor
Feature
1 aniline fragment
Control Approach
*
 NAcid=0 or 1 benzoic acid fragment
* * *
* *
*
 NPrd=0 or 1 pyridine fragment * *
*
*
*
* **
*
 Alkyl chains of size as seen in training set
Feature count 2
Building Up Trust from Consensus
•
•
•
•
pKbind= aAnil/AcidNAnil + aPyr/AcidNPrd+ gsizeF(R) + CAcid
pKbind= aAnil/PrdNAnil + aAcid/PrdNAcid+ gsizeF(R) + CPrd
pKbind= aAcid/AnilNAcid + aPrd/AnilNPrd+ gsizeF(R) + CAnil
These three alternative models are perfectly equivalent –
or “redundant” – as far as training set molecules are
concerned
– Identical prediction for each training molecule, identical
statistical parameters
• However, they cease to be redundant when it comes to
propane: CAcid ≠ CAnil ≠ CPrd
– Divergent prediction by allegedly “redundant” models is a clear
signal of AD violation!
Conclusions…
• In Ligand-based knowledge extraction, the single most
important piece of hardware is a BRAIN
• Correlation is not causality… it’s correlation!
– So, if correlations observed within the training set do apply to
other molecules, forget metaphysical afterthoughts and exploit
them, in successful virtual screening
– However, an in-depth analysis of the model – if feasible – may
reveal intrinsic limitations and pitfalls, and help to better delimit
the AD.
• Training set diversity is the key!
– Do not hesitate to add bogus inactives, in order to teach your
model proper “border conditions” such as “cosmic vacuum is
inactive”. Absence of features cannot provide activity…
Conclusions…
• If a big pharma manager asks you “So, is QSAR useful?”,
please reply “Compared to what?”
– A wrong QSAR model may nevertheless ring a bell in a
medicinal chemist’s brain, and help to make right decisions
– There are moments in life when one should rely on the
accumulated knowledge, and use QSAR to discover new
combinations of known features:
• A properly trained model, within its AD, stands fair chances to discover
new actives and sensibly decrease synthesis/testing effort, maybe find a
new scaffold porting the known binding pharmacophore (lead hopping)
– There are moments when one should put known things aside, and
venture out for random search of new paradigm-breaking ligands
– new scaffold, new binding mode, new action mechanism, but...
Conclusions…
• If a big pharma manager asks you “So, is QSAR useful?”,
please reply “Compared to what?”
– A wrong QSAR model may nevertheless ring a bell in a
medicinal chemist’s brain, and help to make right decisions
– There are moments in life when one should rely on the
accumulated knowledge, and use QSAR to discover new
combinations of known features:
–
• A properly trained model, within its AD, stands fair chances Proprietary
to discover
& patented
new actives and sensibly decrease synthesis/testing effort, maybe
find a
algorithm
new scaffold porting the knownLaymen
binding pharmacophore
(lead
hopping)
will appreciate the
and fittable
exotic
resonance
of
the
name,
There are moments when one should put known things aside,
andn
coefficient
evoking some oriental wisdom
venture
for random
of new paradigm-breaking ligands
Planck’s out
constant
– a flavorsearch
of
from high-tech Japan
(and mode, new action mechanism, but...
–fundamental
new scaffold,science
new binding
conveniently small)
Download