chemically intelligent data mining multiparameter optimisation for medicinal chemists how to handle petabytes of data – Google Chemistry!
activity prediction
Dr Tony Wood
VP, Head of Worldwide Medicinal Chemistry
Pfizer Global Research and Development anthony.wood@pfizer.com
2002-2005 the primary causes of attrition were safety and pharmacology
results of an analysis of 349 studies on 315 compounds covering 90 targets at 985 doses with >10,000 organ evaluations in 4 species
PK known for all cases with strong correlation between AUC and Cmax compound set has similar diversity to Pfizer file
High Concentration
Toxic Exposures
CmaxLowTox
Uncertain
CmaxHiCln
Minimum exposure with observed toxicity
(CmaxLowTox). Set to an arbitrarily high number if no toxic event is observed at any dose
Maximum exposure without observed toxicity (CmaxHiCln). Set to zero if toxicity was observed at all doses assessed
Clean Exposures
Low Concentration
300
250
200
150
100
50
0
100nM 1uM 10uM 100uM 1000uM
Threshold (Total Drug) exposure thresholds were chosen to obtain a balance of toxicity/non-toxicity. set to 10uM for the total-drug threshold.
approx 40% of evaluations above threshold & 40% below.
Clean
Uncertain
Toxic similar analysis for free drug levels gives a threshold of 1 uM.
the y-axis here is a generalized odds, i.e., the ratio of the probability of a compound with a given parameter value being toxic to the probability of it not being toxic
combining low TPSA and high cLogP exacerbates the risk
(numbers in parentheses indicate number of outcomes in database) holds for both free-drug or total-drug thresholds ratio of toxic to non-toxic outcomes
ratio of promiscuous to nonpromiscuous compounds
TPSA>75 TPSA<75
ClogP<3 0.25 (25) 0.80 (18)
ClogP>3 0.44 (13) 6.25 (29) promiscuity defined as >50% activity in >2 Bioprint assay out of a set of 48 (selected for data coverage only)
ClogP > 3
does a good cell viability profile increase the probability of a compound being a CNS CAN w/o organ tox in the clogp risky group (clogp>3)?
15%
20 23
25%
39% 39%
THLE Cv bin x < 25 uM
25 < x < 100 uM x > 100 uM
60%
2 22
22%
5% 14%
ClogP < 3 50% 50% organ tox or not attrition CNS
CANs set
82%
No Organ Tox Organ Tox
“a place to store toxicological knowledge” knowledge-based expert system broad range of toxicity endpoints covered identifies structural alert provides literature-based rationale for prediction qualitative or semi-quantitative predictions now has an API for integration into 3 rd party software products
main strengths are mutagenicity, chromosome damage, carcinogenicity and skin sensitization some recent efforts in hepatotoxicity and teratogenicity
100
90
80
70
60
50
40
30
20
10
0
Endpoint
these relationships were determined using a small well characterised data set much more data lies in non curated data sets with no structure keys we need chemically intelligent data mining to derive knowledge including SAR from this resource
90%
≥ 0.36
lipE CLOGP 95% Range LOGD_7.4
LE
Drugs
CANs
.2
.3
.4 .5
.6
.7
.8
.9
1 1.1 1.2
-1 0 1 2 3 4 5 6 7 8 9 10 11 12 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8
Median
Drugs: 0.52
CANs: 0.47
LLE
6.2
6.3
ClogP
2.9
3.4
-7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8
ClogD
1.8
2.3
TPSA MW HBDON BASIC1PKA
Drugs
CANs
0 100
Median
Drugs: 47
CANs: 53
200 100 200 300 400 500 600 700
MW
305
360
0 1 2 3 4 5 6
HBD
1.0
1.0
1 2 3 4 5 6 7 8 9 10 11 12 13 pKa
8.4
8.4
for design (prospective, accurate and constant) increasing CNS MPO enhances the probability of candidate survival and alignment of in-vitro
ClogP
TPSA
C,P,S
P,S
P ermeability
(including efflux)
1
0.8
0.6
0.4
0.2
0
-2 -1 0 1 cLogP
2 3 4 5 6
1
0.8
0.6
0.4
0.2
0
0 20 40
PSA
60 80 100 120
↑
CNS MPO increases the probability of successfully aligning attributes
ClogD
MW
C,S
P
C learance
1
0.8
0.6
0.4
0.2
0
-2 -1 0 1
LOGD7.4
2 3 4 5
1
0.8
0.6
0.4
0.2
0
100 200 300
MW
400 500 600
CNS MPO
Desirability
124
HBD
P
S afety
(including high risk space)
1
0.8
0.6
0.4
0.2
0
0 1 2 3
HBDONORCNT
4 5
26%
56%
75
44% pKa
P,S
1
0.8
0.6
0.4
0.2
0
2 4 6 pKa (1)
8 10
74%
83%
12
17%
71%
34
29%
Drugs CANs
D rug s CAN Pre CAN
Binned Binned Drug_or_CAN_Class (2)
Le a d s
design is now based on a probabilistic basis using complex MPO relationships we need transparent easy to construct and understand methods to perform multiparameter optimisation
Pfizer in house
Inpharmatica
StARLITe
Serine proteases unified db
4.8 M structures
275k active compounds
600k activities (IC50, etc)
3k targets
800 human targets
Cysteine proteases
Kinases
Aspartyl proteases
Phosphodiesterases
Metalloproteases
Ion Channels
GPCRs (others: classes A, B & C)
Aminergic GPCRs
Peptide GPCRs
Enzymes
(hydrolases, transferases, oxidoreductases & others)
Nuclear hormone receptors Miscellaneous
Cerep
BioPrint
Thomson
IDDB node : target edge : compound
data set (assay data)
“good” actives
“bad” inactives fingerprint bits ~ substructures
Rev Thomas Bayes ca 1702 - 1761 fingerprints are calculated for each molecule check how often fingerprint bit is observed and how often in “good” compound assign weighting factor taking into account both activity ratio and sampling size for instance: “good”/total ratio of
90/100 is statistically more relevant than 9/10 model distinguishes “good” from “bad” predict likelihood molecule is “good
Bayesian model
confirmed measurement
0.45
0.40
0.35
HTA+ > HTA false positive HTA+ or false negative HTA predictions all false negative HTA colored by Bayesian score red: high confidence blue: low confidence
0.30
0.25
0.20
0.15
HTA > HTA+: false positive HTA or false Negative HTA+ predictions red: false HTA+ negative blue: false HTA positive
0.10
0.05
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
LE HTA
238k actives
( 10 M) human target
mw < 1000
pass reactivity filter
10 actives / target
90% / 214k
3870 compounds with
10,806 predictions
FCFP_6
698 models
Bayesian score
BIG LEAP: searching the Pfizer liquid and virtual compound collections real: 0.000025%
1.2M singletons derive Bayesian model that distinguishes library
1 from 2, from 3, etc
Pfizer global virtual library
~ 10 12 compounds
5000 libraries
2.5M compounds liquid screening file predict 16 libraries to which compound could belong search only these libraries, in real and virtual compound space
Acids
A1
O
B1
N
N
Amines
O x
?
x x
1
Cl
A2
N
O
N
O
O
CF
3 o 1
N
B2
N N
2
*
B4 model is built from synthesized compounds (yellow squares) nearly all fingerprint features of any virtual compound (square marked with “?”) are shared by at least one compound from the training set (squares marked with “X”) virtual products in areas 1 share at least one monomer with a compound from the training set-for compound “O”, the new monomer B2 is very close to previously used B1 compounds from area 2 can be considered outside the scope of the model because they have few fingerprint features in common with the existing products as shown for compound “*” where monomers A2 and B4 are unlike previously used monomers
acidic ex-PR library ex-PR library ex-PR library ex-PR library new
a framework for computational scientists to publish services (protocols, models) that can be immediately leveraged by project teams a knowledge repository for Computational Scientists to capture and share their best practices when protocols are published they are automatically wrapped as new
PLP component
we are not short of idea generators!
easy to construct vast virtual libraries we need ways of rapidly scoring and searching petabytes of data
training set 98,155 compounds (80%) talidation set 19,577 compounds (20%) test set 9,241 compounds training: Kappa: 0.61
, Concordance 80% training: Sensitivity 81% , Selectivity 80%
2000 test: Kappa: 0.46
, Concordance 74% test: Sensitivity 75% , Selectivity 74%
1500
“Grey zone”, uncertain prediction
>60%
1000
>70% >85%
500
>85%
>95% >95%
0
-80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40
No Dofetilide Dofetilide
Inactives
Actives prediction is checked against activity of at least
3 nearest neighbours to generate additional confidence measure
statistical fingerprint-based model (FCFP-6, Scitegic) unstable well predicted, stable not
Unstable
HLM stability experiment:
Stable
Moderately stable
Unstable Experiment
Stable
Stable
Prediction
Unstable
short t1/2
100
Series 1
Series 2
80
Stable but likely to be poorly absorbed orally
40 long t1/2
20
0
-1.5
-1 -0.5
0 0.5
1 1.5
2 2.5
CLOGP most compounds stable within this cLogP range synthetic effort weighted by desirability
linker replacement
O
N N aryl switch
N
V1a Ki 28 nM
MW 441 cLogP 5.5
t1/2 HLM 6 mins
(Human Liver Microsomes)
N
O side chain deletion/replacement
LE: 0.31
LiPE: 1.9
N
N
N
N N
N needed
V1a Ki 780 nM
MW 334 cLogP 1.0
t1/2 HLM 120 mins
LE: 0.32
LiPE: 4.8
LE (Ligand Efficiency) =
-1.4 log (IC50) n Heavy Atoms
“how efficient each heavy atom is”
Class dependent…0.3 – 0.5
LiPE (Lipophilic Efficiency) = -log (IC50) - cLogP
“how efficient each lipophilic fragment is”
LiPE= -log (IC50) - cLogP pIC50
8
10nM
7
100nM
LiPE=6
1
6
M
5
1
LiPE=5
LiPE=4
LiPE=3
LiPE=2
2 3 4 5 cLogP
-10 -8
D
G(expt)
-6 -4 -2 0
0
-10
-20
-30 plot of experimental affinity versus calculated enthalpy for reference:
2 kcals = 26-fold off
4.2 kcals = 1000-fold off
-40 kcal/mol
"Improving Accuracy in Protein-Ligand Affinity Calculations"
Paper #104, ACS meeting in Philadelphia (Aug 2004)
Michael K. Gilson, Center for Advanced Research in Biotechnology,
Rockville, MD
D
G o D D o
U W T S config
-5 to -10 kcals 15 to -25 kcal 15 to -25 kcal we usually focus on the interactions D <U+W> potential energy force field (CHARMM, AMBER, etc.) van der Waals
Coulombic
Hydrogen-bonding solvation surface area term:
Hydrophobicity/organophilicity we always neglect T D S config generalized Born/Poisson-Boltzmann
Desolvation of polar groups,
Coulomb screening flexibility, entropy terms sampling/Sum over energy wells
Preorganization/Strain
Entropy losses on binding
(rotational, translational, conformational) we count on cancellation of errors within series, or other corrections, which leads to scattered data.
we are not short of idea generators!
easy to construct vast virtual libraries we need more accurate activity prediction to allow filtering and selection
build project teams around Sharepoint/OneNote implement a RSS strategy around Newsgator create a literature knowledge sharing culture use Wiki type technology to share knowledge
Pfizerpedia
BSA
David Price
Simon Bailey
Julian Blagg
Nigel Greene
CNS MPO
Patrick Verhoest
Travis Wager
Anabella Villalobos
Spiros Liras activity prediction
Marcel de Groot
Martin Edwards
Alex Alex
Jeff Howe
Ben Burke
VLS
Giai Paolini
Willem Van Hoorn
Enoch Huang
Jeff Howe web 2
Jerry Lanfear