In Silico Methods for ADMET and Solubility Prediction

advertisement
In Silico Methods for ADMET and Solubility Prediction
Dr John Mitchell
University of St Andrews
Outline
• Part 1: Computational Toxicology
• Part 2: Aqueous Solubility
1. Toxicological Relationships Between
Proteins Obtained From
a Molecular Spam Filter
Florian Nigsch & John Mitchell
Now at Novartis Institutes, Boston
Spam
• Unsolicited
(commercial) email
• Approx. 90% of all
email traffic is spam
• Where are the
legitimate
messages?
• Filtering
Analogy to Drug Discovery
• Huge number of possible candidates
• Virtual screening to help in selection process
Properties of Drugs
• High affinity to protein
target
• Soluble
• Permeable
• Absorbable
• High bioavailability
• Specific rate of metabolism
• Renal/hepatic clearance?
•
•
•
•
Volume of distribution?
Low toxicity
Plasma protein binding?
Blood-Brain-Barrier
penetration?
• Dosage (once/twice daily?)
• Synthetic accessibility
• Formulation (important in
development)
Multiobjective Optimisation
Bioactivity
Toxicity
Synthetic
accessibility
Solubility
Metabolism
Permeability
Huge number of candidates …
Multiobjective Optimisation
Bioactivity
Drug
Toxicity
Synthetic
accessibility
Solubility
Metabolism
Permeability
Huge number of candidates …
most of which are useless!
Feature Space - Chemical Space
m = (f1,f2,…,fn)
f3
f3
CDK2 COX2
CDK1
f1
DHFR
f2
f1
f2
Feature spaces of
high dimensionality
Features of Molecules
Based on circular fingerprints
Combinations of Features
Combinations of
molecular features to
account for synergies.
Winnow Algorithm
Protein Target Prediction
•
•
•
•
•
Which protein does a given molecule bind to?
Virtual Screening
Multiple endpoint drugs - polypharmacology
New targets for existing drugs
Prediction of adverse drug reactions (ADR)
– Computational toxicology
Predicted Protein Targets
• Selection of 233
classes from the
MDL Drug Data
Report
• ~90,000 molecules
• 15 independent
50%/50% splits into
training/test set
Predicted Protein Targets
Cumulative probability of correct prediction within the
three top-ranking predictions: 82.1% (±0.5%)
Computational Toxicology
• Model for target
prediction
• Annotated library of
toxic molecules
– MDL Toxicity database
– ~150,000 molecules
• For each molecule we
predict the likely target
• Correlations between
predicted protein
targets and known
toxicity codes
– Canonical (23)
– Full (490)
Toxicological Relationships Outline (1)
• Protein target prediction allows us to link
(predictively) 150,000 toxic organic molecules to
233 specific protein targets
• Each target is treated as a single protein,
although may be sets of related proteins
• Toxicological databases link (experimentally)
these 150,000 molecules to 23 toxicity classes
• Combining these two sources of data matches
the 233 proteins with the 23 toxicity classes
Toxicity Annotations
FULL TOXICITY CODES (490)
Y41 : Glycolytic < Metabolism (intermediary) < Biochemical
CANONICAL TOXICITY CODES (23)
Toxicological Relationships Outline (2)
• For each protein target, we have a profile of
association with the 23 toxicity classes
• Proteins with similar profiles are clustered
together
• We demonstrate that these clusters of
proteins can be physiologically meaningful.
Predictions Obtained
Target Prediction
L70 - Changes in liver weight<Liver
Y07 - Hepatic microsomal oxidase<Enzyme inhibition
M30 - Other changes<Kidney, Urether, and Bladder
L30 - Other changes<Liver
Highest ranking one
IS predicted protein
target
Protein code j
Toxicity codes i
Toxcodes
Result matrix R = (rij)
rij incremented for each
prediction.
Protein targets
(
r11 r12 …
r21
)
Proteins by Toxicity
•
Cardiac - G
•
1. Kainic acid receptor
2. Adrenergic alpha2
3. Phosphodiesterase III
4. cAMP Phosphodiesterase
5. O6-Alkylguanine-DNA
alkyltransferase
Vascular - H
1. Angiotensin II AT2
2. Dopamine (D2)
3. Bombesin
4. Adrenergic alpha2
5. 5-HT antagonist
Top 5 Proteins by Toxicity
68 distinct proteins for
23 toxicity classes, i.e.,
3 proteins per
canonical toxicity code.
Lanosterol 14alpha-Methyl Demethylase 5
Glucose-6-phosphate Translocase 4
IL-6 4
Benzodiazepine Antagonist 3
Kainic Acid Receptor 3
Proteins and their connectivities
Correlation Between Proteins
Correlations between proteins: 233 by 233 correlation matrix
Cluster 1 (proteins 6-11)
Cluster 1
Cluster 1
• Within-cluster
correlation (without
auto-correlation)
r = 0.95
Proteins involved in breast cancer
• Carbonic Anhydrase
Inhibitor
• Estrogen Receptor
Modulator
• LHRH Agonist
• Aromatase Inhibitor
• Cysteine Protease
Inhibitor
• DHFR Inhibitor
Cluster 1
Proteins involved in breast cancer
Literature-based links between these proteins
Tissue-specific transcripts of human steroid
sulfatase are under control of estrogen
signaling pathways in breast carcinoma,
Zaichuk 2007
“aim of this study was to characterize
carbonic anhydrase II (CA2), as novel estrogen
responsive gene” Caldarelli 2005
ER
CA
The Transactivation Domain AF-2
but not the DNA-Binding Domain
of the Estrogen Receptor Is
Required to Inhibit Differentiation
of Avian Erythroid Progenitors,
Marieke von Lindern 1998
Controversies of adjuvant endocrine
treatment for breast cancer and
recommendations of the 2007 St Gallen
conference, Rabaglio 2007
Merchenthaler 2005
Summary of aromatase inhibitor trials: The
past and future, Goss 2007
Aromatase
This led to premature expression of CAII,
a possible explanation for the toxic
effects of overexpressed ER.
LHRH
Cathepsin L Gene Expression and Promoter
Activation in Rodent Granulosa Cells,
Sriraman 2004
showed that cathepsin L expression in
granulosa cells of small, growing follicles increased in periovulatory follicles after human
chorionic gonadotropin stimulation.
Regulation of collagenolytic cysteine protease
synthesis by estrogen in osteoclasts,
Furuyama 2000
Induction by estrogens of methotrexate
resistance in MCF-7 breast cancer cells,
Thibodeau 1998
DHFR
Cysteine Prot.
Antimalarials?
Breast Cancer Proteins
Cluster 4
This cluster links treatment of stomach ulcers to loss of bone
mass!
Proton Pump Inhibitors etc.
Correlation
above 0.98
Proton Pump Inhibitors etc.
Correlation
above 0.99
Correlation
above 0.98
Proton Pump Inhibitors etc.
PTH = Parathyroid hormone (84 aa mini-protein)
•
•
•
Proton pump inhibitors used to limit
production of gastric acid
PTH is important in the developent/regulation
of osteoclasts (cells for bone resorption)
PTH controls levels of Ca2+ in the blood;
increased PTH levels are associated with agerelated decrease of bone mass
Recent clinical studies showed increased risk of hip fractures
resulting from long-term use of proton pump inhibitors. Hence
link between PTH and proton pump inhibitors.
Conclusions from Part 1
• Successful adaptation of algorithm formerly not used in
chemoinformatics
• Can find correct protein targets for molecules
• Hence link proteins together via ligand-binding properties and
associations of ligands with toxicities
• Identify clinically relevant toxicological relationships between
proteins
2. In silico calculation of aqueous solubility
Dr John Mitchell
University of St Andrews
Our Methods …
(a) Random Forest (informatics)
References
Our Random Forest Model …
We want to construct a model that will predict
solubility for druglike molecules …
We don’t expect our model either to use real
physics and chemistry or to be easily interpretable
…
We do expect it to be fast and reasonably accurate
…
Random Forest
Machine Learning Method
Random Forest for Predicting Solubility
A Forest of Regression Trees
•
•
•
•
•
•
•
Dataset is partitioned into consecutively
smaller subsets (of similar solubility)
Each partition is based upon the value of one
descriptor
The descriptor used at each split is selected so
as to minimise the MSE
High predictive accuracy
Includes descriptor selection
No training problems – largely immune from
overfitting
“Out-of-bag” validation – using those
molecules not in the bootstrap samples.
Leo Breiman, "Random Forests“, Machine Learning 45, 5-32 (2001).
Dataset
Literature Data
• Compiled from Huuskonen dataset and AquaSol database – pharmaceutically
relevant molecules
• All molecules solid at room temperature
• n = 988 molecules
• Training = 658 molecules
• Test = 330 molecules
• MOE descriptors 2D/3D
Intrinsic aqueous solubility – the thermodynamic solubility of the neutral form in
unbuffered water at 25oC
●
Datasets compiled from diverse literature data may have significant random and systematic
errors.
These results are competitive with any other informatics or QSPR solubility prediction method
Random Forest: Solubility Results
RMSE(tr)=0.27
r2(tr)=0.98
Bias(tr)=0.005
RMSE(oob)=0.68
r2(oob)=0.90
Bias(oob)=0.01
RMSE(te)=0.69
r2(te)=0.89
Bias(te)=-0.04
DS Palmer et al., J. Chem. Inf. Model., 47, 150-158 (2007)
Part 2a, Solubility by Random Forest: Conclusions
● Random Forest gives an RMS error of 0.69 logS units.
● These results are competitive with any other informatics or QSPR
solubility prediction method.
● The nature of the model is predictive, without offering much
insight.
Our Methods …
(b) Thermodynamic Cycle (A hybrid of
theoretical chemistry & informatics)
Reference
Our Thermodynamic Cycle method …
We want to construct a theoretical model that will
predict solubility for druglike molecules …
We expect our model to use real physics and chemistry
and to give some insight …
We may need to include some empirical parameters…
We don’t expect it to be fast by informatics or QSPR
standards, but it should be reasonably accurate …
For this study Toni Llinàs measured 30 solubilities using
the CheqSol method and took another 30 from other
high quality studies (Bergstrom & Rytting).
We use a Sirius glpKa instrument
Can we use theoretical chemistry to calculate
solubility via a thermodynamic cycle?
Gsub comes mostly from lattice energy minimisation
based on the experimental crystal structure.
Gsolv comes from a semi-empirical solvation
model (SCRF B3LYP/6-31G* in Jaguar)
This is likely to be the least accurate term in our equation.
We also tried SM5.4 with AM1 & PM3 in Spartan, with similar results.
Gtr comes from ClogP
ClogP is a fragment-based (informatics) method of
estimating the octanol-water partition coefficient.
What Error is Acceptable?
• For typically diverse sets of druglike
molecules, a “good” QSPR will have an RMSE ≈
0.7 logS units.
• An RMSE > 1.0 logS unit is probably
unacceptable.
• This corresponds to an error range of 4.0 to
5.7 kJ/mol in Gsol.
What Error is Acceptable?
• A useless model would have an RMSE close to
the SD of the test set logS values: ~ 1.4 logS
units;
• The best possible model would have an RMSE
close to the SD resulting from the
experimental error in the underlying data:
~ 0.5 logS units?
Results from Theoretical Calculations
● Direct calculation was a nice idea, but didn’t quite
work – errors larger than QSPR
● “Why not add a correction factor to account for
the difference between the theoretical methods?”
● This was originally intended to calibrate the
different theoretical approaches, but
…
…
● Within a week this had become a hybrid method,
essentially a QSPR with the theoretical energies as
descriptors
Results from Hybrid Model
This regression equation gives r2=0.77 and RMSE=0.71
How Well Did We Do?
• For a training-test split of 34:26, we obtain
an RMSE of 0.71 logS units for the test set.
• This is comparable with the performance
of “pure” QSPR models.
• This corresponds to an error of about 4.0
kJ/mol in Gsol.
Drug Disc.Today, 10 (4), 289 (2005)
Gsolv & ClogP
Ssub & b_rotR
Ulatt
Part 2b, Solubility by TD Cycle: Conclusions
● We have a hybrid part-theoretical, part-empirical method.
● An interesting idea, but relatively low throughput - and an
experimental (or possibly predicted?) crystal structure is needed.
● Similarly accurate to pure QSPR for a druglike set.
● Instructive to compare with literature of theoretical solubility
studies.
Thanks
• Unilever
• Dr Florian Nigsch
• Pfizer & PIPMS
• Dr Dave Palmer
• Pfizer (Dr Iñaki Morao, Dr Nick Terrett & Dr Hua Gao)
Download