Quantum Chemical and Machine Learning Calculations of the Intrinsic Aqueous Solubility of Druglike Molecules Dr John Mitchell University of St Andrews How should we approach the prediction/estimation/calculation of the aqueous solubility of druglike molecules? Two (apparently) fundamentally different approaches: theoretical chemistry & informatics. The Two Faces of Computational Chemistry Informatics Theoretical Chemistry Theoretical Chemistry “The problem is difficult, but by making suitable approximations we can solve it at reasonable cost based on our understanding of physics and chemistry” Theoretical Chemistry • Calculations and simulations based on real physics. • Calculations are either quantum mechanical or use parameters derived from quantum mechanics. • Attempt to model or simulate reality. • Usually Low Throughput. Drug Disc.Today, 10 (4), 289 (2005) Existing Theoretical Approaches • Thus far, although theoretical methods have shown promise, they have not matched the accuracy of QSPR. • There is no theoretical method that deals directly with solubility, so the problem has to be broken down into parts. • There are several different ways of doing this. Our First Principles Method • We present one such approach and believe this to be the world’s most costeffective first principles solubility method. Thermodynamic Cycle Different kinds of theoretical method are used for each part Gsub from lattice energy minimisation Ghydr from Reference Interaction Site Model (RISM) Different kinds of theoretical method are used for each part Gsub from lattice energy & a phonon entropy term; DMACRYS using B3LYP/6-31G(d,p) multipoles and FIT repulsion-dispersion potential. Ghydr from Reference Interaction Site Model with Universal Correction (RISM/UC). OUR DATASET (25 molecules) We have experimental logS for all 25 molecules, but can only subdivide into ΔGsub and ΔGhydr for 10 of them. Thermodynamic Cycle Gas Crystal Solution Sublimation Free Energy Gas Crystal Sublimation Free Energy Gas Crystal Sublimation Free Energy Gas Crystal Calculating ΔGsub is a standard procedure in crystal structure prediction Crystal Structure Prediction • Given the structural diagram of an organic molecule, predict the 3D crystal structure. Br N S O O Slide after SL Price, Int. Sch. Crystallography, Erice, 2004 CSP Methodology • Based around minimising lattice energy of trial structures. • Enthalpy comes from lattice energy and intramolecular energy (DFT), which need to be well calibrated against each other: trade-off of lattice vs conformational energy. • Entropy comes from phonon modes (crystal vibrations); can get Free Energy. These methods can get relative lattice energies of different structures correct, probably to within a few kJ/mol. Absolute energies are harder. Volume/molecule (Å3) 149 150 151 152 153 154 155 -69 Lattice Energy (kJ/mol) -70 -71 -72 -73 -74 P1 P21 Cc C2/c P2/c P21212 P212121 Pna21 Pbca Pma21 BETA P_1 P21/c C2 Pm P21/m Pc Pca21 Pbcn Pmn21 ALPHA GAMMA Additional possible benefit for solubility: if we don’t know the crystal structure, we could reasonably use best structure from crystal structure prediction. Volume/molecule (Å3) 149 150 151 152 153 154 155 -69 Lattice Energy (kJ/mol) -70 -71 -72 -73 -74 P1 P21 Cc C2/c P2/c P21212 P212121 Pna21 Pbca Pma21 BETA P_1 P21/c C2 Pm P21/m Pc Pca21 Pbcn Pmn21 ALPHA GAMMA Results for ΔGsub Lattice energies from DMACRYS with FIT atom-atom model potential and B3LYP/6-31G(d,p) distributed multipoles. Results for ΔGsub Reasonable prediction of ΔGsub, but small number of molecules. To see the trends in errors, we need to look at more molecules. ΔH sub experimental v ΔH sub predicted 200.00 y = 1.0743x + 3.6056 R² = 0.5313 Predicted ΔH sub kJ mol-1 180.00 160.00 140.00 dH Vs exp dH Linear (dH Vs exp dH) 120.00 RMSE = 20.4 kJ/mol (46 molecules) 100.00 80.00 80.00 90.00 100.00 110.00 120.00 130.00 Experimental ΔH sub kJ mol-1 140.00 150.00 160.00 The 46 compound set shown here has a larger error, mostly due to some large outliers. Error statistics vary with dataset. ΔH sub experimental v ΔH sub predicted 200.00 y = 1.0743x + 3.6056 R² = 0.5313 Predicted ΔH sub kJ mol-1 180.00 160.00 140.00 dH Vs exp dH Linear (dH Vs exp dH) 120.00 RMSE = 20.4 kJ/mol (46 molecules) 100.00 80.00 80.00 90.00 100.00 110.00 120.00 130.00 Experimental ΔH sub kJ mol-1 140.00 150.00 160.00 TΔS sub experimental v TΔS sub predicted 68.00 66.00 TΔS predicted kJ mol-1 64.00 y = 0.0857x + 56.078 R² = 0.0941 62.00 TdS Vs exp TdS Linear (TdS Vs exp TdS) 60.00 58.00 RMSE = 9.3 kJ/mol (46 molecules) 56.00 54.00 0 10 20 30 40 50 TΔS experimental kJ mol-1 60 70 80 90 ΔG sub experimental v ΔG sub predicted 120.00 y = 0.7422x + 28.11 R² = 0.3129 100.00 ΔG sub Predicted 80.00 60.00 G Experimental Vs Predicted Linear (G Experimental Vs Predicted) 40.00 RMSE = 22.4 kJ/mol (46 molecules) 20.00 0.00 0.00 20.00 40.00 60.00 ΔG sub Experimental 80.00 100.00 120.00 • The predicted ΔHsub is much better correlated with experiment than is TΔSsub. • However, ΔHsub has a much larger range of values and contributes more to the RMS error. Thermodynamic Cycle Gas Crystal Solution Hydration Free Energy We expected that hydration would be harder to model than sublimation, because the solution has an inexactly known and dynamic structure, both solute and solvent are important etc. Reference Interaction Site Model (RISM) • Combines features of explicit and implicit solvent models. • Solvent density is modelled, but no explicit molecular coordinates or dynamics. ~45 CPU mins per compound RISM Reference Interaction Site Model (RISM) Palmer, D.S., et al., Accurate calculations of the hydration free energies of druglike molecules using the reference interaction site model. The Journal of Chemical Physics, 2010. 133(4): p. 044104-11. Results for ΔGhyd Perhaps surprisingly, error in Ghyd is smaller than in Gsub. Other Hydration Energy Approaches An alternative methodology here is just to try the various different continuum solvent models available in Gaussian. logS from Thermodynamic Cycle Gas Crystal Solution Add the two terms to get ΔGsol and hence logS. Results for ΔGsol Conclusions: Theory Must calculate Gsub & Ghyd separately; Expt data sparse and errors may be large; RISM is efficient & fairly accurate for Ghyd; Dataset size and composition make comparisons of methods hard; • Not yet matched accuracy of informatics. • • • • Informatics Approaches “The problem is too difficult to solve using physics and chemistry, so we will design a black box to link structure and solubility” Informatics and Empirical Models • In general, informatics methods represent phenomena mathematically, but not in a physics-based way. • Inputs and output model are based on an empirically parameterised equation or more elaborate mathematical model. • Do not attempt to simulate reality. • Usually High Throughput. What Error is Acceptable? • For typically diverse sets of druglike molecules, a “good” QSPR will have an RMSE ≈ 0.7 logS units. • A RMSE > 1.0 logS unit is probably unacceptable. • This corresponds to an error range of 4.0 to 5.7 kJ/mol in Gsol. What Error is Acceptable? • A useless model would have an RMSE close to the SD of the test set logS values: ~ 1.4 logS units; • The best possible model would have an RMSE close to the SD resulting from the experimental error in the underlying data: ~ 0.5 logS units? Random Forest Machine Learning Method Random Forest: Solubility Results RMSE(oob)=0.68 r2(oob)=0.90 Bias(oob)=0.01 RMSE(te)=0.69 r2(te)=0.89 Bias(te)=-0.04 DS Palmer et al., J. Chem. Inf. Model., 47, 150-158 (2007) Ntrain = 658; Ntest = 300 Support Vector Machine SVM: Solubility Results et al., RMSE(te)=0.94 r2(te)=0.79 Ntrain = 150 + 50; Ntest = 87 100 Compound Cross-Validation Theoretical energies don’t seem to improve descriptor models. Replicating Solubility Challenge (post hoc) CDK descriptors: RF, RF, PLS, SVM RMSE(te)=1.09; 1.00; 0.89; 1.08 r2(te)=0.39; 0.49; 0.58; 0.41 10; 12; 12; 13/28 correct within 0.5 logS units Ntrain 94; Ntest 28 Replicating Solubility Challenge (post hoc) CDK descriptors: RF, RF, PLS, SVM Although the test dataset is small, it is a standard set. Ntrain 94; Ntest 28 Conclusions: Informatics • Expt data: errors unknown, but limit possible accuracy of models; • CheqSol - step in right direction; • Dataset size and composition hinder comparisons of methods; • Solubility Challenge – step in right direction. Thanks • SULSA • James McDonagh, Dr Tanja van Mourik, Neetika Nath (St Andrews) • Prof. Maxim Fedorov, Dr Dave Palmer (Strathclyde) • Laura Hughes, Dr Toni Llinas • James Taylor, Simon Hogan, Gregor McInnes, Callum Kirk, William Walton (U/G project)