Machine Learning in Drug Design David Page Dept. of Biostatistics and Medical Informatics and Dept. of Computer Sciences Collaborators Michael Waddell Paul Finn Ashwin Srinivasan John Shaughnessy Bart Barlogie Frank Zhan Stephen Muggleton Arno Spatola Sean McIlwain Brian Kay Outline Overview of Drug Design How Machine Learning Fits Into the Process Target Search: Single Nucleotide Polymorphisms (SNPs) Machine Learning from Feature Vectors Decision Trees Support Vector Machines Voting/Ensembles Predicting Molecular Activity: Learning from Structure Drugs Typically Are… Small organic molecules that… Modulate disease by binding to some target protein… At a location that alters the protein’s behavior (e.g., antagonist or agonist). Target protein might be human (e.g., ACE for blood pressure) or belong to invading organism (e.g., surface protein of a bacterium). Example of Binding So To Design a Drug: Identify Target Protein Determine Target Site Structure Synthesize a Molecule that Will Bind Knowledge of proteome/genome Relevant biochemical pathways Crystallography, NMR Difficult if Membrane-Bound Imperfect modeling of structure Structures may change at binding And even then… Molecule Binds Target But May: Bind too tightly or not tightly enough. Be toxic. Have other effects (side-effects) in the body. Break down as soon as it gets into the body, or may not leave the body soon enough. It may not get to where it should in the body (e.g., crossing blood-brain barrier). Not diffuse from gut to bloodstream. And Every Body is Different: Even if a molecule works in the test tube and works in animal studies, it may not work in people (will fail in clinical trials). A molecule may work for some people but not others. A molecule may cause harmful side-effects in some people but not others. Outline Overview of Drug Design How Machine Learning Fits Into the Process Target Search: Single Nucleotide Polymorphisms (SNPs) Machine Learning from Feature Vectors Decision Trees Support Vector Machines Voting/Ensembles Predicting Molecular Activity: Learning from Structure Places to use Machine Learning Finding target proteins. Inferring target site structure. Predicting who will respond positively/negatively. Places to use Machine Learning Finding target proteins. Inferring target site structure. Predicting who will respond positively/negatively. Healthy vs. Disease Healthy Diseased If We Could Sequence DNA Quickly and Cheaply, We Could: Sequence DNA of people taking a drug, and use ML to identify consistent differences between those who respond well and those who do not. Sequence DNA of cancer cells and healthy cells, and use ML to detect dangerous mutations… proteins these genes code for may be useful targets. Sequence DNA of people who get a disease and those who don’t, and use ML to determine genes related to succeptibility… proteins these genes code for may be useful targets. Problem: Can’t Sequence Quickly Can quickly test single positions where variation is common: Single Nucleotide Polymorphisms (SNPs). Can quickly test degree to which every gene is being transcribed: Gene Expression Microarrays (e.g., Affymetrix Gene Chips™). Can (moderately) quickly test which proteins are present in a sample (Proteomics). Outline Overview of Drug Design How Machine Learning Fits Into the Process Target Search: Single Nucleotide Polymorphisms (SNPs) Machine Learning from Feature Vectors Decision Trees Support Vector Machines Voting/Ensembles Predicting Molecular Activity: Learning from Structure Example of SNP Data Person SNP 1 2 3 ... CLASS Person 1 C T A G T T ... old Person 2 C C A G C T ... young Person 3 T T A A C C ... old Person 4 C T G G T T ... young . . . . . . ... . . . . . . . ... . . . . . . . ... . Problem: SNPs are not Genes If we find a predictive SNP, it may not be part of a gene… we can only infer that the SNP is “near” a gene that may be involved in the disease. Even if the SNP is part of a gene, it may be another nearby gene that is the key gene. Problem: Even SNPs are Costly Typically cannot use all known SNPs. Can focus on a particular chromosome and area if knowledge permits that. Can use a scattering of SNPs, since SNPs that are very close together may be redundant… use one SNP per haplotype block, or region where recombination is rare. Why Machine Learning? There may be no single SNP in our data that distinguishes disease vs. healthy. Still may be possible to have some combination of SNPs to predict. Can gain insight from this combination. Outline Overview of Drug Design How Machine Learning Fits Into the Process Target Search: Single Nucleotide Polymorphisms (SNPs) Machine Learning from Feature Vectors Decision Trees Support Vector Machines Voting/Ensembles Predicting Molecular Activity: Learning from Structure Decision Trees in One Picture Young Old SNP1 has A Yes No Naïve Bayes in One Picture Age SNP 1 SNP 2 ... SNP 3000 Voting Approach Score SNPs using information gain. Choose top 1% scoring SNPs. To classify a new case, let these SNPs vote (majority or weighted majority vote). We use majority vote here. Task: Predict Early Onset Disease From SNP Data Only 3000 SNPs, coarsely sampled over entire genome. 80 patients (examples), 40 with early onset. Using technology from Orchid. Can a predictor be learned that performs significantly better than chance on unseen data? Results Use all data, only top 1% of features, or only top 10% of features (according to decision tree’s purity measure). Use Trees, SVMs, Voting. SVMs with top 10% achieve 71% accuracy. Significantly better than chance (50%). Lessons Feature selection is important for performance. Methodology note for machine learning specialists: must repeat this entire process on each fold of cross-validation or results will be overly-optimistic. SNP approach is promising… get funding to measure more SNPs. More work on SVM comprehensibility. Outline Overview of Drug Design How Machine Learning Fits Into the Process Target Search: Single Nucleotide Polymorphisms (SNPs) Machine Learning from Feature Vectors Decision Trees Support Vector Machines Voting/Ensembles Predicting Molecular Activity: Learning from Structure Places to use Machine Learning Finding target proteins. Inferring target site structure. Predicting who will respond positively/negatively. Typical Practice when Target Structure is Unknown Test many molecules (1,000,000) to find some that bind to target (ligands). Infer (induce) shape of target site from 3D structural similarities. Shared 3D substructure is called a pharmacophore. Perfect example of a machine learning task with spatial target. Inactive Active An Example of Structure Learning Inductive Logic Programming Represents data points in mathematical logic Uses Background Knowledge Returns results in logic The Logical Representation of a Pharmacophore Active(X) if: has-conformation(X,Conf), has-hydrophobic(X,A), has-hydrophobic(X,B), distance(X,Conf,A,B,3.0,1.0), has-acceptor(X,C), distance(X,Conf,A,C,4.0,1.0), distance(X,Conf,B,C,5.0,1.0). This logical clause states that a molecule X is active if it has some conformation Conf, hydrophobic groups A and B, and a hydrogen acceptor C such that the following holds: in conformation Conf of molecule X, the distance between A and B is 3 Angstroms (plus or minus 1), the distance between A and C is 4, and the distance between B and C is 5. Background Knowledge I Information about atoms and bonds in the molecules atm(m1,a1,o,3,5.915800,-2.441200,1.799700). atm(m1,a2,c,3,0.574700,-2.773300,0.337600). atm(m1,a3,s,3,0.408000,-3.511700,-1.314000). bond(m1,a1,a2,1). bond(m1,a2,a3,1). Background knowledge II Definition of distance equivalence dist(Drug,Atom1,Atom2,Dist,Error):number(Error), coord(Drug,Atom1,X1,Y1,Z1), coord(Drug,Atom2,X2,Y2,Z2), euc_dist(p(X1,Y1,Z1),p(X2,Y2,Z2),Dist1), Diff is Dist1-Dist, absolute_value(Diff,E1), E1 =< Error. euc_dist(p(X1,Y1,Z1),p(X2,Y2,Z2),D): Dsq is (X1-X2)^2+(Y1-Y2)^2+(Z1-Z2)^2, D is sqrt(Dsq). Central Idea: Generalize by searching a lattice Lattice of Hypotheses active(X) active(X) if has-hydrophobic(X,A) active(X) if has-hydrophobic(X,A), has-donor(X,B), distance(X,A,B,5.0) active(X) if has-donor(X,A) active(X) if has-acceptor(X,A) active(X) if active(X) if has-donor(X,A), has-acceptor(X,A), has-donor(X,B), has-donor(X,B), distance(X,A,B,4.0) distance(X,A,B,6.0) etc. Conformational model Conformational flexibility modelled as multiple conformations: Sybyl randomsearch Catalyst Pharmacophore description Atom and site centred Hydrogen bond donor Hydrogen bond acceptor Hydrophobe Site points (limited at present) User definable Distance based Example 1: Dopamine agonists Agonists taken from Martin data set on QSAR society web pages Examples (5-50 conformations/molecule) OH H3C OH N OH OH OH H2N H2N OH H3C OH OH OH N OH H2N OH HN OH CH3 Pharmacophore identified Molecule A has the desired activity if: in conformation B molecule A contains a hydrogen acceptor at C, and in conformation B molecule A contains a basic nitrogen group at D, and the distance between C and D is 7.05966 +/- 0.75 Angstroms, and in conformation B molecule A contains a hydrogen acceptor at E, and the distance between C and E is 2.80871 +/- 0.75 Angstroms, and the distance between D and E is 6.36846 +/- 0.75 Angstroms, and in conformation B molecule A contains a hydrophobic group at F, and the distance between C and F is 2.68136 +/- 0.75 Angstroms, and the distance between D and F is 4.80399 +/- 0.75 Angstroms, and the distance between E and F is 2.74602 +/- 0.75 Angstroms. Example II: ACE inhibitors 28 angiotensin converting enzyme inhibitors taken from literature D. Mayer et al., J. Comput.-Aided Mol. Design, 1, 316, (1987) O HS N O CH3 HO N O N COOH P N H OH N O COOH COOH CH3 N H N O COOH Experiment 1 Attempt to identify pharmacophore using original Mayer et al. Data (final conformations). Initial failed attempt traced to “bugs” in background knowledge definition. 4 pharmacophores found with corrected code (variations on common theme) ACE pharmacophore Molecule A is an ACE inhibitor if: molecule A contains a zinc-site B, molecule A contains a hydrogen acceptor C, the distance between B and C is 7.899 +/- 0.750 A, molecule A contains a hydrogen acceptor D, the distance between B and D is 8.475 +/- 0.750 A, the distance between C and D is 2.133 +/- 0.750 A, molecule A contains a hydrogen acceptor E, the distance between B and E is 4.891 +/- 0.750 A, the distance between C and E is 3.114 +/- 0.750 A, the distance between D and E is 3.753 +/- 0.750 A. Pharmacophore discovered Distance Progol Mayer A 4.9 5.0 B 3.8 3.8 C 8.5 8.6 Zinc site H-bond acceptor B A C Experiment 2 Definition of “zinc ligand” added to background knowledge based on crystallographic data Multiple conformations Sybyl RandomSearch Experiment 2 Original pharmacophore rediscovered plus one other different zinc ligand position similar to alternative proposed by Ciba-Geigy 4.0 3.9 7.3 Example III: Thermolysin inhibitors 10 inhibitors for which crystallographic data is available in PDB Conformationally challenging molecules Experimentally observed superposition Key binding site interactions Asn112-NH O OH O=C Asn112 S2’ NH Arg203-NH O S1’ O Zn P NH O R O=C Ala113 Interactions made by inhibitors Interaction Asn112-NH S2’ Asn112 C=O Arg 203 NH S1’ Ala113-C=O Zn 1HYT 1THL 1TLP 1TMN 2TMN 4TLN 4TMN 5TLN 5TMN 6TMN Pharmacophore Identification Structures considered 1HYT 1THL 1TLP 1TMN 2TMN 4TLN 4TMN 5TLN 5TMN 6TMN Conformational analysis using “Best” conformer generation in Catalyst 98-251 conformations/molecule Thermolysin Results 10 5-point pharmacophore identified, falling into 2 groups (7/10 molecules) 3 “acceptors”, 1 hydrophobe, 1 donor 4 “acceptors, 1 donor Common core of Zn ligands, Arg203 and Asn112 interactions identified Correct assignments of functional groups Correct geometry to 1 Angstrom tolerance Thermolysin results Increasing tolerance to 1.5Angstroms finds common 6-point pharmacophore including one extra interaction Example IV: Antibacterial peptides Dataset of 11 pentapeptides showing activity against Pseudomonas aeruginosa 6 actives <64mg/ml IC50 5 inactives Pharmacophore Identified A Molecule M is active against Pseudomonas Aeruginosa if it has a conformation B such that: M has a hydrophobic group C, M has a hydrogen acceptor D, the distance between C and D in conformation B is 11.7 Angstroms M has a positively-charged atom E, the distance between C and E in conformation B is 4 Angstroms the distance between D and E in conformation B is 9.4 Angstroms M has a positively-charged atom F, the distance between C and F in conformation B is 11.1 Angstroms the distance between D and F in conformation B is 12.6 Angstroms the distance between E and F in conformation B is 8.7 Angstroms Tolerance 1.5 Angstroms Ongoing ILP developments (pharmacophores) Continue to extend method validation Extending to combinatorial mixtures Quantitative models Mixing different datatypes in background knowledge Developing graphical front-end Ongoing developments (Other) Analysis of HTS datasets Analysis of “drug-likeness” Derivation of new descriptors eg Empirical binding functions