Lecture 7: Statistical learning methods - BIDD

CZ5226: Advanced Bioinformatics Lecture 7: Statistical Learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, National University of Singapore Classification of Drugs or Proteins by SVM • A drug or a protein is classified as either belong (+) or not belong (-) to a class Examples of drug class: inhibitor of a protein, BBB penetrating, genotoxic Examples of protein class: enzyme EC3.4 family, DNA-binding • By screening against all classes, the property of a drug or the function of a protein can be identified Drug or Protein Class-1 SVM - Class-2 SVM - Class-3 SVM + - Drug or Protein belongs to Family-3 2 Classification of Drugs or Proteins by SVM What is SVM? • Support vector machines, a machine learning method, learning by examples, statistical learning, classify objects into one of the two classes. Advantages of SVM: • Diversity of class members (no racial discrimination). • Use of structure-derived physico-chemical features as basis for drug or protein classification (no structure-similarity or sequence-similarity required in the algorithm). 3 SVM References • C. Burges, "A tutorial on support vector machines for pattern recognition", Data Mining and Knowledge Discovery, Kluwer Academic Publishers,1998 (on-line). • R. Duda, P. Hart, and D. Stork, Pattern Classification, John-Wiley, 2nd edition, 2001 (section 5.11, hard-copy). • S. Gong et al. Dynamic Vision: From Images to Face Recognition, Imperial College Pres, 2001 (sections 3.6.2, 3.7.2, hard copy). • Online lecture notes (http://www.cs.unr.edu/~bebis/MathMethods/SVM/lecture.pdf ) • Publications of SVM drug prediction: – J. Chem. Inf. Comput. Sci. 44,1630 (2004) – J. Chem. Inf. Comput. Sci. 44, 1497 (2004) – Toxicol. Sci. 79,170 (2004). 4 SVM References • Publications of SVM protein function prediction: – Bioinformatics 2002; 18, 147 – Nucleic Acids Res 2003; 31, 3692 – Proteins 2004; 55, 66 – RNA 2004; 10, 355 – J Biol Chem 2004; 279, 23262 – Nucleic Acids Res. 2004; 32(21): 6437-6444 – Virology 2005; 331(1):136-143 • Publications of SVM peptide-binder prediction: – BMC Bioinformatics. 2002 Sep 11;3(1):25 – Bioinformatics. 2003 Oct 12;19(15):1978-84 – Protein Sci. 2004 Mar;13(3):596-607 – Genome Inform Ser Workshop Genome Inform. 2004;15(1):198-212 5 Other MHC-Peptide Prediction References – – – – – – – – – – – – – – – – – J Comput Biol. 2004;11(4):683-94 Methods. 2004 Dec;34(4):454-9 Methods. 2004 Dec;34(4):444-53 Methods. 2004 Dec;34(4):436-43 Org Biomol Chem. 2004 Nov 21;2(22):3274-83 Immunogenetics. 2004 Sep;56(6):405-19 J Immunol. 2004 Jun 15;172(12):7495-502 J Immunol. 2004 Jun 1;172(11):6783-9 Appl Bioinformatics. 2003;2(1):63-6 Appl Bioinformatics. 2003;2(3):155-8 Bioinformatics. 2004 Jun 12;20(9):1388-97. Proteins. 2004 Feb 15;54(3):534-56 Novartis Found Symp. 2003;254:102-20; discussion 120-5, 216-22, 250-2 Hum Immunol. 2003 Dec;64(12):1123-43 J Mol Graph Model. 2004 Jan;22(3):195-207 Neural Comput. 2003 Dec;15(12):2931-42 Tissue Antigens. 2003 Nov;62(5):378-84 6 Other MHC-Peptide Prediction References – – – – – – – – – – – – – – – – – – – – Bioinformatics. 2003 Sep 22;19(14):1765-72 Hybrid Hybridomics. 2003 Aug;22(4):229-34 Nucleic Acids Res. 2003 Jul 1;31(13):3621-4 Bioinformatics. 2003 May 22;19(8):1009-14 Methods. 2003 Mar;29(3):236-47 J Proteome Res. 2002 May-Jun;1(3):263-72 J Mol Biol. 2003 Feb 28;326(4):1157-74 BMC Bioinformatics. 2002 Sep 11;3(1):25 Hum Immunol. 2002 Sep;63(9):701-9 J Comput Biol. 2002;9(3):527-39 Mol Med. 2002 Mar;8(3):137-48 Immunol Cell Biol. 2002 Jun;80(3):280-5 Immunol Cell Biol. 2002 Jun;80(3):270-9 BMC Struct Biol. 2002 May 13;2(1):2 Biologicals. 2001 Sep-Dec;29(3-4):179-81 Bioinformatics. 2001 Dec;17(12):1236-7 Bioinformatics. 2001 Oct;17(10):942-8 J Med Chem. 2001 Oct 25;44(22):3572-81 J Comput Aided Mol Des. 2001 Jun;15(6):573-86 Protein Sci. 2000 Sep;9(9):1838-46 7 Machine Learning Method Inductive learning: Example-based learning Descriptor Positive examples Negative examples 8 Machine Learning Method Feature vectors: A=(1, 1, 1) B=(0, 1, 1) C=(1, 1, 1) D=(0, 1, 1) E=(0, 0, 0) F=(1, 0, 1) Descriptor Feature vector Positive examples Negative examples 9 SVM Method Feature vectors in input space: Z Input space Feature vector A=(1, 1, 1) B=(0, 1, 1) C=(1, 1, 1) D=(0, 1, 1) E=(0, 0, 0) F=(1, 0, 1) F E A B Y X 10 SVM Method Protein family members Border New border Protein family members Nonmembers Nonmembers Project to a higher dimensional space 11 SVM method New border Support vector Support vector Protein family members Nonmembers 12 SVM Method Support vector Protein family members Nonmembers New border Support vector 13 Best Linear Separator? 14 Best Linear Separator? 15 Find Closest Points in Convex Hulls d c 16 Plane Bisect Closest Points x wb w  d c d c 17 Find using quadratic program min 1 2 c    i xi i1 s.t.  i1 i 1 i  0 cd d  2  x i1  i1 i i i 1 i  1,...,  Many existing and new solvers. 18 Best Linear Separator: Supporting Plane Method Maximize distance Between two paral supporting planes x w  b 1 x w  b 1 Distance = “Margin” = 2 || w || 19 Best Linear Separator? 20 SVM Method Border line is nonlinear 21 SVM method Non-linear transformation: use of kernel function 22 SVM method Non-linear transformation 23 SVM Method 24 SVM Method 25 SVM Method 26 SVM Method 27 SVM for Classification of Drugs How to represent a drug? • Each structure represented by specific feature vector assembled from structural, physico-chemical properties: – Simple molecular properties (molecular weight, no. of rotatable bonds etc. 18 in total) – Molecular Connectivity and shape (28 in total) – Electro-topological state polarity (84 in total) – Quantum chemical properties (electric charge, polaritability etc. 13 in total) – Geometrical properties (molecular size vector, van der Waals volume, molecular surface etc. 16 in total) J. Chem. Inf. Comput. Sci. 44,1630 (2004) J. Chem. Inf. Comput. Sci. 44, 1497 (2004) Toxicol. Sci. 79,170 (2004). 28 SVM-based drug design and property prediction software Useful for inhibitor/activator/substrate prediction, drug safety and pharmacokinetic prediction. Drug Chemical Structure Option 1 Chemical Structure Your drug structure Option 2 http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi Which class your drug belongs to? Send structure to classifier Input structure through internet Computer loaded with SVMProt Input structure on local machine Drug designed or property predicted Support vector machines classifier for every Drug class Identified classes J. Chem. Inf. Comput. Sci. 44,1630 (2004) J. Chem. Inf. Comput. Sci. 44, 1497 (2004) Toxicol. Sci. 79,170 (2004). SVM Drug Prediction Results Protein inhibitor/activator/substrate prediction: • • 86% of the 129 estrogen receptor activators and 84% of 101 non-activators correctly predicted. 81% of 116 P-glycoprotein substrates and 79% of 85 non-substrates correctly predicted Drug Toxicity Prediction: • • 97% of 102 TdP+ and 84% of 243 TdP- agents correctly predicted 73% of 229 genotoxic and 93% of 631 non-genotoxic agents correctly predicted Pharmacokinetics prediction: • • 95% of 276 BBB+ and 82% of 139 BBB- agents correctly predicted 90% of 131 human intestine absorption and 80% of 65 non-absoption agents correctly predicted. J. Chem. Inf. Comput. Sci. 44,1630 (2004) J. Chem. Inf. Comput. Sci. 44, 1497 (2004) Toxicol. Sci. 79,170 (2004). SVM for Classification of Proteins How to represent a protein? • Each sequence represented by specific feature vector assembled from encoded representations of tabulated residue properties: – amino acid composition – Hydrophobicity – normalized Van der Waals volume – polarity, – Polarizability – Charge – surface tension – secondary structure – solvent accessibility • Three descriptors, composition (C), transition (T), and distribution (D), are used to describe global composition of each of these properties. Nucleic Acids Res. 2003; 31: 3692-3697 31 SVM for Classification of Proteins How to represent a protein? From protein sequence: To Feature vector : (C_amino acid composition, T_ amino acid composition, D_ amino acid composition, C_hydrophobicity, T_hydrophobicity, D_hydrophobicity, … ) Nucleic Acids Res. 2003; 31: 3692-3697 32 SVM for Classification of Proteins How to represent a protein? 33 Protein function prediction software SVMProt Useful for functional prediction of novel proteins, distantly-related proteins, homologous proteins of different functions Your protein sequence Option 1 Your protein sequence Option 2 http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi Which functional families your protein belong to? Send sequence to classifier Input sequence through internet Computer loaded with SVMProt Input sequence on local machine Protein functional indications Support vector machines classifier for every protein functional family Identified Functional families Nucl. Acids Res. 31, 3692-3697 (2003) Protein function prediction software SVMProt Useful for functional prediction of novel proteins, distantly-related proteins, homologous proteins of different functions. Protein families covered: 46 enzyme families, 3 receptor families, 4 transporter and channel families, 6 DNA- and RNA-binding families, 8 structural families, 2 regulator/factor families. SVMProt web-version at: http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi Nucl. Acids Res. 31, 3692-3697 (2003) Protein function prediction software SVMProt Prediction Probability of score correct prediction Nucl. Acids Res. 31, 3692-3697 (2003) SVMProt Protein Functional Family Prediction Results Overall prediction accuracies: • • 87% of the 34,582 proteins correctly assigend to their respective functional family. 97% of the 310,000 non-member proteins correctly predicted Novel enzymes: • 67% of the 12 non-homologous enzymes (having no homlogous proteins by PSIBLAST search of NR databases) are correctly assigned • 83% of the 29 non-homologous enzymes (having no homologous proteins by PSIBLAST search of SwissProt database) are correctly assigned. • 70% of the 20 pairs of homologous enzymes of different functions are correctly assigned. NR databases include all non-redundant GenBank, CDS translations, PDB, SwissProt, PIR, and PRF databases 92% of 12,900 enzymes correctly assigned by BLAST in 1997 Nucleic Acids Res 2003; 31, 3692 Proteins 2004; 55, 66

Lecture 7: Statistical learning methods - BIDD

Related documents

Products

Support

Lecture 7: Statistical learning methods - BIDD

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib