Lecture 7: Statistical learning methods - BIDD

advertisement
CZ5226: Advanced Bioinformatics
Lecture 7: Statistical Learning Methods
Prof. Chen Yu Zong
Tel: 6874-6877
Email: csccyz@nus.edu.sg
http://xin.cz3.nus.edu.sg
Room 07-24, level 7, SOC1,
National University of Singapore
Classification of Drugs or Proteins by SVM
•
A drug or a protein is classified as either belong (+) or not belong (-) to a class
Examples of drug class: inhibitor of a protein, BBB penetrating, genotoxic
Examples of protein class: enzyme EC3.4 family, DNA-binding
•
By screening against all classes, the property of a drug or the function of a
protein can be identified
Drug or Protein
Class-1
SVM
-
Class-2
SVM
-
Class-3
SVM
+
-
Drug or Protein
belongs to
Family-3
2
Classification of Drugs or Proteins by SVM
What is SVM?
• Support vector machines, a machine learning method, learning by
examples, statistical learning, classify objects into one of the two
classes.
Advantages of SVM:
• Diversity of class members (no racial discrimination).
• Use of structure-derived physico-chemical features as basis for drug
or protein classification (no structure-similarity or sequence-similarity
required in the algorithm).
3
SVM References
•
C. Burges, "A tutorial on support vector machines for pattern recognition",
Data Mining and Knowledge Discovery, Kluwer Academic Publishers,1998
(on-line).
•
R. Duda, P. Hart, and D. Stork, Pattern Classification, John-Wiley, 2nd
edition, 2001 (section 5.11, hard-copy).
•
S. Gong et al. Dynamic Vision: From Images to Face Recognition, Imperial
College Pres, 2001 (sections 3.6.2, 3.7.2, hard copy).
•
Online lecture notes
(http://www.cs.unr.edu/~bebis/MathMethods/SVM/lecture.pdf )
•
Publications of SVM drug prediction:
– J. Chem. Inf. Comput. Sci. 44,1630 (2004)
– J. Chem. Inf. Comput. Sci. 44, 1497 (2004)
– Toxicol. Sci. 79,170 (2004).
4
SVM References
•
Publications of SVM protein function prediction:
– Bioinformatics 2002; 18, 147
– Nucleic Acids Res 2003; 31, 3692
– Proteins 2004; 55, 66
– RNA 2004; 10, 355
– J Biol Chem 2004; 279, 23262
– Nucleic Acids Res. 2004; 32(21): 6437-6444
– Virology 2005; 331(1):136-143
•
Publications of SVM peptide-binder prediction:
– BMC Bioinformatics. 2002 Sep 11;3(1):25
– Bioinformatics. 2003 Oct 12;19(15):1978-84
– Protein Sci. 2004 Mar;13(3):596-607
– Genome Inform Ser Workshop Genome Inform. 2004;15(1):198-212
5
Other MHC-Peptide Prediction References
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
J Comput Biol. 2004;11(4):683-94
Methods. 2004 Dec;34(4):454-9
Methods. 2004 Dec;34(4):444-53
Methods. 2004 Dec;34(4):436-43
Org Biomol Chem. 2004 Nov 21;2(22):3274-83
Immunogenetics. 2004 Sep;56(6):405-19
J Immunol. 2004 Jun 15;172(12):7495-502
J Immunol. 2004 Jun 1;172(11):6783-9
Appl Bioinformatics. 2003;2(1):63-6
Appl Bioinformatics. 2003;2(3):155-8
Bioinformatics. 2004 Jun 12;20(9):1388-97.
Proteins. 2004 Feb 15;54(3):534-56
Novartis Found Symp. 2003;254:102-20; discussion 120-5, 216-22, 250-2
Hum Immunol. 2003 Dec;64(12):1123-43
J Mol Graph Model. 2004 Jan;22(3):195-207
Neural Comput. 2003 Dec;15(12):2931-42
Tissue Antigens. 2003 Nov;62(5):378-84
6
Other MHC-Peptide Prediction References
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
Bioinformatics. 2003 Sep 22;19(14):1765-72
Hybrid Hybridomics. 2003 Aug;22(4):229-34
Nucleic Acids Res. 2003 Jul 1;31(13):3621-4
Bioinformatics. 2003 May 22;19(8):1009-14
Methods. 2003 Mar;29(3):236-47
J Proteome Res. 2002 May-Jun;1(3):263-72
J Mol Biol. 2003 Feb 28;326(4):1157-74
BMC Bioinformatics. 2002 Sep 11;3(1):25
Hum Immunol. 2002 Sep;63(9):701-9
J Comput Biol. 2002;9(3):527-39
Mol Med. 2002 Mar;8(3):137-48
Immunol Cell Biol. 2002 Jun;80(3):280-5
Immunol Cell Biol. 2002 Jun;80(3):270-9
BMC Struct Biol. 2002 May 13;2(1):2
Biologicals. 2001 Sep-Dec;29(3-4):179-81
Bioinformatics. 2001 Dec;17(12):1236-7
Bioinformatics. 2001 Oct;17(10):942-8
J Med Chem. 2001 Oct 25;44(22):3572-81
J Comput Aided Mol Des. 2001 Jun;15(6):573-86
Protein Sci. 2000 Sep;9(9):1838-46
7
Machine Learning Method
Inductive learning:
Example-based learning
Descriptor
Positive
examples
Negative
examples
8
Machine Learning Method
Feature vectors:
A=(1, 1, 1)
B=(0, 1, 1)
C=(1, 1, 1)
D=(0, 1, 1)
E=(0, 0, 0)
F=(1, 0, 1)
Descriptor
Feature vector
Positive
examples
Negative
examples
9
SVM Method
Feature vectors in input space:
Z
Input space
Feature vector
A=(1, 1, 1)
B=(0, 1, 1)
C=(1, 1, 1)
D=(0, 1, 1)
E=(0, 0, 0)
F=(1, 0, 1)
F
E A
B
Y
X
10
SVM Method
Protein family
members
Border
New border
Protein family
members
Nonmembers
Nonmembers
Project to a higher dimensional space
11
SVM method
New border
Support vector
Support vector
Protein family
members
Nonmembers
12
SVM Method
Support vector
Protein family
members
Nonmembers
New border
Support vector
13
Best Linear Separator?
14
Best Linear Separator?
15
Find Closest Points in Convex
Hulls
d
c
16
Plane Bisect Closest Points
x wb
w  d c
d
c
17
Find using quadratic program
min
1
2
c    i xi
i1
s.t.

i1
i
1
i  0
cd
d 
2
 x
i1

i1
i
i
i
1
i  1,..., 
Many existing and new solvers.
18
Best Linear Separator:
Supporting Plane Method
Maximize distance
Between two paral
supporting planes
x w  b 1
x w  b 1
Distance
= “Margin”
= 2
|| w ||
19
Best Linear Separator?
20
SVM Method
Border line is nonlinear
21
SVM method
Non-linear transformation: use of kernel function
22
SVM method
Non-linear transformation
23
SVM Method
24
SVM Method
25
SVM Method
26
SVM Method
27
SVM for Classification of Drugs
How to represent a drug?
•
Each structure represented by specific feature vector assembled from
structural, physico-chemical properties:
– Simple molecular properties (molecular weight, no. of rotatable bonds
etc. 18 in total)
– Molecular Connectivity and shape (28 in total)
– Electro-topological state polarity (84 in total)
– Quantum chemical properties (electric charge, polaritability etc. 13 in
total)
– Geometrical properties (molecular size vector, van der Waals volume,
molecular surface etc. 16 in total)
J. Chem. Inf. Comput. Sci. 44,1630 (2004)
J. Chem. Inf. Comput. Sci. 44, 1497 (2004)
Toxicol. Sci. 79,170 (2004).
28
SVM-based drug design and property prediction software
Useful for inhibitor/activator/substrate prediction, drug safety and
pharmacokinetic prediction.
Drug
Chemical
Structure
Option 1
Chemical
Structure
Your drug
structure
Option 2
http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi
Which class your
drug belongs to?
Send structure to classifier
Input structure
through internet
Computer loaded
with SVMProt
Input structure
on local machine
Drug designed
or property
predicted
Support vector machines
classifier for every
Drug class
Identified
classes
J. Chem. Inf. Comput. Sci. 44,1630 (2004)
J. Chem. Inf. Comput. Sci. 44, 1497 (2004)
Toxicol. Sci. 79,170 (2004).
SVM Drug Prediction Results
Protein inhibitor/activator/substrate prediction:
•
•
86% of the 129 estrogen receptor activators and 84% of 101 non-activators correctly
predicted.
81% of 116 P-glycoprotein substrates and 79% of 85 non-substrates correctly
predicted
Drug Toxicity Prediction:
•
•
97% of 102 TdP+ and 84% of 243 TdP- agents correctly predicted
73% of 229 genotoxic and 93% of 631 non-genotoxic agents correctly predicted
Pharmacokinetics prediction:
•
•
95% of 276 BBB+ and 82% of 139 BBB- agents correctly predicted
90% of 131 human intestine absorption and 80% of 65 non-absoption agents
correctly predicted.
J. Chem. Inf. Comput. Sci. 44,1630 (2004)
J. Chem. Inf. Comput. Sci. 44, 1497 (2004)
Toxicol. Sci. 79,170 (2004).
SVM for Classification of Proteins
How to represent a protein?
•
Each sequence represented by specific feature vector assembled from
encoded representations of tabulated residue properties:
– amino acid composition
– Hydrophobicity
– normalized Van der Waals volume
– polarity,
– Polarizability
– Charge
– surface tension
– secondary structure
– solvent accessibility
•
Three descriptors, composition (C), transition (T), and distribution (D), are
used to describe global composition of each of these properties.
Nucleic Acids Res. 2003; 31: 3692-3697
31
SVM for Classification of Proteins
How to represent a protein?
From protein sequence:
To Feature vector :
(C_amino acid composition, T_ amino acid composition, D_ amino acid composition,
C_hydrophobicity, T_hydrophobicity, D_hydrophobicity, … )
Nucleic Acids Res. 2003; 31: 3692-3697
32
SVM for Classification of Proteins
How to represent a protein?
33
Protein function prediction software SVMProt
Useful for functional prediction of novel proteins, distantly-related proteins,
homologous proteins of different functions
Your protein sequence
Option 1
Your protein
sequence
Option 2
http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi
Which functional
families your protein
belong to?
Send sequence to classifier
Input sequence
through internet
Computer loaded
with SVMProt
Input sequence
on local machine
Protein functional
indications
Support vector machines
classifier for every
protein functional family
Identified
Functional families
Nucl. Acids Res. 31, 3692-3697 (2003)
Protein function prediction software SVMProt
Useful for functional prediction of
novel proteins, distantly-related
proteins, homologous proteins of
different functions.
Protein families covered:
46 enzyme families, 3 receptor families,
4 transporter and channel families,
6 DNA- and RNA-binding families,
8 structural families, 2 regulator/factor
families.
SVMProt web-version at:
http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi
Nucl. Acids Res. 31, 3692-3697 (2003)
Protein function prediction software SVMProt
Prediction Probability of
score
correct prediction
Nucl. Acids Res. 31, 3692-3697 (2003)
SVMProt Protein Functional Family Prediction Results
Overall prediction accuracies:
•
•
87% of the 34,582 proteins correctly assigend to their respective functional family.
97% of the 310,000 non-member proteins correctly predicted
Novel enzymes:
•
67% of the 12 non-homologous enzymes (having no homlogous proteins by PSIBLAST search of NR databases) are correctly assigned
•
83% of the 29 non-homologous enzymes (having no homologous proteins by PSIBLAST search of SwissProt database) are correctly assigned.
•
70% of the 20 pairs of homologous enzymes of different functions are correctly
assigned.
NR databases include all non-redundant GenBank,
CDS translations, PDB, SwissProt, PIR, and PRF databases
92% of 12,900 enzymes correctly assigned by BLAST in 1997
Nucleic Acids Res 2003; 31, 3692
Proteins 2004; 55, 66
Download