Functional Annotation of Proteins with Known Structure by Structure and

advertisement
Functional Annotation of Proteins with Known Structure
by Structure and Sequence Similarity,
DNA-protein Interaction Patterns
and GO Framework
Ilya Shindyalov, UCSD/SDSC
PhD, Group Leader, Protein Science Research
DIMACS 2005-06-13
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
Essential Dataflow in Protein Science
Protein
Data:
Methods:
Sequence
Sequence similarity:
(i)
BLAST,
(ii)
fold recognition,
(iii)
homology modeling
…
Results:
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
Structure
Function
Structure similarity:
(i)
DALI,
(ii)
VAST,
(iii)
CE
…
Do we know the function, if we
know the structure?
COVERAGE RATIO FOR FUNCTIONAL ANNOTATION
Disease
Biological
Process
Cell
Component
Molecular
Function
PDB STRUCTURES
0.758
0.396
0.371
0.335
SG TARGETS
0.355
0.315
0.452
0.259
PDB+SG
0.822
0.528
0.593
0.477
HOMOLOGY MODELS
0.984
0.792
0.839
0.821
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
The Subjects of my Talk
3 Approaches of Using Structure Similarity to
Infer Protein Function:
#1: Assigning function from known to unknown – CASE
STUDY – Prediction of calcium binding in Acetylcholine
Esterase – Projection on SNP responsible for Autism.
#2: Classification of DNA-binding protein domains involving
(in addition to structure similarity) – DNA-protein
interaction patterns and sequence similarity.
#3: Extending GO annotation using structure similarity – how
reliable it can be?
#4 [BONUS]: Why ontology is so important for humans? 
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
#1: Assigning function from known to unknown – CASE
STUDY – Prediction of calcium binding in Acetylcholine
Esterase – Projection on SNP responsible for Autism.
#2: Classification of DNA-binding protein domains involving
(in addition to structure similarity) – DNA-protein
interaction patterns and sequence similarity.
#3: Extending GO annotation using structure similarity – how
reliable it can be?
#4 [BONUS]: Why ontology is so important for humans? 
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
CE
Protein structure comparison by Combinatorial Extension
of the optimal path (Shindyalov and Bourne, 1998).
http://cl.sdsc.edu
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
CE
Step 1. Heuristic search for initial path.
AFP = Aligned Fragment Pair
Distance between two fragments
AFP2
AFP1
Protein A
Protein A
Protein B
Protein B
Alignment Path
Protein A
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
Protein A
CE
Step 2. Iterative dynamic
programming on starting
superposition from step 1.
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
CE vs. other Algorithms ???
Novotny, M., Madsen, D., and Kleywegt, G.J. 2004.
Evaluation of protein fold comparison servers. Proteins
54: 260-270.
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
Acetylcholinestarase vs. Troponin C
2ACE vs. 1TN4: RMSD = 4.6Å Z-score = 4.6 LALI = 86 LGAP = 8 Seq. Identity = 3.5%
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
#1: Assigning function from known to unknown – CASE
STUDY – Prediction of calcium binding in Acetylcholine
Esterase – Projection on SNP responsible for Autism.
#2: Classification of DNA-binding protein domains involving
(in addition to structure similarity) – DNA-protein
interaction patterns and sequence similarity.
#3: Extending GO annotation using structure similarity – how
reliable it can be?
#4 [BONUS]: Why ontology is so important for humans? 
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
Data and algorithms used:
• PDB - Protein Data Bank of February 13, 2002 with 17,304 entries was used
as the source of original structural data.
- The DNA fragment size is at least 5 bp long.
- At least 5 different protein residues are involved in the interaction
with DNA.
- The contact distance cutoff between interacting atoms was < 5Å.
- We did not take into account the different types of DNA (A, B, Z)
because of the insufficient level of this annotation in the PDB
• PDP – Protein Domain Parser (Alexandrov, Shindyalov, Bioinformatics,
submitted)
• CE – Protein structure alignment by Combinatorial Extension
(Shindyalov, Bourne, 1998)
• SCOP - Structure Classification of Proteins (Murzin et al., 1995)
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
Building representative set of domains:
PDB
Selection of DNA-binding protein chains
by analyzing DNA-protein contacts
Parsing of DNA-binding protein chains
into domains using PDP
Selection of DNA-binding protein
domains by analyzing DNA-protein
contacts
All-against-all structural alignment of
DNA-binding protein domains using CE
Selection of representative (non-redundant)
set of DNA-binding protein domains
Calculating classification of DNAbinding protein domains
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
Parameters measuring structural similarity:
• Rmsd, root mean squared deviation between two aligned and compared protein
domains > 2.0 Å;
• Z-score, statistical score obtained from CE is < 4.5;
• Rnar, ratio of the number of aligned residues to the smallest domain length <
90%;
Note: sequence identity in the alignment < 90%;
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
A
B
* ***
** ** ****
YKLAAVGTE--FCCILLNIVKLPDGT
|
|
||
| ||
ASQL—AVREERAFA---GGKAPDQQD
**
* * **
****
(1) Parameters measuring structural similarity: Rmsd, Z-score, Rnar;
(2) Parameter measuring the match between DNA-protein contact patterns,
Rmat;
A and B - DNA-protein domain complexes;
Rmat = min{RmatA, RmatB}
RmatX - ratio of the number of matched residues to the total number of
residues involved in contacts with DNA in the DNA-protein complex X.
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
Realignment using scoring function taking into account structural
similarity between two protein domains and protein-DNA contact
pattern
Sij  S
Structure similarity term:
S
Protein-DNA contact pattern term:
where
K
X
m
dist
ij
dist
ij
S
cont
ij
C1  dij , if C1  d ij  C2

otherwise
 C2 ,
S
cont
ij
 C3  K  K
A
i
B
j
1, if protein residue is involved in contact wi th DNA

0, otherwise
m – denotes protein residue, X – protein-DNA complex; C3 is a scaling constant;
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
• If Rmsd > 5.0 Å or Rnar < 70% or Z-score < 3.5, then domains are not considered
as similar;
• If Rmsd  3.0 Å and Rnar  80%, then domains are considered as similar;
• If Rmat  Rmatthreshold and either: 3.0 Å < Rmsd  5.0 Å and Rnar 70%
70%  Rnar < 80% and Rmsd  5.0 Å, then domains are considered similar;
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
or
Comparison of the classification for all 338 DNA-binding
representatives with SCOP at various threshold parameters
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
domain
Final classification of DNA-binding protein domains (fragment):
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
Rnar
Not similar
Similar
80
Similar if Rmat<80
70
Not similar
3
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
5
Rmsd
SPDC – Structural Protein Domain Сlassification
http://spdc.sdsc.edu
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
#1: Assigning function from known to unknown – CASE
STUDY – Prediction of calcium binding in Acetylcholine
Esterase – Projection on SNP responsible for Autism.
#2: Classification of DNA-binding protein domains involving
(in addition to structure similarity) – DNA-protein
interaction patterns and sequence similarity.
#3: Extending GO annotation using structure similarity – how
reliable it can be?
#4 [BONUS]: Why ontology is so important for humans? 
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
Why do we need the ontology?
• Quantitative data explosion (e.g. exponential growth
of sequence data - doubling every 7 month)
•Qualitative data explosion (new experimental
methods and new kinds of data appear, e.g. microarrays, interfering-RNA).
•Lack of adequate means for information storage and
exchange between:
- scientists,
- computers,
- scientists and computers
(what’s published in scientific journals is de facto not
reaching the community).
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
GO can serve as a language which can be easily
read by both humans and computers. By using GO
we ultimately learn to talk in one universal
language.
The goal of this work is to further realize
the potential of GO.
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
What is GO?
CAR
• Controlled dictionaries for:
- Molecular Function
- Biological Process
“is-a”
BMW
- Cellular Component
• Acyclic graph
• “is-a”, “part-of” (“has-a”) relationships
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
“has-a” “part-of”
Wheel
The GO Annotation (GOA) resources providing annotation of gene
products
with Cellular
GO terms
Biological
Molecular
GO Annotation Resource
Process
Function
Component Total Gene
nonnonnon- Products
All
All
All
IEA *
IEA
IEA Associated
codes
codes
codes
codes
codes
codes
SGD
Saccharomyces cerevisiae
6446
6446
6434
6434
6435
6435
6448
FlyBase
Drosophila melanogaster
4439
4428
6795
6789
3942
3918
7938
MGI
Mus musculus
9594
5776
10523
6642
9691
7300
12694
TAIR
Arabidopsis thaliana
6724
1979
7782
5450
13544
1891
18482
WormBase
Caenorhabditis elegans
5115
1563
5754
285
3056
652
6925
RGD
Rattus norvegicus
1060
248
1234
260
835
146
1448
Oryza sativa
6728
4495
6018
2799
ZFIN
Danio rerio
782
0
917
0
687
0
983
DictyBase
Dictyostelium discoideum
1309
100
1600
117
927
117
1781
Pseudomonas syringae
DC3000
2941
2941
3101
3101
263
263
3137
Trypanosoma brucei chr 2
291
291
289
289
278
278
292
Bacillus anthracis Ames
4416
4416
4418
4418
199
199
4418
Arabidopsis thaliana
3001
3001
6463
6463
1563
1563
6801
Coxiella burnetii RSA 493
1359
1359
1349
1349
176
176
1365
Gene Index
80031
0
100151
0
78400
0
126556
Shewanella oneidensis MR-1
3696
3696
3696
3696
241
241
3696
Vibrio cholerae
2923
2923
2728
2728
191
191
2924
631750
0
631105
0
640209
0
658168
Human
16526
7663
18901
7172
13833
6602
20673
PDB
16871
0
18392
0
10417
0
18890
Gramene
TIGR
Compugen
GO
Annotations
@ EBI
SwissPROT/TrEMBL
12878 12273
531757 19499 643062 22548 409822 16125
Leishmania major
64
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
Plasmodium falciparum
2047
14709
738665
64
82
82
26
26
89
2047
2097
2097
2061
1227
2406
The IEA code, Inferred
from Electronic
Annotation, this means no
human involvement in the
assignment
Extending GO annotation of PDB chains using
structural and sequence similarity
34,698 protein chains were taken from the PDB of February, 2003 with
the exception of theoretical models, short chains (less than 30 Cα atoms),
and chains which don’t form domains (no domains detected by PDP
algorithm).
GO annotation has been assigned for 25,835 PDB protein chains by EBI
from 34,698.
Rmsd, root mean squared deviation between two structurally aligned
polypeptides, it characterizes distances between C , C and mainchain O
atoms of aligned residues.
Z-score, statistically founded score, it characterizes significance of the
alignment.
Rnar, ratio of the number of aligned residues to the length of the
shortest polypeptide, it measures overlap between aligned polypeptides.
Rseq, sequence identity calculated for the structurally aligned residues.
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
For two polypeptides A and B with all calculated parameter values (Rmsd,
Z-score, Rnar, Rseq) and given threshold values (Rmsdthreshold, Zscorethreshold, Rnarthreshold, Rseqthreshold) we define:
SSCAB=(Rmsd<Rmsdthreshold )  (Z-score>Z-scorethreshold)  (Rnar>Rnarthreshold)
 (Rseq>Rseqthreshold)
 - denotes logical AND.
SSCAB can only be ascribed two values: true or false.
If SSCAB is true, then A and B are similar.
If SSCAB is false, then A and B are not similar.
The chains were clustered such that for every
two chains in each cluster the above condition (in
red) holds true.
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
Specificity Criteria:
For the clusters where GO terms were available for at least two chains
we define:
“positive cluster” - where all chains have the same GO terms;
“negative cluster” - where chains have different GO terms (more
specific definitions for three criteria will be given further);
TP (true positives) - a number of chains with GO terms in the
positive clusters;
FP (false positives) - a number of chains with GO terms in the
positive clusters;
ppv (positive predictive value) or specificity is the following ratio
- TP/(TP+FP)
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
Specificity Criteria (cont.):
{ti1,..tik(i)} - is a set of GO terms k(i) for i-th chain.
Each specificity is defined for a clusters with at least two annotated chains.
Specificity-1 (the most rigorous) - “positive” cluster must have every pair of chains (i,
j) with the same set of GO terms:
tin = tjn , n=1,…k(i), k(i)=k(j), for  (i, j), i{1,…N}, j{1,…N}.
Specificity-2 (less rigorous than specificity-1) - “positive” cluster must have for every
pair of chains (i, j) with different number of GO terms the following: for the chain
with a smaller number of terms – all terms must be present amongst the terms for a
chain with a larger number of GO terms:
{ti1,..tik(i)}  {tj1,..tjk(j)}, if k(i)  k(j); i{1,…N}, j{1,…N}; {t1,..tN}.
Specificity-3 (less rigorous than specificity-2) - “positive” cluster must have a common
set of terms {t1,..tL} for all N chains within the cluster:
{t1,..tN}  {ti1,..tik(i)}, i=1,…N; {t1,..tN}.
Further detailing of specificity (Specificity-4) should involve the
semantic distance (e.g. Lord et al, 2003) between terms in judging
cluster to be “positive”.
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
Clusterization of PDB chains and the accuracy of GO annotation
at different threshold values of structural similarity parameters.
Threshold values
Rseq
Rnar
Clusters and chains
Performance of GO annotation
Chains in
FP
Clusters
Clusters Clusters clusters with chains Specificit FP chains Specificit FP chains Specificit Cove
Newly
Rmsd
Zand singlewith at least at least two
y-1,
y-2,
y-3,
rage,
annotated
score
tons
two chains chains with (specificit
specificity
(specificit
chains
Å
with GO
GO
y-1)
%
-2)
%
y-3)
%
%
New added Chains
New chain- chain-GO
with
GO term
term
added
associations associations GO terms
0%
90%
2.0
4.5
9940
2919
2435
20799
8255
60.3
4463
78.5
158
99.2
60.9
5397
170372
3531
1953
25%
90%
2.0
4.5
9959
2923
2440
20797
8091
61.1
4410
78.8
155
99.3
60.6
5367
169864
3386
1893
35%
90%
2.0
4.5
10069
2995
2490
20768
7534
63.7
3972
80.9
113
99.5
59.9
5310
167254
3281
1841
50%
90%
2.0
4.5
10368
3180
2643
20719
5618
72.9
2686
87.0
64
99.7
57.4
5089
160523
2937
1376
70%
90%
2.0
4.5
10867
3515
2886
20606
3759
81.8
1137
94.5
42
99.8
52.3
4639
153801
2062
1015
90%
90%
2.0
4.5
11478
3834
3033
20493
1536
92.5
517
97.5
29
99.8
45.2
4002
147162
861
359
0%
70%
5.0
3.8
3401
1962
1687
24805
17730
28.5
15163
38.9
5604
77.4
83.8
7426
266757
2858
1318
25%
70%
5.0
3.8
4261
2533
2142
24610
13683
44.4
9608
61.0
734
97.0
78.5
6956
229366
3961
2153
35%
70%
5.0
3.8
4778
2885
2431
24507
11606
52.6
7299
70.2
328
98.7
75.9
6728
215972
3962
2285
50%
70%
5.0
3.8
5455
3330
2787
24357
8200
66.3
4063
83.3
85
99.7
71.7
6351
197765
4164
1960
70%
70%
5.0
3.8
6199
3819
3152
24196
5042
79.2
1567
93.5
58
99.8
64.9
5749
187239
3187
1440
90%
70%
5.0
3.8
7031
4269
3359
23984
2235
90.7
734
97.0
29
99.9
56.5
5007
178146
1566
588
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
Assignment of GO annotation with structural similarity parameters
(Rmsd 5.0Å, Z-score  3.8, Rnar  70%, Rseq  90%).
Red dot denotes newly annotated chains, red arrow denotes new “GO term – chain”
associations assigned for newly annotated chains. Purple line denotes new “GO term –
chain” associations assigned for chains previously annotated (by EBI). Black arrow denotes
existing “GO term – chain” associations assigned by EBI.
PDB chains (34,698)
3,856
"GO term - chain" associations
(335,322)
5,007
588
154,986
178,675
25,247
1,661
newly annotated chains (this work)
chains annotated by EBI with added new GO terms (this work)
annotated by EBI
not annotated anywhere
New added "GO term - chain" associations for
previously annotated chains
(1,661)
Function
42%
Cellular
component
6%
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
new associations for newly annotated chains (this work)
new associations added for chains annotated by EBI (this work)
annotated by EBI
New "GO term - chain" associations
(178,675)
Process
33%
Process
52%
Function
53%
Cellular
component
14%
The example of “negative” cluster by the definition of specificity-1 and “positive” cluster by the definitions of
specificity-2 and specificity-3. Seven GO terms could be assigned to chains 1h9dA, 1h9dC (Rmsd 5.0Å, Z-score 
3.8, Rnar  70%, Rseq  90%).
1e50A
1e50C
1e50E
1e50G
1e50Q
1e50R
1cmoA
1co1A
1ljmA
1ljmB
1hjbC
1hjbF
1hjcA
1hjcD
1io4C
1eanA
1eaoA
1eaoB
1eaqA
1eaqB
1h9dA
1h9dC
(7) 3677, 3700,
(7) 3677, 3700,
(7) 3677, 3700,
(7) 3677, 3700,
(7) 3677, 3700,
(7) 3677, 3700,
(7) 3677, 3700,
(7) 3677, 3700,
(7) 3677, 3700,
(7) 3677, 3700,
(4) 3677, 5524,
(4) 3677, 5524,
(4) 3677, 5524,
(4) 3677, 5524,
(4) 3677, 5524,
(4) 3677, 5524,
(4) 3677, 5524,
(4) 3677, 5524,
(4) 3677, 5524,
(4) 3677, 5524,
no go terms
no go terms
3677
3700
5524
5634
6355
7275
8151
(F)
(F)
(F)
(C)
(P)
(P)
(P)
-
5524,
5524,
5524,
5524,
5524,
5524,
5524,
5524,
5524,
5524,
5634,
5634,
5634,
5634,
5634,
5634,
5634,
5634,
5634,
5634,
5634,
5634,
5634,
5634,
5634,
5634,
5634,
5634,
5634,
5634,
6355,
6355,
6355,
6355,
6355,
6355,
6355,
6355,
6355,
6355,
6355,
6355,
6355,
6355,
6355,
6355,
6355,
6355,
6355,
6355,
7275,
7275,
7275,
7275,
7275,
7275,
7275,
7275,
7275,
7275,
8151,
8151,
8151,
8151,
8151,
8151,
8151,
8151,
8151,
8151,
The cluster of the same proteins which is Runt-related
transcription factor 1 (synonyms: core-binding factor alfa
subunit, acute myeloid leukemia 1 protein etc.).
DNA binding
transcription factor activity
ATP binding
nucleus
regulation of transcription, DNA-dependent
development
cell growth and/or maintenance
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
The example of “positive” cluster by definition of specificity-1.
Phospholipase A2.
1cl5A
1cl5B
1fb2A
1fb2B
1fv0A
1fv0B
1jq8A
1jq8B
1jq9A
1jq9B
1kpmB
(5) 4623, 5509,
(5) 4623, 5509,
(5) 4623, 5509,
(5) 4623, 5509,
(5) 4623, 5509,
(5) 4623, 5509,
(5) 4623, 5509,
(5) 4623, 5509,
(5) 4623, 5509,
(5) 4623, 5509,
no go terms
4623
5509
15070
16042
16787
(F)
(F)
(F)
(P)
(F)
-
15070,
15070,
15070,
15070,
15070,
15070,
15070,
15070,
15070,
15070,
16042,
16042,
16042,
16042,
16042,
16042,
16042,
16042,
16042,
16042,
16787,
16787,
16787,
16787,
16787,
16787,
16787,
16787,
16787,
16787,
phospholipase A2 activity
calcium ion binding
toxin activity
lipid catabolism
hydrolase activity
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
Only four “negative” clusters have occurred by definition of specificity-3:
An example of missed GO terms for 2mtaC and other chains of Cytochrome c-L
(cytochrome c551i)
2mtaC
1mg2D
1mg2H
1mg2L
1mg2P
1mg3D
1mg3H
1mg3L
1mg3P
(3)
(2)
(2)
(2)
(2)
(2)
(2)
(2)
(2)
5489, 6118, 15945
16021, 16032,
16021, 16032,
16021, 16032,
16021, 16032,
16021, 16032,
16021, 16032,
16021, 16032,
16021, 16032,
5489
6118
15945
16021
16032
(F)
(P)
(P)
(C)
(P)
-
electron transporter activity
electron transport
methanol metabolism
integral to membrane
viral life cycle
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
#1: Assigning function from known to unknown – CASE
STUDY – Prediction of calcium binding in Acetylcholine
Esterase – Projection on SNP responsible for Autism.
#2: Classification of DNA-binding protein domains involving
(in addition to structure similarity) – DNA-protein
interaction patterns and sequence similarity.
#3: Extending GO annotation using structure similarity – how
reliable it can be?
#4 [BONUS]: Why ontology is so important for humans? 
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
Evolution of complex systems:
Computers: complexity doubles in every 18 month per $$$
(Moore’s Law)
Human Brain: very slow (complexity doubles in ~100,000 years)
System
Short Term
Storage
Long Term
Storage
Speed
Cost
PC cluster
(256 units)
65GB
5 TB
256 GFLOP
$130K
Human Brain
(Average)
57 TB
1137 TB
4.4 TFLOP
$130K
Complexity = Speed x Memory
Computer = 5TB x 256 GFLOP = 1024 memory FLOPs
Brain = 1137TB x 4.4 TFLOP = 5x1027 memory FLOPs
Brain/Computer=5x103 or 3.7 log units
Moore’s Law: 3.5 years/log unit
Human brain capacity for computers will be reached: 2000+3.7x3.5=2013
Based on (Ramsey, 1997)
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
The accuracy of predicting the future
for the next 2 years equals 10%
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
Credits:
Julia Ponomarenko (she did #2 and #3)
Phil Bourne (discussions, conceptualizations, logistics)
Lei Xie (PDB statistics)
NIH Grant GM63208
NSF Grants DBI 9808706, DBI 0111710
Gift from Ceres Inc.
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
Download