Functional Annotation of Proteins with Known Structure by Structure and Sequence Similarity, DNA-protein Interaction Patterns and GO Framework Ilya Shindyalov, UCSD/SDSC PhD, Group Leader, Protein Science Research DIMACS 2005-06-13 DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Essential Dataflow in Protein Science Protein Data: Methods: Sequence Sequence similarity: (i) BLAST, (ii) fold recognition, (iii) homology modeling … Results: DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Structure Function Structure similarity: (i) DALI, (ii) VAST, (iii) CE … Do we know the function, if we know the structure? COVERAGE RATIO FOR FUNCTIONAL ANNOTATION Disease Biological Process Cell Component Molecular Function PDB STRUCTURES 0.758 0.396 0.371 0.335 SG TARGETS 0.355 0.315 0.452 0.259 PDB+SG 0.822 0.528 0.593 0.477 HOMOLOGY MODELS 0.984 0.792 0.839 0.821 DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD The Subjects of my Talk 3 Approaches of Using Structure Similarity to Infer Protein Function: #1: Assigning function from known to unknown – CASE STUDY – Prediction of calcium binding in Acetylcholine Esterase – Projection on SNP responsible for Autism. #2: Classification of DNA-binding protein domains involving (in addition to structure similarity) – DNA-protein interaction patterns and sequence similarity. #3: Extending GO annotation using structure similarity – how reliable it can be? #4 [BONUS]: Why ontology is so important for humans? DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD #1: Assigning function from known to unknown – CASE STUDY – Prediction of calcium binding in Acetylcholine Esterase – Projection on SNP responsible for Autism. #2: Classification of DNA-binding protein domains involving (in addition to structure similarity) – DNA-protein interaction patterns and sequence similarity. #3: Extending GO annotation using structure similarity – how reliable it can be? #4 [BONUS]: Why ontology is so important for humans? DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD CE Protein structure comparison by Combinatorial Extension of the optimal path (Shindyalov and Bourne, 1998). http://cl.sdsc.edu DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD CE Step 1. Heuristic search for initial path. AFP = Aligned Fragment Pair Distance between two fragments AFP2 AFP1 Protein A Protein A Protein B Protein B Alignment Path Protein A DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Protein A CE Step 2. Iterative dynamic programming on starting superposition from step 1. DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD CE vs. other Algorithms ??? Novotny, M., Madsen, D., and Kleywegt, G.J. 2004. Evaluation of protein fold comparison servers. Proteins 54: 260-270. DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Acetylcholinestarase vs. Troponin C 2ACE vs. 1TN4: RMSD = 4.6Å Z-score = 4.6 LALI = 86 LGAP = 8 Seq. Identity = 3.5% DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD #1: Assigning function from known to unknown – CASE STUDY – Prediction of calcium binding in Acetylcholine Esterase – Projection on SNP responsible for Autism. #2: Classification of DNA-binding protein domains involving (in addition to structure similarity) – DNA-protein interaction patterns and sequence similarity. #3: Extending GO annotation using structure similarity – how reliable it can be? #4 [BONUS]: Why ontology is so important for humans? DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Data and algorithms used: • PDB - Protein Data Bank of February 13, 2002 with 17,304 entries was used as the source of original structural data. - The DNA fragment size is at least 5 bp long. - At least 5 different protein residues are involved in the interaction with DNA. - The contact distance cutoff between interacting atoms was < 5Å. - We did not take into account the different types of DNA (A, B, Z) because of the insufficient level of this annotation in the PDB • PDP – Protein Domain Parser (Alexandrov, Shindyalov, Bioinformatics, submitted) • CE – Protein structure alignment by Combinatorial Extension (Shindyalov, Bourne, 1998) • SCOP - Structure Classification of Proteins (Murzin et al., 1995) DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Building representative set of domains: PDB Selection of DNA-binding protein chains by analyzing DNA-protein contacts Parsing of DNA-binding protein chains into domains using PDP Selection of DNA-binding protein domains by analyzing DNA-protein contacts All-against-all structural alignment of DNA-binding protein domains using CE Selection of representative (non-redundant) set of DNA-binding protein domains Calculating classification of DNAbinding protein domains DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Parameters measuring structural similarity: • Rmsd, root mean squared deviation between two aligned and compared protein domains > 2.0 Å; • Z-score, statistical score obtained from CE is < 4.5; • Rnar, ratio of the number of aligned residues to the smallest domain length < 90%; Note: sequence identity in the alignment < 90%; DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD A B * *** ** ** **** YKLAAVGTE--FCCILLNIVKLPDGT | | || | || ASQL—AVREERAFA---GGKAPDQQD ** * * ** **** (1) Parameters measuring structural similarity: Rmsd, Z-score, Rnar; (2) Parameter measuring the match between DNA-protein contact patterns, Rmat; A and B - DNA-protein domain complexes; Rmat = min{RmatA, RmatB} RmatX - ratio of the number of matched residues to the total number of residues involved in contacts with DNA in the DNA-protein complex X. DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Realignment using scoring function taking into account structural similarity between two protein domains and protein-DNA contact pattern Sij S Structure similarity term: S Protein-DNA contact pattern term: where K X m dist ij dist ij S cont ij C1 dij , if C1 d ij C2 otherwise C2 , S cont ij C3 K K A i B j 1, if protein residue is involved in contact wi th DNA 0, otherwise m – denotes protein residue, X – protein-DNA complex; C3 is a scaling constant; DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD • If Rmsd > 5.0 Å or Rnar < 70% or Z-score < 3.5, then domains are not considered as similar; • If Rmsd 3.0 Å and Rnar 80%, then domains are considered as similar; • If Rmat Rmatthreshold and either: 3.0 Å < Rmsd 5.0 Å and Rnar 70% 70% Rnar < 80% and Rmsd 5.0 Å, then domains are considered similar; DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD or Comparison of the classification for all 338 DNA-binding representatives with SCOP at various threshold parameters DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD domain Final classification of DNA-binding protein domains (fragment): DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Rnar Not similar Similar 80 Similar if Rmat<80 70 Not similar 3 DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD 5 Rmsd SPDC – Structural Protein Domain Сlassification http://spdc.sdsc.edu DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD #1: Assigning function from known to unknown – CASE STUDY – Prediction of calcium binding in Acetylcholine Esterase – Projection on SNP responsible for Autism. #2: Classification of DNA-binding protein domains involving (in addition to structure similarity) – DNA-protein interaction patterns and sequence similarity. #3: Extending GO annotation using structure similarity – how reliable it can be? #4 [BONUS]: Why ontology is so important for humans? DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Why do we need the ontology? • Quantitative data explosion (e.g. exponential growth of sequence data - doubling every 7 month) •Qualitative data explosion (new experimental methods and new kinds of data appear, e.g. microarrays, interfering-RNA). •Lack of adequate means for information storage and exchange between: - scientists, - computers, - scientists and computers (what’s published in scientific journals is de facto not reaching the community). DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD GO can serve as a language which can be easily read by both humans and computers. By using GO we ultimately learn to talk in one universal language. The goal of this work is to further realize the potential of GO. DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD What is GO? CAR • Controlled dictionaries for: - Molecular Function - Biological Process “is-a” BMW - Cellular Component • Acyclic graph • “is-a”, “part-of” (“has-a”) relationships DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD “has-a” “part-of” Wheel The GO Annotation (GOA) resources providing annotation of gene products with Cellular GO terms Biological Molecular GO Annotation Resource Process Function Component Total Gene nonnonnon- Products All All All IEA * IEA IEA Associated codes codes codes codes codes codes SGD Saccharomyces cerevisiae 6446 6446 6434 6434 6435 6435 6448 FlyBase Drosophila melanogaster 4439 4428 6795 6789 3942 3918 7938 MGI Mus musculus 9594 5776 10523 6642 9691 7300 12694 TAIR Arabidopsis thaliana 6724 1979 7782 5450 13544 1891 18482 WormBase Caenorhabditis elegans 5115 1563 5754 285 3056 652 6925 RGD Rattus norvegicus 1060 248 1234 260 835 146 1448 Oryza sativa 6728 4495 6018 2799 ZFIN Danio rerio 782 0 917 0 687 0 983 DictyBase Dictyostelium discoideum 1309 100 1600 117 927 117 1781 Pseudomonas syringae DC3000 2941 2941 3101 3101 263 263 3137 Trypanosoma brucei chr 2 291 291 289 289 278 278 292 Bacillus anthracis Ames 4416 4416 4418 4418 199 199 4418 Arabidopsis thaliana 3001 3001 6463 6463 1563 1563 6801 Coxiella burnetii RSA 493 1359 1359 1349 1349 176 176 1365 Gene Index 80031 0 100151 0 78400 0 126556 Shewanella oneidensis MR-1 3696 3696 3696 3696 241 241 3696 Vibrio cholerae 2923 2923 2728 2728 191 191 2924 631750 0 631105 0 640209 0 658168 Human 16526 7663 18901 7172 13833 6602 20673 PDB 16871 0 18392 0 10417 0 18890 Gramene TIGR Compugen GO Annotations @ EBI SwissPROT/TrEMBL 12878 12273 531757 19499 643062 22548 409822 16125 Leishmania major 64 DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Plasmodium falciparum 2047 14709 738665 64 82 82 26 26 89 2047 2097 2097 2061 1227 2406 The IEA code, Inferred from Electronic Annotation, this means no human involvement in the assignment Extending GO annotation of PDB chains using structural and sequence similarity 34,698 protein chains were taken from the PDB of February, 2003 with the exception of theoretical models, short chains (less than 30 Cα atoms), and chains which don’t form domains (no domains detected by PDP algorithm). GO annotation has been assigned for 25,835 PDB protein chains by EBI from 34,698. Rmsd, root mean squared deviation between two structurally aligned polypeptides, it characterizes distances between C , C and mainchain O atoms of aligned residues. Z-score, statistically founded score, it characterizes significance of the alignment. Rnar, ratio of the number of aligned residues to the length of the shortest polypeptide, it measures overlap between aligned polypeptides. Rseq, sequence identity calculated for the structurally aligned residues. DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD For two polypeptides A and B with all calculated parameter values (Rmsd, Z-score, Rnar, Rseq) and given threshold values (Rmsdthreshold, Zscorethreshold, Rnarthreshold, Rseqthreshold) we define: SSCAB=(Rmsd<Rmsdthreshold ) (Z-score>Z-scorethreshold) (Rnar>Rnarthreshold) (Rseq>Rseqthreshold) - denotes logical AND. SSCAB can only be ascribed two values: true or false. If SSCAB is true, then A and B are similar. If SSCAB is false, then A and B are not similar. The chains were clustered such that for every two chains in each cluster the above condition (in red) holds true. DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Specificity Criteria: For the clusters where GO terms were available for at least two chains we define: “positive cluster” - where all chains have the same GO terms; “negative cluster” - where chains have different GO terms (more specific definitions for three criteria will be given further); TP (true positives) - a number of chains with GO terms in the positive clusters; FP (false positives) - a number of chains with GO terms in the positive clusters; ppv (positive predictive value) or specificity is the following ratio - TP/(TP+FP) DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Specificity Criteria (cont.): {ti1,..tik(i)} - is a set of GO terms k(i) for i-th chain. Each specificity is defined for a clusters with at least two annotated chains. Specificity-1 (the most rigorous) - “positive” cluster must have every pair of chains (i, j) with the same set of GO terms: tin = tjn , n=1,…k(i), k(i)=k(j), for (i, j), i{1,…N}, j{1,…N}. Specificity-2 (less rigorous than specificity-1) - “positive” cluster must have for every pair of chains (i, j) with different number of GO terms the following: for the chain with a smaller number of terms – all terms must be present amongst the terms for a chain with a larger number of GO terms: {ti1,..tik(i)} {tj1,..tjk(j)}, if k(i) k(j); i{1,…N}, j{1,…N}; {t1,..tN}. Specificity-3 (less rigorous than specificity-2) - “positive” cluster must have a common set of terms {t1,..tL} for all N chains within the cluster: {t1,..tN} {ti1,..tik(i)}, i=1,…N; {t1,..tN}. Further detailing of specificity (Specificity-4) should involve the semantic distance (e.g. Lord et al, 2003) between terms in judging cluster to be “positive”. DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Clusterization of PDB chains and the accuracy of GO annotation at different threshold values of structural similarity parameters. Threshold values Rseq Rnar Clusters and chains Performance of GO annotation Chains in FP Clusters Clusters Clusters clusters with chains Specificit FP chains Specificit FP chains Specificit Cove Newly Rmsd Zand singlewith at least at least two y-1, y-2, y-3, rage, annotated score tons two chains chains with (specificit specificity (specificit chains Å with GO GO y-1) % -2) % y-3) % % New added Chains New chain- chain-GO with GO term term added associations associations GO terms 0% 90% 2.0 4.5 9940 2919 2435 20799 8255 60.3 4463 78.5 158 99.2 60.9 5397 170372 3531 1953 25% 90% 2.0 4.5 9959 2923 2440 20797 8091 61.1 4410 78.8 155 99.3 60.6 5367 169864 3386 1893 35% 90% 2.0 4.5 10069 2995 2490 20768 7534 63.7 3972 80.9 113 99.5 59.9 5310 167254 3281 1841 50% 90% 2.0 4.5 10368 3180 2643 20719 5618 72.9 2686 87.0 64 99.7 57.4 5089 160523 2937 1376 70% 90% 2.0 4.5 10867 3515 2886 20606 3759 81.8 1137 94.5 42 99.8 52.3 4639 153801 2062 1015 90% 90% 2.0 4.5 11478 3834 3033 20493 1536 92.5 517 97.5 29 99.8 45.2 4002 147162 861 359 0% 70% 5.0 3.8 3401 1962 1687 24805 17730 28.5 15163 38.9 5604 77.4 83.8 7426 266757 2858 1318 25% 70% 5.0 3.8 4261 2533 2142 24610 13683 44.4 9608 61.0 734 97.0 78.5 6956 229366 3961 2153 35% 70% 5.0 3.8 4778 2885 2431 24507 11606 52.6 7299 70.2 328 98.7 75.9 6728 215972 3962 2285 50% 70% 5.0 3.8 5455 3330 2787 24357 8200 66.3 4063 83.3 85 99.7 71.7 6351 197765 4164 1960 70% 70% 5.0 3.8 6199 3819 3152 24196 5042 79.2 1567 93.5 58 99.8 64.9 5749 187239 3187 1440 90% 70% 5.0 3.8 7031 4269 3359 23984 2235 90.7 734 97.0 29 99.9 56.5 5007 178146 1566 588 DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Assignment of GO annotation with structural similarity parameters (Rmsd 5.0Å, Z-score 3.8, Rnar 70%, Rseq 90%). Red dot denotes newly annotated chains, red arrow denotes new “GO term – chain” associations assigned for newly annotated chains. Purple line denotes new “GO term – chain” associations assigned for chains previously annotated (by EBI). Black arrow denotes existing “GO term – chain” associations assigned by EBI. PDB chains (34,698) 3,856 "GO term - chain" associations (335,322) 5,007 588 154,986 178,675 25,247 1,661 newly annotated chains (this work) chains annotated by EBI with added new GO terms (this work) annotated by EBI not annotated anywhere New added "GO term - chain" associations for previously annotated chains (1,661) Function 42% Cellular component 6% DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD new associations for newly annotated chains (this work) new associations added for chains annotated by EBI (this work) annotated by EBI New "GO term - chain" associations (178,675) Process 33% Process 52% Function 53% Cellular component 14% The example of “negative” cluster by the definition of specificity-1 and “positive” cluster by the definitions of specificity-2 and specificity-3. Seven GO terms could be assigned to chains 1h9dA, 1h9dC (Rmsd 5.0Å, Z-score 3.8, Rnar 70%, Rseq 90%). 1e50A 1e50C 1e50E 1e50G 1e50Q 1e50R 1cmoA 1co1A 1ljmA 1ljmB 1hjbC 1hjbF 1hjcA 1hjcD 1io4C 1eanA 1eaoA 1eaoB 1eaqA 1eaqB 1h9dA 1h9dC (7) 3677, 3700, (7) 3677, 3700, (7) 3677, 3700, (7) 3677, 3700, (7) 3677, 3700, (7) 3677, 3700, (7) 3677, 3700, (7) 3677, 3700, (7) 3677, 3700, (7) 3677, 3700, (4) 3677, 5524, (4) 3677, 5524, (4) 3677, 5524, (4) 3677, 5524, (4) 3677, 5524, (4) 3677, 5524, (4) 3677, 5524, (4) 3677, 5524, (4) 3677, 5524, (4) 3677, 5524, no go terms no go terms 3677 3700 5524 5634 6355 7275 8151 (F) (F) (F) (C) (P) (P) (P) - 5524, 5524, 5524, 5524, 5524, 5524, 5524, 5524, 5524, 5524, 5634, 5634, 5634, 5634, 5634, 5634, 5634, 5634, 5634, 5634, 5634, 5634, 5634, 5634, 5634, 5634, 5634, 5634, 5634, 5634, 6355, 6355, 6355, 6355, 6355, 6355, 6355, 6355, 6355, 6355, 6355, 6355, 6355, 6355, 6355, 6355, 6355, 6355, 6355, 6355, 7275, 7275, 7275, 7275, 7275, 7275, 7275, 7275, 7275, 7275, 8151, 8151, 8151, 8151, 8151, 8151, 8151, 8151, 8151, 8151, The cluster of the same proteins which is Runt-related transcription factor 1 (synonyms: core-binding factor alfa subunit, acute myeloid leukemia 1 protein etc.). DNA binding transcription factor activity ATP binding nucleus regulation of transcription, DNA-dependent development cell growth and/or maintenance DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD The example of “positive” cluster by definition of specificity-1. Phospholipase A2. 1cl5A 1cl5B 1fb2A 1fb2B 1fv0A 1fv0B 1jq8A 1jq8B 1jq9A 1jq9B 1kpmB (5) 4623, 5509, (5) 4623, 5509, (5) 4623, 5509, (5) 4623, 5509, (5) 4623, 5509, (5) 4623, 5509, (5) 4623, 5509, (5) 4623, 5509, (5) 4623, 5509, (5) 4623, 5509, no go terms 4623 5509 15070 16042 16787 (F) (F) (F) (P) (F) - 15070, 15070, 15070, 15070, 15070, 15070, 15070, 15070, 15070, 15070, 16042, 16042, 16042, 16042, 16042, 16042, 16042, 16042, 16042, 16042, 16787, 16787, 16787, 16787, 16787, 16787, 16787, 16787, 16787, 16787, phospholipase A2 activity calcium ion binding toxin activity lipid catabolism hydrolase activity DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Only four “negative” clusters have occurred by definition of specificity-3: An example of missed GO terms for 2mtaC and other chains of Cytochrome c-L (cytochrome c551i) 2mtaC 1mg2D 1mg2H 1mg2L 1mg2P 1mg3D 1mg3H 1mg3L 1mg3P (3) (2) (2) (2) (2) (2) (2) (2) (2) 5489, 6118, 15945 16021, 16032, 16021, 16032, 16021, 16032, 16021, 16032, 16021, 16032, 16021, 16032, 16021, 16032, 16021, 16032, 5489 6118 15945 16021 16032 (F) (P) (P) (C) (P) - electron transporter activity electron transport methanol metabolism integral to membrane viral life cycle DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD #1: Assigning function from known to unknown – CASE STUDY – Prediction of calcium binding in Acetylcholine Esterase – Projection on SNP responsible for Autism. #2: Classification of DNA-binding protein domains involving (in addition to structure similarity) – DNA-protein interaction patterns and sequence similarity. #3: Extending GO annotation using structure similarity – how reliable it can be? #4 [BONUS]: Why ontology is so important for humans? DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Evolution of complex systems: Computers: complexity doubles in every 18 month per $$$ (Moore’s Law) Human Brain: very slow (complexity doubles in ~100,000 years) System Short Term Storage Long Term Storage Speed Cost PC cluster (256 units) 65GB 5 TB 256 GFLOP $130K Human Brain (Average) 57 TB 1137 TB 4.4 TFLOP $130K Complexity = Speed x Memory Computer = 5TB x 256 GFLOP = 1024 memory FLOPs Brain = 1137TB x 4.4 TFLOP = 5x1027 memory FLOPs Brain/Computer=5x103 or 3.7 log units Moore’s Law: 3.5 years/log unit Human brain capacity for computers will be reached: 2000+3.7x3.5=2013 Based on (Ramsey, 1997) DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD The accuracy of predicting the future for the next 2 years equals 10% DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Credits: Julia Ponomarenko (she did #2 and #3) Phil Bourne (discussions, conceptualizations, logistics) Lei Xie (PDB statistics) NIH Grant GM63208 NSF Grants DBI 9808706, DBI 0111710 Gift from Ceres Inc. DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD