Exploiting Structural and Comparative Genomics to Reveal Protein Functions How many domain families can we find in the genomes and can we predict the functions of relatives? Exploiting protein structure to predict protein functions Using correlated phylogenetic profiles based on CATH domains to reveal functional associations C ATH Gene3D Domain families of known structure Protein families and domain annotations for completed genomes CATHEDRAL Oliver Redfern and Andrew Harrison Combines a rapid graph theory secondary structure filter with dynamic programming for accurate residue alignment SVM is used to combine scores and assess significance of match CATH version 3.0 1100 fold groups 2100 homologous superfamilies 86,000 Domains Fold Recognition Performance 1 0.98 0.94 % Correct Fold % Correct Fold 0.96 0.92 CATHEDRAL CE DALI LSQMAN STRUCTAL SSAP SSAP DDP 0.9 0.88 0.86 0.84 0.82 0.8 0 5 10 15 Rank Rank 20 25 Gene3D:Domain annotations in genome sequences scan against library of HMM models >2 million protein sequences from 300 completed genomes and Uniprot ~2000 CATH ~9000 Pfam assign domains to CATH and Pfam superfamilies Benchmarking by structural data shows that 76% of remote homologues can be identified using the HMMs Gene3D:Domain annotations in genome sequences DomainFinder: structural domains from CATH take precedent N C CATH-1 Pfam-2 Pfam-1 NewFam Pfam-1 CATH-1 NewFam Pfam-2 Percentage of all domain family sequences Domain families ranked by size (number of domain sequences) Pfam families of unknown structure NewFam of unknown stucture CATH superfamilies of known structure Rank by family size ~90% of domain sequences in the genomes and UniProt can be assigned to ~7000 domain families <100 families account for 50% of domain sequences of known fold structural superfamily (CATH) F2 subfamily of relatives F3 F1 F4 relatives likely to have similar functions F5 Only ~3% of diverse sequences in large CATH domain families have known structures Gene3D: Domain mappings for 300 Completed Genomes 300 genomes, >2 million sequences including UniProt and RefSeq structural domain assignments from CATH Iterative Profile Search functional domain assignments Methodology from Pfam Also: SWISS-PROT, EC, COGs, GO, KEGG, MIPS, BIND, IntAct http://www.biochem.ucl.ac.uk:8080/Gene3D Russell Marsden, Corin Yeats, Michael Maibaum, David Lee Nucleic Acids Res. 2006 Yeats et al. Nucleic Acids res. 2006. Conservation of enzyme function in homologous domains with same multidomain architecture (MDA) in Gene3D Protein 1 Pfam-1 CATH-1 NewFam Pfam-2 Protein 2 100 level EC STRING MATCH) Conservation of EC (%) to 3 levels(3rd number FUNCTION CONSERVATION Pfam-1 CATH-1 Pfam-2 DOMAINS IN SAMENewFam ARCHITECTURES 90 80 70 60 50 40 30 20 10 0 11--20 21-30 No OVERLAP 40% OVERLAP 80% OVERLAP 31-40 41-50 51-60 61-70 Sequence Identity Sequence identity 10% OVERLAP 50% OVERLAP 90% OVERLAP 20% OVERLAP 60% OVERLAP 100% OVERLAP 71-80 81-90 91-100 30% OVERLAP 70% OVERLAP Sequence identity thresholds for 95% conservation of enzyme function (to 3 EC Levels) 332 highly conserved families number of sequences 60 highly variable families 1000000 number of families 200 180 160 140 120 100 80 60 40 20 0 100000 10000 1000 100 10 1 11-20% 21-30% 31-40% 41-50% 51-60% Sequence identity Number of domain relatives number of sequences 61-70% 71-80% 81-90% 91-100% thresholds Number of Superfamilies number of families Exploiting Structural and Comparative Genomics to Reveal Protein Functions How many domain families can we find in the genomes and can we predict the functions of relatives? Exploiting protein structure to predict protein functions Using correlated phylogenetic profiles based on CATH domains to reveal functional associations C ATH Gene3D Domain families of known structure Protein families and domain annotations for completed genomes Conservation of Enzyme Function in CATH Domain Families 90 80 SSAP score Structural similarity (SSAP) score 100 Different Function Same Function 70 60 50 40 0 10 20 30 40 50 60 70 80 90 100 sequence idenity (%) Pairwise sequence identity same functions different functions Correlation of structural variability with number of different functional groups COGs Vs SSGs 90 P-loop hydrolases (COG-270, SSG-67) 80 Numbe r of COGs 70 60 0-25 25-50 50-75 75-100 50 40 30 20 10 0 0 10 20 30 40 50 60 Numbe r of Sstructural tructua l Sclusters ub-Groups Number of diverse within family Some families show great structural diversity Gabrielle Reeves Multiple structural alignment by CORA allows identification of consensus secondary structure and embellishments 2DSEC algorithm In 117 superfamilies relatives expanded by >2 fold or more These families represent more than half the genome sequences of known fold Structural embellishments can modify the active site Galectin binding superfamily Structural embellishments can modulate domain interactions side orientation face orientation Glucose 6-phosphate dehydrogenase a Dihydrodipiccolinate reductase Additional secondary structure shown at (a) are involved in subunit interactions Structural embellishments can modify function by modifying active site geometry and mediating new domain and subunit interactions Biotin carboxylase D-alanine-d-alanine ligase ATP Grasp superfamily Dimer of biotin carboxylase Secondary structure insertions are distributed along the chain but aggregate in 3D 80 Frequency (%) 60 40 Indel frequency < 1 % 20 0.85% 0.38% 0.23% 0.11% 0.06% 0.02% 0 1 2 3 4 5 6 7 8 9 10 11 12 Size of Indel (number of secondary structures) 85% of residue insertions comprise only 1 or 2 secondary structures 60% of domains have secondary structure embellishments co-located in 3D with 3 or more other embellishments In 80% of domains, 1 or more embellishments contact other domains or subunits ~80% of variable families are adopt regular layered architectures 2 Layer Alpha Beta Sandwich 2 Layer Beta Sandwich Alpha / Beta Barrel 3 Layer Alpha Beta Sandwich 2 Layer Alpha Beta Sandwich 2 Layer Beta Sandwich Alpha / Beta Barrel 3 Layer Alpha Beta Sandwich Function prediction to Guide Target Selection for Structural Genomics structural superfamily (CATH) close relatives with same MDA F2 F3 F1 relatives likely to have similar functions F4 F5 Only ~3% of diverse sequence families (S30 clusters) in large CATH families have known structures Conservation of Enzyme Function in Homologous Domains 100 80 70 % Frequency Conservation of EC levels (%) 90 Not Conserved 60 Less than 3 EC 50 EC3 40 EC4 30 20 10 0 50-60 60-70 70-80 SSAP Score 80-90 Structure similarity (SSAP) score 90-100 FLORA – structural templates for assigning structures to functional subgroups in CATH Perform CORA multiple structural alignment on functional subfamiles within CATH superfamily Use CORAXplode (HMMs) to find related sequences in UniProt and identify conserved residues (seed) Explore local structural environment of seed residues to find conserved structural motifs Dataset of 84 enzyme superfamilies in CATH of which 21 are functionally very diverse Finding conserved residue positions (seeds) - Scorecons multiple sequence alignment of relatives from functional family guided by structure alignment identify most highly conserved residue positions using Scorecons – Valdar and Thornton (2001) seed positions FLORA Algorithm for Identifying Structural Homologues with Similar Functions expand to local environment of 12Å assign conserved sequence seeds new structures are scanned against a library of FLORA templates and SVMs used to assess significance of matches identify structurally conserved residue cliques and generate template Performance of FLORA vs Global Structure Comparison (SSAP) 1 0.9 0.8 Coverage Coverage 0.7 0.6 SSAP FLORA 0.5 - 0.4 0.3 0.2 0.1 0 0 0.1 Error Error rate 0.2 Exploiting Structural and Comparative Genomics to Reveal Protein Functions How many domain families can we find in the genomes and can we predict the functions of relatives? Exploiting protein structure to predict protein functions Using correlated phylogenetic profiles based on CATH domains to reveal functional associations C ATH Gene3D Domain families of known structure Protein families and domain annotations for completed genomes Eisenberg Phylogenetic Profiles for Detecting Functional Associations Superfamily Functionally Linked sp1 sp2 Organism sp3 sp4 Superfamily 1 1 0 1 0 Superfamily 2 1 0 1 0 Superfamily 3 0 0 1 1 presence or absence of superfamily in organism Gene3D Phylogenetic Occurrence Profiles CATH Domain Superfamily Organism sp3 sp4 sp1 sp2 Superfamily 1 12 13 14 11 Superfamily 2 35 0 12 60 Superfamily 3 6 0 0 0 number of relatives from superfamily in organism Phylogenetic Occurrence Profiles Based on Domain Superfamily and Subfamilies in Gene3D Superfamily 30% sequence identity cluster 50% sequence identity cluster 40% sequence identity cluster Phylogenetic Profiles for Families and Subfamilies Juan Ranea and Corin Yeats domains clustered at different levels of sequence similarity: Superfam. 30% 40% 50% phylogenetic occurrence profile matrix 60%… 100% Sp1 Sp2 Sp3 Sp4 … Spn Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 . . . Cluster n 3 0 1 0 1 0 0 . . . 0 3 2 0 2 0 3 0 . . . 1 5 4 1 0 2 1 0 . . . 0 7 5 0 0 1 2 1 . . . 1 … … … … … … … … … … … 5 4 1 6 0 1 2 . . . 0 Comparison of Pairs of Phylogenetic Profiles Sp1 Sp2 Sp3 Sp4 Sp5 … Spn Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 . . . Cluster n 6 4 1 0 1 0 4 . . . 0 9 3 0 2 4 3 8 . . . 1 6 7 1 0 1 1 4 . . . 0 9 5 0 0 4 2 8 . . . 1 5 3 2 1 1 0 4 . . . 1 … … … … … … … … … … … 9 5 1 6 4 1 8 . . . 0 10 Cluster 1 5 Cluster 2 Sp1 Sp2 Sp3 Sp4 Sp5 … Spn 10 Cluster 1 E1 5 Cluster 5 Sp1 Sp2 Sp3 Sp4 10 E2 5 Sp1 Sp2 Sp3 Sp4 Sp5 … Spn Sp5 … Spn Cluster 1 Euclidian distance: Cluster 7 E1 >> E2 Statistical Significance of Correlated Pairs (Comparison against 3 randomised models) 80 70 Real matrix 60 Random matrix I 50 40 Random matrix II 30 Random matrix III 20 10 Pearson correlation coefficients (0.9)-(1.0) (0.8)-(0.9) (0.7)-(0.8) (0.6)-(0.7) (0.5)-(0.6) (0.4)-(0.5) (0.3)-(0.4) (0.2)-(0.3) (0.1)-(0.2) (0.0)-(0.1) (-0.1)-(0.0) (-0.2)-(-0.1) (-0.3)-(-0.2) 0 Domain Associations Network from 13 Eukaryotes: Actin & VCP-like ATPases DNA replication and repair Chaperones and Cytoskeleton DNA Topoisomerase & Elongation factor G Number of domain relatives DNA topoisomerase & Elongation Factor G 10 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 Species 9 10 11 12 13 Highly correlated profiles correspond to pairs of families with significant similarity in GO functions %Frq %Sum_SS/Frq 60 50 40 30 20 biological processes 10 (>=19) (18)-(19) (17)-(18) (16)-(17) (15)-(16) (14)-(15) (13)-(14) (12)-(13) (11)-(12) (10)-(11) (9)-(10) (8)-(9) (7)-(8) (6)-(7) (5)-(6) (4)-(5) (3)-(4) (2)-(3) (1)-(2) (0)-(1) 0 Distances of correlated profile scores Frequency of significant GO semantic similarity scores Summary – – – – On average 85% of domain sequences in genomes can be assigned to ~6000 domain families in CATH and Pfam Information on multidomain architectures (MDAs) can extend functional annotations obtained through domain based homologies Specific structural templates for functional subgroups within domain families can also help in assigning functions as more structures are solved Analysis of Gene3D phylogenetic occurrence profiles allows detection of functional associations between families Acknowledgements CATH Lesley Greene Alison Cuff Ian Sillitoe Tony Lewis Mark Dibley Oliver Redfern Tim Dallman Gene3D Corin Yeats Sarah Addou Russell Marsden David Lee Alastair Grant Ilhem Diboun Juan Garcia Ranea http://www.biochem.ucl.ac.uk/bsm/cath_new Medical Research Council, Wellcome Trust, NIH EU funded Biosapiens, EU funded Embrace, BBSRC