C lass Architecture Topology or Fold Group Homologous Superfamily CATH domain database Orengo & Thornton 1994 • • • • The CATH domain database and associated resources - DHS, Gene3D How do we determine domain boundaries? How do we you identify fold groups and evolutionary superfamilies? What is the distribution of the CATH domain families in the PDB and in the genomes? Multidomain proteins ~20,000 chains from Protein Databank (PDB) ~50,000 domains in CATH structure database ~40% of the entries in CATH are multidomain Domains are important evolutionary units analysis by Teichmann and others suggests that ~6080% of genes in genomes may be multidomain Carboxypeptidase G2 (1cg2A) Carboxypeptidase A (2ctc) ~30% of multidomains in CATH are discontinuous Algorithms for Recognising Domain Boundaries DETECTIVE Swindells 1995 each domain should have a recognisable hydrophobic core DOMAK Siddiqui & Barton, 1995 residues comprising a domain make more internal contacts than external ones PUU Holm & Sander, 1994 parser for protein folding units: maximal interaction within domains and minimal interaction between domains Consensus is sought between the three methods – on average this occurs about 20% of the time 74% 29% Close homologues 21% Twilight zone 4% Midnight zone 11% Homologues/analogues Algorithms for Recognising Homologues Sequence Based methods close homologues – BLAST (Altschul et al.) - SSEARCH (Smith & Waterman) remote homologues – SAM-T99 (Karplus et al) Structure Based Methods close & remote homologues - CATHEDRAL (Harrison, Thornton Orengo) - SSAP (Taylor & Orengo) - CORA (Orengo) 74% Close homologues SSEARCH 29% 21% Twilight zone HMMs, SSAP 4% Midnight zone CATHEDRAL, SSAP 11% Homologues/analogues CATHEDRAL, SSAP Hidden Markov Models (HMMs) SAM-T99 SAMOSA query sequence Karplus Group Orengo Group Non redundant GenBank database hits these methods can currently identify ~70% of remote homologues (3 times more powerful than BLAST) Percentage of PDB structures classified in CATH by different methods over the last 2 years remote homologues (8.6) analogues (1.9) Novel folds SSAP 2.0 remote homologues (<30%) HMMs Close homologues (>30%) SSEARCH 1.9 8.6 7.6 20.7 59.2 Near-identical SSEARCH Percentage of structural genomics PDB structures classified in CATH by different methods over the last 2 years near-identical SSEARCH novel folds analogues SSAP 11.8 7.7 8.0 22.0 close homologues (>30%) SSEARCH 28.4 remote homologues SSAP 22.0 remote homologues (<30%) HMMs Structure Based Algorithms for Recognising Homologues CATHEDRAL Pairwise alignment - secondary structure comparison SSAP Pairwise alignment - residue comparison CORA Multiple alignment – residue comparison 74% Close homologues ssearch 29% 21% Twilight zone HMMs 4% Midnight zone CATHEDRAL, SSAP 11% Homologues/analogues CATHEDRAL, SSAP structure is much more highly conserved than sequence cholera toxin Heat labile enterotoxin pertussis toxin 97 81 79% 12% Structure similarity (SSAP) score Sequence identity Pairwise Sequence Identities and Structure Similarity (SSAP) Scores in CATH Domain Families structure similarity (SSAP) score same function different function sequence identity (%) • Residue insertions in the loops connecting secondary structures • Shifts in the orientations of secondary structures Structural variation in the P-loop Hydrolase Superfamily Yeast Elongation factor complex Yeast Guanylate kinase Helicase domain of bacteriophage t7 ATP phosphorylase Structural variation in the Galectin Binding Superfamily Fast Structure Comparison Method (CATHEDRAL) Andrew Harrison et al., JMB, 2002 ignore the variable loop regions and only compare the secondary structures derive vectors through secondary structure elements compare closest approach distances and vector orientations using graph theory d a b a . b = | a || b | cos + dihedral angle + chirality CATHEDRAL CATHs Existing Domain Recognition ALgorithm d, , , chirality H edge H d, , , chirality d, , , chirality H node Compares graphs of proteins Comparing proteins with similar folds identifies an overlap graph with the largest common structural motif A III A,a I C II III B I C,d IV a III II b b I d c II V B,c overlap graph has a structural motif of 3 secondary structures Graphs are compared using the Bron Kerbosch algorithm to find the largest common graph In this example the common graph contains 5 nodes. 1000 times faster than residue based methods (e.g. SSAP) Performance statistical significance can be assessed by scanning a protein ‘graph’ against ‘graphs’ of all known structures Score ~ common graph size (size protein1 . size protein2)1/2 statistical significance can be assessed by scanning a protein ‘graph’ against ‘graphs’ of all known structures Score ~ common graph size (size protein1 . size protein2)1/2 scores for unrelated structures exhibit an extreme value distribution F = A e - b . score log F = log A - b .score allows you to calculate the probability (P-value, E-value) of obtaining any score by chance Using CATHEDRAL to Identify Domain Boundaries Graph based secondary structure comparison is very fast - 1000 times faster than residue based methods New multi-domain structures can be rapidly scanned against the library of CATH domains. E-values can be used to identify significant matches. 85-90% of domains in new multi-domain structures have relatives in CATH CATHEDRAL Secondary structure match by graph residues in CATH domain family 1 SSAP residue alignment residues in CATH domain family 2 Fold A Fold B Multi-domain structure residues in new multi-domain SSAP Protein B Protein A Taylor & Orengo, J. Mol. Biol. 1989 Scores range from 0-100 Residues in protein A Residues in protein B residue based structure comparison method using dynamic programming CATHEDRAL One third of known multi-domain structures are discontinuous Reasons for Structural Similarity • Divergence - similarity arises due to divergent evolution from a common ancestor - structure much more highly conserved than sequence • Convergence - similarity due to there being a limited number of ways of packing helices and strands in 3D space Domain structure database C lass Orengo & Thornton 1994 A rchitecture Topology or Fold Group Homologous Superfamily ~50,000 domains in PDB ~1500 domain superfamilies in CATH CATH 3 domain database ~50,000 domains Class ~36 Architecture ~810 Topology or Fold C AT H 40,000 ~50,000 domain domain entries entries Topology or Fold Group ~810 Homologous Superfamily (Domain Family) ~1500 Sequence Family (35%, 60%, 95%) DHS Dictionary of Homologous Superfamilies http://www.biochem.ucl.ac.uk/bsm/dhs Description of structural and functional characteristics for each superfamily DHS Dictionary of Homologous Superfamilies http://www.biochem.ucl.ac.uk/bsm/dhs Description of structural and functional characteristics for each superfamily Variation in Secondary Structures Across Superfamily DHS:Dictionary of Homologous Superfamilies http://www.biochem.ucl.ac.uk/bsm/dhs Functional annotations from GO, EC, COGs, KEGG DHS:Dictionary of Homologous Superfamilies http://www.biochem.ucl.ac.uk/bsm/dhs Multiple structure alignments with conserved residues highlighted DHS:Dictionary of Homologous superfamilies http://www.biochem.ucl.ac.uk/bsm/cath_new/Gene3D Population of CATH Families and Structural Groups cluster proteins with similar sequences cluster proteins with similar structures and functions cluster proteins with similar structures ~50,000 structural domains ~4000 sequence families (35%) S ~1,500 homologous superfamilies H ~36 architectures T A 3 major protein classes C ~810 fold groups Arc repressor-like CATH Arc repressor-like nearly one third of the superfamilies belong to <10 fold groups Up-down Rossmann Fold Rossmann SH3-like OB fold OB Fold Immunoglobulin Alpha/Beta Plaits Jelly Roll Alpha-beta plait TIM barrel Jelly Roll CATH numbering scheme 2.40.50.100 Class Architecture Topology Homology 2. Mainly beta 40. Barrel 50. OB Fold 100 Heat labile enterotoxin superfamily CATH http://www.biochem.ucl.ac.uk/bsm/cath CATH domain structure database CATH http://www.biochem.ucl.ac.uk/bsm/cath CATH class level CATH http://www.biochem.ucl.ac.uk/bsm/cath CATH architecture level CATH http://www.biochem.ucl.ac.uk/bsm/cath CATH Topology or fold group level CATH http://www.biochem.ucl.ac.uk/bsm/cath CATH homologous superfamilies in each fold group CATH http://www.biochem.ucl.ac.uk/bsm/cath CATH homologous superfamily level CATH http://www.biochem.ucl.ac.uk/bsm/cath CATH sequence families (>=35% identity) in each superfamily CATH http://www.biochem.ucl.ac.uk/bsm/cath CATH classification information for individual domains CATH http://www.biochem.ucl.ac.uk/bsm/cath CATH structural relatives listed for each domain CATH server http://www.biochem.ucl.ac.uk/cgi-bin/cath/CathServer.pl CATH server http://www.biochem.ucl.ac.uk/cgi-bin/cath/CathServer.pl CATH server http://www.biochem.ucl.ac.uk/cgi-bin/cath/CathServer.pl structural matches and statistics listed for query domain Expanding CATH with sequence relatives from genomes Library of HMMs built for representative sequences from each CATH domain superfamily protein sequences from genomes Scan against CATH HMM library assign domains to CATH superfamilies Expanding CATH ~1400 Domain Structure Superfamilies S1 H S2 Homologous Superfamily S3 sequences added from GenBank, genomes, SWPTTrEMBL CATH-HMMs S1 H Homologous Superfamily Sequence family ~50,000 sequences ~4,000 sequence families S2 S3 S4 S5 ~600,000 sequences ~24,000 sequence families Up to 70% of sequences in completed genomes can be assigned to CATH domain superfamilies Arc repressor-like Gene3D Arc repressor-like Up-down Four helix bundle Alpha horseshoe Alpha horseshoe fold SH3-like SH3-type barrel OB fold Rossmann Fold OB Fold Rossmann Immunoglobulin Immunoglobulin-like Jelly Roll Jelly Roll Alpha/Beta Plaits TIM Barrel Alpha-beta plait TIM barrel Gene3D http://www.biochem.ucl.ac.uk/bsm/Gene3D CATH domain structure annotations for complete genomes Gene3D http://www.biochem.ucl.ac.uk/bsm/Gene3D Individual genome statistics Gene3D http://www.biochem.ucl.ac.uk/bsm/Gene3D Assignment of sequences to Gene3D protein families Gene3D http://www.biochem.ucl.ac.uk/bsm/Gene3D Functional annotations for individual sequences Gene3D http://www.biochem.ucl.ac.uk/bsm/Gene3D Functional annotations for individual sequences Gene3D http://www.biochem.ucl.ac.uk/bsm/Gene3D Domain annotations for individual sequences Gene3D http://www.biochem.ucl.ac.uk/bsm/Gene3D Domain annotations for individual sequences Summary CATH currently identifies ~1500 superfamilies in the ~50,000 structural domains from the PDB These domains families contain over 600,000 domain sequences from the genomes and sequence databases Up to 70% of genome sequences can be assigned to domain structure families using HMMs and threading Acknowledgements Frances Pearl Ian Sillitoe Oliver Redfern Mark Dibley Tony Lewis Chris Bennett Andrew Harrison Gabrielle Reeves Alastair Grant David Lee Janet Thornton http://www.biochem.ucl.ac.uk/bsm/cath Medical Research Council, Wellcome Trust, NIH Biotechnology and Biological Sciences Research Council