Protein Domains In: Encyclopedia of Genetics, Genomics, Proteomics and Bioinformatics, Wiley Interscience (in press, 2005) Jaap Heringa Centre for Integrative Bioinformatics (IBIVU) Faculty of Sciences / Faculty of Earth and Life Sciences Vrije Universiteit Amsterdam, The Netherlands Email: heringa@cs.vu.nl Tel: +31 20 4447649 Fax: +31 20 4447653 1 KEYWORDS: Protein, domain, domain structure, protein module, domain motion, domain linker, domain prediction, domain database ABSTRACT Many cellular processes involve proteins comprised of multiple domains. The modular nature afforded by breaking up a protein’s three-dimensional structure into discrete domains has many evolutionary advantages. First, it allows proteins to evolve to complex cooperative functions that require elaborate, large and stable structural scaffolds, which would be impossible to form otherwise within the structural constraints associated with the protein folding process. Second, it can bring active sites in proximity in a fixed stoichiometric ratio. Third, otherwise unstable intermediates can be accommodated within inter-domain clefts. Fourth, the flexibility afforded by protein domain linker regions, e.g. hinge regions, can greatly aid the activity of the cooperative catalytic process carried out by a multi-domain enzyme. Fifth, the repeated and combinatorial use of domains in multi-domain proteins has provided nature both an evolutionary economic and versatile system to maintain and enhance cellular activity. INTRODUCTION 2 Protein domains are independent or semi-independent folding units of protein structure that serve as structural building blocks of proteins. Their size varies from 36 amino acids in the protein E-selectin to 692 residues in lipoxgenase-1 (Jones et al., 1998). Most domains observed in nature have less than 200 residues, with an average domain size of about 100 residues. While smaller domains of less than 50 residues are often structurally reinforced by disulfide bonds or coordinated metal ions, large domains comprising more than 300 residues are expected to contain multiple hydrophobic cores. Domains are genetically mobile units, and multi-domain families are found throughout the three kingdoms (Archaea, Bacteria and Eukarya). The majority of genomic proteins, 75% in unicellular organisms and more than 80% in metazoa, are multi-domain proteins created as a result of gene duplication events (Apic et al., 2001). Genetic mechanisms influencing the layout of multi-domain proteins include gross rearrangements such as inversions, translocations, deletions and duplications, homologous recombination, and slippage of DNA polymerase during replication (Bork et al., 1992). Although genetically conceivable, the transition from two single domain proteins to a two-domain protein requires that both domains fold correctly and, that they accomplish to bury a fraction of the previously solvent-exposed surface area in a newly generated inter-domain surface, the size of which is depending on the constraints imposed by the domain linker region and the scale of domain interaction. Protein structure is intrinsically hierarchic in its internal organization. It progresses from secondary structure elements at the lowest level, via supersecondary structures and via 3 domains to complete proteins and assemblies of proteins at the highest level. At the domain level, the connectivity of the polypeptide backbone between the domains appears to be less important, as the protein retains a stable structure irrespective of the sequential ordering of the domains. Moreover, evolution has led to various forms of domain arrangements across the protein sequence that can be classified by their connectivity (Das and Smith, 2000). These arrangements go beyond sequential permutations of complete domain structures and can affect the internal domain organization. Figure 1 displays the observed modes of connectivity between domains ordered by the degree of complexity. Based on their connectivity, domains can be divided into two main classes: continuous domains, where the associated coding sequence is an uninterrupted stretch of amino acids, and discontinuous domains, where the coding sequence is interrupted by subsequences encoding alternative structures, such as inserted domains. An example of a multi-domain structure with a combination of continous and discontinuous domains is the 3-domain structure Pyruvate kinase, a member of the phosphotransferase family, of which the structure and connectivity is given in Figure 2. A protein module has been defined as a continuous domain structure that is structurally independent enough to serve as a modular plug-in to many multi-domain proteins. Modules are believed to have spread largely by exon shuffling and are observed in many different multi-domain proteins with permuted or altogether different domain combinations. The fact that modules often show the N- and C-termini in close proximity appears to facilitate their addition to other protein structures, since their insertion in loop regions would not require gross rearrangements. 4 A structurally viable mechanism for forming oligomeric domain assemblies is provided by domain swapping (Bennett et al., 1995). In domain swapping, a secondary or tertiary element of a monomeric protein is replaced by the same element of another protein. Domain swapping can range from secondary structure elements to whole structural domains in bringing together single-domain or multi-domain structures. It also represents a model of evolution for functional adaptation by oligomerization, e.g. of oligomeric enzymes that have their active site at sub-unit interfaces (Heringa and Taylor, 1997). Predicting domain boundaries in multidomain proteins A large number of structural classification methods have been developed to group the large number of protein domain folds deposited in the Protein Data Bank (PDB). The most widely used structural databases are CATH (Orengo et al., 1997), SCOP (Lo Conte et al., 2000), FSSP (Holm and Sander, 1996) and 3Dee (Siddiqui et al., 2001). Links to web servers for these databases are provided in Table 1. For each database, a unique protocol is followed to classify the protein structures at the domain level. Since a common operational definition of a domain is that of a compact globular substructure with more interactions within itself than with the rest of the structure (Janin and Wodak, 1983), two characteristics are commonly exploited in boundary prediction methods: extent of isolation and compactness (Tsai and Nussinov, 1997). Measures of local compactness have been used in a number of recent domain assignment methods (Holm and Sander, 1994; Islam et al., 1995; Sowdhamini and Blundell, 1995; Siddiqui and Barton, 1995; Nicols et al., 1995; Zehfus, 1997; Hinsen et al., 1999; Taylor, 1999; Xu et 5 al., 2000; Siddiqui et al., 2001; Guo et al., 2003), albeit these approaches generally cannot handle discontinuous or closely interacting domains very well. As a result, Hadley and Jones (1999) observed discrepancies between assignments made by the above structural domain databases. A large number of profile and HMM methods are available to delineate the domain structure in a query sequence. These methods make use of domain classifications in protein domain sequence databases, and use the deposited sequences of each domain family to derive a profile or HMM. Each profile or HMM is then used to scan the query sequence, after which the highest scoring ones are used to delineate the putative domains and their boundaries. Recent developments for these search techniques is the addition of contextual information such as co-occurrence of domains in a sequence (Coin et al., 2003) and the taxonomic distribution (Coin et al., 2004). Although a very difficult task, some ab initio computational methods are also available for the prediction of domain boundaries from protein sequence information alone. The method SnapDRAGON (George and Heringa, 2002a) is primarily based on the hydrophobic collapse during folding and folds a 3D protein structure based on the notion of conserved hydrophobicity and information from secondary structure prediction methods. The method takes as input a multiple alignment of the query sequence and a set of homologues, as well as the predicted secondary structure. It then builds 100 models of the protein structure, which each are assigned domain boundaries using the aforementioned method of Taylor (1999). A final prediction of the domain boundaries is 6 achieved by assessing the consistency of the domain boundaries over the 100 predicted folds and by determining statistically significant boundaries . Domain boundary identification by comparative sequence analysis is performed by the method DOMAINATION (George and Heringa, 2002b), which is based on the homology search method PSI-BLAST (Altschul et al., 1997) to identify homologues to a query sequence. The distribution of N- and C-terminal positions of the local alignments obtained by PSI-BLAST is used to identify putative domain boundaries by a process of chopping and joining domains and domain segments in an attempt to reconstruct the evolutionary loss and gain of domains. The method follows an iterative protocol of onthe-fly cutting and joining of domain fragments and subjecting newly formed fragments to PSI-BLAST. It is able to recognize continuous and discontinuous domains and increases the accuracy of PSI-BLAST by about 15%. Domain fusion Domains in multi-domain structures are likely to have once existed as independent proteins, and many domains in eukaryotic multi-domain proteins can be found as independent proteins in prokaryotes (Davidson et al., 1993). As an example of gene fusion, vertebrates have a multi-enzyme protein (GARs-AIRs-GARt) comprising the enzymes glycinamide ribonucleotide synthetase (GARs), aminoimidazole ribonucleotide synthetase (AIRs), and GAR transformylase (GARt) 1. In insects, the corresponding polypeptide encodes the four-domain structure GARs-(AIRs)2-GARt. However, GARsAIRs is encoded separately from GARt in yeast, and in bacteria each domain is encoded separately (Henikoff et al., 1997). 7 Marcotte et al. (1999) introduced the ‘Rosetta Stone’ computational method for inferring protein function and interactions from genome sequences based on the observation that some pairs of interacting proteins have homologs in another organism fused into a single protein chain. A comparison of sequence homologs from multiple organisms can reveal these fused sequences, which are called Rosetta Stone sequences because the existence of these multi-domain proteins implies the putative interactions between the separate protein pairs in other organisms. Another genomic method to predict domain function is phylogenetic profiling (Pellegrini et al., 1999). The method checks for simultaneous presence or absence of pairs of genes across a wide spectrum of genomes, and thereby reveals sets of proteins that are likely to have coevolved and thus can be expected to act in the same cellular process. The protein phylogenetic profiling and the Rosetta Stone methods have been very useful in predicting the function of large numbers of genes without experimental functional annotation. Domain evolution and gene duplication The evolution and spread of domains through different protein families is likely to be the result of gene duplication, leading to one or more paralogous genes. After such duplication events, the selection pressure becomes less for the paralogous copies of the gene, so that these are freer to manufacture adaptive functions over time. For more recently evolved protein domains, a correspondence between the sequential boundaries of 8 exons and those of protein domains is observed. It is therefore believed that exon shuffling during evolution has enabled organisms to develop new proteins with new functions by the addition of exons from other parts of the genome that form new domains as part of an existing protein. A multi-domain architecture offers the potential of reuse trough domain repeats. Domain duplication is essential for the formation of the active sites of enzymes such as in the aspartic and serine proteinases, porphobilinogen deaminase or the binding sites of transport proteins such as lactoferrin, transferrin and sulfate binding protein. Domain repeats also result in duplication of active sites or binding sites such as the calcium binding sites in calmodulin or the 9 DNA-binding zinc fingers of the transcription factor TFIIID. Recent methods to delineate repeats in protein sequences are RADAR (Holm and Heger, 2000), REPRO (George and Heringa, 2000; Romein et al., 2003) and TRUST (Szklarczyk and Heringa, 2004). Taylor et al. (2002) devised a method based on Fourier analysis to automatically recognize repeated substructures in protein structures. Non-orthologous displacement An increasing number of cases are identified of non-orthologous displacement (Koonin et al., 1996), where enzymes carrying out an identical function in different organisms belong to entirely different protein families, and thus are not expected to show any sequence similarity. Examples of non-orthologous displacement include ornithine decarboxylase in E. coli and S. serevisiae, where the isozymes speF and speC are 9 responsible for this function in E. coli and share the same structure comprising three domains (ornithine decarboxylase N-terminal “wing” domain, PLP-dependent transferase and ornithine decarboxylase C-terminal domain). The corresponding enzyme spe1 in S. cerevisiae is a two-domain protein with entirely different domain structures (PLP-binding barrel and alanine racemase-like domain). A problem with non-orthologous displacement is that it cannot be recognised by sequence comparison methods, which are aimed at recognising divergent evolution by mutation and insertions/deletions, including changes of gene structure by gene fusion or fission. However, the techniques are not able to trace evolutionary cases of horizontal gene transfer or functional displacement of one gene by another one within a genome. Suppose that a set of proteins is studied that correspond to a linear metabolic pathway, and that all proteins are found for a given organism, except for one corresponding to an intermediate step. It can then be expected that the missing activity is performed by some hitherto undetected protein, in which case sequence comparison techniques can lead to the identification of the missing protein by a homology search using a homologous domain sequence in another genome that has the appropriate functional annotation. In the absence of any putative homolog, non-orthologous gene displacement could be suspected, in which case direct sequence analysis is of no use. However, functional ontologies, such as GO (Ashburner et al., 2000), can aid the process, in that various nonorthologous sequences can be identified using a GO functional descriptor. This leads to a direct candidate whenever a protein domain is identified in the target genome. Otherwise, if the ontology identifies a putative non-orthologous gene with the desired function in 10 another genome, a homology search using the latter gene can reveal putative nonorthologous genes for the intermediate metabolic step in the query organism. In the case of a prokaryotic organism, the fact can be exploited that gene sequences in close proximity in the genome tend to be functionally related. Given a confirmed protein domain sequence associated with the above metabolic pathway, the neighbourhood of its homologs in complete prokaryotic genomes can be analysed using the STRING server (Snel et al., 2000). The latter server has been used, for example, for the re-annotation of the Mycoplasma pneumoniae genome (Dandekar, et al., 2000). Homology regions As a result of a homology search operation, for example by using the program PSIBLAST (Altschul et al., 1997), regions with significant similarity between otherwise unrelated proteins in different species can abound. Such so-called homology regions can be sustained within different sequence contexts, and often add a specific functionality to their constituent proteins. In some cases, the function of regional homology domains can be inferred by comparing the properties of the host proteins, when experimental functional annotation is available. However, also the detection of these domains in uncharacterized sequences can provide valuable clues for the protein’s function. In other cases where the biological role of a local homology domain is not known, the delineation and comparison of these regions might still be useful e.g. for finding conserved residues as targets for mutagenesis or for structural studies. Assessing functional specificity of domain families 11 Predicting or assessing the functional specificity of a given domain sequence is possible when a set of homologues is available for a query sequence, and the homologues are suspected to have a roughly identical function but with subtle variations (e.g. dehydrogenase domain structures acting on different substrates). Insight can then be gained by aligning the members of the family and deriving a phylogenetic tree. If the family is organised into a tree that correlates well with the functions described (or derived) for the proteins, then one can draw relatively reliable conclusions regarding the query’s function. If the query falls into a clear subgroup of the tree, one can assume as a function that of the branch. If, however, the functional annotation of the branch holding the query is heterogeneous, or the query falls in between two branches, then one should assume a less specific description in accordance with the neighbouring branches (e.g. “alcohol dehydrogenase”), or resort to the generic functional description of the whole family (e.g. “dehydrogenase”). The transfer of function from one protein to another should only be done when the protein sequences match over their total lengths in the pairwise sequence alignment, without any large unaligned sequence fragments that might correspond to an additional domain. If some fragments do not align, one has to carefully check whether the function that is transferred really corresponds to the part that is common. The use of domain databases and methods for domain identification (vide supra) can be of great help at this point. If a query sequence contains a domain that is not matched in the pairwise alignment, this should be noted. Conversely, if the protein used as source of the annotation contains an 12 unmatched domain, the annotation should just state the presence of the domain(s) of the query without transferral of the complete functional description of the source. Protein sequence clustering Protein sequence cluster databases group related proteins together and are derived automatically from protein sequence databases using different clustering algorithms. Each database uses a different clustering algorithm aimed to group the sequences generally in homologous families and superfamilies. Assignment of a query sequence without annotation to any of the clusters can provide valuable clues about the protein’s function. Since they are derived automatically from protein sequence databases without manual crafting and validation of family discriminators, these databases are relatively comprehensive, although the biological relevance of clusters can be ambiguous and can sometimes be an artifact of particular thresholds. Multidomain protein sequences are a particular source of error, in that the clustering algorithm can establish a link between two sequences A and B based on a particular domain, while a third sequence C might be linked to sequence B based on another domain sequence. If unchecked, this would lead to an incorrect clustering of sequences A, B and C. Table 1 provides a number of links to websites of protein sequence clustering databases. Interaction domains A large and growing body of evidence suggests that proteins involved in the regulation of cellular events such as the cell cycle, signal transduction, protein trafficking, targeted proteolysis, cytoskeletal organization and gene expression are typically multi-domain 13 proteins comprising a catalytic domain and one or more interaction domains. Interaction domains induce the interaction of signaling polypeptides and specific multi-protein complexes. For example, they can couple a cell surface receptor to an intracellular biochemical pathway that controls the cellular responses to an external signal. Interaction domains can be involved in a series of protein-protein interactions, designed to localise signaling proteins to appropriate subcellular locations or to determine the enzyme specificity. A well studied example is the family of protein kinases in relation to their substrates. Typically, protein-protein interaction domains are relatively small and independently folding modules of 35-150 amino acids, that are able to fold in isolation from their host proteins while retaining their intrinsic ability to bind their physiological partners. Their N- and C-termini are usually close together in space, while their ligandbinding surface lies on the opposite face of the domain. This arrangement facilitates their evolutionary use as a modular plug-in since it allows the domain to be easily inserted into a host protein while projecting its ligand-binding site to contact other polypeptides. Protein-protein interaction domains can be clustered into separate families, based either on their sequence or ligand-binding properties. Interaction domains are often used repeatedly by a large number of multi-domain proteins that display very different domain structures, where they mediate a particular type of molecular recognition. 1. An important class of interaction domains recognises peptides that have been post-translationally modified. These modifications include phosphorylation, acetylation or methylation. The dynamic control of cellular behavior exerted by 14 covalent protein modifications therefore appears largely mediated by interaction domains that regulate the associations of signaling protein cascades. a. Phosphorylation: i. An important member of this class is the SH2 domain, which is anticipated to be encoded in the human genome at least 120 times. A large number of cytoplasmic proteins contain one or two SH2 domains that directly recognize phosphotyrosine-containing motifs that are found for example on surfaces of activated receptors for cytokines, growth factors and antigens. SH2 domains all recognize phosphotyrosines, but have different preferences for amino acids immediately following the phosphorylated residue, allowing a degree of specificity in signaling by tyrosine kinases. ii. Phosphotyrosine-containing motifs are also recognized by the PTB domain, a very different structure than the SH2 domain. PTB domains are found as constituents of binding proteins such as the IRS-1 substrate of the insulin receptor. iii. Phosphoserine and phosphothreonine motifs are targeted by a growing family of interaction domains, including FHA domains, 14-3-3 proteins and WD40-repeat domains. Each of the subfamilies bind and proteins contained therein specific phosphoserine/threonine motifs, and thus mediate the biological activities of protein-serine/threonine kinases. 15 b. In addition to phosphorylation, acetylation and methylation have recently been implicated in inducing modular protein-protein interactions. For example, acetylation or methylation of lysine residues on the surfaces of histone structures respectively create effective binding sites for Bromo and Chromo domain constituents of multi-domain proteins involved in chromatin remodeling. 2. A large group of interaction domains bind proline-rich motifs, including SH3, WW and EVH1 domains. The complexes that are recognised by these domains are therefore not dependent on post-translational modifications, and hence are more stable partners than those that, for example, need to be phosporylated to enable binding by SH2 domains. SH3 domains interact with a large number of different domain types and are therefore referred to as "promiscuous" domains. 3. PDZ domains bind the C-termini of their interaction partner domains, ion channels and receptors, in a fashion that appears important for the localization of their targets to particular subcellular sites, as well as for downstream signaling. In addition to binding short C-terminal peptide motifs, PDZ domains can form heterodimers. 4. In addition to binding specific peptide motifs, a growing number of interaction modules are known to recognize a variety of phospholipids, particularly phosphoinositides (PI). For example, PH domains are constituents of lipid kinases and phosphatases, where they affect cellular functioning by binding promiscuously to either the phosphoinositide PI-4,5-P2 or PI-3,4,5-P3. Based on their phospholipid-binding, PH domains can localise signaling proteins to specific 16 subregions of the plasma membrane, where they regulate the enzymatic activities of their host proteins. A predominant feature of interaction domains is their apparent versatility. Though PTB domains were originally discovered based on their ability to bind phosphorylated tyrosines within Asn-Pro-X-Tyr (NPXY) motifs, many PTB domains appear to be able also to recognize unphosphorylated NPXY-related peptides. It therefore appears that PTB domains originally evolved to bind unphosphorylated peptides, but have subsequently developed a capacity to recognize phosphotyrosine in relatively few specific cases. Furthermore, some individual PTB domains, e.g. those within the FRS-2 and Numb proteins, have evolved to recognize two quite different peptide ligands. Although many interaction domains bind very different substrates, a number of them have similar tertiary structures, such as for example the PTB, EVH1 and PH domains, which respectively recognize phosphotyrosine-containing peptide motifs, proline-rich peptides and phospholipids (vide supra). The PTB/PH/EVH1 fold is shared by other types of interaction domains as well. It seems therefore that this domain fold provides a stable framework that can be used for modulating very many different intermolecular interactions. On the other hand, many signaling proteins contain a number of different interaction domains to yield a multi-domain protein that can mediate multiple proteinprotein and protein-phospholipid interactions. This modular organization of signaling proteins can target proteins to the appropriate site within the cell, and direct binding to cell surface receptors and downstream targets. 17 A number of bioinformatics approaches have been developed for the prediction of protein–protein interactions based on the correspondence between the phylogenetic trees of interacting proteins (Goh et al., 2000; Pazos and Valencia, 2001) and detection of correlated mutations between pairs of proteins (Pollock and Taylor, 1997; Pazos et al., 1997: Pollock et al., 1999). Spacer domains and domain linkers A number of protein domains are thought to have a purely structural role, allowing functional domains to be advantageously situated or providing the overall protein with the necessary flexibility. It must be kept in mind that a domain may be considered to have a purely structural role simply because knowledge about its true function is missing. The immunoglobulin (Ig) family provides examples of this, as the Ig constant heavy 1 and 2 domains and the Ig constant light domain act as “spacer” domains to allow the Ig variable light and heavy domains to function optimally. Multi-functional enzymes are generally composed of a number of discrete domains that are connected by inter-domain linkers. A number of site-directed mutagenesis experiments have shown that linkers play a crucial role in the enzymatic activity of multidomain complexes (George and Heringa, 2003). In general, mutating or altering the length of linkers has been shown to affect protein stability, folding rates and domaindomain orientation. The linkers provide dynamic structural bridges for the movement of ligands across the modules. Many linkers act as rigid spacers separating two domains, and provide a scaffold to prevent unfavorable interactions between folding domains. George and Heringa (2003) found that the two main types of linker in natural proteins: 18 helical linkers and non-helical linkers. Helical linkers derive their rigidity from their helical conformation, while non-helical linkers are rich in proline residues, which also leads to structural rigidity. They should be flexible, keeping domains apart while allowing them to move as part of their catalytic function. Linkers should furthermore be invulnerable to host proteases, as they are often the targets for degradation. Domain movements Domain motions are important for a large repertoire of functions including ligand binding, catalysis, regulation, formation of protein assemblies and transport of metabolites (Gerstein et al., 1994). Often, the binding site is situated at the interface of a pair of domains in the ‘closed’ conformation, corresponding to the inactive form of the associated multi-domain enzyme, while the ‘open’ conformation of the domains exposes the binding site and thus provides the active form of the enzyme. Domain motions are often responsible for the mechanism of ‘induced fit’ observed in protein-protein recognition. Protein flexibility often corresponds to large relative movements of domains. In aspartic proteinases, for example, the two main domains forming the lobes of the structure show large relative shifts that are crucial for the activity of the constituent enzyme (Sali et al., 1992). Another example is the so-called swivelling domain in pyruvate phosphate dikinase (Herzberg et al., 1996), which brings an intermediate enzymatic product over about 45 Å from the active site of one domain to that of another. Such movements have been determined by a conformational analysis of protein structures, where the 19 identification of for example hinge axes and their corresponding rotation angles permits a useful representation of protein domain movements (Gerstein et al., 1994). Three main mechansisms of domain movements are discernible (Gerstein et al., 1994): 1. Intrinsic flexibility: These movements leads to deformations of the domain structure. Several small movements in secondary structures which have a cumulative effect. These motions are due to small changes in backbone torsion angles of strands and/or helices, kinks involving prolines in helices as in adenylate kinase (Gerstein et al., 1994), or interconversion of helix and extended conformations as in calmodulin (Ikura et al. 1992). 2. Hinged domain movements: Of special interest are hinge-bending movements, where rigid domains are connected by flexible joints which tether the domains and constrain their movement. Hinge-bending is believed to allow an induced fit of molecular surfaces in protein assembly and ligand docking. These movements are caused by rotations of a domain around a pivotal point, often afforded by a linker region in between domains that changes its conformation to allow the rotation. For example, to attain icosahedral symmetry, the S and P domains of tobacco bushy stunt virus perform hinge movement mainly by dihedral angle changes in the peptide that forms a linker between the two domains (Olsen et al., 1983). In lactoferrin, the open and closed forms of the two domains involve a 53 degree rotation which is achieved by conformational changes within two strands that connect the protein’s two domains (Jameson et al., 1998). 20 3. Ball-and-socket motion: The variable (light and heavy chain) domains of antibodies rotate about 50 degrees with respect to the constant (light and heavy chain) domains by means of a combination of hinge and shear motions (Lesk and Chothia, 1988). The method HINGEFIND (Wriggers and Schulten, 1997) identifies domain movements and characterizes and visualizes the effective rotation axes (hinges). The program compares a pair of known structures (e.g. two different crystal structures of a protein or the results of molecular dynamics or Monte Carlo simulations) and partitions the protein with a prespecified tolerance in preserved subdomains. It then determines effective rotation axes which characterize the domain movements with respect to the reference domain ("rigid core"). Hayward and Berendsen (1998) developed a method to analyse comparative domain movements over larger numbers of structures, which is useful, for example, for studying trajectories of molecular dynamics simulations. CONCLUSION The analysis and prediction methods that elucidate the evolutionary development and function of domains and multi-domain complexes have recently been enriched by the integration of genomic and dynamic contextual information. Many successes are to be derived from the ongoing developments in spawning an integration web between genomics, phylogenetic, sequence, structure, structural dynamics, and function data. This will shed more light on the evolution of the structural and functional versatility of protein 21 domains, which will benefit biotechnical 22 and biomedical applications. REFERENCES Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W and Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402. Apweiler R et al. (2000) InterPro - an integrated documentation resource for protein families, domains and functional sites. Bioinformatics. 16, 1145-1150. Apweiler R, Attwood TK, Bairoch A, Bateman A, Birney E, Biswas M, Bucher P, Cerutti L, Corpet F, Croning MD, Durbin R, Falquet L, Fleischmann W, Gouzy J, Hermjakob H, Hulo N, Jonassen I, Kahn D, Kanapin A, Karavidopoulou Y, Lopez R, Marx B, Mulder NJ, Oinn TM, Pagni M, Servant F, Sigrist CJ, Zdobnov EM. (2001). The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res. 29, 37–40. Ashburner M et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genet., 25, 25-29. Bennet MJ, Schlunegger MP, Eisenberg D (1995) 3D domain swapping: a mechanism for oligomeric assembly. Protein sci., 4, 2455-2468. 23 Coin L, Bateman A and Durbin R (2003) Enhanced protein domain discovery by using language modeling techniques from speech recognition. Proc. Natl. Acad. Sci. USA, 100, 4516-4520. Coin L, Bateman A and Durbin R (2004) Enhanced protein domain discovery using taxonomy. BMC Bioinformatics, 5, 56-65. Dandekar T, Huynen M, Regula JT, Zimmermann CU, Ueberle B, Andrade MA, Doerks T, Sánchez-Pulido L, Snel B, Suyama M, Yuan YP, Herrmann R and Bork P (2000) Reannotating the Mycoplasma pneumoniae genome sequence: adding value, function and reading frames. Nucleic Acids Research, 28, 3278-3288. Das S and Smith T (2000) Identifying nature’s protein LEGO set. Adv. Protein Chem., 54, 159-183. Davidson J, Chen K, Jamison R, Musmanno L and Kern C (1993) The evolutionary history of the first three enzymes in pyrimidine biosynthesis, Bioessays, 15, 157-164. George RA and Heringa J (2000) The REPRO server: finding protein internal sequence repeats through the Web. Trends in Biochem. Sci. 25, 515-517. George RA and Heringa J (2002) Protein domain identification and improved sequence similarity searching using PSI-BLAST, Proteins Struct. Func. Gen. 48, 672-681. 24 George RA and Heringa J (2002) SnapDRAGON - a method to delineate protein structural domains from sequence data, J. Mol. Biol. 316, 839-851. George RA and Heringa J (2003) An analysis of protein domain linkers: their classification and role in protein folding, Prot. Engineer. 15, 871-879. George RA, Kleinjung J and Heringa J (2003) Predicting protein structural domains from sequence data. In: Bioinformatics and Genomes - Current Perspectives. Pp. 1-26. Horizon Scientific Press, Norfolk, England. ISBN 1-898486-47-6. Gerstein M, Lesk A and Chothia C (1994). Structural mechanisms for domain movements in proteins. Biochemistry 33, 6739-6749. Goh CS, et al. (2000). Co-evolution of proteins with their interaction partners. J. Mol. Biol. 299, 283–293. Guo JT, Xu D, Kim D and Xu Y (2003) Improving the performance of DomainParser for structural domain partition using neural network, Nucl. Acids Res., 31, 944-952. Hadley C and Jones D (1999) A systematic comparison of protein structure classifications: SCOP, CATH and FSSP. Structure Fold. Des., 7, 1099-1112. 25 Hayward S and Berendsen HJC (1998) Systematic Analysis of Domain Motions in Proteins from Conformational Change; New Results on Citrate Synthase and T4 Lysozyme, Protein, Struct. Func. Genet., 30, 144-154. Heger A and Holm L (2000) Automatic detection and alignment of repeats in protein sequences. Prot. Struct. Func. Genet., 41, 224-237 Henikoff S, Greene E, Pietrokovski S, Bork P, Attwood T and Hood L (1997) Gene families: the taxonomy of protein paralogs and chimeras, Science, 278, 609-614. Heringa J and Taylor WR (1997) Three-dimensional domain duplication, swapping and stealing. Curr. Opin. Struct. Biol., 7, 416-421. Hinsen K, Thomas A, Field MJ (1999) Analysis of domain motions in large proteins, Prot. Struct. Func. Genet., 34, 369-382. Holm L and Sander C (1994) Parser for protein folding units. Proteins Struct. Func. Genet., 19, 231-234. Holm L and Sander C (1998) Dictionary of recurrent domains in protein structures. Prot. Struct. Func. Genet., 33, 88-96. 26 Ikura M, Clore GM, Gronenborn AM, Zhu G, Klee CB and Bax A (1992) Solution Structure of a Calmodulin-Target Peptide Complex by Multidimensional NMR. Science 256, 632-638. Islam S, Luo J and Sternberg M (1995) Identification and analysis of domains in proteins. Prot. Engin., 8, 513-525. Jameson GB, Anderson BF, Norris GE., Thomas DH, Baker BN (1998) Structure of human apolactoferrin at 2.0-A resolution. Refinement and analysis of ligand-induced conformational change, Acta Crystallogr. D, 54, 1319-1335. Janin J and Wodak S (1983) Structural domains in proteins and their role in the dynamics of protein function. Prog. Biophys. Molec. Biol. , 42, 21-78. Koonin EV, Mushegian AR, Bork P (1996). Non-orthologous gene displacement. Trends Genet., 12, 334-336. Lesk, AM and Chothia C (1988) Elbow motion in the immunoglobulins involves a molecular ball-and-socket joint Nature, 355, 188-190. Lo Conte L, Ailey B, Hubbard TJ, Chothia C (2000) A structural classification of proteins database, Nucl.Acids Res., 28, 257-259. 27 Marcotte EM, Pellegrini M, Ng H-L, Rice DW, Yeates T and Eisenberg D (1999) Detecting Protein Function and Protein-Protein Interactions from Genome Sequences, Science, 285, 751-753. Nicols WL, Rose GD, Ten Eyck LF and Zimm BH (1995) Rigid domains in proteins: an algorithmic approach to their identification, Prot. Struct. Func. Genet., 23, 38-48. Olsen AJ, Bricogne G and Harrison SC (1983). Structure of tomato bushy stunt virus IV. The virus particle at 2.9 A resolution J. Mol. Biol. 171: 61-93. Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB and Thornton JM (1997) CATH – a hierarchic classification of protein domain structures. Structure, 5, 1093-1108. Pazos F, Valencia A (2001). Similarity of phylogenetic trees as indicator of proteinprotein interaction. Protein Eng. 14, 609–614. Pazos F, et al. (1997). Correlated mutations contain information about protein-protein interaction. J. Mol. Biol. 271, 511–523. Pellegrini M., Marcotte EM, Thompson MJ, Eisenberg D and Yeates TO (1999) Functions by Comparative Genome Analysis: Protein Phylogenetic Profiles, Proc. Natl. Acad. Sci. USA, 96, 4285-4288. 28 Pollock, DD and Taylor WR (1997) Effectiveness of correlation analysis in identifying protein residues undergoing correlated evolution. Prot. Engin., 10, 647-657. Pollock, DD, Taylor WR and Goldman N (1999) Coevolving protein residues: maximum likelihood identefication and relationship to structure. J. Mol. Biol., 287, 187-198. Romein JW, Heringa J, and Bal HE (2003) A million-fold speed improvement in genomic repeats detection, SuperComputing'03, Phoenix, AZ, November, 2003. Schultz J, Copley RR, Doerks T, Ponting CP, and Bork P (2000) SMART: a web-based tool for the study of genetically mobile domains. Nucleic Acids Res., 28, 231-234 Sidiqui A and Barton G (1995) Continuous and discontinuous domains – an algorithm for the automatic generation of reliable protein domain definitions. Prot. Sci., 4, 872-884. Sidiqui AS, Dengler U and Barton GJ (2001) 3Dee: a database of protein structural domains. Bioinformatics, 17, 200-201. Snel B, Lehmann G, Bork P, and Huynen MA (2000) STRING: A web-server to retrieve and display the conserved neighbourhood of genes. Nucleic Acids Res., 28, 3442-3444. 29 Sowdhamini R and Blundell TL (1995) An automatic method involving cluster analysis of secondary structures for the identification of domains in proteins, Prot. Sci., 4, 506520. Szklarczyk R and Heringa J (2004) Tracking repeats using significance and transitivity. Bioinformatics 20 Suppl. 1, i311-i317. Tsai CJ and Nussinov R (1997) Hydrophobic folding units derived from dissimilar monomer structures and their interactions. Protein Sci., 6, 24-42. Taylor WR (1999) Protein structure domain identification. Prot. Engin. 12, 203-216. Taylor WR, Heringa J, Baud F and Flores TP (2002) A Fourier analysis of symmetry in proteins, Protein Engin. 15, 79-89. Wriggers W and Schulten K (1997) Protein Domain Movements: Detection of Rigid Domains and Visualization of Effective Rotations in Comparisons of Atomic Coordinates. Proteins Struct., Funct. Genet., 29, 1-14. Xu Y, Xu D, Gabow HN (2000) Proetin domain-decomposition using a graph-theoretic approach, Bioinformatics, 16, 1091-1104. 30 Zehfus M (1997) Identification of compact, hydrophobically stabilized domains and modules containing multiple peptide chains. Protein Sci., 6, 1210-1219. 31 Table 1. WWW addresses of domain-related databases of interest Name WWW Sequence-based classification databases: InterPro http://www.ebi.ac.uk/interpro/ PROCLASS http://www-nbrf.georgetown.edu/gfserver/proclass.html PIR http://www-nbrf.georgetown.edu/pirwww/pirhome.shtml PRINTS http://bioinf.man.ac.uk/dbbrowser/PRINTS/ PFAM http://www.sanger.ac.uk/Pfam/ SMART http://smart.embl-heidelberg.de/ ProtFam List http://www.hgmp.mrc.ac.uk/GenomeWeb/prot-family.html MetaFam http://metafam.ahc.umn.edu/ ProDom http://prodes.toulouse.inra.fr/prodom/doc/prodom.html Structural classification databases: SCOP http://scop.mrc-lmb.cam.ac.uk/scop FSSP http://www2.embl-ebi.ac.uk/dali/fssp/fssp.html CATH http://www.biochem.ucl.ac.uk/bsm/cath new/index.html 3Dee http://www.compbio.dundee.ac.uk/3Dee/ Protein sequence clustering databases: SYSTERS http://www.dkfz-heidelberg.de/tbi/services/cluster/systersform ProtoMap http://www.protomap.cs.huji.ac.il/ CluSTr http://www.ebi.ac.uk/clustr/ COG http://www.ncbi.nlm.nih.gov/COG Domain linker databases: LinkerDB http://ibivu.cs.vu.nl/programs/linkerdbwww/ Protein domain motions databases: ProtMotDB http://hyper.stanford.edu/~mbg/ftp/ProtMotDB/ProtMotDB.main.html 32 FIGURE CAPTIONS Figure 1. Schematic illustration of different types of backbone connectivity in multidomain protein structures (right) and their coding peptide sequences (left). Note that continuous domains are encoded by uninterrupted sequences, whereas discontinuous domain sequences show inserted sequence fragments. Figure 2. Pyruvate kinase structure and domain connectivity. (a) The structure shows three domains: domain 1 (discontinuous) is a barrel regulatory domain; domain 2 is a discontinuous / barrel catalytic substrate binding domain; and domain 3 corresponds to a continuous / nucleotide binding domain. (b) Schematic representation of the corresponding amino acid sequence, showing the segmental arrangement of the three domains. The intercalated and interlaced structures have N- and C-termini indicated (designated ‘N’ and ‘C’, respectively) for clarity. 33 Figure 1 Sequence Structure Single Continuous domains Concatenated Intercalated C N C Interlaced N = linker peptide 34 Discontinuous domains Figure 2. a) Domain 1 Domain 2 Domain 3 b) 12 41 42 115 116 218 219 N 384 385 529 C Domain 3 Domain 2 Domain 1 35