Module 5 Protein modules Aims Objectives Introduction We need to consider something about the evolutionary history of genes and the proteins they encode before we can look at how we can study protein domains. When the amino acid sequences of two proteins are compared and found to exhibit significant similarity they are assumed to be evolutionarily related i.e. they are homologues. We can distinguish two classes of homologue (orthologue and paralogue) via a consideration of genes. Firstly, orthologous genes descended from a unique ancestral gene and their divergence with comparable genes in different organisms is simply parallel to speciation. In contrast, paralogous genes are descended from copies of a gene that duplicated within a single ancestral genome. It is now widely accepted that a substantial proportion of all proteins are composed of more than one domain. A domain is defined as sequentially consecutive residues in a protein that can fold up independently of other parts of the protein. Crystallographers commonly refer to domains as folds and the term module is also sometimes used. Some people would go as far as saying that the domain is the fundamental unit of protein structure, since inter-domain splicing, fusion, deletion, duplication and shuffling have occurred frequently during evolution, whereas intra-domain rearrangements have occurred rarely (Saier, 1996). It was clear from Module 4 that when two homologous proteins are aligned, there are one or more regions where sequence identity is very high, and these constitute motifs or signature sequences. Any particular domain may have one or more characteristic motifs. Domains, motifs and signature sequences constitute the content of many secondary databases and are of enormous value in attempting to predict the function and structure of new proteins. Low complexity regions The individual domains of multidomain proteins are frequently separated from each other by regions of low complexity, also referred to as linker sequences. Long stretches of repeated residues, particularly proline, glutamine, serine or threonine often indicate linker sequences. The program SEG (see below) is designed to detect such low complexity regions and can be used as part of BLAST to mask off segments of the query sequence that have low compositional complexity. Filtering can eliminate statistically significant but biologically uninteresting reports from the blast output (e.g., hits against common acidic-, basic- or proline-rich regions), leaving the more biologically interesting regions of the query sequence available for specific matching against database sequences. Coiled-coils Coiled-coils are another structural feature of proteins which sometimes separate domains. The COILS server will permit the prediction of such regions in query proteins as will MULTICOIL. Signal peptides SignalP predicts ignal peptide and cleavage sites in Gram+, Gram-and eukaryotic amino acid sequences. http://www.cbs.dtu.dk/services/SignalP/caution.html Transmembrane segments Transmembrane segments of proteins are comparatively easy to predict and can be of value in separating intracellular and extracellular domains of a protein. The ere are a number of programs for analysing protein queries for such segments they include: PredictProtein includes prediction of transmembrane helix location and topology amongst much else (uses PHDhtm and PHDtopology algorithms) TMAP predicts transmembrane segments on multiply aligned sequences Tmpred makes a prediction of membrane-spanning regions and their orientation using a combination of several weight-matrices for scoring. DAS uses the Dense Alignment Surface method to predict transmembrane helices Secondary (pattern) databases Two approaches which frequently help with establishing the function and/or structure of an unknown protein involve the production of either motifs have motifs been defined yet? they’ve been used already or profiles from the primary databases. Analysis of the primary protein sequence databases, primarily through the generation of multiple sequence alignments has led to the identification of sequence patterns (or motifs) common to homologous proteins. These motifs, usually of the order of 10-20 amino acids in length, usually correspond to key functional or structural elements and are extremely useful in identifying such features in new uncharacterized proteins. There is a number of such secondary databases in which the information has been derived from different primary databases by different analytical methods. All these databases are based, though, on the same principal. The sequence of an unknown protein is often too distantly related to any protein of known sequence to detect its resemblance by overall sequence alignment, but it can potentially be identified by the occurrence in its sequence of a particular cluster of amino acid residues, which are variously known as a patterns, motifs, signatures, blocks or fingerprints. Usually the motifs do not overlap, but are separated along a sequence, though they may be contiguous in 3D-space. Analysis of the primary protein sequence databases and the production of multiple sequence alignments can also lead to the construction of profiles. Profiles are scoring tables that summarize the information in an alignment. The profile determines which residues are allowed at each point, which residues are conserved or degenerate, which positions can tolerate insertions etc. Unknown proteins can then be scored against the profile to see if they fit. There are a number of programs which allow the searching of an unknown protein against databases of motifs and profiles, or indeed both. Some commonly used programmes are listed below: Pfam is a collection of multiple alignments and profile hidden Markov models of protein domain families, which is based on proteins from both SWISS-PROT and SP-TrEMBL. SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. More than 400 domain families found in signalling, extracellular and chromatin-associated proteins are detectable. These domains are extensively annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues. Each domain found in a non-redundant protein database as well as search parameters and taxonomic information are stored in a relational database system. User interfaces to this database allow searches for proteins containing specific combinations of domains in defined taxa. PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs. PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family; its diagnostic power is refined by iterative scanning of a SWISS-PROT/TrEMBL composite database. Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs, full diagnostic potency deriving from the mutual context provided by motif neighbours. BLOCKS Blocks are short multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins. The rationale behind searching a database of blocks is that information from multiply aligned sequences is present in a concentrated form, reducing background and increasing sensitivity to distant relationships. This information is represented in a position-specific scoring table or "profile" (4), in which each column of the alignment is converted to a column of a table representing the frequency of occurrence of each of the 20 amino acids. IDENTIFY Motifs are derived from the BLOCKS and PRINTS databases INTERPRO SEARCH Integrated search in PROSITE, Pfam, ProDom, PRINTS and SWISSPROT+TrEMBL CD-SEARCH at NCBI employs the reverse position-specific BLAST algorithm to search the Conserved Domain Database (CDD), which is at present composed of Smart and Pfam, plus contributions from colleagues at NCBI. Exercises The slr0228 gene of the cyanobacterium encodes a multidomain homologue of the E. coli protein FtsH. Carry out the following tasks: 1. Retrieve the protein sequence from NCBI using Entrez 2. Analyse the FtsH sequence for transmembrane segments using predict protein 3. Analyse the FtsH sequence for coiled-coils using the COILS server and MULTICOIL. Are there any differences between the predictions? Is this protein likely to have any coiled-coil regions? 4. Analyse the domain structure of the FtsH homologue using PFAM, IDENTIFYand SMART. Are there any differences between the predictions? References and Links Bateman A, Birney E, Durbin R, Eddy SR, Finn RD, Sonnhammer ELL (1999) Pfam 3.1: 1313 multiple alignments match the majority of proteins, Nucleic Acids Research 27:260-262 Saier MH (1996) Phylogenetic approaches to the identification and characterization of protein families and superfamilies. Microbial and Comparative Genomics 1, 129-150. Schultz, J., Copley, R.R., Doerks, T., Ponting, C.P. and Bork, P. (2000) SMART: A Webbased tool for the study of genetically mobile domains Nucleic Acids Res 28, 231-234