Module 5

advertisement
Module 5
Protein modules
Aims



Objectives


Introduction
We need to consider something about the evolutionary history of genes and the proteins they
encode before we can look at how we can study protein domains. When the amino acid
sequences of two proteins are compared and found to exhibit significant similarity they are
assumed to be evolutionarily related i.e. they are homologues. We can distinguish two classes
of homologue (orthologue and paralogue) via a consideration of genes. Firstly, orthologous
genes descended from a unique ancestral gene and their divergence with comparable genes in
different organisms is simply parallel to speciation. In contrast, paralogous genes are
descended from copies of a gene that duplicated within a single ancestral genome.
It is now widely accepted that a substantial proportion of all proteins are composed of
more than one domain. A domain is defined as sequentially consecutive residues in a protein
that can fold up independently of other parts of the protein. Crystallographers commonly refer
to domains as folds and the term module is also sometimes used. Some people would go as far
as saying that the domain is the fundamental unit of protein structure, since inter-domain
splicing, fusion, deletion, duplication and shuffling have occurred frequently during
evolution, whereas intra-domain rearrangements have occurred rarely (Saier, 1996).
It was clear from Module 4 that when two homologous proteins are aligned, there are
one or more regions where sequence identity is very high, and these constitute motifs or
signature sequences. Any particular domain may have one or more characteristic motifs.
Domains, motifs and signature sequences constitute the content of many secondary databases
and are of enormous value in attempting to predict the function and structure of new proteins.
Low complexity regions
The individual domains of multidomain proteins are frequently separated from each other by
regions of low complexity, also referred to as linker sequences. Long stretches of repeated
residues, particularly proline, glutamine, serine or threonine often indicate linker sequences.
The program SEG (see below) is designed to detect such low complexity regions and can be
used as part of BLAST to mask off segments of the query sequence that have low
compositional complexity. Filtering can eliminate statistically significant but biologically
uninteresting reports from the blast output (e.g., hits against common acidic-, basic- or
proline-rich regions), leaving the more biologically interesting regions of the query sequence
available for specific matching against database sequences.
Coiled-coils
Coiled-coils are another structural feature of proteins which sometimes separate domains. The
COILS server will permit the prediction of such regions in query proteins as will
MULTICOIL.
Signal peptides
SignalP predicts ignal peptide and cleavage sites in Gram+, Gram-and eukaryotic amino acid
sequences. http://www.cbs.dtu.dk/services/SignalP/caution.html
Transmembrane segments
Transmembrane segments of proteins are comparatively easy to predict and can be of value in
separating intracellular and extracellular domains of a protein. The ere are a number of
programs for analysing protein queries for such segments they include:
PredictProtein includes prediction of transmembrane helix location and topology amongst
much else (uses PHDhtm and PHDtopology algorithms)
TMAP predicts transmembrane segments on multiply aligned sequences
Tmpred makes a prediction of membrane-spanning regions and their orientation using a
combination of several weight-matrices for scoring.
DAS uses the Dense Alignment Surface method to predict transmembrane helices
Secondary (pattern) databases
Two approaches which frequently help with establishing the function and/or structure of an
unknown protein involve the production of either motifs have motifs been defined yet?
they’ve been used already or profiles from the primary databases. Analysis of the primary
protein sequence databases, primarily through the generation of multiple sequence alignments
has led to the identification of sequence patterns (or motifs) common to homologous proteins.
These motifs, usually of the order of 10-20 amino acids in length, usually correspond to key
functional or structural elements and are extremely useful in identifying such features in new
uncharacterized proteins. There is a number of such secondary databases in which the
information has been derived from different primary databases by different analytical
methods. All these databases are based, though, on the same principal. The sequence of an
unknown protein is often too distantly related to any protein of known sequence to detect its
resemblance by overall sequence alignment, but it can potentially be identified by the
occurrence in its sequence of a particular cluster of amino acid residues, which are variously
known as a patterns, motifs, signatures, blocks or fingerprints. Usually the motifs do not
overlap, but are separated along a sequence, though they may be contiguous in 3D-space.
Analysis of the primary protein sequence databases and the production of multiple
sequence alignments can also lead to the construction of profiles. Profiles are scoring tables
that summarize the information in an alignment. The profile determines which residues are
allowed at each point, which residues are conserved or degenerate, which positions can
tolerate insertions etc. Unknown proteins can then be scored against the profile to see if they
fit.
There are a number of programs which allow the searching of an unknown protein
against databases of motifs and profiles, or indeed both. Some commonly used programmes
are listed below:
Pfam is a collection of multiple alignments and profile hidden Markov models of protein
domain families, which is based on proteins from both SWISS-PROT and SP-TrEMBL.
SMART (a Simple Modular Architecture Research Tool) allows the identification and
annotation of genetically mobile domains and the analysis of domain architectures. More than
400 domain families found in signalling, extracellular and chromatin-associated proteins are
detectable. These domains are extensively annotated with respect to phyletic distributions,
functional class, tertiary structures and functionally important residues. Each domain found in
a non-redundant protein database as well as search parameters and taxonomic information are
stored in a relational database system. User interfaces to this database allow searches for
proteins containing specific combinations of domains in defined taxa.
PROSITE is a database of protein families and domains. It consists of biologically significant
sites, patterns and profiles that help to reliably identify to which known protein family (if any)
a new sequence belongs.
PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs
used to characterise a protein family; its diagnostic power is refined by iterative scanning of a
SWISS-PROT/TrEMBL composite database. Fingerprints can encode protein folds and
functionalities more flexibly and powerfully than can single motifs, full diagnostic potency
deriving from the mutual context provided by motif neighbours.
BLOCKS Blocks are short multiply aligned ungapped segments corresponding to the
most highly conserved regions of proteins. The rationale behind searching a database of
blocks is that information from multiply aligned sequences is present in a concentrated form,
reducing background and increasing sensitivity to distant relationships. This information is
represented in a position-specific scoring table or "profile" (4), in which each column of the
alignment is converted to a column of a table representing the frequency of occurrence of
each of the 20 amino acids.
IDENTIFY Motifs are derived from the BLOCKS and PRINTS databases
INTERPRO SEARCH Integrated search in PROSITE, Pfam, ProDom, PRINTS and SWISSPROT+TrEMBL
CD-SEARCH at NCBI employs the reverse position-specific BLAST algorithm to search the
Conserved Domain Database (CDD), which is at present composed of Smart and Pfam, plus
contributions from colleagues at NCBI.
Exercises
The slr0228 gene of the cyanobacterium encodes a multidomain homologue of the E. coli
protein FtsH. Carry out the following tasks:
1. Retrieve the protein sequence from NCBI using Entrez
2. Analyse the FtsH sequence for transmembrane segments using predict protein
3. Analyse the FtsH sequence for coiled-coils using the COILS server and MULTICOIL.
Are there any differences between the predictions? Is this protein likely to have any
coiled-coil regions?
4. Analyse the domain structure of the FtsH homologue using PFAM, IDENTIFYand
SMART. Are there any differences between the predictions?
References and Links
Bateman A, Birney E, Durbin R, Eddy SR, Finn RD, Sonnhammer ELL (1999) Pfam 3.1:
1313 multiple alignments match the majority of proteins, Nucleic Acids Research 27:260-262
Saier MH (1996) Phylogenetic approaches to the identification and characterization of protein
families and superfamilies. Microbial and Comparative Genomics 1, 129-150.
Schultz, J., Copley, R.R., Doerks, T., Ponting, C.P. and Bork, P. (2000) SMART: A Webbased tool for the study of genetically mobile domains Nucleic Acids Res 28, 231-234
Download