Protein sequence analysis

advertisement
Introduction to Sequence Analysis
Protein Sequence Analysis
Protein Sequence Analysis
Part II
Osvaldo Graña
ograna@cnio.es
CNIO Bioinformatics Unit
[web page here]
22 Feb. 2012
1
Introduction to Sequence Analysis
Protein Sequence Analysis
Introduction
•Determination of protein/peptide sequences is a basic requirement for biomedical research, including cancer
research. It is absolutely essential for characterising and identifying proteins or peptides.
•The UniProt Knowledgebase is a central database of protein sequence and function. It consists of two parts,
a section containing fully manually-annotated and non-redundant records resulting from information extracted
from literature and curator-evaluated computational analyses UniProtKB/Swiss-Prot, and a section with
computationally-analysed records awaiting full manual annotation UniProtKB/TrEMBL.
********Check this web page with information about UniProtKB: http://www.uniprot.org/help/uniprotkb
2
Introduction to Sequence Analysis
Protein Sequence Analysis
Searching against a protein sequence database with NCBI-BLAST2
We are going to search against a protein database using a nucleotide query (sequence 2) with NCBI BLAST2
(http://ubio.bioinfo.cnio.es/people/ograna/public_html/introductionToSequenceAnalysis/checkingForVectorContamination/)
and look for peptides/protein sequences that are similar in UniProtKB/Swiss-Prot. This peptide/protein sequence is a real entry in this
database, so we will expect to find a sequence that is a perfect match to our test sequence. Also we expect to find similar peptide/protein
sequences, perhaps from closely related animals, or from sequences of closely related proteins.
We are going to select the BLASTX option and the UniProtKB/Swiss-Prot database to search against a protein database using a nucleotide
query
http://www.ebi.ac.uk/Tools/blastall/
3
Introduction to Sequence Analysis
Protein Sequence Analysis
Searching against a protein sequence database with NCBI-BLAST2
Database
Choose here the databases you wish to run your protein sequence against.
4
Introduction to Sequence Analysis
Protein Sequence Analysis
Searching against a protein sequence database with NCBI-BLAST2
Database
Choose here the databases you wish to run your protein sequence against.
5
Introduction to Sequence Analysis
Protein Sequence Analysis
Selecting Blast parameters to search
6
Introduction to Sequence Analysis
Protein Sequence Analysis
Searching against a protein sequence database with NCBI-BLAST2
Matrix
You may choose from a complete list of matrices which should cover various evolutionary constraints. This is because
substitutions will occur in your sequences due to genetic diversity during evolution. Each matrix is tailored to a particular
evolutionary distance. The default matrix for BLAST is blosum62 (Blocks Substitution Matrix 62% identity), which is the best of
the available matrices for detecting weak protein similarities. PAM (Point Accepted Mutation) matrices are also traditionally used
for amino acid sequences. Choosing a matrix with a larger PAM value will allow alignments of sequences with larger evolutionary
distances, and choosing a blosum matrix with a larger value will allow a larger percentage identity.
The default value is blosum 62.
Expected threshold
The expected threshold establishes a statistical significance threshold for reporting database sequence matches. The default
value is 10, meaning that 10 matches are expected to be found merely by chance. Lower expected thresholds are more
stringent, leading to fewer chance matches being reported. Increasing the expected threshold shows less stringent matches and
is recommended when you are performing searches with short sequences as a short query is more likely to occur by chance in
the database than a longer one, so even a perfect match (no gaps) can have low statistical significance and may not be
reported. Increasing the Expected threshold allows you to look farther down in the hit list and see matches that would normally
be discarded because of low statistical significance. Generally a value of up to 1000 is enough to see results.
The default value is 10.
10-5
0
10-2
Reliable
Very Reliable
Homology
Risky
7
Introduction to Sequence Analysis
Protein Sequence Analysis
Searching against a protein sequence database with NCBI-BLAST2
Filter
The filter option, if set to true, will allow you to mask out various segments of the query sequence for regions which are nonspecific for sequence similarity searches. Filtering can eliminate statistically significant but biologically uninteresting reports from
the output, for example hits against common acidic-, basic- or proline-rich regions, leaving the more biologically interesting
regions of the query sequence available for specific matching against database sequences. Filtering is only applied to the
query sequence, not to database sequences. The program used for this, with nucleotide query sequences is known as DUST
written by Tatusov, R. L., and Lipman, D.J. The SEG program is used for filtering low complexity regions in amino acid
sequences from your protein query sequence and was written by Wootton, J.C., and Federhen, S.
The default is true.
Default Filters (When Filter set to true):
* BLASTp: SEG
* BLASTx: SEG
* BLASTn: DUST
N.B. "If you have UniProt Clusters 100% (SEG filtered)" selected , you will not be able to set a filter as a filter is already applied.
Drop off
This is the amount a score must drop before extension of word hits is halted.
8
Introduction to Sequence Analysis
Protein Sequence Analysis
Searching against a protein sequence database with NCBI-BLAST2
Open gap
The gap open penalty is the score taken away for the initiation of the gap in sequence or in structure. To make the match more
significant you can try to make the gap penalty larger. It will decrease the number of gaps and if you have good alignment
without many gaps, its Z-score will be higher.
The default is 11.
Extend gap
The gap extension penalty is added to the standard gap penalty for each base or residue in the gap. This is how long gaps are
penalised. If you don't like long gaps, just increase the extension gap penalty. Usually you will expect a few long gaps rather than
many short gaps, so the gap extension penalty should be lower than the gap penalty. An exception is where one or both
sequences are single reads with possible sequencing errors in which case you would expect many single base gaps. You can
get this result by setting the gap open penalty to zero (or very low) and using the gap extension penalty to control gap scoring.
The default is 1.
9
Introduction to Sequence Analysis
Protein Sequence Analysis
Searching against a protein sequence database with NCBI-BLAST2
Gap align
This is a true/false answer that tells the program to perform optimised alignments within regions involving gaps. If set to true,
the program will perform an alignment using gaps. Otherwise, if it is set to false, it will report only individual HSP where two
sequences match each other, and thus will not produce alignments with gaps. The default is true.
(N.B. HSP means High-Scoring Segment Pair. Local alignments with no gaps that achieve one of the top alignment scores in
a given search)
10
Introduction to Sequence Analysis
Protein Sequence Analysis
Substitution matrices
Alignment of protein sequences can take account of the diferential rates at which amino acids substitute for
each other. It can be measured through two types of matrices: PAM and BLOSUM.
PAM (Percent Accepted Mutations): on the basis of comparisons among many pairs of very similar protein
sequences (at least 85% identical, ie., homologous sequences), Margaret Dayhoff constructed a mutation
probability matrix comparing many pairs of protein sequences to determine the empirical frequencies with
which one amino acid is replaced by others during evolution.
Examples are PAM1, PAM10, PAM25, PAM50, PAM100, PAM125, PAM250.
PAM10= [ PAM1]10
……
PAM250 = [ PAM1]250
The PAM1 matrix could be multiplied by itself N times to give transition matrices for comparing sequences with
lower and lower levels of similarity due to separation over longer periods of evolutionary history.
Thus, the commonly used PAM250 matrix represents a level of 250% of change expected in 2500 million years.
Although this amount of change seems very large, sequences at this level of divergence still have about 20% of
similarity (Bioinformatics, D. W: Mount, page 96).
11
Introduction to Sequence Analysis
Protein Sequence Analysis
Substitution matrices
The empirical frecuency with which aminoacid type i is replaced by type j (or viceversa) is writen as Mi,j in the
matrix: the probability of aligning two Ys in an alignment YY/YY is 10+10=20, a very significant score, whereas
that of YY/TP is 0-5=-5
12
Introduction to Sequence Analysis
Protein Sequence Analysis
Substitution matrices
Recommendations
Which PAM matrix should I use?
One cannot know previously what the percentage similarity or difference between two sequences actually is
until an alignment is done, thus a trial alignment must be first done. Once the initial similarity score has been
obtained with these matrices, a more representative score can be obtained by using another PAM matrix
designed specifically for sequences at that level of similarity.
13
Introduction to Sequence Analysis
Protein Sequence Analysis
Substitution matrices
BLOSUM (Blocks Substitution Matrix): the PAM matrices introduced by Dayhoff are constructed from the
amino acid replacements inferred from alignments of protein sequences that are at least 85% identical.
Henikoff & Henikoff (1992) considered blocks, or highly conserved regions, in aligned protein sequences. The
BLOSUM matrix scores for amino acid pairs are based on the frequency of amino acid substitutions in aligned
sequence motifs (blocks) from a related familiy of proteins, regardless of the overall degree of similarity
between the protein sequences.
The BLOSUM62 substitution matrix is widely used for scoring protein sequence alignments. The matrix values
are based on the observed aminoacid substitutions in a large set of approximately 2000 conserved amino acid
blocks representing more than 500 families of related proteins.
BLOSUM62 -> based on blocks that are 62% identical
BLOSUM80 -> based on blocks that are 80% alike
BLOSUM62 example: http://www.uky.edu/Classes/BIO/520/BIO520WWW/blosum62.htm
14
Introduction to Sequence Analysis
Protein Sequence Analysis
Substitution matrices
PAM vs BLOSUM
The PAM matrices are based on scoring all amino acid positions in related sequences, whereas the BLOSUM
matrices are based on substitutions and conserved positions in blocks, which represents the most-alike
common regions in related sequences.
The PAM model is thus designed to track the evolutionary origins of proteins, whereas the BLOSUM model is
designed to find their conserved domains. The choice of which matrix to use depends on the goals of the
investigator.
Still there are some equivalences between PAM and BLOSUM matrices:
15
Protein Sequence Analysis
Introduction to Sequence Analysis
Substitution matrices
We have to consider also insertions and deletions, this implies to open gaps in the alignment and so we have
to recalculate the scores penalizing for:
a) Opening a gap in the alignment
b) Extending the gap in the alignment
Values vary depending on the program we are using, but a general rule is that opening a new gap is much
more penalized than extending an existing one
It is more frequent to find long gaps than bunches of “1 base” gaps
Example1 “bunch of gaps”:
ATCG_ATCG_ATCG_ATCG
ATCGTATCGTATCGTATCG
Example 2 “long gap”:
ATCG_ _ _ ATCG
ATCGT CG ATCG
16
Introduction to Sequence Analysis
Protein Sequence Analysis
Substitution matrices
Example of scoring a sequence alignment with a gap penalty and under BLOSUM62.
BLOSUM62 matrix: http://www.uky.edu/Classes/BIO/520/BIO520WWW/blosum62.htm
Sequence 1
Sequence 2
Score
VDS - C Y
VE S L C Y
4 2 4 -11 9 7
Total score=
(∑ amino acid pair scores )
minus
(single gap penalty) = 15
17
Introduction to Sequence Analysis
Protein Sequence Analysis
Searching against a protein sequence database with NCBI-BLAST2
We are going to search a protein database using a nucleotide query (sequence 2) with NCBI BLAST2
(http://ubio.bioinfo.cnio.es/people/ograna/public_html/introductionToSequenceAnalysis/checkingForVectorContamination/)
and look for peptides/protein sequences that are similar in UniProt. This peptide/protein sequence is a real entry in this database, so we will
expect to find a sequence that is a perfect match to our test sequence. Also we expect to find similar peptide/protein sequences, perhaps
from closely related animals, or from sequences of closely related proteins.
We are going to select the BLASTX option and the Swiss-Prot database to search a protein database using a nucleotide query
http://www.ebi.ac.uk/Tools/blastall/
18
Introduction to Sequence Analysis
Protein Sequence Analysis
Searching against a protein sequence database with NCBI-BLAST2: Results
Summary
NOTE: by clicking 'show alignments' we will find that the hits are catched with the frame 2 (see ‘Show Alignments’).
This tell us that at least the second frame is a coding frame.
19
Introduction to Sequence Analysis
Protein Sequence Analysis
Showing the alignments
NOTE: all the hits are catched with the frame 2 (see ‘Show Alignments’). This tell us that at least the second frame is
a coding frame.
20
Introduction to Sequence Analysis
Protein Sequence Analysis
Visual output (results)
***See that the part of the mouse fosB mRNA that we are able to align with the FosB protein sequence is the one that
belongs to the CDS, from the first methyonine (translation start site) until the stop codon (translation stop site).
21
Introduction to Sequence Analysis
Protein Sequence Analysis
Functional predictions (results)
22
Introduction to Sequence Analysis
Protein Sequence Analysis
Description of Uniprot entry
23
Introduction to Sequence Analysis
Protein Sequence Analysis
Description of Uniprot entry
24
Introduction to Sequence Analysis
Protein Sequence Analysis
Pairwise local/global alignment: differences
Global alignment: we try to align the whole sequence. It is only useful for homologous proteins with a high
percentage of identity.
Local alignment: we try to align locally as much of the sequence as we can. This is useful when dealing with
domains.
Are these proteins homologues?
Globally: no, they are very different, the score would be very low.
Locally: there is a homologous domain, the grey one.
25
Introduction to Sequence Analysis
Protein Sequence Analysis
Pairwise local/global alignment: Running an EMBOSS-Align alignment
We are going to use the EMBOSS-Align tool (http://www.ebi.ac.uk/Tools/emboss/align/).
•2 jobs to execute, one with the EMBOSS global alignment program (needle), and one with the local alignment program (water).
•As we are comparing 2 protein sequences, the molecule type was left on protein.
•The default blosum62 matrix is used, and the default gap open of "10" and gap extend of "0.5" is also used.
26
Introduction to Sequence Analysis
Protein Sequence Analysis
Pairwise local/global alignment: differences
Lets align these two sequences: http://pfam.sanger.ac.uk/family?acc=PF00071
http://ubio.bioinfo.cnio.es/people/ograna/public_html/introductionToSequenceAnalysis/protEMBOSSalign/sequence10.txt
http://ubio.bioinfo.cnio.es/people/ograna/public_html/introductionToSequenceAnalysis/protEMBOSSalign/sequence11.txt
27
Introduction to Sequence Analysis
Protein Sequence Analysis
Pairwise local/global alignment: needle GLOBAL result
28
Introduction to Sequence Analysis
Protein Sequence Analysis
Pairwise local/global alignment: water LOCAL result
The Smith-Waterman algorithm is more suitable for identifying related proteins of limited sequence similarity than FASTA
and BLAST in a database search (Bioinformatics, D. W. Mount, page 259).
29
Introduction to Sequence Analysis
Protein Sequence Analysis
Pairwise local/global alignment: Results of EMBOSS-Align alignments
Note that identical amino acids are connected with a "|" symbol. Unrelated pairs of amino acids (mismatches) would be
connected with a space. A gap would be represented with a "-" symbol. Similar pairs (e.g. leucine vs methionine) are connected
via a ":" symbol. Less similar ones are indicated with "."
The %id is the percentage of identical matches between the two sequences over the reported aligned region.
The %similarity is the percentage of matches between the two sequences over the reported aligned region where the scoring
matrix value is greater or equal to 0.0.
The Overall %id and Overall %similarity are calculated in a similar manner for the number of matches over the length of the
longest of the two sequences.
30
Introduction to Sequence Analysis
Protein Sequence Analysis
ClustalW can build multiple sequence alignments (MSA)
ClustalW (http://www.ebi.ac.uk/Tools/clustalw2/) is a general purpose global multiple sequence alignment program for DNA or
proteins. It produces biologically meaningful multiple sequence alignments of divergent sequences. It calculates the best match
for the selected sequences, and lines them up so that the identities, similarities and differences can be seen. Evolutionary
relationships can be seen through Cladograms or Phylograms.
Multiple alignments of protein sequences are important tools in studying sequences. The basic information they provide is
identification of conserved sequence regions. This is very useful in designing experiments to test and modify the function of
specific proteins, in predicting the function and structure of proteins, and in identifying new members of protein families.
Sequences can be aligned across their entire length (global alignment) or only in certain regions (local alignment). This is true for
pairwise and multiple alignments. Global alignments need to use gaps (representing insertions/deletions) while local alignments
can avoid them, aligning regions between gaps. ClustalW is a fully automatic program for global multiple alignment of DNA and
protein sequences. The alignment is progressive and considers the sequence redundancy. Trees can also be calculated from
multiple alignments. The program has some adjustable parameters with reasonable defaults.
ClustalW (Higgins et al. 1996)
It is designed to provide an adequate alignment of a large number of more
close related sequences and a reliable indication of the domain
structure of those sequences.
The steps used by ClustalW include:
1.
Perform pair-wise alignments of all the sequences
2.
Use the aligment scores to produce a phylogenetic tree
3.
Progressive multiple sequence alignment: it reduces the
construction of the MSA to a series of pair-wise alignments. Initially,
a dynamic programming alignment is made between the two most
alike sequences, and the resulting alignment is then extended to
include other, less alike sequences.
31
Introduction to Sequence Analysis
Protein Sequence Analysis
Building a MSA: 1) get protein homologs with Blast
We select all the hits obtained from the previous search results, and the click download fasta
32
Introduction to Sequence Analysis
Protein Sequence Analysis
Building a MSA: 2) Copy all the downloaded sequences
We then copy all the downloaded sequences to the ClustalW2 tool
33
Introduction to Sequence Analysis
Protein Sequence Analysis
Building a MSA: 3) ClustalW MSA results
34
Introduction to Sequence Analysis
Protein Sequence Analysis
Analyzing ClustalW results
We can now see how the first 3 sequences are very similar while from the fourth to the last the MSA
introduces differences
The branch lengths on the phylogram are proportional to the evolutionary
distance between species, however the branches are normalized in the
cladogram and therefore do not represent the distance between species.
35
Introduction to Sequence Analysis
Protein Sequence Analysis
Interpretation of ClustalW results
Consensus symbols
An alignment will display by default the following symbols denoting the degree of conservation observed in each column:
"*" means that the residues or nucleotides in that column are identical in all sequences in the alignment.
":" means that conserved substitutions (similar) have been observed, according to the COLOUR table below.
"." means that semi-conserved substitutions (less similar) are observed. Colour
This option only works when you have chosen ALN or GCG the output format. The colouring of residues takes place according
to the following physiochemical criteria:
36
Introduction to Sequence Analysis
Protein Sequence Analysis
Other examples of MSA programs
T-Coffee: combines information from global and local alignments to produce a global
MSA (http://www.ebi.ac.uk/Tools/t-coffee/index.html)
Muscle: builds global MSA (http://www.ebi.ac.uk/Tools/muscle/)
Mafft: generates global MSA (http://www.ebi.ac.uk/Tools/mafft/index.html)
DiAlign: produces global and local MSA (http://bibiserv.techfak.uni-bielefeld.de/dialign/)
Hmmer: generates local MSA (http://hmmer.janelia.org/)
Meme: builds local MSA (http://meme.sdsc.edu/meme4_1/cgi-bin/meme.cgi)
37
Introduction to Sequence Analysis
Protein Sequence Analysis
Searching protein families with InterPro
What is InterPro? http://www.ebi.ac.uk/interpro/user_manual.html
•
InterPro is an integrated documentation resource for protein families, domains and sites. InterPro is a consortium of
member databases (PROSITE, Pfam, Prints, ProDom, SMART and TIGRFAMs). Each member database devises
methods that can be applied computationally to assign a score for a protein according to how well it matches a given
signature. For some types of methods, the classification is binary (i.e. hit or miss), in other cases a numerical value is
produced and a cut off point chosen to separate hits from misses. Different member databases create methods/signatures
in different ways: some groups build them from alignments studied manually, others use automatic processes with some
human input and correction, ProDom uses an entirely automatic method.
•
Signatures describing the same protein family or domain are grouped into unique InterPro entries. Each combined
InterPro entry has a unique accession number, an abstract describing the features of proteins associated with the entry
and literature references and has links to the relevant member database(s). All UniProtKB protein sequences that have
matches to a particular InterPro entry are listed in the Match Table associated with that entry. There are also links to the
InterPro graphical views. The graphical views, which can be sorted by UniProtKB accession number, structure or
taxonomy, show the position of the signatures on the protein, mousing over the signature brings up a pop-box, giving the
accession, name and position.
•
InterPro graphically represents the location of a protein domain and information pertaining to the origin of that domain and
the proteins that contain it. Families are also defined and may contain several InterPro domains which are often, but not
always, in the same order. Through the InterPro Domain Architecture view, the composition and order of the different
domains within a family are clearly displayed for easy comparison, as well as for simple navigation between the entries for
individual domains.
•
InterPro entries are linked to one another through PARENT/CHILD and CONTAINS/FOUND IN relationships.
PARENT/CHILD relationships indicate superfamily/family/subfamily relationships, as well as domain hierarchies, where
sequences can be subdivided into more specific sub-sets. CONTAINS/FOUND IN relationships apply to domains, repeats
and sites within families, and are used to describe the composition of protein sequences.
38
Introduction to Sequence Analysis
Protein Sequence Analysis
Searching protein families with InterPro
Going back to our sequence 2:
http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-e+[swissprot-id:FOSB_MOUSE]|[swissprot-acc:FOSB_MOUSE]+-noSession
•
We move down through the page to the section ‘Database cross-references’ until we find the following link:
•
There are 3 InterPro entries in this case: usually InterPro defines one entry for each member database that contains a
definition for this domain. The first entry is a PFAM annotation with type ‘Domain’. The second InterPro entry is an
annotation from PRINTS, in this case the type is called ‘Family’. The third entry comes from SMART and PROSITE and
the type is again ‘Domain’.
•
InterPro entries can have associations like parent/child (different levels defined by InterPro methods) or contain/found in.
•
This particular domain is named differently in the three entries. The reason is only the preferences of each one of the
database members for the names.
39
Introduction to Sequence Analysis
Protein Sequence Analysis
Searching protein families with InterPro
40
Introduction to Sequence Analysis
Protein Sequence Analysis
Thanks for your attention !
I would like to thank also the effort done by the 2Can initiative at the EBI. Some of the
slides shown in this tutorial were selected from the 2Can Support Portal.
41
Download