Introduction to Sequence Analysis Protein Sequence Analysis Protein Sequence Analysis Part II Osvaldo Graña ograna@cnio.es CNIO Bioinformatics Unit [web page here] 22 Feb. 2012 1 Introduction to Sequence Analysis Protein Sequence Analysis Introduction •Determination of protein/peptide sequences is a basic requirement for biomedical research, including cancer research. It is absolutely essential for characterising and identifying proteins or peptides. •The UniProt Knowledgebase is a central database of protein sequence and function. It consists of two parts, a section containing fully manually-annotated and non-redundant records resulting from information extracted from literature and curator-evaluated computational analyses UniProtKB/Swiss-Prot, and a section with computationally-analysed records awaiting full manual annotation UniProtKB/TrEMBL. ********Check this web page with information about UniProtKB: http://www.uniprot.org/help/uniprotkb 2 Introduction to Sequence Analysis Protein Sequence Analysis Searching against a protein sequence database with NCBI-BLAST2 We are going to search against a protein database using a nucleotide query (sequence 2) with NCBI BLAST2 (http://ubio.bioinfo.cnio.es/people/ograna/public_html/introductionToSequenceAnalysis/checkingForVectorContamination/) and look for peptides/protein sequences that are similar in UniProtKB/Swiss-Prot. This peptide/protein sequence is a real entry in this database, so we will expect to find a sequence that is a perfect match to our test sequence. Also we expect to find similar peptide/protein sequences, perhaps from closely related animals, or from sequences of closely related proteins. We are going to select the BLASTX option and the UniProtKB/Swiss-Prot database to search against a protein database using a nucleotide query http://www.ebi.ac.uk/Tools/blastall/ 3 Introduction to Sequence Analysis Protein Sequence Analysis Searching against a protein sequence database with NCBI-BLAST2 Database Choose here the databases you wish to run your protein sequence against. 4 Introduction to Sequence Analysis Protein Sequence Analysis Searching against a protein sequence database with NCBI-BLAST2 Database Choose here the databases you wish to run your protein sequence against. 5 Introduction to Sequence Analysis Protein Sequence Analysis Selecting Blast parameters to search 6 Introduction to Sequence Analysis Protein Sequence Analysis Searching against a protein sequence database with NCBI-BLAST2 Matrix You may choose from a complete list of matrices which should cover various evolutionary constraints. This is because substitutions will occur in your sequences due to genetic diversity during evolution. Each matrix is tailored to a particular evolutionary distance. The default matrix for BLAST is blosum62 (Blocks Substitution Matrix 62% identity), which is the best of the available matrices for detecting weak protein similarities. PAM (Point Accepted Mutation) matrices are also traditionally used for amino acid sequences. Choosing a matrix with a larger PAM value will allow alignments of sequences with larger evolutionary distances, and choosing a blosum matrix with a larger value will allow a larger percentage identity. The default value is blosum 62. Expected threshold The expected threshold establishes a statistical significance threshold for reporting database sequence matches. The default value is 10, meaning that 10 matches are expected to be found merely by chance. Lower expected thresholds are more stringent, leading to fewer chance matches being reported. Increasing the expected threshold shows less stringent matches and is recommended when you are performing searches with short sequences as a short query is more likely to occur by chance in the database than a longer one, so even a perfect match (no gaps) can have low statistical significance and may not be reported. Increasing the Expected threshold allows you to look farther down in the hit list and see matches that would normally be discarded because of low statistical significance. Generally a value of up to 1000 is enough to see results. The default value is 10. 10-5 0 10-2 Reliable Very Reliable Homology Risky 7 Introduction to Sequence Analysis Protein Sequence Analysis Searching against a protein sequence database with NCBI-BLAST2 Filter The filter option, if set to true, will allow you to mask out various segments of the query sequence for regions which are nonspecific for sequence similarity searches. Filtering can eliminate statistically significant but biologically uninteresting reports from the output, for example hits against common acidic-, basic- or proline-rich regions, leaving the more biologically interesting regions of the query sequence available for specific matching against database sequences. Filtering is only applied to the query sequence, not to database sequences. The program used for this, with nucleotide query sequences is known as DUST written by Tatusov, R. L., and Lipman, D.J. The SEG program is used for filtering low complexity regions in amino acid sequences from your protein query sequence and was written by Wootton, J.C., and Federhen, S. The default is true. Default Filters (When Filter set to true): * BLASTp: SEG * BLASTx: SEG * BLASTn: DUST N.B. "If you have UniProt Clusters 100% (SEG filtered)" selected , you will not be able to set a filter as a filter is already applied. Drop off This is the amount a score must drop before extension of word hits is halted. 8 Introduction to Sequence Analysis Protein Sequence Analysis Searching against a protein sequence database with NCBI-BLAST2 Open gap The gap open penalty is the score taken away for the initiation of the gap in sequence or in structure. To make the match more significant you can try to make the gap penalty larger. It will decrease the number of gaps and if you have good alignment without many gaps, its Z-score will be higher. The default is 11. Extend gap The gap extension penalty is added to the standard gap penalty for each base or residue in the gap. This is how long gaps are penalised. If you don't like long gaps, just increase the extension gap penalty. Usually you will expect a few long gaps rather than many short gaps, so the gap extension penalty should be lower than the gap penalty. An exception is where one or both sequences are single reads with possible sequencing errors in which case you would expect many single base gaps. You can get this result by setting the gap open penalty to zero (or very low) and using the gap extension penalty to control gap scoring. The default is 1. 9 Introduction to Sequence Analysis Protein Sequence Analysis Searching against a protein sequence database with NCBI-BLAST2 Gap align This is a true/false answer that tells the program to perform optimised alignments within regions involving gaps. If set to true, the program will perform an alignment using gaps. Otherwise, if it is set to false, it will report only individual HSP where two sequences match each other, and thus will not produce alignments with gaps. The default is true. (N.B. HSP means High-Scoring Segment Pair. Local alignments with no gaps that achieve one of the top alignment scores in a given search) 10 Introduction to Sequence Analysis Protein Sequence Analysis Substitution matrices Alignment of protein sequences can take account of the diferential rates at which amino acids substitute for each other. It can be measured through two types of matrices: PAM and BLOSUM. PAM (Percent Accepted Mutations): on the basis of comparisons among many pairs of very similar protein sequences (at least 85% identical, ie., homologous sequences), Margaret Dayhoff constructed a mutation probability matrix comparing many pairs of protein sequences to determine the empirical frequencies with which one amino acid is replaced by others during evolution. Examples are PAM1, PAM10, PAM25, PAM50, PAM100, PAM125, PAM250. PAM10= [ PAM1]10 …… PAM250 = [ PAM1]250 The PAM1 matrix could be multiplied by itself N times to give transition matrices for comparing sequences with lower and lower levels of similarity due to separation over longer periods of evolutionary history. Thus, the commonly used PAM250 matrix represents a level of 250% of change expected in 2500 million years. Although this amount of change seems very large, sequences at this level of divergence still have about 20% of similarity (Bioinformatics, D. W: Mount, page 96). 11 Introduction to Sequence Analysis Protein Sequence Analysis Substitution matrices The empirical frecuency with which aminoacid type i is replaced by type j (or viceversa) is writen as Mi,j in the matrix: the probability of aligning two Ys in an alignment YY/YY is 10+10=20, a very significant score, whereas that of YY/TP is 0-5=-5 12 Introduction to Sequence Analysis Protein Sequence Analysis Substitution matrices Recommendations Which PAM matrix should I use? One cannot know previously what the percentage similarity or difference between two sequences actually is until an alignment is done, thus a trial alignment must be first done. Once the initial similarity score has been obtained with these matrices, a more representative score can be obtained by using another PAM matrix designed specifically for sequences at that level of similarity. 13 Introduction to Sequence Analysis Protein Sequence Analysis Substitution matrices BLOSUM (Blocks Substitution Matrix): the PAM matrices introduced by Dayhoff are constructed from the amino acid replacements inferred from alignments of protein sequences that are at least 85% identical. Henikoff & Henikoff (1992) considered blocks, or highly conserved regions, in aligned protein sequences. The BLOSUM matrix scores for amino acid pairs are based on the frequency of amino acid substitutions in aligned sequence motifs (blocks) from a related familiy of proteins, regardless of the overall degree of similarity between the protein sequences. The BLOSUM62 substitution matrix is widely used for scoring protein sequence alignments. The matrix values are based on the observed aminoacid substitutions in a large set of approximately 2000 conserved amino acid blocks representing more than 500 families of related proteins. BLOSUM62 -> based on blocks that are 62% identical BLOSUM80 -> based on blocks that are 80% alike BLOSUM62 example: http://www.uky.edu/Classes/BIO/520/BIO520WWW/blosum62.htm 14 Introduction to Sequence Analysis Protein Sequence Analysis Substitution matrices PAM vs BLOSUM The PAM matrices are based on scoring all amino acid positions in related sequences, whereas the BLOSUM matrices are based on substitutions and conserved positions in blocks, which represents the most-alike common regions in related sequences. The PAM model is thus designed to track the evolutionary origins of proteins, whereas the BLOSUM model is designed to find their conserved domains. The choice of which matrix to use depends on the goals of the investigator. Still there are some equivalences between PAM and BLOSUM matrices: 15 Protein Sequence Analysis Introduction to Sequence Analysis Substitution matrices We have to consider also insertions and deletions, this implies to open gaps in the alignment and so we have to recalculate the scores penalizing for: a) Opening a gap in the alignment b) Extending the gap in the alignment Values vary depending on the program we are using, but a general rule is that opening a new gap is much more penalized than extending an existing one It is more frequent to find long gaps than bunches of “1 base” gaps Example1 “bunch of gaps”: ATCG_ATCG_ATCG_ATCG ATCGTATCGTATCGTATCG Example 2 “long gap”: ATCG_ _ _ ATCG ATCGT CG ATCG 16 Introduction to Sequence Analysis Protein Sequence Analysis Substitution matrices Example of scoring a sequence alignment with a gap penalty and under BLOSUM62. BLOSUM62 matrix: http://www.uky.edu/Classes/BIO/520/BIO520WWW/blosum62.htm Sequence 1 Sequence 2 Score VDS - C Y VE S L C Y 4 2 4 -11 9 7 Total score= (∑ amino acid pair scores ) minus (single gap penalty) = 15 17 Introduction to Sequence Analysis Protein Sequence Analysis Searching against a protein sequence database with NCBI-BLAST2 We are going to search a protein database using a nucleotide query (sequence 2) with NCBI BLAST2 (http://ubio.bioinfo.cnio.es/people/ograna/public_html/introductionToSequenceAnalysis/checkingForVectorContamination/) and look for peptides/protein sequences that are similar in UniProt. This peptide/protein sequence is a real entry in this database, so we will expect to find a sequence that is a perfect match to our test sequence. Also we expect to find similar peptide/protein sequences, perhaps from closely related animals, or from sequences of closely related proteins. We are going to select the BLASTX option and the Swiss-Prot database to search a protein database using a nucleotide query http://www.ebi.ac.uk/Tools/blastall/ 18 Introduction to Sequence Analysis Protein Sequence Analysis Searching against a protein sequence database with NCBI-BLAST2: Results Summary NOTE: by clicking 'show alignments' we will find that the hits are catched with the frame 2 (see ‘Show Alignments’). This tell us that at least the second frame is a coding frame. 19 Introduction to Sequence Analysis Protein Sequence Analysis Showing the alignments NOTE: all the hits are catched with the frame 2 (see ‘Show Alignments’). This tell us that at least the second frame is a coding frame. 20 Introduction to Sequence Analysis Protein Sequence Analysis Visual output (results) ***See that the part of the mouse fosB mRNA that we are able to align with the FosB protein sequence is the one that belongs to the CDS, from the first methyonine (translation start site) until the stop codon (translation stop site). 21 Introduction to Sequence Analysis Protein Sequence Analysis Functional predictions (results) 22 Introduction to Sequence Analysis Protein Sequence Analysis Description of Uniprot entry 23 Introduction to Sequence Analysis Protein Sequence Analysis Description of Uniprot entry 24 Introduction to Sequence Analysis Protein Sequence Analysis Pairwise local/global alignment: differences Global alignment: we try to align the whole sequence. It is only useful for homologous proteins with a high percentage of identity. Local alignment: we try to align locally as much of the sequence as we can. This is useful when dealing with domains. Are these proteins homologues? Globally: no, they are very different, the score would be very low. Locally: there is a homologous domain, the grey one. 25 Introduction to Sequence Analysis Protein Sequence Analysis Pairwise local/global alignment: Running an EMBOSS-Align alignment We are going to use the EMBOSS-Align tool (http://www.ebi.ac.uk/Tools/emboss/align/). •2 jobs to execute, one with the EMBOSS global alignment program (needle), and one with the local alignment program (water). •As we are comparing 2 protein sequences, the molecule type was left on protein. •The default blosum62 matrix is used, and the default gap open of "10" and gap extend of "0.5" is also used. 26 Introduction to Sequence Analysis Protein Sequence Analysis Pairwise local/global alignment: differences Lets align these two sequences: http://pfam.sanger.ac.uk/family?acc=PF00071 http://ubio.bioinfo.cnio.es/people/ograna/public_html/introductionToSequenceAnalysis/protEMBOSSalign/sequence10.txt http://ubio.bioinfo.cnio.es/people/ograna/public_html/introductionToSequenceAnalysis/protEMBOSSalign/sequence11.txt 27 Introduction to Sequence Analysis Protein Sequence Analysis Pairwise local/global alignment: needle GLOBAL result 28 Introduction to Sequence Analysis Protein Sequence Analysis Pairwise local/global alignment: water LOCAL result The Smith-Waterman algorithm is more suitable for identifying related proteins of limited sequence similarity than FASTA and BLAST in a database search (Bioinformatics, D. W. Mount, page 259). 29 Introduction to Sequence Analysis Protein Sequence Analysis Pairwise local/global alignment: Results of EMBOSS-Align alignments Note that identical amino acids are connected with a "|" symbol. Unrelated pairs of amino acids (mismatches) would be connected with a space. A gap would be represented with a "-" symbol. Similar pairs (e.g. leucine vs methionine) are connected via a ":" symbol. Less similar ones are indicated with "." The %id is the percentage of identical matches between the two sequences over the reported aligned region. The %similarity is the percentage of matches between the two sequences over the reported aligned region where the scoring matrix value is greater or equal to 0.0. The Overall %id and Overall %similarity are calculated in a similar manner for the number of matches over the length of the longest of the two sequences. 30 Introduction to Sequence Analysis Protein Sequence Analysis ClustalW can build multiple sequence alignments (MSA) ClustalW (http://www.ebi.ac.uk/Tools/clustalw2/) is a general purpose global multiple sequence alignment program for DNA or proteins. It produces biologically meaningful multiple sequence alignments of divergent sequences. It calculates the best match for the selected sequences, and lines them up so that the identities, similarities and differences can be seen. Evolutionary relationships can be seen through Cladograms or Phylograms. Multiple alignments of protein sequences are important tools in studying sequences. The basic information they provide is identification of conserved sequence regions. This is very useful in designing experiments to test and modify the function of specific proteins, in predicting the function and structure of proteins, and in identifying new members of protein families. Sequences can be aligned across their entire length (global alignment) or only in certain regions (local alignment). This is true for pairwise and multiple alignments. Global alignments need to use gaps (representing insertions/deletions) while local alignments can avoid them, aligning regions between gaps. ClustalW is a fully automatic program for global multiple alignment of DNA and protein sequences. The alignment is progressive and considers the sequence redundancy. Trees can also be calculated from multiple alignments. The program has some adjustable parameters with reasonable defaults. ClustalW (Higgins et al. 1996) It is designed to provide an adequate alignment of a large number of more close related sequences and a reliable indication of the domain structure of those sequences. The steps used by ClustalW include: 1. Perform pair-wise alignments of all the sequences 2. Use the aligment scores to produce a phylogenetic tree 3. Progressive multiple sequence alignment: it reduces the construction of the MSA to a series of pair-wise alignments. Initially, a dynamic programming alignment is made between the two most alike sequences, and the resulting alignment is then extended to include other, less alike sequences. 31 Introduction to Sequence Analysis Protein Sequence Analysis Building a MSA: 1) get protein homologs with Blast We select all the hits obtained from the previous search results, and the click download fasta 32 Introduction to Sequence Analysis Protein Sequence Analysis Building a MSA: 2) Copy all the downloaded sequences We then copy all the downloaded sequences to the ClustalW2 tool 33 Introduction to Sequence Analysis Protein Sequence Analysis Building a MSA: 3) ClustalW MSA results 34 Introduction to Sequence Analysis Protein Sequence Analysis Analyzing ClustalW results We can now see how the first 3 sequences are very similar while from the fourth to the last the MSA introduces differences The branch lengths on the phylogram are proportional to the evolutionary distance between species, however the branches are normalized in the cladogram and therefore do not represent the distance between species. 35 Introduction to Sequence Analysis Protein Sequence Analysis Interpretation of ClustalW results Consensus symbols An alignment will display by default the following symbols denoting the degree of conservation observed in each column: "*" means that the residues or nucleotides in that column are identical in all sequences in the alignment. ":" means that conserved substitutions (similar) have been observed, according to the COLOUR table below. "." means that semi-conserved substitutions (less similar) are observed. Colour This option only works when you have chosen ALN or GCG the output format. The colouring of residues takes place according to the following physiochemical criteria: 36 Introduction to Sequence Analysis Protein Sequence Analysis Other examples of MSA programs T-Coffee: combines information from global and local alignments to produce a global MSA (http://www.ebi.ac.uk/Tools/t-coffee/index.html) Muscle: builds global MSA (http://www.ebi.ac.uk/Tools/muscle/) Mafft: generates global MSA (http://www.ebi.ac.uk/Tools/mafft/index.html) DiAlign: produces global and local MSA (http://bibiserv.techfak.uni-bielefeld.de/dialign/) Hmmer: generates local MSA (http://hmmer.janelia.org/) Meme: builds local MSA (http://meme.sdsc.edu/meme4_1/cgi-bin/meme.cgi) 37 Introduction to Sequence Analysis Protein Sequence Analysis Searching protein families with InterPro What is InterPro? http://www.ebi.ac.uk/interpro/user_manual.html • InterPro is an integrated documentation resource for protein families, domains and sites. InterPro is a consortium of member databases (PROSITE, Pfam, Prints, ProDom, SMART and TIGRFAMs). Each member database devises methods that can be applied computationally to assign a score for a protein according to how well it matches a given signature. For some types of methods, the classification is binary (i.e. hit or miss), in other cases a numerical value is produced and a cut off point chosen to separate hits from misses. Different member databases create methods/signatures in different ways: some groups build them from alignments studied manually, others use automatic processes with some human input and correction, ProDom uses an entirely automatic method. • Signatures describing the same protein family or domain are grouped into unique InterPro entries. Each combined InterPro entry has a unique accession number, an abstract describing the features of proteins associated with the entry and literature references and has links to the relevant member database(s). All UniProtKB protein sequences that have matches to a particular InterPro entry are listed in the Match Table associated with that entry. There are also links to the InterPro graphical views. The graphical views, which can be sorted by UniProtKB accession number, structure or taxonomy, show the position of the signatures on the protein, mousing over the signature brings up a pop-box, giving the accession, name and position. • InterPro graphically represents the location of a protein domain and information pertaining to the origin of that domain and the proteins that contain it. Families are also defined and may contain several InterPro domains which are often, but not always, in the same order. Through the InterPro Domain Architecture view, the composition and order of the different domains within a family are clearly displayed for easy comparison, as well as for simple navigation between the entries for individual domains. • InterPro entries are linked to one another through PARENT/CHILD and CONTAINS/FOUND IN relationships. PARENT/CHILD relationships indicate superfamily/family/subfamily relationships, as well as domain hierarchies, where sequences can be subdivided into more specific sub-sets. CONTAINS/FOUND IN relationships apply to domains, repeats and sites within families, and are used to describe the composition of protein sequences. 38 Introduction to Sequence Analysis Protein Sequence Analysis Searching protein families with InterPro Going back to our sequence 2: http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-e+[swissprot-id:FOSB_MOUSE]|[swissprot-acc:FOSB_MOUSE]+-noSession • We move down through the page to the section ‘Database cross-references’ until we find the following link: • There are 3 InterPro entries in this case: usually InterPro defines one entry for each member database that contains a definition for this domain. The first entry is a PFAM annotation with type ‘Domain’. The second InterPro entry is an annotation from PRINTS, in this case the type is called ‘Family’. The third entry comes from SMART and PROSITE and the type is again ‘Domain’. • InterPro entries can have associations like parent/child (different levels defined by InterPro methods) or contain/found in. • This particular domain is named differently in the three entries. The reason is only the preferences of each one of the database members for the names. 39 Introduction to Sequence Analysis Protein Sequence Analysis Searching protein families with InterPro 40 Introduction to Sequence Analysis Protein Sequence Analysis Thanks for your attention ! I would like to thank also the effort done by the 2Can initiative at the EBI. Some of the slides shown in this tutorial were selected from the 2Can Support Portal. 41