Protein sequence analysis

advertisement

Protein Sequence Analysis

Protein Sequence Analysis

Part II

[

web page

]

Osvaldo Graña

CNIO Bioinformatics Unit

ograna@cnio.es

March 2013

1

Protein Sequence Analysis

Introduction

•Determination of protein/peptide sequences is a basic requirement for biomedical research, like in cancer research. It is absolutely essential for characterising and identifying proteins or peptides.

The UniProt Knowledgebase is a central database of protein sequence and function. The UniProt

Knowledgebase consists of two sections: a section containing manually-annotated records with information extracted from literature and curator-evaluated computational analysis, and a section with computationally analyzed records that await full manual annotation. The two sections are referred to as "UniProtKB/Swiss-

Prot" (reviewed, manually annotated) and "UniProtKB/TrEMBL" (unreviewed, automatically annotated), respectively.

********Check this web page with information about UniProtKB : http://www.uniprot.org/help/uniprotkb

2

Protein Sequence Analysis

Searching against a protein sequence database with NCBI-BLAST2

We are going to search against a protein database using a nucleotide query (sequence 2) with NCBI BLAST2

( http://bioinfo.cnio.es/people/ograna/public_html/cursos/Sequence_analysis_course_data/sequence2.txt

) and look for peptides/protein sequences that are similar in UniProtKB/Swiss-Prot . This peptide/protein sequence is a real entry in this database, so we will expect to find a sequence that is a perfect match to our test sequence. Also we expect to find similar peptide/protein sequences, perhaps from closely related animals, or from sequences of closely related proteins.

We are going to select the BLASTX option and the Swiss-Prot database to search in a protein database using a nucleotide query http://www.ebi.ac.uk/Tools/services/web/toolform.ebi?tool=ncbiblast&context=protein

3

Protein Sequence Analysis

Searching against a protein sequence database with NCBI-BLAST2

Database

Choose here the databases you wish to run your protein sequence against.

4

Protein Sequence Analysis

Searching against a protein sequence database with NCBI-BLAST2

Database

Choose here the databases you wish to run your protein sequence against.

5

Protein Sequence Analysis

Selecting Blast parameters to search

6

Protein Sequence Analysis

Searching against a protein sequence database with NCBI-BLAST2

Matrix

You may choose from a complete list of matrices which should cover various evolutionary constraints. This is because substitutions will occur in your sequences due to genetic diversity during evolution. Each matrix is tailored to a particular evolutionary distance.

The default matrix for BLAST is blosum62 (Blocks Substitution Matrix 62% identity) , which is the best of the available matrices for detecting weak protein similarities. PAM (Point Accepted Mutation) matrices are also traditionally used for amino acid sequences. Choosing a matrix with a larger PAM value will allow alignments of sequences with larger evolutionary distances, and choosing a blosum matrix with a larger value will allow a larger percentage identity.

The default value is blosum 62.

Expected threshold

The expected threshold establishes a statistical significance threshold for reporting database sequence matches. The default value is 10, meaning that 10 matches are expected to be found merely by chance. Lower expected thresholds are more stringent, leading to fewer chance matches being reported.

Increasing the expected threshold shows less stringent matches and is recommended when you are performing searches with short sequences as a short query is more likely to occur by chance in the database than a longer one, so even a perfect match (no gaps) can have low statistical significance and may not be reported.

Increasing the Expected threshold allows you to look farther down in the hit list and see matches that would normally be discarded because of low statistical significance. Generally a value of up to 1000 is enough to see results.

The default value is 10.

0 10 -5 10 -2

Very Reliable Reliable Risky

Homology

7

Protein Sequence Analysis

Searching against a protein sequence database with NCBI-BLAST2

Filter

The filter option, if set to true, will allow you to mask out various segments of the query sequence for regions which are nonspecific for sequence similarity searches.

Filtering can eliminate statistically significant but biologically uninteresting reports from the output , for example hits against common acidic-, basic- or proline-rich regions, leaving the more biologically interesting regions of the query sequence available for specific matching against database sequences.

Filtering is only applied to the query sequence, not to database sequences.

The program used for this, with nucleotide query sequences is known as DUST written by Tatusov, R. L., and Lipman, D.J. The SEG program is used for filtering low complexity regions in amino acid sequences from your protein query sequence and was written by Wootton, J.C., and Federhen, S.

The default is true.

Default Filters (When Filter set to true):

* BLASTp: SEG

* BLASTx: SEG

* BLASTn: DUST

Drop off

This is the amount a score must drop before extension of word hits is halted.

8

Protein Sequence Analysis

Searching against a protein sequence database with NCBI-BLAST2

Open gap

The gap open penalty is the score taken away for the initiation of the gap in sequence or in structure. To make the match more significant you can try to make the gap penalty larger. It will decrease the number of gaps and if you have good alignment without many gaps, its Z-score will be higher.

The default is 11.

Extend gap

The gap extension penalty is added to the standard gap penalty for each base or residue in the gap. This is how long gaps are penalised. If you don't like long gaps, just increase the extension gap penalty.

Usually you will expect a few long gaps rather than many short gaps, so the gap extension penalty should be lower than the gap open penalty.

An exception is where one or both sequences are single reads with possible sequencing errors in which case you would expect many single base gaps. You can get this result by setting the gap open penalty to zero (or very low) and using the gap extension penalty to control gap scoring.

The default is 1.

9

Protein Sequence Analysis

Searching against a protein sequence database with NCBI-BLAST2

Gap align

This is a true/false answer that tells the program to perform optimised alignments within regions involving gaps. If set to true, the program will perform an alignment using gaps. Otherwise, if it is set to false, it will report only individual HSP where two sequences match each other, and thus will not produce alignments with gaps . The default is true.

(N.B. HSP means High-Scoring Segment Pair. Local alignments with no gaps that achieve one of the top alignment scores in a given search)

10

Protein Sequence Analysis

Substitution matrices

Alignment of protein sequences can take account of the diferential rates at which amino acids substitute for each other. It can be measured through two types of matrices: PAM and BLOSUM.

PAM (Percent Accepted Mutations): on the basis of comparisons among many pairs of very similar protein sequences (at least 85% identical, ie., homologous sequences), Margaret Dayhoff constructed a mutation probability matrix comparing many pairs of protein sequences to determine the empirical frequencies with which one amino acid is replaced by others during evolution.

Examples are PAM1, PAM10, PAM25, PAM50, PAM100, PAM125, PAM250.

PAM10= [ PAM1]

10

……

PAM250 = [ PAM1]

250

The PAM1 matrix could be multiplied by itself N times to give transition matrices for comparing sequences with lower and lower levels of similarity due to separation over longer periods of evolutionary history.

Thus, the commonly used PAM250 matrix represents a level of 250% of change expected in 2500 million years.

Although this amount of change seems very large, sequences at this level of divergence still have about 20% of similarity (Bioinformatics, D. W: Mount, page 96).

11

Protein Sequence Analysis

Substitution matrices

The empirical frecuency with which aminoacid type i is replaced by type j (or viceversa) is writen as Mi,j in the matrix: the probability of aligning two Ys in an alignment YY/YY is 10+10=20, a very significant score, whereas that of YY/TP is -3-5=-8

12

Recommendations

Which PAM matrix should I use?

Substitution matrices

Protein Sequence Analysis

One cannot know previously what the percentage similarity or difference between two sequences actually is until an alignment is done, thus a trial alignment must be first done. Once the initial similarity score has been obtained with these matrices, a more representative score can be obtained by using another PAM matrix designed specifically for sequences at that level of similarity.

13

Protein Sequence Analysis

Substitution matrices

BLOSUM (Blocks Substitution Matrix): the PAM matrices introduced by Dayhoff are constructed from the amino acid replacements inferred from alignments of protein sequences that are at least 85% identical.

Henikoff & Henikoff (1992) considered blocks, or highly conserved regions, in aligned protein sequences. The

BLOSUM matrix scores for amino acid pairs are based on the frequency of amino acid substitutions in aligned sequence motifs (blocks) from a related familiy of proteins, regardless of the overall degree of similarity between the protein sequences.

The BLOSUM62 substitution matrix is widely used for scoring protein sequence alignments. The matrix values are based on the observed aminoacid substitutions in a large set of approximately 2000 conserved amino acid blocks representing more than 500 families of related proteins.

BLOSUM62 -> based on blocks that are 62% identical

BLOSUM80 -> based on blocks that are 80% alike

BLOSUM62 example: http://www.uky.edu/Classes/BIO/520/BIO520WWW/blosum62.htm

14

Protein Sequence Analysis

15

Protein Sequence Analysis

Substitution matrices

PAM vs BLOSUM

The PAM matrices are based on scoring all amino acid positions in related sequences, whereas the BLOSUM matrices are based on substitutions and conserved positions in blocks, which represents the most-alike common regions in related sequences.

The PAM model is thus designed to track the evolutionary origins of proteins, whereas the BLOSUM model is designed to find their conserved domains. The choice of which matrix to use depends on the goals of the investigator.

Still there are some equivalences between PAM and BLOSUM matrices:

16

Protein Sequence Analysis

GAPs in the alignment

We have to consider also insertions and deletions, this implies to open gaps in the alignment and so we have to recalculate the scores penalizing for: a) Opening a gap in the alignment b) Extending the gap in the alignment

Values vary depending on the program we are using, but a general rule is that opening a new gap is much more penalized than extending an existing one

It is more frequent to find long gaps than bunches of “1 base” gaps

Example1 “bunch of gaps”:

ATCG_ATCG_ATCG_ATCG

ATCGTATCGTATCGTATCG

Example 2 “long gap”:

ATCG_ _ _ ATCG

ATCGT CG ATCG

17

Protein Sequence Analysis

Example

Example of scoring a sequence alignment with a gap penalty and under BLOSUM62.

BLOSUM62 matrix: http://www.uky.edu/Classes/BIO/520/BIO520WWW/blosum62.htm

Sequence 1

Sequence 2

Score

V D S C Y

V E S L C Y

4 2 4 -11 9 7

Total score=

( ∑ amino acid pair scores ) minus

(single gap penalty) = 15

18

Protein Sequence Analysis

Searching against a protein sequence database with NCBI-BLAST2

We are going to search against a protein database using a nucleotide query (sequence 2) with NCBI BLAST2

( http://bioinfo.cnio.es/people/ograna/public_html/cursos/Sequence_analysis_course_data/sequence2.txt

) and look for peptides/protein sequences that are similar in UniProtKB/Swiss-Prot . This peptide/protein sequence is a real entry in this database, so we will expect to find a sequence that is a perfect match to our test sequence. Also we expect to find similar peptide/protein sequences, perhaps from closely related animals, or from sequences of closely related proteins.

We are going to select the BLASTX option and the Swiss-Prot database to search in a protein database using a nucleotide query http://www.ebi.ac.uk/Tools/services/web/toolform.ebi?tool=ncbiblast&context=protein

19

Protein Sequence Analysis

Searching against a protein sequence database with NCBI-BLAST2: Results

Summary

NOTE: by clicking 'show alignments' we will find that the hits are catched with the frame 2 (see ‘Show Alignments’).

This tell us that at least the second frame is a coding frame.

20

Protein Sequence Analysis

Searching against a protein sequence database with NCBI-BLAST2: Results

Summary

NOTE: by clicking 'show alignments' we will find that the hits are catched with the frame 2 (see ‘Show Alignments’).

This tell us that at least the second frame is a coding frame.

21

Showing the alignments

Protein Sequence Analysis

NOTE: all the hits are catched with the frame 2 (see ‘Show Alignments’). This tell us that at least the second frame is a coding frame.

22

Visual output (results)

Protein Sequence Analysis

***Why is the alignment spanning a small region in the query sequence while it spans the full hit?

23

Visual output (results)

Protein Sequence Analysis

***The part of the mouse fosB mRNA that Blast is able to align with the FosB protein sequence is the one that belongs to the CDS, from the first methyonine (translation start site) until the stop codon (translation stop site).

24

Functional predictions (results)

Protein Sequence Analysis

25

Description of Uniprot entry

Protein Sequence Analysis

26

Description of Uniprot entry

Protein Sequence Analysis

27

Protein Sequence Analysis

Pairwise local/global alignment: differences

Global alignment: we try to align the whole sequence.

It is only useful for homologous proteins with a high percentage of identity .

Local alignment: we try to align locally as much of the sequence as we can. This is useful when dealing with domains.

Are these proteins homologues?

28

Protein Sequence Analysis

Pairwise local/global alignment: differences

Global alignment: we try to align the whole sequence.

It is only useful for homologous proteins with a high percentage of identity .

Local alignment: we try to align locally as much of the sequence as we can. This is useful when dealing with domains.

Are these proteins homologues?

Globally: no, they are very different, the score would be very low.

Locally: there is a homologous domain, the grey one.

29

Protein Sequence Analysis

Pairwise local/global alignment: Running an EMBOSS-Align alignment

We are going to use the EMBOSS-Align tool ( http://www.ebi.ac.uk/Tools/psa/ ).

•2 jobs to execute, one with the EMBOSS global alignment program (needle), and one with the local alignment program (water).

•As we are comparing 2 protein sequences, the molecule type was left on protein.

•The default blosum62 matrix is used, and the default gap open of "10" and gap extend of "0.5" is also used.

30

Protein Sequence Analysis

Pairwise local/global alignment: differences

Lets align these two sequences: http://pfam.sanger.ac.uk/family?acc=PF00071 http://bioinfo.cnio.es/people/ograna/public_html/cursos/Sequence_analysis_course_data/Q4RD65_TETNG.txt

http://bioinfo.cnio.es/people/ograna/public_html/cursos/Sequence_analysis_course_data/RACA_DICDI.txt

31

Protein Sequence Analysis

Pairwise local/global alignment: needle GLOBAL result

32

Protein Sequence Analysis

Pairwise local/global alignment: water LOCAL result

The Smith-Waterman algorithm is more suitable for identifying related proteins of limited sequence similarity than FASTA and BLAST in a database search (Bioinformatics, D. W. Mount, page 259).

33

Protein Sequence Analysis

Pairwise local/global alignment:

Results of EMBOSS-Align alignments

Note that identical amino acids are connected with a "|" symbol. Unrelated pairs of amino acids (mismatches) would be connected with a space. A gap would be represented with a "-" symbol. Similar pairs (e.g. leucine vs methionine) are connected via a ":" symbol. Less similar ones are indicated with "."

The %id is the percentage of identical matches between the two sequences over the reported aligned region.

The %similarity is the percentage of matches between the two sequences over the reported aligned region where the scoring matrix value is greater or equal to 0.0.

The Overall %id and Overall %similarity are calculated in a similar manner for the number of matches over the length of the longest of the two sequences.

34

Protein Sequence Analysis

ClustalW can build multiple sequence alignments (MSA)

ClustalW ( http://www.ebi.ac.uk/Tools/msa/clustalw2/ ) is a general purpose global multiple sequence alignment program for DNA or proteins. It produces biologically meaningful multiple sequence alignments of divergent sequences. It calculates the best match for the selected sequences, and lines them up so that the identities, similarities and differences can be seen. Evolutionary relationships can be seen through Cladograms or Phylograms.

Multiple alignments of protein sequences are important tools in studying sequences.

The basic information they provide is identification of conserved sequence regions. This is very useful in designing experiments to test and modify the function of specific proteins, in predicting the function and structure of proteins, and in identifying new members of protein families.

Sequences can be aligned across their entire length (global alignment) or only in certain regions (local alignment). This is true for pairwise and multiple alignments. Global alignments need to use gaps (representing insertions/deletions) while local alignments can avoid them, aligning regions between gaps. ClustalW is a fully automatic program for global multiple alignment of DNA and protein sequences. The alignment is progressive and considers the sequence redundancy. Trees can also be calculated from multiple alignments. The program has some adjustable parameters with reasonable defaults.

ClustalW (Higgins et al. 1996)

It is designed to provide an adequate alignment of a large number of more close related sequences and a reliable indication of the domain structure of those sequences.

The steps used by ClustalW include:

1.

2.

Perform pair-wise alignments of all the sequences

Use the aligment scores to produce a phylogenetic tree

3.

Progressive multiple sequence alignment: it reduces the construction of the MSA to a series of pair-wise alignments. Initially, a dynamic programming alignment is made between the two most alike sequences, and the resulting alignment is then extended to include other, less alike sequences.

35

Protein Sequence Analysis

Building a MSA: 1) get protein homologs with Blast

We select all the hits obtained from the previous search results, and the click download fasta

36

Protein Sequence Analysis

Building a MSA: 2) Copy all the downloaded sequences

We then copy all the downloaded sequences to the ClustalW2 tool

37

Protein Sequence Analysis

Building a MSA: 3) ClustalW MSA results

38

Protein Sequence Analysis

Analyzing ClustalW results

The branch lengths on the phylogram are proportional to the evolutionary distance between species, however the branches are normalized in the cladogram and therefore do not represent the distance between species.

39

Protein Sequence Analysis

Interpretation of ClustalW results

Consensus symbols

An alignment will display by default the following symbols denoting the degree of conservation observed in each column:

"*" means that the residues or nucleotides in that column are identical in all sequences in the alignment.

":" means that conserved substitutions ( similar ) have been observed, according to the COLOUR table below.

"." means that semi-conserved substitutions ( less similar ) are observed. Colour

This option only works when you have chosen ALN or GCG the output format. The colouring of residues takes place according to the following physiochemical criteria:

40

Protein Sequence Analysis

Other examples of MSA programs

T-Coffee:

combines information from global and local alignments to produce a global

MSA ( http://www.ebi.ac.uk/Tools/t-coffee/index.html

)

Muscle:

builds global MSA ( http://www.ebi.ac.uk/Tools/muscle/ )

Mafft:

generates global MSA ( http://www.ebi.ac.uk/Tools/mafft/index.html)

DiAlign:

produces global and local MSA ( http://bibiserv.techfak.uni-bielefeld.de/dialign/)

Hmmer:

generates local MSA ( http://hmmer.janelia.org/ )

Meme:

builds local MSA ( http://meme.sdsc.edu/meme4_1/cgi-bin/meme.cgi

)

41

Protein Sequence Analysis

Searching protein families with InterPro

42

Protein Sequence Analysis

Searching protein families with InterPro

What is InterPro?

http://www.ebi.ac.uk/interpro/user_manual.html

InterPro is an integrated documentation resource for protein families, domains and sites. InterPro is a consortium of member databases (PROSITE, Pfam, Prints, ProDom, SMART and TIGRFAMs).

Each member database devises methods that can be applied computationally to assign a score for a protein according to how well it matches a given signature . For some types of methods, the classification is binary (i.e. hit or miss), in other cases a numerical value is produced and a cut off point chosen to separate hits from misses.

Different member databases create methods/signatures in different ways: some groups build them from alignments studied manually, others use automatic processes with some human input and correction, ProDom uses an entirely automatic method.

Signatures describing the same protein family or domain are grouped into unique InterPro entries . Each combined

InterPro entry has a unique accession number, an abstract describing the features of proteins associated with the entry and literature references and has links to the relevant member database(s).

All UniProtKB protein sequences that have matches to a particular InterPro entry are listed in the Match Table associated with that entry . There are also links to the

InterPro graphical views.

The graphical views , which can be sorted by UniProtKB accession number, structure or taxonomy, show the position of the signatures on the protein , mousing over the signature brings up a pop-box, giving the accession, name and position.

InterPro graphically represents the location of a protein domain and information pertaining to the origin of that domain and the proteins that contain it .

Families are also defined and may contain several InterPro domains which are often, but not always, in the same order . Through the InterPro Domain Architecture view, the composition and order of the different domains within a family are clearly displayed for easy comparison, as well as for simple navigation between the entries for individual domains.

InterPro entries are linked to one another through PARENT/CHILD and CONTAINS/FOUND IN relationships.

PARENT/CHILD relationships indicate superfamily/family/subfamily relationships, as well as domain hierarchies, where sequences can be subdivided into more specific sub-sets. CONTAINS/FOUND IN relationships apply to domains, repeats and sites within families, and are used to describe the composition of protein sequences.

43

Protein Sequence Analysis

Searching protein families with InterPro

Going back to our sequence 2: http://www.uniprot.org/uniprot/p13346

We move down through the page to the section

‘Database cross-references’ until we find the following link:

There are 2 InterPro entries in this case: The first entry is an annotation with type

‘Domain’. The second InterPro entry is an annotation defined as ‘Family’.

InterPro entries can have associations like parent/child (different levels defined by InterPro methods) or contain/found in.

44

Protein Sequence Analysis

Searching protein families with InterPro

45

Protein Sequence Analysis

Searching protein families with InterPro

46

Protein Sequence Analysis

Searching protein families with InterPro

47

Protein Sequence Analysis

Searching protein families with InterPro

48

Protein Sequence Analysis

Searching protein families with InterPro

49

Protein Sequence Analysis

Searching protein families with InterPro

50

Protein Sequence Analysis

Searching protein families with InterPro

51

Protein Sequence Analysis

Searching protein families with InterPro

52

Protein Sequence Analysis

Thanks for your attention !

I would like to thank also the effort done by the

2Can

initiative at the EBI. Some of the slides shown in this tutorial were selected from the

2Can Support Portal

.

53

Download