Last lecture summary

advertisement

Last lecture summary

The outline of sequence alignment

1.

How to recognize which sequence alignment is better.

• Scoring system

Scoring DNA alignment

Scoring protein alignment – substitution matrices (PAM, BLOSUM)

2.

How to perform sequence alignment.

• Algorithm

• Dot plot, dynamic programming, heuristic algorithms (BLAST)

• Flavors of sequence alignment

• Homology

• Scoring DNA alignment, gaps

• Substitution matrix

• Scoring protein alignment

• PAM matrices, PAM1, higher PAM

New stuff

PAM 120

small, polar small, nonpolar polar or acidic

Zvelebil, Baum, Understanding bioinformatics .

Positive score – frequency of substitutions is greater than would have occurred by random chance.

Zero score – frequency is equal to that expected by chance.

Negative score – frequency is less than would have occurred by random chance.

basic large, hydrophobic aromatic

PAM matrices assumptions

• Mutation of amino acid is independent of previous mutations at the same position.

• Only PAM1 was “measured”, all other are predictions.

• Each amino acid position is equally mutable.

• Mutations are assumed to be independent of surrounding residues.

• Forces responsible for sequence evolution over short time are the same as these over longer times.

• PAM matrices are based on protein sequences available in 1978 (bias towards small, globular proteins)

• New generation of Dayhoff-type – e.g. PET91

How to calculate score?

Selzer, Applied bioinformatics .

2 substitution matrix

Protein vs. DNA sequences

• Given the choice of aligning DNA or protein, it is often more informative to compare protein sequences.

• There are several reasons for this:

• Many changes in DNA do not change the amino acid that is specified.

• Many amino acids share related biophysical properties. Though these amino acids are not identical, they can be more easily substituted each with other. These relationships are accounted for by scoring systems .

Similarity vs. identity

• Similarity refers to the percentage of aligned residues that can be more readily substituted for each other.

• have similar physicochemical characteristics and

• the selective pressure results in some mutations being accepted and others being eliminated

S = [( L s

× 2) / ( L a

+ L b

)] × 100 number of aligned residues with similar characteristics total lengths of each sequence

Homology vs. similarity

• Two sequences are homologous when they descended from a common ancestor sequence.

• Similarity can be quantified: “two sequences share 40% similarity”.

• But NOT “two sequences share 40% homology”. Just “two sequences are homologous”

• Qualitative statement

• And it is a conclusion about a common ancestral relationship drawn from sequence similarity comparison

Gaps

• How will I score this alignment?

V D S - C Y

V E S L C Y

• The gaps can’t be inserted freely.

• Indels are relatively slow evolutionary processes.

• And alignments with large gaps do not make biological sense.

• Each gap is penalized – a gap penalty

• The gap penalty is an adjustable parameter.

• Let’s use the gap penalty equaling to -11.

V D S C Y

V E S L C Y

4 2 4 -11 9 7

S = 4 + 2 + 4 – 11 + 9 + 7=15

Gap penalty

• Affine gap penalty

• different for opening and extending constant for extending

• The gap penalty is high – fewer gaps will be inserted

• If you’re searching for sequences that are a strict match for your query sequence, the gap penalty should be set high.

• This will retrieve regions with very closely related sequences.

• The gap penalty is low – more and larger gaps will be inserted

• If you are searching for similarity between distantly related sequences, the gap penalty should be set low.

(A) High gap penalty. Gaps has been inserted only at the beginning and end.

Percentage identity = 10%

(B) Low gap penalty. More gaps. Percentage identity = 18%

Zvelebil, Baum, Understanding bioinformatics .

Protein substitution matrices – BLOSUM

BLOSUM matrices I

• BLOck SUbstitution Matrix by Henikoff and Henikoff, 1992.

• Blocks are ungapped sequence motifs. Sequence motif is a conserved stretch of amino acids confering a specific function to a protein.

• Any given protein can contain one or more blocks corresponding to its structural/functional motifs.

• BLOCKS database – contains multiple alignments of ungapped segments ( blocks ).

• These alignments correspond to highly conserved regions.

• BLOCKS database was used to construct BLOSUM matrices

Blocks

.

..

BLOSUM matrices II

• The Henikoffs focused on substitution patterns only in the most conserved regions of a protein. These regions are

(presumably) least prone to change.

• The substitution patterns of 2000 blocks were examined and BLOSUM matrices were generated.

• Sequences sharing no more than 62% identity were used to calculate the BLOSUM62 matrix.

Short and clear explanation of BLOSUM62 derivation: Eddy SR. Where did the BLOSUM62 alignment score matrix come from? Nat Biotechnol .

2004 22 (8):1035-6. PMID: 15286655.

BLOSUM matrices III

• BLOSUM matrices are based on entirely different type of sequence analysis (local vs. global alignment) and on a much larger data set than PAM.

• All BLOSUM matrices are based on observed alignments.

They are not based on extrapolations (evolutionary model) like PAM.

• BLOSUM numbering system goes in reversing order as the PAM numbering system.

• The lower the BLOSUM number, the more divergent sequence they represent.

PAM vs. BLOSUM I – PAM

• At the time of deriving PAM matrices, most known proteins were small, globular and hydrophilic. If researcher believes his protein contains substantial hydrophobic regions, PAM matrices are not that useful.

• Most widely used is PAM250.

• It is capable of detecting similarities in the 30% range of identity.

• Another point of view – PAM250 provides the best lookback in evolutionary time.

• PAM250 is most effective if the goal is to know the widest possible range of proteins similar to the given protein.

Selecting the Right Similarity-Scoring Matrix, William R. Pearson, Curr Protoc Bioinformatics. 2013; 43: 3.5.1

–3.5.9., http://europepmc.org/articles/PMC3848038/

PAM vs. BLOSUM II – PAM

• Assume a protein is a known member of the serine protease family.

• Using the protein as a query against protein databases with PAM 250 will detect virtually all serine proteases, but also considerable amount of irrelevant hits.

• In this case, the PAM120 matrix should be used. It detects similarities in the 50% to 60% identity range.

And to find only those proteins most similar (identity: 70%

- 90%) to the query protein, use PAM40.

Let’s summarize:

• Locate all potential similarities – PAM250

Determine if the protein belongs to the protein family, database searches – PAM120

Determine the most similar proteins – PAM40

PAM vs. BLOSUM III – BLOSUM

• Most widely used is BLOSUM62.

• BLOSUM62 appears to be superior to PAM250 in detecting distant relationships even if the PAM method is updated with current data sets.

• BLOSUM62 is capable of accurately detecting similarities down to the 30% range of an identity.

• Determine if the protein belongs to protein family –

BLOSUM80 (detects identities at the 50% level)

• Determine the most similar proteins – BLOSUM90

Selecting an Appropriate Matrix

Matrix

Pam40

PAM120

PAM250

Best use

Short highly similar alignments

Detecting members of a protein family, database searches

Longer alingments of more divergent sequences, suspected homology

BLOSUM90 Short highly similar alignments

BLOSUM80 Detecting members of a protein family

BLOSUM62 Most effective in finding all potential similarities

Sequence alignment algorithms

Pairwise alignment algorithms

• Dot plot (dot matrix)

• Graphical way of comparing two sequences

• Dynamic programming

• Slow, but formally optimizing

• Heuristic methods

• Efficient, but not as thorough

• Word (also k -tuples) methods

• Used in database searches

Dot plot

Dot plot

• Graphical method that allows the comparison of two biological sequences and the identification of regions of a close similarity between them.

• Also used for finding direct or inverted repeats in sequences.

Self-similarity dot plot I

The DNA sequence

EU127468.1 compared against itself.

Introduction to dot-plots, Jan Schulz http://www.code10.info/index.php?option=com_content&view=article&id=64:inroduction-to-dot-plots&catid=52:cat_coding_algorithms_dot-plots&Itemid=76

background noise runs of matched residues

Self-similarity dot plot II

The DNA sequence

EU127468.1 compared against itself.

Window size = 16.

Linear color mapping

Introduction to dot-plots, Jan Schulz http://www.code10.info/index.php?option=com_content&view=article&id=64:inroduction-to-dot-plots&catid=52:cat_coding_algorithms_dot-plots&Itemid=76

Improving dot plot

• Sliding window – window size (lets say 11)

• Stringency (lets say 7) – a dot is printed only if 7 out of the next 11 positions in the sequence are identical

• Color mapping

• Scoring matrices can be used to assign a score to each substitution. These numbers then can be converted to gray/color.

Interpretation of dot plot I

1.

Plot two homologous sequences of interest. If they are similar – diagonal line will occur ( matches ).

2.

frame shifts a) mutations gaps in diagonal b) insertions shift of main diagonal c) deletions shift of main diagonal http://ugene.unipro.ru/documentation/manual/plugins/dotplot/interpret_a_dotplot.html

Interpretation of dot plot II

• Identify repeat regions ( direct repeats , inverted repeats )

– lines parallel to the diagonal line in self-similarity plot

• Microsatellites and minisatellites (these are also called low-complexity regions ) can be identified as “squares”.

• Palindromatic sequences are shown as lines perpendicular to the main diagonal.

• Plaindromatic sequence: V ELIPSE SPI LEV

Bioinformatics explained: Dot plots, http://www.clcbio.com/index.php?id=1330&manual=BE_Dot_plots.html

Repeats in dot plot

minisattelites self-similarity dot plot of

NA sequence ofhuman

LDL receptor window 23, stringency 7 direct repeats inverted repeats from the book Bioinformatics, David. M. Mount,

Interpretation of dot plot – summary

perfect match palindrom repeats partial palindrom microsatellites minisatellites homologous indel http://www.code10.info/index.php?option=com_content&view=article&id=64:inroduction-to-dot-plots&catid=52:cat_coding_algorithms_dot-plots&Itemid=76

Dot plot of the human genome

A. M. Campbell, L. J. Heyer, Discovering genomics, proteomics and bioinformatics

Dot plot rules

• Larger windows size is used for DNA sequences because the number of random matches is much greater due to the presence of only four characters in the alphabet.

• A typical window size for DNA is 15, with stringency 10.

For proteins the matrix has not to be filtered at all, or windows 2 or 3 with stringency 2 can be used.

• If two proteins are expected to be related but to have long regions of dissimilar sequence with only a small proportion of identities, such as similar active sites, a large window, e.g., 20, and a small stringency, e.g., 5, should be useful for seeing any similarity.

Dot plot advantages/disadvantages

• Advantages:

• All possible matches of residues between two sequences are found. It’s just up to you to choose the most significant ones.

• Readily reveals the presence of insertions/deletions and direct and inverted repeats that are more difficult to find by other, more automated methods.

• Disadvantages:

Most dot matrix computer programs do not show an actual alignmen t. Does not return a score to indicate how ‘optimal’ a given alignment is (no statistical significance that could be tested).

Download