Computational Biology, Part 7
Similarity Functions and
Sequence Comparison with Dot
Matrices
Robert F. Murphy
Copyright  1996, 1999-2001.
All rights reserved.
Similarity Functions
Used to facilitate comparison of two
sequence elements
 logical valued (true or false, 1 or 0)

 test
whether first argument matches (or could
match) second argument

numerical valued
 test
degree to which first argument matches
second
Logical valued similarity
functions
Let Search(I)=‘A’ and Sequence(J)=‘R’
 A Function to Test for Exact Match

 MatchExact(Search(I),Sequence(J))
would return
FALSE since A is not R

A Function to Test for Possibility of a Match
using IUB codes for Incompletely Specified
Bases
 MatchWild(Search(I),Sequence(J))
since R can be either A or G
would return TRUE
Numerical valued similarity
functions

return value could be probability (for DNA)
 Let
Search(I) = 'A' and Sequence(J) = 'R'
 SimilarNuc (Search(I),Sequence(J)) could return 0.5
 since chances are 1 out of 2 that a purine is adenine

return value could be similarity (for protein)
 Let
Seq1(I) = 'K' (lysine) and Seq2(J) = 'R' (arginine)
 SimilarProt(Seq1(I),Seq2(J)) could return 0.8
 since lysine is similar to arginine

usually use integer values for efficiency
Scoring (similarity) matrices
For each pair of characters in alphabet,
value is proportional to degree of similarity
(or other scoring criterion) between them
 For proteins, most frequently used is
Mutation Data Matrix from Dayhoff, 1978
(MDM78)

Dayhoff PAM250 similarity
matrix (partial)
A
B
C
D
E
F
G
H
A
2
0
-2
0
0
-4
1
-1
B
0
0
-4
3
2
-5
0
1
C
-2
-4
12
-5
-5
-4
-3
-3
D
0
3
-5
4
3
-6
1
1
E
0
2
-5
3
4
-5
0
1
F
-4
-5
-4
-6
-5
9
-5
-2
G
1
0
-3
1
0
-5
5
-2
H
-1
1
-3
1
1
-2
-2
6
Origin of PAM 250 matrix





Take aligned set of closely related proteins
For each position in the set, find the most common
amino acid observed there
Calculate the frequency with which each other
amino acid is observed at that position
Combine frequencies from all positions to give
table showing frequencies for each amino acid
changing to each other amino acid
Take logarithm and normalize for frequency of
each amino acid
Sequence comparison with dot
matrices

Goal: Graphically display regions of
similarity between two sequences (e.g.,
domains in common between two proteins
of suspected similar function)
Sequence comparison with dot
matrices

Basic Method: For two sequences of
lengths M and N, lay out an M by N grid
(matrix) with one sequence across the top
and one sequence down the left side. For
each position in the grid, compare the
sequence elements at the top (column) and
to the left (row). If and only if they are the
same, place a dot at that position.
Sequence comparison with dot
matrices - References
W.M. Fitch. An improved method of testing
for evolutionary homology. J. Mol. Biol.
16:9-16 (1966)
 W.M. Fitch. Locating gaps in amino acid
sequences to optimize the homology
between two proteins. Biochem. Genet.
3:99-108 (1969)

Sequence comparison with dot
matrices - References
A.J. Gibbs & G.A. McIntyre. The diagram,
a method for comparing sequences. Its use
with amino acid and nucleotide sequences.
Eur. J. Biochem. 16:1-11 (1970)
 A.D. McLachlan. Test for comparing related
amino acid sequences: cytochrome c and
cytochrome c551. J. Mol. Biol. 61:409-424
(1971)

Sequence comparison with dot
matrices - References
J. Pustell & F.C. Kafatos. A high speed, high
capacity homology matrix: zooming
through SV40 and polyoma. Nucleic Acids
Res. 10:4765-4782 (1982)
 J. Pustell & F.C. Kafatos. A convenient and
adaptable package of computer programs
for DNA and protein sequence management,
analysis and homology determination.
Nucleic Acids Res. 12:643-655 (1984)

Examples for protein sequences
(Demonstration A5, Sequence 1 vs. 2)
 (Demonstration A5, Sequence 2 vs. 3)

Interpretation of dot matrices
Regions of similarity appear as diagonal
runs of dots
 Reverse diagonals (perpendicular to
diagonal) indicate inversions
 Reverse diagonals crossing diagonals (Xs)
indicate palindromes

 (Demonstration A5,
Sequence 4 vs. 4)
Interpretation of dot matrices

Can link or "join" separate diagonals to
form alignment with "gaps"
 Each
a.a. or base can only be used once
 Can't
trace vertically or horizontally
 Can't double back
 A gap
is introduced by each vertical or
horizontal skip
Uses for dot matrices
Can use dot matrices to align two proteins
or two nucleic acid sequences
 Can use to find amino acid repeats within a
protein by comparing a protein sequence to
itself

 Repeats
appear as a set of diagonal runs stacked
vertically and/or horizontally
 (Demonstration A5,
Sequence 5 vs. 6)
Uses for dot matrices
Can use to find self base-pairing of an RNA
(e.g., tRNA) by comparing a sequence to
itself complemented and reversed
 Excellent approach for finding sequence
transpositions

Filtering to remove “noise”
A problem with dot matrices for long
sequences is that they can be very noisy due
to lots of insignificant matches (i.e., one A)
 Solution use a window and a threshold

 compare
character by character within a
window (have to choose window size)
 require certain fraction of matches within
window in order to display it with a “dot”
Example spreadsheet with
window

(Demonstration A6)
How do we choose a window
size?

Window size changes with goal of analysis
 size
of average exon
 size of average protein structural element
 size of gene promoter
 size of enzyme active site
How do we choose a threshold
value?

Threshold based on statistics
 using
shuffled actual sequence
 find
average (m) and s.d. () of match scores of
shuffled sequence
 convert original (unshuffled) scores (x) to Z scores
• Z = (x - m)/
 use
 using
threshold Z of of 3 to 6
analysis of other sets of sequences
 provides
“objective” standard of significance
Displaying matrices by Pustell
method with MacVector
Goal: Determine differences in
arrangements of elements of pBluescript
family of vectors
 Starting point: Use sequences of three of the
members of the family: open the first three
files in the Common Vectors: Bluescript
folder.

Dot matrices with MacVector

From Analyze menu select Pustell DNA matrix. Dialog appears.
Dot matrices with MacVector

Select SYNBL2KSM and SYNBL2SKM. Use defaults for all else.
Dot matrices with MacVector

23 reagons of homology (“diagonals”) obtained. Request “Matrix
map” only (don’t need “Aligned sequences”)
Dot matrices with MacVector

Note inversion near nucleotide 700 (the direction of the polylinker is
reversed between the two vectors)
Dot matrices with MacVector

To examine effect of threshold, decrease “min. % score” from 65 to 55
Dot matrices with MacVector

Now we get many (223) diagonals.
Dot matrices with MacVector

Note presence of many short regions of at least 55% homology.
Dot matrices with MacVector

Now increase threshold to 90%.
Dot matrices with MacVector

Now just 3 diagonals are found.
Dot matrices with MacVector

Note absence of short homologous regions (“noise”).
Dot matrices with MacVector

Now compare SYNBL2KSP to SYNBL2SKM.
Dot matrices with MacVector

22 diagonals found using default settings.
Dot matrices with MacVector

Note second large inversion at one end of sequences.
More dot matrices with
MacVector - DNA homology
Goal: Duplicate Figure 6 of Chapter 3 of
Sequence Analysis Primer
 Get Accession numbers J02289 (Polyoma)
and J02400 (SV40) from Entrez
 Do Pustell DNA Matrix analysis using
parameters similar to those used in text
(window size = 41, %identity = 51)

More dot matrices with
MacVector - DNA homology
More dot matrices with
MacVector - DNA homology
More dot matrices with
MacVector - DNA homology
More dot matrices with
MacVector - protein homology
Goal: Reproduce Figure 15 from Chapter 3
of Sequence Analysis Primer
 Get Accession numbers P17678 (Chicken)
and X17254 (human) erythroid transcription
factors using Entrez
 Do Pustell Protein Matrix Analysis

Reading for next class
B & O, Chapter 7 just pp. 145-155
 Additional optional reading: Sequence
Analysis Primer, pp. 124-134 “Dynamic
Programming Methods” (on web site as
Reading 1)
 (03-510) Durbin et al, Sections 2.1 - 2.4
 Everybody: Look over paper by Needleman
and Wunsch on web site (Reading 2)

Summary, Part 7
Similarity functions or similarity matrices
describe (quantitatively) the degree of
similarity between two sequence elements
(bases or amino acids)
 The Dayhoff MDM78 matrix is a similarity
matrix commonly used to estimate the
degree to which a change from one amino
acid to another can be “tolerated” in a
protein

Summary, Part 7
Dot matrices graphically present regions of
identity or similarity between two sequences
 The use of windows and thresholds can
reduce “noise” in dot matrices
 Inversions, duplications and palindromes
have unique “signatures” in dot matrices
