Searching for Similarities in Sequences

advertisement
Searching for Similarity in Sequences
Gary Benson
Departments of Computer Science and Biology
Boston University
ISSCB 2003 Benson
Topic 1 Outline
Similarity and Alignment
• Define homology, similarity by descent and
similarity by convergence
• Common mutations and their mathematical
models
• Alignments
• Scoring Alignments
• Gap penalty functions
• Computing the best scoring alignment – the
Longest Common Subsequence (LCS) problem
ISSCB 2003 Benson
Similarity and Biomolecules
Similarity is expected among biomolecules that are
descended from a common ancestor. Mutations
cause differences, but survival of the organism requires
that mutations occur in regions that are less critical to
function while important catalytic, regulatory or
structural regions remain similar.
ISSCB 2003 Benson
Similarity and Evolution
Evolution has duplicated and shuffled bits and pieces of
molecules to produce new linear arrangements that
combine function in novel ways. Regions of similarity
often suggest an evolutionary tie and/or common
functional properties between very different molecules.
ISSCB 2003 Benson
Three common similarity problems
1. Start with a query sequence with unknown properties and
search within a database of millions of sequences to
find those which share similarity with the query.
2. Start with a small set of sequences and identify
similarities and differences among them.
3. In many sequences or very long sequences, detect
commonly occurring patterns.
ISSCB 2003 Benson
What is Similarity?
How can we measure it?
ISSCB 2003 Benson
Morphology
Morphology is the form and structure of an
organism.
Should shared morphology mean similarity?
ISSCB 2003 Benson
Hands
ISSCB 2003 Benson
Aquatic Shape
ISSCB 2003 Benson
Shared morphology
Shared morphology does not necessarily imply common
ancestry.
The animals with hands have all evolved from a common
ancester with a hand.
The ichthyosaur, shark and porpoise each evolved sea life
adaptations independently.
ISSCB 2003 Benson
Homology
When similarity is due to common ancestry, we call it
homology.
ISSCB 2003 Benson
Modern molecular biology seeks to understand cellular
processes through the action of DNA, RNA, and
protein molecules. This will ultimately lead to a
biochemical understanding of:
• The pathogenesis of infectious diseases like AIDS,
hepatitus and SARS.
• The mutagenic properties of environmental toxins and
how they lead to diseases like cancer.
• The etiology of human genetic disease.
• Strategies to prevent and treat diseases through drug
and vaccine design, gene therapy, risk reduction, etc.
ISSCB 2003 Benson
How homology helps
Given molecular sequences X and Y:
X ~ Y AND INFO(Y) ==> INFO(X)
(“ ~ ” means similar)
ISSCB 2003 Benson
Are the Sequences Similar?
ISSCB 2003 Benson
Are the Sequences Similar
• How similar?
• What parts are the most similar?
Remember, the common ancestor of the two sequences
may have existed millions of years ago.
ISSCB 2003 Benson
How can we tell if the two sequences are similar?
Similarity judgements should be based on:
• The types of changes or mutations that occur
within sequences.
• Characteristics of those different types of
mutations.
• The frequency of those mutations.
ISSCB 2003 Benson
Common mutations in DNA
Substitution:
A C G T T G A C
A C G A T G A C
Deletion:
A C G T T G A C
A C G A C
Insertion:
A C G T T G A C
A C G C A A G T T G A C
ISSCB 2003 Benson
Common mutations
Duplication:
A C G T T G A C
A C G T T G A T T G A C
Inversion (double stranded DNA shown):
A C G T T G A C
T G C A A C T G
A C T C A A C C
A C A G T T G G
ISSCB 2003 Benson
Frequency of mutations
Substitution > Insertion, Deletion
>>
Duplication
>
Inversion
ISSCB 2003 Benson
Evolutionary history of sequences
ISSCB 2003 Benson
Alignments
There are many ways to align two sequences. We just saw
one way:
T T A C G T A C A G A T T A
T - - G G A A C A - - - T A
Here is another:
T T A C G T – A C A G A T T A
T - - - G G A A C - - A T - A
Which is better? Remember, we can not choose based on the
evolutionary history, because that is unknown.
ISSCB 2003 Benson
Alignments and Paths through the
Alignment Array
a
c
g
t
g
a
a
t
t
ISSCB 2003 Benson
t
a
c
g
c
a
a
Alignments and Paths through the
Alignment Array
-
t
a
c
g
c
a
a
c
g
t
g
a
a
t
t
ISSCB 2003 Benson
t a c g - c a a - - a c g t g a a t t
a
Alignments and Paths:
An Alternate Alignment
-
t
a
c
g
c
a
a
c
g
t
g
a
a
t
t
ISSCB 2003 Benson
t - - a c g c a - - a
a c g t g - - a a t t
a
Finding the Best Alignment:
Ranking Alignments by Score
Score an alignment by
• Partitioning it into
columns
• Assign a weight to each
column
• Sum the column weights
ISSCB 2003 Benson
Distance Scoring
Distance scoring:
• Alignment gets a non-negative score.
• Alignment of identical sequences scores zero,
all others > zero.
• Best alignment has smallest score.
Typical scoring functions are:
• d(a,a) = 0; identity
• d(a,b) = d(b,a) > 0; a ≠ b; substitution
• g = d(a, – ) > 0; indel (gap)
ISSCB 2003 Benson
Similarity Scoring
Similarity scoring:
• Alignment scores may be positive, zero, or negative.
• More similar means larger positive score.
• The best alignment has largest score.
Typical scoring functions are:
• s(a,b) is { > 0 if a and b are similar in one or more
characteristics or are observed to substitute
frequently for each other;
≤ 0 otherwise }; substitution
• g = s(a, – ) < 0; indel (gap)
ISSCB 2003 Benson
Gap penalty functions
• Single character gap penalty
g(a, – ) = c
(c a constant or a value dependent on a)
• Affine (linear) gap penalty
g(k) = α + βk
(α is a gap opening penalty, β is a gap extension penalty)
• Concave gap penalty
g(k) = α + β(m(k))
m(k) is a function like log(k) which grows more slowly
as k increases.
ISSCB 2003 Benson
Distance Scoring
Alignment parameters:
d(a, a) = 0; d(a, b) = + 2,
g=+4
A – G C C G T A T
A C G A - - T - T
0 4 0 2 4 4 0 4 0
ISSCB 2003 Benson
= 18
Similarity Scoring
Scoring parameters:
s(a, a) = + 5, s(a, b) = - 3,
g= -8
A – G C C G T A T
A C G A - - T - T
5
5
5
+ 5
8
3 8 8
8
-
ISSCB 2003 Benson
= - 15
Similarity scoring with affine gap
Alignment parameters:
s(a, a) = + 5, s(a, b) = - 3,
g(k) = α + βk, α = - 5, β = - 4
A – G C C G T A T
A C G A - - T - T
+ 5
5
5
5
8
3 8 4
8
ISSCB 2003 Benson
= - 11
Computing the Optimal Alignment:
The LCS Problem as Prototype
The Longest Common Subsequence (LCS) problem is
a method for comparing sequences. Although the solution
does not produce an alignment, it illustrates a method of
dynamic programming that is very similar to that used
by alignment algorithms.
ISSCB 2003 Benson
Longest Common Subsequence Problem
Let X be a string of characters. A subsequence X’ of X is
formed by discarding zero or more letters of X. Note
that the letters in X’ maintain their same order as in X.
Let X and Y be two strings. A common subsequence Z is
a subsequence of both. A longest common subsequence
(LCS) is the longest such Z.
Examples:
X=abcdeba
X’ = a b d b
ISSCB 2003 Benson
X=abcdeba
Y=bebdceacd
Z=bdea
LCS Problem
Given: Two sequences X and Y.
Find: An LCS for X and Y.
A divide and conquer solution can be developed by
looking at what happens to the last letters in each
sequence. That is, are they part of the LCS solution
or not?
ISSCB 2003 Benson
Possible ways to split the problem
ISSCB 2003 Benson
LCS recursion
ISSCB 2003 Benson
Filling the dynamic programming array
b
e
b
d
e
c
a
c
d
ISSCB 2003 Benson
0
0
0
0
0
0
0
0
0
0
a
0
b
0
c
0
d
0
e
0
b
0
a
0
Filling the dynamic programming array
b
e
b
d
e
c
a
c
d
ISSCB 2003 Benson
0
0
0
0
0
0
0
0
0
0
a
0
0
0
b
0
1
1
c
0
1
1
d
0
1
1
e
0
1
?
b
0
1
a
0
1
Necessary values in adjacent cells
ISSCB 2003 Benson
Completed LCS array
b
e
b
d
e
c
a
c
d
ISSCB 2003 Benson
0
0
0
0
0
0
0
0
0
0
a
0
0
0
0
0
0
0
1
1
1
b
0
1
1
1
1
1
1
1
1
1
c
0
1
1
1
1
1
2
2
2
2
d
0
1
1
1
2
2
2
2
2
3
e
0
1
2
2
2
3
3
3
3
3
b
0
1
2
3
3
3
3
3
3
3
a
0
1
2
3
3
3
3
4
4
4
Tracing back for a solution
0
1
2
3
4
5
6
7
8
9
b
E
B
d
e
C
a
c
d
0
0
0
0
0
0
0
0
0
0
a
0
0
0
0
0
0
0
1
1
1
b
0
1
1
1
1
1
1
1
1
1
c
0
1
1
1
1
1
2
2
2
2
d
0
1
1
1
2
2
2
2
2
3
LCS = bdea
ISSCB 2003 Benson
e
0
1
2
2
2
3
3
3
3
3
b
0
1
2
3
3
3
3
3
3
3
a
0
1
2
3
3
3
3
4
4
4
LCS time complexity
There are (n + 1)(m + 1) cells in the LCS score array.
Each cell is filled by examining 3 other cells in constant
time. The time complexity to fill the array is O(nm).
Tracing back for an LCS solution takes at most n + m
steps.
The total time complexity is therefore O(nm).
ISSCB 2003 Benson
Topic 2 Outline
Types of Alignment
Substitution Matrices
• Global vs Local Alignment
• Recursions for Global, time complexity
• Global alignment with affine gap penalty, time complexity
• Similarity scoring and local alignment
• Recursion for local, time complexity
• Finding suboptimal local alignments: declumping
• Substitution Matrices
ISSCB 2003 Benson
Global vs Local Alignment
Given two strings, X and Y:
• global alignment produces an alignment that contains all
of X and all of Y.
X
Y
• local alignment produces an alignment that contains only
the best matching substrings, one from X and one from Y.
X
Y
ISSCB 2003 Benson
Global vs Local Alignment
Global alignment is useful when
• The sequences are known to be related throughout their
length, for example, similar protein sequences from close
species.
Local alignment is useful when
• The sequences are believed to contain parts that are closely
related.
ISSCB 2003 Benson
Global Alignment Problem
Given: two sequences X and Y and alignment scoring
functions,
Find: the best scoring alignment that includes all of X and all
of Y.
Solution: Dynamic Programming
ISSCB 2003 Benson
Global Alignment
Analysis of global alignment is similar to the LCS.
Alignments can end in one of three ways. In terms of the
prefix strings x1…xi and y1…yj, we have:
1. xi and yj are aligned with each other. (Here it makes no
difference whether xi and yj are the same.)
G[i,j] = G[i – 1, j – 1] + s(xi, yj)
X: C G T
Y: C G C
ISSCB 2003 Benson
Global Alignment
• xi is deleted (aligned against a dash).
G[i, j] = G[i – 1, j] + g
X: C A T
Y: C A • yj is deleted (aligned against a dash).
G[i, j] = G[i, j – 1] + g
X: C A –
Y: C A A
ISSCB 2003 Benson
Global alignment recursion
(similarity scoring)
ISSCB 2003 Benson
Global alignment example
match = +2,
C
G
T
A
G
C
ISSCB 2003 Benson
0
-4
-8
-12
mismatch = - 3,
C
-4
2
-2
-6
T
-8
-2
-1
?
A
-12
-6
-5
gap = - 4
G
-16
-10
-4
A
-20
-14
-8
Global Alignment Example
match = +2,
C
G
T
A
G
C
ISSCB 2003 Benson
0
-4
-8
-12
-16
-20
-24
mismatch = - 3,
C
-4
2
-2
-6
-10
-14
-18
T
-8
-2
-1
0
-4
-8
-12
A
-12
-6
-5
-4
2
-2
-6
gap = - 4
G
-16
-10
-4
-8
-2
4
0
A
-20
-14
-8
-7
-6
0
1
Global Alignment Example
Tracing back for an alignment
C
G
T
A
G
C
0
-4
-8
-12
-16
-20
-24
C
-4
2
-2
-6
-10
-14
-18
T
-8
-2
-1
0
-4
-8
-12
A
-12
-6
-5
-4
2
-2
-6
C G T A G C
C – T A G A
ISSCB 2003 Benson
G
-16
-10
-4
-8
-2
4
0
A
-20
-14
-8
-7
-6
0
1
Global alignment time complexity
As with the LCS problem, there are (n + 1) (m + 1) cells in
the dynamic programming array. Each is filled by
examining 3 other cells in constant time. The time
complexity to fill the array is O(nm).
Tracing back for a global alignment takes at most n + m steps.
The total time complexity is therefore O(nm).
ISSCB 2003 Benson
Global alignment and affine gap penalty
Recall the affine gap penalty function
g(k) = α + βk
When xi or yj is deleted, we have to consider that it could be
the last of a string of characters that is deleted as one unit.
And the size of that unit will affect the deletion cost.
ISSCB 2003 Benson
Time Complexity (naïve)
with Affine Gap Cost
For each (i, j) in the alignment matrix, there are O(n + m)
posible deletion costs that must be considered in order to
choose the optimal cost. Without any improvements, the
time complexity grows to O(nm(n + m)) or cubic O(n3)
time.
ISSCB 2003 Benson
Refining the Affine Gap Computation
The regularity of deletion costs helps reduce the time
complexity. Observe the two tables.
ISSCB 2003 Benson
Auxillary Functions for Affine Gap
E[i,j] is max
of all
possibilities
ISSCB 2003 Benson
Auxillary Functions for Affine Gap
F[i,j] is the
maximum of all
possibilities
ISSCB 2003 Benson
Global Alignment with Affine Gap recursion
(similarity scoring)
ISSCB 2003 Benson
Time Complexity with affine gap cost
A total of (n +1)(m + 1) cells must be computed. For each
cell, E, F, and G values must be computed. E and F both
require looking up 2 values. G requires looking up 3
values. Time to compute scores is O(nm).
Tracing back can be done in O(n + m) if the E and F values
are retained. This triples the memory space required for
scoring arrays (E, F, and G).
Total time O(nm).
ISSCB 2003 Benson
Local Alignment Problem
Given: two sequences X and Y and alignment scoring
functions,
Find: the best scoring alignment over all substring pairs, one
from X and one from Y.
Solution: Dynamic Programming
ISSCB 2003 Benson
Local Alignment Looks Harder
than Global Alignment
Where global alignment asks for the solution to one problem,
the best alignment of X[1…m] versus Y[1…n], local
alignment asks for the best alignment out of O(n4)
subproblems, any substring in X versus any substring in Y:
X[h…i] versus Y[k…j]
for 1 ≤ h ≤ i ≤ m, 1 ≤ k ≤ j ≤ n
Instead, we solve O(n2) subproblems, the best alignment of
any substring ending at xi versus any substring ending at
yj.
ISSCB 2003 Benson
Local Alignment and Similarity Scoring
Local alignment uses similarity scoring for the
following reason. When an alignment score is negative,
the alignment is “worse” than no alignment at all. For an
(i, j) pair, it often happens that the best alignment of every
substring ending at xi with every substring ending at yj has
a negative score.
Similarity scoring detects these “bad” alignments and local
alignment discards them. If every alignment score for an
(i, j) cell is negative, then the score is reset to zero.
ISSCB 2003 Benson
Local Alignment Recursion
ISSCB 2003 Benson
Local Alignment Time Complexity
Proportional to the product of the sequence lengths: O(nm).
ISSCB 2003 Benson
Finding Suboptimal Alignments
When computing local alignment, we may want to know
optimal and suboptimal alignments. This can be important
in the case where the sequences contain several parts that
are similar.
X
Y
ISSCB 2003 Benson
High Alignment Scores may not be
Independent
ISSCB 2003 Benson
Declump scores by prohibiting match and
substitution pairings from realignment
ISSCB 2003 Benson
ISSCB 2003 Benson
Source: Michael S. Waterman, 1994
ISSCB 2003 Benson
Source: Michael S. Waterman, 1994
Substitution Matrices
• Used for protein alignments
• Substitution rates for amino acid pairs are determined from
known similar sequences
• Matrices contain log-odds scores
• First matrices designed by Margaret Dayhof are called
PAM matrices (Point Accepted Mutation)
• Current matrices designed by Henikoff and Henikoff are
called BLOSUM matrices (Blocks Substitution Matrices)
ISSCB 2003 Benson
ISSCB 2003 Benson
ISSCB 2003 Benson
Odds Ratios
Odds is a ratio of probabilities for two events which are
mutually exclusive.
In horse racing, for example, odds is the ratio of the betting
that a horse will lose to the betting that the horse will win.
So, for a horse with odds of 20 to 1, the betting is 20 times
higher that the horse will lose than it will win, while for a
horse with odds of 3 to 2, the betting is only 1.5 times
higher that the horse will lose than win.
ISSCB 2003 Benson
Alignment and Odds
A substitution score can be interpreted as an odds ratio. For
an individual pair of aligned amino acids, the events are
• The pair are aligned because they are evolutionarily related
• The pair are aligned merely by chance.
For each pair, the relevant question is:
“What are the odds that amino acid i would be substituted for
amino acid j if they were evolutionarily related?”
If the odds are good, then the pair supports the alignment.
If the odds are bad, then the pair reduces confidence in the
alignment.
ISSCB 2003 Benson
Log Odds
The odds for the entire alignment:
“Aligned because the sequences are evolutionarily related
or aligned by chance alone”
can be obtained by multiplying the odds for each aligned pair
of amino acids.
Multiplication is expensive computationally, so logarithms
are used because they can be added.
ISSCB 2003 Benson
Log-Odds Substitution Matrices
The BLOSUM and PAM substitution matrices contain logodds values. The ratios have the basic form
Observed probability of pairing amino acids i and j in
related sequences
Oij
=
Eij
Expected probability of pairing at random
ISSCB 2003 Benson
BLOSUM Data
DATA:
Ungapped multiple alignments (blocks) taken from 504
families of known related protein sequences in the
PROSITE database. In the original paper (1992), this
produced 2100+ blocks.
ISSCB 2003 Benson
BLOSUM Observed Frequencies
A typical block:
R
K
K
K
R
E
T
T
W
…
…
…
S
S
D
…
…
…
C
A
C
…
…
…
L
N
L
…
…
…
H
P
P
…
…
…
First column pairs:
RK: 6
RR: 1
RE: 2
KK: 3
KE: 3
Repeat and accumulate
for every column.
Fij = Pair (i, j) counts;
Oi,j = Fij / total number of pairs = observed pair frequencies
ISSCB 2003 Benson
BLOSUM Background Frequencies
Pi = probability of amino acid i occurring in an amino acid
pair
Pi = Oii + Σj ≠ i Oij / 2
Eij = expected probability of random pairs
Eij =
ISSCB 2003 Benson
{
Pi Pj
Pi Pj + Pj Pi
if i = j
if i ≠ j
BLOSUM Log-Odds Values
Sij = Log-odds ratio for aligning amino acids i and j.
Sij = 2 log 2 (Oij / Eij)
If observed frequencies are
• as expected,
Sij ~ 0
• greater than expected, Sij > 0 (positive)
• less than expected,
Sij < 0 (negative)
ISSCB 2003 Benson
Related Sequences Must Be Clustered
Overrepresentation of closely related sequences can bias the
matrices.
K
K
K
K
R
E
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
4 sequences more than
80% identical
Other related sequences
The overrepresentation of the closely related sequences
increases the observed K to K pairing, falsely increasing
the log-odds score for that pair.
ISSCB 2003 Benson
Clustering Sequences
Closely related sequences are clustered and the letters in any
cluster are given fractional values. In the cluster below,
amino acids K in the first column counts as ¼ each rather
than 1. R and T in the fourth column count as ½ each.
K
K
K
K
R
E
…
…
…
…
…
…
…
…
…
…
…
…
ISSCB 2003 Benson
R
R
T
T
L
K
…
…
…
…
…
…
…
…
…
…
…
…
4 sequences more than
80% identical clustered
The BLOSUM Matrix Family
Blosum 80 clusters sequences if they are ≥ 80% identical
(clustering is not transitive).
Blosum 62 clusters sequences if they are ≥ 62% identical.
Lower numbers yield matrices that give higher scores to
alignments of distantly related sequences.
ISSCB 2003 Benson
ISSCB 2003 Benson
Topic 3 Outline
Database Searching Algorithms
• Sensitivity vs Selectivity vs Running Time
• Dot Plots
• Blast
• High Scoring Segment Pairs (HSP)
• Target Words
• Extending a Hit
• Statistics of HSP scores
• Gapped Blast modifications
ISSCB 2003 Benson
Alternatives to Alignment
Alignment is fine when the sequences are relatively short, but
is unusable for longer sequences such as are encountered
in
• database searches
• comparison of genomes
• large repetition identification
because it takes too long. For these, we need alternate
methods.
ISSCB 2003 Benson
Dot Plots
Two dimensional array like an alignment array. Put a dot in
each cell where the sequence characters “match”. Long
diagonal runs of dots indicate sequence similarity.
Match rule for proteins might be positive substitution value in
BLOSUM matrix.
ISSCB 2003 Benson
Extended Dot Plots
(Nature, 19 June 2003, Vol 423, p 831)
ISSCB 2003 Benson
ISSCB 2003 Benson
Inverted repeat
Tandem repeat
Direct repeat
ISSCB 2003 Benson
Database Search Algorithms:
Sensitivity, Selectivity, Running Time
• Sensitivity – the ability to detect weak similarities between
sequences (often due to long evolutionary separation).
Increasing sensitivity reduces false negatives, those
database sequences similar to the query, but rejected.
• Selectivity – the ability to screen out similarities due to
chance. Increasing selectivity reduces false positives,
those sequences recognized as similar when they are not.
Sensitivity
Selectivity
ISSCB 2003 Benson
BLAST
The BLAST program is designed for fast sequence to
database search. The basic idea is that a high scoring
alignment between the query and a database sequence will
almost always contain a core part that is well conserved.
This core is called a High-scoring Segment Pair (HSP).
The statistical theory behind BLAST deals with the
probability of finding HSPs.
ISSCB 2003 Benson
High-scoring Segment Pair (HSP)
An HSP consists of two equal length substrings, one from the
query and one from the database sequence. When aligned
without gaps, the score, S, is above a specified threshold
and is locally maximal, meaning that extending the
substrings on either end to lengthen the alignment
produces a smaller score.
Suppose we have the following two sequences:
X = LKFSFALCCTIG
Y = ADQHLSRPTWAFYC
One of many segment pairs is (BLOSUM scoring):
ISSCB 2003 Benson
S
T
1
F
W
1
A
A
4
L C
F Y
0 -2
C
C
9 = 13
High Scoring Segment Pairs
X = LKFSFALCCTIG
Y = ADQHLSRPTWAFYC
S
T
1
F
W
1
A
A
4
L C
F Y
0 -2
C
C
9 = 13
This pair is locally maximal because shortening or
lengthening the alignment reduces the score.
L
S
-2
ISSCB 2003 Benson
K F
R P
2 -4
S
T
1
F
W
1
A
A
4
L C
F Y
0 -2
C
C
9
Assumptions for the Statistical Theory
We assume
• a simple protein model – the probabilities for the amino
acids appearing at each position in a protein are
independent and identically distributed (iid) and amino
acid i occurs randomly with probability Pi.
• the substitution matrix (such as BLOSUM 62) has at least
one positive score.
• the expected score for amino acid pairs
Σ pipj sij < 0
is negative.
ISSCB 2003 Benson
Normalized Scores of HSPs
Given Pi and Sij, the statistical theory yields two parameters,
λ and K which can be used to normalize an HSP score S
with the formula:
S' =
λS – ln K
ln 2
The units of S' are bits. When two random proteins
sequences are compared, the expected number of HSPs
with a normalized score of S' is:
E = N / 2S'
where N is the product of the sequence lengths.
ISSCB 2003 Benson
Score for a Given Statistical Significance
Solving for S' yields:
S' = log2 (N / E)
For example, if
• the query protein has length 250.
• the database has length 50 000 000
• the E-value is .05
then the score required to reach that level of statistical
signficance is ~ 38 bits.
ISSCB 2003 Benson
Deficiencies in the BLAST theory
• The result is asymptotic, meaning there is some error for
finite sequences.
• Local variations in residue composition in real sequences
often do not match the model.
• The model assumes that all mutations at all sites in a
protein sequence can be described by a single substitution
matrix.
• It does not apply to gapped alignments.
ISSCB 2003 Benson
Finding HSPs
Blast finds HSPs by first looking for well conserved core
alignments, i.e., ungapped alignments between short query
“words” of length 3 and database “words”. Each core
alignment must score greater than a threshold score, T,
which is typically 13 using the BLOSUM matrix.
Core alignments or “hits” are typically found using pattern
matching finite automata, i.e. linear search through the
database.
ISSCB 2003 Benson
Query and Target Words
Suppose the query sequence is LVNRKPVVP.
• Chop the query into all possible words of length 3:
LVN VNR NRK RKP KPV PVV VVP
• Collect “target words” associated with each query word.
For example, the word RKP has six target words that score
at least 13 (using BLOSUM 62) when aligned with RKP:
QKP KKP RQP REP RRP RKP
In general, there will be a large set of target words derived
from the query sequence. Each target word and its
associated query word could form the core of an HSP.
ISSCB 2003 Benson
Extending a Hit
A hit is extended in either direction to find its locally maximal
segment pair. Extension terminates when the score drops
too far below the highest score found.
For example, Suppose the target RRP is found and it occurs in
the sequence
EPGVCRRPLKCTAS
When trying to extend the core against LVNRKPVVP the
HSP has 6 letters and a score of 16:
L
G
-3
ISSCB 2003 Benson
V N
V C
4 -3
R
R
5
K
R
2
P
P
7
V V P
L K C
1 -2 -3
Gapped Blast
Gapped BLAST (1997) has two improvements over the
original BLAST (1990).
• Two hits – Only extends core alignments when two occur
nearby on the same “diagonal”. Involves lowering the
threshold T to retain sensitivity, but reduces extension
which is the most costly.
• Gapped alignments are computed. Allows raising the
threshold T while retaining selectivity. Speeds initial
database scan.
ISSCB 2003 Benson
Optimal Substitution Matrices
for Distant Homologies
Among HSPs (ungapped) from the comparison of random
sequences, amino acids ai and aj are aligned with
frequency approaching:
qij = pipjeλsij
The qij are called “target frequencies” for the given
substitution matrix sij. Among alignments of distantly
related proteins, amino acids tend to be paired with certain
characteristic frequencies. “Only if these correspond to a
matrix’s target frequencies ... can the matrix be optimal for
distinguishing the distant local homologies.” (Altschul –
1991)
ISSCB 2003 Benson
Log-Odds Matrices Again
Rearranging qij = pipjeλsij yields:
sij = ln(qij /pipj)/λ
where λ acts as a scaling factor for the logarithm. This shows
• that the substitution scores are inherently log-odds scores
for the target and background frequencies, and
• scores may be chosen that correspond to any desired set of
target frequencies.
ISSCB 2003 Benson
Topic 4 Outline
Specialized Sequence Alignment Algorithms
• Sim4
• Blat
ISSCB 2003 Benson
sim4
Sim4 is a program for aligning a cDNA sequence to a
genomic sequence.
cDNA is complementary to a messenger RNA which is the
RNA molecule after introns have been cut out and the
exons spliced together. The difference between cDNA and
genomic DNA is the absence of the intron sequences in
cDNA.
Sim4 assumes that the differences between the two sequences
are limited to:
• introns in the genomic sequence
• sequencing errors in either sequence
ISSCB 2003 Benson
sim4 – Find HSPs
In the first step, sim4 finds HSPs which must have an exact
matching core of 12 nucleotides. The core is extended on
both ends with a score of 1 for match and -5 for mismatch.
cDNA
Genomic
ISSCB 2003 Benson
sim4 – Select chains of HSPs
cDNA
HSps are chained with the constraints that
• starting positions in the cDNA are increasing
• HSPs are in nearby diagonals or are in diagonals separated
by plausible intron distances
Genomic
ISSCB 2003 Benson
starting
positions
increase
gap
typical
for intron
sim4 – Trim overlaps in cDNA
Exon cores that overlap in the cDNA are trimmed to find
GT ... AG, a common intron signal, in the genomic DNA.
GT
overlap
ISSCB 2003 Benson
AG
sim4 – Trim overlaps in cDNA
Exon cores that overlap in the cDNA are trimmed to find
GT ... AG, a common intron signal, in the genomic DNA.
GT
ISSCB 2003 Benson
AG
sim4 – Filling Gaps in cDNA
Gaps are filled by recursively finding HSPs with smaller
cores (starting with exact matches of length 8)
smaller cores are
chained as before
ISSCB 2003 Benson
BLAT – The BLAST-Like Alignment Tool
BLAT is a specialized program for mRNA to DNA
alignments and cross-species protein alignments. It is
faster than BLAST and sim4 for these tasks. It is not
appropriate for distant homology searching.
BLAT is available at the UC Santa Cruz human genome
website for database searches against the human genome.
ISSCB 2003 Benson
BLAT vs BLAST
Both BLAT and BLAST first look for high-scoring pairs:
• BLAST collects short words in the query and does a linear
scan through the database for those words.
• BLAT builds an index of the database (a data structure
suited for rapid detection of word matches) and then scans
linearly through the query. Since the query is much
smaller than the database, the scan phase is very rapid.
The data structure is persistent, meaning that it is built
once and then reused for every query search.
ISSCB 2003 Benson
BLAT – Hits
To form the database index, BLAT cuts the database
sequences into non-overlapping words of length k.
Database:
Query:
Query
k-words
ISSCB 2003 Benson
A hit is an exact match or a
near perfect match (single
character mismatch) with
any query subword.
BLAT – Criteria for Extending Hits
BLAT allows different criteria for extending hits:
• Single exact match
• Single near-exact match (one difference)
C A G T G C G A T G A
C A G T A C G A T G A
• Two exact matches on the same
diagonal separated by a small
distance
ISSCB 2003 Benson
BLAT – Statistics
The size, k, of matching words is selected to balance
sensitivity and selectivity with running time. BLAT makes
flexible assumptions about the query and database
sequences to select k, the word match size:
• M – the percent matching between homologous regions
(98% for cDNA/genomic alignments, 89% for crossspecies protein alignments)
• H – the size of the homologous regions
• G – the size of the database (3 000 000 000 bases)
• Q – the size of a query
• A – the alphabet size (4 for DNA, 20 for protein)
ISSCB 2003 Benson
BLAT – Statistics for k Selection
(Exact Match)
The number of non-overlapping k-words in a homologous
region is:
T = floor (H / k)
Sequence letters are assumed to be iid. The probability that:
• a match occurs between a homologous region k-word and
the query is
p = Mk
• no match occurs is:
q = (1 – p)
• at least one homologous region word matches the query is:
Phit = 1 minus no matches = 1 – q T
ISSCB 2003 Benson
BLAT – Statistics for k Selection
(Exact Match)
The probability that two k words match by chance is:
r = (1/A)k
The number of:
• query words is qw = Q – k + 1
• database words is dw = G/k
The number of k words that are expected to match by
chance is:
F = qw d w r
ISSCB 2003 Benson
BLAT – Statistics for k Selection
With increasing k,
• the probability of a valid hit, Phit, goes down because it
becomes harder to find two words that match due to errors
or mutations.
• the number of false hits, F, goes down because probability
of random word matching decreases as words grow.
For example, in DNA, with M = 0.95, H = 100, and Q = 500,
if
• k = 11, then Phit = 0.999 and F = 32 512
• k = 14, then Phit = 0.991 and F = 399
By decreasing sensitivity slightly, the number of false hits can
be reduced by almost a factor of 100.
ISSCB 2003 Benson
BLAT indexing
Each k-word (nonoverlapping) in the database is converted to
a number in base four:
A list, I, of all possible numbers is
T C A G T T A
maintained. For DNA, when k = 7,
3 1 0 2 3 3 04
this list has 47 or 16,384 entries.
I:
3102330 3102331
L:
100 350 600 730 870 930
List I points to a list L, the sorted locations of the k-word in
the database sequences.
ISSCB 2003 Benson
Deficiencies in the BLAT Program
• Near exact matching with proteins requires an index which
is too large to fit in memory, so a less efficient hashing
scheme is used instead
• The alignment method is not standard optimal alignment
and
– for DNA does not work well below 90% sequence
identity
– for proteins does not work well with indels in one of the
sequences
ISSCB 2003 Benson
Topic 5 Outline
Searching for Repetitive Motifs and Patterns
• Pattern Detection vs Alignment
• Short word methods
• TRF
ISSCB 2003 Benson
Pattern Detection
In pattern detection problems:
• there is no query sequence
• we look for a repetitive pattern or motif in one long
sequence or several sequences
• broad characteristics of the target pattern are specified in
advance
ISSCB 2003 Benson
Short Word Methods
In short word methods, small matching words are detected. A
cluster of short words indicates a potential pattern. A
typical applications is the search for tandem repeats in
DNA sequences.
Word spacings may differ.
ISSCB 2003 Benson
Short Word Methods
Short word methods are well suited to DNA sequences
because
• the alphabet is small so exact matching words are common
even in homologous regions that have experienced
significant mutation.
• they can handle insertions and deletions that are common
in DNA
They are less well suited to protein sequences because
• large amino acid alphabet and frequent substitution make
short matching words uncommon.
ISSCB 2003 Benson
Tandem Repeats
A tandem repeat is any pattern of nucleotides that has been
duplicated so that it appears several times in succession.
For example, the sequence fragment below contains a tandem
repeat of the trinucleotide CGT:
tcgctggtcatacgtcgtcgtcgtcgttacaaacgtcttccgt
ISSCB 2003 Benson
Approximate Tandem Repeats
More typically, the tandem copies
are only approximate due to
mutations. Here is an alignment of
copies from a tandem repeat in C.
elegans.
Shown are the copies and a
consensus pattern
ISSCB 2003 Benson
Tandem Repeats Associated with Human Disease
•Trinucleotide diseases caused by expansion of a trinucleotide
repeat:
Fragile-X mental retardation
Myotonic dystrophy
Huntington’s disease
Friedreich’s ataxia
•Multilocus diseases linked in some cases to unstable or uncommon
minisatellites:
Epilepsy
Diabetes
Ovarian cancer
ISSCB 2003 Benson
Tandem Repeats Function and Usefulness
Tandem repeats:
•
are involved in gene regulation and often contain
putative transcription factor binding sites.
• exhibit copy number polymorphism, making them valuable
genomic markers.
ISSCB 2003 Benson
Tandem Repeats Finder (TRF)
TRF finds tandem repeats in genomic DNA sequences using
short word matches (k-tuple matches). It assumes:
• repeats have on average >80% sequence identity
• insertions and deletions occur on average in < 10% of the
pattern positions
These are average values for program parameter settings.
Repeats with lower sequence identity and higher indel
frequency are also found.
ISSCB 2003 Benson
LOCUS
HUMFMR1
3765 bp mRNA
PRI
08-NOV-1994
DEFINITION Human Fragile X mental retardation 1 FMR-1 gene, 3' end, clones
1
61
121
181
241
301
361
421
481
541
601
661
721
781
841
901
961
1021
1081
1141
1201
gacggaggcg
cggcggaggc
cggcggctgg
agggctgaag
tacaaggcat
tggcagcctg
aataaagata
ccttgctgtt
tatgcagcat
aatcccaaca
gacttacggc
gccttttctg
gtcacctcaa
ttgtctctga
gcctcgagat
actcatggtg
gatgaagata
agaagctttc
ggaaaaaatg
attgaggctg
cttccttcca
ISSCB 2003 Benson
cccgtgccag
ggcggcggcg
gcctcgagcg
agaagatgga
ttgtaaagga
ataggcagat
taaatgaaag
ggtggttagc
gtgatgcaac
aacctgccac
aaatgtgtgc
taacttatga
agcgagcaca
taatgagaaa
ttcatgaaca
ctaatattca
cctgcacatt
tcgaatttgc
gaaagctgat
aaaatgagaa
ataattcaag
ggggcgtgcg
gcggcggcgg
cccgcagccc
ggagctggtg
tgttcatgaa
tccatttcat
tgatgaagtt
taaagtgagg
ttacaatgaa
aaaagatact
caaagaggcg
tccagaaaat
tatgctgatt
tgaagaagct
gtttatcgta
gcaagctaga
tcatatttat
tgaagatgta
tcaggagatt
aaatgttcca
ggttggacct
gcagcgcggc
cggcggaggc
acctctcggg
gtggaagtgc
gattcaataa
gatgtcagat
gaggtgtatt
atgataaagg
attgtcacaa
ttccataaga
gcacataagg
tatcagcttg
gacatgcact
agtaagcagc
agagaagatc
aaagtacctg
ggagaggatc
atacaagttc
gtggacaagt
caagaagagg
aatgccccag
ggcggcggcg
ggcggcggcg
ggcgggctcc
ggggctccaa
cagttgcatt
tcccacctcc
ccagagcaaa
gtgagtttta
ttgaacgtct
tcaagctgga
attttaaaaa
tcattttgtc
ttcggagtct
tggagagttc
tgatgggtct
gggtcactgc
aggatgcagt
caaggaactt
caggagttgt
aaattatgcc
aagaaaaaaa
gcggcggcgg
gcggcggcgg
cggcgctagc
tggcgctttc
tgaaaacaac
tgtaggttat
tgaaaaagag
tgtgatagaa
aagatctgtt
tgtgccagaa
ggcagttggt
catcaatgaa
gcgcactaag
aaggcagctt
agctattggt
tattgatcta
gaaaaaagct
agtagtaata
gagggtgagg
accaaattcc
acatttagat
LOCUS
HUMFMR1
3765 bp mRNA
PRI
08-NOV-1994
DEFINITION Human Fragile X mental retardation 1 FMR-1 gene, 3' end, clones
1
61
121
181
241
301
361
421
481
541
601
661
721
781
841
901
961
1021
1081
1141
1201
gacggaggcg
cggcggaggc
cggcggctgg
agggctgaag
tacaaggcat
tggcagcctg
aataaagata
ccttgctgtt
tatgcagcat
aatcccaaca
gacttacggc
gccttttctg
gtcacctcaa
ttgtctctga
gcctcgagat
actcatggtg
gatgaagata
agaagctttc
ggaaaaaatg
attgaggctg
cttccttcca
ISSCB 2003 Benson
cccgtgccag
ggcggcggcg
gcctcgagcg
agaagatgga
ttgtaaagga
ataggcagat
taaatgaaag
ggtggttagc
gtgatgcaac
aacctgccac
aaatgtgtgc
taacttatga
agcgagcaca
taatgagaaa
ttcatgaaca
ctaatattca
cctgcacatt
tcgaatttgc
gaaagctgat
aaaatgagaa
ataattcaag
ggggcgtgcg
gcggcggcgg
cccgcagccc
ggagctggtg
tgttcatgaa
tccatttcat
tgatgaagtt
taaagtgagg
ttacaatgaa
aaaagatact
caaagaggcg
tccagaaaat
tatgctgatt
tgaagaagct
gtttatcgta
gcaagctaga
tcatatttat
tgaagatgta
tcaggagatt
aaatgttcca
ggttggacct
gcagcgcggc
cggcggaggc
acctctcggg
gtggaagtgc
gattcaataa
gatgtcagat
gaggtgtatt
atgataaagg
attgtcacaa
ttccataaga
gcacataagg
tatcagcttg
gacatgcact
agtaagcagc
agagaagatc
aaagtacctg
ggagaggatc
atacaagttc
gtggacaagt
caagaagagg
aatgccccag
ggcggcggcg
ggcggcggcg
ggcgggctcc
ggggctccaa
cagttgcatt
tcccacctcc
ccagagcaaa
gtgagtttta
ttgaacgtct
tcaagctgga
attttaaaaa
tcattttgtc
ttcggagtct
tggagagttc
tgatgggtct
gggtcactgc
aggatgcagt
caaggaactt
caggagttgt
aaattatgcc
aagaaaaaaa
gcggcggcgg
gcggcggcgg
cggcgctagc
tggcgctttc
tgaaaacaac
tgtaggttat
tgaaaaagag
tgtgatagaa
aagatctgtt
tgtgccagaa
ggcagttggt
catcaatgaa
gcgcactaag
aaggcagctt
agctattggt
tattgatcta
gaaaaaagct
agtagtaata
gagggtgagg
accaaattcc
acatttagat
LOCUS RATIGCA 4461 bp DNA ROD 18-APR-1994
DEFINITION Rat Ig germline epsilon H-chain gene C-region, 3' end.
2881
2941
3001
3061
3121
3181
3241
3301
3361
3421
3481
3541
3601
3661
3721
3781
3841
3901
3961
4021
cgccccaagt
tccatctcag
cgcccaacca
acacacacac
ccaccatatc
agtcggccag
agagatggag
ctccaggcca
gcctgagctg
gattataggg
tcctataagt
agattcctgg
tcctggaggg
ctgtcagata
tgcccacaca
catgcccaca
cacacacaca
gggtgggaga
gtcaggggaa
aagtgggatg
ISSCB 2003 Benson
aggcttcatc
gcccagaggg
ccaaccacca
acacacacac
cagagacaag
cacctcagcc
gaggtggagg
atccttatac
tggaaaacca
agactgaggc
ctgggctggg
agccagagtg
ccctgggcac
cacacacaca
catgcataca
cacatgcata
cacaccccgc
tactgggtca
aaggacatct
gggagctctg
atgctctttg
atgaggagac
gcacatcagg
acacacacac
tgtctgagtc
tccaggccaa
cctgagctgt
tttggcccac
gagacaggaa
aggagtagag
agtccatgtg
tgcatgcagg
tctgaacaaa
tgcacacaca
cacatgcaca
cacacatgca
aggtagcctt
tggtgggcac
gcctccaggg
ccactccagt
gtttagcaat
cagaatcaag
ttcacacacc
acacacacac
tgagatacct
tccttatact
ggaaaaccag
tgcaggccat
gatggtctgt
ctcctacaag
tcctgacttg
ccctagaaga
aggcaattct
tacacacaca
cacatacaca
tgcacacaca
catcatgctg
cggagtagaa
ctgaacagag
ttcaccagga
agcccaaagc
acatacccac
tgagaccagt
acacacaagc
ctgaggatca
ttggcccact
agacaggaag
gagagatgga
atggagagag
gccagtagtc
ctcctcagat
aatgtggagc
gtaggctgta
gagacacaga
cacagagaca
cacacacaca
tctagcgata
agagggaatg
acttggagca
ctgcctgaga
aagctatgca
gcccatccca
ggctcccatc
ccgtacacat
ccaatggcag
gcaggccatg
atggtctgta
ggaggtggag
tagtaaacca
taccttagag
atcacaacca
ttagagccct
tagaggcatc
cacacacaca
cagacacaca
tacacataca
gccctgctga
agcagtcagg
gtcccagagc
ccagtgaggg
LOCUS RATIGCA 4461 bp DNA ROD 18-APR-1994
DEFINITION Rat Ig germline epsilon H-chain gene C-region, 3' end.
2881
2941
3001
3061
3121
3181
3241
3301
3361
3421
3481
3541
3601
3661
3721
3781
3841
3901
3961
4021
cgccccaagt
tccatctcag
cgcccaacca
acacacacac
ccaccatatc
agtcggccag
agagatggag
ctccaggcca
gcctgagctg
gattataggg
tcctataagt
agattcctgg
tcctggaggg
ctgtcagata
tgcccacaca
catgcccaca
cacacacaca
gggtgggaga
gtcaggggaa
aagtgggatg
ISSCB 2003 Benson
aggcttcatc
gcccagaggg
ccaaccacca
acacacacac
cagagacaag
cacctcagcc
gaggtggagg
atccttatac
tggaaaacca
agactgaggc
ctgggctggg
agccagagtg
ccctgggcac
cacacacaca
catgcataca
cacatgcata
cacaccccgc
tactgggtca
aaggacatct
gggagctctg
atgctctttg
atgaggagac
gcacatcagg
acacacacac
tgtctgagtc
tccaggccaa
cctgagctgt
tttggcccac
gagacaggaa
aggagtagag
agtccatgtg
tgcatgcagg
tctgaacaaa
tgcacacaca
cacatgcaca
cacacatgca
aggtagcctt
tggtgggcac
gcctccaggg
ccactccagt
gtttagcaat
cagaatcaag
ttcacacacc
acacacacac
tgagatacct
tccttatact
ggaaaaccag
tgcaggccat
gatggtctgt
ctcctacaag
tcctgacttg
ccctagaaga
aggcaattct
tacacacaca
cacatacaca
tgcacacaca
catcatgctg
cggagtagaa
ctgaacagag
ttcaccagga
agcccaaagc
acatacccac
tgagaccagt
acacacaagc
ctgaggatca
ttggcccact
agacaggaag
gagagatgga
atggagagag
gccagtagtc
ctcctcagat
aatgtggagc
gtaggctgta
gagacacaga
cacagagaca
cacacacaca
tctagcgata
agagggaatg
acttggagca
ctgcctgaga
aagctatgca
gcccatccca
ggctcccatc
ccgtacacat
ccaatggcag
gcaggccatg
atggtctgta
ggaggtggag
tagtaaacca
taccttagag
atcacaacca
ttagagccct
tagaggcatc
cacacacaca
cagacacaca
tacacataca
gccctgctga
agcagtcagg
gtcccagagc
ccagtgaggg
Basic Assumption
Mutated, adjacent copies of a pattern will contain runs of
exact matches.
d
TATAC G T C GAGAC TTA
ISSCB 2003 Benson
d
T C CAC G GAGATATTTA
Basic Assumption
Mutated, adjacent copies of a pattern will contain runs of
exact matches.
d
TATAC G T C GAGAC TTA
d
T C CAC G GAGATATTTA
Runs of matches are identified using k-tuple matches.
ISSCB 2003 Benson
Using k-tuple matches
For purposes of program efficiency, fixed size k-tuples are
used.
d
ISSCB 2003 Benson
d
k-tuple matches
In pattern matching, a k-tuple is a window of length k which
contains text characters. Two windows which contain the
same text, form a k-tuple match:
GAACGTTAGGTAACTGCAT
CCTAGTTATACGTTAAC
ISSCB 2003 Benson
ISSCB 2003 Benson
Modeling Tandem Repeats
Multiple k-tuple matches suggest the occurrence of a tandem
repeat. The appropriate number of matches depends on how a
tandem repeat is defined. For example, are the following two
aligned sequence fragments two copies of the same
underlying pattern?
TCGGCATCAGTCTATGG
TCAA–-TG-GTGT-TGG
ISSCB 2003 Benson
A Stochastic Model
TRF’s stochastic model is based on the probability of
character matching and the frequency of insertion and
deletion (indels) between aligned adjacent copies.
CCACAACC-CGTCAGGCAAGT
CTGCACCATCGTCTGGGAAGT
HTTHHTHTTHHHHTHHTHHHH
Note that the alignment has been converted into a Bernoulli
(coin-toss) sequence.
ISSCB 2003 Benson
Model Parameters
PM = the expected frequency of a match
PI = the expected frequency of an indel
The parameters are applied to Bernoulli sequences to
establish criteria for detecting the repeats.
ISSCB 2003 Benson
Number of matches to indicate a repeat
Sum of heads is the minimum required number of matches
• for a repeat with period n
• match probability p
• tuple size k
Unless k = 1, not all matches will be detected.
For example:
HHTHHHHTHTHHHTTHHHT
ISSCB 2003 Benson
k
H seen
1
13
2
12
3
10
4
4
Sum of heads
Suppose a random Bernoulli sequence has length 100 and the
expected number of heads is 75 (PM = 0.75). If we count the
number of heads, then 95% of the time we expect to count at
least 68 heads.
If we count only heads that occur in runs of length 5 or more,
then 95% of the time we expect to count at least 26 heads.
This is the sum of heads criteria.
ISSCB 2003 Benson
Other Criteria
• Apparent Size – Used to distinguish tandem from nontandem repeats.
• Waiting time – Used to pick a suitable tuple size.
• Random walk – Used to accommodate insertions and
deletions.
ISSCB 2003 Benson
ISSCB 2003 Benson
The Tandem Repeats Database
The Tandem Repeats Database (TRDB) is:
1. A public database of information on tandem repeats.
2. A private workspace for extended research on tandem
repeats.
ISSCB 2003 Benson
C. elegans Distribution of Repeat Pattern Size
Chr 1
ISSCB 2003 Benson
Human Distribution of Pattern Size Chr 1
ISSCB 2003 Benson
C. elegans Distribution of Repeat Location
Chr 1
ISSCB 2003 Benson
Human Distribution of Repeat Location Chr 1
ISSCB 2003 Benson
Human Distribution of Repeat Location Chr 1
for Patternsize >= 5
ISSCB 2003 Benson
Clusters in 41 bp Repeats
ISSCB 2003 Benson
Two identical repeats
ISSCB 2003 Benson
Four repeats from a tandem repeat cluster
ISSCB 2003 Benson
Topic 6 Outline
Composition Alignment
• Sequence composition and composition match
• Composition alignment algorithm
• Composition match scoring functions
• Limiting the length of a composition match
• Growth of local composition alignment scores
• Biological examples
ISSCB 2003 Benson
Sequence Composition
Composition is a vector quantity describing the frequency
of occurrence of each alphabet letter in a particular string.
Let S be a string over Σ. Then,
C(S)=(fσ1 , fσ2 , fσ3 , … , fσ|Σ|)
is the composition of S, where fσ is the fraction of the
i
characters in S that are σi.
Note that the order of letters is irrelevant as it has no effect on
the composition.
ISSCB 2003 Benson
Composition Example
S = ACTGTACCTGGCGCTATT
C(S) = ( 0.17, 0.28, 0.22, 0.33 )
A
C
G
T
ISSCB 2003 Benson
Composition Match
Two strings, S and T, have a composition match if their
lengths are equal and C(S) = C(T).
For example, S and T below have a composition match:
S = ACTGTACCTGGCGCTATT
T = AAACCCCCGGGGTTTTTT
ISSCB 2003 Benson
Composition and Sequence Features
• Isochores – Multi-megabase, specifically GC-rich or GCpoor. GC-rich isochores have greater gene density.
• CpG Islands – Several hundred nucleotides, rich in the
dinucleotide CG which is underrepresented in eukaryotic
genomes. Methylation of the cystine in these dinucleotides
affects gene expression.
• Protein binding regions – Tens of nucleotides, dinucleotide
composition contributes to DNA flexibility, allowing the
helix to change shape during protein binding.
ISSCB 2003 Benson
Composition Alignment Problem
Given: Two sequences, S of length m, and T of length n,
over an alphabet Σ, and a scoring function cm(s, t) for the
score of a composition match between substrings s and t.
Find: The best scoring alignment (global or local) of S with T
such that the allowed scoring options include composition
match between substrings of S and T as well as the
standard options of single character match, single character
mismatch, insertion and deletion.
ISSCB 2003 Benson
Example of composition alignment
S = AACGTCTTTGAGCTC
T = AGCCTGACTGCCTA
Alignment
AACGTCTTTGAGCTC
| |<-> | <--->
AGCCTGACT-GCCTA
ISSCB 2003 Benson
Algorithm Analysis
Given two sequences, S and T, the best alignment of the
prefix strings
S[1, i] = s1 … si
T[1, j] = t1 … tj
ends in one of four ways, mismatch, insertion, deletion, or
composition match between suffixes of length l
1 ≤ l ≤ min(i, j, limit)
i.e., between substrings S[i – l + 1, i] and T[j – l + 1, j]
ISSCB 2003 Benson
Time Complexity
Computing the optimal composition alignment is done with
dynamic programming and is similar to standard
alignment, except for the composition match scoring
option. The overall time complexity is
O(nmZ)
where Z is the time required per (i, j) pair to find the best
length l for the composition match.
ISSCB 2003 Benson
Computing length of the shortest
composition match
Our goal here is to start with two strings, S and T, of equal
length, and for each prefix pair S[1, k], T[1, k], find the
length of the shortest suffixes that have a composition
match. For example, let
S = AACGTCTTTGAGCT
T = AGCCTGACTGCCTA
Then for k = 6, the shortest suffixes which have a
composition match have length = 3:
S = AACGTC
T = AGCCTG
ISSCB 2003 Benson
Composition difference
Composition difference is a vector quantity for two strings x
and y:
CD(x, y) = (cσ1 , … , cσ|Σ|)
where cσ is the difference between the number of times σi
i
occurs in x and in y.
ISSCB 2003 Benson
Using composition difference
Key observation: two identical composition differences at
prefix lengths k and g indicate a composition match of
length k – g.
ISSCB 2003 Benson
Sorting to find shortest
composition matches
Sort on composition
difference using
stable sort. Adjacent
tuples with the same
composition
difference identify
shortest composition
matches.
ISSCB 2003 Benson
Time complexity for composition matches
O(nmΣ) to find n·m shortest composition match lengths for
two strings of length n and m.
In our work, Σ, is a small constant (4 for DNA, 16 for
dinucleotides). For larger alphabets, the method of Amir,
Apostolico, Landau and Satta (2003) can be used.
ISSCB 2003 Benson
Composition match scoring functions
Functions based on match length, k:
• Function 1: cm(k) = ck
• Function 2: cm(k) = c√ k
where c is a constant.
Functions based on substring composition:
• Function 4: cm(C, B, k) = ck · H(C,B)
where H is the relative entropy function, C is the
composition of the matching substrings and B is a
background composition.
ISSCB 2003 Benson
Additive and subadditive scoring functions
The functions based on length are additive or subadditive:
cm(i + j) ≤ cm(i) + cm(j)
Lemma: For additive or subadditive composition match
scoring functions, any best scoring alignment is equivalent
in score to an alignment which contains only shortest
composition matches.
Theorem: Composition alignment with additive or
subadditive match scoring functions and finite alphabet has
time complexity O(nm).
ISSCB 2003 Benson
The limit parameter
Intuitively, allowing scrambled letters to match should
increase the amount of matching between sequences. If
too much matching occurs, alignments will not be
meaningful.
The limit parameter is an upper bound on the length l of the
longest single composition match.
ISSCB 2003 Benson
Limit and fraction of matching characters
random, ungapped alignments
Sequence length = limit
binary
1
50
2
62.5
5
75.6
10
82.4
DNA
Dinucleotide
25
6.2
30
6.5
37.5
7.1
44.2
7.5
ISSCB 2003 Benson
Limit and fraction of matching characters
random, ungapped alignments
Sequence length = 100, all letters equal probability p = 0.25
limit
DNA
1
25
2
33.7
5
44.4
10
51
Sequence length = 400, all letters equal probability p = 0.25
limit
dinucleotide
ISSCB 2003 Benson
1
6.25
2
6.81
10
7.76
50
7.78
Growth of local alignment score
Function 1
Average Local Composition Alignment Scores: DNA Sequences
Function 1
120
100
Limit = 4
Score
80
60
Limit = 3
40
Limit = 2
20
0
100
200
400
Sequence Length (log scale)
ISSCB 2003 Benson
800
1000
Global score as a predictor of
local parameter suitability: Function 1
Average Global Composition Alignment Scores: DNA Sequences
Function 1
100
50
Limit = 5
0
-50
Limit = 4
Score
-100
-150
Limit = 3
-200
-250
-300
-350
-400
100
Limit = 2
200
300
400
500
Sequence Length
ISSCB 2003 Benson
600
700
800
900
Growth of local alignment score
Function 2
Average Local Composition Alignment Scores: DNA Sequences
Function 2
100
50
90
30
80
20
70
Score
60
10
50
6
40
30
20
10
0
100
200
400
Sequence Length
ISSCB 2003 Benson
800
1000
Global score as a predictor of
local parameter suitability: Function 2
Global Composition Alignment Scores: DNA Sequences
Function 2
0
-20
-40
50
-60
Score
-80
30
-100
20
-120
-140
-160
-180
10
-200
0
100
200
300
400
500
Sequence Length
ISSCB 2003 Benson
600
700
800
900
Limit values for DNA
• Function 1: cm(k) = ck: Limit ≤ 3.
• Function 2: cm(k) = c√k: Limit ≤ 10.
• Function 4: cm(C, B, k) = ck ·H(C, B):
Limit ≤ 50.
ISSCB 2003 Benson
Biological examples
Composition alignment was tested on a set of 1796 promoter
sequences from the Eukaryotic Promoter Database. Each
sequence is 600 nucleotides long, 500 bases upstream and
100 downstream of the transcription initiation site.
Two local alignment scores were produced using function 1,
W using composition alignment and S using standard
alignment. The examples shown have statistically
significant W with W ≥ 3 · S to exclude good standard
alignments.
ISSCB 2003 Benson
Example 1
Composition alignment and standard alignment of two
promoters. Standard alignment is not statistically
significant. Sequences are characteristic of CpG islands.
Composition Alignment:
GCCCGCCCGCCGCGCTCCCGCCCGCCGCTCTCCGTGGCCC-CGCCG-CGCTGCCGCCGCCGCCGCTGC
<->||||<>|<>||<>| ||||<>||<> |<-> |||||| <>|<> ||||<><> |<>| ||<->||
CCGCGCCGCCGCCGTCCGCGCCGCCCCG-CCCT-TGGCCCAGCCGCTCGCTCGGCTCCGCTCCCTGGC
Standard Alignment:
CGCCGCCGCCG
CGCCGCCGCCG
ISSCB 2003 Benson
Example 2
Composition alignment of two promoter sequences.
Composition changes at vertical line.
A
C
G
T
Left: (0.01, 0.61, 0.30, 0.08)
Right: (0.19, 0.16, 0.56, 0.09)
GCCCCGCGCCCCGCGCCCCGCGCCCCGCGCGCCTC-CGCCCGCCCCT-GCTCCGGC---C-TTGCGCCTGC-GCACAGTGGGATGCGCGGGGAG
<->|<><>|||| <>|||||| ||<->|<>||||| <>|||| |||| || ||<->
| |<><>|<-> | |<>|<>|<>||||<-><->|
CCGCGCGCCCCC-GCCCCCGCCCCGCCCCGGCCTCGGCCCCGGCCCTGGC-CCCGGGGGCAGTCGCGCCTGTG-AACGGTGAGTGCGGGCAGGG
ISSCB 2003 Benson
Final Slide
Recognize that similarity among biological sequences most
likely exists in ways which we can not today perceive and
for which we have no detection tools. You are encouraged,
therefore, to think broadly, beyond the embellishment or
refinement of current methods, to new definitions of
similarity and new problems of comparison.
ISSCB 2003 Benson
Download