pptx

advertisement
Sequence Local Alignment using
Directed Acyclic Word Graph
Do Huy Hoang
SEQUENCE ALIGNMENT
Sequence Similarity
• Alignment
– Arrange DNA/Protein sequences to show
the similarity
• “” denotes the insertion/deletion event
Other variations
•
•
•
•
Edit distance
Longest common substring
Affine gap scoring
Using scoring matrix (BLOSUM, PAM)
Alignment score computation
• Needleman–Wunsch
– Dynamic programming
Other variations
Name
Problem
Worst time
Average time
Four Russian
Edit distance 1,0
M*N/log(N)
Ukkonen
ND
N+D2
D2
Waterman
Global edit (linear
cost)
Local alignment
MN
MN
MN
Tree tree
Local alignment
M2N2
BWTSW
Meaningful local
alignment
MN2
<not good>
<close to M2N2>
MN0.68
Memory
MN
Local alignment
• Local alignment
– Find the best alignments of two substring
from the sequences
BWTSW
• BWTSW
– Motivation
• Scoring 75% similarity
• Local alignment table most are zero
• Meaningful alignment
– Suffix tree
– Meaningful alignment
– Meaningful alignment with gap
– How good is it?
Meaningful alignment (1)
• Sequences similarity sometimes implies
functional similarity.
• Biologists is NOT usually interested in sequences
with less than 70% similarity.
• BLAST score
–
–
–
–
Match = 1
Mismatch = -3
Open Gap = -5
Extending gap = -2
Meaningful alignment (2)
• BLAST score
– Match = 1
– Mismatch = -3
– Open Gap = -5
– Extending Gap = -2
– At least 70% match to have none zero score
Meaningful alignment (3)
• BLAST score
– Match = 1
– Mismatch = -3
– Open Gap = -5
– Extending Gap = -2
• How many none zero entries in the local
alignment DP table?
How to improve?
• Idea:
– Not storing zero score entries
– Using suffix tree to prune off early
BWTSW details
• FM index for suffix tree representation
• Prune zero entries
• Store DP vector using linked list
Analysis
• Text length = N
• Pattern length = M
• Alphabet size = 
Average running time (1)
• Let F(L) be the number of pairs of strings
length L, which Score(S1,S2) > 0
– Sizeof{(S1,S2) : Len(S1)=Len(S2)=L,
Score(S1,S2)>0}
– F(L) counts the number of pairs of 75% identity.
• F(L) = sum(i=0..L/4, Binomial(L,i) * (-1)i)
• F(L)  k1k2L
• F(log(N))  k3* N0.68
Average running time (2)
• Given S1, Pr(Score(S1,S2) > 0|S1) = F(L)/L
• For M < log(N)
– The number of entries are
– O(M * F(M)) < O(log(N)*F(log(N))
• For M > log (N)
– O(M * N * F(M) / L)
• On average
– Time = O(M*F(log(N))) = M * N0.68
DAWG
Possible improvement of BWTSW
• Worst case running time O(N2 M)
– When M=N
– O(M N0.68+M3) When M is substring of N
• What about ST vs. ST?
• What we used in BWTSW is Suffix Trie (not
suffix tree).
– #Prove it#
• Suffix trie has O(N2)nodes
• DAWG is a similar structure with O(N) nodes
DAWG (1)
DAWG (2)
• DAWG: Directed Acyclic Word Graph
• DAWG is a cyclic automata that recognizes all
the sub-strings of the given string.
DAWG (3)

• Example:
– DAWG of “abcbc”
a
b
c
a
c
b
b
bc, c
ab
c
b
abc
b
abcb,
bcb, cb
c
abcbc,
bcbc, cbc
DAWG (4)
• End-set view
0,1,
2,3,4,5

a
a
b
b
c
1
c
a
c
2, 4
b
c
b
b
3, 5
2
bc, c
ab
c
c
b
3
b
abc
b
b
abcb,
bcb, cb
4
c
5
c
abcbc,
bcbc, cbc
Trivial DAWG construction
• Using End-set class
0,1,
2,3,4,5

a
a
b
b
c
1
c
a
c
2, 4
b
c
b
b
3, 5
2
bc, c
ab
c
c
b
3
b
abc
b
b
abcb,
bcb, cb
4
c
5
c
abcbc,
bcbc, cbc
DAWG properties
• For |w|>2, the Directed Acyclic Word Graph
for w has at most 2|w|-1 states, and 3|w|-4
edges
D(w) and ST(wR)
• There is a map between nodes in DAWG and implicit
ST(wR)
– Example: w=abcbc, wR=cbcba

a
b
c
a
a
c
b
b
bc, c
ab
a
cb
b
a
cba
c
b
abc
cba
b
abcb,
bcb, cb
c
abcbc,
bcbc, cbc
• Store DAWG using ST, which uses only o(N) bits
D(w) and ST(wR) (2)
list all incoming edges of node q in Dw using ST(w^R)
Local Alignment using DAWG
• Basis
• Induction
Extensions
• Meaningful alignment using DAWG
– Prune the nodes whose Score is less than zero
• Shortest path pruning style
• Cache log(N) nodes  the worst case running
time is M*N*log(N), average case is the same
for M << N.
Download