BWA Plus: A high performance DNA short read

advertisement
1
Work @ Fudan University
Chen, Yaoliang
2
• TTS System
• A Chinese Text-To-Speech system
• SafeDB
• Bug backlog
• SMemoHelper
• A small tool that helps learn English words.
• Fraud Detecting
• Time series tech
3
• CGAP-align: A high performance DNA short read
alignment tool
▫ Coauthor with BCM. Bioinformatics in progress
▫ NDBC Demo
• On Encoding Shortest Paths in Large Graphs
▫ Coauthor with Jian Pei. VLDB in progress
▫ Coauthor with Haixun Wang. Sigmod in progress
▫ NDBC
• Other Projects
4
• Baylor College of Medicine
• 序列比对及意义
▫ Reference & Reads
 ACTAGCGATATAACCCTTTCCCTTTCCCTTT
CACGAT
 CACGAT
• Given a number z reference X and read W, we
want to find a subsequence W’=X[i,i+1,…,j]
such that EditDistance(W,W’)≤z.
5
• A human genome sequence
▫
▫
▫
▫
▫
2000 € 1,000,000,000
2008 € 50 - 100,000
2010 € 5 - 10,000
...2015 € 1,000
...2020 € 10
in ~10 years
in ~4 months
in ~2 weeks
in ~1 day
in ~1 hour to minutes
DNA sequences in GenBank
6
• Burrows-Wheeler Alignment Tool
▫ 一个流行的在大型参照序列上对基因片段进行
比对工具
• Optimization of BWA
▫ Code level
▫ Algorithm level
• BWA Performance: T = N × Taln
▫ N: enumerate all mismatches and gaps of the
read
▫ Taln: time to locate the modified reads in the
reference during the alignment stage
7
• Optimizing Taln: efficiency for matching
▫ Suffix Tarray
• Optimizing N: pruning ability to avoid
enumerating unnecessary mismatches and gaps
▫ Data-Conscious D-Array Calculating
8
• Suffix Tree
• Suffix Array Based on BWT (FM-index)
• Comparison
Root
Leaf
(b=2)
A
R(AA)
C
_
R(AA)
Ref=ATCTTCAAGA
Read=TAA
A
C
G
T
G
T
A
A
...
R(TC)
_
R(TC)
FM-index
...
C
R(TT)
T
_
R(TT)
9
From Yuval Rikover
L
F
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi
Sort the rows
#
i
i
i
i
m
p
p
s
s
s
s
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
i
p
s
s
m
#
p
i
s
s
i
i
10
Reminder: Recovering T from L
1.
2.
3.
4.
Find F by sorting L
First char of T? m
Find m in L
L[i] precedes F[i] in T. Therefore we get
mi
How do we choose the correct i in L?
5.
▫
▫
The i’s are in the same order in L and F
As are the rest of the char’s
6. i is followed by s:
7. And so on….
mis
F
L
#
i
i
i
i
m
p
p
s
s
s
s
i
p
s
s
m
#
p
i
s
s
i
i
11
• Backward-search algorithm
• Uses only L (output of BWT)
• Relies on 2 structures:
▫ C[1,…,|Σ|] : C[c] contains the total number of text chars in T which are
alphabetically smaller then c (including repetitions of chars)
▫ Occ(c,q): number of occurrences of char c in prefix L[1,q]
Example
•C[ ] for T = mississippi#
1 5 6 8
i m p s
•occ(s, 5) = 2
•occ(s,12) = 4
Occ
 Rank
1
2
3
4
5
6
7
8
9
10
11
12
12
SUBSTRING SEARCH IN T (COUNT THE PATTERN OCCURRENCES)
P[ j ]
C
P = si
First step
unknown
rows prefixed
by char “i”
fr
lr
occ=2
[lr-fr+1]
fr
lr
#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
#
i
m
p
S
1
2
7
8
10
Inductive step: Given fr,lr for P[j+1,p]
Take c=P[j]
Œ
Find the first c in L[fr, lr]
Find the last c in L[fr, lr]
Occ() oracle is enough
13
• Backward search
• Store “First” and “Last” (k and l) values
14
• P = CAA
▫ i =
12
3
▫ c=
‘A’
‘C’
‘A’
▫ First =
C[‘T’]
First(AA)
+ Occ(‘C’,First(AA)) +1
▫ Last =
C[‘T’]
Last(AA)
+ Occ(‘C’,Last(AA))
Root
A
A
FM-index
15
• Optimizing Taln: efficiency for matching
▫ Suffix Tarray
• Optimizing N: pruning ability to avoid
enumerating unnecessary mismatches and gaps
▫ Data-Conscious D-Array Calculating
16
• e(W)
▫ minimal number of the edit operations that is
needed to make W exactly align onto the reference
X.
• D-array
▫ D[i] : Lower bound of e(W[0…i])
0
i
…
4
3
17
• Given a string W and an arbitrary combination
strings of W = w1,w2,…,wk, we have e(W)>
• D array in BWA
▫ split W into several small strings like W=w1w2…wk
with e(wi)=1 for all i. The correctness of the algorithm
depends on the inequality: e(W) >
.
18
• Example Reference X = “AACGTATCGACG”
▫W
▫D
A
A
C
T
G
G
A
0
0
0
1
1
1
• A better segmentation: Consider e(·)= 2
▫W
▫D
A
A
C
C
T
G
G
A
0
0
0
1
2
2
▫ calculating e(·) costs exponential time
▫ Need to pre-compution
19
Train Reads
Frequent Patterns
• Fasta file F containing
training reads
• Mining Frequent Patterns
• (FPs)
Should be similar to the
• reads
Generate
prefix trie T for
in practice
• Art
State
theof
FPs
withMethods
e(w)=2.
• Data Concious
• Our solution: A simple DFS
• on
Refine
T to a DFA GT
FM-index
▫ Count=Last-First+1
Trie DFA
20
• Why Trie DFA?
▫ When online doing alignment, we need to find all the
FPs contained in a read
▫ This operation should be no more expensive than
O(|W|)
21
Offline Index: Construction
R
• String Set(FP set)
▫
▫
▫
▫
▫
▫
AA
C
G
T
AC
AG
A
1
C
L3C
LAC
6
T
L4G
T
A C G
L2AA
G
L7
AG
• The prefix trie done. We start to construct DFA.
L5T
22
• DFS order – minimize the average hop between
each jump. (7% up)
R
A
C
35
1
A C G
2
G
36
46
T
7
4
T
57
23
Online Query
• String Set(FP set)
▫
▫
▫
▫
▫
▫
AA
AC
AG
C
G
T
RR
1
LAA
• W=“CACAT”
LLAC
AC
LG
LLCC
A C G
T
LAG
T
G
C
A
LLTT
24
• Optimizing Taln: efficiency for matching
▫ Suffix Tarray (20% up)
• Optimizing N: pruning ability to avoid enumerating
unnecessary mismatches and gaps
▫ Data-Conscious D-Array Calculating (0-200% up)
25
• Background
• Consider a graph G = (V,E), where V is a set of
vertices and E =VxV is a set of edges.
• FH-Partition
26
7
4
7->10 FH(7,10) = 9; FH(9,10) = 2; FH(2,10) = 10
27
• Numbering Function
28
29
Compute FHPartitions
Get Numbering
Function(s)
Encoding FHPartitions
• Compute a naïve
• Reduce
to TSP
numbering
function
• Further Compression
••Region
treeFH-partitions
Store the
• Answering query
• Multi numbering
efficiently
functions
30
31
Thank you!
Download