Slides

advertisement
Presented by
Mario Flores, Xuepo Ma, and Nguyen Nguyen
Outline
• Short-read alignment
– Algorithm
– Results
• Comparisons between short-read and longread alignment
• Long-read alignment
– Algorithm
– Results
Motivation
• Motivation: new DNA sequencing technologies call fast and
accurate read alignment programs.
• MAQ:
 Pros: accurate, feature rich and fast enough to align short reads from
single individual.
 Cons: MAQ does NOT support gapped alignment for single-end reads
=> unsuitable for alignment longer reads where indels may occur
frequently.
• Alignment with BWT :
 efficiently align short sequencing reads against a large reference
sequence
 allowing mismatches and gaps
Burrows Wheeler Transfrom
X: actgct W: gcc Z=1
actgct$
ctgct$a
tgct$ac
gct$act
ct$actg
t$actgc
$actgct
0
1
2
3
4
5
6
i S[i]
$actgc
actgct
ct$act
ctgct$
gct$ac
t$actg
tgct$a
B[i]
Inexact Matching - number of
deference in string W
• Take string W=“gcc” for
example.
• 1. W(0,0)=“g”, “g” is a
substring of X, D(0)=0;
• 2. W(0,1)=“gc”, “gc” is a
substring of X, D(1)=0;
• 3. W(0,2)=“gcc”, “gcc” is
not a substring of X,
D(2)=1.
Inexact Matching - Searching
X: actgct
W: gcc
0,6
t
c
a
6,6
c
a
g
^
4,4
1,1
t
^
g
c
t
1,1
c
3,3
1,1
1,1
1
1,1
3
1,1
2
^
1,1
^
^
1,1
3,3
3,3
1,1
6
a
a
a
6,6
6,6
1,1
^
c
t
5
3,3
3,3
6,6
1,1
4
2,3
1,1
2,3
a
g
Exact Matching
• Let the D(i)=0, then the algorithm can search
for the exact matching
Simulated data
• Accuracy
BWA is more accurate than
Bowtie and SOAPv2 based on
criterion 1.
• Speed
 BWA is the fastest second only
to SOAPv2.
• Memory
 MAQ’s memory footprint is
1GB, but it increases linearly with
the number of reads to be
aligned.
 BWA only uses 2.3 GB for
single-end mapping and 3GB for
paired-end ( as much as Bowtie).
 SOAPv2 uses 5.4 GB.
Differences between short-read and longread alignment
Short-read alignment
Long-read alignment
• Align full-length read
• Efficient for ungapped
alignment or limited
gaps
• Find local matches
• Permissive about
alignment gaps
Fast and accurate long-read alignment with Burrows-Wheeler transform
Motivations
Many programs for short sequencing
Not many for reads>200 bp
BLAT, SSAHA2
New platforms are producing longer sequences:
Roche/454 >400bp, Illumina>100 bp,
Pacific > 1000 bp
New algorithm: Burrows Wheeler Aligner’s Smith-Waterman Alignment BWA-SW
Before NGS
After NGS
FASTA 1988
SOAP 2008
BLAST 1997
MAQ 2008
MegaBLAST 2000
Bowtie 2009
SSAHA2 2001
BWA 2009
BLAT 2002
BWA-SW 2010
Burrows Wheeler Aligner’s Smith-Waterman Alignment BWA-SW
Overview Algorithm
(1) Build FM-indices for reference and query sequences
(2) Represent reference in a prefix trie
(3) Represents query in prefix in DAWG (directed acyclic word graph) transformed from
the prefix trie of the query sequence
Example:
String GOOGOL
a. 3 nodes has SA interval [4,4]
b. Their parents have interval
[1,2],[1,2] and [1,1]
‘∧’ start of a string
prefix trie
The two numbers in
A node gives the
SA interval of the
node
In prefix DAWG
The [4,4] node has parents [1,2]
and [1,1]
Node [4,4] represents the strings
‘OG’, ‘OGO’, ‘OGOL’
‘
Prefix tree
Prefix DAWG
Burrows Wheeler Aligner’s Smith-Waterman Alignment BWA-SW
Overview Algorithm
(4) Dynamic programming with heuristics to accelerate algorithm
Heuristics rules:
A) Restrict the dynamic programming algorithm around good
matches only
B) Report only alignments largely non-overlapping
Result of these heuristics is: Savings in computing time
Burrows Wheeler Aligner’s Smith-Waterman Alignment BWA-SW
Heuristic strategies for acceleration
(1) Z best : Traverse G(W) in outer loop and T(X) in inner loop, and at each
node u in G(W) only keep the top Z best scoring nodes in T(X) that match
u rather than keeping all the matching nodes
Where G(W) prefix DAWG of query sequence W
T(X) prefix trie for reference sequence X
u root of G(W)
(2) Take only best few alignments covering each region of the query
sequence
Result
• Implementation of BWA-SW takes a BWA
index and a query FASTA and FASTQ file as
inputs.
• Typical sequencing reads requires less than
4GB. The peak memory is 6.4 GB in total on
one query sequence with 1 million base pairs.
Simulated data
• Speed
BWA-SW is fastest, and its speed is not sensitive to the
read length or error rates.
• Memory
 BWA-SW uses about 4GB (as much as BLAT).
 SSAHA2 uses 2.4GB for >=500 bp reads, and 5.3 GB for
shorter reads.
BWA-SW supports multi-threading while SSAHA2 and BLAT
do not.
• Accuracy
 BWA-SW can detect chimera reads, and produces fewer
false chimeric reads given lower base errors.
Conclusion
• Short-read alignment cannot be used for longread alignment due to:
– Full-length read vs local matches.
– Ungapped or limited gap vs larger number of gaps.
• BWA-short is more accurate, use less memory
and competitively fast.
• BWA-long is the best in market in speed,
accuracy and memory.
Questions ?????
Download