Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen Outline • Short-read alignment – Algorithm – Results • Comparisons between short-read and longread alignment • Long-read alignment – Algorithm – Results Motivation • Motivation: new DNA sequencing technologies call fast and accurate read alignment programs. • MAQ: Pros: accurate, feature rich and fast enough to align short reads from single individual. Cons: MAQ does NOT support gapped alignment for single-end reads => unsuitable for alignment longer reads where indels may occur frequently. • Alignment with BWT : efficiently align short sequencing reads against a large reference sequence allowing mismatches and gaps Burrows Wheeler Transfrom X: actgct W: gcc Z=1 actgct$ ctgct$a tgct$ac gct$act ct$actg t$actgc $actgct 0 1 2 3 4 5 6 i S[i] $actgc actgct ct$act ctgct$ gct$ac t$actg tgct$a B[i] Inexact Matching - number of deference in string W • Take string W=“gcc” for example. • 1. W(0,0)=“g”, “g” is a substring of X, D(0)=0; • 2. W(0,1)=“gc”, “gc” is a substring of X, D(1)=0; • 3. W(0,2)=“gcc”, “gcc” is not a substring of X, D(2)=1. Inexact Matching - Searching X: actgct W: gcc 0,6 t c a 6,6 c a g ^ 4,4 1,1 t ^ g c t 1,1 c 3,3 1,1 1,1 1 1,1 3 1,1 2 ^ 1,1 ^ ^ 1,1 3,3 3,3 1,1 6 a a a 6,6 6,6 1,1 ^ c t 5 3,3 3,3 6,6 1,1 4 2,3 1,1 2,3 a g Exact Matching • Let the D(i)=0, then the algorithm can search for the exact matching Simulated data • Accuracy BWA is more accurate than Bowtie and SOAPv2 based on criterion 1. • Speed BWA is the fastest second only to SOAPv2. • Memory MAQ’s memory footprint is 1GB, but it increases linearly with the number of reads to be aligned. BWA only uses 2.3 GB for single-end mapping and 3GB for paired-end ( as much as Bowtie). SOAPv2 uses 5.4 GB. Differences between short-read and longread alignment Short-read alignment Long-read alignment • Align full-length read • Efficient for ungapped alignment or limited gaps • Find local matches • Permissive about alignment gaps Fast and accurate long-read alignment with Burrows-Wheeler transform Motivations Many programs for short sequencing Not many for reads>200 bp BLAT, SSAHA2 New platforms are producing longer sequences: Roche/454 >400bp, Illumina>100 bp, Pacific > 1000 bp New algorithm: Burrows Wheeler Aligner’s Smith-Waterman Alignment BWA-SW Before NGS After NGS FASTA 1988 SOAP 2008 BLAST 1997 MAQ 2008 MegaBLAST 2000 Bowtie 2009 SSAHA2 2001 BWA 2009 BLAT 2002 BWA-SW 2010 Burrows Wheeler Aligner’s Smith-Waterman Alignment BWA-SW Overview Algorithm (1) Build FM-indices for reference and query sequences (2) Represent reference in a prefix trie (3) Represents query in prefix in DAWG (directed acyclic word graph) transformed from the prefix trie of the query sequence Example: String GOOGOL a. 3 nodes has SA interval [4,4] b. Their parents have interval [1,2],[1,2] and [1,1] ‘∧’ start of a string prefix trie The two numbers in A node gives the SA interval of the node In prefix DAWG The [4,4] node has parents [1,2] and [1,1] Node [4,4] represents the strings ‘OG’, ‘OGO’, ‘OGOL’ ‘ Prefix tree Prefix DAWG Burrows Wheeler Aligner’s Smith-Waterman Alignment BWA-SW Overview Algorithm (4) Dynamic programming with heuristics to accelerate algorithm Heuristics rules: A) Restrict the dynamic programming algorithm around good matches only B) Report only alignments largely non-overlapping Result of these heuristics is: Savings in computing time Burrows Wheeler Aligner’s Smith-Waterman Alignment BWA-SW Heuristic strategies for acceleration (1) Z best : Traverse G(W) in outer loop and T(X) in inner loop, and at each node u in G(W) only keep the top Z best scoring nodes in T(X) that match u rather than keeping all the matching nodes Where G(W) prefix DAWG of query sequence W T(X) prefix trie for reference sequence X u root of G(W) (2) Take only best few alignments covering each region of the query sequence Result • Implementation of BWA-SW takes a BWA index and a query FASTA and FASTQ file as inputs. • Typical sequencing reads requires less than 4GB. The peak memory is 6.4 GB in total on one query sequence with 1 million base pairs. Simulated data • Speed BWA-SW is fastest, and its speed is not sensitive to the read length or error rates. • Memory BWA-SW uses about 4GB (as much as BLAT). SSAHA2 uses 2.4GB for >=500 bp reads, and 5.3 GB for shorter reads. BWA-SW supports multi-threading while SSAHA2 and BLAT do not. • Accuracy BWA-SW can detect chimera reads, and produces fewer false chimeric reads given lower base errors. Conclusion • Short-read alignment cannot be used for longread alignment due to: – Full-length read vs local matches. – Ungapped or limited gap vs larger number of gaps. • BWA-short is more accurate, use less memory and competitively fast. • BWA-long is the best in market in speed, accuracy and memory. Questions ?????