Slides

Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen Outline • Short-read alignment – Algorithm – Results • Comparisons between short-read and longread alignment • Long-read alignment – Algorithm – Results Motivation • Motivation: new DNA sequencing technologies call fast and accurate read alignment programs. • MAQ:  Pros: accurate, feature rich and fast enough to align short reads from single individual.  Cons: MAQ does NOT support gapped alignment for single-end reads => unsuitable for alignment longer reads where indels may occur frequently. • Alignment with BWT :  efficiently align short sequencing reads against a large reference sequence  allowing mismatches and gaps Burrows Wheeler Transfrom X: actgct W: gcc Z=1 actgct$ ctgct$a tgct$ac gct$act ct$actg t$actgc $actgct 0 1 2 3 4 5 6 i S[i] $actgc actgct ct$act ctgct$ gct$ac t$actg tgct$a B[i] Inexact Matching - number of deference in string W • Take string W=“gcc” for example. • 1. W(0,0)=“g”, “g” is a substring of X, D(0)=0; • 2. W(0,1)=“gc”, “gc” is a substring of X, D(1)=0; • 3. W(0,2)=“gcc”, “gcc” is not a substring of X, D(2)=1. Inexact Matching - Searching X: actgct W: gcc 0,6 t c a 6,6 c a g ^ 4,4 1,1 t ^ g c t 1,1 c 3,3 1,1 1,1 1 1,1 3 1,1 2 ^ 1,1 ^ ^ 1,1 3,3 3,3 1,1 6 a a a 6,6 6,6 1,1 ^ c t 5 3,3 3,3 6,6 1,1 4 2,3 1,1 2,3 a g Exact Matching • Let the D(i)=0, then the algorithm can search for the exact matching Simulated data • Accuracy BWA is more accurate than Bowtie and SOAPv2 based on criterion 1. • Speed  BWA is the fastest second only to SOAPv2. • Memory  MAQ’s memory footprint is 1GB, but it increases linearly with the number of reads to be aligned.  BWA only uses 2.3 GB for single-end mapping and 3GB for paired-end ( as much as Bowtie).  SOAPv2 uses 5.4 GB. Differences between short-read and longread alignment Short-read alignment Long-read alignment • Align full-length read • Efficient for ungapped alignment or limited gaps • Find local matches • Permissive about alignment gaps Fast and accurate long-read alignment with Burrows-Wheeler transform Motivations Many programs for short sequencing Not many for reads>200 bp BLAT, SSAHA2 New platforms are producing longer sequences: Roche/454 >400bp, Illumina>100 bp, Pacific > 1000 bp New algorithm: Burrows Wheeler Aligner’s Smith-Waterman Alignment BWA-SW Before NGS After NGS FASTA 1988 SOAP 2008 BLAST 1997 MAQ 2008 MegaBLAST 2000 Bowtie 2009 SSAHA2 2001 BWA 2009 BLAT 2002 BWA-SW 2010 Burrows Wheeler Aligner’s Smith-Waterman Alignment BWA-SW Overview Algorithm (1) Build FM-indices for reference and query sequences (2) Represent reference in a prefix trie (3) Represents query in prefix in DAWG (directed acyclic word graph) transformed from the prefix trie of the query sequence Example: String GOOGOL a. 3 nodes has SA interval [4,4] b. Their parents have interval [1,2],[1,2] and [1,1] ‘∧’ start of a string prefix trie The two numbers in A node gives the SA interval of the node In prefix DAWG The [4,4] node has parents [1,2] and [1,1] Node [4,4] represents the strings ‘OG’, ‘OGO’, ‘OGOL’ ‘ Prefix tree Prefix DAWG Burrows Wheeler Aligner’s Smith-Waterman Alignment BWA-SW Overview Algorithm (4) Dynamic programming with heuristics to accelerate algorithm Heuristics rules: A) Restrict the dynamic programming algorithm around good matches only B) Report only alignments largely non-overlapping Result of these heuristics is: Savings in computing time Burrows Wheeler Aligner’s Smith-Waterman Alignment BWA-SW Heuristic strategies for acceleration (1) Z best : Traverse G(W) in outer loop and T(X) in inner loop, and at each node u in G(W) only keep the top Z best scoring nodes in T(X) that match u rather than keeping all the matching nodes Where G(W) prefix DAWG of query sequence W T(X) prefix trie for reference sequence X u root of G(W) (2) Take only best few alignments covering each region of the query sequence Result • Implementation of BWA-SW takes a BWA index and a query FASTA and FASTQ file as inputs. • Typical sequencing reads requires less than 4GB. The peak memory is 6.4 GB in total on one query sequence with 1 million base pairs. Simulated data • Speed BWA-SW is fastest, and its speed is not sensitive to the read length or error rates. • Memory  BWA-SW uses about 4GB (as much as BLAT).  SSAHA2 uses 2.4GB for >=500 bp reads, and 5.3 GB for shorter reads. BWA-SW supports multi-threading while SSAHA2 and BLAT do not. • Accuracy  BWA-SW can detect chimera reads, and produces fewer false chimeric reads given lower base errors. Conclusion • Short-read alignment cannot be used for longread alignment due to: – Full-length read vs local matches. – Ungapped or limited gap vs larger number of gaps. • BWA-short is more accurate, use less memory and competitively fast. • BWA-long is the best in market in speed, accuracy and memory. Questions ?????

Slides

Related documents

Products

Support

Slides

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib