SNAP: Fast, accurate sequence alignment enabling biological

advertisement
SNAP: Fast, accurate sequence alignment
enabling biological applications
Ravi Pandya, Microsoft Research
ASHG 10/19/2014
SNAP
SNAP is fast *
Align 50x genome in 1.2 hours
(BWA-MEM = 11.75 hours)
Sort + index + markdup BAM in 2 hours
(samtools+sambamba = 4.25 hours)
SNAP is as accurate as BWA-MEM, Bowtie2, etc.
ROC on simulated data
% aligned on real data
Variant calls on real data
* NA12878:ERR194147, Azure D14 (16 cores, 112GB RAM, 800GB SSD)
Sequence alignment
The problem:
Given a read R and a reference genome G
Find the position in p in G that minimizes
EditDistance(R, G[p .. p + |R|])
SNAP solves this quickly and accurately because of:
Efficient system architecture
Reducing the number of comparisons
Reducing the cost of comparisons
System architecture
empty
full
align
sort
temp file
mergesort
index
mark
duplicates
compress
async read
async write
Bill Bolosky, MSR
The sequence alignment problem
The easy part:
97% of 20-mers
in the human genome
occur only once
but at only 75% of locations
CDF of per-read/pair alignment time, NA18705 169M pairs
(using deeper search parameters than current defaults)
100%
10% of reads
90%
80%
70%
Single
Equally
Weighted
Single
Time
Weighted
60%
The hard part:
The other 3% of 20-mers
and 25% of locations
50%
40%
Paired
Equally
Weighted
30%
20%
10%
0%
95% of time
Paired
Time
Weighted
Bill Bolosky, MSR
Hash table lookup
Build a multi-valued map (~30GB for hg19)
from all seeds S in G  all locations of S in G
330 reads/s
For all seeds in read, all locations of seed in genome,
Score implied alignment of read, keep the best
42x
14k reads/s
Ignore frequent seeds (>300 occurrences)
Only use a few seeds/read
Bill Bolosky, MSR
Fast scoring
6.6x
92k reads/s
O(n2)  Ukkonen O(nd), n=len, d=min(limit, actual)
Use limit = best score so far + 2 (for MAPQ)
1.2x
113k reads/s
Sort candidates by # of seed hits
1.4x
154k reads/s
(470x overall)
Skip locations with #seed misses > limit
Bill Bolosky, MSR
Paired-end alignment
Find & score candidate location pairs
C(R1:R2) = C(R1) ∩ C(R2) {± insert size}
Enumerate in O(h log n)
h = |C(R1) ∩ C(R2)|
n = |C(R1)| + |C(R2)|
Increases accuracy by allowing
much higher limit on seed occurrences
(e.g. 4k vs 300)
Results: simulated data
Mason-generated paired-end 100bp reads
Results: real data
NA18507 (Illumina HiSeq 50x)
* AWS cr1.8xlarge (32 cores, 244GB RAM, 2x120GB SSD)
Results: GATK variant calls
Broad GATK pipeline, curated NA12878 variant calls
Results: NIST Genome-in-a-Bottle
11.75
Appistry GATK pipeline, GIAB highly confident calls
Longer seeds are much faster, similar precision/recall
ERR194147*.fastq.gz, Azure D14 (16 cores, 112GB RAM, 800GB SSD)
Results: NIST Genome-in-a-Bottle
Lower confidence calls (qual>20, 2 platforms)
Highly confident
indel
snp
Aligner
Recall
Precision Recall
Precision
bwa-mem 97.24% 97.15% 99.57% 99.65%
snap-20
97.04% 97.48% 99.51% 99.57%
snap-24
97.04% 97.46% 99.52% 99.57%
snap-28
97.04% 97.45% 99.53% 99.57%
snap-32
97.00% 97.41% 99.51% 99.57%
Lower confidence
indel
snp
Aligner
Recall
Precision Recall
Precision
bwa-mem 96.38% 96.30% 99.00% 99.32%
snap-20
96.17% 96.68% 98.94% 99.25%
snap-24
96.17% 96.67% 98.95% 99.23%
snap-28
96.16% 96.62% 98.96% 99.21%
snap-32
96.11% 96.55% 98.94% 99.17%
Pathogen ID: SURPI (Charles Chiu, UCSF)
“This analysis of DNA sequences required just 96 minutes. A
similar analysis conducted with the use of previous generations
of computational software on the same hardware platform
would have taken 24 hours or more to complete, Chiu said.”
Charles Chiu, UCSF
SURPI
SNAP enables SURPI with:
Fast filtering mode
64-bit index for >40GB ntDB
Secondary mapping output
Acknowledgements
Microsoft Research
Bill Bolosky
Ravi Pandya
UC San Francisco
Taylor Sittler
Broad Institute
Christopher Hartl
UC Berkeley AMPLab
Matei Zaharia
Kristal Curtis
Armando Fox
Scott Shenker
Ion Stoica
David Patterson
Binaries, source, documentation (Apache 2.0 licensed)
http://snap.cs.berkeley.edu
Download