FastHASH: A New Algorithm for Fast and Comprehensive

advertisement
FastHASH:
A New Algorithm for
Fast and Comprehensive
Next-Generation Sequence Mapping
Hongyi Xin†
Farhad Hormozdiari ‡
Donghyuk Lee†
Can Alkan §
Onur Mutlu†
† Carnegie Mellon University
§ University of Washington
‡ University of California Los Angeles
Next-Generation DNA Sequencing (NGS)
n 
DNA sequencing is important
n 
Basic Approach: Read and map
C—G
T—A
G—C
T—G
C—A
A—C
T—G
G—T
C—A
C—A
n 
Next-generation sequencing technologies produce a large
number of short DNA fragments
q 
q 
q 
n 
Chromosome VII
G—123
A—124
C—125
G—126
A—127
C—128
G—129
T—130
A—131
A—132
More computationally intensive
Harder to map the entire genome
Especially when allowing polymorphism
Goal: Design fast and comprehensive algorithms to analyze
enormous amounts of NGS data
2
Challenge of Existing NGS Mapping Tools
n 
We want Fast and Comprehensive
A—T
A—T
T—122
C—G
T—289
G—C
C—G
G—123
T—260
A—T
T—A
G—290
C—G
C—G
T—122
T—A
A—124
G—361
C—G
G—C
T—732
T—A
A—291
T—A
C—G
G—123
G—C
C—125
T—1133
A—362
T—A
T—G
G—733
C—G
C—292
G—C
T—A
A—124
T—G
G—126
G—1134
C—363
G—C
C—A
A—734
T—A
G—293
T—G
G—C
C—125
C—A C—735
A—127
A—1135
G—364
T—G
A—C
C—G
A—294
C—A
T—G
G—126
A—C
C—128
C—1136
A—365
C—A T—G
T—G
G—736
C—295
A—C
C—A
A—127
T—G A—C
G—129
G—1137
C—366
G—T
A—127
C—A
G—296
T—G
A—C
C—128
G—T
T—130
A—1138
G—367
T—G
C—A
C—738
A—C
T—297
G—T
T—G
G—129
C—A
A—131
C—1139
T—268
G—T
C—A
G—739
G—T
A—298
C—A
G—T
T—130
C—A C—A
A—132
G—1140
A—269
T—740
G—T
A—299
C—A
C—A
A—131
T—1141
A—370A—741
C—A C—A
C—A
A—132A—1142
A—742 C—A
A—1143
n 
Some tools are only fast
q 
q 
n 
à need a fast tool
à need a comprehensive tool
BWA, SOAPv2, Bowtie
Not comprehensive
Some tools are only comprehensive
q 
q 
mrFAST, mrsFAST
Slow
3
FastHASH NGS Mapping Kernel
n 
Goal
q 
n 
Observation
q 
n 
Some tools do unnecessary work to guarantee
comprehensiveness à main cause of slowness
Main Idea of FastHASH
q 
n 
Best of both worlds: Fast and Comprehensive
Cut down unnecessary work by taking advantage of full
knowledge of reference genome
Two main mechanisms
q 
Adjacency Filtering and Cheap Segment Selection
4
Evaluation
Runtime (s)
20000
18369
15000
mrFAST kernel
10000
5000
0
n 
FastHASH kernel
38x speedup compared to state-of-the-art alignment tool
q 
n 
478
No sacrifice of comprehensiveness
Also implemented on GPU for further acceleration
5
Thanks
n 
For more information, please come by our poster!
6
FastHASH:
A New Algorithm for
Fast and Comprehensive
Next-Generation Sequence Mapping
Hongyi Xin†
Farhad Hormozdiari ‡
Donghyuk Lee†
Can Alkan §
Onur Mutlu†
† Carnegie Mellon University
§ University of Washington
‡ University of California Los Angeles
Download