Document 10410159

advertisement
FastHASH: A New Algorithm for Fast and Comprehensive Next-­‐genera1on Sequence Mapping Hongyi Xin1, Donghyuk Lee1, Farhad Hormozdiari2, Can Alkan3, Onur Mutlu1 1 Departments of Computer Science and Electrical and Computer Engineering, Carnegie Mellon University, Pimsburgh, PA 2 Department of Computer Science, University of California Los Angeles, CA
3 Department of Genome Sciences, University of Washington, Seamle, WA Next-­‐genera;on DNA Sequencing and the State-­‐of-­‐the-­‐art Sequence Mapping Tools mrFAST Background: DNA Sequencing Challenge of Next-­‐genera;on DNA Sequencing Exis;ng Mapping Tools •  Goal: Acquire individual’s en1re DNA sequence •  Mechanism: Read DNA fragments and reconstruct it  Break DNA into pieces and store them as strings  Compare the strings to a known reference DNA string -­‐-­‐ Search for matching coordinates in reference DNA  Stitch fragments together in corresponding order •  Difficul;es: Individuals have muta1ons including  Mismatch, inser1ons and dele1ons; must tolerate •  Next-­‐genera;on DNA Sequencing:  Instead of reading fewer long fragments, read many short fragments in parallel  This pushes the challenge to computa1on •  Challenge:  Shorter but many reads: billions of them  Mapping a fragment to en1re reference genome is costly: cost does not reduce vs. a long fragment, and may increase for a shorter fragment  More poten1al mapping loca1ons: harder to search for all possible matches in the reference DNA -­‐-­‐ Even harder when muta1ons are allowed •  Requirement:  Algorithm that is fast and efficient which can process enormous amount of data •  Suffix tree or prefix tree based alignment tools:  Newer tools use Burrows-­‐Wheeler transforma1on -­‐-­‐ Bow1e, BWA, SOAPv2  Advantage -­‐-­‐ Fast in finding the exact match without muta1ons  Disadvantages -­‐-­‐ Very slow when muta1ons are allowed -­‐-­‐ Not comprehensive: does not search for all possible loca1ons •  Hash table based alignment tools:  Use hash table for filtering non-­‐matching coordinates -­‐-­‐ mrFAST, mrsFAST  Advantage -­‐-­‐ Comprehensive, and fast when comprehensive  Disadvantage -­‐-­‐ Slower in searching for just the exact match qq base pair (bp) Mis-­‐
match Reference DNA Fragment mrFAST: Two Key Components Coordinate list 11 12 229 304 AAAAAAAAAAAA AAAAAAAAAAAC AAAAAAAAAAAG AAAAAAAAAAAT 1105 303 400012 798 …. 4991 900321 qq FastHASH Our Goal and FastHASH •  Hash table (HT):  Stores coordinates of segments in reference DNA Segments q 991 TTTTTTTTTTTTTT Effect of Adjacency Filtering 3.E+10 1.E+10 5.E+07 1.E+08 1.E+07 1.E+06 qq •  String comparisons are dras1cally reduced: 3.7x speedup Our Second Observa;on Original 1me (s) String comparison Other Adjacency Filtering Time with AF (s) 0 5000 10000 15000 20000 •  Adjacency Filtering becomes the bomleneck •  We can speed this up by avoiding the probing of long coordinate lists •  Observa;on: Hash table is imbalanced  Cheap segments: Segments that have few coordinates in hash table  Expensive segments: Segments that have many coordinates  lead to slow execu1on during AF •  Idea: Select cheapest segments within a fragment  Selecting the cheapest e+1 segments guarantees comprehensiveness (at least one has no error) •  Example: If e = 1, select the cheapest 2 segments q AAAAAAAAAAAAACGTAACCTTAAAACCCATTTACC Cheap Cheapest •  Effect of CSS: The number of coordinates examined First segments 100% Cheapest segments 6.4% 0 1E+09 2E+09 10000 q 15000 20000 •  Most string comparisons are useless: result in no match 1.E+11 3.E+10 Number of string comparisons conducted 1.E+10 1.E+09 1.E+08 Number of string matches 1.E+07 1.E+07 3E+09 String Compare Run 1me (s) 10000 5000 4 m m+12 4935 478 m+24 q ? AAAAAAAAAAAA
n 303 505 ? m+12 CCCCCCCCCCCC
n+12 557 1033 TTTTTTTTTTTT
n+25 … m m ? m+24 Do > e coordinate lists contain consecu1ve coordinates? coordinate list Preliminary GPU Execu;on Time of FastHASH Intel i7 2600 / 16 GB DRAM q Input string Reference string •  Observa;on: If perfect match, consecu1ve segments should be at consecu1ve coordinates! •  Idea: For a coordinate, check if consecu1ve coordinates are in the coordinate lists of consecu1ve segments  If yes  Do string comparison  If no
 No need for string comparison coordinate CPU Execu;on Time 15000 Reference strings String compare: Compare every base pair
 very slow …AAAAAAAAAAAACCCCCCCCCCCCTTTTTTTTTTTT…
•  Input fragment set: Run 1me (s)  Fragment length: 108 base-­‐pairs 600  Fragment size: 1 million 500  Number of errors: 3 mismatches, inser1ons or dele1ons 400 18369 q AAAAAAAAAAAACCCCCCCCTCCCTTTTTTTTTTTT
AAAAAAAAAAAACCCCCCCCCCCCTTTTTTTTCGAT
AAAAAATAACAACCCCCCCCCCCCTTTTTTTTTTTT
1.E+06 20000 1.  Divide fragment into segments 2.  Check HT to get coordinates segments’ 303 coordinates in 1105 7712 11 991 44 4991 reference DNA 900321 3.  Retrieve reference DNA strings at the Reference DNA coordinates 3 Database 4.  Compare fragment to reference DNA strings segments AAAAAAAAAAAACCCCCCCCCCCCTTTTTTTTTTTT
Other 0 Expensive 2 Hash Table (HT) Stores coordinates (coord.) of segments in reference DNA String comparison Preliminary Results Cheap Segment Selec;on (CSS) Original mrFAST string comparisons String comparisons aier AF Number of string matches 1 TTTTTTTTTTTT
CCCCCCCCCCCC
AAAAAAAAAAAA
•  Goal: Reduce the number of string comparisons Execu1on 1me (s) 5000 fragment AAAAAAAAAAAACCCCCCCCCCCCTTTTTTTTTTT
Adjacency Filtering (AF) •  String comparisons take too long  95% of execu1on 1me 0 q 4001451 q Our First Observa;on •  Problem with mrFAST:  Slow: 5 hours to process 1M fragments (108 bp) •  Our goal:  Reduce the execu1on 1me while maintaining comprehensiveness • FastHASH Overview: Two key components:  Adjacency Filtering: Reject obviously non-­‐matching coordinates at early stage to avoid unnecessary  Each segment looked up in HT to get coordinate list expensive string comparisons  For each coordinate in the list, look up reference string  Cheap segment selec;on: Reduce the absolute  expensive number of coordinates that are subject to •  String Compare: examina1on  Compare input fragment against reference DNA  Check for muta1ons: mismatches, inser1ons and • Current Result: dele1ons (allow e muta1ons)  38x speedup for 1M fragments compared to mrFAST  Need to compare every base pair  very slow -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ mrFAST Flow Chart mrFAST kernel + Adjacency Filtering + Cheap Segment Selec1on (FastHASH) 478 331 300 200 100 0 q Nvidia Tesla C2070 CPU GPU •  Conclusion  GPU provides 1.44x speedup (early result) •  Conclusions •  Ongoing work  Adjacency Filtering provides 3.7x speedup  Adjacency Filtering + Cheap Segment Selec1on  Schedule work bemer on GPU for higher speedup provides 38x speedup 
Download