Document 10638475

advertisement
FastHASH: A new algorithms for fast and comprehensive next genera8on sequence mapping Hongyi Xin Donghyuk Lee Computer Science Department of School of Electrical and computer engineering department Computer Science, CMU of School of Engineering, CMU DNA sequencing & mrFAST Background: DNA sequencing mrFAST: Hash table for fast loca4on look up •  Goal: people want to know their DNA at a cheap cost •  Mechanism: Read short fragments and reconstruct them -­‐-­‐ Via aligning them against a Reference DNA -­‐-­‐ String Mapping •  Difficul4es: Individuals has muta8ons including -­‐-­‐ Mismatch, Inser8on and Dele8ons base pair Mis-­‐
match Reference DNA mrFAST flow chart •  Hash table -­‐-­‐ fast lookup •  String comparison -­‐-­‐ detailed muta8on informa8on AAAAAAAAAAAACCCCCCCCCCCCTTTTTTTTTTT
Hash table – matches pa:erns with loca8ons Pa:ern Hash Table Segmenta8on Coordinate list TTTTTTTTTTTT
CCCCCCCCCCCC
AAAAAAAAAAAA
1105 303 11 12 229 304 AAAAAAAAAAAA AAAAAAAAAAAC AAAAAAAAAAAG 400012 798 303 1105 7712 11 991 44 4991 900321 …. Reference DNA Database AAAAAAAAAAAACCCCCCCCTCCCTTTTTTTTTTTT
AAAAAAAAAAAACCCCCCCCCCCCTTTTTTTTCGAT
AAAAAATAACAACCCCCCCCCCCCTTTTTTTTTTTT
4001451 -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ Problem of mrFAST •  Too slow: 5 hours 40 min for 1M reads •  String Compare is expensive: over 9 μs each •  Extra memory access further slows down the program 900321 4991 991 TTTTTTTTTTTTTT String comparison – Expensive func8on String Compare Adjacency Filtering Observa4on Mechanism •  String comparison takes too much 8me •  Adjacency Filtering – Reducing String Comparison AAAAAAAAAAAACCCCCCCCCCCCTTTTTTTTTTTT
…AAAAAAAAAAAACCCCCCCCCCCCTTTTTTTTTTTT…
m •  Low pass rate of String Comparison -­‐-­‐ most string compares are useless 1.E+10 1.E+07 1.E+06 1.E+04 Input string Reference string 1.E+10 m+24 Number of String Compare conducted ? n 1.E+00 n ? n+12 ? •  Lots of memory accesses 3 * 104 / read n+24 coordinate a coordinate c If more than e segments fail? AAAAAAAAAAAA
coordinate b CCCCCCCCCCCC
coordinate s Coordinate t TTTTTTTTTTTT
coordinate z coordinate x coordinate y 3.E+10 Number of String Compare mrFAST 5.E+07 1.E+07 1.E+08 1.E+06 Number of String Compare ager Filtering 1.E+04 For a coordinate n, we search for consecu8ve coordinates for consecu8ve segments Number of passes 1.E+02 m+12 •  Filter out less relevant coordinates: -­‐-­‐ Reducing Edit-­‐distance calcula8on If perfect match, consecu8ve segments should be at consecu8ve loca8ons! 3.E+10 1.E+08 Evalua4on 1.E+02 1.E+00 Execu4on 4me reduced to 3 hours But s4ll not 1000x speed up! •  New bo:leneck: Adjacency Filtering Yes Drop n No Con8nue to String Compare Cheap Key Selec4on Observa4on Mechanism •  Load imbalance of Hash Table Longer entries è more computa8on, higher frequency Expensive to do Adjacency Filtering Evalua4on •  Selec8ng e+1 keys for full coverage: Pigeon Hole Theorem -­‐-­‐ Search for e+1 key is sufficient: at least one of them has no error -­‐-­‐ Sort them based on the their coordinate list’s size, the shorter, the cheaper -­‐-­‐ Select the cheapest e+1 keys •  Benefits of Cheap Key Selec8on. -­‐-­‐ Reducing the number searching in Adjacency Filtering •  Selec8ng cheapest e+1 keys: Reduce searching Total number of Adjacency Filter perform First Keys 2400514006 (100%) Cheap Keys 153632222 (6.4%) 1.0E+00 1.0E+03 1.0E+06 1.0E+09 1 million Fragment searching / 1 chromasom / e=3 AAAAAAAAAAAAACGTAACCTTAAAACCCATTTACC Fragment Large Large Large Small Small Small # of same pa:ern # of coordinate # of computa8on Select as searching key Methodology •  Input fragment set: -­‐-­‐ Fragment length: 108 base-­‐pairs -­‐-­‐ Fragment size:
1 million -­‐-­‐ Fragment composi8on: first 1 million fragments from the first chromosome of reference DNA -­‐-­‐ Endurance:
3 mismatches or inser8ons or dele8ons •  Processing Plaqorm: -­‐-­‐CPU: Intel Sandybridge i7 with 16 GB main memory -­‐-­‐GPU: Nvidia Tesla 2070C with 6 GB local memory (GDDR5) 140 Blocks 64 Threads Evalua4on & Conclusion •  CPU Run 8me Comparison Run 8me (s) 3.E+04 2.E+04 20253 Run 8me (s) 300 250 200 mrFAST 2.E+04 10822 mrFAST + AD 1.E+04 5.E+03 •  CPU GPU Comparison mrFAST + AD + CKS 1442 0.E+00 •  Conclusion -­‐-­‐ Adjacency Filtering Provides 1.87x speed up -­‐-­‐ Adjacency Filtering + Check Key Selec8on Provides 14x speed up 207 150 Calcula8on Time 100 50 0 Memory Transfer Time 50 63 CPU GPU •  Conclusion -­‐-­‐ GPU runs slower than CPU •  Explana8on -­‐-­‐ The code does not explore enough parallelism -­‐-­‐ Bad communica8on to computa8on ra8o 
Download