Sequence Alignment in DNA Under the Guidance of : Prof . Kolin Paul Presented By: Lalchand Gaurav Jain Agenda • • • • • • Application Domain & objective General Alignment Procedure Scope of parallelism in BWT Selection sort and quick sort implementation Bwt Implementation on GPU Comparative study Time-Line • • • • • • Application Domain & objective General Alignment Procedure Scope of parallelism in BWT Selection sort and quick sort implementation Bwt Implementation on GPU Comparative study Time-Line • • • • • • Application Domain & objective General Alignment Procedure Scope of parallelism in BWT Selection sort and quick sort implementation Bwt Implementation on GPU Comparative study Time-Line • • • • • • Application Domain & objective General Alignment Procedure Scope of parallelism in BWT Selection sort and quick sort implementation Bwt Implementation on GPU Comparative study Time-Line • • • • • • Application Domain & objective General Alignment Procedure Scope of parallelism in BWT Selection sort and quick sort implementation Bwt Implementation on GPU Comparative study Time-Line • • • • • • Application Domain & objective General Alignment Procedure Scope of parallelism in BWT Selection sort and quick sort implementation Bwt Implementation on GPU Comparative study Application Domain & Objective • • • • Analyzing Gene expression Mapping variations between individuals Mapping homologous Proteins Assembling Genome of Organism To present an efficient implementation (Specially parallel) that effectively aids the problem of searching for short sequences in DNA. Basic Alignment Procedure Genome To be parallelized Indexing Intermediate size :10^18 Reads Parallelized O(logG) Searching { Location,Occurance} Scope of Parallelism in BWT • With BWT , w length string can be find in O(w) time. • The BWT is closely related to the suffix array • Lexicographic sorted list of all suffixes in a genome. BWT • Bwt[i] = ref [ SA[i] -1] {Bwt[i] = $ when S(i) =1} 10 Initial Step - 1 ● Implementation of Bwt using Selection Sort – OpenMp Selection Sort - Openmp Bwt Creation using Selection sort 7000 6000 5000 Proc 1 4000 Proc 2 3000 Proc 4 Proc 8 Time in Seconds 2000 1000 0 0 200 400 600 800 1000 CPU Cores 8 Data cache L1 :32K L2 :6M DRAM 12GB Proc. Clock 2.9 GHz File Size in KB Initial Step - 2 ● Implementation of Bwt using Selection Sort – ● OpenMp Implementation of Bwt using Quick Sort – OpenMp Quick Sort - Openmp CPU Statistics Cores 8 Data cache L1 :32K L2 :6M DRAM 12GB Proc. Clock 2.9 GHz Initial Step - 3 ● Implementation of Bwt using Selection Sort – ● Implementation of Bwt using Quick Sort – ● OpenMp OpenMp Implementing Bwt on GPU – Bitonic sort Why Bitonic ??... • Concatenations of two sub-sequences sorted in opposite directions – A cyclic shift of elements • Implemented by comparator networks – Work in place – No Communication • Naturally suitable for SIMD architectures – Each thread executing same code but different data • O(log2n) time and O(nlog2n) work Burrows-Wheeler Transform Basic String Sorting Algorithm Input: A C G T A $ indices: 0 1 2 3 4 5 5 $ A C G T A 4 A $ A C G T 0 A C G T A $ 1 C G T A $ A 2 G T A $ A C 3 T A $ A C G 5 $ A C G T A 4 A $ A C G T 3 T A $ A C G 2 G T A $ A C 1 C G T A $ A 0 A C G T A $ indices: 5 4 0 1 2 3 Output: A T $ A C G 18 Steps Performed • Copy Genome from host to device Memory • Indices Array for pointing Reference string • Compare Suffix based on indices array – Swap indices accordingly. • Sorts n elements in log2n Kernel calls. – Each of O(1) time & O(n) work • One more step for BWT from suffix array – Bwt[i] = ref [ SA[i] -1] {Bwt[i] = $ when S(i)= 1} CPU – GPU Interaction (BWT) O(log2G) Searching Genome Cuda_Memcpy & kernel call Suffix Array Evaluation Bwt with Bitonic Sort GPU Statistics SM 30 Core/SM 8 Cores 240 Data cache (SM) 16 K DRAM 536 M Proc. Freq 1.2 MHz Comparison between Expected (GPU) and Exact result (Quick_Sort_time) * 2 ) / 240 CPU GPU Cores 2 240 Data cache (SM) L1 :32K L2 :6M 16K DRAM 12GB 536 M Proc. Clock 2.9 GHz 1.2 MHz References : • Fast in-place sorting with CUDA based on bitonic sort :Hagen Peters • Rapid Parallel Genome Indexing with MapReduce :Rohith K. Menon • M. Burrows and D. Wheeler. A Block-Sorting Lossless Data Compression Algorithm. Technical report • Lightweight Data Indexing and Compression in External Memory :Paolo Ferragina • Parallel Lossless Data Compression on the GPU : Yao Zhang Thanks Future Work • Run in limited memory environments – Compute in parts • To use the memory hierarchy of GPU – Sort keys are cached in register or shared memory – Long runs of repeated character • Position indicating end of run • Can only sort sequence,with length power of 2 – 2k+1 2k+1 – Padding with largest symbol