A massively parallel solution to a massively parallel problem Brian Lam, PhD HPC4NGS Workshop, May 21-22, 2012 Principe Felipe Research Center, Valencia Next-generation sequencing and current challenges What is GPGPU computing? The use of GPGPU in NGS bioinformatics How it works – the BarraCUDA project System requirements for GPU computing What if I want to develop my own GPU code? What to look out for in the near future? Conclusions Next-generation sequencing and current challenges What is GPGPU computing? The use of GPGPU in NGS bioinformatics How it works – the BarraCUDA project System requirements for GPU computing What if I want to develop my own GPU code? What to look out for in the near future? Conclusions “DNA sequencing includes several methods and technologies that are used for determining the order of the nucleotide bases A, G, C, T in a molecule of DNA.” - Wikipedia Commercialised in 2005 by 454 Life Science Sequencing in a massively parallel fashion An example: Whole genome sequencing Extract genomic DNA from blood/mouth swabs Break into small DNA fragments of 200-400 bp Attach DNA fragments to a surface (flow cells/slides/microtitre plates) at a high density Perform concurrent “cyclic sequencing reaction” to obtain the sequence of each of the attached DNA fragments An Illumina HiSeq 2000 can interrogate 825K spots / mm2 ATTCNGG GTCCTGA Not to scale TATTTTT Billions of short DNA sequences, also called sequence reads ranging from 25 to 400 bp Image Analysis Base Calling Sequence Alignment Variant Calling, Peak calling Source: Genologics Next-generation sequencing and current challenges What is Many-core/GPGPU computing? The use of GPGPU in NGS bioinformatics How it works – the BarraCUDA project System requirements for GPU computing What if I want to develop my own GPU code? What to look out for in the near future? Conclusions A physical die that contains a ‘large number’ of processing i.e. computation can be done in a massively parallel manner Modern graphics cards (GPUs) consist of hundreds to thousands of computing cores Chapter 1. Introduction CPUs vs GPUs Transistor usage: Source: NVIDIA GPUs are fast, but there is a catch: SIMD – Single instruction, multiple data VS CPUs are powerful multi-purpose processors MIMD – Multiple Instructions, multiple data CPUs vs GPUs Transistor usage: CPUs vs GPUs Transistor usage: Source: NVIDIA Source: NVIDIA The very same (low level) instruction is applied to multiple data at the same time e.g. a GTX680 can do addition to 1536 data point at a time, versus 16 on a 16-core CPU. Branching results in serialisations The ALUs on GPU are usually much more primitive compared to their CPU counterparts. Scientific computing often deal with large amount of data, and in many occasions, applying the same set of instructions to these data. Examples: ▪ Monte Carlo simulations ▪ Image analysis ▪ Next-generation sequencing data analysis Low capital cost and energy efficient Dell 12-core workstation £5,000, ~1kW Dell 40-core computing cluster £ 20,000+, ~6kW NVIDIA Geforce GTX680 (1536 cores): £400, <0.2kW NVIDIA C2070 (448 cores): £1000, 0.2kW Many supercomputers also contain multiple GPU nodes for parallel computations Examples: CUDASW++ 6.3X MUMmerGPU 3.5X GPU-HMMer 60-100x Next-generation sequencing and current challenges What is Many-core/GPGPU computing? The use of GPGPU in NGS bioinformatics How it works – the BarraCUDA project System requirements for GPU computing What if I want to develop my own GPU code? What to look out for in the near future? Conclusions Ion-torrent server (T7500 workstation) uses GPUs for base-calling MummerGPU – comparisons among genomes BarraCUDA, Soap3 – short read alignments Next-generation sequencing and current challenges What is Many-core/GPGPU computing? The use of GPGPU in NGS bioinformatics How it works – the BarraCUDA project System requirements for GPU computing What if I want to develop my own GPU code? What to look out for in the near future? Conclusions Image Analysis Base Calling Sequence Alignment Variant Calling, Peak calling Sequence alignment is aCPU crucial stepto in perform the This step often takes many hours bioinformatics pipeline for downstream analyses Usually done on HPC clusters The main objective of the BarraCUDA project is to develop a software that runs on GPU/ many-core architectures i.e. to map sequence reads the same way as they come out from the NGS instrument CPU Read library Copy read library to GPU Copy genome to GPU Genome GPU Alignment Alignment Alignment Alignment Alignment Alignment Alignment Alignment Alignment Alignment Alignment Alignment Alignment Alignment Alignment CPU Copy alignment results to CPU Write to disk Alignment Results Originally intended for data compression, performs reversible transformation of a string In 2000, Ferragina and Manzini introduced BWT-based index data structure for fast substring matching at O(n) Sub-string matching is performed in a tree traversal-like manner Used in major sequencing read mapping programs e.g. BWA, Bowtie, Soap2 start n a b b n a a n na a an b n nana a b ba ana ban b bana n nan a anana anan b banana matching of anb matching of banan matching substring ‘banan’ b banan , Modified from Li & Durbin Bioinformatics 2009, 25:14, 1754-1760 BWT_exactmatch(READ,i,k,l){ if (i < 0) then return RESULTS; k = C(READ[i]) + O(READ[i],k-1)+1; l = C(READ[i]) + O(READ[i],l); if (k <= l) then BWT_exactmatch(READ,i-1,k,l); } main(){ Calculate reverse BWT string B from reference string X Calculate arrays C(.) and O (.,.) from B Load READS from disk For every READ in READS do{ i = |READ|; Position k = 0; Lower bound l = |X|; Upper bound BWT_exactmatch(READ,i,k,l); } Write RESULTS to disk } Modified from Li & Durbin Bioinformatics 2009, 25:14, 1754-1760 Simple data parallelism Used mainly the GPU for matching __device__BWT_GPU_exactmatch(W,i,k,l){ if (i < 0) then return RESULTS; k = C(W[i]) + O(W[i],k-1)+1; l = C(W[i]) + O(W[i],l); if (k <= l) then BWT_GPU_exactmatch(W,i-1,k,l); } __global__GPU_kernel(){ W = READS[thread_no]; i = |W|; Position k = 0; Lower bound l = |X|; Upper bound BWT_GPU_exactmatch(W,i,k,l); } main(){ Calculate reverse BWT string B from reference string X Calculate array C(.) and O(.,.) from B Load READS from disk Copy B, C(.) and O(.) to GPU Copy READS to GPU Launch GPU_kernel with <<|READS|>> concurrent threads COPY Alignment Results back from GPU Write RESULTS to disk } Very fast indeed, using a Tesla C2050, we can match 25 million 100bp reads to the BWT in just over 1 min. But… is this relevant? start n a b b n a a n na a n a b ba ana nana an b ban b bana n nan a anana anan b banana matching of anb matching of ‘anb’ banan matching substring where ‘b’ is subsituted with an ‘a’ b banan Search space complexity = O(9n)! Klus et al., BMC Res Notes 2012 5:27 partially It _________worked! 10% faster than 8x X5472 cores @ 3GHz BWA uses a greedy breadth-first search approach (takes up to 40MB per thread) Not enough workspace for thousands of concurrent kernel threads (@ 4KB) – i.e. reduced accuracy – NOT GOOD ENOUGH! hit hit The very same (low level) instruction is applied to multiple data at the same time e.g. a GTX680 can do addition to 1536 data point at a time, versus 16 on a 16-core CPU. Branching results in serialisations The ALUs on GPU are usually much more primitive compared to their CPU counterparts. space increases exponent ially wit h longer read lengt hs (as was explained in Sect ion 1.5). Even t hough t he programs are imposing general search space rest rict ions, like t he maximum number of di↵erences and seeding (det ails are covered in t he Sect ion 3.1), t he speed reduct ion can not be complet ely eliminat ed. T here is always a balance between speed and accuracy, t he more aggressive t he search space reduct ion approach is, t he lower t he accuracy of t he alignment becomes. Tasks to be performed CPU parallelism GPU parallelism Thread 0 Thread 1 Thread 0 Thread 1 Thread 0 Thread 1 A A A A A A B D B D B B C C C C D D C C B Data-empty instruction B Data-storing instruction F i gur e 2.2 – Illust rat ion of how t hread execut ion divergence is handled by t he GPU. T he left panel shows t hreads wit h t asks t o perform, part s of t heir execut ion pat hs di↵er. T he middle panel shows how mult i-core CPU handles such divergence - it does not pose a problem, t hreads are execut ed YES Determine if the query sequence has more than three N’s No Calculate the boundaries of differences for the query sequence (widths and bids) Start from the first base of the query and set the full BWT suffix array as the search boundary Exit YES Repeat Look for all possible hits within the search boundary (perfect match, mismatches, indels) Check if there is any hit (k No Has reached the first base of the query sequence No Move one base backwards and look for the next best hit YES Record Alignment (k, l and differences) YES Has reached the last base in the query sequence No Choose and store the best hit, set the new search boundary with k & and proceed to next base Thread 1 Thread 1.1 Thread 1.2 A B CPU thread queue: A hit hit B Klus et al., BMC Res Notes 2012 5:27 BWA 0.5.8 BarraCUDA Mapping 89.95% 91.42% Error 0.06% 0.05% Klus et al., BMC Res Notes 2012 5:27 Time Taken (min) 160 153 105 120 86 80 52 40 21 20 0 0 5160 (1C) X5670 (1C) 5160 (2C) 5160 (4Cs) X5670 (6Cs) 0 M2090 Simple data parallelism Used mainly the GPU for matching Klus et al., BMC Res Notes 2012 5:27 Next-generation sequencing and current challenges What is Many-core/GPGPU computing? The use of GPGPU in NGS bioinformatics How it works – the BarraCUDA project System requirements for GPU computing What if I want to develop my own GPU code? What to look out for in the near future? Conclusions Hardware The system must have at least one decent GPU ▪ NVIDIA Geforce 210 will not work! One or more PCIe x16 slots A decent power supply with appropriate power connectors ▪ 550W for one Tesla C2075, and + 225W for each additional cards ▪ Don’t use any Molex converters! Ideally, dedicate cards for computation and use a separate cheap card for display Software CUDA ▪ CUDA toolkit ▪ Appropriate NVIDIA drivers OpenCL ▪ CUDA toolkit (NVIDIA) ▪ AMD APP SDK (AMD) ▪ Appropriate AMD/NVIDIA drivers For example: CUDAtoolkit v4.2 DEMO Next-generation sequencing and current challenges What is Many-core/GPGPU computing? The use of GPGPU in NGS bioinformatics How it works – the BarraCUDA project System requirements for GPU computing What if I want to develop my own GPU code? What to look out for in the near future? Conclusions In the end, how much effort have we put in so far? 3300 lines of new code for the alignment core ▪ Compared to 1000 lines in BWA ▪ Still on going! 1000 lines for SA to linear space conversion CUDA (in general) is faster and more complete Tied to NVIDIA hardware OpenCL Still new, AMD and Apple are pushing hard on this Can run on a variety of hardware including CPUs NVIDIA developer zone http://developer.nvidia.com/ OpenCL developer zone http://developer.amd.com/zones/OpenCLZone/Pa ges/default.aspx Different hardware has different capabilities GT200 is very different from GF100, in terms of cache arrangement and concurrent kernel executions GF100 is also different from GK110 where the latter can do dynamic parallelism Low level data management is often required, e.g. earmarking data for memory type, coalescing memory access The easy way out! A set of compiler directives It allows programmers to use ‘accelerator’ without explicitly writing GPU code Supported by CAPS, Cray, NVIDIA, PGI Source: NVIDIA http://www.openacc-standard.org/ Next-generation sequencing and current challenges What is Many-core/GPGPU computing? The use of GPGPU in NGS bioinformatics How it works – the BarraCUDA project System requirements for GPU computing What if I want to develop my own GPU code? What to look out for in the near future? Conclusions The game is changing quickly CPUs are also getting more cores : AMD 16-core Opteron 6200 series Many-core platforms are evolving Intel MIC processors (50 x86 cores) NVIDIA Kepler platform (1536 cores) AMD Radeon 7900 series (2048 cores) SIMD nodes are becoming more common in supercomputers OpenCL Next-generation sequencing and current challenges What is Many-core/GPGPU computing? The use of GPGPU in NGS bioinformatics How it works – the BarraCUDA project System requirements for GPU computing What if I want to develop my own GPU code? What to look out for in the near future? Conclusions Few software are available for NGS bioinformatics yet, more to come BarraCUDA is one of the first attempts to accelerate NGS bioinformatics pipeline Significant amount of coding is usually required, but more programming tools are becoming available Many-core is still evolving rapidly and things can be very different in the next 2-3 years IMS-MRL, Cambridge Giles Yeo Petr Klus Simon Lam Gurdon Institute, Cambridge Nicole Cheung NIHR-CBRC, Cambridge Ian McFarlane Cassie Ragnauth HPCS, Cambridge Stuart Rankin Whittle Lab, Cambridge Graham Pullan Tobias Brandvik NVIDIA Corporation Thomas Bradley Timothy Lanfear Microbiology, University College Cork Dag Lyberg