Harvesting the Opportunity of GPUbased Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work Abdullah Gharaibeh, Samer Al-Kiswany 1 Networked Systems Laboratory (NetSysLab) University of British Columbia A golf course … … a (nudist) beach (… and 199 days of rain each ye 2 Hybrid architectures in Top 500 [Nov’10] 3 • Hybrid architectures – High compute power / memory bandwidth – Energy efficient [operated today at low efficiency] • Agenda for this talk – GPU Architecture Intuition • What generates the above characteristics? – Progress on efficiently harnessing hybrid (GPU-based) architectures 4 Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian 5 6 Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian 7 Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian 8 Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian 9 Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian 10 Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian 11 Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian Feed the cores with data Idea #3 The processing elements are data hungry! Wide, high throughput memory bus 12 10,000x parallelism! Idea #4 Hide memory access latency Hardware supported multithreading 13 The Resulting GPU Architecture nVidia Tesla 2050 Host Machine GPU Multiprocessor N Multiprocessor 2 448 cores Multiprocessor 1 Shared Memory Four ‘memories’ • Shared fast – 4 cycles small – 48KB • Global slow – 400-600cycles Host large – up to 3GB high throughput – 150GB/sMemory • Texture – read only • Constant – read only Registers Registers Registers Core 1 Core 2 Core M Instruction Unit Constant Memory Texture Memory Global Memory Hybrid • PCI 16x -- 4GBps 14 GPUs offer different characteristics High peak compute power High peak memory bandwidth High host-device communication overhead Limited memory space Complex to program 15 Projects at NetSysLab@UBC http://netsyslab.ece.ubc.ca Porting applications to efficiently exploit GPU characteristics • Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M. Ripeanu, SC’10 • Accelerating Sequence Alignment on Hybrid Architectures, A. Gharaibeh, M. Ripeanu, Scientific Computing Magazine, January/February 2011. Middleware runtime support to simplify application development • CrystalGPU: Transparent and Efficient Utilization of GPU Power, A. Gharaibeh, S. Al-Kiswany, M. Ripeanu, TR GPU-optimized building blocks: Data structures and libraries • GPU Support for Batch Oriented Workloads, L. Costa, S. Al-Kiswany, M. Ripeanu, IPCCC’09 • Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M. Ripeanu, SC’10 • A GPU Accelerated Storage System, A. Gharaibeh, S. Al-Kiswany, M. Ripeanu, HPDC’10 • On GPU's Viability as a Middleware Accelerator, S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, M. Ripeanu, JoCC‘08 16 Motivating Question: How should we design applications to efficiently exploit GPU characteristics? Context: A bioinformatics problem: Sequence Alignment A string matching problem Data intensive (102 GB) 17 Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M. Ripeanu, SC’10 Past work: sequence alignment on GPUs (%) MUMmerGPU [Schatz 07, Trapnell 09]: A GPU port of the sequence alignment tool MUMmer [Kurtz 04] ~4x (end-to-end) compared to CPU version Hypothesis: mismatch between > 50% overhead the core data structure (suffix tree) and GPU characteristics 18 Idea: trade-off time for space Use a space efficient data structure (though, from higher computational complexity class): suffix array 4x speedup compared to suffix tree-based on GPU Consequences: Significant overhead reduction Opportunity to exploit multi-GPU systems as I/O is less of a bottleneck Focus is shifted towards optimizing the compute stage 19 Outline for the rest of this talk Sequence alignment: background and offloading to GPU Space/Time trade-off analysis Evaluation 20 Background: Sequence Alignment Problem Queries CCAT GGCT... ...TAGGC ...GGCTA ...TAGG Reference .....CGCCCTA GCAATTT.... ...GCGG TGCGC... ...CGGCA... ...GGCG ATGCG… .…TCGG... TTTGCGG…. ...ATAT… .…CCTA... CAATT…. ..CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGCG.. Problem: Find where each query most likely originated from Queries 108 queries 101 to 102 symbols length per query Reference 106 to 1011 symbols length (up to ~400GB) 21 Opportunity Sequence alignment GPU Easy to partition Memory intensive Massively parallel High memory bandwidth Challenges GPU Offloading: Opportunity and Challenges Data Intensive Large output size Limited memory space No direct access to other I/O devices (e.g., disk) 22 GPU Offloading: addressing the challenges • Data intensive problem and limited memory space →divide and compute in rounds →search-optimized datastructures • Large output size →compressed output representation (decompress on the CPU) subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys) foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref) MatchKernel(subqryset, subref) CopyFromGPU(results) } Decompress(results) } High-level algorithm (executed on the host) 23 Space/Time Trade-off Analysis 24 The core data structure massive number of queries and long reference => pre-process reference to an index Past work: build a suffix tree A (MUMmerGPU [Schatz 07, 09]) Search: O(qry_len) per query Space: O(ref_len) but the constant is high ~20 x ref_len Post-processing: DFS traversal for each query O(4qry_len - min_match_len) TACACA$ CA 0 $ CA$ CA$ $ 5 4 2 $ 3 CA$ 1 25 The core data structure massive number of queries and long reference => preprocess reference to an index Past work: build a suffix tree (MUMmerGPU [Schatz 07]) Search: O(qry_len) per query Space: O(ref_len), but the constant is high: ~20xref_len subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys) foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref) Expensive Post-processing: O(4qry_len - min_match_len), DFS traversal per query MatchKernel(subqryset, Efficient subref) CopyFromGPU(results) } Decompress(results) Expensive } 26 A better matching data structure? Suffix Tree A Suffix Array TACACA$ CA 0 $ CA$ CA$ $ 5 2 $ 3 Compute Space 4 CA$ 0 A$ 1 ACA$ 2 ACACA$ 3 CA$ 4 CACA$ 5 TACACA$ 1 Less data to transfer O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len Search O(qry_len) O(qry_len x log ref_len) Postprocess O(4qry_len - min_match_len) O(qry_len – min_match_len) Impact 1: Reduced communication 27 A better matching data structure Suffix Tree A Suffix Array TACACA$ CA 0 $ CA$ CA$ $ 5 2 $ 3 Compute Space 4 CA$ 0 A$ 1 ACA$ 2 ACACA$ 3 CA$ 4 CACA$ 5 TACACA$ 1 O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len Search O(qry_len) O(qry_len x log ref_len) Postprocess O(4qry_len - min_match_len) O(qry_len – min_match_len) Space for longer subreferences => fewer processing rounds Impact 2: Better data locality is achieved at the cost 28 of additional per-thread processing time A better matching data structure Suffix Tree A Suffix Array TACACA$ CA 0 $ CA$ CA$ $ 5 2 $ 3 Compute Space 4 CA$ 0 A$ 1 ACA$ 2 ACACA$ 3 CA$ 4 CACA$ 5 TACACA$ 1 O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len Search O(qry_len) O(qry_len x log ref_len) Postprocess O(4qry_len - min_match_len) O(qry_len – min_match_len) Impact 3: Lower post-processing overhead 29 Evaluation 30 Evaluation setup Testbed Low-end Geforce 9800 GX2 GPU (512MB) High-end Tesla C1060 (4GB) Base line: suffix tree on GPU (MUMmerGPU [Schatz 07, 09]) Success metrics Performance Energy consumption Workloads (NCBI Trace Archive, http://www.ncbi.nlm.nih.gov/Traces) Reference sequence length # of queries Average read length HS1 - Human (chromosome 2) ~238M ~78M ~200 HS2 - Human (chromosome 3) ~100M ~2M ~700 MONO - L. monocytogenes ~3M ~6M ~120 SUIS - S. suis ~2M ~26M ~36 Workload / Species 31 Speedup: array-based over tree-based 32 Dissecting the overheads Significant reduction in data transfers and postprocessing Workload: HS1, ~78M queries, ~238M ref. length on GeForce 33 Comparing with CPU performance [baseline single core performance] [Suffix tree] [Suffix tree] [Suffix array] 34 Summary GPUs have drastically different performance characteristics Reconsidering the choice of the data structure used is necessary when porting applications to the GPU A good matching data structure ensures: Low communication overhead Data locality: might be achieved at the cost of additional per thread processing time Low post-processing overhead 35 Code, benchmarks and papers available at: netsyslab.ece.ubc.ca 36