A GPU Accelerated Storage System Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei Ripeanu NetSysLab The University of British Columbia 1 GPUs radically change the cost landscape $600 $1279 (Source: CUDA Guide) 2 Harnessing GPU Power is Challenging more complex programming model limited memory space accelerator / co-processor model 3 Motivating Question: Does the 10x reduction in computation costs GPUs offer change the way we design/implement distributed systems? Context: Distributed Storage Systems 4 Distributed Systems Computationally Intensive Operations Operations Hashing Techniques Similarity detection Erasure coding Content addressability Encryption/decryption Security Membership testing (Bloom-filter) Integrity checks Compression Redundancy Load balancing Summary cache Storage efficiency Computationally intensive Limit performance 5 Distributed Storage System Architecture Application Layer FS API Metadata Manager Client Files divided into stream of blocks Techniques To improve Performance/Reliability Application Redundancy Access Module Storage Nodes Integrity Checks Similarity Detection Security Enabling Operations Compression Encryption/ Decryption Hashing Encoding/ Decoding Offloading CPU Layer GPU 6 Contributions: A GPU accelerated storage system: Design and prototype implementation that integrates similarity detection and GPU support End-to-end system evaluation: 2x throughput improvement for a realistic checkpointing workload 7 Challenges Integration Challenges Files divided into stream of blocks Minimizing the integration effort Transparency Separation of concerns Similarity Detection Hashing Extracting Major Performance Gains Hiding memory allocation overheads Hiding data transfer overheads Offloading Layer Efficient utilization of the GPU memory units Use of multi-GPU systems GPU 8 Past Work: Hashing on GPUs HashGPU1: a library that exploits GPUs to support specialized use of hashing in distributed storage systems Hashing stream of blocks One performance data point: Accelerates hashing by up to 5x speedup compared to a single core CPU HashGPU GPU However, significant speedup achieved only for large blocks (>16MB) => not suitable for efficient similarity detection “Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems” S. Al-Kiswany, A. Gharaibeh, E. Santos9 Neto, G. Yuan, M. Ripeanu,, HPDC ‘08 1 Profiling HashGPU At least 75% overhead Amortizing memory allocation and overlapping data transfers and computation may bring important benefits 10 CrystalGPU CrystalGPU: a layer of abstraction that transparently enables common GPU optimizations Files divided into stream of blocks One performance data point: CrystalGPU improves the speedup of HashGPU library by more than one order of magnitude HashGPU CrystalGPU Offloading Layer Similarity Detection GPU 11 CrystalGPU Opportunities and Enablers Enabler: a high-level memory manager Opportunity: overlap the communication and computation Files divided into stream of blocks Similarity Detection HashGPU Enabler: double buffering and asynchronous kernel launch CrystalGPU Memory Manager Opportunity: multi-GPU systems (e.g., GeForce 9800 GX2 and GPU clusters) Task Queue Double Buffering GPU Enabler: a task queue manager 12 Offloading Layer Opportunity: Reusing GPU memory buffers Experimental Evaluation: CrystalGPU evaluation End-to-end system evaluation 13 CrystalGPU Evaluation Testbed: A machine with CPU: Intel quad-core 2.66 GHz with PCI Express 2.0 x16 bus GPU: NVIDIA GeForce dual-GPU 9800GX2 Files divided into stream of blocks Experiment space: HashGPU/CrystalGPU vs. original HashGPU Three optimizations Buffer reuse Overlap communication and computation Exploiting the two GPUs HashGPU CrystaGPU GPU 14 HashGPU Performance on top CrystalGPU Base Line: CPU Single Core The gains enabled by the three optimizations can be realized! 15 End-to-End System Evaluation Testbed – Four storage nodes and one metadata server – One client with 9800GX2 GPU Three implementations – No similarity detection (without-SD) – Similarity detection • on CPU (4 cores @ 2.6GHz) (SD-CPU) • on GPU (9800 GX2) (SD-GPU) Three workloads – Real checkpointing workload – Completely similar files: all possible gains in terms of data saving – Completely different files: only overheads, no gains Success metrics: – System throughput – Impact on a competing application: compute or I/O intensive 16 System Throughput (Checkpointing Workload) 1.8x improvement The integrated system preserves the throughput gains on 17 a realistic workload! System Throughput (Synthetic Workload of Similar Files) Room for 2x improvement Offloading to the GPU enables close to optimal performance! 18 Impact on Competing (Compute Intensive) Application Writing Checkpoints back to back 2x improvement 7% reduction Frees resources (CPU) to competing applications while 19 preserving throughput gains! Summary We present the design and implementation of a distributed storage system that integrates GPU power We present CrystalGPU: a management layer that transparently enable common GPU optimizations across GPGPU applications We empirically demonstrate that employing the GPU enable close to optimal system performance We shed light on the impact of GPU offloading on competing applications running on the same node 20 netsyslab.ece.ubc.ca 21 Similarity Detection Hashing File A X Y Z Potentially improving write throughput Only the first block is different Hashing File B W Y Z 22 Execution Path on GPU – Data Processing Application 1. Preprocessing (memory allocation) 3 2. Data transfer in 3. GPU Processing 4. Data transfer out 5. Postprocessing 1 2 4 5 1 2 3 4 5 TTotal = TPreprocesing + TDataHtoG + TProcessing + TDataGtoH + TPostProc 23