Rich Miler – www.datacenterknowledge.com Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems Tayler H. Hetheringtonɣ Timothy G. Rogersɣ Lisa Hsu* Mike O’Connor* Tor M. Aamodtɣ ɣUBC *AMD University of British Columbia In Proc. 2012 ACM/IEEE Int’l Symp. On Performance Analysis of Systems and Software (ISPASS) Bruno Giussani – ww.wired.com SIMD Efficiency Motivation 50% 40% 30% 20% 10% 0% Intuition Actual New types ofrequire workloads Programmer’s initial intuition into an application’s Server applications Server farms a lot of power – – Non-HPC Need for efficient, cost-effective solutions behavior – Memcached – applications – Server GPU/APUs Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU 2 3 Background Memcached *Slide from HPCA-18, 2012 Facebook Keynote, Sanjeev Kumar Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU 4 • • • • Memcached - Compatible with GPU? Irregular control flow Irregular memory access patterns Large memory requirements Highly input data dependent Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU 5 Porting Memcached Simple key-value lookup Server2 • READ (GET) Hash GET requests on Memory GPU • WRITE (SET) requests on CPU Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Miss Key Comparison Hash chaining Hit Memcached Key-value Store on GPU/APU Return Hit/Miss 6 Porting Memcached - Batching Servern Server2 Server 2 Server 2 GET GET Memory GET GET GET Hash Hash Hash Hash Hash Memory Memory Memory Memory Miss Key Comparison Miss MissMiss Key Comparison Hash chaining Miss Comparison chaining Key Key Comparison HashHash chaining Key Comparison Hash chaining Hit Hit Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Hash chaining Return Return Hit Return Hit Hit/Miss Hit Hit/Miss Return Hit/Miss Hit/Miss Memcached Key-value Store on GPU/APU Return Hit/Miss 7 Porting Memcached • Main Goals – Increase request throughput – Keep request latency reasonable • Main Challenges – Irregular memory access patterns – Irregular control flow – Data transfer overheads Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU 8 Methodology • Hardware – AMD Radeon HD 5870 (Discrete) – AMD Llano A8-3850 (Fusion) – AMD Zacate E-350 (Fusion) • Simulators – GPGPU-Sim v3.x – In-house GPU control flow simulator • Testing and Simulation – Traces of Wikipedia accesses Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU Porting Memcached Memory Access • One request per work item • Data accesses for GET requests are input data dependent • Data can be anywhere in memory – Poor performance on GPU? Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU 9 Porting Memcached 10 Memory Divergence Percentage of Peak IPC 35% 30% 25% 20% 15% 10% 5% 0% No L1 Cache 8k 8way Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt 32k 8- 64k 8- 128k 8- 256k 8- 1M 8- 1M FA No No way way way way way Mem Mem Latency Stalls Memcached Key-value Store on GPU/APU Porting Memcached Control Flow • Recall the control flow graph Work item ID 1–2–3–4–5 3–4 1–2–5 1–5 2 3–4 • Many branch outcomes are input data dependent Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU 11 Porting Memcached 12 Control Flow 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% SIMD Efficiency Breakdown 29% 62% MC (Pes) Overall 15% Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt 51% MC (Aug) MC (Act) 40% Memcached Key-value Store on GPU/APU # Active Work-items 1-4 5-8 9-12 13-16 17-20 21-24 25-28 29-32 13 Porting Memcached Data Management • Dynamic memory manager • Transfer memory regions to device • Virtual addresses different on host and device Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU Porting Memcached Data Transfer Reduction • Fusion Systems – Physical shared memory region between host and device – Zero-copy data • Discrete Systems – Possible transfer reduction techniques • Reduction in unnecessary transfers • Acyclic data transfers (Overlap comm. with comp.) • Automatic data transfer frameworks Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU 14 15 Porting Memcached Performance vs. CPU 35 Performance vs. CPU Speed up (X) Speed up (X) 30 40 3025 2020 1015 100% 80%010 60% 5 40% 20% 0 0% No Data No Data Transfers Transfers Execution Breakdown AMD Radeon HD 5870 Llano A8-3850 Data Data Transfers Transfers Data Zacate E-350 Transfer Execution AMD Radeon HD Llano A8-3850 5870 AMD Radeon HD Llano A8-3850 5870 Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Zacate E-350 Zacate E-350 Memcached Key-value Store on GPU/APU 16 Results Radeon HD 5870 Normalized Throughput (Requests/Second) Normalized Latency 7 Latency - 0.5ms 14 12 6 10 5 8 4 6 3 4 2 1 2 0 0 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 Requests / Batch Normalized Latency Normalized Throughput 8 • ~8000 requests yields highest ratio of throughput to latency Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU Rich Miler – www.datacenterknowledge.com 17 Summary • Programmer intuition doesn’t always paint the whole picture • We exploited the available parallelism on GPUs by batching requests, showing a 7.5X performance increase on the Llano system • Data transfer overheads can have a large impact on overall performance • Thank you – Questions? Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU