Redefining the Role of the CPU in the Era of CPU-GPU Integration Manish Arora, Siddhartha Nath, Subhra Mazumdar, Scott Baden and Dean Tullsen Computer Science and Engineering, UC San Diego IEEE Micro Nov – Dec 2012 AMD Research August 20th 2012 Overview Motivation Benchmarks and Methodology Analysis CPU Criticality ILP Branches Loads and Stores Vector Instructions TLP Impact on CPU Design 2 Historical Progression General Purpose Applications Throughput Applications Energy Efficient GPUs Performance/Energy/… gains with chip integration Multicore CPUs GPGPU APU Focus of Improvements Improved Memory Systems Improved GPGPU Scaling … CPU Architecture ? Next-Gen APU Easier Programming 3 The CPU-GPU Era Consumer: Phenom /Athlon II Server: Barcelona... Consumer: Vishera Server: Delhi/Abu Dhabi … Parts APUs have essentially the sameServer CPU cores as CPU-only parts Components Husky (K10) CPU + NI GPU Piledriver CPU + SI GPU Steamroller CPU + Sea I GPU AMD APU Products Llano Trinity Kaveri 2011 2012 2013 4 Example CPU-GPU Benchmark KMeans (Implementation from Rodinia) Randomly Pick Centers Find Closest Center for each Point GPU Find new Centers CPU Easy data parallelism over Few Centers with possibly different #points each point 5 Properties of KMeans Metric CPU Only With GPU Time fraction running Kernel Code ~50% ~16% (Kernel speedup 5x) Time spent on the CPU 100% Perfect Instruction Level Parallelism (Window Size 128) 7.0 4.8 “Hard” Branches 2.3% 4.6% “Hard” Loads 36.2% 64.5% Application Speedup on 8 Core CPU 1.5x 1.0x ~84% CPU Performance Critical +GPU drastically impacts CPU code properties Aim: Understand and Evaluate this “new” CPU workload 6 The Need to Rethink CPU Design APUs: Prime example of heterogeneous systems Heterogeneity: Composing cores run subsets well CPU need not be fully general-purpose Sufficient to optimize for Non-GPU code Investigate Non-GPU code and guide CPU design 7 Overview Motivation Benchmarks and Methodology Analysis CPU Criticality ILP Branches Loads and Stores Vector Instructions TLP Impact on CPU Design 8 Benchmarks Mixed Serial Apps CPU only Parallel Apps GPU Heavy Partitioned Apps CPU Heavy CPU GPU 9 Benchmarks CPU-Heavy (11 Apps) Important computing apps with no evidence of GPU ports SPEC: Parser, Bzip, Gobmk, MCF, Sjeng, GemsFDTD [Serial] Parsec: Povray, Tonto, Facesim, Freqmine, Canneal [Parallel] Mixed and GPU-Heavy (11 + 11 Apps) Rodinia (7 Apps) SPEC/Parsec mapped to GPUs (15 Apps) 10 Mixed Benchmark Suite GPU Kernels Kernel Speedup Kmeans Rodinia 2 5.0 H264 SPEC 2 12.1 SRAD Rodinia 2 15.0 Sphinx3 SPEC 1 17.7 Particlefilter Rodinia 2 32.0 Blackscholes Parsec 1 13.7 Swim SPEC 3 25.3 Milc SPEC 18 6.0 Hmmer SPEC 1 19.0 LUD Rodinia 1 13.5 Streamcluster Parsec 1 26.0 11 GPU-Heavy Benchmark Suite GPU Kernels Kernel Speedup Bwaves SPEC 1 18.0 Equake SPEC 1 5.3 Libquantum SPEC 3 28.1 Ammp SPEC 2 6.8 CFD Rodinia 5 5.5 Mgrid SPEC 4 34.3 LBM SPEC 1 31.0 Leukocyte Rodinia 3 70.0 Art SPEC 3 6.8 Heartwall Rodinia 6 7.9 Fluidanimate Parsec 6 3.9 12 Methodology Interested in Non-GPU portions of CPU-GPU code Ideal scenario: Port all applications on the GPU and use hardware counters Man hours / Domain expertise needed / Platform and architecture dependent code CPU-GPU partitioning based on expert information Publically available source code (Rodinia) Details of GPU portions from publications and own implementations (SPEC/Parsec) 13 Methodology Microarchitectural simulations Marked GPU portions on application code Ran marked applications via PIN based microarchitectural simulators (ILP, Branches, Loads and Stores) Machine measurements Using marked code (CPU Criticality) Used parallel CPU source code when available (TLP studies) 14 Overview Motivation Benchmarks and Methodology Analysis CPU Criticality ILP Branches Loads and Stores Vector Instructions TLP Impact on CPU Design 15 CPU Criticality 16 CPU Time Future Averages weighted by conservative CPU time 100 CPU-Only Non-Kernel Time With Reported Speedups With Conservative Speedups Proportion of Total Application Time (%) 90 Mixed: Even though 80% code is mapped to the GPU, the CPU is still the bottleneck 80 70 More time spend on the CPU than on the GPU 60 50 40 CPU executes 7-14% of time even for GPU-Heavy apps 30 20 10 0 Mixed GPU-Heavy 17 Instruction Level Parallelism Measures inherent instruction stream parallelism Measured ILP with perfect memory and branches 18 Instruction Level Parallelism Parallel Instructions within Instruction Window 30 Window Size 128 Window Size 512 25 20 15 12.7 9.6 10 5 0 CPU-Heavy 19 Parallel Instructions within Instruction Window Instruction Level Parallelism Window Size 128 CPU Only 30 Window Size 128 with GPU Window Size 512 CPU Only 25 Overall 9.9 9.5 (128) 13.7 12.2 (512) Window Size 512 with GPU 10.3 9.2 (128) 15.3 11.1 (512) 20 14.6 13.7 15 12.7 10 9.6 5 CPU +GPU CPU +GPU 0 Mixed GPU-Heavy 20 Instruction Level Parallelism ILP dropped in 17 of 22 applications Common case 4% for 128 size and 10.9% for 512 size Dropped by half for 5 applications Mixed apps ILP dropped by as much as 27.5% Independent loops mapped to the GPU Less regular dependence heavy code on the CPU Occasionally long dependent chains on the GPU Blackscholes (total of 5/22 outliers) Potential gains from larger windows are going to be degraded 21 Branches Branches categorized into 4 categories Biased (> 95% same direction) Patterned (> 95% accuracy on very large local predictor) Correlated (> 95% accuracy on very large gshare predictor) Hard (Remaining) 22 Branch Distribution 100 Hard Correlated 90 Patterned Biased Percentage of Dynamic Branches 24.7% 80 7.0% 70 13.1% 60 50 40 30 55.2% 20 10 0 CPU-Heavy 23 Branch Distribution Hard 100 5.1% Percentage of Dynamic Branches 90 80 Correlated Effect of CPU-Heavy Apps Patterned 11.3% Biased 9.4% Effects of DataDependent branches on GPU-Heavy Apps 70 18.6% 60 50 40 Overall: Branch predictors tuned for generic CPU execution may not be sufficient 30 20 CPU 10 +GPU 0 Mixed GPU-Heavy 24 Loads and Stores Loads and Stores categorized into 4 categories Static (> 95% same address) Strided (> 95% accuracy on very large stride predictor) Patterned (> 95% accuracy on very large Markov predictor) Hard (Remaining) 25 Distribution of Loads Hard Patterned Strided 100 Percentage of Non-Trivial Loads 90 80 70 60 77.5% 50 40 30 20 5.9% 10 16.6% 0 CPU-Heavy 26 Distribution of Stores 100 Hard Patterned Strided Percentage of Non-Trivial Stores 90 80 70 71.7% 60 50 40 30 10.2% 20 10 18.1% 0 CPU-Heavy 27 Distribution of Loads Hard Patterned Strided 100 Percentage of Non-Trivial Loads 90 80 44.4% 70 61.6% 60 50 40 Overall: Stride or next line predictors will struggle 30 20 CPU Effects of kernels with Irregular accesses moving to the GPU 47.3% 10 +GPU 27.0% 0 Mixed GPU-Heavy 28 Distribution of Stores Hard 100 Patterned Strided Percentage of Non-Trivial Stores 90 80 38.6% 70 51.3% 60 50 40 Overall: Slightly less pronounced but similar results as loads 30 48.6% 20 CPU 10 +GPU 34.9% 0 Mixed GPU-Heavy 29 Vector Instructions 40 Percentage of Dynamic Instructions 35 30 25 20 15 10 7.3% 5 0 CPU-Heavy 30 Vector Instructions 40 SSE Instructions SSE Instructions with GPU Vector ISA enhancements targeting the same regions of code as the GPU Fraction of Dynamic Instructions 35 30 25 16.9% 20 15.0% 9.6% 8.5% 15 10 CPU 5 +GPU 0 Mixed GPU-Heavy 31 Thread Level Parallelism 8 Cores 32 Cores 20 Sppedup 15 10 5 0 CPU-Heavy 32 Thread Level Parallelism 20 8 Cores 32 Cores 8 Cores with GPU 32 Cores with GPU Abundant parallelism in GPUHeavy disappears. No gain going from 8 cores to 32 cores. 15 Speedup 14.0x 2.1x 10 Mixed: Gains drop from 4x to 1.4x 5 Overall 10% gain going from 8 cores to 32 cores. 32 core TLP dropped 60% from 5.5x to 2.2x CPU +GPU CPU 0 +GPU Mixed GPU-Heavy 33 Overview Motivation Benchmarks and Methodology Analysis CPU Criticality ILP Branches Loads and Stores Vector Instructions TLP Impact on CPU Design 34 CPU Design in the post-GPU Era Only modest gains from increasing window sizes Considerably increased pressure on branch predictor Memory access will continue to be major bottlenecks Stride or next-line prefetching significantly much less relevant Lots of literature but never adapted on real machines (e.g. Helper thread prefetching or mechanisms targeted at pointer chains) SSE rendered significantly less important In spite of fewer static branches Adopt techniques targeting fewer difficult branches (L-Tage Seznec 2007 ) Every core need not have it / cores could share SSE hardware Extra CPU cores/threads not of much use because of lack of TLP 35 CPU Design in the post-GPU Era (1) Clear case for Big Cores (with a focus on loads/stores/branches and not ILP) + GPUs (2) Need to start adopting proposals for few-thread performance (3) Start by relooking old techniques with current perspectives 36 Backup On Using Unmodified Source Code Most common memory layout change: AOS -> SOA Still a change in stride value AOS well captured by stride/markov predictors CPU only code has even better locality well captured by strided/markov predictors But the locality enhanced accesses map to the GPU Minimal impact on CPU code with GPU: still irregular accesses 38