Jaewoong Sim Aniruddha Dasgupta Hyesoon Kim Richard Vuduc 1 2/26 | Motivation | GPUPerf: Performance analysis framework Performance Advisor Analytical Model Frontend Data Collector | Evaluations | Conclusion 3/26 | GPGPU architectures have become very powerful. | Programmers want to convert CPU applications to GPGPU applications. | Case 1: 10x speed-up CPU Version GPGPU Version | Case 2: 1.1x speed-up CPU Version GPGPU Version | For case 2, programmers might wonder why the benefit is so poor. Maybe, the algorithm is not parallelizable Programmers want optimize code whenever possible! Maybe, the GPGPU code aretonot well optimized | For case 1, programmers might wonder if 10x is the best speed-up. 4/26 Normalized Performance | Optimizing parallel programs is difficult^100! 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Best for this kernel Baseline Shared Memory SFU Tight One Optimization Still the best! UJAM Shared Shared Shared Memory + Memory + Memory + SFU Tight UJAM Shared Memory + Another want to understand benefit! | Most Programmers of programmers apply optimizationperformance techniques one by one. | Try one more optimization with Shared Memory. Which one to choose? 5/26 | Providing performance guidance is not easy. Program analysis: Obtain program information as much as possible Performance modeling: Have a sophiscated analytical model User-friendly metrics: Convert the performance analysis information into performance guidance | We propose GPUPerf, performance analysis framework Quantatively predicts potential performance benefits | In this talk, we will focus more on performance modeling and potential benefit metrics 6/26 | Motivation | GPUPerf: Performance Analysis Framework Performance Advisor Analytical Model Frontend Data Collector | Evaluations | Conclusion 7/26 | What is required for performance guidance? Program analysis Performance modeling User-friendly metrics GPGPU Kernel Frontend Data Collector ILP, #insts Analytical Model Model output Performance Advisor GPUPerf For clarity, each component will be explained in a reverse order Benefit Metrics Frontend Data Collector Analytical Model Performance Advisor | Goal of the performance advisor Convey performance bottleneck information Estimate the potential gains from reducing the bottlenecks | Performance advisor provides four potential benefit metrics Bitilp : benefits of increasing ITILP Benefit metrics are provided Bmemlp : benefits of increasing MLP by our analytical model Bserial : benefits of removing serialization effects Bfp : benefits of improving computing inefficiency | Programmers can get an idea of the potential benefit of a GPGPU Kernel 8/26 9/26 | MWP (Memory Warp Parllelism) Indicator of memory-level parallelism Mem Mem Mem Mem Mem Mem Mem Mem 8 warps | CWP (Compute Warp Parllelism) Mem Comp Comp MWP=4 Comp CWP=3 Time | Depending on MWP and CWP, the execution time is predicted by the model. MWP-CWP [Hong and Kim, ISCA’09] | The MWP-CWP model can predict general cases. | Problem: did not model corner cases, which is critical to predict different program optimization benefits! Frontend Data Collector Analytical Model Performance Advisor 10/26 | Our analytical model follows a top-down approach Easy to interpret model components Relate them directly to performance bottlenecks Texec Tcomp Comp Texec = Tcomp + Tmem - Toverlap Tmem Toverlap Texec Mem Comp Mem Comp Mem Tcomp Comp Time Comp Comp Tcomp : Computation time Tmem : Memory time Toverlap : Overlapped time Mem Comp 4 warps Tmem Comp : Final execution time TMem overlap Mem Mem Mem MWP=2 Frontend Data Collector Analytical Model Performance Advisor 11/26 | Tcomp is the amount of time to execute compute instructions Texec Tcomp Wparallel Tmem Wserial Tcomp = Wparallel + Wserial Toverlap Wparallel : Work executed in parallel (useful work) Wserial : Overhead due to serialization effects Frontend Data Collector Analytical Model Performance Advisor 12/26 | Wparallel is the amount of work that can be executed in parallel Texec Tcomp Wparallel Tmem Wserial Toverlap Effective inst. throughput = f(warp_size, SIMD_width, # pipeline stages) ITILP represents the| number ofTotal insts × Effective inst. throughput Wparallel = instructions that can be parallely executed in the pipeline. average_instruction_latency | Effective Inst. throughput = ITILP Frontend Data Collector | ITILP is inter-thread ILP. Inst1 Inst1 Inst1 Inst1 Inst1 stall stall ExecutionTime (msec) TLP (N) Model (MWP-CWP) ILP =4/3 Inst3 Inst3 Inst3 Inst3 Inst2 10 Inst2 Inst2 Inst2 5 1 3 5 7 Inst3 ITILP = MIN(ILP × N, ITILPmax) warp_size Inst1 Inst3 Inst3 Inst2 Inst2 Inst3 Inst3 stall 9 11 13 15 17Inst4 19 21 23 25Inst4 27 29 31 ILP X TLP ITILPmax =avg_inst_lat/ SIMD_width TLP =3 ITILP=ITILP max As TLP increases, Execution latency is Model (New) stall already all hidden! Inst2 time reduces Inst2 Inst2 stall Inst4 Inst4 Inst4 0 Inst4 13/26 TLP =2 ITILP=8/3 stall Actual Low ITILP Performance Advisor TLP =1 ITILP=4/3 25 TLP =1 20TLP =2 TLP =3 Inst1 Inst1 15 Inst1 Inst1 Analytical Model Inst4 Time stall Inst4 Inst4 Inst2 Inst3 Frontend Data Collector Analytical Model Performance Advisor 14/26 | Wserial represents the overhead due to serialization effects Wserial = Osync + OSFU + OCFDiv + Obank Texec Tcomp Wparallel Tmem Toverlap Wserial Osync OSFU OCFDiv Obank Osync : Synchroization overhead OSFU : SFU contention overhead OCFDiv : branch divergence overhead Obank : Bank-conflict overhead Frontend Data Collector Analytical Model Performance Advisor 15/26 | GPGPU has SFUs where expensive operations can be executed. With a good ratio of insts and SFU insts, SFU executing cost can be hidden. SFU Inst Inst Inst Inst Inst Inst SFU Inst Inst SFU Inst Inst Inst OSFU SFU Inst Low Inst to SFU ratio Execution Time (msec) Inst SFU Inst High Inst to SFU ratio SFU Inst Inst Inst Inst Inst SFU Inst Inst SFU Inst 600 Actual Model (MWP-CWP) Model (New) 500 400 300 Latency of SFU instructions is not completely hidden in this case! 200 100 0 1 2 3 4 5 6 # of SFU insts. per eight FMA insts. 7 Frontend Data Collector Analytical Model Performance Advisor 16/26 | Tmem represents the amount of time spent on memory requests and transfers Tmem = Effective mem. requests × AMAT Texec Tcomp Comp Tmem Toverlap Mem Comp Mem Comp Mem Mem Mem Mem Tmem = 4MEM / 2 Mem Comp MWP=2 Mem Mem Mem Mem Mem Tmem = 4MEM / 1 MWP=1 Frontend Data Collector Analytical Model Performance Advisor | Toverlap represents how much the memory cost can be hidden by multithreading Texec Tcomp Tmem Toverlap Toverlap ≈Tmem Comp Mem Comp MWP=3 Mem Mem Comp Comp CWP=3 MWP ≥ CWP Comp All the memory costs are overlapped with computation 17/26 Frontend Data Collector Analytical Model Performance Advisor 18/26 | Toverlap represents how much the memory access cost can be hidden by multi-threading Texec Tcomp Tmem Toverlap Toverlap ≈Tcomp Mem Comp Mem Mem Comp Mem Mem MWP=2 Comp Comp CWP=4 Comp CWP > MWP Comp Computation cost is hidden by memory cost Frontend Data Collector Analytical Model Performance Advisor | Time metrics are converted into potential benefit metrics. Comp Cost Tcomp Bmemlp Toverlap Single Thread Bserial Tfp : ideal computation cost Tmem_min : ideal memory cost Bitilp Bfp Optimized Kernel Tfp Tmem_min Tmem’ Potential Benefit Chart Tmem Mem Cost Benefit Metrics Benefits of Bmemlp Increasing MLP Bserial Removing serialization effects Bitilp Increasing inter-thread ILP Bfp Improving computing efficiency 19/26 Frontend Data Collector Frontend Data Collector CUDA Executable Ocelot [Diamos et al., PACT’10] Ocelot Executable Compute Visual Profiler Analytical Model #Insts Occupancy #SFU_Insts Static Analysis Tools ILP, MLP, ... The collected information is fed into our analytical model 20/26 Performance Advisor #insts, global LD/ST requests, cache info | Detailed information from emulating PTX executions CUDA Binary (CUBIN) Performance Advisor | Architecture-related information from H/W counters Instruction Analyzer (IA) Analytical Model #SFU insts, #sync insts, loop counters | Information in CUDA binaries instead of PTX after low-level compiler optimizations ILP, MLP 21/26 | Motivation | GPUPerf: A Performance Analysis Framework Performance Advisor Analytical Model Frontend Data Collector | Evaluations | Conclusion 22/26 | NVIDIA C2050 Fermi architecture | FMM (Fast Multi-pole Method): approximation of n-body problem [Winner, 2010 Gordon Bell Prize at Supercomputing] Prefetching SFU Vector Packing Loop Unrolling Shared Memory Loop optimization | Parboil benchmarks, Reduction (in the paper) 44 Optimization combinations 3.5 pref pref_rsqrt rsqrt rsqrt_tight shmem shmem_pref_rsqrt_tight shmem_pref_ujam_rsqrt_tight shmem_rsqrt shmem_rsqrt_tight shmem_tight shmem_trans shmem_trans_pref_ujam_rsqrt_tight shmem_trans_rsqrt shmem_trans_rsqrt_tight shmem_trans_tight shmem_trans_ujam shmem_trans_ujam_rsqrt shmem_trans_ujam_rsqrt_tight shmem_trans_ujam_tight shmem_ujam shmem_ujam_rsqrt shmem_ujam_rsqrt_tight shmem_ujam_tight tight ujam ujam_rsqrt ujam_rsqrt_tight ujam_tight vecpack vecpack_pref vecpack_pref_rsqrt vecpack_pref_ujam vecpack_pref_ujam_rsqrt vecpack_rsqrt vecpack_shmem vecpack_shmem_pref_rsqrt vecpack_shmem_pref_ujam_rsqrt vecpack_shmem_rsqrt vecpack_shmem_trans vecpack_shmem_trans_pref_rsqrt vecpack_shmem_trans_rsqrt vecpack_shmem_trans_ujam_rsqrt vecpack_shmem_ujam_rsqrt vecpack_ujam Speedup over no optimizations 23/26 Actual 3 2.5 2 1.5 1 0.5 0 Vector packing + Shared memory + Unroll-Jam + SFU combination shows the best performance Optimizations 3.5 3 pref pref_rsqrt rsqrt rsqrt_tight shmem shmem_pref_rsqrt_tight shmem_pref_ujam_rsqrt_tight shmem_rsqrt shmem_rsqrt_tight shmem_tight shmem_trans shmem_trans_pref_ujam_rsqrt_tight shmem_trans_rsqrt shmem_trans_rsqrt_tight shmem_trans_tight shmem_trans_ujam shmem_trans_ujam_rsqrt shmem_trans_ujam_rsqrt_tight shmem_trans_ujam_tight shmem_ujam shmem_ujam_rsqrt shmem_ujam_rsqrt_tight shmem_ujam_tight tight ujam ujam_rsqrt ujam_rsqrt_tight ujam_tight vecpack vecpack_pref vecpack_pref_rsqrt vecpack_pref_ujam vecpack_pref_ujam_rsqrt vecpack_rsqrt vecpack_shmem vecpack_shmem_pref_rsqrt vecpack_shmem_pref_ujam_rsqrt vecpack_shmem_rsqrt vecpack_shmem_trans vecpack_shmem_trans_pref_rsqrt vecpack_shmem_trans_rsqrt vecpack_shmem_trans_ujam_rsqrt vecpack_shmem_ujam_rsqrt vecpack_ujam Speedup over no optimizations 24/26 Our model follows the Actual speed-up trend pretty well Prediction 2.5 2 1.5 1 0.5 0 Our model correctly pinpoints the best optimization combination that improves the kernel Optimizations Normalized Benefits Speedup (Actual) 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 vecpack_rsqrt _shmem_ujam _pref vecpack_rsqrt _shmem Bfp : Computing ineffciency vecpack_rsqrt vecpack B_fp vecpack_rsqrt _shmem_ujam 3.5 3 2.5 2 1.5 1 0.5 0 baseline Speedup 25/26 (Higher is wrose) | Bfp implies that the kernel could be improved via optimizations | Small Bfp value indicates that adding Prefetching(Pref) does not lead to further performance improvement 26/26 | We propose performance analysis framework. Front-end data collector, analytical model and performance advisor. | Performance advisor provides potential benefit metrics, which can guide performance tuning for GPGPU code. (Bmemlp, Bserial, Bitilp, Bfp). 44 optimization combinations in FMM are well predicted. | Future work: the performance benefit advisor can be inputs to compilers.