APOGEE: Adaptive Prefetching on GPU for Energy Efficiency Ankit Sethia1, Ganesh Dasika2, Mehrzad Samadi1, Scott Mahlke1 1 1-University of Michigan 2-ARM R&D Austin University of Michigan Electrical Engineering and Computer Science Introduction • High Throughput – 1 TeraFlop • High Energy Efficiency • Programmability Further push efficiency? • High performance on mobile • Less cost in supercomputers 2 University of Michigan Electrical Engineering and Computer Science Background Register File Banks ALU SFU SFU …. …. …. …. .… …. …. .… Warp Scheduler • Hide latency by fine grained multi-threading • 20,000 inflight threads • 2 MB on-chip register file • Management overhead • Scheduling • Divergence Mem Ctrlr Data Cache To global memory 3 University of Michigan Electrical Engineering and Computer Science Motivation - I 100% Normalized Speedup 7 80% 6 5 60% 4 40% 3 2 20% 1 0 0% Normalized % Increase Speedup in Power Normalized Increase in Power 8 32 Warps 16 Warps 8 Warps 4 Warps 2 Warps Too many warps decrease efficiency 4 University of Michigan Electrical Engineering and Computer Science 9.00 1.20 8.00 1.00 7.00 6.00 0.80 5.00 0.60 4.00 3.00 0.40 2.00 0.20 1.00 0.00 Normalized per register Accesses Normalized Register File Accesses Rate Motivation - II 0.00 1 2 4 8 # of Warps Norm. RF Access Rate 16 32 Norm. per Register Access Hardware added to hide memory latency has underutilization 5 University of Michigan Electrical Engineering and Computer Science APOGEE: Overview Register File Banks … … … … … … ALU SFU SFU Mem Ctrlr Data Cache To global memory Prefetcher … … Warp Scheduler • Prefetch data from memory to cache. • Less latency to hide. • Less Multithreading. • Less register context. From global memory 6 University of Michigan Electrical Engineering and Computer Science Traditional CPU Prefetching 64 W0 W1 W2 0 31 32 63 64 95 W3 96 4032 W30 4064 W31 . . . . 127 4063 4095 3 2 -32 96 64 64 Stride cannot be found Next line cannot be timely prefetched Both the traditional prefetching techniques do not work 7 University of Michigan Electrical Engineering and Computer Science Fixed Offset Address Prefetching W0 W1 W2 0 31 32 63 64 95 W3 96 W4128 W5160 4032 W30 4064 W31 Warp 1 Warp 0 64 64 127 64 160 . . . . 64 191 4063 4095 Less warps enables more iterations with fixed offset(stride*numWarp) 8 University of Michigan Electrical Engineering and Computer Science Timeliness in Prefetching 00 01 10 Load Prefetch sent to Memory Prefetch recv from Memory Timely Prefetch Slow Prefetch Early Prefetch Time 00 - Increase distance of prefetching if next load happens in state 01 Prefetch sent to Memory New Load 10 - Decrease distance of prefetching if correctly prefetched address in state 10 was a miss 01 Prefetch received from Memory 9 University of Michigan Electrical Engineering and Computer Science FOA Prefetching Prefetch Table Load PC Address Offset Confidence Tid Distance PF Type PF State Current 0x253ad 0xfcfeed PC 8 2 3 1 > # Threads 0 00 + 3 Miss address of thread index 4 2 Prefetch Queue Prefetch Enable = * + 10 Prefetch Address University of Michigan Electrical Engineering and Computer Science GRAPHICS MEMORY ACCESSES 100% 80% Others 60% Texture 40% TIA 20% FOA 0% ES HS RS IT ST WT WF MEAN 3 major type of accesses: • Fixed Offset Access: Data accessed by adjacent threads in an array have fixed offset. • Thread Invariant Address: Same address is accessed by all the threads in a warp. • Texture Access: Address accessed during texturing operation. 11 University of Michigan Electrical Engineering and Computer Science Thread Invariant Addresses Iteration 21 PC3 Ld PC2 Ld Iteration 3 PC3 Ld Slow Prefetch Slow Prefetch PC1 Ld PC0 Const. Ld Time TIA Prefetch Table Pf PC Address Slow bit PC2 1 PC1 0xabc 0 Iteration 5 PC3 Ld PC2 Ld PC2 Ld Timely Prefetch Prefetch Address PC1 Ld PC1 Ld PC0 Const. Ld Time PC0 Const. Ld Time 12 Prefetch Queue University of Michigan Electrical Engineering and Computer Science Experimental Evaluation Benchmarks o Mesa Driver, SPEC-Viewperf traces, MV5 GPGPU benchmarks Performance Eval o MV5 simulator, 1 GHz, Inorder SIMT, 400 cycle mem latency. o Prefetcher – 32 entry o Prefetch latency – 10 cycles o D-Cache – 64kB, 8 way, 32 bytes per line. Power Eval o Dynamic power from analytical model. Hong et. al(ISCA10). o Static power from published numbers and tools: • FPU, SFU, Caches, Register File, Fetch/Decode/Schedule. 13 University of Michigan Electrical Engineering and Computer Science APOGEE Performance Performance 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 HS WT WF IT ST Graphics SIMT_32 ES RS STRIDE_4 FFT FT HP MS SP GPGPU MTA_4 APOGEE_4 GEO MEAN On average 19% improvement in speedup with APOGEE 14 University of Michigan Electrical Engineering and Computer Science Performance APOGEE Performance -II 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 HS WT WF IT ST ES RS Graphics SIMT_32 FFT FT HP GPGPU MTA_32 MS SP GEO MEAN APOGEE_4 MTA with 32 warps is within 3% of APOGEE 15 University of Michigan Electrical Engineering and Computer Science Performance and Power SIMT 1.35 Normalized Power 1.3 MTA APOGEE # of warps 1.25 1.2 + 1.15 x 32 warps 16 warps 8 warps 4 warps 2 warps 1 warp 1.1 1.05 1 1 2 3 4 5 6 7 8 9 10 Normalized Performance • 20% speedup over SIMT, with 14k less registers. • Around 51% perf/Watt improvement. 16 University of Michigan Electrical Engineering and Computer Science % of Correct Prediction Prefetcher Accuracy 100 90 80 70 60 50 40 30 20 10 0 HS WT WF IT ST ES Graphics RS FFT FT HP MS GPGPU SP GEO MEAN APOGEE has 93.5% accuracy in prediction 17 University of Michigan Electrical Engineering and Computer Science D-Cache Miss Rate % Reduction in Dcache Miss Rate 100% 80% 60% 40% 20% 0% HS -20% WT WF IT ST ES RS FFT Graphics MTA_4 FT HP MS SP GPGPU MTA_32 APOGEE Prefetching results in 80% reduction in cache accesses 18 University of Michigan Electrical Engineering and Computer Science Conclusion Use of high multi-threading on GPUs is inefficient Adaptive prefetching exploits the regular memory access pattern of GPU applications: o Adapt for the timeliness of prefetching o Prefetch for two different patterns Over 12 graphics and GPGPU benchmarks, 20% improvement in performance and 51% improvement in performance/Watt 19 University of Michigan Electrical Engineering and Computer Science Questions? 20 University of Michigan Electrical Engineering and Computer Science