WarpPool: Sharing Requests with Inter-Warp Coalescing for Throughput Processors John Kloosterman, Jonathan Beaumont, Mick Wollman, Ankit Sethia, Ron Dreslinski, Trevor Mudge, Scott Mahlke Computer Engineering Laboratory University of Michigan Introduction • GPUs have high peak performance • For many benchmarks, memory throughput limits performance 50% % Benchmarks 40% 30% 20% 10% 0% < 12% 12-33% 33-66% % cycles stalled 66%+ GPU Architecture • 32 threads grouped into SIMD warps warp add r1, r2, r3 thread load [r1], r2 • Warp scheduler sends ready warps to FUs warp 0 1 2 47 ... warp scheduler ALUs Load/Store Unit 3 GPU Memory System Warp Scheduler Load Load/Store Unit Intra-Warp Coalescer L1 MSHR to L2, DRAM Group by cache line Cache Lines Problem: Divergence Warp Scheduler Load Load/Store Unit Intra-Warp Coalescer L1 Group by cache line Cache Lines MSHR to L2, DRAM … Problem: Bottleneck at L1 Warp Scheduler Loads Warp 0 Warp 1 Warp 2 Warp 3 Warp 4 Warp 5 Load/Store Unit Intra-Warp Coalescer Group by cache line Warp 1 0 Warp 5 Warp 4 L1 Warp 3 Warp 2 MSHR to L2, DRAM Hazards in Benchmarks 30 Cache lines per load/store 25 Waiting loads/stores 20 15 10 5 0 Memory Divergent Bandwidth-Limited Cache-Limited 7 Inter-Warp Spatial Locality • Spatial locality not just within a warp warp 0 warp 1 warp 2 warp 3 warp 4 divergent inside a warp Inter-Warp Spatial Locality • Spatial locality not just within a warp warp 0 warp 1 warp 2 warp 3 warp 4 Inter-Warp Spatial Locality • Spatial locality not just within a warp warp 0 warp 1 warp 2 warp 3 warp 4 • Key insight: use this locality to address throughput bottlenecks Inter-Warp Window Warp Scheduler Intra-Warp Coalescer 32 addresses Warp Scheduler 32 addresses Intra-Warp Coalescer Intra-Warp Coalescer Intra-Warp Coalescer L1 1 cache line from one warp Inter-Warp Coalescer 1many cache line from cache lines one fromwarp many warps L1 1 cache line from many warps 11 Design Overview Warp Scheduler Intra-Warp Coalescer Inter-Warp Coalescer L1 Intra-Warp Coalescer Intra-Warp Coalescers Warp Scheduler Inter-Warp Queues Selection Logic L1 12 Intra-Warp Coalescers Address Generation Warp Scheduler load load Intra-Warp Coalescer ... to inter-warp coalescer Queue memory instructions • Queue load instructions before address generation • Intra-warp coalescers same as baseline • 1 request for 1 cache line exits per cycle 13 Inter-Warp Coalescer intra-warp coalescers W0 Cache line address W0 ... Cache line address warp ID thread mapping warp ID thread mapping ... ... ... ... sort by address • Many coalescing queues, small # tags each • Requests mapped to coalescing queues by address • Insertion: tag lookup, max 1 per cycle per queue 14 Inter-Warp Coalescer intra-warp coalescers W0 W0 W0 Cache line address warp ID ... Cache line address thread mapping warp ID thread mapping ... ... ... 0 ... sort by address • Many coalescing queues, small # tags each • Requests mapped to coalescing queues by address • Insertion: tag lookup, max 1 per cycle per queue 15 Inter-Warp Coalescer intra-warp coalescers W1 Cache line address W1 warp ID ... thread mapping warp ID thread mapping 0 0 ... Cache line address ... ... ... sort by address • Many coalescing queues, small # tags each • Requests mapped to coalescing queues by address • Insertion: tag lookup, max 1 per cycle per queue 16 Inter-Warp Coalescer intra-warp coalescers Cache line address warp ID ... thread mapping Cache line address warp ID 0 0 1 ... thread mapping ... sort by address • Many coalescing queues, small # tags each • Requests mapped to coalescing queues by address • Insertion: tag lookup, max 1 per cycle per queue 17 Selection Logic • Select a cache line from the inter-warp queues to send to L1 L1 Cache ... Selection Logic • 2 strategies: • Default: pick oldest request • Cache-sensitive: prioritize one warp • Switch based on miss rate over quantum 18 Methodology • Implemented in GPGPU-sim 3.2.2 • • • • GTX480 baseline 32 MSHRS 32kB cache GTO scheduler • Verilog implementation for power and area • Benchmark criteria • Parboil, PolyBench, Rodinia benchmark suites • Memory throughput limited: waiting memory requests for more than 90% of execution time • WarpPool configuration • • • • 2 intra-warp coalescers 32 inter-warp queues 100,000 cycle quantum for request selector Up to 4 inter-warp coalesces per L1 access 19 Results: Speedup 5.16 2.35 3.17 2 1.38x Speedup (x) 1.5 1 0.5 0 Memory Divergent Bandwidth-Limited 8-way banked cache MRPB [1] MRPB: Memory request prioritization for massively parallel processors: HPCA 2014 Cache-Limited WarpPool 20 Results: L1 Throughput Requests Serviced per L1 access 1.5 1 0.5 0 Memory Divergent Bandwidth-Limited 8-way banked cache Cache-Limited WarpPool • Banked cache uses divergence, not locality • WarpPool merges even when not divergent • No speedup for banked cache: 1 miss/cycle 21 Results: L1 Misses 100% % Baseline MPKI 75% 50% 25% 0% Memory Divergent Bandwidth-Limited MRPB Cache-Limited WarpPool • MRPB has larger queues • Oldest policy sometimes preserves cross-warp temporal locality [1] MRPB: Memory request prioritization for massively parallel processors: HPCA 2014 22 Conclusion • Many kernels limited by memory throughput • Key insight: use inter-warp spatial locality to merge requests • WarpPool improves performance by 1.38x: • Merging requests: increase L1 throughput by 8% • Prioritizing requests: decrease L1 misses by 23% WarpPool: Sharing Requests with Inter-Warp Coalescing for Throughput Processors John Kloosterman, Jonathan Beaumont, Mick Wollman, Ankit Sethia, Ron Dreslinski, Trevor Mudge, Scott Mahlke Computer Engineering Laboratory University of Michigan