Hardware Transactional Memory for GPU Architectures Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia In Proc. 2011 ACM/IEEE Int’l Symp. Microarchitecture (MICRO-44) Motivation Lifetime of GPU Application Development Functionality Performance E.g. N-Body with 5M bodies CUDA SDK: O(n2) – 1640 s (barrier) Barnes Hut: O(nLogn) – 5.2 s (locks) Time Fine-Grained Locking Transactional Memory ? Time Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Time Hardware TM for GPU Architectures 2 Are TM and GPUs Incompatible? GPUs different from Multi-Core CPUs 1000s Concurrent Scalar Threads Challenges (from TM perspective) Our Solution: KILO TM Hardware TM for GPUs Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 3 Hardware TM for GPUs Challenge #1: SIMD Hardware On GPUs, scalar threads in a warp/wavefront execute in lockstep A Warp with 4 Scalar Threads ... TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit ... Committed Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt T0 T1 T2 T3 T0 T1 T2 T3 Branch Divergence! T0 T1 T2 T3 Aborted Hardware TM for GPU Architectures 4 KILO TM – Solution to Challenge #1: SIMD Hardware Transaction Abort Like a Loop Extend SIMT Stack ... TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit ... Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Abort Hardware TM for GPU Architectures 5 Hardware TM for GPUs Challenge #2: Transaction Rollback GPU Core (SM) CPU Core 10s of Register File Registers @ TX @ TX Entry Abort Warp Warp Warp Warp Warp Warp Warp Warp Register File Checkpoint Register File Checkpoint? Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt 32k Registers Hardware TM for GPU Architectures 2MB Total On-Chip Storage 6 KILO TM – Solution to Challenge #2: Transaction Rollback SW Register Checkpoint Most TX: Registers overwritten at first use TX in Barnes Hut: Checkpoint 2 registers Overwritten TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Abort Hardware TM for GPU Architectures 7 Hardware TM for GPUs Challenge #3: Conflict Detection Existing HTMs use Cache Coherence Protocol Not Available on GPUs No Private Data Cache per Thread Signatures? 1024-bit / Thread 3.8MB / 30k Threads Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 8 Hardware TM for GPUs Challenge #4: Write Buffer GPU Core (SM) Warp Warp Warp Warp Warp Warp Warp L1 Data Cache Problem: 384 lines /Threads 1536 threads lineCache per thread! 1024-1536 Fermi’s < L11Data (48kB) = 384 X 128B Lines Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 9 KILO TM: Value-Based Conflict Detection Private Memory Read-Log A=1 Write-Log B=2 A=1 Global Memory TX1 atomic {B=A+1} TxBegin LD r1,[A] ADD r1,r1,1 ST r1,[B] TxCommit TX2 atomic {A=B+2} B=0 B=2 Private Memory TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit Read-Log B=0 Write-Log A=2 Self-Validation + Abort: Only detects existence of conflict (not identity) Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 10 Parallel Validation? Data Race!?! Private Memory Read-Log A=1 Write-Log B=2 TX1 atomic {B=A+1} Global Memory TX2 atomic {A=B+2} Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt A=1 Tx1 then Tx2: A=4,B=2 B=0 OR Tx2 then Tx1: A=2,B=3 Private Memory Hardware TM for GPU Architectures Read-Log B=0 Write-Log A=2 11 Serialize Validation? Time TX1 V+C TX2 VStall +C Commit Unit Global Memory V = Validation C = Commit Benefit #1: No Data Race Benefit #2: No Live Lock Drawback: Serializes Non-Conflicting Transactions (“collateral damage”) Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 12 Solution: Speculative Validation Key Idea: Split Conflict Detection into two parts 1. Recently Committed TX in Parallel 2. Concurrently Committing TX in Commit Order Approximate Time TX1 V+C TX2 V+C TX3 Stall V+C V = Validation C = Commit Commit Unit RS Global RS RS Memory Conflict Rare Good Commit Parallelism Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 13 KILO TM Implementation Minimal Modification to Existing GPU Arch. SIMT Core SIMT Core SIMT Core SIMT Stacks SIMT Stacks SIMT SIMT Stacks Stacks Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Interconnection Network Thread Block Register File Shared Memory Thread Block L1 Data Cache L1Texture Data Cache Cache L1Texture Data Constant Cache Cache Cache Texture Constant TX Cache Cache Memory Log Constant Port Unit Cache Memory Port Memory Port Kernel Launch Hardware TM for GPU Architectures CPU MemoryPartition Partition Memory Partition Memory Commit Atomic Op. Commit Unit Unit Last-Level Cache Bank Off-Chip DRAM Channel 14 Evaluation Methodology GPGPU-Sim 3.0 (BSD license) Detailed: IPC Correlation of 0.93 vs GT 200 KILO TM (Timing-Driven Memory Accesses) GPU TM Applications Hash Table (HT-H, HT-L) Bank Account (ATM) Cloth Physics (CL) Barnes Hut (BH) CudaCuts (CC) Data Mining (AP) Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 15 700 600 Ideal TM 500 FG Lock 400 300 200 1.14 1.04 Speedup over Serializing Tx Performance (vs. Serializing TX) 100 0 HT-H HT-L ATM CL BH CC AP AVG Higher is Better Serializing TX ≈ Coarse-Grained Locks Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 16 Performance (Exec. Time) Normalized Exec. Time 3 Ideal TM KILO TM FG Lock 2 1 0 HT-H HT-L ATM CL BH CC AP Lower is Better Captures 59% of FG Lock Performance Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 17 Implementation Complexity Logs in Private Memory @ L1 Data Cache Commit Unit 5kB Last Writer History Unit 19kB Transaction Status 32kB Read-Set and Write-Set Buffer CACTI 5.3 @ 40nm 0.40mm2 x 6 Memory Partition 0.5% of 520mm2 Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 18 Summary KILO TM: Hardware TM for GPUs 1000s of Concurrent Scalar TXs Handles Scalar TX Abort No cache coherence protocol dependency Word-level conflict detection Unbounded Transaction 59% Fine-Grained Locking Performance 128X Faster than Serializing TX Execution 0.5% Area Overhead Question? Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 19 Backup Slides Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 20 ABA Problem? Classic Example: Linked List Based Stack top A Next B Next C Next Null Thread 0 – pop(): while (true) { t A t = top; Next = t->Next; Next B // thread 2: pop A, pop B, push A top A C Next Next Null if (atomicCAS(&top, t, next) == t) break; // succeeds! top C Next top Null B Next C Next Null } Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 21 ABA Problem? atomicCAS protects only a single word Only part of the data structure top A Next B Next C Next while (true) { t = top; Next = t->Next; if (atomicCAS(&top, t, next) == t) break; } Null // succeeds! Value-based conflict detection protects all relevant parts of the data structure Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 22