Slide - University of British Columbia

advertisement
Hardware Transactional Memory
for GPU Architectures
Wilson W. L. Fung
Inderpeet Singh
Andrew Brownsword
Tor M. Aamodt
University of British Columbia
In Proc. 2011 ACM/IEEE Int’l Symp. Microarchitecture (MICRO-44)
Motivation

Lifetime of GPU Application Development
Functionality
Performance
E.g. N-Body with 5M bodies
CUDA SDK: O(n2) – 1640 s (barrier)
Barnes Hut: O(nLogn) – 5.2 s (locks)
Time
Fine-Grained Locking
Transactional Memory
?
Time
Wilson Fung, Inderpeet Singh,
Andrew Brownsword, Tor Aamodt
Time
Hardware TM for GPU Architectures
2
Are TM and GPUs Incompatible?
GPUs different from Multi-Core CPUs
 1000s Concurrent Scalar Threads
 Challenges (from TM perspective)
Our Solution: KILO TM
 Hardware TM for GPUs
Wilson Fung, Inderpeet Singh,
Andrew Brownsword, Tor Aamodt
Hardware TM for GPU Architectures
3
Hardware TM for GPUs
Challenge #1: SIMD Hardware

On GPUs, scalar threads in a warp/wavefront
execute in lockstep
A Warp with 4 Scalar Threads
...
TxBegin
LD r2,[B]
ADD r2,r2,2
ST r2,[A]
TxCommit
...
Committed
Wilson Fung, Inderpeet Singh,
Andrew Brownsword, Tor Aamodt
T0 T1 T2 T3
T0 T1 T2 T3
Branch Divergence!
T0 T1 T2 T3
Aborted
Hardware TM for GPU Architectures
4
KILO TM – Solution to
Challenge #1: SIMD Hardware

Transaction Abort


Like a Loop
Extend SIMT Stack
...
TxBegin
LD r2,[B]
ADD r2,r2,2
ST r2,[A]
TxCommit
...
Wilson Fung, Inderpeet Singh,
Andrew Brownsword, Tor Aamodt
Abort
Hardware TM for GPU Architectures
5
Hardware TM for GPUs
Challenge #2: Transaction Rollback
GPU Core (SM)
CPU Core
10s of
Register File Registers
@ TX
@ TX
Entry
Abort
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Register
File
Checkpoint
Register File
Checkpoint?
Wilson Fung, Inderpeet Singh,
Andrew Brownsword, Tor Aamodt
32k Registers
Hardware TM for GPU Architectures
2MB Total
On-Chip
Storage
6
KILO TM – Solution to
Challenge #2: Transaction Rollback

SW Register Checkpoint


Most TX: Registers overwritten at first use
TX in Barnes Hut: Checkpoint 2 registers
Overwritten
TxBegin
LD r2,[B]
ADD r2,r2,2
ST r2,[A]
TxCommit
Wilson Fung, Inderpeet Singh,
Andrew Brownsword, Tor Aamodt
Abort
Hardware TM for GPU Architectures
7
Hardware TM for GPUs
Challenge #3: Conflict Detection
Existing HTMs use Cache Coherence Protocol
 Not Available on GPUs
 No Private Data Cache per Thread
Signatures?
 1024-bit / Thread
 3.8MB / 30k Threads
Wilson Fung, Inderpeet Singh,
Andrew Brownsword, Tor Aamodt
Hardware TM for GPU Architectures
8
Hardware TM for GPUs
Challenge #4: Write Buffer
GPU Core (SM)
Warp
Warp
Warp
Warp
Warp
Warp
Warp
L1 Data Cache
Problem: 384
lines /Threads
1536 threads
lineCache
per thread!
1024-1536
Fermi’s <
L11Data
(48kB)
= 384 X 128B Lines
Wilson Fung, Inderpeet Singh,
Andrew Brownsword, Tor Aamodt
Hardware TM for GPU Architectures
9
KILO TM:
Value-Based Conflict Detection
Private Memory
Read-Log
A=1
Write-Log
B=2

A=1
Global
Memory
TX1
atomic
{B=A+1}
TxBegin
LD r1,[A]
ADD r1,r1,1
ST r1,[B]
TxCommit
TX2
atomic
{A=B+2}
B=0
B=2
Private Memory
TxBegin
LD r2,[B]
ADD r2,r2,2
ST r2,[A]
TxCommit
Read-Log
B=0
Write-Log
A=2
Self-Validation + Abort:

Only detects existence of conflict (not identity)
Wilson Fung, Inderpeet Singh,
Andrew Brownsword, Tor Aamodt
Hardware TM for GPU Architectures
10
Parallel Validation?
Data Race!?!
Private Memory
Read-Log
A=1
Write-Log
B=2
TX1
atomic
{B=A+1}
Global
Memory
TX2
atomic
{A=B+2}
Wilson Fung, Inderpeet Singh,
Andrew Brownsword, Tor Aamodt
A=1
Tx1 then Tx2:
A=4,B=2
B=0
OR
Tx2 then Tx1:
A=2,B=3
Private Memory
Hardware TM for GPU Architectures
Read-Log
B=0
Write-Log
A=2
11
Serialize Validation?
Time
TX1
V+C
TX2
VStall
+C
Commit
Unit
Global
Memory
V = Validation
C = Commit



Benefit #1: No Data Race
Benefit #2: No Live Lock
Drawback: Serializes Non-Conflicting Transactions
(“collateral damage”)
Wilson Fung, Inderpeet Singh,
Andrew Brownsword, Tor Aamodt
Hardware TM for GPU Architectures
12
Solution: Speculative Validation
Key Idea: Split Conflict Detection into two parts
1. Recently Committed TX in Parallel
2. Concurrently Committing TX in Commit Order
 Approximate
Time
TX1
V+C
TX2
V+C
TX3
Stall
V+C
V = Validation
C = Commit
Commit
Unit
RS
Global
RS
RS
Memory
Conflict Rare  Good Commit Parallelism
Wilson Fung, Inderpeet Singh,
Andrew Brownsword, Tor Aamodt
Hardware TM for GPU Architectures
13
KILO TM Implementation

Minimal Modification to Existing GPU Arch.
SIMT Core
SIMT Core
SIMT Core
SIMT
Stacks
SIMT
Stacks
SIMT
SIMT
Stacks
Stacks
Wilson Fung, Inderpeet Singh,
Andrew Brownsword, Tor Aamodt
Interconnection
Network
Thread Block
Register
File
Shared
Memory
Thread Block
L1 Data
Cache
L1Texture
Data
Cache
Cache
L1Texture
Data
Constant
Cache
Cache
Cache
Texture
Constant
TX
Cache
Cache
Memory
Log Constant
Port
Unit Cache
Memory
Port
Memory
Port
Kernel
Launch
Hardware TM for GPU Architectures
CPU
MemoryPartition
Partition
Memory
Partition
Memory
Commit Atomic Op.
Commit
Unit
Unit
Last-Level
Cache Bank
Off-Chip
DRAM Channel
14
Evaluation Methodology

GPGPU-Sim 3.0 (BSD license)



Detailed: IPC Correlation of 0.93 vs GT 200
KILO TM (Timing-Driven Memory Accesses)
GPU TM Applications






Hash Table (HT-H, HT-L)
Bank Account (ATM)
Cloth Physics (CL)
Barnes Hut (BH)
CudaCuts (CC)
Data Mining (AP)
Wilson Fung, Inderpeet Singh,
Andrew Brownsword, Tor Aamodt
Hardware TM for GPU Architectures
15
700
600
Ideal TM
500
FG Lock
400
300
200
1.14
1.04
Speedup over Serializing Tx
Performance (vs. Serializing TX)
100
0
HT-H
HT-L
ATM
CL
BH
CC
AP
AVG
Higher is Better
Serializing TX ≈ Coarse-Grained Locks
Wilson Fung, Inderpeet Singh,
Andrew Brownsword, Tor Aamodt
Hardware TM for GPU Architectures
16
Performance (Exec. Time)
Normalized Exec. Time
3
Ideal TM
KILO TM
FG Lock
2
1
0
HT-H
HT-L
ATM
CL
BH
CC
AP
Lower is Better
Captures 59% of FG Lock Performance
Wilson Fung, Inderpeet Singh,
Andrew Brownsword, Tor Aamodt
Hardware TM for GPU Architectures
17
Implementation Complexity


Logs in Private Memory @ L1 Data Cache
Commit Unit




5kB Last Writer History Unit
19kB Transaction Status
32kB Read-Set and Write-Set Buffer
CACTI 5.3 @ 40nm


0.40mm2 x 6 Memory Partition
0.5% of 520mm2
Wilson Fung, Inderpeet Singh,
Andrew Brownsword, Tor Aamodt
Hardware TM for GPU Architectures
18
Summary

KILO TM: Hardware TM for GPUs






1000s of Concurrent Scalar TXs
Handles Scalar TX Abort
No cache coherence protocol dependency
Word-level conflict detection
Unbounded Transaction
59% Fine-Grained Locking Performance


128X Faster than Serializing TX Execution
0.5% Area Overhead
Question?
Wilson Fung, Inderpeet Singh,
Andrew Brownsword, Tor Aamodt
Hardware TM for GPU Architectures
19
Backup Slides
Wilson Fung, Inderpeet Singh,
Andrew Brownsword, Tor Aamodt
Hardware TM for GPU Architectures
20
ABA Problem?

Classic Example: Linked List Based Stack
top

A
Next
B
Next
C
Next
Null
Thread 0 – pop():
while (true) {
t
A
t = top;
Next = t->Next;
Next
B
// thread 2: pop A, pop B, push A
top
A
C
Next
Next
Null
if (atomicCAS(&top, t, next) == t) break; // succeeds!
top
C
Next
top
Null
B
Next
C
Next
Null
}
Wilson Fung, Inderpeet Singh,
Andrew Brownsword, Tor Aamodt
Hardware TM for GPU Architectures
21
ABA Problem?

atomicCAS protects only a single word

Only part of the data structure
top
A
Next
B
Next
C
Next
while (true) {
t = top;
Next = t->Next;
if (atomicCAS(&top, t, next) == t) break;
}

Null
// succeeds!
Value-based conflict detection protects
all relevant parts of the data structure
Wilson Fung, Inderpeet Singh,
Andrew Brownsword, Tor Aamodt
Hardware TM for GPU Architectures
22
Download