TokenTM: Token-Based Hardware Transactional Memory Jayaram Bobba , Neelam Goyal,

advertisement
TokenTM: Token-Based
Hardware Transactional Memory
Jayaram Bobba, Neelam Goyal,
Mark D. Hill, Michael M. Swift, and David A. Wood
Multifacet Project
(www.cs.wisc.edu/multifacet)
Dept. of Computer Sciences
University of Wisconsin-Madison
Executive Summary
• Current Hardware TMs
– Most Transactions Small & Short Running
– Penalize large/long transactions
– Too restrictive for wide-spread TM use?
• Hypothesis
– Must Support Efficient Large/Long Transactions As Well
– Is such an HTM even possible?
• Yes! TokenTM
1. LogTM’s Log to buffer unbounded values
2. Transactional Tokens for unbounded conflict detection
• Conflict state in memory metabits
• Concurrent updates via metastate fission/fusion
2
7/26/2016
Wisconsin Multifacet Project
Existing HTM Systems
Assumption: Most transactions small & short running
 Optimized for small transactions
 Degrade with large, long running transactions
• Non-localized Overhead, E.g.,
LogTM-SE [Yen07] false conflicts
OneTM [Blundel07] serializes
• Complex, Expensive Operations, E.g.,
XTM [Chung06]& PTM [Chuang06] manipulate page tables
Premature Optimization?
© 2008 Multifacet Project
University of Wisconsin-Madison
3
Why Large Transactions?
Programmers may want large (>>cache) and/or
long (>> ctx switch) transactions
– HLL transactions invoke unpredictable lower-level code
– Replace critical sections containing syscalls or I/O
– Avoid concurrency bugs [Lu08]
• But “Most transactions small & short running”
– Restrict TM to use by gurus (like OS spin locks)?
– Self fulfilling prophesy?
Must Support Efficient Large/Long Transactions As Well
© 2008 Multifacet Project
University of Wisconsin-Madison
4
Toward a Large-Transaction TM
Efficiently detect conflicts between in-flight transactions
using Read/Write Sets
• Unbounded
• Globally accessible
Small Transactions:
Low Overhead
Fast read/write set ops.
E.g., Add to read set
Clear read set
Large Transactions:
Localized Overhead
Accessible read/write set
(potentially unbounded)
Minimal Changes to
Coherence / VM
N
O
© 2008 Multifacet Project
University of Wisconsin-Madison
Heavyweight eviction ops
Negative acks
Additional page tables
5
Existing Mechanisms
Synergy between cache coherence and conflict detection
Hence, overload cache coherence
Small Transactions:
Low Overhead
+ Excellent for bounded/small TM
But,
- ‘Virtualization’ on overflows
- Tough to access ‘virtualized’ state
© 2008 Multifacet Project
University of Wisconsin-Madison
Minimal Changes to
× Coherence / VM
×
Large Transactions:
Localized Overhead
6
TokenTM: a Large-Transaction TM
• New Conflict Detection Mechanism
This Talk
– Transactional Tokens in Tagged Memory
– Token Coherence [Martin03] at different level
• Version Management
– Save old/new values for unbounded Write set
– LogTM [Moore06] undo log
© 2008 Multifacet Project
University of Wisconsin-Madison
7
Outline
• Motivation
• Design
– Token-Based Conflict Detection
– Metadata Storage
• Implementation
• Results
© 2008 Multifacet Project
University of Wisconsin-Madison
8
Transactional Tokens
• Challenge: How to efficiently track Read/Write sets?
• Token Coherence [Martin03]
– Read/Write sets for cache coherence
• Solution: Transactional Tokens
– T tokens per memory block
– At least one token to read, All T tokens to write
(token conflict detection)
– Token Metadata
<c0,c1,…,ci,…> where 0≤ci≤T is count of tokens held
by thread with TID i.
© 2008 Multifacet Project
University of Wisconsin-Madison
9
Tagged Memory
• Challenge: Where to store Unbounded, Globally
Accessible Token Metadata?
• Virtual Memory
– unbounded and globally accessible
• Solution, similar to OneTM [Blundel07]
– Tag Virtual Memory
– Piggyback on existing Virtual Memory and
Cache Coherence mechanisms
© 2008 Multifacet Project
University of Wisconsin-Madison
10
TokenTM Logical Operation
Thread X
PC
BEGIN_XACT
Load A
Store B
COMMIT_XACT
Thread Y
Undo Log
Undo Log
BEGIN_XACT
Load A
Store A
COMMIT_XACT
PC
ABORT
Shared Memory
Block
Data
Metadata
<cx, cy, …>
A
0x..00..
<1,0,…>
<0,0,…>
<1,1,…>
B
0x..00..
B:0x..11..
0x..00..
<0,0,…>
<T,0,…>
C
0x..10..
<0,0,…>
© 2008 Multifacet Project
University of Wisconsin-Madison
Insufficient
tokens
11
Storing Metadata
Unbounded
Difficult to access globally
Thread Y
Thread X
PC
BEGIN_XACT
Load A
Store B
COMMIT_XACT
Undo
Log
Token
log
Undo
TokenLog
log
Cx
CY
PC
BEGIN_XACT
Load A
Store A
COMMIT_XACT
Software
Tagged Memory
Block
Data
Hardware
Metadata
Metastate
(Sum,
<c , cTID)
…>
x
y,
A
0x..00..
B
0x..00..
(0, -)
<0,0,…>
(0, -)
<0,0,…>
C
Lossy
Summary
0x..10..
(0, -)
<0,0,…>
© 2008 Multifacet Project
University of Wisconsin-Madison
Concise
Accessible
12
Hardware Metastate
• Metadata summary (sum, TID)
– sum, total number of tokens acquired
– TID, identify owner when sum = 1 or sum = T (optional)
Some summaries,
<c0, c1, …, ci, …>
(sum, TID)
<0, 0, 0, 0>
(0, -)
<0, 0, 1, 0>
(1, 2)
<0, T, 0, 0>
(T, 2)
<0, 1, 1, 1>
(3, -)
• Concise -> Stored in packed field (e.g., State[1:2] , Attr[3:16])
• Fast -> Accessed as part of normal memory operation
© 2008 Multifacet Project
University of Wisconsin-Madison
13
Token Logs
• Distributed structures for unbounded
Read/Write sets
– per-thread
– stored in program memory (e.g., heap)
– list of <address, num_tokens>
Token log
A: 1
B: T
• Accessible to hardware for fast ops
– Add to read set -> Append to token log
© 2008 Multifacet Project
University of Wisconsin-Madison
14
Double-entry Bookkeeping
(Keeping Metadata Consistent)
Thread X
Logical
Token State
Metadata
<cx, cy, …>
PC
Thread Y
BEGIN_XACT
Load A
Store B
COMMIT_XACT
Token log
Token log
A: 1
A: 1
PC
BEGIN_XACT
Load A
Store A
COMMIT_XACT
<1,1,…>
<1,0,…>
<0,0,…>
Software
<0,0,…>
Hardware
<0,0,…>
Block
Metastate
(Sum, TID)
A
(2, X)
(1,
(0,
-)
B
(0, -)
C
(0, -)
© 2008 Multifacet Project
University of Wisconsin-Madison
15
Outline
• Motivation
• Design
• Implementation
– Metastate Fission/Fusion
• Results
© 2008 Multifacet Project
University of Wisconsin-Madison
16
Implementing Hardware Metastate
Thread X
BEGIN_XACT
Load A
Store B
COMMIT_XACT
Thread Y
Token log
Token log
A: 1
BEGIN_XACT
Load A
Store A
COMMIT_XACT
Software
Load A
Private
Caches
Tag
A
Coherence
State
Data
Exclusive
Owned 0x..00..
0x..00..
Modified
Load A
Hardware
Tag
Data
Sum TID
A Shared 0x..00.. 1 X
Coherence
State
Sum TID
101, XXData A
A
DATAFwd_GETS
A
GETS A
Block Directory
Data
0x..00..
A Shared
Exclusive
Not
Present
@@
P1,P2
P1 0x..00..
Main
Memory
Metastate
GETS A
(Sum,
TID)
Sum TID
© 2008 Multifacet Project
University of Wisconsin-Madison
(0,0)
0 0,
(0,0)
Upgrade A
Shared copies cannot
update metastate
Solution: Fission / Fusion
17
Metastate
Fission
Thread X
Thread Y
BEGIN_XACT
Load A
Store B
COMMIT_XACT
Token log
Token log
A: 1
A: 1
BEGIN_XACT
Load A
Store A
COMMIT_XACT
Software
1,X
Coherence
State
fission
Tag
Data Sum TID
0,Private A Modified
0x..00.. 1,X 1 X
Owned
Caches
0x..00..
Fwd_GETS A
Main
Memory
Tag
A
Load A
Coherence
State
Data
Shared
0x..00..
Data A
Block Directory
A Shared
Exclusive
@ P1
@ P1,P2
Data
Sum TID
Hardware
Sum TID
01 Y-
GETS A
0x..00..
© 2008 Multifacet Project
University of Wisconsin-Madison
18
Metastate Fusion
• Metastate Fusion
– On store, metastate copies fused back
• Why does fission/fusion work?
– Store sees ‘complete’ metastate
– Load sees
• ‘complete’ metastate, if writer exists
• ‘partial’ metastate, otherwise
© 2008 Multifacet Project
University of Wisconsin-Madison
19
Hardware Cost
• Additional metabits in caches/memory
– Recoded ECC to cull metabits
• Changes to coherence protocols
– Additional payload on messages
– Minimal changes to protocol logic
• Requires non-silent eviction
© 2008 Multifacet Project
University of Wisconsin-Madison
20
Outline
•
•
•
•
Motivation
Design
Implementation
Results
Do we meet the two performance goals?
Small Transactions:
Low Overhead
Large Transactions:
Localized Overhead
© 2008 Multifacet Project
University of Wisconsin-Madison
21
Evaluation Methodology
• Methodology
– Full System Simulation
– Multifacet GEMS
• Base System
– 32-core CMP system, in-order, single-issue cores
– Private 4-way 32KB writeback split I&D L1 caches
– Shared 8-way 8 MB writeback L2
– On-chip directory @ L2, MESI coherence
– Packet-switched interconnect in a tiled topology
© 2008 Multifacet Project
University of Wisconsin-Madison
22
TM Systems
• LogTM-SE [Yen07] variant
 Parallel Bloom Filters for conflict detection
 4 2Kbit H3 filters
+ Compact, less hardware overhead
- False Conflicts
• LogTM-SE_Perfect
+ No False Conflicts
- Unimplementable
• TokenTM
© 2008 Multifacet Project
University of Wisconsin-Madison
23
Results
Performance Normalized to
LogTM-SE_Perfect
1.2
1
Large Transactions:
Minor
degradation
Localized
with
largeOverhead
transactions
0.8
0.6
LogTM-SE
0.4
LogTM-SE_Perfect
0.2
TokenTM
0
Comparable
on
Small Transactions:
small
Lowtransactions
Overhead
© 2008 Multifacet Project
University of Wisconsin-Madison
24
TokenTM Conflict Detection
Large Transactions:
Localized Overhead
Accessible read/write set
(potentially unbounded)
Small Transactions:
Low Overhead
Fast read/write set ops.
E.g., Add to read set
Clear read set
Minimal Changes to
Coherence / VM
© 2008 Multifacet Project
University of Wisconsin-Madison
N
O
Heavyweight eviction ops
Negative acks
Additional page tables
25
In the paper…
• Fast Token Release
• TM ‘virtualization’ events
– Context Switches, Paging etc.
• System V shared memory
• Long Running Critical Sections in server
workloads
• Fission/Fusion useful for other TM systems
– USTM [Baugh08], set Fault-on-Write UFO bit
without exclusive permission
© 2008 Multifacet Project
University of Wisconsin-Madison
26
Executive Summary
• Current Hardware TMs
– Most Transactions Small & Short Running
– Penalize large/long transactions
– Too restrictive for TM use up/down software stack?
• Hypothesis
– Must Support Efficient Large/Long Transactions As Well
– Is such an HTM even possible?
• Yes! TokenTM
1. LogTM’s Log to buffer unbounded values
2. Transactional Tokens for unbounded conflict detection
• Conflict state in memory metabits
• Concurrent updates via metastate fission/fusion
27
7/26/2016
Wisconsin Multifacet Project
© 2008 Multifacet Project
University of Wisconsin-Madison
28
Common Token Ops
Actions by thread X
Before
After
(Sum, TID) (Sum, TID)
Acquire One Token
(0, -)
(1, X)
Acquire T Tokens
(0, -)
(T, X)
Release One Token
(1, X)
(v, -)
(0, -)
(v-1, -)
Release T tokens
(T, X)
(0, -)
Conflicting Load
(T, Y), Y≠X
(T, Y), Y≠X
Conflicting Store
(v, -), v≠0
(T, Y), Y≠X
(v, -), v≠0
(T, Y), Y≠X
© 2008 Multifacet Project
University of Wisconsin-Madison
29
Avg Read-Set
Avg Write-Set
Max Read-Set
Max Write-set
Barnes
512 bodies
parallel phase
1
2,553
6.1
4.2
42
39
Cholesky
tk14.O
factorization
1
60,203
2.4
1.7
6
4
Radiosity
batch
1 task
1024
21,786
1.8
1.5
25
24
Raytrace
teapot
parallel phase
1
47,783
5.1
2.0
594
4
Delaunay
gen2.2-m30
parallel phase
1
16,384 51.4 38.8
507
345
Genome
g1024-s32-n65536
parallel phase
1
2.1
768
18
Vacation-Low
low contention
parallel phase
1
16,399 70.7 18.1
162
75
Vacation-High
High contention
parallel phase
1
16,399 99.1 18.6
331
80
Benchmark
Input
Unit o f Work
Units
Measured
Num Xacts
Workload Characteristics
© 2008 Multifacet Project
University of Wisconsin-Madison
100,115 14.5
30
TokenTM Overheads
© 2008 Multifacet Project
University of Wisconsin-Madison
31
Results
Performance Normalized to
LogTM-SE_Perfect
1.2
1
Minor degradation
with large
transactions
0.8
0.6
LogTM-SE_2xH3
0.4
LogTM-SE_4xH3
0.2
TokenTM
0
Comparable on
small transactions
© 2008 Multifacet Project
University of Wisconsin-Madison
32
Fast Release (optional)
Thread X
W’ R+
(Sum, TID)
R
W R’
Attr
(0, -)
0
0
0
0
0
-
(u, -)
1
0
0
0
1
u-1
(u, -)
0
0
0
0
1
u
(1, X)
1
0
0
0
0
X
(1, Y)
0
0
1
0
0
Y
(T, X)
0
1
0
0
0
X
(T, Y)
0
0
0
1
0
Y
PC
BEGIN_XACT
Load A
Store B
COMMIT_XACT
Token log
A: 1
B: T
Flash_Clear
Tag
Data
A
B
0x..00..
0x..01..
© 2008 Multifacet Project
University of Wisconsin-Madison
Token LogPtr
X TID
1 Fast-Release
R W R’ W’ R+ Attr
01 0
X
X
0 10
-0 033
Is Fast Release necessary?
Performance Normalized to
LogTM-SE_Perfect
1.2
1
0.8
0.6
0.4
0.2
0
© 2008 Multifacet Project
University of Wisconsin-Madison
34
Token Operations
Double-entry Bookkeeping
Thread X
Logical
Token State
Metadata
<cx, cy, cz>
PC
Thread Y
Begin_XACT
Load A
Store B
Commit_XACT
Token log
Token log
A: 1
B: T
A: 1
PC
BEGIN_XACT
Load A
Store A
COMMIT_XACT
<0,0,0>
<1,0,0>
<1,1,0>
Software
<0,0,0>
<T,0,0>
Hardware
<0,0,0>
Block
Data
Metastate
(Sum, TID)
A
0x..00..
(2, X)
(1,
(0,
-)
B
0x..00..
0x..11..
0x..10..
(T, X)
(0,
-)
C
© 2008 Multifacet Project
University of Wisconsin-Madison
(0, -)
35
Fission Rules
Before
After
Copy1 Copy2
(u, -)
(u, -)
(0, -)
(1, X)
(1, X)
(0, -)
(T, X)
(T, X)
(T, X)
No writer
• Assume Copy2 sent to new Reader
Is there a writer?
© 2008 Multifacet Project
University of Wisconsin-Madison
36
Fusion Rules
Copy2
Copy1
(v, -)
(1, Y)
(T, Y)
(u + v, -)
(1, Y)
if u = 0
(u + 1, - ) else
(T, Y)
error
(1, X)
(1, X) if v = 0
(v + 1, -) else
(2, -)
error
(T, X)
(T, X)
error
error
(T, X)
error
(u, -)
if v = 0
else
if u = 0
else
if X = Y
else
• Add the two counts
• Forget token owner if count > 1
© 2008 Multifacet Project
University of Wisconsin-Madison
37
Thread X
Metastate Fusion
Thread Y
Begin_XACT
Load A
Store B
Commit_XACT
Coherence
State
Tag
Private A Invalid
Owned
Caches
Token log
Token log
A: 1
A: 1
Data
Sum TID
0x..00..
1 X
Inv A
Main
Memory
Tag
1,X
A
BEGIN_XACT
Load A
Store A
COMMIT_XACT
Coherence
State
Conflict
Store A
Data
Modified 0x..00..
Shared
Ack A
Tag Directory
@ P1,P2
A Shared
Modified
@ P2
Data
Sum TID
Software
Hardware
Sum TID
12 1,YY-
fusion
Insufficient
tokens
2,-
Upgrade A
0x..00..
P2
© 2008 Multifacet Project
University of Wisconsin-Madison
38
Modifying Hardware Metastate
(Take 1) Thread Y
Thread X
Begin_XACT
Load A
Store B
Commit_XACT
Token log
Token log
A: 1
BEGIN_XACT
Load A
Store A
COMMIT_XACT
Software
Load A
Private
Caches
Tag
Coherence
State
A
Exclusive
Data
0x..00..
Tag
Coherence
State
Data
Hardware
1, X
DATA A
GETS A
Tag Directory
Data
0x..00..
A Exclusive
Not Present
@ P1 0x..00..
Main
Memory
Sum TID
© 2008 Multifacet Project
University of Wisconsin-Madison
0 0, -
Extra main memory
access on every
metastate update
39
Download