TokenTM: Token-Based Hardware Transactional Memory Jayaram Bobba, Neelam Goyal, Mark D. Hill, Michael M. Swift, and David A. Wood Multifacet Project (www.cs.wisc.edu/multifacet) Dept. of Computer Sciences University of Wisconsin-Madison Executive Summary • Current Hardware TMs – Most Transactions Small & Short Running – Penalize large/long transactions – Too restrictive for wide-spread TM use? • Hypothesis – Must Support Efficient Large/Long Transactions As Well – Is such an HTM even possible? • Yes! TokenTM 1. LogTM’s Log to buffer unbounded values 2. Transactional Tokens for unbounded conflict detection • Conflict state in memory metabits • Concurrent updates via metastate fission/fusion 2 7/26/2016 Wisconsin Multifacet Project Existing HTM Systems Assumption: Most transactions small & short running Optimized for small transactions Degrade with large, long running transactions • Non-localized Overhead, E.g., LogTM-SE [Yen07] false conflicts OneTM [Blundel07] serializes • Complex, Expensive Operations, E.g., XTM [Chung06]& PTM [Chuang06] manipulate page tables Premature Optimization? © 2008 Multifacet Project University of Wisconsin-Madison 3 Why Large Transactions? Programmers may want large (>>cache) and/or long (>> ctx switch) transactions – HLL transactions invoke unpredictable lower-level code – Replace critical sections containing syscalls or I/O – Avoid concurrency bugs [Lu08] • But “Most transactions small & short running” – Restrict TM to use by gurus (like OS spin locks)? – Self fulfilling prophesy? Must Support Efficient Large/Long Transactions As Well © 2008 Multifacet Project University of Wisconsin-Madison 4 Toward a Large-Transaction TM Efficiently detect conflicts between in-flight transactions using Read/Write Sets • Unbounded • Globally accessible Small Transactions: Low Overhead Fast read/write set ops. E.g., Add to read set Clear read set Large Transactions: Localized Overhead Accessible read/write set (potentially unbounded) Minimal Changes to Coherence / VM N O © 2008 Multifacet Project University of Wisconsin-Madison Heavyweight eviction ops Negative acks Additional page tables 5 Existing Mechanisms Synergy between cache coherence and conflict detection Hence, overload cache coherence Small Transactions: Low Overhead + Excellent for bounded/small TM But, - ‘Virtualization’ on overflows - Tough to access ‘virtualized’ state © 2008 Multifacet Project University of Wisconsin-Madison Minimal Changes to × Coherence / VM × Large Transactions: Localized Overhead 6 TokenTM: a Large-Transaction TM • New Conflict Detection Mechanism This Talk – Transactional Tokens in Tagged Memory – Token Coherence [Martin03] at different level • Version Management – Save old/new values for unbounded Write set – LogTM [Moore06] undo log © 2008 Multifacet Project University of Wisconsin-Madison 7 Outline • Motivation • Design – Token-Based Conflict Detection – Metadata Storage • Implementation • Results © 2008 Multifacet Project University of Wisconsin-Madison 8 Transactional Tokens • Challenge: How to efficiently track Read/Write sets? • Token Coherence [Martin03] – Read/Write sets for cache coherence • Solution: Transactional Tokens – T tokens per memory block – At least one token to read, All T tokens to write (token conflict detection) – Token Metadata <c0,c1,…,ci,…> where 0≤ci≤T is count of tokens held by thread with TID i. © 2008 Multifacet Project University of Wisconsin-Madison 9 Tagged Memory • Challenge: Where to store Unbounded, Globally Accessible Token Metadata? • Virtual Memory – unbounded and globally accessible • Solution, similar to OneTM [Blundel07] – Tag Virtual Memory – Piggyback on existing Virtual Memory and Cache Coherence mechanisms © 2008 Multifacet Project University of Wisconsin-Madison 10 TokenTM Logical Operation Thread X PC BEGIN_XACT Load A Store B COMMIT_XACT Thread Y Undo Log Undo Log BEGIN_XACT Load A Store A COMMIT_XACT PC ABORT Shared Memory Block Data Metadata <cx, cy, …> A 0x..00.. <1,0,…> <0,0,…> <1,1,…> B 0x..00.. B:0x..11.. 0x..00.. <0,0,…> <T,0,…> C 0x..10.. <0,0,…> © 2008 Multifacet Project University of Wisconsin-Madison Insufficient tokens 11 Storing Metadata Unbounded Difficult to access globally Thread Y Thread X PC BEGIN_XACT Load A Store B COMMIT_XACT Undo Log Token log Undo TokenLog log Cx CY PC BEGIN_XACT Load A Store A COMMIT_XACT Software Tagged Memory Block Data Hardware Metadata Metastate (Sum, <c , cTID) …> x y, A 0x..00.. B 0x..00.. (0, -) <0,0,…> (0, -) <0,0,…> C Lossy Summary 0x..10.. (0, -) <0,0,…> © 2008 Multifacet Project University of Wisconsin-Madison Concise Accessible 12 Hardware Metastate • Metadata summary (sum, TID) – sum, total number of tokens acquired – TID, identify owner when sum = 1 or sum = T (optional) Some summaries, <c0, c1, …, ci, …> (sum, TID) <0, 0, 0, 0> (0, -) <0, 0, 1, 0> (1, 2) <0, T, 0, 0> (T, 2) <0, 1, 1, 1> (3, -) • Concise -> Stored in packed field (e.g., State[1:2] , Attr[3:16]) • Fast -> Accessed as part of normal memory operation © 2008 Multifacet Project University of Wisconsin-Madison 13 Token Logs • Distributed structures for unbounded Read/Write sets – per-thread – stored in program memory (e.g., heap) – list of <address, num_tokens> Token log A: 1 B: T • Accessible to hardware for fast ops – Add to read set -> Append to token log © 2008 Multifacet Project University of Wisconsin-Madison 14 Double-entry Bookkeeping (Keeping Metadata Consistent) Thread X Logical Token State Metadata <cx, cy, …> PC Thread Y BEGIN_XACT Load A Store B COMMIT_XACT Token log Token log A: 1 A: 1 PC BEGIN_XACT Load A Store A COMMIT_XACT <1,1,…> <1,0,…> <0,0,…> Software <0,0,…> Hardware <0,0,…> Block Metastate (Sum, TID) A (2, X) (1, (0, -) B (0, -) C (0, -) © 2008 Multifacet Project University of Wisconsin-Madison 15 Outline • Motivation • Design • Implementation – Metastate Fission/Fusion • Results © 2008 Multifacet Project University of Wisconsin-Madison 16 Implementing Hardware Metastate Thread X BEGIN_XACT Load A Store B COMMIT_XACT Thread Y Token log Token log A: 1 BEGIN_XACT Load A Store A COMMIT_XACT Software Load A Private Caches Tag A Coherence State Data Exclusive Owned 0x..00.. 0x..00.. Modified Load A Hardware Tag Data Sum TID A Shared 0x..00.. 1 X Coherence State Sum TID 101, XXData A A DATAFwd_GETS A GETS A Block Directory Data 0x..00.. A Shared Exclusive Not Present @@ P1,P2 P1 0x..00.. Main Memory Metastate GETS A (Sum, TID) Sum TID © 2008 Multifacet Project University of Wisconsin-Madison (0,0) 0 0, (0,0) Upgrade A Shared copies cannot update metastate Solution: Fission / Fusion 17 Metastate Fission Thread X Thread Y BEGIN_XACT Load A Store B COMMIT_XACT Token log Token log A: 1 A: 1 BEGIN_XACT Load A Store A COMMIT_XACT Software 1,X Coherence State fission Tag Data Sum TID 0,Private A Modified 0x..00.. 1,X 1 X Owned Caches 0x..00.. Fwd_GETS A Main Memory Tag A Load A Coherence State Data Shared 0x..00.. Data A Block Directory A Shared Exclusive @ P1 @ P1,P2 Data Sum TID Hardware Sum TID 01 Y- GETS A 0x..00.. © 2008 Multifacet Project University of Wisconsin-Madison 18 Metastate Fusion • Metastate Fusion – On store, metastate copies fused back • Why does fission/fusion work? – Store sees ‘complete’ metastate – Load sees • ‘complete’ metastate, if writer exists • ‘partial’ metastate, otherwise © 2008 Multifacet Project University of Wisconsin-Madison 19 Hardware Cost • Additional metabits in caches/memory – Recoded ECC to cull metabits • Changes to coherence protocols – Additional payload on messages – Minimal changes to protocol logic • Requires non-silent eviction © 2008 Multifacet Project University of Wisconsin-Madison 20 Outline • • • • Motivation Design Implementation Results Do we meet the two performance goals? Small Transactions: Low Overhead Large Transactions: Localized Overhead © 2008 Multifacet Project University of Wisconsin-Madison 21 Evaluation Methodology • Methodology – Full System Simulation – Multifacet GEMS • Base System – 32-core CMP system, in-order, single-issue cores – Private 4-way 32KB writeback split I&D L1 caches – Shared 8-way 8 MB writeback L2 – On-chip directory @ L2, MESI coherence – Packet-switched interconnect in a tiled topology © 2008 Multifacet Project University of Wisconsin-Madison 22 TM Systems • LogTM-SE [Yen07] variant Parallel Bloom Filters for conflict detection 4 2Kbit H3 filters + Compact, less hardware overhead - False Conflicts • LogTM-SE_Perfect + No False Conflicts - Unimplementable • TokenTM © 2008 Multifacet Project University of Wisconsin-Madison 23 Results Performance Normalized to LogTM-SE_Perfect 1.2 1 Large Transactions: Minor degradation Localized with largeOverhead transactions 0.8 0.6 LogTM-SE 0.4 LogTM-SE_Perfect 0.2 TokenTM 0 Comparable on Small Transactions: small Lowtransactions Overhead © 2008 Multifacet Project University of Wisconsin-Madison 24 TokenTM Conflict Detection Large Transactions: Localized Overhead Accessible read/write set (potentially unbounded) Small Transactions: Low Overhead Fast read/write set ops. E.g., Add to read set Clear read set Minimal Changes to Coherence / VM © 2008 Multifacet Project University of Wisconsin-Madison N O Heavyweight eviction ops Negative acks Additional page tables 25 In the paper… • Fast Token Release • TM ‘virtualization’ events – Context Switches, Paging etc. • System V shared memory • Long Running Critical Sections in server workloads • Fission/Fusion useful for other TM systems – USTM [Baugh08], set Fault-on-Write UFO bit without exclusive permission © 2008 Multifacet Project University of Wisconsin-Madison 26 Executive Summary • Current Hardware TMs – Most Transactions Small & Short Running – Penalize large/long transactions – Too restrictive for TM use up/down software stack? • Hypothesis – Must Support Efficient Large/Long Transactions As Well – Is such an HTM even possible? • Yes! TokenTM 1. LogTM’s Log to buffer unbounded values 2. Transactional Tokens for unbounded conflict detection • Conflict state in memory metabits • Concurrent updates via metastate fission/fusion 27 7/26/2016 Wisconsin Multifacet Project © 2008 Multifacet Project University of Wisconsin-Madison 28 Common Token Ops Actions by thread X Before After (Sum, TID) (Sum, TID) Acquire One Token (0, -) (1, X) Acquire T Tokens (0, -) (T, X) Release One Token (1, X) (v, -) (0, -) (v-1, -) Release T tokens (T, X) (0, -) Conflicting Load (T, Y), Y≠X (T, Y), Y≠X Conflicting Store (v, -), v≠0 (T, Y), Y≠X (v, -), v≠0 (T, Y), Y≠X © 2008 Multifacet Project University of Wisconsin-Madison 29 Avg Read-Set Avg Write-Set Max Read-Set Max Write-set Barnes 512 bodies parallel phase 1 2,553 6.1 4.2 42 39 Cholesky tk14.O factorization 1 60,203 2.4 1.7 6 4 Radiosity batch 1 task 1024 21,786 1.8 1.5 25 24 Raytrace teapot parallel phase 1 47,783 5.1 2.0 594 4 Delaunay gen2.2-m30 parallel phase 1 16,384 51.4 38.8 507 345 Genome g1024-s32-n65536 parallel phase 1 2.1 768 18 Vacation-Low low contention parallel phase 1 16,399 70.7 18.1 162 75 Vacation-High High contention parallel phase 1 16,399 99.1 18.6 331 80 Benchmark Input Unit o f Work Units Measured Num Xacts Workload Characteristics © 2008 Multifacet Project University of Wisconsin-Madison 100,115 14.5 30 TokenTM Overheads © 2008 Multifacet Project University of Wisconsin-Madison 31 Results Performance Normalized to LogTM-SE_Perfect 1.2 1 Minor degradation with large transactions 0.8 0.6 LogTM-SE_2xH3 0.4 LogTM-SE_4xH3 0.2 TokenTM 0 Comparable on small transactions © 2008 Multifacet Project University of Wisconsin-Madison 32 Fast Release (optional) Thread X W’ R+ (Sum, TID) R W R’ Attr (0, -) 0 0 0 0 0 - (u, -) 1 0 0 0 1 u-1 (u, -) 0 0 0 0 1 u (1, X) 1 0 0 0 0 X (1, Y) 0 0 1 0 0 Y (T, X) 0 1 0 0 0 X (T, Y) 0 0 0 1 0 Y PC BEGIN_XACT Load A Store B COMMIT_XACT Token log A: 1 B: T Flash_Clear Tag Data A B 0x..00.. 0x..01.. © 2008 Multifacet Project University of Wisconsin-Madison Token LogPtr X TID 1 Fast-Release R W R’ W’ R+ Attr 01 0 X X 0 10 -0 033 Is Fast Release necessary? Performance Normalized to LogTM-SE_Perfect 1.2 1 0.8 0.6 0.4 0.2 0 © 2008 Multifacet Project University of Wisconsin-Madison 34 Token Operations Double-entry Bookkeeping Thread X Logical Token State Metadata <cx, cy, cz> PC Thread Y Begin_XACT Load A Store B Commit_XACT Token log Token log A: 1 B: T A: 1 PC BEGIN_XACT Load A Store A COMMIT_XACT <0,0,0> <1,0,0> <1,1,0> Software <0,0,0> <T,0,0> Hardware <0,0,0> Block Data Metastate (Sum, TID) A 0x..00.. (2, X) (1, (0, -) B 0x..00.. 0x..11.. 0x..10.. (T, X) (0, -) C © 2008 Multifacet Project University of Wisconsin-Madison (0, -) 35 Fission Rules Before After Copy1 Copy2 (u, -) (u, -) (0, -) (1, X) (1, X) (0, -) (T, X) (T, X) (T, X) No writer • Assume Copy2 sent to new Reader Is there a writer? © 2008 Multifacet Project University of Wisconsin-Madison 36 Fusion Rules Copy2 Copy1 (v, -) (1, Y) (T, Y) (u + v, -) (1, Y) if u = 0 (u + 1, - ) else (T, Y) error (1, X) (1, X) if v = 0 (v + 1, -) else (2, -) error (T, X) (T, X) error error (T, X) error (u, -) if v = 0 else if u = 0 else if X = Y else • Add the two counts • Forget token owner if count > 1 © 2008 Multifacet Project University of Wisconsin-Madison 37 Thread X Metastate Fusion Thread Y Begin_XACT Load A Store B Commit_XACT Coherence State Tag Private A Invalid Owned Caches Token log Token log A: 1 A: 1 Data Sum TID 0x..00.. 1 X Inv A Main Memory Tag 1,X A BEGIN_XACT Load A Store A COMMIT_XACT Coherence State Conflict Store A Data Modified 0x..00.. Shared Ack A Tag Directory @ P1,P2 A Shared Modified @ P2 Data Sum TID Software Hardware Sum TID 12 1,YY- fusion Insufficient tokens 2,- Upgrade A 0x..00.. P2 © 2008 Multifacet Project University of Wisconsin-Madison 38 Modifying Hardware Metastate (Take 1) Thread Y Thread X Begin_XACT Load A Store B Commit_XACT Token log Token log A: 1 BEGIN_XACT Load A Store A COMMIT_XACT Software Load A Private Caches Tag Coherence State A Exclusive Data 0x..00.. Tag Coherence State Data Hardware 1, X DATA A GETS A Tag Directory Data 0x..00.. A Exclusive Not Present @ P1 0x..00.. Main Memory Sum TID © 2008 Multifacet Project University of Wisconsin-Madison 0 0, - Extra main memory access on every metastate update 39