Dependence-Aware Transactional Memory for Increased Concurrency Hany E. Ramadan, Christopher J. Rossbach, Emmett Witchel University of Texas, Austin Concurrency Conundrum • Challenge: CMP ubiquity • Parallel programming with locks and threads is difficult – deadlock, livelock, convoys… – lock ordering, poor composability – performance-complexity tradeoff 2 cores 16 cores • Transactional Memory (HTM) – simpler programming model – removes pitfalls of locking – coarse-grain transactions can perform well under lowcontention 80 cores (neat!) 2 High contention: optimism warranted? • TM performs poorly with write-shared data – Increasing core counts make this worse Locks Transactions ReadSharing • Write sharing is common – Statistics, reference counters, lists… • Solutions complicate programming model – open-nesting, boosting, early release, ANTs… WriteSharing TM’s need not Dependence-Awareness always serialize can help the programmer write-shared (transparently!) Transactions! 3 Outline • • • • • Motivation Dependence-Aware Transactions Dependence-Aware Hardware Experimental Methodology/Evaluation Conclusion 4 Two threads sharing a counter Initially: count == 0 … Thread A … Thread B load count,r memory inc r time count: 2 0 1 store r,count load count,r inc r store r,count … … Result: count == 2 Schedule: • dynamic instruction sequence • models concurrency by interleaving instructions from multiple threads5 TM shared counter Initially: count == 0 xbegin Tx A xbegin Tx B load count,r memory inc r time 0 count: 1 store r,count load count,r inc r Conflict! xend store r,count xend Conflict: Current HTM designs cannot accept such schedules, • both transactions access a datum, at least one is a write despite the fact that they yield the correct result! • intersection between read and write sets of transactions 6 Does this really matter? xbegin Critsec A xbegin Critsec B load X,r0 <lots of work> inc r0 <lots of work> store r0,X load X,r0 load count,r inc r store r,count inc r0 load count,r store r0,X inc r store r,count xend xend Common Pattern! • Statistics • Linked lists • Garbage collectors xend xend DATM: use dependences to commit conflicting transactions 7 DATM shared counter Initially: count == 0 xbegin T1 T2 xbegin load count,r memory inc r time count: 1 2 0 store r,count load count,r inc r Forward speculative data from T1 to satisfy T2’s load. store r,count xend xend • T1 must commit before T2 Model these constraints as dependences • If T1 aborts, T2 must also abort • If T1 overwrites count, T2 must abort 8 Conflicts become Dependences … Tx A … Tx B Read-Write: A commits before B read X Write-Read: • forward data • overwrite abort • A commits before B RW write X write Y Write-Write: A commits before B Tx A WW Tx B write Y write Z WR read Z … … Dependence Graph • Transactions: nodes • dependences: edges • edges dictate commit order • no cycles conflict serializable 9 Enforcing ordering using Dependence Graphs T1 xbegin T2 xbegin load X,r0 store r0,X load X,r0 T1 WR T2 store r0,X xend Wait for T1 xend xend T1 must serialize before T2 Outstanding Dependences! 10 Enforcing Consistency using Dependence Graphs T1 xbegin T2 xbegin load X, r0 load X,r0 store r0,X store r0,X xend T1 RW T2 xend WW CYCLE! (lost update) • cyclenot conflict serializable • restart some tx to break cycle • invoke contention manager 11 Theoretical Foundation Correctness • and Serializability: optimality results of concurrent Proof: See interleaving equivalent to serial 09] interleaving [PPoPP • Conflict: • Conflict-equivalent: same operations, order of conflicting operations same • Conflict-serializability: conflictequivalent to a serial interleaving • 2-phase locking: conservative implementation of CS Serializable Conflict Serializable 2PL (LogTM, TCC, RTM, MetaTM, OneTM….) 12 Outline • • • • • Motivation/Background Dependence-Aware Model Dependence-Aware Hardware Experimental Methodology/Evaluation Conclusion 13 DATM Requirements Mechanisms: • Maintain global dependence graph – Conservative approximation – Detect cycles, enforce ordering constraints • Create dependences • Forwarding/receiving • Buffer speculative data Implementation: coherence, L1, per-core HW structures 14 Dependence-Aware Microarchitecture • Order vector: dependence graph • TXID, access bits: versioning, detection • TXSW: transaction status • Frc-ND + ND bits: disable dependences These are not programmer-visible! 15 Dependence-Aware Microarchitecture Order vector: • dependence graph • topological sort • conservative approx. These are not programmer-visible! 16 Dependence-Aware Microarchitecture • Order vector: dependence graph • TXID, access bits: versioning, detection • TXSW: transaction status • Frc-ND + ND bits: disable dependences These are not programmer-visible! 17 Dependence-Aware Microarchitecture • Order vector: dependence graph • TXID, access bits: versioning, detection • TXSW: transaction status • Frc-ND + ND bits: disable dependences, no active dependences These are not programmer-visible! 18 FRMSI protocol: forwarding and receiving FRMSI • • • • • • MSI states TX states (T*) Forwarded: (T*F) Received: (T*R*) Committing (CTM) Bus messages: • TXOVW • xABT • xCMT Transactional States Modified by TX MSI M S TM CTM Fwd TS TMM TMF TMR TMRF Received I TR 19 FRMSI protocol: forwarding and receiving FRMSI • • • • • • MSI states TX states (T*) Forwarded: (T*F) Received: (T*R*) Committing (CTM) Bus messages: • TXOVW • xABT • xCMT Transactional States Modified by TX MSI M S TM CTM Fwd TS TMM TMF TMR TMRF Received I TR 20 FRMSI protocol: forwarding and receiving FRMSI • • • • • • MSI states TxMSI states (T*) Forwarded: (T*F) Received: (T*R*) Committing (CTM) Bus messages: • TXOVW • xABT • xCMT Transactional States Modified by TX MSI M S TM CTM Fwd TS TMM TMF TMR TMRF Received I TR 21 FRMSI protocol: forwarding and receiving FRMSI • • • • • • MSI states TX states (T*) Forwarded: (T*F) Received: (T*R*) Committing (CTM) Bus messages: • TXOVW • xABT • xCMT Transactional States Modified by TX MSI M S TM CTM Fwd TS TMM TMF TMR TMRF Received I TR 22 Converting Conflicts to Dependences core_0 PC 1 4 3 2 5 core 0 R 101 100 TXID TxA L1 OVec [A,B] S TXID data TMM CTM TMF 101 TS TxA 100 cnt TMF 1 2 3 4 … N xbegin ld R, cnt inc cnt st cnt, R xend 1 2 3 4 5 … xbegin ld R, cnt inc cnt st cnt, R xend (stall) Outstanding Dependence core_1 core 1 4 PC 1 3 2 5 R 102 101 TXID TxB L1 OVec [A,B] S TXID data TR TxB 102 cnt TMR 101 xCMT(A) Main Memory cnt 100 23 Outline • • • • • Motivation/Background Dependence-Aware Model Dependence-Aware Hardware Experimental Methodology/Evaluation Conclusion 24 Experimental Setup • Implemented DATM by extending MetaTM – Compared against MetaTM: [Ramadan 2007] • Simulation environment – – – – – Simics 3.0.30 machine simulator x86 ISA w/HTM instructions 32k 4-way tx L1 cache; 4MB 4-way L2; 1GB RAM 1 cycle/inst, 24 cyc/L2 hit, 350 cyc/main memory 8, 16 processors • Benchmarks – Micro-benchmarks – STAMP [Minh ISCA 2007] – TxLinux: Transactional OS [Rossbach SOSP 2007] 25 DATM speedup, 16 CPUs 50% 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% 15x 39% 22% 3% 2% pmake labyrinth vacation bayes counter 26 Eliminating wasted work 80% 70% 60% restarts/tx avg bkcyc/tx 50% 40% 30% 16 CPUs Lower is better 20% 10% 0% pmake labyrinth vacation bayes Speedup: 2% 3% 22% 39% counter 15x 27 Dependence Aware Contention Management Cascaded abort? Timestamp contention management T7 5T6 transactions must restart! T3 T4 T1 T2 Order Vector Dependence Aware T2, T2 T2, T3, T3, T7, T7, T6, T6, T4 T4, T1 28 Contention Management 120% 100% 101% Polka Eruption Dependence-Aware 80% 60% 40% 14% 20% 0% -20% bayes vacation -40% -60% 29 Outline • • • • • Motivation/Background Dependence-Aware Model Dependence-Aware Hardware Experimental Methodology/Evaluation Conclusion 30 Related Work • HTM – TCC [Hammond 04], LogTM[-SE] [Moore 06], VTM [Rajwar 05], MetaTM [Ramadan 07, Rossbach 07], HASTM, PTM, HyTM, RTM/FlexTM [Shriraman 07,08] • TLS – Hydra [Olukotun 99], Stampede [Steffan 98,00], Multiscalar [Sohi 95], [Garzaran 05], [Renau 05] • TM Model extensions – Privatization [Spear 07], early release [Skare 06], escape actions [Zilles 06], open/closed nesting [Moss 85, Menon 07], Galois [Kulkarni 07], boosting [Herlihy 08], ANTs [Harris 07] • TM + Conflict Serializability – Adaptive TSTM [Aydonat 08] 31 Conclusion • DATM can commit conflicting transactions • Improves write-sharing performance • Transparent to the programmer • DATM prototype demonstrates performance benefits Source code available! www.metatm.net 32 Broadcast and Scalability Yes. We broadcast. But it’s less than you might think… Because each node maintains all deps, this design uses broadcast for: • Commit • Abort • TXOVW (broadcast writes) • Forward restarts • New Dependences These sources of broadcast could be avoided in a directory protocol: keep only the relevant subset of the dependence graph at each node 33 FRMSI Transient states 19 transient states: sources: • forwarding • overwrite of forwarded lines (TXOVW) • transitions from MSI T* states Ameliorating factors • best-effort: local evictions abort, reduces WB states • T*R* states are isolated • transitions back to MSI (abort, commit) require no transient states • Signatures for forward/receive sets could eliminate five states. 34 Why isn’t this TLS? TLS DATM TM Forward data between threads Detect when reads occur too early Discard speculative state after violations Retire speculative Writes in order Memory renaming Multi-threaded workloads Programmer/Compiler transparency 1. 2. 3. 4. 5. 6. Txns can commit in arbitrary order. TLS has the daunting task of maintaining program order for epochs TLS forwards values when they are the globally committed value (i.e. through memory) No need to detect “early” reads (before a forwardable value is produced) Notion of “violation” is different—we must roll back when CS is violated, TLS, when epoch ordering Retiring writes in order: similar issue—we use CTM state, don’t need speculation write buffers. 35 Memory renaming to avoid older threads seeing new values—we have no such issue Doesn’t this lead to cascading aborts? YES Rare in most workloads 36 Inconsistent Reads OS modifications: • Page fault handler • Signal handler 37 Increasing Concurrency tc time xend xbegin arbitrate xend serialized xend overlapped retry xend xbegin no retry! xend xend conflict xbegin arbitrate conflict T2 conflict T1 xbegin DATM T2 xbegin (MetaTM,LogTM) T1 xbegin Eager/Eager T2 xbegin (e.g. TCC) xbegin Lazy/Lazy T1 (Eager/Eager: Updatesbuffered, in-place, conflict conflict detection detection at atcommit time of reference) (Lazy/Lazy: Updates time) 38 Design space highlights • Cache granularity: – eliminate per word access bits • Timestamp-based dependences: – eliminate Order Vector • Dependence-Aware contention management 39 Design Points 2.5 2 bayes vacation counter counter-tt 1.5 1 0.5 0 Cache-gran Timestamp-fwd DATM 40 Hardware TM Primer Key Ideas: Critical sections execute concurrently Conflicts are detected dynamically If conflict serializability is violated, rollback Key Abstractions: • Primitives – xbegin, xend, xretry • Conflict • Contention Manager – Need flexible policy “Conventional Wisdom Transactionalization”: Replace locks with transactions 41 Hardware TM basics: example cpu 0 cpu 1 3 6 2 7 PC: 1 0 0 1 2 3 4 7 PC: 8 Working Set R{} R{R{ R{ A,B,C A,AB} } } W{} Working Set R{ R{ R{} A,B A }} W{ W{} C} 0: xbegin 0: xbegin; 1: read A 1: read A 2: read B 2: read B 3: if(cpu % 2) 3: if(cpu % 2) 4: write C 4: write C 5: else 5: else 6: read C 6: read C 7: … 7: … 8: xend 8: xend CONFLICT: Assume contention manager decides cpu1 C is in the read set of wins: cpu0, and in the write set ofrolls cpu0 cpu1back cpu1 commits 42