Hybrid Transactional Memory Nir Shavit MIT and Tel-Aviv University Joint work with Alex Matveev (and describing the work of many in this summer school) Haswell Transactional Memory [HerlihyMoss93] Transactional Memory • Memory Transactions are collections of reads and writes executed atomically • Should Provide – Disjoint Access Parallelism • Should maintain internal and external consistency – External (Serializability): with respect to the interleavings of other transactions. – Internal (Opacity): the transaction itself should operate on a consistent state. External Consistency 0 X 0 Y Transaction A: Read y Write x = 4 Return x+y Transaction B: Read x Write y = 4 Return x+y Cannot both return 4 Application Memory Canonical synchronization problem all STM/HTM implementations must solve Locking STMs Map Array of VersionedWrite-Locks V# Application Memory Commit Time Locking (Write Buff) Mem Locks X X V# V# V# V#+1 V# V# V#+1 00 0 0 00 1 Y Y V# V# V# V#+1 V#+1 V#+1 V# V# 00 0 10 0 V# V# V# V# V# V# 00 000 0 1. To Read/Write: Check unlocked add to Read/Write set 2. Acquire Locks 3. Validate read/write v#’s unchanged 4. Write Values 5. Release each lock with v#+1 Read/Write Lock Validate Write Unlock Internal Inconsistency (Opacity) [GuerraouiKapalka07] 4 8 X Transaction A: Write x = 4 Transaction B: Read x Read y Compute z = 1/(x-y) 4 Y Transaction A: Write y = 2 DIV by 0 ERROR! TL2/TinySTM’s Global Clock [DiceShalevShavit06/ReigelFelberFetzer06] • Have a shared global version clock • Incremented by writing transactions (as infrequently as possible) • Read by all transactions • Used to validate state viewed by transaction is always opaque TL2 Style STM Mem X X Y Y 121 120 100 Locks 87 87 87 121 34 34 121 88 88 00 0 0 01 0 0 0 V# 121 99 121 44 44 0 0 10 0 0 50 V# 50 V# 50 0 0 0 Read Clock VClock 1. Read Vclock 2. Read/Write: if unlocked and v# less clock add to Read/Write-Set 3. Acquire Locks 4. Increment Clock 5. Validate each v# less than clock 6. Write values 7. Release locks with v# = new clock Read/Write Lock Inc Validate Write Unlock TL2 Style STM • Advantages – Great Disjoint Access Parallelism • Disadvantages – Accessing Meta-Data is Expensive – Progress guarantee is only deadlock freedom NOrec STM [DalessandroSpearScott10] • Use shared global clock as a seqlock • Validation in every read if a seqlock change is detected • Value-based validation: no need for meta-data (local time stamps or locks) NOrec STM R/W Set X =X ZZ Z Y =Y 101 100 103 102 104 Read/Write (with validation if Not odd? seqlock changed) seqlock seqlock Lock seqlock (set odd) with validation if seqlock changed Write Unlock seqlock (set even) NOrec STM • Advantages – No Expensive Meta-Data • Disadvantages – Poor Disjoint Access Parallelism (all writes are serialized by clock) – Progress guarantee is only starvation freedom Hardware TM [HerlihyMoss93,IBM/Intel13] • Advantages – Everything in Hardware, No Meta Data – Great Disjoint Access Parallelism • Disadvantages – No Progress Guarantee; Fail because of: • Unsupported instructions: system or protected instructions • Exceptions: page faults and similar • Capacity limit: too many accessed locations Hybrid TM [Moir,Damron et. Al, Kumar et. al] • Fast-Path: Execute Trans Using Best Effort HTM – If it Aborts because of Special Instructions or Transaction Too Large, then… • Slow-Path: Execute Trans Using STM Performance of HTM with progress guarantee of STM Traditional Hybrid TM [DamronFedorovaLevLuchangcoMoirNussbaum06] Hardware Transaction Test VersionedWriteLock in every Read/Write. Update in Write. 0 1 VersionedWrite-Lock Software Transaction Update locks 0 1 VersionedWrite-Lock Traditional Hybrid TM • Advantages – Progress Guarantee of STM • Disadvantages – HTM must access meta data – Fast path is actually slow because of extra load and branch on every read Traditional Hybrid TM Phased TM [LevMoirNussbaum07] • Two modes: all hardware or all software • Shared global mode indicator • If some hardware transaction aborts switch to software mode • Eventually mode reverts back to hardware Phased TM • Advantages – Fast-path Pure HTM: No Meta Data Accesses • Disadvantages – Single Software Transaction Causes all HTM to switch to STM slow path – Not clear how to tune to avoid frequent mode transitions… Hybrid Norec (1st Attempt) Software Norec: Read/Write Not odd? (with seqlock validation) Hardware: Unlock Lock Seqlock Seqlock (set odd) Validate Write (set even) Software will fail seqlock validation! Read/Write (no validation) Write Not odd? seqlock seqlock +2 Hybrid Norec (1st Attempt) Software Norec: Read/Write Not odd? (with seqlock validation) Hardware: Lock Unlock Seqlock Seqlock (set odd) Validate Write (set even) Hardware will fail seqlock validation! Write Not odd? seqlock Read/Write (no validation) seqlock +2 Hybrid Norec (1st Attempt) Software Norec: Odd? seqlock Hardware: Guaranteed External Consistency Read/Write (with validation) Lock Unlock Seqlock Seqlock (set odd) Validate Write (set even) Hardware will fail seqlock validation! Write Not odd? seqlock Read/Write (no validation) seqlock +2 Hybrid Norec (1st Attempt) Software Norec: Problem: hardware opacity Read/Write Not odd? (with seqlock validation) Hardware: Lock Unlock Seqlock Seqlock (set odd) Validate Write (set even) Hardware will fail seqlock validation! Write Not odd? seqlock Read/Write (no validation) seqlock +2 Internal Inconsistency (Opacity) [GuerraouiKapalka07] 4 8 X 4 Y Software A: Lock seqlock +1 Write x = 4 Write y = 2 Unlock seqlock+1 Hardware B: Read x Read y Compute z = 1/(x-y) … Odd? Seqlock DIV by 0 ERROR! Hybrid Norec (2nd Attempt) Software Norec: Guarantee hardware opacity Read/Write Not odd? (with seqlock validation) Hardware: Lock Unlock Seqlock Seqlock (set odd) Validate Write (set even) Hardware will detect seqlock invalidation! Read/Write (no validation) Write Not odd?seqlock seqlock +2 Hybrid NOrec • Advantages – Fast-path HTM: No Meta Data Accesses • Disadvantages – Limited Disjoint Access Parallelism –Seqlock is in hardware tracking set throughout HTM transaction –Major sequential bottleneck Possible Solutions • Forget Opacity, Use sandboxing [DalessandroCarougeWhiteLevMoirSco ttSpear2011] • Hybrid Norec 2 [RiegelMarlierNowackFelberFetzer11]: use non-transactional operations in a hardware transaction to read and But sandboxing is complex…and non- after validate seqlock has not changed transactional ops only available in AMD every read proposal, not actual IBM or Intel … Reduced Hardware Approach to HyTM [MatveevShavit13] • Use short hardware transactions in the software slow-path • I.e. create new “mixed” software/hardware path • Not in order to make slow-path faster – But rather, in order to remove meta-data accesses from fast path • Default to all software if mixed path fails Transactional Writes Imply Hardware Opacity 4 8 X Trans A: Write x = 4 Hardware B: Read x Read y Compute z = 1/(x-y) 24 Y Write y = 2 DIV by 0 ERROR! If in a hardware transaction this cannot happen… Reduced Hardware NOrec [MatveevShavit13] • In Slow-path commit, use a small hardware transaction to: – Write all values – Check seqlock has not changed – Write seqlock+1 • In Fast-path: – Move seqlock test to end, un-instrumented read/writes Reduced Hardware NOrec Software Norec: Guarantee fast-path opacity without having seqlock TMTrans: In in HTM Lock tracking set for long Write values Lock Read/Write Changed?seqlock seqlock seqlock Changed? (with Write+1(set even) seqlock validation) (set odd) Validate seqlock Hardware: Changed? seqlock Hardware will detect write conflict without seqlock! Write Read seqlock seqlock +1 Read/Write (no instrumentation) Reduced Hardware NOrec • Properties – Fast-path: No Meta Data; No instrumentation of reads or writes – Slow-path: –short hardware transaction: size of write set –can repeatedly attempt short hardware transaction in commit Reduced Hardware NOrec • Advantages – Hardware Disjoint Access Parallelism – seqlock accessed only at end of HTM transaction – Surprise: 1st HyTM that is Obstruction-free and Privatizing – Disadvantages – Still window of possible abort due to seqlock increment Reduced Hardware NOrec 3.50E+06 3.00E+06 RH-NOREC 100K Nodes Constant RB-Tree 20% muta ons 2.50E+06 HTM Standard HyTM 2.00E+06 NOREC RH1 Fast 1.50E+06 RH1 Mix 10 1.00E+06 RH1 Mix 100 5.00E+05 0.00E+00 1 2 4 6 8 10 12 14 16 18 20 Reduced Hardware NOrec 2.50E+06 2.00E+06 RH-NOREC 100K Nodes Constant RB-Tree 80% muta ons HTM Standard HyTM 1.50E+06 NOREC RH1 Fast 1.00E+06 RH1 Mix 10 RH1 Mix 100 5.00E+05 0.00E+00 1 2 4 6 8 10 12 14 16 18 20 Reduced Hardware TL2 Style Hardware Will See Software Software TL2 style: Read Clock Read/Write (validate) Hardware: In HTM Trans: Write Validate Write values Hardware will detect write conflict Read/Write (no validation) Read Write values Clock With Clock +1 Problem: if between validate Reduced Hardware TL2 Style and hardware write, can Solution:have combine validation inconsistency Software TL2 style: and writes in single transaction Read Clock Read/Write (validate) Hardware: In HTM Trans: ValidateInand HTM Trans: Write values Validate Write values Hardware will detect write conflict Read/Write (no validation) Read Write values Clock With Clock +1 Reduced Hardware TL2 Style • Advantages – Complete Disjoint Access Parallelism – GV6 clock incremented on aborts only – Obstruction-free – Disadvantages – No privatization – Mixed path transaction size of meta-data set 0.00E+00 1 2 4 6 8 10 12 14 16 18 RH1: Hardware TL2 HTM ReducedStandard HyTM NOREC RH NOREC Fast RH NOREC Mix 10 RH NOREC Mix 100 Style 3.50E+06 Total Opera ons 3.00E+06 100K Nodes Constant RB-Tree 20% muta ons 2.50E+06 HTM Standard HyTM 2.00E+06 TL2 1.50E+06 RH1 Fast 1.00E+06 RH1 Mix 10 5.00E+05 RH1 Mix 100 0.00E+00 1 1.00E+07 2 4 6 8 10 12 14 number of threads 16 18 20 10K Elements Constant Hash Table 2 1.00E+06 RH1: Reduced Hardware TL2 HTM Standard HyTM StyleRH NOREC Slow RH NOREC Mix 100 0.00E+00 1 3.00E+06 Total Opera ons 2.50E+06 2 4 6 8 10 12 100K Nodes Constant RB-Tree 80% muta ons 14 16 18 HTM 2.00E+06 Standard HyTM 1.50E+06 TL2 1.00E+06 RH1 Fast RH1 Mix 10 5.00E+05 RH1 Mix 100 0.00E+00 1 2 4 6 8 10 12 14 16 18 20 number of threads 3.50E+06 1K Nodes Constant Sorted List 2 HyTM: Long Journey • Combination of ideas: – hardware transactions, – global clocks, – no meta data access, – mixed hardware software paths • And there is still room for improvement