Hardware Support for Efficient Transactional and Supervised Memory Systems Jayaram Bobba Dissertation Defense 1/14/2010 Dept. of Computer Sciences University of Wisconsin–Madison Overview: 1) Research Area 2) Challenges/ Contributions 3) Big Picture Research Area Device Scaling Abundant Transistors Emergence of CMPs Hard to Program Hardware Support to Improve Productivity Empty/full-bits Transactional Memory MemTracker Deterministic Memory 7/26/2016 Wisconsin Multifacet Project Supervised Systems 2 Challenges • Supervised Systems – Sequential-consistency only – Ad hoc hardware – Lack of formalism • Transactional Memory – “Most transactions are small” • Self-fulfilling – Limited applicability 7/26/2016 Wisconsin Multifacet Project Contribution 1: Supervised Memory TSOdata ,Safe Supervision Contribution 2: TokenTM Contribution 3: StealthTest 3 Software Big Picture Applications Tools Hardware Supervised Systems Supervised Memory 7/26/2016 StealthTest TokenTM TSOdata and Safe Supervision Wisconsin Multifacet Project 4 Outline Slide Count • • • • • Motivation Supervised Memory TokenTM StealthTest Conclusion 7/26/2016 4 18 4 /19 16 6 Wisconsin Multifacet Project 5 On Software Productivity Better More Hardware Software More Productivity Yannis’s “Law”: Programmer Productivity doubles every 6 years 7/26/2016 More Performance Moore’s Law Moore’s Law will continue But Yannis’s Law? Wisconsin Multifacet Project 6 What has changed? • “A Fundamental Turn towards Concurrency in Software” [Herb Sutter, 2005] • Moore’s Law -> Better Computers – Sequential Computers (Past) • Memory wall, Power wall etc. – Attack of the killer CMPs* (Current) • How to program? Expose parallelism to software • Parallel programs hard to write * Adapted from “Attack of the killer micros” by Eugene Brooks 7/26/2016 Wisconsin Multifacet Project 7 Who solves the productivity issue? • Why, Of course, hardware architects! • Long live Moore’s Law – Spend some transistors on productivity issues • Architectural Support for Enhancing Productivity – – – – – 7/26/2016 for language features for bug avoidance for debugging for performance feedback and so on… Wisconsin Multifacet Project 8 Seriously, Who should solve it? • HW Architects or SW Engineers? • ‘software crisis’ in the past too… “We must now reconsider the balance of hardware • Why HW architects?and software and to provide more specialized – function in hardware than we have previously, in More bang for the buck order to(Economic) drastically simplify the programming process” Edward A. Feustel, IEEE TOC, July 1973 in • Software/IT (1,152 billion) vs Hardware (138 billion) support of Tagged Memory [Wen Mei Hwu, Micro-39 Keynote] – SW cannot do it alone (Technical) • Decades of automatic parallelization efforts • Virtual Memory, Tagged Memory for LISP-like languages 7/26/2016 Wisconsin Multifacet Project 9 Outline • Motivation • Supervised Memory – – – – Background/Motivation Explore relaxed supervised systems Define Supervised Memory Propose formal models • TokenTM • StealthTest • Conclusion 7/26/2016 Wisconsin Multifacet Project 10 Why Supervised Systems? • Synchronization – Hardware TM systems – Empty/Full-bits • [Berry et al 2006] Graph processing algorithms on 4 processor MTA > 64K BG/L • Controlled non-determinism – Deterministic/Interleaving Constrained Multiprocessing • Debugging – Log-based architectures • Safety – Heap checkers, Bounds checkers • Language Features – Hardware-assisted Garbage Collection 7/26/2016 Wisconsin Multifacet Project 11 What are Supervised Systems? 1) out-of-band metadata per data block 2) monitor & control (supervise) memory accesses to data 3) execute handlers on specific metadata states • pure software possible, but inefficient – shadow memory E.g., Valgrind. Mean Slowdown 22X [Nethercote et al., VEE2007] 7/26/2016 Wisconsin Multifacet Project 12 State-of-the-Art • Expect Sequentially-Consistent (SC) hardware – Most hardware is not • Ad hoc – Whither primitives? • Informal treatment of memory consistency – Ambiguous/Incorrect 7/26/2016 Wisconsin Multifacet Project 13 Contributions • Expect Sequentially-Consistent (SC) hardware – Most hardware is not Explore relaxed supervised systems • Ad hoc – Whither primitives? Define Supervised Memory • Informal treatment of memory consistency – Ambiguous/Incorrect 7/26/2016 Propose formal memory models Wisconsin Multifacet Project 14 Outline • Motivation • Supervised Memory – – – – Background/Motivation Explore relaxed supervised systems Define Supervised Memory Propose formal models • TokenTM • StealthTest • Conclusion 7/26/2016 Wisconsin Multifacet Project 15 Explore relaxed supervised systems PC PC r1 r2 r3 r1 r2 r3 ST 0x01, A 0x01 0x10 ST 1, [A] LD [B], r1 ST 0x10, C ST 2,[C] LD [C], r3 ST A LD B Memory Store Buffer Processor TSO-lite: A TSO-compliant system 7/26/2016 Block Data A 0x00 B 0x01 C 0x11 Wisconsin Multifacet Project Metadata 16 Explore relaxed supervised systems LD ST LD Exception None LD/ST 7/26/2016 Memory Full Empty PC PC r1 r2 r3 r1 r2 r3 ST 0x01, A 0x01 Store Buffer ST Processor Empty/Full-Bits on TSO-lite ST 1, [A] LD [B], r1 ST 0x10, C ST 2,[C] LD [C], r3 I1: NO LOAD BYPASS Block Data Metadata A 0x00 Full B 0x01 None C 0x11 Empty Wisconsin Multifacet Project EXCEPTION I2: LATE EXCEPTIONS 17 Explore relaxed supervised systems Deterministic Shared Memory (DMP) [Devietti et al., ASPLOS 2009] “depending upon the consistency model of the underlying hardware, threads must perform a memory fence at the edge of a quantum” • Insert a fence after the last operation in the quantum • Insert a fence before the first shared operation in the quantum I3: Reordered metabit-reads 7/26/2016 Illustration Wisconsin Multifacet Project 18 Outline • Motivation • Supervised Memory – – – – Background/Motivation Explore relaxed supervised systems Define Supervised Memory Propose formal models • TokenTM • StealthTest • Conclusion 7/26/2016 Wisconsin Multifacet Project 19 Define Supervised Memory What is Supervised Memory? • Each memory location A, – data (A.d) – metadata (A.m) • New operations – Supervised Load (sLD A) – Supervised Store (sST A) • Jump on reading special metadata (Optionally) – Hardware exception 7/26/2016 Wisconsin Multifacet Project 20 Define Supervised Memory Supervised Operations sLD A => Start: atomic{ curm = Val[RA.m] // Read metadata nextm = NEXT(Load, curm) // Check software// specified FSM If nextm == EXCEPTION then Jump to Handler RA.d // Read data If (nextm != curm) then WA.m,nextm // Update metadata } Handler: … 7/26/2016 Wisconsin Multifacet Project 21 Define Supervised Memory Using Supervised Memory • Software assigns semantics to metadata – Metastates stored as metadata • E.g., Initialized, Uninitialized – Metastate transition function (NEXT) • Use supervised operations to monitor/control data operations – E.g., catch read access to uninitialized data 7/26/2016 Wisconsin Multifacet Project 22 Outline • Motivation • Supervised Memory – – – – Background/Motivation Explore relaxed supervised systems Define Supervised Memory Propose formal models • TokenTM • StealthTest • Conclusion 7/26/2016 Wisconsin Multifacet Project 23 Propose formal models TSO Axioms [Hangal et al., ISCA 2004] 7/26/2016 Wisconsin Multifacet Project 24 Propose formal models TSO Axioms [Hangal et al., ISCA 2004] Axiom Description Order Total Order on all write accesses Atomicity No intervening accesses for atomic operations Termination All write accesses eventually complete Value Reads return latest value from memory or store buffer Memory Barrier No reordering across a barrier ReadAny Accesses cannot pass outstanding reads WriteWrite Write access cannot pass outstanding writes Rd A Rd B 7/26/2016 Rd A Wr B Wr A Wr B Wr A Rd B Wisconsin Multifacet Project Reordering Axioms Allows store buffers 25 Propose formal models TSOall: A Consistency Model for Supervised Memory TSO axioms applied to all accesses—data and metadata + (Simple) Like TSO — (Slow) Prohibits optimizations Thread: sST A ->[Rd A.m, Wr A.d, Wr A.m] sLD B ->[Rd B.m, Rd B.d] => Store buffers ineffective • Tension – Ease of Reasoning vs Performance 7/26/2016 Wisconsin Multifacet Project 26 Propose formal models Blast from the Past [Adve and Hill, ISCA1990] • Ease of Reasoning (SC) vs Performance (RC) • Observation: – Simple programs rely only on certain SC orders – Ignore non-essential orders. Still appears as SC • Challenge: Simple? Non-essential orders? • Solution: Data-race-freedom – For data-race-free programs, RC = SC 7/26/2016 Wisconsin Multifacet Project 27 Propose formal models Safe Supervision Motivation • Ease of Reasoning (TSOall) vs Performance (?) • Observation: – Simple supervised programs rely only on certain TSOall orders – Ignore non-essential orders. Still appears as TSOall • Challenge: Simple? Non-essential orders? • Solution: Safe Supervision – For safely supervised programs, ? = TSOall 7/26/2016 Examples Wisconsin Multifacet Project 28 Safe Supervision • metadata accesses to location A not used to order operations to a different location B Initially, A.m = Empty, B.d = 0 Thread 1: Thread 2: B.d = 1 While (A.m == Empty); A.m = Full Read B.d • Most uses of supervision are safely supervised. E.g., • Heap Checker: Initialized/Uninitialized values • Transactional Memory: Conflict Detection information 7/26/2016 Definition Wisconsin Multifacet Project 29 Propose formal models Axiom Description Order Total Order on all write accesses Atomicity No intervening accesses for atomic operations Termination All write accesses eventually complete Value Reads return latest value from memory or store buffer Memory Barrier No reordering across a barrier ReadAny Data accesses cannot pass outstanding data reads WriteWrite Data writes cannot pass outstanding data writes Reordering Axioms TSOdata: Fast Yet Simple Thread: sST B->[Rd A.m, Wr A.d, Wr A.m] Store buffers can be used sLDA ->[Rd B.m, Rd B.d] For safely supervised programs, TSOdata = TSOall 7/26/2016 Wisconsin Multifacet Project 30 TSOdata on OpenSPARC T2 • Goal: Explore low-level issues on a real design • Late Exceptions with deferred handlers – Dump store buffer entries on exception – Enhance store buffer to carry Virtual Address (VA) – ~200 cycles to read out 4 entries • Disable store buffer bypassing for supervised loads • Low space overhead for adding metabits (~4%) 7/26/2016 Wisconsin Multifacet Project 31 Supervised Memory Summary • Expects Sequentially-Consistent (SC) hardware – Most hardware is not Explore relaxed memory systems • Ad hoc – Whither primitives? Define Supervised Memory • Informal treatment of memory consistency – Ambiguous/Incorrect 7/26/2016 Propose formal memory models Wisconsin Multifacet Project 32 Outline • • • • • Motivation Supervised Memory TokenTM [ISCA 2008] StealthTest Conclusion 7/26/2016 Longer Version Wisconsin Multifacet Project 33 TokenTM Summary • Current Hardware TMs – Most Transactions Small & Short Running – Penalize large/long transactions – Too restrictive for wide-spread TM use? • Hypothesis – Must Support Efficient Large/Long Transactions As Well – Is such an HTM even possible? • Yes! TokenTM 1. LogTM’s Log to buffer unbounded values 2. Transactional Tokens for unbounded conflict detection • Conflict state in memory metabits 7/26/2016 Wisconsin Multifacet Project 34 Transactional Tokens • Challenge: How to efficiently track Read/Write sets? • Token Coherence [Martin03] – Read/Write sets for cache coherence • Solution: Transactional Tokens – T tokens per memory block – At least one token to read, All T tokens to write (token conflict detection) – Token Metadata <c0,c1,…,ci,…> where 0≤ci≤T is count of tokens held by thread with TID i. 7/26/2016 Wisconsin Multifacet Project 35 Tokens and Supervised Memory • Challenge: Where to store Unbounded, Globally Accessible Token Metadata? – unbounded and globally accessible • Solution – Supervised Memory’s Metadata – Piggyback on existing Virtual Memory and Cache Coherence mechanisms Skip Animation 7/26/2016 Wisconsin Multifacet Project 36 TokenTM: a Large-Transaction TM • New Conflict Detection Mechanism – Transactional Tokens in Supervised Memory – Token Coherence [Martin03] at different level • Version Management – Save old/new values for unbounded Write set – LogTM [Moore06] undo log 7/26/2016 Wisconsin Multifacet Project 37 Outline • • • • • Motivation Supervised Memory TokenTM StealthTest [PACT 2009] Conclusion 7/26/2016 Wisconsin Multifacet Project 38 StealthTest Summary (1/2) The Problem: fork Overhead • Software testing hard – Multithreading makes harder – Run tests on deployed software E.g., Delta Execution for patch testing [Tucek et al., ASPLOS 2009] – Non-intrusive mechanisms • fork (existing) 7/26/2016 Wisconsin Multifacet Project Functionally Hidden • Online software testing can help Low Overhead Good Scaling 39 StealthTest Summary (2/2) Solution: TM for testing • Demonstrate two uses • Delta Execution • In vivo Testing 7/26/2016 Wisconsin Multifacet Project Functionally Hidden • Leverage Transactional Memory for online testing • Non-Intrusive? – transaction { test(); abort} Low Overhead – Fast TM mechanisms Good Scaling 40 Outline • • • • Motivation Supervised Memory TokenTM StealthTest – Online Software Testing • E.g., Patch Validation – StealthTest: TM for online testing – Delta Execution using StealthTest – In vivo Testing using StealthTest (Optionally) • Conclusion 7/26/2016 Wisconsin Multifacet Project 41 Online Patch Validation • Bug fixes can introduce more bugs – Patches must be validated • Online Validation [Nagaraja et al., OSDI 2004] – Increased resource usage – Lockstep execution Input Output Production Testing Diff 7/26/2016 Wisconsin Multifacet Project 42 Delta Execution [Tucek et al., ASPLOS 2009] • Online Patch Validation Most patches are small Patched and Un-patched executions similar • Delta Execution – Run together except when they differ Prior Work Delta Execution Increased Resource Usage Lockstep Execution 7/26/2016 O O Wisconsin Multifacet Project P P 43 Production Testing Delta Execution using fork Time 7/26/2016 Wisconsin Multifacet Project 44 Multi-threading and fork Production Testing ‘Park’ all other threads Time 7/26/2016 Stop all threads to get a consistent memory snapshot Wisconsin Multifacet Project 45 fork Poor Performance ~9.8ms for split/~106ms for merge [Tucek et al, ASPLOS 2009] Poor Scalability Web-server response rate reduced by 43% Want an alternate mechanism 7/26/2016 Wisconsin Multifacet Project 46 Outline • • • • Motivation Supervised Memory TokenTM StealthTest – Online Software Testing • E.g., Patch Validation – StealthTest: TM for online testing – Delta Execution using StealthTest – In vivo Testing using StealthTest (Optionally) • Conclusion 7/26/2016 Wisconsin Multifacet Project 47 Delta Execution Delta Execution using StealthTest Isolate patched execution Introspect patched execution Monitor delta data access StealthTest Transactional Memory fork transaction{…} Version Management Tracks new/old values Conflict Detection Monitor accesses Execute on child process Page diffing mprotect 7/26/2016 Wisconsin Multifacet Project 48 StealthTest Interface Delta Execution Isolate patched execution ST_begin_transaction ST_abort_transaction Introspect patched execution ST_get_old ST_get_new Monitor delta data access ST_protect_set ST_protect_clear StealthTest Transactional Memory transaction{…} 7/26/2016 Version Management Tracks new/old values Wisconsin Multifacet Project Conflict Detection Monitor accesses 49 Requirements from TM • Strong Atomicity [Martin et al., CAL 2006] Transactions isolated from non-transactions => Test transactions isolated from application code • Flexible Conflict Resolution Can abort transactions if necessary => Abort tests if they block application • Communication from within transactions => Expose result of a test 7/26/2016 Wisconsin Multifacet Project 50 Outline • • • • Motivation Supervised Memory TokenTM StealthTest – Online Software Testing • E.g., Patch Validation – StealthTest: TM for online testing – Delta Execution using StealthTest – In vivo Testing using StealthTest (Optionally) • Conclusion 7/26/2016 Wisconsin Multifacet Project 51 Production Testing fork Delta Execution using StealthTest Install D data fork Patched execution Unpatched execution Compute and Isolate D data Merged execution 7/26/2016 Production StealthTest transaction Wisconsin Multifacet Project 52 Production Testing fork Multi-threaded Delta Execution Install D data fork Patched execution Unpatched execution Compute and Isolate D data Merged execution 7/26/2016 Original StealthTest transaction Wisconsin Multifacet Project 53 Evaluation (1) Effective? (2) Non-intrusive? • Workloads – Collection of multi-threaded server apps – Same as Tucek et al., ASPLOS 2009 • Pin-based TM Emulation • 2-way SMP with 2.4GHz Pentium 4 CPUs and 2.5GB RAM 7/26/2016 Wisconsin Multifacet Project 54 (1) Effective? Program Description Patch Description Patch Verified? fork StealthTest Crafty Chess App Code refactoring P P Raytrace Raytracer Result reporting fix P P Tar Archive Util Incremental archiving fix P P Apache1 Web Server Buffer overflow fix P P Apache2 Web Server Buffer overflow fix P P DNSCache DNS Cache Behavior Change P P MySQL5.0 DB Server Extra permission checks P O OpenSSL Security Lib Added bug in TLS handling P O Squid Web Cache Buffer overflow fix P O ATPhttpd Web Server Buffer overflow fix P O 7/26/2016 Wisconsin Multifacet Project Works Memory allocation sockets 55 (2) Non-intrusive? Program Description fork ForkOverhead(%) PatchDuration(%) Crafty Chess App 0.1 <0.1 Raytrace Raytracer 0.2 0.5 Tar Archive Util 41 7.3 Apache1 Web Server 2.8 0.1 Apache2 Web Server 12 0.1 DNSCache DNS Cache 65 0.1 MySQL5.0 DB Server 4.7 5.0 OpenSSL Security Lib 12 <0.1 Squid Web Cache 2.9 0.2 ATPhttpd Web Server 65 0.8 7/26/2016 Wisconsin Multifacet Project 56 Outline • • • • Motivation Supervised Memory TokenTM StealthTest – Online Software Testing • E.g., Patch Validation – StealthTest: TM for online testing – Delta Execution using StealthTest – In vivo Testing using StealthTest • Conclusion 7/26/2016 Wisconsin Multifacet Project 57 StealthTest Summary • Software testing hard • Online software testing can help • StealthTest leverages TM for non-intrusive online testing • Demonstrate two uses – Delta Execution – In vivo Testing 7/26/2016 Wisconsin Multifacet Project Functionally Hidden – Existing mechanisms inadequate Low Overhead Good Scaling 58 Outline • • • • • Motivation Supervised Memory TokenTM StealthTest Conclusion 7/26/2016 Wisconsin Multifacet Project 59 Contribution 1: Supervised Memory [Under Submission] • Supervised Systems – Useful, Renewed interest • Problem – SC only, while most systems are not – Ad hoc hardware, specific to a supervised system – No Formalism, leads to ambiguity/incorrectness • Contributions – Explore non-SC systems – General model for supervision: Supervised Memory – Formal Specification 7/26/2016 Wisconsin Multifacet Project 60 Contribution 2: TokenTM [Bobba et al., ISCA 2008] • Transactional Memory, a supervised system • Problem – “Most transactions are small”, Self-fulfilling assumption – Penalize large/long transactions – Too restrictive for wide-spread TM use? • Contributions – TokenTM • First HTM to support efficient large/long transactions as well • Follow-up: Purdue’s LiteTM [Jafri et al., HPCA 2010] 7/26/2016 Wisconsin Multifacet Project 61 Contribution 3: StealthTest [Bobba et al., PACT 2009] • Using transactional memory for testing • Problem – Existing fork-based mechanisms • High overhead • Poor scalability • Contributions – StealthTest, low-overhead interface for online testing – Two StealthTest-based testing frameworks 7/26/2016 Wisconsin Multifacet Project 62 Other Research and Contributions • Performance Pathologies – Bobba et al., ISCA 2007 – Bobba et al., IEEE Micro Top Picks Jan 2008 • LogTM-SE – Yen et al., HPCA 2007 • Nested LogTM – Moravan et al., ASPLOS 2006 • LogTM – Moore et al., HPCA 2006 • GEMS LogTM-SE Implementation – Development, Release and Support 7/26/2016 Wisconsin Multifacet Project 63 Acknowledgments Advisors Mark Hill, David Wood Mike Swift, Ben Liblit, Shan Lu, Karu Sankaralingam Mikko Lipasti, Jeffrey Naughton Co-authors Kevin Moore, Luke Yen, Haris Volos, Michelle Moravan, Weiwei Xiong, Neelam Goyal Colleagues Alaa Alameldeen, Arkaprava Basu, Brad Beckmann, Polina Dudnik, Dan Gibson, Mike Marty, Somayeh Sardashti, Rathijit Sen, Cong Wang, Yasuko Watanabe, Min Xu Matt Allen, Piramanayagam Arumuga Nainar, Siddharth Barman, Koushik Chakraborthy, Venkat Govindraju, Amit Kumar, Srinath Sridharan, Philip Wells 7/26/2016 Wisconsin Multifacet Project 64 Software Key Contributions Applications Tools Hardware Supervised Systems Supervised Memory 7/26/2016 StealthTest TokenTM TSOdata and Safe Supervision Wisconsin Multifacet Project 65 Backup 7/26/2016 Wisconsin Multifacet Project 66 CPU Trends From “The Free Lunch Is Over” by Herb Sutter 7/26/2016 Wisconsin Multifacet Project 67 The ‘Re-Birth’ of Parallel Programming • Sequential Computers – Memory wall, Power wall etc. • Attack of the killer CMPs* – General-purpose parallel computers – How to program? Expose parallelism to software “The Free Lunch is Over” – Herb Sutter, 2005 * Adapted from “Attack of the killer micros” by Eugene Brooks 7/26/2016 Wisconsin Multifacet Project 68 Parallel Programming is Hard (currently) • Hard for programmers – Correctness • Synchronization, Data races, Atomicity violations – Performance • Communication, Scheduling, Load-Balancing, Critical Path • Hard for tools – Compilers, Static Analysis • Intractable/Inefficient 7/26/2016 Wisconsin Multifacet Project 69 Houston, We Have a Problem! • Who should solve this problem? • Yannis’s Law: Programmer Productivity doubles every 6 years – http://ix.cs.uoregon.edu/~yannis/law.html • Proebsting's Law: Compiler Advances Double Computing Power Every 18 Years – http://research.microsoft.com/enus/um/people/toddpro/papers/law.htm 7/26/2016 Wisconsin Multifacet Project 70 Parallel Algorithms vs Moore’s Law • “Improvement resulting from … algorithmic speedup is comparable to that resulting from from the hardware speedup due to Moore’s Law over the same length of time” David E. Keyes “A Science-Based Case for Large-Scale Simulation”, July 2003. 7/26/2016 Wisconsin Multifacet Project 71 In the “Landscape of Parallel Computing Research…” [Asanovic et al., 2006] 7/26/2016 Wisconsin Multifacet Project 72 Why not Tagged Memory? [Gehringer and Keedy, CAN 1985] • Type information in tags • Arguments do not apply to dynamically-typed languages like Lisp • For other languages, – Simpler but more specialized designs – Compilers improved to make the proposals moot 7/26/2016 Wisconsin Multifacet Project 73 Explore relaxed memory systems Existing proposals assume SC • Assume SC or don’t deal with multiprocessors 7/26/2016 Proposal Base Architecture Implementation WWT MIPS SC Tapeworm MIPS SC LogTM SPARC SC OneTM SPARC SC Informing Memory MIPS, Alpha SC SafeMem x86 x86 MemTracker MIPS SC DMP x86 SC Wisconsin Multifacet Project 74 DMP Correctness 7/26/2016 Wisconsin Multifacet Project 75 Non-TSOall Executions 7/26/2016 Wisconsin Multifacet Project 76 Propose formal models TSOdata is Complex Empty/full-bits sST Empty sLD Exception 7/26/2016 Full Initial State: A.d = 0, A.m = None B.d = 0, B.m = Empty sST T0: dST 1, A sLD B T1: sST B, 1 dLD A Can dLD A return 0? Wisconsin Multifacet Project 77 Safe Supervision 7/26/2016 Wisconsin Multifacet Project 78 7/26/2016 Wisconsin Multifacet Project 79 TokenTM Logical Operation Thread X PC BEGIN_XACT Load A Store B COMMIT_XACT Thread Y Undo Log Undo Log BEGIN_XACT Load A Store A COMMIT_XACT PC ABORT Shared Memory Block 7/26/2016 Data Metadata <cx, cy, …> A 0x..00.. <1,0,…> <0,0,…> <1,1,…> B 0x..00.. B:0x..11.. 0x..00.. <0,0,…> <T,0,…> C 0x..10.. <0,0,…> Wisconsin Multifacet Project Insufficient tokens 80 Existing HTM Systems Assumption: Most transactions small & short running Optimized for small transactions Degrade with large, long running transactions • Non-localized Overhead, E.g., LogTM-SE [Yen07] false conflicts OneTM [Blundel07] serializes • Complex, Expensive Operations, E.g., XTM [Chung06]& PTM [Chuang06] manipulate page tables Premature Optimization? 7/26/2016 Wisconsin Multifacet Project 81 Why Large Transactions? Programmers may want large (>>cache) and/or long (>> ctx switch) transactions – HLL transactions invoke unpredictable lower-level code – Replace critical sections containing syscalls or I/O – Avoid concurrency bugs [Lu08] • But “Most transactions small & short running” – Restrict TM to use by gurus (like OS spin locks)? – Self fulfilling prophesy? Must Support Efficient Large/Long Transactions As Well 7/26/2016 Wisconsin Multifacet Project 82 Toward a Large-Transaction TM Efficiently detect conflicts between in-flight transactions using Read/Write Sets • Unbounded • Globally accessible Small Transactions: Low Overhead Fast read/write set ops. E.g., Add to read set Clear read set Large Transactions: Localized Overhead Accessible read/write set (potentially unbounded) Minimal Changes to Coherence / VM 7/26/2016 N O Wisconsin Multifacet Project Heavyweight eviction ops Negative acks Additional page tables 83 Existing Mechanisms Synergy between cache coherence and conflict detection Hence, overload cache coherence Small Transactions: Low Overhead + Excellent for bounded/small TM But, - ‘Virtualization’ on overflows - Tough to access ‘virtualized’ state 7/26/2016 Wisconsin Multifacet Project Minimal Changes to × Coherence / VM × Large Transactions: Localized Overhead 84 TokenTM: a Large-Transaction TM • New Conflict Detection Mechanism This Talk – Transactional Tokens in Tagged Memory – Token Coherence [Martin03] at different level • Version Management – Save old/new values for unbounded Write set – LogTM [Moore06] undo log 7/26/2016 Wisconsin Multifacet Project 85 Transactional Tokens • Challenge: How to efficiently track Read/Write sets? • Token Coherence [Martin03] – Read/Write sets for cache coherence • Solution: Transactional Tokens – T tokens per memory block – At least one token to read, All T tokens to write (token conflict detection) – Token Metadata <c0,c1,…,ci,…> where 0≤ci≤T is count of tokens held by thread with TID i. 7/26/2016 Wisconsin Multifacet Project 86 Tagged Memory • Challenge: Where to store Unbounded, Globally Accessible Token Metadata? • Virtual Memory – unbounded and globally accessible • Solution, similar to OneTM [Blundel07] – Tag Virtual Memory – Piggyback on existing Virtual Memory and Cache Coherence mechanisms 7/26/2016 Wisconsin Multifacet Project 87 TokenTM Logical Operation Thread X PC BEGIN_XACT Load A Store B COMMIT_XACT Thread Y Undo Log Undo Log BEGIN_XACT Load A Store A COMMIT_XACT PC ABORT Shared Memory Block 7/26/2016 Data Metadata <cx, cy, …> A 0x..00.. <1,0,…> <0,0,…> <1,1,…> B 0x..00.. B:0x..11.. 0x..00.. <0,0,…> <T,0,…> C 0x..10.. <0,0,…> Wisconsin Multifacet Project Insufficient tokens 88 Storing Metadata Unbounded Difficult to access globally Thread Y Thread X PC BEGIN_XACT Load A Store B COMMIT_XACT Undo Log Token log Undo TokenLog log Cx CY PC BEGIN_XACT Load A Store A COMMIT_XACT Software Tagged Memory Block Data Metadata Metastate (Sum, <c , cTID) …> x 7/26/2016 Hardware y, A 0x..00.. B 0x..00.. (0, -) <0,0,…> (0, -) <0,0,…> C Lossy Summary 0x..10.. (0, -) <0,0,…> Wisconsin Multifacet Project Concise Accessible 89 Hardware Metastate • Metadata summary (sum, TID) – sum, total number of tokens acquired – TID, identify owner when sum = 1 or sum = T (optional) Some summaries, <c0, c1, …, ci, …> (sum, TID) <0, 0, 0, 0> (0, -) <0, 0, 1, 0> (1, 2) <0, T, 0, 0> (T, 2) <0, 1, 1, 1> (3, -) • Concise -> Stored in packed field (e.g., State[1:2] , Attr[3:16]) • Fast -> Accessed as part of normal memory operation 7/26/2016 Wisconsin Multifacet Project 90 Token Logs • Distributed structures for unbounded Read/Write sets – per-thread – stored in program memory (e.g., heap) – list of <address, num_tokens> Token log A: 1 B: T • Accessible to hardware for fast ops – Add to read set -> Append to token log 7/26/2016 Wisconsin Multifacet Project 91 Double-entry Bookkeeping (Keeping Metadata Consistent) Thread X Logical Token State Metadata <cx, cy, …> PC BEGIN_XACT Load A Store B COMMIT_XACT Thread Y Token log Token log A: 1 A: 1 PC BEGIN_XACT Load A Store A COMMIT_XACT <1,1,…> <1,0,…> <0,0,…> Software <0,0,…> Hardware <0,0,…> 7/26/2016 Block Metastate (Sum, TID) A (2, X) (1, (0, -) B (0, -) C (0, -) Wisconsin Multifacet Project 92 Implementing Hardware Metastate Thread X BEGIN_XACT Load A Store B COMMIT_XACT Thread Y Token log Token log A: 1 BEGIN_XACT Load A Store A COMMIT_XACT Software Load A Private Caches Tag A Coherence State Data Exclusive Owned 0x..00.. 0x..00.. Modified Load A Hardware Tag Data Sum TID A Shared 0x..00.. 1 X Coherence State Sum TID 101, XXData A A DATAFwd_GETS A GETS A Block Directory Data 0x..00.. A Shared Exclusive Not Present @@ P1,P2 P1 0x..00.. Main Memory 7/26/2016 Metastate GETS A (Sum, TID) Sum TID Wisconsin Multifacet Project (0,0) 0 0, (0,0) Upgrade A Shared copies cannot update metastate Solution: Fission / Fusion 93 Metastate Fission Thread X Thread Y BEGIN_XACT Load A Store B COMMIT_XACT Token log Token log A: 1 A: 1 BEGIN_XACT Load A Store A COMMIT_XACT Software 1,X Coherence State fission Tag Data Sum TID 0,Private A Modified 0x..00.. 1,X 1 X Owned Caches 0x..00.. Fwd_GETS A Main Memory 7/26/2016 Tag A Load A Coherence State Data Shared 0x..00.. Data A Block Directory A Shared Exclusive @ P1 @ P1,P2 Data Sum TID Hardware Sum TID 01 Y- GETS A 0x..00.. Wisconsin Multifacet Project 94 Metastate Fusion • Metastate Fusion – On store, metastate copies fused back • Why does fission/fusion work? – Store sees ‘complete’ metastate – Load sees • ‘complete’ metastate, if writer exists • ‘partial’ metastate, otherwise 7/26/2016 Wisconsin Multifacet Project 95 Hardware Cost • Additional metabits in caches/memory – Recoded ECC to cull metabits • Changes to coherence protocols – Additional payload on messages – Minimal changes to protocol logic • Requires non-silent eviction 7/26/2016 Wisconsin Multifacet Project 96 Evaluation Methodology • Methodology – Full System Simulation – Multifacet GEMS • Base System – 32-core CMP system, in-order, single-issue cores – Private 4-way 32KB writeback split I&D L1 caches – Shared 8-way 8 MB writeback L2 – On-chip directory @ L2, MESI coherence – Packet-switched interconnect in a tiled topology 7/26/2016 Wisconsin Multifacet Project 97 TM Systems • LogTM-SE [Yen07] variant Parallel Bloom Filters for conflict detection 4 2Kbit H3 filters + Compact, less hardware overhead - False Conflicts • LogTM-SE_Perfect + No False Conflicts - Unimplementable • TokenTM 7/26/2016 Wisconsin Multifacet Project 98 Results Performance Normalized to LogTM-SE_Perfect 1.2 1 Large Transactions: Minor degradation Localized with largeOverhead transactions 0.8 0.6 LogTM-SE 0.4 LogTM-SE_Perfect 0.2 TokenTM 0 Comparable on Small Transactions: small Lowtransactions Overhead 7/26/2016 Wisconsin Multifacet Project 99 7/26/2016 Wisconsin Multifacet Project 100 In vivo Testing [Murphy et al. TR 2007, Chu et al. ICST 2008] • Run unit tests on deployed software + More testing + More realistic Catch bugs early 7/26/2016 Wisconsin Multifacet Project 101 In vivo Testing using StealthTest ST_begin_transaction(); try { test(); ST_begin_escape(); fprintf(log, “…”, success); ST_end_escape(); } catch/except() { ST_begin_escape(); fprintf(log, “…”, fail); ST_end_escape(); } ST_abort_transaction(NO_RETRY); 7/26/2016 Wisconsin Multifacet Project 102 Evaluation • Workloads – Bugbench • Server Workloads – STAMP • Transactional Memory benchmarks • Implementation – Intel STM • Language-Based TM – TL2 STM • Library-Based TM • Quad-core workstation with RHEL5 7/26/2016 Wisconsin Multifacet Project 103 (1) Effective? Description Size (LOC) Bug Type Error Detected? NCOM file compress 1.9K Stack Smash Yes POLY file “unixier” 0.7K Stack Smash Yes GZIP file compress 8.2K Buffer Overflow Yes MAN documentation 4.7K Buffer Overflow Yes BC calculator 17.0K Buffer Overflow Yes HTPD1 web server 224K Atomicity Yes SQUD proxy cache 93.5K Buffer Overflow Possible CVS version control 114.5K Double Free Possible MSQL2 DBMS 514K Atomicity Possible MSQL3 DBMS 1028K Atomicity Possible 7/26/2016 Wisconsin Multifacet Project Unsupported Library Calls Program Works • Built on Intel STM. • Run tests on Bugbench applications 104 (2) Non-intrusive? • Built on TL2 STM. • Run tests on STAMP applications (1000 tests per min) Normalized Execution Time 4 3.5 3 2.5 None 2 fork 1.5 StealthTest 1 0.5 0 Genome 7/26/2016 Intruder Vacation Wisconsin Multifacet Project Yada 105 Atomicity Violation Bugs? 7/26/2016 Wisconsin Multifacet Project 106 Degree-2 Transactions • Isolate only writes. • Implementation – Reads in escape action – Early Release – Add new type of transaction to TM 7/26/2016 Wisconsin Multifacet Project 107 StealthTest Wish List • Hardware Support • System Calls within Transactions • Interaction between Locks and Transactions 7/26/2016 Wisconsin Multifacet Project 108 4 Normalized Execution Time 3.5 3 None 2.5 ForkIV-10^3 2 ForkIV-10^1 1.5 StealthIV-10^1 1 StealthIV-10^3 0.5 0 Genome 7/26/2016 Intruder Vacation Wisconsin Multifacet Project Yada 109 Related Work • Binary Translation (SPROCKETS) • Code Emulation (STEM) • TLS (Oplinger&LAM, PathExpander) 7/26/2016 Wisconsin Multifacet Project 110 In vivo Testing Motivation • Ordering Bug in MySQL • In vivo Test: Data Consistency Checks Buggy Code (In mysys/thr_lock.c): void thr_lock_delete(THR_LOCK *lock) { … pthread_mutex_destroy(&lock->mutex); … list_delete(thr_lock_thread_list, &lock->list); … } 7/26/2016 Wisconsin Multifacet Project 111