Hardware Memory Race Recording for Deterministic Replay Mark D. Hill University of Wisconsin—Madison August 10, 2007 Based on joint work with Min Xu & Ras Bodik: ISCA 2003, ASPLOS 2006, IEEE Micro Top Picks 2007, & Xu UW Ph.D. 5/2006 (slides updated from defense talk). Wisconsin Multifacet Project Seek improved architectures for (mostly) servers that are (mostly) chip multiprocessors (CMPs, multi-core) Led by Mark Hill & David Wood LogTM work w/ Ben Liblit & Mike Swift Funding • Grants from U.S. National Science Foundation • Donations from Intel and Sun 1 Selected Multifacet Results (1 of 2) Multiprocessor Flight Data Recorder • Records memory races for deterministic replay • Piggyback on coherence protocol & logs 0.001B/instrn • Supports SC & TSO Adaptive L2 Cache & Memory Link Compression • Cache compression creates level 2½ cache (or 3½) • Adaptive so as “to do no harm” • Link compression husbands memory link bandwidth Multifacet GEMS MP Simulation Infrastructure • Simics==Correctness; GEMS==Performance • GPL Distribution 2 Selected Multifacet Results (2 of 2) Log-based Transactional Memory (LogTM) • Accelerates commit by writing new values in place (after saving old values in a per-thread log) • Gracefully handles cache eviction of TM data LogTM Signature Edition (LogTM-SE) • Signatures summarize read/write sets • HW mechanisms: simple, policy-free, SW accessible Forthcoming • Mechanisms to handle thread switching/migration & paging of transactions with OS or OS/VMM 3 Overview 4 Increasingly useful to replay multithreaded code • Race recording: key to dealing with nondeterminism Effective Inexpensive Race Recorder A Case Study • • • • Long recording: 1 byte/kilo-instr Always-on recording: less than 2% overhead Low cost: 24 KB RAM/core Support both SC & TSO (x86-like) Contributions 5 Low Runtime Overhead Small Log Size Transitive Reduction & Regulated TR Coherence Piggyback Effective Inexpensive Order-Value Hybrid SC & TSO Applicability Set/LRU Approximation Low Cost Hardware Outline 6 Motivation & Problem 6 slides An Effective and Inexpensive Race Recorder TR & RTR Algorithms Coherence Piggyback Set/LRU Approximation Evaluation Method & Results Conclusions, etc. 3 21 Order-Value Hybrid 6 Motivation & Problem Multithreaded Debugging 8 % gcc hash.c % a.out Segmentation fault % % gdb a.out gdb> run Program received SIGSEGV. In get() at hash.c:45 45 a = bucket->d; % gcc para-hash.c % a.out Segmentation fault % % gdb a.out gdb> run Program exited normally. gdb> % gcc para-hash.c % a.out Segmentation fault Race recorded in “log” % % gdb a.out log gdb> run Program received SIGSEGV. In get() at para-hash.c:67 67 a = bucket->d; Applications of Deterministic Replay Deterministic Replay is logically recreating a program execution Cyclic Debugging ([Pancake & Netzer ‘93]) Fault Tolerance (ExtraVirt [Lucchetti et al. ’05]) Intrusion Analysis (ReVirt [Dunlap et al. ’02]) Data Recovery (Continuous Checkpointing)? See VMware Workstation 6 Replay included for single-processor guest VM 9 Race Recording 10 Log Thread I Thread J X = X*5 - X=1 X++ print(X) Original Recording X=6 Thread I Thread J X=1 X =-X*5 X++ X =-X*5 print(X) - Replay X=10 X= 6 Recording for Multithreaded Replay Race Recording • • Not-an-issue for a single thread Create the same general & data races 11 Focus Checkpointing • • Provide a snapshot of the program state Many proposals (e.g., SafetyNet), not focus Input Recording • • Provide repeatable inputs Some proposals (e.g., part of FDR), not focus A Good Race Recorder Low cost 12 Low runtime overhead Applicability % gcc para-hash.c % a.out Segmentation fault Race recorded in “log” % % gdb a.out log gdb> run Program received SIGSEGV. In get() at para-hash.c:67 67 a = bucket->d; Long recording: small log Desired & Existing Race Recorders Recording Length Desired Recorder Applicability Racey SC Small Log MP Size Code TSO 13 Overhead Cost Negligible Slowdown Little Hardware InstRply ’87 R&C ’90 Bacon’91 Netzer’93 Déjà Vu ’98 RecPlay ’00 JaRec ’04 Our Recorder Strata ASPLOS ’06 V V V X V V, but global Small Log Size Transitive Reduction & Regulated TR Coherence Piggyback Order-Value Hybrid Set/LRU Approximation Problem Formulation Thread I Conflicts Thread J 15 ThreadDependence I Thread J ld A add (red) ld A (black) add st B st C st B st C st C ld B st C ld B ld D st A ld D st A sub st C sub st C ld B st D ld B st D Recording Log Replay Reproduce exact same conflicts: no more, no less Log All Conflicts 16 Thread I Thread J 1 ld A add 1 2 st B st C 2 3 st C ld B 3 Log J: 23 14 35 46 4 ld D st A 4 Log I: 23 5 sub st C 5 6 ld B st D 6 Replay Dependence Log 16 bytes Log Size: 5*16=80 bytes (10 integers) Assign IC Detect conflicts Write log But too many conflicts (logical Timestamps) Netzer’s Transitive Reduction Thread I Thread J TR Reduced Log 1 ld A TR reduced add 2 st B st C 2 3 st C ld B 3 1 4 ld D st A 4 5 sub st C 5 6 ld B st D 6 Replay Log J: 23 35 46 Log I: 23 Log Size: 64 bytes (8 integers) 17 The Intuition of the New RTR Algorithm From I to J Vectors After Reduction Regulate Replay (RTR) From J to I Vectors 18 Stricter Dependences to Aid Vectorization19 Thread I Thread J 1 ld A add 1 2 st B st C 2 Log J: 23 45 3 st C 3 Log I: 23 4 ld D ld B stricter st A 5 sub st C 5 ld B 6 Reduced st D 6 Replay New Reduced Log 4 Log Size: 48 bytes (6 integers) Compress Vectorized Dependencies Thread I Thread J 1 ld A add 1 2 st B st C 2 3 st C ld B 3 4 ld D st A 4 5 sub Vector Deps. 6 ld B st C 5 st D 6 Replay Vectorized Log Log J: x=3,5, ∆=1 Log I: x=3, ∆=1 Log Size: 40 bytes (5 integers) Reduce log size to KB/core/second 20 Low Runtime Overhead Transitive Reduction & Regulated TR Coherence Piggyback Order-Value Hybrid Set/LRU Approximation Detect Conflicts A.readers Thread IA.writer Thread J A.readers.add(I, 1) B.writer = (I, 2) if (C.writer != I) log(WAW) foreach C.readers if (reader != I) log(WAR) C.readers.clear( ) C.writer = (I, 3) 22 1 ld A add 1 2 st B st C 2 3 st C ld B 3 st A 4 C.writer =(J, 2) if (B.writer != J) log(RAW) B.readers.add(J,3) … Recording Expensive in software Use Cache and Cache Coherence 23 ld B Proc I Proc J Tag State Data Timestamp A S … 1 B M … 2 Tag State Data Timestamp A S … 3 B I … 2 Get/S Request A.readers A.writer B.readers B.writerData Response Timestamp RAW Detected & Logged Detect conflict in hardware with little runtime cost Cache Evictions and Writebacks 24 st A Proc I Proc J Tag State Data Timestamp A S … 1 C M … 3 B M … 2 Tag State Data Timestamp A S … 3 M B I … 2 Ack WAR Inv Get/S Detected & Logged Directory of A: Shared(I,J) Owner() Timestamp? OK with nonsilent eviction & directory eviction Implement TR and RTR in Hardware Ideal TR requires vector timestamps • Too expensive • New idea: Pairwise-TR (use scalar timestamp) • Enable pairwise transitive reduction Optimal RTR algorithm is likely expensive • Implement a greedy RTR algorithm • One-pass, online algorithm • Keep a sliding window of vectorizable dependencies 25 Hardware Implementation Cache Eviction/writeback Solved, more details later Directory protocols Solved Snooping protocols Partly solved Two-level coherence Not yet solved Processor Out-of-order/Prefetching Solved Unordered message Solved Counter overflow Solved Thread Migration Not yet solved 26 Transitive Reduction & Regulated TR Order-Value Hybrid Coherence Piggyback Set/LRU Approximation Low Cost Hardware Timestamp Approximation One Set of I’s $ Tag State Data Timestamp A S … 1 C M … 3 B M … 2 Use current IC of thread I Directory of A: Shared(I) 28 Thread I Thread J 1 ld A add 1 2 st B st C 2 3 st C ld B 3 I ld D st A J Recording Correct, but more evictions more logged conflicts Hardware Cost Log Size Set/LRU Approximation One Set of I’s $ Tag State Data Timestamp A S … 1 C M … 3 B M … 2 current LRUUse guarantee IC of thread B’s TS > A’s TS I 30 Thread I Thread J 1 ld A add 1 2 st B st C 2 3 st C ld B 3 I ld D st A J Recording Set/LRU better preserve reducibility Small $ more misses but still small log Hardware Cost of Timestamps 31 Coupled Timestamp Memory Tag State Data Timestamp A S … 1 B M … 2 Coupled timestamp memory: overhead cache size • Not flexible • 64B line + 64b (24b) timestamp 12.5% (4.7%) overhead • 192 KB for a 4MB L2 Need to modify cache Decoupled Timestamp Memory 32 Cache Coupled Timestamp Memory Tag State Data Timestamp A S … 1 B M … 2 Tag State Data A S … B M … Tag Timestamp A 1 B 2 Timestamp Memory Decoupling Small timestamp memory (Set/LRU) • e.g., 32-set, 64-way 99% transitive reduction • Timestamps Memory 24 KB No needFrom to modify 192 KBcache to 24 KB: 8x reduction 33 SC & TSO Applicability Transitive Reduction & Regulated TR Coherence Piggyback Order-Value Hybrid Set/LRU Approximation Recording with Total Store Order (TSO) 34 Majority of existing MP are non-SC TSO is well defined, x86-like Thread I Thread J A=B=0 1 st A,1 2 ld B st B,1 1 ld A 2 SC TSO st A,1 st B,1 ld B ld A st A,1 ld B st B,1 ld A st B,1 ld A st A,1 ld B ld A ld B st A,1 st B,1 A=1 B=1 A=1 B=0 A=0 B=1 A=0 B=0 TSO Execution 35 I A=1 Thread I Thread J WrBuf WrBuf A=B=0 1 st A,1 2 ld B B=1J st B,1 1 ld A 2 Memory System A=0 B=0 st A,1 st B,1 ld A ld B st A,1 st B,1 A=0 B=0 Order-Value-Hybrid Recording WAR Value Omitted Thread I Thread J Logged A=B=0 1 st A,1 2 st B,1 1 ld B ld A 1 st A,1 2 ld B Replay Thread J A Changed! st B,1 1 ld A B=1J WrBuf WrBuf 2 Recording Thread I I A=1 36 2 Value Used A=0 st A,1 st B,1 ld A ld B st A,1 st B,1 Memory System A=0 B=0 J StartsIto I Starts Stopsto Monitor Monitoring Monitor A BB A=0 B=0 Hybrid Recording with TR and RTR 37 Hybrid recording • All loads get correct values • Hardware similar to OoO SC [Gharachorloo et al. ’91] Hybrid + TR & RTR • TR will not use the omitted WAR in reduction • RTR vectorize dependencies more conservatively Evaluation Method & Results Put-it-together: Determinizer/CMP TSM TSM IC Core 4 Core 1 L1_I$ L1_D$ Shared L2 Cache (L1 Dir) L1 Coherence TSM Core Controller Core 3 2 Log TR Reg RTR Reg TSM TSM 39 Simulation Method Commercial server hardware • GEMS: http://www.cs.wisc.edu/gems • Full-system (OS + application) executions • 4-core CMP (Sequential Consistent) • • 1-way in-order issue, 2 GHz, 64KB I/D L1, 4MB L2, 64byte lines, MOSI directory Commercial server software • • • • Apache – static web serving SpecJBB – middleware OLTP – TPC-C like Zeus – static web serving 40 Log Size: 1 byte/kilo-instr KB/core/s byte/core/kilo-instr 2.0 200 1.5 150 1.0 100 0.5 50 0.0 ApacheJBB OLTP Zeus AVG 41 0 ApacheJBB OLTP Zeus AVG Well within in the capability of current machines • Long recording (days – months) need improvement Runtime Overhead 42 Execution Time 100 Interconnection Msg. B/W 100 80 80 60 60 40 40 20 20 0 Apache JBB OLTP Zeus 0 Baseline Apache JBB OLTP Zeus With race recorder Our recorder can be “always-on” Benefits of RTR and Set/LRU (Log Size) Improvement by RTR Effectiveness of Set/LRU 100 80 80 Log Size 100 Log Size 43 60 60 40 40 20 20 0 0 ApacheJBB OLTP Zeus AVG Apache JBB OLTP Zeus AVG Pairwise-TR Perfect TSM Our RTR 24KB Set/LRU TSM Why RTR and Set/LRU Work Well? RTR • Processors execute instructions at similar speed • Therefore, we can find “vectorizable” dependencies Set/LRU • Temporal locality makes the LRU timestamps old • We only need to know if a timestamp is “old-enough” 44 Sensitivity and Scalability 45 A design space of the timestamp memory (TSM) • Size: smaller TSM -> larger log • Read/write timestamp: should be used when TSM is large • Partial timestamp: 24-bit enough • Associativity: higher better for RTR Scalability of the recorder • Studied with modest processors (2p – 16p) • Commercial workloads, not scientific workloads • Log size increase slowly with number of cores Conclusions, etc. Conclusions & Future Work 47 Race recording Key to combat nondeterminism Contributions Effective & inexpensive Recorder • • • • Transitive Reduction & RTR algorithm small log size Coherence piggyback Negligible slowdown Timestamp approximation Low hardware cost Order-value hybrid support SC & TSO Future work • Operate with Hardware Transactional Memory • Seek to Eliminate Timestamp on Acknowledgements Toward Recording w/ Snooping Protocols 48 Key problem is combined/implicit response • Not a problem for AMD Hammer st A Proc I Proc J Tag State Data Timestamp A S … 1 B M … 4 Tag State Data Timestamp A S … 3 B I … 2 Pull Shared WAR Detected & Logged Get/X + Current IC Timestamp at L2-Directory or Memory? 49 st A Proc I Proc J Tag State Data Timestamp A S … 1 C M … 3 B M … 4 Tag State Data Timestamp A S … 3 M 4 B I … 2 Ack Eviction Timestamp Memory Timestamp Get/S Directory of A: Shared(J) Owner() StickyS(I,J) Directory eviction: more false conflict, like snooping % gcc para-hash.c % a.out Segmentation fault Race recorded in “log” % % gdb a.out log gdb> run Program received SIGSEGV. In get() at para-hash.c:67 67 a = bucket->d; For references, Google “Mark Hill” Publications & Talks 2007 - IEEE Micro Top Picks 2006 – ASPLOS 2003 – ISCA Students & Graduates 2006 - Xu Ph.D. Thesis 51 Backup Serializability Violation Detector [PLDI’05] 52 Like a race detector No a priori annotation requirement • “critical sections” are inferred Intend to detect bugs “actually” happen • Check for a 2-Phase-Locking condition Read in1 Read in2 Read local Write local Write out1 Write out2 A “Critical Section” Shared Variables Race Recording: Key to Determinism 53 Races: general race & data race [Netzer & Miller] • Both cause nondeterminism • Race recording can help, but Existing race recorders are inadequate • • • • Some generate large logs Some have high runtime overhead Some have high hardware cost (space overhead) Support only sequential consistency Need a better race recorder Recording/Replay & Debugging 54 Online Recorder P1 Store log A Store log B Store log C P2 Crash P3 P4 Checkpoint A Checkpoint B Deterministic Replayer Checkpoint C Dump “Core” Replaying from log B, C Crash Read Checkpoint B Deterministic Replay & Fault Tolerance Fault Recovery • Replay after a failure Fault Detection • Replay then compare (Courtesy of VMware) 55 Future: Record/Replay & Undo/Redo Windows XP VM as a software platform • Ease software development • Fine granularity in Undo and Redo 56 Future: Replay-based Synchronization ld A st B Unlock() Recording Log lock() st A ld B ld A st B Replay 57 st A ld B Three steps • Coarse-grain sync. fine-grain sync. hardware sync. Results: higher performance Works only if static control flow & fixed data addr • DSP kernels Race Recording Related Work Total-order recorders Bacon ’91 RecPlay ’00 (Hardware) JaRec ’04 Bus Lamport Clocks transactions Large log Low overhead Small log 58 Partial-order recorders R&C’90 Bacon ’91 Instant Replay ’87 Netzer ’93 Déjà Vu ’98 (Hardware) Scheduling Small log Low overhead Low overhead (sync only) (non-MP) Low replay parallelism Bus transaction groups Large log Low overhead Variable version Vector clocks Large log Small log High overhead High overhead High replay parallelism Correctness of Order-Value-Hybrid Removing WAR dependencies • Say thread I read, thread J write • Removing the WAR affects I’s read, not J’s write • But, for every dependence removed, thread I reads correct value from the value log • Therefore, all reads get the correct value 59 TR and TSO 60 TR affects dependencies reduced by a WAR • The WAR itself may later be removed during replay • Solution: Not use WAR in TR if the WAR can be removed • Respond with a special flag when a loaded cache line is stolen Thread I Thread J 1 st A st B 1 2 st C st C 2 3 ld B ld A 3 Recording Must not be reduced RTR and TSO 61 The sliding window may expose the ordered loads • Shrink the sliding window to avoid it old win for j:3 Thread I Thread J 1 st A add 1 2 add sub 2 in write bufffer 3 st B ld A 3 ld C ld B 4 new win for j:3 ordered ordered 4 Recording Not allowed by new window Deadlock Avoidance of RTR Thread I Thread J 1 ld A add 1 2 st B st C 2 3 st C ld B 3 4 ld D st A 4 5 sub st C 5 6 ld B st D 6 Recording 62 Replay Cycle i:4j:1 j:2 i:3 i:4 Avoid deadlock by adhere to a SC total order Recording Race-free Executions No data races Only need to record synchronization race Deterministic replay up until the first data race 63 Replay Parallelism Replay performance depends on (1) Number of synchronizations (2) Extra wait incurred by the synchronizations 64 Directory Protocols 65 Add sticky states in the directory • Retain states after writebacks • Need extra acknowledgements Or, add extra timestamp memory in the directory • Helps to avoid extra acknowledgements A tradeoff • Sticky states can be cheaper • But extra timestamp memory can be faster Snooping Protocols 66 Key problem is combined/implicit response • Not a problem for AMD Hammer st A Proc I Proc J Tag State Data Timestamp A S … 1 B M … 4 Tag State Data Timestamp A S … 3 B I … 2 Pull Shared WAR Detected & Logged Get/X + Current IC Nonsilent Evictions 67 st A Proc I Proc J Tag State Data Timestamp A S … 1 C M … 3 B M … 4 Tag State Data Timestamp A S … 3 M 4 B I … 2 Ack Eviction Timestamp Memory Timestamp Get/S Directory of A: Shared(J) Owner() StickyS(I,J) Directory eviction: more false conflict, like snooping Out-of-Order & Hardware Prefetching 68 Speculative execution • No IC assigned yet Hardware prefetching • No IC assigned Key idea: receive observation • Can associate a ld/st with current commit instruction Unordered Messages in Interconnect Message arrive out-of-order Can affect reduction But better add a sequence number • Reconstruct the message order • Enable IC compression by sending deltas 69 Integer Overflow IC and timestamps may overflow IC: make it 64bit, will not overflow for a long time Timestamps: use approximation techniques • MSB of IC + LSB of Timestamps 70 3 2 Apache-1TS-RTR Apache-1TS-TR Apache-2TS-RTR Apache-2TS-TR 1 71 Log Bandwidth (MB/core/second) Log Bandwidth (MB/core/second) Varying TSM Size 2 OLTP-1TS-RTR OLTP-1TS-TR OLTP-2TS-RTR OLTP-2TS-TR 1 0 0 4 8 16 32 64 128 256 Size of the Timestamp Memory (KB) (64 ways, Full Timestamps, Set/LRU) 3 2 SPECjbb-1TS-RTR SPECjbb-1TS-TR SPECjbb-2TS-RTR SPECjbb-2TS-TR 1 2 512 1024 2048 0 Log Bandwidth (MB/core/second) 2 Log Bandwidth (MB/core/second) 3 4 8 16 32 64 128 256 512 1024 2048 Size of the Timestamp Memory (KB) (64 ways, Full Timestamps, Set/LRU) 3 2 Zeus-1TS-RTR Zeus-1TS-TR Zeus-2TS-RTR Zeus-2TS-TR 1 0 2 4 8 16 32 64 128 256 512 1024 2048 Size of the Timestamp Memory (KB) (64 ways, Full Timestamps, Set/LRU) 2 4 8 16 32 64 128 256 512 1024 2048 Size of the Timestamp Memory (KB) (64 ways, Full Timestamps, Set/LRU) 10 Apache-CurrentIC-RTR Apache-CurrentIC-TR Apache-SetLRU-TR Apache-SetLRU-RTR 1 72 Log Bandwidth (MB/core/second) Log Bandwidth (MB/core/second) Varying Associativity 0.1 0.01 OLTP-CurrentIC-RTR OLTP-CurrentIC-TR OLTP-SetLRU-TR OLTP-SetLRU-RTR 1 0.1 0.01 4 8 16 32 64 128 256 512 1024 Associativity of the Timestamp Memory (64KB, Full R/W Timestamps) 10 SPECjbb-CurrentIC-RTR SPECjbb-CurrentIC-TR SPECjbb-SetLRU-TR SPECjbb-SetLRU-RTR 1 0.1 0.01 2 Log Bandwidth (MB/core/second) 2 Log Bandwidth (MB/core/second) 10 4 8 16 32 64 128 256 512 1024 Associativity of the Timestamp Memory (64KB, Full R/W Timestamps) 10 Zeus-CurrentIC-RTR Zeus-CurrentIC-TR Zeus-SetLRU-TR Zeus-SetLRU-RTR 1 0.1 0.01 2 4 8 16 32 64 128 256 512 1024 Associativity of the Timestamp Memory (64KB, Full R/W Timestamps) 2 4 8 16 32 64 128 256 512 1024 Associativity of the Timestamp Memory (64KB, Full R/W Timestamps) Log Bandwidth (MB/core/second) Log Bandwidth (MB/core/second) Varying Partial Timestamp Width 10 Apache-TR Apache-RTR 1 0.1 0.01 10 OLTP-TR OLTP-RTR 1 0.1 0.01 10 15 20 25 30 Partial Timestamp Width (64sets, 64ways, Set/LRU) 10 SPECjbb-TR SPECjbb-RTR 1 0.1 0.01 10 15 20 25 Partial Timestamp Width (64sets, 64ways, Set/LRU) 10 Log Bandwidth (MB/core/second) Log Bandwidth (MB/core/second) 73 30 15 20 25 10 Partial Timestamp Width (64sets, 64ways, Set/LRU) 1 Zeus-TR Zeus-RTR 30 0.1 0.01 10 15 20 25 Partial Timestamp Width (64sets, 64ways, Set/LRU) 30 Log Size (MB/core/s) Log Size Scaling 74 1.0 0.8 Apache SPECjbb OLTP Zeus 0.6 0.4 0.2 0.0 2 4 8 Number of Cores 16