Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors: Mark Hill, Rastislav Bodik Committee: Remzi Arpaci-Dusseau, Mikko Lipasti, Barton Miller, David Wood Overview 1 Increasingly useful to replay multithreaded code • Race recording: key to dealing with nondeterminism Effective Inexpensive Race Recorder A Case Study • • • • Long recording: 1 byte/kilo-instr Always-on recording: less than 2% overhead Low cost: 24 KB RAM/core Support both SC & TSO (x86-like) Thesis Contributions 2 Low Runtime Overhead Small Log Size RTR Algorithm Coherence Piggyback Effective Inexpensive Order-Value Hybrid SC & TSO Applicability Set/LRU Approximation Low Cost Hardware Outline 3 Motivation & Problem 5 slides An Effective and Inexpensive Race Recorder RTR Algorithm Coherence Piggyback Set/LRU Approximation Evaluation Method & Results 21 Order-Value Hybrid 6 Conclusion & My Other Research 3 Motivation & Problem Multithreaded Debugging 5 % gcc hash.c % a.out Segmentation fault % % gdb a.out gdb> run Program received SIGSEGV. In get() at hash.c:45 45 a = bucket->d; % gcc para-hash.c % a.out Segmentation fault % % gdb a.out gdb> run Program exited normally. gdb> % gcc para-hash.c % a.out Segmentation fault Race recorded in “log” % % gdb a.out log gdb> run Program received SIGSEGV. In get() at para-hash.c:67 67 a = bucket->d; Race Recording 6 Log Thread I Thread J X = X*5 - X=1 X++ print(X) Original Recording X=6 Thread I Thread J X=1 X =-X*5 X++ X =-X*5 print(X) - Replay X=10 X= 6 Recording for Multithreaded Replay Race Recording • • Not-an-issue for a single thread Create the same general & data races 7 Focus Checkpointing • • Provide a snapshot of the program state Many proposals (e.g., SafetyNet), not focus Input Recording • • Provide repeatable inputs Some proposals (e.g., part of FDR), not focus A Good Race Recorder Low cost 8 Low runtime overhead Applicability % gcc para-hash.c % a.out Segmentation fault Race recorded in “log” % % gdb a.out log gdb> run Program received SIGSEGV. In get() at para-hash.c:67 67 a = bucket->d; Long recording: small log Desired & Existing Race Recorders Recording Length Desired Recorder InstRply ’87 R&C ’90 Bacon’91 Netzer’93 Déjà Vu ’98 RecPlay ’00 JaRec ’04 Our Recorder Applicability Racey SC Small Log MP Size Code TSO 9 Overhead Cost Negligible Slowdown Little Hardware Small Log Size RTR Algorithm Coherence Piggyback Order-Value Hybrid Set/LRU Approximation Problem Formulation Thread I Conflicts Thread J 11 ThreadDependence I Thread J ld A add (red) ld A (black) add st B st C st B st C st C ld B st C ld B ld D st A ld D st A sub st C sub st C ld B st D ld B st D Recording Log Replay Reproduce exact same conflicts: no more, no less Log All Conflicts 12 Thread I Thread J 1 ld A add 1 2 st B st C 2 3 st C ld B 3 Log J: 23 14 35 46 4 ld D st A 4 Log I: 23 5 sub st C 5 6 ld B st D 6 Replay Dependence Log 16 bytes Log Size: 5*16=80 bytes (10 integers) Assign IC Detect conflicts Write log But too many conflicts (logical Timestamps) Netzer’s Transitive Reduction Thread I Thread J TR Reduced Log 1 ld A TR reduced add 2 st B st C 2 3 st C ld B 3 1 4 ld D st A 4 5 sub st C 5 6 ld B st D 6 Replay Log J: 23 35 46 Log I: 23 Log Size: 64 bytes (8 integers) 13 The Intuition of the New RTR Algorithm From I to J Vectors After Reduction Regulate Replay (RTR) From J to I Vectors 14 Stricter Dependences to Aid Vectorization15 Thread I Thread J 1 ld A add 1 2 st B st C 2 Log J: 23 45 3 st C 3 Log I: 23 4 ld D ld B stricter st A 5 sub st C 5 ld B 6 Reduced st D 6 Replay New Reduced Log 4 Log Size: 48 bytes (6 integers) Compress Vectorized Dependencies Thread I Thread J 1 ld A add 1 2 st B st C 2 3 st C ld B 3 4 ld D st A 4 5 sub Vector Deps. 6 ld B st C 5 st D 6 Replay Vectorized Log Log J: x=3,5, ∆=1 Log I: x=3, ∆=1 Log Size: 40 bytes (5 integers) Reduce log size to KB/core/second 16 Low Runtime Overhead RTR Algorithm Coherence Piggyback Order-Value Hybrid Set/LRU Approximation Detect Conflicts A.readers Thread IA.writer Thread J A.readers.add(I, 1) B.writer = (I, 2) if (C.writer != I) log(WAW) foreach C.readers if (reader != I) log(WAR) C.readers.clear( ) C.writer = (I, 3) 18 1 ld A add 1 2 st B st C 2 3 st C ld B 3 st A 4 C.writer =(J, 2) if (B.writer != J) log(RAW) B.readers.add(J,3) … Recording Expensive in software Use Cache and Cache Coherence 19 ld B Proc I Proc J Tag State Data Timestamp A S … 1 B M … 4 Tag State Data Timestamp A S … 3 B I … 2 Get/S Request A.readers A.writer B.readers B.writerData Response Timestamp RAW Detected & Logged Detect conflict in hardware with little runtime cost Cache Evictions and Writebacks 20 st A Proc I Proc J Tag State Data Timestamp A S … 1 C M … 3 B M … 4 Tag State Data Timestamp A S … 3 M 4 B I … 2 Ack WAR Inv Get/S Detected & Logged Directory of A: Shared(I,J) Owner() Timestamp? OK with nonsilent eviction & directory eviction Implement TR and RTR in Hardware Ideal TR requires vector timestamps • Too expensive • New idea: Pairwise-TR (use scalar timestamp) • Enable pairwise transitive reduction Optimal RTR algorithm is likely expensive • Implement a greedy RTR algorithm • One-pass, online algorithm • Keep a sliding window of vectorizable dependencies 21 Hardware Implementation Cache Eviction/writeback Solved, more details later Directory protocols Solved Snooping protocols Partly solved Two-level coherence Not yet solved Processor Out-of-order/Prefetching Solved Unordered message Solved Counter overflow Solved Thread Migration Not yet solved 22 RTR Algorithm Order-Value Hybrid Coherence Piggyback Set/LRU Approximation Low Cost Hardware Timestamp Approximation One Set of I’s $ Tag State Data Timestamp A S … 1 C M … 3 B M … 2 Use current IC of thread I Directory of A: Shared(I) 24 Thread I Thread J 1 ld A add 1 2 st B st C 2 3 st C ld B 3 I ld D st A J Recording Correct, but more evictions more logged conflicts Hardware Cost Log Size Set/LRU Approximation One Set of I’s $ Tag State Data Timestamp A S … 1 C M … 3 B M … 2 current LRUUse guarantee IC of thread B’s TS > A’s TS I 26 Thread I Thread J 1 ld A add 1 2 st B st C 2 3 st C ld B 3 I ld D st A J Recording Set/LRU better preserve reducibility Small $ more misses but still small log Hardware Cost of Timestamps 27 Coupled Timestamp Memory Tag State Data Timestamp A S … 1 B M … 2 Coupled timestamp memory: overhead cache size • Not flexible • 64B line + 64b (24b) timestamp 12.5% (4.7%) overhead • 192 KB for a 4MB L2 Need to modify cache Decoupled Timestamp Memory 28 Cache Coupled Timestamp Memory Tag State Data Timestamp A S … 1 B M … 2 Tag State Data A S … B M … Tag Timestamp A 1 B 2 Timestamp Memory Decoupling Small timestamp memory (Set/LRU) • e.g., 32-set, 64-way 99% transitive reduction • Timestamps Memory 24 KB No needFrom to modify 192 KBcache to 24 KB: 8x reduction 29 SC & TSO Applicability RTR Algorithm Coherence Piggyback Order-Value Hybrid Set/LRU Approximation Recording with Total Store Order (TSO) 30 Majority of existing MP are non-SC TSO is well defined, x86-like Thread I Thread J A=B=0 1 st A,1 2 ld B st B,1 1 ld A 2 SC TSO st A,1 st B,1 ld B ld A st A,1 ld B st B,1 ld A st B,1 ld A st A,1 ld B ld A ld B st A,1 st B,1 A=1 B=1 A=1 B=0 A=0 B=1 A=0 B=0 TSO Execution 31 I A=1 Thread I Thread J WrBuf WrBuf A=B=0 1 st A,1 2 ld B B=1J st B,1 1 ld A 2 Memory System A=0 B=0 st A,1 st B,1 ld A ld B st A,1 st B,1 A=0 B=0 Order-Value-Hybrid Recording WAR Value Omitted Thread I Thread J Logged A=B=0 1 st A,1 2 st B,1 1 ld B ld A 1 st A,1 2 ld B Replay Thread J A Changed! st B,1 1 ld A B=1J WrBuf WrBuf 2 Recording Thread I I A=1 32 2 Value Used A=0 Memory System A=0 st A,1 st B,1 ld A ld B st A,1 st B,1 A=0 B=0 B=0 Start Start Stop Monitor Monitor A Monitor B B Hybrid Recording with TR and RTR 33 Hybrid recording • All loads get correct values • Hardware similar to OoO SC [Gharachorloo et al. ’91] Hybrid + TR & RTR • TR will not use the omitted WAR in reduction • RTR vectorize dependencies more conservatively Evaluation Method & Results Put-it-together: Determinizer/CMP TSM TSM IC Core 4 Core 1 L1_I$ L1_D$ Shared L2 Cache (L1 Dir) L1 Coherence TSM Core Controller Core 3 2 Log TR Reg RTR Reg TSM TSM 35 Simulation Method Commercial server hardware • GEMS: http://www.cs.wisc.edu/gems • Full-system (OS + application) executions • 4-core CMP (Sequential Consistent) • • 1-way in-order issue, 2 GHz, 64KB I/D L1, 4MB L2, 64byte lines, MOSI directory Commercial server software • • • • Apache – static web serving SpecJBB – middleware OLTP – TPC-C like Zeus – static web serving 36 Log Size: 1 byte/kilo-instr KB/core/s byte/core/kilo-instr 2.0 200 1.5 150 1.0 100 0.5 50 0.0 ApacheJBB OLTP Zeus AVG 37 0 ApacheJBB OLTP Zeus AVG Well within in the capability of current machines • Long recording (days – months) need improvement Runtime Overhead 38 Execution Time 100 Interconnection Msg. B/W 100 80 80 60 60 40 40 20 20 0 Apache JBB OLTP Zeus 0 Baseline Apache JBB OLTP Zeus With race recorder Our recorder can be “always-on” Benefits of RTR and Set/LRU (Log Size) Improvement by RTR Effectiveness of Set/LRU 100 80 80 Log Size 100 Log Size 39 60 60 40 40 20 20 0 0 ApacheJBB OLTP Zeus AVG Apache JBB OLTP Zeus AVG Pairwise-TR Perfect TSM Our RTR 24KB Set/LRU TSM Why RTR and Set/LRU Work Well? RTR • Processors execute instructions at similar speed • Therefore, we can find “vectorizable” dependencies Set/LRU • Temporal locality makes the LRU timestamps old • We only need to know if a timestamp is “old-enough” 40 Sensitivity and Scalability 41 A design space of the timestamp memory (TSM) • Size: smaller TSM -> larger log • Read/write timestamp: should be used when TSM is large • Partial timestamp: 24-bit enough • Associativity: higher better for RTR Scalability of the recorder • Studied with modest processors (2p – 16p) • Commercial workloads, not scientific workloads • Log size increase slowly with number of cores Conclusion & My Other Research Race Recording 43 Race recording Key to combat nondeterminism My thesis An effective & inexpensive Recorder • • • • RTR algorithm small log size Coherence piggyback Negligible slowdown Timestamp approximation Low hardware cost Order-value hybrid support SC & TSO Future work • Improve race recording algorithm • Improve race recorder implementation • Study race replay Serializability Violation Detector [PLDI’05] 44 Like a race detector No a priori annotation requirement • “critical sections” are inferred Intend to detect bugs “actually” happen • Check for a 2-Phase-Locking condition Read in1 Read in2 Read local Write local Write out1 Write out2 A “Critical Section” Shared Variables Publications FDR (ISCA’03) • Adopted by UCSD BugNet (ISCA’05) SVD (PLDI’05) • Cited by Vaziri et al. (POPL’06) • Influenced new data race definition RTR, Set/LRU & Hybrid • Submitted for publication 45 Thank you! % gcc para-hash.c % a.out Segmentation fault Race recorded in “log” % % gdb a.out log gdb> run Program received SIGSEGV. In get() at para-hash.c:67 67 a = bucket->d; Acknowledgements Joint work with my advisors • Mark Hill, Ras Bodik Ph.D. Committee • David Wood, Mikko Lipasti, Remzi Arpaci-Dusseau, Barton Miller Multifacet Group • Milo Martin, Dan Sorin, Carl Mauer, Brad Beckmann, Kevin Moore, Alaa Alameldeen, Mike Marty, Luke Yen Affiliates & Companies • Joe Emer, CJ Newburn, Peter Hsu, Bob Zak, Eric Bach, Gang Luo, Alex Chow, IBM, Intel, Microsoft, Sun 47 Deterministic Replay is Useful Deterministic Replay is logically recreating a program execution Present applications • Cyclic Debugging ([Pancake & Netzer ‘93]) • Fault Tolerance (ExtraVirt [Lucchetti et al. ’05]) • Intrusion Analysis (ReVirt [Dunlap et al. ’02]) Future applications • Data Recovery • Replay-based Synchronization 48 Multicore and Multithreading Multicore is common • • • • AMD X2 IBM Power 5/6, Cell Intel Pentium D, Core Duo Sun SPARC T1 Multithreading is common • Server: high throughput • Scientific: high performance • Desktop/embedded: low response time 49 Race Recording: Key to Determinism 50 Races: general race & data race [Netzer & Miller] • Both cause nondeterminism • Race recording can help, but Existing race recorders are inadequate • • • • Some generate large logs Some have high runtime overhead Some have high hardware cost (space overhead) Support only sequential consistency Need a better race recorder Recording/Replay & Debugging 51 Online Recorder P1 Store log A Store log B Store log C P2 Crash P3 P4 Checkpoint A Checkpoint B Deterministic Replayer Checkpoint C Dump “Core” Replaying from log B, C Crash Read Checkpoint B Deterministic Replay & Fault Tolerance Fault Recovery • Replay after a failure Fault Detection • Replay then compare (Courtesy of VMware) 52 Future: Record/Replay & Undo/Redo Windows XP VM as a software platform • Ease software development • Fine granularity in Undo and Redo 53 Future: Replay-based Synchronization ld A st B Unlock() Recording Log lock() st A ld B ld A st B Replay 54 st A ld B Three steps • Coarse-grain sync. fine-grain sync. hardware sync. Results: higher performance Works only if static control flow & fixed data addr • DSP kernels Race Recording Related Work Total-order recorders Bacon ’91 RecPlay ’00 (Hardware) JaRec ’04 Bus Lamport Clocks transactions Large log Low overhead Small log 55 Partial-order recorders R&C’90 Bacon ’91 Instant Replay ’87 Netzer ’93 Déjà Vu ’98 (Hardware) Scheduling Small log Low overhead Low overhead (sync only) (non-MP) Low replay parallelism Bus transaction groups Large log Low overhead Variable version Vector clocks Large log Small log High overhead High overhead High replay parallelism Correctness of Order-Value-Hybrid Removing WAR dependencies • Say thread I read, thread J write • Removing the WAR affects I’s read, not J’s write • But, for every dependence removed, thread I reads correct value from the value log • Therefore, all reads get the correct value 56 TR and TSO 57 TR affects dependencies reduced by a WAR • The WAR itself may later be removed during replay • Solution: Not use WAR in TR if the WAR can be removed • Respond with a special flag when a loaded cache line is stolen Thread I Thread J 1 st A st B 1 2 st C st C 2 3 ld B ld A 3 Recording Must not be reduced RTR and TSO 58 The sliding window may expose the ordered loads • Shrink the sliding window to avoid it old win for j:3 Thread I Thread J 1 st A add 1 2 add sub 2 in write bufffer 3 st B ld A 3 ld C ld B 4 new win for j:3 ordered ordered 4 Recording Not allowed by new window Deadlock Avoidance of RTR Thread I Thread J 1 ld A add 1 2 st B st C 2 3 st C ld B 3 4 ld D st A 4 5 sub st C 5 6 ld B st D 6 Recording 59 Replay Cycle i:4j:1 j:2 i:3 i:4 Avoid deadlock by adhere to a SC total order Recording Race-free Executions No data races Only need to record synchronization race Deterministic replay up until the first data race 60 Replay Parallelism Replay performance depends on (1) Number of synchronizations (2) Extra wait incurred by the synchronizations 61 Directory Protocols 62 Add sticky states in the directory • Retain states after writebacks • Need extra acknowledgements Or, add extra timestamp memory in the directory • Helps to avoid extra acknowledgements A tradeoff • Sticky states can be cheaper • But extra timestamp memory can be faster Snooping Protocols 63 Key problem is combined/implicit response • Not a problem for AMD Hammer st A Proc I Proc J Tag State Data Timestamp A S … 1 B M … 4 Tag State Data Timestamp A S … 3 B I … 2 Pull Shared WAR Detected & Logged Get/X + Current IC Nonsilent Evictions 64 st A Proc I Proc J Tag State Data Timestamp A S … 1 C M … 3 B M … 4 Tag State Data Timestamp A S … 3 M 4 B I … 2 Ack Eviction Timestamp Memory Timestamp Get/S Directory of A: Shared(J) Owner() StickyS(I,J) Directory eviction: more false conflict, like snooping Out-of-Order & Hardware Prefetching 65 Speculative execution • No IC assigned yet Hardware prefetching • No IC assigned Key idea: receive observation • Can associate a ld/st with current commit instruction Unordered Messages in Interconnect Message arrive out-of-order Can affect reduction But better add a sequence number • Reconstruct the message order • Enable IC compression by sending deltas 66 Integer Overflow IC and timestamps may overflow IC: make it 64bit, will not overflow for a long time Timestamps: use approximation techniques • MSB of IC + LSB of Timestamps 67 3 2 Apache-1TS-RTR Apache-1TS-TR Apache-2TS-RTR Apache-2TS-TR 1 68 Log Bandwidth (MB/core/second) Log Bandwidth (MB/core/second) Varying TSM Size 2 OLTP-1TS-RTR OLTP-1TS-TR OLTP-2TS-RTR OLTP-2TS-TR 1 0 0 4 8 16 32 64 128 256 Size of the Timestamp Memory (KB) (64 ways, Full Timestamps, Set/LRU) 3 2 SPECjbb-1TS-RTR SPECjbb-1TS-TR SPECjbb-2TS-RTR SPECjbb-2TS-TR 1 2 512 1024 2048 0 Log Bandwidth (MB/core/second) 2 Log Bandwidth (MB/core/second) 3 4 8 16 32 64 128 256 512 1024 2048 Size of the Timestamp Memory (KB) (64 ways, Full Timestamps, Set/LRU) 3 2 Zeus-1TS-RTR Zeus-1TS-TR Zeus-2TS-RTR Zeus-2TS-TR 1 0 2 4 8 16 32 64 128 256 512 1024 2048 Size of the Timestamp Memory (KB) (64 ways, Full Timestamps, Set/LRU) 2 4 8 16 32 64 128 256 512 1024 2048 Size of the Timestamp Memory (KB) (64 ways, Full Timestamps, Set/LRU) 10 Apache-CurrentIC-RTR Apache-CurrentIC-TR Apache-SetLRU-TR Apache-SetLRU-RTR 1 69 Log Bandwidth (MB/core/second) Log Bandwidth (MB/core/second) Varying Associativity 0.1 0.01 OLTP-CurrentIC-RTR OLTP-CurrentIC-TR OLTP-SetLRU-TR OLTP-SetLRU-RTR 1 0.1 0.01 4 8 16 32 64 128 256 512 1024 Associativity of the Timestamp Memory (64KB, Full R/W Timestamps) 10 SPECjbb-CurrentIC-RTR SPECjbb-CurrentIC-TR SPECjbb-SetLRU-TR SPECjbb-SetLRU-RTR 1 0.1 0.01 2 Log Bandwidth (MB/core/second) 2 Log Bandwidth (MB/core/second) 10 4 8 16 32 64 128 256 512 1024 Associativity of the Timestamp Memory (64KB, Full R/W Timestamps) 10 Zeus-CurrentIC-RTR Zeus-CurrentIC-TR Zeus-SetLRU-TR Zeus-SetLRU-RTR 1 0.1 0.01 2 4 8 16 32 64 128 256 512 1024 Associativity of the Timestamp Memory (64KB, Full R/W Timestamps) 2 4 8 16 32 64 128 256 512 1024 Associativity of the Timestamp Memory (64KB, Full R/W Timestamps) Log Bandwidth (MB/core/second) Log Bandwidth (MB/core/second) Varying Partial Timestamp Width 10 Apache-TR Apache-RTR 1 0.1 0.01 10 OLTP-TR OLTP-RTR 1 0.1 0.01 10 15 20 25 30 Partial Timestamp Width (64sets, 64ways, Set/LRU) 10 SPECjbb-TR SPECjbb-RTR 1 0.1 0.01 10 15 20 25 Partial Timestamp Width (64sets, 64ways, Set/LRU) 10 Log Bandwidth (MB/core/second) Log Bandwidth (MB/core/second) 70 30 15 20 25 10 Partial Timestamp Width (64sets, 64ways, Set/LRU) 1 Zeus-TR Zeus-RTR 30 0.1 0.01 10 15 20 25 Partial Timestamp Width (64sets, 64ways, Set/LRU) 30 Log Size (MB/core/s) Log Size Scaling 71 1.0 0.8 Apache SPECjbb OLTP Zeus 0.6 0.4 0.2 0.0 2 4 8 Number of Cores 16 In Retrospect … What are you most proud of? • RTR improves TR after 13 years What would you do differently if doing it again? • “replaying me is deterministic” (just kidding) • I wish I focused on race recording earlier What the industry should do? • Implement the recorder as a VMM extension 72