Offline Symbolic Analysis for Multi-Processor Execution Replay Dongyoon Lee†, Mahmoud Said*, Satish Narayanasamy†, Zijiang James Yang*, and Cristiano L. Pereira‡ University of Michigan, Ann Arbor † Western Michigan University * Intel, Inc ‡ -1- Overview Goal: Deterministic replay for multi-threaded programs • Debug non-deterministic bugs Sources of non-determinism • Program input (interrupt, I/O, DMA, etc.) • Shared-memory dependencies Program Input Shared Memory Dependency -2- Past Solutions Our Solution Log I/O, signals, DMA, etc., BugNet [ISCA'05] Log loads (cache miss data) Monitor memory operations Software is slow Hardware is complex SAT constraint solver Determine offline before replay Deterministic Replay Uses Reproduce non-deterministic bugs Debugging Memory Leaks Dynamic Program Analysis Step-Backward in time -3- Data Races Dangling Pointers Developer Site Replayer Remote Site OR In-house Recorder Traditional Record-N-Replay Systems Thread 1 Thread 2 Thread 3 Checkpoint Memory and Register State Log non-deterministic program input Interrupts, I/O values, DMA, etc. Write Read Read Log shared memory dependencies -4- Recording Shared Memory Dependency Problem Need to monitor every memory operation Software-based Replay System PinSEL (UCSD/Intel) x100 iDNA (Microsoft) Hardware-based Replay System FDR/ReRun (Wisconsin) Strata (UCSD) DeLorean (UIUC) -5- Complex hardware x10 Hardware Complexity Hardware-based solution • Detect shared memory dependencies by monitoring cache • coherence messages Transitive optimization to reduce log size W(a) W(b) W(b) R(a) Complexity • Requires changes to coherence sub-system • Complex to design and verify • 9 design bugs in coherence mechanism of AMD64 [Narayanasamy et al. ICCD’06] -6- New Direction to Hardware-based Solution Complexity-effective solution • Do NOT record shared-memory dependencies at all • Infer dependencies offline before replay using Satisfiability Modulo Theory (SMT) solver -7- Our Approach Checkpoint Memory and Registers Checkpoint Registers BugNet [ISCA’05] Log non-deterministic program input Load-based Interrupts,Hardware I/O values, Recorder DMA, etc. Write Read Read Satisfiability-Modulo-Theory (SMT) solver Log shared memory dependency reconstructs interleaving offline -8- Roadmap • Motivation • BugNet for single-threaded programs • • • • -9- [ISCA’05] • Recording cache miss data is sufficient BugNet is sufficient for multi-threaded programs • Insight: BugNet can replay each thread in isolation Offline SMT Analysis Evaluation Conclusion BugNet [Narayanasamy et al, ISCA’05] Insight • Recording initial register state and values of loads is sufficient for deterministic replay • Implicitly captures the program input from I/O, DMA, interrupts, etc. • Input and output of other instructions are reproduced during replay Optimization • Record a load only if it is the first access to a memory location Our modification • Recording data fetched on cache miss captures first loads • Any first access to a location would result in a cache miss • May unnecessarily record data due to store misses, but that is OK - 10 - Recording Cache Miss Data (First Loads) Log file Checkpoint Load A = 0 (cnt1, 0) Load B = 5 (cnt2, 5) Load A = 0 Store C = 1 (cnt3, 0) Execution Time Cache Miss - 11 - First Load Checkpoint • Register Values • Program Counter Record cache misses • (Memory count , Data) • Implicitly capture first loads Deterministic Replay • Input and output (including address) of all instructions are replayed On a store miss • Record old value – data before store update • New value – data after store update – can be reproduced deterministically BugNet Extension Self-modifying code • Consider instruction read as a load; so instructions are logged Full system Replay • Continue logging in kernel mode • See the paper for details on context switches, page faults, etc. - 12 - Roadmap • Motivation • BugNet for single-threaded programs • • • • - 13 - [ISCA’05] • Recording cache miss data is sufficient BugNet is sufficient for multi-threaded programs • Insight: BugNet can replay each thread in isolation Offline SMT Analysis Evaluation Conclusion BugNet for Multithreaded Programs Insight • BugNet recorder (initial register state + loads) for each thread is sufficient for replaying that thread Recording cache miss data is sufficient for multithreaded programs No additional hardware support required for recording dependencies Reason • Load dependent on a remote write cause a cache miss to ensure coherence BugNet implicitly records load values dependent on remote writes Effect • Can replay each thread in isolation (independent of other threads) using BugNet logs - 14 - Replaying Each Thread Independently Proc 1 LOG Proc 2 Proc 1 (1st, 0) Proc 2 LOG Cache Coherence • Invalidate cache block to gain exclusive permission Load A=0 Load A=0 Store A=1 Invalidation Cache Block Invalidated (3rd, 1) Load A= 1 (1st, 0) Log cache miss data • Implicitly records loads dependent on remote writes • No change to coherence mechanism Replay each thread • independent of others Cache Miss - 15 - Shared Memory Dependency Thread 1 x : Old Value : New Value Load A Load B Store A Store C Load A Load B ? Thread 2 Load A Store C Store C Store A Load B Billion instructions • Offline analysis would not scale Load B Final State : A, B, C SMT Solver resolves shared memory dependency We need to bound search space - 16 - Roadmap • Motivation • BugNet • Offline Symbolic Analysis • Encoding Ordering Constraints • Bounding Search Space • Evaluation • Conclusion - 17 - Encoding Ordering Constraints Proc 1 (Assume Sequential Consistency) x3 x4 x5 x1 x2 x Final - 18 - Program Order Constraint Proc 2 Proc1 : X1 < X2 Proc2 : X3 < X4 < X5 AND AND Load-Store Constraint ( M→old== M→prev→new) X1: X1 < X3 AND X2: (X3 < X2 < X4 OR X5 < X2) AND x Old Value New Value Multiple Memory Locations Proc 1 y1 x1 x2 y2 x Final Program Order Constraints Proc 2 (Assume Sequential Consistency) Proc1 : Y1 < X1 < X2 < Y2 Proc2 : X3 < X4 < X5 < Y3 x3 x4 x5 y3 Load-Store Constraints ( M→old== M→prev→new) yFinal x - 19 - AND AND X1: X1 < X3 AND X2: (X3 < X2 < X4 OR X5 < X2) AND : Y1: Y1 < Y2 AND Y2: Y1 < Y2 < Y3 AND : Old Value New Value Satisfiability-Modulo-Theory (SMT) Solver Total Order Ordering Constraints (Program Order) ∧ (Load-Store Order for X) ∧ (Load-Store Order for Y) ∧ : SMT Solver y1 x1 x2 y2 SMT solver • Find one valid total order from multiple solutions • All solutions could be produced, if needed - 20 - x3 x4 x5 y3 Replay Guarantees • The replayed execution has the same final register and memory states • Each thread has the exactly same sequence of instructions along with input and output • Reconstructed shared memory dependencies obey program order and load-store semantics - 21 - Roadmap • Motivation • BugNet • Offline Symbolic Analysis • Encoding Ordering Constraints • Bounding Search Space • Evaluation • Conclusion - 22 - Bounding Search Space Proc 1 Proc 2 Strata Region 1 N cycles Record “Strata hints” • Each processor periodically records memory operation count • Strata regions have a global order cnt 1Final State cnt 2 Initial State Strata Region 2 cnt 3Final State cnt 4 Initial State Strata Region 3 Final Final State State - 23 - N cycles SMT solver analyzes • One region at a time • Start from the last region • Final state of a region = Initial state of the following region Strata Hints Cycle-bound • After N cycles, each core records its memory operation count • No communication is required between cores Problem • The size of Strata region is not based to number of shared memory • dependencies Can we bound based on number of shared memory dependencies? Downgrade-bound • Count coherence downgrade requests • Requires communication between cores, but reduces offline analysis overhead - 24 - Filtering Local & Read-only Accesses Thread 1 Thread 2 Filter • Local accesses : no shared-memory dependency Load A Load B Store B Store A • Read-only accesses Load C : any total order is valid Load C Store B Load C Load C Load C Load C - 25 - Strata Region Effectiveness < 1% of memory accesses remain to be analyzed Roadmap • • • • • - 26 - Motivation Record & Replay Offline Symbolic Analysis Evaluation • Strata Hint Size • Offline Symbolic Analysis Overhead Conclusion Evaluation • Simics + cycle accurate simulator • Simulate multi-processor execution (2, 4, 8,16 cores) • Fast-forward up to known synchronization points • Trace collected for 500 million instructions • Benchmarks • • • • • SPLASH2 : barnes, fmm, ocean Parsec 2.0 : blackscholes, bodytrack, x264 SPEComp : wupwise, swim Apache MySQL • Yices SMT constraint solver [Dutertre and Moura CAV’06] - 27 - Strata Hints Size vs. Offline Analysis Overhead Cycle-bound (10,000) Offline analysis time (secs per sec of prog. Exec) Strata log size (MB/sec) Downgrade-bound (10) 1000000 3.3 100000 3.2 3.1 Downgrade-bound (25) 10% 3 2.9 2.8 2.7 10000 x100 1000 100 10 1 • Downgrade-bound scheme is effective • Offline analysis overhead is one-time cost (not for every replay) - 28 - Strata hints vs. ReRun log Downgrade-bound Proposed System (d10.c10000) Rerun (henkins) ReRun [Hower and Hill, ISCA’08] Strata log size (MB/sec) 100 10 1 • Strata hints are 4x less than ReRun log • Significant reduction in hardware complexity - 29 - x4 Recording Performance, etc. • Cache Miss Data Log • 290 Mbytes / one second of program execution • Recording Performance • On average, 0.35% slowdown in IPC • Scalability results can be found in the paper - 30 - Conclusion • Deterministic replay for multi-threaded program is critical • We proposed a complexity-effective solution • Use BugNet : Record cache miss data • No need to record shared memory dependencies • Determine shared memory dependency using SMT constraint solver offline • Result • < 1% recording overhead • Efficient log size (4x smaller than state-of-the-art scheme ReRun) • Can analyze one second of 8-threaded program in less than 1000 seconds • One-time offline analysis cost (not for every replay) - 31 - Thank you - 32 -