Deterministic Multiprocessing Chris Fallin, David Lewis, Zongwei Zhou

Deterministic Multiprocessing Chris Fallin, David Lewis, Zongwei Zhou Date & location of presentation What is Deterministic MP?  Multiprocessor executes multiple threads  Threads share resources (ie, memory)  Due to bus arbiters, memory controllers, etc, some orderings in shared resources are undefined  Problem for: debugging (reproducibility), thorough testing (many possible cases)  Deterministic: same input  same output 2 Types of Determinism  Strong: same input  same output, regardless of race conditions  Must capture all communicating memory access pairs  Weak: same input  same output, as long as locking is correct  Takes advantage of locks for low SW overhead 3 Types of Deterministic Execution  Record/Replay: HW/SW keeps log of program input  Single-program: system calls, memory interleavings  Full-system: interrupts, I/O, etc  Log allows later replay of a bug  However, several executions may still differ outside of replay  Full-time  Ordering of memory accesses follows a statically-defined deterministic order: for same program and same input, output is always same 4 DMP: Deterministic Shared Memory Multiprocessing Devietti, Lucia, Ceze, Oskin 5 Central Idea  To guarantee deterministic behavior: - the direct way is to preserve the same global interleaving of instructions in every execution of a parallel program - unnecessary and significant performance impact  Insight: only communicating pairs matter Improve a bit.......  Not all memory access is communicating  can parallelize communication-free portion in each quantum  need to know when communications happen!  MESI cache coherence protocol provides this for free DMP Sharing Table - tracks info about mem ownership - two ownership change possibilities: - reading data owned by others - writing data to shared memory Improve a bit more......  Transactional Memory + deterministic commit order  TM: atomic and isolation of quantum  Speculation: find quantum not involved in communication  If communication happens, squash + re-execute  potential optimization:  forward uncommitted (or speculative) data between quanta  could save a large number of squashes Performance Discussion  Speculation  similar idea, but use for opposite purpose to TLS  require complex hardware  I/O or parts of OS can not execute speculatively  Dealing with nondeterminism  threads can use OS to communicate  nondeterministic OS API calls, e.g. read  Better way of token-passing? Kendo: Efficient Deterministic Multithreading in Software Olszewski, Ansel, Amarasinghe 11 Definitions  Strong Determinism  Deterministic order of memory accesses to shared data for particular program input  ALWAYS produces same output for every run with a particular input  Not easily providable without hardware support  Weak Determinism  Deterministic order of lock acquisitions for a given program input  Produces same output for every run if race-free  Can be guaranteed if all accesses to shared data protected by locks  If no data-races, strong and weak determinism provide same guarantees! Introducing Kendo  Software framework to enforce weak determinism of general lock-based C/C++ code for commodity shared-memory multiprocessors  No special hardware necessary!  Deterministic Logical Time  Each thread has its own monotonically increasing deterministic logical clock  How to implement? Performance counter events?  When is it a thread T's turn to use a lock?  All threads with tid < T have greater logical clocks  All threads with tid ≥ T have greater or equal logical clocks Simple Locking Mechanism function det_mutex_lock(l) • { pause_logical_clock(); • wait_for_turn(); lock(l); inc_logical_clock(); resume_logical_clock(); } Simple algorithm for implementing locks Pause logical clock during acquisition and wait for turn to access lock (using heuristic in previous slide) • Once in critical section resume function det_mutex_unlock(l)the clock and continue { • Pros: unlock(l); o Easy to implement } • Problems? Improved Lock function det_mutex_lock(l){ pause_logical_clock(); while(true){ // Loop until we have successfully acquired the lock . wait_for_turn(); // Wait for our deterministic logical clock to be unique global minimum if (try_lock(l)){ // Check the state of the lock , acquiring it if it is free if(l.released_logical_time >= get_logical_clock()){ unlock(l); } else { break; } } inc_logical_clock(); // Lock is free in physical time, but still acquired in // deterministic logical time so we cannot acquire it yet // Release the lock // Lock is free in both physical and in deterministic logical // time, so it is safe to exit the spin loop // Increment our deterministic logical clock and start over } inc_logical_clock(); // Increment our deterministic logical clock before exiting resume_logical_clock(); } function det_mutex_unlock(l){ pause_logical_clock(); l.released_logical_time = get_logical_clock(); unlock(l); inc_logical_clock(); resume_logical_clock(); } Optimizations  Queuing  Queue for each lock guarantees first-come first-serve  Fast-forwarding  While waiting for a lock can set logical time to lock.released_logical_time (or +1 if queuing)  Lazy reads  If application can read out-of-date shared data, no need to lock on read (i.e. finding a "best" value)  Provide read window (in logical time), if all threads past earliest allowable logical time, can successfully read Results Capo: A Software-Hardware Interface for Practical Deterministic Multiprocessor Replay Montesinos, Hicks, King, Torellas 18 Capo: Motivation  Record/replay system for debugging  Not intended to be deployed in the field  Builds on DeLorean [1]  Chunk-based record/replay system  Terminate chunks at communicating pairs, record chunk commit order only  Only half the story  Capo adds software side as a Linux implementation:  Record syscall results  Provide infrastructure to record/replay multiple programs and multiplex hardware record/replay features [1] P. Montesinos, L. Ceze, and J. Torrellas, “DeLorean: Recording and Deterministically Replaying SharedMemory Multiprocessor Execution Efficiently,” in ISCA, June 2008. Capo's Contributions  Replay Spheres: distinct realms of record/replay  Defining hardware-software interface  Simulated DeLorean hardware (chunk-based recording)  Linux kernel modifications Capo Architecture  Replay Sphere: set of R-threads; isolated environment  Arbitrary set of processes is inside sphere     Replay Sphere Mgr: multiplexes HW support over spheres HW: records chunk commit order (DeLorean) SW: records system calls OS not inside sphere, except copy_to_user() Hardware Details Performance Record Replay Log Size  Summary Helps with… debugging testing replicas deployment Needs hw Capo(record/replay) Kendo DMP usually no yes (Devietti et al) Discussion  Which is more useful: record/replay or full-time?  Debugging only, vs. system design philosophy  Tradeoff: cost (log size, overhead) vs. utility  Strong vs. weak determinism  Race conditions are an important class of bugs 26

Deterministic Multiprocessing Chris Fallin, David Lewis, Zongwei Zhou

Related documents

Products

Support

Deterministic Multiprocessing Chris Fallin, David Lewis, Zongwei Zhou

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib