Deterministic Multiprocessing Chris Fallin, David Lewis, Zongwei Zhou Date & location of presentation What is Deterministic MP? Multiprocessor executes multiple threads Threads share resources (ie, memory) Due to bus arbiters, memory controllers, etc, some orderings in shared resources are undefined Problem for: debugging (reproducibility), thorough testing (many possible cases) Deterministic: same input same output 2 Types of Determinism Strong: same input same output, regardless of race conditions Must capture all communicating memory access pairs Weak: same input same output, as long as locking is correct Takes advantage of locks for low SW overhead 3 Types of Deterministic Execution Record/Replay: HW/SW keeps log of program input Single-program: system calls, memory interleavings Full-system: interrupts, I/O, etc Log allows later replay of a bug However, several executions may still differ outside of replay Full-time Ordering of memory accesses follows a statically-defined deterministic order: for same program and same input, output is always same 4 DMP: Deterministic Shared Memory Multiprocessing Devietti, Lucia, Ceze, Oskin 5 Central Idea To guarantee deterministic behavior: - the direct way is to preserve the same global interleaving of instructions in every execution of a parallel program - unnecessary and significant performance impact Insight: only communicating pairs matter Improve a bit....... Not all memory access is communicating can parallelize communication-free portion in each quantum need to know when communications happen! MESI cache coherence protocol provides this for free DMP Sharing Table - tracks info about mem ownership - two ownership change possibilities: - reading data owned by others - writing data to shared memory Improve a bit more...... Transactional Memory + deterministic commit order TM: atomic and isolation of quantum Speculation: find quantum not involved in communication If communication happens, squash + re-execute potential optimization: forward uncommitted (or speculative) data between quanta could save a large number of squashes Performance Discussion Speculation similar idea, but use for opposite purpose to TLS require complex hardware I/O or parts of OS can not execute speculatively Dealing with nondeterminism threads can use OS to communicate nondeterministic OS API calls, e.g. read Better way of token-passing? Kendo: Efficient Deterministic Multithreading in Software Olszewski, Ansel, Amarasinghe 11 Definitions Strong Determinism Deterministic order of memory accesses to shared data for particular program input ALWAYS produces same output for every run with a particular input Not easily providable without hardware support Weak Determinism Deterministic order of lock acquisitions for a given program input Produces same output for every run if race-free Can be guaranteed if all accesses to shared data protected by locks If no data-races, strong and weak determinism provide same guarantees! Introducing Kendo Software framework to enforce weak determinism of general lock-based C/C++ code for commodity shared-memory multiprocessors No special hardware necessary! Deterministic Logical Time Each thread has its own monotonically increasing deterministic logical clock How to implement? Performance counter events? When is it a thread T's turn to use a lock? All threads with tid < T have greater logical clocks All threads with tid ≥ T have greater or equal logical clocks Simple Locking Mechanism function det_mutex_lock(l) • { pause_logical_clock(); • wait_for_turn(); lock(l); inc_logical_clock(); resume_logical_clock(); } Simple algorithm for implementing locks Pause logical clock during acquisition and wait for turn to access lock (using heuristic in previous slide) • Once in critical section resume function det_mutex_unlock(l)the clock and continue { • Pros: unlock(l); o Easy to implement } • Problems? Improved Lock function det_mutex_lock(l){ pause_logical_clock(); while(true){ // Loop until we have successfully acquired the lock . wait_for_turn(); // Wait for our deterministic logical clock to be unique global minimum if (try_lock(l)){ // Check the state of the lock , acquiring it if it is free if(l.released_logical_time >= get_logical_clock()){ unlock(l); } else { break; } } inc_logical_clock(); // Lock is free in physical time, but still acquired in // deterministic logical time so we cannot acquire it yet // Release the lock // Lock is free in both physical and in deterministic logical // time, so it is safe to exit the spin loop // Increment our deterministic logical clock and start over } inc_logical_clock(); // Increment our deterministic logical clock before exiting resume_logical_clock(); } function det_mutex_unlock(l){ pause_logical_clock(); l.released_logical_time = get_logical_clock(); unlock(l); inc_logical_clock(); resume_logical_clock(); } Optimizations Queuing Queue for each lock guarantees first-come first-serve Fast-forwarding While waiting for a lock can set logical time to lock.released_logical_time (or +1 if queuing) Lazy reads If application can read out-of-date shared data, no need to lock on read (i.e. finding a "best" value) Provide read window (in logical time), if all threads past earliest allowable logical time, can successfully read Results Capo: A Software-Hardware Interface for Practical Deterministic Multiprocessor Replay Montesinos, Hicks, King, Torellas 18 Capo: Motivation Record/replay system for debugging Not intended to be deployed in the field Builds on DeLorean [1] Chunk-based record/replay system Terminate chunks at communicating pairs, record chunk commit order only Only half the story Capo adds software side as a Linux implementation: Record syscall results Provide infrastructure to record/replay multiple programs and multiplex hardware record/replay features [1] P. Montesinos, L. Ceze, and J. Torrellas, “DeLorean: Recording and Deterministically Replaying SharedMemory Multiprocessor Execution Efficiently,” in ISCA, June 2008. Capo's Contributions Replay Spheres: distinct realms of record/replay Defining hardware-software interface Simulated DeLorean hardware (chunk-based recording) Linux kernel modifications Capo Architecture Replay Sphere: set of R-threads; isolated environment Arbitrary set of processes is inside sphere Replay Sphere Mgr: multiplexes HW support over spheres HW: records chunk commit order (DeLorean) SW: records system calls OS not inside sphere, except copy_to_user() Hardware Details Performance Record Replay Log Size Summary Helps with… debugging testing replicas deployment Needs hw Capo(record/replay) Kendo DMP usually no yes (Devietti et al) Discussion Which is more useful: record/replay or full-time? Debugging only, vs. system design philosophy Tradeoff: cost (log size, overhead) vs. utility Strong vs. weak determinism Race conditions are an important class of bugs 26