Deterministic Multiprocessing Chris Fallin, David Lewis, Zongwei Zhou

advertisement
Deterministic Multiprocessing
Chris Fallin, David Lewis, Zongwei Zhou
Date & location of presentation
What is Deterministic MP?
 Multiprocessor executes multiple threads
 Threads share resources (ie, memory)
 Due to bus arbiters, memory controllers, etc, some orderings
in shared resources are undefined
 Problem for: debugging (reproducibility), thorough testing
(many possible cases)
 Deterministic: same input  same output
2
Types of Determinism
 Strong: same input  same output, regardless of race
conditions
 Must capture all communicating memory access pairs
 Weak: same input  same output, as long as locking is
correct
 Takes advantage of locks for low SW overhead
3
Types of Deterministic Execution
 Record/Replay: HW/SW keeps log of program input
 Single-program: system calls, memory interleavings
 Full-system: interrupts, I/O, etc
 Log allows later replay of a bug
 However, several executions may still differ outside of replay
 Full-time
 Ordering of memory accesses follows a statically-defined
deterministic order: for same program and same input, output
is always same
4
DMP: Deterministic Shared Memory
Multiprocessing
Devietti, Lucia, Ceze, Oskin
5
Central Idea
 To guarantee deterministic behavior:
- the direct way is to preserve the same global interleaving of
instructions in every execution of a parallel program
- unnecessary and significant performance impact
 Insight: only communicating pairs matter
Improve a bit.......
 Not all memory access is communicating
 can parallelize communication-free portion in each quantum
 need to know when communications happen!
 MESI cache coherence protocol provides this for free
DMP Sharing Table
- tracks info about mem ownership
- two ownership change possibilities:
- reading data owned by others
- writing data to shared memory
Improve a bit more......
 Transactional Memory + deterministic commit order
 TM: atomic and isolation of quantum
 Speculation: find quantum not involved in communication
 If communication happens, squash + re-execute
 potential optimization:
 forward uncommitted (or speculative) data between quanta
 could save a large number of squashes
Performance
Discussion
 Speculation
 similar idea, but use for opposite purpose to TLS
 require complex hardware
 I/O or parts of OS can not execute speculatively
 Dealing with nondeterminism
 threads can use OS to communicate
 nondeterministic OS API calls, e.g. read
 Better way of token-passing?
Kendo: Efficient Deterministic
Multithreading in Software
Olszewski, Ansel, Amarasinghe
11
Definitions
 Strong Determinism
 Deterministic order of memory accesses to shared data for particular
program input
 ALWAYS produces same output for every run with a particular input
 Not easily providable without hardware support
 Weak Determinism
 Deterministic order of lock acquisitions for a given program input
 Produces same output for every run if race-free
 Can be guaranteed if all accesses to shared data protected by locks
 If no data-races, strong and weak determinism provide same
guarantees!
Introducing Kendo
 Software framework to enforce weak determinism of general
lock-based C/C++ code for commodity shared-memory
multiprocessors
 No special hardware necessary!
 Deterministic Logical Time
 Each thread has its own monotonically increasing deterministic
logical clock
 How to implement? Performance counter events?
 When is it a thread T's turn to use a lock?
 All threads with tid < T have greater logical clocks
 All threads with tid ≥ T have greater or equal logical clocks
Simple Locking Mechanism
function det_mutex_lock(l)
•
{
pause_logical_clock();
•
wait_for_turn();
lock(l);
inc_logical_clock();
resume_logical_clock();
}
Simple algorithm for
implementing locks
Pause logical clock during
acquisition and wait for turn to
access lock (using heuristic in
previous slide)
• Once in critical section resume
function det_mutex_unlock(l)the clock and continue
{
• Pros:
unlock(l);
o Easy to implement
}
• Problems?
Improved Lock
function det_mutex_lock(l){
pause_logical_clock();
while(true){
// Loop until we have successfully acquired the lock .
wait_for_turn(); // Wait for our deterministic logical clock to be unique global minimum
if (try_lock(l)){
// Check the state of the lock , acquiring it if it is free
if(l.released_logical_time
>= get_logical_clock()){
unlock(l);
} else {
break;
}
}
inc_logical_clock();
// Lock is free in physical time, but still acquired in
// deterministic logical time so we cannot acquire it yet
// Release the lock
// Lock is free in both physical and in deterministic logical
// time, so it is safe to exit the spin loop
// Increment our deterministic logical clock and start over
}
inc_logical_clock();
// Increment our deterministic logical clock before exiting
resume_logical_clock();
}
function det_mutex_unlock(l){
pause_logical_clock();
l.released_logical_time = get_logical_clock();
unlock(l);
inc_logical_clock();
resume_logical_clock();
}
Optimizations
 Queuing
 Queue for each lock guarantees first-come first-serve
 Fast-forwarding
 While waiting for a lock can set logical time to
lock.released_logical_time (or +1 if queuing)
 Lazy reads
 If application can read out-of-date shared data, no need to lock
on read (i.e. finding a "best" value)
 Provide read window (in logical time), if all threads past earliest
allowable logical time, can successfully read
Results
Capo: A Software-Hardware Interface for
Practical Deterministic Multiprocessor
Replay
Montesinos, Hicks, King, Torellas
18
Capo: Motivation
 Record/replay system for debugging
 Not intended to be deployed in the field
 Builds on DeLorean [1]
 Chunk-based record/replay system
 Terminate chunks at communicating pairs, record chunk commit
order only
 Only half the story
 Capo adds software side as a Linux implementation:
 Record syscall results
 Provide infrastructure to record/replay multiple programs and
multiplex hardware record/replay features
[1] P. Montesinos, L. Ceze, and J. Torrellas, “DeLorean: Recording and Deterministically Replaying SharedMemory Multiprocessor Execution Efficiently,” in ISCA, June 2008.
Capo's Contributions
 Replay Spheres: distinct realms of record/replay
 Defining hardware-software interface
 Simulated DeLorean hardware (chunk-based recording)
 Linux kernel modifications
Capo Architecture
 Replay Sphere: set of R-threads; isolated environment
 Arbitrary set of processes is inside sphere




Replay Sphere Mgr: multiplexes HW support over spheres
HW: records chunk commit order (DeLorean)
SW: records system calls
OS not inside sphere, except copy_to_user()
Hardware Details
Performance
Record
Replay
Log Size

Summary
Helps with…
debugging
testing
replicas
deployment
Needs hw
Capo(record/replay)
Kendo
DMP
usually
no
yes
(Devietti et al)
Discussion
 Which is more useful: record/replay or full-time?
 Debugging only, vs. system design philosophy
 Tradeoff: cost (log size, overhead) vs. utility
 Strong vs. weak determinism
 Race conditions are an important class of bugs
26
Download