shared memory dependencies

advertisement
Offline Symbolic Analysis for
Multi-Processor Execution Replay
Dongyoon Lee†, Mahmoud Said*,
Satish Narayanasamy†, Zijiang James Yang*, and
Cristiano L. Pereira‡
University of Michigan, Ann Arbor †
Western Michigan University *
Intel, Inc ‡
-1-
Overview
Goal: Deterministic replay for multi-threaded programs
• Debug non-deterministic bugs
Sources of non-determinism
• Program input (interrupt, I/O, DMA, etc.)
• Shared-memory dependencies
Program
Input
Shared
Memory
Dependency
-2-
Past Solutions
Our Solution
Log I/O, signals, DMA, etc.,
BugNet [ISCA'05]
Log loads (cache miss data)
Monitor memory operations
Software is slow
Hardware is complex
SAT constraint solver
Determine offline before replay
Deterministic Replay Uses
Reproduce
non-deterministic bugs
Debugging
Memory Leaks
Dynamic Program
Analysis
Step-Backward
in time
-3-
Data Races
Dangling Pointers
Developer Site
Replayer
Remote Site OR
In-house
Recorder
Traditional Record-N-Replay Systems
Thread 1
Thread 2
Thread 3
Checkpoint Memory and Register State
Log non-deterministic program input
Interrupts, I/O values, DMA, etc.
Write
Read
Read
Log shared memory dependencies
-4-
Recording Shared Memory Dependency
Problem
Need to monitor every memory operation
Software-based Replay System
PinSEL (UCSD/Intel)
x100
iDNA (Microsoft)
Hardware-based Replay System
FDR/ReRun (Wisconsin)
Strata (UCSD)
DeLorean (UIUC)
-5-
Complex hardware
x10
Hardware Complexity
Hardware-based solution
• Detect shared memory dependencies by monitoring cache
•
coherence messages
Transitive optimization to reduce log size
W(a)
W(b)
W(b)
R(a)
Complexity
• Requires changes to coherence sub-system
• Complex to design and verify
• 9 design bugs in coherence mechanism of AMD64
[Narayanasamy et al. ICCD’06]
-6-
New Direction to Hardware-based Solution
Complexity-effective solution
• Do NOT record shared-memory dependencies at all
• Infer dependencies offline before replay using
Satisfiability Modulo Theory (SMT) solver
-7-
Our Approach
Checkpoint
Memory
and Registers
Checkpoint
Registers
BugNet [ISCA’05]
Log non-deterministic
program input
Load-based
Interrupts,Hardware
I/O values, Recorder
DMA, etc.
Write
Read
Read
Satisfiability-Modulo-Theory
(SMT) solver
Log shared memory dependency
reconstructs interleaving offline
-8-
Roadmap
• Motivation
• BugNet for single-threaded programs
•
•
•
•
-9-
[ISCA’05]
• Recording cache miss data is sufficient
BugNet is sufficient for multi-threaded programs
• Insight: BugNet can replay each thread in isolation
Offline SMT Analysis
Evaluation
Conclusion
BugNet
[Narayanasamy et al, ISCA’05]
Insight
• Recording initial register state and values of loads is sufficient for
deterministic replay
• Implicitly captures the program input from I/O, DMA, interrupts, etc.
• Input and output of other instructions are reproduced during replay
Optimization
• Record a load only if it is the first access to a memory location
Our modification
• Recording data fetched on cache miss captures first loads
• Any first access to a location would result in a cache miss
• May unnecessarily record data due to store misses, but that is OK
- 10 -
Recording Cache Miss Data (First Loads)
Log file
Checkpoint
Load A = 0
(cnt1, 0)
Load B = 5
(cnt2, 5)
Load A = 0
Store C = 1
(cnt3, 0)
Execution
Time
Cache Miss
- 11 -
First Load
Checkpoint
• Register Values
• Program Counter
Record cache misses
• (Memory count , Data)
• Implicitly capture first loads
Deterministic Replay
• Input and output (including
address) of all instructions
are replayed
On a store miss
• Record old value – data before
store update
• New value – data after store
update – can be reproduced
deterministically
BugNet Extension
Self-modifying code
• Consider instruction read as a load; so instructions are logged
Full system Replay
• Continue logging in kernel mode
• See the paper for details on context switches, page faults, etc.
- 12 -
Roadmap
• Motivation
• BugNet for single-threaded programs
•
•
•
•
- 13 -
[ISCA’05]
• Recording cache miss data is sufficient
BugNet is sufficient for multi-threaded programs
• Insight: BugNet can replay each thread in isolation
Offline SMT Analysis
Evaluation
Conclusion
BugNet for Multithreaded Programs
Insight
• BugNet recorder (initial register state + loads) for each thread is
sufficient for replaying that thread
Recording cache miss data is sufficient for multithreaded programs
No additional hardware support required for recording dependencies
Reason
• Load dependent on a remote write cause a cache miss to ensure
coherence
 BugNet implicitly records load values dependent on remote writes
Effect
• Can replay each thread in isolation (independent of other threads)
using BugNet logs
- 14 -
Replaying Each Thread Independently
Proc 1
LOG
Proc 2
Proc 1
(1st, 0)
Proc 2
LOG
Cache Coherence
• Invalidate cache block
to gain exclusive
permission
Load A=0
Load A=0
Store A=1
Invalidation
Cache Block
Invalidated
(3rd, 1)
Load A=
1
(1st, 0)
Log cache miss data
• Implicitly records loads
dependent on remote
writes
• No change to
coherence mechanism
Replay each thread
• independent of others
Cache Miss
- 15 -
Shared Memory Dependency
Thread 1
x
: Old Value
: New Value
Load A
Load B
Store A
Store C
Load A
Load B
?
Thread 2
Load A
Store C
Store C
Store A
Load B
Billion instructions
• Offline analysis
would not scale
Load B
Final State : A, B, C
SMT Solver resolves shared memory dependency
We need to bound search space
- 16 -
Roadmap
• Motivation
• BugNet
• Offline Symbolic Analysis
• Encoding Ordering Constraints
• Bounding Search Space
• Evaluation
• Conclusion
- 17 -
Encoding Ordering Constraints
Proc 1
(Assume Sequential Consistency)
x3
x4
x5
x1
x2
x Final
- 18 -
Program Order Constraint
Proc 2
Proc1 : X1 < X2
Proc2 : X3 < X4 < X5
AND
AND
Load-Store Constraint
( M→old== M→prev→new)
X1: X1 < X3
AND
X2: (X3 < X2 < X4 OR X5 < X2) AND
x
Old Value
New Value
Multiple Memory Locations
Proc 1
y1
x1
x2
y2
x Final
Program Order Constraints
Proc 2
(Assume Sequential Consistency)
Proc1 : Y1 < X1 < X2 < Y2
Proc2 : X3 < X4 < X5 < Y3
x3
x4
x5
y3
Load-Store Constraints
( M→old== M→prev→new)
yFinal
x
- 19 -
AND
AND
X1: X1 < X3
AND
X2: (X3 < X2 < X4 OR X5 < X2) AND
:
Y1: Y1 < Y2
AND
Y2: Y1 < Y2 < Y3
AND
:
Old Value
New Value
Satisfiability-Modulo-Theory (SMT) Solver
Total Order
Ordering Constraints
(Program Order) ∧
(Load-Store Order for X) ∧
(Load-Store Order for Y) ∧
:
SMT Solver
y1
x1
x2
y2
SMT solver
• Find one
valid total order from multiple solutions
• All solutions could be produced, if needed
- 20 -
x3
x4
x5
y3
Replay Guarantees
• The replayed execution has the same final register and
memory states
• Each thread has the exactly same sequence of
instructions along with input and output
• Reconstructed shared memory dependencies obey
program order and load-store semantics
- 21 -
Roadmap
• Motivation
• BugNet
• Offline Symbolic Analysis
• Encoding Ordering Constraints
• Bounding Search Space
• Evaluation
• Conclusion
- 22 -
Bounding Search Space
Proc 1
Proc 2
Strata Region 1
N cycles
Record “Strata hints”
• Each processor periodically
records memory operation count
• Strata regions have a global order
cnt 1Final State
cnt 2
Initial State
Strata Region 2
cnt 3Final State
cnt 4
Initial State
Strata Region 3
Final
Final State
State
- 23 -
N cycles
SMT solver analyzes
• One region at a time
• Start from the last region
• Final state of a region
= Initial state of the following region
Strata Hints
Cycle-bound
• After N cycles, each core records its memory operation count
• No communication is required between cores
Problem
• The size of Strata region is not based to number of shared memory
•
dependencies
Can we bound based on number of shared memory dependencies?
Downgrade-bound
• Count coherence downgrade requests
• Requires communication between cores, but reduces offline
analysis overhead
- 24 -
Filtering Local & Read-only Accesses
Thread 1
Thread 2
Filter
• Local accesses
: no shared-memory dependency
Load A
Load B
Store B
Store A
• Read-only accesses
Load C
: any total order is valid
Load C
Store B
Load C
Load C
Load C
Load C
- 25 -
Strata Region
Effectiveness
< 1% of memory accesses
remain to be analyzed
Roadmap
•
•
•
•
•
- 26 -
Motivation
Record & Replay
Offline Symbolic Analysis
Evaluation
• Strata Hint Size
• Offline Symbolic Analysis Overhead
Conclusion
Evaluation
• Simics + cycle accurate simulator
• Simulate multi-processor execution (2, 4, 8,16 cores)
• Fast-forward up to known synchronization points
• Trace collected for 500 million instructions
• Benchmarks
•
•
•
•
•
SPLASH2 : barnes, fmm, ocean
Parsec 2.0 : blackscholes, bodytrack, x264
SPEComp : wupwise, swim
Apache
MySQL
• Yices SMT constraint solver [Dutertre and Moura CAV’06]
- 27 -
Strata Hints Size vs. Offline Analysis Overhead
Cycle-bound (10,000)
Offline analysis time
(secs per sec of prog. Exec)
Strata log size (MB/sec)
Downgrade-bound (10)
1000000
3.3
100000
3.2
3.1
Downgrade-bound (25)
10%
3
2.9
2.8
2.7
10000
x100
1000
100
10
1
• Downgrade-bound
scheme is effective
• Offline analysis overhead is one-time cost (not for every replay)
- 28 -
Strata hints vs. ReRun log
Downgrade-bound
Proposed System (d10.c10000)
Rerun
(henkins)
ReRun [Hower
and Hill, ISCA’08]
Strata log size (MB/sec)
100
10
1
• Strata hints
are 4x less than ReRun log
• Significant reduction in hardware complexity
- 29 -
x4
Recording Performance, etc.
• Cache Miss Data Log
• 290 Mbytes / one second of program execution
• Recording Performance
• On average, 0.35% slowdown in IPC
• Scalability results can be found in the paper
- 30 -
Conclusion
• Deterministic replay for multi-threaded program is critical
• We proposed a complexity-effective solution
• Use BugNet : Record cache miss data
• No need to record shared memory dependencies
• Determine shared memory dependency using SMT constraint solver
offline
• Result
• < 1% recording overhead
• Efficient log size (4x smaller than state-of-the-art scheme ReRun)
• Can analyze one second of 8-threaded program in less than 1000
seconds
• One-time offline analysis cost (not for every replay)
- 31 -
Thank you
- 32 -
Download