Effective and Inexpensive (Memory) Race Recording Min Xu

advertisement
Effective and Inexpensive
(Memory) Race Recording
Min Xu
Thesis Defense
05/04/2006
Electrical and Computer Engineering Department, UW-Madison
Advisors: Mark Hill, Rastislav Bodik
Committee: Remzi Arpaci-Dusseau, Mikko Lipasti, Barton Miller, David Wood
Overview
1
Increasingly useful to replay multithreaded code
• Race recording: key to dealing with nondeterminism
Effective
Inexpensive
Race Recorder
A Case Study
•
•
•
•
Long recording: 1 byte/kilo-instr
Always-on recording: less than 2% overhead
Low cost: 24 KB RAM/core
Support both SC & TSO (x86-like)
Thesis Contributions
2
Low Runtime
Overhead
Small
Log Size
RTR
Algorithm
Coherence
Piggyback
Effective
Inexpensive
Order-Value
Hybrid
SC & TSO
Applicability
Set/LRU
Approximation
Low Cost
Hardware
Outline
3
Motivation & Problem
5
slides
An Effective and Inexpensive Race Recorder
RTR
Algorithm
Coherence
Piggyback
Set/LRU
Approximation
Evaluation Method & Results
21
Order-Value
Hybrid
6
Conclusion & My Other Research
3
Motivation & Problem
Multithreaded Debugging
5
% gcc hash.c
% a.out
Segmentation fault
%
% gdb a.out
gdb> run
Program received SIGSEGV.
In get() at hash.c:45
45
a = bucket->d;
% gcc para-hash.c
% a.out
Segmentation fault
%
% gdb a.out
gdb> run
Program exited normally.
gdb>
% gcc para-hash.c
% a.out
Segmentation fault
Race recorded in “log”
%
% gdb a.out log
gdb> run
Program received SIGSEGV.
In get() at para-hash.c:67
67
a = bucket->d;
Race Recording
6
Log
Thread I Thread J
X = X*5
-
X=1
X++
print(X)
Original
Recording
X=6
Thread I Thread J
X=1
X =-X*5
X++
X =-X*5
print(X)
-
Replay
X=10
X= 6
Recording for Multithreaded Replay
Race Recording
•
•
Not-an-issue for a single thread
Create the same general & data races
7
Focus
Checkpointing
•
•
Provide a snapshot of the program state
Many proposals (e.g., SafetyNet), not focus
Input Recording
•
•
Provide repeatable inputs
Some proposals (e.g., part of FDR), not focus
A Good Race Recorder
Low cost
8
Low runtime
overhead
Applicability
% gcc para-hash.c
% a.out
Segmentation fault
Race recorded in “log”
%
% gdb a.out log
gdb> run
Program received SIGSEGV.
In get() at para-hash.c:67
67
a = bucket->d;
Long recording:
small log
Desired & Existing Race Recorders
Recording
Length
Desired
Recorder
InstRply ’87
R&C ’90
Bacon’91
Netzer’93
Déjà Vu ’98
RecPlay ’00
JaRec ’04
Our
Recorder
Applicability
Racey SC
Small Log
MP
Size
Code TSO
9
Overhead
Cost
Negligible
Slowdown
Little
Hardware
Small
Log Size
RTR
Algorithm
Coherence
Piggyback
Order-Value
Hybrid
Set/LRU
Approximation
Problem Formulation
Thread I Conflicts
Thread J
11
ThreadDependence
I
Thread J
ld A
add
(red)
ld A
(black)
add
st B
st C
st B
st C
st C
ld B
st C
ld B
ld D
st A
ld D
st A
sub
st C
sub
st C
ld B
st D
ld B
st D
Recording
Log
Replay
Reproduce exact same conflicts: no more, no less
Log All Conflicts
12
Thread I
Thread J
1
ld A
add
1
2
st B
st C
2
3
st C
ld B
3
Log J: 23
14
35
46
4
ld D
st A
4
Log I: 23
5
sub
st C
5
6
ld B
st D
6
Replay
Dependence Log
16
bytes
Log Size: 5*16=80 bytes
(10 integers)
Assign IC
 Detect
conflicts  Write log
But too
many conflicts
(logical Timestamps)
Netzer’s Transitive Reduction
Thread I
Thread J
TR Reduced Log
1
ld A
TR
reduced
add
2
st B
st C
2
3
st C
ld B
3
1
4
ld D
st A
4
5
sub
st C
5
6
ld B
st D
6
Replay
Log J: 23
35
46
Log I: 23
Log Size: 64 bytes
(8 integers)
13
The Intuition of the New RTR Algorithm
From I to J
Vectors
After Reduction
Regulate Replay (RTR)
From J to I
Vectors
14
Stricter Dependences to Aid Vectorization15
Thread I
Thread J
1
ld A
add
1
2
st B
st C
2
Log J: 23
45
3
st C
3
Log I: 23
4
ld D
ld B
stricter
st A
5
sub
st C
5
ld B
6 Reduced
st D
6
Replay
New Reduced Log
4
Log Size: 48 bytes
(6 integers)
Compress Vectorized Dependencies
Thread I
Thread J
1
ld A
add
1
2
st B
st C
2
3
st C
ld B
3
4
ld D
st A
4
5 sub
Vector
Deps.
6 ld B
st C
5
st D
6
Replay
Vectorized Log
Log J: x=3,5, ∆=1
Log I: x=3, ∆=1
Log Size: 40 bytes
(5 integers)
Reduce log size to KB/core/second
16
Low Runtime
Overhead
RTR
Algorithm
Coherence
Piggyback
Order-Value
Hybrid
Set/LRU
Approximation
Detect Conflicts
A.readers
Thread IA.writer
Thread J
A.readers.add(I, 1)
B.writer = (I, 2)
if (C.writer != I)
log(WAW)
foreach C.readers
if (reader != I)
log(WAR)
C.readers.clear( )
C.writer = (I, 3)
18
1
ld A
add
1
2
st B
st C
2
3
st C
ld B
3
st A
4
C.writer =(J, 2)
if (B.writer != J)
log(RAW)
B.readers.add(J,3)
…
Recording
Expensive in software
Use Cache and Cache Coherence
19
ld B
Proc
I
Proc
J
Tag State Data Timestamp
A
S
…
1
B
M
…
4
Tag State Data Timestamp
A
S
…
3
B
I
…
2
Get/S Request
A.readers
A.writer
B.readers
B.writerData Response
Timestamp
RAW
Detected
& Logged
Detect conflict in hardware with little runtime cost
Cache Evictions and Writebacks
20
st A
Proc
I
Proc
J
Tag State Data Timestamp
A
S
…
1
C
M
…
3
B
M
…
4
Tag State Data Timestamp
A
S
…
3
M
4
B
I
…
2
Ack
WAR
Inv
Get/S Detected
& Logged
Directory of A: Shared(I,J) Owner()
Timestamp?
OK with nonsilent eviction & directory eviction
Implement TR and RTR in Hardware
Ideal TR requires vector timestamps
• Too expensive
• New idea: Pairwise-TR (use scalar timestamp)
• Enable pairwise transitive reduction
Optimal RTR algorithm is likely expensive
• Implement a greedy RTR algorithm
• One-pass, online algorithm
• Keep a sliding window of vectorizable dependencies
21
Hardware Implementation
Cache
Eviction/writeback
Solved, more details later
Directory protocols
Solved
Snooping protocols
Partly solved
Two-level coherence
Not yet solved
Processor
Out-of-order/Prefetching
Solved
Unordered message
Solved
Counter overflow
Solved
Thread Migration
Not yet solved
22
RTR
Algorithm
Order-Value
Hybrid
Coherence
Piggyback
Set/LRU
Approximation
Low Cost
Hardware
Timestamp Approximation
One Set of I’s $
Tag State Data Timestamp
A
S
…
1
C
M
…
3
B
M
…
2
Use current
IC of thread
I
Directory of A: Shared(I)
24
Thread I
Thread J
1
ld A
add
1
2
st B
st C
2
3
st C
ld B
3
I
ld D
st A
J
Recording
Correct, but more evictions  more logged conflicts
Hardware
Cost
Log Size
Set/LRU Approximation
One Set of I’s $
Tag State Data Timestamp
A
S
…
1
C
M
…
3
B
M
…
2
current
LRUUse
guarantee
IC of
thread
B’s TS
> A’s
TS
I
26
Thread I
Thread J
1
ld A
add
1
2
st B
st C
2
3
st C
ld B
3
I
ld D
st A
J
Recording
Set/LRU better preserve reducibility
Small $  more misses  but still small log
Hardware Cost of Timestamps
27
Coupled Timestamp Memory
Tag State Data Timestamp
A
S
…
1
B
M
…
2
Coupled timestamp memory: overhead  cache size
• Not flexible
• 64B line + 64b (24b) timestamp  12.5% (4.7%) overhead
• 192 KB for a 4MB L2
Need to modify cache
Decoupled Timestamp Memory
28
Cache
Coupled Timestamp Memory
Tag State Data Timestamp
A
S
…
1
B
M
…
2
Tag State Data
A
S
…
B
M
…
Tag Timestamp
A
1
B
2
Timestamp Memory
Decoupling  Small timestamp memory (Set/LRU)
• e.g., 32-set, 64-way  99% transitive reduction
• Timestamps Memory  24 KB
No needFrom
to modify
192 KBcache
to 24 KB: 8x reduction
29
SC & TSO
Applicability
RTR
Algorithm
Coherence
Piggyback
Order-Value
Hybrid
Set/LRU
Approximation
Recording with Total Store Order (TSO)
30
Majority of existing MP are non-SC
TSO is well defined, x86-like
Thread I
Thread J
A=B=0
1 st A,1
2
ld B
st B,1 1
ld A
2
SC
TSO
st A,1
st B,1
ld B
ld A
st A,1
ld B
st B,1
ld A
st B,1
ld A
st A,1
ld B
ld A
ld B
st A,1
st B,1
A=1
B=1
A=1
B=0
A=0
B=1
A=0
B=0
TSO Execution
31
I A=1
Thread I
Thread J
WrBuf WrBuf
A=B=0
1 st A,1
2
ld B
B=1J
st B,1 1
ld A
2
Memory System
A=0
B=0
st A,1
st B,1
ld A
ld B
st A,1
st B,1
A=0
B=0
Order-Value-Hybrid Recording
WAR
Value
Omitted
Thread
I
Thread J
Logged
A=B=0
1 st A,1
2
st B,1 1
ld B
ld A
1 st A,1
2
ld B
Replay
Thread
J
A Changed!
st B,1 1
ld A
B=1J
WrBuf WrBuf
2
Recording
Thread I
I A=1
32
2
Value Used
A=0
Memory System
A=0
st A,1
st B,1
ld A
ld B
st A,1
st B,1
A=0
B=0
B=0
Start Start
Stop
Monitor Monitor
A Monitor
B B
Hybrid Recording with TR and RTR
33
Hybrid recording
• All loads get correct values
• Hardware similar to OoO SC [Gharachorloo et al. ’91]
Hybrid + TR & RTR
• TR will not use the omitted WAR in reduction
• RTR vectorize dependencies more conservatively
Evaluation Method & Results
Put-it-together: Determinizer/CMP
TSM TSM
IC
Core
4
Core
1
L1_I$
L1_D$
Shared L2
Cache
(L1 Dir)
L1
Coherence
TSM
Core Controller
Core
3
2
Log
TR
Reg
RTR
Reg
TSM TSM
35
Simulation Method
Commercial server hardware
• GEMS: http://www.cs.wisc.edu/gems
• Full-system (OS + application) executions
• 4-core CMP (Sequential Consistent)
•
•
1-way in-order issue, 2 GHz,
64KB I/D L1, 4MB L2, 64byte lines, MOSI directory
Commercial server software
•
•
•
•
Apache – static web serving
SpecJBB – middleware
OLTP – TPC-C like
Zeus – static web serving
36
Log Size: 1 byte/kilo-instr
KB/core/s
byte/core/kilo-instr
2.0
200
1.5
150
1.0
100
0.5
50
0.0
ApacheJBB OLTP Zeus AVG
37
0
ApacheJBB OLTP Zeus AVG
Well within in the capability of current machines
• Long recording (days – months) need improvement
Runtime Overhead
38
Execution Time
100
Interconnection Msg. B/W
100
80
80
60
60
40
40
20
20
0 Apache JBB OLTP Zeus
0
Baseline
Apache JBB OLTP Zeus
With race recorder
Our recorder can be “always-on”
Benefits of RTR and Set/LRU (Log Size)
Improvement by RTR
Effectiveness of Set/LRU
100
80
80
Log Size
100
Log Size
39
60
60
40
40
20
20
0
0
ApacheJBB OLTP Zeus AVG
Apache JBB OLTP Zeus AVG
Pairwise-TR
Perfect TSM
Our RTR
24KB Set/LRU TSM
Why RTR and Set/LRU Work Well?
RTR
• Processors execute instructions at similar speed
• Therefore, we can find “vectorizable” dependencies
Set/LRU
• Temporal locality makes the LRU timestamps old
• We only need to know if a timestamp is “old-enough”
40
Sensitivity and Scalability
41
A design space of the timestamp memory (TSM)
• Size: smaller TSM -> larger log
• Read/write timestamp: should be used when TSM is large
• Partial timestamp: 24-bit enough
• Associativity: higher better for RTR
Scalability of the recorder
• Studied with modest processors (2p – 16p)
• Commercial workloads, not scientific workloads
• Log size increase slowly with number of cores
Conclusion & My Other Research
Race Recording
43
Race recording  Key to combat nondeterminism
My thesis  An effective & inexpensive Recorder
•
•
•
•
RTR algorithm  small log size
Coherence piggyback Negligible slowdown
Timestamp approximation  Low hardware cost
Order-value hybrid  support SC & TSO
Future work
• Improve race recording algorithm
• Improve race recorder implementation
• Study race replay
Serializability Violation Detector [PLDI’05]
44
Like a race detector
No a priori annotation requirement
• “critical sections” are inferred
Intend to detect bugs “actually” happen
• Check for a 2-Phase-Locking condition
Read in1
Read in2
Read local
Write local
Write out1
Write out2
A “Critical Section”
Shared
Variables
Publications
FDR (ISCA’03)
• Adopted by UCSD BugNet (ISCA’05)
SVD (PLDI’05)
• Cited by Vaziri et al. (POPL’06)
• Influenced new data race definition
RTR, Set/LRU & Hybrid
• Submitted for publication
45
Thank you!
% gcc para-hash.c
% a.out
Segmentation fault
Race recorded in “log”
%
% gdb a.out log
gdb> run
Program received SIGSEGV.
In get() at para-hash.c:67
67
a = bucket->d;
Acknowledgements
Joint work with my advisors
• Mark Hill, Ras Bodik
Ph.D. Committee
• David Wood, Mikko Lipasti, Remzi Arpaci-Dusseau, Barton
Miller
Multifacet Group
• Milo Martin, Dan Sorin, Carl Mauer, Brad Beckmann, Kevin
Moore, Alaa Alameldeen, Mike Marty, Luke Yen
Affiliates & Companies
• Joe Emer, CJ Newburn, Peter Hsu, Bob Zak, Eric Bach,
Gang Luo, Alex Chow, IBM, Intel, Microsoft, Sun
47
Deterministic Replay is Useful
Deterministic Replay is logically recreating a
program execution
Present applications
• Cyclic Debugging ([Pancake & Netzer ‘93])
• Fault Tolerance (ExtraVirt [Lucchetti et al. ’05])
• Intrusion Analysis (ReVirt [Dunlap et al. ’02])
Future applications
• Data Recovery
• Replay-based Synchronization
48
Multicore and Multithreading
Multicore is common
•
•
•
•
AMD X2
IBM Power 5/6, Cell
Intel Pentium D, Core Duo
Sun SPARC T1
Multithreading is common
• Server: high throughput
• Scientific: high performance
• Desktop/embedded: low response time
49
Race Recording: Key to Determinism
50
Races: general race & data race [Netzer & Miller]
• Both cause nondeterminism
• Race recording can help, but
Existing race recorders are inadequate
•
•
•
•
Some generate large logs
Some have high runtime overhead
Some have high hardware cost (space overhead)
Support only sequential consistency
Need a better race recorder
Recording/Replay & Debugging
51
Online Recorder
P1
Store log A
Store log B
Store log C
P2
Crash
P3
P4
Checkpoint A
Checkpoint B
Deterministic Replayer
Checkpoint C
Dump “Core”
Replaying from
log B, C
Crash
Read Checkpoint B
Deterministic Replay & Fault Tolerance
Fault Recovery
• Replay after a failure
Fault Detection
• Replay then compare
(Courtesy of VMware)
52
Future: Record/Replay & Undo/Redo
Windows XP
VM as a software platform
• Ease software development
• Fine granularity in Undo and Redo
53
Future: Replay-based Synchronization
ld A
st B
Unlock()
Recording
Log
lock()
st A
ld B
ld A
st B
Replay
54
st A
ld B
Three steps
• Coarse-grain sync.  fine-grain sync.  hardware sync.
Results: higher performance
Works only if static control flow & fixed data addr
• DSP kernels
Race Recording Related Work
Total-order recorders
Bacon ’91 RecPlay ’00
(Hardware) JaRec ’04
Bus
Lamport Clocks
transactions
Large log
Low
overhead
Small log
55
Partial-order recorders
R&C’90
Bacon ’91
Instant Replay ’87 Netzer ’93
Déjà Vu ’98 (Hardware)
Scheduling
Small log
Low overhead Low overhead
(sync only)
(non-MP)
Low replay parallelism
Bus
transaction
groups
Large log
Low
overhead
Variable version
Vector clocks
Large log
Small log
High overhead
High
overhead
High replay parallelism
Correctness of Order-Value-Hybrid
Removing WAR dependencies
• Say thread I read, thread J write
• Removing the WAR affects I’s read, not J’s write
• But, for every dependence removed, thread I reads
correct value from the value log
• Therefore, all reads get the correct value
56
TR and TSO
57
TR affects dependencies reduced by a WAR
• The WAR itself may later be removed during replay
• Solution: Not use WAR in TR if the WAR can be
removed
• Respond with a special flag when a loaded cache line is
stolen
Thread I
Thread J
1
st A
st B 1
2
st C
st C
2
3
ld B
ld A
3
Recording
Must not
be reduced
RTR and TSO
58
The sliding window may expose the ordered loads
• Shrink the sliding window to avoid it
old win
for j:3
Thread I
Thread J
1
st A
add
1
2
add
sub
2
in write bufffer 3 st B
ld A
3
ld C
ld B
4
new win
for j:3
ordered
ordered
4
Recording
Not allowed
by new window
Deadlock Avoidance of RTR
Thread I
Thread J
1
ld A
add
1
2
st B
st C
2
3
st C
ld B
3
4
ld D
st A
4
5
sub
st C
5
6
ld B
st D
6
Recording
59
Replay Cycle
i:4j:1 j:2 i:3 i:4
Avoid deadlock by adhere to a SC total order
Recording Race-free Executions
No data races
Only need to record synchronization race
Deterministic replay up until the first data race
60
Replay Parallelism
Replay performance depends on
(1) Number of synchronizations
(2) Extra wait incurred by the synchronizations
61
Directory Protocols
62
Add sticky states in the directory
• Retain states after writebacks
• Need extra acknowledgements
Or, add extra timestamp memory in the directory
• Helps to avoid extra acknowledgements
A tradeoff
• Sticky states can be cheaper
• But extra timestamp memory can be faster
Snooping Protocols
63
Key problem is combined/implicit response
• Not a problem for AMD Hammer
st A
Proc
I
Proc
J
Tag State Data Timestamp
A
S
…
1
B
M
…
4
Tag State Data Timestamp
A
S
…
3
B
I
…
2
Pull Shared
WAR
Detected
& Logged
Get/X + Current IC
Nonsilent Evictions
64
st A
Proc
I
Proc
J
Tag State Data Timestamp
A
S
…
1
C
M
…
3
B
M
…
4
Tag State Data Timestamp
A
S
…
3
M
4
B
I
…
2
Ack
Eviction
Timestamp
Memory
Timestamp
Get/S
Directory of A:
Shared(J) Owner() StickyS(I,J)
Directory eviction: more false conflict, like snooping
Out-of-Order & Hardware Prefetching
65
Speculative execution
• No IC assigned yet
Hardware prefetching
• No IC assigned
Key idea: receive observation
• Can associate a ld/st with current commit instruction
Unordered Messages in Interconnect
Message arrive out-of-order
Can affect reduction
But better add a sequence number
• Reconstruct the message order
• Enable IC compression by sending deltas
66
Integer Overflow
IC and timestamps may overflow
IC: make it 64bit, will not overflow for a long
time
Timestamps: use approximation techniques
• MSB of IC + LSB of Timestamps
67
3
2
Apache-1TS-RTR
Apache-1TS-TR
Apache-2TS-RTR
Apache-2TS-TR
1
68
Log Bandwidth (MB/core/second)
Log Bandwidth (MB/core/second)
Varying TSM Size
2
OLTP-1TS-RTR
OLTP-1TS-TR
OLTP-2TS-RTR
OLTP-2TS-TR
1
0
0
4
8
16
32
64
128
256
Size of the Timestamp Memory (KB)
(64 ways, Full Timestamps, Set/LRU)
3
2
SPECjbb-1TS-RTR
SPECjbb-1TS-TR
SPECjbb-2TS-RTR
SPECjbb-2TS-TR
1
2
512 1024 2048
0
Log Bandwidth (MB/core/second)
2
Log Bandwidth (MB/core/second)
3
4
8
16
32
64
128
256
512 1024 2048
Size of the Timestamp Memory (KB)
(64 ways, Full Timestamps, Set/LRU)
3
2
Zeus-1TS-RTR
Zeus-1TS-TR
Zeus-2TS-RTR
Zeus-2TS-TR
1
0
2
4
8
16
32
64
128
256
512 1024 2048
Size of the Timestamp Memory (KB)
(64 ways, Full Timestamps, Set/LRU)
2
4
8
16
32
64
128
256
512 1024 2048
Size of the Timestamp Memory (KB)
(64 ways, Full Timestamps, Set/LRU)
10
Apache-CurrentIC-RTR
Apache-CurrentIC-TR
Apache-SetLRU-TR
Apache-SetLRU-RTR
1
69
Log Bandwidth (MB/core/second)
Log Bandwidth (MB/core/second)
Varying Associativity
0.1
0.01
OLTP-CurrentIC-RTR
OLTP-CurrentIC-TR
OLTP-SetLRU-TR
OLTP-SetLRU-RTR
1
0.1
0.01
4
8
16
32
64
128
256
512
1024
Associativity of the Timestamp Memory
(64KB, Full R/W Timestamps)
10
SPECjbb-CurrentIC-RTR
SPECjbb-CurrentIC-TR
SPECjbb-SetLRU-TR
SPECjbb-SetLRU-RTR
1
0.1
0.01
2
Log Bandwidth (MB/core/second)
2
Log Bandwidth (MB/core/second)
10
4
8
16
32
64
128
256
512
1024
Associativity of the Timestamp Memory
(64KB, Full R/W Timestamps)
10
Zeus-CurrentIC-RTR
Zeus-CurrentIC-TR
Zeus-SetLRU-TR
Zeus-SetLRU-RTR
1
0.1
0.01
2
4
8
16
32
64
128
256
512
1024
Associativity of the Timestamp Memory
(64KB, Full R/W Timestamps)
2
4
8
16
32
64
128
256
512
1024
Associativity of the Timestamp Memory
(64KB, Full R/W Timestamps)
Log Bandwidth (MB/core/second)
Log Bandwidth (MB/core/second)
Varying Partial Timestamp Width
10
Apache-TR
Apache-RTR
1
0.1
0.01
10
OLTP-TR
OLTP-RTR
1
0.1
0.01
10
15
20
25
30
Partial Timestamp Width
(64sets, 64ways, Set/LRU)
10
SPECjbb-TR
SPECjbb-RTR
1
0.1
0.01
10
15
20
25
Partial Timestamp Width
(64sets, 64ways, Set/LRU)
10
Log Bandwidth (MB/core/second)
Log Bandwidth (MB/core/second)
70
30
15
20
25
10
Partial Timestamp Width
(64sets, 64ways, Set/LRU)
1
Zeus-TR
Zeus-RTR
30
0.1
0.01
10
15
20
25
Partial Timestamp Width
(64sets, 64ways, Set/LRU)
30
Log Size (MB/core/s)
Log Size Scaling
71
1.0
0.8
Apache
SPECjbb
OLTP
Zeus
0.6
0.4
0.2
0.0
2
4
8
Number of Cores
16
In Retrospect …
What are you most proud of?
• RTR improves TR after 13 years
What would you do differently if doing it again?
• “replaying me is deterministic” (just kidding)
• I wish I focused on race recording earlier
What the industry should do?
• Implement the recorder as a VMM extension
72
Download