A,1 - University of Wisconsin

advertisement
Hardware Memory Race Recording
for Deterministic Replay
Mark D. Hill
University of Wisconsin—Madison
August 10, 2007
Based on joint work with Min Xu & Ras Bodik:
ISCA 2003, ASPLOS 2006, IEEE Micro Top Picks 2007,
& Xu UW Ph.D. 5/2006 (slides updated from defense talk).
Wisconsin Multifacet Project
Seek improved architectures for (mostly) servers that
are (mostly) chip multiprocessors (CMPs, multi-core)
Led by Mark Hill & David Wood
LogTM work w/ Ben Liblit & Mike Swift
Funding
• Grants from U.S. National Science Foundation
• Donations from Intel and Sun
1
Selected Multifacet Results (1 of 2)
Multiprocessor Flight Data Recorder
• Records memory races for deterministic replay
• Piggyback on coherence protocol & logs
0.001B/instrn
• Supports SC & TSO
Adaptive L2 Cache & Memory Link Compression
• Cache compression creates level 2½ cache (or 3½)
• Adaptive so as “to do no harm”
• Link compression husbands memory link bandwidth
Multifacet GEMS MP Simulation Infrastructure
• Simics==Correctness; GEMS==Performance
• GPL Distribution
2
Selected Multifacet Results (2 of 2)
Log-based Transactional Memory (LogTM)
• Accelerates commit by writing new values in place
(after saving old values in a per-thread log)
• Gracefully handles cache eviction of TM data
LogTM Signature Edition (LogTM-SE)
• Signatures summarize read/write sets
• HW mechanisms: simple, policy-free, SW accessible
Forthcoming
• Mechanisms to handle thread switching/migration &
paging of transactions with OS or OS/VMM
3
Overview
4
Increasingly useful to replay multithreaded code
• Race recording: key to dealing with nondeterminism
Effective
Inexpensive
Race Recorder
A Case Study
•
•
•
•
Long recording: 1 byte/kilo-instr
Always-on recording: less than 2% overhead
Low cost: 24 KB RAM/core
Support both SC & TSO (x86-like)
Contributions
5
Low Runtime
Overhead
Small
Log Size
Transitive
Reduction &
Regulated TR
Coherence
Piggyback
Effective
Inexpensive
Order-Value
Hybrid
SC & TSO
Applicability
Set/LRU
Approximation
Low Cost
Hardware
Outline
6
Motivation & Problem
6
slides
An Effective and Inexpensive Race Recorder
TR & RTR
Algorithms
Coherence
Piggyback
Set/LRU
Approximation
Evaluation Method & Results
Conclusions, etc.
3
21
Order-Value
Hybrid
6
Motivation & Problem
Multithreaded Debugging
8
% gcc hash.c
% a.out
Segmentation fault
%
% gdb a.out
gdb> run
Program received SIGSEGV.
In get() at hash.c:45
45
a = bucket->d;
% gcc para-hash.c
% a.out
Segmentation fault
%
% gdb a.out
gdb> run
Program exited normally.
gdb>
% gcc para-hash.c
% a.out
Segmentation fault
Race recorded in “log”
%
% gdb a.out log
gdb> run
Program received SIGSEGV.
In get() at para-hash.c:67
67
a = bucket->d;
Applications of Deterministic Replay
Deterministic Replay is logically recreating a
program execution
Cyclic Debugging ([Pancake & Netzer ‘93])
Fault Tolerance (ExtraVirt [Lucchetti et al. ’05])
Intrusion Analysis (ReVirt [Dunlap et al. ’02])
Data Recovery (Continuous Checkpointing)?
See VMware Workstation 6
Replay included for single-processor guest VM
9
Race Recording
10
Log
Thread I Thread J
X = X*5
-
X=1
X++
print(X)
Original
Recording
X=6
Thread I Thread J
X=1
X =-X*5
X++
X =-X*5
print(X)
-
Replay
X=10
X= 6
Recording for Multithreaded Replay
Race Recording
•
•
Not-an-issue for a single thread
Create the same general & data races
11
Focus
Checkpointing
•
•
Provide a snapshot of the program state
Many proposals (e.g., SafetyNet), not focus
Input Recording
•
•
Provide repeatable inputs
Some proposals (e.g., part of FDR), not focus
A Good Race Recorder
Low cost
12
Low runtime
overhead
Applicability
% gcc para-hash.c
% a.out
Segmentation fault
Race recorded in “log”
%
% gdb a.out log
gdb> run
Program received SIGSEGV.
In get() at para-hash.c:67
67
a = bucket->d;
Long recording:
small log
Desired & Existing Race Recorders
Recording
Length
Desired
Recorder
Applicability
Racey SC
Small Log
MP
Size
Code TSO
13
Overhead
Cost
Negligible
Slowdown
Little
Hardware
InstRply ’87
R&C ’90
Bacon’91
Netzer’93
Déjà Vu ’98
RecPlay ’00
JaRec ’04
Our
Recorder
Strata ASPLOS ’06
V
V
V
X
V
V, but global
Small
Log Size
Transitive
Reduction &
Regulated TR
Coherence
Piggyback
Order-Value
Hybrid
Set/LRU
Approximation
Problem Formulation
Thread I Conflicts
Thread J
15
ThreadDependence
I
Thread J
ld A
add
(red)
ld A
(black)
add
st B
st C
st B
st C
st C
ld B
st C
ld B
ld D
st A
ld D
st A
sub
st C
sub
st C
ld B
st D
ld B
st D
Recording
Log
Replay
Reproduce exact same conflicts: no more, no less
Log All Conflicts
16
Thread I
Thread J
1
ld A
add
1
2
st B
st C
2
3
st C
ld B
3
Log J: 23
14
35
46
4
ld D
st A
4
Log I: 23
5
sub
st C
5
6
ld B
st D
6
Replay
Dependence Log
16
bytes
Log Size: 5*16=80 bytes
(10 integers)
Assign IC
 Detect
conflicts  Write log
But too
many conflicts
(logical Timestamps)
Netzer’s Transitive Reduction
Thread I
Thread J
TR Reduced Log
1
ld A
TR
reduced
add
2
st B
st C
2
3
st C
ld B
3
1
4
ld D
st A
4
5
sub
st C
5
6
ld B
st D
6
Replay
Log J: 23
35
46
Log I: 23
Log Size: 64 bytes
(8 integers)
17
The Intuition of the New RTR Algorithm
From I to J
Vectors
After Reduction
Regulate Replay (RTR)
From J to I
Vectors
18
Stricter Dependences to Aid Vectorization19
Thread I
Thread J
1
ld A
add
1
2
st B
st C
2
Log J: 23
45
3
st C
3
Log I: 23
4
ld D
ld B
stricter
st A
5
sub
st C
5
ld B
6 Reduced
st D
6
Replay
New Reduced Log
4
Log Size: 48 bytes
(6 integers)
Compress Vectorized Dependencies
Thread I
Thread J
1
ld A
add
1
2
st B
st C
2
3
st C
ld B
3
4
ld D
st A
4
5 sub
Vector
Deps.
6 ld B
st C
5
st D
6
Replay
Vectorized Log
Log J: x=3,5, ∆=1
Log I: x=3, ∆=1
Log Size: 40 bytes
(5 integers)
Reduce log size to KB/core/second
20
Low Runtime
Overhead
Transitive
Reduction &
Regulated TR
Coherence
Piggyback
Order-Value
Hybrid
Set/LRU
Approximation
Detect Conflicts
A.readers
Thread IA.writer
Thread J
A.readers.add(I, 1)
B.writer = (I, 2)
if (C.writer != I)
log(WAW)
foreach C.readers
if (reader != I)
log(WAR)
C.readers.clear( )
C.writer = (I, 3)
22
1
ld A
add
1
2
st B
st C
2
3
st C
ld B
3
st A
4
C.writer =(J, 2)
if (B.writer != J)
log(RAW)
B.readers.add(J,3)
…
Recording
Expensive in software
Use Cache and Cache Coherence
23
ld B
Proc
I
Proc
J
Tag State Data Timestamp
A
S
…
1
B
M
…
2
Tag State Data Timestamp
A
S
…
3
B
I
…
2
Get/S Request
A.readers
A.writer
B.readers
B.writerData Response
Timestamp
RAW
Detected
& Logged
Detect conflict in hardware with little runtime cost
Cache Evictions and Writebacks
24
st A
Proc
I
Proc
J
Tag State Data Timestamp
A
S
…
1
C
M
…
3
B
M
…
2
Tag State Data Timestamp
A
S
…
3
M
B
I
…
2
Ack
WAR
Inv
Get/S Detected
& Logged
Directory of A: Shared(I,J) Owner()
Timestamp?
OK with nonsilent eviction & directory eviction
Implement TR and RTR in Hardware
Ideal TR requires vector timestamps
• Too expensive
• New idea: Pairwise-TR (use scalar timestamp)
• Enable pairwise transitive reduction
Optimal RTR algorithm is likely expensive
• Implement a greedy RTR algorithm
• One-pass, online algorithm
• Keep a sliding window of vectorizable dependencies
25
Hardware Implementation
Cache
Eviction/writeback
Solved, more details later
Directory protocols
Solved
Snooping protocols
Partly solved
Two-level coherence
Not yet solved
Processor
Out-of-order/Prefetching
Solved
Unordered message
Solved
Counter overflow
Solved
Thread Migration
Not yet solved
26
Transitive
Reduction &
Regulated TR
Order-Value
Hybrid
Coherence
Piggyback
Set/LRU
Approximation
Low Cost
Hardware
Timestamp Approximation
One Set of I’s $
Tag State Data Timestamp
A
S
…
1
C
M
…
3
B
M
…
2
Use current
IC of thread
I
Directory of A: Shared(I)
28
Thread I
Thread J
1
ld A
add
1
2
st B
st C
2
3
st C
ld B
3
I
ld D
st A
J
Recording
Correct, but more evictions  more logged conflicts
Hardware
Cost
Log Size
Set/LRU Approximation
One Set of I’s $
Tag State Data Timestamp
A
S
…
1
C
M
…
3
B
M
…
2
current
LRUUse
guarantee
IC of
thread
B’s TS
> A’s
TS
I
30
Thread I
Thread J
1
ld A
add
1
2
st B
st C
2
3
st C
ld B
3
I
ld D
st A
J
Recording
Set/LRU better preserve reducibility
Small $  more misses  but still small log
Hardware Cost of Timestamps
31
Coupled Timestamp Memory
Tag State Data Timestamp
A
S
…
1
B
M
…
2
Coupled timestamp memory: overhead  cache size
• Not flexible
• 64B line + 64b (24b) timestamp  12.5% (4.7%) overhead
• 192 KB for a 4MB L2
Need to modify cache
Decoupled Timestamp Memory
32
Cache
Coupled Timestamp Memory
Tag State Data Timestamp
A
S
…
1
B
M
…
2
Tag State Data
A
S
…
B
M
…
Tag Timestamp
A
1
B
2
Timestamp Memory
Decoupling  Small timestamp memory (Set/LRU)
• e.g., 32-set, 64-way  99% transitive reduction
• Timestamps Memory  24 KB
No needFrom
to modify
192 KBcache
to 24 KB: 8x reduction
33
SC & TSO
Applicability
Transitive
Reduction &
Regulated TR
Coherence
Piggyback
Order-Value
Hybrid
Set/LRU
Approximation
Recording with Total Store Order (TSO)
34
Majority of existing MP are non-SC
TSO is well defined, x86-like
Thread I
Thread J
A=B=0
1 st A,1
2
ld B
st B,1 1
ld A
2
SC
TSO
st A,1
st B,1
ld B
ld A
st A,1
ld B
st B,1
ld A
st B,1
ld A
st A,1
ld B
ld A
ld B
st A,1
st B,1
A=1
B=1
A=1
B=0
A=0
B=1
A=0
B=0
TSO Execution
35
I A=1
Thread I
Thread J
WrBuf WrBuf
A=B=0
1 st A,1
2
ld B
B=1J
st B,1 1
ld A
2
Memory System
A=0
B=0
st A,1
st B,1
ld A
ld B
st A,1
st B,1
A=0
B=0
Order-Value-Hybrid Recording
WAR
Value
Omitted
Thread
I
Thread J
Logged
A=B=0
1 st A,1
2
st B,1 1
ld B
ld A
1 st A,1
2
ld B
Replay
Thread
J
A Changed!
st B,1 1
ld A
B=1J
WrBuf WrBuf
2
Recording
Thread I
I A=1
36
2
Value Used
A=0
st A,1
st B,1
ld A
ld B
st A,1
st B,1
Memory System
A=0
B=0
J StartsIto
I
Starts
Stopsto
Monitor
Monitoring
Monitor
A
BB
A=0
B=0
Hybrid Recording with TR and RTR
37
Hybrid recording
• All loads get correct values
• Hardware similar to OoO SC [Gharachorloo et al. ’91]
Hybrid + TR & RTR
• TR will not use the omitted WAR in reduction
• RTR vectorize dependencies more conservatively
Evaluation Method & Results
Put-it-together: Determinizer/CMP
TSM TSM
IC
Core
4
Core
1
L1_I$
L1_D$
Shared L2
Cache
(L1 Dir)
L1
Coherence
TSM
Core Controller
Core
3
2
Log
TR
Reg
RTR
Reg
TSM TSM
39
Simulation Method
Commercial server hardware
• GEMS: http://www.cs.wisc.edu/gems
• Full-system (OS + application) executions
• 4-core CMP (Sequential Consistent)
•
•
1-way in-order issue, 2 GHz,
64KB I/D L1, 4MB L2, 64byte lines, MOSI directory
Commercial server software
•
•
•
•
Apache – static web serving
SpecJBB – middleware
OLTP – TPC-C like
Zeus – static web serving
40
Log Size: 1 byte/kilo-instr
KB/core/s
byte/core/kilo-instr
2.0
200
1.5
150
1.0
100
0.5
50
0.0
ApacheJBB OLTP Zeus AVG
41
0
ApacheJBB OLTP Zeus AVG
Well within in the capability of current machines
• Long recording (days – months) need improvement
Runtime Overhead
42
Execution Time
100
Interconnection Msg. B/W
100
80
80
60
60
40
40
20
20
0 Apache JBB OLTP Zeus
0
Baseline
Apache JBB OLTP Zeus
With race recorder
Our recorder can be “always-on”
Benefits of RTR and Set/LRU (Log Size)
Improvement by RTR
Effectiveness of Set/LRU
100
80
80
Log Size
100
Log Size
43
60
60
40
40
20
20
0
0
ApacheJBB OLTP Zeus AVG
Apache JBB OLTP Zeus AVG
Pairwise-TR
Perfect TSM
Our RTR
24KB Set/LRU TSM
Why RTR and Set/LRU Work Well?
RTR
• Processors execute instructions at similar speed
• Therefore, we can find “vectorizable” dependencies
Set/LRU
• Temporal locality makes the LRU timestamps old
• We only need to know if a timestamp is “old-enough”
44
Sensitivity and Scalability
45
A design space of the timestamp memory (TSM)
• Size: smaller TSM -> larger log
• Read/write timestamp: should be used when TSM is large
• Partial timestamp: 24-bit enough
• Associativity: higher better for RTR
Scalability of the recorder
• Studied with modest processors (2p – 16p)
• Commercial workloads, not scientific workloads
• Log size increase slowly with number of cores
Conclusions, etc.
Conclusions & Future Work
47
Race recording  Key to combat nondeterminism
Contributions  Effective & inexpensive Recorder
•
•
•
•
Transitive Reduction & RTR algorithm  small log size
Coherence piggyback Negligible slowdown
Timestamp approximation  Low hardware cost
Order-value hybrid  support SC & TSO
Future work
• Operate with Hardware Transactional Memory
• Seek to Eliminate Timestamp on Acknowledgements
Toward Recording w/ Snooping Protocols
48
Key problem is combined/implicit response
• Not a problem for AMD Hammer
st A
Proc
I
Proc
J
Tag State Data Timestamp
A
S
…
1
B
M
…
4
Tag State Data Timestamp
A
S
…
3
B
I
…
2
Pull Shared
WAR
Detected
& Logged
Get/X + Current IC
Timestamp at L2-Directory or Memory?
49
st A
Proc
I
Proc
J
Tag State Data Timestamp
A
S
…
1
C
M
…
3
B
M
…
4
Tag State Data Timestamp
A
S
…
3
M
4
B
I
…
2
Ack
Eviction
Timestamp
Memory
Timestamp
Get/S
Directory of A:
Shared(J) Owner() StickyS(I,J)
Directory eviction: more false conflict, like snooping
% gcc para-hash.c
% a.out
Segmentation fault
Race recorded in “log”
%
% gdb a.out log
gdb> run
Program received SIGSEGV.
In get() at para-hash.c:67
67
a = bucket->d;
For references, Google “Mark Hill”
Publications & Talks
2007 - IEEE Micro Top Picks
2006 – ASPLOS
2003 – ISCA
Students & Graduates
2006 - Xu Ph.D. Thesis
51
Backup
Serializability Violation Detector [PLDI’05]
52
Like a race detector
No a priori annotation requirement
• “critical sections” are inferred
Intend to detect bugs “actually” happen
• Check for a 2-Phase-Locking condition
Read in1
Read in2
Read local
Write local
Write out1
Write out2
A “Critical Section”
Shared
Variables
Race Recording: Key to Determinism
53
Races: general race & data race [Netzer & Miller]
• Both cause nondeterminism
• Race recording can help, but
Existing race recorders are inadequate
•
•
•
•
Some generate large logs
Some have high runtime overhead
Some have high hardware cost (space overhead)
Support only sequential consistency
Need a better race recorder
Recording/Replay & Debugging
54
Online Recorder
P1
Store log A
Store log B
Store log C
P2
Crash
P3
P4
Checkpoint A
Checkpoint B
Deterministic Replayer
Checkpoint C
Dump “Core”
Replaying from
log B, C
Crash
Read Checkpoint B
Deterministic Replay & Fault Tolerance
Fault Recovery
• Replay after a failure
Fault Detection
• Replay then compare
(Courtesy of VMware)
55
Future: Record/Replay & Undo/Redo
Windows XP
VM as a software platform
• Ease software development
• Fine granularity in Undo and Redo
56
Future: Replay-based Synchronization
ld A
st B
Unlock()
Recording
Log
lock()
st A
ld B
ld A
st B
Replay
57
st A
ld B
Three steps
• Coarse-grain sync.  fine-grain sync.  hardware sync.
Results: higher performance
Works only if static control flow & fixed data addr
• DSP kernels
Race Recording Related Work
Total-order recorders
Bacon ’91 RecPlay ’00
(Hardware) JaRec ’04
Bus
Lamport Clocks
transactions
Large log
Low
overhead
Small log
58
Partial-order recorders
R&C’90
Bacon ’91
Instant Replay ’87 Netzer ’93
Déjà Vu ’98 (Hardware)
Scheduling
Small log
Low overhead Low overhead
(sync only)
(non-MP)
Low replay parallelism
Bus
transaction
groups
Large log
Low
overhead
Variable version
Vector clocks
Large log
Small log
High overhead
High
overhead
High replay parallelism
Correctness of Order-Value-Hybrid
Removing WAR dependencies
• Say thread I read, thread J write
• Removing the WAR affects I’s read, not J’s write
• But, for every dependence removed, thread I reads
correct value from the value log
• Therefore, all reads get the correct value
59
TR and TSO
60
TR affects dependencies reduced by a WAR
• The WAR itself may later be removed during replay
• Solution: Not use WAR in TR if the WAR can be
removed
• Respond with a special flag when a loaded cache line is
stolen
Thread I
Thread J
1
st A
st B 1
2
st C
st C
2
3
ld B
ld A
3
Recording
Must not
be reduced
RTR and TSO
61
The sliding window may expose the ordered loads
• Shrink the sliding window to avoid it
old win
for j:3
Thread I
Thread J
1
st A
add
1
2
add
sub
2
in write bufffer 3 st B
ld A
3
ld C
ld B
4
new win
for j:3
ordered
ordered
4
Recording
Not allowed
by new window
Deadlock Avoidance of RTR
Thread I
Thread J
1
ld A
add
1
2
st B
st C
2
3
st C
ld B
3
4
ld D
st A
4
5
sub
st C
5
6
ld B
st D
6
Recording
62
Replay Cycle
i:4j:1 j:2 i:3 i:4
Avoid deadlock by adhere to a SC total order
Recording Race-free Executions
No data races
Only need to record synchronization race
Deterministic replay up until the first data race
63
Replay Parallelism
Replay performance depends on
(1) Number of synchronizations
(2) Extra wait incurred by the synchronizations
64
Directory Protocols
65
Add sticky states in the directory
• Retain states after writebacks
• Need extra acknowledgements
Or, add extra timestamp memory in the directory
• Helps to avoid extra acknowledgements
A tradeoff
• Sticky states can be cheaper
• But extra timestamp memory can be faster
Snooping Protocols
66
Key problem is combined/implicit response
• Not a problem for AMD Hammer
st A
Proc
I
Proc
J
Tag State Data Timestamp
A
S
…
1
B
M
…
4
Tag State Data Timestamp
A
S
…
3
B
I
…
2
Pull Shared
WAR
Detected
& Logged
Get/X + Current IC
Nonsilent Evictions
67
st A
Proc
I
Proc
J
Tag State Data Timestamp
A
S
…
1
C
M
…
3
B
M
…
4
Tag State Data Timestamp
A
S
…
3
M
4
B
I
…
2
Ack
Eviction
Timestamp
Memory
Timestamp
Get/S
Directory of A:
Shared(J) Owner() StickyS(I,J)
Directory eviction: more false conflict, like snooping
Out-of-Order & Hardware Prefetching
68
Speculative execution
• No IC assigned yet
Hardware prefetching
• No IC assigned
Key idea: receive observation
• Can associate a ld/st with current commit instruction
Unordered Messages in Interconnect
Message arrive out-of-order
Can affect reduction
But better add a sequence number
• Reconstruct the message order
• Enable IC compression by sending deltas
69
Integer Overflow
IC and timestamps may overflow
IC: make it 64bit, will not overflow for a long
time
Timestamps: use approximation techniques
• MSB of IC + LSB of Timestamps
70
3
2
Apache-1TS-RTR
Apache-1TS-TR
Apache-2TS-RTR
Apache-2TS-TR
1
71
Log Bandwidth (MB/core/second)
Log Bandwidth (MB/core/second)
Varying TSM Size
2
OLTP-1TS-RTR
OLTP-1TS-TR
OLTP-2TS-RTR
OLTP-2TS-TR
1
0
0
4
8
16
32
64
128
256
Size of the Timestamp Memory (KB)
(64 ways, Full Timestamps, Set/LRU)
3
2
SPECjbb-1TS-RTR
SPECjbb-1TS-TR
SPECjbb-2TS-RTR
SPECjbb-2TS-TR
1
2
512 1024 2048
0
Log Bandwidth (MB/core/second)
2
Log Bandwidth (MB/core/second)
3
4
8
16
32
64
128
256
512 1024 2048
Size of the Timestamp Memory (KB)
(64 ways, Full Timestamps, Set/LRU)
3
2
Zeus-1TS-RTR
Zeus-1TS-TR
Zeus-2TS-RTR
Zeus-2TS-TR
1
0
2
4
8
16
32
64
128
256
512 1024 2048
Size of the Timestamp Memory (KB)
(64 ways, Full Timestamps, Set/LRU)
2
4
8
16
32
64
128
256
512 1024 2048
Size of the Timestamp Memory (KB)
(64 ways, Full Timestamps, Set/LRU)
10
Apache-CurrentIC-RTR
Apache-CurrentIC-TR
Apache-SetLRU-TR
Apache-SetLRU-RTR
1
72
Log Bandwidth (MB/core/second)
Log Bandwidth (MB/core/second)
Varying Associativity
0.1
0.01
OLTP-CurrentIC-RTR
OLTP-CurrentIC-TR
OLTP-SetLRU-TR
OLTP-SetLRU-RTR
1
0.1
0.01
4
8
16
32
64
128
256
512
1024
Associativity of the Timestamp Memory
(64KB, Full R/W Timestamps)
10
SPECjbb-CurrentIC-RTR
SPECjbb-CurrentIC-TR
SPECjbb-SetLRU-TR
SPECjbb-SetLRU-RTR
1
0.1
0.01
2
Log Bandwidth (MB/core/second)
2
Log Bandwidth (MB/core/second)
10
4
8
16
32
64
128
256
512
1024
Associativity of the Timestamp Memory
(64KB, Full R/W Timestamps)
10
Zeus-CurrentIC-RTR
Zeus-CurrentIC-TR
Zeus-SetLRU-TR
Zeus-SetLRU-RTR
1
0.1
0.01
2
4
8
16
32
64
128
256
512
1024
Associativity of the Timestamp Memory
(64KB, Full R/W Timestamps)
2
4
8
16
32
64
128
256
512
1024
Associativity of the Timestamp Memory
(64KB, Full R/W Timestamps)
Log Bandwidth (MB/core/second)
Log Bandwidth (MB/core/second)
Varying Partial Timestamp Width
10
Apache-TR
Apache-RTR
1
0.1
0.01
10
OLTP-TR
OLTP-RTR
1
0.1
0.01
10
15
20
25
30
Partial Timestamp Width
(64sets, 64ways, Set/LRU)
10
SPECjbb-TR
SPECjbb-RTR
1
0.1
0.01
10
15
20
25
Partial Timestamp Width
(64sets, 64ways, Set/LRU)
10
Log Bandwidth (MB/core/second)
Log Bandwidth (MB/core/second)
73
30
15
20
25
10
Partial Timestamp Width
(64sets, 64ways, Set/LRU)
1
Zeus-TR
Zeus-RTR
30
0.1
0.01
10
15
20
25
Partial Timestamp Width
(64sets, 64ways, Set/LRU)
30
Log Size (MB/core/s)
Log Size Scaling
74
1.0
0.8
Apache
SPECjbb
OLTP
Zeus
0.6
0.4
0.2
0.0
2
4
8
Number of Cores
16
Download