Dependence-Aware Transactional Memory for Increased Concurrency Hany E. Ramadan,

advertisement
Dependence-Aware
Transactional Memory for
Increased Concurrency
Hany E. Ramadan,
Christopher J. Rossbach,
Emmett Witchel
University of Texas, Austin
Concurrency Conundrum
• Challenge: CMP ubiquity
• Parallel programming with
locks and threads is difficult
– deadlock, livelock, convoys…
– lock ordering, poor composability
– performance-complexity tradeoff
2 cores
16 cores
• Transactional Memory (HTM)
– simpler programming model
– removes pitfalls of locking
– coarse-grain transactions can
perform well under lowcontention
80 cores
(neat!)
2
High contention: optimism warranted?
• TM performs poorly with
write-shared data
– Increasing core counts make
this worse
Locks
Transactions
ReadSharing
• Write sharing is common
– Statistics, reference
counters, lists…
• Solutions complicate
programming model
– open-nesting, boosting, early
release, ANTs…
WriteSharing
TM’s need not
Dependence-Awareness
always serialize
can help the programmer
write-shared
(transparently!)
Transactions!
3
Outline
•
•
•
•
•
Motivation
Dependence-Aware Transactions
Dependence-Aware Hardware
Experimental Methodology/Evaluation
Conclusion
4
Two threads sharing a counter
Initially: count == 0
…
Thread A
…
Thread B
load count,r
memory
inc r
time
count: 2
0
1
store r,count
load count,r
inc r
store r,count
…
…
Result: count == 2
Schedule:
• dynamic instruction sequence
• models concurrency by interleaving instructions from multiple threads5
TM shared counter
Initially: count == 0
xbegin
Tx A
xbegin
Tx B
load count,r
memory
inc r
time
0
count: 1
store r,count
load count,r
inc r
Conflict!
xend
store r,count
xend
Conflict:
Current HTM designs cannot accept such schedules,
• both transactions access a datum, at least one is a write
despite the fact that they yield the correct result!
• intersection between read and write sets of transactions
6
Does this really matter?
xbegin
Critsec A
xbegin
Critsec B
load X,r0
<lots of work>
inc r0
<lots of work>
store r0,X
load X,r0
load count,r
inc r
store r,count
inc r0
load count,r
store
r0,X
inc r
store r,count
xend
xend
Common Pattern!
• Statistics
• Linked lists
• Garbage collectors
xend
xend
DATM: use dependences to commit conflicting transactions
7
DATM shared counter
Initially: count == 0
xbegin
T1
T2
xbegin
load count,r
memory
inc r
time
count: 1
2
0
store r,count
load count,r
inc r
Forward speculative data
from T1 to satisfy T2’s load.
store r,count
xend
xend
• T1 must commit before T2
Model these
constraints as
dependences
• If T1 aborts, T2 must also abort
• If T1 overwrites count, T2 must abort
8
Conflicts become Dependences
…
Tx A
…
Tx B
Read-Write:
A commits before B
read X
Write-Read:
• forward data
• overwrite  abort
• A commits before B
RW
write X
write Y
Write-Write:
A commits before B
Tx A
WW
Tx B
write Y
write Z
WR
read Z
…
…
Dependence Graph
• Transactions: nodes
• dependences: edges
• edges dictate commit order
• no cycles  conflict serializable
9
Enforcing ordering using
Dependence Graphs
T1
xbegin
T2
xbegin
load X,r0
store r0,X
load X,r0
T1
WR
T2
store r0,X
xend
Wait for T1
xend
xend
T1 must serialize before T2
Outstanding
Dependences!
10
Enforcing Consistency using
Dependence Graphs
T1
xbegin
T2
xbegin
load X, r0
load X,r0
store r0,X
store r0,X
xend
T1
RW
T2
xend
WW
CYCLE!
(lost update)
• cyclenot conflict serializable
• restart some tx to break cycle
• invoke contention manager
11
Theoretical Foundation
Correctness
• and
Serializability:
optimality results of
concurrent
Proof: See interleaving equivalent
to serial 09]
interleaving
[PPoPP
• Conflict:
• Conflict-equivalent: same
operations, order of conflicting
operations same
• Conflict-serializability: conflictequivalent to a serial interleaving
• 2-phase locking: conservative
implementation of CS
Serializable
Conflict
Serializable
2PL
(LogTM, TCC, RTM, MetaTM, OneTM….)
12
Outline
•
•
•
•
•
Motivation/Background
Dependence-Aware Model
Dependence-Aware Hardware
Experimental Methodology/Evaluation
Conclusion
13
DATM Requirements
Mechanisms:
• Maintain global dependence graph
– Conservative approximation
– Detect cycles, enforce ordering constraints
• Create dependences
• Forwarding/receiving
• Buffer speculative data
Implementation:
coherence, L1, per-core HW structures
14
Dependence-Aware
Microarchitecture
• Order vector:
dependence graph
• TXID, access bits:
versioning, detection
• TXSW: transaction status
• Frc-ND + ND bits:
disable dependences
These are not programmer-visible!
15
Dependence-Aware
Microarchitecture
Order vector:
• dependence graph
• topological sort
• conservative approx.
These are not programmer-visible!
16
Dependence-Aware
Microarchitecture
• Order vector:
dependence graph
• TXID, access bits:
versioning, detection
• TXSW: transaction status
• Frc-ND + ND bits: disable
dependences
These are not programmer-visible!
17
Dependence-Aware
Microarchitecture
• Order vector:
dependence graph
• TXID, access bits:
versioning, detection
• TXSW: transaction status
• Frc-ND + ND bits:
disable dependences, no
active dependences
These are not programmer-visible!
18
FRMSI protocol:
forwarding and receiving
FRMSI
•
•
•
•
•
•
MSI states
TX states (T*)
Forwarded: (T*F)
Received: (T*R*)
Committing (CTM)
Bus messages:
• TXOVW
• xABT
• xCMT
Transactional States
Modified by TX
MSI
M
S
TM
CTM
Fwd
TS
TMM
TMF
TMR
TMRF
Received
I
TR
19
FRMSI protocol:
forwarding and receiving
FRMSI
•
•
•
•
•
•
MSI states
TX states (T*)
Forwarded: (T*F)
Received: (T*R*)
Committing (CTM)
Bus messages:
• TXOVW
• xABT
• xCMT
Transactional States
Modified by TX
MSI
M
S
TM
CTM
Fwd
TS
TMM
TMF
TMR
TMRF
Received
I
TR
20
FRMSI protocol:
forwarding and receiving
FRMSI
•
•
•
•
•
•
MSI states
TxMSI states (T*)
Forwarded: (T*F)
Received: (T*R*)
Committing (CTM)
Bus messages:
• TXOVW
• xABT
• xCMT
Transactional States
Modified by TX
MSI
M
S
TM
CTM
Fwd
TS
TMM
TMF
TMR
TMRF
Received
I
TR
21
FRMSI protocol:
forwarding and receiving
FRMSI
•
•
•
•
•
•
MSI states
TX states (T*)
Forwarded: (T*F)
Received: (T*R*)
Committing (CTM)
Bus messages:
• TXOVW
• xABT
• xCMT
Transactional States
Modified by TX
MSI
M
S
TM
CTM
Fwd
TS
TMM
TMF
TMR
TMRF
Received
I
TR
22
Converting Conflicts to Dependences
core_0
PC 1
4
3
2
5
core 0
R 101
100
TXID TxA
L1
OVec [A,B]
S
TXID data
TMM
CTM
TMF
101
TS TxA 100
cnt TMF
1
2
3
4
…
N
xbegin
ld R, cnt
inc cnt
st cnt, R
xend
1
2
3
4
5
…
xbegin
ld R, cnt
inc cnt
st cnt, R
xend
(stall)
Outstanding
Dependence
core_1
core 1
4
PC 1
3
2
5
R 102
101
TXID TxB
L1
OVec [A,B]
S
TXID data
TR TxB 102
cnt TMR
101
xCMT(A)
Main Memory
cnt 100
23
Outline
•
•
•
•
•
Motivation/Background
Dependence-Aware Model
Dependence-Aware Hardware
Experimental Methodology/Evaluation
Conclusion
24
Experimental Setup
• Implemented DATM by extending MetaTM
– Compared against MetaTM: [Ramadan 2007]
• Simulation environment
–
–
–
–
–
Simics 3.0.30 machine simulator
x86 ISA w/HTM instructions
32k 4-way tx L1 cache; 4MB 4-way L2; 1GB RAM
1 cycle/inst, 24 cyc/L2 hit, 350 cyc/main memory
8, 16 processors
• Benchmarks
– Micro-benchmarks
– STAMP [Minh ISCA 2007]
– TxLinux: Transactional OS [Rossbach SOSP 2007]
25
DATM speedup, 16 CPUs
50%
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
15x
39%
22%
3%
2%
pmake labyrinth vacation bayes
counter
26
Eliminating wasted work
80%
70%
60%
restarts/tx
avg bkcyc/tx
50%
40%
30%
16 CPUs
Lower is better
20%
10%
0%
pmake labyrinth vacation bayes
Speedup: 2%
3%
22%
39%
counter
15x
27
Dependence Aware Contention
Management
Cascaded
abort?
Timestamp contention
management
T7
5T6
transactions
must restart!
T3
T4
T1
T2
Order Vector
Dependence Aware
T2,
T2
T2, T3,
T3, T7,
T7, T6,
T6, T4
T4, T1
28
Contention Management
120%
100%
101%
Polka
Eruption
Dependence-Aware
80%
60%
40%
14%
20%
0%
-20%
bayes
vacation
-40%
-60%
29
Outline
•
•
•
•
•
Motivation/Background
Dependence-Aware Model
Dependence-Aware Hardware
Experimental Methodology/Evaluation
Conclusion
30
Related Work
• HTM
– TCC [Hammond 04], LogTM[-SE] [Moore 06], VTM [Rajwar
05], MetaTM [Ramadan 07, Rossbach 07], HASTM, PTM,
HyTM, RTM/FlexTM [Shriraman 07,08]
• TLS
– Hydra [Olukotun 99], Stampede [Steffan 98,00], Multiscalar
[Sohi 95], [Garzaran 05], [Renau 05]
• TM Model extensions
– Privatization [Spear 07], early release [Skare 06], escape
actions [Zilles 06], open/closed nesting [Moss 85, Menon
07], Galois [Kulkarni 07], boosting [Herlihy 08], ANTs [Harris
07]
• TM + Conflict Serializability
– Adaptive TSTM [Aydonat 08]
31
Conclusion
• DATM can commit conflicting
transactions
• Improves write-sharing performance
• Transparent to the programmer
• DATM prototype demonstrates
performance benefits
Source code available!
www.metatm.net 32
Broadcast and Scalability
Yes.
We broadcast.
But it’s less than you might
think…
Because each node maintains
all deps, this design uses
broadcast for:
• Commit
• Abort
• TXOVW (broadcast writes)
• Forward restarts
• New Dependences
These sources of broadcast
could be avoided in a directory
protocol:
keep only the relevant subset of the
dependence graph at each node
33
FRMSI Transient states
19 transient states: sources:
• forwarding
• overwrite of forwarded lines (TXOVW)
• transitions from MSI  T* states
Ameliorating factors
• best-effort: local evictions  abort, reduces WB
states
• T*R* states are isolated
• transitions back to MSI (abort, commit) require no
transient states
• Signatures for forward/receive sets could
eliminate five states.
34
Why isn’t this TLS?
TLS
DATM
TM
Forward data between threads
Detect when reads occur too early
Discard speculative state after violations
Retire speculative
Writes in order
Memory renaming
Multi-threaded workloads
Programmer/Compiler transparency
1.
2.
3.
4.
5.
6.
Txns can commit in arbitrary order. TLS has the daunting task of maintaining program order for epochs
TLS forwards values when they are the globally committed value (i.e. through memory)
No need to detect “early” reads (before a forwardable value is produced)
Notion of “violation” is different—we must roll back when CS is violated, TLS, when epoch ordering
Retiring writes in order: similar issue—we use CTM state, don’t need speculation write buffers. 35
Memory renaming to avoid older threads seeing new values—we have no such issue
Doesn’t this lead to cascading aborts?
YES
Rare in most workloads
36
Inconsistent Reads
OS modifications:
• Page fault handler
• Signal handler
37
Increasing Concurrency
tc
time
xend
xbegin
arbitrate
xend
serialized
xend
overlapped
retry
xend
xbegin
no
retry!
xend
xend
conflict
xbegin
arbitrate
conflict
T2
conflict
T1
xbegin
DATM
T2
xbegin
(MetaTM,LogTM)
T1
xbegin
Eager/Eager
T2
xbegin
(e.g. TCC)
xbegin
Lazy/Lazy
T1
(Eager/Eager:
Updatesbuffered,
in-place, conflict
conflict detection
detection at
atcommit
time of reference)
(Lazy/Lazy: Updates
time)
38
Design space highlights
• Cache granularity:
– eliminate per word access bits
• Timestamp-based dependences:
– eliminate Order Vector
• Dependence-Aware contention
management
39
Design Points
2.5
2
bayes
vacation
counter
counter-tt
1.5
1
0.5
0
Cache-gran
Timestamp-fwd
DATM
40
Hardware TM Primer
Key Ideas:
 Critical sections
execute concurrently
 Conflicts are
detected dynamically
 If conflict
serializability is
violated, rollback
Key Abstractions:
• Primitives
– xbegin, xend, xretry
• Conflict
• Contention Manager
– Need flexible policy
“Conventional Wisdom Transactionalization”:
Replace locks with transactions
41
Hardware TM basics: example
cpu 0
cpu 1
3
6
2
7
PC: 1
0
0
1
2
3
4
7
PC: 8
Working Set
R{}
R{R{
R{
A,B,C
A,AB} } }
W{}
Working Set
R{
R{
R{}
A,B
A }}
W{
W{}
C}
0: xbegin
0: xbegin;
1: read A
1: read A
2: read B
2: read B
3: if(cpu % 2)
3: if(cpu % 2)
4: write C
4: write C
5: else
5: else
6: read C
6: read C
7: …
7: …
8: xend
8: xend
CONFLICT:
Assume
contention
manager decides cpu1
C is in the read set of
wins:
cpu0, and in the write
set ofrolls
cpu0
cpu1back
cpu1 commits
42
Download