Nir Shavit, MIT Computer Science and Artificial Intelligence

advertisement
Hybrid Transactional Memory
Nir Shavit
MIT and Tel-Aviv University
Joint work with Alex Matveev
(and describing the work of many in this
summer school)
Haswell
Transactional Memory
[HerlihyMoss93]
Transactional Memory
• Memory Transactions are collections of
reads and writes executed atomically
• Should Provide
– Disjoint Access Parallelism
• Should maintain internal and external
consistency
– External (Serializability): with respect to the
interleavings of other transactions.
– Internal (Opacity): the transaction itself
should operate on a consistent state.
External Consistency
0
X
0
Y
Transaction A:
Read y
Write x = 4
Return x+y
Transaction B:
Read x
Write y = 4
Return x+y
Cannot both return 4
Application
Memory
Canonical synchronization problem all
STM/HTM implementations must solve
Locking STMs
Map
Array of VersionedWrite-Locks
V#
Application
Memory
Commit Time Locking (Write Buff)
Mem
Locks
X
X
V#
V#
V#
V#+1
V#
V#
V#+1
00
0
0
00
1
Y
Y
V#
V#
V#
V#+1
V#+1
V#+1
V#
V#
00
0
10
0
V#
V#
V#
V#
V#
V#
00
000
0
1. To Read/Write: Check unlocked
add to Read/Write set
2. Acquire Locks
3. Validate read/write v#’s
unchanged
4. Write Values
5. Release each lock with v#+1
Read/Write
Lock Validate Write Unlock
Internal Inconsistency (Opacity)
[GuerraouiKapalka07]
4
8
X
Transaction A:
Write x = 4
Transaction B:
Read x
Read y
Compute z = 1/(x-y)
4 Y
Transaction A:
Write y = 2
DIV by 0 ERROR!
TL2/TinySTM’s Global Clock
[DiceShalevShavit06/ReigelFelberFetzer06]
• Have a shared global version clock
• Incremented by writing transactions (as
infrequently as possible)
• Read by all transactions
• Used to validate state viewed by
transaction is always opaque
TL2 Style STM
Mem
X
X
Y
Y
121
120
100
Locks
87
87
87
121
34
34
121
88
88
00
0
0
01
0
0
0
V#
121
99
121
44
44
0
0
10
0
0
50
V#
50
V#
50
0
0
0
Read Clock
VClock
1. Read Vclock
2. Read/Write: if unlocked and v#
less clock add to Read/Write-Set
3. Acquire Locks
4. Increment Clock
5. Validate each v# less than clock
6. Write values
7. Release locks with v# = new
clock
Read/Write Lock Inc Validate Write Unlock
TL2 Style STM
• Advantages
– Great Disjoint Access Parallelism
• Disadvantages
– Accessing Meta-Data is Expensive
– Progress guarantee is only deadlock
freedom
NOrec STM
[DalessandroSpearScott10]
• Use shared global clock as a seqlock
• Validation in every read if a seqlock
change is detected
• Value-based validation: no need for
meta-data (local time stamps or locks)
NOrec STM
R/W Set
X
=X
ZZ
Z
Y
=Y
101
100
103
102
104
Read/Write
(with
validation if
Not odd? seqlock
changed)
seqlock
seqlock
Lock
seqlock
(set odd)
with
validation if
seqlock
changed
Write
Unlock
seqlock
(set even)
NOrec STM
• Advantages
– No Expensive Meta-Data
• Disadvantages
– Poor Disjoint Access Parallelism (all writes
are serialized by clock)
– Progress guarantee is only starvation
freedom
Hardware TM
[HerlihyMoss93,IBM/Intel13]
• Advantages
– Everything in Hardware, No Meta Data
– Great Disjoint Access Parallelism
• Disadvantages
– No Progress Guarantee; Fail because of:
• Unsupported instructions: system or protected
instructions
• Exceptions: page faults and similar
• Capacity limit: too many accessed locations
Hybrid TM
[Moir,Damron et. Al, Kumar et. al]
• Fast-Path: Execute Trans Using Best Effort
HTM
– If it Aborts because of Special Instructions
or Transaction Too Large, then…
• Slow-Path: Execute Trans Using STM
Performance of HTM with progress
guarantee of STM
Traditional Hybrid TM
[DamronFedorovaLevLuchangcoMoirNussbaum06]
Hardware
Transaction
Test
VersionedWriteLock
in every
Read/Write.
Update in
Write.
0
1
VersionedWrite-Lock
Software Transaction
Update locks
0
1
VersionedWrite-Lock
Traditional Hybrid TM
• Advantages
– Progress Guarantee of STM
• Disadvantages
– HTM must access meta data
– Fast path is actually slow because of extra
load and branch on every read
Traditional Hybrid TM
Phased TM
[LevMoirNussbaum07]
• Two modes: all hardware or all software
• Shared global mode indicator
• If some hardware transaction aborts
switch to software mode
• Eventually mode reverts back to
hardware
Phased TM
• Advantages
– Fast-path Pure HTM: No Meta Data
Accesses
• Disadvantages
– Single Software Transaction Causes all
HTM to switch to STM slow path
– Not clear how to tune to avoid frequent
mode transitions…
Hybrid Norec (1st Attempt)
Software Norec:
Read/Write
Not odd? (with
seqlock
validation)
Hardware:
Unlock
Lock
Seqlock
Seqlock
(set odd) Validate Write (set even)
Software
will fail seqlock
validation!
Read/Write (no validation)
Write
Not odd? seqlock
seqlock +2
Hybrid Norec (1st Attempt)
Software Norec:
Read/Write
Not odd? (with
seqlock
validation)
Hardware:
Lock
Unlock
Seqlock
Seqlock
(set odd) Validate Write (set even)
Hardware
will fail seqlock
validation!
Write
Not odd? seqlock
Read/Write (no validation) seqlock
+2
Hybrid Norec (1st Attempt)
Software Norec:
Odd?
seqlock
Hardware:
Guaranteed External Consistency
Read/Write
(with
validation)
Lock
Unlock
Seqlock
Seqlock
(set odd) Validate Write (set even)
Hardware
will fail seqlock
validation!
Write
Not odd? seqlock
Read/Write (no validation) seqlock
+2
Hybrid Norec (1st Attempt)
Software Norec:
Problem: hardware opacity
Read/Write
Not odd? (with
seqlock
validation)
Hardware:
Lock
Unlock
Seqlock
Seqlock
(set odd) Validate Write (set even)
Hardware
will fail seqlock
validation!
Write
Not odd? seqlock
Read/Write (no validation) seqlock
+2
Internal Inconsistency (Opacity)
[GuerraouiKapalka07]
4
8
X
4 Y
Software A:
Lock seqlock +1
Write x = 4
Write y = 2
Unlock seqlock+1
Hardware B:
Read x
Read y
Compute z = 1/(x-y)
…
Odd? Seqlock
DIV by 0 ERROR!
Hybrid Norec (2nd Attempt)
Software Norec:
Guarantee hardware opacity
Read/Write
Not odd? (with
seqlock
validation)
Hardware:
Lock
Unlock
Seqlock
Seqlock
(set odd) Validate Write (set even)
Hardware
will detect seqlock
invalidation!
Read/Write (no validation)
Write
Not odd?seqlock
seqlock +2
Hybrid NOrec
• Advantages
– Fast-path HTM: No Meta Data Accesses
• Disadvantages
– Limited Disjoint Access Parallelism
–Seqlock is in hardware tracking set
throughout HTM transaction
–Major sequential bottleneck
Possible Solutions
• Forget Opacity, Use sandboxing
[DalessandroCarougeWhiteLevMoirSco
ttSpear2011]
• Hybrid Norec 2
[RiegelMarlierNowackFelberFetzer11]:
use non-transactional operations in a
hardware transaction to read and
But sandboxing
is complex…and
non- after
validate
seqlock
has not changed
transactional ops only available in AMD
every
read
proposal, not actual IBM or Intel …
Reduced Hardware Approach to
HyTM
[MatveevShavit13]
• Use short hardware transactions in the
software slow-path
• I.e. create new “mixed”
software/hardware path
• Not in order to make slow-path faster
– But rather, in order to remove meta-data
accesses from fast path
• Default to all software if mixed path fails
Transactional Writes Imply
Hardware Opacity
4
8
X
Trans A:
Write x = 4
Hardware B:
Read x
Read y
Compute z = 1/(x-y)
24 Y
Write y = 2
DIV by 0 ERROR!
If in a hardware transaction this cannot happen…
Reduced Hardware NOrec
[MatveevShavit13]
• In Slow-path commit, use a small
hardware transaction to:
– Write all values
– Check seqlock has not changed
– Write seqlock+1
• In Fast-path:
– Move seqlock test to end, un-instrumented
read/writes
Reduced Hardware NOrec
Software Norec:
Guarantee fast-path opacity
without having seqlock
TMTrans:
In in
HTM
Lock
tracking
set for long Write values
Lock
Read/Write
Changed?seqlock
seqlock
seqlock
Changed? (with
Write+1(set even)
seqlock
validation) (set odd) Validate seqlock
Hardware:
Changed?
seqlock
Hardware
will detect
write conflict
without seqlock!
Write
Read
seqlock
seqlock +1
Read/Write (no instrumentation)
Reduced Hardware NOrec
• Properties
– Fast-path: No Meta Data; No
instrumentation of reads or writes
– Slow-path:
–short hardware transaction: size of write
set
–can repeatedly attempt short hardware
transaction in commit
Reduced Hardware NOrec
• Advantages
– Hardware Disjoint Access Parallelism
– seqlock accessed only at end of HTM
transaction
– Surprise: 1st HyTM that is Obstruction-free
and Privatizing
– Disadvantages
– Still window of possible abort due to
seqlock increment
Reduced Hardware NOrec
3.50E+06
3.00E+06
RH-NOREC
100K Nodes Constant RB-Tree
20% muta ons
2.50E+06
HTM
Standard HyTM
2.00E+06
NOREC
RH1 Fast
1.50E+06
RH1 Mix 10
1.00E+06
RH1 Mix 100
5.00E+05
0.00E+00
1
2
4
6
8
10
12
14
16
18
20
Reduced Hardware NOrec
2.50E+06
2.00E+06
RH-NOREC
100K Nodes Constant RB-Tree
80% muta ons
HTM
Standard HyTM
1.50E+06
NOREC
RH1 Fast
1.00E+06
RH1 Mix 10
RH1 Mix 100
5.00E+05
0.00E+00
1
2
4
6
8
10
12
14
16
18
20
Reduced Hardware TL2 Style
Hardware Will See Software
Software TL2 style:
Read Clock Read/Write (validate)
Hardware:
In HTM Trans:
Write
Validate Write
values
Hardware
will detect
write conflict
Read/Write
(no validation)
Read Write values
Clock With Clock +1
Problem:
if between validate
Reduced
Hardware
TL2 Style
and hardware write, can
Solution:have
combine
validation
inconsistency
Software
TL2 style:
and writes
in single transaction
Read Clock Read/Write (validate)
Hardware:
In HTM Trans:
ValidateInand
HTM Trans:
Write values
Validate
Write values
Hardware
will detect
write conflict
Read/Write
(no validation)
Read Write values
Clock With Clock +1
Reduced Hardware TL2 Style
• Advantages
– Complete Disjoint Access Parallelism
– GV6 clock incremented on aborts only
– Obstruction-free
– Disadvantages
– No privatization
– Mixed path transaction size of meta-data
set
0.00E+00
1
2
4
6
8
10
12
14
16
18
RH1:
Hardware
TL2
HTM ReducedStandard
HyTM
NOREC
RH NOREC Fast
RH NOREC Mix 10
RH NOREC Mix 100
Style
3.50E+06
Total Opera ons
3.00E+06
100K Nodes Constant RB-Tree
20% muta ons
2.50E+06
HTM
Standard HyTM
2.00E+06
TL2
1.50E+06
RH1 Fast
1.00E+06
RH1 Mix 10
5.00E+05
RH1 Mix 100
0.00E+00
1
1.00E+07
2
4
6
8
10
12
14
number of threads
16
18
20
10K Elements Constant Hash Table
2
1.00E+06
RH1: Reduced Hardware TL2
HTM
Standard HyTM
StyleRH NOREC Slow RH NOREC Mix 100
0.00E+00
1
3.00E+06
Total Opera ons
2.50E+06
2
4
6
8
10
12
100K Nodes Constant RB-Tree
80% muta ons
14
16
18
HTM
2.00E+06
Standard HyTM
1.50E+06
TL2
1.00E+06
RH1 Fast
RH1 Mix 10
5.00E+05
RH1 Mix 100
0.00E+00
1
2
4
6
8
10
12
14
16
18
20
number of threads
3.50E+06
1K Nodes Constant Sorted List
2
HyTM: Long Journey
• Combination of ideas:
– hardware transactions,
– global clocks,
– no meta data access,
– mixed hardware software paths
• And there is still room for improvement
Download