Kevin E. Moore, Jayaram Bobba, Michelle J. Presented by: Eduardo Cuervo

advertisement
Kevin E. Moore, Jayaram Bobba, Michelle J.
Moravan, Mark D. Hill & David A. Wood
Presented by: Eduardo Cuervo

Previous TM systems abort fast, commit slow
◦ Old values “in place”
◦ New values somewhere else

Commit is the common case!
◦ Remember Amdahl’s Law

Conflicts usually solved by hardware
◦ Fast but myopic
◦ Trapping to SW if needed for careful resolution
Eager Lazy
Conflict
Version Management
Lazy
Eager
OCC DBMSs
TCC
none
LTM
VTM
CCC DBMSs
UTM
LogTM

Eager version management
◦ Puts new values in place for faster commits
◦ No data moves even on cache overflow

Eager conflict detection
◦ Detects offending ld/st immediately
◦ Fast conflict detection on evicted blocks
◦ Fast commit by lazy reset of directory state

Handle aborts by SW
◦ Aborts are much less common than commits

Per-thread log in cacheable virtual memory
◦ On st. logs address and previous contents of block

Write bit
◦ Tracks if a block has been stored and logged

Faster commits
◦ Clear W bits and reset log (pointer)

Slower aborts
◦ Also has to write old values back
Virtual
Address
Data Block
00
1 2 - - - - - - -
0 0
40
- - - - - - - 2 3
0 0
c0
3 4 - - - - - - -
0 0
R
1000
1040
1080
LogBase 1000
LogPtr
1000
LogPtr
1
W
Virtual
Address
Data Block
00
1 2 - - - - - - -
1 0
40
- - - - - - - 2 3
0 0
c0
3 4 - - - - - - -
0 0
R
1000
1040
1080
LogBase 1000
LogPtr
1000
LogPtr
1
W
Virtual
Address
Data Block
00
1 2 - - - - - - -
1 0
40
- - - - - - - 2 3
0 0
c0
5 6 - - - - - - -
0 1
R
1000 c 0 3 4 - - - - - 1040 1080
LogBase 1000
LogPtr
1048
LogPtr
1
W
Virtual
Address
Data Block
00
1 2 - - - - - - -
1 0
40
- - - - - - - 2 4
1 1
c0
5 6 - - - - - - -
0 1
R
1000 c 0 3 4 - - - - - 1040 - 4 0 - - - - - 1080 - 2 3
LogBase 1000
LogPtr
1090
LogPtr
1
W
Virtual
Address
Data Block
00
1 2 - - - - - - -
0 0
40
- - - - - - - 2 4
0 0
c0
5 6 - - - - - - -
0 0
R
1000 c 0 3 4 - - - - - 1040 - 4 0 - - - - - 1080 - 2 3
LogBase 1000
LogPtr
1000
LogPtr
0
W
Virtual
Address
Data Block
00
1 2 - - - - - - -
0 0
40
- - - - - - - 2 3
0 0
c0
3 4 - - - - - - -
0 0
R
1000 c 0 3 4 - - - - - 1040 - 4 0 - - - - - 1080 - 2 3
LogBase 1000
LogPtr
1000
LogPtr
0
W



Coherence requests sent to directory
Directory will forward to other processor(s)
Processors will detect conflict
◦ Using local state
◦ Ack/Nack as response
◦ Requester resolves any conflict


Adds read bit to each cache block
Extends MOESI protocol
◦ “Sticky” states

Works even after cache overflow
◦ Forward to conflicting requests to “interested”
processors

Adds a per processor overflow bit
◦ The transactional block can be updated
◦ Requests will still be redirected to the processor
◦ Processor can Nack on conflict


Depends on MOESI state
M: Replace with transactional writeback
◦ Sets state as “Sticky@Processor”
◦ Requests are forwarded to the processor

S: Silently replaced,
◦ Adds processor to sharer list
◦ Requests forwarded to all sharers

O: Write back to directory
◦ Add itself to sharer list, same as S if requested
exclusively

E: Same as O
Directory
Idle [old]
P
I (--) [none]
TMcount: 1
Overflow: 0
Directory
M@P [old]
P
M (R W) [new]
TMcount: 1
Overflow: 0
Directory
M@P [old]
P
M (R W) [new]
TMcount: 1
Overflow: 0
NACK
Q
I (- -) [ ]
TMcount: 1
Overflow: 0
Directory
M@P[new]
P
I (- -) [ ]
TMcount: 1
Overflow: 1
Directory
M@P[new]
P
Q
I (- -) [ ]
TMcount: 1
Overflow: 1
NACK
I (- -) [ ]
TMcount: 1
Overflow: 0
Directory
E@Q[new]
P
I (- -) [ ]
TMcount: 0
Overflow: 0
Q
E (R -) [new]
TMcount: 1
Overflow: 0

Lazy clean up better if overflow is rare
◦ Can be improved otherwise (i.e. use Bloom filters)

Ambiguities handled conservatively
◦ Refetch during same against earlier transaction
◦ Set R&W bits
◦ Log old values

When two transactions conflict
◦ At least one must stall or abort
◦ Quick myopic decision by HW
◦ Slow and careful by SW

Hybrid approach:
◦ HW seeks fast solution, traps to software if problem
persists


Distributed timestamp
Trap to conflict handler (SW)
◦ Transaction could cause deadlock
◦ Logically later than transaction in conflict

Per processor possible cycle flag
◦ Conflict if nack received from a logically earlier
transaction with possible cycle flag set

Target System
SPARC Solaris 32 Processors 1Ghz
L1: 16KB 4-way split, 1 cycle latency
L2: 4 MB 4-way unified, 12-cycle latency
Memory: 4GB 80-cycle latency
Directory: Full-bit vector sharer list, migratory
sharing optimization, directory cache, 6-cycle
latency
◦ Interconnection: Hierarchical switch topology, 14cycle link latency
◦
◦
◦
◦
◦

Simulated using Simics
◦ LogTM interface added by “magic” instructions


Shared counter
micro-benchmark
Compared to
◦ Exponential Backoff
◦ MCS locks


LogTM outperforms
them
LogTM does not
abort transactions




Evaluated using a
subset of SPLASH-2
Used two versions
of raytrace
(with/without false
sharing)
False sharing has
significant impact!
Performance gains
from moderate to
large

LogTM must read a block before writing it to
the log
◦ Benchmarks showed that data is usually read
anyway


LogTM is more sensitive to false sharing than
lock approaches
Since the log is required to be valid only until
an abort
◦ A k-block log write buffer reduces most writes as
shown in the benchmarks.

TCC
◦ Lazy version management (slow commits)
◦ Lazy conflict detection (detect on commit)

LTM
◦ On overflow stores new values in uncacheable inmemory hash table
◦ LogTM allows both old and new versions cached

UTM
◦ Logs blocks targeted by both loads and stores
◦ More complete conflict detection
◦ Must walk log on certain coherence requests

VTM
◦ Per address space virtual mode for cache evictions,
paging, context switches
◦ Virtualized VTM uses micro-code for conflict
detection. (LogTM uses MOESI extension)



Presents a TM implementation designed to
speed up the common case
Efficiently handles cache evictions
Requires simple architectural changes
◦ Registers, state, directory extension



Work towards hybrid conflict detection
No paging or context switch support
Very sensitive to false sharing
Download