Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo Previous TM systems abort fast, commit slow ◦ Old values “in place” ◦ New values somewhere else Commit is the common case! ◦ Remember Amdahl’s Law Conflicts usually solved by hardware ◦ Fast but myopic ◦ Trapping to SW if needed for careful resolution Eager Lazy Conflict Version Management Lazy Eager OCC DBMSs TCC none LTM VTM CCC DBMSs UTM LogTM Eager version management ◦ Puts new values in place for faster commits ◦ No data moves even on cache overflow Eager conflict detection ◦ Detects offending ld/st immediately ◦ Fast conflict detection on evicted blocks ◦ Fast commit by lazy reset of directory state Handle aborts by SW ◦ Aborts are much less common than commits Per-thread log in cacheable virtual memory ◦ On st. logs address and previous contents of block Write bit ◦ Tracks if a block has been stored and logged Faster commits ◦ Clear W bits and reset log (pointer) Slower aborts ◦ Also has to write old values back Virtual Address Data Block 00 1 2 - - - - - - - 0 0 40 - - - - - - - 2 3 0 0 c0 3 4 - - - - - - - 0 0 R 1000 1040 1080 LogBase 1000 LogPtr 1000 LogPtr 1 W Virtual Address Data Block 00 1 2 - - - - - - - 1 0 40 - - - - - - - 2 3 0 0 c0 3 4 - - - - - - - 0 0 R 1000 1040 1080 LogBase 1000 LogPtr 1000 LogPtr 1 W Virtual Address Data Block 00 1 2 - - - - - - - 1 0 40 - - - - - - - 2 3 0 0 c0 5 6 - - - - - - - 0 1 R 1000 c 0 3 4 - - - - - 1040 1080 LogBase 1000 LogPtr 1048 LogPtr 1 W Virtual Address Data Block 00 1 2 - - - - - - - 1 0 40 - - - - - - - 2 4 1 1 c0 5 6 - - - - - - - 0 1 R 1000 c 0 3 4 - - - - - 1040 - 4 0 - - - - - 1080 - 2 3 LogBase 1000 LogPtr 1090 LogPtr 1 W Virtual Address Data Block 00 1 2 - - - - - - - 0 0 40 - - - - - - - 2 4 0 0 c0 5 6 - - - - - - - 0 0 R 1000 c 0 3 4 - - - - - 1040 - 4 0 - - - - - 1080 - 2 3 LogBase 1000 LogPtr 1000 LogPtr 0 W Virtual Address Data Block 00 1 2 - - - - - - - 0 0 40 - - - - - - - 2 3 0 0 c0 3 4 - - - - - - - 0 0 R 1000 c 0 3 4 - - - - - 1040 - 4 0 - - - - - 1080 - 2 3 LogBase 1000 LogPtr 1000 LogPtr 0 W Coherence requests sent to directory Directory will forward to other processor(s) Processors will detect conflict ◦ Using local state ◦ Ack/Nack as response ◦ Requester resolves any conflict Adds read bit to each cache block Extends MOESI protocol ◦ “Sticky” states Works even after cache overflow ◦ Forward to conflicting requests to “interested” processors Adds a per processor overflow bit ◦ The transactional block can be updated ◦ Requests will still be redirected to the processor ◦ Processor can Nack on conflict Depends on MOESI state M: Replace with transactional writeback ◦ Sets state as “Sticky@Processor” ◦ Requests are forwarded to the processor S: Silently replaced, ◦ Adds processor to sharer list ◦ Requests forwarded to all sharers O: Write back to directory ◦ Add itself to sharer list, same as S if requested exclusively E: Same as O Directory Idle [old] P I (--) [none] TMcount: 1 Overflow: 0 Directory M@P [old] P M (R W) [new] TMcount: 1 Overflow: 0 Directory M@P [old] P M (R W) [new] TMcount: 1 Overflow: 0 NACK Q I (- -) [ ] TMcount: 1 Overflow: 0 Directory M@P[new] P I (- -) [ ] TMcount: 1 Overflow: 1 Directory M@P[new] P Q I (- -) [ ] TMcount: 1 Overflow: 1 NACK I (- -) [ ] TMcount: 1 Overflow: 0 Directory E@Q[new] P I (- -) [ ] TMcount: 0 Overflow: 0 Q E (R -) [new] TMcount: 1 Overflow: 0 Lazy clean up better if overflow is rare ◦ Can be improved otherwise (i.e. use Bloom filters) Ambiguities handled conservatively ◦ Refetch during same against earlier transaction ◦ Set R&W bits ◦ Log old values When two transactions conflict ◦ At least one must stall or abort ◦ Quick myopic decision by HW ◦ Slow and careful by SW Hybrid approach: ◦ HW seeks fast solution, traps to software if problem persists Distributed timestamp Trap to conflict handler (SW) ◦ Transaction could cause deadlock ◦ Logically later than transaction in conflict Per processor possible cycle flag ◦ Conflict if nack received from a logically earlier transaction with possible cycle flag set Target System SPARC Solaris 32 Processors 1Ghz L1: 16KB 4-way split, 1 cycle latency L2: 4 MB 4-way unified, 12-cycle latency Memory: 4GB 80-cycle latency Directory: Full-bit vector sharer list, migratory sharing optimization, directory cache, 6-cycle latency ◦ Interconnection: Hierarchical switch topology, 14cycle link latency ◦ ◦ ◦ ◦ ◦ Simulated using Simics ◦ LogTM interface added by “magic” instructions Shared counter micro-benchmark Compared to ◦ Exponential Backoff ◦ MCS locks LogTM outperforms them LogTM does not abort transactions Evaluated using a subset of SPLASH-2 Used two versions of raytrace (with/without false sharing) False sharing has significant impact! Performance gains from moderate to large LogTM must read a block before writing it to the log ◦ Benchmarks showed that data is usually read anyway LogTM is more sensitive to false sharing than lock approaches Since the log is required to be valid only until an abort ◦ A k-block log write buffer reduces most writes as shown in the benchmarks. TCC ◦ Lazy version management (slow commits) ◦ Lazy conflict detection (detect on commit) LTM ◦ On overflow stores new values in uncacheable inmemory hash table ◦ LogTM allows both old and new versions cached UTM ◦ Logs blocks targeted by both loads and stores ◦ More complete conflict detection ◦ Must walk log on certain coherence requests VTM ◦ Per address space virtual mode for cache evictions, paging, context switches ◦ Virtualized VTM uses micro-code for conflict detection. (LogTM uses MOESI extension) Presents a TM implementation designed to speed up the common case Efficiently handles cache evictions Requires simple architectural changes ◦ Registers, state, directory extension Work towards hybrid conflict detection No paging or context switch support Very sensitive to false sharing