IBM T. J. Watson Research Center Overview of POWER HTM Maged Michael IBM T J Watson Research Center WTTM 2014 15 July 2014 Outline POWER HTM features Use cases Performance results Acknowledgment of IBM colleagues in Austin, Yorktown, Tokyo, and Toronto Any errors in describing POWER HTM features and performance in this presentation are my own. 2 WTTM 2014 - POWER HTM POWER HTM Features 3 WTTM 2014 - POWER HTM Basic Transactional Instructions TBEGIN: Begins an outermost transaction (or increments nesting level) TEND: Commits an outermost transaction (or decrements nesting level) TBEGIN sets a condition register to indicate success or failure TEND sets a condition register to indicate whether it was executed in a transaction or not (i.e., extraneous TEND) Transaction failure transfers control to the instruction following TBEGIN Basic example tbegin. # begin transaction beq failure_handler # branch to failure handler if failure code is set ... tend. bgt was_not_in_a_transaction # (optional) check if tend was extraneous 4 WTTM 2014 - POWER HTM Features of Basic Transactions No hardware progress guarantee. Failure handlers must include an alternative non-HTM software path. Strong isolation. Hardware detection of conflicts with non-transactional accesses. Flat nesting. Transaction failure transfers control to the instruction following the outermost TBEGIN. Order guarantee for successful transactions among three groups of (cacheable write-back) memory accesses: – Before TBEGIN – Inside the transaction – After TEND Example: Initially X == Y == 0. r1 == r2 == 0 not allowed st X = 1 tbegin. ld r1 = Y tend. 5 st Y = 1 tbegin. ld r2 = X tend. WTTM 2014 - POWER HTM Transaction Abort TABORT: Causes transaction failure Unconditional variants with and without 8-bit code Conditional variants with 32/64-bit register or immediate parameters Example: Transactional lock elision entry tbegin. beq- tle_failure_handler ld r=LOCK # cmpi r==FREE # beq+ $+8 # tabort. # <critical section> load lock compare with free value if free, start critical section if not free, abort TLE transaction tbegin. beq- tle_failure_handler ld r=LOCK # load lock tabort[wd]ci. r!=FREE # If not free, abort TLE transaction <critical section> 6 WTTM 2014 - POWER HTM Transactional Registers and Failure Causes TFHAR: Address of failure handler, i.e., outermost TBEGIN + 4 TFIAR: Address of failure instruction when applicable TEXASR: Transaction exception and status register. Includes cause of transaction failure. TEXASR register contains a summary bit that provides a hint of whether the cause of failure is likely to be persistent or transient TEXASR register also contains an 8-bit software code that may have been provided with a TABORT instruction Failure causes include conflicts, abort instructions, footprint overflow , I/O, access to non-write-back memory, nesting level overflow, disallowed instructions (e.g., sleep, cache invalidation). 7 WTTM 2014 - POWER HTM Suspending/Resuming Transactional State TSUSPEND: Suspends the current transaction. I.e., transitions from transactional state to suspended TRESUME: Resumes the suspended transaction. Loads and stores in suspended state are performed non-speculatively as they occur and do not use hardware transactional resources No new transactions can be initiated in suspended state Transaction failure is recorded but failure handling is deferred until the transaction is resumed Load instructions of location written transactionally return the written values as long as the transaction has not failed Stores in suspended state to locations accessed transactionally cause transaction failure TCHECK: Checks for transaction failure and validity of prior memory operations. (May be used in transactional state too) 8 WTTM 2014 - POWER HTM Rollback Only Transactions (ROT) Intended for single thread speculation Not intended for shared data No conflict detection Keeps track only of transactional stores No order guarantees May be nested with atomic transactions 9 WTTM 2014 - POWER HTM Use Cases 10 WTTM 2014 - POWER HTM Transactional Lock Elision Transactional lock elision - Entry pthread_mutex_lock(mutex) { if (do_tle(mutex)) { // Check TLE state and collect stats if needed attempts = 0; // Count TLE attempts for current TRY_TLE: if (__TM_begin()) { // Inside HW transaction if (!is_free(mutex)) __TM_abort(); // If mutex is busy abort HW transaction return 0; // return SUCCESS } // HW transaction failed // Failure handler: // // // Decide to retry TLE or fallback on conventional implementation based on number of failed attempts, cause of failure, and lock recursion May update TLE stats for the mutex if (decide_to_try_TLE_again(mutex,++attempts,__TM_is_failure_persistent())) { wait_until_free(mutex); backoff(attempts); goto TRY_TLE; } } <Fallback on conventional non-TLE lock acquisition implementation> } 11 WTTM 2014 - POWER HTM Transactional Lock Elision Transactional lock elision - Exit pthread_mutex_unlock(mutex) { if (is_free(mutex)) if (__TM_end() return 0; // End TLE transaction // return success <Follow conventional non-TLE path> } 12 WTTM 2014 - POWER HTM Path Length Reduction Example: java.util.concurrent ConcurrentLinkedQueue.offer() critical path of CAS-based implementation No TM 1 l 2 isync 3 l 4 isync 5 l 6 isync 7 cmp r,t 8 bne start_over 9 cmpi s,0 10 bne fix_tail 11 hwsync 12 13 s=[t.next] r=[tail] larx r=[t.next] 13 cmp r,s 14 bne start_over 15 stcx [t.next]=n 16 bne- L1 17 hwsync 18 L1: t=[tail] L2: larx r=[tail] 19 cmp r,t 20 bne skip_stcx 21 stcx [tail]=n 22 bne- L2 23 isync WTTM 2014 - POWER HTM Path Length Reduction CLQ with TM TM 1 tbegin 2 beq- failure_handler 3 l t=[tail] 4 l s=[t.next] 5 cmpi s,0 6 beq+ L1 # skip next instruction mr t=s # not common case st [t.next]=n 8 st [tail]=n 9 tend 7 L1: Fallback on conventional CAS-based implementation in case of TM failure Aggregation of memory barriers 14 WTTM 2014 - POWER HTM Other Use Case Examples Hybrid HW/SW high-level transactions. E.g., HTM commit acceleration, spin-waiting in suspended state. Thread-level speculation with commit ordering using suspended-mode accesses Single thread speculation using Rollback-Only Transaction. Assume safe optimization and rollback if optimization was unsafe. 15 WTTM 2014 - POWER HTM Performance 16 WTTM 2014 - POWER HTM Single Thread An empty Pthreads TLE critical section is 6% faster than a conventional Pthreads critical section. 71% reduction in execution time (warm caches) of CLQ offer()/poll() pairs using TM path length reduction and memory barrier aggregation The execution time of an empty transaction with suspend/resume is 3.4x that of an empty transaction without suspend/resume 17 WTTM 2014 - POWER HTM Pthreads TLE - Microbenchmarks Pattern 1: high contention, no conflicts, data set fits in TM capacity Pattern 2: high contention, data set that overflows TM capacity Pattern 3: Mixed pattern 80% high contention, no conflict, fits in TM capacity 20% medium contention, overflows TM capacity TLE Locking 2 30 1.5 20 TLE 0.5 0 0 Locking 12 10 1 10 8 6 4 2 0 16 32 48 Threads 18 Locking Speedup 40 Speedup Speedup TLE 64 80 96 0 0 16 32 48 64 Threads WTTM 2014 - POWER HTM 80 96 0 16 32 48 Threads 64 80 96 Pthreads TLE - Memcached Memcached server with varying number of threads Client running on the same machine. 96 hardware threads. 12 cores. SMT 8 Best TLE throughput (on 16 threads) is 26.9% higher than best locking throughput (on 12 threads) On 16 threads, TLE is higher by 37.5% TLE Locking 10 Speedup 8 6 4 2 0 0 19 8 16 24 32 Memcached Server Threads WTTM 2014 - POWER HTM 40 48 Summary POWER HTM Instruction Set Suspend / Resume Rollback Only Transactions Low HTM overheads Caution not to learn wrong lessons from specific implementations of specific HTM architectures. E.g., POWER HTM and BG/Q HTM Thank You 20 WTTM 2014 - POWER HTM