An Integrated Hardware-Software Approach to Transactional Memory Sean Lie 6.895 Theory of Parallel Systems Monday December 8th, 2003 Transactional Memory Transactional memory provides atomicity without the problems associated with locks. Locks if (i<j) { a = i; b = j; } else { a = j; b = i; } Transactional Memory Lock(L[a]); Lock(L[b]); StartTransaction; Flow[i] = Flow[i] – X; Flow[i] = Flow[i] – X; Flow[j] = Flow[j] + X; Flow[j] = Flow[j] + X; Unlock(L[b]); Unlock(L[a]); EndTransaction; I propose an integrated hardware-software approach to transactional memory. Hardware Transactional Memory HTM: Transactional memory can be implemented in hardware using the cache and cache coherency mechanism. [Herlihy & Moss] Uncommitted transactional data is stored in the cache. Transactional data is marked in the cache using an additional bit per cache line. HTM has very low overhead but has size and length limitations. Trans Tag Status 1 0 1 0 Data Software Transactional Memory Original Object Trans Version Readers List 248 ‘a’ FLAG NULL NULL NULL 48 77 “Mass. Ave.” NULL NULL FLEX Software Transaction System STM: Transactional memory can be implemented in software using compiler and library support. Uncommitted transactional data is stored in a copy of the object. Transactional data is marked by flagging the object field. STM does not have size or length limitations but has high overhead. Results An integrated approach gives the best of both worlds. Common z case: HTM mode - Small/short transactions run fast. Uncommon z case: STM mode - Large/long transactions are slower but possible. An integrated hardware-software transactional memory system was implemented and evaluated. HTM was implemented in the UVSIM software simulator. A subset of STM functionality was implemented for the benchmark applications. HTM was modified to be software-compatible. Hardware vs. Software HTM has much lower overhead than STM. A network flow µbenchmark (node-push) was implemented for evaluating overhead. 1 processor overheads: Atomicity “Worst” “More Realistic” Mechanism Cycles (% of Base) Locks 505% 136% HTM 153% 104% 1879% 206% STM “worst” case: Back-to-back small transactions “more realistic” case: Some processing between small transactions However, HTM has 2 serious limitations. Hardware Limitation: Cache Capacity HTM uses the cache to hold all transactional data. Therefore, HTM aborts transactions larger than the cache. Restricting transaction size is awkward and not modular. Size will depend on associativity, block size, etc. in addition to cache size. Cache configuration change from processor to processor. Trans Tag Status 1 1 1 1 Data Hardware Limitation: Context Switches Cache Trans data for thread a The cache is the only transactional buffer for all threads. Processor Therefore, HTM aborts transactions on context switches. Restricting context switches is awkward and not modular. Thread a Thread b Context switches occur regularly in modern systems (e.g.. TLB exceptions). HSTM: An Integrated Approach Transactions are switched from HTM to STM when necessary. When a transaction aborts in HTM, it is restart in STM. HTM is modified to be software-compatible. Trans Tag Status 1 0 Data Readers List 248 ‘a’ FLAG 77 “Mass. Ave.” 1 0 Software-Compatible HTM Original Object Readers List 248 ‘a’ FLAG 77 “Mass. Ave.” Store Check null null?? Y Load Check flag flag?? N Y Normal Memory Operation N 1 new instruction: xAbort Additional checks are performed in software: On loads, check if the memory location is set to FLAG. On stores, check if there is a readers list. Abort Software checks are slow. Performing checks adds a 2.2x performance overhead over pure HTM in “worst” case (1.1x in “more realistic” case). Overcoming Size Limitations The node-push benchmark was modified to touch more nodes to evaluate size limitations. HSTM uses HTM when possible and STM when necessary. 300 HTM STM Cycles per Node 250 200 150 HTM Transactions stop fitting after this point 100 50 0 1 5 9 13 17 21 25 29 33 37 Transaction Size (Number of Nodes x 100) 41 45 49 Overcoming Size Limitations The node-push benchmark was modified to touch more nodes to evaluate size limitations. HSTM uses HTM when possible and STM when necessary. 300 HTM STM Cycles per Node 250 HSTM 200 150 HTM Transactions stop fitting after this point 100 50 0 1 5 9 13 17 21 25 29 33 37 Transaction Size (Number of Nodes x 100) 41 45 49 Overcoming Context Switching Limitations Context switches occur on TLB exceptions. The node-push benchmark was modified to choose from a larger set of nodes. More nodes Æ higher probability of TLB miss (Pabort). HSTM behaves like HTM when Pabort is low and like STM when Pabort is high. 3000 6 STM Avg. Cycles per Iteration 2500 5 HTM 2000 4 1500 3 1000 2 500 1 0 0 1 4 7 10 13 16 19 Number of Nodes (x1000) 22 25 28 31 Avg. Aborts per HTM Transaction TLB Aborts Overcoming Context Switching Limitations Context switches occur on TLB exceptions. The node-push benchmark was modified to choose from a larger set of nodes. More nodes Æ higher probability of TLB miss (Pabort). HSTM behaves like HTM when Pabort is low and like STM when Pabort is high. 3000 STM HTM Avg. Cycles per Iteration 2500 HSTM 2000 1500 1000 500 0 1 4 7 10 13 16 19 Number of Nodes (x1000) 22 25 28 31 Conclusions An integrated approach gives the best of both worlds. Common z case: HTM mode - Small/short transactions run fast. Uncommon z case: STM mode - Large/long transactions are slower but possible. Trade-offs: “STM z This is acceptable since it is uncommon. “HTM z mode” is not has fast as pure STM. mode” is not has fast as pure HTM. Is this acceptable? Future Work Full implementation of STM in UVSIM Integration compiler of software-compatible HTM into the FLEX Evaluate how software-compatible HTM performs for parallel applications Should software-compatible modifications be moved into hardware? Can a transaction be transferred from hardware to software during execution?