Concurrent Critical Section Access Acceleration: a Hybrid System of Speculative Lock Elision and Asymmetric CMP architectures Yu Wang, Daniel Lin, Wei Lin September 27, 2010 Abstract In multithreaded programs, locks are often used to guarantee exclusive access to the shared data; however, concurrent accesses to a shared memory region do not guarantee conflict. In fact, many operations can be performed on a shared memory region concurrently with as the operations operate on disjoint portions of the shared memory resource. With clever programming, this is a perfect opportunity improve performance without compromising functional correctness. However, it is often that programmers lack either the programming skill or simply the development time budget to exploit this phenomenon. What this project aims to propose is a hardware solution to side step this software development problem by detecting locked critical sections of code and dynamically determining the necessity of a lock and to expedite sections of code in which locking is critical. 1. Problem Definition Conservative critical section locking and contention in memory among multiple threads are among major bottlenecks to the performance in modern systems and deserves our attention. This primary cause of this phenomenon is conservative locking practices by software developers. While most recognize that concurrent accesses to the critical region of memory do not necessarily warrant a lock, most believe that modification usually does. Even then, many recognize that only modification to a concurrently accessed section of code really necessitates a lock. Thus, the concept of locking cannot be deprecated, however, conservative locking much be promoted. Unfortunately, the direct method of addressing this situation is prohibitively expensive. The retraining of programmers cost project development time and much educational resources. Unfortunately, true critical sections of code do exist and impose a bottleneck to the performance of multi-threaded programs. Considering the current trend of incorporating more microprocessors/cores to further exploit thread-level parallelism, the penalty of shared data contention will most certainly exacerbate. Much research has been put forth to solve this problem in hardware. Some notable ones are mentioned below under related work. If each algorithm yields tangible performance increases alone, what benefits can be achieved from a hybrid algorithm? The goal of this project then, is to combine the strategies of Speculative Lock Elision (SLE)[1] and Asymmetric Chip Multiprocessor Critical Section Acceleration (ACMPCSA)[2] to create a hybrid algorithm which promotes unnecessary lock removal of SLE algorithm the while offering critical sections of code more robust hardware to reduce execution time utilizing the ACMPCSA algorithm. By combing the benefits of both worlds, this hybrid algorithm can be solution to conservative locking in multithreaded application execution. 2. Related Work Improving the performance of critical-section-related execution is an important topic in computer architecture; several researches have been conducted with different attempts to solve the problem. Among the many are: 2.1 Accelerating Critical Section Execution Asymmetric Multi-Core Architectures[1] with The idea of ACMPCSA originated from the paper “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures” by Suleman et al. Our approach of adaptive acceleration of truly collided critical section access will be based on the idea mentioned in this research project. 2.2 Speculative lock elision: enabling highly concurrent multithreaded execution[2] The idea of SLE originates from the paper “Speculative lock elision: enabling highly concurrent multithreaded execution” by Rajwar et al. This paper invents and evaluates a general heuristic to identify lock acquisition, remove critical section locking, speculatively enable concurrently access of shared memory region and restore when miss-speculation occurs. We will build our solution to fast/effective critical section access on top of its SLE technique. 2.3 Hardware acceleration for lock-free data structures and software-transactional memory[3] This paper by Diestelhorst et al. reports an AMD64 architecture/instruction-extension which gears toward critical section lock removal and low-overhead implementation of software transactional memory. This is an interesting approach, but its solution is tight up with a particular architecture and requires extension in ISA. Our proposal on the other hand, is more general and can be used cross a spectrum of architectures; it also requires no software modification and thus proved to be compatible for supporting any existing multithreaded programs. The implementation details are subject to change, but they the basic framework would be based on implementations in reference work 1 and 2. 3. Proposed Solution We will build the feature of asymmetric CMP acceleration on top of SLE. The acceleration would therefore be applied precisely to the threads that are identified to have encountered “true critical section collisions,” instead of any thread in critical section. The key idea is as following: The fast/big core sleeps while the small cores execute using the protocol of SLE. Use counters to track the amounts of (a.) critical section-access and (b.) true critical section collision which requires speculative restores. When the percentage of true collision among critical section accesses grows above a certain threshold, it implies that the critical section is slowing down the concurrent executions. In this case the thread would be migrated to the fast core, which is the most powerful and has the highest memory access priority, to speed up resolving the contention. Once the value of true collision/total critical section access falls under the threshold, it would migrate back to its original core, and the fast core handles other threads of contention source or sleep. 4. Methodology We are planning on modifying the asymmetric CMP simulation environment with one large core and 12 small cores by integrating SLE for the small cores. [1][2] We will first re-implement hardware to support SLE techniques for each individual processor, and then work on exploiting the asymmetric architecture. We will build our simulation environment on top of the SuperESCalar (SESC) Simulator, which is a cycle accurate architectural simulator that models a variety of architectures including CMP and thread-level speculation developed at UIUC. (http://iacoma.cs.uiuc.edu/~paulsack/sescdoc/) 5. Research Plan Over the course of this project, we intend to meet the following milestones: October 13th - Have the simulator and the environment setup and functional with traces selected with baseline benchmarks run (Milestone 1) November 1st - Modification to the simulator complete, data collection process in full swing (Milestone 2) November 19th - Data collection complete, analysis work in progress with basic conclusions drawn (Milestone 3) November 29th - Final Presentation Given December 12th - Final Report Complete References 1. M. Aater Suleman, Onur Mutlu, Moinuddin K. Qureshi, and Yale N. Patt, "Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures" In ASPLOS 2009: 2. Ravi Rajwar, James R. Goodman. “Speculative lock elision: enabling highly concurrent multithreaded execution”. In MICRO 2001: 3. Stephan Diestelhorst, Michael Hohmuth. “Hardware acceleration for lock-free data structures and software-transactional memory”. AMD, Inc. 2008: