Concurrent Critical Section Access Acceleration: a Hybrid System of Speculative

Concurrent Critical Section Access Acceleration: a Hybrid System of Speculative
Lock Elision and Asymmetric CMP architectures
Yu Wang, Daniel Lin, Wei Lin
September 27, 2010
In multithreaded programs, locks are often used to guarantee exclusive access to the shared data; however, concurrent
accesses to a shared memory region do not guarantee conflict. In fact, many operations can be performed on a shared
memory region concurrently with as the operations operate on disjoint portions of the shared memory resource. With
clever programming, this is a perfect opportunity improve performance without compromising functional correctness.
However, it is often that programmers lack either the programming skill or simply the development time budget to
exploit this phenomenon. What this project aims to propose is a hardware solution to side step this software
development problem by detecting locked critical sections of code and dynamically determining the necessity of a lock
and to expedite sections of code in which locking is critical.
1. Problem Definition
Conservative critical section locking and contention in
memory among multiple threads are among major
bottlenecks to the performance in modern systems and
deserves our attention. This primary cause of this
phenomenon is conservative locking practices by
software developers. While most recognize that
concurrent accesses to the critical region of memory do
not necessarily warrant a lock, most believe that
modification usually does. Even then, many recognize
that only modification to a concurrently accessed
section of code really necessitates a lock. Thus, the
concept of locking cannot be deprecated, however,
conservative locking much be promoted.
Unfortunately, the direct method of addressing this
situation is prohibitively expensive. The retraining of
programmers cost project development time and much
educational resources. Unfortunately, true critical
sections of code do exist and impose a bottleneck to the
performance of multi-threaded programs. Considering
microprocessors/cores to further exploit thread-level
parallelism, the penalty of shared data contention will
most certainly exacerbate.
Much research has been put forth to solve this problem
in hardware. Some notable ones are mentioned below
under related work. If each algorithm yields tangible
performance increases alone, what benefits can be
achieved from a hybrid algorithm?
The goal of this project then, is to combine the
strategies of Speculative Lock Elision (SLE)[1] and
Asymmetric Chip Multiprocessor Critical Section
Acceleration (ACMPCSA)[2] to create a hybrid algorithm
which promotes unnecessary lock removal of SLE
algorithm the while offering critical sections of code
more robust hardware to reduce execution time
utilizing the ACMPCSA algorithm. By combing the
benefits of both worlds, this hybrid algorithm can be
solution to conservative locking in multithreaded
application execution.
2. Related Work
Improving the performance of critical-section-related
execution is an important topic in computer
architecture; several researches have been conducted
with different attempts to solve the problem. Among
the many are:
2.1 Accelerating Critical Section Execution
Asymmetric Multi-Core Architectures[1]
The idea of ACMPCSA originated from the paper
“Accelerating Critical Section Execution with
Asymmetric Multi-Core Architectures” by Suleman et al.
Our approach of adaptive acceleration of truly collided
critical section access will be based on the idea
mentioned in this research project.
2.2 Speculative lock elision: enabling highly concurrent
multithreaded execution[2]
The idea of SLE originates from the paper “Speculative
lock elision: enabling highly concurrent multithreaded
execution” by Rajwar et al. This paper
invents and evaluates a general heuristic to identify lock
acquisition, remove critical section locking, speculatively
enable concurrently access of shared memory region
and restore when miss-speculation occurs. We will
build our solution to fast/effective critical section access
on top of its SLE technique.
2.3 Hardware acceleration for lock-free data structures
and software-transactional memory[3]
This paper by Diestelhorst et al. reports an AMD64
architecture/instruction-extension which gears toward
critical section lock removal and low-overhead
implementation of software transactional memory.
This is an interesting approach, but its solution is tight
up with a particular architecture and requires extension
in ISA. Our proposal on the other hand, is more general
and can be used cross a spectrum of architectures; it
also requires no software modification and thus proved
to be compatible for supporting any existing
multithreaded programs.
The implementation details are subject to change, but
they the basic framework would be based on
implementations in reference work 1 and 2.
3. Proposed Solution
We will build the feature of asymmetric CMP
acceleration on top of SLE. The acceleration would
therefore be applied precisely to the threads that are
identified to have encountered “true critical section
collisions,” instead of any thread in critical section.
The key idea is as following:
 The fast/big core sleeps while the small cores
execute using the protocol of SLE.
 Use counters to track the amounts of (a.) critical
section-access and (b.) true critical section collision
which requires speculative restores.
 When the percentage of true collision among
critical section accesses grows above a certain
threshold, it implies that the critical section is
slowing down the concurrent executions. In this
case the thread would be migrated to the fast core,
which is the most powerful and has the highest
memory access priority, to speed up resolving the
Once the value of true collision/total critical section
access falls under the threshold, it would migrate
back to its original core, and the fast core handles
other threads of contention source or sleep.
4. Methodology
We are planning on modifying the asymmetric CMP
simulation environment with one large core and 12
small cores by integrating SLE for the small cores. [1][2]
We will first re-implement hardware to support SLE
techniques for each individual processor, and then work
on exploiting the asymmetric architecture.
We will build our simulation environment on top of the
SuperESCalar (SESC) Simulator, which is a cycle accurate
architectural simulator that models a variety of
architectures including CMP and thread-level
speculation developed at UIUC.
5. Research Plan
Over the course of this project, we intend to meet the
following milestones:
October 13th - Have the simulator and the
environment setup and functional with traces
selected with baseline benchmarks run (Milestone 1)
November 1st - Modification to the simulator
complete, data collection process in full swing
(Milestone 2)
November 19th - Data collection complete, analysis
work in progress with basic conclusions drawn
(Milestone 3)
November 29th - Final Presentation Given
December 12th - Final Report Complete
1. M. Aater Suleman, Onur Mutlu, Moinuddin K. Qureshi, and Yale N. Patt,
"Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures" In ASPLOS 2009:
2. Ravi Rajwar, James R. Goodman. “Speculative lock elision: enabling highly concurrent multithreaded execution”. In MICRO 2001:
3. Stephan Diestelhorst, Michael Hohmuth. “Hardware acceleration for lock-free data structures and software-transactional memory”. AMD, Inc.