Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures M. Aater Suleman* Onur Mutlu† Moinuddin K. Qureshi‡ Yale N. Patt* †Carnegie Mellon University ‡IBM Research 1 *The University of Texas at Austin Background • To leverage CMPs: – Programs must be split into threads • Mutual Exclusion: – Threads are not allowed to update shared data concurrently • Accesses to shared data are encapsulated inside critical sections • Only one thread can execute a critical section at a given time 2 Example of Critical Section from MySQL List of Open Tables A Thread 0 B × D E × × Thread 1 Thread 2 C × × × Thread 3 Thread 2: 3: OpenTables(D, E) CloseAllTables() 3 × Example Critical Section from MySQL A B C D E 0 0 1 2 2 3 4 3 Example Critical Section from MySQL LOCK_openAcquire() End of Transaction: foreach (table opened by thread) if (table.temporary) table.close() LOCK_openRelease() 5 Contention for Critical Sections Critical Section Parallel Thread 1 Thread 2 Thread 3 ThreadAccelerating 4 Idle critical sections not only helps the thread executing the t t t t t t critical sections, but also thet waiting threads Thread 1 Critical Sections 1 2 3 4 5 6 Thread 2 Thread 3 Thread 4 7 execute 2x faster t1 t2 t3 t4 6 t5 t6 t7 Impact of Critical Sections on Scalability • Contention for critical sections increases with the number of threads and limits scalability 8 Speedup 7 6 5 4 3 2 1 0 0 8 16 24 32 Chip Area (cores) MySQL (oltp-1) 7 Outline • • • • • Background Mechanism Performance Trade-Offs Evaluation Related Work and Summary 8 The Asymmetric Chip Multiprocessor (ACMP) Large core Niagara Niagara -like -like core core Niagara Niagara -like -like core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core ACMP Approach • Provide one large core and many small cores • Execute parallel part on small cores for high throughput • Accelerate serial part using the large core 9 Conventional ACMP 1. 2. 3. 4. 5. EnterCS() PriorityQ.insert(…) LeaveCS() P2 encounters a Critical Section Sends a request for the lock Acquires the lock Executes Critical Section Releases the lock Core executing critical section P1 P2 P3 P4 On-chip Interconnect 10 Accelerating Critical Sections (ACS) Critical Section Request Buffer (CSRB) Large core Niagara Niagara -like -like core core Niagara Niagara -like -like core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core ACMP Approach • Accelerate Amdahl’s serial part and critical sections using the large core 11 Accelerated Critical Sections (ACS) 1. P2 encounters a Critical Section 2. P2 sends CSCALL Request to CSRB 3. P1 executes Critical Section 4. P1 sends CSDONE signal EnterCS() PriorityQ.insert(…) LeaveCS() Core executing critical section P1 P2 P3 Critical Section Request Buffer (CSRB) P4 OnchipInterconnect 12 Architecture Overview • ISA extensions – CSCALL LOCK_ADDR, TARGET_PC – CSRET LOCK_ADDR • Compiler/Library inserts CSCALL/CSRET • On a CSCALL, the small core: – Sends a CSCALL request to the large core • Arguments: Lock address, Target PC, Stack Pointer, Core ID – Stalls and waits for CSDONE • Large Core – Critical Section Request Buffer (CSRB) – Executes the critical section and sends CSDONE to the requesting core 13 False Serialization • ACS can serialize independent critical sections • Selective Acceleration of Critical Sections (SEL) – Saturating counters to track false serialization To large core A 4 2 3 CSCALL (A) B 4 5 CSCALL (A) CSCALL (B) From small cores 14 Critical Section Request Buffer (CSRB) Outline • • • • • Background Mechanism Performance Trade-Offs Evaluation Related Work and Summary 15 Performance Tradeoffs • Fewer threads vs. accelerated critical sections – Accelerating critical sections offsets loss in throughput – As the number of cores (threads) on chip increase: • Fractional loss in parallel performance decreases • Increased contention for critical sections makes acceleration more beneficial • Overhead of CSCALL/CSDONE vs. better lock locality – ACS avoids “ping-ponging” of locks among caches by keeping them at the large core • More cache misses for private data vs. fewer misses for shared data 16 Cache misses for private data PriorityHeap.insert(NewSubProblems) Shared Data: The priority heap Private Data: NewSubProblems Puzzle Benchmark 17 Performance Tradeoffs • Fewer threads vs. accelerated critical sections – Accelerating critical sections offsets loss in throughput – As the number of cores (threads) on chip increase: • Fractional loss in parallel performance decreases • Increased contention for critical sections makes acceleration more beneficial • Overhead of CSCALL/CSDONE vs. better lock locality – ACS avoids “ping-ponging” of locks among caches by keeping them at the large core • More cache misses for private data vs. fewer misses for shared data – Cache misses reduce if shared data > private data 18 Outline • • • • • Background Mechanism Performance Trade-Offs Evaluation Related Work and Summary 19 Experimental Methodology Niagara Niagara Niagara Niagara -like -like -like -like core core core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core Large core Niagara Niagara -like -like core core Niagara Niagara -like -like core core Large core Niagara Niagara -like -like core core Niagara Niagara -like -like core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core SCMP ACMP ACS • All small cores • Conventional locking • One large core (area-equal 4 small cores) • Conventional locking 20 • ACMP with a CSRB • Accelerates Critical Sections Experimental Methodology • Workloads – 12 critical section intensive applications from various domains – 7 use coarse-grain locks and 5 use fine-grain locks • Simulation parameters: – x86 cycle accurate processor simulator – Large core: Similar to Pentium-M with 2-way SMT. 2GHz, out-of-order, 128-entry ROB, 4-wide issue, 12-stage – Small core: Similar to Pentium 1, 2GHz, in-order, 2-wide issue, 5stage – Private 32 KB L1, private 256KB L2, 8MB shared L3 – On-chip interconnect: Bi-directional ring 21 Workloads with Coarse-Grain Locks Chip Area = 16 cores Chip Area = 32 small cores SCMP = 16 small cores ACMP/ACS = 1 large and 12 small cores SCMP = 32 small cores ACMP/ACS = 1 large and 28 small cores 130 120 110 100 90 80 70 60 50 40 30 20 10 0 210 150 Exec. Time Norm. to ACMP Exec. Time Norm. to ACMP Equal-area comparison Number of threads = Best threads SCMP ACS 22 130 120 110 100 90 80 70 60 50 40 30 20 10 0 210 SCMP ACS 150 Workloads with Fine-Grain Locks Chip Area = 16 cores Chip Area = 32 small cores SCMP = 16 small cores ACMP/ACS = 1 large and 12 small cores SCMP = 32 small cores ACMP/ACS = 1 large and 28 small cores 130 120 110 100 90 80 70 60 50 40 30 20 10 0 Exec. Time Norm. to ACMP Exec. Time Norm. to ACMP Equal-area comparison Number of threads = Best threads SCMP ACS 23 130 120 110 100 90 80 70 60 50 40 30 20 10 0 SCMP ACS ------ SCMP ------ ACMP ------ ACS Equal-Area Comparisons Number of threads = No. of cores Speedup over a small core 3.5 3 2.5 2 1.5 1 0.5 0 3 5 2.5 4 2 7 6 5 4 3 2 1 0 3 1.5 2 1 0.5 1 0 0 3.5 3 2.5 2 1.5 1 0.5 0 14 12 10 8 6 4 2 0 0 8 16 24 32 0 8 16 24 32 0 8 16 24 32 0 8 16 24 32 0 8 16 24 32 0 8 16 24 32 (a) ep (b) is (c) pagemine (d) puzzle (e) qsort (f) tsp 6 10 5 8 4 8 6 12 3 12 10 2.5 10 8 2 8 6 1.5 6 4 1 4 2 0.5 2 0 0 0 6 3 4 4 2 1 2 0 0 2 0 0 8 16 24 32 0 8 16 24 32 (g) sqlite (h) iplookup 0 8 16 24 32 (i) oltp-1 0 8 16 24 32 0 8 16 24 32 0 8 16 24 32 (i) oltp-2 (k) specjbb (l) webcache Chip Area (small cores) 24 Exec. Time Norm. to SCMP ACS on Symmetric CMP 130 120 110 100 90 80 70 60 50 40 30 20 10 0 Majority of benefit is from large core ACS symmACS 25 Outline • • • • • Background Mechanism Performance Trade-Offs Evaluation Related Work and Summary 26 Related Work • Improving locality of shared data by thread migration and software prefetching (Sridharan+, Trancoso+, Ranganathan+) ACS not only improves locality but also uses a large core to accelerate critical section execution • Asymmetric CMPs (Morad+, Kumar+, Suleman+, Hill+) ACS not only accelerates the Amdahl’s bottleneck but also critical sections • Remote procedure calls (Birrell+) ACS is for critical sections among shared memory cores 27 Hiding Latency of Critical Sections • Transactional memory (Herlihy+) ACS does not require code modification • Transactional Lock Removal (Rajwar+) and Speculative Synchronization (Martinez+) – Hide critical section latency by increasing concurrency ACS reduces latency of each critical section – Overlaps execution of critical sections with no data conflicts ACS accelerates ALL critical sections – Does not improve locality of shared data ACS improves locality of shared data ACS outperforms TLR (Rajwar+) by 18% (details in paper) 28 Conclusion • Critical sections reduce performance and limit scalability • Accelerate critical sections by executing them on a powerful core • ACS reduces average execution time by: – 34% compared to an equal-area SCMP – 23% compared to an equal-area ACMP • ACS improves scalability of 7 of the 12 workloads 29 Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures M. Aater Suleman* Onur Mutlu† Moinuddin K. Qureshi‡ Yale N. Patt* †Carnegie Mellon University ‡IBM Research 30 *The University of Texas at Austin