Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures M. Aater Suleman* †

advertisement
Accelerating Critical Section Execution
with Asymmetric Multi-Core Architectures
M. Aater Suleman*
Onur Mutlu†
Moinuddin K. Qureshi‡
Yale N. Patt*
†Carnegie Mellon
University
‡IBM Research
1
*The University of Texas
at Austin
Background
• To leverage CMPs:
– Programs must be split into threads
• Mutual Exclusion:
– Threads are not allowed to update shared data concurrently
• Accesses to shared data are encapsulated inside
critical sections
• Only one thread can execute a critical section at
a given time
2
Example of Critical Section from MySQL
List of Open Tables
A
Thread 0
B
×
D
E
×
×
Thread 1
Thread 2
C
×
×
×
Thread 3
Thread 2:
3:
OpenTables(D, E)
CloseAllTables()
3
×
Example Critical Section from MySQL
A
B
C
D
E
0
0
1
2
2
3
4
3
Example Critical Section from MySQL
LOCK_openAcquire()
End of Transaction:
foreach (table opened by thread)
if (table.temporary)
table.close()
LOCK_openRelease()
5
Contention for Critical Sections
Critical
Section
Parallel
Thread 1
Thread 2
Thread 3
ThreadAccelerating
4
Idle
critical sections not
only helps the thread executing the
t
t
t
t
t
t
critical
sections,
but
also
thet
waiting threads
Thread 1
Critical Sections
1
2
3
4
5
6
Thread 2
Thread 3
Thread 4
7
execute 2x faster
t1
t2
t3
t4
6
t5
t6
t7
Impact of Critical Sections on Scalability
• Contention for critical sections increases with the
number of threads and limits scalability
8
Speedup
7
6
5
4
3
2
1
0
0
8
16
24
32
Chip Area (cores)
MySQL (oltp-1)
7
Outline
•
•
•
•
•
Background
Mechanism
Performance Trade-Offs
Evaluation
Related Work and Summary
8
The Asymmetric Chip Multiprocessor (ACMP)
Large
core
Niagara Niagara
-like
-like
core
core
Niagara Niagara
-like
-like
core
core
Niagara Niagara Niagara Niagara
-like
-like
-like
-like
core
core
core
core
Niagara Niagara Niagara Niagara
-like
-like
-like
-like
core
core
core
core
ACMP Approach
• Provide one large core and many small cores
• Execute parallel part on small cores for
high throughput
• Accelerate serial part using the large core
9
Conventional ACMP
1.
2.
3.
4.
5.
EnterCS()
PriorityQ.insert(…)
LeaveCS()
P2 encounters a Critical Section
Sends a request for the lock
Acquires the lock
Executes Critical Section
Releases the lock
Core executing
critical section
P1
P2
P3
P4
On-chip
Interconnect
10
Accelerating Critical Sections (ACS)
Critical Section
Request Buffer
(CSRB)
Large
core
Niagara Niagara
-like
-like
core
core
Niagara Niagara
-like
-like
core
core
Niagara Niagara Niagara Niagara
-like
-like
-like
-like
core
core
core
core
Niagara Niagara Niagara Niagara
-like
-like
-like
-like
core
core
core
core
ACMP Approach
• Accelerate Amdahl’s serial part and critical
sections using the large core
11
Accelerated Critical Sections (ACS)
1. P2 encounters a Critical Section
2. P2 sends CSCALL Request to CSRB
3. P1 executes Critical Section
4. P1 sends CSDONE signal
EnterCS()
PriorityQ.insert(…)
LeaveCS()
Core executing
critical section
P1
P2
P3
Critical Section
Request Buffer
(CSRB)
P4
OnchipInterconnect
12
Architecture Overview
• ISA extensions
– CSCALL LOCK_ADDR, TARGET_PC
– CSRET LOCK_ADDR
• Compiler/Library inserts CSCALL/CSRET
• On a CSCALL, the small core:
– Sends a CSCALL request to the large core
• Arguments: Lock address, Target PC, Stack Pointer, Core ID
– Stalls and waits for CSDONE
• Large Core
– Critical Section Request Buffer (CSRB)
– Executes the critical section and sends CSDONE to the
requesting core
13
False Serialization
• ACS can serialize independent critical sections
• Selective Acceleration of Critical Sections (SEL)
– Saturating counters to track false serialization
To large core
A
4
2
3
CSCALL (A)
B
4
5
CSCALL (A)
CSCALL (B)
From small cores
14
Critical Section
Request Buffer
(CSRB)
Outline
•
•
•
•
•
Background
Mechanism
Performance Trade-Offs
Evaluation
Related Work and Summary
15
Performance Tradeoffs
• Fewer threads vs. accelerated critical sections
– Accelerating critical sections offsets loss in throughput
– As the number of cores (threads) on chip increase:
• Fractional loss in parallel performance decreases
• Increased contention for critical sections
makes acceleration more beneficial
• Overhead of CSCALL/CSDONE vs. better lock locality
– ACS avoids “ping-ponging” of locks among caches by keeping
them at the large core
• More cache misses for private data vs. fewer misses
for shared data
16
Cache misses for private data
PriorityHeap.insert(NewSubProblems)
Shared Data:
The priority heap
Private Data:
NewSubProblems
Puzzle Benchmark
17
Performance Tradeoffs
• Fewer threads vs. accelerated critical sections
– Accelerating critical sections offsets loss in throughput
– As the number of cores (threads) on chip increase:
• Fractional loss in parallel performance decreases
• Increased contention for critical sections
makes acceleration more beneficial
• Overhead of CSCALL/CSDONE vs. better lock locality
– ACS avoids “ping-ponging” of locks among caches by keeping
them at the large core
• More cache misses for private data vs. fewer misses
for shared data
– Cache misses reduce if shared data > private data
18
Outline
•
•
•
•
•
Background
Mechanism
Performance Trade-Offs
Evaluation
Related Work and Summary
19
Experimental Methodology
Niagara Niagara Niagara Niagara
-like
-like
-like
-like
core
core
core
core
Niagara Niagara Niagara Niagara
-like
-like
-like
-like
core
core
core
core
Large
core
Niagara Niagara
-like
-like
core
core
Niagara Niagara
-like
-like
core
core
Large
core
Niagara Niagara
-like
-like
core
core
Niagara Niagara
-like
-like
core
core
Niagara Niagara Niagara Niagara
-like
-like
-like
-like
core
core
core
core
Niagara Niagara Niagara Niagara
-like
-like
-like
-like
core
core
core
core
Niagara Niagara Niagara Niagara
-like
-like
-like
-like
core
core
core
core
Niagara Niagara Niagara Niagara
-like
-like
-like
-like
core
core
core
core
Niagara Niagara Niagara Niagara
-like
-like
-like
-like
core
core
core
core
Niagara Niagara Niagara Niagara
-like
-like
-like
-like
core
core
core
core
SCMP
ACMP
ACS
• All small cores
• Conventional
locking
• One large core
(area-equal 4
small cores)
• Conventional
locking
20
• ACMP with a
CSRB
• Accelerates
Critical Sections
Experimental Methodology
• Workloads
– 12 critical section intensive applications from various domains
– 7 use coarse-grain locks and 5 use fine-grain locks
• Simulation parameters:
– x86 cycle accurate processor simulator
– Large core: Similar to Pentium-M with 2-way SMT.
2GHz, out-of-order, 128-entry ROB, 4-wide issue, 12-stage
– Small core: Similar to Pentium 1, 2GHz, in-order, 2-wide issue, 5stage
– Private 32 KB L1, private 256KB L2, 8MB shared L3
– On-chip interconnect: Bi-directional ring
21
Workloads with Coarse-Grain Locks
Chip Area = 16 cores
Chip Area = 32 small cores
SCMP = 16 small cores
ACMP/ACS = 1 large and 12 small cores
SCMP = 32 small cores
ACMP/ACS = 1 large and 28 small cores
130
120
110
100
90
80
70
60
50
40
30
20
10
0
210
150
Exec. Time Norm. to ACMP
Exec. Time Norm. to ACMP
Equal-area comparison
Number of threads = Best threads
SCMP
ACS
22
130
120
110
100
90
80
70
60
50
40
30
20
10
0
210
SCMP
ACS
150
Workloads with Fine-Grain Locks
Chip Area = 16 cores
Chip Area = 32 small cores
SCMP = 16 small cores
ACMP/ACS = 1 large and 12 small cores
SCMP = 32 small cores
ACMP/ACS = 1 large and 28 small cores
130
120
110
100
90
80
70
60
50
40
30
20
10
0
Exec. Time Norm. to ACMP
Exec. Time Norm. to ACMP
Equal-area comparison
Number of threads = Best threads
SCMP
ACS
23
130
120
110
100
90
80
70
60
50
40
30
20
10
0
SCMP
ACS
------ SCMP
------ ACMP
------ ACS
Equal-Area Comparisons
Number of threads = No. of cores
Speedup over a small core
3.5
3
2.5
2
1.5
1
0.5
0
3
5
2.5
4
2
7
6
5
4
3
2
1
0
3
1.5
2
1
0.5
1
0
0
3.5
3
2.5
2
1.5
1
0.5
0
14
12
10
8
6
4
2
0
0 8 16 24 32
0 8 16 24 32
0 8 16 24 32
0 8 16 24 32
0 8 16 24 32
0 8 16 24 32
(a) ep
(b) is
(c) pagemine
(d) puzzle
(e) qsort
(f) tsp
6
10
5
8
4
8
6
12
3
12
10
2.5
10
8
2
8
6
1.5
6
4
1
4
2
0.5
2
0
0
0
6
3
4
4
2
1
2
0
0
2
0
0 8 16 24 32
0 8 16 24 32
(g) sqlite
(h) iplookup
0 8 16 24 32
(i) oltp-1
0 8 16 24 32
0 8 16 24 32
0 8 16 24 32
(i) oltp-2
(k) specjbb
(l) webcache
Chip Area (small cores)
24
Exec. Time Norm. to SCMP
ACS on Symmetric CMP
130
120
110
100
90
80
70
60
50
40
30
20
10
0
Majority of benefit is from large core
ACS
symmACS
25
Outline
•
•
•
•
•
Background
Mechanism
Performance Trade-Offs
Evaluation
Related Work and Summary
26
Related Work
• Improving locality of shared data by thread migration and
software prefetching (Sridharan+, Trancoso+, Ranganathan+)
ACS not only improves locality but also uses a large core to
accelerate critical section execution
• Asymmetric CMPs (Morad+, Kumar+, Suleman+, Hill+)
ACS not only accelerates the Amdahl’s bottleneck but also critical
sections
• Remote procedure calls (Birrell+)
ACS is for critical sections among shared memory cores
27
Hiding Latency of Critical Sections
• Transactional memory (Herlihy+)
ACS does not require code modification
• Transactional Lock Removal (Rajwar+) and Speculative
Synchronization (Martinez+)
– Hide critical section latency by increasing concurrency
ACS reduces latency of each critical section
– Overlaps execution of critical sections with no data conflicts
ACS accelerates ALL critical sections
– Does not improve locality of shared data
ACS improves locality of shared data
ACS outperforms TLR (Rajwar+) by 18% (details in paper)
28
Conclusion
• Critical sections reduce performance and limit
scalability
• Accelerate critical sections by executing them on a
powerful core
• ACS reduces average execution time by:
– 34% compared to an equal-area SCMP
– 23% compared to an equal-area ACMP
• ACS improves scalability of 7 of the 12 workloads
29
Accelerating Critical Section Execution
with Asymmetric Multi-Core Architectures
M. Aater Suleman*
Onur Mutlu†
Moinuddin K. Qureshi‡
Yale N. Patt*
†Carnegie Mellon
University
‡IBM Research
30
*The University of Texas
at Austin
Download