18-742 Spring 2011
Parallel Computer Architecture
Lecture 8: From Symmetry to Asymmetry
Prof. Onur Mutlu
Carnegie Mellon University
Reviews
Due Sunday (Feb 6)
Tullsen et al., “Exploiting choice: instruction fetch and issue on
an implementable simultaneous multithreading processor,”
ISCA 1996.
Spracklen and Abraham, “Chip Multithreading: Opportunities
and Challenges,” HPCA Industrial Session, 2005.
2
Last Lecture
Piranha CMP
Evolution of Sun’s multi-core systems
Wimpy to powerful
Niagara to ROCK
3
More Powerful Cores in Sun ROCK
Advantages
+ Higher single-thread performance (MLP + ILP)
+ Better cache miss tolerance Can reduce on-chip cache sizes
Disadvantages
- Bigger cores Fewer cores Lower parallel throughput (in
terms of threads).
How about each thread’s response time?
- More complex than Niagara cores (but simpler than
conventional out-of-order execution) Longer design time?
4
More Powerful Cores in Sun ROCK
Chaudhry talk, Aug 2008.
5
More Powerful Cores in Sun ROCK
Chaudhry et al., “Simultaneous Speculative Threading: A Novel Pipeline
Architecture Implemented in Sun's ROCK Processor,” ISCA 2009
6
IBM POWER4
Tendler et al., “POWER4 system microarchitecture,” IBM J
R&D, 2002.
Another symmetric multi-core chip…
But, fewer and more powerful cores
7
IBM POWER4
2 cores, out-of-order execution
100-entry instruction window in each core
8-wide instruction fetch, issue, execute
Large, local+global hybrid branch predictor
1.5MB, 8-way L2 cache
Aggressive stream based prefetching
8
IBM POWER5
Kalla et al., “IBM Power5 Chip: A Dual-Core Multithreaded Processor,” IEEE
Micro 2004.
9
IBM POWER6
Le et al., “IBM POWER6
microarchitecture,” IBM J R&D,
2007.
2 cores, in order, high
frequency (4.7 GHz)
8 wide fetch
Simultaneous multithreading in
each core
Runahead execution in each
core
Similar to Sun ROCK
10
IBM POWER7
Kalla et al., “Power7: IBM’s Next-Generation Server
Processor,” IEEE Micro 2010.
8 out-of-order cores, 4-way SMT in each core
TurboCore mode
Can turn off cores so that other cores can be run at higher
frequency
11
Large vs. Small Cores
Large
Core
Out-of-order
Wide fetch e.g. 4-wide
Deeper pipeline
Aggressive branch
predictor (e.g. hybrid)
• Multiple functional units
• Trace cache
• Memory dependence
speculation
•
•
•
•
Small
Core
•
•
•
•
In-order
Narrow Fetch e.g. 2-wide
Shallow pipeline
Simple branch predictor
(e.g. Gshare)
• Few functional units
Large Cores are power inefficient:
e.g., 2x performance for 4x area (power)
12
Large vs. Small Cores
Grochowski et al., “Best of both Latency and Throughput,”
ICCD 2004.
13
Tile-Large Approach
Large
core
Large
core
Large
core
Large
core
“Tile-Large”
Tile a few large cores
IBM Power 5, AMD Barcelona, Intel Core2Quad, Intel Nehalem
+ High performance on single thread, serial code sections (2 units)
- Low throughput on parallel program portions (8 units)
14
Tile-Small Approach
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
“Tile-Small”
Tile many small cores
Sun Niagara, Intel Larrabee, Tilera TILE (tile ultra-small)
+ High throughput on the parallel part (16 units)
- Low performance on the serial part, single thread (1 unit)
15
Can we get the best of both worlds?
Tile Large
+ High performance on single thread, serial code sections (2
units)
- Low throughput on parallel program portions (8 units)
Tile Small
+ High throughput on the parallel part (16 units)
- Low performance on the serial part, single thread (1 unit),
reduced single-thread performance compared to existing single
thread processors
Idea: Have both large and small on the same chip
Performance asymmetry
16
Asymmetric Chip Multiprocessor (ACMP)
Large
core
Large
core
Large
core
Large
core
“Tile-Large”
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
“Tile-Small”
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Large
core
ACMP
Provide one large core and many small cores
+ Accelerate serial part using the large core (2 units)
+ Execute parallel part on all cores for high throughput (14
units)
17
Accelerating Serial Bottlenecks
Single thread Large core
Large
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
ACMP Approach
18
Performance vs. Parallelism
Assumptions:
1. Small core takes an area budget of 1 and has
performance of 1
2. Large core takes an area budget of 4 and has
performance of 2
19
ACMP Performance vs. Parallelism
Area-budget = 16 small cores
Large
core
Large
core
Large
core
Large
core
Small Small Small Small
core core core core
Small Small Small Small
core core core core
Large
core
Small Small
core core
Small Small
core core
Small Small Small Small
core core core core
Small Small Small Small
core core core core
Small Small Small Small
core core core core
Small Small Small Small
core core core core
“Tile-Small”
ACMP
“Tile-Large”
Large
Cores
4
0
1
Small
Cores
0
16
12
Serial
Performance
2
1
2
2x4=8
1 x 16 = 16
1x2 + 1x12 = 14
Parallel
Throughput
20
20
Some Analysis
Hill and Marty, “Amdahl’s Law in the Multi-Core Era,” IEEE
Computer 2008.
Each Chip Bounded to N BCEs (Base Core Equivalents)
One R-BCE Core leaves N-R BCEs
Use N-R BCEs for N-R Base Cores
Therefore, 1 + N - R Cores per Chip
For an N = 16 BCE Chip:
Symmetric: Four 4-BCE cores
Asymmetric: One 4-BCE core
& Twelve 1-BCE base cores
21
Amdahl’s Law Modified
Serial Fraction 1-F same, so time = (1 – F) / Perf(R)
Parallel Fraction F
One core at rate Perf(R)
N-R cores at rate 1
Parallel time = F / (Perf(R) + N - R)
Therefore, w.r.t. one base core:
1
Asymmetric Speedup =
1-F
Perf(R)
F
+
Perf(R) + N - R
22
Asymmetric Multicore Chip, N = 256 BCEs
250
Asymmetric Speedup
F=0.999
200
150
F=0.99
100
F=0.975
50
F=0.9
F=0.5
0
1
2
4
8
16
32
64
128
256
R BCEs
Number of Cores = 1 (Enhanced) + 256 – R (Base)
23
Symmetric Multicore Chip, N = 256 BCEs
Symmetric Speedup
250
200
F=0.999
150
100
F=0.99
F=0.975
50
F=0.9
F=0.5
0
1
2
4
8
Recall F=0.9, R=28, Cores=9, Speedup=26.7
16
32
64
128
256
R BCEs
24
Asymmetric Multicore Chip, N = 256 BCEs
250
Asymmetric Speedup
F=0.999
F=0.99
R=41 (vs. 3)
Cores=216 (vs. 85)
Speedup=166 (vs. 80)
200
150
F=0.99
100
F=0.975
50
F=0.9
F=0.5
0
1
2
4
8
16
R BCEs
32
64
128
256
F=0.9
R=118 (vs. 28)
Cores= 139 (vs. 9)
Speedup=65.6
(vs. 26.7)
Asymmetric multi-core provides better speedup than
symmetric multi-core when N is large
25
Asymmetric vs. Symmetric Cores
Advantages of Asymmetric
+ Can provide better performance when thread parallelism is
limited
+ Can be more energy efficient
+ Schedule computation to the core type that can best execute it
Disadvantages
- Need to design more than one type of core. Always?
- Scheduling becomes more complicated
- What computation should be scheduled on the large core?
- Who should decide? HW vs. SW?
- Managing locality and load balancing can become difficult if
threads move between cores (transparently to software)
- Cores have different demands from shared resources
26
How to Achieve Asymmetry
Static
Type and power of cores fixed at design time
Two approaches to design “faster cores”:
High frequency
Build a more complex, powerful core with entirely different uarch
Is static asymmetry natural? (chip-wide variations in frequency)
Dynamic
Type and power of cores change dynamically
Two approaches to dynamically create “faster cores”:
Boost frequency dynamically (limited power budget)
Combine small cores to enable a more complex, powerful core
Is there a third, fourth, fifth approach?
27
Asymmetry via Boosting of Frequency
Static
Due to process variations, cores might have different
frequency
Simply hardwire/design cores to have different frequencies
Dynamic
Annavaram et al., “Mitigating Amdahl’s Law Through EPI
Throttling,” ISCA 2005.
Dynamic voltage and frequency scaling
28
EPI Throttling
Goal: Minimize execution time of parallel programs while
keeping power within a fixed budget
For best scalar and throughput performance, vary energy
expended per instruction (EPI) based on available
parallelism
P = EPI •IPS
P = fixed power budget
EPI = energy per instruction
IPS = aggregate instructions retired per second
Idea: For a fixed power budget
Run sequential phases on high-EPI processor
Run parallel phases on multiple low-EPI processors
29
EPI Throttling via DVFS
DVFS: Dynamic voltage frequency scaling
In phases of low thread parallelism
Run a few cores at high supply voltage and high frequency
In phases of high thread parallelism
Run many cores at low supply voltage and low frequency
30
Possible EPI Throttling Techniques
Grochowski et al., “Best of both Latency and Throughput,”
ICCD 2004.
31
Boosting Frequency of a Small Core vs. Large Core
Frequency boosting implemented on Intel Nehalem, IBM
POWER7
Advantages of Boosting Frequency
+ Very simple to implement; no need to design a new core
+ Parallel throughput does not degrade when TLP is high
+ Preserves locality of boosted thread
Disadvantages
- Does not improve performance if thread is memory bound
- Does not reduce Cycles per Instruction (remember the
performance equation?)
- Changing frequency/voltage can take longer than switching to a
large core
32
EPI Throttling (Annavaram et al., ISCA’05)
Static AMP
Duty cycles set once prior to program run
Parallel phases run on 3P/1.25GHz
Sequential phases run on 1P/2GHz
Affinity guarantees sequential on 1P and parallel on 3
Benchmarks that rapidly transition between sequential and
parallel phases
Dynamic AMP
Duty cycle changes during program run
Parallel phases run on all or a subset of four processors
Sequential phases of execution on 1P/2GHz
Benchmarks with long sequential and parallel phases
33
EPI Throttling (Annavaram et al., ISCA’05)
Evaluation on Base SMP: 4 Base SMP: 4-way 2GHz Xeon,
2MB L3, 4GB Memory
Hand-modified programs
OMP threads set to 3 for static AMP
Calls to set affinity in each thread for static AMP
Calls to change duty cycle and to set affinity in dynamic AMP
34
EPI Throttling (Annavaram et al., ISCA’05)
Frequency boosting AMP improves performance compared
to 4-way SMP for many applications
35
EPI Throttling
Why does Frequency Boosting (FB) AMP not always
improve performance?
Loss of throughput in static AMP (only 3 processors in
parallel portion)
Rapid transitions between serial and parallel phases
Is this really the best way of using FB-AMP?
Data/thread migration and throttling overhead
Boosting frequency does not help memory-bound phases
36
Review So Far
Symmetric Multicore
Evolution of Sun’s and IBM’s Multicore systems and design
choices
Niagara, Niagara 2, ROCK
IBM POWERx
Asymmetric multicore
Motivation
Functional vs. Performance Asymmetry
Static vs. Dynamic Asymmetry
EPI Throttling
37
Design Tradeoffs in ACMP (I)
Hardware Design Effort vs. Programmer Effort
- ACMP requires more design effort
+ Performance becomes less dependent on length of the serial part
+ Can reduce programmer effort: Serial portions are not as bad for
performance with ACMP
Migration Overhead vs. Accelerated Serial Bottleneck
+ Performance gain from faster execution of serial portion
- Performance loss when architectural state is migrated/switched
in when the master changes
Can be alleviated with multithreading and hidden by long serial portion
- Serial portion incurs cache misses when it needs data
generated by the parallel portion
- Parallel portion incurs cache misses when it needs data
generated by the serial portion
38
Design Tradeoffs in ACMP (II)
Fewer threads vs. accelerated serial bottleneck
+ Performance gain from accelerated serial portion
- Performance loss due to unavailability of L threads in parallel
portion
This need not be the case Large core can implement
Multithreading to improve parallel throughput
As the number of cores (threads) on chip increases, fractional
loss in parallel performance decreases
39
Uses of Asymmetry
So far:
Improvement in serial performance (sequential bottleneck)
What else can we do with asymmetry?
Energy reduction?
Energy/performance tradeoff?
Improvement in parallel portion?
40