18-742 Spring 2011 Parallel Computer Architecture Lecture 8: From Symmetry to Asymmetry

advertisement
18-742 Spring 2011
Parallel Computer Architecture
Lecture 8: From Symmetry to Asymmetry
Prof. Onur Mutlu
Carnegie Mellon University
Reviews

Due Sunday (Feb 6)


Tullsen et al., “Exploiting choice: instruction fetch and issue on
an implementable simultaneous multithreading processor,”
ISCA 1996.
Spracklen and Abraham, “Chip Multithreading: Opportunities
and Challenges,” HPCA Industrial Session, 2005.
2
Last Lecture


Piranha CMP
Evolution of Sun’s multi-core systems


Wimpy to powerful
Niagara to ROCK
3
More Powerful Cores in Sun ROCK

Advantages
+ Higher single-thread performance (MLP + ILP)
+ Better cache miss tolerance  Can reduce on-chip cache sizes

Disadvantages
- Bigger cores  Fewer cores  Lower parallel throughput (in
terms of threads).
How about each thread’s response time?
- More complex than Niagara cores (but simpler than
conventional out-of-order execution)  Longer design time?
4
More Powerful Cores in Sun ROCK

Chaudhry talk, Aug 2008.
5
More Powerful Cores in Sun ROCK

Chaudhry et al., “Simultaneous Speculative Threading: A Novel Pipeline
Architecture Implemented in Sun's ROCK Processor,” ISCA 2009
6
IBM POWER4



Tendler et al., “POWER4 system microarchitecture,” IBM J
R&D, 2002.
Another symmetric multi-core chip…
But, fewer and more powerful cores
7
IBM POWER4






2 cores, out-of-order execution
100-entry instruction window in each core
8-wide instruction fetch, issue, execute
Large, local+global hybrid branch predictor
1.5MB, 8-way L2 cache
Aggressive stream based prefetching
8
IBM POWER5

Kalla et al., “IBM Power5 Chip: A Dual-Core Multithreaded Processor,” IEEE
Micro 2004.
9
IBM POWER6





Le et al., “IBM POWER6
microarchitecture,” IBM J R&D,
2007.
2 cores, in order, high
frequency (4.7 GHz)
8 wide fetch
Simultaneous multithreading in
each core
Runahead execution in each
core
 Similar to Sun ROCK
10
IBM POWER7



Kalla et al., “Power7: IBM’s Next-Generation Server
Processor,” IEEE Micro 2010.
8 out-of-order cores, 4-way SMT in each core
TurboCore mode

Can turn off cores so that other cores can be run at higher
frequency
11
Large vs. Small Cores
Large
Core
Out-of-order
Wide fetch e.g. 4-wide
Deeper pipeline
Aggressive branch
predictor (e.g. hybrid)
• Multiple functional units
• Trace cache
• Memory dependence
speculation
•
•
•
•
Small
Core
•
•
•
•
In-order
Narrow Fetch e.g. 2-wide
Shallow pipeline
Simple branch predictor
(e.g. Gshare)
• Few functional units
Large Cores are power inefficient:
e.g., 2x performance for 4x area (power)
12
Large vs. Small Cores

Grochowski et al., “Best of both Latency and Throughput,”
ICCD 2004.
13
Tile-Large Approach
Large
core
Large
core
Large
core
Large
core
“Tile-Large”
Tile a few large cores
 IBM Power 5, AMD Barcelona, Intel Core2Quad, Intel Nehalem
+ High performance on single thread, serial code sections (2 units)
- Low throughput on parallel program portions (8 units)

14
Tile-Small Approach
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
“Tile-Small”
Tile many small cores
 Sun Niagara, Intel Larrabee, Tilera TILE (tile ultra-small)
+ High throughput on the parallel part (16 units)
- Low performance on the serial part, single thread (1 unit)

15
Can we get the best of both worlds?

Tile Large
+ High performance on single thread, serial code sections (2
units)
- Low throughput on parallel program portions (8 units)

Tile Small
+ High throughput on the parallel part (16 units)
- Low performance on the serial part, single thread (1 unit),
reduced single-thread performance compared to existing single
thread processors

Idea: Have both large and small on the same chip 
Performance asymmetry
16
Asymmetric Chip Multiprocessor (ACMP)
Large
core
Large
core
Large
core
Large
core
“Tile-Large”
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
“Tile-Small”
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Large
core
ACMP
Provide one large core and many small cores
+ Accelerate serial part using the large core (2 units)
+ Execute parallel part on all cores for high throughput (14
units)

17
Accelerating Serial Bottlenecks
Single thread  Large core
Large
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
Small
core
ACMP Approach
18
Performance vs. Parallelism
Assumptions:
1. Small core takes an area budget of 1 and has
performance of 1
2. Large core takes an area budget of 4 and has
performance of 2
19
ACMP Performance vs. Parallelism
Area-budget = 16 small cores
Large
core
Large
core
Large
core
Large
core
Small Small Small Small
core core core core
Small Small Small Small
core core core core
Large
core
Small Small
core core
Small Small
core core
Small Small Small Small
core core core core
Small Small Small Small
core core core core
Small Small Small Small
core core core core
Small Small Small Small
core core core core
“Tile-Small”
ACMP
“Tile-Large”
Large
Cores
4
0
1
Small
Cores
0
16
12
Serial
Performance
2
1
2
2x4=8
1 x 16 = 16
1x2 + 1x12 = 14
Parallel
Throughput
20
20
Some Analysis






Hill and Marty, “Amdahl’s Law in the Multi-Core Era,” IEEE
Computer 2008.
Each Chip Bounded to N BCEs (Base Core Equivalents)
One R-BCE Core leaves N-R BCEs
Use N-R BCEs for N-R Base Cores
Therefore, 1 + N - R Cores per Chip
For an N = 16 BCE Chip:
Symmetric: Four 4-BCE cores
Asymmetric: One 4-BCE core
& Twelve 1-BCE base cores
21
Amdahl’s Law Modified

Serial Fraction 1-F same, so time = (1 – F) / Perf(R)

Parallel Fraction F




One core at rate Perf(R)
N-R cores at rate 1
Parallel time = F / (Perf(R) + N - R)
Therefore, w.r.t. one base core:
1
Asymmetric Speedup =
1-F
Perf(R)
F
+
Perf(R) + N - R
22
Asymmetric Multicore Chip, N = 256 BCEs
250
Asymmetric Speedup
F=0.999
200
150
F=0.99
100
F=0.975
50
F=0.9
F=0.5
0
1
2
4
8
16
32
64
128
256
R BCEs

Number of Cores = 1 (Enhanced) + 256 – R (Base)
23
Symmetric Multicore Chip, N = 256 BCEs
Symmetric Speedup
250
200
F=0.999
150
100
F=0.99
F=0.975
50
F=0.9
F=0.5
0
1
2
4
8
Recall F=0.9, R=28, Cores=9, Speedup=26.7
16
32
64
128
256
R BCEs
24
Asymmetric Multicore Chip, N = 256 BCEs
250
Asymmetric Speedup
F=0.999
F=0.99
R=41 (vs. 3)
Cores=216 (vs. 85)
Speedup=166 (vs. 80)
200
150
F=0.99
100
F=0.975
50
F=0.9
F=0.5
0
1
2
4
8
16
R BCEs

32
64
128
256
F=0.9
R=118 (vs. 28)
Cores= 139 (vs. 9)
Speedup=65.6
(vs. 26.7)
Asymmetric multi-core provides better speedup than
symmetric multi-core when N is large
25
Asymmetric vs. Symmetric Cores

Advantages of Asymmetric
+ Can provide better performance when thread parallelism is
limited
+ Can be more energy efficient
+ Schedule computation to the core type that can best execute it

Disadvantages
- Need to design more than one type of core. Always?
- Scheduling becomes more complicated
- What computation should be scheduled on the large core?
- Who should decide? HW vs. SW?
- Managing locality and load balancing can become difficult if
threads move between cores (transparently to software)
- Cores have different demands from shared resources
26
How to Achieve Asymmetry

Static


Type and power of cores fixed at design time
Two approaches to design “faster cores”:




High frequency
Build a more complex, powerful core with entirely different uarch
Is static asymmetry natural? (chip-wide variations in frequency)
Dynamic


Type and power of cores change dynamically
Two approaches to dynamically create “faster cores”:



Boost frequency dynamically (limited power budget)
Combine small cores to enable a more complex, powerful core
Is there a third, fourth, fifth approach?
27
Asymmetry via Boosting of Frequency

Static



Due to process variations, cores might have different
frequency
Simply hardwire/design cores to have different frequencies
Dynamic


Annavaram et al., “Mitigating Amdahl’s Law Through EPI
Throttling,” ISCA 2005.
Dynamic voltage and frequency scaling
28
EPI Throttling


Goal: Minimize execution time of parallel programs while
keeping power within a fixed budget
For best scalar and throughput performance, vary energy
expended per instruction (EPI) based on available
parallelism





P = EPI •IPS
P = fixed power budget
EPI = energy per instruction
IPS = aggregate instructions retired per second
Idea: For a fixed power budget


Run sequential phases on high-EPI processor
Run parallel phases on multiple low-EPI processors
29
EPI Throttling via DVFS

DVFS: Dynamic voltage frequency scaling

In phases of low thread parallelism


Run a few cores at high supply voltage and high frequency
In phases of high thread parallelism

Run many cores at low supply voltage and low frequency
30
Possible EPI Throttling Techniques

Grochowski et al., “Best of both Latency and Throughput,”
ICCD 2004.
31
Boosting Frequency of a Small Core vs. Large Core


Frequency boosting implemented on Intel Nehalem, IBM
POWER7
Advantages of Boosting Frequency
+ Very simple to implement; no need to design a new core
+ Parallel throughput does not degrade when TLP is high
+ Preserves locality of boosted thread

Disadvantages
- Does not improve performance if thread is memory bound
- Does not reduce Cycles per Instruction (remember the
performance equation?)
- Changing frequency/voltage can take longer than switching to a
large core
32
EPI Throttling (Annavaram et al., ISCA’05)

Static AMP






Duty cycles set once prior to program run
Parallel phases run on 3P/1.25GHz
Sequential phases run on 1P/2GHz
Affinity guarantees sequential on 1P and parallel on 3
Benchmarks that rapidly transition between sequential and
parallel phases
Dynamic AMP




Duty cycle changes during program run
Parallel phases run on all or a subset of four processors
Sequential phases of execution on 1P/2GHz
Benchmarks with long sequential and parallel phases
33
EPI Throttling (Annavaram et al., ISCA’05)


Evaluation on Base SMP: 4 Base SMP: 4-way 2GHz Xeon,
2MB L3, 4GB Memory
Hand-modified programs



OMP threads set to 3 for static AMP
Calls to set affinity in each thread for static AMP
Calls to change duty cycle and to set affinity in dynamic AMP
34
EPI Throttling (Annavaram et al., ISCA’05)

Frequency boosting AMP improves performance compared
to 4-way SMP for many applications
35
EPI Throttling


Why does Frequency Boosting (FB) AMP not always
improve performance?
Loss of throughput in static AMP (only 3 processors in
parallel portion)


Rapid transitions between serial and parallel phases


Is this really the best way of using FB-AMP?
Data/thread migration and throttling overhead
Boosting frequency does not help memory-bound phases
36
Review So Far

Symmetric Multicore




Evolution of Sun’s and IBM’s Multicore systems and design
choices
Niagara, Niagara 2, ROCK
IBM POWERx
Asymmetric multicore




Motivation
Functional vs. Performance Asymmetry
Static vs. Dynamic Asymmetry
EPI Throttling
37
Design Tradeoffs in ACMP (I)

Hardware Design Effort vs. Programmer Effort
- ACMP requires more design effort
+ Performance becomes less dependent on length of the serial part
+ Can reduce programmer effort: Serial portions are not as bad for
performance with ACMP

Migration Overhead vs. Accelerated Serial Bottleneck
+ Performance gain from faster execution of serial portion
- Performance loss when architectural state is migrated/switched
in when the master changes

Can be alleviated with multithreading and hidden by long serial portion
- Serial portion incurs cache misses when it needs data
generated by the parallel portion
- Parallel portion incurs cache misses when it needs data
generated by the serial portion
38
Design Tradeoffs in ACMP (II)

Fewer threads vs. accelerated serial bottleneck
+ Performance gain from accelerated serial portion
- Performance loss due to unavailability of L threads in parallel
portion


This need not be the case  Large core can implement
Multithreading to improve parallel throughput
As the number of cores (threads) on chip increases, fractional
loss in parallel performance decreases
39
Uses of Asymmetry

So far:


Improvement in serial performance (sequential bottleneck)
What else can we do with asymmetry?



Energy reduction?
Energy/performance tradeoff?
Improvement in parallel portion?
40
Download