18-742 Spring 2011 Parallel Computer Architecture Lecture 7: Symmetric Multi-Core Prof. Onur Mutlu

advertisement
18-742 Spring 2011
Parallel Computer Architecture
Lecture 7: Symmetric Multi-Core
Prof. Onur Mutlu
Carnegie Mellon University
Research Project

Submit your proposal via Blackboard by midnight today.
2
Reviews

Due Tuesday (Jan 25)



Papamarcos and Patel, “A low-overhead coherence solution
for multiprocessors with private cache memories,” ISCA 1984.
Kelm et al., “Cohesion: a hybrid memory model for
accelerators,” ISCA 2010.
Due Friday (Jan 28)

Suleman et al., “Data Marshaling for Multi-Core Architectures,”
ISCA 2010.
3
Review: Multi-Core



Idea: Put multiple processors on the same die.
Technology scaling (Moore’s Law) enables more transistors
to be placed on the same die area
What else could you do with the die area you dedicate to
multiple processors?




Have a bigger, more powerful core
Have larger caches in the memory hierarchy
Simultaneous multithreading
Integrate platform components on chip (e.g., network
interface, memory controllers)
4
Review: Large Superscalar vs. Multi-Core

Olukotun et al., “The Case for a Single-Chip
Multiprocessor,” ASPLOS 1996.
5
Review: Comparison Points
6
Review: Multi-Core vs. Large Superscalar

Multi-core advantages
+ Simpler cores  more power efficient, lower complexity,
easier to design and replicate, higher frequency (shorter
wires, smaller structures)
+ Higher system throughput on multiprogrammed workloads 
reduced context switches
+ Higher system throughput in parallel applications

Multi-core disadvantages
- Requires parallel tasks/threads to improve performance
(parallel programming)
- Resource sharing can reduce single-thread performance
- Shared hardware resources need to be managed
- Number of pins limits data supply for increased demand
7
Review: Large Superscalar vs. Multi-Core


Olukotun et al., “The Case for a Single-Chip
Multiprocessor,” ASPLOS 1996.
Technology push

Instruction issue queue size limits the cycle time of the
superscalar, OoO processor  diminishing performance



Quadratic increase in complexity with issue width
Large, multi-ported register files to support large instruction
windows and issue widths  reduced frequency or longer RF
access, diminishing performance
Application pull



Integer applications: little parallelism?
FP applications: abundant loop-level parallelism
Others (transaction proc., multiprogramming): CMP better fit
8
Review: Why Multi-Core?

Alternative: (Simultaneous) Multithreading
+ Exploits thread-level parallelism (just like multi-core)
+ Good single-thread performance when there is a single thread
+ No need to have an entire core for another thread
+ Parallel performance aided by tight sharing of caches
- Scalability is limited: need bigger register files, larger issue
width (and associated costs) to have many threads 
complex with many threads
- Parallel performance limited by shared fetch bandwidth
- Extensive resource sharing at the pipeline and memory system
reduces both single-thread and parallel application
performance
9
Review: Why Multi-Core?

Alternative: Integrate platform components on chip instead
+ Speeds up many system functions (e.g., network interface
cards, Ethernet controller, memory controller, I/O controller)
- Not all applications benefit (e.g., CPU intensive code sections)
10
Why Multi-Core?

Alternative: More scalable superscalar, out-of-order engines

Clustered superscalar processors (with multithreading)
+ Simpler to design than superscalar, more scalable than
simultaneous multithreading (less resource sharing)
+ Can improve both single-thread and parallel application
performance
- Diminishing performance returns on single thread: Clustering
reduces IPC performance compared to monolithic superscalar.
Why?
- Parallel performance limited by shared fetch bandwidth
- Difficult to design
11
Why Multi-Core?

Alternative: Traditional symmetric multiprocessors
+ Smaller die size (for the same processing core)
+ More memory bandwidth (no pin bottleneck)
+ Fewer shared resources  less contention between threads
- Long latencies between cores (need to go off chip)  shared
data accesses limit performance  parallel application
scalability is limited
- Worse resource efficiency due to less sharing  worse
power/energy efficiency
12
Why Multi-Core?

Other alternatives?




Dataflow?
Vector processors (SIMD)?
Integrating DRAM on chip?
Reconfigurable logic? (general purpose?)
13
Review: Multi-Core Alternatives











Bigger, more powerful single core
Bigger caches
(Simultaneous) multithreading
Integrate platform components on chip instead
More scalable superscalar, out-of-order engines
Traditional symmetric multiprocessors
Dataflow?
Vector processors (SIMD)?
Integrating DRAM on chip?
Reconfigurable logic? (general purpose?)
Other alternatives?
14
Piranha Chip Multiprocessor




Barroso et al., “Piranha: A Scalable Architecture Based on SingleChip Multiprocessing,” ISCA 2000.
An early example of a symmetric multi-core processor
Large-scale server based on CMP nodes
Designed for commercial workloads
Commercial Workload Characteristics

Memory system is the main bottleneck





Very poor Instruction Level Parallelism (ILP) with existing
techniques




Very high CPI
Execution time dominated by memory stall times
Instruction stalls as important as data stalls
Fast/large L2 caches are critical
Frequent hard-to-predict branches
Large L1 miss ratios
Small gains from wide-issue out-of-order techniques
No need for floating point and multimedia units
16
Piranha Processing Node
CPU
Alpha core:
1-issue, in-order,
500MHz
Next few slides from
Luiz Barroso’s ISCA 2000 presentation of
Piranha: A Scalable Architecture
Based on Single-Chip Multiprocessing
Piranha Processing Node
CPU
I$ D$
Alpha core:
1-issue, in-order,
500MHz
L1 caches:
I&D, 64KB, 2-way
Piranha Processing Node
CPU
CPU
CPU
CPU
I$ D$
I$ D$
I$ D$
I$ D$
Alpha core:
1-issue, in-order,
500MHz
L1 caches:
I&D, 64KB, 2-way
Intra-chip switch (ICS)
32GB/sec, 1-cycle
delay
ICS
I$ D$
I$ D$
I$ D$
I$ D$
CPU
CPU
CPU
CPU
Piranha Processing Node
CPU
CPU
L2$
I$ D$
CPU
L2$
I$ D$
CPU
L2$
I$ D$
L2$
I$ D$
ICS
I$ D$
L2$
I$ D$
L2$
CPU
I$ D$
L2$
CPU
I$ D$
L2$
CPU
CPU
Alpha core:
1-issue, in-order,
500MHz
L1 caches:
I&D, 64KB, 2-way
Intra-chip switch (ICS)
32GB/sec, 1-cycle
delay
L2 cache:
shared, 1MB, 8-way
Piranha Processing Node
MEM-CTL MEM-CTL MEM-CTL MEM-CTL
CPU
CPU
CPU
CPU
L2$
I$ D$
L2$
I$ D$
L2$
I$ D$
L2$
I$ D$
ICS
I$ D$
L2$
I$ D$
L2$
CPU
I$ D$
L2$
CPU
I$ D$
L2$
CPU
CPU
MEM-CTL MEM-CTL MEM-CTL MEM-CTL
8 banks
@1.6GB/sec
Alpha core:
1-issue, in-order,
500MHz
L1 caches:
I&D, 64KB, 2-way
Intra-chip switch (ICS)
32GB/sec, 1-cycle
delay
L2 cache:
shared, 1MB, 8-way
Memory Controller
(MC)
RDRAM, 12.8GB/sec
Piranha Processing Node
MEM-CTL MEM-CTL MEM-CTL MEM-CTL
CPU
HE
CPU
L2$
I$ D$
CPU
L2$
I$ D$
CPU
L2$
I$ D$
L2$
I$ D$
ICS
I$ D$
RE L2$
I$ D$
L2$
CPU
I$ D$
L2$
CPU
I$ D$
L2$
CPU
CPU
MEM-CTL MEM-CTL MEM-CTL MEM-CTL
Alpha core:
1-issue, in-order,
500MHz
L1 caches:
I&D, 64KB, 2-way
Intra-chip switch (ICS)
32GB/sec, 1-cycle
delay
L2 cache:
shared, 1MB, 8-way
Memory Controller (MC)
RDRAM, 12.8GB/sec
Protocol Engines (HE &
RE)
prog., 1K instr.,
even/odd interleaving
Piranha Processing Node
MEM-CTL MEM-CTL MEM-CTL MEM-CTL
4 Links
@ 8GB/s
CPU
HE
CPU
L2$
L2$
I$ D$
Router
I$ D$
CPU
CPU
L2$
I$ D$
L2$
I$ D$
ICS
I$ D$
RE L2$
I$ D$
L2$
CPU
I$ D$
L2$
CPU
I$ D$
L2$
CPU
CPU
MEM-CTL MEM-CTL MEM-CTL MEM-CTL
Alpha core:
1-issue, in-order,
500MHz
L1 caches:
I&D, 64KB, 2-way
Intra-chip switch (ICS)
32GB/sec, 1-cycle
delay
L2 cache:
shared, 1MB, 8-way
Memory Controller (MC)
RDRAM, 12.8GB/sec
Protocol Engines (HE &
RE):
prog., 1K instr.,
even/odd interleaving
System Interconnect:
4-port Xbar router
topology independent
32GB/sec total
bandwidth
Piranha Processing Node
MEM-CTL MEM-CTL MEM-CTL MEM-CTL
CPU
HE
CPU
L2$
L2$
I$ D$
Router
I$ D$
CPU
CPU
L2$
I$ D$
L2$
I$ D$
ICS
I$ D$
RE L2$
I$ D$
L2$
CPU
I$ D$
L2$
CPU
I$ D$
L2$
CPU
CPU
MEM-CTL MEM-CTL MEM-CTL MEM-CTL
Alpha core:
1-issue, in-order,
500MHz
L1 caches:
I&D, 64KB, 2-way
Intra-chip switch (ICS)
32GB/sec, 1-cycle
delay
L2 cache:
shared, 1MB, 8-way
Memory Controller (MC)
RDRAM, 12.8GB/sec
Protocol Engines (HE &
RE):
prog., 1K instr.,
even/odd interleaving
System Interconnect:
4-port Xbar router
topology independent
32GB/sec total
bandwidth
Piranha Processing Node
25
Inter-Node Coherence Protocol Engine
26
Piranha System
27
Piranha I/O Node
28
Sun Niagara (UltraSPARC T1)

Kongetira et al., “Niagara: A 32-Way Multithreaded SPARC
Processor,” IEEE Micro 2005.
29
Niagara Core



4-way fine-grain multithreaded, 6-stage, dual-issue in-order
Round robin thread selection (unless cache miss)
Shared FP unit among cores
30
Niagara Design Point

Also designed for commercial applications
31
Sun Niagara II (UltraSPARC T2)

8 SPARC cores, 8
threads/core. 8 stages. 16 KB
I$ per Core. 8 KB D$ per
Core. FP, Graphics, Crypto,
units per Core.

4 MB Shared L2, 8 banks, 16way set associative.

4 dual-channel FBDIMM
memory controllers.

X8 PCI-Express @ 2.5 Gb/s.

Two 10G Ethernet ports @
3.125 Gb/s.
32
Chip Multithreading (CMT)


Spracklen and Abraham, “Chip Multithreading:
Opportunities and Challenges,” HPCA Industrial Session,
2005.
Idea: Chip multiprocessor where each core is multithreaded



Niagara 1/2: fine grain multithreading
IBM POWER5: simultaneous multithreading
Motivation: Tolerate memory latency better


A simple core stays idle on a cache miss
Multithreading enables tolerating cache miss latency when
there is TLP
33
CMT (CMP + MT) vs. CMP

Advantages of adding multithreading to each core
+ Better memory latency tolerance when there are enough threads
+ Fine grained multithreading can simplify core design (no need for
branch prediction, dependency checking)
+ Potentially better utilization of core, cache, memory resources
+ Shared instructions and data among threads not replicated
+ When one thread is not using a resource, another can

Disadvantages
- Reduced single-thread performance (a thread does not have the
core and L1 caches to itself)
- More pressure on the shared resources (cache, off-chip
bandwidth)  more resource contention
- Applications with limited TLP do not benefit
34
Sun ROCK



Chaudhry et al., “Rock: A High-Performance Sparc CMT Processor,”
IEEE Micro, 2009.
Chaudhry et al., “Simultaneous Speculative Threading: A Novel Pipeline
Architecture Implemented in Sun's ROCK Processor,” ISCA 2009
Goals:



Maximize throughput when threads are available
Boost single-thread performance when threads are not
available and on cache misses
Ideas:


Runahead on a cache miss  ahead thread executes missindependent instructions, behind thread executes dependent
instructions
Branch prediction (gshare)
35
Sun ROCK




16 cores, 2 threads
per core (fewer
threads than Niagara
2)
4 cores share a 32KB
instruction cache
2 cores share a 32KB
data cache
2MB L2 cache (smaller
than Niagara 2)
36
Runahead Execution (I)

A simple pre-execution method for prefetching purposes
Mutlu et al., “Runahead Execution: An Alternative to Very
Large Instruction Windows for Out-of-order Processors,”
HPCA 2003.

When the oldest instruction is a long-latency cache miss:



In runahead mode:




Checkpoint architectural state and enter runahead mode
Speculatively pre-execute instructions
The purpose of pre-execution is to generate prefetches
L2-miss dependent instructions are marked INV and dropped
Runahead mode ends when the original miss returns

Checkpoint is restored and normal execution resumes
37
Runahead Execution (II)
Small Window:
Load 2 Miss
Load 1 Miss
Compute
Stall
Compute
Miss 1
Stall
Miss 2
Runahead:
Load 1 Miss
Compute
Load 2 Miss
Runahead
Miss 1
Load 1 Hit
Load 2 Hit
Compute
Saved Cycles
Miss 2
38
Runahead Execution (III)

Advantages
+ Very accurate prefetches for data/instructions (all cache levels)
+ Follows the program path
+ Simple to implement, most of the hardware is already built in

Disadvantages
-- Extra executed instructions

Limitations
-- Limited by branch prediction accuracy
-- Cannot prefetch dependent cache misses. Solution?
-- Effectiveness limited by available Memory Level Parallelism

Mutlu et al., “Efficient Runahead Execution: Power-Efficient Memory
Latency Tolerance,” IEEE Micro Jan/Feb 2006.
39
Performance of Runahead Execution
1.3
1.2
No prefetcher, no runahead
Only prefetcher (baseline)
Only runahead
Prefetcher + runahead
12%
Micro-operations Per Cycle
1.1
1.0
22%
0.9
12%
15%
0.8
35%
0.7
22%
0.6
16%
0.5
52%
0.4
13%
0.3
0.2
0.1
0.0
S95
FP00
INT00
WEB
MM
PROD
SERV
WS
AVG
40
Sun ROCK Cores


Load miss in L1 cache starts parallelization using 2 HW threads
Ahead thread




Behind thread


Executes deferred instructions and re-defers them if necessary
Memory-Level Parallelism (MLP)


Checkpoints state and executes speculatively
Instructions independent of load miss are speculatively executed
Load miss(es) and dependent instructions are deferred to behind
thread
Run ahead on load miss and generate additional load misses
Instruction-Level Parallelism (ILP)

Ahead and behind threads execute independent instructions from
different points in program in parallel
41
ROCK Pipeline
42
More Powerful Cores in Sun ROCK

Advantages
+ Higher single-thread performance (MLP + ILP)
+ Better cache miss tolerance  Can reduce on-chip cache sizes

Disadvantages
- Bigger cores  Fewer cores  Lower parallel throughput (in
terms of threads).
How about each thread’s response time?
- More complex than Niagara cores (but simpler than
conventional out-of-order execution)  Longer design time?
43
More Powerful Cores in Sun ROCK

Chaudhry talk, Aug 2008.
44
More Powerful Cores in Sun ROCK

Chaudhry et al., “Simultaneous Speculative Threading: A Novel Pipeline
Architecture Implemented in Sun's ROCK Processor,” ISCA 2009
45
Download