18-742 Spring 2011 Parallel Computer Architecture Lecture 11: Core Fusion and Multithreading

advertisement
18-742 Spring 2011
Parallel Computer Architecture
Lecture 11: Core Fusion and Multithreading
Prof. Onur Mutlu
Carnegie Mellon University
Announcements

No class Monday (Feb 14)

Interconnection Networks lectures on Wed-Fri (Feb 16, 18)
2
Reviews

Due Today (Feb 11) midnight


Herlihy and Moss, “Transactional Memory: Architectural Support for
Lock-Free Data Structures,” ISCA 1993.
Due Tuesday (Feb 15) midnight



Patel, “Processor-Memory Interconnections for Multiprocessors,”
ISCA 1979.
Dally, “Route packets, not wires: on-chip inteconnection network,”
DAC 2001.
Das et al., “Aergia: Exploiting Packet Latency Slack in On-Chip
Networks,” ISCA 2010.
3
Last Lecture



Speculative Lock Elision (SLE)
SLE vs. Accelerated critical sections (ACS)
Data Marshaling
4
Today


Dynamic Core Combining (Core Fusion)
Maybe start multithreading
5
How to Build a Dynamic ACMP

Frequency boosting
DVFS

Core combining: Core Fusion



Ipek et al., “Core Fusion: Accommodating Software
Diversity in Chip Multiprocessors,” ISCA 2007.
Idea: Dynamically fuse multiple small cores to form a single
large core
6
Core Fusion: Motivation


Programs are incrementally parallelized in stages
Each parallelization stage is best executed on a different
“type” of multi-core
7
Core Fusion Idea

Combine multiple simple cores dynamically to form a larger,
more powerful core
8
Core Fusion Microarchitecture

Concept: Add enveloping hardware to make cores combineable
9
How to Make Multiple Cores Operate Collectively as a Single Core

Reconfigurable I-cache
10
Collective Fetch

Each core fetches two instructions from own i-cache


Fetch Management Unit (FMU) controls redirection



Two-cycle bubble per taken branch (+1 if misaligned core)
Core “zero” provides RAS


Cores process branches locally, communicate prediction to
FMU
FMU communicates outcome and GHR updates
Two-cycle interconnect


Misaligned targets re-align in one cycle
Return encountered on another core gets its prediction from
Core 0’s RAS
FMU updates i-TLBs on a miss
11
Branching in Fused Mode
BPred
BPred
GHR
GHR
GHR
GHR
RAS
RAS
BTB
RAS
B
BPred
BTB
RAS
BTB
BPred
BTB
12
Branching in Fused Mode
BTB
GHR
GHR
GHR
GHR
RAS
X
X
B
BPred
BTB
RAS
X
X
BPred
BPred
BTB
RAS
X
X
RAS
BPred
BTB
X
X
X
13
Centralized Renaming and Steering



Centralized structure: Cores send predecoded info to Steering Management Unit (SMU)
SMU steers and dispatches regular and copy instructions

Max. two regular + two copy instructions per core, cycle
Eight extra pipeline stages (only fused mode)
14
Operand Communication via Copy Instructions
Copy-in
Issue Copy-out
Copy-in
Issue Copy-out
Out
In
15
Collective Commit (No Blocking Case)
Pre-commit
ROB Head
Conventional
ROB Head
i1
i3
i5
i7
i0
i2
i4
i6
i1
i3
i5
i7
i0
i2
i4
i6
i1
i3
i5
i7
i0
i2
i4
i6
i1
i3
i5
i7
i0
i2
i4
i6
16
Collective Commit (Blocked Case)
Pre-commit
ROB Head
Conventional
ROB Head
i1
i3
i5
i7
i0
i2
i4
i6
i1
i3
i5
i7
i0
i2
i4
i6
i1
i3
i5
i7
i0
i2
i4
i6
i1
i3
i5
i7
i0
i2
i4
i6
17
Collective Load/Store Queue

LD/ST instructions bank-assigned to cores based on
effective addresses


PC-based steering prediction on which bank the ld/st
should access


Distributed disambiguation
Re-steer on misprediction
Core-fusion-aware indexing


Full utilization in fused and split modes
Cache coherence avoids flushing or shuffling
Tag
Bank ID
Index
18
Dynamic Reconfiguration

Run-time control of granularity



Mechanism: Fusion, fission instructions in the ISA



Serial vs. parallel sections
Variable granularity in parallel sections
Typically encapsulated in macros or directives (e.g., OpenMP
sections)
Can be safely ignored (single execution model)
Reconfiguration actions



Flush pipelines and i-caches
Reconfigure i-cache tags
Transfer architectural state as needed
19
Core Fusion Evaluation

Ipek et al., “Core Fusion: Accommodating Software
Diversity in Chip Multiprocessors,” ISCA 2007.
20
Core Fusion Evaluation
21
Single Thread Performance
22
Parallel Application Performance
23
Core Fusion vs. Tile-Small Symmetric CMP

Core Fusion Advantages
+ Better single-thread performance when needed

Disadvantages
- Possibly lower parallel throughput (area spent for glue logic)
- Reconfiguration overhead to dynamically create a large core
(can reduce performance)
- More complex design: glue logic between cores, reconfigurable
I-cache
24
Core Fusion vs. Asymmetric CMP

Core Fusion Advantages
+ Cores not fixed at design time: more adaptive
+ Possibly better parallel throughput (assuming ACMP does not
use SMT)
+ Potentially higher frequency design: all cores are the same

Disadvantages
- Reconfiguration overhead to dynamically create a large core
(can reduce performance)
- Single-thread performance on the fused core less than that on
a large core statically optimized for single thread
- Additional stages, fine-grained operand communication between
cores, collective operation constraints
- Potentially complex design: glue logic between cores,
reconfigurable I-cache
25
Core Fusion vs. Clustered Superscalar/OoO


Core fusion: build small cores, add glue logic to combine
them dynamically
Clustered superscalar/OoO: build a large superscalar core in
a scalable fashion (clustered scheduling windows, register
files, and execution units)


Can use SMT to execute multiple threads in different clusters
Both require:



Steering instructions to different clusters (cores)
Operand communication between clusters
Memory disambiguation
26
Core Fusion vs. Clustered Superscalar/OoO

Some core fusion advantages
+ No resource contention between threads in non-fused mode
+ No need to build a wide fetch engine

Some disadvantages
- Single-thread performance can be less due to additional
communication latencies in fetch and commit
- I-cache not shared between threads in non-fused mode
27
Review: Performance Asymmetry

What to do with it?





Improve serial performance (accelerate sequential bottleneck)
Reduce energy consumption – adapt to phase behavior
Optimize energy delay – adapt to phase behavior
Improve parallel performance (accelerate critical sections)
How to build it?

Static


Multiple different core microarchitectures or frequencies
Dynamic


Combine cores
Adapt frequency
28
Research in Asymmetric Multi-Core

How to Design Asymmetric Cores


Static
Dynamic


How to divide the program to best take advantage of
asymmetry?




Can you fuse in-order cores easily to build an OoO core?
Explicit vs. transparent
How to match arbitrary program phases to the best-fitting
core? Staged execution models.
How to minimize code/data migration overhead?
How to satisfy shared resource requirements of different cores?
29
Multithreading
Readings: Multithreading

Required





Spracklen and Abraham, “Chip Multithreading: Opportunities and
Challenges,” HPCA Industrial Session, 2005.
Kalla et al., “IBM Power5 Chip: A Dual-Core Multithreaded Processor,” IEEE
Micro 2004.
Tullsen et al., “Exploiting choice: instruction fetch and issue on an
implementable simultaneous multithreading processor,” ISCA 1996.
Eyerman and Eeckhout, “A Memory-Level Parallelism Aware Fetch Policy for
SMT Processors,” HPCA 2007.
Recommended




Hirata et al., “An Elementary Processor Architecture with Simultaneous
Instruction Issuing from Multiple Threads,” ISCA 1992
Smith, “A pipelined, shared resource MIMD computer,” ICPP 1978.
Gabor et al., “Fairness and Throughput in Switch on Event Multithreading,”
MICRO 2006.
Agarwal et al., “APRIL: A Processor Architecture for Multiprocessing,” ISCA
1990.
31
Multithreading (Outline)



Multiple hardware contexts
Purpose
Initial incarnations




CDC 6600
HEP
Tera
Levels of multithreading


Fine-grained (cycle-by-cycle)
Coarse grained (multitasking)



Switch-on-event
Simultaneous
Uses: traditional + creative (now that we have multiple
contexts, why do we not do …)
32
Download