CoreGenesis: Erasing Core Boundaries for Robust

advertisement
43rd International Symposium on Microarchitecture
Erasing Core Boundaries for Robust and
Configurable Performance
Shantanu Gupta
Shuguang Feng
Amin Ansari
Scott Mahlke
University of Michigan, Ann Arbor
December 7, 2010
University of Michigan
Electrical Engineering and Computer Science
Multicore Architectures
• Industry wide move to multicores
2 – 16 cores on a single die
• Multiple challenges confront them:
►
►
►
►
Single-thread performance
Reliability
Power density
Memory bandwidth
….
Sun Niagara 2
►
IBM Cell
►
Intel 4 Core Nehalem
• Our hypothesis: A highly configurable architecture
can handle these issues in a unified manner.
2
University of Michigan
Electrical Engineering and Computer Science
Multicore Performance Challenge
2. Stagnating sequential performance
CPU Performance (log scale)
1. Good throughput / parallel
performance with more cores
Core i7
Core 2 Quad
Core Duo
Pentium 4
Pentium III
Pentium II
Need flexibility to provide both
Parallel and Sequential performance
Pentium
486
Power wall
1985
Parallel workloads
(scientific computing, newer
web browsers, video decoding)
1990
1995
2000
2005
2010
2015
Sequential workloads
Spectrum of Applications
3
(legacy workloads, most
mobile/desktop apps)
University of Michigan
Electrical Engineering and Computer Science
Solution: Configurable Performance
• Assign resources where they are needed…
• In an N core chip:
►
►
Use all N cores for best Parallel Performance
Group M cores together for Serial Performance (M < N)
Parallel / Throughput
Serial / Sequential
Source: Mark D. Hill
• Core Fusion, ISCA’07; Composable Lighweight Processors, MICRO’07
4
University of Michigan
Electrical Engineering and Computer Science
Multicore Reliability Challenge
Parametric
Variability
Electromigration (EM)
Hard
Faults
Intra-die variations in ILD thickness
Need mechanisms
for
Increased Heating
in-field silicon failures
Oxide breakdown (OBD)
Oxide
Negative Bias Threshold Inversion
Thermal
Runaway
Higher
Power
Dissipation
[Todd Austin, GSRC Sep 08]
Higher
Transistor
Leakage
Manufacturing Defects That Escape Testing
(Inefficient Burn-in Testing)
5
University of Michigan
Electrical Engineering and Computer Science
Solution: Isolate Broken Resources
CORE level
MODULE level
• ElastIC, DT’ 06
• Configurable Isolation, ISCA’07
• Online Diagnosis of Hard Faults, MICRO’ 05
• Ultra Low-Cost Defect Protection, ASPLOS’ 06
STAGE level
Stage1
Stage1
Stage1
Stage2
Stage2
Stage2
• StageNet, MICRO 08
• Core Cannibalization, PACT 08
Stage3
Stage3
Stage3
StageN
StageN
StageN
- StageNet decouples the pipeline stages
- Regular fabric, no global interconnections
- Any set of stages can be connected to form a pipeline
6
University of Michigan
Electrical Engineering and Computer Science
Point Solutions: Summary and Limitations
Reliability
Configurable Performance
Stage1
Fuse cores for
higher single-thread
performance
Stage1
Stage1
Stage2
Stage3
Stage2
Stage2
Stage3
StageN
StageN
Stage3
Stage level isolation
7
University of Michigan
Electrical Engineering and Computer Science
StageN
Point Solutions: Summary and Limitations
1. Solve only one challenge at a time
2. Incur additive overheads, no resource overlap
Our
3. Are incompatible with
oneGoal:
another
Design an architectural
solution,
which
Stage1
Stage2
Stage3
Fuse cores for
higher single-thread
performance
Stage1
Stage2
Stage3
1. Simultaneously targets
configurable
Stage1
Stage2
Stage3
performance and reliability
StageN
StageN
Stage level isolation
2. Overlaps hardware changes, and
• Tightly coupled resources
• Decoupled resources
• Centralized structures for data
and control management
• Distributed data and control
management
3. Resolves any conflicting requirements
8
University of Michigan
Electrical Engineering and Computer Science
StageN
The CoreGenesis (CG) Architecture
Crossbar Switch
Fetch
Decode
Issue
Ex/Mem
Fetch
Decode
Issue
Ex/Mem
Fetch
Decode
Issue
Ex/Mem
Fetch
Decode
Issue
Ex/Mem
Distributed
Structures
1.
• Regular grid of pipeline stages. No explicit core boundary.
• Stages interconnected by full crossbars
• Distributed structures for data and control management
Throughput
9
University of Michigan
Electrical Engineering and Computer Science
The CoreGenesis (CG) Architecture
Single pipeline processor
Fetch
Decode
Advantages:
Issue
1.
Fetch
Fetch
Unified performance / reliability
solution
Decode
Issue
Ex/Mem
2.
Overlaps hardware overheads
3.
Regular fabric
Decode
Issue
4.
Fetch
Ex/Mem
Ex/Mem
No centralized resources for fetch,
issue, operand copying
Decode
Issue
Ex/Mem
Conjoined pipelines processor
1. Throughput
3. Configurable Performance
2. Reliability
10
University of Michigan
Electrical Engineering and Computer Science
CG – Microarchitectural Hurdles
Control
Flow
Register
Data Flow
Single
Pipeline
Memory
Data Flow
Instruction
Issue
N/A
N/A
Conjoined
Pipelines
Control Flow
Register and Memory Data Flow
- Instruction sequence needs to be
• Single Pipeline: Solved by the StageNet design, MICRO’08
managed across fetch stages
- Detection of cross pipeline register
► Stream Identification bits for Control
and memory
Flow
data flow violations
Instruction
Issue cache (inside EXEC.-stage)
► Bypass
for to
Register
Data
Flow
Recovery
a consistent
architectural
- Segregate data flow chains
state
between pipelines
11
University of Michigan
Electrical Engineering and Computer Science
CG – Overview
Conjoined pipelines processor
Distributed
fetch
Distributed
decode
Detection of data
flow violations
In-order
Writeback
Decentralized
Instruction issue
(broadcasted)
Fetch
Decode
Issue
Ex/Mem
Fetch
Decode
Issue
Ex/Mem
1. Control Flow
2. Register Data
Flow tracking
3. Memory Data
Flow tracking
4. Replay
Mechanism
5. Instruction Issue
12
University of Michigan
Electrical Engineering and Computer Science
CG – Control Flow
Distributed Fetch.
- Pipelines fetch alternate instructions
- Branch predictors are kept in sync.
9..7..5..3..1
Fetch
Decode
10..8..6..4..2
Fetch
Decode
Advantages
- Evenly splits the work (fetch, decode, issue) between two pipelines
- No explicit communication required for control decisions
- Consistent control decisions due to mirrored branch predictors
13
University of Michigan
Electrical Engineering and Computer Science
CG – Data Flow
Across pipeline dependencies are tricky….
Compare notes
at commit time
Register Data Flow
1. Issue stages locally maintain a table of source registers
Replay if any
dependency
was violated
2. Issue stages monitor write-backs, and detect if any other pipeline
InstructionupdatesSplit
instruction
Local decisions
a source
for an outstanding
instruction
stream
stream
and execution
3. Missed dependency  initiate a light-weight replay
14
University of Michigan
Electrical Engineering and Computer Science
CG – Register Data Flow: Example
Scenario A
1.
2.
3.
4.
R1 = ….
R2 = ….
… = R1
… = R2
R1
… = R2
R1
….
3, 1
… = R2
R1
Issue
R1 = ….
Execute
R1
R2
R2
… = ….
R2
4, 2
Issue
… = R2
R2 = ….
R1
R2
Execute
R2
Scenario B
1.
2.
3.
4.
R1 = ….
R2 = ….
… = R2
… = R2
Data flow violation!
Pipeline 1 used a stale value of R2
How can we avoid these violations?
15
University of Michigan
Electrical Engineering and Computer Science
CG – Instruction Issue
• Instructions can be:
Issue
►
Straight steered
►
Cross steered
straight
Issue
Ex/Mem
Ex/Mem
• Objective: match producers and consumers
• Mismatch  Data Flow violation  Replay
• Solution: Use static compiler analysis to
generate steering hints
16
University of Michigan
Electrical Engineering and Computer Science
CG – Instruction Issue: Example
10..8..6..4..2
6
Fetch
order
9
1
10
4
9..7..5..3..1
Issue
straight
Issue
Ex/Mem
Ex/Mem
Always straight steering
5
2
• Ignores data dependencies
• Number of replays = 5
7
3
Compiler orchestrated steering
• Use clustering algorithms
8
Critical cross
dependency
• Accounts for dependencies and
communication delays
• Number of replays = 0
17
University of Michigan
Electrical Engineering and Computer Science
CG – Design Summary
Fetch
Decode
Issue
Ex/Mem
Fetch
Decode
Issue
Ex/Mem
1. Control Flow
- Pipelines fetch alternate
instructions
- Branch predictors kept
in sync
2. Register Data Flow
- Maintain local data flow
information
- Check the decisions at
writeback
3. Memory Data Flow tracking
5. Instruction Issue
- Steer consumers to
producers
- Leverage static
compiler analysis
4. Replay Mechanism
18
University of Michigan
Electrical Engineering and Computer Science
Evaluation Methodology
• Liberty Simulation Infrastructure
► For cycle accurate simulations
• Trimaran Compilation System
► For instruction steering hints
Microarchitectural Paramenters
Branch predictor
Global, 16-bit, gshare predictor
Level 1 I/D cache
4-way, 16KB, 1 cycle latency
Level 2 unified
cache
8-way, 64KB, 5 cycle latency
• Experiments:
►
►
►
Single-thread performance gain from conjoining
Throughput improvement from conjoining (at low utilizations)
Throughput sustainability (in face of failures)
19
University of Michigan
Electrical Engineering and Computer Science
Sequential Performance
Baseline (1 issue)
CG Single Pipeline (1 issue)
CG Conjoint Pipelines (2 issue)
Baseline (2 issue)
Normalized IPC
2.5
2
1.5
1
0.5
0
20
University of Michigan
Electrical Engineering and Computer Science
Throughput at varying utilization
Throughput (IPC)
8-core traditional multicore
8-core CoreGenesis chip
8
7
6
5
4
3
2
1
0
0.25
0.5
0.75
1
System utilization (number threads / number of cores)
21
University of Michigan
Electrical Engineering and Computer Science
Throughput Sustainability (Reliability)
8-core traditional multicore
8-core CoreGenesis chip
7
Throughput (IPC)
6
5
4
3
2
1
0
0
1
2
3
4
5
Time (years)
22
6
7
8
University of Michigan
Electrical Engineering and Computer Science
9
Conclusions
• Architectural flexibility can tackle multiple
multicore challenges
• CoreGenesis is our attempt at a unified performance and
reliability solution
►
►
Decentralized instruction flow management to combine resources
for higher single-thread performance
Decoupled pipeline architecture to allow stage level reconfiguration
• Results:
►
►
►
Combining two single issue pipelines gives 40% speedup
Sustains the same throughput for up to 70% longer
Overheads: 20% area, 17% power
23
University of Michigan
Electrical Engineering and Computer Science
Thank you
Erasing Core Boundaries for Robust and
Configurable Performance
http://cccp.eecs.umich.edu
24
University of Michigan
Electrical Engineering and Computer Science
Back up slides
25
University of Michigan
Electrical Engineering and Computer Science
Traditional Solutions and CoreGenesis (CG)
(B) Core Disabling. Isolates broken
cores (red). Sustains throughput
only in low failure rates.
Throughput
The architecture is composed of a sea of building blocks (B).
These blocks can be configured for:
Sequential
• Throughput computing: By forming single-issue pipelines
• Single-thread performance: By forming wider-issue pipelines
• Fault-tolerance: By decommissioning broken blocks.
(A) Dynamic Multicore. Cores can fuse
together when sequential performance
is needed.
(C) Heterogeneous CMP. Maintains
a variety of cores to offer powerproportional computing.
Traditional point solutions
• Customized processing: Heterogeneous building blocks can
be introduced in the fabric to form customized pipelines.
CoreGenesis Vision
26
University of Michigan
Electrical Engineering and Computer Science
CG Instance: A Unified Performance-Reliability Solution
Provides..
Summary of Challenges
1. Configurable Performance:
By merging varying number of stages
2. Reliability:
By isolating broken stages
Design Characteristics
• Elementary pipeline stages form the
building blocks
• Stages interconnected using full crossbars.
• No global flush, stall or forwarding signals.
• No modifications to the cache hierarchy
Single Pipeline
Conjoint Pipelines
Control Flow
Register Data Flow
Memory Data Flow
Instruction Steering
27
University of Michigan
Electrical Engineering and Computer Science
Bypass $
1. Control Flow
2. Data Flow
Stream ID (SID)
double
Issue
double
double
double
double
double
Decode
Ex/Mem
3. Transmission Delays
Bypass $
Macro-Ops
• Stores previous results
• Send instruction
bundles
• Control flow handling
• Eliminates flush signals
• Fully associative structure
0
1
• Emulates data forwarding
double
SID
SID
Fetch
buffer
buffer
Register File
buffer
buffer
buffer
Macro-op
Generator
buffer
Gen Branch
PC Predictor
buffer
Decoupling Stages in a Pipeline [MICRO’08]
• Amortizes transfer
delay
>>
LD
LD
+
+
/
&
ST
>>
<<
• Increases system
utilization
28
University of Michigan
Electrical Engineering and Computer Science
ST
Replays
MemFlow replay cycles
RegFlow replay cycles
Normal operation cycles
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
29
University of Michigan
Electrical Engineering and Computer Science
Area
30
University of Michigan
Electrical Engineering and Computer Science
Power
31
University of Michigan
Electrical Engineering and Computer Science
Download