A Unified View of Non-monotonic Core Selection and Application

advertisement
A Unified View of Non-monotonic Core
Selection and Application Steering in
Heterogeneous Chip Multiprocessors
Sandeep Navada, Niket K. Choudhary,
Salil Wadhavkar, Eric Rotenberg
Department of Electrical and Computer Engineering
North Carolina State University
Sandeep Navada © 2013
1
Single-ISA HCMP
• Same ISA
• Different microarchitectures
– Superscalar width
– Structure sizes
– Frequency
• Cores have different performance and
power
• New run-time optimization lever
Sandeep Navada © 2013
2
Monotonic HCMP
Performance
• Cores can be ranked independent of application
• Core 1 faster than Core 2 for any application
Core 1
Core 2
A
Sandeep Navada © 2013
B
C
Applications
D
3
Monotonic HCMP example
Sandeep Navada © 2013
4
HCMP literature
• Focus
– Monotonic cores
– Cores are preordained
– Scheduling
• Single thread
– Minimize energy for given performance
degradation threshold w.r.t. highest ranked core
• Multiple threads
– Maximize throughput/Watt/mm2
Sandeep Navada © 2013
5
Going beyond monotonic HCMP
Performance
• Cores can’t be ranked independent of application
• Cores designed from ground-up, not pre-existing
Core 1
Core 2
A
Sandeep Navada © 2013
B
C
Applications
D
6
Non-monotonic HCMP
High-contention
scenario
(Optimize throughput)
Kumar, et al., Core
Architecture
Optimization for SingleISA Heterogeneous
Multiprocessors
Low-contention
scenario
(Optimize latency)
Our work
Sandeep Navada © 2013
7
Optimize latency
Performance = IPC × frequency
Complexity↑ => IPC↑ frequency↓
App A
App B
IPC
frequency
perf
Complexity
IPC
frequency
perf
Complexity
This tradeoff plays out differently for different apps and
is dependent on the ILP characteristics of the app
Sandeep Navada © 2013
8
Non-monotonic HCMP challenges
Core
Selection
Application
Steering
How to pick the
core types
comprising the
heterogeneous
design?
How to steer the
applications to the
best core?
Sandeep Navada © 2013
9
CORE SELECTION
Sandeep Navada © 2013
10
Core design space
Parameter
Value Range
Number
Front end width
2, 3, 4, 5, 6, 7, 8
7
Issue width
2, 3, 4, 5, 6, 7, 8
7
Physical register file
size
64, 128, 192, 256, 384, 512
6
Issue queue size
16, 24, 32, 48, 64, 96, 128
7
Load queue/
Store queue size
8/8, 16/16, 24/24, 32/32, 40/40,
48/48, 56/56, 64/64
8
L1 I$ size
8, 16, 32, 64, 128KB
5
L1 D$ size
8, 16, 32, 64, 128KB
5
L2$ size
2MB
1
Clock period
0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2 ns
8
Sandeep Navada © 2013
11
Core selection
Core
design
space
Pruning
script
Pruned
design
Space
SPEC
bench
SimPoint
tool
39 10M
phases
Search
N=1
HCMP
Optimal
1-core-type
HCMP
Search
N=2
HCMP
Optimal
2-core-type
HCMP
Sandeep Navada © 2013
Search
N=3
HCMP
Optimal
3-core-type
HCMP
FabScalar
toolset
IPC, freq,
power
Performance of
every phase on
every design point
Search
N=4
HCMP
Optimal
4-core-type
HCMP
N: Number of
core types
12
Core Types
BIPS
Phases
A
B
C
D
E
F
G
H
1
1.5
3.2
1.3
2.2
1.6
1.7
1.3
2.0
2
0.5
2.3
2.5
1.9
3.1
1.8
2.0
1.2
Search for Optimal 4-core-type HCMP
Core 1
Core 2
Core 3
Core 4 Performance
A
B
C
D
HMEAN(3.2, 2.5) = 2.81
E
B
C
D
HMEAN(3.2, 3.1) = 3.15
A
F
C
D
HMEAN(2.2, 2.5) = 2.34
E
F
C
D
HMEAN(2.2, 3.1) = 2.57
E
F
G
H
HMEAN(2.0, 3.1) = 2.43
…
Sandeep Navada © 2013
13
Kiviat diagram
• Visualize core parameters
Frequency
higher frequency
increase superscalar width
Width
Sandeep Navada © 2013
larger structures
Window
14
Optimal 1-core-type HCMP
Frequency
A
Width
Sandeep Navada © 2013
Window
15
Optimal 1-core-type HCMP
Frequency
A
Width
Window
“A” core is an average core which strikes a good
balance between IPC and frequency.
Sandeep Navada © 2013
16
Optimal 2-core-type HCMP
Frequency
A
LW
Width
Sandeep Navada © 2013
Window
17
Optimal 2-core-type HCMP
Frequency
A
LW
Width
Window
“A” core is still selected!
Sandeep Navada © 2013
18
Optimal 2-core-type HCMP
Frequency
A
LW
Width
Window
LARGER
WIDER
“LW” core targets window and width bottlenecks
in “A” core.
Sandeep Navada © 2013
19
Optimal 3-core-type HCMP
Frequency
A
LW
N
Width
Sandeep Navada © 2013
Window
20
Optimal 3-core-type HCMP
Frequency
A
LW
N
Width
Window
“A” core is still selected!!
Sandeep Navada © 2013
21
Optimal 3-core-type HCMP
Frequency
A
LW
N
Width
Window
“LW” core is still selected.
Sandeep Navada © 2013
22
Optimal 3-core-type HCMP
Frequency
A
LW
N
Width
Window
“N” core targets frequency bottleneck.
Sandeep Navada © 2013
23
Optimal 4-core-type HCMP
Frequency
A
L
W
N
Width
Sandeep Navada © 2013
Window
24
Optimal 4-core-type HCMP
Frequency
A
L
W
N
Width
Window
“A” and “N” are selected, again.
“LW” got split into “L” and “W”,
addressing
each bottleneck better!
Sandeep
Navada © 2013
25
LW split
Frequency
A
LW
L
W
Width
Sandeep Navada © 2013
Window
26
Optimal HCMP
Core Type
Clock Period
ILP-extracting
buffers
Widths
Caches
A
0.6
32, 128, 128
3, 4
64, 64
N
0.5
32, 64, 64
2, 2
16, 16
L
0.7
48, 128, 384
4, 4
128, 128
W
0.7
32, 128, 128
6, 6
128, 32
The optimal HCMP consists of
1. Average core which is the best homogeneous core
2. Accelerator cores that relieve distinct bottlenecks in
the average core
Sandeep Navada © 2013
27
APPLICATION STEERING
Sandeep Navada © 2013
28
Bottleneck-driven steering
• Application is continuously diagnosed for
bottlenecks on the current core using perf. counters
• Migrate to different core when bottlenecks change
– To an accelerator core that relieves any diagnosed
bottleneck and doesn’t worsen any diagnosed bottleneck
– To the average core if no accelerator meets this
condition, or if no bottlenecks
Sandeep Navada © 2013
29
Bottleneck-driven steering
Track performance counters
Diagnose bottlenecks
Steer phase
Sandeep Navada © 2013
30
Track performance counters
Counter
Description
Width_ctr
Ready instruction not issued due to limited issue width.
Window_ctr
Instruction not dispatched due to issue queue or reorder
buffer full.
I$_ctr
Instruction stalled due to instruction cache miss.
D$_ctr
Load instruction stalled due to data cache miss.
Misp_ctr
Mispredicted branch.
L2_ctr
Instruction stalled due to L2 cache miss.
Cycle_ctr
Number of cycles.
Sandeep Navada © 2013
31
Diagnose bottlenecks
• Every 10K instructions, evaluate bottlenecks
using performance counters and thresholds
• Performance counters are normalized with
respect to the cycle count
• If the normalized performance counter value is
above threshold, then the corresponding
resource is a bottleneck
Sandeep Navada © 2013
32
Diagnose bottlenecks
Bottleneck
bool Width
Expression
Width = (Width_ctr > Width_thresh)
bool Window
Window = (Window_ctr > Window_thresh)
bool Frequency Frequency = (Misp_ctr > Misp_thresh) ||
(L2_ctr > L2_thresh)
bool I$
I$ = (I$_ctr > I$_thresh)
bool D$
D$ = (D$_ctr > D$_thresh)
Thresholds are determined empirically using a training process
Sandeep Navada © 2013
33
Steer phase
Core
Bottlenecks
relieved
Bottlenecks
worsened
Steering logic
W
Width
Frequency
if (Width && !Frequency)
W
L
Window
Frequency
else if (Window && !Frequency)
L
N
Frequency
Width,
Window
else if (Frequency && !(Width || Window))
N
A
n/a
n/a
else
A
Paper shows full steering logic with I$ and D$ bottlenecks included.
Sandeep Navada © 2013
34
RESULTS
Sandeep Navada © 2013
35
Methodology
• Benchmarks: SPEC 2000
– Simulate first 4 billion instructions
• Metrics
– Performance: BIPS
– Efficiency: BIPS3/Watt
• Migration overhead
– Default: 100 cycles
– Sensitivity study: 1K, 10K cycles
Sandeep Navada © 2013
36
Steering algorithms
Algorithm
Description
Baseline
Run the entire 4B instructions on the average
core
Run on each core type for the sampling interval
and then on the best core type for the switching
interval
Run current 10K instruction segment based on
the bottlenecks of the prior 10K segment
Sampling
Bottleneck
Optimal
Oracle
Run every 10K instruction segment on the best
core type of the prior 10K segment
Run every 10K instruction segment on the best
core type
Sandeep Navada © 2013
37
4-core-type HCMP
•4-core HCMP outperforms homogeneous CMP by up to 76% and
15%, on average
•Our steering algorithm is able to capture most of this gain
Sandeep Navada © 2013
38
Sampling vs. bottleneck steering
Sampling
performs
than
the average
Sampling
performs
8.9% 8.9%
betterbetter
than the
average
core core
Bottleneck
steering
performs
12% better
than
the average
Bottleneck
steering
performs
12% better
than the
average
core core
Sandeep Navada © 2013
39
Occupancy
Occupancy pattern varies dramatically across different applications
Sandeep Navada © 2013
40
Efficiency
Sampling performs 25% better than the average core
Bottleneck steering performs 33% better than the average core
Sandeep Navada © 2013
41
SUMMARY
Sandeep Navada © 2013
42
Summary
• First proposal to architect and orchestrate
multiple core types for latency reduction.
• With N core types, the optimal HCMP consists of
an average core type coupled with N-1
accelerator core types.
• In the complementary steering algorithm, the
application is continuously diagnosed for
bottlenecks and is migrated to the core type
which relieves the bottlenecks.
Sandeep Navada © 2013
43
Future work
• HCMPs open up a whole new direction of
microarchitecture research.
• Many microarchitecture optimizations don’t
provide universal benefits.
• As each core-type targets a narrow workload
space, HCMP provides a great platform to
reconsider these optimizations.
Sandeep Navada © 2013
44
Download