Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi Onur Mutlu

advertisement
Coordinated Control of Multiple
Prefetchers in Multi-Core Systems
Eiman Ebrahimi*
Onur Mutlu‡
Chang Joo Lee*
Yale N. Patt*
* HPS Research Group
‡ Computer Architecture Laboratory
The University of Texas at Austin
Carnegie Mellon University
1
Motivation
 Aggressive prefetching improves
memory latency tolerance of
many applications when they run alone
 Prefetching for concurrently-executing
applications on a CMP can lead to
 Significant system performance degradation and
bandwidth waste
 Problem:
Prefetcher-caused inter-core interference
 Prefetches of one application contend with
prefetches and demands of other applications
2
Potential Performance
2.2
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
Gmean-32
WL14
WL13
WL12
WL11
WL10
WL9
WL8
WL7
WL6
WL5
WL4
WL3
WL2
56%
WL1
Perf. Normalized to No Throttling
System performance improvement of ideally removing all
prefetcher-caused inter-core interference in shared resources
Exact workload combinations can be found in paper
3
Outline
 Background
 Shortcoming of Prior Approaches to
Prefetcher Control
 Hierarchical
Prefetcher Aggressiveness Control
 Evaluation
 Conclusion
4
Increasing Prefetcher Accuracy
 Increasing prefetcher accuracy can reduce
prefetcher-caused inter-core interference
 Single-core prefetcher aggressiveness
throttling (e.g., Srinath et al., HPCA ’07)
 Filtering inaccurate prefetches
(e.g., Zhuang and Lee, ICPP ’03)
 Dropping inaccurate prefetches at memory
controller (Lee et al., MICRO ’08)
All such techniques operate independently on
the prefetches of each application
5
Feedback-Directed Prefetching (FDP)
(Srinath et al., HPCA ’07)


Uses prefetcher feedback information local to the
prefetcher’s core



Prefetch accuracy
Prefetch timeliness
Prefetch cache pollution


Prefetch Distance
Prefetch Degree
Dynamically adapts the prefetcher’s aggressiveness
Stream Prefetcher Aggressiveness
Prefetch
Degree
A+1
Access Stream
P
P+1
P+2
P+3
P+4

Prefetch Distance
A
Shown to perform better than and consume less bandwidth
than static aggressiveness configurations
6
Outline
 Background
 Shortcoming of Prior Approaches to
Prefetcher Control
 Hierarchical
Prefetcher Aggressiveness Control
 Evaluation
 Conclusion
7
High Interference caused by
Accurate Prefetchers
Core0
Shared Cache
Dem 2
Addr:A
Miss
Dem 2
Core2
Core1
Pref 11
Dem
Addr:A
Dem X
Addr: Y
Dem 2
Demand Request
Pref 0
Addr:B
Dem 0
…
From Core X
For Addr Y
Core3
Pref 3
Memory Controller
In
aDRAM
CMP
Row
Buffer
Hit
Requests
Being
Serviced
Dem 2
Pref 1
Pref 3
Bank 0
Bank 1
system, accurate prefetchers can
cause significant
interference
with
Row
Pref 1
Pref 3
Addr.
concurrently-executing
Buffers Row Addr. Rowapplications
8
Shortcoming of Per-Core (Local-Only)
Prefetcher Aggressiveness Control
Core 0
Core 1
Prefetcher
4
Degree: 2
Prefetcher
Degree: 4
2
Core 2
Core 3
FDP Throttle Up FDP Throttle Up
Shared Cache
Used_P
Used_P
Used_P
Used_P
Pref
02 Dem
Pref 02 Dem
Pref 13 Dem
Pref 13 Dem 2 Dem 2 Dem 3 Dem 3
Set 0 Dem
Used_P
Used_P
Dem03 Used_P
Pref
Dem03 Pref
Pref
Dem12 Dem
Pref 12 Dem 3 Dem 3 Dem 3 Dem 3
Set 1 Used_P
Set 2 Pref
Dem02 Pref
Dem02 Pref
Dem02 Pref
Dem13 Pref
Dem13 Pref
Dem13 Pref
Dem13
Dem02 Pref
…
…
Local-only prefetcher control techniques
have no mechanism to detect inter-core interference
9
Shortcoming of Local-Only
Prefetcher Control
4-core workload example: lbm_06 + swim_00 + crafty_00 + bzip2_00
Speedup over Alone Run
1
No Prefetching
Pref. + No Throttling
Feedback-Directed Prefetching
HPAC
0.5
0.4
0.8
0.3
0.6
0.2
0.4
0.1
0
0.2
Hspeedup
0
bzip2_00
crafty_00
swim_00
lbm_06
Our Approach: Use both global and per-core feedback
to determine each prefetcher’s aggressiveness
10
Outline
 Background
 Shortcoming of Prior Approaches to
Prefetcher Control
 Hierarchical
Prefetcher Aggressiveness Control
 Evaluation
 Conclusion
11
Hierarchical
Prefetcher Aggressiveness Control (HPAC)
Global
Local control’s
Control:goal:
accepts or Global control’s goal: Keep
of and control
Maximize decisions
the
overrides
made bytrack
Memory Controller
prefetcher-caused
prefetching
local
controlperformance
to improve
of core system
i independently
overall
performance inter-core interference in
shared memory system
Final
Throttling Decision
Pref. i
Throttling Decision
Local
Control
Accuracy
Local
Core i Throttling Decision
Bandwidth Feedback
Global
Control
Cache Pollution
Feedback
Shared Cache
12
Terminology
 Global feedback metrics used in our mechanism:
For each core i:
 Core i’s prefetcher accuracy – Acc (i)
 Core i’s prefetcher caused inter-core cache pollution
Pol (i)
 Demand cache lines of other cores evicted by this core’s
prefetches that are requested subsequent to eviction
 Bandwidth consumed by core i - BW (i)
 Accounts for how long requests from this core tie up
DRAM banks
 Bandwidth needed by other cores j != i - BWNO (i)
 Accounts for how long requests from other cores have to
wait for DRAM banks because of requests from this core
13
Calculating Inter-Core Cache Pollution
Prefetch
from core i,aevicts
a core
j’s miss
demand from shared cache
Core j experiences
demand
cache
Pollution Filter of core i
Core
id
Pollution
bit
Missing
Evicted line’s
Address
From core j
0
0
0
2
0
1
j
0
Increment Pol (i)
.
.
.
Hash
Function
0
0
0
2
1
2
14
Hierarchical
Prefetcher Aggressiveness Control (HPAC)
- High accuracy
- High pollution
- High bandwidth consumed
while other cores need bandwidth
Pref. i
Local
Control
Final
Enforce
ThrottlingDown
Decision
Throttle
Memory Controller
High
BW
(i)BW (i)
BWNO
(i)
High BWNO
(i)
Global
Control
High Acc
Acc (i)
(i)
Local
Local
ThrottlingUp
Decision
Core i Throttle
Pol
(i)Pol (i)
High
Pol. Filter i Shared Cache
15
Heuristics for Global Control
 Classification of global control heuristics
based on interference severity
 Severe interference
 Action: Reduce the aggressiveness of interfering
prefetcher
 Borderline interference
 Action: Prevent prefetcher from transitioning
into severe interference:
 Allow local-control to only throttle-down
 No interference or moderate interference from
an accurate prefetcher
 Action: Allow local control to maximize
local benefits from prefetching
16
HPAC Control Policies
Pol (i)
Acc (i)
Inaccurate
BW (i)
Low BW
Consumption
High BW
Consumption
Causing Low
Pollution
BWNO (i)
Interference Class
Action
Others’ low
BW need
Others’ high
BW need
Severe interference
throttle
down
Severe interference
throttle
down
Severe interference
throttle
down
Others’ low
BW need
Highly
Accurate
Inaccurate
Causing High
Pollution
Low BW
Consumption
Highly
Accurate
High BW
Consumption
Others’ low
BW need
Others’ high
BW need
Others’ low
BW need
Others’ high
BW need
17
Hardware Cost (4-Core System)
Total hardware cost local-control & global control
Additional cost on top of FDP
15.14 KB
1.55 KB
 Additional cost on top of FDP only 1.55 KB
 HPAC does not require any structures or logic
that are on the processor’s critical path
18
Outline
 Background
 Shortcoming of Prior Approaches to
Prefetcher Control
 Hierarchical
Prefetcher Aggressiveness Control
 Evaluation
 Conclusion
19
Evaluation Methodology

x86 cycle accurate simulator

Baseline processor configuration

Per core



Shared






4-wide issue, out-of-order, 256-entry ROB
Stream prefetcher with 32 streams, prefetch degree:4,
prefetch distance:64
2MB, 16-way L2 cache (4MB, 32-way for 8-core)
DDR3 1333Mhz
8B wide core to memory bus
128, 256 L2 MSHRs for 4-, 8-core
Latency of 15ns per command (tRP, tRCD, CL)
HPAC thresholds used
Acc
BW
Pol
BWNO
0.6
50k
90
75k
20
No Prefetching
1.5
1.4
1.3
1.2
1.1
1
0.9
0.8
0.7
FDP
Class 2
Class 1
HPAC
Class 3
Class 4
AVG-32
WL14
WL13
WL12
WL11
WL10
WL9
WL8
WL7
WL6
WL5
WL4
WL3
WL2
15%
WL1
System Performance
Normalized to No Throttling
Performance Results
Exact workload combinations can be found in paper
21
Summary of Other Results
 Further results and analysis are presented in
the paper
 Results with different types of memory controllers
 Prefetch-Aware DRAM Controllers (PADC)
 First-Ready First-Come-First-Served (FR-FCFS)
 Effect of HPAC on system fairness
 HPAC performance on 8-core systems
 Multiple types of prefetchers per core and
different local-control policies
 Sensitivity to system parameters
22
Conclusion

Prefetcher-caused inter-core interference can destroy
potential performance of prefetching


When prefetching for concurrently executing applications in
CMPs
Did not exist in single-application environments

Develop one low-cost hierarchical solution which
throttles different cores’ prefetchers in a
coordinated manner

The key is to take global feedback into account to
determine aggressiveness of each core’s prefetcher


Improves system performance by 15% compared to
no throttling on a 4-core system
Enables performance improvement from prefetching that is
not possible without it on many workloads
23
Thank you!
Questions?
24
Download