Slides - HPS Research Group

advertisement
Techniques for Bandwidth-Efficient
Prefetching of Linked Data Structures
in Hybrid Prefetching Systems
Eiman Ebrahimi*
Onur Mutlu‡
Yale N. Patt*
* HPS Research Group
‡ Computer Architecture Laboratory
University of Texas at Austin
Carnegie Mellon University
1
Motivation
 Prefetching can significantly reduce memory
latency impact on performance
 Stream prefetching very useful but unable to
reduce latency of many misses
 Access patterns that follow pointers in linked
data structures (LDS) prevalent in many
applications
 High-performance and bandwidth-efficient
LDS prefetchers are needed
2
gmean-no-health
gmean
pfast
voronoi
perimeter
mst
140
health
bisort
ammp_00
art_00
parser_00
omnetpp_06
xalancbmk_06
astar_06
mcf_06
gcc_06
perlbench_06
IPC Delta (%)
Potential Performance
IPC delta of ideal LDS prefetching over stream prefetching
615
130
120
110
100
90
80
70
60
50
40
30
20
10
0
3
Our Goal
Develop techniques that
1) Enable low cost and bandwidthefficient prefetching of linked data
structure accesses
2) Efficiently combine such
prefetchers with commonly-employed
stream based prefetchers
4
Our Goal
Develop techniques that
1) Enable low cost and bandwidthefficient prefetching of linked data
structure accesses
2) Efficiently combine such
prefetchers with commonly-employed
stream based prefetchers
5
Outline
 Background
 Efficient Content Directed LDS
Prefetching
 Managing Multiple Prefetchers
in a Hybrid Prefetching System
 Evaluation
 Conclusion
6
Content-Directed Prefetching (CDP)
(Cooksey et al. ASPLOS ’02)
 Requires no state  Attractive approach
 Searches for pointers as data is fetched
from memory
 Virtual address predictor
 Compares high order bits of values
within cache line with cache line’s address
 Generates prefetch requests on a match
7
Content-Directed Prefetching (CDP)
X800 22220
[31:20]
x40373551
[31:20]
=
[31:20]
=
[31:20]
=
x80011100
x80011100
[31:20]
=
[31:20]
=
[31:20]
=
[31:20]
=
[31:20]
=
Virtual Address Predictor
Generate Prefetch
X80022220
…
L2
…
DRAM
8
Shortcomings of CDP
 CDP prefetches all identified pointers
 Indiscriminate prefetching of all
discovered pointers leads to
 Low prefetch accuracy
 High cache pollution
 High bandwidth consumption
9
Shortcomings of CDP – An example
HashLookup(int Key) {
…
for (node = head ; node -> Key != Key; node = node -> Next; ) ;
if (node) return node->D1;
}
Key
D1
D2
Key
Key
D1
Key
D2
Key
Struct node{
int Key;
int * D1_ptr;
int * D2_ptr;
node * Next;
}
D1
D1
D2
…
Key
D1
…
D2
D2
Example from mst
10
Shortcomings of CDP – An example
Cache Line Addr
[31:20]
Key
D1_ptr
[31:20]
=
Next
D2_ptr
[31:20]
=
[31:20]
=
Key
[31:20]
[31:20]
=
=
Next
D1_ptr D2_ptr
[31:20]
=
[31:20]
[31:20]
=
=
Virtual Address Predictor
…
Key
D1
Key
D2
Key
D1
D2
Key
D1
D1
D2
…
Key
D1
…
D2
D2
11
Shortcomings of CDP – An example
HashLookup(int Key) {
…
for (node = head ; node -> Key != Key; node = node -> Next; ) ;
if (node) return node -> D1;
}
Key
D1
Key
D2
Key
D1
D2
Key
D1
D1
D2
…
Key
D1
…
D2
D2
12
Shortcomings of CDP – An example
Cache Line Addr
[31:20]
Key
[31:20]
=
D1_ptr D2_ptr
[31:20]
=
[31:20]
Next
Key
[31:20]
=
=
[31:20]
=
D1_ptr
D2_ptr
[31:20]
Next
[31:20]
[31:20]
=
=
=
Virtual Address Predictor
…
Key
D1
Key
D2
Key
D1
D2
Key
D1
D1
D2
…
Key
D1
…
D2
D2
13
Outline
 Background
 Efficient Content Directed LDS
Prefetching
 Managing Multiple Prefetchers
in a Hybrid Prefetching System
 Evaluation
 Conclusion
14
Efficient Content Directed
Prefetching (ECDP) – Basic Idea
 A compiler guided technique that
identifies likely-useful pointer
addresses to prefetch
 Compiler profiles and provides hints as
to which pointer addresses
are likely-useful to prefetch
 Hardware uses hints to prefetch only
likely-useful pointers
15
Terminology – Pointer Group (PG)
LD1: data = node -> data;
…
node = node -> left;
struct node {
int data;
int key;
node * left;
node * right;
}
PG(L, X) = { all pointers at offset X from byte accessed by instruction L }
P1
data
key
left
P2
right data
offset 8
LD1
key
left
right data
key
left
…
right
offset 8
LD1
PG (LD1, 8) = {P1, P2, etc.}
16
Efficient Content-Directed
Prefetching (ECDP)
 The PG definition naturally associates a
number of PGs to each load instruction
1) Compile-time profiling classifies PGs into
beneficial/harmful
2) Hardware prefetches PGs that are
beneficial
- Information conveyed to hardware with hint
bit vector embedded into the load instruction
17
Beneficial vs Harmful PG
data
key
P1
left
{
data
key
PG1 = {P1, P2, P3}
right data
key
left
key
key
left
…
right
right data
key
left
…
right
key
P3
left
…
right
25 useful
12 useless
left
right data
P2
key
left
{
data
right data
left
right data
key
50 useful
9 useless
left
right data
{
33 useful
10 useless
25 + 50 + 33 > 12 + 9 + 10
PG1’s useful prefetches > PG1’s useless prefetches
A pointer group whose majority of prefetches are useful is classified as beneficial
18
ECDP mechanism - Example
LD1’s associated beneficial pointer groups
PG1 = {LD1, 8}
PG2 = {LD1, 24}
PG3 = {LD1, 44}
Prefetch
Don’t Prefetch
Assuming 4 byte address values
LD1 hint bit-vector
0
bit 2
0 1 0
8
offset 12
0
bit 6
0 1 0
offset 24
0
bit 11
0 1 0
0
0
0
0
offset 44
offset 12
8
data
key
left
right
data
key
left
right
data
key
left
…
right
byte 12
19
Outline
 Background
 Efficient Content Directed LDS
Prefetching
 Managing Multiple Prefetchers
in a Hybrid Prefetching System
 Evaluation
 Conclusion
20
Managing Multiple Prefetchers
in a Hybrid Prefetching System
 ECDP can be complementary to
a stream prefetcher
 Multiple prefetchers can deny service
to each other as they contend for
 Memory request buffer entries
 DRAM bus bandwidth and DRAM banks
 Cache space
 Unmanaged use of multiple prefetchers
causes
 Performance degradation
 Inability to gain full performance benefit of
using multiple prefetchers
21
Coordinated Throttling of Multiple
Prefetchers – Basic Idea
 Dynamic feedback gathered for every
prefetcher in the system
 Simple heuristics use feedback
to adapt each prefetcher's
aggressiveness
22
Adapting Stream Prefetcher
Aggressiveness
 Stream Prefetcher Aggressiveness
 Prefetch Distance
 Prefetch Degree
A A+1
P
P+1
P+2
P+3
P+4
Access Stream
Prefetch Distance Prefetch
Degree
23
Adapting CDP Aggressiveness

Each memory request assigned a depth value
 Demand accessed line assigned depth 0
Depth = 0
Line fetched by demand access
ptr1
ptr2
ptr3
ptr4
…
Depth = 1
ptr5
ptr6
ptr7
…
Depth = 2
ptr8

ptr9
ptr10
…
CDP Aggressiveness
 Maximum allowed prefetch depth
24
Coordinated Prefetcher
Aggressiveness Control Policies
 Each prefetcher adapts its own aggressiveness
Prefetches
Stream
Prefetcher (Deciding)
(Rival)
Feedback
Prefetches
Shared Memory
Resources
TheContent-Directed
goal: Allow the prefetcher most likely to improve performance
(Rival)
Prefetcher
(Deciding)
to use
more shared resources
Feedback
 Deciding prefetcher adapts
its own aggressiveness based on
 Deciding prefetcher coverage and accuracy
 Rival prefetcher coverage
25
Coordinated Prefetcher
Aggressiveness Control Policies
Deciding
Prefetcher Feedback
Rival
Feedback
Action
Reason
throttle
down
(a) avoid unnecessary bandwidth
consumption and cache pollution
Rival Cov
High
throttle
down
(b) give rival prefetcher chance to
use more shared resources
Rival Cov
Low
throttle
up
(c) give deciding prefetcher
chance to improve coverage
Rival Cov
High
do
nothing
(d) deciding prefetcher not causing
trouble, rival performing well
throttle
up
(e) deciding prefetcher performing
well, avoid performance loss
Deciding Acc
Low
Deciding Acc
Med
Deciding Cov
Low
Deciding Acc
High
Deciding Cov
High
26
Hardware Cost of all Techniques
Total hardware cost
2.11 KB
Percentage Area Overhead (as fraction of the
baseline 1MB L2 cache)
0.206%
 Major components
 ‘prefetched’ bits for each L2 line – used
to account for useful/useless prefetches
 Eleven 16-bit counters to estimate prefetcher
coverage and accuracy
27
Outline
 Background
 Efficient Content Directed LDS
Prefetching
 Managing Multiple Prefetchers
in a Hybrid Prefetching System
 Evaluation
 Conclusion
28
Evaluation Methodology

x86 cycle accurate simulator

Baseline processor configuration

Per core

4-wide issue, out-of-order, 256-entry ROB
1MB, 8-way L2 cache
Stream prefetcher with 32 streams, prefetch degree:4, prefetch
distance:32
Content Directed Prefetcher, compare bits:8, max depth:4



450 cycle memory latency
8B wide core to memory bus
32, 64, 128 L2 MSHRs for 1-, 2-, 4-core





Shared
Coordinated prefetcher throttling thresholds
Tcoverage
0.2
Alow
Ahigh
0.4
0.7
29
gmean-no-health
1.3
1.2
gmean
pfast
voronoi
perimeter
mst
health
bisort
1.4
2.27
2.27
2.58
1.75
Str Pref. + Orig CDP
Str Pref. + ECDP
Str Pref. + ECDP + Coord. Thrott.
ammp_00
art_00
parser_00
omnetpp_06
xalancbmk_06
astar_06
mcf_06
gcc_06
perlbench_06
IPC Normalized to Stream Prefetching
Overall Performance
22.5%
1.1
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
30
gmean
pfast
voronoi
perimeter
mst
health
bisort
ammp_00
art_00
80
parser_00
90
omnetpp_06
xalancbmk_06
astar_06
100
mcf_06
gcc_06
perlbench_06
Bus Access Per 1K Inst.
Memory Bandwidth Consumption
375
Str Pref. Only
Str Pref. + Orig. CDP
Str Pref. + ECDP
Str Pref. + ECDP + Coord. Thrott.
70
60
50
40
30
20
10
0
31
gmean-no-health
gmean
pfast
voronoi
perimeter
mst
health
bisort
ammp_00
art_00
parser_00
omnetpp_06
0.4
0.3
xalancbmk_06
0.6
0.5
astar_06
mcf_06
gcc_06
perlbench_06
2.58
2.36
1.73
1.88
1.75
IPC Normalized to Stream Prefetching
Comparison to other
LDS/Correlation Prefetchers
1.4
1.3
1.2
1.1
1
0.9
0.8
0.7
Str Pref. + DBP
Str Pref. + Markov
GHB
Str Pref. + ECDP + Coord. Thrott.
0.2
0.1
0
32
Summary of Other Results
 Further comparisons and analysis are
presented in the paper
 Feedback Directed Prefetching
 5% avg. improvement
 HW Prefetch Filtering
 17% avg. improvement
 Multi-core Results
 Dual Core (10.4% avg. improvement)
 Quad Core (9.5% avg. improvement)
 Effects of techniques on prefetcher accuracy and
coverage
33
Outline
 Background
 Efficient Content Directed LDS
Prefetching
 Managing Multiple Prefetchers
in a Hybrid Prefetching System
 Evaluation
 Conclusion
34
Conclusion

Developed a low-cost and bandwidth-efficient
HW/SW cooperative linked data structure prefetcher
 ECDP utilizes compiler hints to prefetch
only likely-useful pointers

Inter-prefetcher interference can destroy
potential performance
 Coordinated throttling manages interference
between multiple prefetchers

Efficient integration of ECDP with stream prefetching
 Improves average performance by 22% over stream
prefetching alone
 Reduces bandwidth consumption by 25%
35
Thank you !
Questions ?
36
gmean-no-health
gmean
pfast
voronoi
Str Pref. + ECDP + Coord. Thrott.
perimeter
0.2
0.1
mst
Str Pref. + ECDP + FDP
health
0.4
0.3
bisort
ammp_00
art_00
parser_00
omnetpp_06
xalancbmk_06
astar_06
mcf_06
gcc_06
perlbench_06
IPC Normalized to Stream Prefetching
Comparison to Feedback-Directed
Prefetching (Srinath et al. HPCA ‘07)
1.4
1.3
1.2
1.1
1
0.9
0.8
0.7
0.6
0.5
0
37
1.4
2.27
2.58
1.55
1.61
1.75
1.77
IPC Normalized to Stream Prefetching
Comparison to HW Prefetch
Filtering (Zhuang and Lee ICPP ‘03)
1.3
1.2
1.1
1
0.9
0.8
0.7
0.6
0.5
Str Pref. + Orig CDP + Hw-Filter
Str Pref. + Orig. CDP + HW-Filter + Coord. Thrott.
Str Pref + ECDP
Str Pref + ECDP + Coord. Thrott.
0.4
0.3
0.2
0.1
gmean-no-health
gmean
pfast
voronoi
perimeter
mst
health
bisort
ammp_00
art_00
parser_00
omnetpp_06
xalancbmk_06
astar_06
mcf_06
gcc_06
perlbench_06
0
38
gmean
GemsFDTD,
h264
pfast, leslie
omnetpp, perl
pfast, xalanc
astar, h264
astar, mcf
omnetpp,
soplex
xalanc, namd
astar, leslie
gcc, milc
2.4
2.3
2.2
2.1
2
1.9
1.8
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
xalanc, astar
mcf, gcc
IPC Normalized to Stream Prefetching
Performance on Dual-Core
Str Pref. Only
Str Pref. + DBP
Str Pref. + Markov
GHB
Str Pref.+ ECDP + Coord. Thrott.
39
3.8
3.6
3.4
3.2
3
2.8
2.6
2.4
2.2
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
gmean
omnetpp,
namd, tonto,
gobmk
tonto, soplex,
xalanc, pfast
omnetpp, gcc,
h264, milc
Str Pref. Only
Str Pref. + DBP
Str Pref. + Markov
GHB
Str Pref. + ECDP + Coord. Thrott.
mcf, astar,
xalanc, perl
IPC Normalized to Stream Prefetching
Performance on Quad-Core
40
Stream Prefetcher Accuracy
Str Pref. + Orig. CDP
100
Str Pref. + ECDP
Str Pref. + Orig. CDP + Coord.
Thrott.
Str Pref. + ECDP + Coord. Thrott.
80
70
60
50
40
30
20
10
amean
pfast
voronoi
perimeter
mst
health
bisort
ammp_00
art_00
parser_00
omnetpp_06
xalancbmk_06
astar_06
mcf_06
gcc_06
0
perlbench_06
Stream Prefetcher Accuracy
90
41
CDP Accuracy
Str Pref. + Orig. CDP
Str Pref. + ECDP
Str Pref + Orig. CDP + Coord. Thrott.
Str Pref. + ECDP + Coord. Thrott.
100
90
70
60
50
40
30
20
10
amean
pfast
voronoi
perimeter
mst
health
bisort
ammp_00
art_00
parser_00
omnetpp_06
xalancbmk_06
astar_06
mcf_06
gcc_06
0
perlbench_06
CDP Accuracy
80
42
Sensitivity Study
43
Download