Prefetch-Aware DRAM Controllers Chang Joo Lee Onur Mutlu* Veynu Narasiman

advertisement
Prefetch-Aware DRAM Controllers
Chang Joo Lee
Onur Mutlu*
Veynu Narasiman
Yale N. Patt
Electrical and Computer Engineering
The University of Texas at Austin
*Microsoft Research and Carnegie Mellon University
1
Outline




Motivation
Mechanism
Experimental Evaluation
Conclusion
2
Modern DRAM Systems
DRAM Bank



Rows and columns of DRAM cells
A row buffer in each bank
Non-uniform access latency:



Row B
Row-hit:
 Data is in the row buffer
Row Buffer
Row A
Row-conflict:
 Data is not in the row buffer
Row-conflict
Row-hit
 Needs to access the DRAM cells
Data Bus
Row-hit latency < Row-conflict latency Processor: Row AB
Prioritize row-hit accesses to increase DRAM throughput
[Rixner et al. ISCA2000]
3
Problems of Prefetch Handling

How to schedule prefetches vs demands?


Demand-first: Always prioritizes demands over
prefetch requests
Demand-prefetch-equal: Always treats them the same
Neither of these perform best
Neither take into account both:
1. Non-uniform access latency of DRAM systems
2. Usefulness of prefetches
4
When Prefetches are Useful
Stall
DRAM

Row A
B
Execution
Demand-first
Row Buffer
2 row-conflicts, 1 row-hit
DRAM
Row-conflict
Row-hit
Processor
DRAM
Controller
Miss Y
Pref Row A
:X
Miss X
Miss Z
Dem Row B : Y
Pref Row A
:Z
Processor needs Y, X, and Z
5
When Prefetches are Useful
Stall
DRAM

Demand-first
Row Buffer
Row A
B
Execution
2 row-conflicts, 1 row-hit
DRAM
Row-conflict
Row-hit
Processor
DRAM
Controller
Miss Y
Miss X
Miss Z
Pref Row
A :X
Demand-pref-equal
outperforms demand-first
Dem Row B : Y
Demand-pref-equal
Pref Row A

:Z
2 row-hits, 1 row-conflict
DRAM
Processor
Processor needs Y, X, and Z
Miss Y
Saved Cycles
Hit X Hit Z
6
When Prefetches are Useless
DRAM
Row A

Demand-first
Row Buffer
DRAM
Y
X
Z
Processor
DRAM
Controller
Saved Cycles
Miss Y
Pref Row Demand-first
A :X
Dem Row B : Y
Pref Row A
outperforms demand-pref-equal

Demand-pref-equal
:Z
DRAM
X
Z
Y
Processor
Processor needs ONLY Y
Miss Y
7
Demand-first vs. Demand-pref-equal policy
Stream prefetcher enabled
IPC normalized to no prefetching
3
2.5
Demand-first
Demand-pref-equal
Useless prefetches:
Off-chip bandwidth
Queue resources
Cache Pollution
2
1.5
1
0.5
0
sl
le
m
tu
es
an
3d
ie
qu
im
av
bw
lib
sw
ilc
m
t
ar
l
p
e
lg
m
am
ga
Goal 1: Adaptively
Goal
schedule
2: Eliminate
prefetches
useless
based
prefetches
on prefetch usefulness
Demand-pref-equal
Demand-first
is
better
is
better
8
Goals
1. Maximize the benefits of prefetching:
Increase DRAM throughput by adaptively
scheduling requests based on prefetch usefulness
→ increase timeliness of useful prefetches
2. Minimize the harm of prefetching:
Adaptively delay the service of useless
prefetches and remove useless prefetches
→ increase efficiency of resource utilization
Achieve higher performance and efficiency
9
Outline




Motivation
Mechanism
Experimental Evaluation
Conclusion
10
Prefetch-Aware DRAM Controllers
(PADC)
To DRAM

Adaptive Prefetch Scheduling
(APS): Prioritizes prefetch and
demand requests based on prefetch
accuracy estimation

Adaptive Prefetch Dropping
(APD): Cancels likely-useless
prefetches from memory request
buffer based on prefetch accuracy
Update
Memory
request
buffer
Request
priority
Drop
Request
Info
APS
APD
PADC
Prefetch accuracy from each core
11
Prefetch Accuracy Estimation
#Prefetches used
#Prefetches sent

Prefetch accuracy =

Hardware support:
 Prefetch bit (per L2 cache line, MSHR entry):
Indicates whether it is a prefetch or demand
 Prefetch sent counter (per core)
 Prefetch used counter (per core)
 Prefetch accuracy register (per core)

Estimated every 100K cycles
12
Adaptive Prefetch Scheduling (APS)
1. Adaptively change the priority of
prefetch requests
To DRAM


Low prefetch accuracy → prioritize demands from the core
High prefetch accuracyUpdate
→ treat demands and prefetches equally
Memory
request
buffer
Request
priority
APS
2. In a CMP system: prioritize demand requests from a core
Drop
APD
that has many useless
prefetches
Request

Info
To avoid starving demand requests from a core with low prefetch
PADC
accuracy → improves system performance
Prefetch accuracy from each core
13
Adaptive Prefetch Scheduling (APS)
1. Critical requests


All demand requests
Prefetch requests from cores whose
prefetch accuracy ≥ promotion threshold
2. Urgent requests

Demand requests from cores whose
prefetch accuracy < promotion threshold
14
Adaptive Prefetch Scheduling (APS)

Each memory request buffer entry: priority fields
C

RH
U
FCFS
Prioritization order:
1. Critical request (C)
2. Row-hit request (RH)
3. Urgent request (U)
4. Oldest request (FCFS)
15
Adaptive Prefetch Dropping (APD)

To DRAM
Proactively drops old prefetches based on prefetch
accuracy estimation
Update

Old requests are likely uselessAPS


Request
APS prioritizes
demand
priorityrequests when prefetch accuracy is low
Memory
request
A prefetch that
is hit by a demand is promoted to a demand
buffer
Drop

Request
Info
APD
Dropping old, useless prefetches saves resources
(bandwidth, queues, caches)PADC

Prefetch
each core requests
Saved resources can
be accuracy
used from
by useful
16
Adaptive Prefetch Dropping (APD)

Each memory request buffer entry: drop information
P
ID
AGE
Prefetch bit (P)
 Core ID field (ID)
 Age field (AGE)
Drop prefetch requests whose
AGE > Drop threshold
Drop threshold is dynamically determined based on
prefetch accuracy estimation




Lower accuracy → Lower threshold
17
Hardware Cost for 4-core CMP
Cost (bits)
Prefetch Accuracy
Estimation

APS
128
APD
1,536
Total
34,720
Total storage: 34,720 bits (~4.25KB) are needed
 ~ 4KB are prefetch bits in each cache line


33,056
If prefetch bits are already implemented: ~228B
Logic is not on the critical path

Scheduling and dropping decisions are made every DRAM bus cycle
18
Outline




Motivation
Mechanism
Experimental Evaluation
Conclusion
19
Simulation Methodology


x86 cycle accurate simulator
Baseline processor configuration

Per core




Shared




4-wide issue, out-of-order, 256-entry ROB
512KB, 8-way unified L2 cache (1MB for single core processor)
Stream prefetcher (Lookahead, prefetch degree: 4, prefetch distance: 64)
On-chip, demand-first FR-FCFS memory controller
64, 128, 256 L2 MSHRs, memory request buffer for 1-, 4-, 8-core
DDR3 1333, 15-15-15ns, 4KB row buffer
PADC configuration


Promotion threshold: 85%
Prefetch accuracy (%)
Drop threshold:
0~10
10~30
30~70
70~100
Threshold (core cycles)
100
1,500
50,000
100,000
20
Workloads for Evaluation

Single-core processor:
All 55 SPEC 2000/2006 benchmarks
 Single-threaded
 38 prefetch sensitive benchmarks
 17 prefetch insensitive benchmarks

CMP:
Randomly chosen multiprogrammed workloads from 55 benchmarks:
 4-core CMP: 32 workloads
 8-core CMP: 21 workloads
21
Performance of PADC
1
1
0.8
0.6
0.4
No-pref
Demand-first
Demand-pref-equal
PADC
0
Normalized to demand-first
1
Normalized to demand-first
Normalized to demand-first
8-core CMP
4-core CMP
Single-core
0.2
1.2
1.2
1.2
0.8
0.6
0.4
4.3%
0.6
0.4
No-pref
No-pref
0.2
Demand-first
Demand-pref-equal
0.2
PADC
Demand-first
Demand-pref-equal
PADC
0
0
Average
0.8
Average
8.2%
Average
9.9%
22
Bus Traffic of PADC
3.5
12
Single-core
20
4-core CMP
8-core CMP
18
3
10
16
14
2
1.5
8
Million cache lines
Million cache lines
Million cache lines
2.5
6
12
10
8
4
1
0.5
No-pref
Demand-first
Demand-pref-equal
PADC
0
No-pref
2
Demand-first
Demand-pref-equal
4
PADC
2
0
Average
-10.4%
6
No-pref
Demand-first
Demand-pref-equal
PADC
0
Average
-10.7%
Average
-9.4%
23
Performance with Other Prefetchers
4-core CMP
1.2
Stride
Normalized to no prefetching
Normalized to no prefetching
0.8
0.6
0.4
No-pref
0.8
0.6
0.4
0.2
Demand-first
PADC
PADC
0
Average
6.0%
0.8
0.6
0.4
No-pref
Demand-first
0
Markov
1
1
1
0.2
1.2
GHB
Normalized to no prefetching
1.2
0.2
No-pref
Demand-first
PADC
0
Average
6.6%
Average
2.2%
24
Bus Traffic with Other Prefetchers
4-core CMP
12
Stride
Markov
GHB
10
10
8
8
8
6
4
6
4
Demand-first
2
4
Demand-first
0
Average
-5.7%
No-pref
2
PADC
PADC
0
6
No-pref
No-pref
2
Million cache lines
10
Million cache lines
Million cache lines
12
12
Demand-first
PADC
0
Average
-6.8%
Average
-10.3%
25
Outline




Motivation
Mechanism
Experimental Evaluation
Conclusion
26
Conclusions

Prefetch-Aware DRAM Controllers (PADC)

Adaptive Prefetch Scheduling



Adaptive Prefetch Dropping



Increase DRAM throughput by exploiting row-buffer locality when
prefetches are useful
Delay service of prefetches when they are useless
With APS, remove useless prefetches effectively while keeping the
benefits of useful prefetches
Improve performance and bandwidth efficiency for both
single-core and CMP systems
Low cost and easily implementable
27
Questions?
28
Performance Detail

Single-core:

38 prefetch-sensitive: 6.2%
Prefetch-friendly: 29 benchmarks
 Prefetch-unfriendly: 9 benchmarks
 17 out of 38 are memory intensive
(MPKI > 10) : 11.8%


17 prefetch-insensitive
29
Two Channel Memory
Performance
1.2
1.2
8-core CMP
4-core CMP
1
1
31%
0.8
0.6
0.4
1ch-demand-first
Normalized to demand-first
Normalized to demand-first
16%
0.8
0.6
0.4
No-pref
Demand-first
0.2
Demand-pref-equal
PADC
0
0.2
1ch-demand-first
No-pref
Demand-first
Demand-pref-equal
PADC
0
Average
5.9%
Average
5.5%
30
Two Channel Memory
Bus Traffic
12
20
4-core CMP
8-core CMP
18
10
16
Million cache lines
Million cache lines
14
8
6
12
10
8
4
6
No-pref
No-pref
Demand-first
2
4
Demand-pref-equal
Demand-pref-equal
PADC
Demand-first
2
PADC
0
0
Average
-12.9%
Average
-13.2%
31
Comparison with Feedback
Directed Prefetching 4-core CMP
12
1.2
Bus traffic
1
10
0.8
8
0.6
Demand-first
0.4
0.2
fdp-demand-first
Million cache lines
Normalized to demand-first
Performance
6
Demand-first
4
fdp-demand-first
apd-demand-first
apd-demand-first
fdp-demand-pref-equal
fdp-demand-pref-equal
fdp-aps
2
PADC(aps-apd)
0
fdp-aps
PADC(aps-apd)
0
Average
6.4%
Average
32
Performance on Single-Core
Normalized IPC to demand-first
1.8
No-pref
Demand-first
Demand-pref-equal
APS-only
APS-APD (PADC)
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
n
x
le
ea
gm
p
so
m
tu
es
an
p
tp
3d
ie
sl
le
qu
im
av
bw
lib
sw
ilc
m
t
ar
ne
om
l
p
e
lg
m
am
ga
33
Prefetch Friendly Application
libquantum
1.8
3
2
1.5
Useless
Useful
Demand
1
0.5
0
Normalized IPC to demand-first
Bus traffic
2.5
Million cache lines

Performance
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
em
em
an
t
rs
t
rs
fi
d-
f
re
fi
d-
an
p
o-
an
an
f
re
)
C
AD
(P
D
AP
SAP
y
al
l
qu
on
-e
Sef
AP
pr
d-
D
D
N
em
em
p
o-
)
C
AD
(P
D
AP
Sal
AP
y
l
qu
on
-e
Sef
AP d-pr
D
D
N
34
0
0
1
30
25
20
15
0.4
0.2
Useless
Useful
Demand
10
5
Performance
1.2
Bus traffic
35
0.8
0.6
Normalized IPC to demand-first
art

Million cache lines
Prefetch Unfriendly Application
em
em
an
p
o-
fi
d-
f
re
an
t
rs
)
C
AD
(P
D
AP
SAP ly
al
qu
on
-e
Sef
pr
AP
dD
D
N
an
fi
df
re
an
p
o-
em
em
t
rs
)
C
AD
(P
D
AP
SAP
al
ly
qu
on
-e
Sef
pr
AP
dD
D
N
35
Average Performance on Single-Core
All 55 SPEC 2000/2006 CPU benchmarks
3.5
1.2
Performance
Normalized IPC to demand-first
Bus traffic
3
Million cache lines

2.5
2
1.5
1
0.5
0
1
0.8
0.6
0.4
0.2
0
em
em
an
t
rs
t
rs
fi
d-
f
re
fi
d-
an
p
o-
an
an
f
re
)
C
AD
(P
D
AP
SAP l y
al
qu
on
-e
Sef
AP
pr
d-
D
D
N
em
em
p
o-
)
C
AD
(P
D
AP
SAP l y
al
qu
on
-e
Sef
AP
pr
d-
D
D
N
36
System Performance on 4-Core CMP
32 randomly chosen 4-core workloads
3.5
20
System performance
Average bus traffic
18
3
16
2.5
Demand-first
Demand-pref-equal
2
APS-only
APS-APD (PADC)
1.5
1
Million cache lines
No-pref
Metric

14
12
10
8
6
4
0.5
2
No-pref
Demand-first
Demand-pref-equal
APS-only
APS-APD (PADC)
0
0
WS
HS
Traffic
37
System Performance on 8-core CMP
21 randomly chosen 8-core workloads
5
20
System performance
4.5
Average bus traffic
18
4
16
Demand-first
3
Demand-pref-equal
APS-only
2.5
APS-APD (PADC)
2
1.5
Million cache lines
No-pref
3.5
Metric

14
12
10
8
6
1
4
0.5
2
0
0
WS
HS
No-pref
Demand-first
Demand-pref-equal
APS-only
APS-APD (PADC)
Traffic
38
4
3
0
0
0.4
0.2
Useless
Useful
Demand
2
1
Performance
1.4
6
1.2
5
1
0.8
0.6
Normalized IPC to demand-first
leslie3d

Million cache lines
Prefetch Friendly Application
Bus traffic
em
em
an
p
o-
fi
d-
f
re
an
t
rs
)
C
AD
(P
D
AP
SAP l y
al
qu
on
-e
Sef
pr
AP
dD
D
N
em
em
an
f
re
an
p
o-
fi
d-
t
rs
)
C
AD
(P
D
AP
SAP
al
ly
qu
on
-e
Sef
AP
pr
dD
D
N
39
Prefetch Unfriendly Application
ammp
0.9
1.4
0.7
Useless
Useful
Demand
0.6
0.5
0.4
0.3
0.2
0.1
0
Normalized IPC to demand-first
Bus traffic
0.8
Million cache lines

Performance
1.2
1
0.8
0.6
0.4
0.2
0
em
em
an
t
rs
t
rs
fi
d-
f
re
fi
d-
an
p
o-
an
an
f
re
)
C
AD
(P
D
AP
SAP l y
al
qu
on
-e
Sef
AP
pr
d-
D
D
N
em
em
p
o-
)
C
AD
(P
D
AP
Sal
AP
ly
qu
on
-e
Sef
AP d-pr
D
D
N
40
Performance on 4-Core
omnetpp, libquantum, galgel, and GemsFDTD on 4-core
System performance
CMP
Individual speedup
2.5
0.9
0.8
0.7
0.6
No-pref
Demand-first
Demand-pref-equal
APS-only
APS-APD (PADC)
2
Metric
Speedup to single application run IPC

0.5
0.4
No-pref
Demand-first
Demand-pref-equal
APS-only
APS-APD (PADC)
1.5
1
0.3
0.2
0.5
0.1
0
0
omnetpp
libquantum
galgel
GemsFDTD
WS
HS
41
0
Individual speedup
2
0.5
1
0.4
APS-APD (PADC)
APS-only
Demand-pref-equal
Demand-first
No-pref
APS-APD (PADC)
APS-only
Demand-pref-equal
Demand-first
No-pref
APS-APD (PADC)
APS-only
D
sF
l
TD
m
tu
p
tp
an
e
lg
em
G
ga
qu
ne
om
lib
GemsFDTD
galgel
libquantum
omnetpp
Demand-pref-equal
0
3
0.6
Demand-first
No-pref
0.1
4
0.7
APS-APD (PADC)
APS-only
0.2
Demand
5
0.8
Demand-pref-equal
Demand-first
No-pref
0.3
Useless
Useful
6
Demand-pref-equal
APS-only
APS-APD (PADC)
Million cache lines

omnetpp, libquantum, galgel, and GemsFDTD on 4-core
No-pref
7
CMP
0.9
Demand-first
Speedup to single application run IPC
Performance on 4-Core
42
System Performance on 4-Core

omnetpp, libquantum, galgel, and GemsFDTD
2.5
18
System performance
Bus traffic
2
1.5
Metric
No-pref
Demand-first
Demand-pref-equal
APS-only
APS-APD (PADC)
1
Million cache lines
16
Useless
Useful
Demand
14
12
10
8
6
4
2
0
em
em
an
fi
d-
f
re
an
p
o-
t
rs
0
WS
HS
)
C
AD
(P
D
AP
Sal
AP
ly
qu
on
f-e
Sre
AP d-p
D
D
N
0.5
43
Download