Adaptive Prefetching on GPU for Energy Efficiency

advertisement
APOGEE: Adaptive Prefetching
on GPU for Energy Efficiency
Ankit Sethia1, Ganesh Dasika2, Mehrzad Samadi1,
Scott Mahlke1


1
1-University of Michigan
2-ARM R&D Austin
University of Michigan
Electrical Engineering and Computer Science
Introduction
• High Throughput – 1 TeraFlop
• High Energy Efficiency
• Programmability
Further push efficiency?
• High performance on mobile
• Less cost in supercomputers
2
University of Michigan
Electrical Engineering and Computer Science
Background
Register File Banks
ALU
SFU
SFU
….
….
….
….
.…
….
….
.…
Warp
Scheduler
• Hide latency by fine grained
multi-threading
• 20,000 inflight threads
• 2 MB on-chip register file
• Management overhead
• Scheduling
• Divergence
Mem Ctrlr
Data Cache
To global
memory
3
University of Michigan
Electrical Engineering and Computer Science
Motivation - I
100%
Normalized Speedup
7
80%
6
5
60%
4
40%
3
2
20%
1
0
0%
Normalized
% Increase
Speedup
in Power
Normalized Increase in Power
8
32 Warps
16 Warps
8 Warps
4 Warps
2 Warps
Too many warps decrease efficiency
4
University of Michigan
Electrical Engineering and Computer Science
9.00
1.20
8.00
1.00
7.00
6.00
0.80
5.00
0.60
4.00
3.00
0.40
2.00
0.20
1.00
0.00
Normalized per register Accesses
Normalized Register File Accesses Rate
Motivation - II
0.00
1
2
4
8
# of Warps
Norm. RF Access Rate
16
32
Norm. per Register Access
Hardware added to hide
memory latency has underutilization
5
University of Michigan
Electrical Engineering and Computer Science
APOGEE: Overview
Register File Banks
…
…
…
…
…
…
ALU
SFU
SFU
Mem Ctrlr
Data Cache
To global
memory
Prefetcher
…
…
Warp Scheduler
• Prefetch data from
memory to cache.
• Less latency to hide.
• Less Multithreading.
• Less register
context.
From global
memory
6
University of Michigan
Electrical Engineering and Computer Science
Traditional CPU Prefetching
64
W0
W1
W2
0
31
32
63
64
95
W3 96
4032
W30
4064
W31
.
.
.
.
127
4063
4095
3
2
-32
96
64
64
Stride cannot be found
Next line cannot be timely prefetched
Both the traditional
prefetching techniques
do not work
7
University of Michigan
Electrical Engineering and Computer Science
Fixed Offset Address Prefetching
W0
W1
W2
0
31
32
63
64
95
W3 96
W4128
W5160
4032
W30
4064
W31
Warp
1
Warp
0
64
64
127
64
160
.
.
.
.
64
191
4063
4095
Less warps enables more
iterations with fixed
offset(stride*numWarp)
8
University of Michigan
Electrical Engineering and Computer Science
Timeliness in Prefetching
00
01
10
Load
Prefetch sent
to Memory
Prefetch recv
from Memory
Timely
Prefetch
Slow
Prefetch
Early
Prefetch
Time
00
- Increase distance of
prefetching if next load happens
in state 01
Prefetch sent
to Memory
New
Load
10
- Decrease distance of
prefetching if correctly
prefetched address in state 10
was a miss
01
Prefetch
received from
Memory
9
University of Michigan
Electrical Engineering and Computer Science
FOA Prefetching
Prefetch Table
Load PC Address Offset Confidence Tid Distance PF Type PF State
Current 0x253ad 0xfcfeed
PC
8
2
3
1
>
# Threads
0
00
+
3
Miss address
of thread index
4
2
Prefetch Queue
Prefetch
Enable
=
*
+
10
Prefetch
Address
University of Michigan
Electrical Engineering and Computer Science
GRAPHICS MEMORY ACCESSES
100%
80%
Others
60%
Texture
40%
TIA
20%
FOA
0%
ES
HS
RS
IT
ST
WT
WF
MEAN
3 major type of accesses:
• Fixed Offset Access: Data accessed by adjacent threads in an
array have fixed offset.
• Thread Invariant Address: Same address is accessed by all the
threads in a warp.
• Texture Access: Address accessed during texturing operation.
11
University of Michigan
Electrical Engineering and Computer Science
Thread Invariant Addresses
Iteration 21
PC3 Ld
PC2 Ld
Iteration 3
PC3 Ld
Slow
Prefetch
Slow
Prefetch
PC1 Ld
PC0 Const.
Ld
Time
TIA Prefetch Table
Pf PC Address Slow
bit
PC2
1
PC1
0xabc
0
Iteration 5
PC3 Ld
PC2 Ld
PC2 Ld
Timely
Prefetch
Prefetch
Address
PC1 Ld
PC1 Ld
PC0 Const.
Ld
Time
PC0 Const.
Ld
Time
12
Prefetch Queue
University of Michigan
Electrical Engineering and Computer Science
Experimental Evaluation

Benchmarks
o Mesa Driver, SPEC-Viewperf traces, MV5 GPGPU
benchmarks

Performance Eval
o MV5 simulator, 1 GHz, Inorder SIMT, 400 cycle mem latency.
o Prefetcher – 32 entry
o Prefetch latency – 10 cycles
o D-Cache – 64kB, 8 way, 32 bytes per line.

Power Eval
o Dynamic power from analytical model. Hong et. al(ISCA10).
o Static power from published numbers and tools:
• FPU, SFU, Caches, Register File, Fetch/Decode/Schedule.
13
University of Michigan
Electrical Engineering and Computer Science
APOGEE Performance
Performance
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
HS
WT
WF
IT
ST
Graphics
SIMT_32
ES
RS
STRIDE_4
FFT
FT
HP
MS
SP
GPGPU
MTA_4 APOGEE_4
GEO
MEAN
On average 19% improvement in speedup with APOGEE
14
University of Michigan
Electrical Engineering and Computer Science
Performance
APOGEE Performance -II
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
HS
WT WF
IT
ST
ES
RS
Graphics
SIMT_32
FFT
FT
HP
GPGPU
MTA_32
MS
SP GEO
MEAN
APOGEE_4
MTA with 32 warps is within 3% of APOGEE
15
University of Michigan
Electrical Engineering and Computer Science
Performance and Power
SIMT
1.35
Normalized Power
1.3
MTA
APOGEE
# of warps
1.25
1.2
+
1.15
x
32 warps
16 warps
8 warps
4 warps
2 warps
1 warp
1.1
1.05
1
1
2
3
4
5
6
7
8
9
10
Normalized Performance
• 20% speedup over SIMT, with 14k less registers.
• Around 51% perf/Watt improvement.
16
University of Michigan
Electrical Engineering and Computer Science
% of Correct Prediction
Prefetcher Accuracy
100
90
80
70
60
50
40
30
20
10
0
HS WT WF
IT
ST
ES
Graphics
RS
FFT
FT
HP MS
GPGPU
SP GEO
MEAN
APOGEE has 93.5% accuracy in prediction
17
University of Michigan
Electrical Engineering and Computer Science
D-Cache Miss Rate
% Reduction in Dcache Miss Rate
100%
80%
60%
40%
20%
0%
HS
-20%
WT
WF
IT
ST
ES
RS
FFT
Graphics
MTA_4
FT
HP
MS
SP
GPGPU
MTA_32
APOGEE
Prefetching results in 80% reduction in cache accesses
18
University of Michigan
Electrical Engineering and Computer Science
Conclusion

Use of high multi-threading on GPUs is inefficient

Adaptive prefetching exploits the regular memory
access pattern of GPU applications:
o Adapt for the timeliness of prefetching
o Prefetch for two different patterns

Over 12 graphics and GPGPU benchmarks, 20%
improvement in performance and 51% improvement
in performance/Watt
19
University of Michigan
Electrical Engineering and Computer Science
Questions?
20
University of Michigan
Electrical Engineering and Computer Science
Download