A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression

advertisement
A Case for Core-Assisted
Bottleneck Acceleration in GPUs
Enabling Flexible Data Compression
with Assist Warps
Nandita Vijaykumar
Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick,
Rachata Ausavarangnirun, Chita Das, Mahmut Kandemir,
Todd C. Mowry, Onur Mutlu
Executive Summary





Observation: Imbalances in execution leave GPU resources
underutilized
Our Goal: Employ underutilized GPU resources to do something
useful – accelerate bottlenecks using helper threads
Challenge: How do you efficiently manage and use helper
threads in a throughput-oriented architecture?
Our Solution: CABA (Core-Assisted Bottleneck Acceleration)
 A new framework to enable helper threading in GPUs
 Enables flexible data compression to alleviate the memory
bandwidth bottleneck
 A wide set of use cases (e.g., prefetching, memoization)
Key Results: Using CABA to implement data compression in
memory improves performance by 41.7%
2
GPUs today are used for a wide range
of applications …
Computer Vision Data Analytics
Scientific
Simulation
Medical
Imaging
3
Challenges in GPU Efficiency
Threads
Thread
3
Register File
Cores
Idle!
Thread
2
Full!
Thread
1
Idle!
Full!
Memory
Hierarchy
Thread
0
GPU Streaming Multiprocessor
The
memory
bandwidth
bottleneck
leadsregister
to idle cores
Thread
limits
lead to an
underutilized
file
4
% Unallocated Registers
Motivation: Unutilized On-chip Memory
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%


24% of the register file is unallocated on average
Similar trends for on-chip scratchpad memory
5
% Cycles
Motivation: Idle Pipelines
100%
80%
60%
40%
20%
0%
Active
Stalls
67% of cycles idle
CONS JPEG
LPS
MUM
RAY
SCP
PVC
PVR
bfs
Avg.
Memory Bound
% Cycles
100%
80%
60%
35% of cycles idle
40%
Active
Stalls
20%
0%
NN
STO
bp
hs
dmr
NQU
Compute Bound
SLA
lc
pt
mc
6
Motivation: Summary
Heterogeneous application requirements lead to:
Bottlenecks in execution
 Idle resources

7
Our Goal

Use idle resources to do something useful:
accelerate bottlenecks using helper threads
Cores
Helper
threads

Register File
Memory
Hierarchy
A flexible framework to enable helper threading in GPUs:
Core-Assisted Bottleneck Acceleration (CABA)
8
Helper threads in GPUs

Large body of work in CPUs …
 [Chappell+
ISCA ’99, MICRO ’02], [Yang+ USC TR ’98],
[Dubois+ CF ’04], [Zilles+ ISCA ’01], [Collins+ ISCA ’01,
MICRO ’01], [Aamodt+ HPCA ’04], [Lu+ MICRO ’05],
[Luk+ ISCA ’01], [Moshovos+ ICS ’01], [Kamruzzaman+
ASPLOS ’11], etc.

However, there are new challenges with GPUs…
9
Challenge
How do you efficiently
manage and use helper threads
in a throughput-oriented architecture?
10
Managing Helper Threads in GPUs
Thread
Warp
Hardware
Block
Software
Where do we add helper threads?
11
Approach #1: Software-only
 No hardware changes
Coarse grained
Regular threads
Synchronization is
difficult
Not aware of runtime
program behavior
Helper threads
12
Where Do We Add Helper Threads?
Thread
Warp Hardware
Block Software
13
Approach #2: Hardware-only
Warps
Cores
 Fine-grained control
–
–
Synchronization
Enforcing Priorities
Register File
GPU
Providing contexts
efficiently is difficult
CPU
Core 0
Core 1
Reg File 0
Reg File 0
Reg File 1
Reg File 1
14
CABA: An Overview

SW
HW
“Tight coupling” of helper threads and
regular threads
 Efficient context management
 Simpler data communication

“Decoupled management” of helper threads
and regular threads
 Dynamic management of threads
 Fine-grained synchronization
15
CABA: 1. In Software
Regular threads
Regs
Helper threads
Block
Helper threads:
 Tightly coupled to
regular threads
 Simply instructions
injected into the GPU
pipelines
 Share the same
context as the regular
threads
 Efficient context management
 Simpler data communication
16
CABA: 2. In Hardware
Helper threads:
 Decoupled from regular threads
 Tracked at the granularity of a warp – Assist Warp
 Each
regular (parent) warp can have different assist
warps
 Dynamic management
of threads
Assist Warp: A
 Fine-grained
synchronization
Parent Warp: X
Assist Warp: B
17
Key Functionalities

Triggering and squashing assist warps
 Associating

Deploying active assist warps
 Scheduling

events with assist warps
instructions for execution
Enforcing priorities
 Between
assist warps and parent warps
 Between different assist warps
18
CABA: Mechanism
Fetch Stages instructions from triggered
W
Scoreboard
r
D warps
assist
for execution
ALU i
I
ALU
e
ALU
I-Cache
t
c enforce priorities s
 Helps
Assist
Instruction for different
s
e
o
Holds
instructions
assist warp
Warp
Buffer
u
b
d
routines
Mem
Store
 Assist
Central
point ofecontrol
for: a
Warp
e
oBuffer
Triggering assist warps
Deploy
c
o Squashing them
Assist Warp
Scheduler
k
 Tracks progress for active assist
Controller
warps
Trigger
19
Other functionality
In the paper:
 More details on the hardware structures
 Data communication and synchronization
 Enforcing priorities
20
CABA: Applications

Data compression

Memoization

Prefetching

…
21
A Case for CABA: Data Compression

Data compression can help alleviate the memory
bandwidth bottleneck - transmits data in a more
condensed form
Uncompressed
Compressed
Idle!

Memory
Hierarchy
CABA employs idle compute pipelines to perform
compression
22
Data Compression with CABA

Use assist warps to:
 Compress
cache blocks before writing to memory
 Decompress cache blocks before placing into the cache

CABA flexibly enables various compression algorithms

Example: BDI Compression [Pekhimenko+ PACT ’12]
 Parallelizable
 Low

across SIMT width
latency
Others: FPC [Alameldeen+ TR ’04], C-Pack [Chen+ VLSI ’10]
23
Walkthrough of Decompression
Assist
Warp
Store
Assist
Warp
Controller
Trigger
Scheduler
Miss!
L1D
Hit!
L2 +
Memory
Cores
24
Walkthrough of Compression
Assist
Warp
Store
Assist
Warp
Controller
Trigger
Scheduler
L1D
L2 +
Memory
Cores
25
Evaluation
Methodology


Simulator: GPGPUSim, GPUWattch
Workloads


Lonestar, Rodinia, MapReduce, CUDA SDK
System Parameters
15 SMs, 32 threads/warp
 48 warps/SM, 32768 registers, 32KB Shared Memory
 Core: 1.4GHz, GTO scheduler , 2 schedulers/SM
 Memory: 177.4GB/s BW, 6 GDDR5 Memory Controllers,
FR-FCFS scheduling
 Cache: L1 - 16KB, 4-way associative; L2 - 768KB, 16-way
associative


Metrics
Performance: Instructions per Cycle (IPC)
 Bandwidth Consumption: Fraction of cycles the DRAM data
bus is busy
27

Normalized Performance
Effect on Performance
2.8
2.6
2.4
2.2
2
1.8
1.6
1.4
1.2
1
41.7%
CABA-BDI
No-Overhead-BDI
 CABA provides a 41.7% performance improvement
 CABA achieves performance close to that of designs
with no overhead for compression
28
Memory Bandwidth Consumption
Effect on Bandwidth Consumption
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Baseline
CABA-BDI
Data compression with CABA alleviates
the memory bandwidth bottleneck
29
Normalized Performance
Different Compression Algorithms
2.8
2.6
2.4
2.2
2
1.8
1.6
1.4
1.2
1
0.8
CABA-FPC
CABA-BDI
CABA-CPack
CABA-BestOfAll
CABA is flexible: Improves performance with
different compression algorithms
30
Other Results



CABA’s performance is similar to pure-hardware
based BDI compression
CABA reduces the overall system energy (22%) by
decreasing the off-chip memory traffic
Other evaluations:
Compression ratios
 Sensitivity to memory bandwidth
 Capacity compression
 Compression at different levels of the hierarchy

31
Conclusion





Observation: Imbalances in execution leave GPU resources
underutilized
Our Goal: Employ underutilized GPU resources to do something
useful – accelerate bottlenecks using helper threads
Challenge: How do you efficiently manage and use helper
threads in a throughput-oriented architecture?
Our Solution: CABA (Core-Assisted Bottleneck Acceleration)
 A new framework to enable helper threading in GPUs
 Enables flexible data compression to alleviate the memory
bandwidth bottleneck
 A wide set of use cases (e.g., prefetching, memoization)
Key Results: Using CABA to implement data compression in
memory improves performance by 41.7%
32
A Case for Core-Assisted
Bottleneck Acceleration in GPUs
Enabling Flexible Data Compression
with Assist Warps
Nandita Vijaykumar
Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick,
Rachata Ausavarangnirun, Chita Das, Mahmut Kandemir,
Todd C. Mowry, Onur Mutlu
34
Backup Slides
Effect on Energy
Normalized Energy
1.2
1.1
1
0.9
0.8
0.7
0.6
0.5
0.4
CABA-BDI
Ideal-BDI
HW-BDI-Mem
HW-BDI
CABA reduces the overall system energy by
decreasing the off-chip memory traffic
35
Effect on Compression Ratio
36
Other Uses of CABA

Hardware Memoization
Goal: avoid redundant computation by reusing
previous results over the same/similar inputs
 Idea:

hash the inputs at predefined points
 use load/store pipelines to save inputs in shared memory
 eliminate redundant computation by loading stored results


Prefetching

Similar to CPU
37
Download