presentation title

†
§
Fine-grain Task Aggregation and
Coordination on GPUs
Marc S. Orr†§, Bradford M. Beckmann§, Steven K. Reinhardt§, David A. Wood†§
ISCA, June 16, 2014
Executive Summary
SIMT languages (e.g. CUDA & OpenCL) restrict GPU
programmers to regular parallelism
‒Compare to Pthreads, Cilk, MapReduce, TBB, etc.
Goal: enable irregular parallelism on GPUs
‒Why? More GPU applications
‒How? Fine-grain task aggregation
‒What? Cilk on GPUs
2 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014
Outline
Background
‒GPUs
‒Cilk
‒Channel Abstraction
Our Work
‒Cilk on Channels
‒Channel Design
Results/Conclusion
3 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014
GPUs Today
GPU
CP
SIMD SIMD
SIMD SIMD
System Memory
GPU tasks scheduled by
control processor (CP)—
small, in-order
programmable core
Today’s GPU abstractions
are coarse-grain
+ Maps well to SIMD hardware
- Limits fine-grain scheduling
4 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014
Cilk Background
Cilk extends C for divide and conquer parallelism
Adds keywords
‒spawn: schedule a thread to execute a function
‒sync: wait for prior spawns to complete
1: int fib(int n) {
2:
if (n <= 2) return 1;
3:
int x = spawn fib(n - 1);
4:
int y = spawn fib(n - 2);
5:
sync;
6:
return (x + y);
7: }
5 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014
Prior Work on Channels
GPU
SIMD SIMD
Agg
SIMD SIMD
CP, or aggregator (agg),
manages channels
Finite task queues, except:
System Memory
channels
1. User-defined scheduling
2. Dynamic aggregation
3. One consumption function
Dynamic aggregation enables “CPU-like”
scheduling abstractions on GPUs
6 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014
Outline
Background
‒GPUs
‒Cilk
‒Channel Abstraction
Our Work
‒Cilk on Channels
‒Channel Design
Results/Conclusion
7 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014
Enable Cilk on GPUs via Channels
Step 1
Cilk routines split by sync into sub-routines
1: int fib (int n) {
2:
if (n<=2) return 1;
3:
int x = spawn fib “pre-sync”
(n-1);
4:
int y = spawn fib (n-2);
5:
sync;
6:
return (x+y);
7: }
“continuation”
1: int fib (int n) {
2:
if (n<=2) return 1;
3:
int x = spawn fib (n-1);
4:
int y = spawn fib (n-2);
5: }
6: int fib_cont(int x, int y) {
7:
return (x+y);
8: }
8 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014
Enable Cilk on GPUs via Channels
Step 2
“pre-sync” task ready
“pre-sync” task done
“continuation” task
A B task A spawned task B
A B task B depends on task A
2
1
3
Channels instantiated for
breadth-first traversal
‒Quickly populates GPU’s tens
of thousands of lanes
‒Facilitates coarse-grain
dependency management
9 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014
3 2
4
2
4
5
1
3
3
5
fib channel
fib_cont channel stack:
top of
stack
Bound Cilk’s Memory Footprint
Bound memory to the depth of the Cilk tree by draining
channels closer to the base case
2
1
3
2
2
1
4
3
5
‒The amount of work generated dynamically is not known a priori
We propose that GPUs allow SIMT threads to yield
‒Facilitates resolving conflicts on shared resources like memory
10 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014
Channel Implementation
See Paper
Our design accommodates SIMT access patterns
+ array-based
+ lock-free
+ non-blocking
11 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014
Outline
Background
‒GPUs
‒Cilk
‒Channel Abstraction
Our Work
‒Cilk on Channels
‒Channel Design
Results/Conclusion
12 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014
Methodology
Implemented Cilk on channels on a simulated APU
‒Caches are sequentially consistent
‒Aggregator schedules Cilk tasks
13 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014
Fibonacci
Queens
Sort
Strassen
More Compute Units  Faster execution
14 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014
8CUs
4CUs
2CUs
1CU
8CUs
4CUs
2CUs
1CU
8CUs
4CUs
2CUs
1CU
8CUs
4CUs
2CUs
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1CU
Normalized execution time
Cilk scales with the GPU Architecture
Conclusion
We observed that dynamic aggregation enables new
GPU programming languages and abstractions
We enabled dynamic aggregation by extending the GPU’s
control processor to manage channels
We found that breadth first scheduling works well for
Cilk on GPUs
We proposed that GPUs allow SIMT threads to yield for
breadth first scheduling
Future work should focus on how the control
processor can enable more GPU applications
15 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014
Backup
16 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014
Divergence and Channels
Branch divergence
Percent of wavefronts
1-16
17-32
33-48
49-64 lanes active
100
80
60
40
20
0
Fibonacci
Queens
Memory divergence
+ Data in channels good
‒Pointers to data in channels bad
17 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014
Sort
Strassen
GPU NOT Blocked on Aggregator
100
90
80
% of time
70
60
simple
2-way light OoO
2-way OoO
4-way OoO
50
40
30
20
10
0
fib
queens
sort
18 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014
strassen
GPU Cilk vs. standard GPU workloads
Strassen
Queens
LOC
reduction
42%
36%
Dispatch
rate
13x
12.5x
Speedup
1.06
0.98
Cilk is more succinct than SIMT languages
Channels trigger more GPU dispatches
Same performance, easier to program
19 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014
Disclaimer & Attribution
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and
typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to
product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences
between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or
otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to
time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR
ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO
EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM
THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
© 2014 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of
Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance
Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners.
20 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014