Programming Many-Core Systems with GRAMPS Jeremy Sugerman 14 May 2010

advertisement
Programming Many-Core
Systems with GRAMPS
Jeremy Sugerman
14 May 2010
The single fast core era is over
• Trends:
Changing Metrics: ‘scale out’, not just ‘scale up’
Increasing diversity: many different mixes of ‘cores’
• Today’s (and tomorrow’s) machines:
commodity, heterogeneous, many-core
Problem: How does one program all this
complexity?!
2
High-level programming models
• Two major advantages over threads & locks
– Constructs to express/expose parallelism
– Scheduling support to help manage concurrency,
communication, and synchronization
• Widespread in research and industry:
OpenGL/Direct3D, SQL, Brook, Cilk, CUDA,
OpenCL, StreamIt, TBB, …
3
My biases workloads
• Interesting applications have irregularity
• Large bundles of coherent work are efficient
• Producer-consumer idiom is important
Goal: Rebuild coherence dynamically by
aggregating related work as it is generated.
4
My target audience
• Highly informed, but (good) lazy
– Understands the hardware and best practices
– Dislikes rote, Prefers power versus constraints
Goal: Let systems-savvy developers efficiently
develop programs that efficiently map onto
their hardware.
5
Contributions: Design of GRAMPS
• Programs are graphs of stages and queues
Simple Graphics Pipeline
• Queues:
– Maximum capacities, Packet sizes
• Stages:
– No, limited, or total automatic parallelism
– Fixed, variable, or reduction (in-place) outputs
6
Contributions: Implementation
• Broad application scope:
– Rendering, MapReduce, image processing, …
• Multi-platform applicability:
– GRAMPS runtimes for three architectures
• Performance:
– Scale-out parallelism, controlled data footprint
– Compares well to schedulers from other models
• (Also: Tunable)
7
Outline
•
•
•
•
GRAMPS overview
Study 1: Future graphics architectures
Study 2: Current multi-core CPUs
Comparison with schedulers from other
parallel programming models
8
GRAMPS Overview
GRAMPS
• Programs are graphs of stages and queues
– Expose the program structure
– Leave the program internals unconstrained
10
Writing a GRAMPS program
• Design the application graph and queues:
Cookie Dough Pipeline
• Design the stages
• Instantiate and launch.
Credit: http://www.foodnetwork.com/recipes/alton-brown/the-chewy-recipe/index.html
11
Queues
• Bounded size, operate at “packet” granularity
– “Opaque” and “Collection” packets
• GRAMPS can optionally preserve ordering
– Required for some workloads, adds overhead
12
Thread (and Fixed) stages
• Preemptible, long-lived, stateful
– Often merge, compare, or repack inputs
• Queue operations: Reserve/Commit
• (Fixed: Thread stages in custom hardware)
13
Shader stages:
• Automatically parallelized:
– Horde of non-preemptible, stateless instances
– Pre-reserve/post-commit
• Push: Variable/conditional output support
– GRAMPS coalesces elements into full packets
14
Queue sets: Mutual exclusion
Cookie Dough Pipeline
• Independent exclusive (serial) subqueues
– Created statically or on first output
– Densely or sparsely indexed
• Bonus: Automatically instanced Thread stages
15
Queue sets: Mutual exclusion
Cookie Dough (with queue set)
• Independent exclusive (serial) subqueues
– Created statically or on first output
– Densely or sparsely indexed
• Bonus: Automatically instanced Thread stages
16
A few other tidbits
• In-place Shader stages /
coalescing inputs
• Instanced Thread stages
• Queues as barriers /
read all-at-once
17
Formative influences
• The Graphics Pipeline, early GPGPU
• “Streaming”
• Work-queues and task-queues
18
Study 1: Future Graphics
Architectures
(with Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, Pat Hanrahan;
appeared in Transactions on Computer Graphics, January 2009)
Graphics is a natural first domain
• Table stakes for commodity parallelism
• GPUs are full of heterogeneity
• Poised to transition from fixed/configurable
pipeline to programmable
• We have a lot of experience in it
20
The Graphics Pipeline in GRAMPS
• Graph, setup are (application) software
– Can be customized or completely replaced
• Like the transition to programmable shading
– Not (unthinkably) radical
• Fits current hw: FIFOs, cores, rasterizer, …
21
Reminder: Design goals
• Broad application scope
• Multi-platform applicability
• Performance: scale-out, footprint-aware
22
The Experiment
• Three renderers:
– Rasterization, Ray Tracer, Hybrid
• Two simulated future architectures
– Simple scheduler for each
23
Scope: Two(-plus) renderers
Rasterization Pipeline (with ray tracing extension)
Ray Tracing Graph
Ray Tracing Extension
24
Platforms: Two simulated systems
CPU-Like: 8 Fat Cores, Rast
GPU-Like: 1 Fat Core, 4 Micro Cores, Rast, Sched
25
Performance— Metrics
“Maximize machine utilization while
keeping working sets small”
• Priority #1: Scale-out parallelism
– Parallel utilization
• Priority #2: ‘Reasonable’ bandwidth / storage
– Worst case total footprint of all queues
– Inherently a trade-off versus utilization
26
Performance— Scheduling
Simple prototype scheduler (both platforms):
• Static stage priorities:
(Lowest)
(Highest)
• Only preempt on Reserve and Commit
• No dynamic weighting of current queue sizes
27
Performance— Results
•
•
•
•
Utilization: 95+% for all but rasterized fairy (~80%).
Footprint: < 600KB CPU-like, < 1.5MB GPU-like
Surprised how well the simple scheduler worked
Maintaining order costs footprint
28
Study 2: Current Multi-core CPUs
(with (alphabetically) Christos Kozyrakis, David Lo, Daniel Sanchez,
Richard Yoo; submitted to PACT 2010)
Reminder: Design goals
• Broad application scope
• Multi-platform applicability
• Performance: scale-out, footprint-aware
30
The Experiment
• 9 applications, 13 configurations
• One (more) architecture: multi-core x86
– It’s real (no simulation here)
– Built with pthreads, locks, and atomics
• Per-pthread task-priority-queues with work-stealing
– More advanced scheduling
31
Scope: Application bonanza
• GRAMPS
Ray tracer (0, 1 bounce)
Spheres
(No rasterization, though)
• Cilk(-like)
Mergesort
• CUDA
Gaussian, SRAD
• MapReduce
Hist (reduce / combine)
LR (reduce / combine)
PCA
• StreamIt
FM, TDE
32
Scope: Many different idioms
Ray Tracer
MapReduce
Merge Sort
FM
SRAD
33
Platform: 2xQuad-core Nehalem
Native: 8 HyperThreaded Core i7’s
• Queues: copy in/out, global (shared) buffer
• Threads: user-level scheduled contexts
• Shaders: create one task per input packet
34
Performance— Metrics (Reminder)
“Maximize machine utilization while
keeping working sets small”
• Priority #1: Scale-out parallelism
• Priority #2: ‘Reasonable’ bandwidth / storage
35
Performance– Scheduling
•
•
•
•
Static per-stage priorities (still)
Work-stealing task-priority-queues
Eagerly create one task per packet (naïve)
Keep running stages until a low watermark
– (Limited dynamic weighting of queue depths)
36
Parallel Speedup
Performance– Good Scale-out
Hardware Threads
• (Footprint: Good; detail a little later)
37
Performance– Low Overheads
Percentage of Execution
Execution Time Breakdown (8 cores / 16 hyperthreads)
• ‘App’ and ‘Queue’ time are both useful work.
38
Comparison with Other
Schedulers
(with (alphabetically) Christos Kozyrakis, David Lo, Daniel Sanchez,
Richard Yoo; submitted to PACT 2010)
Three archetypes
• Task-Stealing: (Cilk, TBB)
Low overhead with fine granularity tasks
 No producer-consumer, priorities, or data-parallel
• Breadth-First: (CUDA, OpenCL)
Simple scheduler (one stage at the time)
 No producer-consumer, no pipeline parallelism
• Static: (StreamIt / Streaming)
No runtime scheduler; complex schedules
 Cannot adapt to irregular workloads
40
GRAMPS is a natural framework
Shader Producer- Structured
Support Consumer
‘Work’
GRAMPS



Task


Stealing
Breadth


First
Static



Adaptive




41
The Experiment
• Re-use the exact same application code
• Modify the scheduler per archetype:
– Task-Stealing: Unbounded queues, no priority,
(amortized) preempt to child tasks
– Breadth-First: Unbounded queues, stage at a
time, top-to-bottom
– Static: Unbounded queues, offline per-thread
schedule using SAS / SGMS
42
GRAMPS
Breadth-First
Task-Stealing
Static (SAS)
Seeing is believing (ray tracer)
43
Comparison: Execution time
Percentage of Time
Time Breakdown (GRAMPS, Task-Stealing, Breadth-First)
• Mostly similar: good parallelism, load balance
44
Comparison: Execution time
Percentage of Time
Time Breakdown (GRAMPS, Task-Stealing, Breadth-First)
• Breadth-first can exhibit load-imbalance
45
Comparison: Execution time
Percentage of Time
Time Breakdown (GRAMPS, Task-Stealing, Breadth-First)
• Task-stealing can ping-pong, cause contention
46
Comparison: Footprint
Size versus GRAMPS
Relative Packet Footprint (Log-Scale)
• Breadth-First is pathological (as expected)
47
Footprint: GRAMPS & Task-Stealing
Relative Packet Footprint
Relative Task Footprint
48
Footprint: GRAMPS & Task-Stealing
GRAMPS gets insight from the graph:
• (Application-specified) queue bounds
Ray Tracer
MapReduce
• Group tasks by stage for priority, preemption
Ray Tracer
MapReduce
49
Static scheduling is challenging
Execution Time
Packet Footprint
• Generating good Static schedules is *hard*.
• Static schedules are fragile:
– Small mismatches compound
– Hardware itself is dynamic (cache traffic, IRQs, …)
• Limited upside: dynamic scheduling is cheap!
50
Discussion (for multi-core CPUs)
• Adaptive scheduling is the obvious choice.
– Better load-balance / handling of irregularity
• Semantic insight (app graph) gives a big
advantage in managing footprint.
• More cores, development maturity → more
complex graphs and thus more advantage.
51
Conclusion
Contributions Revisited
• GRAMPS programming model design
– Graph of heterogeneous stages and queues
• Good results from actual implementation
– Broad scope: Wide range of applications
– Multi-platform: Three different architectures
– Performance: High parallelism, good footprint
53
Anecdotes and intuitions
•
•
•
•
•
Structure helps: an explicit graph is handy.
Simple (principled) dynamic scheduling works.
Queues impedance match heterogeneity.
Graphs with cycles and push both paid off.
(Also: Paired instrumentation and visualization
help enormously)
54
Conclusion: Future trends revisited
• Core counts are increasing
– Parallel programming models
• Memory and bandwidth are precious
– Working set, locality (i.e., footprint) management
• Power, performance driving heterogeneity
– All ‘cores’ need to communicate, interoperate

GRAMPS fits them well.
55
Thanks
• Eric, for agreeing to make this happen.
• Christos, for throwing helpers at me.
• Kurt, Mendel, and Pat, for, well, a lot.
• John Gerth, for tireless computer servitude.
• Melissa (and Heather and Ada before her)
56
Thanks
•
•
•
•
•
•
My practice audiences
My many collaborators
Daniel, Kayvon, Mike, Tim
Supporters at NVIDIA, ATI/AMD, Intel
Supporters at VMware
Everyone who entertained, informed,
challenged me, and made me think
57
Thanks
• My funding agencies:
– Rambus Stanford Graduate Fellowship
– Department of the Army Research
– Stanford Pervasive Parallelism Laboratory
58
Q&A
• Thank you for listening!
• Questions?
59
Extra Material (Backup)
Data: CPU-Like & GPU-Like
61
Footprint Data: Native
62
Tunability
• Diagnosis:
– Raw counters, statistics, logs
– Grampsviz
• Optimize / Control:
– Graph topology (e.g., sort-middle vs. sort-last)
– Queue watermarks (e.g., 10x win for ray tracing)
– Packet size: Match SIMD widths, share data
63
Tunability– Grampsviz (1)
• GPU-Like: Rasterization pipeline
64
Tunability– Grampsviz (2)
Reduce
Combine
• CPU-Like: Histogram (MapReduce)
65
Tunability– Knobs
• Graph topology/design:
Sort-Middle
Sort-Last
• Sizing critical queues:
66
Alternatives
A few other tidbits
Image Histogram Pipeline
• In-place Shader stages /
coalescing inputs
• Instanced Thread stages
• Queues as barriers /
read all-at-once
69
Parallel Speedup
Performance– Good Scale-out
Hardware Threads
• (Footprint: Good; detail a little later)
70
GRAMPS
Task-Stealing
Breadth-First
Static (SAS)
Seeing is believing (ray tracer)
71
Comparison: Execution time
Percentage of Time
Time Breakdown (GRAMPS, Task-Stealing, Breadth-First)
• Small ‘Sched’ time, even with large graphs
72
Download