Programming Many-Core Systems with GRAMPS Jeremy Sugerman 14 May 2010 The single fast core era is over • Trends: Changing Metrics: ‘scale out’, not just ‘scale up’ Increasing diversity: many different mixes of ‘cores’ • Today’s (and tomorrow’s) machines: commodity, heterogeneous, many-core Problem: How does one program all this complexity?! 2 High-level programming models • Two major advantages over threads & locks – Constructs to express/expose parallelism – Scheduling support to help manage concurrency, communication, and synchronization • Widespread in research and industry: OpenGL/Direct3D, SQL, Brook, Cilk, CUDA, OpenCL, StreamIt, TBB, … 3 My biases workloads • Interesting applications have irregularity • Large bundles of coherent work are efficient • Producer-consumer idiom is important Goal: Rebuild coherence dynamically by aggregating related work as it is generated. 4 My target audience • Highly informed, but (good) lazy – Understands the hardware and best practices – Dislikes rote, Prefers power versus constraints Goal: Let systems-savvy developers efficiently develop programs that efficiently map onto their hardware. 5 Contributions: Design of GRAMPS • Programs are graphs of stages and queues Simple Graphics Pipeline • Queues: – Maximum capacities, Packet sizes • Stages: – No, limited, or total automatic parallelism – Fixed, variable, or reduction (in-place) outputs 6 Contributions: Implementation • Broad application scope: – Rendering, MapReduce, image processing, … • Multi-platform applicability: – GRAMPS runtimes for three architectures • Performance: – Scale-out parallelism, controlled data footprint – Compares well to schedulers from other models • (Also: Tunable) 7 Outline • • • • GRAMPS overview Study 1: Future graphics architectures Study 2: Current multi-core CPUs Comparison with schedulers from other parallel programming models 8 GRAMPS Overview GRAMPS • Programs are graphs of stages and queues – Expose the program structure – Leave the program internals unconstrained 10 Writing a GRAMPS program • Design the application graph and queues: Cookie Dough Pipeline • Design the stages • Instantiate and launch. Credit: http://www.foodnetwork.com/recipes/alton-brown/the-chewy-recipe/index.html 11 Queues • Bounded size, operate at “packet” granularity – “Opaque” and “Collection” packets • GRAMPS can optionally preserve ordering – Required for some workloads, adds overhead 12 Thread (and Fixed) stages • Preemptible, long-lived, stateful – Often merge, compare, or repack inputs • Queue operations: Reserve/Commit • (Fixed: Thread stages in custom hardware) 13 Shader stages: • Automatically parallelized: – Horde of non-preemptible, stateless instances – Pre-reserve/post-commit • Push: Variable/conditional output support – GRAMPS coalesces elements into full packets 14 Queue sets: Mutual exclusion Cookie Dough Pipeline • Independent exclusive (serial) subqueues – Created statically or on first output – Densely or sparsely indexed • Bonus: Automatically instanced Thread stages 15 Queue sets: Mutual exclusion Cookie Dough (with queue set) • Independent exclusive (serial) subqueues – Created statically or on first output – Densely or sparsely indexed • Bonus: Automatically instanced Thread stages 16 A few other tidbits • In-place Shader stages / coalescing inputs • Instanced Thread stages • Queues as barriers / read all-at-once 17 Formative influences • The Graphics Pipeline, early GPGPU • “Streaming” • Work-queues and task-queues 18 Study 1: Future Graphics Architectures (with Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, Pat Hanrahan; appeared in Transactions on Computer Graphics, January 2009) Graphics is a natural first domain • Table stakes for commodity parallelism • GPUs are full of heterogeneity • Poised to transition from fixed/configurable pipeline to programmable • We have a lot of experience in it 20 The Graphics Pipeline in GRAMPS • Graph, setup are (application) software – Can be customized or completely replaced • Like the transition to programmable shading – Not (unthinkably) radical • Fits current hw: FIFOs, cores, rasterizer, … 21 Reminder: Design goals • Broad application scope • Multi-platform applicability • Performance: scale-out, footprint-aware 22 The Experiment • Three renderers: – Rasterization, Ray Tracer, Hybrid • Two simulated future architectures – Simple scheduler for each 23 Scope: Two(-plus) renderers Rasterization Pipeline (with ray tracing extension) Ray Tracing Graph Ray Tracing Extension 24 Platforms: Two simulated systems CPU-Like: 8 Fat Cores, Rast GPU-Like: 1 Fat Core, 4 Micro Cores, Rast, Sched 25 Performance— Metrics “Maximize machine utilization while keeping working sets small” • Priority #1: Scale-out parallelism – Parallel utilization • Priority #2: ‘Reasonable’ bandwidth / storage – Worst case total footprint of all queues – Inherently a trade-off versus utilization 26 Performance— Scheduling Simple prototype scheduler (both platforms): • Static stage priorities: (Lowest) (Highest) • Only preempt on Reserve and Commit • No dynamic weighting of current queue sizes 27 Performance— Results • • • • Utilization: 95+% for all but rasterized fairy (~80%). Footprint: < 600KB CPU-like, < 1.5MB GPU-like Surprised how well the simple scheduler worked Maintaining order costs footprint 28 Study 2: Current Multi-core CPUs (with (alphabetically) Christos Kozyrakis, David Lo, Daniel Sanchez, Richard Yoo; submitted to PACT 2010) Reminder: Design goals • Broad application scope • Multi-platform applicability • Performance: scale-out, footprint-aware 30 The Experiment • 9 applications, 13 configurations • One (more) architecture: multi-core x86 – It’s real (no simulation here) – Built with pthreads, locks, and atomics • Per-pthread task-priority-queues with work-stealing – More advanced scheduling 31 Scope: Application bonanza • GRAMPS Ray tracer (0, 1 bounce) Spheres (No rasterization, though) • Cilk(-like) Mergesort • CUDA Gaussian, SRAD • MapReduce Hist (reduce / combine) LR (reduce / combine) PCA • StreamIt FM, TDE 32 Scope: Many different idioms Ray Tracer MapReduce Merge Sort FM SRAD 33 Platform: 2xQuad-core Nehalem Native: 8 HyperThreaded Core i7’s • Queues: copy in/out, global (shared) buffer • Threads: user-level scheduled contexts • Shaders: create one task per input packet 34 Performance— Metrics (Reminder) “Maximize machine utilization while keeping working sets small” • Priority #1: Scale-out parallelism • Priority #2: ‘Reasonable’ bandwidth / storage 35 Performance– Scheduling • • • • Static per-stage priorities (still) Work-stealing task-priority-queues Eagerly create one task per packet (naïve) Keep running stages until a low watermark – (Limited dynamic weighting of queue depths) 36 Parallel Speedup Performance– Good Scale-out Hardware Threads • (Footprint: Good; detail a little later) 37 Performance– Low Overheads Percentage of Execution Execution Time Breakdown (8 cores / 16 hyperthreads) • ‘App’ and ‘Queue’ time are both useful work. 38 Comparison with Other Schedulers (with (alphabetically) Christos Kozyrakis, David Lo, Daniel Sanchez, Richard Yoo; submitted to PACT 2010) Three archetypes • Task-Stealing: (Cilk, TBB) Low overhead with fine granularity tasks No producer-consumer, priorities, or data-parallel • Breadth-First: (CUDA, OpenCL) Simple scheduler (one stage at the time) No producer-consumer, no pipeline parallelism • Static: (StreamIt / Streaming) No runtime scheduler; complex schedules Cannot adapt to irregular workloads 40 GRAMPS is a natural framework Shader Producer- Structured Support Consumer ‘Work’ GRAMPS Task Stealing Breadth First Static Adaptive 41 The Experiment • Re-use the exact same application code • Modify the scheduler per archetype: – Task-Stealing: Unbounded queues, no priority, (amortized) preempt to child tasks – Breadth-First: Unbounded queues, stage at a time, top-to-bottom – Static: Unbounded queues, offline per-thread schedule using SAS / SGMS 42 GRAMPS Breadth-First Task-Stealing Static (SAS) Seeing is believing (ray tracer) 43 Comparison: Execution time Percentage of Time Time Breakdown (GRAMPS, Task-Stealing, Breadth-First) • Mostly similar: good parallelism, load balance 44 Comparison: Execution time Percentage of Time Time Breakdown (GRAMPS, Task-Stealing, Breadth-First) • Breadth-first can exhibit load-imbalance 45 Comparison: Execution time Percentage of Time Time Breakdown (GRAMPS, Task-Stealing, Breadth-First) • Task-stealing can ping-pong, cause contention 46 Comparison: Footprint Size versus GRAMPS Relative Packet Footprint (Log-Scale) • Breadth-First is pathological (as expected) 47 Footprint: GRAMPS & Task-Stealing Relative Packet Footprint Relative Task Footprint 48 Footprint: GRAMPS & Task-Stealing GRAMPS gets insight from the graph: • (Application-specified) queue bounds Ray Tracer MapReduce • Group tasks by stage for priority, preemption Ray Tracer MapReduce 49 Static scheduling is challenging Execution Time Packet Footprint • Generating good Static schedules is *hard*. • Static schedules are fragile: – Small mismatches compound – Hardware itself is dynamic (cache traffic, IRQs, …) • Limited upside: dynamic scheduling is cheap! 50 Discussion (for multi-core CPUs) • Adaptive scheduling is the obvious choice. – Better load-balance / handling of irregularity • Semantic insight (app graph) gives a big advantage in managing footprint. • More cores, development maturity → more complex graphs and thus more advantage. 51 Conclusion Contributions Revisited • GRAMPS programming model design – Graph of heterogeneous stages and queues • Good results from actual implementation – Broad scope: Wide range of applications – Multi-platform: Three different architectures – Performance: High parallelism, good footprint 53 Anecdotes and intuitions • • • • • Structure helps: an explicit graph is handy. Simple (principled) dynamic scheduling works. Queues impedance match heterogeneity. Graphs with cycles and push both paid off. (Also: Paired instrumentation and visualization help enormously) 54 Conclusion: Future trends revisited • Core counts are increasing – Parallel programming models • Memory and bandwidth are precious – Working set, locality (i.e., footprint) management • Power, performance driving heterogeneity – All ‘cores’ need to communicate, interoperate GRAMPS fits them well. 55 Thanks • Eric, for agreeing to make this happen. • Christos, for throwing helpers at me. • Kurt, Mendel, and Pat, for, well, a lot. • John Gerth, for tireless computer servitude. • Melissa (and Heather and Ada before her) 56 Thanks • • • • • • My practice audiences My many collaborators Daniel, Kayvon, Mike, Tim Supporters at NVIDIA, ATI/AMD, Intel Supporters at VMware Everyone who entertained, informed, challenged me, and made me think 57 Thanks • My funding agencies: – Rambus Stanford Graduate Fellowship – Department of the Army Research – Stanford Pervasive Parallelism Laboratory 58 Q&A • Thank you for listening! • Questions? 59 Extra Material (Backup) Data: CPU-Like & GPU-Like 61 Footprint Data: Native 62 Tunability • Diagnosis: – Raw counters, statistics, logs – Grampsviz • Optimize / Control: – Graph topology (e.g., sort-middle vs. sort-last) – Queue watermarks (e.g., 10x win for ray tracing) – Packet size: Match SIMD widths, share data 63 Tunability– Grampsviz (1) • GPU-Like: Rasterization pipeline 64 Tunability– Grampsviz (2) Reduce Combine • CPU-Like: Histogram (MapReduce) 65 Tunability– Knobs • Graph topology/design: Sort-Middle Sort-Last • Sizing critical queues: 66 Alternatives A few other tidbits Image Histogram Pipeline • In-place Shader stages / coalescing inputs • Instanced Thread stages • Queues as barriers / read all-at-once 69 Parallel Speedup Performance– Good Scale-out Hardware Threads • (Footprint: Good; detail a little later) 70 GRAMPS Task-Stealing Breadth-First Static (SAS) Seeing is believing (ray tracer) 71 Comparison: Execution time Percentage of Time Time Breakdown (GRAMPS, Task-Stealing, Breadth-First) • Small ‘Sched’ time, even with large graphs 72