Jeremy Sugerman
Stanford PPL Retreat
November 21, 2008
Collaborators: Kayvon Fatahalian, Solomon Boulos,
Kurt Akeley, Pat Hanrahan
Initial work appearing in ACM TOG in January, 2009
Our starting point:
CPU, GPU trends… and collision?
Two research areas:
– HW/SW Interface, Programming Model
– Future Graphics API
Problem Statement / Requirements:
Build a programming model / primitives / building blocks to drive efficient development for and usage of future many-core machines.
Handle homogeneous, heterogeneous, programmable cores, and fixed-function units.
Status Quo:
GPU Pipeline (Good for GL, otherwise hard)
CPU / C run-time (No guidance, fast is hard)
Input
Fragment
Queue
Rasterize
Raster Graphics
Ray
Queue
Camera
Ray Tracer
Shade
Output
Fragment
Queue
Intersect
Ray Hit
Queue Fragment
Queue
Shade
FB Blend
FB Blend
= Thread Stage
= Shader Stage
= Fixed-func Stage
= Queue
= Stage Output
Apps: Graphs of stages and queues
Producer-consumer, task, data-parallelism
Initial focus on real-time rendering
Large Application Scope– preferable to roll-your-own
High Performance– Competitive with roll-your-own
Optimized Implementations– Informs HW design
Multi-Platform– Suits a variety of many-core systems
Also:
Tunable– Expert users can optimize their apps
5
Not (unthinkably) radical for ‘graphics’
Like fixed → programmable shading
– Pipeline undergoing massive shake up
– Diversity of new parameters and use cases
Bigger picture than ‘graphics’
– Rendering is more than GL/D3D
– Compute is more than rendering
– Some ‘GPUs’ are losing their innate pipeline
Sounds like streaming:
Execution graphs, kernels, data-parallelism
Streaming: “squeeze out every FLOP”
– Goals: bulk transfer, arithmetic intensity
– Intensive static analysis, custom chips (mostly)
– Bounded space, data access, execution time
GRAMPS: “interesting apps are irregular”
– Goals: Dynamic, data-dependent code
– Aggregate work at run-time
– Heterogeneous commodity platforms
Streaming techniques fit naturally when applicable
A ‘graphics pipeline’ is now an app!
Target users: engine/pipeline/run-time authors, savvy hardware-aware systems developers.
Compared to status quo:
– More flexible, lower level than a GPU pipeline
– More guidance than bare metal
– Portability in between
– Not domain specific
Data access via windows into queues/memory
Queues: Dynamically allocated / managed
– Ordered or unordered
– Specified max capacity (could also spill)
– Two types: Opaque and Collection
Buffers: Random access, Pre-allocated
– RO, RW Private, RW Shared (Not Supported)
10
Queue Sets: Independent sub-queues
– Instanced parallelism plus mutual exclusion
– Hard to fake with just multiple queues
11
Large Application Scope
High Performance
Optimized Implementations
Multi-Platform
(Tunable)
12
Static Inputs:
– Application graph topology
– Per-queue packet (‘chunk’) size
– Per-queue maximum depth / high-watermark
Dynamic Inputs (currently ignored):
– Current per-queue depths
– Average execution time per input packet
Simple Policy: Run consumers, pre-empt producers
14
Tiered Scheduler: Tier-N, Tier-1, Tier-0
Tier-N only wakes idle units, no rebalancing
All Tier-1s compete for all queued work.
‘Fat’ cores: software tier-1 per core, tier-0 per thread
‘Micro’ cores: single shared hardware tier-1+0
Direct3D Pipeline (with Ray-tracing Extension)
IA
1
Input Vertex
Queue 1
VS
1
IA
N
Input Vertex
Queue N
VS
N
Primitive
Queue 1
Primitive
Queue N
RO
Primitive
Queue
Rast
Fragment
Queue
PS
Ray
Queue
Trace
Ray Hit
Queue
Sample
Queue Set
PS2
Ray-tracing Extension
Ray-tracing Graph
Tiler
Tile
Queue
Sampler
Sample
Queue
Camera
Ray
Queue
Intersect
= Thread Stage
= Shader Stage
= Fixed-func
Ray Hit
Queue
Fragment
Queue
OM
= Queue
= Stage Output
= Push Output
Shade FB Blend
16
Queues are small (< 600 KB CPU, < 1.5 MB GPU)
Parallelism is good (at least 80%, all but one 95+%)
Is GRAMPS a suitable GPU evolution?
– Enable pipeline competitive with bare metal?
– Enable innovation: advanced / alternative methods?
Is GRAMPS a good parallel compute model?
– Does it fulfill our design goals?
Simulation / Hardware fidelity improvements
– Memory model, locality
GRAMPS Run-Time improvements
– Scheduling, run-time overheads
GRAMPS API extensions
– On-the-fly graph modification, data sharing
More applications / workloads
– REYES, physics, finance, AI, …
– Lazy/adaptive/procedural data generation
20
Application Scope: okay– only (multiple) renderers
High Performance: so-so– limited simulation detail
Optimized Implementations: good
Multi-Platform: good
(Tunable: good, but that’s a separate talk)
Strategy: Broaden available apps and use them to drive performance and simulation work for now.
21
Task (Divide) and Data (Conquer)
Subdivide algorithm into a DAG (or graph) of kernels.
Data is long lived, manipulated in-place.
Kernels are ephemeral and stateless.
Kernels only get input at entry/creation.
Producer-Consumer (Pipeline) Parallelism
Data is ephemeral: processed as it is generated.
Bandwidth or storage costs prohibit accumulation.
22
“App” 1: MapReduce Run-time
– Popular parallelism-rich idiom
– Enables a variety of useful apps
App 2: Cloth Simulation (Rendering Physics)
– Inspired by the PhysBAM cloth simulation
– Demonstrates basic mechanics, collision detection
– Graph is still very much a work in progress…
App 3: Real-time REYES-like Renderer (Kayvon)
23
“ProduceReduce”: Minimal simplifications / constraints
Produce/Split (1:N)
Map (1:N)
(Optional) Combine (N:1)
Reduce (N:M, where M << N or M=1 often)
Sort (N:N conceptually, implementations vary)
(Aside: REYES is MapReduce, OpenGL is MapCombine)
24
Produce
Initial
Tuples
Map
Intermediate
Tuples
Combine
(Optional)
Intermediate
Tuples
Reduce
Final
Tuples
Sort
= Thread Stage
= Shader Stage
= Queue
= Stage Output
= Push Output
Map output is a dynamically instanced queue set.
Combine might motivate a formal reduction shader.
Reduce is an (automatically) instanced thread stage.
Sort may actually be parallelized.
25
Proposed Update
Update
Mesh
Collision Detection
Broad
Collide
Candidate
Pairs
BVH
Nodes
Narrow
Collide
Resolution
Resolve
Collisions
Moved
Nodes
Fast
Recollide
Update is not producer-consumer!
= Thread Stage
= Shader Stage
= Queue
= Stage Output
= Push Output
Broad Phase will actually be either a (weird) shader or multiple thread instances.
Fast Recollide details are also TBD.
26
Thank you for listening. Any questions?
Actively interested in new collaborators
– Owners or experts in some application domain (or engine / run-time system / middleware).
– Anyone interested in scheduling or details of possible hardware / core configurations.
TOG Paper: http://graphics.stanford.edu/papers/gramps-tog/
27
28
Efficiency requires “large chunks of coherent work”
Stages separate coherency boundaries
– Frequency of computation (fan-out / fan-in)
– Memory access coherency
– Execution coherency
Queues allow repacking, re-sorting of work from one coherency regime to another.
29
Host/Setup: Create execution graph
Thread: Stateful, singleton
Shader: Data-parallel, auto-instanced
Portability really means performance.
Less portable than GL/D3D
– GRAMPS graph is (more) hardware sensitive
More portable than bare metal
– Enforces modularity
– Best case, just works
– Worst case, saves boiler plate
Better scheduling
– Less bursty, better slot filling
– Dynamic priorities
– Handle graphs with loops better
More detailed costs
– Bill for scheduling decisions
– Bill for (internal) synchronization
More statistics
Important: Graph modification (state change)
Probably: Data sharing / ref-counting
Maybe: Blocking inter-stage calls (join)
Maybe: Intra/inter-stage synchronization primitives
REYES, hybrid graphics pipelines
Image / video processing
Game Physics
– Collision detection or particles
Physics and scientific simulation
AI, finance, sort, search or database query, …
Heavy dynamic data manipulation
- k-D tree / octree / BVH build
- lazy/adaptive/procedural tree or geometry