Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

advertisement

Many-Core Programming with GRAMPS

Jeremy Sugerman

Stanford PPL Retreat

November 21, 2008

Introduction

 Collaborators: Kayvon Fatahalian, Solomon Boulos,

Kurt Akeley, Pat Hanrahan

 Initial work appearing in ACM TOG in January, 2009

Our starting point:

 CPU, GPU trends… and collision?

 Two research areas:

– HW/SW Interface, Programming Model

– Future Graphics API

Background

Problem Statement / Requirements:

 Build a programming model / primitives / building blocks to drive efficient development for and usage of future many-core machines.

 Handle homogeneous, heterogeneous, programmable cores, and fixed-function units.

Status Quo:

 GPU Pipeline (Good for GL, otherwise hard)

 CPU / C run-time (No guidance, fast is hard)

GRAMPS

Input

Fragment

Queue

Rasterize

Raster Graphics

Ray

Queue

Camera

Ray Tracer

Shade

Output

Fragment

Queue

Intersect

Ray Hit

Queue Fragment

Queue

Shade

FB Blend

FB Blend

= Thread Stage

= Shader Stage

= Fixed-func Stage

= Queue

= Stage Output

 Apps: Graphs of stages and queues

 Producer-consumer, task, data-parallelism

 Initial focus on real-time rendering

Design Goals

 Large Application Scope– preferable to roll-your-own

 High Performance– Competitive with roll-your-own

 Optimized Implementations– Informs HW design

 Multi-Platform– Suits a variety of many-core systems

Also:

 Tunable– Expert users can optimize their apps

5

As a Graphics Evolution

 Not (unthinkably) radical for ‘graphics’

 Like fixed → programmable shading

– Pipeline undergoing massive shake up

– Diversity of new parameters and use cases

 Bigger picture than ‘graphics’

– Rendering is more than GL/D3D

– Compute is more than rendering

– Some ‘GPUs’ are losing their innate pipeline

As a Compute Evolution (1)

 Sounds like streaming:

Execution graphs, kernels, data-parallelism

 Streaming: “squeeze out every FLOP”

– Goals: bulk transfer, arithmetic intensity

– Intensive static analysis, custom chips (mostly)

– Bounded space, data access, execution time

As a Compute Evolution (2)

 GRAMPS: “interesting apps are irregular”

– Goals: Dynamic, data-dependent code

– Aggregate work at run-time

– Heterogeneous commodity platforms

 Streaming techniques fit naturally when applicable

GRAMPS’ Role

 A ‘graphics pipeline’ is now an app!

 Target users: engine/pipeline/run-time authors, savvy hardware-aware systems developers.

 Compared to status quo:

– More flexible, lower level than a GPU pipeline

– More guidance than bare metal

– Portability in between

– Not domain specific

GRAMPS Entities (1)

 Data access via windows into queues/memory

 Queues: Dynamically allocated / managed

– Ordered or unordered

– Specified max capacity (could also spill)

– Two types: Opaque and Collection

 Buffers: Random access, Pre-allocated

– RO, RW Private, RW Shared (Not Supported)

10

GRAMPS Entities (2)

 Queue Sets: Independent sub-queues

– Instanced parallelism plus mutual exclusion

– Hard to fake with just multiple queues

11

Design Goals (Reminder)

 Large Application Scope

 High Performance

 Optimized Implementations

 Multi-Platform

 (Tunable)

12

What We’ve Built (System)

GRAMPS Scheduling

 Static Inputs:

– Application graph topology

– Per-queue packet (‘chunk’) size

– Per-queue maximum depth / high-watermark

 Dynamic Inputs (currently ignored):

– Current per-queue depths

– Average execution time per input packet

 Simple Policy: Run consumers, pre-empt producers

14

GRAMPS Scheduler Organization

 Tiered Scheduler: Tier-N, Tier-1, Tier-0

 Tier-N only wakes idle units, no rebalancing

 All Tier-1s compete for all queued work.

 ‘Fat’ cores: software tier-1 per core, tier-0 per thread

 ‘Micro’ cores: single shared hardware tier-1+0

What We’ve Built (Apps)

Direct3D Pipeline (with Ray-tracing Extension)

IA

1

Input Vertex

Queue 1

VS

1

IA

N

Input Vertex

Queue N

VS

N

Primitive

Queue 1

Primitive

Queue N

RO

Primitive

Queue

Rast

Fragment

Queue

PS

Ray

Queue

Trace

Ray Hit

Queue

Sample

Queue Set

PS2

Ray-tracing Extension

Ray-tracing Graph

Tiler

Tile

Queue

Sampler

Sample

Queue

Camera

Ray

Queue

Intersect

= Thread Stage

= Shader Stage

= Fixed-func

Ray Hit

Queue

Fragment

Queue

OM

= Queue

= Stage Output

= Push Output

Shade FB Blend

16

Initial Renderer Results

 Queues are small (< 600 KB CPU, < 1.5 MB GPU)

 Parallelism is good (at least 80%, all but one 95+%)

Scheduling Can Clearly Improve

Taking Stock: High-level Questions

 Is GRAMPS a suitable GPU evolution?

– Enable pipeline competitive with bare metal?

– Enable innovation: advanced / alternative methods?

 Is GRAMPS a good parallel compute model?

– Does it fulfill our design goals?

Possible Next Steps

 Simulation / Hardware fidelity improvements

– Memory model, locality

 GRAMPS Run-Time improvements

– Scheduling, run-time overheads

 GRAMPS API extensions

– On-the-fly graph modification, data sharing

 More applications / workloads

– REYES, physics, finance, AI, …

– Lazy/adaptive/procedural data generation

20

Design Goals (Revisited)

 Application Scope: okay– only (multiple) renderers

 High Performance: so-so– limited simulation detail

 Optimized Implementations: good

 Multi-Platform: good

 (Tunable: good, but that’s a separate talk)

Strategy: Broaden available apps and use them to drive performance and simulation work for now.

21

Digression: Some Kinds of Parallelism

Task (Divide) and Data (Conquer)

 Subdivide algorithm into a DAG (or graph) of kernels.

 Data is long lived, manipulated in-place.

 Kernels are ephemeral and stateless.

 Kernels only get input at entry/creation.

Producer-Consumer (Pipeline) Parallelism

 Data is ephemeral: processed as it is generated.

 Bandwidth or storage costs prohibit accumulation.

22

Three New Graphs

 “App” 1: MapReduce Run-time

– Popular parallelism-rich idiom

– Enables a variety of useful apps

 App 2: Cloth Simulation (Rendering Physics)

– Inspired by the PhysBAM cloth simulation

– Demonstrates basic mechanics, collision detection

– Graph is still very much a work in progress…

 App 3: Real-time REYES-like Renderer (Kayvon)

23

MapReduce: Specific Flavour

“ProduceReduce”: Minimal simplifications / constraints

 Produce/Split (1:N)

 Map (1:N)

 (Optional) Combine (N:1)

 Reduce (N:M, where M << N or M=1 often)

 Sort (N:N conceptually, implementations vary)

(Aside: REYES is MapReduce, OpenGL is MapCombine)

24

MapReduce Graph

Produce

Initial

Tuples

Map

Intermediate

Tuples

Combine

(Optional)

Intermediate

Tuples

Reduce

Final

Tuples

Sort

= Thread Stage

= Shader Stage

= Queue

= Stage Output

= Push Output

 Map output is a dynamically instanced queue set.

 Combine might motivate a formal reduction shader.

 Reduce is an (automatically) instanced thread stage.

 Sort may actually be parallelized.

25

Cloth Simulation Graph

Proposed Update

Update

Mesh

Collision Detection

Broad

Collide

Candidate

Pairs

BVH

Nodes

Narrow

Collide

Resolution

Resolve

Collisions

Moved

Nodes

Fast

Recollide

 Update is not producer-consumer!

= Thread Stage

= Shader Stage

= Queue

= Stage Output

= Push Output

 Broad Phase will actually be either a (weird) shader or multiple thread instances.

 Fast Recollide details are also TBD.

26

That’s All Folks

 Thank you for listening. Any questions?

 Actively interested in new collaborators

– Owners or experts in some application domain (or engine / run-time system / middleware).

– Anyone interested in scheduling or details of possible hardware / core configurations.

 TOG Paper: http://graphics.stanford.edu/papers/gramps-tog/

27

Backup Slides / More Details

28

Designing A Good Graph

 Efficiency requires “large chunks of coherent work”

 Stages separate coherency boundaries

– Frequency of computation (fan-out / fan-in)

– Memory access coherency

– Execution coherency

 Queues allow repacking, re-sorting of work from one coherency regime to another.

29

GRAMPS Interfaces

 Host/Setup: Create execution graph

 Thread: Stateful, singleton

 Shader: Data-parallel, auto-instanced

GRAMPS Graph Portability

 Portability really means performance.

 Less portable than GL/D3D

– GRAMPS graph is (more) hardware sensitive

 More portable than bare metal

– Enforces modularity

– Best case, just works

– Worst case, saves boiler plate

Possible Next Steps: Implementation

 Better scheduling

– Less bursty, better slot filling

– Dynamic priorities

– Handle graphs with loops better

 More detailed costs

– Bill for scheduling decisions

– Bill for (internal) synchronization

 More statistics

Possible Next Steps: API

 Important: Graph modification (state change)

 Probably: Data sharing / ref-counting

 Maybe: Blocking inter-stage calls (join)

 Maybe: Intra/inter-stage synchronization primitives

Possible Next Steps: New Workloads

 REYES, hybrid graphics pipelines

 Image / video processing

 Game Physics

– Collision detection or particles

 Physics and scientific simulation

 AI, finance, sort, search or database query, …

 Heavy dynamic data manipulation

- k-D tree / octree / BVH build

- lazy/adaptive/procedural tree or geometry

Download