Graphics on GRAMPS Jeremy Sugerman Kayvon Fatahalian FLASHG 15 Oct 2007

advertisement
Graphics on GRAMPS
Jeremy Sugerman
Kayvon Fatahalian
FLASHG 15 Oct 2007
1
Background
 Context: Broader research investigation generalizing
GPU/Cell/”compute” cores and combining them with
CPUs.
 Fundamental Beliefs:
– Real data parallel apps still have performance critical
non-data parallel pieces
– Existing parallel programming models are too
constrained (GPUs) or too hard/vague (CPUs)
– Queues are an excellent idiom to capture producerconsumer parallelism– thread and data
– Fixed function execution units are not a problem, but
fixed control paths are
FLASHG 15 Oct 2007
2
Compute Cores
 CPUs designed for single threads per core
Minimal FLOPS per core
 Compute cores design for lots of math per core
Many “threads” per core
Sometimes wider SIMD per thread
SIMD width * # hardware threads ops / core
 And, more compute than CPU cores fit per chip
 Many examples: GPU, Cell, Niagara, Larrabee
FLASHG 15 Oct 2007
3
Simplified Direct3D Pipeline

Application launches some drawing…
1.
2.
3.
4.
5.
6.
7.

Vertex Assembly (Fixed, Non-Data Parallel)
Vertex Processing (Programmable, Data Parallel)
Primitive Assembly (Fixed, Non-Data Parallel)
Primitive Processing (Programmable, Data Parallel)
Fragment Assembly (Fixed, Non-Data Parallel)
Fragment Processing (Programmable, Data Parallel)
Pixel / Image Assembly (Fixed, Non-Data Parallel)
Only Data Parallel stages are programmable!
FLASHG 15 Oct 2007
4
Direct3D Pipeline Properties
 There is a reason only data parallel stages are
programmable.
 ‘Shader’ stages are inherently per-element (e.g.
vertex / primitive / fragment) and stateless
between them.
 ‘Assembly’ stages also run on many elements,
but they have inter-element dependencies
– State can be remembered (vertex caching)
– Inputs can be used by multiple outputs (strips)
 Programmable ‘Assembly’ requires heavier
(more serial) threads than ‘Shaders’.
FLASHG 15 Oct 2007
5
Question
 Can fixed-function control be decoupled from
efficient graphics performance on a computeheavy architecture?
 Does not necessarily exclude fixed-function
execution blocks (eg. rasterizer, texture units…)
FLASHG 15 Oct 2007
6
This Talk
 GRAMPS: Our current model for programming
compute cores.
 Implementing Direct3D 10 “in software” with
GRAMPS.
 (Potentially) thoughts about how REYES, ray
tracers map to GRAMPS.
No explicit discussion of heterogeneous cores.
No fancy scheduling algorithms (yet?)
FLASHG 15 Oct 2007
7
Example: Simple 3D Pipeline
Input
Vertices
Vertex
Shading
Transformed
Vertices
Primitive
Assembly
Primitives
Rasterize
(Assemble)
Fragments
Fragment
Shading
Shaded
Fragments
Image
Assembly
Framebuffer
Pixels
FLASHG 15 Oct 2007
8
GRAMPS
 General Runtime/Architecture for Multicore
Parallel Systems
 Models execution graph of queues connected by
threads
 Graph specified by host program
 Simulator for exploring compute cores
– Currently conflates “hardware” and runtime
– # of cores, thread contexts, SIMD width are all
parameters
FLASHG 15 Oct 2007
9
Simple GRAMPS core
 T - threads/core
 S - SIMD ALUs/core
 R - registers/thread
R
Thread 0
Thread 1
Thread 2
…
 1 thread runs in
each clock
FLASHG 15 Oct 2007
Thread T-1
ALU 0
ALU 1
ALU 2
ALU 3
…
 Threads issue
vector instructions
(think S-wide SSE)
L1 data cache
(or scratchpad)
ALU 4
ALU S-1
10
D3D10 Setup
1. App defines 3 shading environments
– Vertex, geometry, fragment
– Attach programs and resources
2. App configure fixed function units
– Fixed number of “modes”
– Attach resources
3. App submits work (vertices) to pipeline
4. Graphics runtime executes until completion
FLASHG 15 Oct 2007
11
GRAMPS Setup
1. App defines a set of queues
2. App defines a set of thread environments
3. App attaches queues as thread inputs and
outputs
4. App bootstraps computation by inserting data
into queue
5. Runtime executes threads until completion
FLASHG 15 Oct 2007
12
GRAMPS Entities: Execution
 Threads: Assemble, Shader, Fixed
– Assemble: Stateful, akin to a regular thread
– Fixed: Special purpose hardware wrapped to
appear an Assemble thread
– Shader: Stateless and data parallel
FLASHG 15 Oct 2007
13
GRAMPS Entities: Data
 Queues for producer-consumer parallelism
 Queues for aggregating coherent work
 Queues support push and reserve/commit for inplace Assembly
 Chunks are the units / granularity at which
Queues are manipulated.
FLASHG 15 Oct 2007
14
GRAMPS Scheduling
 GRAMPS assigns Threads to hw contexts
– Based on graph, current Queue contents




Tiered scheduling model
Tier-0: Trivially puts threads onto hw threads
Tier-1: Builds schedules for Tier-0.
Tier-N: Arbitrarily clever. Doesn’t exist.
FLASHG 15 Oct 2007
15
System
(how it works today)
FLASHG 15 Oct 2007
16
D3D10 on GRAMPS
postVtxShade queue
Index queue
idxVtxAssemble
preVtxShade queue
prePrimAssemble queue
vtxShade
primAssemble
prePrimShade queue
= shader thread
primShade
postPrimShade queue
= assemble thread
rastAssemble
= fixed function in GPU
preRast queue
tri setup / clip / cull
tri queue 0
tri queue 1
tri queue 2
tri queue N
rasterize
rasterize
rasterize
rasterize
preFragShade queue
preFragShade queue
preFragShade queue
preFragShade queue
fragShade
fragShade
fragShade
fragShade
postFragShade queue
postFragShade queue
postFragShade queue
postFragShade queue
blend / ztest
blend / ztest
blend / ztest
blend / ztest
FLASHG 15 Oct 2007
17
Internal Queues
 Queues just memory + state struct (see below)
– For now: Queues are finite
– Queues are contiguous array of chunks
 Chunks = granularity of manipulation
queue {
BYTE
int
int
int
int
int
bool
};
ptr[num_chunks * chunk_byte_width];
num_chunks;
chunk_byte_width;
head;
tail;
reclaim;
done[num_chunks];
FLASHG 15 Oct 2007
18
Ex: GRAMPS has chunks
postVtxShade queue
Index queue
idxVtxAssemble
preVtxShade queue
vtxShade
index_queue chunks contain vertex indices
preVtxShade_queue chunks contain 16 pre-transformed vertices
postVtxShade_queue chunks contain 16 transformed vertices
FLASHG 15 Oct 2007
19
Ex: GRAMPS has chunks
rasterize
preFragShade queue
fragShade
preFragshade_queue chunks contain:
Interpolated inputs for 16 fragments
liveness mask per fragment
x,y position per quad
uniform data shared across all fragments
FLASHG 15 Oct 2007
20
Queue API
 Window = view into a contiguous range of
chunks for assemble threads
 Symmetric for producing/consuming access
qwin {
BYTE* ptr;
int
num;
int
id;
};
 Shader threads just have “push”
FLASHG 15 Oct 2007
21
Queue manipulation
(All threads)
void
produce()
“push”
(Assemble shader only)
qwin*
qwin*
FLASHG 15 Oct 2007
reserve(qwin* q, int num_chunks)
commit(qwin* q, int num_chunks)
22
Internal threads
 Defines a “type” of thread
ThreadEnv {
type = {shader, assemble, fixed-func}
Program Code
uniforms/constant data
sampler/texture/resource id bindings
List of input queues
List of output queues
};
FLASHG 15 Oct 2007
23
Shader threads
 Shading language unchanged (HLSL)
– Still write shaders in terms of single elements
– Compilation produces code to operate on chunks
void hlsl_likefn(const element* inputEl,
element* outputEl,
const sampler foo,
const tex3d
tex)
FLASHG 15 Oct 2007
24
Internal shader threads
 Shader thread code processes chunks
 Input:
– GRAMPS pre-reserved chunks from in/out queues
– Environment info (uniforms, consts, etc)
void shaderFn(const chunk* in_chunks[],
chunk* out_chunks[],
const env*
env)
 Dispatched shader threads run to completion
 Completion implies:
inChunks are released
outChunks are commited
FLASHG 15 Oct 2007
25
Assemble threads
 Assemble threads build chunks
 Access queue data via windows
 Commit/reserve/consume may block thread
void assembleFn(qwin* in_win[],
qwin* out_win[],
const env* env)
FLASHG 15 Oct 2007
26
Ex: primitive assembly
 Input chunks = 16 verts
 Output chunks = 16 prims
 Prim structure depends on type of prim
– Points lines, triangles, triangle /w adj, etc
 Creating prims from verts dependent on
topology
– Strips or lists
– Triangle strip: data for output chunk comes
from multiple input chunks
prePrimAssemble queue
primAssemble
prePrimShade queue
FLASHG 15 Oct 2007
27
Ex: frag assembly (rast)
For (each input triangle) {
Add triangle uniform data to chunk
while (chunk not full && triangle not done) {
rasterize next tile of quads…
for (each nonempty quad) {
add 4 fragments to chunk
add quad description per chunk
}
}
if (chunk is full) {
qwin_out = commit(qwin_out, 1);
grow window with reserve() if necessary…
}
}
FLASHG 15 Oct 2007
Building chunks:
1. Compact valid quads
2. Data at various frequencies
28
Execution: Tier 1
queue
queue
queue
shader
shader
threadEnv threadEnv
assemble assemble
threadEnv threadEnv
queue
queue
queue
shader
shader
threadEnv threadEnv
assemble assemble
threadEnv threadEnv
ShaderThr dispatch
AssembleThr resume
Tier 1 to Tier 0 FIFO
T0
T1
T2
T T-1
FLASHG 15 Oct 2007
L1 $
Thread_Done() (implicit commit)
Produce()
Reserve()
Commit()
29
Execution: Tier 0
 Each cycle: round robin
runnable threads
Tier 1 to Tier 0 FIFO
 Thread stalls: place on wait list
R
 When thread completes:
1. Pull next thread from fifo,
assign to empty thread slot
2. Send completion message
to tier 0
Thread 1
L1 data cache
(or scratchpad)
Thread 2
…
ALU 1
ALU 2
ALU 3
…
Tier 0
Scheduler
Thread T-1
ALU 0
FLASHG 15 Oct 2007
Thread 0
ALU 4
ALU S-1
30
Validation
 “Fat enough” cores for assemble threads can
deliver sufficient FLOPS
 Assemble threads can keep compute cores +
fixed-function units busy
 Can give up domain-specific heuristics in the
scheduling
FLASHG 15 Oct 2007
31
Download