Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman

advertisement
Brook for GPUs
Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman
Pat Hanrahan
February 10th, 2003
Brook: general purpose streaming language
• developed for PCA Program/Merrimac
– compiler: RStream
• Reservoir Labs
Scalar
Execution
Unit
– DARPA PCA Program
• Stanford: SmartMemories
• UT Austin: TRIPS
• MIT: RAW
Stream
Execution Unit
Stream
Register File
Memory
System
Network
Network
Interface
text
text
DRDRAM
– Brook version 0.2 spec: http://merrimac.stanford.edu
– Brook for GPUs: http://brook.sourceforce.net
February 11th, 2004
2
Brook: general purpose streaming language
• stream programming model
– enforce data parallel computing
• streams
– encourage arithmetic intensity
• kernels
• C with streams
February 11th, 2004
3
Brook for gpus
• demonstrate gpu streaming coprocessor
– make programming gpus easier
• hide texture/pbuffer data management
• hide graphics based constructs in CG/HLSL
• hide rendering passes
• virtualize resources
– performance!
• … on applications that matter
– highlight gpu areas for improvement
• features required general purpose stream
computing
February 11th, 2004
4
system outline
.br
Brook source files
brcc
source to source compiler
brt
Brook run-time library
February 11th, 2004
5
Brook language
streams
• streams
– collection of records requiring similar computation
• particle positions, voxels, FEM cell, …
float3 positions<200>;
float3 velocityfield<100,100,100>;
– encourage data parallelism
February 11th, 2004
6
Brook language
kernels
• kernels
– functions applied to streams
• similar to for_all construct
kernel void foo (float a<>, float b<>,
out float result<>) {
result = a + b;
}
float a<100>;
float b<100>;
float c<100>;
foo(a,b,c);
for (i=0; i<100; i++)
c[i] = a[i]+b[i];
– no dependencies between stream elements
• encourage high arithmetic intensity
February 11th, 2004
7
Brook language
kernels
• Ray Triangle Intersection
kernel void krnIntersectTriangle(Ray ray<>, Triangle tris[],
RayState oldraystate<>,
GridTrilist trilist[],
out Hit candidatehit<>) {
float idx, det, inv_det;
float3 edge1, edge2, pvec, tvec, qvec;
if(oldraystate.state.y > 0) {
idx = trilist[oldraystate.state.w].trinum;
edge1 = tris[idx].v1 - tris[idx].v0;
edge2 = tris[idx].v2 - tris[idx].v0;
pvec = cross(ray.d, edge2);
det = dot(edge1, pvec);
inv_det = 1.0f/det;
tvec = ray.o - tris[idx].v0;
candidatehit.data.y = dot( tvec, pvec ) * inv_det;
qvec = cross( tvec, edge1 );
candidatehit.data.z = dot( ray.d, qvec ) * inv_det;
candidatehit.data.x = dot( edge2, qvec ) * inv_det;
candidatehit.data.w = idx;
} else {
candidatehit.data = float4(0,0,0,-1);
}
}
February 11th, 2004
8
Brook language
additional features
• reductions
– scalar
– stream
• stride & repeat
• GatherOp & ScatterOp
– a[i] += p
– p = a[i]++
February 11th, 2004
9
brcc compiler
infrastructure
• based on ctool
– http://ctool.sourceforge.net
• parser
– build code tree
– extend C grammar to accept Brook
• convert
– tree transformations
• codegen
– generate cg & hlsl code
– call cgc, fxc
– generate stub function
February 11th, 2004
10
Applications
Ray-tracer
FFT
Segmentation
Linear Algebra:
– BLAS, LINPACK, LAPACK
February 11th, 2004
11
Brook Performance
February 11th, 2004
12
GPU Gotchas
Time
Registers Used
February 11th, 2004
13
GPU Gotchas
Time
Registers Used
NVIDIA NV3x: Register usage vs. Time
February 11th, 2004
14
GPU Gotchas
NVIDIA:
• Register Penalty
• Render to Texture Limitation
– Requires explicit copy or heavy pbuffer
solution
– Superbuffer extension needed
http://mirror.ati.com/developer/SIGGRAPH03/Percy_OpenGL_Extensions SIG03.pdf
February 11th, 2004
15
GPU Gotchas
ATI Radeon 9800 Pro
• Limited dependent
texture lookup
• 96 instructions
• 24-bit floating point
– s16e7
Integers up to 131,072
(s23e8: 16,777,216)
1
2
Memory Refs
Math Ops
Memory Refs
Math Ops
3
Memory Refs
Math Ops
4
Memory Refs
Math Ops
February 11th, 2004
16
GPU Catch-Up!
• Integer & Bit Ops & Double Precision
• Memory Addressing
• CGC/FXC Performance
– Hand code performance critical code
• No native reduction support
• No native scatter support
– p[i] = a (indirect write)
• No programmable blend
– GatherOp / ScatterOp
• Limited 4x4 output
– Brook virtualized kernel outputs
• Readback still slow
– NV35 OpenGL: 600 MB/sec Download
– ATI DirectX:
550 MB/sec Download
February 11th, 2004
170 MB/sec Readback
50 MB/sec Readback
17
GPUs of the future (we hope)
• Complete Instruction Sets
– Integers, Bit Ops, Doubles, Mem Access
• Integration
– Streaming coprocessor not just a rendering device
SDRAM
SDRAM
SDRAM
SDRAM
February 11th, 2004
Stream
Register File
• Streaming architectures
ALU Cluster
ALU Cluster
ALU Cluster
18
Brook for GPUs
• Release v0.3 available on Sourceforge
• Project Page
– http://graphics.stanford.edu/projects/brook
• Source
– http://www.sourceforge.net/projects/brook
• Over 4K downloads!
• Questions?
Fly-fishing fly images from The English Fly Fishing Shop
February 11th, 2004
19
Download