Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman Pat Hanrahan February 10th, 2003 Brook: general purpose streaming language • developed for PCA Program/Merrimac – compiler: RStream • Reservoir Labs Scalar Execution Unit – DARPA PCA Program • Stanford: SmartMemories • UT Austin: TRIPS • MIT: RAW Stream Execution Unit Stream Register File Memory System Network Network Interface text text DRDRAM – Brook version 0.2 spec: http://merrimac.stanford.edu – Brook for GPUs: http://brook.sourceforce.net February 11th, 2004 2 Brook: general purpose streaming language • stream programming model – enforce data parallel computing • streams – encourage arithmetic intensity • kernels • C with streams February 11th, 2004 3 Brook for gpus • demonstrate gpu streaming coprocessor – make programming gpus easier • hide texture/pbuffer data management • hide graphics based constructs in CG/HLSL • hide rendering passes • virtualize resources – performance! • … on applications that matter – highlight gpu areas for improvement • features required general purpose stream computing February 11th, 2004 4 system outline .br Brook source files brcc source to source compiler brt Brook run-time library February 11th, 2004 5 Brook language streams • streams – collection of records requiring similar computation • particle positions, voxels, FEM cell, … float3 positions<200>; float3 velocityfield<100,100,100>; – encourage data parallelism February 11th, 2004 6 Brook language kernels • kernels – functions applied to streams • similar to for_all construct kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } float a<100>; float b<100>; float c<100>; foo(a,b,c); for (i=0; i<100; i++) c[i] = a[i]+b[i]; – no dependencies between stream elements • encourage high arithmetic intensity February 11th, 2004 7 Brook language kernels • Ray Triangle Intersection kernel void krnIntersectTriangle(Ray ray<>, Triangle tris[], RayState oldraystate<>, GridTrilist trilist[], out Hit candidatehit<>) { float idx, det, inv_det; float3 edge1, edge2, pvec, tvec, qvec; if(oldraystate.state.y > 0) { idx = trilist[oldraystate.state.w].trinum; edge1 = tris[idx].v1 - tris[idx].v0; edge2 = tris[idx].v2 - tris[idx].v0; pvec = cross(ray.d, edge2); det = dot(edge1, pvec); inv_det = 1.0f/det; tvec = ray.o - tris[idx].v0; candidatehit.data.y = dot( tvec, pvec ) * inv_det; qvec = cross( tvec, edge1 ); candidatehit.data.z = dot( ray.d, qvec ) * inv_det; candidatehit.data.x = dot( edge2, qvec ) * inv_det; candidatehit.data.w = idx; } else { candidatehit.data = float4(0,0,0,-1); } } February 11th, 2004 8 Brook language additional features • reductions – scalar – stream • stride & repeat • GatherOp & ScatterOp – a[i] += p – p = a[i]++ February 11th, 2004 9 brcc compiler infrastructure • based on ctool – http://ctool.sourceforge.net • parser – build code tree – extend C grammar to accept Brook • convert – tree transformations • codegen – generate cg & hlsl code – call cgc, fxc – generate stub function February 11th, 2004 10 Applications Ray-tracer FFT Segmentation Linear Algebra: – BLAS, LINPACK, LAPACK February 11th, 2004 11 Brook Performance February 11th, 2004 12 GPU Gotchas Time Registers Used February 11th, 2004 13 GPU Gotchas Time Registers Used NVIDIA NV3x: Register usage vs. Time February 11th, 2004 14 GPU Gotchas NVIDIA: • Register Penalty • Render to Texture Limitation – Requires explicit copy or heavy pbuffer solution – Superbuffer extension needed http://mirror.ati.com/developer/SIGGRAPH03/Percy_OpenGL_Extensions SIG03.pdf February 11th, 2004 15 GPU Gotchas ATI Radeon 9800 Pro • Limited dependent texture lookup • 96 instructions • 24-bit floating point – s16e7 Integers up to 131,072 (s23e8: 16,777,216) 1 2 Memory Refs Math Ops Memory Refs Math Ops 3 Memory Refs Math Ops 4 Memory Refs Math Ops February 11th, 2004 16 GPU Catch-Up! • Integer & Bit Ops & Double Precision • Memory Addressing • CGC/FXC Performance – Hand code performance critical code • No native reduction support • No native scatter support – p[i] = a (indirect write) • No programmable blend – GatherOp / ScatterOp • Limited 4x4 output – Brook virtualized kernel outputs • Readback still slow – NV35 OpenGL: 600 MB/sec Download – ATI DirectX: 550 MB/sec Download February 11th, 2004 170 MB/sec Readback 50 MB/sec Readback 17 GPUs of the future (we hope) • Complete Instruction Sets – Integers, Bit Ops, Doubles, Mem Access • Integration – Streaming coprocessor not just a rendering device SDRAM SDRAM SDRAM SDRAM February 11th, 2004 Stream Register File • Streaming architectures ALU Cluster ALU Cluster ALU Cluster 18 Brook for GPUs • Release v0.3 available on Sourceforge • Project Page – http://graphics.stanford.edu/projects/brook • Source – http://www.sourceforge.net/projects/brook • Over 4K downloads! • Questions? Fly-fishing fly images from The English Fly Fishing Shop February 11th, 2004 19