Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman Pat Hanrahan GCafe December 10th, 2003 Brook: general purpose streaming language • developed for stanford streaming supercomputing project – architecture: Merrimac Scalar Execution Unit – compiler: RStream • Reservoir Labs Stream Execution Unit Stream Register File Memory System Network Network Interface text text DRDRAM – Center for Turbulence Research – NASA – DARPA PCA Program • Stanford: SmartMemories • UT Austin: TRIPS • MIT: RAW – Brook version 0.2 spec: http://merrimac.stanford.edu December 10th, 2003 2 why graphics hardware? GeForce FX December 10th, 2003 3 why graphics hardware? Pentium 4 SSE theoretical* 3GHz * 4 wide * .5 inst / cycle = 6 GFLOPS GeForce FX 5900 (NV35) fragment shader obtained: MULR R0, R0, R0: 20 GFLOPS equivalent to a 10 GHz P4 GeForce FX and getting faster: 3x improvement over NV30 (6 months) December 10th, 2003 *from Intel P4 Optimization Manual 4 gpu: data parallel & arithmetic intensity – data parallelism • each fragment shaded independently • better alu usage • hide memory latency December 10th, 2003 5 gpu: data parallel & arithmetic intensity – data parallelism • each fragment shaded independently • better alu usage • hide memory latency – arithmetic intensity • compute-to-bandwidth ratio • lots of ops per word transferred • app limited by alu performance, not off-chip bandwidth • more chip real estate for alu’s, not caches December 10th, 2003 64bit fpu 6 Brook: general purpose streaming language • stream programming model – enforce data parallel computing • streams – encourage arithmetic intensity • kernels • C with streams December 10th, 2003 7 Brook for gpus • demonstrate gpu streaming coprocessor – explicit programming abstraction December 10th, 2003 8 Brook for gpus • demonstrate gpu streaming coprocessor – make programming gpus easier • hide texture/pbuffer data management • hide graphics based constructs in CG/HLSL • hide rendering passes • virtualize resources December 10th, 2003 9 Brook for gpus • demonstrate gpu streaming coprocessor – make programming gpus easier • hide texture/pbuffer data management • hide graphics based constructs in CG/HLSL • hide rendering passes • virtualize resources – performance! • … on applications that matter December 10th, 2003 10 Brook for gpus • demonstrate gpu streaming coprocessor – make programming gpus easier • hide texture/pbuffer data management • hide graphics based constructs in CG/HLSL • hide rendering passes • virtualize resources – performance! • … on applications that matter – highlight gpu areas for improvement • features required general purpose stream computing December 10th, 2003 11 system outline .br Brook source files brcc source to source compiler brt Brook run-time library December 10th, 2003 12 Brook language streams • streams – collection of records requiring similar computation • particle positions, voxels, FEM cell, … float3 positions<200>; float3 velocityfield<100,100,100>; December 10th, 2003 13 Brook language streams • streams – collection of records requiring similar computation • particle positions, voxels, FEM cell, … float3 positions<200>; float3 velocityfield<100,100,100>; – similar to arrays, but… • index operations disallowed: position[i] • read/write stream operators streamRead (positions, p_ptr); streamWrite (velocityfield, v_ptr); – encourage data parallelism December 10th, 2003 14 Brook language kernels • kernels – functions applied to streams • similar to for_all construct kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } December 10th, 2003 15 Brook language kernels • kernels – functions applied to streams • similar to for_all construct kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } float a<100>; float b<100>; float c<100>; foo(a,b,c); December 10th, 2003 16 Brook language kernels • kernels – functions applied to streams • similar to for_all construct kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } float a<100>; float b<100>; float c<100>; foo(a,b,c); December 10th, 2003 for (i=0; i<100; i++) c[i] = a[i]+b[i]; 17 Brook language kernels • kernels – functions applied to streams • similar to for_all construct kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } – no dependencies between stream elements • encourage high arithmetic intensity December 10th, 2003 18 Brook language kernels • kernels arguments – input/output streams kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } a,b: Read-only input streams result: Write-only output stream December 10th, 2003 19 Brook language kernels • kernels arguments – input/output streams – constant parameters kernel void foo (float a<>, float b<>, float t, out float result<>) { result = a + t*b; } float a<100>; float b<100>; float c<100>; foo(a,b,3.2f,c); December 10th, 2003 20 Brook language kernels • kernels arguments – input/output streams – constant paramters – gather streams kernel void foo (float a<>, float b<>, float t, float array[], out float result<>) { result = array[a] + t*b; } float float float float a<100>; b<100>; c<100>; array<25> foo(a,b,3.2f,array,c); December 10th, 2003 gpu bonus 21 Brook language kernels • kernels arguments – – – – input/output streams constant parameters gather streams iterator streams kernel void foo (float a<>, float b<>, float t, float array[], iter float n<>, out float result<>) { result = array[a] + t*b + n; } float a<100>; float b<100>; float c<100>; float array<25> iter float n<100> = iter(0, 10); gpu bonus foo(a,b,3.2f,array,n,c); December 10th, 2003 22 Brook language kernels • example – position update in velocity field kernel void updatepos (float2 pos<>, float2 vel[100][100], float timestep, out float2 newpos<>) { newpos = pos + vel[pos]*timestep; } updatepos (positions, velocityfield, 10.0f, positions); December 10th, 2003 23 Brook language reductions December 10th, 2003 24 Brook language reductions • reductions – compute single value from a stream reduce void sum (float a<>, reduce float r<>) r += a; } float a<100>; float r; sum(a,r); December 10th, 2003 25 Brook language reductions • reductions – compute single value from a stream reduce void sum (float a<>, reduce float r<>) r += a; } float a<100>; float r; sum(a,r); December 10th, 2003 r = a[0]; for (int i=1; i<100; i++) r += a[i]; 26 Brook language reductions • reductions – associative operations only (a+b)+c = a+(b+c) • sum, multiply, max, min, OR, AND, XOR • matrix multiply December 10th, 2003 27 Brook language reductions • multi-dimension reductions – stream “shape” differences resolved by reduce function December 10th, 2003 28 Brook language reductions • multi-dimension reductions – stream “shape” differences resolved by reduce function reduce void sum (float a<>, reduce float r<>) r += a; } float a<20>; float r<5>; sum(a,r); December 10th, 2003 29 Brook language reductions • multi-dimension reductions – stream “shape” differences resolved by reduce function reduce void sum (float a<>, reduce float r<>) r += a; } float a<20>; float r<5>; sum(a,r); December 10th, 2003 for (int i=0; i<5; i++) r[i] = a[i*4]; for (int j=1; j<4; j++) r[i] += a[i*4 + j]; 30 Brook language reductions • multi-dimension reductions – stream “shape” differences resolved by reduce function reduce void sum (float a<>, reduce float r<>) r += a; } float a<20>; float r<5>; sum(a,r); December 10th, 2003 for (int i=0; i<5; i++) r[i] = a[i*4]; for (int j=1; j<4; j++) r[i] += a[i*4 + j]; 31 Brook language stream repeat & stride • kernel arguments of different shape – resolved by repeat and stride December 10th, 2003 32 Brook language stream repeat & stride • kernel arguments of different shape – resolved by repeat and stride kernel void foo (float a<>, float b<>, out float result<>); float a<20>; float b<5>; float c<10>; foo(a,b,c); December 10th, 2003 33 Brook language stream repeat & stride • kernel arguments of different shape – resolved by repeat and stride kernel void foo (float a<>, float b<>, out float result<>); float a<20>; float b<5>; float c<10>; foo(a,b,c); December 10th, 2003 foo(a[0], foo(a[2], foo(a[4], foo(a[6], foo(a[8], foo(a[10], foo(a[12], foo(a[14], foo(a[16], foo(a[18], b[0], b[0], b[1], b[1], b[2], b[2], b[3], b[3], b[4], b[4], c[0]) c[1]) c[2]) c[3]) c[4]) c[5]) c[6]) c[7]) c[8]) c[9]) 34 Brook language matrix vector multiply kernel void mul (float a<>, float b<>, out float result<>) { result = a*b; } reduce void sum (float a<>, reduce float result<>) { result += a; } float float float float matrix<20,10>; vector<1, 10>; tempmv<20,10>; result<20, 1>; mul(matrix,vector,tempmv); sum(tempmv,result); December 10th, 2003 M V V V = T 35 Brook language matrix vector multiply kernel void mul (float a<>, float b<>, out float result<>) { result = a*b; } reduce void sum (float a<>, reduce float result<>) { result += a; } float float float float matrix<20,10>; vector<1, 10>; tempmv<20,10>; result<20, 1>; mul(matrix,vector,tempmv); sum(tempmv,result); December 10th, 2003 T sum R 36 brcc compiler infrastructure December 10th, 2003 37 brcc compiler infrastructure • based on ctool – http://ctool.sourceforge.net • parser – build code tree – extend C grammar to accept Brook • convert – tree transformations • codegen – generate cg & hlsl code – call cgc, fxc – generate stub function December 10th, 2003 38 brcc compiler kernel compilation kernel void updatepos (float2 pos<>, float2 vel[100][100], float timestep, out float2 newpos<>) { newpos = pos + vel[pos]*timestep; } float4 main (uniform float4 _workspace : register (c0), uniform sampler _tex_pos : register (s0), float2 _tex_pos_pos : TEXCOORD0, uniform sampler vel : register (s1), uniform float4 vel_scalebias : register (c1), uniform float timestep : register (c2)) : COLOR0 { float4 _OUT; float2 pos; float2 newpos; pos = tex2D(_tex_pos, _tex_pos_pos).xy; newpos = pos + tex2D(vel,(pos).xy*vel_scalebias.xy+vel_scalebias.zw).xy * timestep; _OUT.x = newpos.x; _OUT.y = newpos.y; _OUT.z = newpos.y; _OUT.w = newpos.y; return _OUT; } December 10th, 2003 39 brcc compiler kernel compilation static const char __updatepos_ps20[] = "ps_2_0 ..... static const char __updatepos_fp30[] = "!!fp30 ..... void updatepos (const __BRTStream& pos, const __BRTStream& vel, const float timestep, const __BRTStream& newpos) { static const void *__updatepos_fp[] = { "fp30", __updatepos_fp30, "ps20", __updatepos_ps20, "cpu", (void *) __updatepos_cpu, "combine", 0, NULL, NULL }; static __BRTKernel k(__updatepos_fp); k->PushStream(pos); k->PushGatherStream(vel); k->PushConstant(timestep); k->PushOutput(newpos); k->Map(); } December 10th, 2003 40 brcc runtime streams December 10th, 2003 41 brt runtime streams • streams separate texture per stream December 10th, 2003 vel texture 1 pos texture 2 42 brt runtime kernels • kernel execution – set stream texture as render target – bind inputs to texture units – issue screen size quad • texture coords provide stream positions vel kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } a foo b result December 10th, 2003 43 brt runtime reductions • reduction execution – multipass execution – associativity required December 10th, 2003 44 research directions • demonstrate gpu streaming coprocessor – compiling Brook to gpus – evaluation – applications December 10th, 2003 45 research directions • applications – linear algebra – image processing – molecular dynamics (gromacs) – – – – – FEM multigrid raytracer volume renderer SIGGRAPH / GH papers December 10th, 2003 46 research directions • virtualize gpu resources – texture size and formats • packing streams to fit in 2D segmented memory space float matrix<8096,10,30,5>; December 10th, 2003 47 research directions • virtualize gpu resources – texture size and formats • support complex formats typedef struct { float3 pos; float3 vel; float mass; } particle; kernel void foo (particle a<>, float timestep, out particle b<>); float a<100>[8][2]; December 10th, 2003 48 research directions • virtualize gpu resources – multiple outputs • simple: let cgc or fxc do dead code elimination kernel void foo (float3 a<>, float3 b<>, …, out float3 x<>, out float3 y<>) kernel void foo1(float3 a<>, float3 b<>, …, out float3 x<>) kernel void foo2(float3 a<>, float3 b<>, …, out float3 y<>) • better: compute intermediates separately December 10th, 2003 49 research directions • virtualize gpu resources – limited instructions per kernel • generalize RDS algorithm for kernels – compute ideal # of passes for intermediate results – hard ??? December 10th, 2003 50 research directions • auto vectorization kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } kernel void foo_faster (float4 a<>, float4 b<>, out float4 result<>) { result = a + b; } December 10th, 2003 51 research directions • Brook v0.2 support – stream operators • stencil, group, domain, repeat, stride, merge, … – building and manipulating data structures • scatterOp a[i] += p • gatherOp p = a[i]++ • gpu primitives December 10th, 2003 52 research directions • gpu areas of improvement – – – – reduction registers texture constraints scatter capabilities programmable blending • gatherOp, scatterOp December 10th, 2003 53 Brook status • team – – – – Jeremy Sugerman Daniel Horn Tim Foley Ian Buck • beta release – December 15th – sourceforge December 10th, 2003 54 Questions? December 10th, 2003 Fly-fishing fly images from The English Fly Fishing Shop 55