Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman

Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman Pat Hanrahan GCafe December 10th, 2003 Brook: general purpose streaming language • developed for stanford streaming supercomputing project – architecture: Merrimac Scalar Execution Unit – compiler: RStream • Reservoir Labs Stream Execution Unit Stream Register File Memory System Network Network Interface text text DRDRAM – Center for Turbulence Research – NASA – DARPA PCA Program • Stanford: SmartMemories • UT Austin: TRIPS • MIT: RAW – Brook version 0.2 spec: http://merrimac.stanford.edu December 10th, 2003 2 why graphics hardware? GeForce FX December 10th, 2003 3 why graphics hardware? Pentium 4 SSE theoretical* 3GHz * 4 wide * .5 inst / cycle = 6 GFLOPS GeForce FX 5900 (NV35) fragment shader obtained: MULR R0, R0, R0: 20 GFLOPS equivalent to a 10 GHz P4 GeForce FX and getting faster: 3x improvement over NV30 (6 months) December 10th, 2003 *from Intel P4 Optimization Manual 4 gpu: data parallel & arithmetic intensity – data parallelism • each fragment shaded independently • better alu usage • hide memory latency December 10th, 2003 5 gpu: data parallel & arithmetic intensity – data parallelism • each fragment shaded independently • better alu usage • hide memory latency – arithmetic intensity • compute-to-bandwidth ratio • lots of ops per word transferred • app limited by alu performance, not off-chip bandwidth • more chip real estate for alu’s, not caches December 10th, 2003 64bit fpu 6 Brook: general purpose streaming language • stream programming model – enforce data parallel computing • streams – encourage arithmetic intensity • kernels • C with streams December 10th, 2003 7 Brook for gpus • demonstrate gpu streaming coprocessor – explicit programming abstraction December 10th, 2003 8 Brook for gpus • demonstrate gpu streaming coprocessor – make programming gpus easier • hide texture/pbuffer data management • hide graphics based constructs in CG/HLSL • hide rendering passes • virtualize resources December 10th, 2003 9 Brook for gpus • demonstrate gpu streaming coprocessor – make programming gpus easier • hide texture/pbuffer data management • hide graphics based constructs in CG/HLSL • hide rendering passes • virtualize resources – performance! • … on applications that matter December 10th, 2003 10 Brook for gpus • demonstrate gpu streaming coprocessor – make programming gpus easier • hide texture/pbuffer data management • hide graphics based constructs in CG/HLSL • hide rendering passes • virtualize resources – performance! • … on applications that matter – highlight gpu areas for improvement • features required general purpose stream computing December 10th, 2003 11 system outline .br Brook source files brcc source to source compiler brt Brook run-time library December 10th, 2003 12 Brook language streams • streams – collection of records requiring similar computation • particle positions, voxels, FEM cell, … float3 positions<200>; float3 velocityfield<100,100,100>; December 10th, 2003 13 Brook language streams • streams – collection of records requiring similar computation • particle positions, voxels, FEM cell, … float3 positions<200>; float3 velocityfield<100,100,100>; – similar to arrays, but… • index operations disallowed: position[i] • read/write stream operators streamRead (positions, p_ptr); streamWrite (velocityfield, v_ptr); – encourage data parallelism December 10th, 2003 14 Brook language kernels • kernels – functions applied to streams • similar to for_all construct kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } December 10th, 2003 15 Brook language kernels • kernels – functions applied to streams • similar to for_all construct kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } float a<100>; float b<100>; float c<100>; foo(a,b,c); December 10th, 2003 16 Brook language kernels • kernels – functions applied to streams • similar to for_all construct kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } float a<100>; float b<100>; float c<100>; foo(a,b,c); December 10th, 2003 for (i=0; i<100; i++) c[i] = a[i]+b[i]; 17 Brook language kernels • kernels – functions applied to streams • similar to for_all construct kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } – no dependencies between stream elements • encourage high arithmetic intensity December 10th, 2003 18 Brook language kernels • kernels arguments – input/output streams kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } a,b: Read-only input streams result: Write-only output stream December 10th, 2003 19 Brook language kernels • kernels arguments – input/output streams – constant parameters kernel void foo (float a<>, float b<>, float t, out float result<>) { result = a + t*b; } float a<100>; float b<100>; float c<100>; foo(a,b,3.2f,c); December 10th, 2003 20 Brook language kernels • kernels arguments – input/output streams – constant paramters – gather streams kernel void foo (float a<>, float b<>, float t, float array[], out float result<>) { result = array[a] + t*b; } float float float float a<100>; b<100>; c<100>; array<25> foo(a,b,3.2f,array,c); December 10th, 2003 gpu bonus 21 Brook language kernels • kernels arguments – – – – input/output streams constant parameters gather streams iterator streams kernel void foo (float a<>, float b<>, float t, float array[], iter float n<>, out float result<>) { result = array[a] + t*b + n; } float a<100>; float b<100>; float c<100>; float array<25> iter float n<100> = iter(0, 10); gpu bonus foo(a,b,3.2f,array,n,c); December 10th, 2003 22 Brook language kernels • example – position update in velocity field kernel void updatepos (float2 pos<>, float2 vel[100][100], float timestep, out float2 newpos<>) { newpos = pos + vel[pos]*timestep; } updatepos (positions, velocityfield, 10.0f, positions); December 10th, 2003 23 Brook language reductions December 10th, 2003 24 Brook language reductions • reductions – compute single value from a stream reduce void sum (float a<>, reduce float r<>) r += a; } float a<100>; float r; sum(a,r); December 10th, 2003 25 Brook language reductions • reductions – compute single value from a stream reduce void sum (float a<>, reduce float r<>) r += a; } float a<100>; float r; sum(a,r); December 10th, 2003 r = a[0]; for (int i=1; i<100; i++) r += a[i]; 26 Brook language reductions • reductions – associative operations only (a+b)+c = a+(b+c) • sum, multiply, max, min, OR, AND, XOR • matrix multiply December 10th, 2003 27 Brook language reductions • multi-dimension reductions – stream “shape” differences resolved by reduce function December 10th, 2003 28 Brook language reductions • multi-dimension reductions – stream “shape” differences resolved by reduce function reduce void sum (float a<>, reduce float r<>) r += a; } float a<20>; float r<5>; sum(a,r); December 10th, 2003 29 Brook language reductions • multi-dimension reductions – stream “shape” differences resolved by reduce function reduce void sum (float a<>, reduce float r<>) r += a; } float a<20>; float r<5>; sum(a,r); December 10th, 2003 for (int i=0; i<5; i++) r[i] = a[i*4]; for (int j=1; j<4; j++) r[i] += a[i*4 + j]; 30 Brook language reductions • multi-dimension reductions – stream “shape” differences resolved by reduce function reduce void sum (float a<>, reduce float r<>) r += a; } float a<20>; float r<5>; sum(a,r); December 10th, 2003 for (int i=0; i<5; i++) r[i] = a[i*4]; for (int j=1; j<4; j++) r[i] += a[i*4 + j]; 31 Brook language stream repeat & stride • kernel arguments of different shape – resolved by repeat and stride December 10th, 2003 32 Brook language stream repeat & stride • kernel arguments of different shape – resolved by repeat and stride kernel void foo (float a<>, float b<>, out float result<>); float a<20>; float b<5>; float c<10>; foo(a,b,c); December 10th, 2003 33 Brook language stream repeat & stride • kernel arguments of different shape – resolved by repeat and stride kernel void foo (float a<>, float b<>, out float result<>); float a<20>; float b<5>; float c<10>; foo(a,b,c); December 10th, 2003 foo(a[0], foo(a[2], foo(a[4], foo(a[6], foo(a[8], foo(a[10], foo(a[12], foo(a[14], foo(a[16], foo(a[18], b[0], b[0], b[1], b[1], b[2], b[2], b[3], b[3], b[4], b[4], c[0]) c[1]) c[2]) c[3]) c[4]) c[5]) c[6]) c[7]) c[8]) c[9]) 34 Brook language matrix vector multiply kernel void mul (float a<>, float b<>, out float result<>) { result = a*b; } reduce void sum (float a<>, reduce float result<>) { result += a; } float float float float matrix<20,10>; vector<1, 10>; tempmv<20,10>; result<20, 1>; mul(matrix,vector,tempmv); sum(tempmv,result); December 10th, 2003 M V V V = T 35 Brook language matrix vector multiply kernel void mul (float a<>, float b<>, out float result<>) { result = a*b; } reduce void sum (float a<>, reduce float result<>) { result += a; } float float float float matrix<20,10>; vector<1, 10>; tempmv<20,10>; result<20, 1>; mul(matrix,vector,tempmv); sum(tempmv,result); December 10th, 2003 T sum R 36 brcc compiler infrastructure December 10th, 2003 37 brcc compiler infrastructure • based on ctool – http://ctool.sourceforge.net • parser – build code tree – extend C grammar to accept Brook • convert – tree transformations • codegen – generate cg & hlsl code – call cgc, fxc – generate stub function December 10th, 2003 38 brcc compiler kernel compilation kernel void updatepos (float2 pos<>, float2 vel[100][100], float timestep, out float2 newpos<>) { newpos = pos + vel[pos]*timestep; } float4 main (uniform float4 _workspace : register (c0), uniform sampler _tex_pos : register (s0), float2 _tex_pos_pos : TEXCOORD0, uniform sampler vel : register (s1), uniform float4 vel_scalebias : register (c1), uniform float timestep : register (c2)) : COLOR0 { float4 _OUT; float2 pos; float2 newpos; pos = tex2D(_tex_pos, _tex_pos_pos).xy; newpos = pos + tex2D(vel,(pos).xy*vel_scalebias.xy+vel_scalebias.zw).xy * timestep; _OUT.x = newpos.x; _OUT.y = newpos.y; _OUT.z = newpos.y; _OUT.w = newpos.y; return _OUT; } December 10th, 2003 39 brcc compiler kernel compilation static const char __updatepos_ps20[] = "ps_2_0 ..... static const char __updatepos_fp30[] = "!!fp30 ..... void updatepos (const __BRTStream& pos, const __BRTStream& vel, const float timestep, const __BRTStream& newpos) { static const void *__updatepos_fp[] = { "fp30", __updatepos_fp30, "ps20", __updatepos_ps20, "cpu", (void *) __updatepos_cpu, "combine", 0, NULL, NULL }; static __BRTKernel k(__updatepos_fp); k->PushStream(pos); k->PushGatherStream(vel); k->PushConstant(timestep); k->PushOutput(newpos); k->Map(); } December 10th, 2003 40 brcc runtime streams December 10th, 2003 41 brt runtime streams • streams separate texture per stream December 10th, 2003 vel texture 1 pos texture 2 42 brt runtime kernels • kernel execution – set stream texture as render target – bind inputs to texture units – issue screen size quad • texture coords provide stream positions vel kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } a foo b result December 10th, 2003 43 brt runtime reductions • reduction execution – multipass execution – associativity required December 10th, 2003 44 research directions • demonstrate gpu streaming coprocessor – compiling Brook to gpus – evaluation – applications December 10th, 2003 45 research directions • applications – linear algebra – image processing – molecular dynamics (gromacs) – – – – – FEM multigrid raytracer volume renderer SIGGRAPH / GH papers December 10th, 2003 46 research directions • virtualize gpu resources – texture size and formats • packing streams to fit in 2D segmented memory space float matrix<8096,10,30,5>; December 10th, 2003 47 research directions • virtualize gpu resources – texture size and formats • support complex formats typedef struct { float3 pos; float3 vel; float mass; } particle; kernel void foo (particle a<>, float timestep, out particle b<>); float a<100>[8][2]; December 10th, 2003 48 research directions • virtualize gpu resources – multiple outputs • simple: let cgc or fxc do dead code elimination kernel void foo (float3 a<>, float3 b<>, …, out float3 x<>, out float3 y<>) kernel void foo1(float3 a<>, float3 b<>, …, out float3 x<>) kernel void foo2(float3 a<>, float3 b<>, …, out float3 y<>) • better: compute intermediates separately December 10th, 2003 49 research directions • virtualize gpu resources – limited instructions per kernel • generalize RDS algorithm for kernels – compute ideal # of passes for intermediate results – hard ??? December 10th, 2003 50 research directions • auto vectorization kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } kernel void foo_faster (float4 a<>, float4 b<>, out float4 result<>) { result = a + b; } December 10th, 2003 51 research directions • Brook v0.2 support – stream operators • stencil, group, domain, repeat, stride, merge, … – building and manipulating data structures • scatterOp a[i] += p • gatherOp p = a[i]++ • gpu primitives December 10th, 2003 52 research directions • gpu areas of improvement – – – – reduction registers texture constraints scatter capabilities programmable blending • gatherOp, scatterOp December 10th, 2003 53 Brook status • team – – – – Jeremy Sugerman Daniel Horn Tim Foley Ian Buck • beta release – December 15th – sourceforge December 10th, 2003 54 Questions? December 10th, 2003 Fly-fishing fly images from The English Fly Fishing Shop 55

Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman

Related documents

Products

Support

Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib