Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman

advertisement
Brook for GPUs
Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman
Pat Hanrahan
GCafe December 10th, 2003
Brook: general purpose streaming language
• developed for stanford streaming supercomputing project
– architecture: Merrimac
Scalar
Execution
Unit
– compiler: RStream
• Reservoir Labs
Stream
Execution Unit
Stream
Register File
Memory
System
Network
Network
Interface
text
text
DRDRAM
– Center for Turbulence Research
– NASA
– DARPA PCA Program
• Stanford: SmartMemories
• UT Austin: TRIPS
• MIT: RAW
– Brook version 0.2 spec: http://merrimac.stanford.edu
December 10th, 2003
2
why graphics hardware?
GeForce FX
December 10th, 2003
3
why graphics hardware?
Pentium 4 SSE theoretical*
3GHz * 4 wide * .5 inst / cycle = 6 GFLOPS
GeForce FX 5900 (NV35) fragment shader obtained:
MULR R0, R0, R0:
20 GFLOPS
equivalent to a 10 GHz P4
GeForce FX
and getting faster: 3x improvement over NV30 (6 months)
December 10th, 2003
*from Intel P4 Optimization Manual
4
gpu: data parallel & arithmetic intensity
– data parallelism
• each fragment shaded independently
• better alu usage
• hide memory latency
December 10th, 2003
5
gpu: data parallel & arithmetic intensity
– data parallelism
• each fragment shaded independently
• better alu usage
• hide memory latency
– arithmetic intensity
• compute-to-bandwidth ratio
• lots of ops per word transferred
• app limited by alu performance, not
off-chip bandwidth
• more chip real estate for alu’s, not
caches
December 10th, 2003
64bit fpu
6
Brook: general purpose streaming language
• stream programming model
– enforce data parallel computing
• streams
– encourage arithmetic intensity
• kernels
• C with streams
December 10th, 2003
7
Brook for gpus
• demonstrate gpu streaming coprocessor
– explicit programming abstraction
December 10th, 2003
8
Brook for gpus
• demonstrate gpu streaming coprocessor
– make programming gpus easier
• hide texture/pbuffer data management
• hide graphics based constructs in CG/HLSL
• hide rendering passes
• virtualize resources
December 10th, 2003
9
Brook for gpus
• demonstrate gpu streaming coprocessor
– make programming gpus easier
• hide texture/pbuffer data management
• hide graphics based constructs in CG/HLSL
• hide rendering passes
• virtualize resources
– performance!
• … on applications that matter
December 10th, 2003
10
Brook for gpus
• demonstrate gpu streaming coprocessor
– make programming gpus easier
• hide texture/pbuffer data management
• hide graphics based constructs in CG/HLSL
• hide rendering passes
• virtualize resources
– performance!
• … on applications that matter
– highlight gpu areas for improvement
• features required general purpose stream
computing
December 10th, 2003
11
system outline
.br
Brook source files
brcc
source to source compiler
brt
Brook run-time library
December 10th, 2003
12
Brook language
streams
• streams
– collection of records requiring similar computation
• particle positions, voxels, FEM cell, …
float3 positions<200>;
float3 velocityfield<100,100,100>;
December 10th, 2003
13
Brook language
streams
• streams
– collection of records requiring similar computation
• particle positions, voxels, FEM cell, …
float3 positions<200>;
float3 velocityfield<100,100,100>;
– similar to arrays, but…
• index operations disallowed: position[i]
• read/write stream operators
streamRead (positions, p_ptr);
streamWrite (velocityfield, v_ptr);
– encourage data parallelism
December 10th, 2003
14
Brook language
kernels
• kernels
– functions applied to streams
• similar to for_all construct
kernel void foo (float a<>, float b<>,
out float result<>) {
result = a + b;
}
December 10th, 2003
15
Brook language
kernels
• kernels
– functions applied to streams
• similar to for_all construct
kernel void foo (float a<>, float b<>,
out float result<>) {
result = a + b;
}
float a<100>;
float b<100>;
float c<100>;
foo(a,b,c);
December 10th, 2003
16
Brook language
kernels
• kernels
– functions applied to streams
• similar to for_all construct
kernel void foo (float a<>, float b<>,
out float result<>) {
result = a + b;
}
float a<100>;
float b<100>;
float c<100>;
foo(a,b,c);
December 10th, 2003
for (i=0; i<100; i++)
c[i] = a[i]+b[i];
17
Brook language
kernels
• kernels
– functions applied to streams
• similar to for_all construct
kernel void foo (float a<>, float b<>,
out float result<>) {
result = a + b;
}
– no dependencies between stream elements
• encourage high arithmetic intensity
December 10th, 2003
18
Brook language
kernels
• kernels arguments
– input/output streams
kernel void foo (float a<>, float b<>,
out float result<>) {
result = a + b;
}
a,b:
Read-only input streams
result:
Write-only output stream
December 10th, 2003
19
Brook language
kernels
• kernels arguments
– input/output streams
– constant parameters
kernel void foo (float a<>, float b<>,
float t,
out float result<>) {
result = a + t*b;
}
float a<100>;
float b<100>;
float c<100>;
foo(a,b,3.2f,c);
December 10th, 2003
20
Brook language
kernels
• kernels arguments
– input/output streams
– constant paramters
– gather streams
kernel void foo (float a<>, float b<>,
float t, float array[],
out float result<>) {
result = array[a] + t*b;
}
float
float
float
float
a<100>;
b<100>;
c<100>;
array<25>
foo(a,b,3.2f,array,c);
December 10th, 2003
gpu
bonus
21
Brook language
kernels
• kernels arguments
–
–
–
–
input/output streams
constant parameters
gather streams
iterator streams
kernel void foo (float a<>, float b<>,
float t, float array[],
iter float n<>,
out float result<>) {
result = array[a] + t*b + n;
}
float a<100>;
float b<100>;
float c<100>;
float array<25>
iter float n<100> = iter(0, 10);
gpu
bonus
foo(a,b,3.2f,array,n,c);
December 10th, 2003
22
Brook language
kernels
• example
– position update in velocity field
kernel void updatepos (float2 pos<>,
float2 vel[100][100],
float timestep,
out float2 newpos<>) {
newpos = pos + vel[pos]*timestep;
}
updatepos (positions, velocityfield,
10.0f, positions);
December 10th, 2003
23
Brook language
reductions
December 10th, 2003
24
Brook language
reductions
• reductions
– compute single value from a stream
reduce void sum (float a<>,
reduce float r<>)
r += a;
}
float a<100>;
float r;
sum(a,r);
December 10th, 2003
25
Brook language
reductions
• reductions
– compute single value from a stream
reduce void sum (float a<>,
reduce float r<>)
r += a;
}
float a<100>;
float r;
sum(a,r);
December 10th, 2003
r = a[0];
for (int i=1; i<100; i++)
r += a[i];
26
Brook language
reductions
• reductions
– associative operations only
(a+b)+c = a+(b+c)
• sum, multiply, max, min, OR, AND, XOR
• matrix multiply
December 10th, 2003
27
Brook language
reductions
• multi-dimension reductions
– stream “shape” differences resolved by reduce
function
December 10th, 2003
28
Brook language
reductions
• multi-dimension reductions
– stream “shape” differences resolved by reduce
function
reduce void sum (float a<>,
reduce float r<>)
r += a;
}
float a<20>;
float r<5>;
sum(a,r);
December 10th, 2003
29
Brook language
reductions
• multi-dimension reductions
– stream “shape” differences resolved by reduce
function
reduce void sum (float a<>,
reduce float r<>)
r += a;
}
float a<20>;
float r<5>;
sum(a,r);
December 10th, 2003
for (int i=0; i<5; i++)
r[i] = a[i*4];
for (int j=1; j<4; j++)
r[i] += a[i*4 + j];
30
Brook language
reductions
• multi-dimension reductions
– stream “shape” differences resolved by reduce
function
reduce void sum (float a<>,
reduce float r<>)
r += a;
}
float a<20>;
float r<5>;
sum(a,r);
December 10th, 2003
for (int i=0; i<5; i++)
r[i] = a[i*4];
for (int j=1; j<4; j++)
r[i] += a[i*4 + j];
31
Brook language
stream repeat & stride
• kernel arguments of different shape
– resolved by repeat and stride
December 10th, 2003
32
Brook language
stream repeat & stride
• kernel arguments of different shape
– resolved by repeat and stride
kernel void foo (float a<>, float b<>,
out float result<>);
float a<20>;
float b<5>;
float c<10>;
foo(a,b,c);
December 10th, 2003
33
Brook language
stream repeat & stride
• kernel arguments of different shape
– resolved by repeat and stride
kernel void foo (float a<>, float b<>,
out float result<>);
float a<20>;
float b<5>;
float c<10>;
foo(a,b,c);
December 10th, 2003
foo(a[0],
foo(a[2],
foo(a[4],
foo(a[6],
foo(a[8],
foo(a[10],
foo(a[12],
foo(a[14],
foo(a[16],
foo(a[18],
b[0],
b[0],
b[1],
b[1],
b[2],
b[2],
b[3],
b[3],
b[4],
b[4],
c[0])
c[1])
c[2])
c[3])
c[4])
c[5])
c[6])
c[7])
c[8])
c[9])
34
Brook language
matrix vector multiply
kernel void mul (float a<>, float b<>,
out float result<>) {
result = a*b;
}
reduce void sum (float a<>,
reduce float result<>) {
result += a;
}
float
float
float
float
matrix<20,10>;
vector<1, 10>;
tempmv<20,10>;
result<20, 1>;
mul(matrix,vector,tempmv);
sum(tempmv,result);
December 10th, 2003
M
V
V
V
=
T
35
Brook language
matrix vector multiply
kernel void mul (float a<>, float b<>,
out float result<>) {
result = a*b;
}
reduce void sum (float a<>,
reduce float result<>) {
result += a;
}
float
float
float
float
matrix<20,10>;
vector<1, 10>;
tempmv<20,10>;
result<20, 1>;
mul(matrix,vector,tempmv);
sum(tempmv,result);
December 10th, 2003
T
sum
R
36
brcc compiler
infrastructure
December 10th, 2003
37
brcc compiler
infrastructure
• based on ctool
– http://ctool.sourceforge.net
• parser
– build code tree
– extend C grammar to accept Brook
• convert
– tree transformations
• codegen
– generate cg & hlsl code
– call cgc, fxc
– generate stub function
December 10th, 2003
38
brcc compiler
kernel compilation
kernel void updatepos (float2 pos<>,
float2 vel[100][100],
float timestep,
out float2 newpos<>) {
newpos = pos + vel[pos]*timestep;
}
float4 main (uniform float4 _workspace
: register (c0),
uniform sampler _tex_pos
: register (s0),
float2 _tex_pos_pos
: TEXCOORD0,
uniform sampler vel
: register (s1),
uniform float4 vel_scalebias : register (c1),
uniform float timestep
: register (c2)) : COLOR0 {
float4 _OUT; float2 pos; float2 newpos;
pos = tex2D(_tex_pos, _tex_pos_pos).xy;
newpos = pos
+ tex2D(vel,(pos).xy*vel_scalebias.xy+vel_scalebias.zw).xy
* timestep;
_OUT.x = newpos.x; _OUT.y = newpos.y;
_OUT.z = newpos.y; _OUT.w = newpos.y;
return _OUT;
}
December 10th, 2003
39
brcc compiler
kernel compilation
static const char __updatepos_ps20[] = "ps_2_0 .....
static const char __updatepos_fp30[] = "!!fp30 .....
void
updatepos (const __BRTStream& pos,
const __BRTStream& vel,
const float timestep,
const __BRTStream& newpos) {
static const void *__updatepos_fp[] = {
"fp30", __updatepos_fp30,
"ps20", __updatepos_ps20,
"cpu", (void *) __updatepos_cpu,
"combine", 0,
NULL, NULL };
static __BRTKernel k(__updatepos_fp);
k->PushStream(pos);
k->PushGatherStream(vel);
k->PushConstant(timestep);
k->PushOutput(newpos);
k->Map();
}
December 10th, 2003
40
brcc runtime
streams
December 10th, 2003
41
brt runtime
streams
• streams
separate texture per stream
December 10th, 2003
vel
texture 1
pos
texture 2
42
brt runtime
kernels
• kernel execution
– set stream texture as render target
– bind inputs to texture units
– issue screen size quad
• texture coords provide stream positions
vel
kernel void foo (float a<>, float b<>,
out float result<>) {
result = a + b;
}
a
foo
b
result
December 10th, 2003
43
brt runtime
reductions
• reduction execution
– multipass execution
– associativity required
December 10th, 2003
44
research directions
• demonstrate gpu streaming coprocessor
– compiling Brook to gpus
– evaluation
– applications
December 10th, 2003
45
research directions
• applications
– linear algebra
– image processing
– molecular dynamics (gromacs)
–
–
–
–
–
FEM
multigrid
raytracer
volume renderer
SIGGRAPH / GH papers
December 10th, 2003
46
research directions
• virtualize gpu resources
– texture size and formats
• packing streams to fit in 2D segmented memory space
float matrix<8096,10,30,5>;
December 10th, 2003
47
research directions
• virtualize gpu resources
– texture size and formats
• support complex formats
typedef struct {
float3 pos;
float3 vel;
float mass;
} particle;
kernel void foo (particle a<>,
float timestep,
out particle b<>);
float a<100>[8][2];
December 10th, 2003
48
research directions
• virtualize gpu resources
– multiple outputs
• simple: let cgc or fxc do dead code elimination
kernel void foo (float3 a<>,
float3 b<>, …,
out float3 x<>,
out float3 y<>)
kernel void foo1(float3 a<>,
float3 b<>, …,
out float3 x<>)
kernel void foo2(float3 a<>,
float3 b<>, …,
out float3 y<>)
• better: compute intermediates separately
December 10th, 2003
49
research directions
• virtualize gpu resources
– limited instructions per kernel
• generalize RDS algorithm for kernels
– compute ideal # of passes for intermediate results
– hard ???
December 10th, 2003
50
research directions
• auto vectorization
kernel void foo (float a<>, float b<>,
out float result<>) {
result = a + b;
}
kernel void foo_faster (float4 a<>, float4 b<>,
out float4 result<>) {
result = a + b;
}
December 10th, 2003
51
research directions
• Brook v0.2 support
– stream operators
• stencil, group, domain, repeat, stride, merge, …
– building and manipulating data structures
• scatterOp a[i] += p
• gatherOp p = a[i]++
• gpu primitives
December 10th, 2003
52
research directions
• gpu areas of improvement
–
–
–
–
reduction registers
texture constraints
scatter capabilities
programmable blending
• gatherOp, scatterOp
December 10th, 2003
53
Brook status
• team
–
–
–
–
Jeremy Sugerman
Daniel Horn
Tim Foley
Ian Buck
• beta release
– December 15th
– sourceforge
December 10th, 2003
54
Questions?
December 10th, 2003
Fly-fishing fly images from The English Fly Fishing Shop
55
Download