GPGPU Toolkit SlabOps SlabOps were created by Mark

advertisement
GPGPU Toolkit
SlabOps
SlabOps were
created by Mark
Harris (UNC,
NVIDIA)
Main Issue with GPU Programming

Main issue is not with writing the code for the
graphics card

The main issue is interfacing with the
graphics card
Issues with Interfacing with GPUs
1.
You forget to do something
1.
2.
3.
4.
2.
Graphics based GPGPU algorithms are hacks
1.
3.
Forget to initialize FBOs
Forget to enable the CG program
Forget to set the viewpoint correctly
….
You’re rendering a quad to perform an algorithm on an
array
Its not object oriented
Using SlabOps



GPGPU methods covered previously are fine
for performing 1 or 2 programs
What about trying to manage ten or twenty
programs performing hundreds of passes?
SlabOps to the rescue!
Using SlabOps

SlabOps were created by Mark Harris while
getting his PHD at the University of North
Carolina.

Used in his GPU Fluid Simulator to manage
the large number of fragment programs
required for each pass.
Using SlabOps
3 Parts
1.
Define
1.
2.
Initialization
1.
2.
3.
3.
Define the type of SlabOp that you need
(more on this later)
Initialize the program to load
Initialize the parameters to connect
Initialize the output
Run
1.
2.
Update any parameters that might have changed
Call Compute() to run the program
Initialization
void initSlabOps() {
// Load the program
g_addMatrixfp.InitializeFP(cgContext, "addMatrix.cg", "main");
// Set the texture parameters
g_addMatrixfp.SetTextureParameter("tex1", inputYTexID);
g_addMatrixfp.SetTextureParameter("tex2", inputXTexID);
// Set the texture coordinates and output rectangle
g_addMatrixfp.SetTexCoordRect( 0,0, texSizeX, texSizeY);
g_addMatrixfp.SetSlabRect( 0,0, texSizeX, texSizeY);
// Set the output texture
g_addMatrixfp.SetOutputTexture(outputTexID, texSizeX, texSizeY,
textureParameters.texTarget, GL_COLOR_ATTACHMENT0_EXT);
}
Run
g_addMatrixfp.Compute();
One line to run the program:
 Sets the variables
 Enables the program
 Sets the viewpoint
 Builds the geometry to perform the processing
 Perform the computation
 Get the output into the buffer or texture
 Disable the program
 Reset the viewpoint
Comparing Saxpy (SlabOp)
SlabOp
// Do calculations
for(int i = 0; i < numIterations; i++) {
g_saxpyfp.SetTextureParameter("textureY", yTexID[readTex]);
g_saxpyfp.SetOutputTexture(yTexID[writeTex], texSize, texSize,
textureParameters.texTarget, attachmentpoints[writeTex]);
g_saxpyfp.Compute();
swap();
}
Comparing Saxpy (Non-SlabOp 1)
//
// attach two textures to FBO
glFramebufferTexture2DEXT(GL_FRAMEBUFFER_EXT, attachmentpoints[writeTex],
textureParameters.texTarget, yTexID[writeTex], 0);
glFramebufferTexture2DEXT(GL_FRAMEBUFFER_EXT, attachmentpoints[readTex], textureParameters.texTarget,
yTexID[readTex], 0);
// check if that worked
if (!checkFramebufferStatus()) {
printf("glFramebufferTexture2DEXT():\t [FAIL]\n");
PAUSE();
exit (ERROR_FBOTEXTURE);
} else if (mode == 0) {
printf("glFramebufferTexture2DEXT():\t [PASS]\n");
}
// enable fragment profile
cgGLEnableProfile(fragmentProfile);
// bind saxpy program
cgGLBindProgram(fragmentProgram);
// enable texture x (read-only, not changed during the iteration)
cgGLSetTextureParameter(xParam, xTexID);
cgGLEnableTextureParameter(xParam);
// enable scalar alpha (same)
cgSetParameter1f(alphaParam, alpha);
// Calling glFinish() is only neccessary to get accurate timings,
// and we need a high number of iterations to avoid timing noise.
glFinish();
Comparing Saxpy (Non-SlabOp 2)
for (int i=0; i<numIterations; i++) {
// set render destination
glDrawBuffer (attachmentpoints[writeTex]);
// enable texture y_old (read-only)
cgGLSetTextureParameter(yParam, yTexID[readTex]);
cgGLEnableTextureParameter(yParam);
// and render multitextured viewport-sized quad
// depending on the texture target, switch between
// normalised ([0,1]^2) and unnormalised ([0,w]x[0,h])
// texture coordinates
// make quad filled to hit every pixel/texel
// (should be default but we never know)
glPolygonMode(GL_FRONT,GL_FILL);
// and render the quad
if (textureParameters.texTarget == GL_TEXTURE_2D) {
// render with normalized texcoords
glBegin(GL_QUADS);
glTexCoord2f(0.0, 0.0);
glVertex2f(0.0, 0.0);
glTexCoord2f(1.0, 0.0);
glVertex2f(texSize, 0.0);
glTexCoord2f(1.0, 1.0);
glVertex2f(texSize, texSize);
glTexCoord2f(0.0, 1.0);
glVertex2f(0.0, texSize);
glEnd();
} else {
// render with unnormalized texcoords
glBegin(GL_QUADS);
glTexCoord2f(0.0, 0.0);
glVertex2f(0.0, 0.0);
glTexCoord2f(texSize, 0.0);
glVertex2f(texSize, 0.0);
glTexCoord2f(texSize, texSize);
glVertex2f(texSize, texSize);
glTexCoord2f(0.0, texSize);
glVertex2f(0.0, texSize);
glEnd();
}
// swap role of the two textures (read-only source becomes
// write-only target and the other way round):
swap();
}
Comparing Saxpy


Ok, that looked a little worse than we know it is
But… using SlabOps did look a little easier

Saxpy only had one program being run for multiple
iterations.

What about something more complicated…

Fluid Flow
Fluids


1.
2.
3.
4.
5.
Follow Stams method
We’re not going to cover how to do fluids so much
as the program flow and how SlabOps help contain
“Fast Fluid Dynamics Simulation on the
the problem
GPU”, Mark Harris. In GPU Gems.
Advection
Impulse
Vorticity Confinement
Viscous Diffusion
Project Divergent Velocity
1. Compute Divergence
2. Compute Pressure Disturbances
3. Subtract gradient(p) from u
6. Display
Lets not forget Boundary Conditions
Boundaries and interior are computed in separate passes and may require
separate programs
Implementation


Harris’ implementation contained 15 GPU programs
(including 4 for display)
The simulation takes about 20 passes for each timestep,
(not including 2, 50 pass runs for the poisson solver)
Switch to code:
(Note, code can be found in GPU Gems 1)
Point:

Creating something as complex as a fluid
solver would be very difficult without some
kind of abstraction

So what’s so special about SlabOps


Versatility
Policy-Based Design
SlabOp Versatility




Remember we skipped over how to define a SlabOp.
Each SlabOp is actually composed of 6 objects
working together.
Each of the six objects can be replaced according to
the specific task
In other words to alter a SlabOp to display to the
screen instead of the back buffer, I just replace the
Update object.
The 6 objects that define a SlabOp

Render Target Policy


GL State Policy


Sets up / shuts down fragment programs
Compute Policy


Sets up / shuts down vertex programs
Fragment Pipe Policy


Sets and unsets the GL state needed for the SlabOp
Vertex Pipe Policy


Sets up / shuts down any special render target functionality needed by the
SlabOp
Performs the computation (usually via rendering)
Update Policy

Performs any copies or other update functions after the computation has been
performed
Defining a SlabOp



Luckily you do not need to create each of those
objects.
You just need to replace one when it doesn’t do what
you want.
Harris created 3 predefined SlabOps



DefaultSlabOp – performs simple fragment program
rendered to a quad
BCSlabOp – performs boundary condition fragment
program rendered as lines
DisplayOp – displays a texture to the screen
More complex SlabOps
Objects defined to perform:

Flat 3d texture computations
- computing for voxel grids





Multi-texture output
- rendering with multiple texture outputs


Flat3DTexComputePolicy
Flat3DBoundaryComputePolicy
Flat3DVectorizedTexComputePolicy
Copy3DTexGLUpdatePolicy
MultiTextureGLComputePolicy
Volume computations
- rendering with multiple texture coordinates


VolumeComputePolicy,
VolumeGLComputePolicy
Defining a SlabOp
typedef SlabOp < NoopRenderTargetPolicy, NoopGLStatePolicy,
NoopVertexPipePolicy, GenericCgGLFragmentPipePolicy,
SingleTextureGLComputePolicy, CopyTexGLUpdatePolicy >
DefaultSlabOp;
Include a Noop where a policy is not used,
Include the preferred policy where one is needed
Next Generation SlabOps?



Version on course website has been extracted
out of Harris’ fluid simulator and updated to
use frame buffer objects instead of render
texture
Easy to update SlabOps to use the geometry
processor also
Additional policies could be created to render
to non-quad surfaces, i.e. an object
How do SlabOps work?


The rest of this lecture will explain policy
based design. There will be no more GPU
talk during the remainder of the lecture
Why?


SlabOps were a good implementation of Policy
Based Design
You should have some exposure to design
patterns and templates
Where did Policy Based Design Come from?
Modern C++ Design
Generic Programming and Design Patterns Applied
By: Andrei Alexandrescu
Excellent Bedtime reading
- Asleep within 2 pages
Contains unique implementations of
design patterns using templates
What is a design pattern?

Design Pattern: A general repeatable solution
to a commonly occurring problem in software
design.
- Wikipedia (The irrefutable source on everything)

The most commonly known design pattern?
The Singleton


One of the simplest and most useful design
pattern
Goal: To only have one instance of an object,
no matter where it is created in the program
The Singleton
class Singleton {
public:
static Singleton & Instance();
~Singleton();
private:
static Singleton * m_singleton;
};
Singleton & Singleton::Instance() {
if(m_singleton == null)
m_singleton = new Singleton();
return *m_singleton;
}
// in Cpp file
Singleton::m_singleton = null;
C++ Templates



Templates – functions that can operate with generic
types
The STL is a library of templates
hence its name Standard Template Library
Example Templates:


cout, cin
vector<int>
string

Example Template:
template <class myType>
myType GetMax (myType a, myType b)
{ return (a>b?a:b); }
Example Template Use:
int x,y;
GetMax <int> (x,y);
Modern C++ Design – Book on design patterns using templates
Policy Based Design

Defines a class with a complex behavior out of many
little classes (called policies), each which takes care
of one behavioral or structural aspect.

You can mix and match policies to achieve a
combinatorial set of behaviors by using a small core
of elementary components
How it works

Multiple Inheritance


One class that inherits the properties of numerous
other classes
Templates

Systems that operate with generic types
Multiple Inheritance + Templates => Policy Based Design
Policies



Each policy is a simple class that implements
one aspect of the overall goal
Policies do not need to be templates
(in many cases they’re not)
Policies do need to have specific known
functions that they implement
Encapsulation Class

One class needs to use multiple inheritance to
combine all the policies together
template
<
class RenderTargetPolicy,
class GLStatePolicy,
class VertexPipePolicy,
class FragmentPipePolicy,
class ComputePolicy,
class UpdatePolicy
>
class SlabOp : public RenderTargetPolicy,
public GLStatePolicy,
public VertexPipePolicy,
public FragmentPipePolicy,
public ComputePolicy,
public UpdatePolicy
{
public:
SlabOp() {}
~SlabOp() {}
Compute();
};
The Compute Method
// The only method of the SlabOp host class is Compute(), which
// uses the inherited policy methods to perform the slab computation.
// Note that this also defines the interfaces that the policy classes
// must have.
void Compute()
{
// Activate the output slab, if necessary
ActivateRenderTarget();
// Set the necessary state for the slab operation
GLStatePolicy::SetState();
VertexPipePolicy::SetState();
FragmentPipePolicy::SetState();
SetViewport();
// Put the results of the operation into the output slab.
UpdateOutputSlab();
// Perform the slab operation
ComputePolicy::Compute();
ResetViewport();
// Reset state
FragmentPipePolicy::ResetState();
VertexPipePolicy::ResetState();
GLStatePolicy::ResetState();
// Deactivate the output slab, if necessary
DeactivateRenderTarget();
}
};
The Other Methods



But wait, what about all the other functions
that we called inside our GPU program?
Those exist in the individual policies
Example:
InitializeFP(CGcontext context,
string fpFileName,
string entryPoint)
Exists in the FragmentPipePolicy
Templates in CUDA


CUDA is C based, but has support for
templates
Two uses for templates:


Create one kernel for multiple data types
Evaluate if statements at compile time
Example of CUDA templates
template <unsigned int blockSize>
__global__ void
reduce5(int *g_idata, int *g_odata)
{
extern __shared__ int sdata[];
// perform first level of reduction,
// reading from global memory, writing to shared memory
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockSize*2) + threadIdx.x;
sdata[tid] = g_idata[i] + g_idata[i+blockSize];
__syncthreads();
// do reduction in shared mem
if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }
if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }
if (blockSize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); }
#ifndef __DEVICE_EMULATION__
if (tid < 32)
#endif
{
if (blockSize >= 64) { sdata[tid] += sdata[tid + 32]; EMUSYNC; }
if (blockSize >= 32) { sdata[tid] += sdata[tid + 16]; EMUSYNC; }
if (blockSize >= 16) { sdata[tid] += sdata[tid + 8]; EMUSYNC; }
if (blockSize >= 8) { sdata[tid] += sdata[tid + 4]; EMUSYNC; }
if (blockSize >= 4) { sdata[tid] += sdata[tid + 2]; EMUSYNC; }
if (blockSize >= 2) { sdata[tid] += sdata[tid + 1]; EMUSYNC; }
}
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
Conclusion
SlabOps are one of many GPGPU abstractions
 Happens to be my favorite because they are the most
versatile and are easy to use
Issues:
 Does not include basic GPGPU functions such as
Reduce()
 There is a learning curve
 Difficult to find out where things are actually going
on

Download