GPGPU Toolkit SlabOps SlabOps were created by Mark

GPGPU Toolkit SlabOps SlabOps were created by Mark Harris (UNC, NVIDIA) Main Issue with GPU Programming  Main issue is not with writing the code for the graphics card  The main issue is interfacing with the graphics card Issues with Interfacing with GPUs 1. You forget to do something 1. 2. 3. 4. 2. Graphics based GPGPU algorithms are hacks 1. 3. Forget to initialize FBOs Forget to enable the CG program Forget to set the viewpoint correctly …. You’re rendering a quad to perform an algorithm on an array Its not object oriented Using SlabOps    GPGPU methods covered previously are fine for performing 1 or 2 programs What about trying to manage ten or twenty programs performing hundreds of passes? SlabOps to the rescue! Using SlabOps  SlabOps were created by Mark Harris while getting his PHD at the University of North Carolina.  Used in his GPU Fluid Simulator to manage the large number of fragment programs required for each pass. Using SlabOps 3 Parts 1. Define 1. 2. Initialization 1. 2. 3. 3. Define the type of SlabOp that you need (more on this later) Initialize the program to load Initialize the parameters to connect Initialize the output Run 1. 2. Update any parameters that might have changed Call Compute() to run the program Initialization void initSlabOps() { // Load the program g_addMatrixfp.InitializeFP(cgContext, "addMatrix.cg", "main"); // Set the texture parameters g_addMatrixfp.SetTextureParameter("tex1", inputYTexID); g_addMatrixfp.SetTextureParameter("tex2", inputXTexID); // Set the texture coordinates and output rectangle g_addMatrixfp.SetTexCoordRect( 0,0, texSizeX, texSizeY); g_addMatrixfp.SetSlabRect( 0,0, texSizeX, texSizeY); // Set the output texture g_addMatrixfp.SetOutputTexture(outputTexID, texSizeX, texSizeY, textureParameters.texTarget, GL_COLOR_ATTACHMENT0_EXT); } Run g_addMatrixfp.Compute(); One line to run the program:  Sets the variables  Enables the program  Sets the viewpoint  Builds the geometry to perform the processing  Perform the computation  Get the output into the buffer or texture  Disable the program  Reset the viewpoint Comparing Saxpy (SlabOp) SlabOp // Do calculations for(int i = 0; i < numIterations; i++) { g_saxpyfp.SetTextureParameter("textureY", yTexID[readTex]); g_saxpyfp.SetOutputTexture(yTexID[writeTex], texSize, texSize, textureParameters.texTarget, attachmentpoints[writeTex]); g_saxpyfp.Compute(); swap(); } Comparing Saxpy (Non-SlabOp 1) // // attach two textures to FBO glFramebufferTexture2DEXT(GL_FRAMEBUFFER_EXT, attachmentpoints[writeTex], textureParameters.texTarget, yTexID[writeTex], 0); glFramebufferTexture2DEXT(GL_FRAMEBUFFER_EXT, attachmentpoints[readTex], textureParameters.texTarget, yTexID[readTex], 0); // check if that worked if (!checkFramebufferStatus()) { printf("glFramebufferTexture2DEXT():\t [FAIL]\n"); PAUSE(); exit (ERROR_FBOTEXTURE); } else if (mode == 0) { printf("glFramebufferTexture2DEXT():\t [PASS]\n"); } // enable fragment profile cgGLEnableProfile(fragmentProfile); // bind saxpy program cgGLBindProgram(fragmentProgram); // enable texture x (read-only, not changed during the iteration) cgGLSetTextureParameter(xParam, xTexID); cgGLEnableTextureParameter(xParam); // enable scalar alpha (same) cgSetParameter1f(alphaParam, alpha); // Calling glFinish() is only neccessary to get accurate timings, // and we need a high number of iterations to avoid timing noise. glFinish(); Comparing Saxpy (Non-SlabOp 2) for (int i=0; i<numIterations; i++) { // set render destination glDrawBuffer (attachmentpoints[writeTex]); // enable texture y_old (read-only) cgGLSetTextureParameter(yParam, yTexID[readTex]); cgGLEnableTextureParameter(yParam); // and render multitextured viewport-sized quad // depending on the texture target, switch between // normalised ([0,1]^2) and unnormalised ([0,w]x[0,h]) // texture coordinates // make quad filled to hit every pixel/texel // (should be default but we never know) glPolygonMode(GL_FRONT,GL_FILL); // and render the quad if (textureParameters.texTarget == GL_TEXTURE_2D) { // render with normalized texcoords glBegin(GL_QUADS); glTexCoord2f(0.0, 0.0); glVertex2f(0.0, 0.0); glTexCoord2f(1.0, 0.0); glVertex2f(texSize, 0.0); glTexCoord2f(1.0, 1.0); glVertex2f(texSize, texSize); glTexCoord2f(0.0, 1.0); glVertex2f(0.0, texSize); glEnd(); } else { // render with unnormalized texcoords glBegin(GL_QUADS); glTexCoord2f(0.0, 0.0); glVertex2f(0.0, 0.0); glTexCoord2f(texSize, 0.0); glVertex2f(texSize, 0.0); glTexCoord2f(texSize, texSize); glVertex2f(texSize, texSize); glTexCoord2f(0.0, texSize); glVertex2f(0.0, texSize); glEnd(); } // swap role of the two textures (read-only source becomes // write-only target and the other way round): swap(); } Comparing Saxpy   Ok, that looked a little worse than we know it is But… using SlabOps did look a little easier  Saxpy only had one program being run for multiple iterations.  What about something more complicated…  Fluid Flow Fluids   1. 2. 3. 4. 5. Follow Stams method We’re not going to cover how to do fluids so much as the program flow and how SlabOps help contain “Fast Fluid Dynamics Simulation on the the problem GPU”, Mark Harris. In GPU Gems. Advection Impulse Vorticity Confinement Viscous Diffusion Project Divergent Velocity 1. Compute Divergence 2. Compute Pressure Disturbances 3. Subtract gradient(p) from u 6. Display Lets not forget Boundary Conditions Boundaries and interior are computed in separate passes and may require separate programs Implementation   Harris’ implementation contained 15 GPU programs (including 4 for display) The simulation takes about 20 passes for each timestep, (not including 2, 50 pass runs for the poisson solver) Switch to code: (Note, code can be found in GPU Gems 1) Point:  Creating something as complex as a fluid solver would be very difficult without some kind of abstraction  So what’s so special about SlabOps   Versatility Policy-Based Design SlabOp Versatility     Remember we skipped over how to define a SlabOp. Each SlabOp is actually composed of 6 objects working together. Each of the six objects can be replaced according to the specific task In other words to alter a SlabOp to display to the screen instead of the back buffer, I just replace the Update object. The 6 objects that define a SlabOp  Render Target Policy   GL State Policy   Sets up / shuts down fragment programs Compute Policy   Sets up / shuts down vertex programs Fragment Pipe Policy   Sets and unsets the GL state needed for the SlabOp Vertex Pipe Policy   Sets up / shuts down any special render target functionality needed by the SlabOp Performs the computation (usually via rendering) Update Policy  Performs any copies or other update functions after the computation has been performed Defining a SlabOp    Luckily you do not need to create each of those objects. You just need to replace one when it doesn’t do what you want. Harris created 3 predefined SlabOps    DefaultSlabOp – performs simple fragment program rendered to a quad BCSlabOp – performs boundary condition fragment program rendered as lines DisplayOp – displays a texture to the screen More complex SlabOps Objects defined to perform:  Flat 3d texture computations - computing for voxel grids      Multi-texture output - rendering with multiple texture outputs   Flat3DTexComputePolicy Flat3DBoundaryComputePolicy Flat3DVectorizedTexComputePolicy Copy3DTexGLUpdatePolicy MultiTextureGLComputePolicy Volume computations - rendering with multiple texture coordinates   VolumeComputePolicy, VolumeGLComputePolicy Defining a SlabOp typedef SlabOp < NoopRenderTargetPolicy, NoopGLStatePolicy, NoopVertexPipePolicy, GenericCgGLFragmentPipePolicy, SingleTextureGLComputePolicy, CopyTexGLUpdatePolicy > DefaultSlabOp; Include a Noop where a policy is not used, Include the preferred policy where one is needed Next Generation SlabOps?    Version on course website has been extracted out of Harris’ fluid simulator and updated to use frame buffer objects instead of render texture Easy to update SlabOps to use the geometry processor also Additional policies could be created to render to non-quad surfaces, i.e. an object How do SlabOps work?   The rest of this lecture will explain policy based design. There will be no more GPU talk during the remainder of the lecture Why?   SlabOps were a good implementation of Policy Based Design You should have some exposure to design patterns and templates Where did Policy Based Design Come from? Modern C++ Design Generic Programming and Design Patterns Applied By: Andrei Alexandrescu Excellent Bedtime reading - Asleep within 2 pages Contains unique implementations of design patterns using templates What is a design pattern?  Design Pattern: A general repeatable solution to a commonly occurring problem in software design. - Wikipedia (The irrefutable source on everything)  The most commonly known design pattern? The Singleton   One of the simplest and most useful design pattern Goal: To only have one instance of an object, no matter where it is created in the program The Singleton class Singleton { public: static Singleton & Instance(); ~Singleton(); private: static Singleton * m_singleton; }; Singleton & Singleton::Instance() { if(m_singleton == null) m_singleton = new Singleton(); return *m_singleton; } // in Cpp file Singleton::m_singleton = null; C++ Templates    Templates – functions that can operate with generic types The STL is a library of templates hence its name Standard Template Library Example Templates:   cout, cin vector<int> string  Example Template: template <class myType> myType GetMax (myType a, myType b) { return (a>b?a:b); } Example Template Use: int x,y; GetMax <int> (x,y); Modern C++ Design – Book on design patterns using templates Policy Based Design  Defines a class with a complex behavior out of many little classes (called policies), each which takes care of one behavioral or structural aspect.  You can mix and match policies to achieve a combinatorial set of behaviors by using a small core of elementary components How it works  Multiple Inheritance   One class that inherits the properties of numerous other classes Templates  Systems that operate with generic types Multiple Inheritance + Templates => Policy Based Design Policies    Each policy is a simple class that implements one aspect of the overall goal Policies do not need to be templates (in many cases they’re not) Policies do need to have specific known functions that they implement Encapsulation Class  One class needs to use multiple inheritance to combine all the policies together template < class RenderTargetPolicy, class GLStatePolicy, class VertexPipePolicy, class FragmentPipePolicy, class ComputePolicy, class UpdatePolicy > class SlabOp : public RenderTargetPolicy, public GLStatePolicy, public VertexPipePolicy, public FragmentPipePolicy, public ComputePolicy, public UpdatePolicy { public: SlabOp() {} ~SlabOp() {} Compute(); }; The Compute Method // The only method of the SlabOp host class is Compute(), which // uses the inherited policy methods to perform the slab computation. // Note that this also defines the interfaces that the policy classes // must have. void Compute() { // Activate the output slab, if necessary ActivateRenderTarget(); // Set the necessary state for the slab operation GLStatePolicy::SetState(); VertexPipePolicy::SetState(); FragmentPipePolicy::SetState(); SetViewport(); // Put the results of the operation into the output slab. UpdateOutputSlab(); // Perform the slab operation ComputePolicy::Compute(); ResetViewport(); // Reset state FragmentPipePolicy::ResetState(); VertexPipePolicy::ResetState(); GLStatePolicy::ResetState(); // Deactivate the output slab, if necessary DeactivateRenderTarget(); } }; The Other Methods    But wait, what about all the other functions that we called inside our GPU program? Those exist in the individual policies Example: InitializeFP(CGcontext context, string fpFileName, string entryPoint) Exists in the FragmentPipePolicy Templates in CUDA   CUDA is C based, but has support for templates Two uses for templates:   Create one kernel for multiple data types Evaluate if statements at compile time Example of CUDA templates template <unsigned int blockSize> __global__ void reduce5(int *g_idata, int *g_odata) { extern __shared__ int sdata[]; // perform first level of reduction, // reading from global memory, writing to shared memory unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*(blockSize*2) + threadIdx.x; sdata[tid] = g_idata[i] + g_idata[i+blockSize]; __syncthreads(); // do reduction in shared mem if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); } if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); } if (blockSize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); } #ifndef __DEVICE_EMULATION__ if (tid < 32) #endif { if (blockSize >= 64) { sdata[tid] += sdata[tid + 32]; EMUSYNC; } if (blockSize >= 32) { sdata[tid] += sdata[tid + 16]; EMUSYNC; } if (blockSize >= 16) { sdata[tid] += sdata[tid + 8]; EMUSYNC; } if (blockSize >= 8) { sdata[tid] += sdata[tid + 4]; EMUSYNC; } if (blockSize >= 4) { sdata[tid] += sdata[tid + 2]; EMUSYNC; } if (blockSize >= 2) { sdata[tid] += sdata[tid + 1]; EMUSYNC; } } // write result for this block to global mem if (tid == 0) g_odata[blockIdx.x] = sdata[0]; } Conclusion SlabOps are one of many GPGPU abstractions  Happens to be my favorite because they are the most versatile and are easy to use Issues:  Does not include basic GPGPU functions such as Reduce()  There is a learning curve  Difficult to find out where things are actually going on 

GPGPU Toolkit SlabOps SlabOps were created by Mark

Related documents

Products

Support

GPGPU Toolkit SlabOps SlabOps were created by Mark

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib