Panini: A GPU aware Array class Dr. Santosh Ansumali(JNCASR) & Priyanka Sah (NVIDIA) Heterogeneous Computing CPU — Multicore — Multiprocessor — Cluster of Multicore GPU — CPU + GPU MIC Background Programing Efficiency Performance — Scalar — Parallel MATLAB/ STL-C++ Merit - Easy to write a code using Demerit - Performance Issue – no scalability of code Array (C/C++) – smarter way of writing code Use the feature of object oriented Matlab and Template Meta programming Background Template Meta programming — Blitz++ : Fast, Accurate Numerical Computing in C++ Advanced feature (vector, array ,matrix) FORTRAN MATLAB Scalable Disadvantage – large code POOMA –MPI Panini – part of MATLAB + Blitz Vector Initialization on MATLAB a=1,2,3 Expression Evaluation a=αb + βc MATLAB Blitz way T1[i]=βc[i] lazy evaluation T2[i]=αb[i] a[i]= α*b[i] +β*c[i] T3[i]=t1[i]+t2[i] A=t3[i] — Small array double a[20] - complete loop unrolling — Large array double *a – if we do loop unrolling - register spilling a=αb +βc +µd type checking CRTP –to avoid type checking without going to virtual function bsum(a,b) –{ &a[i] , &b[i] } a+b+c bsumof X + X -> tuplesum Vector Initialization __device__ commaHelper<dataType, N> operator=(dataType val ) { for(int i =0; i< numELEM; i++) data[i] = val; return commaHelper<dataType, N>(&data[1]); } Type Checking template< class T, int len> class commaHelper { public: __device__ commaHelper() : vPtr(0){} __device__ commaHelper(T * ptr) : vPtr(ptr) { } __device__ commaHelper & operator,(T val) { *vPtr++ = val; return *this; } private: T * vPtr; }; template< class lhs, typename dataType> class scalarMultR: public baseET<dataType, scalarMultR<lhs,dataType> > { public: __device__ scalarMultR(const lhs &l,const dataType & r): lhs_(l),rhs_(r){} __device__ dataType value(int i)const{ return lhs_.value(i) * rhs_; } private: const lhs &lhs_; const dataType rhs_; }; Objectives Programmer productivity — Rapidly develop complex applications — Leverage parallel primitives Encourage generic programming High performance — With minimal programmer effort Interoperability — Integrates with CUDA C/C++ code Panini Library Generic parallel array class built on advanced generic programming methodologies where details of parallelization is hidden inside the array class itself. Allow a user to work with high-level physical abstractions for scientific computation Expression Template - high performance numerical libraries, where abstract mathematical notations via operator overloading in C++ Efficiently parallelizable for large scale scientific code. Implementation for the expression templates mechanism based on “Curiously recurring template pattern" (CRTP) in c++. What is Panini Library ? C++ template library for CUDA Support Data Structure : — 1d , 2d and complex Vector on CUDA — 1d , 2d Grid with multidimensional data on cuda SOA as well as AOS Data Structure Template & Operator Overloading Loop Unrolling for small size vector Lazy Evaluation Common sub expression elimination Template and Operator overloading Containers/Objects Supported by Panini Small size vector on device Large vector on device Create complex array , grid 1d, 2d using namespace Panini; int main() { int nX = 200; int nY = 200; vectET < double > coordX(nX,0.0); vectET < double > coordY(nY,0.0); gridFlow2D<FLOW_FIELD_2D> myGridN(nX, nY, 1,1); gridFlow2D<FLOW_FIELD_2D> myGridO(nX, nY, 1,1); gridFlow2D<FLOW_FIELD_2D> myGridM(nX, nY, 1,1); gridFlow2D<1> pressureM(nX, nY,1,1); gridFlow2D<1> pressureN(nX, nY,1,1); gridFlow2D<1> pressure(nX, nY,1,1); gridFlow2D<1> potentialO(nX, nY, 1,1); Basic Feature of vectTiny/vectET Class Direct Assignment vectTiny <dataType, N> array array =1.2, 3.5, 5.6 Binary operation Scalar arithmetic operation Math operation on vectTiny object Type checking not required supports single- and double-precision floating point values, complex numbers, Booleans, 32-bit signed and unsigned integers supports manipulating vectors, matrices, and N-dimensional arrays Best Practice Structure of Arrays —Ensure memory coalescing Array of Structure Implicit Sequences — Avoid explicitly storing and accessing regular patterns — Eliminate memory accesses and storage DataType VectTiny : small size arrays – user know the size in advance — vectTiny <float, 100 > a VectET: large arrays or grid 1D or N-Dimensional. — b in a grid of size 100, where at every point, we have a fixed array of size 3. — VectET< vectTiny< myReal, 3 >> a(100). gridFlow2D – grid 2D or N dimension — gridFlow2D<T,FLOW_FIELD_2D>**myGrid; Allowed Operation Three modes of initialization are provided — vectTiny < double, 3> a=2; — vectTiny < double, 3> a; a=1,2,3; — vectTiny < double, 3> b; b =a; All Math operations, Binary operation and Scalar operation. — vectTiny < double, 3> a=0.1, b, c; b = sin(a) ; c = a + b ; c = 0.5*c; All vectors operations — vectTiny < double, 3> a, b, c,d; b = a+sin(c)+0.3*cos(d) Vector operations are relying on following optimizations: Loop-unrolling (by hand), Lazy Evaluation. Keep reference of object till the last iteration Allowed Operation… Lazy Evaluation — vectTiny < double, 3> a, b, c,d; — b = sin(c)+0.3*cos(d) A typical operator overloading + virtual function approach will evaluate it in following sequence — for(i=1,N) tmp[i]=cos(d[i]); — for(i=1,N) tmp1[i]=0.3*tmp[i]; — for(i=1,N) tmp3[i]= sin(c[i]); Panini supports optimized Fortran style code — for(i=1,N) — b[i] = sin(c[i])+0.3*cos(d[i]) Structure of Arrays Coalescing improves memory efficiency Accesses to arrays of arbitrary structures won’t coalesce Reordering into structure of arrays ensures coalescing — Struct float3{ float x;float y; floatz;}; float3 *aos;...aos[i].x = 1.0f; — Struct float3_soa{float*x;float*y;float*z;}; float3_soa soa;...soa.x[i] = 1.0f; Array of structures struct Velocity { int ux; int uy; }; Velocity<FLOW_FIELD_2D> obj_vel(nX, nY, 1,1); Structure of arrays (Best Practice) struct Pressure { float *pressure; float *pressureM; float *pressureN ; }; gridFlow2D<1> pressureM(nX, nY,1,1); gridFlow2D<1> pressureN(nX, nY,1,1); gridFlow2D<1> pressure(nX, nY,1,1); PlaceHolder Object Implicit Sequences — placeHolder IX(nX) Often we need ranges following a sequential pattern Constant ranges [1, 1, 1, 1, ...] Incrementing ranges [0, 1, 2, 3, ...] How Panini different from Array Fire Static resolution. Approach used in Array fire is easy to develop for library, but that will have penalty in performance. Very initial stage Navier Stroke Example: How easy to write a scientific code using Panini Data Structure vectET <double> coordX(nX,0.0); vectET <double> coordY(nY,0.0); gridFlow2D<FLOW_FIELD_2D> myGridN (nX, nY, 1, 1); gridFlow2D<FLOW_FIELD_2D> myGridO ( nX, nY, 1, 1); gridFlow2D<FLOW_FIELD_2D> myGridM ( nX, nY, 1, 1); gridFlow2D<1> pressureM (nX, nY, 1, 1); gridFlow2D<1> pressureN (nX, nY,1, 1); gridFlow2D<1> pressure (nX, nY, 1, 1); gridFlow2D<1> potentialO(nX, nY, 1,1); Initial Condition myGridN(iX,iY).value(UX) = -2.0*M_PI*kY*phi*cos(coordX[iX]*kX)*sin(coordY[iY]*kY); myGridN(iX,iY).value(UY) = 2.0*M_PI*kX*phi*sin(coordX[iX]*kX)*cos(coordY[iY]*kY); pressure(iX,iY) = -M_PI*M_PI*phi*phi*(kY*kY*cos(2.0*coordX[iX]*kX)+kX*kX*cos(2.0*coordY[iX]*kY)); pressureM(iX,iY) -= 0.5*(myGridN(iX,iY).value(UX)*myGridN(iX,iY).value(UX)+myGridN(iX,iY).value(UY)*myGridN(iX,iY).value(UY)); Laplacian Equation template<int N> void getLaplacian(gridFlow2D<N> gridVar,gridFlow2D<N> &gridLap, double c3, double c4, int iX, int iY) { gridLap(iX,iY) = gridVar(iX,iY) + c4*(gridVar(iX,iY+1) -2.0*gridVar(iX,iY) +gridVar(iX,iY-1)) + c3*(gridVar(iX+1,iY) -2.0*gridVar(iX,iY) + gridVar(iX-1,iY)); } Serial Code – CPU Timing No of Grid Point CPU Timing Time for 100 Iteration(sec) Time for 200 Iteration (sec) 1.00E+04 0.00126063 0.00126172 4.00E+04 0.00625661 0.00624585 1.60E+05 0.0439781 0.044118 2.50E+05 0.0781446 0.0785149 CPU Timing – MPI Version MPI Version of Panini Code No. of Processors 100 iterations 200 iterations 1 0.0743643 0.0755658 2 0.0580707 0.0579703 4 0.054078 0.0507001 5 0.0447405 0.0420167 10 0.0382128 0.0365341 16 0.0379704 0.0372657 20 0.0367649 0.0390902 25 0.0472415 0.0589682 30 0.0645379 0.0627601 CPU Timing – MPI Version MPI Version of Panini Code No. of Processors 100 iterations 200 iterations 1 0.00741906 0.00716424 2 0.00583018 0.00561896 4 0.00725028 0.0067555 5 0.00634362 0.0071692 10 0.00856 0.010103 16 0.0102652 0.0102073 20 0.0114238 0.0103288 25 0.0104092 0.0103344 30 0.011299 0.0118635 CPU Timing vs GPU Timing No of Grid Point CPU Timing (sec) GPU Timing (sec ) 100 iteration 100 iteration 100 x 100 0.001260 0.000441 2.72x 200 x 200 0.006256 0.001279 4.89x 400 x 400 0.043978 0.004311 10.09x SpeedUp Curiously recurring template pattern This is the core of Vector Design. The basic idea is that all input class will derive from this template based class where template input class will be derived class itself namespace Panini { template <typename dataType, class input> class baseET { public: typedef const input& inputRef; // Return Reference to object input Inline operator inputRef () const { return *static_cast<const input*> (this ) ; } inputRef getInputRef() const { return static_cast<inputRef> (*this ) ; } // Every Base class will have member value __device__ dataType value(const int i) const{ return static_cast<inputRef> (*this ).value(i) ; } };