Panini: A GPU aware Array class
Dr. Santosh Ansumali(JNCASR) & Priyanka Sah (NVIDIA)
Heterogeneous Computing
CPU
— Multicore
— Multiprocessor
— Cluster of Multicore
GPU
— CPU + GPU
MIC
Background
Programing Efficiency
Performance
— Scalar
— Parallel
MATLAB/ STL-C++
Merit - Easy to write a code using
Demerit - Performance Issue – no scalability of code
Array (C/C++) – smarter way of writing code
Use the feature of object oriented Matlab and Template Meta programming
Background
Template Meta programming
— Blitz++ : Fast, Accurate Numerical Computing in C++
Advanced feature (vector, array ,matrix)
FORTRAN
MATLAB
Scalable
Disadvantage – large code
POOMA –MPI
Panini – part of MATLAB + Blitz
Vector Initialization on MATLAB a=1,2,3
Expression Evaluation a=αb + βc
MATLAB
Blitz way
T1[i]=βc[i]
lazy evaluation
T2[i]=αb[i]
a[i]= α*b[i] +β*c[i]
T3[i]=t1[i]+t2[i]
A=t3[i]
— Small array double a[20] - complete loop unrolling
— Large array double *a – if we do loop unrolling - register spilling
a=αb +βc +µd type checking
CRTP –to avoid type checking without going to virtual function
bsum(a,b) –{ &a[i] , &b[i] }
a+b+c bsumof X + X -> tuplesum
Vector Initialization
__device__ commaHelper<dataType, N> operator=(dataType val )
{
for(int i =0; i< numELEM; i++)
data[i] = val;
return commaHelper<dataType, N>(&data[1]);
}
Type Checking
template< class T, int len>
class commaHelper
{
public:
__device__ commaHelper() : vPtr(0){}
__device__ commaHelper(T * ptr) : vPtr(ptr) { }
__device__ commaHelper & operator,(T val)
{
*vPtr++ = val;
return *this;
}
private:
T * vPtr;
};
template< class lhs, typename dataType>
class scalarMultR: public baseET<dataType, scalarMultR<lhs,dataType> > {
public:
__device__ scalarMultR(const lhs &l,const dataType & r): lhs_(l),rhs_(r){}
__device__ dataType value(int i)const{
return lhs_.value(i) * rhs_;
}
private:
const lhs &lhs_;
const dataType rhs_;
};
Objectives
Programmer productivity
— Rapidly develop complex applications
— Leverage parallel primitives
Encourage generic programming
High performance
— With minimal programmer effort
Interoperability
— Integrates with CUDA C/C++ code
Panini Library
Generic parallel array class built on advanced generic programming methodologies
where details of parallelization is hidden inside the array class itself.
Allow a user to work with high-level physical abstractions for scientific computation
Expression Template - high performance numerical libraries, where abstract
mathematical notations via operator overloading in C++
Efficiently parallelizable for large scale scientific code.
Implementation for the expression templates mechanism based on “Curiously
recurring template pattern" (CRTP) in c++.
What is Panini Library ?
C++ template library for CUDA
Support Data Structure :
— 1d , 2d and complex Vector on CUDA
— 1d , 2d Grid with multidimensional data on cuda
SOA as well as AOS Data Structure
Template & Operator Overloading
Loop Unrolling for small size vector
Lazy Evaluation
Common sub expression elimination
Template and Operator overloading
Containers/Objects Supported by Panini
Small size vector on device
Large vector on device
Create complex array , grid 1d, 2d
using namespace Panini;
int main()
{
int nX = 200;
int nY = 200;
vectET < double > coordX(nX,0.0);
vectET < double > coordY(nY,0.0);
gridFlow2D<FLOW_FIELD_2D> myGridN(nX, nY, 1,1);
gridFlow2D<FLOW_FIELD_2D> myGridO(nX, nY, 1,1);
gridFlow2D<FLOW_FIELD_2D> myGridM(nX, nY, 1,1);
gridFlow2D<1> pressureM(nX, nY,1,1);
gridFlow2D<1> pressureN(nX, nY,1,1);
gridFlow2D<1> pressure(nX, nY,1,1);
gridFlow2D<1> potentialO(nX, nY, 1,1);
Basic Feature of vectTiny/vectET Class
Direct Assignment
vectTiny <dataType, N> array
array =1.2, 3.5, 5.6
Binary operation
Scalar arithmetic operation
Math operation on vectTiny object
Type checking not required
supports single- and double-precision floating point values, complex numbers,
Booleans, 32-bit signed and unsigned integers
supports manipulating vectors, matrices, and N-dimensional arrays
Best Practice
Structure of Arrays
—Ensure memory coalescing
Array of Structure
Implicit Sequences
— Avoid explicitly storing and accessing regular patterns
— Eliminate memory accesses and storage
DataType
VectTiny : small size arrays – user know the size in advance
— vectTiny <float, 100 > a
VectET: large arrays or grid 1D or N-Dimensional.
— b in a grid of size 100, where at every point, we have a fixed array of size 3.
— VectET< vectTiny< myReal, 3 >> a(100).
gridFlow2D – grid 2D or N dimension
— gridFlow2D<T,FLOW_FIELD_2D>**myGrid;
Allowed Operation
Three modes of initialization are provided
— vectTiny < double, 3> a=2;
— vectTiny < double, 3> a; a=1,2,3;
— vectTiny < double, 3> b; b =a;
All Math operations, Binary operation and Scalar operation.
— vectTiny < double, 3> a=0.1, b, c; b = sin(a) ; c = a + b ; c = 0.5*c;
All vectors operations
— vectTiny < double, 3> a, b, c,d; b = a+sin(c)+0.3*cos(d)
Vector operations are relying on following optimizations: Loop-unrolling (by hand),
Lazy Evaluation.
Keep reference of object till the last iteration
Allowed Operation…
Lazy Evaluation
— vectTiny < double, 3> a, b, c,d;
— b = sin(c)+0.3*cos(d)
A typical operator overloading + virtual function approach will evaluate it in following
sequence
— for(i=1,N) tmp[i]=cos(d[i]);
— for(i=1,N) tmp1[i]=0.3*tmp[i];
— for(i=1,N) tmp3[i]= sin(c[i]);
Panini supports optimized Fortran style code
— for(i=1,N)
— b[i] = sin(c[i])+0.3*cos(d[i])
Structure of Arrays
Coalescing improves memory efficiency
Accesses to arrays of arbitrary structures won’t coalesce
Reordering into structure of arrays ensures coalescing
— Struct float3{ float x;float y; floatz;};
float3 *aos;...aos[i].x = 1.0f;
— Struct float3_soa{float*x;float*y;float*z;}; float3_soa soa;...soa.x[i] = 1.0f;
Array of structures struct Velocity
{
int ux;
int uy;
};
Velocity<FLOW_FIELD_2D> obj_vel(nX, nY, 1,1);
Structure of arrays (Best Practice)
struct Pressure
{
float *pressure;
float *pressureM;
float *pressureN ;
};
gridFlow2D<1>
pressureM(nX, nY,1,1);
gridFlow2D<1>
pressureN(nX, nY,1,1);
gridFlow2D<1>
pressure(nX, nY,1,1);
PlaceHolder Object
Implicit Sequences
— placeHolder IX(nX)
Often we need ranges following a sequential pattern
Constant ranges
[1, 1, 1, 1, ...]
Incrementing ranges
[0, 1, 2, 3, ...]
How Panini different from Array Fire
Static resolution.
Approach used in Array fire is easy to develop for library, but that will have penalty in
performance.
Very initial stage
Navier Stroke Example: How easy to write
a scientific code using Panini
Data Structure
vectET <double> coordX(nX,0.0);
vectET <double> coordY(nY,0.0);
gridFlow2D<FLOW_FIELD_2D> myGridN (nX, nY, 1, 1);
gridFlow2D<FLOW_FIELD_2D> myGridO ( nX, nY, 1, 1);
gridFlow2D<FLOW_FIELD_2D> myGridM ( nX, nY, 1, 1);
gridFlow2D<1> pressureM (nX, nY, 1, 1);
gridFlow2D<1> pressureN (nX, nY,1, 1);
gridFlow2D<1> pressure (nX, nY, 1, 1);
gridFlow2D<1> potentialO(nX, nY, 1,1);
Initial Condition
myGridN(iX,iY).value(UX) = -2.0*M_PI*kY*phi*cos(coordX[iX]*kX)*sin(coordY[iY]*kY);
myGridN(iX,iY).value(UY) = 2.0*M_PI*kX*phi*sin(coordX[iX]*kX)*cos(coordY[iY]*kY);
pressure(iX,iY) = -M_PI*M_PI*phi*phi*(kY*kY*cos(2.0*coordX[iX]*kX)+kX*kX*cos(2.0*coordY[iX]*kY));
pressureM(iX,iY) -= 0.5*(myGridN(iX,iY).value(UX)*myGridN(iX,iY).value(UX)+myGridN(iX,iY).value(UY)*myGridN(iX,iY).value(UY));
Laplacian Equation
template<int N>
void getLaplacian(gridFlow2D<N> gridVar,gridFlow2D<N> &gridLap, double c3, double c4, int iX, int iY)
{
gridLap(iX,iY) = gridVar(iX,iY) + c4*(gridVar(iX,iY+1) -2.0*gridVar(iX,iY) +gridVar(iX,iY-1))
+ c3*(gridVar(iX+1,iY) -2.0*gridVar(iX,iY) + gridVar(iX-1,iY));
}
Serial Code – CPU Timing
No of Grid Point
CPU Timing
Time for 100
Iteration(sec)
Time for 200
Iteration (sec)
1.00E+04
0.00126063
0.00126172
4.00E+04
0.00625661
0.00624585
1.60E+05
0.0439781
0.044118
2.50E+05
0.0781446
0.0785149
CPU Timing – MPI Version
MPI Version of Panini Code
No. of Processors
100 iterations
200 iterations
1
0.0743643
0.0755658
2
0.0580707
0.0579703
4
0.054078
0.0507001
5
0.0447405
0.0420167
10
0.0382128
0.0365341
16
0.0379704
0.0372657
20
0.0367649
0.0390902
25
0.0472415
0.0589682
30
0.0645379
0.0627601
CPU Timing – MPI Version
MPI Version of Panini Code
No. of Processors
100 iterations
200 iterations
1
0.00741906
0.00716424
2
0.00583018
0.00561896
4
0.00725028
0.0067555
5
0.00634362
0.0071692
10
0.00856
0.010103
16
0.0102652
0.0102073
20
0.0114238
0.0103288
25
0.0104092
0.0103344
30
0.011299
0.0118635
CPU Timing vs GPU Timing
No of Grid Point
CPU Timing
(sec)
GPU Timing
(sec )
100 iteration
100 iteration
100 x 100
0.001260
0.000441
2.72x
200 x 200
0.006256
0.001279
4.89x
400 x 400
0.043978
0.004311
10.09x
SpeedUp
Curiously recurring template pattern
This is the core of Vector Design.
The basic idea is that all input class
will derive from this template based
class where template input class will
be derived class itself
namespace Panini {
template <typename dataType, class input>
class baseET {
public:
typedef const input& inputRef;
// Return Reference to object input
Inline operator inputRef () const {
return *static_cast<const input*> (this ) ; }
inputRef getInputRef() const {
return static_cast<inputRef> (*this ) ; }
// Every Base class will have member value
__device__ dataType value(const int i) const{
return static_cast<inputRef> (*this ).value(i) ;
}
};