Panini: A GPU Aware Array Class - GPU Technology Conference 2012

advertisement
Panini: A GPU aware Array class
Dr. Santosh Ansumali(JNCASR) & Priyanka Sah (NVIDIA)
Heterogeneous Computing
 CPU
— Multicore
— Multiprocessor
— Cluster of Multicore
 GPU
— CPU + GPU
 MIC
Background
 Programing Efficiency
 Performance
— Scalar
— Parallel
 MATLAB/ STL-C++
 Merit - Easy to write a code using
 Demerit - Performance Issue – no scalability of code
 Array (C/C++) – smarter way of writing code
 Use the feature of object oriented Matlab and Template Meta programming
Background
 Template Meta programming
— Blitz++ : Fast, Accurate Numerical Computing in C++
 Advanced feature (vector, array ,matrix)
 FORTRAN
 MATLAB
 Scalable
 Disadvantage – large code
 POOMA –MPI
Panini – part of MATLAB + Blitz
 Vector Initialization on MATLAB a=1,2,3
 Expression Evaluation a=αb + βc
MATLAB
Blitz way
T1[i]=βc[i]
lazy evaluation
T2[i]=αb[i]
a[i]= α*b[i] +β*c[i]
T3[i]=t1[i]+t2[i]
A=t3[i]
— Small array double a[20] - complete loop unrolling
— Large array double *a – if we do loop unrolling - register spilling
 a=αb +βc +µd type checking
 CRTP –to avoid type checking without going to virtual function
 bsum(a,b) –{ &a[i] , &b[i] }
a+b+c bsumof X + X -> tuplesum
 Vector Initialization
__device__ commaHelper<dataType, N> operator=(dataType val )
{
for(int i =0; i< numELEM; i++)
data[i] = val;
return commaHelper<dataType, N>(&data[1]);
}
 Type Checking
template< class T, int len>
class commaHelper
{
public:
__device__ commaHelper() : vPtr(0){}
__device__ commaHelper(T * ptr) : vPtr(ptr) { }
__device__ commaHelper & operator,(T val)
{
*vPtr++ = val;
return *this;
}
private:
T * vPtr;
};
template< class lhs, typename dataType>
class scalarMultR: public baseET<dataType, scalarMultR<lhs,dataType> > {
public:
__device__ scalarMultR(const lhs &l,const dataType & r): lhs_(l),rhs_(r){}
__device__ dataType value(int i)const{
return lhs_.value(i) * rhs_;
}
private:
const lhs &lhs_;
const dataType rhs_;
};
Objectives
 Programmer productivity
— Rapidly develop complex applications
— Leverage parallel primitives
 Encourage generic programming
 High performance
— With minimal programmer effort
 Interoperability
— Integrates with CUDA C/C++ code
Panini Library
 Generic parallel array class built on advanced generic programming methodologies
where details of parallelization is hidden inside the array class itself.
 Allow a user to work with high-level physical abstractions for scientific computation
 Expression Template - high performance numerical libraries, where abstract
mathematical notations via operator overloading in C++
 Efficiently parallelizable for large scale scientific code.
 Implementation for the expression templates mechanism based on “Curiously
recurring template pattern" (CRTP) in c++.
What is Panini Library ?
 C++ template library for CUDA
 Support Data Structure :
— 1d , 2d and complex Vector on CUDA
— 1d , 2d Grid with multidimensional data on cuda
 SOA as well as AOS Data Structure
 Template & Operator Overloading
 Loop Unrolling for small size vector
 Lazy Evaluation
 Common sub expression elimination
 Template and Operator overloading
Containers/Objects Supported by Panini
 Small size vector on device
 Large vector on device
 Create complex array , grid 1d, 2d
using namespace Panini;
int main()
{
int nX = 200;
int nY = 200;
vectET < double > coordX(nX,0.0);
vectET < double > coordY(nY,0.0);
gridFlow2D<FLOW_FIELD_2D> myGridN(nX, nY, 1,1);
gridFlow2D<FLOW_FIELD_2D> myGridO(nX, nY, 1,1);
gridFlow2D<FLOW_FIELD_2D> myGridM(nX, nY, 1,1);
gridFlow2D<1> pressureM(nX, nY,1,1);
gridFlow2D<1> pressureN(nX, nY,1,1);
gridFlow2D<1> pressure(nX, nY,1,1);
gridFlow2D<1> potentialO(nX, nY, 1,1);
Basic Feature of vectTiny/vectET Class
 Direct Assignment
vectTiny <dataType, N> array
array =1.2, 3.5, 5.6
 Binary operation
 Scalar arithmetic operation
 Math operation on vectTiny object
 Type checking not required
 supports single- and double-precision floating point values, complex numbers,
Booleans, 32-bit signed and unsigned integers
 supports manipulating vectors, matrices, and N-dimensional arrays
Best Practice
 Structure of Arrays
—Ensure memory coalescing
 Array of Structure
 Implicit Sequences
— Avoid explicitly storing and accessing regular patterns
— Eliminate memory accesses and storage
DataType
 VectTiny : small size arrays – user know the size in advance
— vectTiny <float, 100 > a
 VectET: large arrays or grid 1D or N-Dimensional.
— b in a grid of size 100, where at every point, we have a fixed array of size 3.
— VectET< vectTiny< myReal, 3 >> a(100).
 gridFlow2D – grid 2D or N dimension
— gridFlow2D<T,FLOW_FIELD_2D>**myGrid;
Allowed Operation
 Three modes of initialization are provided
— vectTiny < double, 3> a=2;
— vectTiny < double, 3> a; a=1,2,3;
— vectTiny < double, 3> b; b =a;
 All Math operations, Binary operation and Scalar operation.
— vectTiny < double, 3> a=0.1, b, c; b = sin(a) ; c = a + b ; c = 0.5*c;
 All vectors operations
— vectTiny < double, 3> a, b, c,d; b = a+sin(c)+0.3*cos(d)
 Vector operations are relying on following optimizations: Loop-unrolling (by hand),
Lazy Evaluation.
 Keep reference of object till the last iteration
Allowed Operation…
 Lazy Evaluation
— vectTiny < double, 3> a, b, c,d;
— b = sin(c)+0.3*cos(d)
 A typical operator overloading + virtual function approach will evaluate it in following
sequence
— for(i=1,N) tmp[i]=cos(d[i]);
— for(i=1,N) tmp1[i]=0.3*tmp[i];
— for(i=1,N) tmp3[i]= sin(c[i]);
 Panini supports optimized Fortran style code
— for(i=1,N)
— b[i] = sin(c[i])+0.3*cos(d[i])
Structure of Arrays
 Coalescing improves memory efficiency
 Accesses to arrays of arbitrary structures won’t coalesce
 Reordering into structure of arrays ensures coalescing
— Struct float3{ float x;float y; floatz;};
float3 *aos;...aos[i].x = 1.0f;
— Struct float3_soa{float*x;float*y;float*z;}; float3_soa soa;...soa.x[i] = 1.0f;
 Array of structures struct Velocity
{
int ux;
int uy;
};
Velocity<FLOW_FIELD_2D> obj_vel(nX, nY, 1,1);
 Structure of arrays (Best Practice)
struct Pressure
{
float *pressure;
float *pressureM;
float *pressureN ;
};
gridFlow2D<1>
pressureM(nX, nY,1,1);
gridFlow2D<1>
pressureN(nX, nY,1,1);
gridFlow2D<1>
pressure(nX, nY,1,1);
PlaceHolder Object
 Implicit Sequences
— placeHolder IX(nX)
 Often we need ranges following a sequential pattern
 Constant ranges
 [1, 1, 1, 1, ...]
 Incrementing ranges
 [0, 1, 2, 3, ...]
How Panini different from Array Fire
 Static resolution.
 Approach used in Array fire is easy to develop for library, but that will have penalty in
performance.
 Very initial stage
Navier Stroke Example: How easy to write
a scientific code using Panini
 Data Structure
vectET <double> coordX(nX,0.0);
vectET <double> coordY(nY,0.0);
gridFlow2D<FLOW_FIELD_2D> myGridN (nX, nY, 1, 1);
gridFlow2D<FLOW_FIELD_2D> myGridO ( nX, nY, 1, 1);
gridFlow2D<FLOW_FIELD_2D> myGridM ( nX, nY, 1, 1);
gridFlow2D<1> pressureM (nX, nY, 1, 1);
gridFlow2D<1> pressureN (nX, nY,1, 1);
gridFlow2D<1> pressure (nX, nY, 1, 1);
gridFlow2D<1> potentialO(nX, nY, 1,1);
 Initial Condition
myGridN(iX,iY).value(UX) = -2.0*M_PI*kY*phi*cos(coordX[iX]*kX)*sin(coordY[iY]*kY);
myGridN(iX,iY).value(UY) = 2.0*M_PI*kX*phi*sin(coordX[iX]*kX)*cos(coordY[iY]*kY);
pressure(iX,iY) = -M_PI*M_PI*phi*phi*(kY*kY*cos(2.0*coordX[iX]*kX)+kX*kX*cos(2.0*coordY[iX]*kY));
pressureM(iX,iY) -= 0.5*(myGridN(iX,iY).value(UX)*myGridN(iX,iY).value(UX)+myGridN(iX,iY).value(UY)*myGridN(iX,iY).value(UY));
 Laplacian Equation
template<int N>
void getLaplacian(gridFlow2D<N> gridVar,gridFlow2D<N> &gridLap, double c3, double c4, int iX, int iY)
{
gridLap(iX,iY) = gridVar(iX,iY) + c4*(gridVar(iX,iY+1) -2.0*gridVar(iX,iY) +gridVar(iX,iY-1))
+ c3*(gridVar(iX+1,iY) -2.0*gridVar(iX,iY) + gridVar(iX-1,iY));
}
Serial Code – CPU Timing
No of Grid Point
CPU Timing
Time for 100
Iteration(sec)
Time for 200
Iteration (sec)
1.00E+04
0.00126063
0.00126172
4.00E+04
0.00625661
0.00624585
1.60E+05
0.0439781
0.044118
2.50E+05
0.0781446
0.0785149
CPU Timing – MPI Version
 MPI Version of Panini Code
No. of Processors
100 iterations
200 iterations
1
0.0743643
0.0755658
2
0.0580707
0.0579703
4
0.054078
0.0507001
5
0.0447405
0.0420167
10
0.0382128
0.0365341
16
0.0379704
0.0372657
20
0.0367649
0.0390902
25
0.0472415
0.0589682
30
0.0645379
0.0627601
CPU Timing – MPI Version
 MPI Version of Panini Code
No. of Processors
100 iterations
200 iterations
1
0.00741906
0.00716424
2
0.00583018
0.00561896
4
0.00725028
0.0067555
5
0.00634362
0.0071692
10
0.00856
0.010103
16
0.0102652
0.0102073
20
0.0114238
0.0103288
25
0.0104092
0.0103344
30
0.011299
0.0118635
CPU Timing vs GPU Timing
No of Grid Point
CPU Timing
(sec)
GPU Timing
(sec )
100 iteration
100 iteration
100 x 100
0.001260
0.000441
2.72x
200 x 200
0.006256
0.001279
4.89x
400 x 400
0.043978
0.004311
10.09x
SpeedUp
Curiously recurring template pattern
 This is the core of Vector Design.
 The basic idea is that all input class
will derive from this template based
class where template input class will
be derived class itself
namespace Panini {
template <typename dataType, class input>
class baseET {
public:
typedef const input& inputRef;
// Return Reference to object input
Inline operator inputRef () const {
return *static_cast<const input*> (this ) ; }
inputRef getInputRef() const {
return static_cast<inputRef> (*this ) ; }
// Every Base class will have member value
__device__ dataType value(const int i) const{
return static_cast<inputRef> (*this ).value(i) ;
}
};
Download