Templates and All That Balint Joo Jefferson Lab, Newport News, VA for

advertisement

Templates and All That

Balint Joo

Jefferson Lab, Newport News, VA for

HackLatt'09, Edinburgh, May 2009

Outline

• C++ Templates in General

– Basic Templates

– Specializations

• Funky Templates

– Traits

– Type computations

– Expression Templates, and PETE

• Optimizing Expressions in QDP++

– Site by site specializations

– evaluate() specializations

– Hooking in threading

C++ Templates in General

• Templates allow you to perform 'substitutions' in code at compile time

• Typical Use: Containers for several types template< typename T> class Bag { public:

insert( const T& item ) { // Code goes here … }

}; int main(int argc, char *argv[])

{

Bag<float> BagOfFloats;

Bag<int> BagOfInts;

int i=5; float f=6.0;

BagOfFloats.insert( f );

BagOfInts.insert( i );

BagOfInts.insert( f ); // NONONO! Unless automatic conversion

}

'Value' Templates

• Bag<T> was templated only on types. Can also template on values

• This is useful e.g. if I want to pick a container size at compile time, but it will stay fixed after that template< typename T, int N> class FixedSizedBag { public:

insert( const T& item, int position ) { // Code goes here … }

}; int main(int argc, char *argv[])

{

FixedSizedBag<float,2> BagOfTwoFloats;

float f1=6.0;

float f2=3.0;

BagOfTwoFloats.insert(f1, 0); // Insert at position 0

BagOfTwoFloats.insert(f2, 1); // Insert at position 1

}

Templating Classes, Functions

• We can template functions as well as classes

#include<iostream> template< typename T > void doSomethingWithT( const T& input )

{

std::cout << “T is “ << input << endl;

}

• NB: a definition of operator<< must be available that can print a type T – so called 'duck typing' class MyFunnyType {} ; // Empty class int main(int argc, char *argv[])

{

int i=6 ; doSomethingWithT(i); // OK: << handles int

float f=5.0 ; doSomethingWithT(f); // OK: << handles float

MyFunnyType mf; doSomethingWithT(mf); // BARF!!!!!!

}

Class Specialization

• One can 'customize' the code of a template for specific template values – this is called specialization template<class T> class FancyPrinter { public:

FancyPrinter(const T& input) {

std::cout << “Fancy Printer: “ << input << endl;

}

};

// SPECIALIZE The Entire Class template<> class FancyPrinter< MyFunnyType > {

FancyPrinter(const MyFunnyType& input) {

std::cout << “Fancy Printer: trying to print FunnyType”

<< endl;

}

};

Class Member Specialization

• I can specialize individual member functions of a class template e.g.: template<class T> class FancyPrinter { public:

FancyPrinter(const T& input) {

std::cout << “Fancy Printer: “ << input << endl;

}

};

// SPECIALIZE The constructor template<>

FancyPrinter< MyFunnyType >::FancyPrinter(const MyFunnyType& input)

{

std::cout << “Fancy Printer: trying to print FunnyType”

<< endl;

};

A word about the STL & Boost

• At this point it is worth mentioning that various container types have been implemented for you in the

Standard Template Library and the Boost Library

– NB: Boost has been included in TR1 and may make it into the 'standard C++ library' when it next becomes available

• In QDP++, because of their parallel nature, we didn't use the STL but tried to keep some of its feel...

– STL prescribes a lot of detail for things like 'iterators' which allow generic traversal of STL containers. This is a lot of overhead for us, so we left it out....

• The full template rules are complex, see Stroustrup, the Web etc etc etc if you want minutiae..

Traits...

• Classes can 'export' internal types: class SomeClass { public:

typedef float MyFloatType;

};

// This declares a float really

SomeClass::MyFloatType t=5.6;

• MyFloatType is a 'type trait' of SomeClass

• A Traits Class can hold several traits: class SomeClass { public:

typedef float MyFloatType; static const int FloatSize = sizeof(MyFloatSize);

}; std::cout << “The Size of SomeClass::MyFloatType is :”

<< SomeClass::FloatSize << endl;

More useful traits

• Traits become powerful when combined with templating: template <typename T> class MyTraits { public:

typedef T MyType;

static const int MyTypeSize = sizeof(T);

};

• Can now write generic code in terms of 'MyType' and control say memory copies using 'MyTypeSize'

• Traits are heavily used in the STL and standard libraries.

Type Computing Traits

• One can construct a set of templates to 'compute' about types: template< typename T> // Catchall case... Empty...

struct DoublePrecType { // Will cause compiler error if One tries

}; // to access its nonexistent members template<> struct DoublePrecType<float> { // Specialization for floats

typedef double Type_t; // this is the trait...

}; template<> struct DoublePrecType<double> { // Specialization for doubles

typedef double Type_t; // This is the trait

}; std::cout << “Sizeof DP(float)”

<< sizeof( DoublePrecType<float>::Type_t ) << endl; std::cout << “But not for ints” << sizeof( DoublePrecType<int>::Type_t )

<< endl; // This'll barf, matches struct with no Type_t

Recursive Traits

• Type computations can be recursive:

#include <vector> using namespace std; template< typename T> // Catchall case. Hopefully never struct DoublePrecType { // Instantiate this

}; template<> // Base case for floats struct DoublePrecType<float> {

typedef double Type_t;

}; template<typename T> // Recursive case for vectors struct DoublePrecType< vector< T > > {

typedef vector< typename DoublePrecType< T >::Type_t > Type_t;

};

DoublePrecType< vector< float > >::Type_t doublevec(5); cout << “doublevec[0] has size=” << sizeof( doublevec[0] ) << endl;

Some QDP++ Traits

• Basic QDP++ Traits:

– WordType<T>::Type_t - the innermost word (int, float)

– SinglePrecType<T>::Type_t single precision version of T

– DoublePrecType<T>::Type_t double precision versions of T

– QIOStringTraits<T > - holds strings needed by QIO functions

• QIOStringTraits<T>::tname = “Lattice” or “Scalar”

• QIOStringTraits<T>::tprec = “U” “I” “F” etc...

• More advanced traits that arise in Expression Templates

– UnaryReturn< T, Op>::Type_t - the return type produced by a Unary function Op acting on T

– BinaryReturn<T1, T2, Op>::Type_t - the return type produced by a

Binary Operator Op acting on inputs with type T1 and T2 respectively

• In all the cases, if we try and use a trait, for which there is no specialization case, or a 'catchall' case. We'll get horrible compiler errors...

Expression Templates

• Basic Idea:

– Want to 'wrap up' expressions at compile time, so that they become 'straightforward' to evaluate

– Want to be able to 'specialize' some expressions using

'faster' code y = a*x ;

QDPExpr y

Binary Node OpMultiply a x

The Basic QDP++ Expression

• The basic QDP++ Expression Template is deceptively simple (qdp++/include/qdp_qdpexpr.h) template<class T, class C> class QDPExpr { public: typedef T Subtype_t; // Type of the first argument.

typedef T Expression_t; //! Type of the expression.

typedef C Container_t; //! Type of the container class

//! Construct from an expression.

QDPExpr(const T& expr) : expr_m(expr)

{ }

QDPExpr 'anonymizes'  expressions of type T

//! Accessor that returns the expression.

const Expression_t& expression() const

{

return expr_m;

} private:

T expr_m;

};

Class exports 3 type Traits and only one public function

For PETE's Sake

• PETE provides 'Tree node' classes to help build the

expression tree.

– Reference<T> - holds a reference to a type

• const T& reference() const; allows access to held value

– UnaryNode<Op,Child> - wraps a Unary Expression

• DeReference<Child>::Return_t child() const; can be used to access 'child' subexpression

– BinaryNode<Op, Left, Right> - Wraps a binary expression

• DeReference<Left>::Return_t left() const;

• DeReference<Right>::Return_t right() const;

– can be used to access left and right subexpressions

More PETE-isms

• Pete also provides (specifies) two classes to wrap arbitrary expressions into QDPExpr classes:

– MakeReturn< T, C >

• Takes expression T and wraps it to QDPExpr<T,C>

using the publicly declared 'make()' function

• To allow generic coding MakeReturn<T,C> exports an

Expression_t type which is just QDPExpr<T,C>

– CreateLeaf< QDPExpr<T, C> >

• Takes the expression and extracts T from it using its publicly declared 'make()' function.

• To allow generic coding use, CreateLeaf<QDPExpr<T,C> > exports a Leaf_t type (which is really just T in a roundabout way)

Building The Tree

• The expression tree is built by overriding functions and operators like operator*() eg: template<class T1,class C1,class T2,class C2> inline typename MakeReturn< BinaryNode<OpMultiply,

typename CreateLeaf<QDPType<T1,C1> >::Leaf_t,

typename CreateLeaf<QDPType<T2,C2> >::Leaf_t> , typename BinaryReturn<C1,C2,OpMultiply>::Type_t

>::Expression_t

{ typedef BinaryNode<OpMultiply,

typename CreateLeaf<QDPType<T1,C1> >::Leaf_t,

This is the node for * operator* (const QDPType<T1,C1> & l,const QDPType<T2,C2> & r) return type=container

typename CreateLeaf<QDPType<T2,C2> >::Leaf_t> Tree_t; typedef typename BinaryReturn<C1,C2,OpMultiply>::Type_t

Container_t;

return MakeReturn< Tree_t , Container_t >::make( Tree_t(

CreateLeaf<QDPType<T1,C1> >::make(l),

CreateLeaf<QDPType<T2,C2> >::make(r)) ) ;

}

What the ?????

• Despite its extremely complex looks the function is simple.

– Take 2 QDPType-s

– Use CreateLeaf<T,C>::make() to make them Leaf expressions

– Use the BinaryNode<T1, T2, C> (aliased to Tree_t), to wrap the leaves into a binary node for multiplication

– Use the MakeReturn<T,C>::make() function to wrap the Binary

Node into a QDPExpr<T,C> and return it

• If this is simple what's complicated?

– This is kind of hard to read and error prone to write.

– PETE does the hard work by writing this for us

When and how do we evaluate?

• Evaluation can happen in two ways:

– one calls evaluate() on the QDPExpr

• in operators =, +=, -=, etc...

• this can be in architecture specific files eg:

– qdp_scalar_specific.h, qdp_parscalar_specific.h

– during the creation/fusing of QDPExpr-s

• Eg: in the generic sum routine template<class RHS, class T> typename UnaryReturn<OScalar<T>, FnSum>::Type_t sum(const QDPExpr<RHS,OScalar<T> >& s1, const Subset& s)

{

typename UnaryReturn<OScalar<T>, FnSum>::Type_t d;

evaluate(d,OpAssign(),s1,all); // since OScalar, no global sum needed

return d;

}

Final note on Expression Templates

• This is of course, not the whole story.

– I haven't really explained why the MakeReturn, CreateLeaf etc are needed

– I haven't really explained why they export the types they export.

– These are beyond the scope of a simple lecture

• Hopefully I have shown you the main types (nodes, etc)

• Armed with this knowledge we know enough to hook in optimizations.

• For more details on PETE you can read:

– “Disambiguated Glommable Expression Templates” by G. Furnish,

Computers in Physics, Vol 11, Issue 3, 1997 pages 263-269

– Online version available here

– NERSC still has a web page on PETE where there is some documentation

– We have a copy of the PETE source code in QDP++ with some tweaks...

How do I slot in optimized routines?

• Typically you'll want to specialize either an evaluate() or a particular operator (eg +,-, etc) for a site.

• There are 3 main 'optimizations' for so called 'scalarsite' targets (scalar nodes, possibly connected)

– include/scalarsite_sse/

• SSE Optimizations

– include/scalarsite_bagel_qdp/

• BAGEL Optimizations

– include/scalarsite_generic/

• Generic C Optimizations

• Toplevel control is through include files:

– include/qdp_scalarsite_sse.h - #includes files from scalarsite_sse

– include/qdp_scalarsite_generic.h - #includes files from scalarsite_generic

– include/qdp_scalarsite_bagel_qdp.h

• Which of these is used, is controlled by Autoconf/Automake #defines in qdp.h

Beware: This can get messy

• Bagel QDP optimizations are perhaps the cleanest

– (all functions called live in an external library)

• SSE Optimizations are perhaps the messiest

– Some called functions live in .h files in scalarsite_sse

– Some functions live in lib/scalarsite_sse/

– Some functions live in a separate library (libintrin) provided in qdp++/other_libs/libintrin/

Quasi-Conventions

• in scalarsite_xxx/ typically have the following .h files

– qdp_scalarsite_xxx_linalg.h

• SU(3) Matrix* SU(3) Matrix, SU(3) Matrix* Color Vec etc.

– qdp_scalarsite_xxx_blas.h

• a*x + y, a*x+b*y, norm2(), innerProduct() etc, a,b real

– qdp_scalarsite_xxx_blas_g5.h

• a*x + gamma_5 y type operations

– qdp_scalarsite_xxx_cblas.h

• a*x+y, a*x + b*y etc, a, b are complex.

– Plus potentially double prec. versions

(qdp_scalarsite_xxx_blas_double.h)

– Plus potentially threading wrappers

(qdp_scalarsite_xxx_blas_wrapper.h)

Specializing Site By Site

• Example Matrix*Matrix specialization from qdp_scalarsite_sse_linalg_double.h

#include "sse_linalg_mm_su3_double.h" inline BinaryReturn<PMatrix<RComplex<REAL64>,3,PColorMatrix>,

PMatrix<RComplex<REAL64>,3,PColorMatrix>, OpMultiply>::Type_t operator*( const PMatrix<RComplex<REAL64>,3,PColorMatrix>& l , const PMatrix<RComplex<REAL64>,3,PColorMatrix>& r )

{

BinaryReturn<PMatrix<RComplex<REAL64>,3,PColorMatrix>,

PMatrix<RComplex<REAL64>,3,PColorMatrix>, OpMultiply>::Type_t d;

REAL64* lm = (REAL64 *) &(l.elem(0,0).real());

REAL64* rm = (REAL64 *) &(r.elem(0,0).real());

REAL64* dm = (REAL64 *) &(d.elem(0,0).real());

Return value ssed_m_eq_mm_u(dm,lm,rm,1);

return d;

}

Call 'optimized' function

Unwrap 

QDP Types

Notes on Site by Site Optimization

• Site result is returned on the stack

– This can cause problems...

– In principle, a good compiler can pass in the return value

(return value eliding).

• Your optimized routine had better be good

– We pass pointers to the operands

– Potential aliasing hazard

– Can prevent compiler from optimizing your 'leaf routine'

– Use a compiler flag to tell the compiler to assume your pointers do not alias (eg for gcc: -fargument-noalias-global)

• Where does one get a chance to prefetch?

– Only potentially in a surrounding loop...

Specializing on evaluate()

• AXPY operation (y += a*x) from qdp_scalarsite_bagel_qdp_blas.h

qdp_vaxpy3(yptr, aptr, xptr, yptr, n_3vec);

}...

Unwrap Binary Node inline void evaluate( OLattice< TVec >& d, const OpAddAssign & op, const QDPExpr< BinaryNode< OpMultiply,

Reference < QDPType< TScal, OScalar < TScal > > >,

Reference< QDPType< TVec, OLattice< TVec > > > >,

OLattice< TVec > > &rhs,

const Subset& s)

{ // Unwrap Binary Node... get a & x

const OLattice< TVec >& x

= static_cast<const OLattice< TVec > &>( rhs.expression().right() );

const OScalar< TScal >& a

= static_cast<const OScalar< TScal > &> ( rhs.expression().left() );

REAL ar = a.elem().elem().elem().elem(); REAL* aptr = &ar;

if( s.hasOrderedRep() ) {

// For ordered (contiguous) subsets call a subset's worth of work

REAL* xptr = (REAL *)&(x.elem(s.start()).elem(0).elem(0).real());

REAL* yptr = &(d.elem(s.start()).elem(0).elem(0).real());

int n_3vec = (s.end()-s.start()+1)*4 ;

Call Optimized function get pointers

and the unordered half...

  else { const int* tab = s.siteTable().slice();

for(int j=0; j < s.numSiteTable(); j++) {

int i = tab[j];

Unordered Subset:

Must loop through subset site table

REAL* xptr = (REAL *)&(x.elem(i).elem(0).elem(0).real());

REAL* yptr = &(d.elem(i).elem(0).elem(0).real()); qdp_vaxpy3(yptr, aptr, xptr, yptr, 4);

}

}

Call Optimized Function, for one site

}

Unwrap pointers

- for site I in the table

• Note: This structure is quite generic

– Two branches for 'contiguous' and 'non-contiguous' subsets

• For non-contiguous cases, same caveats about pointer aliasing (eg above).

Hooking In Threads...

• Wrap the optimized function so that it can be dispatched to threads...

– QMT Style of dispatch is (and OpenMP can do this too)

# sites qmt_call( (qmt_userfunc_t)func , int N , (void *)args )

– The QMT user function looks like:

User

Arguments wrapped in a struct void func ( int lo, int hi , int myId, MyArgType* args ); low and high site indices for this thread. qmt_call() computes these from N

My Thread ID

(given to me by qmt_call())

Hooking in Threads

• So I need to define a func() and an Argument struct

• eg: for AXPZ (qdp_scalarsite_sse_blas_wrapper.h) struct ordered_sse_vaxOpy3_user_arg{

REAL32* Out; // y

REAL32* scalep; // a

REAL32* InScale; // x

REAL32* Opt; // z

void (*func)(REAL32*, REAL32*, REAL32*, REAL32*, int);

}; inline void ordered_sse_vaxOpy3_evaluate_function (int lo, int hi, int myId, ordered_sse_vaxOpy3_user_arg* a) {

REAL32* Out = a->Out; REAL32* scalep = a->scalep;

REAL32* InScale = a->InScale; REAL32* Opt = a->Opt;

Unwrap Args

void (*func)(REAL32*, REAL32*, REAL32*, REAL32*, int) = a->func;

int n_3vec = (hi - lo); int index = lo;

InScale = &InScale[index];

Opt = &Opt[index];

Out = &Out[index]; func(Out, scalep, InScale, Opt, n_3vec);

}

Setup Pointers, and n_3vec from hi/low

Call optimized func

Hooking In Threads

• Now we have the userfunc, and args structure, we hook it into specialized expression templates

• The Example below is from qdp_scalarsite_sse_blas.h

– I've reproduced just the bits for the 'contiguous' AXPY case below...

Get Pointers

REAL32 ar = a.elem().elem().elem().elem(); REAL32* aptr = &ar;

if( s.hasOrderedRep() ) {

REAL32* xptr = (REAL32 *)&(x.elem(s.start()).elem(0).elem(0).real());

REAL32* yptr = &(d.elem(s.start()).elem(0).elem(0).real());

int total_n_3vec = (s.end()-s.start()+1)*24; ordered_sse_vaxOpy3_user_arg arg = {yptr, aptr, xptr, yptr, vaxpy3}; dispatch_to_threads(total_n_3vec, arg,

ordered_sse_vaxOpy3_evaluate_function);

Dispatch

Setup User Args...

Collectives

• Threading collectives is a little more involved

– Each thread reduces to its own location in norm2_results (for norm2()). And then sum over those at end...

– norm2_results allocated at startup to the max no of threads.

int n_real = (s.end() - s.start() + 1)*24;

ordered_sse_norm_single_user_arg arg;

arg.vptr = (REAL32*) &(s1.elem(s.start()).elem(0).elem(0).real()); arg.results = ThreadReductions::norm2_results;

arg.func = local_sumsq_24_48; dispatch_to_threads(n_real, arg, ordered_norm_single_func);

REAL64 ltmp=arg.results[0];

for(int i=1; i < qdpNumThreads(); i++) {

ltmp += arg.results[i];

}

Internal::globalSum(ltmp);

REAL64 array[N_THREADS]

Threads compute partial results

Sum partial results

Thanks & Miscellany

• We'd like to thank Xu Guo and EPCC for the majority of work on integrating the OpenMP and QMT threading into

QDP++

– I merely added the norm2(), and innerProduct() collectives and replicated the structure for some of the double precision routines...

• Templates and G++ have one annoying side

– The template definition and declaration seem as tho they must appear in the same compilation unit → typically this means template classes must live entirely in header files.

(splitting the code off into .cc files can produce linkage errors)

– This behaviour is essentially compiler dependent (e.g.: xlC can create a 'template registry' instead.)

– This may be 'fixed' in the next C++ standard...

A template trick...

• Sometimes you can hide the fact that something is templated.

• E.g.: Suppose you want a solver to work for both single and double precisions. Tou can declare both in the .h file as non-templated functions

– (from chroma/lib/actions/ferm/invert/invcg2.h)

// Single precision

SystemSolverResults_t

InvCG2(const LinearOperator<LatticeFermionF>& M,

const LatticeFermionF& chi,

LatticeFermionF& psi,

const Real& RsdCG,int MaxCG);

// Double precision

SystemSolverResults_t

InvCG2(const LinearOperator<LatticeFermionD>& M,

const LatticeFermionD& chi,

LatticeFermionD& psi,

const Real& RsdCG,int MaxCG);

A template trick...

• Then in the .cc file, declare the templated solver, and wrap it for both declared interfaces: template<typename T, typename C>

SystemSolverResults_t

InvCG2_a(const C& M, const T& chi, T& psi,

const Real& RsdCG, int MaxCG)

{… code ommitted for lack of space … }

// Single precision Wrapper as declared in .h file.

SystemSolverResults_t

InvCG2(const LinearOperator<LatticeFermionF>& M,

const LatticeFermionF& chi,

LatticeFermionF& psi,

const Real& RsdCG,int MaxCG)

{

return InvCG2_a(M, chi, psi, RsdCG, MaxCG);

}

// And likewise for LatticeFermionD ...

Summary

• We discussed templates in general including template specialization

• We considered using public template class members and exported types to encode 'traits' into traits classes.

• We considered Expression Templates and how these can be assembled using operator overloading and particularly complicated templates

• We discussed where to hook in specializations into QDP++ to optimize expression templates at the site and evaluate level

• We discussed how the current threading implementation works in QDP++

• At this point you should be well equipped to optimize and thread in QDP++ to your hearts' content!

Suggested Reading

• If you have the time:

– “Disambiguated Glommable Expression Templates” by

G. Furnish, Computers in Physics, Vol 11, Issue 3,

1997 pages 263-269 or here .

– The PETE documentation: here .

– Alexandrescu, “Modern C++ Design”

• On online from Google Books

– The QDP++ Source Code...

Download