Balint Joo
Jefferson Lab, Newport News, VA for
HackLatt'09, Edinburgh, May 2009
• C++ Templates in General
– Basic Templates
– Specializations
• Funky Templates
– Traits
– Type computations
– Expression Templates, and PETE
• Optimizing Expressions in QDP++
– Site by site specializations
– evaluate() specializations
– Hooking in threading
• Templates allow you to perform 'substitutions' in code at compile time
• Typical Use: Containers for several types template< typename T> class Bag { public:
insert( const T& item ) { // Code goes here … }
}; int main(int argc, char *argv[])
{
Bag<float> BagOfFloats;
Bag<int> BagOfInts;
int i=5; float f=6.0;
BagOfFloats.insert( f );
BagOfInts.insert( i );
BagOfInts.insert( f ); // NONONO! Unless automatic conversion
}
• Bag<T> was templated only on types. Can also template on values
• This is useful e.g. if I want to pick a container size at compile time, but it will stay fixed after that template< typename T, int N> class FixedSizedBag { public:
insert( const T& item, int position ) { // Code goes here … }
}; int main(int argc, char *argv[])
{
FixedSizedBag<float,2> BagOfTwoFloats;
float f1=6.0;
float f2=3.0;
BagOfTwoFloats.insert(f1, 0); // Insert at position 0
BagOfTwoFloats.insert(f2, 1); // Insert at position 1
}
• We can template functions as well as classes
#include<iostream> template< typename T > void doSomethingWithT( const T& input )
{
std::cout << “T is “ << input << endl;
}
• NB: a definition of operator<< must be available that can print a type T – so called 'duck typing' class MyFunnyType {} ; // Empty class int main(int argc, char *argv[])
{
int i=6 ; doSomethingWithT(i); // OK: << handles int
float f=5.0 ; doSomethingWithT(f); // OK: << handles float
MyFunnyType mf; doSomethingWithT(mf); // BARF!!!!!!
}
• One can 'customize' the code of a template for specific template values – this is called specialization template<class T> class FancyPrinter { public:
FancyPrinter(const T& input) {
std::cout << “Fancy Printer: “ << input << endl;
}
};
// SPECIALIZE The Entire Class template<> class FancyPrinter< MyFunnyType > {
FancyPrinter(const MyFunnyType& input) {
std::cout << “Fancy Printer: trying to print FunnyType”
<< endl;
}
};
• I can specialize individual member functions of a class template e.g.: template<class T> class FancyPrinter { public:
FancyPrinter(const T& input) {
std::cout << “Fancy Printer: “ << input << endl;
}
};
// SPECIALIZE The constructor template<>
FancyPrinter< MyFunnyType >::FancyPrinter(const MyFunnyType& input)
{
std::cout << “Fancy Printer: trying to print FunnyType”
<< endl;
};
• At this point it is worth mentioning that various container types have been implemented for you in the
Standard Template Library and the Boost Library
– NB: Boost has been included in TR1 and may make it into the 'standard C++ library' when it next becomes available
• In QDP++, because of their parallel nature, we didn't use the STL but tried to keep some of its feel...
– STL prescribes a lot of detail for things like 'iterators' which allow generic traversal of STL containers. This is a lot of overhead for us, so we left it out....
• The full template rules are complex, see Stroustrup, the Web etc etc etc if you want minutiae..
• Classes can 'export' internal types: class SomeClass { public:
typedef float MyFloatType;
};
// This declares a float really
SomeClass::MyFloatType t=5.6;
• MyFloatType is a 'type trait' of SomeClass
• A Traits Class can hold several traits: class SomeClass { public:
typedef float MyFloatType; static const int FloatSize = sizeof(MyFloatSize);
}; std::cout << “The Size of SomeClass::MyFloatType is :”
<< SomeClass::FloatSize << endl;
• Traits become powerful when combined with templating: template <typename T> class MyTraits { public:
typedef T MyType;
static const int MyTypeSize = sizeof(T);
};
• Can now write generic code in terms of 'MyType' and control say memory copies using 'MyTypeSize'
• Traits are heavily used in the STL and standard libraries.
• One can construct a set of templates to 'compute' about types: template< typename T> // Catchall case... Empty...
struct DoublePrecType { // Will cause compiler error if One tries
}; // to access its nonexistent members template<> struct DoublePrecType<float> { // Specialization for floats
typedef double Type_t; // this is the trait...
}; template<> struct DoublePrecType<double> { // Specialization for doubles
typedef double Type_t; // This is the trait
}; std::cout << “Sizeof DP(float)”
<< sizeof( DoublePrecType<float>::Type_t ) << endl; std::cout << “But not for ints” << sizeof( DoublePrecType<int>::Type_t )
<< endl; // This'll barf, matches struct with no Type_t
• Type computations can be recursive:
#include <vector> using namespace std; template< typename T> // Catchall case. Hopefully never struct DoublePrecType { // Instantiate this
}; template<> // Base case for floats struct DoublePrecType<float> {
typedef double Type_t;
}; template<typename T> // Recursive case for vectors struct DoublePrecType< vector< T > > {
typedef vector< typename DoublePrecType< T >::Type_t > Type_t;
};
DoublePrecType< vector< float > >::Type_t doublevec(5); cout << “doublevec[0] has size=” << sizeof( doublevec[0] ) << endl;
• Basic QDP++ Traits:
– WordType<T>::Type_t - the innermost word (int, float)
– SinglePrecType<T>::Type_t single precision version of T
– DoublePrecType<T>::Type_t double precision versions of T
– QIOStringTraits<T > - holds strings needed by QIO functions
• QIOStringTraits<T>::tname = “Lattice” or “Scalar”
• QIOStringTraits<T>::tprec = “U” “I” “F” etc...
• More advanced traits that arise in Expression Templates
– UnaryReturn< T, Op>::Type_t - the return type produced by a Unary function Op acting on T
– BinaryReturn<T1, T2, Op>::Type_t - the return type produced by a
Binary Operator Op acting on inputs with type T1 and T2 respectively
• In all the cases, if we try and use a trait, for which there is no specialization case, or a 'catchall' case. We'll get horrible compiler errors...
• Basic Idea:
– Want to 'wrap up' expressions at compile time, so that they become 'straightforward' to evaluate
– Want to be able to 'specialize' some expressions using
'faster' code y = a*x ;
QDPExpr y
Binary Node OpMultiply a x
• The basic QDP++ Expression Template is deceptively simple (qdp++/include/qdp_qdpexpr.h) template<class T, class C> class QDPExpr { public: typedef T Subtype_t; // Type of the first argument.
typedef T Expression_t; //! Type of the expression.
typedef C Container_t; //! Type of the container class
//! Construct from an expression.
QDPExpr(const T& expr) : expr_m(expr)
{ }
QDPExpr 'anonymizes' expressions of type T
//! Accessor that returns the expression.
const Expression_t& expression() const
{
return expr_m;
} private:
T expr_m;
};
Class exports 3 type Traits and only one public function
• PETE provides 'Tree node' classes to help build the
expression tree.
– Reference<T> - holds a reference to a type
• const T& reference() const; allows access to held value
– UnaryNode<Op,Child> - wraps a Unary Expression
• DeReference<Child>::Return_t child() const; can be used to access 'child' subexpression
– BinaryNode<Op, Left, Right> - Wraps a binary expression
• DeReference<Left>::Return_t left() const;
• DeReference<Right>::Return_t right() const;
– can be used to access left and right subexpressions
• Pete also provides (specifies) two classes to wrap arbitrary expressions into QDPExpr classes:
– MakeReturn< T, C >
• Takes expression T and wraps it to QDPExpr<T,C>
using the publicly declared 'make()' function
• To allow generic coding MakeReturn<T,C> exports an
Expression_t type which is just QDPExpr<T,C>
– CreateLeaf< QDPExpr<T, C> >
• Takes the expression and extracts T from it using its publicly declared 'make()' function.
• To allow generic coding use, CreateLeaf<QDPExpr<T,C> > exports a Leaf_t type (which is really just T in a roundabout way)
• The expression tree is built by overriding functions and operators like operator*() eg: template<class T1,class C1,class T2,class C2> inline typename MakeReturn< BinaryNode<OpMultiply,
typename CreateLeaf<QDPType<T1,C1> >::Leaf_t,
typename CreateLeaf<QDPType<T2,C2> >::Leaf_t> , typename BinaryReturn<C1,C2,OpMultiply>::Type_t
>::Expression_t
{ typedef BinaryNode<OpMultiply,
typename CreateLeaf<QDPType<T1,C1> >::Leaf_t,
This is the node for * operator* (const QDPType<T1,C1> & l,const QDPType<T2,C2> & r) return type=container
typename CreateLeaf<QDPType<T2,C2> >::Leaf_t> Tree_t; typedef typename BinaryReturn<C1,C2,OpMultiply>::Type_t
Container_t;
return MakeReturn< Tree_t , Container_t >::make( Tree_t(
CreateLeaf<QDPType<T1,C1> >::make(l),
CreateLeaf<QDPType<T2,C2> >::make(r)) ) ;
}
• Despite its extremely complex looks the function is simple.
– Take 2 QDPType-s
– Use CreateLeaf<T,C>::make() to make them Leaf expressions
– Use the BinaryNode<T1, T2, C> (aliased to Tree_t), to wrap the leaves into a binary node for multiplication
– Use the MakeReturn<T,C>::make() function to wrap the Binary
Node into a QDPExpr<T,C> and return it
• If this is simple what's complicated?
– This is kind of hard to read and error prone to write.
– PETE does the hard work by writing this for us
• Evaluation can happen in two ways:
– one calls evaluate() on the QDPExpr
• in operators =, +=, -=, etc...
• this can be in architecture specific files eg:
– qdp_scalar_specific.h, qdp_parscalar_specific.h
– during the creation/fusing of QDPExpr-s
• Eg: in the generic sum routine template<class RHS, class T> typename UnaryReturn<OScalar<T>, FnSum>::Type_t sum(const QDPExpr<RHS,OScalar<T> >& s1, const Subset& s)
{
typename UnaryReturn<OScalar<T>, FnSum>::Type_t d;
evaluate(d,OpAssign(),s1,all); // since OScalar, no global sum needed
return d;
}
• This is of course, not the whole story.
– I haven't really explained why the MakeReturn, CreateLeaf etc are needed
– I haven't really explained why they export the types they export.
– These are beyond the scope of a simple lecture
• Hopefully I have shown you the main types (nodes, etc)
• Armed with this knowledge we know enough to hook in optimizations.
• For more details on PETE you can read:
– “Disambiguated Glommable Expression Templates” by G. Furnish,
Computers in Physics, Vol 11, Issue 3, 1997 pages 263-269
– Online version available here
– NERSC still has a web page on PETE where there is some documentation
– We have a copy of the PETE source code in QDP++ with some tweaks...
• Typically you'll want to specialize either an evaluate() or a particular operator (eg +,-, etc) for a site.
• There are 3 main 'optimizations' for so called 'scalarsite' targets (scalar nodes, possibly connected)
– include/scalarsite_sse/
• SSE Optimizations
– include/scalarsite_bagel_qdp/
• BAGEL Optimizations
– include/scalarsite_generic/
• Generic C Optimizations
• Toplevel control is through include files:
– include/qdp_scalarsite_sse.h - #includes files from scalarsite_sse
– include/qdp_scalarsite_generic.h - #includes files from scalarsite_generic
– include/qdp_scalarsite_bagel_qdp.h
• Which of these is used, is controlled by Autoconf/Automake #defines in qdp.h
• Bagel QDP optimizations are perhaps the cleanest
– (all functions called live in an external library)
• SSE Optimizations are perhaps the messiest
– Some called functions live in .h files in scalarsite_sse
– Some functions live in lib/scalarsite_sse/
– Some functions live in a separate library (libintrin) provided in qdp++/other_libs/libintrin/
• in scalarsite_xxx/ typically have the following .h files
– qdp_scalarsite_xxx_linalg.h
• SU(3) Matrix* SU(3) Matrix, SU(3) Matrix* Color Vec etc.
– qdp_scalarsite_xxx_blas.h
• a*x + y, a*x+b*y, norm2(), innerProduct() etc, a,b real
– qdp_scalarsite_xxx_blas_g5.h
• a*x + gamma_5 y type operations
– qdp_scalarsite_xxx_cblas.h
• a*x+y, a*x + b*y etc, a, b are complex.
– Plus potentially double prec. versions
(qdp_scalarsite_xxx_blas_double.h)
– Plus potentially threading wrappers
(qdp_scalarsite_xxx_blas_wrapper.h)
• Example Matrix*Matrix specialization from qdp_scalarsite_sse_linalg_double.h
#include "sse_linalg_mm_su3_double.h" inline BinaryReturn<PMatrix<RComplex<REAL64>,3,PColorMatrix>,
PMatrix<RComplex<REAL64>,3,PColorMatrix>, OpMultiply>::Type_t operator*( const PMatrix<RComplex<REAL64>,3,PColorMatrix>& l , const PMatrix<RComplex<REAL64>,3,PColorMatrix>& r )
{
BinaryReturn<PMatrix<RComplex<REAL64>,3,PColorMatrix>,
PMatrix<RComplex<REAL64>,3,PColorMatrix>, OpMultiply>::Type_t d;
REAL64* lm = (REAL64 *) &(l.elem(0,0).real());
REAL64* rm = (REAL64 *) &(r.elem(0,0).real());
REAL64* dm = (REAL64 *) &(d.elem(0,0).real());
Return value ssed_m_eq_mm_u(dm,lm,rm,1);
return d;
}
Call 'optimized' function
Unwrap
QDP Types
• Site result is returned on the stack
– This can cause problems...
– In principle, a good compiler can pass in the return value
(return value eliding).
• Your optimized routine had better be good
– We pass pointers to the operands
– Potential aliasing hazard
– Can prevent compiler from optimizing your 'leaf routine'
– Use a compiler flag to tell the compiler to assume your pointers do not alias (eg for gcc: -fargument-noalias-global)
• Where does one get a chance to prefetch?
– Only potentially in a surrounding loop...
• AXPY operation (y += a*x) from qdp_scalarsite_bagel_qdp_blas.h
qdp_vaxpy3(yptr, aptr, xptr, yptr, n_3vec);
}...
Unwrap Binary Node inline void evaluate( OLattice< TVec >& d, const OpAddAssign & op, const QDPExpr< BinaryNode< OpMultiply,
Reference < QDPType< TScal, OScalar < TScal > > >,
Reference< QDPType< TVec, OLattice< TVec > > > >,
OLattice< TVec > > &rhs,
const Subset& s)
{ // Unwrap Binary Node... get a & x
const OLattice< TVec >& x
= static_cast<const OLattice< TVec > &>( rhs.expression().right() );
const OScalar< TScal >& a
= static_cast<const OScalar< TScal > &> ( rhs.expression().left() );
REAL ar = a.elem().elem().elem().elem(); REAL* aptr = &ar;
if( s.hasOrderedRep() ) {
// For ordered (contiguous) subsets call a subset's worth of work
REAL* xptr = (REAL *)&(x.elem(s.start()).elem(0).elem(0).real());
REAL* yptr = &(d.elem(s.start()).elem(0).elem(0).real());
int n_3vec = (s.end()-s.start()+1)*4 ;
Call Optimized function get pointers
else { const int* tab = s.siteTable().slice();
for(int j=0; j < s.numSiteTable(); j++) {
int i = tab[j];
Unordered Subset:
Must loop through subset site table
REAL* xptr = (REAL *)&(x.elem(i).elem(0).elem(0).real());
REAL* yptr = &(d.elem(i).elem(0).elem(0).real()); qdp_vaxpy3(yptr, aptr, xptr, yptr, 4);
}
}
Call Optimized Function, for one site
}
Unwrap pointers
- for site I in the table
• Note: This structure is quite generic
– Two branches for 'contiguous' and 'non-contiguous' subsets
• For non-contiguous cases, same caveats about pointer aliasing (eg above).
• Wrap the optimized function so that it can be dispatched to threads...
– QMT Style of dispatch is (and OpenMP can do this too)
# sites qmt_call( (qmt_userfunc_t)func , int N , (void *)args )
– The QMT user function looks like:
User
Arguments wrapped in a struct void func ( int lo, int hi , int myId, MyArgType* args ); low and high site indices for this thread. qmt_call() computes these from N
My Thread ID
(given to me by qmt_call())
• So I need to define a func() and an Argument struct
• eg: for AXPZ (qdp_scalarsite_sse_blas_wrapper.h) struct ordered_sse_vaxOpy3_user_arg{
REAL32* Out; // y
REAL32* scalep; // a
REAL32* InScale; // x
REAL32* Opt; // z
void (*func)(REAL32*, REAL32*, REAL32*, REAL32*, int);
}; inline void ordered_sse_vaxOpy3_evaluate_function (int lo, int hi, int myId, ordered_sse_vaxOpy3_user_arg* a) {
REAL32* Out = a->Out; REAL32* scalep = a->scalep;
REAL32* InScale = a->InScale; REAL32* Opt = a->Opt;
Unwrap Args
void (*func)(REAL32*, REAL32*, REAL32*, REAL32*, int) = a->func;
int n_3vec = (hi - lo); int index = lo;
InScale = &InScale[index];
Opt = &Opt[index];
Out = &Out[index]; func(Out, scalep, InScale, Opt, n_3vec);
}
Setup Pointers, and n_3vec from hi/low
Call optimized func
• Now we have the userfunc, and args structure, we hook it into specialized expression templates
• The Example below is from qdp_scalarsite_sse_blas.h
– I've reproduced just the bits for the 'contiguous' AXPY case below...
Get Pointers
REAL32 ar = a.elem().elem().elem().elem(); REAL32* aptr = &ar;
if( s.hasOrderedRep() ) {
REAL32* xptr = (REAL32 *)&(x.elem(s.start()).elem(0).elem(0).real());
REAL32* yptr = &(d.elem(s.start()).elem(0).elem(0).real());
int total_n_3vec = (s.end()-s.start()+1)*24; ordered_sse_vaxOpy3_user_arg arg = {yptr, aptr, xptr, yptr, vaxpy3}; dispatch_to_threads(total_n_3vec, arg,
ordered_sse_vaxOpy3_evaluate_function);
Dispatch
Setup User Args...
• Threading collectives is a little more involved
– Each thread reduces to its own location in norm2_results (for norm2()). And then sum over those at end...
– norm2_results allocated at startup to the max no of threads.
int n_real = (s.end() - s.start() + 1)*24;
ordered_sse_norm_single_user_arg arg;
arg.vptr = (REAL32*) &(s1.elem(s.start()).elem(0).elem(0).real()); arg.results = ThreadReductions::norm2_results;
arg.func = local_sumsq_24_48; dispatch_to_threads(n_real, arg, ordered_norm_single_func);
REAL64 ltmp=arg.results[0];
for(int i=1; i < qdpNumThreads(); i++) {
ltmp += arg.results[i];
}
Internal::globalSum(ltmp);
REAL64 array[N_THREADS]
Threads compute partial results
Sum partial results
• We'd like to thank Xu Guo and EPCC for the majority of work on integrating the OpenMP and QMT threading into
QDP++
– I merely added the norm2(), and innerProduct() collectives and replicated the structure for some of the double precision routines...
• Templates and G++ have one annoying side
– The template definition and declaration seem as tho they must appear in the same compilation unit → typically this means template classes must live entirely in header files.
(splitting the code off into .cc files can produce linkage errors)
– This behaviour is essentially compiler dependent (e.g.: xlC can create a 'template registry' instead.)
– This may be 'fixed' in the next C++ standard...
• Sometimes you can hide the fact that something is templated.
• E.g.: Suppose you want a solver to work for both single and double precisions. Tou can declare both in the .h file as non-templated functions
– (from chroma/lib/actions/ferm/invert/invcg2.h)
// Single precision
SystemSolverResults_t
InvCG2(const LinearOperator<LatticeFermionF>& M,
const LatticeFermionF& chi,
LatticeFermionF& psi,
const Real& RsdCG,int MaxCG);
// Double precision
SystemSolverResults_t
InvCG2(const LinearOperator<LatticeFermionD>& M,
const LatticeFermionD& chi,
LatticeFermionD& psi,
const Real& RsdCG,int MaxCG);
• Then in the .cc file, declare the templated solver, and wrap it for both declared interfaces: template<typename T, typename C>
SystemSolverResults_t
InvCG2_a(const C& M, const T& chi, T& psi,
const Real& RsdCG, int MaxCG)
{… code ommitted for lack of space … }
// Single precision Wrapper as declared in .h file.
SystemSolverResults_t
InvCG2(const LinearOperator<LatticeFermionF>& M,
const LatticeFermionF& chi,
LatticeFermionF& psi,
const Real& RsdCG,int MaxCG)
{
return InvCG2_a(M, chi, psi, RsdCG, MaxCG);
}
// And likewise for LatticeFermionD ...
• We discussed templates in general including template specialization
• We considered using public template class members and exported types to encode 'traits' into traits classes.
• We considered Expression Templates and how these can be assembled using operator overloading and particularly complicated templates
• We discussed where to hook in specializations into QDP++ to optimize expression templates at the site and evaluate level
• We discussed how the current threading implementation works in QDP++
• At this point you should be well equipped to optimize and thread in QDP++ to your hearts' content!
• If you have the time:
– “Disambiguated Glommable Expression Templates” by
G. Furnish, Computers in Physics, Vol 11, Issue 3,
1997 pages 263-269 or here .
– The PETE documentation: here .
– Alexandrescu, “Modern C++ Design”
• On online from Google Books
– The QDP++ Source Code...