QDP++ Internals Balint Joo Jefferson Lab, Newport News for HackLatt'09, Edinburgh, May 2009 Rough Outline • • • • • • Code Layout, Bundled Libraries Best Supported Architectures QDP++ Nested Template Types Back Door Type Manipulations Sets, Subsets Layouts Code Layout qio qdp++ other_libs This talk... Template talk mostly include xpath_reader lib scalarsite_bagel_qdp filedb scalarvecsite_sse scalarsite_sse obsoleted not actively maintained scalarsite_generic scalarsite_sse libintrin scalarsite_qcdoc Template specialization optimizations 'Standalone' libraries distributed with QDP++ What lives where? • include/ – most header files to do with templates, configuration etc – subdirectories contain template specialization optimizations... • lib/ - encoding of layouts, architecture specific startup, shifts, allocators, io etc and library code for some SSE BLAS like routines • other_libs/ – bundled libraries A note on parallel architecture names • Scalarsite – sites are in 'scalar order' (outermost index) • Scalarvecsite – vectorized over sites (innermost index) • Parscalar – parallel machine made up of nodes on which sites are in scalar order (ie scalarsite) • typical cluster/MPP machine • Scalar – Scalar machine on which sites are in scalar oder (ie scalarsite) • Site order/architecture specific files (both .h and .cc) typically go as qdp_scalarsite_sse.h, qdp_scalar_specific.cc etc etc. Bundled Libraries • We bundle the following in other_libs: – QIO: You know it and (possibly) love it. The full SciDAC QIO – xpath_reader : For reading/writing XML – filedb : Recent library to write 'database' like files for faster binary output of correlators etc – libintrin : A single precision SSE library based on the MILC SSE routines. • We don't bundle, but you will need to have installed – QMP : for parallel builds – libxml2 : the engine under xpath_reader – bagel_qdp : for BLAS on the BlueGene/QCDOC Supported Architectures • Scalarsite SSE – Sites in scalar order, SSE based – Clusters, Crays, Laptops, Desktops, anything supporting Intel SSE intrinsics. – Best supported. Most common, most access • Scalarsite BAGEL QDP – – – – Double precision BLAS only in bagel_qdp library BlueGenes/QCDOC Best Effort Supported, depending on access/allocations Works on BlueGene/L/P, For BG/L disable filedb (no threads) • Scalarsite Generic – BLAS like routines in C/C++. Good for new architectures/compilers – Supported and mixed with SSE for routines where we have no SSE. Supported Architectures • Scalarsite QCDOC – Was original QCDOC architecture – Now 'merged' with scalarsite BAGEL QDP – Essentially obsoleted: UK QCDOC turned off, not expecting US QCDOC allocation • Scalarvecsite SSE – Vectorized over sites – Apparently, large Nc calculations on workstations use this – I have never really developed this... so I claim not to support it... but who knows... QDP++ Types • Scalarsite Templated Type definitions live in include/qdp_scalarsite_defs.h: typedef typedef typedef typedef typedef typedef OLattice< OLattice< OLattice< OLattice< OLattice< OLattice< PScalar< PColorVector< RComplex<REAL>, Nc> > > LatticeColorVector; PSpinVector< PScalar< RComplex<REAL> >, Ns> > LatticeSpinVector; PScalar< PColorMatrix< RComplex<REAL>, Nc> > > LatticeColorMatrix; PSpinMatrix< PScalar< RComplex<REAL> >, Ns> > LatticeSpinMatrix; PSpinVector< PColorVector< RComplex<REAL>, Nc>, Ns> > LatticeFermion; PSpinVector< PColorVector< RComplex<REAL>, Nc>, Ns>>1 > > LatticeHalfFermion; • These types are 'floating' types. The precision 'floats' depending on whether you compile in single/double precision. – Single Prec: REAL → float Double Prec: REAL → double • There are also FIXED precision types: typedef OLattice< PSpinVector< PColorVector< RComplex<REAL64>, Nc>, 4> > LatticeDiracFermionD; typedef OLattice< PSpinVector< PColorVector< RComplex<REAL64>, 3>, 4> > LatticeDiracFermionD3; • NB Fixed Nc: LatticeDiracFermionD3 • Useful for when concrete precisions, Nc etc are needed Accessing members of the types... • OLattice<>, PScalar<>, etc export elem() function – OLattice<T>::elem( int site ) – PScalar<T>::elem() ; – PSpinVector<T,N>::elem(int i); – PSpinMatrix<T,N>::elem(int i, int j); – PColorVector<T,N>::elem(int i); – PColorMatrix<T,N>::elem(int i, int j); – RScalar<T>::elem(); – RComplex<T>::real(); RComplex<T>::imag(); // NB No indices // No indices • So to get at element of a LatticeColorMatrix, you nest these: LatticeFermion ferm; LatticeColorMatrix u; REAL element_of_ferm = ferm.elem(site).elem(spin).elem(color).real(); REAL elem_of_u = u.elem(site).elem().elem(row,col).real(); • NB: This kind of access is architecture dependent! – Eg. work for scalarsite, but not for scalarvecsite where order of templates may be different... BEWARE!!! Use poke/peek routines if you want portability across those kinds architectures... Accessing members of types • The elem() etc methods are a good way of getting hold of pointers to lattice wide data, for passing to optimized routines (see talk on templates and optimization) – These kinds of optimizations may well be architecture specific anyway.... LatticeDiracFermionD3 x, y; gaussian(x); gaussian(y); Double a = 0.5; REAL64* xptr=(REAL64*)&(x.elem(0).elem(0).elem(0).real()); REAL64* yptr=(REAL64*)&(y.elem(0).elem(0).elem(0).real()); REAL64* aptr=(REAL64*)&(a.elem().elem().elem().elem()); int n_sites = all.end() - all.start()+1; optimized_func(aptr, xptr, yptr, &n_sites); //could even call F***TRAN Memory Ordering etc etc • Primitive datatype (PSpinVector, PColorVector, etc) 'allocate' their own memory by holding fixed size arrays eg: template <class T, int N> class PSpinVector { private: T F[N] ; // Data member F is an N element // array of T }; • Outer lattice types holding primitive types allocate all the data up front in a contiguous chunk eg. (qdp_outer.h, OLattice::alloc_mem): slow=(T*) QDP::Allocator::theQDPAllocator::Instance().allocate( sizeof(T)*Layout::sitesOnNode(), QDP::Allocator::DEFAULT); F=slow; Memory Ordering etc. • Using the QDP::Allocator ensures that lattice data will start suitably aligned but: – If the data type has a size that is not a multiple of an 'alignment unit', successive elements may not be appropriately aligned – padding may be needed – Adding padding may confuse some optimized routines which do not expect it... • Fast/Slow pointers are a QCDOC abstraction... • multi1d<> arrays will create an array of objects – a multi1d<> of OLattice<T> will create an array of OLattice<T> objects. – OLattice<T> objects have dynamically allocated data held via pointers.. – The objects will be contiguous, but the memories they point to may end up being non-contiguous... - array of pointers problem Memory Ordering etc... [0] [1] [2] [3] [4] multi1d<OLattice<T> > neighbouring elements are 'contiguous' OLattice<T> dynamically allocated data memory. Not necessarily contiguous • multi1d<T> has a 'placement' constructor where user can give a pointer to an allocated area, and the multi1d<> will use that internally... Memory Ordering etc • In nested primitive types, ordering will follow the nesting, with 'innermost fastest' Eg: PSpinVector<PColorMatrix<RComplex<REAL64>, 3>, 4> • will run as – complex fastest (real then imag) – color indices after – spin indices slowest • For OLattice types, site index will run slowest as dictated by the Layout in use. – x fastest, y, z, t in lexicographic layout – x/2 (fastest), y, z, t, cb in cb2, checkerboarded layout – cb2 checkerboarded layout is the default... Floating vs. Fixed Types • Typically, we worked with floating types except in SSE (where we needed know whether we were in 32 or 64 bits to select the right instructions). On BG/L we always built in 64 bits. • Now we are interested in mixed precision (and mixed precision solvers). Want to have both 32 and 64 bit optimized code in scope if possible. • SSE has been more or less cleaned. • BG/L we haven't started yet (still mostly floating) • Generics is mostly still floating. • My next QDP++ maintenance activity is to clean this up. Subsets... • A lot of QDP++ operations can be done under a subset • For hooking in optimizations it is useful to know that – a subset can be contiguous/not • hasOrderedRep() member function (true=>contiguous) – a subset has a site-table you can access • siteTable() member func gives site table as a multi1d<int>& • numSiteTable() gives number of elements in site table. – a subset can give you • index of its first element: with start() member func; • index of its last element: with end() member func: Typical Subset Site Loops • Typically working under a subset, the loop will look like this: if( s.hasOrderedRep() ) { // Contiguous case // Get pointer to first element, and size. REAL64* ptr = (REAL64 *)&(stuff.elem(s.start()).elem(0).elem(0).elem()); int n_sites = s.end()-s.start()+1; // s.start() and s.end() both inclusive optimized_func(ptr, n_sites); } else { // Non contiguous case const int* tab = s.siteTable().slice(); // Grab a pointer to site table } // explicitly loop through all elements of the site table for // subset for(int sites = 0; sites < s.numSiteTable(); sites++) { int i = tab[sites]; REAL64* ptr=(REAL64*)&(stuff.elem(i).elem(0).elem(0).elem()); optimized_func(ptr, 1); } Layouts • To specify a site layout, you typically need 6 functions or so. • The 3 primary ones are: – int linearSiteIndex(const multi1d<int>& coord) • maps a global coordinate (coord) into the local site index • site index only useful on the node holding the site – int nodeNumber(const multi1d<int>& coord) • returns the local coordinate for a global coordinate – multi1d<int> siteCoords(int node, int linear) • given a node number and a linear coordinate, return the global site coordinates Layouts... Global Coordinates (x,y,z,t) Layout::nodeNumber() Node Number Layout::linearSiteIndex() Local Linear Site Index Layout::siteCoords() • Layout specific versions defined in qdp_scalar_layout.cc, qdp_parscalar_layout.cc, qdp_parscalarvec_layout.cc • Autoconf defined macros select which are used : – QDP_USE_CB2_LAYOUT, QDP_USE_LEXICO_LAYOUT etc.. Layouts • The 3 main layout functions are exported so that optimized kernels can learn about the QDP++ layout. The exported names are: – void QDPXX_getSiteCoords(int coord[], int node, int linear) – int QDPXX_getLinearSiteIndex(const int coord[]); – int QDPXX_nodeNumber(const int coord[]); • NB: const multi1d<>& references replaced with arrays. • Declaration/definition in include qdp_layout.h Layouts • The three remaining functions needed are: – multi1d<int> minimalLayoutMapping() • returns the smallest local volume allowed per node – multi1d<int> crtesn(int pos, const multi1d<int>& lsize) • compute cartesian coordinates of site pos in a lattice of size lsize – int local_site(const multi1d<int>& coord, const multi1d<int>& lsize); • given a coordinate (coord) and a lattice size (lsize) compute the local site index. • from the layout point of view, crtesn and local_site were strictly this internal utility functions, but they have escaped (eg into writeOLattice, readArchiv etc) Layouts • Layout code is a little messy right now – Primarily in lib/qdp_layout.cc & architecture specific layout files (qdp_scalar_layout.cc, qdp_parscalar_layout.cc etc). Just grep for LAYOUT... – Various layouts separated by #ifdef-s and we use conditional compilation – Autoconf defines #ifdef macro for layout as a result of using –enable-layout= configure options – Macro is defined in qdp_config_internal.h, and is visible from qdp_config.h – At some point this should be tidyed up... Summary • You've now seen most of QDP++'s dirty linen – Talked briefly about code layout, bundled libraries – Talked briefly about where the recursive types are defined, and how to access their elements. – Talked about how the memory is laid out • small objects have fixed size data • OLattice<> objects malloc their data, try to align etc • multi1d<> objects – Talked about Layouts, how their chief functions work to constrain site ordering. • Most of what remains are to do with Templates, specialization and threading, in the next talk entitled “Templates and all that”