QDP++ Internals Balint Joo Jefferson Lab, Newport News for 

advertisement
QDP++ Internals
Balint Joo
Jefferson Lab, Newport News
for HackLatt'09, Edinburgh, May 2009
Rough Outline
•
•
•
•
•
•
Code Layout, Bundled Libraries
Best Supported Architectures
QDP++ Nested Template Types
Back Door Type Manipulations
Sets, Subsets
Layouts
Code Layout
qio
qdp++
other_libs
This
talk...
Template talk
mostly
include
xpath_reader
lib
scalarsite_bagel_qdp
filedb
scalarvecsite_sse
scalarsite_sse
obsoleted
not actively
maintained
scalarsite_generic
scalarsite_sse
libintrin
scalarsite_qcdoc
Template specialization optimizations
'Standalone'
libraries
distributed with
QDP++
What lives where?
• include/
– most header files to do with templates, configuration
etc
– subdirectories contain template specialization
optimizations...
• lib/
- encoding of layouts, architecture specific startup,
shifts, allocators, io etc and library code for some
SSE BLAS like routines
• other_libs/
– bundled libraries
A note on parallel architecture names
• Scalarsite
– sites are in 'scalar order' (outermost index)
• Scalarvecsite
– vectorized over sites (innermost index)
• Parscalar
– parallel machine made up of nodes on which sites are
in scalar order (ie scalarsite)
• typical cluster/MPP machine
• Scalar
– Scalar machine on which sites are in scalar oder (ie
scalarsite)
• Site order/architecture specific files (both .h and .cc)
typically go as qdp_scalarsite_sse.h,
qdp_scalar_specific.cc etc etc.
Bundled Libraries
• We bundle the following in other_libs:
– QIO: You know it and (possibly) love it. The full
SciDAC QIO
– xpath_reader : For reading/writing XML
– filedb : Recent library to write 'database' like files for
faster binary output of correlators etc
– libintrin : A single precision SSE library based on the
MILC SSE routines.
• We don't bundle, but you will need to have installed
– QMP : for parallel builds
– libxml2 : the engine under xpath_reader
– bagel_qdp : for BLAS on the BlueGene/QCDOC
Supported Architectures
• Scalarsite SSE
– Sites in scalar order, SSE based
– Clusters, Crays, Laptops, Desktops, anything supporting Intel SSE
intrinsics.
– Best supported. Most common, most access
• Scalarsite BAGEL QDP
–
–
–
–
Double precision BLAS only in bagel_qdp library
BlueGenes/QCDOC
Best Effort Supported, depending on access/allocations
Works on BlueGene/L/P, For BG/L disable filedb (no threads)
• Scalarsite Generic
– BLAS like routines in C/C++. Good for new architectures/compilers
– Supported and mixed with SSE for routines where we have no
SSE.
Supported Architectures
• Scalarsite QCDOC
– Was original QCDOC architecture
– Now 'merged' with scalarsite BAGEL QDP
– Essentially obsoleted: UK QCDOC turned off, not
expecting US QCDOC allocation
• Scalarvecsite SSE
– Vectorized over sites
– Apparently, large Nc calculations on workstations use
this
– I have never really developed this... so I claim not to
support it... but who knows...
QDP++ Types
• Scalarsite Templated Type definitions live in
include/qdp_scalarsite_defs.h:
typedef
typedef
typedef
typedef
typedef
typedef
OLattice<
OLattice<
OLattice<
OLattice<
OLattice<
OLattice<
PScalar< PColorVector< RComplex<REAL>, Nc> > > LatticeColorVector;
PSpinVector< PScalar< RComplex<REAL> >, Ns> > LatticeSpinVector;
PScalar< PColorMatrix< RComplex<REAL>, Nc> > > LatticeColorMatrix;
PSpinMatrix< PScalar< RComplex<REAL> >, Ns> > LatticeSpinMatrix;
PSpinVector< PColorVector< RComplex<REAL>, Nc>, Ns> > LatticeFermion;
PSpinVector< PColorVector< RComplex<REAL>, Nc>, Ns>>1 > > LatticeHalfFermion;
• These types are 'floating' types. The precision 'floats'
depending on whether you compile in single/double precision.
– Single Prec: REAL → float
Double Prec: REAL → double
• There are also FIXED precision types:
typedef OLattice< PSpinVector< PColorVector< RComplex<REAL64>, Nc>, 4> > LatticeDiracFermionD;
typedef OLattice< PSpinVector< PColorVector< RComplex<REAL64>, 3>, 4> > LatticeDiracFermionD3;
• NB Fixed Nc: LatticeDiracFermionD3
• Useful for when concrete precisions, Nc etc are needed
Accessing members of the types...
• OLattice<>, PScalar<>, etc export elem() function
–
OLattice<T>::elem( int site )
–
PScalar<T>::elem() ;
–
PSpinVector<T,N>::elem(int i);
–
PSpinMatrix<T,N>::elem(int i, int j);
–
PColorVector<T,N>::elem(int i);
–
PColorMatrix<T,N>::elem(int i, int j);
–
RScalar<T>::elem();
–
RComplex<T>::real(); RComplex<T>::imag();
// NB No indices
// No indices
• So to get at element of a LatticeColorMatrix, you nest these:
LatticeFermion ferm;
LatticeColorMatrix u;
REAL element_of_ferm = ferm.elem(site).elem(spin).elem(color).real();
REAL elem_of_u = u.elem(site).elem().elem(row,col).real();
• NB: This kind of access is architecture dependent!
– Eg. work for scalarsite, but not for scalarvecsite where order of
templates may be different... BEWARE!!! Use poke/peek routines
if you want portability across those kinds architectures...
Accessing members of types
• The elem() etc methods are a good way of getting
hold of pointers to lattice wide data, for passing to
optimized routines (see talk on templates and
optimization)
– These kinds of optimizations may well be architecture
specific anyway....
LatticeDiracFermionD3 x, y;
gaussian(x); gaussian(y);
Double a = 0.5;
REAL64* xptr=(REAL64*)&(x.elem(0).elem(0).elem(0).real());
REAL64* yptr=(REAL64*)&(y.elem(0).elem(0).elem(0).real());
REAL64* aptr=(REAL64*)&(a.elem().elem().elem().elem());
int n_sites = all.end() - all.start()+1;
optimized_func(aptr, xptr, yptr, &n_sites); //could even call F***TRAN
Memory Ordering etc etc
• Primitive datatype (PSpinVector, PColorVector, etc)
'allocate' their own memory by holding fixed size
arrays eg:
template <class T, int N> class PSpinVector
{
private:
T F[N] ;
// Data member F is an N element
// array of T
};
• Outer lattice types holding primitive types allocate all
the data up front in a contiguous chunk eg.
(qdp_outer.h, OLattice::alloc_mem):
slow=(T*) QDP::Allocator::theQDPAllocator::Instance().allocate(
sizeof(T)*Layout::sitesOnNode(),
QDP::Allocator::DEFAULT);
F=slow;
Memory Ordering etc.
• Using the QDP::Allocator ensures that lattice data will start
suitably aligned but:
– If the data type has a size that is not a multiple of an 'alignment
unit', successive elements may not be appropriately aligned –
padding may be needed
– Adding padding may confuse some optimized routines which do
not expect it...
• Fast/Slow pointers are a QCDOC abstraction...
• multi1d<> arrays will create an array of objects
– a multi1d<> of OLattice<T> will create an array
of OLattice<T> objects.
– OLattice<T> objects have dynamically allocated data held via
pointers..
– The objects will be contiguous, but the memories they point to
may end up being non-contiguous... - array of pointers problem
Memory Ordering etc...
[0] [1] [2] [3] [4]
multi1d<OLattice<T> >
neighbouring elements are
'contiguous'
OLattice<T>
dynamically
allocated data
memory.
Not necessarily
contiguous
• multi1d<T> has a 'placement' constructor where user
can give a pointer to an allocated area, and the
multi1d<> will use that internally...
Memory Ordering etc
• In nested primitive types, ordering will follow the
nesting, with 'innermost fastest' Eg:
PSpinVector<PColorMatrix<RComplex<REAL64>, 3>, 4>
• will run as
– complex fastest (real then imag)
– color indices after
– spin indices slowest
• For OLattice types, site index will run slowest as
dictated by the Layout in use.
– x fastest, y, z, t
in lexicographic layout
– x/2 (fastest), y, z, t, cb in cb2, checkerboarded layout
– cb2 checkerboarded layout is the default...
Floating vs. Fixed Types
• Typically, we worked with floating types except in
SSE (where we needed know whether we were in 32
or 64 bits to select the right instructions). On BG/L we
always built in 64 bits.
• Now we are interested in mixed precision (and mixed
precision solvers). Want to have both 32 and 64 bit
optimized code in scope if possible.
• SSE has been more or less cleaned.
• BG/L we haven't started yet (still mostly floating)
• Generics is mostly still floating.
• My next QDP++ maintenance activity is to clean this
up.
Subsets...
• A lot of QDP++ operations can be done under a subset
• For hooking in optimizations it is useful to know that
– a subset can be contiguous/not
• hasOrderedRep() member function (true=>contiguous)
– a subset has a site-table you can access
• siteTable() member func gives site table as a multi1d<int>&
• numSiteTable() gives number of elements in site table.
– a subset can give you
• index of its first element: with start() member func;
• index of its last element: with end() member func:
Typical Subset Site Loops
• Typically working under a subset, the loop will look
like this:
if( s.hasOrderedRep() ) {
// Contiguous case
// Get pointer to first element, and size.
REAL64* ptr = (REAL64 *)&(stuff.elem(s.start()).elem(0).elem(0).elem());
int n_sites = s.end()-s.start()+1; // s.start() and s.end() both inclusive
optimized_func(ptr, n_sites);
}
else {
// Non contiguous case
const int* tab = s.siteTable().slice(); // Grab a pointer to site table
}
// explicitly loop through all elements of the site table for
// subset
for(int sites = 0; sites < s.numSiteTable(); sites++) {
int i = tab[sites];
REAL64* ptr=(REAL64*)&(stuff.elem(i).elem(0).elem(0).elem());
optimized_func(ptr, 1);
}
Layouts
• To specify a site layout, you typically need 6 functions
or so.
• The 3 primary ones are:
– int linearSiteIndex(const multi1d<int>& coord)
• maps a global coordinate (coord) into the local site index
• site index only useful on the node holding the site
– int nodeNumber(const multi1d<int>& coord)
• returns the local coordinate for a global coordinate
– multi1d<int> siteCoords(int node, int linear)
• given a node number and a linear coordinate, return the
global site coordinates
Layouts...
Global Coordinates (x,y,z,t)
Layout::nodeNumber()
Node Number
Layout::linearSiteIndex()
Local Linear Site Index
Layout::siteCoords()
• Layout specific versions defined in qdp_scalar_layout.cc,
qdp_parscalar_layout.cc, qdp_parscalarvec_layout.cc
• Autoconf defined macros select which are used :
– QDP_USE_CB2_LAYOUT, QDP_USE_LEXICO_LAYOUT etc..
Layouts
• The 3 main layout functions are exported so that
optimized kernels can learn about the QDP++ layout.
The exported names are:
– void QDPXX_getSiteCoords(int coord[], int node,
int linear)
– int QDPXX_getLinearSiteIndex(const int coord[]);
– int QDPXX_nodeNumber(const int coord[]);
• NB: const multi1d<>& references replaced with
arrays.
• Declaration/definition in include qdp_layout.h
Layouts
• The three remaining functions needed are:
– multi1d<int> minimalLayoutMapping()
• returns the smallest local volume allowed per node
– multi1d<int> crtesn(int pos, const multi1d<int>& lsize)
• compute cartesian coordinates of site pos in a lattice of size lsize
– int local_site(const multi1d<int>& coord,
const multi1d<int>& lsize);
• given a coordinate (coord) and a lattice size (lsize)
compute the local site index.
• from the layout point of view, crtesn and local_site were
strictly this internal utility functions, but they have escaped
(eg into writeOLattice, readArchiv etc)
Layouts
• Layout code is a little messy right now
– Primarily in lib/qdp_layout.cc & architecture specific
layout files (qdp_scalar_layout.cc,
qdp_parscalar_layout.cc etc). Just grep for LAYOUT...
– Various layouts separated by #ifdef-s and we use
conditional compilation
– Autoconf defines #ifdef macro for layout as a result of
using –enable-layout= configure options
– Macro is defined in qdp_config_internal.h, and is
visible from qdp_config.h
– At some point this should be tidyed up...
Summary
• You've now seen most of QDP++'s dirty linen
– Talked briefly about code layout, bundled libraries
– Talked briefly about where the recursive types are
defined, and how to access their elements.
– Talked about how the memory is laid out
• small objects have fixed size data
• OLattice<> objects malloc their data, try to align etc
• multi1d<> objects
– Talked about Layouts, how their chief functions work to
constrain site ordering.
• Most of what remains are to do with Templates,
specialization and threading, in the next talk entitled
“Templates and all that”
Download