SciDAC_Software_hacklat05

advertisement
QDP++ and Chroma
Robert Edwards
Jefferson Lab
Collaborators:
Balint Joo
Lattice QCD – extremely uniform
Dirac operator:
D  x        igA  x    x 


Periodic or very simple
boundary conditions

SPMD: Identical
sublattices per
processor
Lattice Operator:
D ( x) 
1
†

ˆ
U
x

x



U




 x  ˆ   x  ˆ 


2a 

Software Infrastructure Goals: Create a unified
software environment that will enable the US lattice
community to achieve very high efficiency on diverse
multi-terascale hardware.
TASKS:
LIBRARIES:
I.
QCD Data Parallel API
 QDP
II.
Optimize Message Passing
 QMP
III. Optimize QCD Linear Algebra
 QLA
IV. I/O, Data Files and Data Grid
 QIO
V.
Opt. Physics Codes
CPS/MILC/Croma/etc.
VI. Execution Environment  unify BNL/FNAL/JLab
Overlapping communications and
computations

C(x)=A(x) * shift(B, + mu):
Data layout over processors
– Send face forward non-blocking to
neighboring node.
– Receive face into pre-allocated
buffer.
– Meanwhile do A*B on interior sites.
– “Wait” on receive to perform A*B on
the face.

Lazy Evaluation (C style):
Shift(tmp, B, + mu);
Mult(C, A, tmp);
SciDAC Software Structure
Wilson Op, DWF Inv for P4; Wilson and Stag. Op for QCDOC
Level 3
Level
2
Exists in C,
C++, scalar
and
MPP using
QMPLevel 1
Optimised Dirac Operators,
Inverters
QDP (QCD Data Parallel)
QIO
Lattice Wide Operations,
Data shifts
XML I/O
LIME
QLA (QCD Linear Algebra)
QMP (QCD Message Passing)
Exists, implemented in MPI, GM, gigE and
QCDOC
QMP Simple Example
char buf[size];
QMP_msgmem_t mm;
QMP_msghandle_t mh;
mm = QMP_declare_msgmem(buf,size);
Multiple calls
mh = QMP_declare_send_relative(mm,+x);
QMP_start(mh);
// Do computations
QMP_wait(mh);
Receiving node coordinates with the same steps except
mh = QMP_declare_receive_from(mm,-x);
Data Parallel QDP/C,C++ API







Hides architecture and layout
Operates on lattice fields across sites
Linear algebra tailored for QCD
Shifts and permutation maps across sites
Reductions
Subsets
Entry/exit – attach to existing codes
QDP++ Type Structure
Lattice Fields have various kinds of indices
Color: Uab(x) Spin: Gab Mixed: aa(x),
Qabab(x)
Tensor Product of Indices forms Type:
Lattice
Gauge Fields: Lattice
Lattice
Fermions:
Scalar
Scalars:
Propagators: Lattice
Scalar
Gamma:





Color
Matrix(Nc)
Vector(Nc)
Scalar
Matrix(Nc)
Scalar





Spin
Scalar
Vector(Ns)
Scalar
Matrix(Ns)
Matrix(Ns)





Complexity
Complex
Complex
Scalar
Complex
Complex
QDP++ forms these types via nested C++ templating
Formation of new types (eg: half fermion) possible
Data-parallel Operations

Unary and binary:
-a; a-b; …

Unary functions:
adj(a), cos(a), sin(a), …

Random numbers:
// platform independent
random(a), gaussian(a)


Comparisons (booleans)
a <= b, …
 Broadcasts:
a = 0, …
 Reductions:
sum(a), …
Fields have various types (indices): Tensor Product
Lattice: A(x), Color: U i j  x  , Spin: Gab ,
ai  x  ,
ij
Qab
 x
QDP Expressions

Can create expressions
cai ( x)  Uij ( x) bai ( x   )  2 dai ( x)

x  Even
QDP/C++ code
multi1d<LatticeColorMatrix> u(Nd);
LatticeDiracFermion b, c, d;
int mu;
c[even] = u[mu] * shift(b,mu) + 2 * d;

PETE: Portable Expression Template Engine
 Temporaries eliminated, expressions optimised
Linear Algebra Implementation
// Lattice operation
A = adj(B) + 2 * C;

// Lattice temporaries
t1 = 2 * C;
t2 = adj(B);
t3 = t2 + t1;
A = t3;

// Merged Lattice loop
for (i = ... ; ... ; ...) {
A[i] = adj(B[i]) + 2 * C[i];
}



Naïve ops involve lattice temps –
inefficient
Eliminate lattice temps -PETE
Allows further combining of
operations (adj(x)*y)
Overlap
communications/computations
Full performance – expressions at
site level
QDP++ Optimization
Optimizations “under the hood”
Select numerically intensive operations through template
specialization.
PETE recognises expression templates like:
z=a*x+y
from type information at compile time.
Calls machine specific optimised routine (axpyz)
Optimized routine can use assembler, reorganize loops etc.
Optimized routines can be selected at configuration time,
Unoptimized fallback routines exist for portability
LatticeFermion psi, p, r;
Real c, cp, a, d;
Subset s;
for(int k = 1; k <= MaxCG; ++k)
{
// c = | r[k-1] |**2
c = cp;
Performance Test Case Wilson Conjugate
Gradient
// a[k] := | r[k-1] |**2 / <M p[k], Mp[k] >
// Mp = M(u) * p
M(mp, p, PLUS); // Dslash
// d = | mp | ** 2
d = norm2(mp, s);
Norm squares
a = c / d;
// Psi[k] += a[k] p[k]
psi[s] += a * p;
// r[k] -= a[k] M^dag.M.p[k] ;
M(mmp, mp, MINUS);
r[s] -= a * mmp;
cp = norm2(r, s);
if ( cp <= rsd_sq ) return;
// b[k+1] := |r[k]|**2 / |r[k-1]|**2
b = cp / c;
// p[k+1] := r[k] + b[k+1] p[k]
p[s] = r + b*p;
}
VAXPY
operations
In C++ significant room for
perf. degradation
Performance limitations in
Lin. Alg. Ops (VAXPY) and
norms
Optimization:
Funcs return container
holding function type and
operands
At “=“, replace expression
with optimized code by
template specialization
Performance:
QDP overhead ~ 1% peak
Wilson: QCDOC
283Mflops/node @350
MHz, 4^4/node
Chroma
A lattice QCD toolkit/library built on top of QDP++
Library is a module – can be linked with other codes.
Features:
Utility libraries (gluonic measure, smearing, etc.)
Fermion support (DWF, Overlap, Wilson, Asqtad)
eg: McNeile: computes
Applications:
propagators with CPS,
measure pions with Chroma
Spectroscopy, Props & 3-pt funcs,
eigenvalues
all in same
code
Heatbath, HMC
Optimization hooks – level 3 Wilson-Dslash for
Pentium, QCDOC, BG/L, IBM SP-like nodes (via
Bagel)
Software
Map
Features:
Show dir
structure
Chroma Lib Structure
Chroma Lattice Field Theory library
• Support for gauge and fermion actions
•
–
–
–
–
–
– Boson action support
– Fermion action support
•
•
•
•
•
Measurement routines
Fermion actions
Fermion boundary conditions
Inverters
Fermion linear operators
Quark propagator solution routines
•
•
•
•
– Gauge action support
•
•
•
Gauge actions
Gauge boundary conditions
–
–
–
–
IO routines
– Enums
•
Measurement routines
–
–
–
–
–
Eigenvalue measurements
Gauge fixing routines
Gluonic observables
Hadronic observables
Inline measurements
•
•
•
•
–
–
–
–
Eigenvalue measurements
Glue measurements
Hadron measurements
Smear measurements
Psibar-psi measurements
Schroedinger functional
Smearing routines
Trace-log support
Eigenvalue measurements
Gauge fixing routines
Gluonic observables
Hadronic observables
Inline measurements
•
Eigenvalue measurements
Glue measurements
Hadron measurements
Smear measurements
Psibar-psi measurements
Schroedinger functional
Smearing routines
Trace-log support
Gauge field update routines
– Heatbath
– Molecular dynamics support
•
•
•
•
•
•
Hamiltonian systems
HMC trajectories
HMD integrators
HMC monomials
HMC linear system solver initial guess
Utility routines
– Fermion manipulation routines
– Fourier transform support
– Utility routines for manipulating color
matrices
– Info utilities
Fermion Actions
 Actions are factory objects (foundries)
 Do not hold gauge fields – only params
 Factory/creation functions with gauge field argument
 Takes a gauge field - creates a State & applies fermion BC.
 Takes a State – creates a Linear Operator (dslash)
 Takes a State – creates quark prop. solvers
 Linear Ops are function objects
 E.g., class Foo {int operator() (int x);} fred; // int z=fred(1);
 Argument to CG, MR, etc. – simple functions
 Created with XML
Fermion Actions - XML
 Tag FermAct is key in
lookup map of
constructors
 During construction,
action reads XML
 FermBC tag invokes
another lookup
<FermionAction>
<FermAct>WILSON</FermAct>
<Kappa>0.11</Kappa>
<FermionBC>
<FermBC>SIMPLE_FERMBC</FermBC>
<boundary>1 1 1 -1</boundary>
</FermionBC>
<AnisoParam>
<anisoP>false</anisoP>
<t_dir>3</t_dir>
<xi_0>1.0</xi_0>
<nu>1.0</nu>
</AnisoParam>
</FermionAction>
XPath used in chroma/mainprogs/main/propagator.cc
/propagator/Params/FermionAction/FermAct
HMC and Monomials
 HMC built on
Monomials
 Monomials
define Nf, gauge,
etc.
 Only provide
Mom à deriv(U)
and S(U) .
Pseudoferms not
visible.
 Have Nf=2 and
rational Nf=1
 Both 4D and 5D
versions.
<Monomials>
<elem>
<Name>TWO_FLAVOR_WILSON_FERM_MONOMIAL</Na
<FermionAction>
<FermAct>WILSON</FermAct>
……
</FermionAction>
<InvertParam>
<invType>CG_INVERTER</invType>
<RsdCG>1.0e-7</RsdCG>
<MaxCG>1000</MaxCG>
</InvertParam>
<ChronologicalPredictor>
<Name>MINIMAL_RESIDUAL_EXTRAPOLATION_4D_PR
<MaxChrono>7</MaxChrono>
</ChronologicalPredictor>
</elem>
<elem> …. </elem>
</Monomials>
Gauge Monomials
Gauge monomials:
Plaquette
Rectangle
Parallelogram
Monomial
constructor will
invoke constructor
for Name in
GaugeAction
<Monomials>
<elem> …. </elem>
<elem>
<Name>WILSON_GAUGEACT_MONOMIAL</Name>
<GaugeAction>
<Name>WILSON_GAUGEACT</Name>
<beta>5.7</beta>
<GaugeBC>
<Name>PERIODIC_GAUGEBC</Name>
</GaugeBC>
</GaugeAction>
</elem>
</Monomials>
Chroma – Inline Measurements
HMC has Inline
meas.
Chroma.cc is Inline
only code.
Former mainprogs
now inline meas.
Meas. are registered
with constructor call.
Meas. given gauge
field – no return value.
Only communicate to
each other via disk
(maybe mem. buf.??)
<InlineMeasurements>
<elem>
<Name>MAKE_SOURCE</Name>
<Param>...</Param>
<Prop>
<source_file>./source_0</source_file>
<source_volfmt>MULTIFILE</source_volfmt>
</Prop>
</elem>
<elem>
<Name>PROPAGATOR</Name>
<Param>...</Param>
<Prop>
<source_file>./source_0</source_file>
<prop_file>./propagator_0</prop_file>
<prop_volfmt>MULTIFILE</prop_volfmt>
</Prop>
</elem>
<elem>….</elem>
</InlineMeasurements>
Binary File/Interchange Formats
Metadata – data describing data; e.g., physics
params
 Use XML for metadata
 File formats:

– Files mixed mode – XML ascii+binary
– Using DIME (similar to e-mail MIME) to package
– Use BinX (Edinburgh) to describe binary

Replica-catalog web-archive repositories
For More Information

U.S. Lattice QCD Home Page:
http://www.usqcd.org/

The JLab Lattice Portal
http://lqcd.jlab.org/

High Performance Computing at JLab
http://www.jlab.org/hpc/
Download