QDP++ and Chroma Robert Edwards Jefferson Lab Collaborators: Balint Joo Lattice QCD – extremely uniform Dirac operator: D x igA x x Periodic or very simple boundary conditions SPMD: Identical sublattices per processor Lattice Operator: D ( x) 1 † ˆ U x x U x ˆ x ˆ 2a Software Infrastructure Goals: Create a unified software environment that will enable the US lattice community to achieve very high efficiency on diverse multi-terascale hardware. TASKS: LIBRARIES: I. QCD Data Parallel API QDP II. Optimize Message Passing QMP III. Optimize QCD Linear Algebra QLA IV. I/O, Data Files and Data Grid QIO V. Opt. Physics Codes CPS/MILC/Croma/etc. VI. Execution Environment unify BNL/FNAL/JLab Overlapping communications and computations C(x)=A(x) * shift(B, + mu): Data layout over processors – Send face forward non-blocking to neighboring node. – Receive face into pre-allocated buffer. – Meanwhile do A*B on interior sites. – “Wait” on receive to perform A*B on the face. Lazy Evaluation (C style): Shift(tmp, B, + mu); Mult(C, A, tmp); SciDAC Software Structure Wilson Op, DWF Inv for P4; Wilson and Stag. Op for QCDOC Level 3 Level 2 Exists in C, C++, scalar and MPP using QMPLevel 1 Optimised Dirac Operators, Inverters QDP (QCD Data Parallel) QIO Lattice Wide Operations, Data shifts XML I/O LIME QLA (QCD Linear Algebra) QMP (QCD Message Passing) Exists, implemented in MPI, GM, gigE and QCDOC QMP Simple Example char buf[size]; QMP_msgmem_t mm; QMP_msghandle_t mh; mm = QMP_declare_msgmem(buf,size); Multiple calls mh = QMP_declare_send_relative(mm,+x); QMP_start(mh); // Do computations QMP_wait(mh); Receiving node coordinates with the same steps except mh = QMP_declare_receive_from(mm,-x); Data Parallel QDP/C,C++ API Hides architecture and layout Operates on lattice fields across sites Linear algebra tailored for QCD Shifts and permutation maps across sites Reductions Subsets Entry/exit – attach to existing codes QDP++ Type Structure Lattice Fields have various kinds of indices Color: Uab(x) Spin: Gab Mixed: aa(x), Qabab(x) Tensor Product of Indices forms Type: Lattice Gauge Fields: Lattice Lattice Fermions: Scalar Scalars: Propagators: Lattice Scalar Gamma: Color Matrix(Nc) Vector(Nc) Scalar Matrix(Nc) Scalar Spin Scalar Vector(Ns) Scalar Matrix(Ns) Matrix(Ns) Complexity Complex Complex Scalar Complex Complex QDP++ forms these types via nested C++ templating Formation of new types (eg: half fermion) possible Data-parallel Operations Unary and binary: -a; a-b; … Unary functions: adj(a), cos(a), sin(a), … Random numbers: // platform independent random(a), gaussian(a) Comparisons (booleans) a <= b, … Broadcasts: a = 0, … Reductions: sum(a), … Fields have various types (indices): Tensor Product Lattice: A(x), Color: U i j x , Spin: Gab , ai x , ij Qab x QDP Expressions Can create expressions cai ( x) Uij ( x) bai ( x ) 2 dai ( x) x Even QDP/C++ code multi1d<LatticeColorMatrix> u(Nd); LatticeDiracFermion b, c, d; int mu; c[even] = u[mu] * shift(b,mu) + 2 * d; PETE: Portable Expression Template Engine Temporaries eliminated, expressions optimised Linear Algebra Implementation // Lattice operation A = adj(B) + 2 * C; // Lattice temporaries t1 = 2 * C; t2 = adj(B); t3 = t2 + t1; A = t3; // Merged Lattice loop for (i = ... ; ... ; ...) { A[i] = adj(B[i]) + 2 * C[i]; } Naïve ops involve lattice temps – inefficient Eliminate lattice temps -PETE Allows further combining of operations (adj(x)*y) Overlap communications/computations Full performance – expressions at site level QDP++ Optimization Optimizations “under the hood” Select numerically intensive operations through template specialization. PETE recognises expression templates like: z=a*x+y from type information at compile time. Calls machine specific optimised routine (axpyz) Optimized routine can use assembler, reorganize loops etc. Optimized routines can be selected at configuration time, Unoptimized fallback routines exist for portability LatticeFermion psi, p, r; Real c, cp, a, d; Subset s; for(int k = 1; k <= MaxCG; ++k) { // c = | r[k-1] |**2 c = cp; Performance Test Case Wilson Conjugate Gradient // a[k] := | r[k-1] |**2 / <M p[k], Mp[k] > // Mp = M(u) * p M(mp, p, PLUS); // Dslash // d = | mp | ** 2 d = norm2(mp, s); Norm squares a = c / d; // Psi[k] += a[k] p[k] psi[s] += a * p; // r[k] -= a[k] M^dag.M.p[k] ; M(mmp, mp, MINUS); r[s] -= a * mmp; cp = norm2(r, s); if ( cp <= rsd_sq ) return; // b[k+1] := |r[k]|**2 / |r[k-1]|**2 b = cp / c; // p[k+1] := r[k] + b[k+1] p[k] p[s] = r + b*p; } VAXPY operations In C++ significant room for perf. degradation Performance limitations in Lin. Alg. Ops (VAXPY) and norms Optimization: Funcs return container holding function type and operands At “=“, replace expression with optimized code by template specialization Performance: QDP overhead ~ 1% peak Wilson: QCDOC 283Mflops/node @350 MHz, 4^4/node Chroma A lattice QCD toolkit/library built on top of QDP++ Library is a module – can be linked with other codes. Features: Utility libraries (gluonic measure, smearing, etc.) Fermion support (DWF, Overlap, Wilson, Asqtad) eg: McNeile: computes Applications: propagators with CPS, measure pions with Chroma Spectroscopy, Props & 3-pt funcs, eigenvalues all in same code Heatbath, HMC Optimization hooks – level 3 Wilson-Dslash for Pentium, QCDOC, BG/L, IBM SP-like nodes (via Bagel) Software Map Features: Show dir structure Chroma Lib Structure Chroma Lattice Field Theory library • Support for gauge and fermion actions • – – – – – – Boson action support – Fermion action support • • • • • Measurement routines Fermion actions Fermion boundary conditions Inverters Fermion linear operators Quark propagator solution routines • • • • – Gauge action support • • • Gauge actions Gauge boundary conditions – – – – IO routines – Enums • Measurement routines – – – – – Eigenvalue measurements Gauge fixing routines Gluonic observables Hadronic observables Inline measurements • • • • – – – – Eigenvalue measurements Glue measurements Hadron measurements Smear measurements Psibar-psi measurements Schroedinger functional Smearing routines Trace-log support Eigenvalue measurements Gauge fixing routines Gluonic observables Hadronic observables Inline measurements • Eigenvalue measurements Glue measurements Hadron measurements Smear measurements Psibar-psi measurements Schroedinger functional Smearing routines Trace-log support Gauge field update routines – Heatbath – Molecular dynamics support • • • • • • Hamiltonian systems HMC trajectories HMD integrators HMC monomials HMC linear system solver initial guess Utility routines – Fermion manipulation routines – Fourier transform support – Utility routines for manipulating color matrices – Info utilities Fermion Actions Actions are factory objects (foundries) Do not hold gauge fields – only params Factory/creation functions with gauge field argument Takes a gauge field - creates a State & applies fermion BC. Takes a State – creates a Linear Operator (dslash) Takes a State – creates quark prop. solvers Linear Ops are function objects E.g., class Foo {int operator() (int x);} fred; // int z=fred(1); Argument to CG, MR, etc. – simple functions Created with XML Fermion Actions - XML Tag FermAct is key in lookup map of constructors During construction, action reads XML FermBC tag invokes another lookup <FermionAction> <FermAct>WILSON</FermAct> <Kappa>0.11</Kappa> <FermionBC> <FermBC>SIMPLE_FERMBC</FermBC> <boundary>1 1 1 -1</boundary> </FermionBC> <AnisoParam> <anisoP>false</anisoP> <t_dir>3</t_dir> <xi_0>1.0</xi_0> <nu>1.0</nu> </AnisoParam> </FermionAction> XPath used in chroma/mainprogs/main/propagator.cc /propagator/Params/FermionAction/FermAct HMC and Monomials HMC built on Monomials Monomials define Nf, gauge, etc. Only provide Mom à deriv(U) and S(U) . Pseudoferms not visible. Have Nf=2 and rational Nf=1 Both 4D and 5D versions. <Monomials> <elem> <Name>TWO_FLAVOR_WILSON_FERM_MONOMIAL</Na <FermionAction> <FermAct>WILSON</FermAct> …… </FermionAction> <InvertParam> <invType>CG_INVERTER</invType> <RsdCG>1.0e-7</RsdCG> <MaxCG>1000</MaxCG> </InvertParam> <ChronologicalPredictor> <Name>MINIMAL_RESIDUAL_EXTRAPOLATION_4D_PR <MaxChrono>7</MaxChrono> </ChronologicalPredictor> </elem> <elem> …. </elem> </Monomials> Gauge Monomials Gauge monomials: Plaquette Rectangle Parallelogram Monomial constructor will invoke constructor for Name in GaugeAction <Monomials> <elem> …. </elem> <elem> <Name>WILSON_GAUGEACT_MONOMIAL</Name> <GaugeAction> <Name>WILSON_GAUGEACT</Name> <beta>5.7</beta> <GaugeBC> <Name>PERIODIC_GAUGEBC</Name> </GaugeBC> </GaugeAction> </elem> </Monomials> Chroma – Inline Measurements HMC has Inline meas. Chroma.cc is Inline only code. Former mainprogs now inline meas. Meas. are registered with constructor call. Meas. given gauge field – no return value. Only communicate to each other via disk (maybe mem. buf.??) <InlineMeasurements> <elem> <Name>MAKE_SOURCE</Name> <Param>...</Param> <Prop> <source_file>./source_0</source_file> <source_volfmt>MULTIFILE</source_volfmt> </Prop> </elem> <elem> <Name>PROPAGATOR</Name> <Param>...</Param> <Prop> <source_file>./source_0</source_file> <prop_file>./propagator_0</prop_file> <prop_volfmt>MULTIFILE</prop_volfmt> </Prop> </elem> <elem>….</elem> </InlineMeasurements> Binary File/Interchange Formats Metadata – data describing data; e.g., physics params Use XML for metadata File formats: – Files mixed mode – XML ascii+binary – Using DIME (similar to e-mail MIME) to package – Use BinX (Edinburgh) to describe binary Replica-catalog web-archive repositories For More Information U.S. Lattice QCD Home Page: http://www.usqcd.org/ The JLab Lattice Portal http://lqcd.jlab.org/ High Performance Computing at JLab http://www.jlab.org/hpc/