ILFTN-II at Edinburgh 10 March 2005 The PACS-CS Project and the improved Wilson Program Akira Ukawa Center for Computational Sciences University of Tsukuba A bit of history, naming, and all that PACS-CS Improved Wilson Program Summary 1 Lattice QCD in (Tsukuba) Japan 1980 KEK 1985 S810 315Mflops 1990 1995 S820 1Gflops SR8000F1/100 1.2Tflops ?? >20Tflops Supercomputer installation since 1985 Instrumental for LQCD development in Japan Regular funding from government/Upgrade every 5-6 years Univ. of Tsukuba VPP500/80 128Gflops 2005 JLQCD Collaboration QCDPAX 14Gflops 1989 2000 CP-PACS 614Gflops 1996.10 PACS-CS 10-20Tflops 2006.3 CP-PACS Collaboration Center for Computational Physics 1992 Development of dedicated computers for scientific applications Individual funding (lots of hustle…) Center for Computational Sciences 2004 2 CP-PACS/JLQCD members U. Tsukuba Ishikawa T. Taniguchi Kuramashi Ishizuka Yoshie Ukawa KEK CCS member Yamada Matsufuru Hashimoto Now elsewhere Okamoto(FNAL) Lesk(Imp. Coll.) Noaki(Southampton) Ejiri(Bielefeld) Nagai(Zeuthen) Aoki Y.(Wuppertal) Izubuchi(Kanazawa) Ali Khan(Berlin) Manke Shanahan(London) Burkhalter(Zurich) KEK Baer Aoki Kanaya 5km Iwasaki Tsukuba President of UT (2004~) Hiroshima Ishikawa K. Okawa U Tsukuba Tokyo Kyoto Onogi 3 KEK supercomputer upgrade Current system Hitachi SR8000F1/100 1.2Tflops peak/35% sustained for PHMC code Will terminate operation by the end of 2005 Next system Government supercomputer procurement in progress Decision by formal bidding in fall 2005 Start operation in March 2006 >20 times performance over the current system targeted Users in various areas, but mostly for lattice QCD 4 University of Tsukuba : 25 years of R@D of Parallel Computers CP-PACS QCDPAX PAXS-32 PACS-9 1996 1989 1980 1978 year name speed 1978 PACS-9 7kflops GFLOPS TFLOPS 6 10 1000 BlueGene/L 5 10 100 Earth Simulator 1980 PAXS-32 500kflops 1983 PAX-128 4Mflops 1000 1984 PAX-32J 3Mflops 100 1989 QCDPAX 14Gflops 10 1996 CP-PACS 614Gflops 1 104 10 CP-PACS 1 0.1 QCD-PAX 0.01 0.001 CRAY-1 0.1 1975 1980 1985 1990 1995 year 2000 2005 0.0001 2010 5 naming PACS Processor Array for Continuum Simulation Parallel Array Computer System PAX Parallel Array Experiment Processor Array Experiment CP-PACS Computational Physics with Parallel Array Computer System 6 Collaboration with computer scientists CP-PACS Project members 1996.3.9 Watase Oyanagi Kanaya Yamashita Sakai Boku Nakamura Okawa Aoki Ukawa Hoshino Iwasaki Nakazawa Nakata Yoshie 7 CP-PACS run statistics 1996.4 – 2003.12 Monthly usage rate = (hour used for jobs)/(physical hours of month) 8 CP-PACS run statistics 1996.4 – 2003.12 Monthly run hours according to partitions 9 Organization 1980 1985 1990 Center for Computational Physics 1992.4 ~ 2004.3 Center for computational physics Particle physics 2 Astrophysics 2 Condensed matter physics 2 Biophysics 2 Parallel computer engineering 3 1995 2000 2005 Center for Computational Sciences Center for computational sciences 2004.3 ~ Particle and Astrophysics 6 Materials and life sciences 11 Earth and biological sciences 3 High performance computing 5 Computational informatics 6 11 faculty members 34 faculty members 10 Path followed by Tsukuba Collaboration with computer scientists Collaboration with vendor Anritsu Ltd. for QCDPAX (1989) Hitachi Ltd. for CP-PACS (1996) Institutionalization of research activity Center for Computational Physics (1992) Center for Computational Sciences (2004) Expansion of research area outside of QCD Astrophysics (1992) Solid state physics and others (2004) 11 PACS-CS Parallel Array Computer System for Computational Science Successor of CP-PACS for lattice QCD To be used also for density functional theory calculations in solid state physics Astrophysics has its own cluster project 12 Background considerations - summer 2003 - What kind of system should we aim at, and how could we get it? Options purchase of an appropriate system not a real option for us CP-PACS style (or Columbia style) development vendor problem (expensive to get them interested) time problem (takes very long) Clusters i.e., system built out of commodity processors and network A possible option (cost, time, …), but ….. 13 An early attempt toward post CP-PACS 1997~1999 SCIMA: Software Controlled Integrated Memory Architecture ALU FPU register L1 Cache Basic idea: Addressable On-Chip Memory for overcoming the memory wall problem Carried out Pageload/store MMU On-Chip Memory (SRAM) ・・・ Memory (DRAM) Concept of SCIMA NIA Basic design Compilor/simulator Benchmark for QCD Some hardware design Discussion with vendor etc did not come through for various reasons… Network 14 Cluster option “standard” general-purpose cluster Node with single or dual commodity processor (P4/Xeon/Opteron)/node Connected by Gigabit Ethernet through one or several big switches Problems Inadequate processor/memory bandwidth dual P4 3GHz 12Gflops/6.4GB/s Inadequate network bandwidth Gigabit Ethernet 12Gflops/1Gbps Switch cost rapidly increases for larger systems Expensive to use faster network Myrinet Infiniband(x4) 250MB/s 1GB/s 15 Our choice Build a “semi-dedicated” cluster appropriate for lattice QCD (and a few other applications) Single CPU/node with fastest bus available Judicious choice of network topology (3-dimensional hyper-crossbar) Multiple Gigbit Ethernet from each node for high aggregate bandwidth Large number of medium-size switches to cut switch cost Mother board design to accommodate these features 16 PACS-CS hardware specifications Node Single low-voltage Xeon 2.8GHz 5.6Gflops 2GB PC3200 memory with FSB800 6.4GB/s 160GB disk (Raid1 mirror) Network 3-dimensional hyper-crossbar topology Dual Gigabit Ethernet for each direction, i.e., 0.25GB/s/link and an agregate 0.75GB/s/node (better than InfiniBand(x4) shared by dual CPU) System size At least 2048 CPU (16x16x8, 11.5Tflops/4TB), and hopefuly up to 3072 CPU (16x16x12, 17.2Tflops/6TB) 17 3-dimensional hypercrossbar network ・・・ ・・・ Z=8~12 ・・・ ・・・ ・・・ Computing node Communication via single switch communication via multiple switches ・・・ ・・・ Y=16 ・・・ ・・・ ・・・ X-switch Y-switch Z-switch ・・・ ・・・ X=16 In the figure Dual link for band width 18 Board layout: 2 nodes /1U board HDD (RAID-1) Node image on 1U board Serial ATA, IDE or SCSI front chip-set HDD HDD HDD HDD x0, x1: dual link for X-crossbar CPU I/O (GbE) 3d HXB (GbE x 6) y0, y1: dual link for Y-crossbar z0, z1: dual link for Z-crossbar unit-0 x0 x1 y0 y1 z0 z1 File I/O network unit-1 Power Unit RAS (GbE) memory back System diagnostics and control 19 File server and external I/O Separate tree network for file I/O External raid disks File servers GbE x 4 …… GbE x 2 nodes 20 PACS-CS software OS Linux SCore (cluster middleware deveoped by PC Cluster Consortium http://www.pccluster.org/index.html.en 3D HXB driver based on SCore PM (under development) Programming MPI for communication Library for 3D HXB network Fortran, C, C++ with MPI Job execution System partition (256nodes, 512nodes, 1024nodes, ...) Batch queue using PBS Job scripts for file I/O 21 Current schedule Center for Computaitional Sciences October 2006 10 years of CP-PACS operation PACS-CS April 2003 April 2004 Basic design April 2005 April 2006 April 2007 April 2008 Detailed design verification Test system builtup and testing R&D in progress system production 2048 node system by early fiscal 2006 R&D of system software Final system by early fiscal 2007 Development of application program Begin operation Operation by the full system KEK SR800F1 New system 22 Nf=2+1 Improved Wilson Program on PACS-CS Current status Physics prospects Algorithm 23 CP-PACS/JLQCD joint effort toward Nf=2+1 since 2001 strategy C_sw for Nf=3 JLQCD/CP-PACS K. Ishikawa et al Lattice’03 Iwasaki RG gauge action Wilson-clover quark action Algorithm Polynomial HMC for strange quark Standard HMC for up and down quarks 1-loop fitted 1.6 cSW Fully O(a) improved via Shcroedinger functional determination of c_sw NP Z factors for operators via Schroedinger functional determination 1.8 1.4 1.2 1.0 0.0 1.0 2.0 3.0 2 g0 JLQCD K. Ishikawa et al PRD 24 Machines and run parameters in progress β=2.05 a ~ 0.07fm 28^3 x 56 2000 trajectory 1 2 Fixed physical volume ~ (2.0fm)^3 1 β=1.90 a ~ 0.10fm 20^3 x 40 8000 trajectory β=1.83 a ~ 0.12fm 16^3 x 32 8000 trajectory 3 2 finished a2 Lattice spacing finished Earth simulator @ Jamstec SR8000/F1 @KEK CP-PACS @Tsukuba SR8000/G1 @Tsukuba VPP5000 @Tsukuba 25 T. Ishikawa This workshop Light hadron results : meson hyperfine splitting 1.05 1 φ-input K-input φ K* meson mass [GeV] 1 meson mass [GeV] 0.9 experiment 0.95 0.9 0.8 experiment 0.7 0.6 K* 0.5 0.85 0 0.005 0.01 2 2 a [fm ] 0.015 0.02 0.4 K 0 0.005 0.01 2 0.015 0.02 2 a [fm ] 26 T. Ishikawa This workshop Light quark masses 110 4 (μ=2GeV) [MeV] VWI AWI φ-input 90 MS 3.5 100 3 ms mud MS (μ=2GeV) [MeV] AWI, K-input K- 80 inp ut 2.5 0 0.005 0.01 2 2 a [fm ] 0.015 0.02 0 0.005 0.01 2 0.015 0.02 2 a [fm ] 27 Y. Kayaba This workshop Relativistic heavy quark scaling tests (I) Charmonium hyperfine splitting Nf=2 ΔM(J/ψ-ηc)[GeV] 0.12 0.1 Mpole Mkin Aniso, Nf=0 expt. linear extr. 0.08 0.06 2 χ /d.o.f = 2.3 0.04 0 0.05 0.1 0.15 0.2 0.25 a(r0) [fm] 28 Y. Kuramashi This workshop Relativistic heavy quark scaling tests (II) Nf=0 calculation on Iwasaki gauge action Ypsilon hyperfine splitting Bs meson decay constant 0.35 0.04 Nf=0, NRQCD (A) Nf=0, NRQCD (B) Nf=0, A4 0.3 ΔM(Υ−ηb) [GeV] fBs [GeV] Nf=0, NRQCD Nf=2, Mpole Nf=2, Mkin 0.03 0.25 0.02 0.2 0.01 0.15 0 0.05 0.1 0.15 0.2 0.25 0 0.05 0.1 0.15 0.2 a [fm] a [fm] 29 Plan for PACS-CS Limitation of the current run Three lattice spacings a ≈ 0.015 fm , 0.01 fm , 0.005 fm 2 2 2 2 But, in light quark masses, only down to mπ mud ≈ 0.6 i.e., ≈ 0.5 mρ ms Wish to go down to mπ ≈ 0.4 i.e., mρ mud ≈ 0.2 ms or less… 30 Physics and methods (I) Fundamental constants Light quark masses from meson spectrum Strong coupling constant Hadron physics Flavor singlet meson and topology Baryon mass spectrum (larger spatial size needed) … Methods S. Takeda This workshop Wilson ChPT to control chiral behavior Schroedinger functional methods for RG running and NP renormalization of operators 31 Physics and methods (II) CKM-related issues light hadron sector Pi and K form factors heavy hadron sector D and B decay constants and box parameters D and B form factors Methods Relativistic heavy quark action to control large quark mass Wilson ChPT to control light quark chiral behavior 32 Physics and methods (III) Open issues : Is it possible to study BK probably yes, using chiral Ward identity methods for NP renormalization of 4-quark operators Kuramashi et al, Phys.Rev. D60 (1999) 034511 K->pi+pi decays in the I=2 channel Ishizuka et al Phys.Rev. D58 (1998) yes as for BK 054503 K->pi+pi decays in the I=0 channel??? divergent renormaliztion of 4-quark operators??? tmQCD??? 33 Performance Benchmark assumptions Target runs 1 lattice size 24 × 48 at a ≈ 0.1 fm , 32 × 64 at a ≈ 0.1 fm 2 # trajectories 10000 polynomial order 300 3 3 Machine assumptions system size job partition 2048CPU 83 = 512CPU × 4 CPU performance 2 Gflops network performance 0.2 GB / s / link (3 directions overlapped ) network latency 20 μs 34 Time estimate for standard HMC (Nf=2+1) Standard HMC 1/a (GeV) 2 2.83 lattice size N s pi/rho Ninv 1/dt N t 24x48 32x64 time/ traj (hr) calc comm 10000traj total (days) 0.6 517 116 0.12 0.13 0.25 26 0.5 870 189 0.30 0.32 0.62 65 0.4 1507 322 0.84 0.89 1.73 180 0.3 2884 611 2.93 3.11 6.03 629 0.2 8591 1806 25.00 26.57 51.57 5372 0.6 719 155 0.67 0.46 1.13 118 0.5 1218 252 1.72 1.19 2.91 303 0.4 2118 430 4.86 3.39 8.25 860 0.3 4066 814 17.16 11.99 29.14 3036 0.2 12143 2408 148.23 103.66 251.89 26238 Rather dismal number of days ..... 35 Acceleration of HMC via domain decomposition M. Luescher Hep-lat/0409106 ⎛ ⎞ ⎛ + ⎞ 1 1 ⎛ + 1 ⎞ + + ⎟ ⎜ ⎟ ⋅ − ⋅ − d d d d φ φ φ φ φ φ φ φ φ det D + D = ∫ dφΩ+ dφΩ exp⎜⎜ − φΩ+ exp exp ⎜ ⎟ * * * * R R R R Ω Ω Ω Ω Ω + + + ⎟ ∫ ⎜ ⎟ ∫ R R D D D D ⎝ ⎠ Ω Ω Ω* Ω* ⎝ ⎠ ⎝ ⎠ DΩ = ∑ DΛ n n , DΩ* = ∑ DΛ* n R = 1 − θ ∂Ω* DΩ−1 D∂Ω DΩ−1* D∂Ω* n ∂Ω = ∑ ∂Λ n , ∂Ω* = ∑ ∂Λ*n Dirichlet b.c. on n n Values only on the boundary of domains R −1 = 1 − θ ∂Ω* D −1 D∂Ω* Ω = ∏ Λn n Ω* = ∏ Λ*n n 36 Acceleration possibilities (III) Crucial observation(Luescher ) F0 F1 F2 force due to gauge action force due to DΩ and DΩ* force due to R 10 F0 〈||Fk(x,μ)||〉 F1 k=1 1 F2 0.1 Numerically at k=0 mπ ≈ 0.7 − 0.4 and a −1 ≈ 2.4GeV mρ F0 : F1 : F2 ≈ 5 : 1 : 0.2 = 25 : 5 : 1 Use larger step sizes for quarks dτ 0 : dτ 1 : dτ 2 ≈ 1 : 5 : 25 k=2 1 2 3 4 d 5 From M. Luescher Hep-lat/0409106 Ω = ∏ Λn n Ω* = ∏ Λ*n n 37 Advantages with domain-decomposed HMC DΛ n → DΛ−1n DΛ* → DΛ−*1 n n Inversion in each domain once every 5 MD steps or so Dirichlet boundary condition, hence easier to invert No inter-node communication if the domain is within the node R → R −1 = 1 − θ ∂Ω* D −1 D∂Ω* Full inversion once every 25 MD steps or so Both Floating and Communication Requirements are reduced…… 38 Acceleration possibility standard HMC 1/a (GeV) 2 2.83 lattice size N s pi/rho N t 24x48 32x64 domain-decomposed HMC 10000traj (days) #steps N0 N1 time/traj(hr) N2 calc comm 10000traj total accel erati on (days) 0.6 26 4 5 5 0.031 0.005 0.037 4 7 0.5 65 4 5 6 0.058 0.010 0.068 7 9 0.4 180 4 5 7 0.110 0.019 0.129 13 13 0.3 629 4 5 8 0.230 0.041 0.271 28 22 0.2 5372 4 5 9 0.747 0.132 0.880 92 59 0.6 118 5 6 6 0.181 0.018 0.199 21 6 0.5 303 5 6 7 0.333 0.033 0.366 38 8 0.4 860 5 6 9 0.713 0.071 0.784 82 11 0.3 3036 5 6 10 1.475 0.147 1.622 169 18 0.2 26238 5 6 11 4.739 0.473 5.213 543 48 Only a paper estimate, but more than encouraging ..... Implementation in progress 39 summary “Dedicated” large-scale cluster (1020Tflops peak) expected by early summer of 2006 at University of Tsukuba Plan to finish off the improved Wilson program exploiting the domain decomposition acceleration idea Separate system at KEK; physics program under discussion 40