Session 8, 11:30- A Ukawa "The PACS-CS Project and the improved Wilson Program"

advertisement
ILFTN-II at Edinburgh
10 March 2005
The PACS-CS Project and
the improved Wilson Program
Akira Ukawa
Center for Computational Sciences
University of Tsukuba
† A bit of history, naming, and all that
† PACS-CS
† Improved Wilson Program
† Summary
1
Lattice QCD in (Tsukuba) Japan
1980
KEK
„
„
„
1985
S810
315Mflops
1990
1995
S820
1Gflops
SR8000F1/100
1.2Tflops
??
>20Tflops
Supercomputer installation since 1985
Instrumental for LQCD development in Japan
Regular funding from government/Upgrade every 5-6 years
Univ. of Tsukuba
„
VPP500/80
128Gflops
2005
JLQCD Collaboration
QCDPAX
14Gflops
1989
„
2000
CP-PACS
614Gflops
1996.10
PACS-CS
10-20Tflops
2006.3
CP-PACS Collaboration
Center for
Computational Physics
1992
Development of dedicated
computers for scientific applications
Individual funding (lots of hustle…)
Center for
Computational Sciences
2004
2
CP-PACS/JLQCD members
U. Tsukuba
Ishikawa T.
Taniguchi
Kuramashi
Ishizuka
Yoshie
Ukawa
KEK
CCS member
Yamada
Matsufuru
Hashimoto
Now elsewhere
Okamoto(FNAL)
Lesk(Imp. Coll.)
Noaki(Southampton)
Ejiri(Bielefeld)
Nagai(Zeuthen)
Aoki Y.(Wuppertal)
Izubuchi(Kanazawa)
Ali Khan(Berlin)
Manke
Shanahan(London)
Burkhalter(Zurich)
KEK
Baer
Aoki
Kanaya
5km
Iwasaki
Tsukuba
President of UT
(2004~)
Hiroshima
Ishikawa K.
Okawa
U Tsukuba
Tokyo
Kyoto
Onogi
3
KEK supercomputer upgrade
† Current system
„ Hitachi SR8000F1/100
† 1.2Tflops peak/35% sustained for PHMC code
† Will terminate operation by the end of 2005
† Next system
Government supercomputer procurement in progress
Decision by formal bidding in fall 2005
Start operation in March 2006
>20 times performance over the current system
targeted
„ Users in various areas, but mostly for lattice QCD
„
„
„
„
4
University of Tsukuba :
25 years of R@D of Parallel Computers
CP-PACS
QCDPAX
PAXS-32
PACS-9
1996
1989
1980
1978
year
name
speed
1978
PACS-9
7kflops
GFLOPS
TFLOPS
6
10
1000
BlueGene/L
5
10
100
Earth Simulator
1980
PAXS-32
500kflops
1983
PAX-128
4Mflops
1000
1984
PAX-32J
3Mflops
100
1989
QCDPAX
14Gflops
10
1996
CP-PACS
614Gflops
1
104
10
CP-PACS
1
0.1
QCD-PAX
0.01
0.001
CRAY-1
0.1
1975
1980
1985
1990
1995
year
2000
2005
0.0001
2010
5
naming
PACS
Processor Array for Continuum Simulation
Parallel Array Computer System
PAX
Parallel Array Experiment
Processor Array Experiment
CP-PACS
Computational Physics with
Parallel Array Computer System
6
Collaboration with computer scientists
CP-PACS Project members
1996.3.9
Watase Oyanagi Kanaya
Yamashita
Sakai
Boku Nakamura
Okawa
Aoki
Ukawa Hoshino
Iwasaki Nakazawa Nakata Yoshie
7
CP-PACS run statistics 1996.4 – 2003.12
Monthly usage rate = (hour used for jobs)/(physical hours of month)
8
CP-PACS run statistics 1996.4 – 2003.12
Monthly run hours according to partitions
9
Organization
1980
1985
1990
Center for Computational Physics
1992.4 ~ 2004.3
Center for computational physics
Particle physics
2
Astrophysics
2
Condensed matter physics
2
Biophysics
2
Parallel computer engineering
3
1995
2000
2005
Center for Computational Sciences
Center for computational sciences
2004.3 ~
Particle and Astrophysics
6
Materials and life sciences
11
Earth and biological sciences
3
High performance computing
5
Computational informatics
6
11 faculty members
34 faculty members
10
Path followed by Tsukuba
† Collaboration with computer scientists
† Collaboration with vendor
„ Anritsu Ltd.
for QCDPAX (1989)
„ Hitachi Ltd.
for CP-PACS (1996)
† Institutionalization of research activity
Center for Computational Physics (1992)
Center for Computational Sciences (2004)
† Expansion of research area outside of QCD
„ Astrophysics (1992)
„ Solid state physics and others (2004)
11
PACS-CS
Parallel Array Computer System for
Computational Science
„ Successor of CP-PACS for lattice QCD
„ To be used also for density functional theory
calculations in solid state physics
„ Astrophysics has its own cluster project
12
Background considerations
- summer 2003 -
What kind of system should we aim at, and how
could we get it?
† Options
„ purchase of an appropriate system
not a real option for us
„ CP-PACS style (or Columbia style) development
vendor problem (expensive to get them interested)
time problem (takes very long)
„ Clusters
i.e., system built out of commodity processors and network
A possible option (cost, time, …), but …..
13
An early attempt toward post CP-PACS
1997~1999
SCIMA: Software Controlled Integrated Memory Architecture
ALU
FPU
register
L1 Cache
†
Basic idea:
Addressable On-Chip Memory for
overcoming the memory wall
problem
†
Carried out
Pageload/store
MMU
„
„
„
„
„
On-Chip Memory
(SRAM)
・・・
Memory
(DRAM)
Concept of SCIMA
NIA
Basic design
Compilor/simulator
Benchmark for QCD
Some hardware design
Discussion with vendor
etc
did not come through for
various reasons…
Network
14
Cluster option
“standard” general-purpose cluster
„ Node with single or dual commodity processor
(P4/Xeon/Opteron)/node
„ Connected by Gigabit Ethernet through one or several
big switches
† Problems
„ Inadequate processor/memory bandwidth
dual P4 3GHz
12Gflops/6.4GB/s
„ Inadequate network bandwidth
Gigabit Ethernet
12Gflops/1Gbps
„ Switch cost rapidly increases for larger systems
„ Expensive to use faster network
† Myrinet
† Infiniband(x4)
250MB/s
1GB/s
15
Our choice
Build a “semi-dedicated” cluster appropriate for
lattice QCD (and a few other applications)
† Single CPU/node with fastest bus available
† Judicious choice of network topology
(3-dimensional hyper-crossbar)
„ Multiple Gigbit Ethernet from each node for high
aggregate bandwidth
„ Large number of medium-size switches to cut switch
cost
† Mother board design to accommodate these
features
16
PACS-CS hardware specifications
† Node
„ Single low-voltage Xeon 2.8GHz
5.6Gflops
„ 2GB PC3200 memory with FSB800 6.4GB/s
„ 160GB disk (Raid1 mirror)
† Network
„ 3-dimensional hyper-crossbar topology
„ Dual Gigabit Ethernet for each direction,
i.e., 0.25GB/s/link and an agregate 0.75GB/s/node
(better than InfiniBand(x4) shared by dual CPU)
† System size
„ At least 2048 CPU (16x16x8, 11.5Tflops/4TB), and
hopefuly up to 3072 CPU (16x16x12, 17.2Tflops/6TB)
17
3-dimensional hypercrossbar network
・・・
・・・
Z=8~12
・・・
・・・
・・・
Computing node
Communication via
single switch
communication via
multiple switches
・・・
・・・
Y=16
・・・
・・・
・・・
X-switch
Y-switch
Z-switch
・・・
・・・
X=16
In the figure
Dual link for band
width
18
Board layout: 2 nodes /1U board
HDD
(RAID-1)
Node image on 1U board
Serial ATA, IDE or SCSI
front
chip-set
HDD
HDD
HDD
HDD
x0, x1: dual link for
X-crossbar
CPU
I/O
(GbE)
3d HXB
(GbE x 6)
y0, y1: dual link for
Y-crossbar
z0, z1: dual link for
Z-crossbar
unit-0
x0 x1 y0 y1 z0 z1
File I/O network
unit-1
Power Unit
RAS
(GbE)
memory
back
System diagnostics and control
19
File server and external I/O
Separate tree network for file I/O
External raid disks
File servers
GbE x 4
……
GbE x 2
nodes
20
PACS-CS software
† OS
„ Linux
„ SCore (cluster middleware deveoped by PC Cluster
Consortium http://www.pccluster.org/index.html.en
„ 3D HXB driver based on SCore PM (under development)
† Programming
„ MPI for communication
„ Library for 3D HXB network
„ Fortran, C, C++ with MPI
† Job execution
„ System partition (256nodes, 512nodes, 1024nodes, ...)
„ Batch queue using PBS
„ Job scripts for file I/O
21
Current schedule
Center for Computaitional Sciences
October 2006
10 years of CP-PACS operation
PACS-CS
April 2003
April 2004
Basic design
April 2005
April 2006
April 2007
April 2008
Detailed design
verification
Test system builtup
and testing
R&D in progress
system production
2048 node system by
early fiscal 2006
R&D of system software
Final system by early
fiscal 2007
Development of application program
Begin operation
Operation by the full system
KEK
SR800F1
New system
22
Nf=2+1 Improved Wilson Program
on PACS-CS
„ Current status
„ Physics prospects
„ Algorithm
23
CP-PACS/JLQCD joint effort toward Nf=2+1
since 2001
strategy
C_sw for Nf=3
JLQCD/CP-PACS K. Ishikawa et al Lattice’03
† Iwasaki RG gauge action
† Wilson-clover quark action
† Algorithm
„ Polynomial HMC for strange
quark
„ Standard HMC for up and
down quarks
1-loop
fitted
1.6
cSW
„ Fully O(a) improved via
Shcroedinger functional
determination of c_sw
„ NP Z factors for operators via
Schroedinger functional
determination
1.8
1.4
1.2
1.0
0.0
1.0
2.0
3.0
2
g0
JLQCD K. Ishikawa et al PRD
24
Machines and run parameters
in progress
β=2.05
a ~ 0.07fm
28^3 x 56
2000 trajectory
1
2
Fixed physical volume
~ (2.0fm)^3
1
β=1.90
a ~ 0.10fm
20^3 x 40
8000 trajectory
β=1.83
a ~ 0.12fm
16^3 x 32
8000 trajectory
3
2
finished
a2
Lattice spacing
finished
Earth simulator
@ Jamstec
SR8000/F1
@KEK
CP-PACS
@Tsukuba
SR8000/G1
@Tsukuba
VPP5000
@Tsukuba
25
T. Ishikawa
This workshop
Light hadron results : meson hyperfine splitting
1.05
1
φ-input
K-input
φ
K*
meson mass [GeV]
1
meson mass [GeV]
0.9
experiment
0.95
0.9
0.8
experiment
0.7
0.6
K*
0.5
0.85
0
0.005
0.01
2
2
a [fm ]
0.015
0.02
0.4
K
0
0.005
0.01
2
0.015
0.02
2
a [fm ]
26
T. Ishikawa
This workshop
Light quark masses
110
4
(μ=2GeV) [MeV]
VWI
AWI
φ-input
90
MS
3.5
100
3
ms
mud
MS
(μ=2GeV) [MeV]
AWI, K-input
K-
80
inp
ut
2.5
0
0.005
0.01
2
2
a [fm ]
0.015
0.02
0
0.005
0.01
2
0.015
0.02
2
a [fm ]
27
Y. Kayaba
This workshop
Relativistic heavy quark scaling tests (I)
Charmonium hyperfine splitting Nf=2
ΔM(J/ψ-ηc)[GeV]
0.12
0.1
Mpole
Mkin
Aniso, Nf=0
expt.
linear extr.
0.08
0.06
2
χ /d.o.f = 2.3
0.04
0
0.05
0.1
0.15
0.2
0.25
a(r0) [fm]
28
Y. Kuramashi
This workshop
Relativistic heavy quark scaling tests (II)
Nf=0 calculation on Iwasaki gauge action
Ypsilon hyperfine splitting
Bs meson decay constant
0.35
0.04
Nf=0, NRQCD (A)
Nf=0, NRQCD (B)
Nf=0, A4
0.3
ΔM(Υ−ηb) [GeV]
fBs [GeV]
Nf=0, NRQCD
Nf=2, Mpole
Nf=2, Mkin
0.03
0.25
0.02
0.2
0.01
0.15
0
0.05
0.1
0.15
0.2
0.25
0
0.05
0.1
0.15
0.2
a [fm]
a [fm]
29
Plan for PACS-CS
† Limitation of the current run
„ Three lattice spacings
a ≈ 0.015 fm , 0.01 fm , 0.005 fm
2
2
2
2
„ But, in light quark masses, only down to
mπ
mud
≈ 0.6 i.e.,
≈ 0.5
mρ
ms
† Wish to go down to
mπ
≈ 0.4 i.e.,
mρ
mud
≈ 0.2
ms
or less…
30
Physics and methods (I)
† Fundamental constants
„ Light quark masses from meson spectrum
„ Strong coupling constant
† Hadron physics
„ Flavor singlet meson and topology
„ Baryon mass spectrum (larger spatial size needed)
„ …
Methods
S. Takeda
This workshop
„ Wilson ChPT to control chiral behavior
„ Schroedinger functional methods for RG running and
NP renormalization of operators
31
Physics and methods (II)
† CKM-related issues
light hadron sector
„ Pi and K form factors
heavy hadron sector
„ D and B decay constants and box parameters
„ D and B form factors
Methods
„ Relativistic heavy quark action to control large quark
mass
„ Wilson ChPT to control light quark chiral behavior
32
Physics and methods (III)
† Open issues :
Is it possible to study
„ BK
probably yes, using chiral Ward identity methods for
NP renormalization of 4-quark operators
Kuramashi et al, Phys.Rev. D60 (1999)
034511
„ K->pi+pi decays in the I=2 channel
Ishizuka et al Phys.Rev. D58 (1998)
yes as for BK
054503
„ K->pi+pi decays in the I=0 channel???
divergent renormaliztion of 4-quark operators???
tmQCD???
33
Performance Benchmark assumptions
† Target runs
1
lattice size
24 × 48 at a ≈ 0.1 fm , 32 × 64 at a ≈
0.1 fm
2
# trajectories
10000
polynomial order 300
3
3
† Machine assumptions
system size
job partition
2048CPU
83 = 512CPU × 4
CPU performance
2 Gflops
network performance 0.2 GB / s / link (3 directions overlapped )
network latency
20 μs
34
Time estimate for standard HMC (Nf=2+1)
Standard HMC
1/a
(GeV)
2
2.83
lattice
size
N
s
pi/rho
Ninv
1/dt
N
t
24x48
32x64
time/ traj (hr)
calc
comm
10000traj
total
(days)
0.6
517
116
0.12
0.13
0.25
26
0.5
870
189
0.30
0.32
0.62
65
0.4
1507
322
0.84
0.89
1.73
180
0.3
2884
611
2.93
3.11
6.03
629
0.2
8591
1806
25.00
26.57
51.57
5372
0.6
719
155
0.67
0.46
1.13
118
0.5
1218
252
1.72
1.19
2.91
303
0.4
2118
430
4.86
3.39
8.25
860
0.3
4066
814
17.16
11.99
29.14
3036
0.2
12143
2408
148.23
103.66
251.89
26238
Rather dismal number of days .....
35
Acceleration of HMC via domain decomposition
M. Luescher Hep-lat/0409106
⎛
⎞
⎛ +
⎞
1
1
⎛ + 1
⎞
+
+
⎟
⎜
⎟
⋅
−
⋅
−
d
d
d
d
φ
φ
φ
φ
φ
φ
φ
φ
φ
det D + D = ∫ dφΩ+ dφΩ exp⎜⎜ − φΩ+
exp
exp
⎜
⎟
*
*
*
*
R
R
R
R
Ω
Ω
Ω
Ω
Ω
+
+
+
⎟ ∫
⎜
⎟ ∫
R
R
D
D
D
D
⎝
⎠
Ω
Ω
Ω*
Ω*
⎝
⎠
⎝
⎠
DΩ = ∑ DΛ n
n
, DΩ* = ∑ DΛ*
n
R = 1 − θ ∂Ω* DΩ−1 D∂Ω DΩ−1* D∂Ω*
n
∂Ω = ∑ ∂Λ n , ∂Ω* = ∑ ∂Λ*n
Dirichlet b.c. on
n
n
Values only on the boundary of domains
R −1 = 1 − θ ∂Ω* D −1 D∂Ω*
Ω = ∏ Λn
n
Ω* = ∏ Λ*n
n
36
Acceleration possibilities (III)
†
Crucial observation(Luescher )
F0
F1
F2
force due to gauge action
force due to DΩ and DΩ*
force due to R
10
F0
⟨||Fk(x,μ)||⟩
F1
k=1
1
F2
0.1
Numerically at
k=0
mπ
≈ 0.7 − 0.4 and a −1 ≈ 2.4GeV
mρ
F0 : F1 : F2 ≈ 5 : 1 : 0.2 = 25 : 5 : 1
Use larger step sizes for quarks
dτ 0 : dτ 1 : dτ 2 ≈ 1 : 5 : 25
k=2
1
2
3
4
d
5
From M. Luescher Hep-lat/0409106
Ω = ∏ Λn
n
Ω* = ∏ Λ*n
n
37
Advantages with domain-decomposed HMC
†
DΛ n → DΛ−1n
DΛ* → DΛ−*1
n
n
„ Inversion in each domain once every 5 MD steps or so
„ Dirichlet boundary condition, hence
† easier to invert
† No inter-node communication if the domain is within the node
†
R → R −1 = 1 − θ ∂Ω* D −1 D∂Ω*
„ Full inversion once every 25 MD steps or so
† Both Floating and Communication Requirements are
reduced……
38
Acceleration possibility
standard
HMC
1/a
(GeV)
2
2.83
lattice
size
N
s
pi/rho
N
t
24x48
32x64
domain-decomposed HMC
10000traj
(days)
#steps
N0
N1
time/traj(hr)
N2
calc
comm
10000traj
total
accel
erati
on
(days)
0.6
26
4
5
5
0.031
0.005
0.037
4
7
0.5
65
4
5
6
0.058
0.010
0.068
7
9
0.4
180
4
5
7
0.110
0.019
0.129
13
13
0.3
629
4
5
8
0.230
0.041
0.271
28
22
0.2
5372
4
5
9
0.747
0.132
0.880
92
59
0.6
118
5
6
6
0.181
0.018
0.199
21
6
0.5
303
5
6
7
0.333
0.033
0.366
38
8
0.4
860
5
6
9
0.713
0.071
0.784
82
11
0.3
3036
5
6
10
1.475
0.147
1.622
169
18
0.2
26238
5
6
11
4.739
0.473
5.213
543
48
† Only a paper estimate, but more than encouraging .....
† Implementation in progress
39
summary
† “Dedicated” large-scale cluster (1020Tflops peak) expected by early
summer of 2006 at University of Tsukuba
† Plan to finish off the improved Wilson
program exploiting the domain
decomposition acceleration idea
† Separate system at KEK; physics
program under discussion
40
Download