Status of QCDOC ILFTN Peter Boyle for the QCDOC collaboration

advertisement
Status of QCDOC
UKQCD
RBRC
COLUMBIA
QCDOC
Status of QCDOC
ILFTN
Peter Boyle
for the QCDOC collaboration
N I V E R
S
T H
Y
IT
E
U
R
G
H
O F
E
D I
U
N B
-1-
ILFTN, Edinburgh 2005
Status of QCDOC
UKQCD
RBRC
COLUMBIA
QCDOC
QCDOC Collaboration
Columbia:
N. Christ, S. Cohen, C. Cristian, C. Kim,
L. Levkova, X. Liao, G. Liu,
R. Mawhinney, A. Yamaguchi
UKQCD:
P. Boyle, M. Clark, B. Joo
Riken-Brookhaven: S. Ohta (KEK), T. Wettig (Regensburg)
SciDac:
C. Jung, K. Petrov
IBM Research:
A. Gara, D. Chen
N I V E R
S
T H
Y
IT
E
U
R
G
H
O F
E
D I
U
N B
-2-
ILFTN, Edinburgh 2005
Status of QCDOC
UKQCD
RBRC
COLUMBIA
QCDOC
Problem characteristics
Krylov solvers dominate: Dirac matrix multiply + global summation
• Simple nearest neighbour stencil.
• Communications are nearest neighbour on Toriodal Network
• Very large machines ⇒ small local sub-lattice
• Challenge is to provide the data fast enough
• Interconnect as agressive as a local DRAM system
• Communications repetitive
• Communicate in multiple directions simultaneously
• Predictable memory access patterns ⇒ prefetch
Exploit simplifications: special purpose machines increase scalability
N I V E R
S
T H
Y
IT
E
U
R
G
H
O F
E
D I
U
N B
-3-
ILFTN, Edinburgh 2005
Status of QCDOC
UKQCD
RBRC
COLUMBIA
QCDOC
For example Wilson Dirac Operator
D
W
=
X
Uµ (x)(1 − γ µ )δy,x+µ + Uµ† (x − µ)(1 + γ µ )δy,x−µ
µ
Operation
Proj
Su3
Su3
Rec
Face exchange
Flops/site (Chiral basis)
96 adds
4×(12mul + 60 madd)
4×(12mul + 60 madd)
7 × 24 add
8 packets S+R
Loads
24
(18 + 12) × 4
(18 + 12) × 4
(12 + 12) × 4
12×L3
× (8 + 8) words
2
Stores
(12 + 12) × 4
12 × 4
12 × 4
24
1320 flops and off-core 264 loads, 120 stores
⇒ 2.3GB/s per Gflop/s
Must perform comms in time for CPU process
L4
2
sites
Generalisations to 5-d obvious
Can trade off footprint versus latency tolerance and bandwidth
N I V E R
S
T H
Y
IT
E
U
R
G
H
O F
E
D I
U
N B
-4-
ILFTN, Edinburgh 2005
Status of QCDOC
UKQCD
RBRC
COLUMBIA
QCDOC
N I V E R
S
T H
Y
IT
E
U
R
G
H
O F
E
D I
U
N B
-5-
ILFTN, Edinburgh 2005
Status of QCDOC
UKQCD
RBRC
COLUMBIA
QCDOC
QCDOC Architecture
• 6 dimensional torus
• QCD can be maximally spread out in four/five dimensions
• Extra dimensions allow partitioning
• Over 10k nodes ⇒ small subvolume per node
• fast on-chip memory
• high performance nearest neighbour communication
• 24 DMA FIFO’s to nearest neighours.
• Frequent global summation : Hardware assist
• Communication performance ' memory performance
N I V E R
S
T H
Y
IT
E
U
R
G
H
O F
E
D I
U
N B
-6-
ILFTN, Edinburgh 2005
Status of QCDOC
UKQCD
RBRC
COLUMBIA
QCDOC
N I V E R
S
T H
Y
IT
E
U
R
G
H
O F
E
D I
U
N B
-7-
ILFTN, Edinburgh 2005
Status of QCDOC
UKQCD
RBRC
COLUMBIA
QCDOC
QCDOC Asic Technology
• IBM SA27E 0.18µ mixed logic + Embedded DRAM process
• 420 MHz IBM PowerPC SoC (440 CPU + FPU)
• 5 Watts allowing high density
• Custom 4MB on-chip embedded memory controller
(IBM Research, Yorktown Heights)
• Custom communication network and DMA
(Columbia)
• Custom Ethernet-JTAG boot/diagnostic protocol
(IBM Research, Yorktown Heights)
• Hardware and Software Co-Designed
N I V E R
S
T H
Y
IT
E
U
R
G
H
O F
E
D I
U
N B
-8-
ILFTN, Edinburgh 2005
Status of QCDOC
UKQCD
RBRC
COLUMBIA
QCDOC
Serial Communications Unit
• 12 × 420Mbit/s LVDS serial links using IBM HSSL transceivers
• Single bit detect, ACK/NACK, hardware retransmit, checksums
• Block-strided descriptor based DMA
• Concurrently runs all links efficiently
• Hardware assist for global summation
Prefetching Edram Controller
• Multiple ports: efficient concurrent access for Comms and Compute
• Each PEC port linearly prefetches two streams
• Double buffers gather writes
N I V E R
S
T H
Y
IT
E
U
R
G
H
O F
E
D I
U
N B
-9-
ILFTN, Edinburgh 2005
Status of QCDOC
UKQCD
RBRC
COLUMBIA
QCDOC
QCDOC Asic, 1 Gflop/s
N I V E R
S
T H
Y
IT
E
U
R
G
H
O F
E
D I
U
N B
-10-
ILFTN, Edinburgh 2005
Status of QCDOC
UKQCD
RBRC
COLUMBIA
QCDOC
QCDOC Daughterboard 2 Gflop/s, $450
Two independent compute nodes + Ethernet system
N I V E R
S
T H
Y
IT
E
U
R
G
H
O F
E
D I
U
N B
-11-
ILFTN, Edinburgh 2005
Status of QCDOC
UKQCD
RBRC
COLUMBIA
QCDOC
QCDOC Motherboard, $17k
64 Gflop/s, 32 Daughterboards, 64 nodes, very dense
768 Gigabit/s internode, 800 Mbit/s Ethernet.
N I V E R
S
T H
Y
IT
E
U
R
G
H
O F
E
D I
U
N B
-12-
ILFTN, Edinburgh 2005
Status of QCDOC
UKQCD
RBRC
COLUMBIA
QCDOC
QCDOC Motherboard
N I V E R
S
T H
Y
IT
E
U
R
G
H
O F
E
D I
U
N B
-13-
ILFTN, Edinburgh 2005
Status of QCDOC
UKQCD
RBRC
COLUMBIA
QCDOC
4096 node prototype, 4Tflop/s, $1.6 million
(Air cooled with chilled water heat exchanger)
N I V E R
S
T H
Y
IT
E
U
R
G
H
O F
E
D I
U
N B
-14-
ILFTN, Edinburgh 2005
Status of QCDOC
UKQCD
RBRC
COLUMBIA
QCDOC
12288 node UKQCD machine on the BNL machine floor.
10.3 Tflop/s peak (current clock speed), $5 Million
UKQCD machine disassembled and in shipping,
12288 node RBRC machine nearly complete,
12288 node DOE SciDAC machine by March
N I V E R
S
T H
Y
IT
E
U
R
G
H
O F
E
D I
U
N B
-15-
ILFTN, Edinburgh 2005
Status of QCDOC
UKQCD
RBRC
COLUMBIA
QCDOC
Serial Communications Data Eye
N I V E R
S
T H
Y
IT
E
U
R
G
H
O F
E
D I
U
N B
-16-
ILFTN, Edinburgh 2005
Status of QCDOC
UKQCD
RBRC
COLUMBIA
QCDOC
DDR memory data eye (write path)
N I V E R
S
T H
Y
IT
E
U
R
G
H
O F
E
D I
U
N B
-17-
ILFTN, Edinburgh 2005
Status of QCDOC
UKQCD
RBRC
COLUMBIA
QCDOC
Ethernet Data Eye (three level signal)
N I V E R
S
T H
Y
IT
E
U
R
G
H
O F
E
D I
U
N B
-18-
ILFTN, Edinburgh 2005
Status of QCDOC
UKQCD
RBRC
COLUMBIA
QCDOC
Performance
• Memory system performance
• Network performance
• Application performance
Streams performance
gcc-3.4.1
gcc-3.4.1
gcc-3.4.1
gcc-3.4.1
N I V E R
-O6
-O6
-O6
-O6
Comment
vanilla source
vanilla source
builtin prefetch
Auto-generated asm
vanilla source
vanilla source
builtin prefetch
Auto-generated asm
Memory
Edram
Edram
Edram
Edram
DDR
DDR
DDR
DDR
MB/s
747
747
1024
1670
260
280
447
606
S
T H
Y
IT
E
U
Compiler/Options/Code
xlc -O5 -qarch=440
-funroll-all-loops -fprefetch-loop-arrays
-funroll-all-loops -fprefetch-loop-arrays
Assembly
xlc -O5 -qarch=440
-funroll-all-loops -fprefetch-loop-arrays
-funroll-all-loops -fprefetch-loop-arrays
Assembly
R
G
H
O F
E
D I
U
N B
-19-
ILFTN, Edinburgh 2005
Status of QCDOC
UKQCD
RBRC
COLUMBIA
QCDOC
QCDOC Multi-wire Ping Ping bandwidth (420MHz)
1000
800
MB/s
600
11 BiDi links
10 BiDi links
9 BiDi links
8 BiDi links
7 BiDi links
6 BiDi links
5 BiDi links
4 BiDi links
3 BiDi links
2 BiDi links
1 BiDi link
Infiniband PMB (NCSA)
Myrinet PMB (NCSA)
TCP PMB (NCSA)
400
200
0
10
N I V E R
1000
10000
S
T H
Y
IT
E
U
100
Packet size bytes
R
G
H
O F
E
D I
U
N B
-20-
ILFTN, Edinburgh 2005
Status of QCDOC
UKQCD
RBRC
COLUMBIA
QCDOC
• Single link half RTT Latency: 690 ns.
• Single link ping pong obtains 50% max bandwidth on 32 byte packets.
• SoC: Outstanding interconnect with modest node cost
• 24 Wilson operator in 20µs. 16 distinct DMA’s
• Latency particularly important further improvements to AsqTad
N I V E R
S
T H
Y
IT
E
U
R
G
H
O F
E
D I
U
N B
-21-
ILFTN, Edinburgh 2005
QCDOC Single-wire latency (420MHz)
Latency (microseconds)
100
Single direction
Multi-directional effective latency
10
1
0.1
10
100
Packet size bytes
1000
10000
Status of QCDOC
UKQCD
RBRC
COLUMBIA
QCDOC
Global reduction
• Hdw acceleration for ”All-to-All” along an axis.
• CPU performs arithmetic.
GlobalSum → 300ns ×
• Slope of log-log plot asymptotically
1
1
D(Nproc) D
2
1
D
• 1024 node global sum in under 16 µs.
• 20-30 µs (estimated) for D ≥ 3 on 12k nodes
N I V E R
S
T H
Y
IT
E
U
R
G
H
O F
E
D I
U
N B
-22-
ILFTN, Edinburgh 2005
Status of QCDOC
UKQCD
RBRC
COLUMBIA
QCDOC
Global sum (us)
1d global sum
2d global sum
3d global sum
4d global sum
5d global sum
6d global sum
10
1
1
N I V E R
100
Number of processors
1000
10000
S
T H
Y
IT
E
U
10
R
G
H
O F
E
D I
U
N B
-23-
ILFTN, Edinburgh 2005
Status of QCDOC
UKQCD
RBRC
COLUMBIA
QCDOC
Application code performance
Various discretisations, 44 local volume
Action
Wilson
AsqTad
DWF
Clover
Nodes
512
128
512
512
Sparse Matrix
44%
42%
46%
54%
CG Performance
39%
40%
42%
47%
On 24 surface to volume ratio is 4
QCDOC nodes can apply Wilson sparse matrix in around 20 µs for 44% of peak
with 16 distinct communications.
N I V E R
S
T H
Y
IT
E
U
R
G
H
O F
E
D I
U
N B
-24-
ILFTN, Edinburgh 2005
Status of QCDOC
UKQCD
RBRC
COLUMBIA
QCDOC
Scaling
164 on 1k nodes (equivalent to 324 on 16k nodes)
Scaling on 16^4 to 1024 nodes
350
300
Ideal Edram
Clover scaling
Ideal DDR
Gigaflops
250
200
150
100
50
Edram threshold
0
0
200
400
600
processors
800
1000
1200
Expect 4 to 5 Tflop/s sustained on large machines
N I V E R
S
T H
Y
IT
E
U
R
G
H
O F
E
D I
U
N B
-25-
ILFTN, Edinburgh 2005
Status of QCDOC
UKQCD
RBRC
COLUMBIA
QCDOC
Partitioning
• Machine is a 2k × 2l × 2m × 2 × 2 × 2 hyper-torus.
• Partitions are 6d hyper-rectangular slices of machine
• Locally redefine Application to Machine axis map
• Present 1,2,3,4,5 or 6 dim torus to applications
• Logical nearest neighbor is physically the nearest neighbor
Remapped slice of machine: lower dimensional torus:
N I V E R
S
T H
Y
IT
E
U
R
G
H
O F
E
D I
U
N B
-26-
ILFTN, Edinburgh 2005
Status of QCDOC
UKQCD
RBRC
COLUMBIA
QCDOC
• Fold any two dimensions independent of others
• Can iterate to fold multiple dimensions together
N I V E R
S
T H
Y
IT
E
U
R
G
H
O F
E
D I
U
N B
-27-
ILFTN, Edinburgh 2005
Status of QCDOC
UKQCD
RBRC
COLUMBIA
QCDOC
Operating System
Custom node kernel and host daemon (v2.5.9)
• Eliminates Unix overhead and integrates diagnostics
• POSIX compliant single process subset (Cygwin/newlib libc)
• Custom Ethernet driver
• Custom NFS client provides I/O to host and trivially parallel file system
• O/S advances state of the art for special purpose machines
• 2 Minute boot time
Compile Chain & Programming environment
• C/C++ with message passing extensions
• GCC, XLC (gfortran?)
• QMP 2.0 (Jung) is SciDAC QCD cross platform library for message passing
• CPS uses internal interface
• Multi-heap memory allocation extensions for Edram
• Standard compliant: benefits everyone
documentation is already written & code is portable
N I V E R
S
T H
Y
IT
E
U
R
G
H
O F
E
D I
U
N B
-28-
ILFTN, Edinburgh 2005
Status of QCDOC
UKQCD
RBRC
COLUMBIA
QCDOC
Reliability
• Careful and controlled boot process
• Log and test each memory, device, bus, bridge before use
• Discovers and checks machine topology
• Counts, reports and databases DDR,Edram ECC errors
• Counts, reports and databases SCU link errors
• Counts, reports and databases Ethernet problems
• Report and databases checksum mismatches on any link in machine
• Optionally terminate jobs on any correctable errors
• Application level: run 10% reproducibility testing long term
• 100% reproducibility testing during shakeout phase
N I V E R
S
T H
Y
IT
E
U
R
G
H
O F
E
D I
U
N B
-29-
ILFTN, Edinburgh 2005
Status of QCDOC
UKQCD
RBRC
COLUMBIA
QCDOC
Disks, Bandwidths and all that
• /host globally shared
• /pfs per-node unique (8 way striping across boxes)
Tune Linux kernel parameters: O(12 - 50) MB/s to each RAID box
Each Linux/RAID box is supporting 4096 mounts.
p655/AIX has supported 12k over mounts of internal disks.
Disk1:/node_M/pfs
....
Disk D
DiskD:/node_N−M/pfs
....
M nodes
Node
Node
Disk1:/node_1/pfs
Disk 1
....
Node
....
N = M . D nodes
Node
Ethernet
M nodes
HSSL
DiskD:/node_N/pfs
/host
N I V E R
S
T H
Y
IT
E
U
R
G
H
O F
E
D I
U
N B
-30-
ILFTN, Edinburgh 2005
Status of QCDOC
UKQCD
RBRC
COLUMBIA
QCDOC
UKQCD: 400MHz
12 Racks in use 24 × 7 last week (ramped up since December)
CPS: Understanding 2+1 DWF/DBW2 parameter space - 163 × 32 × 8
CPS: Cross check 2+1 R-algorithm UK-QCDOC/CU-QCDOC/CU-QCDSP
CPS: Cross check 2+1 RHMC vs. R-algorithm on UK-QCDOC (Clark)
CHROMA: up and running with BAGEL assembler (Joo).
CHROMA: inverter agreement with CPS
RBRC: 420MHz
2 CU Racks stable since September
12 Racks assembled at BNL, 2 Racks stable
Speed sorting remainder at 420MHz to eliminate reproducibility problems CPS:
Cross check CU-QCDOC R-algorithm 2+1 DWF with QCDSP
CPS: Study of gauge action rectangle effect on mres
DOE:
Front-end set up and configured
(empty) Racks on BNL floor and cabled
First 21 Motherboards (1344 nodes) received this week.
Assembly and shakeout planned by May.
N I V E R
S
T H
Y
IT
E
U
R
G
H
O F
E
D I
U
N B
-31-
ILFTN, Edinburgh 2005
Status of QCDOC
UKQCD
RBRC
COLUMBIA
QCDOC
Outstanding Issues
• Bit errors when running AsqTad
Driven by coupling to ethernet system, hacks avoid trigger
BNL: using additional NIC’s to minimise ethernet traffic
UKQCD: hack to turn off ethernet when running optimised code
≥ O(100) longer up time to tolerable 12h level
• Reproducibility/speed distribution
FPU very infrequently dropping last few bits in double precision
UK lowered clock speed & temperature ( 1-minute → 14 days between
errors)
BNL investigating speed sorting asics using synthetic DWF HMC diagnostic
How to obtain safety margin needed for scientific prudence?
Recently solved issues
• Ethernet system lock up mode
• DDR memory timing
• DDR PLB slave errors
N I V E R
S
T H
Y
IT
E
U
R
G
H
O F
E
D I
U
N B
-32-
ILFTN, Edinburgh 2005
Status of QCDOC
UKQCD
RBRC
COLUMBIA
QCDOC
Summary
• UKQCD machine has been shaken out and performing well
• $1 per megaflop at 5 Tflop/s
• Safe clock speed is lower in our volume build
• Must fix this (sorting or lower clk), 100% reproduction for now
• Minor hacks improve PLL problem that shows up in optimised code
• Linux/RAID disk system has been performing well after tweaking
• Larger machines will be cabled in response to demand
• Real status presented by Chris, Balint and Mike
• All that remains is to exploit 30Tflop/s for physics
• Thanks to the tolerant UKQCD early adopters
(Antonio, Clark, Joo, Maynard, Tweedie, Yamaguchi )
N I V E R
S
T H
Y
IT
E
U
R
G
H
O F
E
D I
U
N B
-33-
ILFTN, Edinburgh 2005
Download