Status of QCDOC UKQCD RBRC COLUMBIA QCDOC Status of QCDOC ILFTN Peter Boyle for the QCDOC collaboration N I V E R S T H Y IT E U R G H O F E D I U N B -1- ILFTN, Edinburgh 2005 Status of QCDOC UKQCD RBRC COLUMBIA QCDOC QCDOC Collaboration Columbia: N. Christ, S. Cohen, C. Cristian, C. Kim, L. Levkova, X. Liao, G. Liu, R. Mawhinney, A. Yamaguchi UKQCD: P. Boyle, M. Clark, B. Joo Riken-Brookhaven: S. Ohta (KEK), T. Wettig (Regensburg) SciDac: C. Jung, K. Petrov IBM Research: A. Gara, D. Chen N I V E R S T H Y IT E U R G H O F E D I U N B -2- ILFTN, Edinburgh 2005 Status of QCDOC UKQCD RBRC COLUMBIA QCDOC Problem characteristics Krylov solvers dominate: Dirac matrix multiply + global summation • Simple nearest neighbour stencil. • Communications are nearest neighbour on Toriodal Network • Very large machines ⇒ small local sub-lattice • Challenge is to provide the data fast enough • Interconnect as agressive as a local DRAM system • Communications repetitive • Communicate in multiple directions simultaneously • Predictable memory access patterns ⇒ prefetch Exploit simplifications: special purpose machines increase scalability N I V E R S T H Y IT E U R G H O F E D I U N B -3- ILFTN, Edinburgh 2005 Status of QCDOC UKQCD RBRC COLUMBIA QCDOC For example Wilson Dirac Operator D W = X Uµ (x)(1 − γ µ )δy,x+µ + Uµ† (x − µ)(1 + γ µ )δy,x−µ µ Operation Proj Su3 Su3 Rec Face exchange Flops/site (Chiral basis) 96 adds 4×(12mul + 60 madd) 4×(12mul + 60 madd) 7 × 24 add 8 packets S+R Loads 24 (18 + 12) × 4 (18 + 12) × 4 (12 + 12) × 4 12×L3 × (8 + 8) words 2 Stores (12 + 12) × 4 12 × 4 12 × 4 24 1320 flops and off-core 264 loads, 120 stores ⇒ 2.3GB/s per Gflop/s Must perform comms in time for CPU process L4 2 sites Generalisations to 5-d obvious Can trade off footprint versus latency tolerance and bandwidth N I V E R S T H Y IT E U R G H O F E D I U N B -4- ILFTN, Edinburgh 2005 Status of QCDOC UKQCD RBRC COLUMBIA QCDOC N I V E R S T H Y IT E U R G H O F E D I U N B -5- ILFTN, Edinburgh 2005 Status of QCDOC UKQCD RBRC COLUMBIA QCDOC QCDOC Architecture • 6 dimensional torus • QCD can be maximally spread out in four/five dimensions • Extra dimensions allow partitioning • Over 10k nodes ⇒ small subvolume per node • fast on-chip memory • high performance nearest neighbour communication • 24 DMA FIFO’s to nearest neighours. • Frequent global summation : Hardware assist • Communication performance ' memory performance N I V E R S T H Y IT E U R G H O F E D I U N B -6- ILFTN, Edinburgh 2005 Status of QCDOC UKQCD RBRC COLUMBIA QCDOC N I V E R S T H Y IT E U R G H O F E D I U N B -7- ILFTN, Edinburgh 2005 Status of QCDOC UKQCD RBRC COLUMBIA QCDOC QCDOC Asic Technology • IBM SA27E 0.18µ mixed logic + Embedded DRAM process • 420 MHz IBM PowerPC SoC (440 CPU + FPU) • 5 Watts allowing high density • Custom 4MB on-chip embedded memory controller (IBM Research, Yorktown Heights) • Custom communication network and DMA (Columbia) • Custom Ethernet-JTAG boot/diagnostic protocol (IBM Research, Yorktown Heights) • Hardware and Software Co-Designed N I V E R S T H Y IT E U R G H O F E D I U N B -8- ILFTN, Edinburgh 2005 Status of QCDOC UKQCD RBRC COLUMBIA QCDOC Serial Communications Unit • 12 × 420Mbit/s LVDS serial links using IBM HSSL transceivers • Single bit detect, ACK/NACK, hardware retransmit, checksums • Block-strided descriptor based DMA • Concurrently runs all links efficiently • Hardware assist for global summation Prefetching Edram Controller • Multiple ports: efficient concurrent access for Comms and Compute • Each PEC port linearly prefetches two streams • Double buffers gather writes N I V E R S T H Y IT E U R G H O F E D I U N B -9- ILFTN, Edinburgh 2005 Status of QCDOC UKQCD RBRC COLUMBIA QCDOC QCDOC Asic, 1 Gflop/s N I V E R S T H Y IT E U R G H O F E D I U N B -10- ILFTN, Edinburgh 2005 Status of QCDOC UKQCD RBRC COLUMBIA QCDOC QCDOC Daughterboard 2 Gflop/s, $450 Two independent compute nodes + Ethernet system N I V E R S T H Y IT E U R G H O F E D I U N B -11- ILFTN, Edinburgh 2005 Status of QCDOC UKQCD RBRC COLUMBIA QCDOC QCDOC Motherboard, $17k 64 Gflop/s, 32 Daughterboards, 64 nodes, very dense 768 Gigabit/s internode, 800 Mbit/s Ethernet. N I V E R S T H Y IT E U R G H O F E D I U N B -12- ILFTN, Edinburgh 2005 Status of QCDOC UKQCD RBRC COLUMBIA QCDOC QCDOC Motherboard N I V E R S T H Y IT E U R G H O F E D I U N B -13- ILFTN, Edinburgh 2005 Status of QCDOC UKQCD RBRC COLUMBIA QCDOC 4096 node prototype, 4Tflop/s, $1.6 million (Air cooled with chilled water heat exchanger) N I V E R S T H Y IT E U R G H O F E D I U N B -14- ILFTN, Edinburgh 2005 Status of QCDOC UKQCD RBRC COLUMBIA QCDOC 12288 node UKQCD machine on the BNL machine floor. 10.3 Tflop/s peak (current clock speed), $5 Million UKQCD machine disassembled and in shipping, 12288 node RBRC machine nearly complete, 12288 node DOE SciDAC machine by March N I V E R S T H Y IT E U R G H O F E D I U N B -15- ILFTN, Edinburgh 2005 Status of QCDOC UKQCD RBRC COLUMBIA QCDOC Serial Communications Data Eye N I V E R S T H Y IT E U R G H O F E D I U N B -16- ILFTN, Edinburgh 2005 Status of QCDOC UKQCD RBRC COLUMBIA QCDOC DDR memory data eye (write path) N I V E R S T H Y IT E U R G H O F E D I U N B -17- ILFTN, Edinburgh 2005 Status of QCDOC UKQCD RBRC COLUMBIA QCDOC Ethernet Data Eye (three level signal) N I V E R S T H Y IT E U R G H O F E D I U N B -18- ILFTN, Edinburgh 2005 Status of QCDOC UKQCD RBRC COLUMBIA QCDOC Performance • Memory system performance • Network performance • Application performance Streams performance gcc-3.4.1 gcc-3.4.1 gcc-3.4.1 gcc-3.4.1 N I V E R -O6 -O6 -O6 -O6 Comment vanilla source vanilla source builtin prefetch Auto-generated asm vanilla source vanilla source builtin prefetch Auto-generated asm Memory Edram Edram Edram Edram DDR DDR DDR DDR MB/s 747 747 1024 1670 260 280 447 606 S T H Y IT E U Compiler/Options/Code xlc -O5 -qarch=440 -funroll-all-loops -fprefetch-loop-arrays -funroll-all-loops -fprefetch-loop-arrays Assembly xlc -O5 -qarch=440 -funroll-all-loops -fprefetch-loop-arrays -funroll-all-loops -fprefetch-loop-arrays Assembly R G H O F E D I U N B -19- ILFTN, Edinburgh 2005 Status of QCDOC UKQCD RBRC COLUMBIA QCDOC QCDOC Multi-wire Ping Ping bandwidth (420MHz) 1000 800 MB/s 600 11 BiDi links 10 BiDi links 9 BiDi links 8 BiDi links 7 BiDi links 6 BiDi links 5 BiDi links 4 BiDi links 3 BiDi links 2 BiDi links 1 BiDi link Infiniband PMB (NCSA) Myrinet PMB (NCSA) TCP PMB (NCSA) 400 200 0 10 N I V E R 1000 10000 S T H Y IT E U 100 Packet size bytes R G H O F E D I U N B -20- ILFTN, Edinburgh 2005 Status of QCDOC UKQCD RBRC COLUMBIA QCDOC • Single link half RTT Latency: 690 ns. • Single link ping pong obtains 50% max bandwidth on 32 byte packets. • SoC: Outstanding interconnect with modest node cost • 24 Wilson operator in 20µs. 16 distinct DMA’s • Latency particularly important further improvements to AsqTad N I V E R S T H Y IT E U R G H O F E D I U N B -21- ILFTN, Edinburgh 2005 QCDOC Single-wire latency (420MHz) Latency (microseconds) 100 Single direction Multi-directional effective latency 10 1 0.1 10 100 Packet size bytes 1000 10000 Status of QCDOC UKQCD RBRC COLUMBIA QCDOC Global reduction • Hdw acceleration for ”All-to-All” along an axis. • CPU performs arithmetic. GlobalSum → 300ns × • Slope of log-log plot asymptotically 1 1 D(Nproc) D 2 1 D • 1024 node global sum in under 16 µs. • 20-30 µs (estimated) for D ≥ 3 on 12k nodes N I V E R S T H Y IT E U R G H O F E D I U N B -22- ILFTN, Edinburgh 2005 Status of QCDOC UKQCD RBRC COLUMBIA QCDOC Global sum (us) 1d global sum 2d global sum 3d global sum 4d global sum 5d global sum 6d global sum 10 1 1 N I V E R 100 Number of processors 1000 10000 S T H Y IT E U 10 R G H O F E D I U N B -23- ILFTN, Edinburgh 2005 Status of QCDOC UKQCD RBRC COLUMBIA QCDOC Application code performance Various discretisations, 44 local volume Action Wilson AsqTad DWF Clover Nodes 512 128 512 512 Sparse Matrix 44% 42% 46% 54% CG Performance 39% 40% 42% 47% On 24 surface to volume ratio is 4 QCDOC nodes can apply Wilson sparse matrix in around 20 µs for 44% of peak with 16 distinct communications. N I V E R S T H Y IT E U R G H O F E D I U N B -24- ILFTN, Edinburgh 2005 Status of QCDOC UKQCD RBRC COLUMBIA QCDOC Scaling 164 on 1k nodes (equivalent to 324 on 16k nodes) Scaling on 16^4 to 1024 nodes 350 300 Ideal Edram Clover scaling Ideal DDR Gigaflops 250 200 150 100 50 Edram threshold 0 0 200 400 600 processors 800 1000 1200 Expect 4 to 5 Tflop/s sustained on large machines N I V E R S T H Y IT E U R G H O F E D I U N B -25- ILFTN, Edinburgh 2005 Status of QCDOC UKQCD RBRC COLUMBIA QCDOC Partitioning • Machine is a 2k × 2l × 2m × 2 × 2 × 2 hyper-torus. • Partitions are 6d hyper-rectangular slices of machine • Locally redefine Application to Machine axis map • Present 1,2,3,4,5 or 6 dim torus to applications • Logical nearest neighbor is physically the nearest neighbor Remapped slice of machine: lower dimensional torus: N I V E R S T H Y IT E U R G H O F E D I U N B -26- ILFTN, Edinburgh 2005 Status of QCDOC UKQCD RBRC COLUMBIA QCDOC • Fold any two dimensions independent of others • Can iterate to fold multiple dimensions together N I V E R S T H Y IT E U R G H O F E D I U N B -27- ILFTN, Edinburgh 2005 Status of QCDOC UKQCD RBRC COLUMBIA QCDOC Operating System Custom node kernel and host daemon (v2.5.9) • Eliminates Unix overhead and integrates diagnostics • POSIX compliant single process subset (Cygwin/newlib libc) • Custom Ethernet driver • Custom NFS client provides I/O to host and trivially parallel file system • O/S advances state of the art for special purpose machines • 2 Minute boot time Compile Chain & Programming environment • C/C++ with message passing extensions • GCC, XLC (gfortran?) • QMP 2.0 (Jung) is SciDAC QCD cross platform library for message passing • CPS uses internal interface • Multi-heap memory allocation extensions for Edram • Standard compliant: benefits everyone documentation is already written & code is portable N I V E R S T H Y IT E U R G H O F E D I U N B -28- ILFTN, Edinburgh 2005 Status of QCDOC UKQCD RBRC COLUMBIA QCDOC Reliability • Careful and controlled boot process • Log and test each memory, device, bus, bridge before use • Discovers and checks machine topology • Counts, reports and databases DDR,Edram ECC errors • Counts, reports and databases SCU link errors • Counts, reports and databases Ethernet problems • Report and databases checksum mismatches on any link in machine • Optionally terminate jobs on any correctable errors • Application level: run 10% reproducibility testing long term • 100% reproducibility testing during shakeout phase N I V E R S T H Y IT E U R G H O F E D I U N B -29- ILFTN, Edinburgh 2005 Status of QCDOC UKQCD RBRC COLUMBIA QCDOC Disks, Bandwidths and all that • /host globally shared • /pfs per-node unique (8 way striping across boxes) Tune Linux kernel parameters: O(12 - 50) MB/s to each RAID box Each Linux/RAID box is supporting 4096 mounts. p655/AIX has supported 12k over mounts of internal disks. Disk1:/node_M/pfs .... Disk D DiskD:/node_N−M/pfs .... M nodes Node Node Disk1:/node_1/pfs Disk 1 .... Node .... N = M . D nodes Node Ethernet M nodes HSSL DiskD:/node_N/pfs /host N I V E R S T H Y IT E U R G H O F E D I U N B -30- ILFTN, Edinburgh 2005 Status of QCDOC UKQCD RBRC COLUMBIA QCDOC UKQCD: 400MHz 12 Racks in use 24 × 7 last week (ramped up since December) CPS: Understanding 2+1 DWF/DBW2 parameter space - 163 × 32 × 8 CPS: Cross check 2+1 R-algorithm UK-QCDOC/CU-QCDOC/CU-QCDSP CPS: Cross check 2+1 RHMC vs. R-algorithm on UK-QCDOC (Clark) CHROMA: up and running with BAGEL assembler (Joo). CHROMA: inverter agreement with CPS RBRC: 420MHz 2 CU Racks stable since September 12 Racks assembled at BNL, 2 Racks stable Speed sorting remainder at 420MHz to eliminate reproducibility problems CPS: Cross check CU-QCDOC R-algorithm 2+1 DWF with QCDSP CPS: Study of gauge action rectangle effect on mres DOE: Front-end set up and configured (empty) Racks on BNL floor and cabled First 21 Motherboards (1344 nodes) received this week. Assembly and shakeout planned by May. N I V E R S T H Y IT E U R G H O F E D I U N B -31- ILFTN, Edinburgh 2005 Status of QCDOC UKQCD RBRC COLUMBIA QCDOC Outstanding Issues • Bit errors when running AsqTad Driven by coupling to ethernet system, hacks avoid trigger BNL: using additional NIC’s to minimise ethernet traffic UKQCD: hack to turn off ethernet when running optimised code ≥ O(100) longer up time to tolerable 12h level • Reproducibility/speed distribution FPU very infrequently dropping last few bits in double precision UK lowered clock speed & temperature ( 1-minute → 14 days between errors) BNL investigating speed sorting asics using synthetic DWF HMC diagnostic How to obtain safety margin needed for scientific prudence? Recently solved issues • Ethernet system lock up mode • DDR memory timing • DDR PLB slave errors N I V E R S T H Y IT E U R G H O F E D I U N B -32- ILFTN, Edinburgh 2005 Status of QCDOC UKQCD RBRC COLUMBIA QCDOC Summary • UKQCD machine has been shaken out and performing well • $1 per megaflop at 5 Tflop/s • Safe clock speed is lower in our volume build • Must fix this (sorting or lower clk), 100% reproduction for now • Minor hacks improve PLL problem that shows up in optimised code • Linux/RAID disk system has been performing well after tweaking • Larger machines will be cabled in response to demand • Real status presented by Chris, Balint and Mike • All that remains is to exploit 30Tflop/s for physics • Thanks to the tolerant UKQCD early adopters (Antonio, Clark, Joo, Maynard, Tweedie, Yamaguchi ) N I V E R S T H Y IT E U R G H O F E D I U N B -33- ILFTN, Edinburgh 2005