Toward Mega-Scale Computing with pMatlab Chansup Byun and Jeremy Kepner MIT Lincoln Laboratory Vipin Sachdeva and Kirk E. Jordan IBM T.J. Watson Research Center HPEC 2010 This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government. MIT Lincoln Laboratory Slide 1 Outline • Introduction • Performance Studies • Optimization for Large • What is Parallel Matlab (pMatlab) • IBM Blue Gene/P System • BG/P Application Paths • Porting pMatlab to BG/P Scale Computation • Summary MIT Lincoln Laboratory Slide 2 Parallel Matlab (pMatlab) Application Output Analysis Input Vector/Matrix Parallel Library Comp Conduit Task Library Layer (pMatlab) User Interface Kernel Layer Messaging (MatlabMPI) Math (MATLAB/Octave) Hardware Interface Parallel Hardware Layered Architecture for parallel computing • • Kernel layer does single-node math & parallel messaging Library layer provides a parallel data and computation toolbox to Matlab users MIT Lincoln Laboratory Slide 3 IBM Blue Gene/P System LLGrid Core counts: ~1K cores Core speed: 850 MHz Blue Gene/P Core counts: ~300K MIT Lincoln Laboratory Slide 4 Blue Gene Application Paths Blue Gene Environment High Throughput Computing (HTC) High Performance Computing (MPI) Serial and Pleasantly Parallel Apps Highly Scalable Message Passing Apps • High Throughput Computing (HTC) – Enabling BG partition for many single-node jobs – Ideal for “pleasantly parallel” type applications MIT Lincoln Laboratory Slide 5 HTC Node Modes on BG/P • Symmetrical Multiprocessing (SMP) mode – One process per compute node – Full node memory available to the process • Dual mode – Two processes per compute node – Half of the node memory per each process • Virtual Node (VN) mode – Four processes per compute node (one per core) – 1/4th of the node memory per each process MIT Lincoln Laboratory Slide 6 Porting pMatlab to BG/P System • Requesting and booting a BG partition in HTC mode – Execute “qsub” command Define number of processes, runtime, HTC boot script (htcpartition --trace 7 --boot --mode dual \ --partition $COBALT_PARTNAME) Wait for the partition ready (until the boot completes) • Running jobs – Create and execute a Unix shell script to run a series of “submit” commands including submit -mode dual -pool ANL-R00-M1-512 \ -cwd /path/to/working/dir -exe /path/to/octave \ -env LD_LIBRARY_PATH=/home/cbyun/lib \ -args “--traditional MatMPI/MatMPIdefs523.m” • Combine the two steps eval(pRUN(‘m_file’, Nprocs, ‘bluegene-smp’)) MIT Lincoln Laboratory Slide 7 Outline • Introduction • Performance Studies • Optimization for Large Scale Computation • Single Process Performance • Point-to-Point Communication • Scalability • Summary MIT Lincoln Laboratory Slide 8 Performance Studies • Single Processor Performance – – – – – – MandelBrot ZoomImage Beamformer Blurimage Fast Fourier Transform (FFT) High Performance LINPACK (HPL) • Point-to-Point Communication – pSpeed • Scalability – Parallel Stream Benchmark: pStream MIT Lincoln Laboratory Slide 9 Single Process Performance: Intel Xeon vs. IBM PowerPC 450 Matlab 2009b, LLGrid 12 Octave 3.2.2, LLGrid 10 Octave 3.2.2, IBM BG/P Octave 3.2.2, IBM BG/P (Clock Normalized) 8 Lower is better 6 5.2 s ZoomImage* Beamformer 1.5 s MandelBrot 6.1 s 18.8 s 2 11.9 s 4 26.4 s Time Relative to Matlab 14 0 Blurimage* FFT HPL * conv2() performance issue in Octave has been improved in a subsequent release MIT Lincoln Laboratory Slide 10 Octave Performance With Optimized BLAS MFLOPS DGEM Performance Comparison Matrix size (N x N) MIT Lincoln Laboratory Slide 11 Single Process Performance: Stream Benchmark 4 Octave 3.2.2, LLGrid 3 Octave 3.2.2, IBM BG/P Octave 3.2.2, IBM BG/P (Clock Normalized) 2 0 Scale 1 b=qc 996 MB/s 1208 MB/s 1 Higher is better 944 MB/s Relative Performance to Matlab Matlab 2009B, LLGrid Add 2 c=a+c Triad 3 a=b+qc MIT Lincoln Laboratory Slide 12 Point-to-Point Communication • pMatlab example: pSpeed – Send/Receive messages to/from the neighbor. – Messages are files in pMatlab. Pid = 0 Pid = Np-1 Pid = 1 Pid = 2 Pid = 3 MIT Lincoln Laboratory Slide 13 Filesystem Consideration • A single NFS-shared disk (Mode S) • A group of cross-mounted, NFS-shared disks to distribute messages (Mode M) Pid = 0 Pid = Np-1 Pid = 0 Pid = 1 Pid = Np-1 Pid = 1 Pid = 2 Pid = 3 Pid = 2 Pid = 3 MIT Lincoln Laboratory Slide 14 pSpeed Performance on LLGrid: Mode S Matlab 2009b, 2x1 1.00E+09 1.00E+09 1.00E+08 1.00E+08 1.00E+07 1.00E+07 1.00E+06 1.00E+05 1.00E+04 Iteration 1 1.00E+03 Iteration 2 1.00E+02 Higher is better 1.00E+01 Iteration 3 Bandwidth, Bytes/sec Bandwidth, Bytes/sec Matlab 2009b, 1x2 1.00E+06 1.00E+05 1.00E+04 Iteration 1 1.00E+03 Iteration 2 1.00E+02 Iteration 3 1.00E+01 Iteration 4 Iteration 4 1.00E+00 1.00E+00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1.00E-01 1 2 3 4 5 6 7 8 Message Sizes, 8*2^N Bytes Message Sizes, 8*2^N Bytes Matlab 2009b, 8x1 1.00E+09 1.00E+09 1.00E+08 1.00E+08 1.00E+07 1.00E+07 1.00E+06 1.00E+05 1.00E+04 Iteration 1 Iteration 2 1.00E+02 Iteration 3 1.00E+01 Iteration 4 1.00E+00 1.00E-01 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Message Sizes, 8*2^N Bytes Bandwidth, Bytes/sec Bandwidth, Bytes/sec Matlab 2009b, 4x1 1.00E+03 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1.00E+06 1.00E+05 1.00E+04 Iteration 1 1.00E+03 Iteration 2 1.00E+02 Iteration 3 1.00E+01 Iteration 4 1.00E+00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Message Sizes, 8*2^N Bytes MIT Lincoln Laboratory Slide 15 pSpeed Performance on LLGrid: Mode M 1.0E+01 1.0E+09 Matlab 2009b, 4x1 1.0E+08 1.0E+06 Times, Seconds Bandwidth, Bytes/sec Lower is better Octave 3.2.2, 4x1 1.0E+07 1.0E+05 1.0E+04 1.0E+03 Higher is better 1.0E+02 1.0E+01 1.0E+00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1.0E-01 Matlab 2009b, 4x1 Octave 3.2.2, 4x1 1.0E+00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1.0E-02 Message Sizes, 8*2^N Bytes Message Sizes, 8*2^N Bytes 1.0E+09 1.0E+01 1.0E+08 Matlab 2009b, 8x1 Octave 3.2.2, 8x1 1.0E+06 Times, Seconds Bandwidth, Bytes/sec 1.0E+07 1.0E+05 1.0E+04 1.0E+03 1.0E+02 1.0E+00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1.0E-01 Matlab 2009b, 8x1 Octave 3.2.2, 8x1 1.0E+01 1.0E+00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Message Sizes, 8*2^N Bytes 1.0E-02 Message Sizes, 8*2^N Bytes MIT Lincoln Laboratory Slide 16 pSpeed Performance on BG/P 1.0E+09 BG/P Filesystem: GPFS 1.0E+08 1.0E+06 1.0E+05 100 1.0E+04 Octave 3.2.2, 2x1 1.0E+03 1.0E+02 1.0E+01 1.0E+00 Times, Seconds Bandwidth, Bytes/sec 1.0E+07 10 Octave 3.2.2, 4x1 Octave3.2.2, 3.2.2,8x1 2x1 Octave Octave 3.2.2, 4x1 Octave 3.2.2, 8x1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Message Sizes, 8*2^N Bytes 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 0.1 Message Sizes, 8*2^N Bytes MIT Lincoln Laboratory Slide 17 pStream Results with Scaled Size • SMP mode: Initial global array size of 2^25 for Np=1 – Global array size scales proportionally as number of processes increases (1024x1) 1.E+07 Scale Bandwidth, MB/Sec 1.E+06 Add Triad 1.E+05 1.E+04 563 GB/sec 100% Efficiency at Np = 1024 1.E+03 1.E+02 1 2 3 4 5 6 7 8 9 10 11 Number of Processes, 2^(N-1) MIT Lincoln Laboratory Slide 18 pStream Results with Fixed Size • Global array size of 2^30 – The number of processes scaled up to 16384 (4096x4) DUAL: 4.333 TB/Sec 96% efficiency at Np=8192 VN: 8.832 TB/Sec 101% efficiency at Np=16384 SMP: 2.208 TB/Sec 98% efficiency at Np=4096 MIT Lincoln Laboratory Slide 19 Outline • Introduction • Performance Studies • Optimization for Large Scale Computation • Aggregation • Summary MIT Lincoln Laboratory Slide 20 Current Aggregation Architecture • The leader process receives all the distributed data from • • other processes. All other processes send their portion of the distributed data to the leader process. The process is inherently sequential. – The leader receives Np-1 messages. 0 1 2 Np = 8 Np: total number of processes 3 4 5 6 7 8 MIT Lincoln Laboratory Slide 21 Binary-Tree Based Aggregation • BAGG: Distributed message collection using a binary tree – The even numbered processes send a message to its odd numbered neighbor – The odd numbered processes receive a message from its even numbered neighbor. 0 1 2 3 0 Maximum number of message a process may send/receive is N, where Np = 2^(N) 4 5 2 0 6 7 4 6 4 MIT Lincoln Laboratory Slide 22 BAGG() Performance • Two dimensional data and process distribution • Two different file systems are used for performance comparison – IBRIX: file system for users’ home directories – LUSTRE: parallel file system for all computation 100 IBRIX: 10x faster at Np=128 IBRIX, agg() IBRIX, bagg() Time, Seconds 10 LUSTRE, agg() LUSTRE, bagg() 1 0.1 LUSTRE: 8x faster at Np=128 0.01 1 2 3 4 5 6 7 No of CPUs (2^N) MIT Lincoln Laboratory Slide 23 BAGG() Performance, 2 • Four dimensional data and process distribution • With GPFS file system on IBM Blue Gene/P System (ANL’s Surveyor) – From 8 processes to 1024 processes 2.5x faster at Np=1024 1000 GPFS, agg() GPFS, bagg() Time, Seconds 100 10 1 0.1 1 2 3 4 5 6 7 8 No of CPUs, 2^(N+2) MIT Lincoln Laboratory Slide 24 Generalizing Binary-Tree Based Aggregation • HAGG: Extend the binary tree to the next power of two number – Suppose that Np = 6 The next power of two number: Np* = 8 – Skip any messages from/to the fictitious Pid’s. 0 1 2 3 0 4 5 2 0 6 7 4 6 4 MIT Lincoln Laboratory Slide 25 BAGG() vs. HAGG() • HAGG() generalizes BAGG() – – • Removes the restriction (Np = 2^N) in BAGG() Additional costs associated with bookkeeping Performance comparison on two dimensional data and process distribution Time, Seconds 10 IBRIX, agg IBRIX, bagg IBRIX, hagg 1 0.1 ~3x faster at Np=128 0.01 1 2 3 4 No of CPUs, 2^N Slide 26 5 6 7 MIT Lincoln Laboratory BAGG() vs. HAGG(), 2 • • Performance comparison on four dimensional data and process distribution Performance difference is marginal on a dedicated environment – SMP mode on IBM Blue Gene/P System 100 GPFS, bagg() Time, Seconds GPFS, hagg() 10 1 0.1 1 2 3 4 5 6 7 No of CPUs, 2^(N+2) MIT Lincoln Laboratory Slide 27 BAGG() Performance with Crossmounts • Significant performance improvement by reducing resource contention on file system – Performance is jittery because production cluster is used for performance test 50 Old AGG, IBRIX Total Runtime, Seconds 40 New AGG, IBRIX New AGG, Hybrid (IBRIX+Crossmounts) 30 Lower is better 20 10 0 1 2 3 4 5 6 7 8 Number of Processes, 2^(N-1) MIT Lincoln Laboratory Slide 28 Summary • pMatlab has been ported to IBM Blue Gene/P system • Clock-normalized, single process performance of Octave on BG/P system is on-par with Matlab • For pMatlab point-to-point communication (pSpeed), file system performance is important. – Performance is as expected with GPFS on BG/P • Parallel Stream Benchmark scaled to 16384 processes • Developed a new pMatlab aggregation function using a binary tree to scale beyond 1024 processes MIT Lincoln Laboratory Slide 29