Presentation - MIT Lincoln Laboratory

advertisement
Toward Mega-Scale Computing
with pMatlab
Chansup Byun and Jeremy Kepner
MIT Lincoln Laboratory
Vipin Sachdeva and Kirk E. Jordan
IBM T.J. Watson Research Center
HPEC 2010
This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations,
conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government.
MIT Lincoln Laboratory
Slide 1
Outline
• Introduction
• Performance Studies
• Optimization for Large
• What is Parallel Matlab
(pMatlab)
• IBM Blue Gene/P System
• BG/P Application Paths
• Porting pMatlab to BG/P
Scale Computation
• Summary
MIT Lincoln Laboratory
Slide 2
Parallel Matlab (pMatlab)
Application
Output
Analysis
Input
Vector/Matrix
Parallel
Library
Comp
Conduit
Task
Library Layer (pMatlab)
User
Interface
Kernel Layer
Messaging
(MatlabMPI)
Math
(MATLAB/Octave)
Hardware
Interface
Parallel
Hardware
Layered Architecture for parallel computing
•
•
Kernel layer does single-node math & parallel messaging
Library layer provides a parallel data and computation toolbox to Matlab users
MIT Lincoln Laboratory
Slide 3
IBM Blue Gene/P System
LLGrid
Core counts: ~1K
cores
Core speed: 850 MHz
Blue Gene/P
Core counts: ~300K
MIT Lincoln Laboratory
Slide 4
Blue Gene Application Paths
Blue Gene Environment
High Throughput
Computing (HTC)
High Performance
Computing (MPI)
Serial and Pleasantly
Parallel Apps
Highly Scalable
Message Passing Apps
• High Throughput Computing (HTC)
– Enabling BG partition for many single-node jobs
– Ideal for “pleasantly parallel” type applications
MIT Lincoln Laboratory
Slide 5
HTC Node Modes on BG/P
• Symmetrical Multiprocessing (SMP) mode
– One process per compute node
– Full node memory available to the process
• Dual mode
– Two processes per compute node
– Half of the node memory per each process
• Virtual Node (VN) mode
– Four processes per compute node (one per core)
– 1/4th of the node memory per each process
MIT Lincoln Laboratory
Slide 6
Porting pMatlab to BG/P System
• Requesting and booting a BG partition in HTC mode
– Execute “qsub” command
Define number of processes, runtime, HTC boot script
(htcpartition --trace 7 --boot --mode dual \
--partition $COBALT_PARTNAME)
Wait for the partition ready (until the boot completes)
• Running jobs
– Create and execute a Unix shell script to run a series of
“submit” commands including
submit -mode dual -pool ANL-R00-M1-512 \
-cwd /path/to/working/dir -exe /path/to/octave \
-env LD_LIBRARY_PATH=/home/cbyun/lib \
-args “--traditional MatMPI/MatMPIdefs523.m”
• Combine the two steps
eval(pRUN(‘m_file’, Nprocs, ‘bluegene-smp’))
MIT Lincoln Laboratory
Slide 7
Outline
• Introduction
• Performance Studies
• Optimization for Large
Scale Computation
• Single Process Performance
• Point-to-Point
Communication
• Scalability
• Summary
MIT Lincoln Laboratory
Slide 8
Performance Studies
• Single Processor Performance
–
–
–
–
–
–
MandelBrot
ZoomImage
Beamformer
Blurimage
Fast Fourier Transform (FFT)
High Performance LINPACK (HPL)
• Point-to-Point Communication
– pSpeed
• Scalability
– Parallel Stream Benchmark: pStream
MIT Lincoln Laboratory
Slide 9
Single Process Performance:
Intel Xeon vs. IBM PowerPC 450
Matlab 2009b, LLGrid
12
Octave 3.2.2, LLGrid
10
Octave 3.2.2, IBM BG/P
Octave 3.2.2, IBM BG/P
(Clock Normalized)
8
Lower is better
6
5.2 s
ZoomImage* Beamformer
1.5 s
MandelBrot
6.1 s
18.8 s
2
11.9 s
4
26.4 s
Time Relative to Matlab
14
0
Blurimage*
FFT
HPL
* conv2() performance issue in Octave has been improved in a subsequent release
MIT Lincoln Laboratory
Slide 10
Octave Performance With
Optimized BLAS
MFLOPS
DGEM Performance Comparison
Matrix size (N x N)
MIT Lincoln Laboratory
Slide 11
Single Process Performance:
Stream Benchmark
4
Octave 3.2.2, LLGrid
3
Octave 3.2.2, IBM BG/P
Octave 3.2.2, IBM BG/P
(Clock Normalized)
2
0
Scale
1
b=qc
996 MB/s
1208 MB/s
1
Higher is better
944 MB/s
Relative Performance to Matlab
Matlab 2009B, LLGrid
Add
2
c=a+c
Triad
3
a=b+qc
MIT Lincoln Laboratory
Slide 12
Point-to-Point Communication
• pMatlab example: pSpeed
– Send/Receive messages to/from the neighbor.
– Messages are files in pMatlab.
Pid = 0
Pid = Np-1
Pid = 1
Pid = 2
Pid = 3
MIT Lincoln Laboratory
Slide 13
Filesystem Consideration
• A single NFS-shared disk
(Mode S)
• A group of cross-mounted,
NFS-shared disks to distribute
messages (Mode M)
Pid = 0
Pid = Np-1
Pid = 0
Pid = 1
Pid = Np-1
Pid = 1
Pid = 2
Pid = 3
Pid = 2
Pid = 3
MIT Lincoln Laboratory
Slide 14
pSpeed Performance on LLGrid:
Mode S
Matlab 2009b, 2x1
1.00E+09
1.00E+09
1.00E+08
1.00E+08
1.00E+07
1.00E+07
1.00E+06
1.00E+05
1.00E+04
Iteration 1
1.00E+03
Iteration 2
1.00E+02
Higher is better
1.00E+01
Iteration 3
Bandwidth, Bytes/sec
Bandwidth, Bytes/sec
Matlab 2009b, 1x2
1.00E+06
1.00E+05
1.00E+04
Iteration 1
1.00E+03
Iteration 2
1.00E+02
Iteration 3
1.00E+01
Iteration 4
Iteration 4
1.00E+00
1.00E+00
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
1.00E-01
1
2
3
4
5
6
7
8
Message Sizes, 8*2^N Bytes
Message Sizes, 8*2^N Bytes
Matlab 2009b, 8x1
1.00E+09
1.00E+09
1.00E+08
1.00E+08
1.00E+07
1.00E+07
1.00E+06
1.00E+05
1.00E+04
Iteration 1
Iteration 2
1.00E+02
Iteration 3
1.00E+01
Iteration 4
1.00E+00
1.00E-01
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Message Sizes, 8*2^N Bytes
Bandwidth, Bytes/sec
Bandwidth, Bytes/sec
Matlab 2009b, 4x1
1.00E+03
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
1.00E+06
1.00E+05
1.00E+04
Iteration 1
1.00E+03
Iteration 2
1.00E+02
Iteration 3
1.00E+01
Iteration 4
1.00E+00
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Message Sizes, 8*2^N Bytes
MIT Lincoln Laboratory
Slide 15
pSpeed Performance on LLGrid:
Mode M
1.0E+01
1.0E+09
Matlab 2009b, 4x1
1.0E+08
1.0E+06
Times, Seconds
Bandwidth, Bytes/sec
Lower is better
Octave 3.2.2, 4x1
1.0E+07
1.0E+05
1.0E+04
1.0E+03
Higher is better
1.0E+02
1.0E+01
1.0E+00
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
1.0E-01
Matlab 2009b, 4x1
Octave 3.2.2, 4x1
1.0E+00
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
1.0E-02
Message Sizes, 8*2^N Bytes
Message Sizes, 8*2^N Bytes
1.0E+09
1.0E+01
1.0E+08
Matlab 2009b, 8x1
Octave 3.2.2, 8x1
1.0E+06
Times, Seconds
Bandwidth, Bytes/sec
1.0E+07
1.0E+05
1.0E+04
1.0E+03
1.0E+02
1.0E+00
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
1.0E-01
Matlab 2009b, 8x1
Octave 3.2.2, 8x1
1.0E+01
1.0E+00
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Message Sizes, 8*2^N Bytes
1.0E-02
Message Sizes, 8*2^N Bytes
MIT Lincoln Laboratory
Slide 16
pSpeed Performance on BG/P
1.0E+09
BG/P Filesystem: GPFS
1.0E+08
1.0E+06
1.0E+05
100
1.0E+04
Octave 3.2.2, 2x1
1.0E+03
1.0E+02
1.0E+01
1.0E+00
Times, Seconds
Bandwidth, Bytes/sec
1.0E+07
10
Octave 3.2.2, 4x1
Octave3.2.2,
3.2.2,8x1
2x1
Octave
Octave 3.2.2, 4x1
Octave 3.2.2, 8x1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Message Sizes, 8*2^N Bytes
1
1 2 3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
0.1
Message Sizes, 8*2^N Bytes
MIT Lincoln Laboratory
Slide 17
pStream Results with Scaled Size
• SMP mode: Initial global array size of 2^25 for Np=1
– Global array size scales proportionally as number of
processes increases (1024x1)
1.E+07
Scale
Bandwidth, MB/Sec
1.E+06
Add
Triad
1.E+05
1.E+04
563 GB/sec
100% Efficiency
at Np = 1024
1.E+03
1.E+02
1
2
3
4
5
6
7
8
9
10
11
Number of Processes, 2^(N-1)
MIT Lincoln Laboratory
Slide 18
pStream Results with Fixed Size
• Global array size of 2^30
– The number of processes scaled up to 16384 (4096x4)
DUAL: 4.333 TB/Sec
96% efficiency
at Np=8192
VN: 8.832 TB/Sec
101% efficiency
at Np=16384
SMP: 2.208 TB/Sec
98% efficiency
at Np=4096
MIT Lincoln Laboratory
Slide 19
Outline
• Introduction
• Performance Studies
• Optimization for Large
Scale Computation
• Aggregation
• Summary
MIT Lincoln Laboratory
Slide 20
Current Aggregation Architecture
• The leader process receives all the distributed data from
•
•
other processes.
All other processes send their portion of the distributed
data to the leader process.
The process is inherently sequential.
– The leader receives Np-1 messages.
0
1
2
Np = 8
Np: total number of processes
3
4
5
6
7
8
MIT Lincoln Laboratory
Slide 21
Binary-Tree Based Aggregation
• BAGG: Distributed message collection using a binary tree
– The even numbered processes send a message to its odd
numbered neighbor
– The odd numbered processes receive a message from its
even numbered neighbor.
0
1
2
3
0
Maximum number of message
a process may send/receive is
N, where Np = 2^(N)
4
5
2
0
6
7
4
6
4
MIT Lincoln Laboratory
Slide 22
BAGG() Performance
• Two dimensional data and process distribution
• Two different file systems are used for performance
comparison
– IBRIX: file system for users’ home directories
– LUSTRE: parallel file system for all computation
100
IBRIX: 10x faster
at Np=128
IBRIX, agg()
IBRIX, bagg()
Time, Seconds
10
LUSTRE, agg()
LUSTRE, bagg()
1
0.1
LUSTRE: 8x
faster at Np=128
0.01
1
2
3
4
5
6
7
No of CPUs (2^N)
MIT Lincoln Laboratory
Slide 23
BAGG() Performance, 2
• Four dimensional data and process distribution
• With GPFS file system on IBM Blue Gene/P System (ANL’s
Surveyor)
– From 8 processes to 1024 processes
2.5x faster at
Np=1024
1000
GPFS, agg()
GPFS, bagg()
Time, Seconds
100
10
1
0.1
1
2
3
4
5
6
7
8
No of CPUs, 2^(N+2)
MIT Lincoln Laboratory
Slide 24
Generalizing Binary-Tree Based
Aggregation
• HAGG: Extend the binary tree to the next power of two
number
– Suppose that Np = 6
The next power of two number: Np* = 8
– Skip any messages from/to the fictitious Pid’s.
0
1
2
3
0
4
5
2
0
6
7
4
6
4
MIT Lincoln Laboratory
Slide 25
BAGG() vs. HAGG()
•
HAGG() generalizes BAGG()
–
–
•
Removes the restriction (Np = 2^N) in BAGG()
Additional costs associated with bookkeeping
Performance comparison on two dimensional data and process
distribution
Time, Seconds
10
IBRIX, agg
IBRIX, bagg
IBRIX, hagg
1
0.1
~3x faster at
Np=128
0.01
1
2
3
4
No of CPUs, 2^N
Slide 26
5
6
7
MIT Lincoln Laboratory
BAGG() vs. HAGG(), 2
•
•
Performance comparison on four dimensional data and process
distribution
Performance difference is marginal on a dedicated environment
–
SMP mode on IBM Blue Gene/P System
100
GPFS, bagg()
Time, Seconds
GPFS, hagg()
10
1
0.1
1
2
3
4
5
6
7
No of CPUs, 2^(N+2)
MIT Lincoln Laboratory
Slide 27
BAGG() Performance with Crossmounts
•
Significant performance improvement by reducing resource
contention on file system
–
Performance is jittery because production cluster is used for
performance test
50
Old AGG, IBRIX
Total Runtime, Seconds
40
New AGG, IBRIX
New AGG, Hybrid
(IBRIX+Crossmounts)
30
Lower is better
20
10
0
1
2
3
4
5
6
7
8
Number of Processes, 2^(N-1)
MIT Lincoln Laboratory
Slide 28
Summary
• pMatlab has been ported to IBM Blue Gene/P system
• Clock-normalized, single process performance of Octave on
BG/P system is on-par with Matlab
• For pMatlab point-to-point communication (pSpeed), file
system performance is important.
– Performance is as expected with GPFS on BG/P
• Parallel Stream Benchmark scaled to 16384 processes
• Developed a new pMatlab aggregation function using a binary
tree to scale beyond 1024 processes
MIT Lincoln Laboratory
Slide 29
Download