300x Matlab Dr. Jeremy Kepner MIT Lincoln Laboratory September 25, 2002

advertisement
300x Matlab
Dr. Jeremy Kepner
MIT Lincoln Laboratory
September 25, 2002
HPEC Workshop
Lexington, MA
This work is sponsored by the High Performance Computing Modernization Office under Air Force Contract
F19628-00-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and
are not necessarily endorsed by the Department of Defense.
MIT Lincoln Laboratory
999999-1
XYZ 7/12/2016
Outline
• Introduction
• Motivation
• Challenges
• Approach
• Performance Results
• Future Work and Summary
Slide-2
MatlabMPI HPEC 2002
MIT Lincoln Laboratory
Motivation: DoD Need
• Cost
= 4 lines of DoD code
• DoD has a clear need to rapidly develop, test and deploy
new techniques for analyzing sensor data
– Most DoD algorithm development and simulations are
done in Matlab
– Sensor analysis systems are implemented in other
languages
– Transformation involves years of software development,
testing and system integration
• MatlabMPI allows any Matlab program to
become a high performance parallel program
Slide-3
MatlabMPI HPEC 2002
MIT Lincoln Laboratory
Challenges: Why Has This Been Hard?
C
• Productivity
– Most users will not touch any solution that requires other
languages (even cmex)
• Portability
F77
C++
– Most users will not use a solution that could potentially make
their code non-portable in the future
• Performance
– Most users want to do very simple parallelism
– Most programs have long latencies (do not require low
latency solutions)
Slide-4
MatlabMPI HPEC 2002
MIT Lincoln Laboratory
Outline
• Introduction
• Approach
• Basic Requirements
• File I/O based messaging
• Performance Results
• Future Work and Summary
Slide-5
MatlabMPI HPEC 2002
MIT Lincoln Laboratory
Modern Parallel Software Layers
Application
Output
Analysis
Input
Vector/Matrix
Comp
Task
Parallel
Library
Conduit
Messaging Kernel
Math Kernel
User
Interface
Hardware
Interface
Intel
Cluster
Hardware
Workstation
PowerPC
Cluster
Slide-6
MatlabMPI HPEC 2002
• Can build any parallel
application/library on top
of a few basic messaging
capabilities
• MatlabMPI provides this
Messaging Kernel
MIT Lincoln Laboratory
MatlabMPI “Core Lite”
• Parallel computing requires eight capabilities
–
–
–
–
–
MPI_Run launches a Matlab script on multiple processors
MPI_Comm_size returns the number of processors
MPI_Comm_rank returns the id of each processor
MPI_Send sends Matlab variable(s) to another processor
MPI_Recv receives Matlab variable(s) from another
processor
– MPI_Init called at beginning of program
– MPI_Finalize called at end of program
Slide-7
MatlabMPI HPEC 2002
MIT Lincoln Laboratory
Key Insight: File I/O based messaging
• Any messaging system can be implemented using file I/O
Shared
File System
• File I/O provided by Matlab via load and save functions
– Takes care of complicated buffer packing/unpacking problem
– Allows basic functions to be implemented in ~250 lines of
Matlab code
Slide-8
MatlabMPI HPEC 2002
MIT Lincoln Laboratory
MatlabMPI:
Point-to-point Communication
MPI_Send (dest, tag, comm, variable);
Sender
Receiver
Shared File System
variable
save
create
Data file
Lock file
load
variable
detect
variable = MPI_Recv (source, tag, comm);
• Sender saves variable in Data file, then creates Lock file
• Receiver detects Lock file, then loads Data file
Slide-9
MatlabMPI HPEC 2002
MIT Lincoln Laboratory
Example: Basic Send and Receive
• Initialize
• Get processor ranks
• Execute send
• Execute recieve
• Finalize
• Exit
MPI_Init;
comm = MPI_COMM_WORLD;
comm_size = MPI_Comm_size(comm);
my_rank = MPI_Comm_rank(comm);
source = 0;
dest = 1;
tag = 1;
%
%
%
%
%
%
%
Initialize MPI.
Create communicator.
Get size.
Get rank.
Set source.
Set destination.
Set message tag.
if(comm_size == 2)
if (my_rank == source)
data = 1:10;
MPI_Send(dest,tag,comm,data);
end
if (my_rank == dest)
data=MPI_Recv(source,tag,comm);
end
end
%
%
%
%
Check size.
If source.
Create data.
Send data.
MPI_Finalize;
exit;
% Finalize Matlab MPI.
% Exit Matlab
% If destination.
% Receive data.
• Uses standard message passing techniques
• Will run anywhere Matlab runs
• Only requires a common file system
Slide-10
MatlabMPI HPEC 2002
MIT Lincoln Laboratory
MatlabMPI Additional Functionality
• Important MPI conveniences functions
– MPI_Abort kills all jobs
– MPI_Bcast broadcasts a message
exploits symbolic links to allow for true multi-cast
– MPI_Probe returns a list of all incoming messages
allows more dynamic message reading
• MatlabMPI specific functions
– MatMPI_Delete_all cleans up all files after a run
– MatMPI_Save_messages toggles deletion of messages
individual messages can be inspected for debugging
– MatMPI_Comm_settings user can set MatlabMPI internals
rsh or ssh, location of Matlab, unix or windows, …
• Other
– Processor specific directories
Slide-11
MatlabMPI HPEC 2002
MIT Lincoln Laboratory
Outline
• Introduction
• Approach
• Performance Results
• Bandwidth
• Parallel Speedup
• Future Work and Summary
Slide-12
MatlabMPI HPEC 2002
MIT Lincoln Laboratory
MatlabMPI vs MPI bandwidth
Bandwidth (SGI Origin2000)
Bandwidth (Bytes/sec)
1.E+08
1.E+07
C MPI
MatlabMPI
1.E+06
1.E+05
1K
4K
16K
64K 256K
1M
4M
32M
Message Size (Bytes)
• Bandwidth matches native C MPI at large message size
• Primary difference is latency (35 milliseconds vs. 30MIT
microseconds)
Lincoln Laboratory
Slide-13
MatlabMPI HPEC 2002
MatlabMPI bandwidth scalability
Linux w/Gigabit Ethernet
1.E+09
Bandwidth (Bytes/sec)
16 Processors
1.E+08
1.E+07
2 Processors
1.E+06
1.E+05
1K
4K
16K 64K 256K 1M
4M
16M
Message Size (Bytes)
Slide-14
MatlabMPI HPEC 2002
• Bandwidth scales to multiple processors
• Cross mounting eliminates bottlenecksMIT Lincoln Laboratory
Image Filtering Parallel Performance
100
Fixed Problem
(SGI O2000)
ParallelSize
performance
100
Scaled Problem Size (IBM SP2)
MatlabMPI
Linear
Speedup
Linear
MatlabMPI
Gigaflops
10
10
1
0
1
1
2
4
8
16
32
Number of Processors
64
1
10
100
Number of Processors
• Achieved “classic” super-linear speedup on fixed problem
• Achieved speedup of ~300 on 304 processors on scaled problem
Slide-15
MatlabMPI HPEC 2002
MIT Lincoln Laboratory
1000
Productivity vs. Performance
Peak Performance
Lines of Code
0.1
10
1
Matlab
10
100
1000
• Programmed image
filtering several ways
Parallel
Matlab*
Matlab
MatlabMPI
100
Current
Research
Current
Practice
PVL
VSIPL/MPI
C++
• MatlabMPI provides
VSIPL
C
VSIPL/OpenMP
1000
Single
Processor
Slide-16
MatlabMPI HPEC 2002
Shared
Memory
•Matlab
•VSIPL
•VSIPL/OpenMPI
•VSIPL/MPI
•PVL
•MatlabMPI
•high productivity
•high performance
VSIPL/MPI
Distributed
Memory
MIT Lincoln Laboratory
Current MatlabMPI deployment
•
•
•
•
•
•
•
•
•
•
•
•
Lincoln Signal processing (7.8 on 8 cpus, 9.4 on 8 duals)
Lincoln Radar simulation (7.5 on 8 cpus, 11.5 on 8 duals)
Lincoln Hyperspectral Imaging (~3 on 3 cpus)
MIT LCS Beowulf (11 Gflops on 9 duals)
MIT AI Lab Machine Vision
OSU EM Simulations
ARL SAR Image Enhancement
Wash U Hearing Aid Simulations
So. Ill. Benchmarking
JHU Digital Beamforming
ISL Radar simulation
• Rapidly growing MatlabMPI user base
• Web release may create hundreds of users
URI Heart modelling
Slide-17
MatlabMPI HPEC 2002
MIT Lincoln Laboratory
Outline
• Introduction
• Approach
• Performance Results
• Future Work and Summary
Slide-18
MatlabMPI HPEC 2002
MIT Lincoln Laboratory
Future Work: Parallel Matlab Toolbox
• Parallel Matlab need
has been identified
•HPCMO (OSU)
• Required user interface
has been demonstrated
•Matlab*P (MIT/LCS)
•PVL (MIT/LL)
• Required hardware
interface has been
demonstrated
High Performance Matlab Applications
DoD Sensor
Processing
DoD Mission
Planning
Scientific
Simulation
Commercial
Applications
User
Interface
Parallel Matlab
Toolbox
Hardware
Interface
•MatlabMPI (MIT/LL)
Parallel Computing Hardware
Slide-19
MatlabMPI HPEC 2002
• Allows parallel programs with
no additional lines of code MIT Lincoln Laboratory
Future Work: Scalable Perception
• Data explosion
– Advanced perception techniques must process vast (and rapidly
growing) amounts of sensor data
• Component scaling
– Research has given us high-performance algorithms and
architectures for sensor data processing,
– But these systems are not modular, reflective, or radically
reconfigurable to meet new goals in real time
• Scalable perception
– will require a framework that couples the computationally daunting
problems of real-time multimodal perception to the infrastructure of
modern high-performance computing, algorithms, and systems.
• Such a framework must exploit:
– High-level languages
– Graphical / linear algebra duality
– Scalable architectures and networks
Slide-20
MatlabMPI HPEC 2002
MIT Lincoln Laboratory
Summary
• MatlabMPI has the basic functions necessary for parallel
programming
– Size, rank, send, receive, launch
– Enables complex applications or libraries
• Performance can match native MPI at large message sizes
• Demonstrated scaling into hundreds of processors
• Demonstrated productivity
• Available on HPCMO systems
• Available on Web
Slide-21
MatlabMPI HPEC 2002
MIT Lincoln Laboratory
Acknowledgements
•
Support
–
–
•
Collaborators
–
–
•
Stan Ahalt and John Nehrbass (Ohio St.)
Alan Edelman, John Gilbert and Ron Choy (MIT LCS)
Lincoln Applications
–
–
–
•
Charlie Holland DUSD(S&T) and John Grosh OSD
Bob Bond and Ken Senne (Lincoln)
Gil Raz, Ryan Haney and Dan Drake
Nick Pulsone and Andy Heckerling
David Stein
Centers
–
–
Slide-22
MatlabMPI HPEC 2002
Maui High Performance Computing Center
Boston University
MIT Lincoln Laboratory
http://www.ll.mit.edu/MatlabMPI
Slide-23
MatlabMPI HPEC 2002
MIT Lincoln Laboratory
Download