300x Matlab Dr. Jeremy Kepner MIT Lincoln Laboratory September 25, 2002 HPEC Workshop Lexington, MA This work is sponsored by the High Performance Computing Modernization Office under Air Force Contract F19628-00-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the Department of Defense. MIT Lincoln Laboratory 999999-1 XYZ 7/12/2016 Outline • Introduction • Motivation • Challenges • Approach • Performance Results • Future Work and Summary Slide-2 MatlabMPI HPEC 2002 MIT Lincoln Laboratory Motivation: DoD Need • Cost = 4 lines of DoD code • DoD has a clear need to rapidly develop, test and deploy new techniques for analyzing sensor data – Most DoD algorithm development and simulations are done in Matlab – Sensor analysis systems are implemented in other languages – Transformation involves years of software development, testing and system integration • MatlabMPI allows any Matlab program to become a high performance parallel program Slide-3 MatlabMPI HPEC 2002 MIT Lincoln Laboratory Challenges: Why Has This Been Hard? C • Productivity – Most users will not touch any solution that requires other languages (even cmex) • Portability F77 C++ – Most users will not use a solution that could potentially make their code non-portable in the future • Performance – Most users want to do very simple parallelism – Most programs have long latencies (do not require low latency solutions) Slide-4 MatlabMPI HPEC 2002 MIT Lincoln Laboratory Outline • Introduction • Approach • Basic Requirements • File I/O based messaging • Performance Results • Future Work and Summary Slide-5 MatlabMPI HPEC 2002 MIT Lincoln Laboratory Modern Parallel Software Layers Application Output Analysis Input Vector/Matrix Comp Task Parallel Library Conduit Messaging Kernel Math Kernel User Interface Hardware Interface Intel Cluster Hardware Workstation PowerPC Cluster Slide-6 MatlabMPI HPEC 2002 • Can build any parallel application/library on top of a few basic messaging capabilities • MatlabMPI provides this Messaging Kernel MIT Lincoln Laboratory MatlabMPI “Core Lite” • Parallel computing requires eight capabilities – – – – – MPI_Run launches a Matlab script on multiple processors MPI_Comm_size returns the number of processors MPI_Comm_rank returns the id of each processor MPI_Send sends Matlab variable(s) to another processor MPI_Recv receives Matlab variable(s) from another processor – MPI_Init called at beginning of program – MPI_Finalize called at end of program Slide-7 MatlabMPI HPEC 2002 MIT Lincoln Laboratory Key Insight: File I/O based messaging • Any messaging system can be implemented using file I/O Shared File System • File I/O provided by Matlab via load and save functions – Takes care of complicated buffer packing/unpacking problem – Allows basic functions to be implemented in ~250 lines of Matlab code Slide-8 MatlabMPI HPEC 2002 MIT Lincoln Laboratory MatlabMPI: Point-to-point Communication MPI_Send (dest, tag, comm, variable); Sender Receiver Shared File System variable save create Data file Lock file load variable detect variable = MPI_Recv (source, tag, comm); • Sender saves variable in Data file, then creates Lock file • Receiver detects Lock file, then loads Data file Slide-9 MatlabMPI HPEC 2002 MIT Lincoln Laboratory Example: Basic Send and Receive • Initialize • Get processor ranks • Execute send • Execute recieve • Finalize • Exit MPI_Init; comm = MPI_COMM_WORLD; comm_size = MPI_Comm_size(comm); my_rank = MPI_Comm_rank(comm); source = 0; dest = 1; tag = 1; % % % % % % % Initialize MPI. Create communicator. Get size. Get rank. Set source. Set destination. Set message tag. if(comm_size == 2) if (my_rank == source) data = 1:10; MPI_Send(dest,tag,comm,data); end if (my_rank == dest) data=MPI_Recv(source,tag,comm); end end % % % % Check size. If source. Create data. Send data. MPI_Finalize; exit; % Finalize Matlab MPI. % Exit Matlab % If destination. % Receive data. • Uses standard message passing techniques • Will run anywhere Matlab runs • Only requires a common file system Slide-10 MatlabMPI HPEC 2002 MIT Lincoln Laboratory MatlabMPI Additional Functionality • Important MPI conveniences functions – MPI_Abort kills all jobs – MPI_Bcast broadcasts a message exploits symbolic links to allow for true multi-cast – MPI_Probe returns a list of all incoming messages allows more dynamic message reading • MatlabMPI specific functions – MatMPI_Delete_all cleans up all files after a run – MatMPI_Save_messages toggles deletion of messages individual messages can be inspected for debugging – MatMPI_Comm_settings user can set MatlabMPI internals rsh or ssh, location of Matlab, unix or windows, … • Other – Processor specific directories Slide-11 MatlabMPI HPEC 2002 MIT Lincoln Laboratory Outline • Introduction • Approach • Performance Results • Bandwidth • Parallel Speedup • Future Work and Summary Slide-12 MatlabMPI HPEC 2002 MIT Lincoln Laboratory MatlabMPI vs MPI bandwidth Bandwidth (SGI Origin2000) Bandwidth (Bytes/sec) 1.E+08 1.E+07 C MPI MatlabMPI 1.E+06 1.E+05 1K 4K 16K 64K 256K 1M 4M 32M Message Size (Bytes) • Bandwidth matches native C MPI at large message size • Primary difference is latency (35 milliseconds vs. 30MIT microseconds) Lincoln Laboratory Slide-13 MatlabMPI HPEC 2002 MatlabMPI bandwidth scalability Linux w/Gigabit Ethernet 1.E+09 Bandwidth (Bytes/sec) 16 Processors 1.E+08 1.E+07 2 Processors 1.E+06 1.E+05 1K 4K 16K 64K 256K 1M 4M 16M Message Size (Bytes) Slide-14 MatlabMPI HPEC 2002 • Bandwidth scales to multiple processors • Cross mounting eliminates bottlenecksMIT Lincoln Laboratory Image Filtering Parallel Performance 100 Fixed Problem (SGI O2000) ParallelSize performance 100 Scaled Problem Size (IBM SP2) MatlabMPI Linear Speedup Linear MatlabMPI Gigaflops 10 10 1 0 1 1 2 4 8 16 32 Number of Processors 64 1 10 100 Number of Processors • Achieved “classic” super-linear speedup on fixed problem • Achieved speedup of ~300 on 304 processors on scaled problem Slide-15 MatlabMPI HPEC 2002 MIT Lincoln Laboratory 1000 Productivity vs. Performance Peak Performance Lines of Code 0.1 10 1 Matlab 10 100 1000 • Programmed image filtering several ways Parallel Matlab* Matlab MatlabMPI 100 Current Research Current Practice PVL VSIPL/MPI C++ • MatlabMPI provides VSIPL C VSIPL/OpenMP 1000 Single Processor Slide-16 MatlabMPI HPEC 2002 Shared Memory •Matlab •VSIPL •VSIPL/OpenMPI •VSIPL/MPI •PVL •MatlabMPI •high productivity •high performance VSIPL/MPI Distributed Memory MIT Lincoln Laboratory Current MatlabMPI deployment • • • • • • • • • • • • Lincoln Signal processing (7.8 on 8 cpus, 9.4 on 8 duals) Lincoln Radar simulation (7.5 on 8 cpus, 11.5 on 8 duals) Lincoln Hyperspectral Imaging (~3 on 3 cpus) MIT LCS Beowulf (11 Gflops on 9 duals) MIT AI Lab Machine Vision OSU EM Simulations ARL SAR Image Enhancement Wash U Hearing Aid Simulations So. Ill. Benchmarking JHU Digital Beamforming ISL Radar simulation • Rapidly growing MatlabMPI user base • Web release may create hundreds of users URI Heart modelling Slide-17 MatlabMPI HPEC 2002 MIT Lincoln Laboratory Outline • Introduction • Approach • Performance Results • Future Work and Summary Slide-18 MatlabMPI HPEC 2002 MIT Lincoln Laboratory Future Work: Parallel Matlab Toolbox • Parallel Matlab need has been identified •HPCMO (OSU) • Required user interface has been demonstrated •Matlab*P (MIT/LCS) •PVL (MIT/LL) • Required hardware interface has been demonstrated High Performance Matlab Applications DoD Sensor Processing DoD Mission Planning Scientific Simulation Commercial Applications User Interface Parallel Matlab Toolbox Hardware Interface •MatlabMPI (MIT/LL) Parallel Computing Hardware Slide-19 MatlabMPI HPEC 2002 • Allows parallel programs with no additional lines of code MIT Lincoln Laboratory Future Work: Scalable Perception • Data explosion – Advanced perception techniques must process vast (and rapidly growing) amounts of sensor data • Component scaling – Research has given us high-performance algorithms and architectures for sensor data processing, – But these systems are not modular, reflective, or radically reconfigurable to meet new goals in real time • Scalable perception – will require a framework that couples the computationally daunting problems of real-time multimodal perception to the infrastructure of modern high-performance computing, algorithms, and systems. • Such a framework must exploit: – High-level languages – Graphical / linear algebra duality – Scalable architectures and networks Slide-20 MatlabMPI HPEC 2002 MIT Lincoln Laboratory Summary • MatlabMPI has the basic functions necessary for parallel programming – Size, rank, send, receive, launch – Enables complex applications or libraries • Performance can match native MPI at large message sizes • Demonstrated scaling into hundreds of processors • Demonstrated productivity • Available on HPCMO systems • Available on Web Slide-21 MatlabMPI HPEC 2002 MIT Lincoln Laboratory Acknowledgements • Support – – • Collaborators – – • Stan Ahalt and John Nehrbass (Ohio St.) Alan Edelman, John Gilbert and Ron Choy (MIT LCS) Lincoln Applications – – – • Charlie Holland DUSD(S&T) and John Grosh OSD Bob Bond and Ken Senne (Lincoln) Gil Raz, Ryan Haney and Dan Drake Nick Pulsone and Andy Heckerling David Stein Centers – – Slide-22 MatlabMPI HPEC 2002 Maui High Performance Computing Center Boston University MIT Lincoln Laboratory http://www.ll.mit.edu/MatlabMPI Slide-23 MatlabMPI HPEC 2002 MIT Lincoln Laboratory