Parallel Matlab: RTExpress™ on 64-bit SGI Altix with SCSL and MPT Data Collection – 2D FFT Tests What is RTExpress™ • Development and Runtime Environment allowing MATLAB scripts to be compiled and executed on real-time/parallel High Performance Computers (HPC) SGI® Altix™ 3000 64-Processor System Cornerturn shows excellent scaling up to Memory Bandwidth • FFT Benchmark tests – Computation and Communication • 1D FFT, transpose (cornerturn), 1D FFT in-place – User does not require detailed knowledge of parallel programming • Embedded parallel architectures such as Mercury 80.0 100.0 90.0 • Results of Shared Memory interconnect shows improved scaling on cornerturn – Supports: Corner Turn 1D FFT • Testing on SGI Altix Linux servers with Shared Memory Interconnect – Provides a flexible means to harness the power of a HPC using MATLAB 70.0 80.0 60.0 Improvement SGI® NUMAlink™ Interconnect Fabric • SUN Network of Workstations • High performance Linux PC Servers – Support for FPGA functions 60.0 1K 2K 50.0 4k 8k 40.0 • Library of FPGA functions directly callable from MATLAB source Improvement 70.0 50.0 1K 2K 40.0 4k 8k 30.0 30.0 • Now porting to SGI Altix systems 20.0 20.0 Intel / SGI Development Agreement Itanium 64-bit processing Shared Memory Architecture New RTExpress for SGI release expected by fall/winter of 2004 editor for editing a user's MATLAB® code. the user with the ability to assign the MATLAB® software tasks to physical hardware nodes. 20 30 40 50 60 70 0 10 20 30 40 50 60 70 Processors 1D FFT in-place Composite 2D FFT Time 70.0 120.0 60.0 100.0 50.0 80.0 1K 2K 60.0 4k 8k 1K 40.0 2K 4k 30.0 8k 40.0 20.0 20.0 • Please note that all timing information gathered is not intended to provide a recommendation for any particular hardware, but to illustrate parallel operation with various combinations of processors and interconnect systems mapit: Provides 10 Processors • RTExpress is used to run the MATLAB script on varying numbers of processors in data-parallel • Elapsed times are computed, averaging time over several iterations • First iteration is not counted editm: Enhanced text 0.0 0 matrix = ones(fftsize, fftsize) + j * ones(fftsize, fftsize) loop store time t1 a = fft(init_matrix) store time t2 a = a’ store time t3 a = fft(a) store time t4 end loop ® The RTExpress RTExpress™ environment contains four subsub-tools which assist a user in parallelizing MATLAB® code. 0.0 • A Matlab script performs the 2D complex FFT Parallelization with RTExpress™ is Flexible and Efficient splitm: Provides the user RTExpress™ with the ability to define how the MATLAB algorithm can Tool Windows be parallelized. 10.0 10.0 2D FFT Benchmark Test using RTExpress Improvement http://www.sgi.com/newsroom/press_releases/2004/june/altix_tcep.html – – – – Improvement – 10.0 0.0 0.0 0 10 20 30 40 50 60 70 0 10 20 30 Processors 40 50 60 70 Processors 64p 1.7GHz/9MB cache Altix (note: production systems are 1.6GHz) • Equipment used in the following tests may no longer be the hardware vendor’s current offering • Maximum RTExpress performance may be gained by fully using vector operations in MATLAB rather than using sequential loops bedit: Board editor graphical configuration tool to depict compute elements. • “Improvement,” as compared to first-processor performance used to scaling rather absolute timing Improves Cornerturn Comparison toexamine other Collected Datathan – Shared Memory FFT Timing (1024 x 1024 out of place) Comparison to other Collected Data – SGI Shared Memory Improves Cornerturn FFT Timing (1024 x 1024 out of place) 1024 x 1024 Cornerturn Timing 1024 x 1024 Cornerturn Timing 10 0.45 0.2 20 0.4 18 0.35 Scripts RTExpress / 1.2GHz Athlon 0.1 RTEpxress/ Dual Athlon 1.2GHz 0.08 Myrinet / IBM x300 0.06 Translator Compiler/ Linker editm RTExpress / 1.2GHz Athlon RTEpxress/ Dual Athlon 1.2GHz 0.2 Myrinet / IBM x300 0.15 Scali DolphinNet SGI Altix / Shared Mem 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 RTExpress / 1.2GHz Athlon 10 RTEpxress/ Dual Athlon 1.2GHz 8 Myrinet / IBM x300 6 Scali DolphinNet 02 03 04 05 06 07 Incorporate Real-Time Toolbox and Features 08 09 10 11 12 13 14 15 RTExpress / 1.2GHz Athlon 5 RTEpxress/ Dual Athlon 1.2GHz 4 Myrinet / IBM x300 3 Scali DolphinNet 1 0 0 02 03 04 05 06 16 07 08 09 10 11 12 13 14 15 SGI Altix / Shared Mem 16 01 02 03 04 05 06 NODES 07 08 10 11 12 13 14 15 16 2D FFT Timing (1024 x 1024) 25 0.7 09 NODES FFT Timing (1024 x 1024 in-place) 2D FFT Timing (1024 x 1024) 0.2 Race++ / 400 MHz PPC 6 2 SGI Altix / Shared Mem NODES FFT Timing (1024 x 1024 in-place) 7 2 01 01 NODES Source Code for Compilation, Linking and Execution 12 0 01 Race++ / 400 MHz PPC 14 4 0.05 0 12 0.18 0.6 RTExpress / 1.2GHz Athlon 0.1 RTEpxress/ Dual Athlon 1.2GHz 0.08 Myrinet / IBM x300 0.06 Scali DolphinNet Race++ / 400 MHz PPC 0.4 RTExpress / 1.2GHz Athlon RTEpxress/ Dual Athlon 1.2GHz 0.3 Myrinet / IBM x300 0.2 Scali DolphinNet SGI Altix / Shared Mem 0.04 Race++ / 400 MHz PPC RTExpress / 1.2GHz Athlon 15 RTEpxress/ Dual Athlon 1.2GHz Myrinet / IBM x300 10 Scali DolphinNet Improvement Race++ / 400 MHz PPC 0.12 Improvement 0.14 Race 200 MHz 603 PPC 0.5 time (sec) Execute on Host Target Platform Race 200 MHz 603 PPC 10 20 0.16 Automatically Generate Embedded Code time (sec) FEATURES • Automatic data redistribution with scaling • Near real-time data plotting and image viewing • Greater life cycle coverage due to MATLAB® integration • Legacy software integration • Automatic parallel C code generation • Supports MPI • Compiler selectable performance monitoring and program instrumentation • Support for parallelization alternatives splitm 0.25 0.1 SGI Altix / Shared Mem 0.02 RTExpress Resource Mapper Race++ / 400 MHz PPC Scali DolphinNet 0.04 RTExpress™ Development Environment 0.3 Improvement Race++ / 400 MHz PPC 0.12 time (sec) time (sec) Utilize Non-Real-Time MATLAB® Scripts MATLAB® Race 200 MHz 603 PPC Race 200 MHz 603 PPC 0.14 8 16 0.16 Improvement 0.18 Developing Parallel Applications with RTExpress™ 9 Race++ / 400 MHz PPC 8 RTExpress / 1.2GHz Athlon RTEpxress/ Dual Athlon 1.2GHz 6 Myrinet / IBM x300 4 Scali DolphinNet SGI Altix / Shared Mem 5 SGI Altix / Shared Mem 2 SGI Altix / Shared Mem 0.1 0.02 0 0 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 0 01 0 01 01 02 03 NODES 04 05 06 07 08 09 10 11 12 13 14 15 16 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 01 02 03 04 05 NODES Integrated Sensors, Inc. (315)798-1377 www.sensors.com 07 08 09 10 NODES NODES Altix350 1.4 Ghz/3MB L3 cache 06 Altix350 1.4 Ghz/3MB L3 cache 11 12 13 14 15 16