Parallelization with the Matlab® Distributed Computing Server (MDCS) @ CBI cluster Overview • Parallelization with Matlab using Parallel Computing Toolbox(PCT) • Matlab Distributed Computing Server Introduction • Benefits of using the MDCS • Hardware/Software/Utilization @ CBI • MDCS Usage Scenarios • Hands-on Training 2 Parallelization with Matlab PCT • The Matlab Parallel Computing Toolbox provides access to multi-core, multi-system(MDCS), GPU parallelism. • Many built-in Matlab functions directly support parallelism ( e.g. FFT ) transparently. • Parallel constructs such as going from for loops to parfor loops. • Allows handling of many different types of parallel software development challenges. • MDCS allows scaling of locally developed parallel enabled Matlab applications. 3 Parallelization with Matlab PCT • Distributed / Parallel algorithm characteristics – Memory Usage & CPU Usage • Load a 4 Gigabyte file into Memory Calculate averages – Communication/Data IO patterns • Read file 1 ( 10 Gigabytes ) Run a function • Worker B Send data to worker A run a function return data to worker B – Dependencies • Function 1 Function 2 Function 3 • Hardware resource contention ( e.g. 16 cores each trying to read /write a set of files, bandwidth limitations on RAM ) • Managing large #’s of small files Filesystem contention 4 Parallelization with Matlab PCT Applications have layers of parallelism: For optimal solution, must look at the application as a whole. Matlab PCT + MDCS framework automates much of the complexity in developing parallel & distributed apps Scalability: use as many workers as possible in an efficient manner Clusters CPU’s, Multi-Cores GPU Cards/External Accelerator Cards 5 Parallelization with Matlab PCT & MDCS CPU’s, Multi-Cores Distributed loops: parfor Interactive development mode (matlabpool/pmode) MDCS Cluster Scale out with the MDCS Cluster in Batch Job Submission Mode Distributed Arrays(spmd) 6 MDCS Benefits MDCS Worker Processes ( a.k.a. “Labs”) – The workers never request regular Matlab or toolbox licenses. – The only license an MDCS worker ever uses is an MDCS worker license( of which we have up to 64 ). – Toolboxes are unlocked to an MDCS worker based on the licenses owned by the client during the job submission process. – Wonderful parallel algorithm development environment with the superior visualization & profiling capabilities of the Matlab environment. – Many built-in functions are parallel enabled: fft, lu, svd… – Distributed arrays allow development of data – parallel algorithms – Enable the scaling of codes that cannot be compiled using the Matlab Compiler Toolbox. – Allows you to go from development on a laptop directly to running on up to 64 MDCS Labs. ( Some simulations can go from years of runtime to days of runtime on 64 MDCS Labs) 7 MDCS Structure 8 Hardware/Software/Utilization @ CBI MDCS worker processes run on 4 physical servers Dell PowerEdge M910: Four x 16 core systems, 4x64GB RAM, 2x Intel Xeon 2.26 Ghz/system with 8 cores per processor Total of 64 cores, with 256 GB total RAM distributed among systems Max 64 MDCS worker licenses available Subsets of MDCS workers can be created based on project needs 9 Usage scenarios Local system: Interactive Use: ( matlabpool / spmd / pmode / mpiprofile ) – Local system(e.g. one of the Workstations @ CBI ) as part of initial algorithm development. MDCS: Non-interactive Use: Job&Task based – 2 main types: Independent vs. Communicating Jobs • Both types can be used with either the local( on a non-cluster workstation ) or MDCS profile. 10 MDCS Workloads 2 main types of workloads can be implemented with the MDCS: – A job is logically decomposed into a set of tasks. The job may have 1 or more tasks, and each task may or may not have additional parallelism within it. CASE 1: Independent Within a job the parallelism is fully independent, we have the opportunity to use MDCS workers to offload some of the independent work units. The code will not make use parallel language features such as parfor, spmd. Note: In many cases, parfor can be transformed into a set of tasks. – createJob() + createTask(), createTask(), … createTask() CASE 2: Communicating Within a single job the parallelism is more complex, requiring the workers to communicate or when parfor, spmd, codistributed arrays(language features are used from Parallel Compute Toolbox). – createCommunicatingJob(), createTask() 11 MDCS Working Environment 12 MDCS Working Environment 13 Interactive Mode Sample(parfor) For well mapping workloads, parfor can yield exceptional performance improvement From years to days / days to hours for certain workloads: ideally case are long running jobs with little or no inter-job communication. Standard for loop Parfor enabled on the MDCS 14 MDCS Scaling ( Batch Mode ) 15 MDCS Scaling( Batch mode ) 16 MDCS Scaling ( Batch mode ) 17 Summary • Applied examples of using MDCS in Batch mode available as part of hands-on section or via consulting appointment for more in-depth MDCS usage information. • We can allocate a subset of MDCS workers on a per project basis. 18 Summary • Wonderful parallel algorithm design & development environment • Scale out codes up to 64 Matlab MDCS workers – Both distributed compute & memory • Standard Matlab+Toolbox license usage minimization • Many options to approach parallelization of computational workloads. 19 Acknowledgements • This project received computational, research & development, software design/development support from the Computational System Biology Core/Computational Biology Initiative, funded by the National Institute on Minority Health and Health Disparities (G12MD007591) from the National Institutes of Health. URL: http://www.cbi.utsa.edu 20 Contact Us http://cbi.utsa.edu 21 Appendix A 22 Local Mode: Matlab Worker Process/Thread Structure Parallel Toolbox constructs can be tested in local mode, the “lab” abstraction allows the actual process used for a lab to reside either locally or on a distributed server node. MPI used for inter-process communication between “Labs”, Matlab Worker Processes 23 Local Mode Scaling Sample(parfor) 24 Interactive Mode Sample(pmode/spmd) Each lab handles a piece of the data. Results are gathered on lab 1. Client session requests the complete data set to be sent to it using lab2client 25 Local vs. MDCS Mode Compare (parfor) 26 Appendix B: MDCS Access • Access to MDCS provided via Cheetah Cluster. – On Linux: ssh –Y username@cheetah.cbi.utsa.edu – qlogin – matlab & 27 Appendix B: MDCS Access • Access to MDCS provided via Cheetah Cluster. – On Windows: Using PuTTY + Xming w/X11 forwarding – qlogin – matlab & 28 References [1] http://www.mathworks.com/products/parallel-computing/ ( Parallel Computing Toolbox reference ) [2] http://www.mathworks.com/help/toolbox/distcomp/f1-6010.html#brqxnfb-1 (Parallel Computing Toolbox) [3] http://www.mathworks.com/products/parallel-computing/builtin-parallel-support.html ( Parallel Computing Toolbox ) [4] http://www.mathworks.com/products/distriben/supported/license-management.html ( MDCS License Management ) [5] http://www.mathworks.com/cmsimages/dm_interact_wl_11322.jpg ( MDCS Architecture Overview ) [6] http://www.mathworks.com/cmsimages/62006_wl_mdcs_fig1_wl.jpg ( MDCS Architecture Overview: Scalability ) [7] http://www.mathworks.com/products/parallel-computing/builtin-parallel-support.html ( Built-in MDCS support ) [8] http://www.mathworks.com/products/datasheets/pdf/matlab-distributed-computing-server.pdf ( MDCS Licensing ) [9] http://www.psc.edu/index.php/matlab ( MDCS @ PCS) [10] http://www.mathworks.com/products/compiler/supported/compiler_support.html ( Compiler Support for MATLAB and Toolboxes ) [11] http://www.mathworks.com/support/solutions/en/data/1-2MC1RY/?solution=1-2MC1RY ( SGE Integration ) [12] http://www.mathworks.com/company/events/webinars/wbnr30965.html?id=30965&p1=70413&p2=70415 ( MDCS Administration ) [13] http://www.mathworks.com/help/toolbox/mdce/f4-10664.html ( General MDCE Workflow ) [14] http://www.mathworks.com/help/toolbox/distcomp/f3-10664.html ( Independent Jobs with MDCS ) [15] http://cac.engin.umich.edu/swafs/training/pdfs/matlab.pdf ( MDCS @ Umich ) [16] http://www.mathworks.com/products/optimization/examples.html?file=/products/demos/shipping/optim/optimparfor.html ( Optimization toolbox example ) [17] http://www.mathworks.com/products/distriben/examples.html ( MDCS Examples ) [18] http://www.mathworks.com/support/product/DM/installation/ver_current/ ( MDCS Installation Guide R2012a ) [19] http://www.psc.edu/index.php/matlab ( MDCS @ PSC ) [20] http://rcc.its.psu.edu/resources/software/dmatlab/ ( MDCS @ Penn State ) [21] http://ccr.buffalo.edu/support/software-resources/compilers-programming-languages/matlab/mdcs.html ( MDCS @ U of Buffalo) [22] http://www.cac.cornell.edu/wiki/index.php?title=Running_MDCS_Jobs_on_the_ATLAS_cluster ( MDCS @ Cornell ) [23] http://www.mathworks.com/products/distriben/description3.html ( MDCS Licensing ) [24] http://www.mathworks.com/cmsimages/dm_interact_wl_11322.jpg ( MDCS Architecture ) 29 References [25] http://www.mathworks.com/access/helpdesk/help/toolbox/distcomp/bqxooam-1.html ( Built-in functions that work with distributed arrays ) [26] http://www.rz.rwthaachen.de/aw/cms/rz/Themen/hochleistungsrechnen/nutzung/nutzung_des_rechners_unter_windows/~sxm/MATLAB_Parallel_Computing_Toolbox/?lang=de ( MDCS @ Aachen University ) [27] http://www.mathworks.com/support/solutions/en/data/1-9D3XVH/index.html?solution=1-9D3XVH ( Compiled Matlab Applications using PCT + MDCS) [28] http://www.hpc.maths.unsw.edu.au/tensor/matlab ( MDCS @ UNSW ) [29] http://blogs.mathworks.com/loren/2012/04/20/running-scripts-on-a-cluster-using-the-batch-command-in-parallel-computing-toolbox/ ( Batch command ) [30] http://www.rcac.purdue.edu/userinfo/resources/peregrine1/userguide.cfm#run_pbs_examples_app_matlab_licenses_strategies ( MDCS @ Purdue ) [31] http://www.mathworks.com/help/pdf_doc/distcomp/distcomp.pdf ( Parallel Computing Toolbox R2012a ) [32] http://www.nccs.nasa.gov/matlab_instructions.html ( MDCS @ Nasa ) [33] http://www.mathworks.com/help/toolbox/distcomp/rn/bs8h9g9-1.html ( PCT, MDCS R2012a interface changes ) [34] http://www.mathworks.com/help/toolbox/distcomp/createcommunicatingjob.html ( Communicating jobs ) [35] http://www.mathworks.com/products/parallel-computing/examples.html?file=/products/demos/shipping/distcomp/paralleltutorial_dividing_tasks.html ( Moving parfor loops to jobs+tasks ) [36] http://people.sc.fsu.edu/~jburkardt/presentations/fsu_2011_matlab_tasks.pdf ( MDCS @ FSU: Task based parallelism ) [37] http://www.icam.vt.edu/Computing/fdi_2012_parfor.pdf ( MDCS @ Virginia Tech: Parfor parallelism ) [38] http://www.hpc.fsu.edu/ ( MDCS @ FSU, HPC main site ) [39] http://www.mathworks.com/help/toolbox/distcomp/rn/bs8h9g9-1.html ( PCT Updates in R2012a ) [40] http://www.mathworks.com/help/distcomp/using-matlab-functions-on-codistributed-arrays.html ( Built in functions available for Co-Distributed arrays ) [41] http://scv.bu.edu/~kadin/Tutorials/PCT/matlab-pct.html ( Matlab PCT @ Boston University ) [42] http://www.circ.rochester.edu/wiki/index.php/MatlabWorkshop#Example_using_distributed_arrays_for_FFT [43] http://www.advancedlinuxprogramming.com/alp-folder/alp-ch04-threads.pdf [44] http://www.mathworks.com/products/distriben/parallel/accelerate.html [45] http://www.mathworks.com/products/distriben/examples.html?file=/products/parallel-computing/includes/parallel.html [46] http://en.wikipedia.org/wiki/Gustafson%27s_law [47] http://www.mathworks.com/help/distcomp/index.html [48] http://www.mathworks.com/cmsimages/43623_wl_dm_using_paralles_forloops_wl.jpg [49] http://www.mathworks.com/help/distcomp/mpiprofile.html 30