Parallel & GPU computing in MATLAB ITS Research Computing Lani Clough Objectives • Introductory level MATLAB course for people who want to learn parallel and GPU computing in MATLAB. • Help participants determine when to use parallel computing and how to use MATLAB parallel & GPU computing on their local computer & on the Research Computing clusters (Killdevil/Kure) Logistics • Course Format • Overview of MATLAB topics with Lab Exercises • UNC Research Computing – http://its.unc.edu/research Agenda • Parallel computing (1hr 10min) – What is it? – Why use it? – How to write MATLAB code in parallel (1hr) • GPU computing (20 min) – What is it & why use it? – How to write MATLAB code in for GPU computing (15 min) • How to run MATLAB parallel & GPU code on the UNC cluster (20 min) – Quick introduction to the UNC cluster (Kure) – Bsusb commands and what they mean • Questions (10 min) Parallel Computing What is Parallel Computing? • Generally, computer code is written in serial – 1 task completed after another until the script is finished with only 1 task completing at each time – Concept the computer only has 1 CPU Source: https://computing.llnl.gov/tutorials/parallel_comp/ What is Parallel Computing? (cont.) • Parallel Computing: Using multiple computer processing units (CPUs) to solve a problem at the same time • The compute resources might be: computer with multiple processors or networked computers Source: https://computing.llnl.gov/tutorials/parallel_comp/ Why use Parallel Computing • • • • Save time & money (commodity components) Provide concurrency Solve larger problems Use non-local resources – UNC compute cluster – SETI: 2.9 million computers – Folding (Stanford): 450,000 cpus Source: https://computing.llnl.gov/tutorials/parallel_comp/ How to write code in parallel • The computational problem should be able to: – Be broken into discrete parts that can be solved simultaneously and independently – Be solved in less time with multiple compute resources than with a single compute resource. Parallel Computing in MATLAB Parallel Computing in MATLAB • MATLAB parallel Computing Toolbox (available for use at UNC) – Provides twelve workers (MATLAB computational engines) to execute applications on a multicore system. – Built in functions for parallel computing • parfor loop (for running task-parallel algorithms on multiple processors) • spmd (handles large datasets and data-parallel algorithms) Matlab Distributed Computing Toolbox • Allows MATLAB to run as many workers on a remote cluster of computers as licensing allows. • OR run more than 12 workers on a local machine. • UNC does not have a license for this toolbox- it’s extremely $$$$$$$$ • More information: http://www.mathworks.com/products/distriben/ • Course will not go over this toolbox Primary Parallel Commands • findResource • matlabpool – open – close – size • parfor (for loop) • spmd (distributed computing for datasets) • batch jobs (run job in background) findResource • Find available parallel computing resources • out = findResource() findResource Examples • lsf_sched = findResource('scheduler','type','LSF') – Find the Platform LSF scheduler on the network. • local_sched = findResource('scheduler','type','local') – Create a local scheduler that will start workers on the client machine for running your job. • jm1 = findResource('scheduler','type’, 'jobmanager’ ,'Name', 'ClusterQueue1'); – Find a particular job manager by its name. More Resources for findResource • http://www.mathworks.com/help/toolbox/dis tcomp/findresource.html Matlabpool • matlabpool open – Begins a parallel work session • Options for open matlab pool Matlabpool open • These three examples of open matlabpool each have the same result: opens a local pool of 4 workers – 1: – 2: – 3: Matlabpool • matlabpool(x) – Request the number of workers you’d like, i.e. matlabpool(4) • matlabpool(‘size’) – Tells you the number of workers available in matlabpool – i.e. Matlabpool • Request too many workers, get an error Can only request 4 workers on this machine! Matlabpool Close • Use matlabpool close to end parallel session • Options – matlabpool close force • deletes all pool jobs for current user in the cluster specified by default profile (including running jobs) – matlabpool close force <profilename> • deletes all pool jobs run in the specified profile Parallel for Loops (parfor) • parfor loops can execute for loop like code in parallel to significantly improve performance • Must consist of code broken into discrete parts that can be solved simultaneously (i.e. it can’t be serial) Parfor example • Will work in parallel, loop increments are not dependent on each other: open matlabpool local 2 j=zeros(100,1); %pre-allocate vector parfor i=2:100; Makes the loop j(i,1)=5*i; run in parallel end; close matlabpool Serial Loop example • Won’t work in parallel- it’s serial: j=zeros(100,1); %pre-allocate vector j(1)=5; j(i-1) needed to for i=2:100; calculate j(i,1)=j(i-1)+5; j(i,1) end; serial!!! Parallel for Loops (parfor) • Can not nest parfor loops within parfor loops parfor i=1:10 parfor j=1:10 x(i,j)=1; end; end; Parallel for Loops (parfor) • If a function is used with multiple outputs, within a parfor loop MATLAB will have difficulty figuring out how to run the parfor loop. e.g. for i=1:10 [x{i}(:,1), x{i}(,:2)]=functionName(z,w) end Parallel for Loops (parfor) • Use this instead for i=1:10 [x1, x2]=functionName(z,w); x{i}=[x1 x2]; end Parallel for Loops (parfor) For parallel computing to be worth your time: the task must be solved in less time with multiple compute resources than with a single compute resource. Test the efficiency of your parallel code • Use MATLAB’s tic & toc functions – Tic starts a timer – Toc tells you the number of seconds since the tic function was called Tic & Toc Simple Example tic; parfor i=1:10 z(i)=10; end; toc Check efficiency of simple parfor loop clear; clc; matlabpool(4) k=(zeros(10,3)); m=1; i=1; while i<1e8 [time1 time2]=testParfor(i); k(m,:)= [i time1 time2]; m=m+1; i=i*10; end; Check efficiency of simple parfor loop function [t1e, t2e]=testParfor(x) A=ones(x,1).*4; B=zeros(x,1); t1s=tic; matlabpool(4) parfor i = 1:length(A) B(i) = sqrt(A(i)); end t1e=toc(t1s); matlabpool close B=zeros(x,1); t2s=tic; for i = 1:length(A) B(i) = sqrt(A(i)); end t2e=toc(t2s); Result of Check Efficiency of parfor • For loop is much more efficient than parfor loop- more resources does not necessary equate to a faster run time!! Parfor Efficiency • Previous example is not an effective use of a parfor loop because it takes more time to evaluate than a for loop. – Data transfer is the issue – Parfor is more effective with long running calculations within the loop – Generally more iterations increase the efficiency of a parfor loop Lab Exercise with parfor • Lab exercise: – Turn a non-parallel function into a function that can run in parallel – Go through each section of each and determine if it can be written in parallel and if so, how? (%% denotes a new section) Lab Exercise function N=calcNeighNp(neighPoly,manzPoly,manzPop93, manzPop05, manzID) matlabpool(x) %start matlabpool %Find the manzanas which don't have an associated population in 93, but a population in 05 %parfor can't be used here because it’s serial j=1; for i=1:length(manzPop93) if manzPop93(i,1)==0 && manzPop05(i,1)>0 no93manzPopID(j,1)=manzID(i,1); j=j+1; end; end; %% Lab Exercise %% %parfor can't be used here because it’s serial %Calculate the average monthly population change (excluding the data pts with no pop in 1993); MonthsC=(2005-1993); j=1; count=0; TotalPopC=0; for i=1:length(manzPop93) if manzID(i,1)~=no93manzPopID(j,1) TotalPopC=TotalPopC+((manzPop05(i,1)manzPop93(i,1))/MonthsC); count=count+1; else j=j+1; end; end; %% Lab Exercise %% meanPopChangeM=TotalPopC/count; PopChangeMmanz=zeros(length(manzPop05),1); %Calculate the monthly population change for all the manzanas parfor i=1:length(manzPop05) for i=1:length(manzPop05) for j=1:length(no93manzPopID) if manzID(i,1)==no93manzPopID(j,1) PopChangeMmanz(i,1)=meanPopChangeM; %% break must be deleted, not permitted in parfor break; else PopChangeMmanz(i,1)=(manzPop05(i,1)manzPop93(i,1))/MonthsC; end; end; end; %% Lab Exercise %% %Now calculate what the midpoint population midPop=manzPop93+(PopChangeMmanz*9.5); %turn the neighs and manz clockwise to calc pop parfor i=1:length(neighPoly) for i=1:length(neighPoly) [temp1, temp2] = poly2cw(neighPoly{i}(:, 1),neighPoly{i}(:,2)); [neighClock{i}(:,1) neighClock{i}(:,2)] = neighClock{i}=[temp1 temp2]; poly2cw(neighPoly{i}(:,1),neighPoly{i}(:,2)); end; end; parfor i=1:length(manzPoly) [temp1, temp2] = poly2cw(manzPoly{i}(:, for i=1:length(manzPoly) 1),manzPoly{i}(:,2)); [manzClock{i}(:,1) manzClock{i}(:,2)]= manzClock{i}=[temp1 temp2]; end; poly2cw(manzPoly{i}(:,1),manzPoly{i}(:,2)); end; %% Lab Exercise %% %calculate the areas of the manzanas; polyAreaR=zeros(length(manzClock),1); for i=1:length(manzClock) parfor i=1:length(manzClock) polyAreaR(i,1)=calcArea(manzClock{i}(:,1), polyAreaR(i,1)=calcArea(manzClock{i}(:,1), manzClock{i}(:,2)); manzClock{i}(:,2)); end; end; %% Lab Exercise %calculate the population for each of the neighs as function of the manzanas & sum calculated pop N=zeros(length(neighClock),1); %pre-allocate the vector; parfor i=1:length(neighClock) for i=1:length(neighClock) m=0; Ntemp=zeros(length(manzClock),1); for j=1:length(manzClock) [tempx tempy]=polybool('intersection', neighClock{i}(:,1),neighClock{i}(:, 2) ,manzClock{j}(:,1),manzClock{j}(:,2)); if isempty(tempx)==0; m=m+1; Ntemp(m,1)=(calcArea(tempx,tempy)/polyAreaR(j,1))*midPop (j); end; end; N(i,1)=(sum(Ntemp)); More parfor resources • Loren Shure’s blog entry on parfor – http://blogs.mathworks.com/loren/2009/10/02/u sing-parfor-loops-getting-up-and-running/ • Advanced parfor topics (MATLAB online help) – http://www.mathworks.com/help/toolbox/distco mp/brdqtjj-1.html#bq_of7_-1 • Lauren Shore (MATLAB engineer) Functions to support parfor performance • All functions are included in the online Parallel MATLAB program files • Parfor progress monitor (user created) – http://www.mathworks.com/matlabcentral/fileex change/24594-parfor-progress-monitor • Parallel Profiler (user created) – http://www.mathworks.com/help/toolbox/distco mp/bra51qt-1.html#brcrm_t Functions to support parfor performance • All functions are included in the online Parallel MATLAB program files • User-created codes • Parfor progress monitor (user created) – http://www.mathworks.com/matlabcentral/fileex change/24594-parfor-progress-monitor Functions to support parfor performance • Parallel Profiler (built-in function) – http://www.mathworks.com/help/toolbox/distco mp/bra51qt-1.html#brcrm_t • partictoc – You can also use this user created function, partictoc to examine the efficiency of your parallel code – Download at:http://www.mathworks.com/matlabcentral/file exchange/27472-partictoc Spmd • Used to Partition large data sets • Excellent when you want to work with an array too large for your computer’s memory Spmd • Spmd distributes the array among MATLAB workers (each worker contains a part of the array) • However, still can operate on entire array as 1 entity • Workers automatically transfer data between when necessary i.e matrix multiplication. Spmd Format • Format • Simple Example matlabpool (4) spmd statements end matlabpool(4) spmd j=zeros(1e7,1); end; Spmd Examples • Result j is a Composite with 4 parts! MATLAB Composites • Its an object used for data distribution in MATLAB • A Composite object has one entry for each worker – matlabpool(12) creates? 12X1 composite – matlabpool(6) creates? 6X1 composite MATLAB Composites • You can create a composite in two ways: – spmd – c = Composite(); • This creates a composite that does not contain any data, just placeholders for data • Also, one element per matlabpool worker is created for the composite • Use smpd or indexing to populate a composite created this way MATLAB Composites • Example c = Composite(); % One element per lab in the pool for ii = 1:length(c) % Set the entry for each lab to zero c{ii} = 0; % Value stored on each lab end Composite indexing • Using the j Composite from Previous slide Composite indexing • Assign the values of a composite to a matrix • All composites are turned into MATLAB cell arrays Another spmd Example- creating graphs %Perform a simple calculation in parallel, and plot the results: matlabpool(4) spmd % build magic squares in parallel q = magic(labindex + 2); %labindex- index of the lab/worker (e.g. 1) end for ii=1:length(q) % plot each magic square figure, imagesc(q{ii}); %plot a matrix as an image end matlabpool close Another spmd Example- creating graphs • Results MATLAB help documents on spmd • Extensive documentation online for using spmd and composites – http://www.mathworks.com/help/toolbox/distco mp/brukbno-1.html – Spmd specific documentation: • http://www.mathworks.com/help/toolbox/distcomp/sp md.html Run jobs in batch • Run independent parallel jobs on a worker, not on a compute cluster – Batch in cluster language≠ batch in MATLAB language Run jobs in batch %Construct a parallel job object using the default configuration. pjob = createParallelJob(); %Add the task to the job. createTask(pjob, 'rand', 1, {4}); %Set the number of workers required for parallel execution. set(pjob,'MinimumNumberOfWorkers',4 ); set(pjob,'MaximumNumberOfWorkers',4 ); %Run the job. submit(pjob); %Wait for the job to finish running, and retrieve the job results. waitForState(pjob); out = getAllOutputArguments(pjob); %Display the random matrices. celldisp(out); %Destroy the job. destroy(pjob); Running jobs in batch • Results from previous batch job Run jobs in batch • More information at: http://www.mathworks.com/help/toolbox/distc omp/f1-6010.html#f1-7659 GPU Computing What is GPU computing? • GPU computing is the use of a GPU (graphics processing unit) with a CPU to accelerate performance • Offloads compute-intensive portions an application to the GPU, and remainder of code runs on CPU What is GPU computing? • CPUs consist of a few cores optimized for serial processing • GPUs consist of thousands of smaller cores designed for parallel performance (i.e. more memory bandwidth and cores) Source: http://www.nvidia.com/object/what-is-gpu-computing.html What/Why GPU computing? • Serial portions of the code run on the CPU while parallel portions run on the GPU • From a user's perspective, applications in general run significantly faster Write GPU computing codes in MATLAB • Transfer data between the MATLAB workspace & the GPU – Accomplished by a GPUArray • Data stored on the GPU. – Use gpuArray function to transfer an array from the MATLAB workspace to the GPU Write GPU computing codes in MATLAB • Examples N = 6; M = magic(N); G = gpuArray(M); %create an array stored on GPU • G is a MATLAB GPUArray object representing magic square data on the GPU. X = rand(1000); G = gpuArray(X); %array stored On GPU Write GPU computing codes in MATLAB • gpuArray requires nonsparse data types: 'single', 'double', 'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64', or 'logical’. Static GPUArrays • Static GPUArrays allow users to directly construct arrays on GPUs, without transfers • Include: Static Array Examples • Construct an Identity Matrix on the GPU II = parallel.gpu.GPUArray.eye(1024,'int32'); size(II) 1024 1024 • Construct a Multidimensional Array on the GPU G = parallel.gpu.GPUArray.ones(100, 100, 50); size(G) 100 100 50 classUnderlying(G) Double %double is default, so don’t need to specify it More Resources for GPU Arrays • For a complete list of available static methods in any release, type methods('parallel.gpu.GPUArray') • For help on any one of the constructors, type help parallel.gpu.GPUArray/functionname • For example, to see the help on the colon constructor, type help parallel.gpu.GPUArray/colon Retrieve Data from the GPU • Use gather function – Makes data available in GPU environment, available in MATLAB workspace (CPU) • Use isequal to verify that you get the correct data back: Retrieve Data from the GPU • Example G = gpuArray(ones(100, 'uint32')); %array stored only on GPU D = gather(G); %bring D to CPU/MATLAB workspace OK = isequal(D, ones(100, 'uint32')) %check to see if the array on the GPU is the same as the array brought to the CPU GPUArray Characteristics • You can also examine GPUArray underlying charateristics using following built-in functions: GPU Array Charaterstics • Example – To examine the size of the GPUArray object G, type: G = gpuArray(rand(100)); s = size(G) 100 100 Calling Functions with GPU Objects • Example uses the fft and real functions, arithmetic operators + and *. • Calculations are performed on the GPU, gather retrieves data from the GPU to workspace. Ga = gpuArray(rand(1000, 'single')); %array on GPU & next operations performed on GPU Gfft = fft(Ga); Gb = (real(Gfft) + Ga) * 6; G = gather(Gb); brings G to the CPU Calling Functions with GPU Objects • The whos command is instructive for showing where each variable's data is stored. whos Name Size Bytes Class G 1000x1000 4000000 single Ga 1000x1000 108 parallel.gpu.GPUArray Gb 1000x1000 108 parallel.gpu.GPUArray Gfft 1000x1000 108 parallel.gpu.GPUArray • All arrays are stored on the GPU (GPUArray), except G, because it was “gathered” Running Functions on GPU • Call arrayfun with a function handle to the MATLAB function as the first input argument: result = arrayfun(@myFunction, arg1, arg2); • Subsequent arguments provide inputs to the MATLAB function. • Input arguments can be workspace data or GPUArray. – GPUArray type input arguments return GPUArray. – Else arrayfun executes in the CPU Running Functions on GPU Example: function applies correction to an array function c = myCal(rawdata, gain, offst) c = (rawdata .* gain) + offst; • Function performs only element-wise operations when applying a gain factor and offset to each element of the rawdata array. Running Functions on GPU • Create some nominal measurement: meas = ones(1000)*3; % 1000-by-1000 matrix • Function allows the gain and offset to be arrays of the same size as rawdata, so unique corrections can be applied to individual measurements. • Typically keep the correction data on the GPU so you do not have to transfer it for each application: Running Functions on GPU % Runs on the GPU because the input arguments gn and offs are in GPU memory; gn = gpuArray(rand(1000))/100 + 0.995; offs = gpuArray(rand(1000))/50 - 0.01; corrected = arrayfun(@myCal, meas, gn, offs); % Retrieve the corrected results from the GPU to the MATLAB workspace; results = gather(corrected); Identify & Select GPU • If you have only one GPU in your computer, that GPU is the default. • If you have more than one GPU card in your computer, you can use the following functions to identify and select which card you want to use: Identify & Select GPU • This example shows how to identify and GPU a for your computations – First, determine the number of GPU devices on your computer using gpuDeviceCount Identify & Select GPU • In this case, you have 2 devices, thus the first is the default. – To examine it’s properties type gpuDevice Identify & Select GPU • If the previous GPU is the device you want to use, then you can just proceed with the default • To use another device call gpuDevice with the index of the card and view its properties to verify you want to use it. Here is an example where the second device is chosen More Resources for GPU computing • MATLAB’s extensive online help documents for GPU computing – http://www.mathworks.com/help/toolbox/distco mp/bsic3by.html Parallel & GPU Computing on the cluster Cluster Jargon • Node – A standalone "computer in a box". Usually comprised of multiple CPUs/processors/cores. Nodes are networked together to comprise a cluster. • Processor / Core – individual CPUs subdivided into multiple "cores", each being a unique execution unit (processor). • The result is a node with multiple CPUs, each containing multiple cores. Using MATLAB on the computer Cluster • What?? – UNC provides researchers and graduate students with access to extremely powerful computers to use for their research. – Kure is a Linux based computing system with >1,800 cores – Killdevil is a Linux based computing system with >6,000 cores • Why?? – The cluster is an extremely fast and efficient way to run LARGE MATLAB programs (fewer “Out of Memory” errors!) – You can get more done! Your programs run on the cluster which frees your computer for writing and debugging other programs!!! Using MATLAB on the computer Cluster • Where and When?? – The cluster is available 24/7 and you can run programs remotely from anywhere with an internet connection! Using MATLAB on the computer Cluster • Overview of how to use the computer cluster – It would be helpful to take the following courses: • Getting Started on Kure & Killdevil • Introduction to Linux – For presentations & help documents, visit: • Course presentations: http://its2.unc.edu/divisions/rc/training/scientific/ • Help documents: http://its.unc.edu/research/its-researchcomputing/computing-resources/ Using MATLAB on the computer Cluster • Run your job on the cluster (1 job, not parallel) • • • • 1. Log in SSH file transfer client 2. Transfer the files you want to work with 3. Log into the SSH client 4. Change your working directory to the folder you want to work in i.e. cd /netscr/myoynen/ • 5. Type ls to make sure your program is located in the correct folder • 6. Type bmatlab <yourProgram.m> • Optional- to see you program running, type bhist or bjobs Parallel MATLAB on Cluster • Have access to: – 8 workers on Kure – 12 workers for each job on Killdevil Bsub commands for parallel & GPU • Start a cluster job with this command which gives you 1 job that is NOT parallel OR GPU – bsub /nas02/apps/matlab-2011a/matlab –nodisplay –nosplash –singleCompThread –r <filename> o “filename” is the name of your Matlab script with the .m extension left off o singleCompThread o ALWAYS use this option unless you are requesting an entire node for a serial (i.e. not using the Parallel Computing Toolbox) Matlab job or using GPUs!!!!!! Bsub commands for parallel & GPU • Log file options (always created for jobs) • sent to your email by default- it is possible to output this to a file located in your job’s current working directory. • ALWAYS PUT additional BSUB OPTIONS AFTER bsub & BEFORE the executable name!!!!!!! • See examples on next slides!! Bsub commands for parallel & GPU • Add these additional Logfile options: o - o logfile.%J o Does not send your MATLAB logfile to your email, it instead puts this information in a file called logfile.%J where %J is the job’s ID number. o use this when your MATLAB output (all the resulting unsuppressed output from your job) is too large to send over email. Bsub commands for parallel & GPU • Add these additional options: o -x o Request the use of an entire node o -M o Requests more than 4GB of memory for your job o -n o Requests the number of workers you’d like for your job Bsub commands for parallel & GPU • ALL MATLAB jobs must run on 1 host! • LSF option to use with parallel Matlab jobs: -R “span[hosts=1]” o -R “span[hosts=1]” o Send your job to one host. More information on using the cluster • Intermediate & Introductory MATLAB course PPTs have step by step instructions to get started using the cluster & using a basic matlab command on to run a simple job – http://its2.unc.edu/divisions/rc/training/scientific/ • UNC cluster help files (LSF, file sharing system, tells MATLAB how to run jobs, all commands before where LSF commands – https://help.unc.edu/6273 – Must use onyen to log in! Bsub Parallel MATLAB Excercise • Run a parallel MATLAB job with 96GB of Memory and 12 workers on 1 host – This can only run on KillDevil!!! – bsub –n12 –M96 –R “span[hosts=1]” /nas02/apps/matlab-2011a/matlab –nodisplay – nosplash –singleCompThread –r <filename> • Run a parallel MATLAB job on 2 hosts, with 8 workers – Can’t DO this!! All parallel MATLAB jobs must run on 1 host!!! Bsub Parallel MATLAB Exercise • Run a parallel MATLAB job on 1 hosts, with 8 workers – Use either Kure or KillDevil • bsub –n8 –R “span[hosts=1]” /nas02/apps/matlab2011a/matlab –nodisplay –nosplash -singleCompThread –r <filename> Bsub Parallel MATLAB Exercise • Run a parallel MATLAB job with 6 workers, give log file a specified name, don’t send output to email and and don’t include .m on filename – Use either Kure or KillDevil • bsub –o out.%J –n6 –R “span[hosts=1]” /nas02/apps/matlab2011a/matlab -nodisplay –nosplash -singleCompThread -r <filename> -logfile <logName> Bsub GPU MATLAB commands • Can only use KillDevil to run GPU jobs • bsub script is straightforward • Only request 1 CPU because you are only using 1 CPU and the multiple GPU processors – Use –q gpu –a gpuexcl_t – E.g. • bsub –q gpu –a gpuexcl_t /nas02/apps/matlab2011a/matlab –nodisplay –nosplash –r <filename> Bsub GPU MATLAB commands • Will not use the following options o -x o -M o -n o singleCompThread • Can use all other bsub commands introduced • More information: – https://help.unc.edu/CCM3_034792 Cluster Command Reminders! • Make sure your written MATLAB code has the following information: – matlabpool close – matlabpool (x) Questions? Questions and Comments? • For assistance with MATLAB, please contact the Research Computing Group: Email: research@unc.edu Phone: 919-962-HELP Submit help ticket at http://help.unc.edu