Parallel Programming Workshop HPC 470 August, 2015 Credits Contributors: Dr Charles Antonelli (LSA IT) Mark Champe (LSA IT) Bennet Fauber (ARC) Dr Alexander Gaenko (ARC) Nancy Herlocher (LSA IT) Seth Meyer (LSA IT) Todd Raeker (ARC) Brought to you under the auspices of Advanced Research Computing, U-M Office of Research LSA IT ARS / cja © 2015 2 10 Aug 2015 Roadmap Session Monday August 10 Tuesday August 11 Wednesday August 12 Thursday August 13 Friday August 14 Morning (10 AM - 1 PM) Session 1: Introduction/roadmap (Antonelli) Session 3: Parallel Python (Champe) Session 5: Parallel C & Fortran (Antonelli) Profiling & debugging MPI (message-passing) Session 7: Accelerator Parallelism (Meyer) CUDA Session 9: Intro to Globus+ (Raeker) In particular will go through examples of sharing features. git (Herlocher) Data copy intro (Antonelli) scp/sftp, flux-xfer Globus Connect Intro to Cloud computing (Raeker) Cloud-based compute sources: AWS, Azure, Google Compute Cloud,.. Intro to parallelism (Antonelli) Lab (all) Afternoon (2 PM - 5 PM) Session 2: Parallel R (Fauber) Basic R functions, list applicable functions, and converting list applicable functions to parallel execution. Session 4: Parallel MATLAB (Fauber) Two examples will be shown, one involving processing many input files, the other a Monte Carlo simulation. Please see our course web page for more information and registration information. http://arc-ts.umich.edu/hpc470/ Session 6: Parallel C & Fortran (Antonelli) OpenMP (multi-core) OpenACC (accelerators) Session 8: Accelerator Parallelism (Meyer) CUDA (continued) Intel Xeon Phi Session 10: Lab (all) Lab report-outs (participants) Roadmap Session Monday August 10 Tuesday August 11 Morning (10 AM - 1 PM) Session 1: Introduction/roadmap (Antonelli) Session 3: Parallel Python (Cha git (Herlocher) Data copy intro (Antonelli) scp/sftp, flux-xfer Globus Connect Intro to parallelism (Antonelli) Afternoon (2 PM - 5 PM) Session 2: Parallel R (Fauber) Basic R functions, list applicable functions, and converting list applicable functions to parallel execution. Session 4: Parallel MATLAB (F Two examples will one involving proce input files, the othe Carlo simulation. Copying data LSA IT ARS / cja © 2015 5 10 Aug 2015 Copying data From Linux or Mac OS X, use scp or sftp Non-interactive (scp) scp localfile uniqname@flux-xfer.engin.umich.edu:remotefile scp -r localdir uniqname@flux-xfer.engin.umich.edu:remotedir scp uniqname@flux-login.engin.umich.edu:remotefile localfile Use "." as destination to copy to your Flux home directory: scp localfile login@flux-xfer.engin.umich.edu:. ... or to your Flux scratch directory: scp localfile login@flux-xfer.engin.umich.edu:/scratch/allocname/uniqname Interactive (sftp) sftp uniqname@flux-xfer.engin.umich.edu From Windows, use WinSCP U-M Blue Disc: http://www.itcs.umich.edu/bluedisc/ cja 2015 6 05/15 Globus Online Features High-speed data transfer, much faster than SCP or SFTP Reliable & persistent Minimal client software: Mac OS X, Linux, Windows GridFTP Endpoints Gateways through which data flow Exist for XSEDE, OSG, … UMich: umich#flux, umich#nyx Add your own client endpoint! Add your own server endpoint: contact flux-support@umich.edu More information http://arc.umich.edu/flux-and-other-hpc-resources/flux/usingflux/transferring-files-with-globus-gridftp/ cja 2015 7 05/15 Parallelism Review LSA IT ARS / cja © 2015 8 10 Aug 2015 Compute Node P RAM Process Processor Local disk ES15 9 5/15 Fine-grained parallelism P RAM Cores ES15 Local disk 10 5/15 Fine-grained parallelism P RAM Cores ES15 Local disk 11 5/15 Programming Models (1) Fine-grained parallelism The parallel application consists of a single process containing several parallel threads that communicate with each other using synchronization primitives Used when the data can fit into a single process, and the communications overhead of the message-passing model is intolerable “Symmetric multiprocessing (SMP)” or “Shared-memory parallelism” or “multi-threaded parallelism” or … Implemented using compilers and software libraries OpenMP (Open Multi-Processing) ES15 12 5/15 Coarse-grained parallelism ES15 13 5/15 Programming Models (2) Coarse-grained parallelism The parallel application consists of several processes running on different nodes and communicating with each other over the network Used when the data are too large to fit on a single node, and simple synchronization is adequate “Message-passing” Implemented using software libraries MPI (Message Passing Interface) ES15 14 5/15 Good parallel Embarrassingly parallel Folding@home, RSA Challenges, Bitcoin mining, password cracking, … http://en.wikipedia.org/wiki/List_of_distributed_co mputing_projects ES15 15 5/15 Amdahl’s Law If you enhance a fraction f of a computation by a speedup S, the overall speedup is: ES15 16 5/15 Amdahl’s Law ES15 17 5/15 MPI LSA IT ARS / cja © 2015 18 10 Aug 2015 Crib sheet Login to Stampede ssh trainNNN@stampede.tacc.utexas.edu Get and compile the code cp -r ~cja/hpc470/jacobi . cd hpc470/jacobi make Get on a compute node idev –m 30 Run the vanilla version cd hpc470/jacobi mpirun -bootstrap fork –np N ./oned Excellent user guide https://portal.tacc.utexas.edu/user-guides/stampede LSA IT ARS / cja © 2015 19 19 10 Aug 2015 Debugging & profiling LSA IT ARS / cja © 2015 20 10 Aug 2015 Debugging with GDB Command-line debugger Start programs or attach to running programs Display source program lines Display and change variables or memory Plant breakpoints, watchpoints Examine stack frames Excellent tutorial documentation http://www.gnu.org/s/gdb/documentation/ LSA IT ARS / cja © 2015 21 21 10 Aug 2015 Compiling for GDB Debugging is easier if you ask the compiler to generate extra source-level debugging information Add -g flag to your compilation icc -g serialprogram.c -o serialprogram or mpicc -g mpiprogram.c -o mpiprogram GDB will work without symbols Need to be fluent in machine instructions and hexadecimal Be careful using -O with -g Some compilers won't optimize code when debugging Most will, but you sometimes won't recognize the resulting source code at optimization level -O2 and higher Use -O0 -g to suppress optimization LSA IT ARS / cja © 2015 22 22 10 Aug 2015 Running GDB Two ways to invoke GDB: Debugging a serial program: gdb ./serialprogram Debugging an MPI program: mpirun -np N xterm -e gdb ./mpiprogram This gives you N separate GDB sessions, each debugging one rank of the program Remember to use the -X or -Y option to ssh when connecting to Flux, or you can't start xterms there LSA IT ARS / cja © 2015 23 10 Aug 2015 Useful GDB commands gdb exec gdb exec core l [m,n] disas disas func b func b line# b *0xaddr ib d bp# r [args] bt c step next stepi p var p *var p &var p arr[idx] x 0xaddr x *0xaddr x/20x 0xaddr ir i r ebp set var = expression q LSA IT ARS / cja © 2015 start gdb on executable exec start gdb on executable exec with core file core list source disassemble function enclosing current instruction disassemble function func set breakpoint at entry to func set breakpoint at source line# set breakpoint at address addr show breakpoints delete beakpoint bp# run program with optional args show stack backtrace continue execution from breakpoint single-step one source line single-step, don't step into function single-step one instruction display contents of variable var display value pointed to by var display address of var display element idx of array arr display hex word at addr display hex word pointed to by addr display 20 words in hex starting at addr display registers display register ebp set variable var to expression quit gdb 24 10 Aug 2015 Debugging with DDT Allinea's Distributed Debugging Tool is a comprehensive graphical debugger designed for the complex task of debugging parallel code Advantages include Provides GUI interface to debugging Similar capabilities as, e.g., Eclipse or Visual Studio Supports parallel debugging of MPI programs Scales much better than GDB LSA IT ARS / cja © 2015 25 10 Aug 2015 Running DDT Compile with -g: mpicc -g mpiprogram.c -o mpiprogram Load the DDT module: module load ddt Start DDT: ddt mpiprogram This starts a DDT session, debugging all ranks concurrently Remember to use the -X or -Y option to ssh when connecting to Flux, or you can't start ddt there http://arc-ts.umich.edu/software/ http://content.allinea.com/downloads/userguide.pdf LSA IT ARS / cja © 2015 26 10 Aug 2015 Application Profiling with MAP Allinea's MAP Tool is a statistical application profiler designed for the complex task of profiling parallel code Advantages include Provides GUI interface to profiling Observe cumulative results, drill down for details Supports parallel profiling of MPI programs Handles most of the details under the covers LSA IT ARS / cja © 2015 27 10 Aug 2015 Running MAP Compile with -g: mpicc -g mpiprogram.c -o mpiprogram Load the MAP module: module load ddt Start MAP: map mpiprogram This starts a MAP session Runs your program, gathers profile data, displays summary statistics Remember to use the -X or -Y option to ssh when connecting to Flux, or you can't start ddt there http://content.allinea.com/downloads/userguide.pdf LSA IT ARS / cja © 2015 28 10 Aug 2015 OpenMP LSA IT ARS / cja © 2015 29 10 Aug 2015 OpenACC LSA IT ARS / cja © 2015 30 10 Aug 2015 Crib sheet Login to Flux ssh flux-login.arc-ts.umich.edu Get on a GPU node qsub –I –V –X –l nodes=1:gpus=1 –q fluxg –l qos=flux –A hpc470_fluxg –l walltime=4:00:00 Get, compile, run the code cp -r ~cja/hpc470/saxpy. cd saxpy module load cuda module load pgi pgcc -ta=nvidia,cc11 -acc -Minfo=accel saxpy.c -o saxpy ./saxpy LSA IT ARS / cja © 2015 31 31 10 Aug 2015 Resources http://arc-ts.umich.edu/flux/ U-M Advanced Research Computing Flux pages http://arc.research.umich.edu/software/ Flux Software Catalog http://arc-ts.umich.edu/flux/flux-faqs/ Flux FAQs http://www.youtube.com/user/UMCoECAC ARC-TS YouTube channel For assistance: flux-support@umich.edu Read by a team of people including unit support staff Cannot help with programming questions, but can help with operational Flux and basic usage questions LSA IT ARS / cja © 2015 32 10 Aug 2015 References 1. 2. Supported Flux software, http://arc-ts.umich.edu/software/, (accessed May 2015) Free Software Foundation, Inc., "GDB User Manual," http://www.gnu.org/s/gdb/documentation/ (accessed May 2015). 3. Intel C and C++ Compiler 14 User and Reference Guide, https://software.intel.com/enus/compiler_15.0_ug_c (accessed May 2015). 4. Intel Fortran Compiler 14 User and Reference Guide,https://software.intel.com/enus/compiler_15.0_ug_f(accessed May 2015). 5. Torque Administrator's Guide, http://www.adaptivecomputing.com/resources/docs/torque/5-10/torqueAdminGuide-5.1.0.pdf (accessed May 2015). 6. Submitting GPGPU Jobs, https://sites.google.com/a/umich.edu/engincac/resources/systems/flux/gpgpus (accessed May 2015). 7. http://content.allinea.com/downloads/userguide.pdf (accessed May 2015) LSA IT ARS / cja © 2015 33 10 Aug 2015