MPI support in gLite Enol Fernández CSIC MPI on the Grid – Definition of job characteristics – Search and select adequate resources – Allocate (or coallocate) resources for the job CREAM/WMS • Submission/Allocation – File distribution – Batch system interaction – MPI implementation details MPI-Start EMI INFSO-RI-261611 • Execution Allocation / Submission EMI INFSO-RI-261611 • Process count specified with the CPUNumber attribute Type = "Job"; CPUNumber = 23; Executable = "my_app"; Arguments = "-n 356 -p 4"; StdOutput = "std.out"; StdError = "std.err"; InputSandBox = {"my_app"}; OutputSandBox = {"std.out", "std.err"}; Requirements = Member("OPENMPI”, other.GlueHostApplicationSoftwareRunTimeEnvironment); MPI-Start EMI INFSO-RI-261611 • Specify a unique interface to the upper layer to run a MPI job • Allow the support of new MPI implementations without modifications in the Grid middleware • Support of “simple” file distribution • Provide some support for the user to help manage his data Grid Middleware MPI-START MPI Resources MPI-Start Design Goals • Portable – The program must be able to run under any supported operating system • Modular and extensible architecture – Plugin/Component architecture EMI INFSO-RI-261611 • Relocatable – Must be independent of absolute path, to adapt to different site configurations – Remote “injection” of mpi-start along with the job • “Remote” debugging features File Dist. Compiler Execution User Local PACX LAM MPICH2 Scheduler Open MPI LSF SGE PBS/Torque EMI INFSO-RI-261611 MPI-Start Architecture CORE Hooks EMI INFSO-RI-261611 Using MPI-Start (I) JobType = "Normal"; CpuNumber = 4; Executable = "starter.sh"; InputSandbox = {"starter.sh”} StdOutput = "std.out"; StdError = "std.err"; OutputSandbox = {"std.out","std.err"}; Requirements = Member("MPI-START”, other.GlueHostApplicationSoftwareRunTimeEnvironment) && Member("OPENMPI”, other.GlueHostApplicationSoftwareRunTimeEnvironment); $ cat starter.sh #!/bin/sh # This is a script to call mpi-start stdout: Scientific Linux CERN SLC release 4.5 (Beryllium) Scientific Linux CERN SLC release 4.5 (Beryllium) lflip30.lip.pt lflip31.lip.pt stderr: real 0m0.731s user 0m0.021s sys 0m0.013s # Set environment variables needed export I2G_MPI_APPLICATION=/bin/hostname export I2G_MPI_APPLICATION_ARGS= export I2G_MPI_TYPE=openmpi export I2G_MPI_PRECOMMAND=time # Execute mpi-start $I2G_MPI_START Using MPI-Start (II) … CpuNumber Executable Arguments InputSandbox Environment ... = = = = = 4; ”mpi-start-wrapper.sh"; “userapp OPENMPI some app args…” {”mpi-start-wrapper.sh”}; {“I2G_MPI_START_VERBOSE=1”, …} #!/bin/bash MY_EXECUTABLE=$1 shift MPI_FLAVOR=$1 shift export I2G_MPI_APPLICATION_ARGS=$* EMI INFSO-RI-261611 # Convert flavor to lowercase for passing to mpi-start. MPI_FLAVOR_LOWER=`echo $MPI_FLAVOR | tr '[:upper:]' '[:lower:]'` # Pull out the correct paths for the requested flavor. eval MPI_PATH=`printenv MPI_${MPI_FLAVOR}_PATH` # Ensure the prefix is correctly set. Don't rely on the defaults. eval I2G_${MPI_FLAVOR}_PREFIX=$MPI_PATH export I2G_${MPI_FLAVOR}_PREFIX # Setup for mpi-start. export I2G_MPI_APPLICATION=$MY_EXECUTABLE export I2G_MPI_TYPE=$MPI_FLAVOR_LOWER # Invoke mpi-start. $I2G_MPI_START MPI-Start Hooks (I) • File Distribution Methods – Copy files needed for execution using the most appropriate method (shared filesystem, scp, mpiexec, …) EMI INFSO-RI-261611 • Compiler flag checking – checks correctness of compiler flags for 32/64 bits, changes them accordingly • User hooks: – build applications – data staging MPI-Start Hooks (II) #!/bin/sh pre_run_hook () { # Compile the program. echo "Compiling ${I2G_MPI_APPLICATION}" EMI INFSO-RI-261611 # Actually compile the program. cmd="mpicc ${MPI_MPICC_OPTS} -o ${I2G_MPI_APPLICATION} ${I2G_MPI_APPLICATION}.c" $cmd if [ ! $? -eq 0 ]; then echo "Error compiling program. Exiting..." exit 1 fi # Everything's OK. echo "Successfully compiled ${I2G_MPI_APPLICATION}" return 0 } … InputSandbox = {…, “myhooks.sh”…}; Environment = {…, “I2G_MPI_PRE_HOOK=myhooks.sh”}; … MPI-Start: more features • Remote injection – Mpi-start can be sent along with the job • Just unpack, set environment and go! • Interactivity EMI INFSO-RI-261611 – A pre-command can be used to “control” the mpirun call – $I2G_MPI_PRECOMMAND mpirun …. – This command can: • Redirect I/O • Redirect network traffic • Perform accounting • Debugging – 3 different debugging levels: • VERBOSE: basic information • DEBUG: internal flow information • TRACE: set –x at the beginning. Full trace of the execution Future work (I) • New JDL description for parallel jobs (proposed by the EGEE MPI TF): – WholeNodes (True/False): • whether or not full nodes should be reserved – NodeNumber (default = 1): EMI INFSO-RI-261611 • number of nodes requested – SMPGranularity (default = 1): • minimum number of cores per node – CPUNumber (default = 1): • number of job slots (processes/cores) to use • CREAM team working on how to support them Future work (II) • Management of non MPI jobs – new execution environments (OpenMP) – generic parallel job support • Support for new schedulers EMI INFSO-RI-261611 – Condor and SLURM support • Explore support for new architectures: – FPGAs, GPUs,… More Info… • gLite MPI PT: – https://twiki.cern.ch/twiki/bin/view/EMI/GLi teMPI EMI INFSO-RI-261611 • MPI-Start trac – http://devel.ifca.es/mpi-start – contains user, admin and developer docs • MPI Wiki @ TCD – http://www.grid.ie/mpi/wiki