Runtime Optimization of Application Level Communication Patterns Edgar Gabriel and Shuo Huang Department of Computer Science University of Houston gabriel@cs.uh.edu HIPS 2007 Long Beach Edgar Gabriel Motivation Finite Difference code on a PC cluster using IB and GE interconnects Execution time for 200 iterations of the solver on 32 processes/processors 30 execution time [sec ] 25 20 fcfs p fcfs-pack 15 ordered overlap 10 5 0 HIPS 2007 Long Beach 128x128x64 IB Edgar Gabriel 128x128x128 IB 128x128x64 TCP 128x128x128 TCP How to implement the required communication pattern efficiently? • Dependence on platform – Some functionality only supported (efficiently) on certain/platforms or with certain network interconnects • Dependence on MPI library – Does the MPI library support all available methods – Efficiency in overlapping communication and computation – Quality of the support for user defined data-types • Dependence on application – Problem size – Ratio of communication to computation HIPS 2007 Long Beach Edgar Gabriel • Problem: How can an (average) user understand the myriad of implementation options and their impact on the performance of the application? • (Honest) Answer: no way – Abstract interfaces for application level communication operations required ADCL – Statistical tools required to detect correlations between parameters and application performance HIPS 2007 Long Beach Edgar Gabriel ADCL - Adaptive Data and Communication Library • Goals: – Provide abstract interfaces for often occurring application level communication patterns • Collective operations • Not-covered by MPI specification – Provide a wide variety of implementation possibilities and decision routines which choose the fastest available implementation (at runtime) • Not replacing MPI, but add-on functionality – Uses many features of MPI HIPS 2007 Long Beach Edgar Gabriel ADCL terminology ADCL object Functionality Attribute Abstraction for a characteristic of an implementation represented by the set its possible values Attribute-set Group of attributes Function Implementation of a particular operation • optionally including an attribute-set and values Function-set Set of functions providing the same functionality • have to have the same attribute-set Vector Abstraction for a multi-dimensional data object Topology Abstraction for a process topology Request Handle for tuple of < topology, vector, function-set> HIPS 2007 Long Beach Edgar Gabriel Code sample ADCL_Vector vec; ADCL_Topology topo; ADCL_Request request; /* Generate a 2-D process topology */ MPI_Cart_create ( comm, 2, cart_dims, periods, 0,&cart_comm); ADCL_Topology_create ( cart_comm, &topo ); /* Register a 2D vector with ADCL */ ADCL_Vector_register (ndims, vec_dims, HALO_WIDTH, MPI_DOUBLE, vector, &vec); /* Match process topology, data item and function-set */ ADCL_Request_create (vec, topo, ADCL_FNCTSET_NEIGHBORHOOD, &request ); for (i=0; i<NIT; i++ ) { /* Main application loop */ ADCL_Request_start (request ); HIPS 2007… Long Beach Edgar Gabriel } Runtime selection logic: brute force search (I) Implementation no. 1 2 3 4 5 6 HIPS 2007 Long Beach Edgar Gabriel 7 Using the fastest implementation for the rest of the application Runtime selection logic: brute force search (II) • Test each function of a given function set a given number of times – Store the execution time for each execution per process • Filter the list of execution times in order to exclude outliers • Determine the avg. execution time per function i and process j • Determine the max. execution time for function i across all processes fi max = max( f i ), j = 0...nprocs − 1 j – Requires communication (e.g. MPI_Allreduce) HIPS 2007 Long Beach Edgar Gabriel Runtime selection logic: brute force search (III) • Determine the function with the minimal max. execution time across all processes f winner = min( f i max ), i = 0...nfuncs − 1 • Use this function for the rest of the application lifetime HIPS 2007 Long Beach Edgar Gabriel Runtime selection logic: performance hypothesis (I) • Assumptions: – every implementation can be characterized by a set of attributes, which impact its performance, e.g. for neighborhood communication • Communication pattern/degree • Handling of non-contiguous data • Data transfer primitive • Overlapping communication and computation – The fastest implementation will also have the optimal values for these attributes HIPS 2007 Long Beach Edgar Gabriel Runtime selection logic: performance hypothesis (II) • Approach: determine the optimal value for an attribute by comparing the execution time of functions differing in only a single attribute Function a Function b Function c Value for attribute 1 1 2 3 Value for attribute 2 X X X Value for attribute 3 Y Y Y Value for attribute 4 z z z – E.g. if function c had the lowest execution time across all processes: • Hypothesis: value 3 optimal for attribute 1 • Confidence value in this hypothesis: 1 HIPS 2007 Long Beach Edgar Gabriel Runtime selection logic: performance hypothesis (III) • Evaluate a different set of functions differing in one other attribute, e.g. Function c Function d Function e Value for attribute 1 1 2 3 Value for attribute 2 X+1 X+1 X+1 Value for attribute 3 Y Y Y Value for attribute 4 z z z – If this set of measurements lead to the same optimal value for attribute 1: • Increase confidence value for this hypothesis by 1 – Else decrease the confidence value by 1 HIPS 2007 Long Beach Edgar Gabriel Runtime selection logic: performance hypothesis (IV) • If the confidence value for an attribute reaches a given threshold – Remove all functions not having the required value for this attribute from the Function-set • If the value for attribute (s) do not converge towards a value this algorithm leads to the brute force search • Advantage: potentially fewer functions have to be evaluated to determine the winner HIPS 2007 Long Beach Edgar Gabriel Currently available implementations for neighborhood communication Name Comm. pattern IsendIrecv_aao IsendIrecv_pair SendIrecv_aao SendIrecv_pair IsendIrecv_aao_pack IsendIrecv_pair_pack SendIrecv_aao_pack SendIrecv_pair_pack SendRecv_pair Sendrecv_pair SendRecv_pair_pack Sendrecv_pair_pack WinfencePut_aao WinfenceGet_aao PostStartPut_aao PostStartGet_aao WinfencePut_pair WinfenceGet_pair PostStartPut_pair HIPS 2007 Long Beach Edgar Gabriel PostStartGet_pair aao pair aao pair aao pair aao pair pair pair pair pair aao aao aao aao pair pair pair pair Handling of non-cont. data ddt ddt ddt ddt ddt Pack/unpack ddt Pack/unpack ddt ddt Pack/unpack Pack/unpack ddt ddt ddt ddt ddt ddt ddt ddt Data transfer primitive MPI_Isend/Irecv/Waitall MPI_Isend/Irecv/Waitall MPI_Send/Irecv/Waitall MPI_Send/Irecv/Wait MPI_Isend/Irecv/Waitall MPI_Isend/Irecv/Waitall MPI_Send/Irecv/Waitall MPI_Send/Irecv/Wait MPI_Send/Recv MPI_Send/Recv MPI_Send/Recv MPI_Send/Recv MPI_Put/MPI_Win_fence MPI_Get/MPI_Win_fence MPI_Put/MPI_Win_post/start MPI_Get/MPI_Win_post/start MPI_Put/MPI_Win_fence MPI_Get/MPI_Win_fence MPI_Put/MPI_Win_post/start MPI_Get/MPI_Win_post/start HIPS 2007 Long Beach Edgar Gabriel en dI re cv v_ aa o _a ao po ut e hy br _p ai dR r ec v_ Se pa nd ir Ire cv _p Se ai nd Is r re en c dI v_ re pa cv ir Se _a nd ao Ire _p cv ac _a Is k en ao dI _p re ac c v Se k _p nd ai r_ R ec pa v_ ck Se pa nd ir_ Ire pa cv ck _p Se ai nd r_ re pa cv ck _p ai r_ pa ck Se n Is cv re c dI re dI en Se n Is Execution time [sec] Performance results (I) InfiniBand 32 processes small problem size 12.4 12.2 12 11.8 11.6 11.4 11.2 11 10.8 10.6 10.4 HIPS 2007 Long Beach Edgar Gabriel hy po br ut e Is en dI re cv _a Se ao nd Ire cv _a Is en ao dI re cv _p Se ai nd r R ec v_ Se pa ir nd Ire cv _p Se ai n r dr Is en ec v_ dI re pa cv ir _a Se a nd o_ Ire pa ck cv _ Is aa en o_ dI pa re ck cv _p Se ai nd r_ R pa ec ck v_ Se pa ir_ nd pa Ire ck cv _p Se ai nd r_ pa re cv ck _p ai r_ pa ck Execution time [sec] Performance results (II) InfiniBand 32 processes large problem size 77.5 77 76.5 76 75.5 75 74.5 74 73.5 73 72.5 HIPS 2007 Long Beach Edgar Gabriel en dI re cv v_ aa _p ir ir ac k pa pa pa ir o_ p _a ao v_ v_ v_ Se n po ut e hy br ac k _p ai dR r_ pa ec ck v_ Se p ai nd r_ Ire pa cv ck _p Se ai nd r_ re pa cv ck _p ai r_ pa ck Is re c cv re c o _p ai r aa _a ao v_ cv dr ec dI Se n Se n dI re dI en Se n Is dI re re c cv dR ec en Se n Is dI re dI en Se n Is Execution time [sec] Performance results (III) TCP over Fast Ethernet 32 processes small problem size 400 350 300 250 200 150 100 50 0 HIPS 2007 Long Beach Edgar Gabriel dI re k ir_ p po ut e hy br k ck ac pa ck pa ck ac ir_ pa pa pa ir_ pa v_ v_ re c dr ec dI Se n Se n o_ p ir ac k pa _p _p ai r_ aa v_ cv dR ec en Se n Is _a ao ai r pa ir v_ p v_ re c dr ec v_ cv re c dI re dI en Se n Is Se n dI ao _p ai r v_ cv dR ec Se n Se n dI re _a ao v_ a cv re c dI re dI Is en Se n Is en Execution time [sec] Performance results (IV) TCP over Fast Ethernet 32 processes large problem size 450 400 350 300 250 200 150 100 50 0 Limitations of ADCL • • • • • Reproducibility of measurements even on dedicated compute nodes a challenging topic – Hyper-threading – Processor frequency scaling Network often shared between multiple jobs Hierarchical networks – Process placement by the batch scheduler Performance hypothesis – Attributes should not be correlated User has to modify its code – How much longer will we have to deal with MPI? HIPS 2007 Long Beach Edgar Gabriel Advantages of ADCL • • • • Provides close to optimal performance in many scenarios Simplifies the development of parallel code for many applications Simplifies the development of adaptive parallel code Currently ongoing work: – Improving (nearly) all components of ADCL • Data filtering • Increase parameter space and set of implementation • Experiment with other runtime selection algorithms – Historic learning, Game theory, genetic algorithms – Integration with a CFD solver in cooperation with Dr. Garbey HIPS 2007 Long Beach Edgar Gabriel