Final Report Iowa State University Glenn Luecke, Jeff Kuehn, Steve Poole July 2011 This is the final report for contract DE-AC05-00OR22726, subcontract modification 4000078135, modification 4. The following lists the section numbers and deliverables in this contract. Section 2.1.A. “A preliminary report of the performance evaluation of OpenSHMEM implementations”. DESCOPED. Section 2.1.B. “The initial delivery of the Non-public Alpha release of the Unified test suite for V&V and performance (of OpenSHMEM)”. DESCOPED. Section 2.2.A. “OpenSHMEM-CHECK design document.” Delivered to ORNL in December 2010. This deliverable is listed in the appendix of this document. Section 2.2.B.1 “An implementation of deadlock detection using ROSE.” Delivered to ORNL in June 2011. See below for details. Section 2.2.B.2. “An implementation of UPC function argument error checking using ROSE.” Delivered to ORNL in June 2011. See below for details. Section 2.2.B.3. “ISU will deliver all ROSE bugs found using simple programs when possible.” Delivered to ORNL in June 2011. Section 2.2.B.4. In June 2011, ISU delivered to ORNL the UPC-CHECK user documentation (User’s Guide and Installation Guide) and the UPC-CHECK Tutorial: UPC-CHECK is a tool designed for the automatic detection of run-time errors for programs written in Unified Parallel C (UPC). UPC is an extension of the C programming language designed for high performance on parallel computers. Currently, UPC-CHECK provides automatic run-time error detection for deadlocks and for argument errors in UPC functions. UPC-CHECK instruments UPC source code by inserting argument and deadlock checks before UPC function calls. The instrumented UPC code is then compiled and executed using any UPC compiler available on the system. During run-time UPC-CHECK detects errors and issues messages to help programmers quickly fix the errors. Instrumentation is done using the ROSE Toolkit from Lawrence Livermore National Laboratory. A novel method for deadlock detection is presented. A manuscript presenting this work is currently under preparation. UPC-CHECK Testing. UPC-CHECK has been extensively tested using tests written during this project and also using the RTED for UPC written for a US government agency several years ago. 1 The RTED evaluation tool for UPC is a collection of scripts for running thousands of tests, comparing the actual message with the expected message and then assigning a score of 0, 1, 2, 3, 4 or 5 to the message generated by each tests. Scoring was done as follows: A score of 5 is given for a detailed error message that will assist a programmer to fix the error. A score of 4 is given for error messages with more information than a score of 3 and less than 5. This is tailored for each test. A score of 3 is given for error messages with the correct error name, line number and the name of the file where the error occurred. A score of 2 is given for error messages with the correct error name and line number where the error occurred but not the file name where the error occurred. A score of 1 is given for error messages with the correct error name. A score of 0 is given when the error was not detected. Error Category argument errors deadlocks Cray 0.38 0.00 Berkeley 0.04 0.58 HP 0.00 0.36 GNU 0.00 0.27 UPC-CHECK 4.89* 5.00 * The argument error category in the UPC RTED tests includes 3 tests for upc-forall which are not argument error tests. (They probably should have been put in a miscellaneous error category.) Excluding these 3 tests, UPC-CHECK scores 5.00 in the argument error category. 2 Appendix Preliminary Design Document for SHMEM-CHECK: a Tool for Detecting OpenSHMEM Run-time Errors Glenn Luecke, Director James Coyle, James Hoekstra, Marina Kraeva High Performance Computing Group Iowa State University, Ames, Iowa December 15, 2010 1. Introduction The SHMEM programming model consists of library functions that provide low-latency, highbandwidth communication for use in highly parallelized scalable programs. The one-sided and collective functions in the SHMEM application programming interface (API) provide a programming model for exchanging data between cooperating parallel processes. The resulting programs are similar in style to Message Passing Interface (MPI) programs. The SHMEM API can be used either alone or in combination with MPI functions in the same parallel program. Several versions of SHMEM have been developed by SGI, Cray, Quadrics. Since there was no single SHMEM standard, SHMEM programs written over the years are platform dependent. In attempt to standardize the SHMEM interface OpenSHMEM was created based on SGI’s SHMEM. Some researchers prefer to write scientific applications using SHMEM for communications rather than using MPI since SHMEM usually provides better performance. Providing a productive programming environment for OpenSHMEM will encourage new scientific applications to be written in OpenSHMEM. Since debugging OpenSHMEM programs can be time consuming, it is important to have OpenSHMEM tools and run-time systems that detect run-time errors and issue messages that help programmers quickly fix errors. Having high quality error messages can greatly increase programmer productivity by reducing debugging time when developing and maintaining application programs. This is especially important when developing applications for petascale computing. In fact, detecting and correctly identifying run-time errors would likely enable these errors to be fixed without a debugger. We recommend the development of a SHMEM run-time error detection tool that we suggest be named SHMEM-CHECK. (The MPI-CHECK tool was developed by ISU’s High Performance Computing Group for detecting and reporting MPI run-time errors in Fortran programs. The UPCCHECK tool is being developed using ROSE toolkit by LLNL and ISU’s High Performance Computing Group for detecting and reporting run-time errors in UPC programs.) We recommend that initially SHMEM-CHECK be designed to detect errors at run-time and then at a later time SHMEM-CHECK be enhanced to detect errors at compile-time. 3 This document presents our recommendations for the design of SHMEM-CHECK. This work has been supported with funding from the US Department of Defense and from the Extreme Scale System Center at Oak Ridge National Laboratory. 2. Design of SHMEM-CHECK 2.1. Preprocessor Design SHMEM-CHECK will be designed as a preprocessor using the ROSE toolkit from Lawrence Livermore National Laboratory (LLNL). SHMEM-CHECK will be a source-to-source translator, i.e. SHMEM-CHECK would take OPENSHMEM C code as input, insert run-time checking using the ROSE toolkit, and then output instrumented OPENSHMEM code. This output could then be compiled and executed by any C compiler and linked with a SHMEM library. When executed, runtime errors will be detected and error messages issued. For example, suppose one has a SHMEM program named program.c. Then issuing the following SHMEM-CHECK Ccompiler [compiler|link options] program.c will cause the program program.c to be instrumented and compiled with the compiler named “Ccompiler”. When executed run-time error messages will be issued. 2.2. Modularity SHMEM-CHECK will be implemented using a modular approach, i.e. an independent module will be dedicated to each error category, namely: uninitialized data used in SHMEM functions out-of-bounds array accesses in SHMEM functions symmetric heap memory related errors argument errors in SHMEM functions deadlocks and potential deadlocks incorrect order of SHMEM functions race conditions 2.3. Additional Design Parameters The following lists additional design recommendations for SHMEM-CHECK. 4 SHMEM-CHECK will be designed to detect run-time errors that are due to the incorrect usage of SHMEM function calls, and will not be expected to detect incorrect usage of C statements. The SHMEM-CHECK will assume that the program syntax is correct, i.e. the SHMEM program to be analyzed must compile without syntax errors. The design plan for SHMEM-CHECK assumes that the compiler or tool already has a mechanism for detecting out-of-bounds and initialized variable in the statements which do not contain SHMEM functions. It will be assumed that method used will be extended to cover the SHMEM functions. When the instrumented program detects that an error would occur, an error message containing the type of error, error location, and other useful information will be issued before the error can occur. Program execution will be terminated by calling exit after waiting a settable specified amount of time to allow other PEs to issue messages for the different errors that they detected. SHMEM-CHECK will be designed to minimize both the preprocessing and run-time overhead introduced. SHMEM-CHECK will be designed to allow users to turn off checking for each of the different error categories for performance, memory considerations and flexibility during the debugging processes. SHMEM-CHECK will be designed to allow users to choose whether program execution will continue or stop when an error is detected. The default behavior will be to stop on an error, but the user will be able to over-ride this. 2.4. Testing and Verification Testing and verification will be an important part of the development of SHMEM-CHECK. We propose that OpenSHMEM run-time error tests be written for testing and verification that SHMEMCHECK can detect and issue quality error messages for all these tests. ISU has written such runtime error detection suits for serial, MPI and OpenMP errors in Fortran, C and C++ programs and UPC errors in UPC programs. In addition, larger SHMEM programs should be used to test the ability of SHMEM-CHECK to detect errors in complex programs. Errors will need to be inserted in these SHMEM applications so they can be used for testing SHMEM-CHECK. 3. Error Detection Methodology for each Error Category The following presents how we plan to implement run-time error detection for each run-time error category. Appendix A presents a detailed listing of the SHMEM run-time error categories and the sub-categories. In the following descriptions, comparisons which depend on arguments passed to a SHMEM function will be written in terms of the formal arguments as given in the function prototypes. E.g the 3rd argument to shmem_put is the number of elements to send to another PE, which appears as len in the function prototype. When the function (and hence function prototype) 5 involved is known, rather than refer to this as the value of the 3rd argument, we simply refer to this as len in any calculations. 3.1. Detecting uninitialized objects used in SHMEM functions To detect uninitialized objects used in SHMEM functions would require an amount of work greater than that for implementing checks for uninitialized variables in a standard C program without SHMEM functions. Thus, the design plan for SHMEM-CHECK assumes that the compiler or tool already has a mechanism for detecting uninitialized variables in the statements which do not contain SHMEM functions. This method should be extended to cover the SHMEM functions. In particular the checks for uninitialized variables should be added before the calls to SHMEM functions that have source argument. The source array has to be symmetric, i.e. either it has to be a non-stack variable or it has to be allocated by shmalloc. In the first case the variable will be zeroinitialized, but SHMEM-CHECK has to issue a warning message if all elements of the array in the source argument are not explicitly initialized in the program prior to the SHMEM function call. In case source array is allocated by shmalloc and is never initialized before the call to the SHMEM function, an error message should be issued. In case the serial C-CHECK tool or compiler cannot be modified to support detection of uninitialized objects in SHMEM functions, the following instrumentation can be done: for the SHMEM functions that have source argument except remote read functions (get/iget/g) add the following in the same line as the call to the function, right before the call to the function: {TYPE SHC_array[M]; for(int i=0; i<M; i++) shmem_check_array[i]=source[i]; where M is either nlong, nreduce, len or len*sst depending on the function (see Detecting out-of-bounds array accesses in SHMEM functions section for details), TYPE is the type of the source array. The back curly bracket '}' should be added after the call to the function (in the same line). for SHMEM broadcast functions the following condition should be inserted right before the 'for' statement above: if(_my_pe()==PE_root) since only the values in the source array on the root PE are important. 3.2. Detecting out-of-bounds array accesses in SHMEM functions Out-of-bounds array accesses in SHMEM functions that will not be detected by a serial C-CHECK tool/compiler are caused by wrong combination of values of function arguments. To detect out-ofbounds array accesses in SHMEM functions would require an amount of work greater than that for implementing checks for uninitialized variables in a standard C program without SHMEM functions. 6 Thus, the design plan for SHMEM-CHECK assumes that the compiler or tool already has a mechanism for detecting out-of-bounds array accesses in the statements which do not contain SHMEM functions. This method should be extended to cover the SHMEM functions. Below is the list of possible out-of-bounds array accesses in SHMEM functions that should be detected by SHMEM-CHECK: Use too small pSync array in collective functions (The OpenSHMEM manual requires that the memory block starting at the address passed in pSync argument should be at least _SHMEM_BARRIER_SYNC_SIZE elements long) Use too small array in the source argument of a broadcast function (The OpenSHMEM manual requires that the memory block starting at the address passed in source argument should be at least nlong elements long) Use too small array in the target argument of a broadcast function (The OpenSHMEM manual requires that the memory block starting at the address passed in target argument should be at least nlong elements long) Use too small array in the source argument of a collect function (The OpenSHMEM manual requires that the memory block starting at the address passed in source argument should be at least nlong elements long) Use too small array in the target argument of a collect function (The OpenSHMEM manual requires that the memory block starting at the address passed in target argument should be at least N elements long, where N is the sum of values in nlong argument on the PEs in the active set) Use too small array in the source argument of a reduction collective function (The OpenSHMEM manual requires that the memory block starting at the address passed in source argument should be at least nreduce elements long) Use too small array in the target argument of a reduction collective function (The OpenSHMEM manual requires that the memory block starting at the address passed in target argument should be at least nreduce elements long) Use too small pWrk array in collective reduction functions (The OpenSHMEM manual requires that the memory block starting at the address passed in pWrk argument should be at least max(nreduce/2 + 1, _SHMEM_REDUCE_MIN_WRKDATA_SIZE) elements long) Use too small array in the target argument of a remote write (put) function (The OpenSHMEM manual requires that the memory block starting at the address passed in target argument should be at least len elements long) Use too small array in the source argument of a a remote write (put) function (The OpenSHMEM manual requires that the memory block starting at the address passed in source argument should be at least len elements long) Use too small array in the target argument of a remote read (get) function (The OpenSHMEM manual requires that the memory block starting at the address passed in target argument should be at least len elements long) 7 Use too small array in the source argument of a a remote read (get) function (The OpenSHMEM manual requires that the memory block starting at the address passed in source argument should be at least len elements long) Use too small array in the target argument of a remote strided write (iput) function (The OpenSHMEM manual requires that the memory block starting at the address passed in target argument should be at least len*tst elements long) Use too small array in the source argument of a a remote strided write (iput) function (The OpenSHMEM manual requires that the memory block starting at the address passed in source argument should be at least len*sst elements long) Use too small array in the target argument of a remote strided read (iget) function (The OpenSHMEM manual requires that the memory block starting at the address passed in target argument should be at least len*tst elements long) Use too small array in the source argument of a a remote strided read (iget) function (The OpenSHMEM manual requires that the memory block starting at the address passed in source argument should be at least len*sst elements long) In case the serial C-CHECK tool or compiler cannot be modified to support detection of out-ofbounds array accesses in SHMEM functions, the following instrumentation can be done: for the SHMEM functions that have source argument add the following in the same line as the call to the function, right before the call to the function: {TYPE SHC_array[M]; for(int i=0; i<M; i++) shmem_check_array[i]=source[i]; where M is either nlong, nreduce, len or len*sst depending on the function (see above the requirements for the array sizes), TYPE is the type of the source array. The back curly bracket '}' should be added after the call to the function. for SHMEM broadcast functions the following condition should be inserted right before the 'for' statement above: if(_my_pe()==PE_root) since only the values in the source array on the root PE are important. for the SHMEM functions that have target argument add the following in the same line as the call to the function, right before the call to the function: {TYPE SHC_array[M]; for(int i=0; i<M; i++) target[i]=shmem_check_array[i]; where M is either nlong, nreduce, len or len*tst depending on the function (see above the requirements for the array sizes), TYPE is the type of the target array. The back curly bracket '}' should be added after the call to the function. for the SHMEM functions that have pSync argument add the following in the same line as the call to the function, right before the call to the function: for(int i=0; i<M; i++) if pSync[i]!=K issue_error(); where M is equal to _SHMEM_BARRIER_SYNC_SIZE, K is equal either to _SHMEM_SYNC_VALUE (in SHMEM reduction functions and broadcast/collect functions), or to 0 (in shmem_barrier). for the SHMEM functions that have pWrk argument add the following in the same line as the call to the function, right before the call to the function: for(int i=0; i<M; i++) pWrk[i]=(TYPE)i; 8 where M = max(nreduce/2 + 1, _SHMEM_REDUCE_MIN_WRKDATA_SIZE), TYPE is the type of the target array. 3.3. Detection of SHMEM memory related errors If the compiler or tool already has a mechanism for detecting memory leaks and use of dangling pointers, then this method should be extended to cover the SHMEM symmetric heap memory functions. Otherwise the following method will be used. To detect SHMEM memory related errors the calls to SHMEM symmetric heap memory function will be tracked. A global array SHMB (Symmetric Heap Memory Blocks) will be used to store the following information about symmetric heap memory blocks: the starting address the size of the memory block in bytes number of pointers that point to a location within the memory block the location of the memory allocation call (file name and line number) the location of the memory deallocation call (file name and line number) Array SHMB should be big enough to hold information about all calls to SHMEM symmetric heap memory allocation functions in the program. The number of initialized elements in SHMB will be stored in the global variable NSHMA (number of symmetric heap memory allocations). The following information about pointers in the program that point to symmetric heap memory will be stored in the global array PI on each PE: the address of the pointer memblock - the index of the element in SHMB that describes the memory block to which the pointer points the location of the last assignment to the pointer (file name and line number) After each call to shmalloc, shrealloc or shmemalign the following will be done: The following information will be stored in SHMB: o the starting address o the size of the memory block in bytes o the location of the memory allocation call o number of pointers that point to a location within the memory block will be initialized with 1 9 The following information will be stored in PI (if the address of the pointer on the left side of the call is already recorded in PI and memblock >=0, then first decrease the number of pointers that point to a location within the memory block by 1 in element memblock in SHMB): o the address of the pointer on the left side of the call o memblock will be initialized with the value of NSHMA o the location of the call (file name and line number) The value of NSHMA will be increased by 1 After each call to shfree or shrealloc, the following will be done: The following changes will be done for the corresponding element in SHMB: o the starting address will be replaced with -1 o the location of the memory deallocation call will be recorded o number of pointers that point to a location within the memory block will be decreased by 1 The following changes will be done for the corresponding element in PI: o memblock will be set to -1 Before each pointer assignment statement of kind ptr1=ptr2[+i] the following will be done after the check for memory leak described in section 3.3.3 below: if the address of ptr1 is recorded in PI and memblock_ptr1>=0, in element memblock_ptr1 in SHMB decrease the number of pointers that point to a location within the memory block by 1 if the address of ptr2 is recorded in PI and memblock_ptr2>=0, o do the following changes in PI: initialize memblock_ptr1 with the value of memblock_ptr2 record the location of the assignment statement (file name and line number) o in element memblock_ptr2 in SHMB increase the number of pointers that point to a location within the memory block by 1 else (if either the address of ptr2 is not in PI, or if memblock_ptr2<0) o do the following changes in PI: memblock_ptr1 = -1 record the location of the assignment statement (file name and line number) 10 if the address of ptr1 is not in PI or if memblock_ptr1<0, if the address of ptr2 is recorded in PI and memblock_ptr2>=0, o store the following information in PI (either in the new element or replace the existing information if address of ptr1 already appears in PI) the address of the pointer ptr1 initialize memblock_ptr1 with the value of memblock_ptr2 the location of the assignment statement (file name and line number) o in element memblock_ptr2 in SHMB increase the number of pointers that point to a location within the memory block by 1 Before every return statement in a program (except return statements in the main function) and at the end of all basic blocks, array PI will be searched for the addresses of the non-static pointers declared in the current functions or blocks. For these entries memblock will be set to -1 and in the corresponding elements in SHMB the number of pointers that point to a location within the memory block will be decreased. The following subsections describe how the specified memory related errors will be detected. 3.3.1. Call heap memory management functions on pointers that do not point to a memory block allocated via a call to shmalloc, shmemalign or shrealloc To check for this error, before every call to shrealloc(ptr,..) or shfree(ptr), search PI for the address of ptr if there is no entry for ptr in PI, that means that the pointer ptr never pointed to a symmetric heap memory block, thus issue an error message if there is an entry for ptr in PI o if memblock == -1, that means that the pointer ptr no longer points to a symmetric heap memory block, thus issue an error message o otherwise compare the value of ptr with the address in the element memblock in SHMB; if they do not match, that means that the pointer does not point to the beginning of the memory block, thus issue an error message. 3.3.2. Use of dangling pointers The use of dangling pointers (when a pointer points to a freed memory) will be detected as follows: whenever a pointer appears on the right hand side of an assignment statement, or is dereferenced, search PI for the address of ptr. If the address of ptr is found in PI and memblock>=0, check the beginning address of the memory block memblock in SHMB. If it's equal to -1, that means that the memory block was already freed, thus issue an error message. 11 3.3.3. Memory Leak - reassigning a pointer before deallocating This memory leak will be detected as follows: When a pointer ptr1 appears on the left-hand-side of an assignment statement of kind ptr1=ptr2[+i] the following checks will be done prior to execution of this statement: If there is no entry for ptr1 in PI, then this error did not occur and no further checking for this error is needed. If there is an entry for ptr1 in PI and if the value of memblock for ptr2 is equal to the value of memblock for ptr1 then this error does not occur and no further checking for this error is needed. If there is an entry for ptr1 in PI and if the value of memblock for ptr2 is different from the value of memblock for ptr1, check in SHMB how many pointers point to a location within the memory block memblock for ptr1. If it's equal to 1, that means that after assignment statement, no pointers will be pointing to the memory block, thus issue an error message. When a pointer ptr appears on the left-hand-side of an assignment statement of kind ptr=f(..), where f is one of shmalloc, shrealloc or shmemalign, the following checks will be done prior to execution of this statement: If there is no entry for ptr1 in PI, then this error did not occur and no further checking for this error is needed. If there is an entry for ptr in PI and memblock >= 0, check in SHMB how many pointers point to a location within the memory block memblock. If it's equal to 1, that means that after assignment statement, no pointers will be pointing to the memory block, thus issue an error message. 3.3.4. Memory Leak - leaving a block before freeing memory that was allocated using non-static pointer declared and allocated within the block As described above in 3.3, before every return statement in a program (except return statements in the main function) and at the end of all basic blocks, array PI will be searched for the addresses of the non-static pointers declared in the current functions or blocks. Before setting memblock to -1 for these entries, and before decreasing the number of pointers that point to a location within the memory block in the corresponding elements in SHMB, it will be checked whether the number of pointers that point to a location within the memory block is equal to 1. If it is the case, that means that after returning from the function or basic block, no pointers will be pointing to the memory block, thus an error message will be issued. 3.4. Detecting Argument Errors in SHMEM Functions To allow all arguments to be checked in a single run of the SHMEM program, all of the argument checks listed below will be made before aborting the program. The following subsections describe how the specified argument errors will be detected. 12 3.4.1. Use non-symmetric data objects as arguments that are required to be remotely accessible To detect this error before every call to a SHMEM function that has either of the 'source', 'target', 'pSync' and 'pWork' arguments that are required to be remotely accessible, it will be checked that the addresses in those arguments are either associated with static variables or are within one of the blocks listed in the array SHMB defined in section 3.3 Detection of SHMEM memory related errors. 3.4.2. Errors in pSync argument According to the OpenSHMEM manual, elements of pSync array have to be initialized either with the value _SHMEM_SYNC_VALUE (in SHMEM reduction functions and broadcast/collect functions), or with 0 (in shmem_barrier). Thus before the calls to the SHMEM reduction functions and broadcast/collect functions the values of elements of pSync array have to be compared with _SHMEM_SYNC_VALUE, and before the calls to shmem_barrier the values of elements of pSync array have to be compared with 0. The same instrumentation as in Detecting out-of-bounds array accesses in SHMEM functions section will be used to detect this error: for the SHMEM functions that have pSync argument add the following in the same line as the call to the function, right before the call to the function: for(int i=0; i<M; i++) if pSync[i]!=K issue_error(); where M is equal to _SHMEM_BARRIER_SYNC_SIZE, K is equal either to _SHMEM_SYNC_VALUE (in SHMEM reduction functions and broadcast/collect functions), or to 0 (in shmem_barrier). 3.4.3. Use overlapping (but not the same) arrays as source and target arguments in SHMEM collective reduction functions To detect this error the values of the source, target and nreduce arguments will be examined. An error will be issued in the following cases: source<target && source+nreduce>target source>target && target+nreduce >source 3.4.4. Wrong values of the pe, PE_root, PE_size, PE_stride and PE_start arguments To detect these errors, before the call to a SHMEM function, the following will be checked: the value of pe argument in SHMEM put, get, fetch-op and atomic memory operation functions is not negative and is less than the number of PEs the value of PE_start argument in SHMEM collective functions is not negative and is less than the number of PEs the value of PE_size argument in SHMEM collective functions is not negative and is less than the number of PEs 13 the value of PE_stride argument in SHMEM collective functions is not negative the values of PE_start, PE_stride and PE_size arguments in SHMEM collective functions satisfy the following condition: PE_start+(2**PE_stride)*(PE_size-1) < number of PEs the value of PE_root argument in SHMEM broadcast functions is not negative and is less than the value of PE_size argument An error message will be issued when any one of these conditions is not satisfied. 3.4.5. Call a collective function by a PE not in the active set To detect this error, the following check will be inserted before the call to a SHMEM collective function: check whether there is a whole number 'i', 0<=i<PE_size, so that the rank of the PE that calls the SHMEM collective function is equal to PE_start+(2**PE_stride)*i . If there is no such 'i', the error message will be issued. 3.5. Detecting actual and potential deadlocks An actual deadlock occurs when something is waiting for an event that will never happen. We say that a SHMEM program has a potential deadlock when it will produce an actual deadlock using a valid OpenSHMEM implementation. This means that a potential deadlock may not be an actual deadlock for some OpenSHMEM implementations. Both actual and potential deadlocks are errors. In this document the word "deadlock" refers to both actual and potential deadlocks. The OpenSHMEM manual does not explicitly place a constraint on the order of calls to the collective functions. However it is possible for some valid OpenSHMEM implementations that a program will deadlock or produce other side-effects when collective functions are not called by all PEs in the active set in the same order. Thus the tool will detect such cases and issue an error message. The following subsections describe how the specified deadlock errors will be detected. 3.5.1. Not every process in the active group calls a barrier function, a symmetric heap memory management function or a collective function with identical argument(s) that are required to be single-valued in the same order The proposed method requires synchronization of all PEs in the active set with the root PE before every call to a collective function. In this method the instrumented program will issue a warning message if some PEs are waiting at the call for too long. The length of the waiting period, and whether the program should be stopped, will be set before executing the program. The deadlock check should be performed after the checks described in "Detecting Argument Errors in SHMEM Functions" chapter. A detailed description of this method follows. Before every collective function, a hand-shaking check will be inserted to test in a while loop whether all PEs in the active set have arrived to the same collective function with the same arguments when required by the OpenSHMEM manual. 14 The hand-shaking check will be done by first declaring global arrays shmem_cf_notify and shmem_cf_wait. To record the name of the collective function and argument information, as well as the location (the file name and line number) of the call a global structure object info_shmem_cf will be declared. Therefore, the following declaration statements will appear in the instrumented program outside the main function: struct SHMEM_cf_and_args info_shmem_cf; int shmem_cf_notify[PES]; int shmem_cf_wait[PES]; Before the call to a collective function the following will be inserted: Set root_PE to PE_start for those functions that have PE_start argument or to 0 for barrier, shmem_barrier_all and symmetric heap memory management functions. If the rank of the calling PE is not equal to root_PE Then { record the collective function name and argument information in info_shmem_cf while (shmem_cf_notify[myPE] on root_PE != 0) {} set shmem_cf_notify[myPE] on root_PE to 1 while (shmem_cf_wait[root_PE] != 1) { after specified amount of time issue a warning stop execution after certain number of warnings or when prompted by user } shmem_cf_wait[root_PE] = 0; } Else { count=1; while (count < PE_size) { for each rank_PE in the active set except root_PE { if (shmem_cf_notify[rank_PE] ==1) { compare function name and argument information in info_shmem_cf on rank_PE with own call; if it matches {set shmem_cf_wait[root_PE] to 1 on rank_PE} else {issue an error message and stop execution} shmem_cf_notify[rank_PE]=2; count++; 15 } } after specified amount of time issue a warning stop execution after certain number of warnings } for each rank_PE in the active set except root_PE shmem_cf_notify[rank_PE]=0; } 3.5.2. Deadlock at the call to a wait function or shmem_set_lock SHMEM wait functions wait for argument ivar to be changed by a remote write or atomic swap issued by a different processor. If the awaited change does not happen, the process will be waiting at the call until the program execution is interrupted. This is one source of deadlock. Another source of deadlock is caused by incorrect use of shmem lock functions. A PE will deadlock at the call to shmem_set_lock(&lock) if another PE that is holding the lock never releases it by calling shmem_clear_lock(&lock). To let user know about a PE waiting at the call to a SHMEM wait function or to shmem_set_lock the following will be done in addition to the algorithm described in the previous section: 1. Four (global) objects will be declared: a. wait_counter to keep count of waiting PEs, b. a structure wait_loc to record the location (line number and file name) of the calls to SHMEM wait functions, c. a structure set_lock_loc to record the location of the calls to shmem_set_lock, d. an array of structures list_locks to keep track of locks being held. 2. Before every call to a SHMEM wait function a. check the value of wait_counter on PE0. If wait_counter == number of PEs -1, that means that if the current PE were to call the wait function, the program would deadlock, thus issue the deadlock error message. Otherwise b. record the location (line number and file name) of the call in the wait_location object and c. increase the wait_counter on PE0. 3. After the call to a SHMEM wait function a. clear the wait_loc object and b. decrease the wait_counter on PE0. 4. Before every call to shmem_set_lock(&lock) a. check the value of wait_counter on PE0. If wait_counter == number of PEs -1, then check in list_locks (on PE0) whether the lock is being held by another thread, if yes 16 that means that if the current PE was to call shmem_set_lock, the program would deadlock, thus issue the deadlock error message. Otherwise b. record the location (line number and file name) of the call in the set_lock_loc object and c. increase the wait_counter on PE0. 5. After the call to shmem_set_lock, a. record the address of the variable lock and the PE number in list_locks (on PE0), b. clear the set_lock_loc object and c. decrease the wait_counter on PE0. 6. After every call to shmem_test_lock, check the return value, if it's equal to 0, record the address of the variable lock and the PE number in list_locks (on PE0). 7. After every call to shmem_clear_lock, remove the address of the variable lock and the PE number from list_locks (on PE0). 8. Before every return statement in the main() function a. check the value of wait_counter on PE0. If wait_counter == number of PEs -1, that means that if the current PE was to return from the program, the program would deadlock, thus issue the error message. Otherwise b. increase the wait_counter on PE0. 9. Before every call to a barrier, symmetric heap memory function or a collective function if the rank of the calling PE is not equal to root_PE perform the following before the "handshaking" algorithm described in the previous section: a. check the value of wait_counter on PE0. If wait_counter == number of PEs -1, that means that if the current PE would call the barrier, symmetric heap memory function or a collective function, the program will deadlock (since it is not the last PE in the active set of PEs - still there is root_PE left), thus issue the error message. Otherwise b. record the location (line number and file name) of the call in the info_shmem_cf object and c. increase the wait_counter on PE0. 10. After the call to the barrier, symmetric heap memory function or collective function decrease the wait_counter on PE0. Since the detection of a deadlock may be deferred until some PE reaches the end of the program, the above method may be enhanced by making PEs waiting at a call to a SHMEM wait function or to shmem_set_lock issue a warning after specified period of time (similar to the "hand-shaking" algorithm described in the previous section). 17 3.6. Detecting Wrong Order of SHMEM Functions 3.6.1. Call a SHMEM function before the call to start_pes The start_pes routine identifies the number of processes for a program. This statement must be the first statement in a program that uses distributed, shared memory (SHMEM) communication routines. Even though the manual does not explicitly prohibits multiple calls to start_pes, it still may lead to problems. To detect those cases when start_pes is called more than once or when a SHMEM function is called before the call to start_pes, the following will be done: A global variable start_pes_called will be defined and initially set to 0. Before a call to start_pes, the executing PE will check the value of start_pes_called. If it is equal to 1, an error message will be issued. If the value is equal to 0, execution will continue. After a call to start_pes, start_pes_called will be assigned value of 1. Before every call to a SHMEM function except start_pes, the executing PE will check the value of start_pes_called. If it is equal to 0, an error message will be issued. If the value is equal to 1, execution will continue. 3.6.2. Unlock lock being held by another PE The OpenSHMEM manual does not explicitly state that only the PE that holds the lock can clear it, however such a requirement would make sense. Therefore SHMEM-CHECK will check for this error. To detect the error array list_locks described in section "Deadlock at the call to a wait function or shmem_set_lock" will be used. As described in that section, after a PE successfully sets the lock, it records its rank and the address of the lock variable in list_locks on PE0. To detect wrong order error, the following will be done: 1. After the call to shmem_set_lock, record the address of the variable lock and the PE number in list_locks on PE0. 2. After every call to shmem_test_lock, check the return value, if it's equal to 0, record the address of the variable lock and the PE number in list_locks on PE0. 3. Before every call to shmem_clear_lock check list_locks on PE0 whether PE holds the lock, if not issue an error message. 4. After every call to shmem_clear_lock, remove the address of the variable lock and the PE number from list_locks on PE0. Items 1, 2, and 4 above are being done in the section "Deadlock at the call to a wait function or shmem_set_lock", and are included here for clarity. 3.6.3. Call two collective SHMEM routines with the same pSync and/or pWrk arguments and no shmem_barrier or shmem_barrier_all call in between With the exception of shmem_barrier, it is erroneous to use the same pSync array in two consecutive calls to SHMEM collective functions without intervening barrier synchronization. In 18 addition, a pWrk array can be reused in a subsequent reduction routine call only if none of the PEs in the active set are still processing a prior reduction routine call that used the same pWrk array. To detect when pSync or pWrk array is being reused without intervening barrier synchronization the following will be done: A global variable pes_synchronized will be defined and initially set to 0. After every call to shmem_barrier, barrier, shmem_barrier_all or a symmetric heap memory management function (which call shmem_barrier_all), pes_synchronized will be assigned value of 1. Four global variables ba_pSync, ne_pSync, ba_pWrk and ne_pWrk to record the beginning addresses and number of elements of the arrays in pSync and pWrk arguments of the SHMEM collective function will be defined and initialized with 0. Before every call to a SHMEM collective function except shmem_barrier, check the value of pes_synchronized. If it's equal to 0 o check that the array in pSync argument is different from the one used in the previous call to a SHMEM collective function (recorded in ba_pSync and ne_pSync); an error will be issued in the following cases: pSync== ba_pSync pSync< ba_pSync && pSync+size>ba_pSync pSync> ba_pSync && pSync<ba_pSync+ne_pSync where size is one of the: _SHMEM_REDUCE_SYNC_SIZE _SHMEM_COLLECT_SYNC_SIZE _SHMEM_BCAST_SYNC_SIZE depending on the SHMEM collective function. o for the reduction functions also check that the array in pWrk argument is different from the one used in the previous call to a SHMEM collective reduction function (recorded in ba_pWrk and ne_pWrk); an error will be issued in the following cases: pWrk== ba_pWrk pWrk< ba_pWrk && pWrk+size>ba_pWrk pWrk> ba_pWrk && pWrk<ba_pWrk+ne_pWrk where size == max(nreduce/2 + 1, _SHMEM_REDUCE_MIN_WRKDATA_SIZE) After every call to a SHMEM collective function except shmem_barrier, do the following: o for reduction functions: ba_pSync= pSync ne_pSync=_SHMEM_REDUCE_SYNC_SIZE ba_pWrk= pWrk ne_pWrk= max(nreduce/2 + 1, _SHMEM_REDUCE_MIN_WRKDATA_SIZE) 19 o for broadcast functions: ba_pSync= pSync ne_pSync=_SHMEM_BCAST_SYNC_SIZE o for collect/fcollect functions: ba_pSync= pSync ne_pSync=_SHMEM_COLLECT_SYNC_SIZE o for all SHMEM collective functions set pes_synchronized to 0. The above method will catch errors related to reuse of pSync and pWrk in successive calls without synchronization. 3.7. Race Conditions The methodology used for detecting race conditions is based on the fact that a race condition will occur only when both of the following conditions are satisfied: at least two different PEs access the same shared memory location with at least one PE writing to this memory location the order of PEs execution is not controlled. In OpenSHMEM, PEs can access only symmetric memory locations on a remote PE, and only through SHMEM function calls. A race condition will occur when two PEs access a symmetric memory location on a third PE without synchronization, or when one PE accesses symmetric memory locally and another PE accesses the same memory location using a SHMEM function. Note that sometimes the final result will not depend on the order of accesses only because the data happens to be the same, e.g. when all PEs happen to write the same value to a symmetric memory location. However, we will still report that as a race condition. The shmem_inc and shmem_finc functions add a third type of access, which we will call increase. These functions only act on variables of type int, long and longlong. A PE can also increase a symmetric variable x via a statement of the form x = x+expr, where expr is an expression that does not depend on x. Statements of this form involving symmetric variables of type int, long and longlong must also be treated as a shmem_inc. Since shmem_finc function not only increases target, but also returns the previous contents of target, it should be treated as a couple of accesses: "read" and "increase". However if the returned value of shmem_finc is not assigned to any variable and is not used as an argument in any function, it should be considered only as a single access of type "increase". Similarly, accessing a variable through shmem_swap should be treated as "read" and "write". However if the returned value of shmem_swap is not assigned to any variable and is not used as an argument in any function, it should be considered only as a single access of type "write". So we have three types of access: read (e.g. when a variable appears on the right side of the assignment statement, in the source argument of a SHMEM function, passed by value to any function or used in the expression of an IF statement) 20 write (e.g. when a variable appears on the left side of an assignment statement or in the target argument of a SHMEM function except shmem_inc) increase (i.e. when the access to x can be represented as x = x+expr, where expr is an expression that does not depend on x, or when x appears in the target argument of shmem_inc); this type of access is valid only for variables of types int, long and longlong Race conditions occur when the following pairs of accesses are executed on two different PEs: increase and read increase and write read and write write and write Note that if between synchronizations all accesses to a symmetric memory location are of type "read", then there is no race condition at this memory location. In addition if between synchronizations all accesses to a symmetric memory location are of type "increase", then there is no race condition at this memory location since the final result of all increase operations will always be the same. If only one PE accesses a symmetric memory location multiple times, then there is no race condition at this memory location, since the order of those accesses is guaranteed. Since race conditions only occur in regions between barrier synchronizations, accesses to symmetric memory need only be kept track of between barrier synchronizations in the SHMEM program. Let S be the set of all pairs of accesses to the same symmetric memory location by two different PEs. The procedure described below will remove the members of S that can be guaranteed to not be a race condition. When the program is executed, race condition messages will be issued for the remaining members of S. This method will find all race conditions as defined above. To detect race conditions, first create a global array SMA (Symmetric Memory Accesses) on each PE, that for each address in the symmetric memory on that PE will contain the following information about the current state of accesses to the symmetric memory location: unaccessed {read, p}, where p is the PE that accessed the variable {increased, p}, where p is the PE that accessed the variable {written, p}, where p is the PE that accessed the variable {read and increased, p}, where p is the PE that accessed the variable read by more than one PE increased by more than one PE For each address in the symmetric memory the line number and file name of up to two accesses will also be saved. This information will be saved every time the access state for a symmetric location is modified. Race conditions will be detected as follows. Before each statement that accesses a symmetric memory location, a call to a function check_race_condition will be made for each symmetric variable which is referenced in the statement. This function will contain a region controlled by the locks (one lock per symmetric memory location) which can be accessed only by one PE at a time 21 (i.e. a critical region). Inside this critical region the access state will be changed according to the following algorithm, where PE is the current process number: if (access == read) { switch (state) { case unaccessed: This is the first time this location is accessed, set state to "{read, PE}"; break; case {read, PE}: Last access was read from this same PE, OK, no action; break; case {written, PE}: This location was written from this same PE only, OK, no action; break; case {increased, PE}: Last access was increase from this same PE, set state to "{read and increased, PE}"; break; case {read, p}, p!=PE: Last access was a read from another PE, set state to "read by more than one PE"; break; case read by more than one PE: Already have state set to "read by more than one PE", OK, no action; break; case {written, p}, p!=PE: Last access was a write from another PE, error - read/write without synchronization; break; case {increased,p}, p!=PE: Last access was an increase from another PE, error - read/increment error; break; case {read and increased, PE}: Already has state set to "{read and increased, PE}", OK, no action; break; case increased by more than one PE: This location was increased by another PE, error - read/increment error; break; case {read and increased, p}, p!=PE: This location was increased by another PE, error - read/increment error; break; } } else if (access == increase) { switch (state) { case unaccessed: This is the first time this location is accessed, set state to "{increased, PE}"; break; case {read, PE}: Last access was read from this same PE, set state to "{read and increased, PE}; break; 22 case {written, PE}: This location was written from this same PE only, OK, no action; break; case {increased, PE}: Last access was increase from this same PE, OK, no action; break; case {read, p}, p!=PE: Last access was a read from another PE, error - read/increment error; break; case read by more than one PE: This location was read by another PE, error - read/increment error; break; case {written, p}, p!=PE: Last access was a write from another PE, error - increase/write without synchronization; break; case {increased,p}, p!=PE: Last access was an increase from another PE, set state to "increased by more than one PE"; break; case {read and increased, PE}: Already has state set to "{read and increased, PE}", OK, no action; break; case increased by more than one PE: Already has state set to "increased by more than one PE", OK, no action; break; case {read and increased, p}, p!=PE: This location was read by another PE, error - read/increase error; break; } } else if (access == write) { switch (state) { case unaccessed: This is the first time this location is accessed, set state to "{written, PE}"; break; case {read, PE}: Last access was read from this same PE, set state to "{written, PE}; break; case {written, PE}: This location was written from this same PE only, OK, no action; break; case {increased, PE}: Last access was increase from this same PE, set state to "{written, PE}; break; case {read, p}, p!=PE: Last access was a read from another PE, error - read/write error; break; case read by more than one PE: 23 This location was read by another PE, error - read/write error; break; case {written, p}, p!=PE: This location was written to by another PE, error - write/write without synchronization; break; case {increased,p}, p!=PE: Last access was an increase from another PE, error - increase/write without synchronization; break; case {read and increased, PE}: The location was accessed by this PE only, set state to "{written, PE}"; break; case increased by more than one PE: This location was increased by another PE, error - increase/write error; break; case {read and increased, p}, p!=PE: This location was read by another PE, error - read/write error; break; } } After every call to barrier, shmem_barrier_all or a symmetric heap memory management function (which call shmem_barrier_all), the type of accesses in the entire SMA on each PE will be set to "unaccessed" and all PEs will be synchronized with an additional call to shmem_barrier_all. The shmem_barrier is used to synchronize subset of PEs. It ensures that all local stores and remote memory updates issued by any of the PEs in the active set prior to shmem_barrier are complete before returning. Thus after every call to shmem_barrier, SMA arrays on all PEs in the active set should be "cleaned" from the accesses performed by PEs in the active set. 24 Appendix A: SHMEM Run-time Error Categories 1. Uninitialized Data Used in SHMEM Functions Source argument in collective functions Source argument in remote write (put) functions 2. Out-of-bounds array accesses in SHMEM Functions Arrays can be declared too small, or an address in the middle of a larger array may be passed into SHMEM function so that the rest of the array starting this address is too small - both errors should be detected Use too small pSync array in collective functions Use too small array in the target argument of a collective function (the size of the array will depend on the values of the nreduce, nlong and PE_size arguments) Use too small array in the source argument of a collective function (the size of the array will depend on the values of the nreduce, nlong and PE_size arguments) Use too small pWrk array in collective reduction functions (the size of the array will depend on the value of the nreduce argument) Use too small array in the target argument of a remote write (put) function (the size of the array will depend on the value of the len and tst arguments) Use too small array in the source argument of a a remote write (put) function (the size of the array will depend on the value of the len and sst arguments) Use too small array in the target argument of a remote read (get) function (the size of the array will depend on the value of the len and tst arguments) Use too small array in the source argument of a a remote read (get) function (the size of the array will depend on the value of the len and sst arguments) 3. Symmetric Heap Memory Related Errors Call heap memory management functions on pointers that do not point to a memory block allocated via a call to shmalloc, shmemalign or shrealloc Use of dangling pointers Memory Leak - reassigning a pointer before deallocating Memory Leak - leaving a block before freeing memory that was allocated using non-static pointer declared and allocated within the block 25 4. Argument Errors in SHMEM Functions Use non-symmetric data objects as arguments that are required to be remotely accessible Errors in pSync argument Use overlapping (but not the same) arrays as source and target arguments in SHMEM collective reduction functions Use a value larger than 'number of PEs -1' or use a negative value in the pe, PE_size and PE_start arguments Use invalid combination of values of PE_size, PE_stride and PE_start arguments Call a collective function by a PE not in the active set 5. Deadlocks and potential deadlocks: Not every process in the active set calls a barrier function, a symmetric heap memory management function or a collective function with identical argument(s) that are required to be single-valued in the same order Deadlock at the call to a wait function Deadlocks when using SHMEM lock functions 6. Incorrect Order of SHMEM Functions Call a SHMEM function before the call to start_pes Unlock lock being held by another PE Call two collective SHMEM routines with the same pSync and/or pWrk arguments and no shmem_barrier or shmem_barrier_all call in between 7. Race Conditions 26