Final Report Iowa State University

advertisement
Final Report
Iowa State University
Glenn Luecke, Jeff Kuehn, Steve Poole
July 2011
This is the final report for contract DE-AC05-00OR22726, subcontract modification 4000078135,
modification 4. The following lists the section numbers and deliverables in this contract.

Section 2.1.A. “A preliminary report of the performance evaluation of OpenSHMEM
implementations”. DESCOPED.

Section 2.1.B. “The initial delivery of the Non-public Alpha release of the Unified test suite
for V&V and performance (of OpenSHMEM)”. DESCOPED.

Section 2.2.A. “OpenSHMEM-CHECK design document.” Delivered to ORNL in
December 2010. This deliverable is listed in the appendix of this document.

Section 2.2.B.1 “An implementation of deadlock detection using ROSE.” Delivered to
ORNL in June 2011. See below for details.

Section 2.2.B.2. “An implementation of UPC function argument error checking using
ROSE.” Delivered to ORNL in June 2011. See below for details.

Section 2.2.B.3. “ISU will deliver all ROSE bugs found using simple programs when
possible.” Delivered to ORNL in June 2011.

Section 2.2.B.4. In June 2011, ISU delivered to ORNL the UPC-CHECK user
documentation (User’s Guide and Installation Guide) and the UPC-CHECK Tutorial:
UPC-CHECK is a tool designed for the automatic detection of run-time errors for programs written
in Unified Parallel C (UPC). UPC is an extension of the C programming language designed for
high performance on parallel computers. Currently, UPC-CHECK provides automatic run-time
error detection for deadlocks and for argument errors in UPC functions. UPC-CHECK instruments
UPC source code by inserting argument and deadlock checks before UPC function calls. The
instrumented UPC code is then compiled and executed using any UPC compiler available on the
system. During run-time UPC-CHECK detects errors and issues messages to help programmers
quickly fix the errors. Instrumentation is done using the ROSE Toolkit from Lawrence Livermore
National Laboratory. A novel method for deadlock detection is presented. A manuscript presenting
this work is currently under preparation.
UPC-CHECK Testing. UPC-CHECK has been extensively tested using tests written during this
project and also using the RTED for UPC written for a US government agency several years ago.
1
The RTED evaluation tool for UPC is a collection of scripts for running thousands of tests,
comparing the actual message with the expected message and then assigning a score of 0, 1, 2, 3, 4
or 5 to the message generated by each tests. Scoring was done as follows:
 A score of 5 is given for a detailed error message that will assist a programmer to fix the error.
 A score of 4 is given for error messages with more information than a score of 3 and less than
5. This is tailored for each test.
 A score of 3 is given for error messages with the correct error name, line number and the
name of the file where the error occurred.
 A score of 2 is given for error messages with the correct error name and line number where
the error occurred but not the file name where the error occurred.
 A score of 1 is given for error messages with the correct error name.
 A score of 0 is given when the error was not detected.
Error Category
argument errors
deadlocks
Cray
0.38
0.00
Berkeley
0.04
0.58
HP
0.00
0.36
GNU
0.00
0.27
UPC-CHECK
4.89*
5.00
* The argument error category in the UPC RTED tests includes 3 tests for upc-forall which are not
argument error tests. (They probably should have been put in a miscellaneous error category.)
Excluding these 3 tests, UPC-CHECK scores 5.00 in the argument error category.
2
Appendix
Preliminary Design Document for SHMEM-CHECK:
a Tool for Detecting OpenSHMEM Run-time Errors
Glenn Luecke, Director
James Coyle, James Hoekstra, Marina Kraeva
High Performance Computing Group
Iowa State University, Ames, Iowa
December 15, 2010
1. Introduction
The SHMEM programming model consists of library functions that provide low-latency, highbandwidth communication for use in highly parallelized scalable programs. The one-sided and
collective functions in the SHMEM application programming interface (API) provide a
programming model for exchanging data between cooperating parallel processes. The resulting
programs are similar in style to Message Passing Interface (MPI) programs. The SHMEM API can
be used either alone or in combination with MPI functions in the same parallel program.
Several versions of SHMEM have been developed by SGI, Cray, Quadrics. Since there was no
single SHMEM standard, SHMEM programs written over the years are platform dependent. In
attempt to standardize the SHMEM interface OpenSHMEM was created based on SGI’s SHMEM.
Some researchers prefer to write scientific applications using SHMEM for communications rather
than using MPI since SHMEM usually provides better performance. Providing a productive
programming environment for OpenSHMEM will encourage new scientific applications to be
written in OpenSHMEM. Since debugging OpenSHMEM programs can be time consuming, it is
important to have OpenSHMEM tools and run-time systems that detect run-time errors and issue
messages that help programmers quickly fix errors. Having high quality error messages can greatly
increase programmer productivity by reducing debugging time when developing and maintaining
application programs. This is especially important when developing applications for petascale
computing. In fact, detecting and correctly identifying run-time errors would likely enable these
errors to be fixed without a debugger.
We recommend the development of a SHMEM run-time error detection tool that we suggest be
named SHMEM-CHECK. (The MPI-CHECK tool was developed by ISU’s High Performance
Computing Group for detecting and reporting MPI run-time errors in Fortran programs. The UPCCHECK tool is being developed using ROSE toolkit by LLNL and ISU’s High Performance
Computing Group for detecting and reporting run-time errors in UPC programs.) We recommend
that initially SHMEM-CHECK be designed to detect errors at run-time and then at a later time
SHMEM-CHECK be enhanced to detect errors at compile-time.
3
This document presents our recommendations for the design of SHMEM-CHECK. This work has
been supported with funding from the US Department of Defense and from the Extreme Scale
System Center at Oak Ridge National Laboratory.
2. Design of SHMEM-CHECK
2.1. Preprocessor Design
SHMEM-CHECK will be designed as a preprocessor using the ROSE toolkit from Lawrence
Livermore National Laboratory (LLNL). SHMEM-CHECK will be a source-to-source translator,
i.e. SHMEM-CHECK would take OPENSHMEM C code as input, insert run-time checking using
the ROSE toolkit, and then output instrumented OPENSHMEM code. This output could then be
compiled and executed by any C compiler and linked with a SHMEM library. When executed, runtime errors will be detected and error messages issued. For example, suppose one has a SHMEM
program named program.c. Then issuing the following
SHMEM-CHECK Ccompiler [compiler|link options] program.c
will cause the program program.c to be instrumented and compiled with the compiler named
“Ccompiler”. When executed run-time error messages will be issued.
2.2. Modularity
SHMEM-CHECK will be implemented using a modular approach, i.e. an independent module will
be dedicated to each error category, namely:

uninitialized data used in SHMEM functions

out-of-bounds array accesses in SHMEM functions

symmetric heap memory related errors

argument errors in SHMEM functions

deadlocks and potential deadlocks

incorrect order of SHMEM functions

race conditions
2.3. Additional Design Parameters
The following lists additional design recommendations for SHMEM-CHECK.
4

SHMEM-CHECK will be designed to detect run-time errors that are due to the incorrect
usage of SHMEM function calls, and will not be expected to detect incorrect usage of C
statements.

The SHMEM-CHECK will assume that the program syntax is correct, i.e. the SHMEM
program to be analyzed must compile without syntax errors.

The design plan for SHMEM-CHECK assumes that the compiler or tool already has a
mechanism for detecting out-of-bounds and initialized variable in the statements which do
not contain SHMEM functions. It will be assumed that method used will be extended to
cover the SHMEM functions.

When the instrumented program detects that an error would occur, an error message
containing the type of error, error location, and other useful information will be issued
before the error can occur. Program execution will be terminated by calling exit after
waiting a settable specified amount of time to allow other PEs to issue messages for the
different errors that they detected.

SHMEM-CHECK will be designed to minimize both the preprocessing and run-time
overhead introduced.

SHMEM-CHECK will be designed to allow users to turn off checking for each of the
different error categories for performance, memory considerations and flexibility during the
debugging processes.

SHMEM-CHECK will be designed to allow users to choose whether program execution will
continue or stop when an error is detected. The default behavior will be to stop on an error,
but the user will be able to over-ride this.
2.4. Testing and Verification
Testing and verification will be an important part of the development of SHMEM-CHECK. We
propose that OpenSHMEM run-time error tests be written for testing and verification that SHMEMCHECK can detect and issue quality error messages for all these tests. ISU has written such runtime error detection suits for serial, MPI and OpenMP errors in Fortran, C and C++ programs and
UPC errors in UPC programs. In addition, larger SHMEM programs should be used to test the
ability of SHMEM-CHECK to detect errors in complex programs. Errors will need to be inserted in
these SHMEM applications so they can be used for testing SHMEM-CHECK.
3. Error Detection Methodology for each Error
Category
The following presents how we plan to implement run-time error detection for each run-time error
category. Appendix A presents a detailed listing of the SHMEM run-time error categories and the
sub-categories. In the following descriptions, comparisons which depend on arguments passed to
a SHMEM function will be written in terms of the formal arguments as given in the function
prototypes. E.g the 3rd argument to shmem_put is the number of elements to send to another PE,
which appears as len in the function prototype. When the function (and hence function prototype)
5
involved is known, rather than refer to this as the value of the 3rd argument, we simply refer to this
as len in any calculations.
3.1. Detecting uninitialized objects used in SHMEM functions
To detect uninitialized objects used in SHMEM functions would require an amount of work greater
than that for implementing checks for uninitialized variables in a standard C program without
SHMEM functions.
Thus, the design plan for SHMEM-CHECK assumes that the compiler or tool already has a
mechanism for detecting uninitialized variables in the statements which do not contain SHMEM
functions. This method should be extended to cover the SHMEM functions.
In particular the checks for uninitialized variables should be added before the calls to SHMEM
functions that have source argument. The source array has to be symmetric, i.e. either it has to be a
non-stack variable or it has to be allocated by shmalloc. In the first case the variable will be zeroinitialized, but SHMEM-CHECK has to issue a warning message if all elements of the array in the
source argument are not explicitly initialized in the program prior to the SHMEM function call. In
case source array is allocated by shmalloc and is never initialized before the call to the SHMEM
function, an error message should be issued.
In case the serial C-CHECK tool or compiler cannot be modified to support detection of
uninitialized objects in SHMEM functions, the following instrumentation can be done:

for the SHMEM functions that have source argument except remote read functions
(get/iget/g) add the following in the same line as the call to the function, right before the call
to the function:
{TYPE SHC_array[M]; for(int i=0; i<M; i++) shmem_check_array[i]=source[i];
where M is either nlong, nreduce, len or len*sst depending on the function (see Detecting
out-of-bounds array accesses in SHMEM functions section for details), TYPE is the type of
the source array. The back curly bracket '}' should be added after the call to the function (in
the same line).

for SHMEM broadcast functions the following condition should be inserted right before the
'for' statement above:
if(_my_pe()==PE_root)
since only the values in the source array on the root PE are important.
3.2. Detecting out-of-bounds array accesses in SHMEM
functions
Out-of-bounds array accesses in SHMEM functions that will not be detected by a serial C-CHECK
tool/compiler are caused by wrong combination of values of function arguments. To detect out-ofbounds array accesses in SHMEM functions would require an amount of work greater than that for
implementing checks for uninitialized variables in a standard C program without SHMEM
functions.
6
Thus, the design plan for SHMEM-CHECK assumes that the compiler or tool already has a
mechanism for detecting out-of-bounds array accesses in the statements which do not contain
SHMEM functions. This method should be extended to cover the SHMEM functions.
Below is the list of possible out-of-bounds array accesses in SHMEM functions that should be
detected by SHMEM-CHECK:

Use too small pSync array in collective functions (The OpenSHMEM manual requires that
the memory block starting at the address passed in pSync argument should be at least
_SHMEM_BARRIER_SYNC_SIZE elements long)

Use too small array in the source argument of a broadcast function (The OpenSHMEM
manual requires that the memory block starting at the address passed in source argument
should be at least nlong elements long)

Use too small array in the target argument of a broadcast function (The OpenSHMEM
manual requires that the memory block starting at the address passed in target argument
should be at least nlong elements long)

Use too small array in the source argument of a collect function (The OpenSHMEM manual
requires that the memory block starting at the address passed in source argument should be
at least nlong elements long)

Use too small array in the target argument of a collect function (The OpenSHMEM manual
requires that the memory block starting at the address passed in target argument should be at
least N elements long, where N is the sum of values in nlong argument on the PEs in the
active set)

Use too small array in the source argument of a reduction collective function (The
OpenSHMEM manual requires that the memory block starting at the address passed in
source argument should be at least nreduce elements long)

Use too small array in the target argument of a reduction collective function (The
OpenSHMEM manual requires that the memory block starting at the address passed in target
argument should be at least nreduce elements long)

Use too small pWrk array in collective reduction functions (The OpenSHMEM manual
requires that the memory block starting at the address passed in pWrk argument should be at
least max(nreduce/2 + 1, _SHMEM_REDUCE_MIN_WRKDATA_SIZE) elements long)

Use too small array in the target argument of a remote write (put) function (The
OpenSHMEM manual requires that the memory block starting at the address passed in target
argument should be at least len elements long)

Use too small array in the source argument of a a remote write (put) function (The
OpenSHMEM manual requires that the memory block starting at the address passed in
source argument should be at least len elements long)

Use too small array in the target argument of a remote read (get) function (The
OpenSHMEM manual requires that the memory block starting at the address passed in target
argument should be at least len elements long)
7

Use too small array in the source argument of a a remote read (get) function (The
OpenSHMEM manual requires that the memory block starting at the address passed in
source argument should be at least len elements long)

Use too small array in the target argument of a remote strided write (iput) function (The
OpenSHMEM manual requires that the memory block starting at the address passed in target
argument should be at least len*tst elements long)

Use too small array in the source argument of a a remote strided write (iput) function (The
OpenSHMEM manual requires that the memory block starting at the address passed in
source argument should be at least len*sst elements long)

Use too small array in the target argument of a remote strided read (iget) function (The
OpenSHMEM manual requires that the memory block starting at the address passed in target
argument should be at least len*tst elements long)

Use too small array in the source argument of a a remote strided read (iget) function (The
OpenSHMEM manual requires that the memory block starting at the address passed in
source argument should be at least len*sst elements long)
In case the serial C-CHECK tool or compiler cannot be modified to support detection of out-ofbounds array accesses in SHMEM functions, the following instrumentation can be done:

for the SHMEM functions that have source argument add the following in the same line as
the call to the function, right before the call to the function:
{TYPE SHC_array[M]; for(int i=0; i<M; i++) shmem_check_array[i]=source[i];
where M is either nlong, nreduce, len or len*sst depending on the function (see above the
requirements for the array sizes), TYPE is the type of the source array. The back curly
bracket '}' should be added after the call to the function.

for SHMEM broadcast functions the following condition should be inserted right before the
'for' statement above:
if(_my_pe()==PE_root)
since only the values in the source array on the root PE are important.

for the SHMEM functions that have target argument add the following in the same line as
the call to the function, right before the call to the function:
{TYPE SHC_array[M]; for(int i=0; i<M; i++) target[i]=shmem_check_array[i];
where M is either nlong, nreduce, len or len*tst depending on the function (see above the
requirements for the array sizes), TYPE is the type of the target array. The back curly
bracket '}' should be added after the call to the function.

for the SHMEM functions that have pSync argument add the following in the same line as
the call to the function, right before the call to the function:
for(int i=0; i<M; i++) if pSync[i]!=K issue_error();
where M is equal to _SHMEM_BARRIER_SYNC_SIZE, K is equal either to
_SHMEM_SYNC_VALUE (in SHMEM reduction functions and broadcast/collect
functions), or to 0 (in shmem_barrier).

for the SHMEM functions that have pWrk argument add the following in the same line as
the call to the function, right before the call to the function:
for(int i=0; i<M; i++) pWrk[i]=(TYPE)i;
8
where M = max(nreduce/2 + 1, _SHMEM_REDUCE_MIN_WRKDATA_SIZE), TYPE is
the type of the target array.
3.3. Detection of SHMEM memory related errors
If the compiler or tool already has a mechanism for detecting memory leaks and use of dangling
pointers, then this method should be extended to cover the SHMEM symmetric heap memory
functions. Otherwise the following method will be used.
To detect SHMEM memory related errors the calls to SHMEM symmetric heap memory function
will be tracked.
A global array SHMB (Symmetric Heap Memory Blocks) will be used to store the following
information about symmetric heap memory blocks:

the starting address

the size of the memory block in bytes

number of pointers that point to a location within the memory block

the location of the memory allocation call (file name and line number)

the location of the memory deallocation call (file name and line number)
Array SHMB should be big enough to hold information about all calls to SHMEM symmetric heap
memory allocation functions in the program.
The number of initialized elements in SHMB will be stored in the global variable NSHMA (number
of symmetric heap memory allocations).
The following information about pointers in the program that point to symmetric heap memory will
be stored in the global array PI on each PE:

the address of the pointer

memblock - the index of the element in SHMB that describes the memory block to which
the pointer points

the location of the last assignment to the pointer (file name and line number)
After each call to shmalloc, shrealloc or shmemalign the following will be done:

The following information will be stored in SHMB:
o the starting address
o the size of the memory block in bytes
o the location of the memory allocation call
o number of pointers that point to a location within the memory block will be
initialized with 1
9
The following information will be stored in PI (if the address of the pointer on the left side of the
call is already recorded in PI and memblock >=0, then first decrease the number of pointers that
point to a location within the memory block by 1 in element memblock in SHMB):
o the address of the pointer on the left side of the call
o memblock will be initialized with the value of NSHMA
o the location of the call (file name and line number)

The value of NSHMA will be increased by 1
After each call to shfree or shrealloc, the following will be done:

The following changes will be done for the corresponding element in SHMB:
o the starting address will be replaced with -1
o the location of the memory deallocation call will be recorded
o number of pointers that point to a location within the memory block will be
decreased by 1

The following changes will be done for the corresponding element in PI:
o memblock will be set to -1
Before each pointer assignment statement of kind ptr1=ptr2[+i] the following will be done after the
check for memory leak described in section 3.3.3 below:
if the address of ptr1 is recorded in PI and memblock_ptr1>=0,

in element memblock_ptr1 in SHMB decrease the number of pointers that point to a location
within the memory block by 1

if the address of ptr2 is recorded in PI and memblock_ptr2>=0,
o do the following changes in PI:

initialize memblock_ptr1 with the value of memblock_ptr2

record the location of the assignment statement (file name and line number)
o in element memblock_ptr2 in SHMB increase the number of pointers that point to a
location within the memory block by 1

else (if either the address of ptr2 is not in PI, or if memblock_ptr2<0)
o do the following changes in PI:

memblock_ptr1 = -1

record the location of the assignment statement (file name and line number)
10
if the address of ptr1 is not in PI or if memblock_ptr1<0,

if the address of ptr2 is recorded in PI and memblock_ptr2>=0,
o store the following information in PI (either in the new element or replace the existing
information if address of ptr1 already appears in PI)

the address of the pointer ptr1

initialize memblock_ptr1 with the value of memblock_ptr2

the location of the assignment statement (file name and line number)
o in element memblock_ptr2 in SHMB increase the number of pointers that point to a
location within the memory block by 1
Before every return statement in a program (except return statements in the main function) and at
the end of all basic blocks, array PI will be searched for the addresses of the non-static pointers
declared in the current functions or blocks. For these entries memblock will be set to -1 and in the
corresponding elements in SHMB the number of pointers that point to a location within the memory
block will be decreased.
The following subsections describe how the specified memory related errors will be detected.
3.3.1.
Call heap memory management functions on pointers that do
not point to a memory block allocated via a call to shmalloc,
shmemalign or shrealloc
To check for this error, before every call to shrealloc(ptr,..) or shfree(ptr), search PI for the address
of ptr

if there is no entry for ptr in PI, that means that the pointer ptr never pointed to a symmetric
heap memory block, thus issue an error message

if there is an entry for ptr in PI
o if memblock == -1, that means that the pointer ptr no longer points to a symmetric
heap memory block, thus issue an error message
o otherwise compare the value of ptr with the address in the element memblock in
SHMB; if they do not match, that means that the pointer does not point to the
beginning of the memory block, thus issue an error message.
3.3.2.
Use of dangling pointers
The use of dangling pointers (when a pointer points to a freed memory) will be detected as follows:
whenever a pointer appears on the right hand side of an assignment statement, or is dereferenced,
search PI for the address of ptr. If the address of ptr is found in PI and memblock>=0, check the
beginning address of the memory block memblock in SHMB. If it's equal to -1, that means that the
memory block was already freed, thus issue an error message.
11
3.3.3.
Memory Leak - reassigning a pointer before deallocating
This memory leak will be detected as follows:
When a pointer ptr1 appears on the left-hand-side of an assignment statement of kind ptr1=ptr2[+i]
the following checks will be done prior to execution of this statement:

If there is no entry for ptr1 in PI, then this error did not occur and no further checking for
this error is needed.

If there is an entry for ptr1 in PI and if the value of memblock for ptr2 is equal to the value
of memblock for ptr1 then this error does not occur and no further checking for this error is
needed.

If there is an entry for ptr1 in PI and if the value of memblock for ptr2 is different from the
value of memblock for ptr1, check in SHMB how many pointers point to a location within
the memory block memblock for ptr1. If it's equal to 1, that means that after assignment
statement, no pointers will be pointing to the memory block, thus issue an error message.
When a pointer ptr appears on the left-hand-side of an assignment statement of kind ptr=f(..), where
f is one of shmalloc, shrealloc or shmemalign, the following checks will be done prior to execution
of this statement:

If there is no entry for ptr1 in PI, then this error did not occur and no further checking for
this error is needed.

If there is an entry for ptr in PI and memblock >= 0, check in SHMB how many pointers
point to a location within the memory block memblock. If it's equal to 1, that means that
after assignment statement, no pointers will be pointing to the memory block, thus issue an
error message.
3.3.4.
Memory Leak - leaving a block before freeing memory that was
allocated using non-static pointer declared and allocated within
the block
As described above in 3.3, before every return statement in a program (except return statements in
the main function) and at the end of all basic blocks, array PI will be searched for the addresses of
the non-static pointers declared in the current functions or blocks. Before setting memblock to -1
for these entries, and before decreasing the number of pointers that point to a location within the
memory block in the corresponding elements in SHMB, it will be checked whether the number of
pointers that point to a location within the memory block is equal to 1. If it is the case, that means
that after returning from the function or basic block, no pointers will be pointing to the memory
block, thus an error message will be issued.
3.4. Detecting Argument Errors in SHMEM Functions
To allow all arguments to be checked in a single run of the SHMEM program, all of the argument
checks listed below will be made before aborting the program. The following subsections describe
how the specified argument errors will be detected.
12
3.4.1.
Use non-symmetric data objects as arguments that are required
to be remotely accessible
To detect this error before every call to a SHMEM function that has either of the 'source', 'target',
'pSync' and 'pWork' arguments that are required to be remotely accessible, it will be checked that
the addresses in those arguments are either associated with static variables or are within one of the
blocks listed in the array SHMB defined in section 3.3 Detection of SHMEM memory related
errors.
3.4.2.
Errors in pSync argument
According to the OpenSHMEM manual, elements of pSync array have to be initialized either with
the value _SHMEM_SYNC_VALUE (in SHMEM reduction functions and broadcast/collect
functions), or with 0 (in shmem_barrier). Thus before the calls to the SHMEM reduction functions
and broadcast/collect functions the values of elements of pSync array have to be compared with
_SHMEM_SYNC_VALUE, and before the calls to shmem_barrier the values of elements of pSync
array have to be compared with 0.
The same instrumentation as in Detecting out-of-bounds array accesses in SHMEM functions
section will be used to detect this error:
for the SHMEM functions that have pSync argument add the following in the same line as the call
to the function, right before the call to the function:
for(int i=0; i<M; i++) if pSync[i]!=K issue_error();
where M is equal to _SHMEM_BARRIER_SYNC_SIZE, K is equal either to
_SHMEM_SYNC_VALUE (in SHMEM reduction functions and broadcast/collect functions), or to
0 (in shmem_barrier).
3.4.3.
Use overlapping (but not the same) arrays as source and target
arguments in SHMEM collective reduction functions
To detect this error the values of the source, target and nreduce arguments will be examined. An
error will be issued in the following cases:

source<target && source+nreduce>target

source>target && target+nreduce >source
3.4.4.
Wrong values of the pe, PE_root, PE_size, PE_stride and
PE_start arguments
To detect these errors, before the call to a SHMEM function, the following will be checked:

the value of pe argument in SHMEM put, get, fetch-op and atomic memory operation
functions is not negative and is less than the number of PEs

the value of PE_start argument in SHMEM collective functions is not negative and is less
than the number of PEs

the value of PE_size argument in SHMEM collective functions is not negative and is less
than the number of PEs
13

the value of PE_stride argument in SHMEM collective functions is not negative

the values of PE_start, PE_stride and PE_size arguments in SHMEM collective functions
satisfy the following condition: PE_start+(2**PE_stride)*(PE_size-1) < number of PEs

the value of PE_root argument in SHMEM broadcast functions is not negative and is less
than the value of PE_size argument
An error message will be issued when any one of these conditions is not satisfied.
3.4.5.
Call a collective function by a PE not in the active set
To detect this error, the following check will be inserted before the call to a SHMEM collective
function: check whether there is a whole number 'i', 0<=i<PE_size, so that the rank of the PE that
calls the SHMEM collective function is equal to PE_start+(2**PE_stride)*i . If there is no such 'i',
the error message will be issued.
3.5. Detecting actual and potential deadlocks
An actual deadlock occurs when something is waiting for an event that will never happen. We say
that a SHMEM program has a potential deadlock when it will produce an actual deadlock using a
valid OpenSHMEM implementation. This means that a potential deadlock may not be an actual
deadlock for some OpenSHMEM implementations. Both actual and potential deadlocks are errors.
In this document the word "deadlock" refers to both actual and potential deadlocks.
The OpenSHMEM manual does not explicitly place a constraint on the order of calls to the
collective functions. However it is possible for some valid OpenSHMEM implementations that a
program will deadlock or produce other side-effects when collective functions are not called by all
PEs in the active set in the same order. Thus the tool will detect such cases and issue an error
message.
The following subsections describe how the specified deadlock errors will be detected.
3.5.1.
Not every process in the active group calls a barrier function, a
symmetric heap memory management function or a collective
function with identical argument(s) that are required to be
single-valued in the same order
The proposed method requires synchronization of all PEs in the active set with the root PE before
every call to a collective function. In this method the instrumented program will issue a warning
message if some PEs are waiting at the call for too long. The length of the waiting period, and
whether the program should be stopped, will be set before executing the program. The deadlock
check should be performed after the checks described in "Detecting Argument Errors in SHMEM
Functions" chapter. A detailed description of this method follows.
Before every collective function, a hand-shaking check will be inserted to test in a while loop
whether all PEs in the active set have arrived to the same collective function with the same
arguments when required by the OpenSHMEM manual.
14
The hand-shaking check will be done by first declaring global arrays shmem_cf_notify and
shmem_cf_wait. To record the name of the collective function and argument information, as well
as the location (the file name and line number) of the call a global structure object info_shmem_cf
will be declared. Therefore, the following declaration statements will appear in the instrumented
program outside the main function:
struct SHMEM_cf_and_args info_shmem_cf;
int shmem_cf_notify[PES];
int shmem_cf_wait[PES];
Before the call to a collective function the following will be inserted:
Set root_PE to PE_start for those functions that have PE_start argument or to 0 for barrier,
shmem_barrier_all and symmetric heap memory management functions.
If the rank of the calling PE is not equal to root_PE
Then {
record the collective function name and argument information in info_shmem_cf
while (shmem_cf_notify[myPE] on root_PE != 0) {}
set shmem_cf_notify[myPE] on root_PE to 1
while (shmem_cf_wait[root_PE] != 1) {
after specified amount of time issue a warning
stop execution after certain number of warnings or when prompted by user
}
shmem_cf_wait[root_PE] = 0;
}
Else {
count=1;
while (count < PE_size) {
for each rank_PE in the active set except root_PE {
if (shmem_cf_notify[rank_PE] ==1) {
compare function name and argument information in
info_shmem_cf on rank_PE with own call;
if it matches
{set shmem_cf_wait[root_PE] to 1 on rank_PE}
else
{issue an error message and stop execution}
shmem_cf_notify[rank_PE]=2;
count++;
15
}
}
after specified amount of time issue a warning
stop execution after certain number of warnings
}
for each rank_PE in the active set except root_PE
shmem_cf_notify[rank_PE]=0;
}
3.5.2.
Deadlock at the call to a wait function or shmem_set_lock
SHMEM wait functions wait for argument ivar to be changed by a remote write or atomic swap
issued by a different processor. If the awaited change does not happen, the process will be waiting
at the call until the program execution is interrupted. This is one source of deadlock.
Another source of deadlock is caused by incorrect use of shmem lock functions. A PE will deadlock
at the call to shmem_set_lock(&lock) if another PE that is holding the lock never releases it by
calling shmem_clear_lock(&lock).
To let user know about a PE waiting at the call to a SHMEM wait function or to shmem_set_lock
the following will be done in addition to the algorithm described in the previous section:
1. Four (global) objects will be declared:
a. wait_counter to keep count of waiting PEs,
b. a structure wait_loc to record the location (line number and file name) of the calls to
SHMEM wait functions,
c. a structure set_lock_loc to record the location of the calls to shmem_set_lock,
d. an array of structures list_locks to keep track of locks being held.
2. Before every call to a SHMEM wait function
a. check the value of wait_counter on PE0. If wait_counter == number of PEs -1, that
means that if the current PE were to call the wait function, the program would
deadlock, thus issue the deadlock error message. Otherwise
b. record the location (line number and file name) of the call in the wait_location object
and
c. increase the wait_counter on PE0.
3. After the call to a SHMEM wait function
a. clear the wait_loc object and
b. decrease the wait_counter on PE0.
4. Before every call to shmem_set_lock(&lock)
a. check the value of wait_counter on PE0. If wait_counter == number of PEs -1, then
check in list_locks (on PE0) whether the lock is being held by another thread, if yes
16
that means that if the current PE was to call shmem_set_lock, the program would
deadlock, thus issue the deadlock error message. Otherwise
b. record the location (line number and file name) of the call in the set_lock_loc object
and
c. increase the wait_counter on PE0.
5. After the call to shmem_set_lock,
a. record the address of the variable lock and the PE number in list_locks (on PE0),
b. clear the set_lock_loc object and
c. decrease the wait_counter on PE0.
6. After every call to shmem_test_lock, check the return value, if it's equal to 0, record the
address of the variable lock and the PE number in list_locks (on PE0).
7. After every call to shmem_clear_lock, remove the address of the variable lock and the PE
number from list_locks (on PE0).
8. Before every return statement in the main() function
a. check the value of wait_counter on PE0. If wait_counter == number of PEs -1, that
means that if the current PE was to return from the program, the program would
deadlock, thus issue the error message. Otherwise
b. increase the wait_counter on PE0.
9. Before every call to a barrier, symmetric heap memory function or a collective function if
the rank of the calling PE is not equal to root_PE perform the following before the "handshaking" algorithm described in the previous section:
a. check the value of wait_counter on PE0. If wait_counter == number of PEs -1, that
means that if the current PE would call the barrier, symmetric heap memory function
or a collective function, the program will deadlock (since it is not the last PE in the
active set of PEs - still there is root_PE left), thus issue the error message. Otherwise
b. record the location (line number and file name) of the call in the info_shmem_cf
object and
c. increase the wait_counter on PE0.
10. After the call to the barrier, symmetric heap memory function or collective function
decrease the wait_counter on PE0.
Since the detection of a deadlock may be deferred until some PE reaches the end of the program, the
above method may be enhanced by making PEs waiting at a call to a SHMEM wait function or to
shmem_set_lock issue a warning after specified period of time (similar to the "hand-shaking"
algorithm described in the previous section).
17
3.6. Detecting Wrong Order of SHMEM Functions
3.6.1.
Call a SHMEM function before the call to start_pes
The start_pes routine identifies the number of processes for a program. This statement must be the
first statement in a program that uses distributed, shared memory (SHMEM) communication
routines. Even though the manual does not explicitly prohibits multiple calls to start_pes, it still
may lead to problems. To detect those cases when start_pes is called more than once or when a
SHMEM function is called before the call to start_pes, the following will be done:

A global variable start_pes_called will be defined and initially set to 0.

Before a call to start_pes, the executing PE will check the value of start_pes_called. If it is
equal to 1, an error message will be issued. If the value is equal to 0, execution will
continue.

After a call to start_pes, start_pes_called will be assigned value of 1.

Before every call to a SHMEM function except start_pes, the executing PE will check the
value of start_pes_called. If it is equal to 0, an error message will be issued. If the value is
equal to 1, execution will continue.
3.6.2.
Unlock lock being held by another PE
The OpenSHMEM manual does not explicitly state that only the PE that holds the lock can clear it,
however such a requirement would make sense. Therefore SHMEM-CHECK will check for this
error. To detect the error array list_locks described in section "Deadlock at the call to a wait
function or shmem_set_lock" will be used. As described in that section, after a PE successfully sets
the lock, it records its rank and the address of the lock variable in list_locks on PE0. To detect
wrong order error, the following will be done:
1. After the call to shmem_set_lock, record the address of the variable lock and the PE number
in list_locks on PE0.
2. After every call to shmem_test_lock, check the return value, if it's equal to 0, record the
address of the variable lock and the PE number in list_locks on PE0.
3. Before every call to shmem_clear_lock check list_locks on PE0 whether PE holds the lock,
if not issue an error message.
4. After every call to shmem_clear_lock, remove the address of the variable lock and the PE
number from list_locks on PE0.
Items 1, 2, and 4 above are being done in the section "Deadlock at the call to a wait function or
shmem_set_lock", and are included here for clarity.
3.6.3.
Call two collective SHMEM routines with the same pSync and/or
pWrk arguments and no shmem_barrier or shmem_barrier_all
call in between
With the exception of shmem_barrier, it is erroneous to use the same pSync array in two
consecutive calls to SHMEM collective functions without intervening barrier synchronization. In
18
addition, a pWrk array can be reused in a subsequent reduction routine call only if none of the PEs
in the active set are still processing a prior reduction routine call that used the same pWrk array. To
detect when pSync or pWrk array is being reused without intervening barrier synchronization the
following will be done:

A global variable pes_synchronized will be defined and initially set to 0.

After every call to shmem_barrier, barrier, shmem_barrier_all or a symmetric heap memory
management function (which call shmem_barrier_all), pes_synchronized will be assigned
value of 1.

Four global variables ba_pSync, ne_pSync, ba_pWrk and ne_pWrk to record the beginning
addresses and number of elements of the arrays in pSync and pWrk arguments of the
SHMEM collective function will be defined and initialized with 0.

Before every call to a SHMEM collective function except shmem_barrier, check the value
of pes_synchronized. If it's equal to 0
o check that the array in pSync argument is different from the one used in the previous call
to a SHMEM collective function (recorded in ba_pSync and ne_pSync); an error will be
issued in the following cases:

pSync== ba_pSync

pSync< ba_pSync && pSync+size>ba_pSync

pSync> ba_pSync && pSync<ba_pSync+ne_pSync
where size is one of the:

_SHMEM_REDUCE_SYNC_SIZE

_SHMEM_COLLECT_SYNC_SIZE

_SHMEM_BCAST_SYNC_SIZE
depending on the SHMEM collective function.
o for the reduction functions also check that the array in pWrk argument is different from
the one used in the previous call to a SHMEM collective reduction function (recorded in
ba_pWrk and ne_pWrk); an error will be issued in the following cases:

pWrk== ba_pWrk

pWrk< ba_pWrk && pWrk+size>ba_pWrk

pWrk> ba_pWrk && pWrk<ba_pWrk+ne_pWrk
where size == max(nreduce/2 + 1, _SHMEM_REDUCE_MIN_WRKDATA_SIZE)

After every call to a SHMEM collective function except shmem_barrier, do the following:
o for reduction functions:

ba_pSync= pSync

ne_pSync=_SHMEM_REDUCE_SYNC_SIZE

ba_pWrk= pWrk

ne_pWrk= max(nreduce/2 + 1, _SHMEM_REDUCE_MIN_WRKDATA_SIZE)
19
o for broadcast functions:

ba_pSync= pSync

ne_pSync=_SHMEM_BCAST_SYNC_SIZE
o for collect/fcollect functions:

ba_pSync= pSync

ne_pSync=_SHMEM_COLLECT_SYNC_SIZE
o for all SHMEM collective functions set pes_synchronized to 0.
The above method will catch errors related to reuse of pSync and pWrk in successive calls without
synchronization.
3.7. Race Conditions
The methodology used for detecting race conditions is based on the fact that a race condition will
occur only when both of the following conditions are satisfied:

at least two different PEs access the same shared memory location with at least one PE
writing to this memory location

the order of PEs execution is not controlled.
In OpenSHMEM, PEs can access only symmetric memory locations on a remote PE, and only
through SHMEM function calls. A race condition will occur when two PEs access a symmetric
memory location on a third PE without synchronization, or when one PE accesses symmetric
memory locally and another PE accesses the same memory location using a SHMEM function.
Note that sometimes the final result will not depend on the order of accesses only because the data
happens to be the same, e.g. when all PEs happen to write the same value to a symmetric memory
location. However, we will still report that as a race condition.
The shmem_inc and shmem_finc functions add a third type of access, which we will call increase.
These functions only act on variables of type int, long and longlong. A PE can also increase a
symmetric variable x via a statement of the form x = x+expr, where expr is an expression that does
not depend on x. Statements of this form involving symmetric variables of type int, long and
longlong must also be treated as a shmem_inc.
Since shmem_finc function not only increases target, but also returns the previous contents of
target, it should be treated as a couple of accesses: "read" and "increase". However if the returned
value of shmem_finc is not assigned to any variable and is not used as an argument in any function,
it should be considered only as a single access of type "increase". Similarly, accessing a variable
through shmem_swap should be treated as "read" and "write". However if the returned value of
shmem_swap is not assigned to any variable and is not used as an argument in any function, it
should be considered only as a single access of type "write".
So we have three types of access:

read (e.g. when a variable appears on the right side of the assignment statement, in the
source argument of a SHMEM function, passed by value to any function or used in the
expression of an IF statement)
20

write (e.g. when a variable appears on the left side of an assignment statement or in the
target argument of a SHMEM function except shmem_inc)

increase (i.e. when the access to x can be represented as x = x+expr, where expr is an
expression that does not depend on x, or when x appears in the target argument of
shmem_inc); this type of access is valid only for variables of types int, long and longlong
Race conditions occur when the following pairs of accesses are executed on two different PEs:

increase and read

increase and write

read and write

write and write
Note that if between synchronizations all accesses to a symmetric memory location are of type
"read", then there is no race condition at this memory location. In addition if between
synchronizations all accesses to a symmetric memory location are of type "increase", then there is
no race condition at this memory location since the final result of all increase operations will always
be the same. If only one PE accesses a symmetric memory location multiple times, then there is no
race condition at this memory location, since the order of those accesses is guaranteed.
Since race conditions only occur in regions between barrier synchronizations, accesses to symmetric
memory need only be kept track of between barrier synchronizations in the SHMEM program. Let
S be the set of all pairs of accesses to the same symmetric memory location by two different PEs.
The procedure described below will remove the members of S that can be guaranteed to not be a
race condition. When the program is executed, race condition messages will be issued for the
remaining members of S. This method will find all race conditions as defined above.
To detect race conditions, first create a global array SMA (Symmetric Memory Accesses) on each
PE, that for each address in the symmetric memory on that PE will contain the following
information about the current state of accesses to the symmetric memory location:

unaccessed

{read, p}, where p is the PE that accessed the variable

{increased, p}, where p is the PE that accessed the variable

{written, p}, where p is the PE that accessed the variable

{read and increased, p}, where p is the PE that accessed the variable

read by more than one PE

increased by more than one PE
For each address in the symmetric memory the line number and file name of up to two accesses will
also be saved. This information will be saved every time the access state for a symmetric location is
modified.
Race conditions will be detected as follows. Before each statement that accesses a symmetric
memory location, a call to a function check_race_condition will be made for each symmetric
variable which is referenced in the statement. This function will contain a region controlled by the
locks (one lock per symmetric memory location) which can be accessed only by one PE at a time
21
(i.e. a critical region). Inside this critical region the access state will be changed according to the
following algorithm, where PE is the current process number:
if (access == read) {
switch (state) {
case unaccessed:
This is the first time this location is accessed, set state to "{read, PE}";
break;
case {read, PE}:
Last access was read from this same PE, OK, no action;
break;
case {written, PE}:
This location was written from this same PE only, OK, no action;
break;
case {increased, PE}:
Last access was increase from this same PE, set state to "{read and increased, PE}";
break;
case {read, p}, p!=PE:
Last access was a read from another PE, set state to "read by more than one PE";
break;
case read by more than one PE:
Already have state set to "read by more than one PE", OK, no action;
break;
case {written, p}, p!=PE:
Last access was a write from another PE, error - read/write without synchronization;
break;
case {increased,p}, p!=PE:
Last access was an increase from another PE, error - read/increment error;
break;
case {read and increased, PE}:
Already has state set to "{read and increased, PE}", OK, no action;
break;
case increased by more than one PE:
This location was increased by another PE, error - read/increment error;
break;
case {read and increased, p}, p!=PE:
This location was increased by another PE, error - read/increment error;
break;
}
}
else if (access == increase) {
switch (state) {
case unaccessed:
This is the first time this location is accessed, set state to "{increased, PE}";
break;
case {read, PE}:
Last access was read from this same PE, set state to "{read and increased, PE};
break;
22
case {written, PE}:
This location was written from this same PE only, OK, no action;
break;
case {increased, PE}:
Last access was increase from this same PE, OK, no action;
break;
case {read, p}, p!=PE:
Last access was a read from another PE, error - read/increment error;
break;
case read by more than one PE:
This location was read by another PE, error - read/increment error;
break;
case {written, p}, p!=PE:
Last access was a write from another PE, error - increase/write without synchronization;
break;
case {increased,p}, p!=PE:
Last access was an increase from another PE, set state to "increased by more than one
PE";
break;
case {read and increased, PE}:
Already has state set to "{read and increased, PE}", OK, no action;
break;
case increased by more than one PE:
Already has state set to "increased by more than one PE", OK, no action;
break;
case {read and increased, p}, p!=PE:
This location was read by another PE, error - read/increase error;
break;
}
}
else if (access == write) {
switch (state) {
case unaccessed:
This is the first time this location is accessed, set state to "{written, PE}";
break;
case {read, PE}:
Last access was read from this same PE, set state to "{written, PE};
break;
case {written, PE}:
This location was written from this same PE only, OK, no action;
break;
case {increased, PE}:
Last access was increase from this same PE, set state to "{written, PE};
break;
case {read, p}, p!=PE:
Last access was a read from another PE, error - read/write error;
break;
case read by more than one PE:
23
This location was read by another PE, error - read/write error;
break;
case {written, p}, p!=PE:
This location was written to by another PE, error - write/write without synchronization;
break;
case {increased,p}, p!=PE:
Last access was an increase from another PE, error - increase/write without
synchronization;
break;
case {read and increased, PE}:
The location was accessed by this PE only, set state to "{written, PE}";
break;
case increased by more than one PE:
This location was increased by another PE, error - increase/write error;
break;
case {read and increased, p}, p!=PE:
This location was read by another PE, error - read/write error;
break;
}
}
After every call to barrier, shmem_barrier_all or a symmetric heap memory management function
(which call shmem_barrier_all), the type of accesses in the entire SMA on each PE will be set to
"unaccessed" and all PEs will be synchronized with an additional call to shmem_barrier_all.
The shmem_barrier is used to synchronize subset of PEs. It ensures that all local stores and remote
memory updates issued by any of the PEs in the active set prior to shmem_barrier are complete
before returning. Thus after every call to shmem_barrier, SMA arrays on all PEs in the active set
should be "cleaned" from the accesses performed by PEs in the active set.
24
Appendix A:
SHMEM Run-time Error Categories
1. Uninitialized Data Used in SHMEM Functions

Source argument in collective functions

Source argument in remote write (put) functions
2. Out-of-bounds array accesses in SHMEM Functions
Arrays can be declared too small, or an address in the middle of a larger array may be passed
into SHMEM function so that the rest of the array starting this address is too small - both errors
should be detected

Use too small pSync array in collective functions

Use too small array in the target argument of a collective function (the size of the array will
depend on the values of the nreduce, nlong and PE_size arguments)

Use too small array in the source argument of a collective function (the size of the array will
depend on the values of the nreduce, nlong and PE_size arguments)

Use too small pWrk array in collective reduction functions (the size of the array will depend
on the value of the nreduce argument)

Use too small array in the target argument of a remote write (put) function (the size of the
array will depend on the value of the len and tst arguments)

Use too small array in the source argument of a a remote write (put) function (the size of the
array will depend on the value of the len and sst arguments)

Use too small array in the target argument of a remote read (get) function (the size of the
array will depend on the value of the len and tst arguments)

Use too small array in the source argument of a a remote read (get) function (the size of the
array will depend on the value of the len and sst arguments)
3. Symmetric Heap Memory Related Errors

Call heap memory management functions on pointers that do not point to a memory block
allocated via a call to shmalloc, shmemalign or shrealloc

Use of dangling pointers

Memory Leak - reassigning a pointer before deallocating

Memory Leak - leaving a block before freeing memory that was allocated using non-static
pointer declared and allocated within the block
25
4. Argument Errors in SHMEM Functions

Use non-symmetric data objects as arguments that are required to be remotely accessible

Errors in pSync argument

Use overlapping (but not the same) arrays as source and target arguments in SHMEM
collective reduction functions

Use a value larger than 'number of PEs -1' or use a negative value in the pe, PE_size and
PE_start arguments

Use invalid combination of values of PE_size, PE_stride and PE_start arguments

Call a collective function by a PE not in the active set
5. Deadlocks and potential deadlocks:

Not every process in the active set calls a barrier function, a symmetric heap memory
management function or a collective function with identical argument(s) that are required to
be single-valued in the same order

Deadlock at the call to a wait function

Deadlocks when using SHMEM lock functions
6. Incorrect Order of SHMEM Functions

Call a SHMEM function before the call to start_pes

Unlock lock being held by another PE

Call two collective SHMEM routines with the same pSync and/or pWrk arguments and no
shmem_barrier or shmem_barrier_all call in between
7. Race Conditions
26
Download