Power Point Tutorial on UPC-CHECK

advertisement
UPC-Check Tutorial *
High Performance Computing Group
Glenn Luecke(director), James Coyle, James Hoekstra,
Marina Kraeva and Indranil Roy
Iowa State University
Aug 30, 2011
* This work was supported by the United States Department of
Defense & used resources of the Extreme Scale Systems Center
at Oak Ridge National of Oak Ridge National Laboratory.
1
UPC-CHECK Tutorial Outline
•
•
•
•
UPC-CHECK Design
Current Functionality of UPC-CHECK
UPC-CHECK syntax
How to use UPC-CHECK to find and
correct program errors. (6 examples)
• Efficiency of UPC-CHECK
• Scalability of UPC-CHECK
• Memory overhead of UPC-CHECK
2
UPC-CHECK Design
Original UPC
Program
UPC to UPC
Translator
UPC-CHECK
Support
Routines
UPC
Compiler
Executable with error
checking
UPC program
with error
checking
3
Current Functionality of
UPC-CHECK
• Argument checking for UPC functions
• Deadlock detection
4
UPC-CHECK Syntax
•
Use upc-check the same as your UPC compiler, e.g. instead of
upcc
-O –T 3 a.upc r.o
issue:
upc-check –O –T 3 a.upc r.o
•
In a Makefile, change UPC=upcc to UPC=upc-check
•
Note: the -T compiler option must be used with the upc-check command
since ROSE currently requires that the number of threads be known at
compile time (UPC-CHECK uses the ROSE Toolkit from Lawrence
Livermore National Laboratory to instrument UPC source code).
5
Run-Time Errors Detected by UPC-CHECK
• UPC-CHECK detects Argument Errors in UPC
Functions and Deadlocks in UPC programs.
• UPC-CHECK will not test the single-valued
requirement of upc_forall statements.
• Since UPC-CHECK works on UPC source
programs, it cannot detect deadlocks within
library functions.
• Currently, UPC-CHECK requires that programs
do not define the ‘main' function in a header file.
Quantifying the quality of a tool
which detects UPC run-time errors.
•
Iowa State University has a Test Suite that scores the ability of UPC
compilers/tools to detect run-time errors: see
http://rted.public.iastate.edu/UPC/
•
This Test Suite uses the following scoring system:
– A score of 5 is given for a detailed error message that will assist a
programmer to quickly correct the error.
– A score of 4 is given for error messages with more information than
a score of 3 and less than 5.
– A score of 3 is given for error messages with the correct error name,
line number and the name of the file where the error occurred.
– A score of 2 is given for error messages with the correct error name
and line number where the error occurred but not the file name
where the error occurred.
– A score of 1 is given for error messages with the correct error name.
– A score of 0 is given when the error was not detected.
How UPC-CHECK compares
• Results from ISU’s test suite:
http://rted.public.iastate.edu/UPC/RESULTS/result_table.html
• UPC-CHECK gets the highest score for
Deadlocks and the highest score for all but 3
tests in the Argument Errors section.
Compiler
Argument Errors
UPC-CHECK
4.89
Berkley UPC
0.04
Cray
0.38
HP
0.00
GNU
0.00
Deadlocks
5.00
0.58
0.00
0.36
0.27
Additional Checks
• While collecting the information necessary to
instrument the Argument Errors and Deadlock checks,
UPC-CHECK sometimes detects a different error.
Whichever error would occur first is reported with a
meaningful error message. (E.g. a collective routine
within a upc_forall.)
• Due to this, more categories in ISU’s RTED_UPC test
suite showed improvement when using UPC-CHECK,
see:
http://rted.public.iastate.edu/UPC/RESULTS/result_table.html
• In addition, some errors are detected and reported at
translation/compile time.
Examples illustrating how to use
UPC-CHECK to find and correct
*
program errors
*
All examples use the Berkeley UPC compiler.
10
Example 1:
http://hpcgroup.public.iastate.edu/UPC-CHECK/Ex/ex1.upc
This program contains the function:
upc_all_broadcast(arrA, arrB,
sizeof(int)*sh_val,
UPC_IN_NOSYNC | UPC_OUT_NOSYNC);
and sh_val is declared as
static shared int sh_val;
However the program does not initialize sh_val
The declaration means that sh_val has an initial value of zero.
Therefore the third argument of the above broadcast function is zero.
This is not allowed by the UPC specification.
11
When issuing:
upcc -T 4 -o ex1 ex1.upc;
upcrun -n 4 ./ex1;
the program executes without any error messages being issued.
When issuing:
upc-check -T 4 –o ex1 ex1.upc;
upcrun -n 4 ./ex1;
the following message is issued:
Thread 0 encountered invalid arguments in function
upc_broadcast at line 26 in file
/home/jjc/ex1.upc.
Error: Parameter (((sizeof(int )) *(sh_val)))
passes non-positive value of 0 to nbytes argument
Variable sh_val was declared at line 10 in file
/home/jjc/ex1.upc.
12
Correcting Example 1
Seeing that sizeof(int)*shval was zero, the
programmer can see that sh_val still has the default value
of zero due to its declaration in line 10. (Static shared
variables are initialized to zero according to the UPC
Spec.) Thus, an assignment of a value to sh_val before line
26 is missing.
Inserting the statement
sh_val=BLOCK_SIZE ;
at line 16 fixes this error.
13
Example 2:
http://hpcgroup.public.iastate.edu/UPC-CHECK/Ex/ex2.upc
This program contains:
numhints = 1;
fd = upc_all_fopen("upcio1.txt",
UPC_INDIVIDUAL_FP|UPC_WRONLY|
UPC_CREATE, numhints, hints);
And the program does not allocate space for the structure
hints.
14
When issuing:
upcc -T 4 -o ex2 ex2.upc;
upcrun -n 4 ./ex2;
the following is printed from the printf in the program:
File not open.
When issuing:
upc-check -T 4 –o ex2
ex2.upc; upcrun -n 4 ./ex2;
the following message is issued:
Thread 0 encountered invalid arguments in function
upc_all_fopen at line 13 in file /home/jjc/ex2.upc.
Error: Parameter numhints passes non-zero value of 1 to
'numhints' argument while target of parameter (hints) passed
to 'hints' argument is unallocated.
Variable numhints was declared at line 7 in file
/home/jjc/ex2.upc.
Variable hints was declared at line 9 in file
/home/jjc/ex2.upc
15
Correcting Example 2
The argument hints is not used unless numhints is
positive. hints may be used to convey information about a
file in hopes of more efficient I/O. Therefore, example 2
can be corrected by either
1) setting numhints to 0, or
2) allocating hints and assigning correct values to it.
16
Example 3:
http://hpcgroup.public.iastate.edu/UPC-CHECK/Ex/ex3.upc
http://hpcgroup.public.iastate.edu/UPC-CHECK/Ex/ex3_s.upc
In this program, the upc_barrier function is not called by
all threads, and causes a deadlock.
This error is difficult to find since the barrier is contained
inside a function which is called from within an if block.
17
When issuing:
upcc -T 4 -o ex3 ex3.upc ex3_s.upc; upcrun -n 4 ./ex3;
a deadlock occurs and the upcrun command never returns.
When issuing:
upc-check -T 4 -o ex3 ex3.upc ex3_s.upc; upcrun -n 4 ./ex3;
the following message is issued:
Runtime error: Deadlock condition detected: One or more
threads have finished executing while other threads are
waiting at a collective routine
Status of threads
=================
Thread id:Status:Presently waiting at line number:of file
-------------------------------------------------------0:waiting at upc_barrier: 7: /home/jjc/ex3_s.upc
1:reached end of execution through: 39: /home/jjc/ex3.upc
2:waiting at upc_barrier: 7: /home/jjc/ex3_s.upc
3:waiting at upc_barrier: 7: /home/jjc/ex3_s.upc
18
Correcting Example 3
The upc_barrier is called from funcA. Two of the
three possible paths through the two nested if
statements appear and contain a upc_barrier, but
the third possible (else) path is missing.
This error can be corrected by creating the missing
else block and placing either a call to funcA, or a
upc_barrier call.
19
Example 4:
http://hpcgroup.public.iastate.edu/UPC-CHECK/Ex/ex4.upc
http://hpcgroup.public.iastate.edu/UPC-CHECK/Ex/ex4_s.upc
In this program, not all threads call the UPC collective
function upc_all_fsync.
20
When issuing:
upcc -T 4 -o ex4 ex4.upc ex4_s.upc; upcrun -n 4 ./ex4;
the upcrun command never completes.
When issuing:
upc-check -T 4 –o ex4 ex4.upc ex4_s.upc; upcrun -n 4 ./ex4;
the following message is issued:
Runtime error: Deadlock condition detected: Different
threads waiting at different collective routines
Status of threads
=================
Thread id:Status:Presently waiting at line number:of file
--------------------------------------------------------0:waiting at upc_all_fsync on file pointer fd: 9:
/home/jjc/ex4_s.upc
1:waiting at upc_all_fclose on file pointer fd: 52:
/home/jjc/ex4.upc
2:waiting at upc_all_fsync on file pointer fd: 9:
/home/jjc/ex4_s.upc
3:waiting at upc_all_fsync on file pointer fd: 9:
/home/jjc/ex4_s.upc
21
Correcting Example 4
This is another case where a UPC collective (in
this case upc_all_fsync) is not called by all
threads, as required. This is detected when one
set of threads executes upc_all_fsync, while
another set executes upc_all_fclose.
Inserting an else clause with the statement
upc_all_fsync(fd) corrects the problem.
22
Example 5:
http://hpcgroup.public.iastate.edu/UPC-CHECK/Ex/ex5.upc
In this program, all of the threads call the UPC collective function
upc_all_reduceI, but they call with different source arrays,
which is not allowed by the UPC specification.
Without UPC-CHECK, when issuing:
upcc -T 4 -o ex5 ex5.upc;
upcrun -n 4 ./ex5;
the following is printed:
sumA=120
23
When issuing:
upc-check -T 4 –o ex5
ex5.upc; upcrun -n 4 ./ex5;
the following message is issued:
Runtime error: Unspecified behavior condition detected, may lead to
deadlock : One or more threads have different values for single_valued
parameters.
Status of threads
=================
Thread id:Status:Presently waiting at line number:of file
--------------------------------------------------------0:waiting at upc_all_reduceI: 21: /home/jjc/ex5.upc
1:waiting at upc_all_reduceI: 21: /home/jjc/ex5.upc
2:waiting at upc_all_reduceI: 21: /home/jjc/ex5.upc
3:waiting at upc_all_reduceI: 21: /home/jjc/ex5.upc
Mismatch in parameter: src.
Thread no.
===================================================================
0:ptrA points to memory location 0x2b7dd810dff0.
Variable ptrA was declared at line 7 in file /home/jjc/ex5.upc.
1:ptrA points to memory location 0x2b7dd810dfe0.
Variable ptrA was declared at line 7 in file /home/jjc/ex5.upc.
2:ptrA points to memory location 0x2b7dd810dfc0.
Variable ptrA was declared at line 7 in file /home/jjc/ex5.upc.
3:ptrA points to memory location 0x2b7dd810dfd0.
Variable ptrA was declared at line 7 in file /home/jjc/ex5.upc.
24
Correcting Example 5
The error message on the previous slide reports that threads have
different values of the src parameter of function upc_all_reduceI.
ptrA, declared at line 7 of file ex5.upc, points to different memory
locations. Looking at the ptrA declaration, we see that ptrA is a
private pointer-to-shared.
Later in the code ptrA is assigned the value returned by the call to
upc_global_alloc. This function is not collective. If it's called by
multiple threads, all threads which make the call get different
allocations.
Changing upc_global_alloc to upc_all_alloc corrects the
problem since now ptrA will have the same value on every thread.
Note that with the current version of Berkley UPC compiler, the value of
sumA will be the same in either case, but this behavior is not
guaranteed for the test above.
25
Example 6
Example 6 is the Dining Philosopher’s problem, a classic deadlock
problem.
http://hpcgroup.public.iastate.edu/UPC-CHECK/Ex/ex6.upc
Without UPC-CHECK, when issuing:
upcc -T 3 -o ex6 ex6.upc; upcrun -n 3 ./ex6;
the output produced varies from run to run. For one run the following
output was produced:
philosopher # 0 got the left fork
philosopher # 0 got the right fork
philosopher # 0 got the left fork
philosopher # 1 got the left fork
philosopher # 2 got the left fork
the program then deadlocks and no additional output is issued.
26
When issuing:
upc-check -T 3 –o ex6 ex6.upc; upcrun -n 3 ./ex6;
the program exits after issuing the following message:
Runtime error: Deadlock condition detected: Found cycle of
hold-and-wait dependencies for acquiring locks:
Thread 2 is waiting at upc_lock function at line 18 of file
/home/jjc/ex6.upc to acquire lock forks[((MYTHREAD ) + 1) % 3]
pointing to location 0x9f40.
Lock forks[((MYTHREAD ) + 1) % 3] was already acquired as
forks[MYTHREAD ] by thread 0 with 'upc_lock' at line 16 of file
/home/jjc/ex6.upc.
Thread 0 is waiting at upc_lock function at line 18 of file
/home/jjc/ex6.upc to acquire lock forks[((MYTHREAD ) + 1) % 3]
pointing to location 0x9f20.
Lock forks[((MYTHREAD ) + 1) % 3] was already acquired as
forks[MYTHREAD ] by thread 1 with 'upc_lock' at line 16 of file
/home/jjc/ex6.upc.
Thread 1 is waiting at upc_lock function at line 18 of file
/home/jjc/ex6.upc to acquire lock forks[((MYTHREAD ) + 1) % 3]
pointing to location 0x9f00.
Lock forks[((MYTHREAD ) + 1) % 3] was already acquired as
forks[MYTHREAD ] by thread 2 with 'upc_lock' at line 16 of file
/home/jjc/ex6.upc.
27
Correcting Example 6
The error message on the previous slide shows
• where the deadlock is occurring (line 18 of the indicated file),
• which locks are involved,
• who holds which locks,
• and what lock each thread is waiting on.
The deadlock can be avoided by numbering the forks, and picking up
the even fork first, then another fork, and putting them down in the
reverse order.
28
Efficiency of UPC-CHECK
UPC-CHECK has been carefully designed to minimize the overhead
when executing the instrumented UPC program. Using the UPC
implementation of the NAS Parallel CG benchmark, we timed both the
instrumented and non-instrumented executables using 4 threads for the
smallest 3 benchmarks (S, A, and B). In these cases we also use the
Berkley UPC compiler. We see essentially zero overhead.
Size
S
A
B
WallTime(sec.)
No UPC-CHECK
With UPC-CHECK
7.36
7.41
9.06
9.12
85.03
83.04
Overhead
0.7%
0.7%
- 2.3%
29
Scalability of UPC-CHECK checks
Type of check
Overhead (for T threads)
Argument checking
O(1)
Deadlocks
Collective routines
UPC_Locks
O(1)
O(L) , L<=T
Where L is the length of the longest hold-and-wait chain.
For a program that does not use upc_locks, the overhead in using UPC-CHECK
does not depend on the number of threads. This is because all checking can be
done via values local to the threads and its neighboring threads. The O(1) deadlock checking
for collective routines will be described in a paper that is being prepared.
A program that uses upc_locks may have overhead that depends on the number
of threads because there may be a chain of lock dependencies (a deadlock) which spans all
threads.
30
Overhead on a Cray XT using the Cray
compiler
128 threads
NAS Benchmark
CG-A
CG-B
CG-C
EP-A
EP-B
EP-C
IS-A
IS-B
IS-C
Total
Execution Time for Original Program
(sec.)
Execution Time for Instrumented
Program (sec.)
Slowdown
4.912
4.99
1.02
54.183
54.239
1.00
58.309
58.281
1.00
1.417
1.427
1.01
7.116
7.128
1.00
11.19
11.17
1.00
3.56
3.658
1.03
8.752
8.776
1.00
10.089
10.073
1.00
159.528
159.742
1.00
Memory overhead of UPC-CHECK
The memory overhead per thread consists of three components:
1) Extra context variables allocated to support checks: approximately
128KB.
2) Extra information about call stack if call stack tracking is requested:
1/2 KB per call level per thread
3) Executable size: The support routines add less than1MB and each
UPC routine adds about 3.5Kbytes.
32
Download