UPC CHECK: A scalable tool for detecting run-time errors in Unified

advertisement
UPC CHECK: A scalable tool for
detecting run-time errors in Unified
Parallel C
Indranil Roy
High Performance Computing
(HPC) group
Segmentation error. Core dumped.
A good error message
Thread 0 encountered invalid arguments in function
upc all broadcast at line 26 in file
/home/jjc/ex1.upc.
Error: Parameter (sizeof(int ) * sh val) passes nonpositive value of 0 to nbytes argument
Variable sh val was declared at line 10 in file
/home/jjc/ex1.upc.
Outline
▫ Understanding a Unified Parallel C
▫ UPC-CHECK 1.0 tool







How does it work?
Usability
Error coverage and quality of error reports generated
Testing
Overheads
Scalability
Known limitations
▫ Challenges in argument error detection
▫ Deadlock detection algorithm
▫ Demo
Understanding Unified Parallel C
• Shared memory model
• Distributed memory
model
Understanding Unified Parallel C
• Unified Parallel C
▫ Distributed Shared Memory Model or Partitioned
Global Address Space Model
UPC-CHECK v1.0
• Source to source translator
• Pre-compiler
• Error handling
▫ Argument
errors
▫ Deadlocks
UPC-CHECK: Usability
• Portable
▫ Machine independent
▫ Compiler independent
• Ease of use
▫ Easy to install
install_UPC-CHECK
▫ Easy to run
• Freely available
wget http://hpcgroup.public.iastate.edu/UPCCHECK/UPC-CHECK.tar.gz
UPC-CHECK 1.0: Usability
• Usage
upc-check [compiler options] [--upccheck:flag [-upccheck:flag] ...] -c sourcefile.upc
disables argument checking (enabled by
default)
-d|-d_deadlock_check
disables deadlock checking (enabled by
default)
-s|-e_track_func_call_stack enables tracing of function call stack
(disabled by default)
-h|--h|-help
prints help for UPC-CHECK
-a|-d_argument_check
• Just replace your compile-command with upc-check.
Quality of error reports generated
• Coyle, J., Hoekstra, J., Kraeva, M., Luecke, G. R., Kleiman, R., Srinivas, V.,
Tripathi, A., Weiss, O., Wehe, A., Xu, Y., Yahya, M. (2008). UPC Run-Time
Error Detection Test Suite.
http://kraeva.public.iastate.edu/rted/UPC.TestPlan.pdf,
Iowa State University, High Performance Computing Group.
▫
▫
▫
▫
▫
▫
A score of 5 is given for a detailed error message that will assist a programmer to
x the error.
A score of 4 is given for error messages with more information than a score of 3
and less than 5. This is tailored for each test.
A score of 3 is given for error messages with the correct error name, line number
and the name of the file where the error occurred.
A score of 2 is given for error messages with the correct error name and line
number where the error occurred but not the file name where the error occurred.
A score of 1 is given for error messages with the correct error name.
A score of 0 is given when the error was not detected.
Run-time
environments
Argument
errors
Deadlocks
Cray
0.38
0
Berkeley
0.04
0.58
HP
0
0.36
GNU
0
0.27
4.89
5
UPC-CHECK
UPC-CHECK 1.0: Testing
•
•
•
•
400 error test-cases
1800 false-positive cases
Additional testing for deadlocks
Testing across application programs
UPC-CHECK 1.0: Overhead
• Base memory requirement
▫ ~ 128 KB per thread
▫ With every acquired or requested shared memory lock,
requirement goes by around 256 B
▫ while tracking function call stack, with every level of
nested function call, memory requirement goes by
around 512 B
• Increase of code section
▫ ~ 100 lines of instrumentation per UPC operation
▫ ~12000 lines from support files
Efficiency overhead
Berkeley UPC
Original
Cray UPC
Instumented
Overhead
Original
Instumented
Overhead
CG-S
7.329
7.393
1.009
7.34
7.924
1.080
CG-W
7.554
7.613
1.008
7.576
8.344
1.101
CG-A
8.531
9.378
1.099
8.372
9.133
1.091
CG-B
73.619
74.222
1.008
56.376
63.239
1.122
CG-C
171.36
173.036
1.010
132.997
140.317
1.055
EP-S
8.048
8.581
1.066
5.319
5.307
0.998
EP-W
8.944
10.179
1.138
6.039
6.019
0.997
EP-A
19.71
25.193
1.278
14.755
14.743
0.999
EP-B
57.366
92.385
1.610
44.706
46.567
1.042
EP-C
211.214
349.248
1.654
164.289
163.929
0.998
FT-S
7.529
7.74
1.028
4.97
4.918
0.990
FT-W+
7.651
7.68
1.004
5.135
5.151
1.003
FT-A*+
15.34
14.312
0.933
9.173
9.084
0.990
FT-B*+
83.981
77.339
0.921
50.621
50.613
1.000
0.000
200.947
220.111
1.095
FT-C*+
IS-S
7.257
7.389
1.018
4.954
4.894
0.988
IS-W
7.409
7.441
1.004
5.099
5.006
0.982
IS-A
8.526
8.435
0.989
5.787
5.799
1.002
IS-B
12.038
12.115
1.006
8.604
8.54
0.993
IS-C*+
25.397
25.69
1.012
19.662
21.655
1.101
MG-S
7.298
7.45
1.021
4.798
4.79
0.998
MG-W
7.631
7.499
0.983
5.239
5.815
1.110
MG-A*+
11.32
10.73
0.948
6.979
12.038
1.725
MG-B*+
19.083
19.1
1.001
11.718
16.88
1.441
MG-C*+
118.651
118.965
1.003
68.33
107.597
1.575
Maximum
1.654
Maximum
1.725
Average
1.073
Average
1.099
Maximum
1.654
Average
1.073
Maximum
1.725
Average
1.099
UPC-CHECK 1.0: Scalability
Original
•
•
•
•
CROW cluster
Cray compiler
Cray run-time environment
128 threads
Maximum
Average
1.052
1.008
CG-S
CG-W
CG-A
CG-B
CG-C
EP-S
EP-W
EP-A
EP-B
EP-C
FT-S
FT-W
FT-A
FT-B
FT-C*
IS-S
IS-W
IS-A
IS-B
IS-C
MG-S
MG-W
MG-A
MG-B
MG-C
Instrumented
4.742
15.664
4.912
54.183
58.309
1.145
6.247
1.417
7.116
11.19
DNR
DNR
DNR
DNR
DNR
DNR
15.528
22.855
3.541
10.422
3.56
8.752
10.089
DNR
DNR
8.288
DNR
DNR
9.293
13.551
4.942
15.708
4.99
54.239
58.281
1.145
6.243
1.427
7.128
11.17
15.556
22.735
3.594
10.961
3.658
8.776
10.073
8.308
9.341
13.579
Maximum
Average
Slowdown
1.042
1.003
1.016
1.001
1.000
1.000
0.999
1.007
1.002
0.998
0.000
0.000
0.000
1.002
0.995
1.015
1.052
1.028
1.003
0.998
0.000
1.002
0.000
1.005
1.002
1.052
1.008
UPC-CHECK v1.0: Known limitations
• UPC-CHECK will not test the single-valued
requirement of upc forall statements.
• Since UPC-CHECK works on UPC source programs,
it will be unable to handle any deadlocks which are
created in a library that a user might be using.
• UPC-CHECK should not be used for programs
where the ‘main' function lies within a header
file
▫ Best effort will be made, but may lead to memory leaks
at end of execution.
Challenges in checking argument
errors
• Engineering challenges
▫
▫
▫
▫
Exhaustiveness
Argument checks against multiple functions
Handling vector arguments
Dependency of one argument on another
argument
▫ Data-structures used
▫ Displaying the errors
A novel Deadlock Detection Algorithm
• Dynamic
• Optimal
▫ O(1) for deadlocks created by collective routines
▫ O(n) for deadlocks created by locks
• Distributed
• Scalable
A few more terms:“collective”
operations
• “Collective” is a constraint placed on some language
operations which requires evaluation of such
operations to be matched across all threads. The
behavior of collective operations is undefined unless
all threads execute the same sequence of collective
operations.
• “Single valued” refers to an operand to a collective
operation, which has the same value on every
thread. The behavior of the operation is otherwise
undefined.
Central idea
• The collective requirement simply states a
relative ordering property of calls to collective
operations that must be maintained in the
parallel execution trace for all executions of any
legal program.
threads
time
Deadlocks in UPC
1. Not all threads are waiting at the same collective routine
threads
time
0
1
2
…
i
…
j
…
T-2
T-1
2. Some threads are waiting at the same collective routine when at least
one of the threads has reached end-of-execution
threads
time
0
1
2
…
i
…
j
…
T-2
T-1
End-of-execution
3. One of the threads at a collective routine is holding a lock that at least
one of the threads are trying to acquire.
threads
time
0
1
2
…
i
…
j
…
T-2
T-1
5. Circular dependency for acquiring locks amongst threads
Definition: A thread i is dependent on another thread j if the thread i is trying to
acquire a lock held by thread j
threads
time
0
1
2
…
i
…
j
…
T-2
T-1
6. Chain of dependency for acquiring locks leads to a thread which is
waiting at a collective routine.
threads
time
0
1
2
…
i
…
j
…
T-2
T-1
6. Chain of dependency for acquiring locks leads to a thread which is
reached end of execution.
threads
time
0
1
2
…
i
…
j
…
End-of-execution
T-2
T-1
Algorithm: Get all the threads in the
picture
T-2
T-1
T-3
…
j
i+2
0
1
i+1
2
3
… i-1
i
Validation method: A basic block
threads
threads
R
i
time
time
i-1
R
i-1
i
Implementation: Algorithm 1
shared [1] deadlock_ctxt_t unified_deadlock_ctxt[THREADS];
0
1
…
i-1
Ii
i+1
…
state
…
nk
nk
nk
…
desired_state
…
un
un
un
…
i-1
i
i+1
T-1
shared [1] deadlock_ctxt_t unified_deadlock_ctxt[THREADS];
0
1
…
i-1
i
state
…
nk
desired_state
…
un
un
i-1
i
i+1
…
nk
…
…
i+1
T-1
shared [1] deadlock_ctxt_t unified_deadlock_ctxt[THREADS];
0
1
…
i-1
i
i+1
state
…
nk
desired_state
…
un
un
un
i-1
i
i+1
…
…
…
T-1
shared [1] deadlock_ctxt_t unified_deadlock_ctxt[THREADS];
0
1
…
i-1
i
state
…
nk
desired_state
…
un
un
i-1
i
i+1
…
…
…
i+1
T-1
shared [1] deadlock_ctxt_t unified_deadlock_ctxt[THREADS];
0
1
…
state
…
desired_state
…
i-1
i
i+1
…
…
un
i-1
un
i
i+1
…
T-1
shared [1] deadlock_ctxt_t unified_deadlock_ctxt[THREADS];
0
1
…
state
…
desired_state
…
i-1
i
i+1
…
…
un
i-1
…
i
i+1
T-1
Atomicity and serialization of status
checks
• One centralized lock solution
▫ Efficiency hit – complete serialization
• Decentralized lock solution –one lock per thread
▫ shared [1] upc_lock_t upc_check_deadlock_detection_lock[THREADS];
0
1
2
…
i
i+1
…
T-3
T-2
T-1
Avoiding deadlocks created by the
checks
0
1
2
…
i
i+1
…
T-3
T-2
T-1
Scheme 1 of acquiring locks
0
1
2
…
i
i+1
Even thread: lock[i] then lock[(i+1) %THREADS]
Odd thread: lock[(i+1) %THREADS] then lock[i]
…
T-3
T-2
T-1
Legend:
: First lock acquired
: Second lock acquired
Scheme 1: Maximum latency of
acquiring locks for even number of
threads
1
i-2
2
2
i-1
1
1 2
2
i
i+1
Longest dependency chains
when i is odd
Longest dependency chains
when i is even
1
1
2
i-1
Maximum latency is 3 or O(1)
2
1
i
1
i+1
2
2
i+2
1
Maximum latency: when total number
of threads are odd
Maximum latency is 4 or O(1)
Efficiency
• The number of threads for which any thread has
to wait before entering its critical section is is
O(1).
• The number of remote memory access is O(1) as
any thread i only accesses memory related to the
state of only thread I and thread
(i+1)%THREADS.
• Optimal!
When thread reaches a upc_lock
• Track requested locks and acquired locks
• Look out cyclical hold-and-wait conditions
• Look out for chain of hold-and-wait conditions which
lead to a thread blocked at a collective routine
▫ If a thread has reached a collective routine, check if there is a
request for a lock that the thread is holding
• Look out for chain of hold-and-wait conditions which
lead to a thread which has reached end-of-execution
▫ If a thread is exiting without freeing all locks held by it, then
check if there is a request for a lock that the thread is holding
Papers
1.
Coyle, J., Hoekstra, J., Kraeva, M., Luecke, G. R., Kleiman, R., Roy, I.
(2009). UPC Compile-Time Error Detection Test Suite.
http://kraeva.public.iastate.edu/rted/UPCct.TestPlan.pdf, Iowa State
University High Performance Computing Group.
2.
Roy, I., Luecke, G. R., Coyle, J., Kraeva, M., Hoekstra, J. (2011). UPCCHECK: A run-time error detection tool for programs written in UPC.
Preprint
3.
Roy, I., Luecke, G. R., Coyle, J., Kraeva, M., Hoekstra, J. (2011). An
O(1) algorithm to detect deadlocks in collective routines in the
distributed shared memory model. Preprint
Thank you
Download