UPC CHECK: A scalable tool for detecting run-time errors in Unified Parallel C Indranil Roy High Performance Computing (HPC) group Segmentation error. Core dumped. A good error message Thread 0 encountered invalid arguments in function upc all broadcast at line 26 in file /home/jjc/ex1.upc. Error: Parameter (sizeof(int ) * sh val) passes nonpositive value of 0 to nbytes argument Variable sh val was declared at line 10 in file /home/jjc/ex1.upc. Outline ▫ Understanding a Unified Parallel C ▫ UPC-CHECK 1.0 tool How does it work? Usability Error coverage and quality of error reports generated Testing Overheads Scalability Known limitations ▫ Challenges in argument error detection ▫ Deadlock detection algorithm ▫ Demo Understanding Unified Parallel C • Shared memory model • Distributed memory model Understanding Unified Parallel C • Unified Parallel C ▫ Distributed Shared Memory Model or Partitioned Global Address Space Model UPC-CHECK v1.0 • Source to source translator • Pre-compiler • Error handling ▫ Argument errors ▫ Deadlocks UPC-CHECK: Usability • Portable ▫ Machine independent ▫ Compiler independent • Ease of use ▫ Easy to install install_UPC-CHECK ▫ Easy to run • Freely available wget http://hpcgroup.public.iastate.edu/UPCCHECK/UPC-CHECK.tar.gz UPC-CHECK 1.0: Usability • Usage upc-check [compiler options] [--upccheck:flag [-upccheck:flag] ...] -c sourcefile.upc disables argument checking (enabled by default) -d|-d_deadlock_check disables deadlock checking (enabled by default) -s|-e_track_func_call_stack enables tracing of function call stack (disabled by default) -h|--h|-help prints help for UPC-CHECK -a|-d_argument_check • Just replace your compile-command with upc-check. Quality of error reports generated • Coyle, J., Hoekstra, J., Kraeva, M., Luecke, G. R., Kleiman, R., Srinivas, V., Tripathi, A., Weiss, O., Wehe, A., Xu, Y., Yahya, M. (2008). UPC Run-Time Error Detection Test Suite. http://kraeva.public.iastate.edu/rted/UPC.TestPlan.pdf, Iowa State University, High Performance Computing Group. ▫ ▫ ▫ ▫ ▫ ▫ A score of 5 is given for a detailed error message that will assist a programmer to x the error. A score of 4 is given for error messages with more information than a score of 3 and less than 5. This is tailored for each test. A score of 3 is given for error messages with the correct error name, line number and the name of the file where the error occurred. A score of 2 is given for error messages with the correct error name and line number where the error occurred but not the file name where the error occurred. A score of 1 is given for error messages with the correct error name. A score of 0 is given when the error was not detected. Run-time environments Argument errors Deadlocks Cray 0.38 0 Berkeley 0.04 0.58 HP 0 0.36 GNU 0 0.27 4.89 5 UPC-CHECK UPC-CHECK 1.0: Testing • • • • 400 error test-cases 1800 false-positive cases Additional testing for deadlocks Testing across application programs UPC-CHECK 1.0: Overhead • Base memory requirement ▫ ~ 128 KB per thread ▫ With every acquired or requested shared memory lock, requirement goes by around 256 B ▫ while tracking function call stack, with every level of nested function call, memory requirement goes by around 512 B • Increase of code section ▫ ~ 100 lines of instrumentation per UPC operation ▫ ~12000 lines from support files Efficiency overhead Berkeley UPC Original Cray UPC Instumented Overhead Original Instumented Overhead CG-S 7.329 7.393 1.009 7.34 7.924 1.080 CG-W 7.554 7.613 1.008 7.576 8.344 1.101 CG-A 8.531 9.378 1.099 8.372 9.133 1.091 CG-B 73.619 74.222 1.008 56.376 63.239 1.122 CG-C 171.36 173.036 1.010 132.997 140.317 1.055 EP-S 8.048 8.581 1.066 5.319 5.307 0.998 EP-W 8.944 10.179 1.138 6.039 6.019 0.997 EP-A 19.71 25.193 1.278 14.755 14.743 0.999 EP-B 57.366 92.385 1.610 44.706 46.567 1.042 EP-C 211.214 349.248 1.654 164.289 163.929 0.998 FT-S 7.529 7.74 1.028 4.97 4.918 0.990 FT-W+ 7.651 7.68 1.004 5.135 5.151 1.003 FT-A*+ 15.34 14.312 0.933 9.173 9.084 0.990 FT-B*+ 83.981 77.339 0.921 50.621 50.613 1.000 0.000 200.947 220.111 1.095 FT-C*+ IS-S 7.257 7.389 1.018 4.954 4.894 0.988 IS-W 7.409 7.441 1.004 5.099 5.006 0.982 IS-A 8.526 8.435 0.989 5.787 5.799 1.002 IS-B 12.038 12.115 1.006 8.604 8.54 0.993 IS-C*+ 25.397 25.69 1.012 19.662 21.655 1.101 MG-S 7.298 7.45 1.021 4.798 4.79 0.998 MG-W 7.631 7.499 0.983 5.239 5.815 1.110 MG-A*+ 11.32 10.73 0.948 6.979 12.038 1.725 MG-B*+ 19.083 19.1 1.001 11.718 16.88 1.441 MG-C*+ 118.651 118.965 1.003 68.33 107.597 1.575 Maximum 1.654 Maximum 1.725 Average 1.073 Average 1.099 Maximum 1.654 Average 1.073 Maximum 1.725 Average 1.099 UPC-CHECK 1.0: Scalability Original • • • • CROW cluster Cray compiler Cray run-time environment 128 threads Maximum Average 1.052 1.008 CG-S CG-W CG-A CG-B CG-C EP-S EP-W EP-A EP-B EP-C FT-S FT-W FT-A FT-B FT-C* IS-S IS-W IS-A IS-B IS-C MG-S MG-W MG-A MG-B MG-C Instrumented 4.742 15.664 4.912 54.183 58.309 1.145 6.247 1.417 7.116 11.19 DNR DNR DNR DNR DNR DNR 15.528 22.855 3.541 10.422 3.56 8.752 10.089 DNR DNR 8.288 DNR DNR 9.293 13.551 4.942 15.708 4.99 54.239 58.281 1.145 6.243 1.427 7.128 11.17 15.556 22.735 3.594 10.961 3.658 8.776 10.073 8.308 9.341 13.579 Maximum Average Slowdown 1.042 1.003 1.016 1.001 1.000 1.000 0.999 1.007 1.002 0.998 0.000 0.000 0.000 1.002 0.995 1.015 1.052 1.028 1.003 0.998 0.000 1.002 0.000 1.005 1.002 1.052 1.008 UPC-CHECK v1.0: Known limitations • UPC-CHECK will not test the single-valued requirement of upc forall statements. • Since UPC-CHECK works on UPC source programs, it will be unable to handle any deadlocks which are created in a library that a user might be using. • UPC-CHECK should not be used for programs where the ‘main' function lies within a header file ▫ Best effort will be made, but may lead to memory leaks at end of execution. Challenges in checking argument errors • Engineering challenges ▫ ▫ ▫ ▫ Exhaustiveness Argument checks against multiple functions Handling vector arguments Dependency of one argument on another argument ▫ Data-structures used ▫ Displaying the errors A novel Deadlock Detection Algorithm • Dynamic • Optimal ▫ O(1) for deadlocks created by collective routines ▫ O(n) for deadlocks created by locks • Distributed • Scalable A few more terms:“collective” operations • “Collective” is a constraint placed on some language operations which requires evaluation of such operations to be matched across all threads. The behavior of collective operations is undefined unless all threads execute the same sequence of collective operations. • “Single valued” refers to an operand to a collective operation, which has the same value on every thread. The behavior of the operation is otherwise undefined. Central idea • The collective requirement simply states a relative ordering property of calls to collective operations that must be maintained in the parallel execution trace for all executions of any legal program. threads time Deadlocks in UPC 1. Not all threads are waiting at the same collective routine threads time 0 1 2 … i … j … T-2 T-1 2. Some threads are waiting at the same collective routine when at least one of the threads has reached end-of-execution threads time 0 1 2 … i … j … T-2 T-1 End-of-execution 3. One of the threads at a collective routine is holding a lock that at least one of the threads are trying to acquire. threads time 0 1 2 … i … j … T-2 T-1 5. Circular dependency for acquiring locks amongst threads Definition: A thread i is dependent on another thread j if the thread i is trying to acquire a lock held by thread j threads time 0 1 2 … i … j … T-2 T-1 6. Chain of dependency for acquiring locks leads to a thread which is waiting at a collective routine. threads time 0 1 2 … i … j … T-2 T-1 6. Chain of dependency for acquiring locks leads to a thread which is reached end of execution. threads time 0 1 2 … i … j … End-of-execution T-2 T-1 Algorithm: Get all the threads in the picture T-2 T-1 T-3 … j i+2 0 1 i+1 2 3 … i-1 i Validation method: A basic block threads threads R i time time i-1 R i-1 i Implementation: Algorithm 1 shared [1] deadlock_ctxt_t unified_deadlock_ctxt[THREADS]; 0 1 … i-1 Ii i+1 … state … nk nk nk … desired_state … un un un … i-1 i i+1 T-1 shared [1] deadlock_ctxt_t unified_deadlock_ctxt[THREADS]; 0 1 … i-1 i state … nk desired_state … un un i-1 i i+1 … nk … … i+1 T-1 shared [1] deadlock_ctxt_t unified_deadlock_ctxt[THREADS]; 0 1 … i-1 i i+1 state … nk desired_state … un un un i-1 i i+1 … … … T-1 shared [1] deadlock_ctxt_t unified_deadlock_ctxt[THREADS]; 0 1 … i-1 i state … nk desired_state … un un i-1 i i+1 … … … i+1 T-1 shared [1] deadlock_ctxt_t unified_deadlock_ctxt[THREADS]; 0 1 … state … desired_state … i-1 i i+1 … … un i-1 un i i+1 … T-1 shared [1] deadlock_ctxt_t unified_deadlock_ctxt[THREADS]; 0 1 … state … desired_state … i-1 i i+1 … … un i-1 … i i+1 T-1 Atomicity and serialization of status checks • One centralized lock solution ▫ Efficiency hit – complete serialization • Decentralized lock solution –one lock per thread ▫ shared [1] upc_lock_t upc_check_deadlock_detection_lock[THREADS]; 0 1 2 … i i+1 … T-3 T-2 T-1 Avoiding deadlocks created by the checks 0 1 2 … i i+1 … T-3 T-2 T-1 Scheme 1 of acquiring locks 0 1 2 … i i+1 Even thread: lock[i] then lock[(i+1) %THREADS] Odd thread: lock[(i+1) %THREADS] then lock[i] … T-3 T-2 T-1 Legend: : First lock acquired : Second lock acquired Scheme 1: Maximum latency of acquiring locks for even number of threads 1 i-2 2 2 i-1 1 1 2 2 i i+1 Longest dependency chains when i is odd Longest dependency chains when i is even 1 1 2 i-1 Maximum latency is 3 or O(1) 2 1 i 1 i+1 2 2 i+2 1 Maximum latency: when total number of threads are odd Maximum latency is 4 or O(1) Efficiency • The number of threads for which any thread has to wait before entering its critical section is is O(1). • The number of remote memory access is O(1) as any thread i only accesses memory related to the state of only thread I and thread (i+1)%THREADS. • Optimal! When thread reaches a upc_lock • Track requested locks and acquired locks • Look out cyclical hold-and-wait conditions • Look out for chain of hold-and-wait conditions which lead to a thread blocked at a collective routine ▫ If a thread has reached a collective routine, check if there is a request for a lock that the thread is holding • Look out for chain of hold-and-wait conditions which lead to a thread which has reached end-of-execution ▫ If a thread is exiting without freeing all locks held by it, then check if there is a request for a lock that the thread is holding Papers 1. Coyle, J., Hoekstra, J., Kraeva, M., Luecke, G. R., Kleiman, R., Roy, I. (2009). UPC Compile-Time Error Detection Test Suite. http://kraeva.public.iastate.edu/rted/UPCct.TestPlan.pdf, Iowa State University High Performance Computing Group. 2. Roy, I., Luecke, G. R., Coyle, J., Kraeva, M., Hoekstra, J. (2011). UPCCHECK: A run-time error detection tool for programs written in UPC. Preprint 3. Roy, I., Luecke, G. R., Coyle, J., Kraeva, M., Hoekstra, J. (2011). An O(1) algorithm to detect deadlocks in collective routines in the distributed shared memory model. Preprint Thank you