UPC/SHMEM PAT Report – First Iteration January 4th, 2005 UPC Group University of Florida Abstract Due to the complex nature of parallel and distributed computing systems and applications, the optimization of UPC programs can be a significant challenge without proper tools for performance analysis. The UPC group at the University of Florida is investigating key concepts and developing a comprehensive highlevel design for a performance analysis tool (PAT) or suite of tools that will directly support analysis and optimization of UPC programs, with an emphasis on usability and productivity, on a variety of HPC platforms. This report details the approach we are taking to design our performance tool, and any pertinent information we have obtained related to the design or functionality of performance tools in general. 2 Table of Contents 1 INTRODUCTION ................................................................................................................................ 6 2 APPROACH ......................................................................................................................................... 7 3 PROGRAMMING PRACTICES .......................................................................................................... 8 3.1 3.1.1 3.1.2 3.1.3 3.1.4 3.1.5 3.2 3.2.1 3.2.2 3.2.3 3.2.4 3.2.5 3.3 3.3.1 3.3.2 3.3.3 3.3.4 3.3.5 3.4 3.5 ALGORITHM DESCRIPTIONS .......................................................................................................... 8 Differential cryptanalysis for the CAMEL cipher ................................................................... 8 Mod 2n inverse - NSA benchmark 9 ........................................................................................ 9 Convolution ...........................................................................................................................10 Concurrent wave equation .....................................................................................................11 Depth-first search (DFS) .......................................................................................................12 CODE OVERVIEW .........................................................................................................................12 CAMEL ..................................................................................................................................12 Mod 2n inverse .......................................................................................................................15 Convolution ...........................................................................................................................18 Concurrent wave equation .....................................................................................................19 Depth-first search ..................................................................................................................20 ANALYSIS ...................................................................................................................................21 CAMEL (MPI and UPC) .......................................................................................................21 Mod 2n inverse (C, MPI, UPC, and SHMEM) .......................................................................22 Convolution (C, MPI, UPC, and SHMEM) ...........................................................................26 Concurrent wave equation (C and UPC)...............................................................................29 Depth-first search (C and UPC) ............................................................................................30 CONCLUSIONS .............................................................................................................................31 REFERENCES ...............................................................................................................................33 4 PERFORMANCE TOOL STRATEGIES............................................................................................34 5 ANALYTICAL PERFORMANCE MODELING ...............................................................................36 5.1 FORMAL PERFORMANCE MODELS ................................................................................................37 5.1.1 Petri nets ................................................................................................................................38 5.1.2 Process algebras....................................................................................................................39 5.1.3 Queuing theory ......................................................................................................................39 5.1.4 PAMELA ................................................................................................................................40 5.2 GENERAL ANALYTICAL PERFORMANCE MODELS .........................................................................40 5.2.1 PRAM ....................................................................................................................................41 5.2.2 BSP ........................................................................................................................................42 5.2.3 LogP ......................................................................................................................................45 5.2.4 Other techniques ....................................................................................................................46 5.3 PREDICTIVE PERFORMANCE MODELS ...........................................................................................48 5.3.1 Lost cycles analysis ...............................................................................................................48 3 5.3.2 Adve’s deterministic task graph analysis ...............................................................................50 5.3.3 Simon and Wierum’s task graphs ..........................................................................................52 5.3.4 ESP ........................................................................................................................................54 5.3.5 VFCS......................................................................................................................................55 5.3.6 PACE .....................................................................................................................................57 5.3.7 Convolution method ...............................................................................................................58 5.3.8 Other techniques ....................................................................................................................60 5.4 CONCLUSION AND RECOMMENDATIONS ......................................................................................62 5.5 REFERENCES ...............................................................................................................................67 6 EXPERIMENTAL PERFORMANCE MEASUREMENT .................................................................73 6.1 6.1.1 6.1.2 6.1.3 6.1.4 6.1.5 6.1.6 6.2 6.2.1 6.2.2 6.2.3 6.2.4 6.3 6.4 6.4.1 6.4.2 6.4.3 6.5 6.5.1 6.5.2 6.5.3 INSTRUMENTATION .....................................................................................................................74 Instrumentation overhead ......................................................................................................75 Profiling and tracing .............................................................................................................75 Manual vs. automatic ............................................................................................................77 Number of passes ...................................................................................................................77 Levels of instrumentation .......................................................................................................78 References..............................................................................................................................82 MEASUREMENT ...........................................................................................................................84 Performance factor ................................................................................................................84 Measurement strategies .........................................................................................................91 Factor List + experiments .....................................................................................................91 References..............................................................................................................................92 ANALYSIS ...................................................................................................................................94 PRESENTATION ............................................................................................................................94 Usability ................................................................................................................................94 Presentation methodology ...................................................................................................106 References............................................................................................................................107 OPTIMIZATION ..........................................................................................................................109 Optimization techniques ......................................................................................................110 Performance bottleneck identification .................................................................................131 References............................................................................................................................131 7 LANGUAGE ANALYSIS .................................................................................................................134 8 TOOL DESIGN .................................................................................................................................135 9 TOOL EVALUATION STRATEGIES .............................................................................................136 9.1 PRE-EXECUTION ISSUES .............................................................................................................136 9.1.1 Cost ......................................................................................................................................136 9.1.2 Installation ...........................................................................................................................137 9.1.3 Software support (libraries/compilers) ................................................................................137 9.1.4 Hardware support (platform) ..............................................................................................138 9.1.5 Heterogeneity support .........................................................................................................138 9.1.6 Learning curve .....................................................................................................................139 4 9.2 EXECUTION-TIME ISSUES ...........................................................................................................139 9.2.1 Stage 1: Instrumentation......................................................................................................139 9.2.2 Stage 2: measurement issues ...............................................................................................141 9.2.3 Stage 3: analysis issues .......................................................................................................141 9.2.4 Stage 4: presentation issues.................................................................................................142 9.2.5 Stage 5: optimization issues.................................................................................................143 9.2.6 Response time ......................................................................................................................143 9.3 OTHER ISSUES ...........................................................................................................................144 9.3.1 Extendibility .........................................................................................................................144 9.3.2 Documentation quality.........................................................................................................144 9.3.3 System stability ....................................................................................................................144 9.3.4 Technical support ................................................................................................................145 9.3.5 Multiple executions ..............................................................................................................145 9.3.6 Searching .............................................................................................................................145 9.4 REFERENCES .............................................................................................................................149 10 TOOL EVALUATIONS ....................................................................................................................150 11 CONCLUSION ..................................................................................................................................151 5 1 Introduction To be written. 6 2 Approach To be written (Hybrid approach, borrow from whitepaper + new info + new strategies on tool framework/approach) 7 3 Programming practices To effectively research and develop a useful PAT for UPC and SHMEM, it is necessary to understand the various aspects of the languages and their supporting environment. To accomplish this goal, we coded several commonlyused algorithms in sequential C. After writing the sequential versions, we created parallel versions of the same algorithms using MPI, UPC, or SHMEM. We ran the parallel versions on our available hardware, and compared any performance differences between our different implementations. In addition, for a few of the algorithms we tried UPC-specific hand optimizations in an attempt to gather a list of possible techniques that can be used to improve UPC program performance. This section contains a summary of the experiences we had when writing the codes, and also outlines the problems we encountered while writing and testing our code. The rest of this section is structured as follows. Section 3.1 contains brief descriptions of the algorithms used. Section 3.2 gives an overview of the coding process used for each algorithm. In Section 3.3, the performance results are shown and an analyzed. Finally, section 3.4 gives the conclusions we drew from these programming practices. 3.1 Algorithm descriptions In this section, we present overviews of the different algorithms implemented. 3.1.1 Differential cryptanalysis for the CAMEL cipher CAMEL, or Chris And Matt’s Encryption aLgorithm, is an encryption algorithm developed by Matt Murphy and Chris Conger of the HCS lab. This algorithm was created as a test case while studying the effects of hardware changes to the performance of cryptanalysis programs on high-performance computer hardware. The algorithm is based on the S-DES cipher [3.1]. An overview of the cipher function is shown in Figure 3.1. A sequential C program was written by members of the HCS lab that performed a differential cryptanalysis on the algorithm. The program first encrypted a block of text using a user-specified key, and then performed a differential attack on the 8 text using only the S-boxes used by the algorithm and the encrypted text itself. The original sequential version of the code contained about 600 lines of C code. The algorithm used during the differential attack phase constituted most of the program’s overall execution time, and so we decided it would be beneficial to create a parallel version of the differential attack and see how much speedup we could obtain. For this algorithm, we implemented versions in C and MPI. Since the size of the sequential code is fairly large and the nature of data flow in the program is not trivial, we felt it would give us an excellent vehicle to compare UPC and MPI. Figure 3.1 - CAMEL cipher diagram 3.1.2 Mod 2n inverse - NSA benchmark 9 While searching for other algorithms to implement, we examined the NSA benchmark suite and decided it would be worthwhile to implement the Mod 2 n inverse benchmark. This benchmark was selected because it is a very bandwidth- and memory-intensive program, even though its sequential implementation is small. We were interested to see how much efficiency we could obtain and how well each language coped with difficulties presented by the benchmark. 9 The basic idea of this benchmark is: given the list A containing 64-bit integers whose size ranges from 0 to 2j – 1, compute two lists: List B, where Bi=Ai “right justified.” List C, such that (Bi * Ci) % 2j = 1. List C is constructed using an iterative algorithm (which is discussed in Section 3.2.2). In our implementation, we also included a “checking” phase in which one processor traverses over list B and list C and double-checks that the lists were computed correctly. The computation of the list is embarrassingly parallel (although it is memory-intensive), but the check phase can be extremely bandwidth-intensive, especially on architectures that do not have shared memory. We implemented C, UPC, SHMEM, and MPI versions of this benchmark. In addition, we decided to try a few optimizations on our implementation of the UPC version to see how different optimization techniques impacted overall program performance. 3.1.3 Convolution Convolution is a simple operation often performed during signal processing applications. The basic definition of the convolution of two discrete sequences X and H is: C[n] X [k ] * H [n k ] k The algorithmic complexity of convolution is order N2, which results in slow computation for even moderately sized sequences. In reality, convolution is usually never computed directly, as it can also be computed by taking the Fast Fourier Transform of both sequences, multiplying them, and taking an inverse Fourier transform of that result. Computing convolution in this manner results in an algorithm of complexity Nlog2(N). Our implementation used the N2 algorithm as this allowed us to more easily measure the effect of our optimizations on the parallel versions. 10 For this application, we implemented C, UPC, MPI, and SHMEM versions. We also decided to try out the same optimizations used on our Mod 2n inverse UPC implementation on the UPC version. Convolution has different computational properties than Mod 2n inverse, so this was an ideal test to see if the same UPC optimization strategies have the similar effects on a totally different type of code. 3.1.4 Concurrent wave equation The wave equation is an important partial differential equation which generally describes all kinds of waves, such as sound waves, light waves and water waves [3.2]. It arises in many different fields such as acoustics, electromagnetic, and fluid dynamics. Variations of the wave equation are also found in quantum mechanics and general relativity. The general form of the wave equation is: 2u c 2 2u t 2 Here, c is the speed of the wave’s propagation and u u ( p, t ) describes the wave’s amplitude at position p and time t . The one-dimensional form can be used to represent a flexible string stretched between two points on the x-axis. When specialized to one dimension, the wave equation takes the form: 2 2u 2 u c t 2 x 2 We developed two implementations of UPC programs to solve the wave equation in one dimension as described in Chapter 5 of [3.3]. The sequential C version of the program is readily available on the web [3.5], and this formed the basis of our implementations in UPC. One version of our UPC code is derived from the unoptimized code found on the web. The other version is derived from a modified version of the sequential C code that employs several hand optimizations. These optimizations include the removal of redundant calculations and the use of global and temporary variables to store intermediate results, which combined to produce a 30% speedup in execution time compared to the original sequential code on our Xeon cluster. 11 3.1.5 Depth-first search (DFS) Many programmers use tree data structures for their storage efficiencies, so efficient tree-searching algorithms are a necessity. Depth-first search is an efficient tree-searching algorithm that is commonly used. In the depth-first search algorithm, target data is first matched against the root node of the tree, which has a depth level of 1. The search stops if a match was found. Otherwise, all children of the root nodes at level 1 of the tree (which are in depth level 2) are matched against the target data. If no match is found, then nodes at the next depth are searched. This process continues for increasing depth levels until a match has been made on that a particular level. The algorithmic complexity of this algorithm is order N on sequential machines and order log(N) in a parallel environment. For this algorithm, we first implemented a sequential version. We then coded two UPC versions which used an upc_forall loop and a manual for loop for work distribution. 3.2 Code overview In this section, we overview the code for each of the algorithms implemented. Any difficulties we encountered that resulted from limitations imposed in specific languages are also presented here. 3.2.1 CAMEL The original code for the sequential cryptanalysis program can be broken up into three distinct phases: An initialization phase, which initializes the S-boxes and computes the optimal difference pair based on the chosen values for the S-boxes. A main computational phase, which first gets a list of possible candidate keys and then checks those possible candidate keys using brute-force methods in concert with the optimal difference pair that was previously computed. A wrap-up phase, which combines the results of the cryptanalysis phase and returns data to the user, including the broken cipher key. 12 The first and third phases of the program are dwarfed by the execution time of the main computational phase, which can take hours to generate all possible candidate key pairs. To keep the computation times under control, we chose keys from a limited range and adjusted the main computational loop to only search over a subset of the possible key space. This kept the runtime of the main phase to within a reasonable time for our evaluation purposes while retaining similar (but scaled-down) performance characteristics of a full run. We decided to use coarse-grained parallelism for our UPC port of the CAMEL differential analysis program. After experimenting with small pieces of the program under differing platforms and UPC runtimes, we concluded that this would keep the performance of the resulting parallel program high. Coarsegrained code also represents the type of code that UPC is well-suited for. In fact, once we adopted this strategy, creating the UPC version of the application became very straightforward. We restructured the application slightly to better lend itself to parallelization, used the upc_forall construct in several key places, and added some synchronization code to complete the parallelization process. Restructuring of the application was necessary so that the corresponding for loops in the original C code could be easily converted to upc_forall loops. Listed below is the pseudocode for the original main computation phase of the application. for (each possible key pair) { if (possible candidate key pair) { count++; if (count < 3) { iterate over whole key space and add keys to list if they match with this key pair } else { only iterate over candidate keys previously added and check if they match with this key pair } } } This bit of code was restructured to C code implementing the pseudocode shown below. for (each possible key pair done in parallel) { if (possible candidate key pair) { add to global list } } 13 for (each key in global list) { if (count < 3) { iterate over whole key space in parallel and add keys to list if they match with this key pair } else { only iterate over candidate keys previously added in parallel and check if they match with this key pair } } } The UPC implementation of the first for loop of the restructured computation loop is shown below. Since each thread in the UPC application has access to the cresult and cresultcnt variables, they are protected with a lock. upc_forall(input = 0; input < NUMPAIRS; input++; input) { // grab all crypts that match up docrypt(key32, input, 0, &R1X, &R1Y, &C, &C2); // perform 2 encryptions curR1Y = R1Y; // per iteration of loop docrypt(key32, (input ^ R1XCHAR), 1, &R1X, &R1Y, &C, &C2); if ((R1Y ^ curR1Y) == (R1YCHAR)) { // lock the result array & stick it in there upc_lock(resultlock); cresult[cresultcnt].r1y = R1Y; cresult[cresultcnt].curr1y = curR1Y; cresult[cresultcnt].c = C; cresult[cresultcnt].c2 = C2; cresultcnt++; upc_unlock(resultlock); } } The UPC implementation of the iteration psuedocode that iterates over the key space in parallel is shown below. Since all threads have access to the PK2 and sharedindex variables, they were protected with a lock. The macro MAINKEYLOOP controls how much of the key space the differential search iterates over. upc_forall(m = 0; m < MAINKEYLOOP; m++; m) { upc_forall(k = 0; k < 1048576; k++; continue) { testKey = (1048576 * m) + k; if ((lastRound(curR1Y, testKey) == C) && (lastRound(R1Y, testKey) == C2)) { upc_lock(resultlock); PK2[sharedindex] = testKey; sharedindex++; upc_unlock(resultlock); } } } 14 Finally, the UPC code that iterates over the previously found candidate keys added by the previous code (which is executed after the third main loop iteration) is shown below. Again, shared variables which need atomic actions on them are protected with locks. upc_forall(m = 0; m < shared_n; m++; &keyArray[m]) { if ((lastRound(curR1Y, keyArray[m]) == C) && (lastRound(R1Y, keyArray[m]) == C2)) { upc_lock(resultlock); PK2[sharedindex] = keyArray[m]; sharedindex++; upc_unlock(resultlock); } } The translation of code from C to UPC was very straightforward, especially since we were able to reuse almost all of the original C code without making major modifications. After the correctness of our UPC implementation was verified, an MPI implementation using the master-worker paradigm was written using the same parallel decomposition strategy as in the UPC version. 3.2.2 Mod 2n inverse The C code for doing the basic computations needed in this benchmark is shown below. /** num is the number to right justify, N is the number of bits it has **/ UINT64 rightjustify(UINT64 num, unsigned int N) { while (((num & 1) == 0) && (num != 0) && (N > 0)) { num = num >> 1; N--; } return num; } /** this computes the Mod 2^n inverse when num is odd such that num * result = 1 mod 2^N */ #define INVMOD_ITER_BITS 3 #define INVMOD_INIT 8 UINT64 invmod2n(UINT64 num, unsigned int N) { UINT64 val = num; UINT64 modulo = INVMOD_INIT; int j = INVMOD_ITER_BITS; while (j < N) { modulo = modulo << j; j = j * 2; 15 val = (val * (2 - num * val)) % modulo; } return val; } Our sequential implementation starts by reading in the parameters of the benchmark from the command-line arguments given to the program. The program then allocates (mallocs)space for the lists A, B, and C. List A is filled with random integers whose values range from 0 to 2 j – 1. Then lists B and C are computed using the rightjustify and invmod2n functions previously shown. Finally, lists B and C are traversed, and (Bi * Ci) % 2j is checked to make sure that is equal to 1. In our parallel implementations, each thread “owns” a piece of A and computes the corresponding parts of B and C by itself. A is initialized to random numbers as before, and after B and C are calculated, the first thread traverses all values in B and C to ensure they are correct. The main parts of our basic UPC implementation are shown below. // populate A upc_forall(i = 0; i < listsize; i++; &A[i]) { A[i] = (rand() & (sz - 1)) + 1; } // compute B & C upc_forall(i = 0; i < listsize; i++; &B[i]) { B[i] = rightjustify(A[i], numbits); C[i] = invmod2n(B[i], numbits); } // have main thread do check if (MYTHREAD == 0) { for (i = 0; i < listsize; i++) { if (((B[i] * C[i]) % sz) != 1) { printf("FAILED i=%d for A[i]=%d (Got B[i]=%d, C[i]=%d) for thread %d\n", i, A[i], B[i], C[i], MYTHREAD); } } Since the UPC implementation was straightforward, we decided to experiment with different optimizations to see how they impacted overall program performance. The first optimization used was to write our own for loop instead of using the upc_forall construct. For our second optimization, we casted shared variables that pointed to private data before using them whenever possible (pointer privatization). For our third optimization, we had the main thread do an upc_memget from the appropriate threads to bring in the other 16 thread’s data into its private address space before initiating the checking of A and B. The code for our SHMEM implementation was almost identical to the sequential version for calculating A, B and C; the only differences were that each thread operated on a fraction of the list, and the lists they operated on were created using gpshalloc instead of malloc. The code for the checking phase is a bit more complex, however. In the SHMEM version, the first thread starts off by checking that the data it generated was correct. The first thread then issues gpshmem_getmem calls to the other threads to bring in their copies of B and C before checking their data. The code for the check routine is shown below. // do check gpshmem_barrier_all(); // make sure everyone is done if (myproc == 0) { // check local results on master thread for (i = 0; i < mysize; i++) { if (((B[i] * C[i]) % sz) != 1) { printf("FAILED i=%d for A[i]=%lld (Got B[i]=%lld, C[i]=%lld)\n", i, A[i], B[i], C[i]); fflush(stdout); } } // now check the rest for (i = 1; i < nump; i++) { int recvsize = mysize; if (i == nump - 1) { recvsize = lastsize; } gpshmem_getmem(B, B, recvsize * sizeof(UINT64), i); gpshmem_getmem(C, C, recvsize * sizeof(UINT64), i); // now do local check for (j = 0; j < recvsize; j++) { if (((B[j] * C[j]) % sz) != 1) { printf("FAILED i=%d j=%d (Got B[i]=%lld, C[i]=%lld)\n", i, j, B[j], C[j]); fflush(stdout); } } } } The code for the MPI implementation is similar to the code for the SHMEM version, except that it is complicated due to the lack of one-sided MPI functions in our available MPI library implementations. During the check loop, each processor sends their data to the first processor in turn using a combination of for loops, MPI barriers, and MPI send and receive calls. In the interests of 17 brevity, the MPI code for the check routine has been omitted from this section, since it is roughly twice as long as the check code for the SHMEM version. 3.2.3 Convolution The sequential C code for the direct computation of the convolution of two arrays A and B is shown below. The macro INTEGER is set at compile time to a doubleprecision floating point type, a 32-bit integer type, or a 64-bit integer type. void conv(INTEGER* A, INTEGER* B, INTEGER* C, long lena, long lenb, long cstart, long cend) { long n, k; for (n = cstart; n <= cend; n++) { INTEGER s, e; s = (n >= lenb) ? n - lenb + 1 : 0; e = (n - lena < 0) ? n + 1 : lena; C[n] = 0; for (k = s; k < e; k++) { C[n] += A[k] * B[n - k]; } } } The UPC code for our un-optimized version was almost identical to the above code, except that the for loop was replaced with the upc_forall loop, using &C[n] as the affinity operator. Closely examining the code above tells us that depending on how the UPC compiler blocks the different arrays, the work each thread has to do may vary by a large amount. Computing the values of C that are near the middle of the sequence take much longer than computing the values near the edges of C, since more multiplications need to be performed on A and B. Therefore, using a small block size near 1 (a cyclic distribution) should result in nearly even work distribution, since each process is likely to have a uniform mix of elements of C. Based on this observation, our implementation used a block size of 1. We used the same three optimizations for our UPC implementation that we used in our Mod 2n benchmark implementation. The first optimization was to write our own for loops manually instead of using the upc_forall construct. This optimization also incorporated our previous second optimization, which was casting shared variables that pointed to private data before using them whenever possible (pointer privatization). For our last optimization, we had the main thread call upc_memget from the appropriate threads to bring in the other thread’s data 18 into its private address space before initiating the checks for the A and B arrays. The manual work distribution using the for optimization complicated the code since array offsets had to be manually calculated, but each of the other optimizations only added a few lines to the UPC code. The code for the SHMEM and MPI versions of this code were nearly identical, with the only differences being due to the different communication functions. However, since MPI and SHMEM don’t have a built in array blocking mechanism, the code for the computation was more complex, since array offsets had to be computed manually. However, the code was no more complex than the for optimization in the UPC version of the code. 3.2.4 Concurrent wave equation The implementation of the concurrent wave equation needs to calculate the amplitude of points along a vibrating string for a specified number of time intervals. The equation that is solved is shown below. In the equation, the variable i represents a point on the line. new[i] = 2 * amp[i] - old[i] + sqtau * (amp[i - 1] - 2 * amp[i] + amp[i + 1]) The amp array holds the current amplitudes. Note that the new amplitude for the point will depend on the current values at neighboring points. Each process is assigned a contiguous block of N/P points (block decomposition) where N is the number of points and P is the number of processes. When the points are assigned this way, each processor has all the data needed to update its interior points. To update its endpoints, a processor must read values for the points bordering the block assigned to it. Once the boundary values are calculated, these new boundary values must also be updated in a shared array so other processors may use them. In our implementation, an upc_forall loop was used to initiate this communication. 19 3.2.5 Depth-first search The most generic sequential DFS algorithm applicable to C uses pointers to construct the tree. If a thread-spawning ability is available, a DFS using pointers can be implemented as follows: node found = NULL; int DFS (node current, int target) { if (target == current.data) found = current; else if (found != NULL) { spawn DFS (current->child1, target); spawn DFS (current->child2, target); ... } } At first glance, parallelization in UPC of this implementation appears trivial as each node can be treated like a thread. However, this does not work as there is no construct in UPC that allows remote spawning of tasks. Because of this, we restricted the DFS algorithm to work only with n-degree trees (trees with node having maximum of n children), which allows an array representation of the tree to be used. By doing so, the dynamic spawning of children process is changed to matching of data against nodes in a certain range of the array. The parallelization process then becomes that of changing from using the for loop to the upc_forall loop and making the array that represents the tree globally accessible. Shown below is the UPC version of this implementation. shared tree[N]; // global variable int level = 1, left_node = 0, right_node = 0; bool found = FALSE; int DFS (int target) { do { upc_forall (i = left_node; i <= right_node; i++; &tree[i]) { if (tree[i].data == target) found = TRUE; // perform task with a match found } left_node = right_node + 1; right_node = right_node + level^max_degree - 1; upc_barrier; } while (found == FALSE) } 20 3.3 Analysis In this section, we present an analysis of each algorithm’s runtime performance. We also discuss any differences between versions of applications coded in MPI, UPC, or SHMEM. 3.3.1 CAMEL (MPI and UPC) Since we used the master-worker paradigm in our MPI implementation, to have a fair comparison between our MPI and UPC implementations we put both the main MPI master thread and the first worker thread on the same CPU during execution. However, after some experimentation, it became evident that the MPI implementation for our InfiniBand network used a spinlock for its implementation of blocking MPI send calls in order to keep latencies low. This spin lock wasted several CPU cycles and destroyed the performance of any worked thread that happened to be paired with the master thread on the same CPU. Because of this, we rewrote our application to use a more traditional, distributed-style coordination between all computing processes. This increased overall performance by lessening the impact of the spin lock at the expense of creating much more complex MPI code. This version of the MPI code, which had comparable performance to the UPC version, was 113 lines of code longer than our UPC implementation. Clearly, for data-parallel applications, UPC is a much more attractive language than MPI. The usefulness of UPC is especially evident on shared-memory machines, where performance differences between wellwritten MPI programs and UPC programs are less pronounced than on cluster architectures. We ran our UPC and MPI implementations using a value of 256 for MAINKEYLOOP, which results in about 1/16th of the key space being searched. Our UPC and MPI implementations were tested on our four-processor AlphaServer machine and 16-processor Opteron cluster. For the AlphaServer, we used the native MPI and UPC compilers available (HP UPC compiler, version 2.3). For the Opteron cluster, we used Voltaire’s MPI implementation over 4x InfiniBand, and we used the VAPI conduit over 4x InfiniBand on the Berkeley UPC compiler (v2.0.1). The results from these runs are shown in Figure 3.2. As can be seen, performance between the MPI and UPC versions was comparable, and the overall parallel efficiencies of our implementations were very high. On the Opteron cluster, both the UPC and MPI implementations had over 95% 21 efficiency for 16 processors. On the AlphaServer, both of our implementations had an efficiency of over 98% when run with the maximum number of available processors (4). CAMEL performance 250 AlphaServer, UPC AlphaServer, MPI Opteron, VAPI MPI Opteron, VAPI UPC Execution time (s) 200 150 100 50 0 1 2 4 8 Number of processors 12 16 Figure 3.2 - CAMEL performance 3.3.2 Mod 2n inverse (C, MPI, UPC, and SHMEM) As can be inferred from the section 3.2.2, given the sequential C code for this benchmark, the parallel version was almost trivial to code in UPC. However, the straightforward UPC implementation of this benchmark resulted in a program that has poor performance. Adding the three optimizations previously mentioned makes the code slightly more complex, but has a large impact on overall performance. Even with the added complexity of the optimizations in the UPC code, the UPC code length and complexity was about on par with the SHMEM version when all optimizations were implemented. The MPI code was again more complex, longer, and harder to write than the SHMEM and UPC versions. On the plus side, MPI does have access to a rich library of non-blocking communication operations which can improve 22 performance in these types of applications, although non-blocking operations are soon becoming available to UPC programmers in the form of language library extensions. We did not use the non-blocking operations in our MPI implementation. We first examine the effect that each optimization had on overall UPC performance. The parameters used for the runs in this benchmark were a list size of 5,000,000 elements and n = 48 bits. Bench 9 Optimizations - AlphaServer 6 sequential, upc upc, 1 thread upc, 4 threads 5 Time (seconds) sequential, cc 4 3 2 1 0 forall forall cast for for cast get forall get forall cast get for get for cast Optimization Figure 3.3 - Effect of optimizations on 4-processor AlphaServer The results that the different optimizations had on our AlphaServer are summarized in Figure 3.3. Each column signifies which combinations of optimizations were used when executing the UPC implementation. Also in the figure are the times taken by the sequential version when compiled by the C compiler (cc) and the UPC compiler (upc). The sequential code performance obtained from using the C and UPC compilers is replicated across the columns in the graph to enhance readability. Notice that compiling the same code using the UPC compiler instead of the C compiler (using the same optimization flags) results in about a 20% drop in performance. We again suspect that the source code transformations applied by the UPC compiler may be responsible for the 23 decrease in performance. In addition, we suspect that the UPC compiler may resort to using less aggressive memory access optimizations to ensure correctness when executed in a parallel environment. This could also have contributed to the slowdown. In terms of the effects the different optimizations had on UPC program performance, Figure 3.3 shows that applying all three optimizations concurrently resulted in the best overall performance. Using a manual for loop instead of a forall loop resulted in decreased performance, unless casting shared variables locally (pointer privatization) was also employed. Using the upc_memget to copy blocks of memory from remote notes into local memory also resulted in an appreciable performance gain. Effects the optimizations had on our Opteron cluster are not included; however, the effects were similar to the AlphaServer, although the Opteron cluster proved to be more sensitive to the optimizations. Like the AlphaServer, the Opteron cluster also performed best when all optimizations were used concurrently. Bench 9 - AlphaServer 3 marvel upc (get for cast) marvel gpshmem marvel mpi 2.5 Time (seconds) 2 1.5 1 0.5 0 1 2 3 Number of threads Figure 3.4 - Overall bench9 performance, AlphaServer 24 4 The overall performance of our MPI, SHMEM, and UPC implementations on AlphaServer is shown in Figure 3.4. The figure shows that MPI implementation had the best overall performance, and the SHMEM version had the worst. We believe the difference in performance between the MPI and UPC versions can be attributed to the slowdown caused by compiling our code using the UPC compiler instead of the regular C compiler (which is how MPI programs are compiled on the AlphaServer). Nevertheless, even with the initial handicap imposed by the UPC compiler, our UPC implementation performs comparably to our MPI implementation with 4 processors. The SHMEM implementation lags behind both the UPC and MPI versions in almost all cases; however, we are using a freely-available version of SHMEM (gpshmem) because we do not have access to a vendor-supplied version for our AlphaServer. In any case, the performance of the SHMEM version was not drastically worse than the performance of the other two implementations. lambda gpshmem lambda mpi lambda bupc-vapi (get for cast, BS=100) lambda bupc-vapi (get for cast, BS=MAX) Bench 9 - Opteron 6 Time (seconds) 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Number threads Figure 3.5 - Overall bench9 performance, Opteron cluster The performance on Opteron cluster is shown in Figure 3.5. The check phase is a very bandwidth-intensive task; this results in poor performance due to the 25 limited bandwidth capabilities from CPU to CPU in a cluster environment. Our MPI implementation performs the best here, which is not surprising given that MPI is well-suited to cluster environments. Also, the MPI implementation explicitly defines how data is communicated from each processor to the next. While being programmer-intensive and tedious to write explicit communication patterns, this usually results in the best overall utilization of the network hardware. Figure 3.5 also illustrates the effect that adjusting the block size of UPC shared arrays has when combined with use of the upc_memget function. Using a larger block size for our shared arrays resulted in better performance of our UPC implementation. This is logical, since most networks perform best when transferring larger messages. When the block size for the shared arrays is set to the maximum size allowed by the UPC compiler, overall UPC performance moves quite a bit closer to the performance obtained by the MPI version. As with AlphaServer, the SHMEM implementation has the worst overall performance. We had expected this; SHMEM is designed for shared-memory machines, so it makes sense that it does not perform well in a cluster environment. 3.3.3 Convolution (C, MPI, UPC, and SHMEM) The parameters used in this benchmark were two sequences containing 100,000 double-precision floating point elements. As with our Mod 2 n inverse codes, we decided to examine the impact that each of our three UPC optimizations had on our UPC code. A reduced data set size was used for these tests. The results from the execution of the code with different optimizations enabled on our AlphaServer are shown in Figure 3.6. Each column illustrates the optimizations used when running the UPC program. In the figure, the columns labeled “naïve” did not use the upc_memget function to bring in local copies of A and B before starting computation. The rest of the optimizations listed in the columns correspond to the optimizations previously mentioned, with the for optimization also including the casting optimization. The results we obtained were similar to the effects the optimizations had on the Mod 2 n inverse UPC code. In all cases, applying all optimizations led to the best performance. 26 Integer Convolution Optimizations - AlphaServer 14 sequential upc, 1 thread 12 upc, 4 threads Time (seconds) 10 8 6 4 2 0 naïve forall naïve for get forall get for Optimization Figure 3.6 - Effect of optimizations on 4-processor AlphaServer Not shown are the results from applying the optimizations to Opteron cluster. These results also agreed with our previous results from Mod 2n inverse; in general, UPC performance on the Opteron cluster was very sensitive to the optimizations used. As with the AlphaServer, in all cases applying all optimizations led to the best performance. The overall performance for performing double-precision floating point convolution is shown in Figure 3.7. The performance of the MPI, UPC, and SHMEM versions of the code was comparable on both the Opteron cluster and the four-processor AlphaServer. The parallel efficiencies for running the UPC, SHMEM, and MPI versions on the Opteron cluster were over 97.5%. In addition, the parallel efficiencies for the UPC, MPI, and SHMEM versions on the AlphaServer were also over 99%. 27 Double Precision Floating Point Convolution 140 AlphaServer UPC AlphaServer GPSHMEM AlphaServer MPI Opteron UPC Opteron GPSHMEM Opteron MPI 120 Time (seconds) 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Number of threads Figure 3.7 - Overall convolution performance An interesting phenomenon is also diagrammed in Figure 3.7: floating point performance for the UPC implementation was significantly better than the floating point performance obtained by the MPI and SHMEM versions of the code. On our AlphaServer, both MPI and SHMEM are made available to C programmers through C libraries that are linked with the user’s application. We suspect that the UPC compiler on the AlphaServer has more intimate knowledge of the available floating point hardware; it seems the UPC compiler was better able to schedule the use of the floating point units than the sequential C compiler paired with parallel programming libraries. It is also worthy to note that integer performance was not improved by using the UPC compiler on the AlphaServer, although performance was not degraded by using the UPC compiler as it was with our bench9 UPC implementation. The performance of the convolution given by the Berkeley UPC compiler on our Opteron was degraded; again, we attribute 28 this to the source-to-source transformations interfering with the ability of the GCC compiler to perform the same optimizations that it was able to on the MPI and SHMEM code. 3.3.4 Concurrent wave equation (C and UPC) Figure 3.8 summarizes the execution times for various implementations of the code. The modified sequential version was 30% faster than the baseline for the Xeon cluster, but only 17% faster for the Opteron cluster. Computations take up more of the total execution time on the Xeon cluster, so the optimizations will necessarily have more of an impact. Since the algorithm is memory-intensive, obtaining execution times for larger data sets is intractable due to the physical limitations of main memory. UPC Concurrent Wave Equation Results Xeon-sequential Xeon-upc-1 Xeon-upc-4 Opteron-sequential mod Opteron upc-2 Xeon-sequential mod Xeon-upc-2 Opteron-sequential Opteron upc-1 Opteron upc-4 1.4 Execution time (sec) 1.2 1 0.8 0.6 0.4 0.2 0 0.5 1 1.5 2 2.5 3 Number of points (1E6) Figure 3.8 - Concurrent wave performance The UPC versions of the code exhibit near-linear speedup. This fact is more meaningful when considering that the UPC code was fairly straightforward to port from the sequential code. Once the code was written, we could focus our attention on determining the most efficient language constructs to use for a given 29 situation. We found that for smaller data sets, the affinity expression array+j performed slightly better than &(array[j]). This is probably due to the different implementations of each construct by the UPC compiler. By gaining more information about the implementations of various language constructs, we hope to be able to exploit more construct-specific performance benefits. 3.3.5 Depth-first search (C and UPC) Figure 3.9 shows the average performance of the DFS algorithm on a tree with 1,000,000 elements ran on Xeon clusters using SCI. The tree was set up so the data contained in each node contained the same value as its array index. Values from 1 to 1,000,000 were used as search keys, and the average time for all of the searches was recorded. As can be seen from the data, the UPC versions perform much slower than the sequential version. This is primarily due to the extra synchronization needed by the UPC versions. However, the UPC versions start to perform better when a long delay was added to the matching process (this data is not shown). This is intuitive, because adding the delay results in less frequent synchronization, which increases the efficiency of the parallelization. Again, we found that using the for instead of the upc_forall results in better performance, and also scales better as the number of threads is increased. Execution Time (msec.) 1 Node 2 Nodes 4 Nodes 300 250 200 150 100 50 0 Sequential UPC with for_all UPC with for Figure 3.9 - DFS performance on Xeon Cluster with SCI 30 We also tried increasing the block size of the global array to see if this would affect overall performance. In our implementation, adding this optimization actually decreased the performance because it creates a load imbalance. To illustrate why this is, assume there are two processors and a block size of 5 is used on the global array representing the search tree. Because of the distribution of array elements, the first node ends up searching all of the first two tree levels while the second node is idle. The extra idle time spent by the second node decreases the overall efficiency of the parallelization, which results in lower overall performance. This effect worsens as the depth of the search increases. In addition, we also investigated other optimizations, including casting local variables before using them (pointer privatization) and using other language constructs where applicable. These optimizations did not improve the performance of the application, so we have excluded their results from this section. 3.4 Conclusions Our implementation of the CAMEL differential cryptanalysis program gave us useful experience with both the UPC and MPI programming languages. Our UPC implementation was easily constructed from the original sequential code, while the MPI implementation took a little more thought due to our MPI vendor’s implementation of a blocking receive with a spin lock. We were able to achieve high efficiency on both of our implementations. One interesting fact we learned from working with our CAMEL implementation is that both UPC compilers can sometimes give slightly better or slightly worse performance for the same code compared with the MPI or sequential C compilers. We suspect that since the UPC source code may be slightly transformed or altered depending on the UPC compiler implementation, the final code that gets assembled may have different optimizations that can be applied to it as compared with the original version. In this respect, since MPI compilers generally do not perform any additional code reorganization of the source code before compiling and linking against the available MPI libraries, overall MPI performance usually matches the performance of sequential versions more closely than UPC. This is especially true for the Berkeley UPC compiler, which utilizes source-to-source transformations during compilation. 31 Our Mod 2n inverse implementation was another useful tool that allowed us to compare MPI, SHMEM, and UPC. While we could have examined an existing UPC implementation of Mod 2n, writing implementations from scratch turned out to be an excellent learning experience. The simplicity of the code for Mod 2 n inverse allowed us to experiment with different UPC optimizations on a variety of platforms and UPC runtimes. We found that three commonly-employed optimizations (as evidenced by the GWU UPC benchmark suite) can make a considerable difference in performance. Specifically, the combination of using upc_memget/upc_memput to block transfer contiguous spaces of memory, manually partitioning work up using the for construct instead of the upc_forall blocks, and casting shared variables to local variables where necessary resulted in the best performance on all of our UPC runtimes and compilers. Our implementations of convolution in UPC, SHMEM, and MPI show that each language offers similar performance for applications with large computation requirements and moderate communication requirements. The AlphaServer’s UPC compiler was able to squeeze more performance out of its floating point units, even when compared to the sequential C compiler. UPC’s notion of blocked arrays also made uniform work sharing easier to implement in this application. Finally, our chosen optimizations had a positive impact on overall UPC performance for both the AlphaServer and our Opteron cluster, and the effects of our optimizations also agreed with the results we obtained on our Mod 2n implementation. The adjustments required to port the wave equation code from the original sequential code to UPC were fairly intuitive. However, the overhead for running the UPC code for one process relative to the sequential code is quite substantial. Even more noteworthy is that when the sequential code is passed into the UPC compiler and run on one processor, a similar overhead incurred. We believe the reasons for this overhead are similar to the reasons we mentioned for the overhead observed in the CAMEL application. In our DFS implementation, we again verified that use of certain constructs instead of the built-in alternatives (for versus upc_forall) can make an impact on the program performance. We also verified that the computation to communication ratio of a program has significant effect on parallel program 32 efficiency. More importantly, we discovered that it was necessary to change the underlying algorithm and restrict the original problem when creating a parallel version of an application in order to gain efficiency. It is necessary for programmers to understand the limitations of the language they use, because understanding the limitations posed by a language is a prerequisite to becoming efficient in a language. By examining the capabilities of language, a set of “good” performance guidelines might be formed. In general, this task provided us with an opportunity to become more familiar with UPC and SHMEM. The optimization process was beneficial, as it forced us to view performance analysis tools from a user’s perspective. It caused us to gain an understanding of the information that is needed when optimizing parallel programs. In addition, it provided us with experience using ad-hoc methods for collecting such basic metrics as total execution time and total time spent within a function. Finally, the programming practice confirmed that the use of optimizations can result in a significant performance impact. 3.5 References [3.1] “Cryptanalysis of S-DES,” Dr. K. S. Ooi and Brain Chin Vito, Cryptology ePrint Archive, Report 2002/045, April 2002. [3.2] “Wave Equation. Wikipedia: The Free Encyclopedia” [3.3] Fox et al. (1988), “Solving Problems on Concurrent Processors”, Vol 1. [3.4] http://www.new-npac.org/projects/html/projects/cdroms/cewes-199906-vol1/cps615course/mpi-examples/note5.html [3.5] http://super.tit.ac.kr/workshop/html/samples/exercises.html#wave 33 4 Performance tool strategies Writing a parallel program can often be a daunting task. In addition to the normal issues raised during sequential programming, programmers working on parallel codes have to contend with data partitioning schemes, synchronization, and work distribution among processors among others. Recently, several programming languages such as UPC have been created which aim to improve programmer efficiency by providing simplified coding styles and convenient machine abstractions to the programmer. Even with these improved programming environments, deciding how to optimally code a program can still be a trial-anderror process. Evaluation of parallel code is usually accomplished via one or a combination of the following three methods: Simulation – Here, detailed models are created of the parallel code and hardware that the user wishes to evaluate. These models are simulated, and information during the simulation of the models is collected and stored for later analysis or shown to the user immediately through a Graphical User Interface (GUI). Analytical models – In analytical models, mathematical formulas are used in conjunction with specific parameters describing the parallel code and the hardware upon which it will be executed. This gives an approximation of what will happen when the code is actually run by the user on the target hardware. Experimental – In this method, instrumentation code is added that records data during the program’s execution on real hardware. Creating detailed simulation models can provide extremely detailed information to the user. However, creating and validating models of existing hardware is a very labor-intensive process. The most accurate models may take eons to simulate, while coarse-grained models that run in a reasonable amount of time usually have poor accuracy. In addition, the models created are usually tied closely to particular architectures or runtime systems, and modifying them to 34 updated architectures or drastically different architectures usually involves substantial work. Given that parallel architectures can vary wildly in short amounts of time, and the large execution cost of accurate models, simulative models are usually only used in cases where the detailed information they provide is absolutely required. In this respect, they are invaluable tools for ensuring the correctness of execution of mission-critical systems, but their usefulness in the implementation of a PAT is limited. Analytical models can be thought of as extremely simplified versions of models created for simulative analysis. While they lack the accuracy of detailed simulation techniques, often the information they provide is sufficient enough to warrant their use in a PAT. For example, parallel performance models can provide the programmer with insight into the characteristics of existing parallel machines by giving the programmer a “mental picture” of parallel hardware. In addition, some performance models may even predict how a given program will perform on other available hardware or hardware not yet available. The experimental approach (direct execution of a user’s code) provides the most accurate performance information to the user, but using this strategy by itself encourages the use of the “measure-modify” approach, in which incremental changes are made to the code after measuring a program’s runtime performance. This modified code is then run and measured again. This process repeats until the user obtains the desired performance from their code. A major drawback of this process is that it is usually very time consuming; in addition, many actions are performed on a trial-and-error basis, so getting better performance from codes may also involve an element of chance. 35 5 Analytical performance modeling In the context of a PAT, a performance models have many possible uses. A performance model may be used to give suggestions on how to improve the performance of their application, or may allow a user to perform simple tradeoff studies to aid them in obtaining the maximum performance from their program. Additionally, studying existing performance models will give us an idea of which particular metrics (“performance factors”) are deemed important enough to researchers to include them in models that evaluate or predict performance of parallel code. By compiling a list of the performance factors used in a wide variety of existing performance models, we can justify the inclusion of measurement tools for these particular metrics in our PAT. If we are able to characterize performance of an application on real hardware using a handful of performance factors, we will have a better understanding of how each performance factor affects a program’s overall performance. The rest of this section will review existing performance models in attempt to determine their applicability to a PAT. To evaluate each model, we will use the following set of criteria: Machine-friendliness – For a performance model to be useful in a PAT, the PAT must be able to evaluate a user’s program with little or no help at all from the user. If a user must expend a great deal of effort to use the model, the user will most likely avoid the model in favor of other, easier-touse features provided by the PAT. Accuracy – A very inaccurate performance model is not useful to the user. We seek models that have a reasonable amount of accuracy (20-30% error). In general, we wish this requirement to be flexible; if a model has slightly worse than 30% accuracy but has other redeeming features, we may still assign it a good evaluation. Speed – If evaluating a given code under a performance model takes longer than evaluating it on actual hardware, it will generally be more productive to use the actual hardware instead of a model that approximates it. Therefore, we have chosen the time taken to re-run the 36 code on the actual hardware as an extreme upper bound of time that is taken to evaluate a performance model. To quantify this time interval, models that are to give evaluations in seconds to tens of seconds are highly desirable. Models that take minutes to evaluate need to have high accuracy or some other desirable feature to make up for their speed deficiency. If a model takes one hour or longer to evaluate code, we will not consider it a feasible choice for inclusion in our PAT. We have divided the existing performance models we will be evaluating into three categories. First, models that use formalized mathematical methods are grouped under the title “formal performance models.” These models are presented and evaluated in Section 5.1. Second, general performance models that are used to give mental pictures to programmers or that provide general programming strategies are categorized as “general analytical performance models,” and are summarized and evaluated in Section 5.2. Performance models that are designed to specifically predict the performance of parallel codes on existing or future architectures are categorized as “predictive performance models,” and are presented and evaluated in Section 5.3. Finally, Section 5.4 gives our recommendations on how to incorporate an analytical model into our PAT. 5.1 Formal performance models In this section, we will briefly give an introduction to some of the more widelyused formal methods for evaluating performance of parallel codes. This category of performance models encompasses many different methods and techniques. Because formalized methods generally require extensive user interaction, and thus are not readily applicable for inclusion into a PAT based on our previously mentioned criteria, we will only present a brief overview of them. Also, formal performance models use highly abstracted views of parallel machines, so we cannot extract which performance factors are considered when evaluating performance as these metrics are not used directly. For completeness, we have included a generalized evaluation of formal performance models below. These comments are applicable to all the formal models discussed in this section. 37 Formal models summary: Parameters used – varies; usually an abstract representation of processes and resources Machine-friendliness – very low, requires on the user’s ability to create abstract models of the systems they wish to study Average error – varies; can be arbitrarily low or high depending on how systems are modeled Speed – creating and verifying the models used by formal methods can be time consuming 5.1.1 Petri nets Petri nets came around as a result of the work that Carl Petri performed while working on his PhD thesis [5.5.1]. Petri nets are specialized graphs (in the computer science sense) which are used to graphically represent processes and systems. In some sense they are more generalized versions of finite state machines, as finite state machines may be represented by using petri nets. The original petri nets proposed by Carl Petri had no notion of time and only allowed limited modeling of complex systems. Several improvements have been proposed to the original petri nets, including colored petri nets which allow more complicated transitions and timed petri nets which introduce time parameters to petri nets. Using the colored, timed petri nets, it is possible to model arbitrarily complex systems. Since petri nets are strongly grounded in mathematical theory, it is often possible to make strong assertions about the systems modeled using petri nets. Specialized versions of petri nets have even been created specifically for performance analysis of distributed systems [5.5.2]. Petri nets can be thought of as basic tools that provide a fundamental framework for modeling. However, they are often very difficult to create for general parallel systems and codes. If included in a PAT, the PAT would need a lot of help from the user in the form of hints on how to construct the petri nets (or the provision of graphical tools to help the user create the petri nets). Therefore, because of the 38 large dependence on user interaction for petri nets, we suggest not using petri nets as a basis for a performance model in our PAT. 5.1.2 Process algebras Process algebras represent another formal modeling technique strongly rooted in mathematic (especially algebra). Some of the more popular instances of process algebras are Milner’s “A Calculus of Communicating Systems” [5.5.3] and Hoare’s “Communicating Sequential Processes” [5.5.4]. Processes algebras abstractly model parallel processes and events that happen between them. These techniques are rich in applicability for studying concurrent systems but are quite complex. Entire books have been written on these subjects (for example, Hoare has an entire textbook dedicated to his “Communicating Sequential Processes” process algebra [5.5.5]). Process algebras excel at giving formalized mathematical theory behind concurrent systems, but in general are difficult to apply to real parallel application codes and real systems. They are useful for verifying certain properties of parallel systems (e.g., deadlock-free algorithms), but are not immediately useful to a PAT. Therefore, we recommend excluding process algebras from our PAT. 5.1.3 Queuing theory Another formalized modeling technique we discuss here is the application of queuing theory to parallel systems. As with process algebras and petri nets, queuing theory is strongly rooted in mathematics and provides a general way to describe and evaluate systems that involve queuing. Queuing theory comprises an entire field in and of itself; it has been successfully applied in the past to evaluate parallel systems (for a summary paper, see [5.5.6]). However, like process algebras, queuing theory is a general tool that is sometimes difficult to apply to real-world problems. In addition, some parallel codes might not be readily modeled using queuing terminology; this is even more problematic when working with languages that provide high-level abstractions of communication to the user. While queuing theory is a useful tool for solving some parallel computing problems (notably load balancing), it is not appropriate for inclusion into our PAT. 39 5.1.4 PAMELA PAMELA is a generic PerformAnce ModEling Language invented by van Gemund [5.5.7, 5.5.8]. PAMELA is an imperative, C-style language extended with constructs to support concurrent and time-related operations. PAMELA code is intended to be fed into a simulator, and has some calculus operators defined for the language that allow for the reductions of programs to speed evaluation of them. PAMELA has much in common with general process algebras, although PAMELA is oriented towards direct simulation of the resulting codes in the PAMELA language/process algebra. During evaluation of code written in the PAMELA language, serialization analysis is used to provide a lower and upper bound on the effects of contention that affect a program’s performance. Even though PAMELA itself is strictly a symbolic language, van Gemund developed a prototype system that was written for the Spar/Java data-parallel programming language [5.5.9]. This system generates a PAMELA program model based on comments provided in the source code to programs. While the system used was not fully automated due to the dependence on special comments in the source code, the generated models had an average of 15% error for matrix multiplication, numerical integration, Gaussian elimination, and parallel sample sort codes. The models generated by their system were in general many times larger than the actual source code, although the sizes shrunk dramatically after the performance models were reduced. PAMELA provides a convenient description of parallel programs, but automatically creating models from actual programs, especially programs coded in UPC, would be too difficult to perform efficiently without interaction from the user. The performance calculus used is simple enough to allow for machine evaluation of PAMELA models; however, because of the necessary user interaction required in creating accurate PAMELA models, we cannot recommend the use of PAMELA in our PAT. 5.2 General analytical performance models In this section, we present an overview of general analytical performance models, which we define as performance models that are meant to provide a 40 programmer with a mental model of parallel hardware or general strategies to follow while creating parallel codes. 5.2.1 PRAM Perhaps one of the most popular general analytical performance models to come out of academia is Fortune and Wylie’s Parallel Random Access Machine (PRAM) model [5.5.10]. The PRAM model is an extremely straightforward model to work with. In the model, an infinite number of ideal CPUs are available to the programmer. Parallelism is accomplished using a fork-join style of programming. Each processor in the PRAM model has a local memory and all processors have access to a global memory store. Each processor may access its local memory and the global memory store, but may not access the private memory of the other processors. In addition, local and global memory accesses have the same cost. Because of its simplicity, many programmers use the model to determine the algorithmic complexity of their parallel algorithms. However, no real machine exists today that matches the ideal characteristics of the PRAM model, so the complexities determined using the model are strict lower bounds. Researchers have created several different variations of the model to try to more closely match the model with existing hardware by restricting the types of memory operations that can be performed. For example, different variations of the model can be obtained by specifying whether reads and writes to the global memory store can be performed concurrently or exclusively. CREW-PRAM (concurrent read, exclusive write PRAM) is one particular variation of the model that is widely used. However, even with the enhancements to the model, the model is still too simplistic to predict performance of parallel codes to a reasonable degree of accuracy, especially in non-uniform memory access machines. In addition, synchronization costs are not directly captured by the model, which can contribute greatly to the cost of running parallel codes on today’s architectures. PRAM is a very useful algorithmic analysis and programming model. It is an attractive method because it is a very straightforward model to deal with. However, its low accuracy prohibits its use in our PAT. PRAM summary: 41 Parameters used – coarse model of memory access, sequential code speed used for predicting overall execution time for parallel code Machine-friendliness – medium to low; in order to accurately model algorithms using PRAM either the source code must be processed in some way or extra information must be provided by the user Average error – generally accurate to within an order of magnitude, although accuracy can be much worse depending on nature of code being evaluated (excessive synchronization is problematic for the model) Speed – extremely fast (entirely analytic) 5.2.2 BSP In 1990, Leslie Valiant introduced an analytical performance model he termed the Bulk-Synchronous Parallel (BSP) model [5.5.11]. The BSP model aims to provide a bridging tool between hardware and programming models, as the von Neumann model of computing has done for sequential programming. In the BSP model, computation is broken up into several supersteps. In each superstep, a processor first performs local computation. Then, it is allowed to receive or send at most h messages. These messages are charged a cost of gЇ h+s, where gЇ is the throughput of the communication network between the processors and s is the startup latency for performing communication costs. Finally, at the end of each superstep, a global barrier synchronization operation is performed. The barrier synchronizations are performed with frequency determined by a model parameter L. It is important to note that synchronization is performed every L time units. If a superstep has not completed at this time, the next superstep is allocated for finishing the computation of the current superstep. Breaking computation up into supersteps and limiting the network communication that can occur within a superstep makes it easier to predict the overall time taken for that superstep, which will be the sum of the maximum computation time, communication time, and the time needed to synchronize all the processors participating in that superstep. A few variants of the BSP model also exist, including the Extended BSP (E-BSP) model [5.5.12]. The E-BSP model enhances the original BSP model by introducing a new parameter M which models the pipelining of messages from processor to processor. In addition, the E-BSP model also adds in the notion of locality and unbalanced communication 42 into the model by also considering network topology in the model, something that is ignored in the general BSP model. In 1996, Juurlink and Wijshoff [5.5.13] performed an evaluation of BSP by tuning the BSP and E-BSP models to three actual machines and using their tuned models to predict performance of several parallel applications: matrix multiplication, bitonic sort, local sort, sample sort, and all-pairs shortest path. The three machines used were a 1024-processor MasPar MP-1, 64-processor CM-5, and a 64-node GCel. For each code, the BSP model and E-BSP model showed good accuracy, within 5% in most cases. However, error grew to 5070% in the cases of poor communication schedules (when several processors try to send to one processor at the same time), poor data locality (high cache misses), and sending a large number of tiny messages. A similar drop in accuracy was reported for the BSP model when working on networks with low average bisection bandwidth but high point-to-point bandwidth of processors near each other in the interconnection network. The BSP model uses the maximum number of messages sent by a group of processors to predict performance by assuming all processors send the exact same number of messages, but if there is only one processor that sends a larger number of messages than the others this cost will be overestimated. The problem is compounded by networks having low bisection bandwidth, since network saturation decreases performance more quickly on these types of networks. It is also important to note in these cases the programs were coded with the BSP and E-BSP models in mind from the start, so they most likely represent optimistic accuracies. In addition, the BSP model itself imposes restrictions in the structure of the program, although the restrictions themselves are not that hard to live with. To this end, a group of researches created a BSP programming library named BSPLib to ease the implementation of BSP-style algorithms [5.5.14]. The library provides similar operations that MPI provides, albeit on a much smaller scale. The BSPLib tool also includes the ability to generate a trace of program execution that can be analyzed using a profiling tool. One interesting feature of the profiling tool is that it is able to predict performance of the application on other machines, given their BSP parameters. The accuracy of the predictions made by the profiling tool on a complex Computational Fluid Dynamics (CFD) program were within 20%, which is fairly respectable considering the trace was 43 performed on a shared-memory machine and the predicted platform was a loosely-coupled cluster. One other contribution of the BSPLib effort was the change of gЇ to a function gЇ (x) which allows the model to capture different latencies associated with different messages sizes, which is a common property of most available networks today. This addition does increase the complexity of the model, but this additional complexity may be mitigated by the use of profiling utilities. The BSP model provides a convenient upgrade in accuracy from the PRAM model, although it requires a specific programming style and the model ignores some important features including unbalanced communication, cache locality, and reduced overhead for large messages on some networks. Some of these deficiencies have been addressed by the E-BSP model at the expense of added model complexity. In general, if processing trace files created by instrumented code is not too costly, supersteps and h parameters may be extrapolated from existing code by examining when messages were sent and received in relation to the time barrier synchronization operations were performed. In addition, microbenchmarks may be used to automatically record the gЇ and L parameters of existing systems. Using these two methods together, applying the BSP model to a program after it has been executed can probably be entirely automated. In addition, the simplicity of the model is also an attractive feature, although it remains to be seen whether the model can be applied to more finely-grained applications that do not make use of many barrier synchronizations. BSP summary: Parameters used – network bandwidth, network latency, sequential code performance Machine-friendliness – medium to high; supersteps of an algorithm may be detected automatically by analyzing a trace file or source code directly Average error – within 20% for nominal cases, accurate to within 70% in cases where parameters not captured by model negatively affect performance or costs are overestimated by model Speed – extremely fast (entirely analytic) 44 5.2.3 LogP The LogP model [5.5.15] was created after its authors noticed a trend in parallel computer architecture in which most machines were converging towards thousands of loosely-coupled processors connected by robust communication networks. Since these types of architectures will scale to the thousands of nodes (but not millions), the authors theorized that a large number of data elements will need to be assigned to each processor. Also, since network technology generally lags significantly behind the speed available for processor-memory interconnects, the cost of network operations is very high compared to the cost of local operations. In addition, since adaptive routing techniques such as wormhole and cut-through routing make topologies less important to overall performance, a parallel performance model need not take into account overall topology. Finally, due to the range of programming methodologies in use, the authors decided to make their performance model applicable to all styles of parallel programming. The parameters used in the LogP model are entirely network-centric: an upper bound on network latency L, the overhead of sending a message o, the minimum gap between messages that can be sent by a processor g, and the number of processors P. In addition, the network is assumed to have a limited capacity where a maximum number of messages are allowed to be transferred at once. Processors are delayed if a network is saturated. Many extensions to LogP have been proposed by various researches. LogGP [5.5.16] handles longer messages in a different manner, reflecting the common practice of switching communication protocols for larger messages. LoPC [5.5.17] and LogGPC [5.5.18] incorporate a contention parameter that captures contention in an effort to model systems that use active messages. Log n P [5.5.19] attempts to capture the overhead of communicating messages that are not in contiguous memory space. LogP has even been applied to modeling memory hierarchies [5.5.20], although the memory LogP model only accurately predicts regular memory access patterns. Due to its simplicity and limited parameters, the LogP algorithm is very approachable and easy to work with. Also, the model encourages such things as contention-free communication patterns which other models such as BSP and 45 PRAM ignore. In general, the accuracy of LogP in predicting network performance is usually good; Culler et al. predicted sorting performance on a 512-node CM-5 machine to within 12% accuracy [5.5.21]. Unlike the BSP model, LogP captures details of every message that is transferred on the network, and care must be taken to ensure that the capacity of the network is respected. Also, no interdependence between messages is directly captured using the LogP model. This increases the cost of automated analysis. LogP in and of itself has no provisions to predict computational performance of algorithms. In addition, if we are tailoring our PAT to mainly shared-memory architectures, the LogP model may not be a good fit for evaluating these machines. Therefore, in order for LogP to be useful for our PAT, it must be adapted for use with shared-memory machines and supplemented with another model that can give a good representation of computational performance. LogP summary: Parameters used – network latency, overhead, gap (bandwidth), number of processors Machine-friendliness – high for network-specific evaluations, medium to low for general evaluations; the model does not directly capture interdependencies between messages, and the model does not provide a way to evaluate computational performance Average error – within 10% for predicting network performance, specialized versions of the model may be necessary to accurately model specific hardware Speed – extremely fast (entirely analytic), although entire communication trace files will need to be processed which may slow things down 5.2.4 Other techniques In this section, a brief overview of other general analytical models will be presented and briefly evaluated. The models presented here are not useful for our PAT, but are included for completeness. 46 Clement and Quinn created a general analytical performance model for use with the Dataparallel C language which they describe in [5.5.22]. Their model uses a more complex form of Amdahl’s law which takes into account communication cost, memory cost, and compiler overhead introduced by Dataparallel C compilers. While their model is too simplistic to be useful, it is interesting that they model compiler overhead directly, although it is characterized by a simple scalar slowdown factor. A performance model tailored to a specific application is presented in [5.5.23]. While high accuracy was obtained for the model, creating application-specific models is both time consuming and difficult. This illustrates that applicationspecific models do not present a tractable design strategy, especially in the context of a PAT. Sun and Zhu present a case study of creating a performance model and using it to evaluate architecture tradeoffs and predict performance of a Householder transformation matrix algorithm [5.5.24]. The model they presented used simple formulas to describe performance, and the resulting accuracy for the model was not very high. In addition, the authors only presented results for a particular architecture and did not attempt to predict the performance of their application on different hardware. The case study, however, is still useful for its pedagogical value. Kim et al. present a detailed performance model for parallel computing in [5.5.25]. Their method involves a general strategy for the creation of highly accurate models, but the method requires considerable modeling effort on the part of the user. They categorize the models created with their strategy as “Parametric Micro-level Performance Models.” They use matrix multiplication, LU decomposition, and FFT kernels to evaluate their modeling strategy and find that their models were accurate to within 2%. However, this high accuracy comes at the price of very complex models – the end of their paper contains seven full pages of equations that represent the models created for the scientific kernels. This illustrates a difficult problem associated with the creation of analytical models from scratch: in order to get high accuracy in analytical models, it is often necessary to use complicated statistical techniques to capture the nonlinear performance of parallel machines. 47 5.3 Predictive performance models In this section, we present predictive performance models, which we define as models that are used to predict performance of parallel codes on existing hardware. We differentiate the models presented here from the models presented in Section 5.2 by specifying that the models presented here are created specifically for predicting performance of parallel codes. 5.3.1 Lost cycles analysis Crovella and LeBlanc invented lost cycles analysis [5.5.26, 5.5.27] as a way to determine all sources of overhead in a parallel program. The authors of this paper noted that analytic performance models were not being widely used as a development tool. They reasoned that because analytical models generally emphasize asymptotic performance, assume that particular overheads dominate over many cases, and may be difficult to work with, parallel programmers avoid using them. To remedy these problems, they created a tool set to aid programmers in applying their lost cycles analysis method on real-world applications. In lost cycles analysis, a distinction is made between pure computation and all other aspects of parallel programming. Any part of a parallel program that is not directly related to computation is deemed overhead and is labeled as “lost cycles.” To further classify these lost cycles, the authors came up with the following categories of lost cycles: load imbalance (idle cycles spent when work could be done), insufficient parallelism (idle cycles spent when no work is available), synchronization loss (cycles spent waiting for locks or barriers), communication loss (cycles spent waiting for data), and resource contention (cycles spent waiting for access to a shared resource). The authors make one interesting assertion: they insist the categories chosen to classify the lost cycles must be complete and orthogonal. That is, the categories must be able to classify all sources of lost cycles, and must not overlap at all. They also assert the five categories they chose (which are mentioned above) fulfill these properties. Measurement of lost cycles is accomplished via the pp tool. Instrumentation code is inserted into the parallel application code that sets flags at appropriate times. The instrumentation code comes in the form of flags which report the 48 current state of execution of the processor. Example flags are Work_Exists, Solution_Found, Busy, Spinning, and Idle. In an early implementation [5.28], the programmer is expected to add the instrumentation code to set the flags at appropriate places throughout the code, although in a later implementation these predicates were handled via general library calls inserted at various places in the code (e.g., before and after parallel loops). During the execution of the parallel code, the values of the flags are either sampled or logged to a file for later analysis. Some flags are necessarily measured in machine-specific ways. For example, to measure communication loss on the shared-memory machine used in the paper, dedicated hardware was used to count the number of second-level cache misses (which incur communication) and record the time necessary to service them. Resource contention was measured indirectly by dividing the time taken to service the second-level cache miss by the optimal time taken to service a second-level cache miss on an unloaded machine. Once the values for the flags are recorded, they are fed into predicates which allow the computation of lost cycles. An example predicate is: Load Imbalance(x)≣Work Exists ^ Processors Idle(x). The values for the lost cycles in each category along with the values recorded for pure computation are reported back to the user in units of seconds of execution. The lca program is then used to analyze data from different runs when one system parameter was varied (e.g., number of processors or data set size). After lca finishes its analysis, it provides the user with a set of equations describing the effect of changing the parameter that was varied earlier, along with goodness-of-fit information that give statistical confidence intervals on the equations that were presented. It is important to note the lca program does not support multiple variables directly. It is up to the user to correlate separate variables based on the information lca reports when varying each variable. Although lost cycles analysis is presented as a method for predicting program performance, it is restricted to predicting performance of that code on a single machine. The accuracy of this relatively simple method is surprisingly good; for a 2D FFT, the authors were able to obtain an average prediction accuracy of 49 12.5%. The accuracy was high enough for the authors to productively evaluate two different FFT implementations with success. This is no small feat, as FFT implementations tend to be orders of magnitude more complex than the general program codes used as illustrative examples by most performance models. In general, it seems lost cycles analysis is useful as an analysis tradeoff study technique, but as a performance model it is limited. Lost cycles summary: Parameters used – load imbalance, insufficient parallelism, synchronization cost, communication cost, resource contention Machine-friendliness – medium to high; library calls must be inserted but are generally easy to do and may be automated, although recording some flags such as communication costs for shared-memory machines and whether a processor is blocked may be tricky depending on what facilities the architecture provides Average error – 12.5% for nontrivial example Speed – fast, but requires several initial execution runs to gain accuracy 5.3.2 Adve’s deterministic task graph analysis Adve introduced what he termed deterministic task graph analysis as part of his PhD thesis work [5.5.29]. In Adve’s method, a task graph for a program is created which precisely represents the parallelism and synchronization inherent in that program. Task graphs are visual representations of parallel program codes. Each node in the graph represents a task that needs to be accomplished, and edges in the graph represent communication dependencies. The task graph assumes that execution of the program will be deterministic. As motivation behind this assumption, Adve noted that while many stochastic program representations were able to capture non-determinism of application codes, in order to keep the analysis of the models tractable, simplifying assumptions about the nature of the work distribution or a restriction of the possible task graphs that may be represented were employed. He suggests using mean execution times as representations for stochastic processes, instead of analyzing them directly using complicated stochastic methods. He also argues that while stochastic 50 behavior can affect overall performance, the effect of this is usually negligible on overall execution time. Adve’s performance models are composed of two levels: the first level contains lower-level (system-level, possibly stochastic) system resource usage model, and the second level contains a higher-level deterministic model of program behavior. He created a system-level model for a Sequent Symmetry platform [5.5.30], and created high-level models for several program codes, including a particle simulation program, a network simulation program, a VLSI routing tool, a polynomial root finder with arbitrary precision, and a gene sequence aligning program. The sizes of these test cases ranged from 1800 to 7200 lines of C code, and the task graphs ranged from 348 to 40963 tasks. Each task graph took under 30 seconds to evaluate, so the authors assert that task graphs are efficient representations of parallel codes. Adve also asserts that the task graphs can be used to evaluated overall program structure; in this manner, they may be also used as an analysis tool for parallel codes. Adve was able to obtain very good accuracies from his task graph models of the programs he presented. His models predicted accuracy on 1024- and 4096processor systems to within 5% accuracy, although the models used in the analysis have been in existence since Adve’s original PhD thesis, a time period of almost 10 years, so they may have been refined several times. In addition, no coherent methods for automatically obtaining task graphs from existing programs are mentioned. However, since evaluating a task graph is much easier than generating one, if we are able to generate task graphs based on information contained in a program’s trace file, this method may provide a tractable way of including a performance model in our PAT, as long as we are also able to invent a method for generating system models. In general, the method Adve presents is open-ended enough to be used in many situations, although the quality of predictions made from this method is highly dependent on the quality of the models used to represent the programs and systems being evaluated. Adve’s deterministic task graph analysis summary: Parameters used – no specific set required, although for modeling the Sequent Symmetry Adve used weighted averages of memory access characteristics in addition to other unspecified metrics 51 Machine-friendliness – medium to low; no unified procedure for creating system-level models is given (although it may be feasible to automate this at the expense of accuracy), and generating application task graphs may require human interaction although it may be possible to derive a task graph from a program’s trace Average error – less than 5% for nontrivial examples Speed – moderate to fast, 30 seconds for a large program 5.3.3 Simon and Wierum’s task graphs Simon and Wierum noted that the BSP, LogP, and PRAM models ignored multilevel memory hierarchies, which they felt was a large omission because node design is an important aspect of parallel computer performance. They decided to create a performance model [5.5.31] based on task graphs which uses a distributed memory architecture consisting of shared-memory, specifically Symmetric Multi-Processor (SMP), nodes. In their model, parallel programs are represented by task graphs. Communication between processors is modeled via a function that uses parameters for message size and distance between the sender and receiver. Multiprocessing in each SMP node is modeled by an ideal abstracted scheduler. Finally, resource contention inside SMP nodes is modeled by closed system of queuing networks. The authors assume that tasks are mapped to processors statically and cannot change during the execution of the program. Control flows are approximated using mean-value analysis. Since task graphs for real programs can be overwhelmingly large, they are reduced by allowing loops to be expressed compactly. In addition, nodes that are not in the critical path of execution may be eliminated from task graphs to reduce their size. Once a task graph is created for a program, microbenchmarks are used to evaluate particular performance metrics of the target hardware. These microbenchmarks measure the time taken for a mixture of load, store, and arithmetic operations of various sizes. The authors mention the benchmarks investigate all levels of the machine’s memory hierarchy. Unfortunately, the 52 authors do not provide details on the specifics of the benchmarks they used or the exact metrics collected. After the data for the microbenchmarks is collected, the data is used to evaluate the task graph, and a prediction of the runtime of the application is given. As a test case, the authors apply their model to the LU decomposition code from the Linpack benchmark suite. The LU decomposition code is a compact scientific computation kernel, so one would expect the model would be straightforward to apply to this code. Unfortunately, the author’s test case reveals a significant problem with their model: it is very difficult to use. Constructing a task graph from code is a labor-intensive process, and when coupled with high-level languages it is not clear if it is possible at all to automatically generate task graphs. While the authors were able to achieve high prediction accuracy for their test case (within 6.5%), significant manual analysis was required to obtain that accuracy. The detail presented in their models does an excellent job of predicting performance of codes, but makes the model difficult to work with. Also, one potential major problem is the use of mean-value analysis to approximate algorithm control flow. This worked well for the LU computation kernel, but for more nondeterministic codes it may prove to be a source of model errors. Because of these reasons, we do not suggest incorporating this model into our PAT. Simon and Wierum’s task graph summary: Parameters used – memory performance, integer performance, floating point performance, message size, distance between sender and receiver, resource contention for SMP nodes via closed queuing network Machine-friendliness – very low Average error – excellent (below 6.5%), although mean-value analysis may introduce more significant errors in nondeterministic code Speed – n/a; evaluation of models fast, but creation of task graphs tedious and cannot be easily automated, even if a program’s trace is available 53 5.3.4 ESP Parashar and Hariri have implemented a performance prediction tool as part of a development environment for High-Performance Fortran systems named ESP [5.5.32, 5.5.33]. Their tool operates at the source-code level, interpreting the source code to predict its performance on a iPSC/860 hypercube system. In their interpretive environment, the application is abstracted by extracting properties from the source code. The system that the user wishes the application is simulated on is modeled by a System Abstraction Graph (SAG), whose nodes abstract part of the machine’s overall performance. An interesting idea presented here is that each node in the SAG uses a well-defined interface, so each node may use any technique it wishes to report predicted information. The interface that is used for each node is composed of four components: processing, memory, communication/synchronization, and input/output. Deeper nodes in the SAG represent finer and finer models of specific components of the target architecture to be evaluated. Applications are modeled in a similar way using an Application Abstraction Graph (AAG). Nodes in the AAG represent actions the program takes, such as starting, stopping, sequential code execution, synchronization, and communication. Information from the AAG is merged with the SAG to apply system-specific characteristics to the application to form the Synchronized Application Abstraction Graph (SAAG). The SAAG is then fed into the interpreter, which uses the features provided by the SAG to estimate the performance of the application. Most of the nodes in the SAG graph use simple analytical formulas to predict performance of specific actions taken by a HPF program. The overall accuracies obtained by ESP are respectable. Errors for the tests cases used by the authors were at most 20% with most errors being under 10%. The model itself is not immediately useful to our PAT, but ESP illustrates that it is feasible to successfully incorporate a predictive performance model into a PAT. In addition, the notion of using a standard interface for system models and the use is an interesting one which allows increased accuracy for parts of the models by allowing simulation where needed. ESP summary: 54 Parameters used – raw arithmetic speeds, overhead for loops and conditional branches, function call overhead, memory parameters (cache size, associativity, block size, memory size, cache miss penalties for reads and writes, main memory costs for reading and writing data, TLB miss overhead), network parameters (topology, bandwidth, router buffer sizes, communication startup overhead, per-hop overhead, receive overhead, synchronization overhead, broadcast and multicast algorithms) Machine-friendliness – medium to low; requires a Fortran source code parser Average error – below 10% in most cases Speed – medium to slow; interpretive models may take a long time to evaluate depending on how many simulative components exist for the system models 5.3.5 VFCS In the early 1990s, Fahringer, Blasko, and Zima from the University of Vienna created a specialized Fortran compiler they dubbed the Vienna Fortran Compilation System (VFCS). VFCS is a Fortran compilation system that automatically parallelizes Fortran77 codes based on a predictive performance model and sequential execution characteristics. Fahringer et al. also refer to the VFCS as SUPERB-2, illustrating that it is a second-generation version of an earlier compilation system developed at the University of Vienna named SUPERB. In an earlier implementation of the VFCS [5.5.34], an Abstract Program Graph (APG) is constructed for each Fortran program to be parallelized. The unmodified Fortran77 code is instrumented and run via the Weight Finder, which captures information such as conditional branch probability, loop iteration counts, and estimated execution times for basic code blocks. The Weight Finder requires a representative set of data for the profiling stage, or the effectiveness of the VFCS is limited. The information obtained from the Weight Finder is merged into the APG and simulated using a discrete-event simulator, which also uses machine-specific metrics like network topology, pipeline behavior, and number of processors. The simulator then outputs the predicted performance for the user. 55 More recent versions of the VFCS incorporate P3T [5.5.35], which is the Parameter-based Performance Prediction Tool. Instead of using simulation to perform performance predictions on a program graph, an analytical model is used to predict performance based on certain metrics collected in the profiling stage by the Weight Finder. These metrics are classified into two categories. The first category, which represents machine-independent information, contains the work distribution of a program (based on an owner-compute rule and a block decomposition of arrays in the Fortran program), number of data transfers performed by the program, and the amount of data transferred. The second category, which represents machine-dependent information, contains a basic model of network contention, network transfer times, and number of cache misses for the program. It is important to note that scalar values are reported for each of the metrics listed above. The accuracies the authors reported for their analytical models were within 10%, although the codes they evaluated were small scientific kernels consisting of at most tens of lines of code. Fahringer has also incorporated a graphical user interface (GUI) for P3T which helps a user tune the performance of their Fortran code, and also extended it to work on a subset of HPF code [5.5.36]. Blasko has also refined the earlier discrete-event simulation model by the addition of the PEPSY (PErformance Prediction SYstem) [5.5.37], keeping the tradition of the esoteric naming conventions for VFCSrelated components. The VFCS system and its related components illustrate a variety of interesting techniques for predicting parallel program performance and applying it to guide automatic parallelization and performance tuning. However, the authors never successfully demonstrate their system for anything but small scientific computation kernels. In addition, their tool is limited to working with Fortran code. Therefore, even though it is interesting to examine the techniques used in the VFCS set of tools, they are not directly applicable to our PAT. VFCS summary: Parameters used – conditional branch probability, loop iteration counts, and estimated execution times for basic code blocks for Weight Finder; network topology, pipeline behavior, and number of processors for simulations; work distribution of a program, number of data transfers performed by the program, the amount of data transferred, network 56 contention, network transfer times, and number of cache misses for the program for the analytical model Machine-friendliness – very high, but requires integration with Fortran compiler; specific to Fortran Average error – below 10% for simple scientific kernels Speed – fast for analytical model, unspecified for simulative model 5.3.6 PACE PACE, or PerformaAnCE analysis environment, is a tool for predicting and analyzing parallel performance for models of program codes [5.5.38]. PACE uses a hierarchical classification technique similar to our layers, in which software and hardware layers are separated with a parallel template layer. Parallel programs are represented in the template layer via a task graph, similar to the task graphs used by Adve and Simon and Wierum. The PACE environment offers a graphical environment in which to evaluate parallel programs. Abstract task graphs are realized using a performance modeling language CHIP3S, and models are compiled and used to predict performance of the application. An updated version of PACE has been targeted as a generic tool for modeling parallel grid computing applications, rather than a specific system for automatically doing such [5.5.39]. PACE provides a few domain-specific modeling languages, including PSL (Performance Specification Language) which models and HMCL (Hardware Model Configuration Language), but like Adve’s deterministic task graph method, no standard way of modeling specific architectures is presented. Based on the papers available describing PACE, it seems as though PACE is just an integration of several existing performance tools to provide a system for prediction the performance of grid applications. However, PACE also introduces a novel idea: predictive traces [5.5.40]. In this approach, trace files are created when running application models. These predictive trace files use general formats such as Pablo’s Self-Defining Data Format (SDDF) or Paragraph’s Portable Instrumented Communication Library (PICL). This increases the value 57 of predictions to the user, since the predictions can be viewed in the exact same manner as regular performance traces. Since PACE is targeted for grid computing, we will not specifically consider its inclusion in our PAT. However, predictive traces are an interesting idea that could potentially add value to a performance model included in a PAT. PACE summary: Parameters used – application execution characteristics (for loops, etc) for program models, unspecified for system models Machine-friendliness – medium to low; generation of models requires much user interaction Average error – within 9% for examples given, although specifics of the systems models used not given so it is difficult to gauge this accuracy Speed – fast (single-second times were reported in [5.38]) 5.3.7 Convolution method The Performance Evaluation Research Center (PERC) has created an interesting method for predicting performance which is based on the idea of convolution [5.5.41]. In this method, an “application signature” is collected from an application, which is combined (convolved) with a machine profile to give an estimate of execution performance. The convolution method uses existing tools to gather application signatures and machine profiles. Machine profiles are gathered using MAPS, which gathers memory performance of a machine, and PMB, which gathers network performance of a machine. Application signatures are captured using MetaSim tracer, which collects memory operations from a program, and MPIDtrace, which collects MPI operations performed by a program. The communication-specific portions of the application signature and machine profile and convolved together using DIMEMAS, and the memory-specific portions of the signature and profile are convolved using MetaSim convolver. Currently, MetaSim tracer is limited to the Alpha platform, but the authors suggest that future versions of MetaSim 58 tracer may use Paradyn’s DynInst API, which would open the door to more platforms. The convolution method provides accuracies within 25%, which is surprising considering that the model only takes into account memory operations, network communications, and floating point performance. In general, the model accuracy was good for using weak scaling (problem size increases with system size) but suffered when strong scaling was applied to the problem (problem size stays fixed with increases in system size). However, one drawback to this method is that it requires programs to be executed before performance predictions may be carried out, which limits the efficiency reported by the tool to the quality of workloads used during profiling runs. Simulation is used to determine performance of network and computational performance. Because of this, time to perform the convolutions may vary widely based on the properties of the simulators used. In addition, no standard set of convolutions is suggested to use; the authors used application-specific convolution equations to predict performance of different parallel application codes. Because of the possible large cost of the simulative aspects of this method, it seems unlikely that it will be feasible to use it in a PAT. In addition, the general approach used seems to be used in almost all other performance prediction frameworks under a different name; the nomenclature used to describe the separation of machine- and application-specific models is new, but the general idea is old. Convolution method summary: Parameters used – network usage, memory performance, floating point performance Machine-friendliness – high as long as tools are supported on platforms of interest Average error – within 25% for matrix multiplication Speed – not reported; possibly slow due to inclusion of simulative techniques 59 5.3.8 Other techniques In this section, we will present a brief overview of other predictive performance models and quickly evaluate them. The models presented here are not useful for our PAT, but are included for completeness. Howell presents an MPI-specific performance prediction tool in his PhD thesis [5.5.42]. He uses microbenchmarks to gather performance data for MPI codes. He then provides several equations that can be used to predict the performance of MPI-related function calls. These equations may be used directly (a simple graphing package is also provided), or used in the context of his “reverse profiling” library. Howell’s reverse profiling library is an interesting idea; in it, MPI code is linked against an instrumented library and run. When MPI commands are intercepted by the library, the equations derived earlier are used to simulate delays, and the delays are reported back to the user. This interesting hybrid between an analytical model, profiling, and simulation allows a user to quickly predict the performance of their MPI codes on different network hardware. In addition, Howell also provides a simulation tool to directly simulate the MPI code and provide traces. Reverse profiling is an interesting technique, but would be much harder to implement for languages that have implicit communication (such as UPC). Grove also presents an MPI-specific performance prediction technique in his PhD thesis [5.5.43]. He creates what he terms a Performance Evaluating Virtual Parallel Machine (PEVPM), which processes special comments in the source code to quickly predict the performance of an MPI program. Sequential sections of program code are labeled with comments indicating how long they take to evaluate, and MPI communication calls are simulated using a probabilistic general communication model that takes into account the type of MPI call, the number of messages in transit, and the size of data being transferred. As with Howell’s work, microbenchmarks are used to predict communication performance (in addition to other information collected by the execution simulator). Again, this approach would be difficult to apply to UPC because it requires that the user know explicitly when communication is taking place. In addition, the details of sequential code performance are abstracted away entirely and require the user to specify how long each sequential code statement takes. This approach does represent a low-cost method of performance modeling and 60 prediction for SHMEM, though, and may be useful for a prototype system for SHMEM. Models specific to cluster architectures are presented by Qin and Baer in [5.5.44] and by Yan and Zhang in [5.5.45]. Qin and Baer created simple analytical equations for distributed shared-memory systems operating in a cluster environment. However, their analytical models also require detailed hardware and network models to be used with trace-drive simulation. In addition, they did not calibrate their models to real systems, so the accuracy of their approach is unknown. Yan and Zhang use a model of bus-based Ethernet clusters to predict performance of applications being run over non-dedicated clusters. Their communication model assumes communication times are nondeterministic, but in practice uses averages to approximate the communication costs. Their approach sustained an accuracy of 15%, but the sequential code time prediction they used requires an instruction mix for each program that is to be evaluated. Such instruction mixes can sometimes be costly or impossible to obtain for large parallel codes due to the non-determinism and large scales associated with them. A simulation tool for parallel codes named COMPASS is presented in [5.5.46], which is being used in the POEMS (Performance Oriented End-to-end Modeling System) project. The simulation tools is able to gain high accuracy (within 2% for the Sweep3D application), but requires the use of multiple nodes during simulation to keep simulations times manageable. In addition, their simulation tool requires detailed network and hardware models to be created and so does not represent a viable strategy for inclusion into our PAT, since the work required to create accurate models can easily become unwieldy. An earlier simulation tool is presented by Poulsen and Yew in [5.5.47]. This tool is execution-driven, and supports three types of simulations: critical-path simulation, execution-driven trace generation, and trace-driven simulation. Of the three types of simulations supported by their tool, critical-path simulation is the most interesting. In critical-path simulation, the minimum parallel execution time is given for a program by instrumentation to compute the earliest time at which each task in a given parallel code could execute, given parameters about the hardware and network being used. However, like all simulation techniques, this requires detailed models, and also requires an initial run with instrumented 61 code. In this respect, it seems related techniques like lost cycles analysis can accomplish a similar objective with less overhead. Brehm presents a prediction tool for massively parallel systems named PerPreT in [5.5.48]. This prediction tool operates on task graphs represented in a C-like language. It incorporates an analytical model to estimate communication performance, although the analytical model ignores network contention. PerPreT can be thought of as a specific version of Adve’s general deterministic task graph method. The method has decent accuracy (within 20% for matrix multiplication and conjugate gradient codes), but requires expressing the programs being evaluated in PerPreT’s specialized language. Since it is not likely this can be automated, PerPreT is not an ideal candidate for inclusion into our PAT. One different approach for performance prediction taken by Kuhnemann et al. is using source code only to predict performance [5.5.49]. Their approach uses the SUIF system (Stanford Universal Intermediary Format) to generate parse trees of C code. The parse trees of these codes are analyzed and costs are associated with each node in the parse tree. Computation is modeled using simple machine-specific parameters, and communication is modeled using analytical equations with parameters derived from microbenchmarks. While this approach is interesting, it only supports MPI codes. However, this approach does validate the usefulness of the SUIF source code parser. 5.4 Conclusion and recommendations In this section, we have presented an overview of performance modeling and predictive techniques. We categorized the existing performance models into three categories: formal models, general analytical models, and predictive models. For each model presented, we evaluated it against our criteria set forth in the beginning of this section. In general, most of the analytical models presented in this section do not apply well within the context of a PAT. The formal models presented in Section 5.1 require too much user interaction for them to be useful. Most of the analytical models presented in Section 5.2 are either too simplistic or would take too much work to provide working implementations. The predictive models in Section 5.3 are the most useful for a PAT, but many of the models presented need user 62 interaction in order for them to be useful, or they use detailed simulation models which will take too long to create and run. We feel it is entirely necessary to choose a method that creates models for the user automatically or at least with the help of minimal comments in the user’s source code. Creating detailed, accurate models is feasible for researchers studying in the area of performance modeling and prediction, but most users will not have the time or desire to do this. If our performance needs to require any input at all from the user, it should be as quick and painless for the user to provide as possible. There are a few attractive options that we believe would fit well within a PAT. Lost cycles analysis is an especially promising method, since it is a model that can not only predict performance, but help the user improve their program code because it doubles as an analysis strategy. In addition, it does not seem it would be that difficult to implement, especially when compared with other techniques that depend on the generation of deterministic task graphs. If we are going to be creating instrumented forms of UPC runtime environments or methods that accomplish the same goal, we should be able to easily insert extra code to record the simple states needed for lost cycles analysis. Lost cycles analysis also presents an easy-to-understand metaphor to the user that illustrates where performance is being lost in a parallel application. However, one problem that we will need to address before implementing this strategy is how to report a lost cycles analysis with granularity finer than at the application level, since lost cycles will be more useful if we can report data at the function- or loop-level granularity. If our PAT is going to support tracing (which in all likelihood it will), the simplest solution to this problem is to perform lost cycles analysis entirely posthumously based on information provided in the trace files. As long as our trace files can relate data back to the source code, we can give a lost cycles analysis for arbitrary subsections of the source code. In any case, the lost cycles analysis should be an optional feature the user can turn on and off at will in case the method introduces measurement overhead. Another possible way to include a performance model in our PAT is to use one of the simpler methods which rely on comments inserted in the source code of the user’s program. This should not impose too much overhead on the user, 63 especially if our PAT is able to provide detailed timing information back to the user for subsections of code. This method would be attractive because it represents a low-cost solution: parsing source code comments is not difficult provided they are in a well-defined format, and the data collected could be plugged into any arbitrary modeling technique we wish, be it a simple analytical model or a more detailed simulative model. This could provide an “upgrade path” where an analytical model or a simulative model could be used, depending on how much accuracy the user needs and how long they can tolerate the evaluation time to be. One possible wrinkle in the source code annotation plan is how to handle modeling implicit communications in UPC. One possible way to deal with this is to analyze the trace files generated by an actual run and correlate them with the source code. If this could be implemented, it probably would not be too hard to also include computational metrics from the trace files in the modeling process, thus obviating the need for the user to do anything at all in order to use the performance model. However, such a system would undoubtedly be extremely difficult to implement, and as such is out of the scope of our project. Finally, if we are able to automate the generation of task graphs and system models for a particular program and architecture, using Adve’s deterministic task graphs may provide a relatively low-cost solution for performance prediction. The task graphs generated may also be a useful visualization tool, because they provide a high-level view of the parallelism and synchronization that occurs during a program’s execution. Adve’s method seems like it would be best used if complemented with other models (such as LogP and memory hierarchy models), since it provides a general framework in which to represent a program’s structure and predict its execution time. In summary, lost cycles analysis provides a useful performance model that can be implemented with relatively low cost (under a few key assumptions). Since lost cycles analysis also doubles as an analysis strategy, we feel it would fit very well in our PAT. Therefore, we are recommending it as the most likely candidate if we are to include a performance model in our PAT. 64 Table 5.1 - Summary of performance models Model name Model type Parameters used Machine-friendliness Accuracy Speed Section Adve’s task graphs Predictive No specific set required Medium to low Less than 5% Moderate 5.3.2 BSP General analytical Network bandwidth and latency, sequential code performance Medium to high Within 20% for most cases Fast 5.2.2 Convolution Predictive Network usage, memory performance, floating point performance High Within 25% Not reported; possibly slow 5.3.7 ESP Predictive (many) Medium to low Within 10% Medium to slow 5.3.4 Formal models Formal Varies Very low Varies Low 5.1 LogP General analytical Network latency, overhead, gap (bandwidth), number of processors High for network-specific evaluations, low in general case Within 10% for predicting network performance Fast 5.2.3 Lost cycles Predictive Load imbalance, insufficient parallelism, synchronization cost, communication cost, resource contention Medium to high 12.5% for FFT Fast; requires several initialization runs 5.3.1 PACE Predictive Application execution characteristics, unspecified system model parameters Medium to low Within 9% Fast 5.3.6 PRAM General analytical Coarse model of memory access and sequential execution time Medium to low Within an order of magnitude Fast 5.2.1 Simon and Wierum’s task graphs Predictive Memory performance, floating point performance, integer performance, message size, distance between sender and receiver, resource contention Very low Within 6.5% Fast for evaluation, very slow for creating models 5.3.3 VFCS Predictive (many) Very high; requires integration with Fortran compiler Within 10% Medium to fast 5.3.5 66 5.5 References [5.1] C. A. Petri, Kommunikation mit Automaten. PhD thesis, Universitat Bonn, 1962. [5.2] S. Gilmore, J. Hillston, L. Kloul, and M. Ribaudo, “Software performance modelling using PEPA nets,” in WOSP ’04: Proceedings of the fourth international workshop on Software and performance, pp. 13–23, ACM Press, 2004. [5.3] R. Milner, A Calculus of Communicating Systems, vol. 92 of Lecture Notes in Computer Science. Springer, 1980. [5.4] C. A. R. Hoare, “Communicating sequential processes,” Communications of the ACM, vol. 21, no. 8, pp. 666–677, 1978. [5.5] C. A. R. Hoare, Communicating Sequential Processes. Prentice Hall International, 1985. [5.6] O. Boxma, G. Koole, and Z. Liu, “Queueing-theoretic solution methods for models of parallel and distributed systems,” in 3rd QMIPS workshop: Performance Evaluation of Parallel and Distributed Systems, pp. 1–24, 1994. [5.7] A. J. C. van Gemund, “Performance prediction of parallel processing systems: the PAMELA methodology,” in ICS ’93: Proceedings of the 7th international conference on Supercomputing, pp. 318–327, ACM Press, 1993. [5.8] A. J. C. van Gemund, “Compiling performance models from parallel programs,” in ICS ’94: Proceedings of the 8th international conference on Supercomputing, pp. 303–312, ACM Press, 1994. [5.9] A. J. C. van Gemund, “Symbolic performance modeling of parallel systems.,” IEEE Trans. Parallel Distrib. Syst., vol. 14, no. 2, pp. 154– 165, 2003. [5.10] S. Fortune and J. Wyllie, “Parallelism in random access machines,” in STOC ’78: Proceedings of the tenth annual ACM symposium on Theory of computing, pp. 114–118, ACM Press, 1978. [5.11] L. G. Valiant, “A bridging model for parallel computation,” Commun. ACM, vol. 33, no. 8, pp. 103–111, 1990. [5.12] B. H. H. Juurlink and H. A. G. Wijshoff, “The E-BSP model: Incorporating general locality and unbalanced communication into the BSP model.,” in Proc. Euro-Par’96, vol.II, (LNCS 1124), pp. 339–347, 1996. [5.13] B. Juurlink and H. A. G. Wijshoff, “A quantitative comparison of parallel computation models,” in Proc. 8th ACM Symp. on Parallel Algorithms and Architectures (SPAA’96), pp. 13–24, January 1996. [5.14] J. M. D. Hill, P. I. Crumpton, and D. A. Burgess, “Theory, practice, and a tool for BSP performance prediction,” in Euro-Par, Vol. II, pp. 697– 705, 1996. [5.15] D. E. Culler, R. M. Karp, D. A. Patterson, A. Sahay, K. E. Schauser, E. E. Santos, R. Subramonian, and T. von Eicken, “LogP: Towards a realistic model of parallel computation,” in PPOPP, pp. 1–12, 1993. [5.16] A. Alexandrov, M. F. Ionescu, K. E. Schauser, and C. Scheiman, “LogGP: incorporating long messages into the logp model – one step closer towards a realistic model for parallel computation,” in SPAA ’95: Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures, pp. 95–105, ACM Press, 1995. [5.17] M. I. Frank, A. Agarwal, and M. K. Vernon, “LoPC: modeling contention in parallel algorithms,” in PPOPP ’97: Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming, pp. 276–287, ACM Press, 1997. [5.18] C. A. Moritz and M. I. Frank, “LoGPC: modeling network contention in message-passing programs,” in SIGMETRICS ’98/PERFORMANCE ’98: Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems, pp. 254–263, ACM Press, 1998. [5.19] K. W. Cameron and R. Ge, “Predicting and evaluating distributed communication performance,” in Supercomputing ’04: Proceedings of the 2004 ACM/IEEE conference on Supercomputing, 2004. [5.20] K. W. Cameron and X.-H. Sun, “Quantifying locality effect in data access delay: Memory LogP,” in 17th International Parallel and Distributed Processing Symposium (IPDPS 2003), pp. 48–55, 2003. 68 [5.21] A. C. Dusseau, D. E. Culler, K. E. Schauser, and R. P. Martin, “Fast parallel sorting under LogP: Experience with the CM-5,” IEEE Trans. Parallel Distrib. Syst., vol. 7, no. 8, pp. 791–805, 1996. [5.22] M. J. Clement and M. J. Quinn, “Analytical performance prediction on multicomputers,” in Supercomputing ’93: Proceedings of the 1993 ACM/IEEE conference on Supercomputing, pp. 886–894, ACM Press, 1993. [5.23] D. J. Kerbyson, H. J. Alme, A. Hoisie, F. Petrini, H. J. Wasserman, and M. Gittings, “Predictive performance and scalability modeling of a large-scale application,” in Supercomputing ’01: Proceedings of the 2001 ACM/IEEE conference on Supercomputing (CDROM), pp. 37–48, ACM Press, 2001. [5.24] X.-H. Sun and J. Zhu, “Performance prediction of scalable computing: a case study,” in HICSS, pp. 456–469, 1995. [5.25] Y. Kim, M. Fienup, J. C. Clary, and S. C. Kothari, “Parametric microlevel performance models for parallel computing,” Tech. Rep. TR-9423, Department of Computer Science, Iowa State University, December 1994. [5.26] M. E. Crovella and T. J. LeBlanc, “Parallel performance using lost cycles analysis,” in Supercomputing ’94: Proceedings of the 1994 conference on Supercomputing, pp. 600–609, IEEE Computer Society Press, 1994. [5.27] J. Wagner Meira, “Modeling performance of parallel programs,” Tech. Rep. 589, Computer Science Department, University of Rochester, June 1995. [5.28] M. Crovella and T. J. LeBlanc, “Performance debugging using parallel performance predicates,” in Workshop on Parallel and Distributed Debugging, pp. 140–150, 1993. [5.29] V. S. Adve, Analyzing the behavior and performance of parallel programs. PhD thesis, Department of Computer Sciences, University of Wisconsin-Madison, December 1993. 69 [5.30] V. S. Adve and M. K. Vernon, “Parallel program performance prediction using deterministic task graph analysis,” ACM Trans. Comput. Syst., vol. 22, no. 1, pp. 94–136, 2004. [5.31] J. Simon and J.-M. Wierum, “Accurate performance prediction for massively parallel systems and its applications,” in Euro-Par ’96: Proceedings of the Second International Euro-Par Conference on Parallel Processing-Volume II, pp. 675–688, Springer-Verlag, 1996. [5.32] M. Parashar and S. Hariri, “Compile-time performance prediction of hpf/fortran 90d,” IEEE Parallel Distrib. Technol., vol. 4, no. 1, pp. 57– 73, 1996. [5.33] M. Parashar and S. Hariri, “Interpretive performance prediction for high performance application development,” in HICSS (1), pp. 462–471, 1997. [5.34] T. Fahringer, R. Blasko, and H. P. Zima, “Automatic performance prediction to support parallelization of fortran programs for massively parallel systems,” in ICS ’92: Proceedings of the 6th international conference on Supercomputing, pp. 347–356, ACM Press, 1992. [5.35] T. Fahringer and H. P. Zima, “A static parameter based performance prediction tool for parallel programs,” in ICS ’93: Proceedings of the 7th international conference on Supercomputing, pp. 207–219, ACM Press, 1993. [5.36] T. Fahringer, “Estimating and optimizing performance for parallel programs,” Tech. Rep. TR 96-1, Institute for Software Technology and Parallel Systems, University of Vienna, March 1996. [5.37] R. Blasko, “Hierarchical performance prediction for parallel programs,” in Proceedings of the 1995 International Symposium and Workshop on Systems Engineering of Computer Based Systems, pp. 398–405, March 1995. [5.38] D. J. Kerbyson, J. S. Harper, A. Craig, and G. R. Nudd, “PACE: A toolset to investigate and predict performance in parallel systems,” in Proc. of the European Parallel Tools Meeting, (Chвtillon, France), Oct. 1996. 70 [5.39] J. Cao, D. J. Kerbyson, E. Papaefstathiou, and G. R. Nudd, “Modelling of ASCI high performance applications using PACE,” in 15th Annual UK Performance Engineering Workshop, pp. 413–424, 1999. [5.40] D. J. Kerbyson, E. Papaefstathiou, J. S. Harper, S. C. Perry, and G. R. Nudd, “Is predictive tracing too late for HPC users,” in HighPerformance Computing, pp. 57–67, Kluwer Academic, 1999. [5.41] A. Snavely, L. Carrington, N. Wolter, J. Labarta, R. Badia, and A. Purkayastha, “A framework for performance modeling and prediction,” in Supercomputing ’02: Proceedings of the 2002 ACM/IEEE conference on Supercomputing, pp. 1–17, IEEE Computer Society Press, 2002. [5.42] F. Howell, Approaches to parallel performance prediction. PhD thesis, Dept of Computer Science, University of Edinburgh, 1996. [5.43] D. P. Grove, Performance modelling of message-passing parallel programs. PhD thesis, Dept of Computer Science, University of Adelaide, 2003. [5.44] X. Qin and J.-L. Baer, “A performance evaluation of cluster architectures,” in SIGMETRICS ’97: Proceedings of the 1997 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, pp. 237–247, ACM Press, 1997. [5.45] Y. Yan, X. Zhang, and Y. Song, “An effective and practical performance prediction model for parallel computing on nondedicated heterogeneous NOW,” J. Parallel Distrib. Comput., vol. 38, no. 1, pp. 63–80, 1996. [5.46] R. Bagrodia, E. Deeljman, S. Docy, and T. Phan, “Performance prediction of large parallel applications using parallel simulations,” in PPoPP ’99: Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming, pp. 151–162, ACM Press, 1999. [5.47] D. K. Poulsen and P.-C. Yew, “Execution-driven tools for parallel simulation of parallel architectures and applications,” in Supercomputing ’93: Proceedings of the 1993 ACM/IEEE conference on Supercomputing, pp. 860–869, ACM Press, 1993. 71 [5.48] J. Brehm, M. Madhukar, E. Smirni, and L. W. Dowdy, “PerPreT - a performance prediction tool for massive parallel sysytems,” in MMB, pp. 284–298, 1995. [5.49] M. Kuhnemann, T. Rauber, and G. Runger, “A source code analyzer for performance prediction.,” in IPDPS, 2004. 72 6 Experimental performance measurement The experimental performance modeling process includes five stages (Figure 6.1). In the instrumentation stage, measuring codes are inserted into the original application code. This is when the user manually or the PAT automatically decides what kind of measurements are needed, where to put the codes and how to collect them. After this, in the measurement stage, the PAT collects the raw data that feeds into the analysis stage so they can be transformed into a set of meaningful performance data. This set is then organized and presented (plain text, visualization, etc.) to the user in a meaningful and intuitive way in the presentation stage. Finally, in the optimization stage, the user or the PAT discovers the source of the performance bottleneck and modifies the original code to alleviate the problem. Other terms have been used for many of these stages (ex: monitoring for instrumentation and measurement stages, filtering and aggregation for analysis stage) but the goals of the stages remain the same. Among the methods proposed, by far the most dominant and perhaps the only way implemented in today’s PATs is based on events (another approach available is the flow analysis, but materials regarding this approach are generally used for determining the correctness of the program and has no little correlation with performance analysis. Furthermore, no tool that we know of is based on this approach except the inclusion of a call graph tree which probably stems from the flow control world). An event is a measurable behavior of the program that is of interest. It can be as simple as an occurrence of a cache miss or a collection of simple events such as the cache miss count. It was found that 95% of these events can be classified as frequency (probability of an action), distance (time between occurrence of the same event), duration (length in time of the actual event), count (number of times an event occurred) or access value (program related values such as message length) [6.1.10]. Using appropriate events, the dynamic behavior of the program can be reconstructed, thus aiding in the identification of the performance bottleneck. The challenge in this approach is the determination of the meaningful events (what we define as factors) and how to accurately measure these events. Many PAT developers simply use what is available, what is used by other tools, or depend on user’s request to determine what factors to put into their PAT. Although these are great sources, support is 73 needed to justify their importance. Ultimately, we need to establish a “standard” set of important factors along with support for why they are important. In addition, the best way to measure them and to what degree performance is affected by them should be determined. Original code Instrumented code Improved code Instrumentation Measurement Optimization Raw data Bottleneck detection Analysis Presentation Meaningful set of data Figure 6.1 - Experimental performance modeling stages 6.1 Instrumentation This section covers issues and options important in the instrumentation stage along with what we believe to be the approach to try in our PAT. Note that the presenter of SC03 tutorial on principles of performance analysis stated that 89% of the development is directly or indirectly related to writing instrumentation software [6.1.2]. Because of this, we should try to extend existing instrumentation software(s) if we find factors that are not currently being measured. 74 6.1.1 Instrumentation overhead Instrumentation overhead can be broken down into two categories. One is the additional work a PAT user needs to do, the manual overhead, and the other is the performance overhead, the extra program running time due to the insertion of instrumentation codes. The question is then what levels of overhead are acceptable in both of these areas? It is ideal to minimize the manual overhead but doing so limits the usefulness of a PAT as only events included in the tool are available. This is a tradeoff between effectiveness of a PAT and the extra effort needed from the user. Will the user deem a PAT too tedious to work with? Unfortunately, these issues are rarely evaluated by PAT developers. However, one often view it as one of the benefits of dynamic instrumentation (see Section 6.1.2) but that is as far as the discussion goes. The PAT performance overhead, on the other hand, is often mentioned. This is important since excessive instrumentation will alter the behavior of the original program, thus making the analysis invalid. PAT developers and users are aware of the impact this will have on the usefulness of the PAT but determining what level of overhead is acceptable is still arbitrary. Generally, less than 30% increase in overall program execution time is viewed as acceptable (from the developer’s point of view, what’s acceptable from the user is yet to be determine) and this may be just a consensus people have without experimental support. Nonetheless, it does provide a tangible level for us to use. However, this is not sufficient to determine if the instrumentation code perturbs the original program behavior or not. It is possible for an instrumented version of the code to run only slightly slower than the original version but because of where and how it was instrumented, it may alter the program’s behavior. This should be studied and perhaps determining the overhead level of each instrumentation step along with some simple modeling may provide an answer. 6.1.2 Profiling and tracing Profiling refers to the process of collecting statistical event data at runtime and performing data aggregation and filtration when the program terminates. It can be based on sampled process timing, where a hardware timer periodically interrupts the program execution to trigger the measurements (i.e. there is a daemon running independent of program being analyzed that collects data on state of the machine), or measured process timing, in which instrumentation is 75 triggered at the start and end of the event. In profiling, when and what to instrument can be determined statically and are inserted prior to program execution. The performance data generally require small storage space as occurrence of the same event can be recorded by an event collection. Because of the statistical nature of the method, it is more difficult and often impossible to reconstruct the accurate program behavior based on profiling. Tracing, on the other hand, is more useful when accurate reconstruction of program behavior is desirable. Events are often time-stamped or at least ordered with respect to program execution. This enables the user to see exactly when and what occurred. Partial, meaningful event data can be presented to the user at runtime so the user can detect bottleneck while tracing takes place (partial data can also be presented to the user at runtime using profiling, but the profiling data, due to their statistical nature, are often meaningless). Lastly, it is possible to calculate the profiling data from tracing data after the program terminates. Tracing is able to provide a more accurate view of the program behavior. However, with a long running program, tracing requires an enormous amount of storage space and because of this it can also complicate the analysis process. Two ideas have been proposed to alleviate the problem with large trace files. The first approach is to use trace file format that is compact and scalable (popular formats include SDDF, EPILOG, CLOG/SLOG, VTF/STF, XML). This has been shown to be effective to some extent and new formats are constantly being devised to address these issues along with other desirable features such as interoperability between tools (a big selling point of XML). There is also an effort to develop a trace file database [6.1.4] which could be a great source for factor discovery. The second approach is to vary the degree of instrumentation depending on the program behavior and is part of the motivation behind dynamic instrumentation where the PAT decides when and where instrumentations take place (ex: DynInst of Paradyn, DPCL from IBM). The system examines the available data and instruments the original code frequently only if it discovers a possible bottleneck. Heavy instrumentation is then toned back to the normal amount (a predefined set of events needed for problem discovery which does not impose too much overhead) once the program behaves normally again. Schemes have been devised (ex: W 3 of Paradyn and knowledge based discovery system) to identify all possible performance bottlenecks without 76 wasting too much time on the false positives. However, the research in this area is still very primitive. We should probably use tracing as our data collecting strategy but we need to be very careful not to over-trace the program. Dynamic instrumentation and scalable trace format are both important and we should use them in our PAT, but we might need to extend or even come up with a better bottleneck discovery system. 6.1.3 Manual vs. automatic As mentioned in Section 6.1, the instrumentation process can be done manually or automatically by the system. It is debatable as to which one is more useful and tools incorporate one or both of these methods (but often only one of the methods is used at a time) have shown to be successful. It is interesting to note that users are not entirely against manual instrumentation so the ideal approach seems to be that the PAT should provide as much automatic instrumentation as possible while allowing manual instrumentation so its usefulness would not be limited. Finally, it would be beneficial to classify factors according to their appropriateness to manual or automatic instrumentation. I believe some of the factors are better left for the user to instrument (ex: user defined functions) while others should be automatically inserted (ex: memory profile) to lessen the effort needed from the user. 6.1.4 Number of passes The number of passes refers to how many times a tool needs to execute the instrumented code in order to gather the necessary profiling/tracing information. A one-pass PAT is desirable since it minimizes the time the user needs to wait for feedback. More importantly, it is sometimes critical to have a one-pass system because multi-pass system is simply not a viable solution for long running programs. On the other hand, more accurate data can be obtained, and thus help the performance bottleneck identification process, using the multiple passes approach. Later passes can use the data collected in the first pass (generally profiling) to fine tune its task (one usage is in dynamic instrumentation where the system analyzes the statistical data from the first pass and then trace only at the hot spots). Finally, a hybrid strategy was introduced to overcome the shortfall of the one-pass approach. The basic principle is to periodically analyze data while collecting event data and use the analysis as a basis of future instrumentation. 77 This is very interesting if we decide to incorporate dynamic instrumentation into our PAT but we need to be aware that this method does not yield the exact same result as would a multiple passes approach. Finally, it is worth re-enforcing that this issue is really only applicable to tracing as virtually all profiling events are statistically oriented, thus the necessary instrumentation can be determined a priori. 6.1.5 Levels of instrumentation Instrumentation can be done at various levels in the programming cycle. Typical levels include the source (application), runtime/middleware (library and compiler), operating system and binary (executable) levels. 6.1.5.1 Source level instrumentation Instrumentation at the source level involves asserting desirable measurements to the source code. This strategy is commonly used as it gives the PAT user great control over the instrumentation process. It is easy to correlate between the measurements with the program constructs of the source code thus giving user a better idea of where the problem is occurring. In addition, source code instrumentation is system independent (while being language specific), thus portable across platforms. Unfortunately, some source level instrumentation will hinder the compiler optimization process (instrumentation version turns off an optimization which would normally be present in the original version) and does not always provide the most accurate measurements due to other system activities (ex: timing of a function. Because context switch can happen while a function executes, it is possible to obtain function running time longer than the actual running time.). Source level instrumentation can be done in two ways: manual insertion by user or automatically with a pre-processor (we consider using an instrumentation language as manual since user still need to do work doing instrumentation although it generally requires less work). The tradeoff between these two approaches is the same as that of manual vs. automatic instrumentation mentioned. 78 Source code Pre-processor Instrumented code Compiler Object code Linker Libraries Executables Execution Figure 6.2 – Levels of instrumentation (adopted from Fig. 1 of [6.15]) 6.1.5.2 System level instrumentation Instrumentation at this level (platform specific) is possible through the use of a library or by directly incorporating instrumentation into the compiler. In the library approach, wrapper functions of interest are created. An example of this is the PMPI (MPI profiling) interface in which a profiling library is defined that includes the wrapper to MPI functions (ex: MPI_send). The underneath MPI library then includes the original version of MPI functions plus the corresponding PMPI function (ex: PMPI_send) that include instrumentation codes. Instrumentation of these functions can then be turned on by linking the program with the profiling library. This approach is very convenient for the users as they do not need to modify their source code manually but it is also limited as only those functions defined in the library are available. The compiler approach works similarly except that it is the compiler’s job to correctly add instrumentation code. It is more versatile then the library approach as it can also instrument constructs (ex: a single statement) other than functions. However, instrumentation still can only be applied to those sets of constructs 79 predefined by the developer. In addition, this approach requires the access of a compiler’s source code and the problem of getting inaccurate measurement still exists. In many ways, the library instrumentation approach can be view as a variation of automatic source level instrumentation. In both situations, based on a predefined set of rules, instrumentation codes are automatically added to the source code. The only difference is that the library approach uses a library whereas the automatic source level uses a pre-processor. Because of this, these can be grouped together into a single method that uses a pre-processor. We believe this is more appropriate because it enables the incorporation of constructs other than function calls (although wrapper function approach is easier to implement). Finally, we do not think the compiler approach is viable because the compiler developers will probably not be willing to put in the effort to incorporate the sets of events we deemed important. 6.1.5.3 Operating system level instrumentation The OS level instrumentation involves use of existing system calls to extract information on program behavior. These calls vary from system to system and are quite often limited, making this technique impractical in most instances. An exception is the use of hardware performance counters. A package like PAPI can be considered one of these but it can also be treated as a library. Because of this, we should not consider this level of instrumentation in our project. 6.1.5.4 Binary level instrumentation The lowest level of instrumentation is at the binary level. Executables specific to a target system are examined and a pre-defined set of rules is used to facilitate the automatic insertion of event measuring codes (statically inserting or dynamically inserting, removing and changing of code). This approach avoids the problem of obtaining inaccurate measurements and a single implementation works with many different programming languages. Unfortunately, although it is simple to correlate the instrumentations with low level events, applying them to the more complex events is not trivial. Because of this, it is difficult to associate the instrumentations back to the source code, making it harder for the user to pinpoint the problem. Furthermore, instrumentation must be done to all executables on all the nodes in the system. If a heterogeneous environment is used for the execution of the parallel program, the instrumentations done at each 80 machine might not yield the same useful set. automatically. In general, this is done 6.1.5.5 Level of instrumentation conclusion Although many tools use one or many of these levels of instrumentation, they are generally used in separate runs. Due to the nature of the factors, we believe that some factors will naturally be fitting to measure at one of these levels. As in the case of manual vs. automatic instrumentation, it would be beneficial to categorize factors in regard to the instrumentation levels as well. In addition, our PAT should use several levels of instrumentations together in a single run (according to the categorization) or at the very least include source and binary level instrumentations. 81 6.1.6 References [6.1.1] Andrew W. Appel et al, “Profiling in the Presence of Optimization and Garbage Collection”, Nov 1988 [6.1.2] Luiz DeRose, Bernd Mohr and Kevin London, “Performance Tools 101: Principles of Experimental Performance Measurement and Analysis”, SC2003 Tutorial M-11 [6.1.3] Luiz DeRose, Bernd Mohr and Seetharami Seelam, “An Implementation of the POMP Performance Monitoring Interface for OpenMP Based on Dynamic Probes” [6.1.4] Ken Ferschweiler et al, “A Community Databank for Performance Tracefiles” [6.1.5] Thomas Fahringer and Clovis Seragiotto Junior, “Modeling and Detecting Performance Problems for Distributed and Parallel Programs with JavaPSL”, University of Vienna [6.1.6] Seon Wook Kim et al, “VGV: Supporting Performance Analysis of Object-Oriented Mixed MPI/OpenMP Parllel Applications” [6.1.7] Anjo Kolk, Shari Yamaguchi and Jim Viscusi, “Yet Another Performance Profiling Method”, Oracle Corporation, June 1999 [6.1.8] Yong-fong Lee and Barbara G. Ryder, “A Comprehensive Approach to Parallel Data Flow Analysis”, Rutgers University [6.1.9] Allen D. Malony and Sameer Shende, “Performance Technology for Complex Parallel and Distributed Systems”, University of Oregon [6.1.10] Bernd Mohr, “Standardization of Event Traces Considered Harmful or Is an Implementation of Object-Independent Event Trace Monitoring and Analysis Systems Possible?”, Advances in Parallel Computing, Vol. 6, pp. 103-124, 1993 [6.1.11] Philip Mucci et al, “Automating the Large-Scale Collection and Analysis of Performance Data on Linux Clusters”, University of TenesseeKnoxville and National Center for Supercomputing Applications [6.1.12] Stephan Oepen and John Carroll, “Parser Engineering and Performance Profiling”, Natural Language Engineering 6 (1): 81-97, Feb 2000 82 [6.1.13] Daniel A. Reed et al, “Performance Analysis of Parallel Systems: Approaches and Open Problems”, University of Illinois-Urbana [6.1.14] Lambert Schaelicke, Al Davis and Sally A. McKee, “Profiling I/O Interrupts in Modern Architectures”, University of Utah [6.1.15] Sameer Shende, “Profiling and Tracing in Linux”, University of Oregon [6.1.16] Sameer Shende et al, “Portable Profiling and Tracing for Parallel, Scientific Applications using C++”, University of Oregon and Los Alamos National Laboratory [6.1.17] Sameer Shende, Allen D. Malony and Robert Ansell-Bell, “Instrumentation and Measurement Strategies for Flexible and Portable Empirical Performance Evaluation”, University of Oregon [6.1.18] Hong-Linh Truong and Thomas Fahringer, “On Utilizing Experiment Data Repository for Performance Analysis of Parallel Applications”, University of Vienna [6.1.19] Jeffrey Vetter, “Performance Analysis of Distributed Applications using Automatic Classification of Communication Inefficiencies”, ACM International Conference on Supercomputing 2000 [6.1.20] Jurgen Vollmer, “Data Flow Analysis of Parallel Programs”, University of Karlsruhe [6.1.21] Youfeng Wu, “Efficient Discovery of Regular Stride Patterns in Irregular Programs and Its Use in Compiler Prefetching”, Intel Labs 83 6.2 Measurement 6.2.1 Performance factor The ever increasing desire for high computation power, supported by advances in microprocessor and communications technology as well as algorithms, has led to the rapid deployment of a wide range of parallel and distributed systems. Many factors affect the performance of these systems, including the processors and hardware architecture, the communication network, the various system software components, and the mapping of user applications and their algorithms to the architecture [6.2.1]. Analyzing performance factors in such systems has proven to be a challenging task that requires innovative performance analysis tools and methods to keep up with the rapid evolution and ever increasing complexity of such systems. This section first provides a formal definition of the term performance factor, followed by a discussion as to what constitutes a good performance factor including a discussion on the distinction between mean and end factors. Then a three-step approach to determine if a factor is good is presented, followed by conclusions. 6.2.1.1 Definition of a performance factor Before we can begin to design a performance analysis tool, we must determine what things are interesting and useful to measure. The basic characteristics of a parallel computer system that a user of a performance tool typically wants to measure are [6.2.2]: a count of how many times an event occurs the duration of some time interval the size of some parameter. For instance, a user may want to count how many times a processor initiates an I/O request. They may also be interested in how long each of these requests 84 takes. Finally, it is probably also useful to determine the amount of data transmitted and stored. From these types of measured values, a performance analysis tool can derive the actual value that the user wants in order to describe the performance of the system. This value is called a performance factor. If the user is interested specifically in the time, count, or size value measured, we can use that value directly as the performance factor. Often, however, the user is interested in normalizing event counts to a common time basis to provide a speed metric such as instructions executed per second. This type of factor is referred to in the literature as a rate factor or throughput and is calculated by dividing the count of the number of events that occur in a given interval by the time interval over which the events occur. Since a rate factor is normalized to a common time basis, such as seconds, it is useful for comparing different measurements made over different time intervals. Conceptually, the information provided by the PAT should consist of two groups. The first group is the information gathered from low-level system monitoring tools such as hardware counters. This group of information will provide the raw performance data of the system. The user will be able to use this set of data to gather fine-grained information about the performance issues of the system. This raw information will be most useful to users with extensive experience with parallel systems and may not be as applicable to the novice user. However, the second group of information will be at higher level to provide the user with an overall view of performance. The information provided by this group will be derived from the other group and may include such information as speedup, parallel efficiency, and other high level information which users have come to identify as important in diagnosing parallel systems. This conceptual separation of data seems quite natural and may suggest a straightforward approach to the design of the PAT. Many tools currently follow this approach with the use of a hardware counter library such as PAPI to provide low-level information that is then abstracted into higher level information which is more readily understood by the user. 85 6.2.1.2 Characteristics of a good performance factor There are many different metrics that have been used to describe the performance of a computer system [6.2.5, 6.2.9, 6.2.11, 6.2.12]. Some of these metrics are commonly used throughout the field, such as MIPS and MFLOPS, whereas others are invented for new situations as they are needed. Although the designers of a tool will primarily be concerned with performance factors that are specific to parallel systems, factors that contribute to sequential performance need to be explored as well. Experience has shown that not all metrics are ‘good' in the sense that sometimes using a particular metric can lead to erroneous or misleading conclusions. Consequently, it is useful to understand the characteristics of a ‘good' performance metric. Naturally, the designers of a performance tool will be interested in providing the most useful and unambiguous information to user about the desired performance factors. Understanding the characteristics of a good performance factor will be useful when deciding which of the existing performance metrics to use for a particular situation, and when developing a new performance metric would be more appropriate. Since many performance factors provide information that may be misleading to the user, designers of a performance tool must be careful to inform users of the limits of the applicability of a given performance factor. For example, the number of instructions executed may not correspond to the total execution time of a given application. If the user relies solely on the number of instructions executed as a performance factor, the user may end up applying performance optimizations that actually degrade performance rather than improve it. Although this is a simplistic example, one can envision more complicated scenarios in which data provided by a performance tool may be misleading or ambiguous, which may lead to the misapplication of performance optimizations and ultimately, the poor performance of the user’s application. This would be detrimental to the acceptance and usefulness of the tool regardless of the correctness of the information provided. 86 Many tools provide the user with the ability to define their own performance factors in order to provide a more customized analysis of their application. For example, Paradyn [6.2.8] provides the user with the Metric Description Language (MDL) [6.2.10] to enable user-defined metrics. However, since not all performance factors are equally applicable, if we decide to allow user-defined metrics in the PAT, we should take measures in order to ensure the proper use of the user factors. Warning the user of the potential misapplication of any userdefined factor may be sufficient since this functionality is probably intended for advanced users who are fully aware of the issues involved. However, it is still necessary for us to be aware of the characteristics of a good performance factor to ensure that at least the default performance factors will be useful to the user. A performance factor that satisfies all of the following requirements is generally considered ‘useful’ by the literature in allowing accurate and detailed comparisons of different measurements. These criteria have been developed by observing the results of numerous performance analyses over many years [6.2.2]. Although some caution that following these guidelines is not necessarily a recipe for success, it is generally regarded that using a factor that does not satisfy these requirements can often lead to erroneous conclusions [6.2.6]. Reliability - A performance factor is considered to be reliable if system A always outperforms system B when the corresponding values of the factor for both systems indicate that system A should outperform system B assuming all other factors are the same. A general test to determine if a factor is reliable is to While this requirement would seem to be so obvious as to be unnecessary to state explicitly, several commonly used performance factors do not in fact satisfy this requirement. The MIPS metric, for instance, is notoriously unreliable. Specifically, it is not unusual for one processor to have a higher MIPS rating than another processor while the second processor actually executes a specific program in less time than does the processor with the higher value of the metric. Such a factor is essentially useless for summarizing performance, and is 87 unreliable. Thus if a performance tool provides information about an unreliable factor to the user, it should only be one of a number of factors rather than a single number summarizing the total performance of the application [6.2.3]. It should be noted that unreliable factors need not necessarily be removed from a performance tool, rather they should be used within the context of a larger, more comprehensive data set. One of the problems with many of the metrics discussed earlier (such as MIPS) that makes them unreliable is that they measure what was done whether or not it was useful. What makes a performance factor reliable, however, is that it accurately and consistently measures progress towards a goal. Metrics that measure what was done, useful or not, are referred to in the literature as meansbased metrics whereas ends-based metrics measure what is actually accomplished. For the high-level information provided by a performance tool to the user, end-based factors will be more appropriate and useful. However, much of the low-level information provided by the tool must necessarily be meansbased, such as the information provided by hardware counters. Also, many performance factors will not be adequately described as end or means-based such as network latency. However, an effort should be made on the part of tool designers to provide the user with end-based performance factors whenever possible. This will help to ensure the most relevant and reliable data is presented to the user. Repeatability - A performance factor is repeatable if the same value of the factor is measured each time the same experiment is performed. This also implies that a good metric is deterministic. Although a particular factor may be dependent upon characteristics of a program that are essentially non-deterministic from the point of view of the program such as user input, a repeatable factor should yield the same results for two identical runs of the application. 88 Ease of measurement - If a factor is not easy to measure, it is unlikely that anyone will actually use it. Furthermore, the more difficult a factor is to measure directly, or to derive from other measured values, the more likely it is that the factor will be determined incorrectly [6.2.7]. Since a performance tool has a finite development time, the majority of the implementation effort should be used to provide the user with the easier, more useful factors since efforts to provide factors which are more difficult to obtain may not provide the user with more utility. Consistency - A consistent performance factor is one for which the units of the factor and its precise definition are the same across different systems and different configurations of the same system. If the units of a factor are not consistent, it is impossible to use the factor to compare the performances of the different systems. Although the concepts of reliability and consistency are similar, reliability refers to a factor’s ability to predict the relative performances of two systems whereas consistency refers to a factor’s ability to provide the same information about different systems. Since one of the goals of our performance tool is to be portable, the issue of consistency is particularly important. Many users will port their code from one system to another and a good performance tool should try to provide consistent information for both systems whenever possible. Therefore, it is highly desirable that a performance tool provide semantically similar if not identical performance factors across a wide variety of platforms. While the necessity for this characteristic would also seem obvious, it is not satisfied by many popular metrics, such as MIPS and MFLOPS. Our general strategy for determining if a performance factor meets these four criteria is as follows. First, on each supported platform, determine if the factor is measurable and if so, how easy was the value for the factor obtained. Obviously, if a factor is not measurable for a given platform, it will not be supported for that system, but it may be for other systems. After the factor has been determined to be easy to measure, it will be straightforward to determine if the factor is 89 repeatable. The determination of reliability and consistency, however, will require a more involved, three-step approach. First, if the factor can be modified on a real system, the factor should be tested directly to determine if it is reliable and consistent by applying the definitions of each listed above. For example, changing the network of a given system will determine if network latency and bandwidth are reliable and consistent factors. However, most factors will not be able to be modified on a real system as easily. Factors such as cache size and associativity will require a different approach since they cannot be readily changed on a real system. Therefore, justification from the literature will be needed to determine whether the factor is reliable and consistent. Lastly, if information regarding the reliability and consistency is not available in the literature, the information will need to be generated through the use of predictive performance models. If there are no suitable performance models to use, the factor may still be included in the PAT as long as the user is informed of its limitations. In summary, in order to determine if a proposed performance factor satisfies the four requirements of a good performance factor, we will perform the following tests. 1. On each platform, determine ease of measurement. 2. Determine repeatability. 3. Determine reliability and consistency by one of the following methods. (a) Modify the factor using real hardware. (b) Find justification in the literature. (c) Derive the information from performance models. 6.2.1.3 Conclusions In this section, we provided a formal definition of the term performance factor and then presented a discussion as to what constitutes a good performance factor. 90 This was followed by a three-step approach to determine which of our proposed factors are good. In order to provide the most useful data to the user of a performance analysis tool, we must ensure that the data presented by the PAT about each performance factor satisfy the constraints of a good factor. Namely, the factor must be reliable, repeatable, easy to measure, and consistent. Whenever possible, a factor should be end-based in order to avoid ambiguity and reduce misapplication of the tool. Also, there should be mechanisms in place to warn users of the potential pitfalls of using a poor performance factor when a tool supports user-defined factors. Finally, the natural separation between high-level factors and the low-level, measurable factors upon which they are based leads to an intuitive implementation strategy adopted by many parallel performance tools. 6.2.2 Measurement strategies To be written. 6.2.3 Factor List + experiments To be written. 91 6.2.4 References [6.2.1] L. Margetts. “Parallel Finite Element Analysis”, Ph.D Thesis, University of Manchester. [6.2.2] D. Lilja. “Measuring Computer Performance: A Practitioner's Guide”. Cambridge University Press. [6.2.3] J.E. Smith, “Characterizing Computer Performance with a Single Number”, Communications of the ACM, October 1988, pp. 1202-1206. [6.2.4] L.A. Crowl, “How to Measure, Present, and Compare Parallel Performance”, IEEE Parallel and Distributed Technology, Spring 1994, pp. 9-25. [6.2.5] J.L. Gustafson and Q.O. Snell, “HINT: A New Way to Measure Computer Performance”, Hawaii International Conference on System Sciences, 1995. [6.2.6] R. Jain, “The Art of Computer Systems Performance Analysis”, John Wiley and Sons, Inc., 1991. [6.2.7] A.D. Malony and D.A. Reed, “Performance Measurement Intrusion and Perturbation Analysis”, IEEE Transactions on Parallel Distributed Systems, Vol. 3, No. 4, July 1992. [6.2.8] B.P. Miller, M.D. Callaghan, J.M. Cargille, J.K. Hollingsworth, R.B. Irvin, K.L. Karavanic, K. Kunchithapadam, and T. Newhall. “The Paradyn Parallel Performance Measurement Tools”, IEEE Computer, Vol. 28, No. 11, November 1995. [6.2.9] J. Hollingsworth, B.P. Miller, “Parallel Program Performance Metrics: A Comparison and Validation”. Proceedings of the 1992 ACM/IEEE conference on Supercomputing, ACM/IEEE. [6.2.10] M. Goncalves, “MDL: A Language and Compiler for Dynamic Program Instrumentation”. [6.2.11] S. Sahni, V. Thanvantri, “Parallel Computing: Performance Metrics and Models”. Research Report, Computer Science Department, University of Florida, 1995. [6.2.12] H. Truong, T. Fahringer, G. Madsen, A.D. Malony, H. Moritsch, and S. Shende. “On Using SCALEA for Performance Analysis of Distributed 92 and Parallel Programs”. In Proceeding of the 9th IEEE/ACM HighPerformance Networking and Computing Conference (SC’2001), Denver, USA, November 2001. 93 6.3 Analysis This is probably one of the later things to do as it involve using a lot of other findings, 6.4 Presentation 6.4.1 Usability Parallel performance tools have been in existence for some time now and a great deal of research effort has been directed at tools for improving the performance of parallel applications. Significant progress has been made, however, parallel tools are still lacking widespread acceptance [6.4.3]. The most common complaints about parallel performance tools concern their lack of usability [6.4.6]. They are criticized for being too hard to learn, too complex to use in most programming tasks, and unsuitable for the size and structure of real world parallel applications. Users are skeptical about how much value tools can really provide in typical program development situations. They are reluctant to invest time in learning to apply tools effectively because they are uncertain that the investment will pay off. A tool is only perceived as valuable when it clearly assists the user to accomplish the tasks to which it is applied. Thus, a tool’s propensities for being misapplied or its failure to support the kinds of tasks users wish to perform constitute serious impediments to tool acceptance. Therefore, it is essential that the PAT be easy to use as well as provide information in such a way as to be useful to the user. This is where usability comes into play. 6.4.1.1 Usability factors There are several dimensions of usability presented in the literature, of which four seem particularly important for parallel performance tools. 6.4.1.1.1 Ease-of-learning 94 The first factor, ease-of-learning, is particularly important for attracting new users. The interface presents the user with an implicit model of the underlying software. This shapes the user's understanding of what can be done with the software, and how. Learning a new piece of software requires that the user discover, or invent, a mapping function from his/her logical understanding of the software's domain, to the implicit model established by the interface [6.4.7, 6.4.19]. Designers often fail to take into account the fact that the interface is really the only view of the software that a user ever sees. Each inconsistency, ambiguity, and omission in the interface model can lead to confusion and misunderstanding during the learning process. For example, providing default settings for some objects, but not for all, hinders learning because it forces users to recognize subtle distinctions when they are still having to make assumptions about the larger patterns; a common result is the misinterpretation of what object categories mean or what defaults are for. In fact, any place the interface deviates from what users already know about parallel performance tools, or about any other software with which they are familiar is a likely source of error [6.4.16]. As PAT designers, it should be our goal to create an internally as well as externally consistent tool. In order to ensure internal consistency, we should provide the most uniform interface possible to the user. We should also make an effort to use the established conventions set by other performance analysis tools when a clear precedent has been set. By leveraging the conventions set by other performance analysis tools, we will ensure an externally consistent tool that will be easier for the user to learn. It is also important to recognize that the time a user invests to learn a library or tool will not be warranted unless it can be amortized across many applications of the interface. If the interface is not intuitive or difficult to understand, lack of regular use forces them to re-learn the interface many times over. The short lifespan of HPC machines exacerbates the problem. Parallel programmers will generally end up porting their applications across several machine platforms over 95 the course of time. The investment in learning a software package may not be warranted unless it is supported, and behaves consistently, across multiple platforms. Therefore, a successful PAT will need to be intuitive as well as consistent across platforms in order to establish a strong user-base. 6.4.1.1.2 Ease-of-use Once an interface is familiar to the user, other usability factors begin to dominate. Ease-of-use refers to the amount of attention and effort required to accomplish a specific task using the software. In general, the more a user has to memorize about using the interface, the more effort will be required to apply that remembered knowledge [6.4.11]. This is why mnemonic names, the availability of menus listing operations, and other mechanisms aimed at prodding the user's memory will serve to improve the usability of the PAT. Interface simplicity is equally important, since it allows users to organize their actions in small, meaningful units; complexity forces users to pause and re-consider their logic at frequent intervals, which is detrimental to the user’s experience with the PAT. Ease-of-use also suffers dramatically when features and operations are indirect or "hidden" at other levels of the interface. For example, the need to precede a desired action by some apparently unrelated action forces the user to expend extra effort, both to recognize the dependency, and to memorize the sequencing requirement. Thus for the PAT, operations should be concrete and logical; there should be a clear correlation between an action the user takes and its desired effect. 6.4.1.1.3 Usefulness Where ease-of-use refers to how easy it is to figure out what actions are needed to accomplish some task, usefulness focuses on how directly the software supports the user's own task model. That is, as the user formulates goals and executes a series of actions to reach each goal, how direct is the mapping between what the user wants to do and what they must do within the constraints imposed by the interface? If a lengthy sequence of steps must be carried out to 96 accomplish even very common goals such as determining execution times, the usefulness of a performance tool is low. On the other hand, if the most common user tasks are met through simple, direct operations, usefulness will be high (in spite of the fact that long sequences may be required for tasks that occur only rarely). It should be our goal as PAT designers to optimize the usability of the common case whenever possible. Another aspect of usefulness is how easily users can apply the software to new task situations. If the implicit model presented by the interface is clear, it should be possible to infer new uses with a low incidence of error. For instance, if a user has been focused on alleviating the performance problems associated with the network, a similar set of operations should be able to be used to diagnose problems with the memory hierarchy. 6.4.1.1.4 Throughput Since the inherent goal of software is to increase user productivity, throughput is also important. This measure reflects the degree to which the tool contributes to user productivity in general. It includes the efficiency with which the software can be applied to find and correct performance problems, as well as the negative influences exerted by frequent errors and situations where corrections are difficult or time consuming. For performance tools with graphical interfaces or other software start-up costs, throughput also measures the amount of time required to invoke the tool and begin accomplishing useful work. 6.4.1.1.5 Summary It should be clear that all four dimensions mentioned in this section contribute to how quickly and generally users will adopt a software package. It should be equally clear that users are the only ones who will have the insight needed to accurately identify which interface features contribute to usability, and which represent potential sources of confusion or error. The basis for usability lies in 97 how responsive the software interface is to user needs and preferences which is something that can generally only be determined with the help of actual users. 6.4.1.2 User centered design User-centered design is based on the premise that usability will be achieved only if the software design process is user-driven. The tool designer must make a conscious effort to understand the target users, what they will want to accomplish with the tool, and how they will use the tool to accomplish those tasks [6.4.13]. The concept that usability should be the driving factor in software design and implementation is not particularly new; it has appeared in the literature under the guises of usability engineering, participatory design, and iterative design, as well as user-centered design [6.4.1]. There is not yet a firm consensus on what methodology is most appropriate, nor on the frequency with which users should be involved in design decisions [6.4.2]. What is clear is that the tradition of soliciting user feedback only during the very early and very late stages of development is not adequate for assessing and improving usability. While developing the PAT, we should take this fact into consideration and try to incorporate user feedback into the design throughout the development process in order to avoid common mistakes. During early stages, the design will be too amorphous for a user to really assess how the interface structure might enhance or constrain task performance. During late stages such as alpha testing, the software structure will be solidified so much that user impact will be largely cosmetic. User involvement and feedback will be needed throughout the design process, since different types of usability problems will be caught and corrected at different points. Moreover, it will be important to work with at least a few individual users on a sustained basis in order to ensure continuity and to gauge the progress of the PAT over time. The introduction of any performance tool does more than replace a sequence of manual operations (such as printf statements) by automated ones. Only by understanding how a user interacts 98 with the interface as they become familiar with it can tool designers understand the real issues affecting usability [6.4.4]. A four-step model for incorporating user-centered design in the development of a parallel performance tool is suggested by [6.4.1]. (1) Ensure that the initial functionality provided by the tool is based on demonstrable user needs. Soliciting input directly from the user community is the best way to identify this information. Specifically, a tool will not be useful unless it facilitates tasks that the user already does and that are time-consuming, tedious, error-prone, or impossible to perform manually. If, instead, design of the PAT is driven by the kind of functionality that we are ready or able to provide, it will miss the mark. Because of the nature of the PAT project, direct user contact may be difficult to obtain. However, the general principles regarding the basic functionality of a parallel performance analysis tool are the same for all parallel languages. Therefore, the information gained from the MPI user community as well as tool developers on this matter will be useful. In order to ensure that the functionality provided by our PAT is what users want, we should contact users from the MPI community to obtain justification and general feedback on our preliminary design. This information will only be helpful up to a point so additional information regarding the UPC/SHMEM specific aspects of the preliminary design should be obtained from any end-users we can find as well as people that can speak on behalf of our end-users; the meta-user. (2) Analyze how users identify and correct performance problems in a wide variety of environments. A preliminary step in the design of the PAT is to study UPC and SHMEM users in their world, where the tool ultimately will be used. The point is to analyze how users go about organizing their efforts to improve the performance of their parallel application. For example, the realization that users write down or sketch out certain information for themselves provides important clues about how visual representations should be structured and how the 99 interface should be documented, as well as indicating the need for additional functionality. Some of this information was gained through our programming practices; however, it will be useful to gain insight from other real-world UPC/SHMEM users. Again because of our limited access to end-users from governmental agencies, we will need to find as many other users in the UPC/SHMEM community as possible such as the graduate students at GWU. From these users, we can gain a better idea of how the tool will actually be used on real programs. The information we gain from these users will then be presented to the meta-users to be presented to end-users for them to critique and provide feedback. In this way, we can ensure that we will have input from a variety of sources to ensure a comprehensive analysis of the preliminary design. (3) Implement incrementally. Based on information gained from users, we should begin to organize the proposed interface so that the most useful features are the best supported. Prototypes or preliminary implementations can be constructed to show basic interface concepts. This allows user evaluation and feedback to begin long before efforts have been invested in features that will have low usability payoffs. Again, this information will only be obtained by talking with users and applying changes based on their feedback. This will also allows time to gain information about the users' instinctive reactions, one piece at a time. For example, the user might be presented with a few subroutine names and asked to guess what each does or what arguments are required for each. Early reactions might suggest major changes in thrust that will ultimately have repercussions throughout the PAT interface. It will definitely be easier to make these types of changes early in the design process rather later. Obviously, this type of user interaction will be impossible with some of our users so it will be key to maintain a strong relationship with the few other users with whom we have access. These users will provide us with information on the small changes that we need to make to the interface and suggest larger, more fundamental changes that will be verified by the meta-users and end-users. 100 (4) Have users evaluate every aspect of the tool’s interface structure and behavior. As the implementation of the PAT progresses, user tests should be performed at many points along the way. This permits feature-by-feature refinement in response to specific sources of user confusion or frustration. Feedback gained from this approach will also provides valuable insight into sources of user error and what might be done to minimize the opportunity for errors or ameliorate their effects. By following this four-step model, the designers of a performance tool can prevent some of the common problems cited by users presented in the next section. 6.4.1.3 Common problems of usability in PAT Throughout the literature, several basic problems are cited for the usability of parallel performance tools. Some data was compiled through case studies, while other data comes from user surveys and general principles from the field of Human Computer Interaction (HCI). Inconsistency: Lack of symmetry or consistency among elements within a given tool. The most blatant inconsistencies (e.g., spelling, naming of elements, or punctuation) can be caught through a careful checking by the tool developers themselves. Nevertheless, there are generally additional inconsistencies that are likely to impede the user’s learning of the tool. In many cases, the tool developers may be able to cite practical justifications for the inconsistencies but with real users, they may cause errors and confusion. We need to be aware of this potential and change the interface in order to facilitate the use of the tool. Ambiguity: the choice of interface names leads to user misinterpretation. For example, as a case study in [6.4.11], users were confused by the fact that both "reduce" and "combine" were routines, where one referred to the operation the users traditionally call reduction or combination (i.e., acquiring values from each of multiple nodes and calculating the sum, minimum, etc.) and the other was a 101 shortcut referring to that same operation, followed immediately by a broadcasting of the result to all nodes. The users complained that it would be very hard to remember which meaning went with which name. Whenever possible, the PAT should use concrete, clear names for everything. Incompatibility: The interface specification contradicts or overrides accepted patterns of usage established by other tools. Whereas consistency compares elements within an interface, compatibility assesses how well the interface conforms to "normal practice." Since there already exist many parallel performance tools, designers for a new performance tool should conform to established practices if possible. Indirection: One operation must be performed as a preliminary to another. Indirection as described in the literature most often involved some sort of table lookup operation, so that the index (rather than the name assigned by the user) must be supplied as an argument to some other operation. We recommend that user-defined names be used whenever possible throughout the PAT interface. In this way, the user will be spared any confusion regarding naming conventions, thus increasing efficiency. Fragility: Subtleties in syntax or semantics those are likely to result in errors. For example, if all blocking operations involve routines whose names begin with b (such as bsend and brecv) but one routine beginning with that letter (bcast) is non-blocking [6.4.11]. Fragility increases when the errors are essentially undetectable (that is, the software will still work, but results will be incorrect or unexpected). Therefore, it is essential for tool designers to be extremely careful when applying the naming conventions to be used in order to prevent any ambiguity. However, problems regarding fragility should be easy to find and correct early in the design process through a careful review of the interface by the tool designers as well as outside users. 102 Ergonomic Problems: Too many (or too clumsy) physical movements must be performed by the user. In most cases, problems occur because users are forced to do unnecessary or redundant typing. The tool should have measures in place to facilitate the menial tasks associated with using it. For some aspects of the tool’s interface, it may be more appropriate to develop a GUI rather than a command line interface for the user to interact with the tool. However, if the relative simplicity and intuitiveness of a command line interface is desired by users, we recommend the tool have a scrollable command history to prevent ergonomic problems. Problems stemming from inconsistency and incompatibility clearly make it harder for the novice to develop a well-formulated mental model of how the tool operates, and therefore can be said to impede ease-of-learning. Incongruity and ambiguity, on the other hand, are likely to cause problems in remembering how to apply the software, thereby affecting ease-of-use. Fragility has more impact on the user's ability to apply the software efficiently and with few errors, and so, impedes usefulness. Finally, indirection and ergonomic problems are wasteful of the user's movements and necessitate more actions/operations, thereby affecting throughput. By avoiding these common problems, a tool will be able to address the most relevant issues regarding the usability of parallel performance tools. The next section provides strategies that may be used to avoid the common problems presented in this section. 6.4.1.4 General solutions This section provides an outline as to the types of changes that will be required based on user feedback. This information may be useful to the designers of a performance tool contemplating using a user-centered design approach in order to justify the implementation costs of using such a strategy. The designers may wish to know how the use of a user-centered approach generally affects the implementation time of a performance tool. If the implementation time using a user-centered approach is much greater than without, designers may wish to use 103 an alternative approach. Therefore, this section provides the justification for the use of a user-centered design methodology in terms of implementation time. Solutions to the common issues of usability found in parallel tools presented in the previous section can generally be placed in one of six categories. Superficial change: modification limited to the documentation and/or procedure prototypes (e.g., to change the names of arguments) Trivial syntactic change: modification limited to the name of the operation Syntactic change: modification of the order of arguments or operands Trivial semantic change: modification of the number of arguments or operands Semantic change: relatively minor modification of the operation's meaning Fundamental change: addition of a new feature to the interface and/or major changes in operational semantics However, evidence from case studies suggests that the overwhelming majority of usability issues fall into the categories of superficial or trivial syntactic changes. That is, simple changes in the names used to refer to operations or operands were sufficient to eliminate the problem, from the users' perspective. Only rarely do problems fall into the category of fundamental changes, requiring significant implementation work. Thus, the actual cost of modifying a tool based on user response is generally low. Therefore, since the implementation cost should be low and the benefit to the usability of the tool will be high, as PAT designers, we should feel justified in adopting a user-centered design methodology. Experience has shown that development time for software created with user input is not significantly increased [6.4.5, 6.4.8]. In some cases, development time is actually shortened because features that would have required major implementation effort turn out to be of no real interest to users. Essentially, more developer time is spent earlier in the design cycle, rather than making adjustments once the design is solidified. Generally speaking, the earlier that input is solicited from users, the easier and less expensive it is to make changes 104 particularly semantic or fundamental changes. We should take this information into account while designing the PAT in order to prevent lost time later in development. Finally, the design of software with usability in mind engenders real interest and commitment on the part of users [13, 16]. From their perspective, the developers are making a serious attempt to be responsive to their needs, rather than making half-hearted, cosmetic changes when it's too late to do any good anyway [14, 15]. By involving users in the design process, we may be able to establish a strong user-base, which will be crucial to the success of the PAT project. Also, users may be able to help in ways that go well beyond interface evaluation, such as developing example applications or publicizing the tool among their colleagues. 6.4.1.5 Conclusions In this section, we provided a discussion on the factors influencing the usability of performance tools followed by an outline of how to incorporate user-centered design into the PAT. We then talked about common problems seen in performance tools followed by solutions. From the preceding sections, it is clear that how users perceive and respond to a performance tool is critical to its success. Creating an elegant, powerful performance tool does not guarantee that it will be accepted by users. Ease-of-learning, ease-of-use, usefulness, and throughput are important indicators of the tool’s usability, and they depend heavily on the interface provided by the tool. It is essential that we capitalize on the observation that users are the ones who are best qualified to determine how the PAT should be used. We need to be sure that the PAT reflects user needs, that the interface organization correlates well with established conventions set by other performance tools, and that interface features and terminology are clear and efficient for users. The best way for the PAT to meet these usability goals is to involve users in the entire 105 development process. The feedback gained from our most accessible users should be presented to our meta-users and end-users for verification of large changes in the interface. This will ensure that a wide variety of users will have supported and verified the changes made to the interface. By using this strategy, we feel that we can create a performance tool that has a high level of usability. 6.4.2 Presentation methodology To be written. 106 6.4.3 References [6.4.1] S.Musil, G.Pigel, M.Tscheligi, “User Centered Monitoring of Parallel Programs with InHouse”, HCI Ancillary Proceedings, 1994. [6.4.2] J. Grudin, “Interactive Systems: Bridging the Gaps between Developers and Users”, IEEE Computer, April 1991, pp. 59-69. [6.4.3] C. Pancake. “Applying Human Factors to the Design of Performance Tools”. [6.4.4] B. Curtis. “Human Factors in Software Development: A Tutorial 2nd Edition”, IEEE Computer Society Press. [6.4.5] B.P. Miller, et al. “The Paradyn Parallel Performance Measurement Tools”. IEEE Computer, 1995. [6.4.6] C. Pancake and C. Cook. “What Users Need in Parallel Tool Support: Survey Results and Analysis” [6.4.7] J. Grudin, “Interactive Systems: Bridging the Gaps between Developers and Users”. [6.4.8] R. Jeffries, J. R. Miller, C. Wharton and K. M. Uyeda, “User Interface Evaluation in the Real World: A Comparison of Four Techniques”. [6.4.9] J. Kuehn, `”NCAR User Perspective”, Proc. 1992 Supercomputing Debugger Workshop, Jan 1993. [6.4.10] J. Whiteside, J. Bennett and K. Holtzblatt, “Usability Engineering: Our Experience and Evolution”, in Handbook of Human-Computer Interaction. [6.4.11] C. Pancake. “Improving the Usability of Numerical Software through User-Centered Design”. [6.4.12] D. Szafron, J. Schaeffer, A Edmonton.” Interpretive Performance Prediction for Parallel Application Development”. [6.4.13] G.V. Wilson, J. Schaeffer, D Szafron, A Edmonton. “Enterprise in Context: Assessing the Usability of Parallel Programming Environments”. [6.4.14] S. MacDonald et al. “From Patterns to Frameworks to Parallel Programs”. 107 [6.4.15] S. Utter. “Enhancing the Usability of Parallel Debuggers”. [6.4.16] C. Pancake, D. Hall. “Can Users Play an Effective Role in Parallel Tools Research”. [6.4.17] T.R.G. Green, M. Petre. “Usability analysis of visual programming environments: a 'cognitive dimensions' framework”. [6.4.18] T.E. Anderson, E.D. Lazowska. “Quartz: A Tool for Tuning Parallel Program Performance” [6.4.19] M. Parashar, S. Hariri. ”Interpretive Performance Prediction for Parallel Application Development.” 108 6.5 Optimization 109 6.5.1 Optimization techniques When a programmer does not obtain the desired performance they require from their program codes, they often turn to optimization techniques to enhance their program’s performance. Optimization techniques for programs have been wellresearched and come in many forms. In this section, we examine optimization techniques in the following categories: 1. Sequential compiler optimizations – In this class of optimizations, we study techniques that have been used to enhance the performance of sequential programs. We restrict our study to techniques that are used by sequential compilers. These optimizations are presented in Section 6.5.1.1. 2. Pre-compilation optimizations for parallel codes – These optimizations are specific to parallel programs and are meant to be applied to a program before being compiled. These optimizations deal with high-level issues such as data placement and load balancing. These optimizations are presented in Section 6.5.1.2. 3. Compile-time optimizations for parallel codes – Optimizing compilers play a large role in the performance of codes generated from implicitly parallel languages such as High-Performance Fortran (HPF) and OpenMP. Here we examine techniques used by such compilers. These optimizations are presented in Section 6.5.1.3. 4. Runtime optimizations for parallel codes – Many times optimizations cannot be applied until a program’s runtime for various reasons. These types of optimizations are presented in Section 6.5.1.4. 5. Post-runtime optimizations for parallel codes – The optimizations presented in this category require that a program be executed in some manner before the optimizations can be applied. These optimizations are presented in Section 6.5.1.5. We also present the following information for each optimization category presented: Purpose of optimizations – What do these optimizations attempt to do? 110 Metrics optimizations examine – Which particular metrics or program characteristics are examined by the optimization? The aim of this study is to create a catalog of existing optimizations. We hope that by studying existing optimization techniques, we will have a better understanding of what kind of information needs to be reported to the user of our PAT in order for them to increase the performance of their applications. 6.5.1.1 Sequential compiler optimizations Many types of optimizations are performed by sequential compilers. In today’s superscalar architectures, sequential compilers are expected to extrapolate instruction-level parallelism to keep the hardware busy. A summary of these techniques is presented below, which is mainly taken from an excellent survey paper written by Bacon et al. [6.5.1]. In order for these techniques to be applied, the compiler must have a method of correctly analyzing dependencies of statements in the source code. We do not present the methods of dependency analysis here; instead, we focus on the transformations the compiler performs that increase program performance. While most of these transformations can also be used for parallel codes, there are additional restrictions that must be enforced. Midkiff and Padua [6.5.2] give specific examples of these types of problems, and provide an analysis technique that can be used to ensure the correctness of the transformations. 6.5.1.1.1 Loop transformations Purpose of optimizations – To restructure or modify loops to reduce their computational complexity, to increase their parallelism, or to improve their memory access characteristics Metrics optimizations examine – Computational cost of repeated or unnecessary statements, dependencies between code statements, and memory access characteristics Loop-based strength reduction – In this method, a compiler substitutes a statement in a loop that depends on the loop index with a less-costly computation. For example, if the variable i is a loop index, the statement a[i] += a[i] + c*i 111 can be replaced with the statement a[i] += a[i] + t; T += c (where T is a temporary variable initialized to c). Induction variable elimination – Here, instead of using an extra register to keep track of the loop index, the exit condition of the loop is rewritten in terms of a strength-reduced loop index variable. This frees up a register and reduces the work done during the loop. For example, the following code: for (i = 0; i < 10; i++) a[i]++; can be reduced by using a register to point to a[0] and stopping iteration after the pointer reaches a[9], instead of incrementing i and recomputing the address of a[i] in every loop iteration. Loop-invariant code motion – Loop-invariant code motion moves expressions evaluated in a loop that do not change between loop executions outside that loop. This reduces the computation cost of each iteration in the loop. Loop unswitching – This technique is related to loop-invariant code motion, except that instead of singular expressions, conditionals that do not change during loop execution are moved outside of the loop. The loop is duplicated as necessary. Assuming x does not depend on the loop below, this optimization changes for (i = 0; i < 10; i++) { if (x <= 25) { a[i]++; } else { a[i]--; } } to if (x <= 25) { for (i = 0; i < 10; i++) a[i]++; } else { for (i = 0; i < 10; i++) a[i]--; 112 } Since the conditional is only evaluated once in the second form, it saves a bit of time compared with the first form. Loop interchange – Loop interchanging interchanges the iteration order of nested loops. This method can change for (i = 0; i < 10; i++) { for (j = 0; j < 10; j++) { a[i][j] = sin(i + j); } } to for (j = 0; j < 10; j++) { for (i = 0; i < 10; i++) { a[i][j] = sin(i + j); } } By interchanging loop orders, it is often possible to create larger independent loops in order to increase parallelism in loops by increasing their granularity. In addition, interchanging loop orders can also increase vectorization of code by creating larger sections of independent code. Loop skewing – Loop skewing is often employed in wavefront applications that have updates that “ripple” through the arrays. Skewing works by adding a multiplying factor to the outside loop and subtracting as needed throughout the inner code body. This can create a larger degree of parallelism when codes have dependencies between both indices. Loop reversal – Loop reversal is related to loop skewing. In loop reversal, the loop is iterated in reverse order, which may allow dependencies to be evaluated earlier. Strip mining – In strip mining, the granularity of an operation is increased by splitting a loop up into larger pieces. The purpose of this operation is to split code up into “strips” of code that can be converted to vector operations. 113 Cycle shrinking – Cycle shrinking is a special case of strip mining where the strip-mined code is transformed into an outer serial loop and an inner loop with instructions that only depend on the current iteration. This allows instructions in the inner loop to be performed in parallel. Loop tiling – Loop tiling is the multidimensional equivalent of cycle shrinking. It is primarily used to improve data locality, and is accomplished by breaking the iteration space up into equal-sized “tiles.” Loop distribution – This method breaks up a loop into many smaller loops which have the same iteration space as the iteration, but contain a subset of instructions performed in the main loop. Loop distribution is employed to reduce the use of registers within a loop and improve instruction cache locality. Loop fusion – This is the reverse operation of loop distribution, where a series of loops that have the same bounds are fused together. This method reduces loop overhead, can increase spatial locality, and can also increase the load balancing of parallel loops. Loop unrolling – Loop unrolling replicates the body of a loop several times in an effort to reduce the overhead of the loop bookkeeping code. In addition, loop unrolling can increase instruction-level parallelism since more instructions are available which do not have loop-carried dependencies. Software pipelining – This method breaks a loop up into several stages and pipelines the execution of these stages. For example, the loop for (i = 0; i < 1000; i++) a[i]++; can be broken into three pipeline stages: a load stage, an increment stage, and a store stage. The instructions are executed in a pipelined fashion such that the load for iteration 3 is executed, followed by the increment for iteration 2, and then the store for iteration 1. This helps reduce the overhead of loop-carried dependencies by allowing multiple iterations of the loop to be executed simultaneously. Loop coalescing – This method breaks up a nested loop into one large loop, using indices computed from the outermost loop variable. The expressions 114 generated by loop coalescing are often reduced so that the overhead of the original nested loops is reduced. Loop collapsing – Loop collapsing can be used to reduce the cost of iteration of multidimensional variables stored in a contiguous memory space. This method collapses loops iterating over contiguous memory spaces by controlling iteration with a single loop and substituting expressions for the other indices. Loop peeling – Loop peeling removes a small number of iterations from the beginning or end of a loop. It is used to eliminate loop dependencies caused by dependent instructions in the beginning or end of a loop, and is also used to enable loop fusion. Loop normalization – Loop normalization converts all loops so that the looping index starts at zero or one and is incremented by one each iteration. This is used to bring the loop into a standard form, which some dependency analysis methods require. Loop spreading – This method takes code from one serial loop and inserts it into another loop so that both loops can be executed in parallel. Reduction recognition – In reduction recognition, the compiler recognizes reduction operations (operations that are performed on each element in an array) and transforms them to be vectorized or executed in parallel. This works well for or, and, min, and max operations since they are fully associative and can be fully parallelized. Loop idiom recognition – In this optimization, the compiler recognizes a loop that has certain properties which allow use of special instructions on the hardware. For example, some architectures have special reduction instructions that can be used on a range of memory locations. 6.5.1.1.2 Memory access transformations Purpose of optimizations – To reduce the cost of memory operations, or restructure a program to reduce the number of memory accesses it performs Metrics optimizations examine – Memory access characteristics 115 Array padding – This is used to pad elements in an array so the fit nicely into cache line sizes. Array padding can be used to increase the effectiveness of caches by aligning memory access locations to sizes well-supported by the architecture. Scalar expansion – Scalar expansion is used in vector compilers to eliminate antidependencies. Compiler-generated temporary loop variables are converted to arrays so that each loop iteration has a “private” copy. Array contraction – This method reduces the space needed for compilergenerated temporary variables. Instead of using n-dimensional temporary arrays for loops nested n times, a single temporary array is used. Scalar replacement – Here, a commonly-used memory location is “pegged” into a register to prevent multiple loads to the same memory location over and over again. It is useful when an inner loop reads the same variable over and over and a new value is written back to that location after the inner loop has completed. Code co-location – In code co-location, commonly-used code is replicated throughout a program. Code co-location is meant to increase instruction cache locality by keeping related code closer together in memory. Displacement minimization – In most architectures, a branch is specified as an offset from the current location and a finite number of bits are used to specify the value of the offset. This means that the range of a branch is often limited, and performing branches to larger offsets requires additional instructions and/or overhead. Displacement minimization aims to minimize the number of long branches by restructuring the code to keep branches closer together. 6.5.1.1.3 Reduction transformations Purpose of optimizations – To reduce the cost of statements in the code by eliminating duplicated work or transforming individual statements to equivalent statements of lesser cost Metrics optimizations examine – Computation time and number of registers used 116 Constant propagation – In constant propagation, constants are propagated throughout a program, allowing the code to be more effectively analyzed. For example, constants can affect loop iterations or cause some conditional branches to always evaluate to a specific outcome. Constant folding – Here, operations performed on constant values are evaluated at compile time and substituted in the code, instead of being evaluated at run time. For example, a compiler may replace all instances of 3 + 2 with 5 in order to save a few instructions throughout the program. Copy propagation – In this method, multiple copies of a variable are eliminated such that all references are changed back to the original variable. This reduces the number of registers used. Forward substitution – This generalization of copy propagation substitutes a variable with its defining expression as appropriate. Forward substitution can change the dependencies between statements, which can increase the amount of parallelism in the code. Algebraic simplification & strength reduction – Some math expressions may be substituted for other, less costly ones. For example, x2 can be replaced with x*x and x*2 can be replaced by x+x. A compiler can exploit this during compilation and reduce some algebraic expressions to less costly ones. Unreachable code elimination – In this method, the compiler eliminates code that can never be reached. This method may also create more opportunities for constant propagation; if a compiler knows a section of code will never be reached, then a variable which previously was modified there may be turned into a constant if no other code modifies its value. Dead-variable elimination – Variables that are declared but whose values are never used end up wasting a register. Dead-variable elimination ignores these variables and can free up registers. Common subexpression elimination – Here, common subexpressions which are evaluated multiple times are coalesced and evaluated only once. This can speed up code at the cost of using another register. 117 Short-circuiting – Some boolean expressions can be made faster by creating code that short-circuits after one expression evaluates to a certain value. For example, in a compound and statement, if a single expression evaluates to false, the rest of the expressions can be ignored since the whole expression will never be true. In general, short-circuiting can also affect a program’s correctness if the programmer makes an assumption about whether a statement in a boolean expression is always evaluated or not, but most languages (such as C) use short-circuiting to speed up code. 6.5.1.1.4 Function transformations Purpose of optimizations – To reduce the overhead of function calls Metrics optimizations examine – Computation time and function call overhead Leaf function optimization – Some hardware architectures use a specific register to store the location of the return address in a function call. If a function never calls another function, this register is never modified, and instructions that save the value of the return address at the beginning of the function and restore the original return address back into the register after the function is done can be eliminated. Cross-call register allocation – If a compiler knows which registers are used by a calling function and which functions this particular function calls, it may be able to efficiently use temporary registers between function calls. For example, if function A calls function B, but function B only uses registers R10 through R12 and function A only uses registers R5 through R9, then functin B doesn’t have to save those registers on its call stack. Parameter promotion – In functions that take call-by-reference arguments, it is much faster to allow a function to take in arguments from registers instead of storing them to memory and reloading them from within the function. Parameter promotion uses registers to pass arguments and can lead to frame collapsing for function calls (where no values are placed on the stack at all). 118 Function inlining – Function inlining attempts to reduce the overhead of a function call by replicating code instead of using a branch statement. This increases register contention, but eliminates the overhead of a function call. Function cloning – This method creates specialized versions of functions based on the values obtained by constant propagation. The specialized versions of the functions can take advantage of algebraic and strength reductions to make them faster than the general versions. Loop pushing – Loop pushing moves a loop that calls a function several times inside a cloned version of that function. It allows the cost of the function call to be paid only once, instead of several times as in the original loop. This also may increase the parallelism within the loop itself. Tail recursion elimination – This method reduces the function call costs associated with recursive programs by exploiting properties of tail-recursive functions. Tail-recursive functions compute values as the recursion depth increases, and once the recursion stops the value is returned. This returned value propagates up the call stack until it is finally returned to the caller. Tail recursion elimination converts a tail-recursive function into an iterative one, thus reducing the overhead of several function calls. Function memoization – Function memoization is a technique often used with dynamic programming. In this technique, values from function calls which have no side effects on the program’s execution are cached before they are returned. If the function is called with the same arguments, the cached value is returned instead of being computed over again. 6.5.1.2 Pre-compilation optimizations for parallel codes In this section, we present optimization strategies that are meant to be used during the coding phase of an application. The methods here are general enough to apply to many programming paradigms, although some are difficult to use and require specialized skills in combinatorial mathematics. 6.5.1.2.1 Tiling Purpose of optimizations – Automatically parallelize sequential code that can be sectioned into atomic tiles 119 Metrics optimizations examine – Rough approximations of communication and computation; some tile placement strategies also consider load balancing Tiling is a general strategy for decomposing the work of programs that use nested for loops. Tiling groups iterations within those nested for loops into “tiles” which can be independently executed. These tiles are then assigned to processors, and any communication necessary to satisfy dependencies is also scheduled. For tiling mappings of programs that have inherent loop-carried dependencies, a systolic architecture is used to exploit the maximum amount of parallelism possible. In general, tiling may use any type of “tile shape,” which controls which elements are grouped into which tiles. The term shape is used, since the iteration space is visualized as an n-dimensional space broken into different tiles which can be visualized into shapes based on the decomposition used. A technique based on iteration space tiling is presented by Andonov and Rajopadhye in [6.5.3]. Iteration space tiling is used to vary the size of the tiles for a given tile mapping in an effort to find the optimal tile size based on a simple systolic communication and computation model. In the technique presented by Andonov and Rajopadhy, the systolic model is relaxed by providing a general (but still highly-abstracted) way to describe the communication and computation that occurs within a tile. It is assumed that the optimal tile shape has already been found through another method. Once a relaxed systolic model for the problem has been created, it is combined with the general communication and computation functions and turned into a two-variable optimization problem whose free variables are the tile period (computation time) and inter-tile latency (communication time). The authors then present a complicated algorithm that solves this optimization problem, which gives the optimal tiles size to be used with the chosen tiling shape. Goumas et al. present a method for automatically generating parallel code for nested loops in [6.5.4] that is based on tiling. In their method, the sequential code is analyzed and an optimal tiling is chosen, which is then parallelized. Since the body of each tile can be executed independently provided the prerequisite computations have already been completed and sent to this processor, the parallelization step is very straightforward. The authors then test 120 their parallelization method on some small computation kernels, and find that their method is able to perform the tiling quickly and give reasonable performance. Lee and Kedem present another data and computation decomposition strategy that is based on tiling in [6.5.5]. In their method, they first determine the optimal data distribution to be used for code that is to be parallelized, and then decide to how map the tiles to a group of processors to minimize communication. Their method does not make any assumptions about tiles sizes or shapes, but is extremely complicated. 6.5.1.2.2 ADADs Purpose of optimizations – Optimally parallelize do loops in Fortran D Metrics optimizations examine – Data placement of array elements and communication caused by loop dependencies Hovland and Ni propose a method for parallelizing Fortran D code in [6.5.6] which uses augmented data access descriptors. Augmented data access descriptors (ADADs) provide compact representations of dependencies inherent in loops. Traditionally, compilers rely on pessimistic heuristics to determine whether a loop can be fully parallelized or not. Instead of directly analyzing dependencies between code statements within a loop, ADADs represent the how sections of a program can influence each other. ADADs also allow dependencies to be determined by performing simple union and intersection operations. Armed with the ADADs, Hovland and Ni compute data placement for the array elements modified in the loop based on a heuristic. In addition, they show that they are able to apply such loop transformations as loop unrolling and loop fusion directly on their ADADs, which may help a parallelizing compiler choose which optimizations to apply. 6.5.1.3 Compile-time optimizations for parallel codes In this section, we present optimizations that can be performed during the compilation of a program. In many cases, the methods described here may also be performed prior to the compilation of the code; however, because these methods can be automated, they are incorporated into the compilation process. 121 Therefore, the methods presented here generally have low-cost implementations (in terms of memory and CPU usage). 6.5.1.3.1 General compile-time strategies Purpose of optimizations – To eliminate unnecessary communication, or reduce the cost of necessary communication between processors Metrics optimizations examine – Dependency transformations which allow privatization of scalars or arrays, overhead of small message sizes vs. large message sizes, cache line size for shared-memory systems, dependencies between computation and communication statements (for latency hiding) The summary provided by Bacon et al. [6.5.1] also contains a short overview of general compile-time optimizations performed by parallel compilers. Since they are general strategies, we group them together here. Scalar privatization – This is the parallel generalization of the “scalar expansion” presented in Section 6.5.1.1.2, in which temporary variables used during execution of a loop are kept local instead of shared across all processors evaluating that loop. This decreases the amount of unnecessary communication between threads, and is only needed in parallelizing compilers for implicitly parallel languages (e.g., HPF). Array privatization – Certain parallel codes operate on arrays that are written once (or statically initialized) at the beginning of a program’s execution and are only read from after that. Distributing these arrays to every processor (if necessary) reduces a program’s overall execution significantly, since much unnecessary communication is eliminated. Cache alignment – On shared-memory and distributed shared-memory machines, aligning objects to cache sizes can reduce false sharing of two unrelated elements that get placed on the same page by padding them to keep them in separate cache lines. This is analogous to the “array padding” mentioned in Section 6.5.1.1.2. Message vectorization, coalescing, & aggregation – In these methods, data is grouped together before being sent to another processor. Since the overhead of 122 sending a large amount of small messages usually dominates the overall communication cost, using larger messages whenever possible allows for more efficient communication. Message pipelining – Message pipelining is another name for the overlapping of communication and computation, which can allow a machine to “hide” much of the latency that communication normally incurs. 6.5.1.3.2 PARADIGM compiler Purpose of optimizations – Reduce communication overhead and increase parallelism in automatically parallelized Fortran77 and HPF source code Metrics optimizations examine – Communication and load balancing PARADIGM is a compiler developed by Banerjee et al. that automatically parallelizes Fortran77 or HPF source code [6.5.7]. The authors describe their compilation system as one that uses a “unified approach.” PARADIGM uses an abstract model of a multidimensional mesh of processors in order to determine the best data distribution for the source code fed into it. The parallelizer uses task graphs internally to schedule tasks to processors, and supports regular computations through static scheduling and irregular computations through a runtime system. The PARADIGM compiler can also generate code that overlaps communication with computation. An earlier version of PARADIGM was only able to determine data partitionings for HPF code [6.5.8], but later versions were also able to parallelize the code and apply optimizations to the parallelized code. To perform automatic data partitioning, a communication and computation cost model is created for specific architectures, which is controlled via architectural parameters. The data partitioning happens in several phases: a phase to align array dimensions to a dimensions in the abstract processor mesh, a pass to determine if a block or cyclic layout distribution should be used, a block size selection pass, and a mesh configuration pass where data is mapped to a mesh. PARADIGM also uses a few optimization techniques such as message coalescing, message vectorization, message aggregation, and coarse-grained pipelining to maximize the amount of parallelism and minimize the communication costs. A short description of these optimizations is shown below: 123 Message coalescing – Eliminates redundant messages referencing data at different times that has not been modified Message vectorization – Vectorizes nonlocal accesses within a loop iteration into one larger message Message aggregation – Messages with identical source/destination pairs are merged together into a single message (may also merge vectorized messages) Course-grained pipelining – Overlaps loop iterations to increase parallelism in loops which cannot be parallelized due to dependencies between iterations of the loop 6.5.1.3.3 McKinley’s algorithm Purpose of optimizations – Optimize regular Fortran code to increase parallelism and exploit spatial and temporal locality on shared-memory machines Metrics optimizations examine – Spatial and temporal locality (memory performance), parallel granularity, loop overhead (for loop transformations) In [6.5.9], McKinley describes an algorithm for optimizing and parallelizing Fortran code which is meant to be used with “dusty deck” codes. McKinley argues that the only way to get decent performance out of sequential programs not written with parallelism in mind is to combine algorithmic parallelization and compiler optimizations. McKinley’s algorithm is designed for use with sharedmemory multiprocessor systems. The algorithm considers many aspects and program optimizations and is divided into four steps: an optimization step, a fusing step, a parallelization step, and an enabling step. The optimization step uses loop permutation and tiling (which incorporates strip mining) on a single loop nest to exploit data locality and parallelism. The fusing step performs loop fusion and loop distribution to increase the granularity of parallelism across multiple loop nests. The parallelization step combines the results from the optimization and fusing steps to optimize single loop nests within procedures. Finally, the enabling step uses interprocedural analysis and transformations to optimize loop nests containing 124 function calls and spanning function calls by applying loop embedding, loop extraction, and procedure cloning as needed. The algorithm uses a memory model to determine placement and ordering of loop iterations for maximum parallelism and locality. A simple loop tiling algorithm chooses the largest tile size possible given a number of processors, which enhances the previous optimizations that try to maximize the spatial locality of a loop. The authors used their optimization techniques on code obtained from scientists working at Argonne National Laboratory. The speedups they obtained surpassed the speedup obtained with hand-coded parallel routines in some cases, and matched or closed matched in all others, which implies the strategies and optimizations used in their compiler worked as intended. 6.5.1.3.4 ASTI compiler Purpose of optimizations – Ease parallelization of existing sequential code by extending a sequential compiler with parallel transformations and “loop outlining” Metrics optimizations examine – Work distribution, memory contention, loop overheads ASTI is a sequential compiler that has been extended by IBM to parallelize codes for SMP machines [6.5.10]. The compiler uses many high-level sequential code optimizations such as loop interchanges and loop unrolling, as well as general techniques for estimating cache and TLB access costs. The extended version of ASTI also uses loop coalescing and iteration reordering to enhance parallelization. In addition, the compiler uses a method known as function outlining which simplifies storage management for threads by transforming sections of code into function calls. This does simplify storage management, since in a function call each thread gets a separate copy of local variables allocated in its stack frame. The compiler also uses a parallel runtime library that employs dynamic self-scheduling for load balancing. The “chunk sizes” that are used to distribute work by the runtime library can be changed at runtime, and the library also lets the user choose between busy waits and sleeps when a thread is blocked for data. The busy wait method can decrease performance since it increases memory contention, but can have higher latency than the busy wait method. The authors tested their ASTI compiler on a four-processor machine 125 and found that their compiler generated positive speedups, although the parallel efficiency of their implementation left much to be desired. 6.5.1.3.5 dHPF compiler Purpose of optimizations – Reduce communication between processors and keep work distribution even Metrics optimizations examine – Communication cost vs. computation cost, cost of small vs. large messages, dependencies between statements to allow for overlapping of communication and computation The dHPF is a High-Performance Fortran compiler which automatically parallelizes HPF code [6.5.11]. dHPF uses the owner-computes rule to guide its computation partitioning model. For a given loop, the compiler chooses a data partitioning that minimizes communication costs by evaluating possible data mappings with a simple communication model (based on the number of remote references). The compiler also includes several optimizations, including message vectorization, message coalescing, message combining, and loopsplitting transformations that enable communication to be overlapped with computation. The authors found that while these optimizations worked well on simple kernels and benchmarks, but more optimizations were needed for more realistic codes such as the NAS benchmarks. Therefore, the dHPF compiler also incorporates other optimizations to increase its ability to parallelize the NAS benchmarks. The additional optimizations needed for the NAS benchmarks are listed below: Array privatization – Same technique as discussed in Section 6.5.1.3.1 Partial replication of computation – In order to decrease communication costs, code sections can be marked with a LOCALIZE directive which indicates that the statements will be performed by all processors instead of having one processor evaluate them and all other processors read them With these optimizations in place, the authors were able to achieve competitive performance on the NAS benchmarks. 126 6.5.1.4 Runtime optimizations for parallel codes In this section, we present optimization methods that are not employed until runtime. In general, it more efficient to perform optimizations during development or compile time. However, irregular code, nondeterministic code, and code that cannot be statically analyzed cannot be optimized at compile time and must be handled at runtime, so it is useful to study these techniques. 6.5.1.4.1 Inspector/executor scheme Purpose of optimizations – Perform runtime load balancing for irregular applications by generating a schedule based on runtime information (which may be generated in parallel) and executing that schedule Metrics optimizations examine – data access patterns during runtime Saltz et al. invented a general method for runtime parallelization by splitting the problem up into two pieces: computing the parallel iteration schedule and then executing the computed schedule [6.5.12]. They title their method the inspector/executor scheme, where the job of the inspector is to generate an optimal (or near-optimal) iteration schedule and the job of the executor is achieve the execution of that schedule. The general idea is very straightforward, but can be very effective for irregular programs, although it only works for problems whose data access patterns can be predicted before a loop is executed (which covers many unstructured mesh explicit and multigrid solvers, along with many sparse iterative linear systems solvers). The authors later implemented a general library for their method under the name of PARTI [6.5.13]. The overhead of generating the schedule during runtime can be very high. This prompted Leung and Zahorjan to parallelize the inspection algorithm in order to speed the generation of the evaluation schedule and thus gain parallel efficiency [6.5.14]. The provided two parallel variations of the original inspection algorithm: one that works by assigning sections of code to be examined by each processor and merging those sections using the serial inspection algorithm, and another that “bootstraps” itself by using the inspection algorithm on itself. The authors tested their method and found that the sectioning method worked best in most cases. 6.5.1.4.2 Nikolopolous’ method 127 Purpose of optimizations – To redistribute work based on load balancing information obtained after a few initial probing iterations of an irregular application code Metrics optimizations examine – Number of floating point operations performed per processor, data placement Nikolopolous et al. present another method for runtime load balancing for irregular applications that works with unmodified OpenMP APIs in [6.5.15]. Their method works by using a few probing iterations in which information about the state of each processor is obtained through instrumented library calls. After these probing iterations are done, the collected metrics are analyzed and data and loop migration are performed to minimize data communication and maximize parallelism. The algorithm used to reallocate iterations is very simple; it is based on a simple greedy method that redistributes work from overloaded processors to less busy processors. The authors tested their method and found that they achieved performance within 33% of a hand-coded MPI version. 6.5.1.5 Post-runtime optimizations for parallel codes In this section, we explore methods for optimizing a program’s performance based on tracing data obtained during a program’s run. Many optimization strategies in this category can be categorized as using the “measure-modify” approach, where optimization is performed iteratively by making an improvement and rerunning the application in hopes that the measurements obtained by rerunning the application show improvements. 6.5.1.5.1 Ad-hoc methods Purpose of optimizations – To optimize performance using the using trialand-error methods Metrics optimizations examine – varies A large number of optimization methods currently fall under the “ad-hoc” category, including methods used with most performance analysis tools. The defining characteristic of these methods is that they rely on providing the user with detailed information gathered during the program’s execution, and they expect that this information will enable the user to optimize their code. Usually 128 this involves changing one parameter of the program at a time and rerunning to see the impact of the change, although analytical models and simulative models may be used to gain insight into which code changes the user should try to use. Aside from improving the algorithm used for a program, this class of optimizations usually yields the most speedup for a program. However, it is heavily dependent on the skill (and sometimes luck) of the user. 6.5.1.5.2 PARADISE Purpose of optimizations – To automate the optimization processes by analyzing trace files generated during runtime Metrics optimizations examine – Communication overhead and load balancing Krishnan and Kale present a general method for automating trace file analysis in [6.5.16]. The general idea they present is based on the general ad-hoc method, except that the improvements the user implements after each run of the program are suggested by the PARADISE (PARallel programming ADvISEr) system. In addition, the improvements suggested by PARADISE can be automatically implemented without user intervention through the use of a “hints file” given to the runtime library used in the system. PARADISE works with the Charm++ language, a variant of C++ that has distributed objects. Because of this, some of the optimizations it deals with (method invocation scheduling and object placement) are of a different nature than those used in traditional SPMD-style programs. However, the system does incorporate a few well-known techniques for improving communications: message pipelining and message aggregation. 6.5.1.5.3 KAPPA-PI Purpose of optimizations – To suggest optimizations based on trace files generated during runtime Metrics optimizations examine – information extracted from trace files (time between send/receive pairs, time taken for barrier synchronizations, send and receive communication patterns) 129 KAPPA-PI, a tool written by Espinosa while performing research for his PhD [6.5.17], aims to provide a user with suggestions for improving their program. KAPPA-PI is a knowledge-based program that classifies performance problems using a rule-based system. A catalog of common performance problems are associated with a set of rules, and when the rules are evaluated as true, the problem is presented to the user and is correlated with the source code location the problem occurs at. The performance problems used in the tool are categorized as blocked sender, multiple output (which represents extra serialization caused by many messages being sent), excessive barrier synchronization time, a master/worker scheme that generates idle cycles, and load imbalance. Based on the problems, the program also has a method for coming up with recommendations to the user on how to fix their problem. KAPPA-PI is an interesting technique, but it is aimed squarely at the novice user. As such, it would not be helpful for users who have a better understanding of the systems than it does. 6.5.1.6 Conclusions In this section we have presented many different optimization techniques which are applied at different stages of a program’s lifetime. The most common metrics examined by the optimizations we presented are loop overhead, data locality (spatial and temporal), parallel granularity, load balancing, communication overhead of small messages, communication overhead of messages that can be eliminated or merged, and placement strategies for shared data. The metrics relevant to our PAT are the communication, load balancing, and cache characteristics of programs. Nearly all of the optimizations techniques presented here can be applied based on the information provided by these three main categories of program information; if we make sure to include them in our PAT, we enable programmers to use these optimization techniques which are already developed. One interesting phenomenon we have observed is the lack of optimizations for parallel programs in commercial compilers. Arch Robison, a programmer with KAI Software (a division of Intel), attributes this to the fact that compiler optimizations are a small part of a larger system that a customer expects when purchasing a compiler [6.5.18]. In addition to the compiler itself, customers also expect libraries, support, and other tools to be provided with their purchase. 130 Therefore, optimizations must compete with other aspects of a compiler package for attention from customers. Because no “one size fits all” optimizations have been discovered that drastically improve the performance for all types of parallel applications, it becomes harder to integrate them into a commercial product due to their limited applicability. To remedy this situation, Robison advocates the separation of optimizations from the compiler, so that third-party developers may develop optimizations for specific types of situations as appropriate. 6.5.2 Performance bottleneck identification To be written. 6.5.3 References [6.5.1] D. F. Bacon, S. L. Graham, and O. J. Sharp, “Compiler transformations for high-performance computing,” ACM Computing Surveys, vol. 26, no. 4, pp. 345–420, 1994. [6.5.2] S. Midkiff and D. Padua, “Issues in the compile-time optimization of parallel programs,” in 19th Int’l Conf. on Parallel Processing, August 1990. [6.5.3] R. Andonov and S. Rajopadhye, “Optimal Tiling of Two-dimensional Uniform Recurrences,” Tech. Rep. 97-01, LIMAV, Universitй de Valenciennes, Le Mont Houy - B.P. 311, 59304 Valenciennes Cedex, France, January 1997. (submitted to JPDC). This report superseeds Optimal Tiling (IRISA,PI-792), January, 1994. Part of these results were presented at CONPAR 94-VAPP VI, Lecture Notes in Computer Science 854, 701–712, 1994, Springer Verlag. [6.5.4] G. Goumas, N. Drosinos, M. Athanasaki, and N. Koziris, “Automatic parallel code generation for tiled nested loops,” in SAC ’04: Proceedings of the 2004 ACM symposium on Applied computing, pp. 1412–1419, ACM Press, 2004. [6.5.5] P. Lee and Z. M. Kedem, “Automatic data and computation decomposition on distributed memory parallel computers,” ACM Trans. Program. Lang. Syst., vol. 24, no. 1, pp. 1–50, 2002. 131 [6.5.6] P. D. Hovland and L. M. Ni, “A model for automatic data partitioning,” Tech. Rep. MSU-CPS-ACS-73, Department of Computer Science, Michigan State University, October 1992. [6.5.7] P. Banerjee, J. A. Chandy, M. Gupta, E. W. H. IV, J. G. Holm, A. Lain, D. J. Palermo, S. Ramaswamy, and E. Su, “The paradigm compiler for distributed-memory multicomputers,” Computer, vol. 28, no. 10, pp. 37–47, 1995. [6.5.8] M. Gupta and P. Banerjee, “PARADIGM: A compiler for automatic data distribution on multicomputers,” in International Conference on Supercomputing, pp. 87–96, 1993. [6.5.9] K. S. McKinley, “A compiler optimization algorithm for shared-memory multiprocessors,” IEEE Trans. Parallel Distrib. Syst., vol. 9, no. 8, pp. 769–787, 1998. [6.5.10] J.-H. Chow, L. E. Lyon, and V. Sarkar, “Automatic parallelization for symmetric shared-memory multiprocessors,” in CASCON ’96: Proceedings of the 1996 conference of the Centre for Advanced Studies on Collaborative research, p. 5, IBM Press, 1996. [6.5.11] V. Adve, G. Jin, J. Mellor-Crummey, and Q. Yi, “High performance Fortran compilation techniques for parallelizing scientific codes,” in Supercomputing ’98: Proceedings of the 1998 ACM/IEEE conference on Supercomputing (CDROM), pp. 1–23, IEEE Computer Society, 1998. [6.5.12] J. Saltz, H. Berryman, and J. Wu, “Multiprocessors and runtime compilation,” in Proc. International Workshop on Compilers for Parallel Computers, 1990. [6.5.13] A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, and K. Crowley, “PARTI primitives for unstructured and block structured problems,” Computing Systems in Engineering, vol. 3, no. 14, pp. 73–86, 1992. [6.5.14] S.-T. Leung and J. Zahorjan, “Improving the performance of runtime parallelization,” in PPOPP ’93: Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming, pp. 83–91, ACM Press, 1993. 132 [6.5.15] D. S. Nikolopoulos, C. D. Polychronopoulos, and E. Ayguadи, “Scaling irregular parallel codes with minimal programming effort,” in Supercomputing ’01: Proceedings of the 2001 ACM/IEEE conference on Supercomputing (CDROM), pp. 16–16, ACM Press, 2001. [6.5.16] S. Krishnan and L. V. Kale, “Automating parallel runtime optimizations using post-mortem analysis,” in ICS ’96: Proceedings of the 10th international conference on Supercomputing, pp. 221–228, ACM Press, 1996. [6.5.17] A. E. Morales, Automatic Performance Analysis of Parallel Programs. PhD thesis, Department d’Informаtica, Universitat Autтnoma de Barcelona, 2000. [6.5.18] A. D. Robison, “Impact of economics on compiler optimization,” in JGI ’01: Proceedings of the 2001 joint ACM-ISCOPE conference on Java Grande, pp. 1–10, ACM Press, 2001. 133 7 Language analysis To be written. 134 8 Tool design To be written. (this might not have enough info and the implemented portion is probably cover in tool evaluation as well) 135 9 Tool evaluation strategies Before the users can decide which PAT best fits their needs, a set of guidelines needs to be established so users can compare the available tools in a systematic fashion. In addition, as new PAT developers, we need to develop a method to evaluate existing tools so we can decide if a new tool is needed or not. In this section, features that should be considered are introduced, along with a brief description of why they are important. An Importance Rating, which indicates the relative importance of the feature compared to the other features, is then applied to each of these features. The possible values are minor (nice to have), average (should include) and critical (needed). Finally, features are categorized according to their influence on usability, productivity, portability and miscellaneous aspects. 9.1 Pre-execution issues This section of report deals with features that are generally considered before getting the software. 9.1.1 Cost Commercial PATs, given their business nature, are generally more appealing to users because they often provide features familiar to users (meaning they generally try to appeal to users by giving them what they want). However, users often need to search for alternative tools simply because they cannot afford to pay for the commercial product. Even worse, they may decide not to use any PAT if what they can afford does not fulfill their need. Because of this, any useful tool evaluation needs to take cost into consideration. Generally, the user will desire the minimum cost product that accommodates most of their needs. There is often a tradeoff between the optional desirable features versus their additional cost. Importance Rating: Average. By itself, cost is of average importance. However, this should be evaluated in conjunction with the desirable PAT features from the user’s perspective. Together, these two aspects become critical in determining in which tool to deploy. 136 Category: Miscellaneous 9.1.2 Installation Ease of installation is another feature that the users look upon when deciding which tool to use. Sometimes a suitable tool, in term of cost and features, will not be used because the user cannot install the system. Due to modularization, many tools required installation of multiple components to enable the complete set of features the tool can provide. Furthermore, in order to support features provided by other systems (other tools or other systems such as a visualization kit), these components must be installed separately and they may require multiple components of their own. As more and more features are added to a given PAT, it seems inevitable that more and more components need to be installed as they are often developed by different groups. With productivity in mind, a tool should try to incorporate as many useful components as possible. However, it should simplify the installation process as much as possible. The use of environment detection scripts with few options is desirable. With parallel language such as UPC/SHMEM in mind, a tool should also have the ability to automatically install required components on all nodes in the targeted system. Finally, it is optimal to minimize the number of separate installations required, perhaps by having a master script that installs all desired components (maybe by including only the version of external components that work with the tool version and have them installed as part of the tool). Importance Rating: Minor. Because the installation process is usually done once, this feature is not as important as the others. Category: Usability. 9.1.3 Software support (libraries/compilers) A good tool needs to support libraries and compilers that the user intends to use. If the support is limited, the user is then restricted to use the tool only with few particular system settings (i.e. they can only use the tool on machines with certain set of libraries installed). This greatly hinders the usability of the tool and could also limit productivity. However, the support should only be applied to a selective set of key libraries, as too much library support can introduce unnecessary overhead. Tool developers should decide on a set of core libraries 137 (from user’s point of view) and support all implementation of them. however should try to support all available language compilers. A tool Importance Rating: Critical. Without the necessary software support, a tool is virtually useless. Category: Usability, productivity. 9.1.4 Hardware support (platform) Another important aspect of environmental support is hardware support. Again, a good tool needs to execute well on platforms of interest without significant effort (from user) when going from platform to platform. With longer software development times (due to larger applications) and fast improvements in hardware technology, this feature is becoming increasingly important. Software development can sometimes outlast the hardware that it was originally developed on. Because of this need to port from one generation of a system to the next, it is important for the tool to provide support on all these systems equally well. However, it is not necessary to provide support for all the platforms which the program can run, although it is desirable. A balance between additional development times versus platform support should be considered. Importance Rating: Critical. A tool must provide support for the most widely used current machines and support future ones. A core set needs to be identified. Category: Usability, portability. 9.1.5 Heterogeneity support Support for heterogeneity is related to both software and hardware support. This feature deals with using the tool simultaneously on a single application running on nodes with different software and/or hardware configurations. With an increasing desire to run applications in a heterogeneous environment, tool developers should take this into consideration. Importance Rating: Minor. This feature is not very important at this time, especially since UPC and SHMEM implementations still do not support heterogeneous environments well. Category: Miscellaneous. 138 9.1.6 Learning curve Many powerful systems (both software and hardware) never become widely accepted because they are too difficult to learn. Users are able to recognize the usefulness of the features these systems provide but are unable to utilize them in an efficient manner. The same principle applies to PATs. Users will not use a PAT with lots of desirable features but difficult to learn because the benefit gained from using the tool is just not worth the learning effort. Given that desirable features are keys to a productive PAT, a good approach is to have a basic set of features most users appreciate that requires a small learning time and make the other features optional. This way, a novice user can quickly see the benefit of using the PAT while a more advanced user can enjoy all the more sophisticated features. Importance Rating: Critical. No matter how powerful a tool is, it is useless if it is too difficult for users to use. Category: Usability, productivity. 9.2 Execution-time issues This section covers the issues when tool is in use. The issues presented are broken into subsections corresponding to the five stages in experimental performance modeling mentioned in Section 6. 9.2.1 Stage 1: Instrumentation 9.2.1.1 Manual instrumentation overhead How much effort the user needs to put into running the PAT is a big determinant on how often they will use the tool (assuming that they do decide to use the tool at all). It is possible that using a sophisticated PAT is not as effective as simply using the printf statement to pinpoint performance bottlenecks. If a tool requires too much manual instrumentation, the user might consider the effort not worth while if they can obtain similar useful information from inserting printf statements. It is up to the tool developers to minimize the amount of manual overhead while maximize the benefit a tool provides. For this reason, an ideal tool should perform automatic instrumentation as much as possible and allow manual instrumentation for extendibility (see Section 6.1.3). 139 Importance Rating: Average. This is important for the introduction of a tool to new user but is less critical as user become more accustomed to using the PAT. Advanced users are often willing to put in the extra effort to obtain useful information. Category: Usability, productivity. 9.2.1.2 Profiling/tracing support It is highly recommended for a tool to support tracing as profiling data can be extracted from the tracing data (see Section 6.1.2). Tracing technique, trace file format and mechanisms for turning tracing on and off are important as they directly impact the amount of storage needed and the perturbations caused by of the tool (i.e. the effect of instrumentation on the correctness of program’s behavior). An ideal tool should deploy strategies to minimize the storage requirement while gathering all necessary information. The strategies should also not affect the original program’s behavior and should be compatible with other popular tools. Finally, the performance overhead issue needs to be considered (see the profiling/tracing sub-report for a detailed discussion). Importance Rating: Critical. The choice significantly affects how useful the PAT is based on the reasons mentioned. Category: Productivity, portability, scalability. 9.2.1.3 Available metrics Core to the PAT are the metrics (events) it is able to tract. Intuitively, this set should cover all the key aspects of software and hardware system critical to performance analysis (details about these will be covered in the sub-report regarding important factors for UPC/SHMEM). Importance Rating: Critical. Category: Productivity. 140 9.2.2 Stage 2: measurement issues 9.2.2.1 Measurement accuracy Measurements must be done so they represent the exact behavior of the original program. An ideal tool need to ensure that the measuring strategy provides accurate information in the most efficient manner under the various software and hardware environments (the set of measurements could be the same or different between systems). Importance Rating: Critical. Accurate event data is vital to a PAT’s usefulness. Category: Productivity, portability. 9.2.2.2 Interoperability Once multiple tools are available for a particular language, it is a good strategy for a tool to store its event data in a format that is portable to other tools as people might be accustomed to a particular way a tool presents its data. Furthermore, it saves time to use components developed elsewhere, as this avoids reinventing the wheel. This helps the acceptability of the tool and can sometimes reduce the development cost. Importance Rating: Average. Since no tool in existence supports UPC and SHMEM, there is no need to consider this except to store the data in a way that is portable in the future. Also, since it’s highly likely that we end up using one tool as the major backbone for our PAT, this is a minor issue. However, it might be good to consider using a format that is compatible with visualization packages because GUI is time consuming to produce. Category: Portability. 9.2.3 Stage 3: analysis issues 9.2.3.1 Filtering and aggregation With the assumption that most users use PATs with long-running programs (a likely assumption since little benefit can be gained from improving short-running programs unless they are part of a bigger application), it is important for a tool to provide filtering abilities (also applicable to system with multiple nodes). Filtering removes excess event information that can get in the way of performance 141 bottleneck identification by flushing out a large amount of “normal” events that do not provide insight into identifying performance bottlenecks. Aggregation is used to organize existing data and produce new meaningful event data (commonly from one event but aggregating two or more related events is useful). These aggregate events lead to a better understanding of program behavior as they provide a higher-level view of the program behavior. An ideal tool should provide various degree of filtering that can help in the identification of abnormality depending on the problem space and system size. In addition, aggregation should be applied to filtered data to better match the user’s need (filtering following by aggregation is faster than the reverse and can also provide the same information). Importance Rating: Critical. Event data needs to be organized to be meaningful for bottleneck detection. Category: Productivity, scalability. 9.2.3.2 Multiple analyses There are often multiple ways to analyze a common problem. For example, multiple groups have proposed different ways to identify synchronization overhead. However, there is no consensus as to which of these methods are more useful than the others. Because of this, it is helpful for a tool to provide support for all the useful methods so the user can the user can switch between them based on their preference. Importance Rating: Average. Although it is nice to have this feature, trying to provide too much can significantly impact the usability of the tool. It is perhaps best to select a few major issues and then provide a couple of views for those issues. Category: Usability. 9.2.4 Stage 4: presentation issues 9.2.4.1 Multiple views Ideally, a tool should have multiple levels of presentation (text, simple graphic). It should also provide a few different presentation methods for some of its displays. 142 It is also desirable to have a zooming capability (i.e. display data on a particular range). Importance Rating: Critical. This is the only stage relating to what the user sees, and the what user sees completely determines how useful the tool will be. Category: Usability, productivity. 9.2.4.2 Source code correlation It is important to correlate the presentation of performance data with the source code. This is vital in facilitating the task of performance bottleneck identification. Importance Rating: Critical. Category: Usability, productivity. 9.2.5 Stage 5: optimization issues 9.2.5.1 Performance bottleneck identification It is always beneficial if the tool can identify performance bottlenecks and provide suggestions on how to fix them. However, it should avoid false positives. Importance Rating: Minor to average. This is a nice feature to have but not critical. However, the identification part should probably be investigated as it is related to dynamic instrumentation. Category: Productivity. 9.2.6 Response time Another important issue to consider that is applicable to the entire tool utilization phase is the response time of the tool. A tool should provide useful information back to the user as quickly as possible. A tool that takes too long to provide feedback will deter the user from using it. In generally, it is best to provide partial useful information when it becomes available and update the information in a periodic fashion. This is related to the profiling/tracing technique and performance bottleneck identification. Importance Rating: Minor to average. As long as the response time is not too terrible, it should be fine. 143 Category: Productivity. 9.3 Other issues This section covers all issues that do not fit well with the other two phases. 9.3.1 Extendibility An important factor for our project but generally not as important to the user is how easy the tool can be extended to support new languages or adding new metrics. This is of great importance as we need to evaluate the existing capabilities of tools against the development effort to decide if it is better to design a tool from scratch or use existing tool(s). Importance Rating: Critical. An ideal tool should not require a significant amount of effort to extend. Category: Miscellaneous. 9.3.2 Documentation quality A tool should provide clear documentation on its design, how it can be used, and how to best use it to learn its features. A good document often determines if the tool will ever be used or not (as discussed in the installation section). However, as tool developers, this is not as important. Importance Rating: Minor. Category: Miscellaneous. 9.3.3 System stability A tool should not crash too often. This is probably difficult to determine and what is a good rate is arbitrary. Importance Rating: Average. Category: Usability, productivity. 144 9.3.4 Technical support Tool developers should be responsive to the user. If the user can get a response regarding the tool within a few days, then the support is acceptable. The tool should also provide clear error messages to the user. Importance Rating: Minor to average. As long as the tool itself is easy to use, there is little need for this. It is more important for us, however, because we definitely need developer support if we decide to use their tool. Tool developers are generally willing to work with others to extend their tool though. Category: Usability. 9.3.5 Multiple executions Since we are dealing with parallel programs that involve multiple processing nodes, it is sometimes beneficial to compare the performance of the same program running on different numbers of nodes (i.e. performance on 2 nodes vs. 4 nodes). This helps to identify a program’s scalability trend. Importance Rating: Minor to average. Depending on what factors we deem important (if program scalability is important to show, then this is of average importance. Otherwise, this should not be considered). Category: Productivity. 9.3.6 Searching Another helpful feature to include is the ability to search the performance data gathered using some criteria. This is helpful when user wish to see a particular event data. Importance Rating: Minor. Searching is a nice feature to have but implementing it probably isn’t worth the effort for the prototype version of our PAT. Category: Productivity. 145 Table 9.1 - Tool evaluation summary table * The way this should be used is by filling out all the information for all the features (+ other comments). Then, based on the information, provide a rating of 1-5 (5 being the best) on how good the tool is. Some explanation as to why you choose such a rating is also helpful. Feature (section) Available metrics (9.2.1.3) Cost (9.1.1) Documentation quality (9.3.2) Extendibility (9.3.1) Filtering and aggregation (9.2.3.1) Information to gather Categories Importance Rating Metrics it can provide (function, hw …) Productivity Critical How much Miscellaneous Average Clear document? Helpful document? Miscellaneous Minor 1. Estimating of how easy it is to extend to UPC/SHMEM 2. How easy is it to add new metrics Does it provide filtering? Aggregation? Miscellaneous Critical Productivity, Scalability Critical Platform support Usability, Portability Critical Support running in a heterogeneous environment? Miscellaneous Minor 1. How to get the software 2. How hard to install the software 3. Components needed 4. Estimate number of hours needed for installation List of other tools that can be used with this Usability Minor Portability Average To what degree Hardware support (9.1.4) Heterogeneity support (9.1.5) Installation (9.1.2) Interoperability (9.2.2.2) Learning curve (9.1.6) Estimate learning time for basic set of features and complete set of features Usability, Productivity Critical 1. Method for manual instrumentation (source code, instrumentation language, etc) 2. Automatic instrumentation support Evaluation of the measuring method (probably going to be difficult to do, might leave until the measuring method report is done) Usability, Productivity Average Productivity, Portability Critical Multiple analyses (9.2.3.2) Provide multiple analyses? Useful analyses? Usability Average Multiple executions (9.3.5) Support multiple executions? Productivity Minor Average Provide multiple views? Intuitive views? Usability, Productivity Critical Support automatic bottleneck identification? How? Productivity Minor Average 1. Profiling? Tracing? 2. Trace format 3. Trace strategy 4. Mechanism for turning on and off tracing How long does it take to get back useful information Productivity, Portability, Scalability Critical Productivity Average Support data searching? Productivity Minor 1. Libraries it supports 2. Languages it supports Usability, Productivity Critical Able to correlate performance data to source code? Usability, Productivity Critical Manual overhead (9.2.1.1) Measurement accuracy (9.2.2.1) Multiple views (9.2.4.1) Performance bottleneck identification (9.2.5.1) Profiling / tracing support (9.2.1.2) Response time (9.2.6) Searching (9.3.6) Software support (9.1.3) Source code correlation (9.2.4.2) 147 System stability (9.3.3) Technical support (9.3.4) Crash rate Usability, Productivity Average 1. Time to get a response from developer. 2. Quality/usefulness of system messages Usability Minor Average 148 9.4 References [9.1] Shirley Browne, Jack Dongarra, Kevin London, “Review of Performance Analysis Tools for MPI Parallel Programs”, University of Tenessee (updated version of 5) [9.2] Michael Gerndt and Andreas Schmidt, “Comparison of Performance Analysis Tools on Hitachi”, Technical Report, Technische Universitat Munchen, Feb. 26, 2002 [9.3] Esther Jean-Pierre, “Performance Tool Evaluation”, University of Florida, Dec. 15, 2003 [9.4] Bernd Mohr, Michael Gerndt and Jesper Larsson Traff, “Design of a Test Suite for (Automatic) Performance Analysis Tools”, A PowerPoint presentation, SC 99 Tutorial S7, Nov. 14, 1999 [9.5] Shirley Moore et al., “Review of Performance Analysis Tools for MPI Parallel Programs”, University of Tenessee [9.6] Cherri M. Pancake, “Performance Tools for Today’s HPC: Are We Addressing the Right Issues?”, Oregon State University [9.7] Cherri M. Pancake, “Can Users Play an Effective Role in Parallel Tools Research?”, Oregon State University, Oct., 1996 [9.8] “Domain Analysis Tools Evaluation”, http://www.vtt.fi/ele/people/matias.vierimaa/gradu~oa 10 Tool evaluations We either keep it in power point format or convert to word. 150 11 Conclusion To be written 151