– First Iteration UPC/SHMEM PAT Report January 4

advertisement
UPC/SHMEM PAT Report – First Iteration
January 4th, 2005
UPC Group
University of Florida
Abstract
Due to the complex nature of parallel and distributed computing systems and
applications, the optimization of UPC programs can be a significant challenge
without proper tools for performance analysis. The UPC group at the University
of Florida is investigating key concepts and developing a comprehensive highlevel design for a performance analysis tool (PAT) or suite of tools that will
directly support analysis and optimization of UPC programs, with an emphasis on
usability and productivity, on a variety of HPC platforms. This report details the
approach we are taking to design our performance tool, and any pertinent
information we have obtained related to the design or functionality of
performance tools in general.
2
Table of Contents
1 INTRODUCTION ................................................................................................................................ 6
2 APPROACH ......................................................................................................................................... 7
3 PROGRAMMING PRACTICES .......................................................................................................... 8
3.1
3.1.1
3.1.2
3.1.3
3.1.4
3.1.5
3.2
3.2.1
3.2.2
3.2.3
3.2.4
3.2.5
3.3
3.3.1
3.3.2
3.3.3
3.3.4
3.3.5
3.4
3.5
ALGORITHM DESCRIPTIONS .......................................................................................................... 8
Differential cryptanalysis for the CAMEL cipher ................................................................... 8
Mod 2n inverse - NSA benchmark 9 ........................................................................................ 9
Convolution ...........................................................................................................................10
Concurrent wave equation .....................................................................................................11
Depth-first search (DFS) .......................................................................................................12
CODE OVERVIEW .........................................................................................................................12
CAMEL ..................................................................................................................................12
Mod 2n inverse .......................................................................................................................15
Convolution ...........................................................................................................................18
Concurrent wave equation .....................................................................................................19
Depth-first search ..................................................................................................................20
ANALYSIS ...................................................................................................................................21
CAMEL (MPI and UPC) .......................................................................................................21
Mod 2n inverse (C, MPI, UPC, and SHMEM) .......................................................................22
Convolution (C, MPI, UPC, and SHMEM) ...........................................................................26
Concurrent wave equation (C and UPC)...............................................................................29
Depth-first search (C and UPC) ............................................................................................30
CONCLUSIONS .............................................................................................................................31
REFERENCES ...............................................................................................................................33
4 PERFORMANCE TOOL STRATEGIES............................................................................................34
5 ANALYTICAL PERFORMANCE MODELING ...............................................................................36
5.1
FORMAL PERFORMANCE MODELS ................................................................................................37
5.1.1
Petri nets ................................................................................................................................38
5.1.2
Process algebras....................................................................................................................39
5.1.3
Queuing theory ......................................................................................................................39
5.1.4
PAMELA ................................................................................................................................40
5.2
GENERAL ANALYTICAL PERFORMANCE MODELS .........................................................................40
5.2.1
PRAM ....................................................................................................................................41
5.2.2
BSP ........................................................................................................................................42
5.2.3
LogP ......................................................................................................................................45
5.2.4
Other techniques ....................................................................................................................46
5.3
PREDICTIVE PERFORMANCE MODELS ...........................................................................................48
5.3.1
Lost cycles analysis ...............................................................................................................48
3
5.3.2
Adve’s deterministic task graph analysis ...............................................................................50
5.3.3
Simon and Wierum’s task graphs ..........................................................................................52
5.3.4
ESP ........................................................................................................................................54
5.3.5
VFCS......................................................................................................................................55
5.3.6
PACE .....................................................................................................................................57
5.3.7
Convolution method ...............................................................................................................58
5.3.8
Other techniques ....................................................................................................................60
5.4
CONCLUSION AND RECOMMENDATIONS ......................................................................................62
5.5
REFERENCES ...............................................................................................................................67
6 EXPERIMENTAL PERFORMANCE MEASUREMENT .................................................................73
6.1
6.1.1
6.1.2
6.1.3
6.1.4
6.1.5
6.1.6
6.2
6.2.1
6.2.2
6.2.3
6.2.4
6.3
6.4
6.4.1
6.4.2
6.4.3
6.5
6.5.1
6.5.2
6.5.3
INSTRUMENTATION .....................................................................................................................74
Instrumentation overhead ......................................................................................................75
Profiling and tracing .............................................................................................................75
Manual vs. automatic ............................................................................................................77
Number of passes ...................................................................................................................77
Levels of instrumentation .......................................................................................................78
References..............................................................................................................................82
MEASUREMENT ...........................................................................................................................84
Performance factor ................................................................................................................84
Measurement strategies .........................................................................................................91
Factor List + experiments .....................................................................................................91
References..............................................................................................................................92
ANALYSIS ...................................................................................................................................94
PRESENTATION ............................................................................................................................94
Usability ................................................................................................................................94
Presentation methodology ...................................................................................................106
References............................................................................................................................107
OPTIMIZATION ..........................................................................................................................109
Optimization techniques ......................................................................................................110
Performance bottleneck identification .................................................................................131
References............................................................................................................................131
7 LANGUAGE ANALYSIS .................................................................................................................134
8 TOOL DESIGN .................................................................................................................................135
9 TOOL EVALUATION STRATEGIES .............................................................................................136
9.1
PRE-EXECUTION ISSUES .............................................................................................................136
9.1.1
Cost ......................................................................................................................................136
9.1.2
Installation ...........................................................................................................................137
9.1.3
Software support (libraries/compilers) ................................................................................137
9.1.4
Hardware support (platform) ..............................................................................................138
9.1.5
Heterogeneity support .........................................................................................................138
9.1.6
Learning curve .....................................................................................................................139
4
9.2
EXECUTION-TIME ISSUES ...........................................................................................................139
9.2.1
Stage 1: Instrumentation......................................................................................................139
9.2.2
Stage 2: measurement issues ...............................................................................................141
9.2.3
Stage 3: analysis issues .......................................................................................................141
9.2.4
Stage 4: presentation issues.................................................................................................142
9.2.5
Stage 5: optimization issues.................................................................................................143
9.2.6
Response time ......................................................................................................................143
9.3
OTHER ISSUES ...........................................................................................................................144
9.3.1
Extendibility .........................................................................................................................144
9.3.2
Documentation quality.........................................................................................................144
9.3.3
System stability ....................................................................................................................144
9.3.4
Technical support ................................................................................................................145
9.3.5
Multiple executions ..............................................................................................................145
9.3.6
Searching .............................................................................................................................145
9.4
REFERENCES .............................................................................................................................149
10 TOOL EVALUATIONS ....................................................................................................................150
11 CONCLUSION ..................................................................................................................................151
5
1 Introduction
To be written.
6
2 Approach
To be written
(Hybrid approach, borrow from whitepaper + new info + new strategies on tool
framework/approach)
7
3 Programming practices
To effectively research and develop a useful PAT for UPC and SHMEM, it is
necessary to understand the various aspects of the languages and their
supporting environment. To accomplish this goal, we coded several commonlyused algorithms in sequential C. After writing the sequential versions, we
created parallel versions of the same algorithms using MPI, UPC, or SHMEM.
We ran the parallel versions on our available hardware, and compared any
performance differences between our different implementations. In addition, for
a few of the algorithms we tried UPC-specific hand optimizations in an attempt to
gather a list of possible techniques that can be used to improve UPC program
performance.
This section contains a summary of the experiences we had when writing the
codes, and also outlines the problems we encountered while writing and testing
our code. The rest of this section is structured as follows. Section 3.1 contains
brief descriptions of the algorithms used. Section 3.2 gives an overview of the
coding process used for each algorithm. In Section 3.3, the performance results
are shown and an analyzed. Finally, section 3.4 gives the conclusions we drew
from these programming practices.
3.1
Algorithm descriptions
In this section, we present overviews of the different algorithms implemented.
3.1.1 Differential cryptanalysis for the CAMEL cipher
CAMEL, or Chris And Matt’s Encryption aLgorithm, is an encryption algorithm
developed by Matt Murphy and Chris Conger of the HCS lab. This algorithm was
created as a test case while studying the effects of hardware changes to the
performance of cryptanalysis programs on high-performance computer hardware.
The algorithm is based on the S-DES cipher [3.1]. An overview of the cipher
function is shown in Figure 3.1.
A sequential C program was written by members of the HCS lab that performed a
differential cryptanalysis on the algorithm. The program first encrypted a block of
text using a user-specified key, and then performed a differential attack on the
8
text using only the S-boxes used by the algorithm and the encrypted text itself.
The original sequential version of the code contained about 600 lines of C code.
The algorithm used during the differential attack phase constituted most of the
program’s overall execution time, and so we decided it would be beneficial to
create a parallel version of the differential attack and see how much speedup we
could obtain.
For this algorithm, we implemented versions in C and MPI. Since the size of the
sequential code is fairly large and the nature of data flow in the program is not
trivial, we felt it would give us an excellent vehicle to compare UPC and MPI.
Figure 3.1 - CAMEL cipher diagram
3.1.2 Mod 2n inverse - NSA benchmark 9
While searching for other algorithms to implement, we examined the NSA
benchmark suite and decided it would be worthwhile to implement the Mod 2 n
inverse benchmark. This benchmark was selected because it is a very
bandwidth- and memory-intensive program, even though its sequential
implementation is small. We were interested to see how much efficiency we
could obtain and how well each language coped with difficulties presented by the
benchmark.
9
The basic idea of this benchmark is: given the list A containing 64-bit integers
whose size ranges from 0 to 2j – 1, compute two lists:

List B, where Bi=Ai “right justified.”

List C, such that (Bi * Ci) % 2j = 1.
List C is constructed using an iterative algorithm (which is discussed in Section
3.2.2). In our implementation, we also included a “checking” phase in which one
processor traverses over list B and list C and double-checks that the lists were
computed correctly. The computation of the list is embarrassingly parallel
(although it is memory-intensive), but the check phase can be extremely
bandwidth-intensive, especially on architectures that do not have shared
memory.
We implemented C, UPC, SHMEM, and MPI versions of this benchmark. In
addition, we decided to try a few optimizations on our implementation of the UPC
version to see how different optimization techniques impacted overall program
performance.
3.1.3 Convolution
Convolution is a simple operation often performed during signal processing
applications. The basic definition of the convolution of two discrete sequences X
and H is:
C[n] 

 X [k ] * H [n  k ]
k  
The algorithmic complexity of convolution is order N2, which results in slow
computation for even moderately sized sequences. In reality, convolution is
usually never computed directly, as it can also be computed by taking the Fast
Fourier Transform of both sequences, multiplying them, and taking an inverse
Fourier transform of that result. Computing convolution in this manner results in
an algorithm of complexity Nlog2(N). Our implementation used the N2 algorithm
as this allowed us to more easily measure the effect of our optimizations on the
parallel versions.
10
For this application, we implemented C, UPC, MPI, and SHMEM versions. We
also decided to try out the same optimizations used on our Mod 2n inverse UPC
implementation on the UPC version. Convolution has different computational
properties than Mod 2n inverse, so this was an ideal test to see if the same UPC
optimization strategies have the similar effects on a totally different type of code.
3.1.4 Concurrent wave equation
The wave equation is an important partial differential equation which generally
describes all kinds of waves, such as sound waves, light waves and water waves
[3.2]. It arises in many different fields such as acoustics, electromagnetic, and
fluid dynamics. Variations of the wave equation are also found in quantum
mechanics and general relativity. The general form of the wave equation is:
 2u
 c 2 2u
t 2

Here, c is the speed of the wave’s propagation and u  u ( p, t ) describes the

wave’s amplitude at position p and time t . The one-dimensional form can be
used to represent a flexible string stretched between two points on the x-axis.
When specialized to one dimension, the wave equation takes the form:
2
 2u
2  u
c
t 2
x 2
We developed two implementations of UPC programs to solve the wave equation
in one dimension as described in Chapter 5 of [3.3]. The sequential C version of
the program is readily available on the web [3.5], and this formed the basis of our
implementations in UPC. One version of our UPC code is derived from the unoptimized code found on the web. The other version is derived from a modified
version of the sequential C code that employs several hand optimizations. These
optimizations include the removal of redundant calculations and the use of global
and temporary variables to store intermediate results, which combined to
produce a 30% speedup in execution time compared to the original sequential
code on our Xeon cluster.
11
3.1.5 Depth-first search (DFS)
Many programmers use tree data structures for their storage efficiencies, so
efficient tree-searching algorithms are a necessity. Depth-first search is an
efficient tree-searching algorithm that is commonly used. In the depth-first
search algorithm, target data is first matched against the root node of the tree,
which has a depth level of 1. The search stops if a match was found. Otherwise,
all children of the root nodes at level 1 of the tree (which are in depth level 2) are
matched against the target data. If no match is found, then nodes at the next
depth are searched. This process continues for increasing depth levels until a
match has been made on that a particular level. The algorithmic complexity of
this algorithm is order N on sequential machines and order log(N) in a parallel
environment.
For this algorithm, we first implemented a sequential version. We then coded
two UPC versions which used an upc_forall loop and a manual for loop for
work distribution.
3.2
Code overview
In this section, we overview the code for each of the algorithms implemented.
Any difficulties we encountered that resulted from limitations imposed in specific
languages are also presented here.
3.2.1 CAMEL
The original code for the sequential cryptanalysis program can be broken up into
three distinct phases:

An initialization phase, which initializes the S-boxes and computes the
optimal difference pair based on the chosen values for the S-boxes.

A main computational phase, which first gets a list of possible candidate
keys and then checks those possible candidate keys using brute-force
methods in concert with the optimal difference pair that was previously
computed.

A wrap-up phase, which combines the results of the cryptanalysis phase
and returns data to the user, including the broken cipher key.
12
The first and third phases of the program are dwarfed by the execution time of
the main computational phase, which can take hours to generate all possible
candidate key pairs. To keep the computation times under control, we chose
keys from a limited range and adjusted the main computational loop to only
search over a subset of the possible key space. This kept the runtime of the
main phase to within a reasonable time for our evaluation purposes while
retaining similar (but scaled-down) performance characteristics of a full run.
We decided to use coarse-grained parallelism for our UPC port of the CAMEL
differential analysis program. After experimenting with small pieces of the
program under differing platforms and UPC runtimes, we concluded that this
would keep the performance of the resulting parallel program high. Coarsegrained code also represents the type of code that UPC is well-suited for. In fact,
once we adopted this strategy, creating the UPC version of the application
became very straightforward. We restructured the application slightly to better
lend itself to parallelization, used the upc_forall construct in several key
places, and added some synchronization code to complete the parallelization
process.
Restructuring of the application was necessary so that the
corresponding for loops in the original C code could be easily converted to
upc_forall loops. Listed below is the pseudocode for the original main
computation phase of the application.
for (each possible key pair) {
if (possible candidate key pair) {
count++;
if (count < 3) {
iterate over whole key space and add keys to list if they
match with this key pair
} else {
only iterate over candidate keys previously added and check
if they match with this key pair
}
}
}
This bit of code was restructured to C code implementing the pseudocode shown
below.
for (each possible key pair done in parallel) {
if (possible candidate key pair) {
add to global list
}
}
13
for (each key in global list) {
if (count < 3) {
iterate over whole key space in parallel and add keys
to list if they match with this key pair
} else {
only iterate over candidate keys previously added
in parallel and check if they match with this key pair
}
}
}
The UPC implementation of the first for loop of the restructured computation
loop is shown below. Since each thread in the UPC application has access to
the cresult and cresultcnt variables, they are protected with a lock.
upc_forall(input = 0; input < NUMPAIRS; input++; input) {
// grab all crypts that match up
docrypt(key32, input, 0, &R1X, &R1Y, &C, &C2); // perform 2
encryptions
curR1Y = R1Y;
// per iteration of
loop
docrypt(key32, (input ^ R1XCHAR), 1, &R1X, &R1Y, &C, &C2);
if ((R1Y ^ curR1Y) == (R1YCHAR)) {
// lock the result array & stick it in there
upc_lock(resultlock);
cresult[cresultcnt].r1y = R1Y;
cresult[cresultcnt].curr1y = curR1Y;
cresult[cresultcnt].c = C;
cresult[cresultcnt].c2 = C2;
cresultcnt++;
upc_unlock(resultlock);
}
}
The UPC implementation of the iteration psuedocode that iterates over the key
space in parallel is shown below. Since all threads have access to the PK2 and
sharedindex variables, they were protected with a lock.
The macro
MAINKEYLOOP controls how much of the key space the differential search
iterates over.
upc_forall(m = 0; m < MAINKEYLOOP; m++; m) {
upc_forall(k = 0; k < 1048576; k++; continue) {
testKey = (1048576 * m) + k;
if ((lastRound(curR1Y, testKey) == C)
&& (lastRound(R1Y, testKey) == C2)) {
upc_lock(resultlock);
PK2[sharedindex] = testKey;
sharedindex++;
upc_unlock(resultlock);
}
}
}
14
Finally, the UPC code that iterates over the previously found candidate keys
added by the previous code (which is executed after the third main loop iteration)
is shown below. Again, shared variables which need atomic actions on them are
protected with locks.
upc_forall(m = 0; m < shared_n; m++; &keyArray[m]) {
if ((lastRound(curR1Y, keyArray[m]) == C)
&& (lastRound(R1Y, keyArray[m]) == C2)) {
upc_lock(resultlock);
PK2[sharedindex] = keyArray[m];
sharedindex++;
upc_unlock(resultlock);
}
}
The translation of code from C to UPC was very straightforward, especially since
we were able to reuse almost all of the original C code without making major
modifications. After the correctness of our UPC implementation was verified, an
MPI implementation using the master-worker paradigm was written using the
same parallel decomposition strategy as in the UPC version.
3.2.2 Mod 2n inverse
The C code for doing the basic computations needed in this benchmark is shown
below.
/**
num is the number to right justify, N is the number of bits it has
**/
UINT64 rightjustify(UINT64 num, unsigned int N) {
while (((num & 1) == 0) && (num != 0) && (N > 0)) {
num = num >> 1;
N--;
}
return num;
}
/**
this computes the Mod 2^n inverse when num is odd
such that num * result = 1 mod 2^N
*/
#define INVMOD_ITER_BITS 3
#define INVMOD_INIT 8
UINT64 invmod2n(UINT64 num, unsigned int N) {
UINT64 val = num;
UINT64 modulo = INVMOD_INIT;
int j = INVMOD_ITER_BITS;
while (j < N) {
modulo = modulo << j;
j = j * 2;
15
val = (val * (2 - num * val)) % modulo;
}
return val;
}
Our sequential implementation starts by reading in the parameters of the
benchmark from the command-line arguments given to the program. The
program then allocates (mallocs)space for the lists A, B, and C. List A is filled
with random integers whose values range from 0 to 2 j – 1. Then lists B and C
are computed using the rightjustify and invmod2n functions previously
shown. Finally, lists B and C are traversed, and (Bi * Ci) % 2j is checked to make
sure that is equal to 1.
In our parallel implementations, each thread “owns” a piece of A and computes
the corresponding parts of B and C by itself. A is initialized to random numbers
as before, and after B and C are calculated, the first thread traverses all values in
B and C to ensure they are correct. The main parts of our basic UPC
implementation are shown below.
// populate A
upc_forall(i = 0; i < listsize; i++; &A[i]) {
A[i] = (rand() & (sz - 1)) + 1;
}
// compute B & C
upc_forall(i = 0; i < listsize; i++; &B[i]) {
B[i] = rightjustify(A[i], numbits);
C[i] = invmod2n(B[i], numbits);
}
// have main thread do check
if (MYTHREAD == 0) {
for (i = 0; i < listsize; i++) {
if (((B[i] * C[i]) % sz) != 1) {
printf("FAILED i=%d for A[i]=%d (Got B[i]=%d, C[i]=%d) for thread
%d\n",
i, A[i], B[i], C[i], MYTHREAD);
}
}
Since the UPC implementation was straightforward, we decided to experiment
with different optimizations to see how they impacted overall program
performance. The first optimization used was to write our own for loop instead
of using the upc_forall construct. For our second optimization, we casted
shared variables that pointed to private data before using them whenever
possible (pointer privatization). For our third optimization, we had the main
thread do an upc_memget from the appropriate threads to bring in the other
16
thread’s data into its private address space before initiating the checking of A and
B.
The code for our SHMEM implementation was almost identical to the sequential
version for calculating A, B and C; the only differences were that each thread
operated on a fraction of the list, and the lists they operated on were created
using gpshalloc instead of malloc.
The code for the checking phase is a bit more complex, however. In the SHMEM
version, the first thread starts off by checking that the data it generated was
correct. The first thread then issues gpshmem_getmem calls to the other
threads to bring in their copies of B and C before checking their data. The code
for the check routine is shown below.
// do check
gpshmem_barrier_all(); // make sure everyone is done
if (myproc == 0) {
// check local results on master thread
for (i = 0; i < mysize; i++) {
if (((B[i] * C[i]) % sz) != 1) {
printf("FAILED i=%d for A[i]=%lld (Got B[i]=%lld,
C[i]=%lld)\n", i,
A[i], B[i], C[i]); fflush(stdout);
}
}
// now check the rest
for (i = 1; i < nump; i++) {
int recvsize = mysize;
if (i == nump - 1) {
recvsize = lastsize;
}
gpshmem_getmem(B, B, recvsize * sizeof(UINT64), i);
gpshmem_getmem(C, C, recvsize * sizeof(UINT64), i);
// now do local check
for (j = 0; j < recvsize; j++) {
if (((B[j] * C[j]) % sz) != 1) {
printf("FAILED i=%d j=%d (Got B[i]=%lld, C[i]=%lld)\n",
i, j, B[j], C[j]); fflush(stdout);
}
}
}
}
The code for the MPI implementation is similar to the code for the SHMEM
version, except that it is complicated due to the lack of one-sided MPI functions
in our available MPI library implementations. During the check loop, each
processor sends their data to the first processor in turn using a combination of
for loops, MPI barriers, and MPI send and receive calls. In the interests of
17
brevity, the MPI code for the check routine has been omitted from this section,
since it is roughly twice as long as the check code for the SHMEM version.
3.2.3 Convolution
The sequential C code for the direct computation of the convolution of two arrays
A and B is shown below. The macro INTEGER is set at compile time to a doubleprecision floating point type, a 32-bit integer type, or a 64-bit integer type.
void conv(INTEGER* A, INTEGER* B, INTEGER* C, long lena,
long lenb, long cstart, long cend) {
long n, k;
for (n = cstart; n <= cend; n++) {
INTEGER s, e;
s = (n >= lenb) ? n - lenb + 1 : 0;
e = (n - lena < 0) ? n + 1 : lena;
C[n] = 0;
for (k = s; k < e; k++) {
C[n] += A[k] * B[n - k];
}
}
}
The UPC code for our un-optimized version was almost identical to the above
code, except that the for loop was replaced with the upc_forall loop, using
&C[n] as the affinity operator. Closely examining the code above tells us that
depending on how the UPC compiler blocks the different arrays, the work each
thread has to do may vary by a large amount. Computing the values of C that
are near the middle of the sequence take much longer than computing the values
near the edges of C, since more multiplications need to be performed on A and
B. Therefore, using a small block size near 1 (a cyclic distribution) should result
in nearly even work distribution, since each process is likely to have a uniform
mix of elements of C. Based on this observation, our implementation used a
block size of 1.
We used the same three optimizations for our UPC implementation that we used
in our Mod 2n benchmark implementation. The first optimization was to write our
own for loops manually instead of using the upc_forall construct. This
optimization also incorporated our previous second optimization, which was
casting shared variables that pointed to private data before using them whenever
possible (pointer privatization). For our last optimization, we had the main thread
call upc_memget from the appropriate threads to bring in the other thread’s data
18
into its private address space before initiating the checks for the A and B arrays.
The manual work distribution using the for optimization complicated the code
since array offsets had to be manually calculated, but each of the other
optimizations only added a few lines to the UPC code.
The code for the SHMEM and MPI versions of this code were nearly identical,
with the only differences being due to the different communication functions.
However, since MPI and SHMEM don’t have a built in array blocking mechanism,
the code for the computation was more complex, since array offsets had to be
computed manually. However, the code was no more complex than the for
optimization in the UPC version of the code.
3.2.4 Concurrent wave equation
The implementation of the concurrent wave equation needs to calculate the
amplitude of points along a vibrating string for a specified number of time
intervals. The equation that is solved is shown below. In the equation, the
variable i represents a point on the line.
new[i] = 2 * amp[i]
- old[i]
+ sqtau * (amp[i - 1]
- 2 * amp[i]
+ amp[i + 1])
The amp array holds the current amplitudes. Note that the new amplitude for the
point will depend on the current values at neighboring points. Each process is
assigned a contiguous block of N/P points (block decomposition) where N is the
number of points and P is the number of processes. When the points are
assigned this way, each processor has all the data needed to update its interior
points. To update its endpoints, a processor must read values for the points
bordering the block assigned to it. Once the boundary values are calculated,
these new boundary values must also be updated in a shared array so other
processors may use them. In our implementation, an upc_forall loop was
used to initiate this communication.
19
3.2.5 Depth-first search
The most generic sequential DFS algorithm applicable to C uses pointers to
construct the tree. If a thread-spawning ability is available, a DFS using pointers
can be implemented as follows:
node found = NULL;
int DFS (node current, int target)
{
if (target == current.data) found = current;
else if (found != NULL)
{
spawn DFS (current->child1, target);
spawn DFS (current->child2, target);
...
}
}
At first glance, parallelization in UPC of this implementation appears trivial as
each node can be treated like a thread. However, this does not work as there is
no construct in UPC that allows remote spawning of tasks. Because of this, we
restricted the DFS algorithm to work only with n-degree trees (trees with node
having maximum of n children), which allows an array representation of the tree
to be used. By doing so, the dynamic spawning of children process is changed
to matching of data against nodes in a certain range of the array. The
parallelization process then becomes that of changing from using the for loop to
the upc_forall loop and making the array that represents the tree globally
accessible. Shown below is the UPC version of this implementation.
shared tree[N];
// global variable
int level = 1, left_node = 0, right_node = 0;
bool found = FALSE;
int DFS (int target)
{
do {
upc_forall (i = left_node; i <= right_node; i++; &tree[i])
{
if (tree[i].data == target) found = TRUE;
// perform task with a match found
}
left_node = right_node + 1;
right_node = right_node + level^max_degree - 1;
upc_barrier;
} while (found == FALSE)
}
20
3.3
Analysis
In this section, we present an analysis of each algorithm’s runtime performance.
We also discuss any differences between versions of applications coded in MPI,
UPC, or SHMEM.
3.3.1 CAMEL (MPI and UPC)
Since we used the master-worker paradigm in our MPI implementation, to have a
fair comparison between our MPI and UPC implementations we put both the
main MPI master thread and the first worker thread on the same CPU during
execution. However, after some experimentation, it became evident that the MPI
implementation for our InfiniBand network used a spinlock for its implementation
of blocking MPI send calls in order to keep latencies low. This spin lock wasted
several CPU cycles and destroyed the performance of any worked thread that
happened to be paired with the master thread on the same CPU. Because of
this, we rewrote our application to use a more traditional, distributed-style
coordination between all computing processes.
This increased overall
performance by lessening the impact of the spin lock at the expense of creating
much more complex MPI code. This version of the MPI code, which had
comparable performance to the UPC version, was 113 lines of code longer than
our UPC implementation. Clearly, for data-parallel applications, UPC is a much
more attractive language than MPI. The usefulness of UPC is especially evident
on shared-memory machines, where performance differences between wellwritten MPI programs and UPC programs are less pronounced than on cluster
architectures.
We ran our UPC and MPI implementations using a value of 256 for
MAINKEYLOOP, which results in about 1/16th of the key space being searched.
Our UPC and MPI implementations were tested on our four-processor
AlphaServer machine and 16-processor Opteron cluster. For the AlphaServer,
we used the native MPI and UPC compilers available (HP UPC compiler, version
2.3). For the Opteron cluster, we used Voltaire’s MPI implementation over 4x
InfiniBand, and we used the VAPI conduit over 4x InfiniBand on the Berkeley
UPC compiler (v2.0.1). The results from these runs are shown in Figure 3.2. As
can be seen, performance between the MPI and UPC versions was comparable,
and the overall parallel efficiencies of our implementations were very high. On
the Opteron cluster, both the UPC and MPI implementations had over 95%
21
efficiency for 16 processors. On the AlphaServer, both of our implementations
had an efficiency of over 98% when run with the maximum number of available
processors (4).
CAMEL performance
250
AlphaServer, UPC
AlphaServer, MPI
Opteron, VAPI MPI
Opteron, VAPI UPC
Execution time (s)
200
150
100
50
0
1
2
4
8
Number of processors
12
16
Figure 3.2 - CAMEL performance
3.3.2 Mod 2n inverse (C, MPI, UPC, and SHMEM)
As can be inferred from the section 3.2.2, given the sequential C code for this
benchmark, the parallel version was almost trivial to code in UPC. However, the
straightforward UPC implementation of this benchmark resulted in a program that
has poor performance. Adding the three optimizations previously mentioned
makes the code slightly more complex, but has a large impact on overall
performance. Even with the added complexity of the optimizations in the UPC
code, the UPC code length and complexity was about on par with the SHMEM
version when all optimizations were implemented.
The MPI code was again more complex, longer, and harder to write than the
SHMEM and UPC versions. On the plus side, MPI does have access to a rich
library of non-blocking communication operations which can improve
22
performance in these types of applications, although non-blocking operations are
soon becoming available to UPC programmers in the form of language library
extensions.
We did not use the non-blocking operations in our MPI
implementation.
We first examine the effect that each optimization had on overall UPC
performance. The parameters used for the runs in this benchmark were a list
size of 5,000,000 elements and n = 48 bits.
Bench 9 Optimizations - AlphaServer
6
sequential, upc
upc, 1 thread
upc, 4 threads
5
Time (seconds)
sequential, cc
4
3
2
1
0
forall
forall cast
for
for cast
get forall
get forall cast
get for
get for cast
Optimization
Figure 3.3 - Effect of optimizations on 4-processor AlphaServer
The results that the different optimizations had on our AlphaServer are
summarized in Figure 3.3. Each column signifies which combinations of
optimizations were used when executing the UPC implementation. Also in the
figure are the times taken by the sequential version when compiled by the C
compiler (cc) and the UPC compiler (upc). The sequential code performance
obtained from using the C and UPC compilers is replicated across the columns in
the graph to enhance readability. Notice that compiling the same code using the
UPC compiler instead of the C compiler (using the same optimization flags)
results in about a 20% drop in performance. We again suspect that the source
code transformations applied by the UPC compiler may be responsible for the
23
decrease in performance. In addition, we suspect that the UPC compiler may
resort to using less aggressive memory access optimizations to ensure
correctness when executed in a parallel environment. This could also have
contributed to the slowdown.
In terms of the effects the different optimizations had on UPC program
performance, Figure 3.3 shows that applying all three optimizations concurrently
resulted in the best overall performance. Using a manual for loop instead of a
forall loop resulted in decreased performance, unless casting shared
variables locally (pointer privatization) was also employed.
Using the
upc_memget to copy blocks of memory from remote notes into local memory
also resulted in an appreciable performance gain. Effects the optimizations had
on our Opteron cluster are not included; however, the effects were similar to the
AlphaServer, although the Opteron cluster proved to be more sensitive to the
optimizations. Like the AlphaServer, the Opteron cluster also performed best
when all optimizations were used concurrently.
Bench 9 - AlphaServer
3
marvel upc (get for cast)
marvel gpshmem
marvel mpi
2.5
Time (seconds)
2
1.5
1
0.5
0
1
2
3
Number of threads
Figure 3.4 - Overall bench9 performance, AlphaServer
24
4
The overall performance of our MPI, SHMEM, and UPC implementations on
AlphaServer is shown in Figure 3.4. The figure shows that MPI implementation
had the best overall performance, and the SHMEM version had the worst. We
believe the difference in performance between the MPI and UPC versions can be
attributed to the slowdown caused by compiling our code using the UPC compiler
instead of the regular C compiler (which is how MPI programs are compiled on
the AlphaServer). Nevertheless, even with the initial handicap imposed by the
UPC compiler, our UPC implementation performs comparably to our MPI
implementation with 4 processors. The SHMEM implementation lags behind
both the UPC and MPI versions in almost all cases; however, we are using a
freely-available version of SHMEM (gpshmem) because we do not have access
to a vendor-supplied version for our AlphaServer. In any case, the performance
of the SHMEM version was not drastically worse than the performance of the
other two implementations.
lambda gpshmem
lambda mpi
lambda bupc-vapi (get for cast, BS=100)
lambda bupc-vapi (get for cast, BS=MAX)
Bench 9 - Opteron
6
Time (seconds)
5
4
3
2
1
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Number threads
Figure 3.5 - Overall bench9 performance, Opteron cluster
The performance on Opteron cluster is shown in Figure 3.5. The check phase is
a very bandwidth-intensive task; this results in poor performance due to the
25
limited bandwidth capabilities from CPU to CPU in a cluster environment. Our
MPI implementation performs the best here, which is not surprising given that
MPI is well-suited to cluster environments. Also, the MPI implementation
explicitly defines how data is communicated from each processor to the next.
While being programmer-intensive and tedious to write explicit communication
patterns, this usually results in the best overall utilization of the network
hardware.
Figure 3.5 also illustrates the effect that adjusting the block size of UPC shared
arrays has when combined with use of the upc_memget function. Using a larger
block size for our shared arrays resulted in better performance of our UPC
implementation. This is logical, since most networks perform best when
transferring larger messages. When the block size for the shared arrays is set to
the maximum size allowed by the UPC compiler, overall UPC performance
moves quite a bit closer to the performance obtained by the MPI version. As with
AlphaServer, the SHMEM implementation has the worst overall performance.
We had expected this; SHMEM is designed for shared-memory machines, so it
makes sense that it does not perform well in a cluster environment.
3.3.3 Convolution (C, MPI, UPC, and SHMEM)
The parameters used in this benchmark were two sequences containing 100,000
double-precision floating point elements. As with our Mod 2 n inverse codes, we
decided to examine the impact that each of our three UPC optimizations had on
our UPC code. A reduced data set size was used for these tests. The results
from the execution of the code with different optimizations enabled on our
AlphaServer are shown in Figure 3.6. Each column illustrates the optimizations
used when running the UPC program. In the figure, the columns labeled “naïve”
did not use the upc_memget function to bring in local copies of A and B before
starting computation. The rest of the optimizations listed in the columns
correspond to the optimizations previously mentioned, with the for optimization
also including the casting optimization. The results we obtained were similar to
the effects the optimizations had on the Mod 2 n inverse UPC code. In all cases,
applying all optimizations led to the best performance.
26
Integer Convolution Optimizations - AlphaServer
14
sequential
upc, 1 thread
12
upc, 4 threads
Time (seconds)
10
8
6
4
2
0
naïve forall
naïve for
get forall
get for
Optimization
Figure 3.6 - Effect of optimizations on 4-processor AlphaServer
Not shown are the results from applying the optimizations to Opteron cluster.
These results also agreed with our previous results from Mod 2n inverse; in
general, UPC performance on the Opteron cluster was very sensitive to the
optimizations used.
As with the AlphaServer, in all cases applying all
optimizations led to the best performance.
The overall performance for performing double-precision floating point
convolution is shown in Figure 3.7. The performance of the MPI, UPC, and
SHMEM versions of the code was comparable on both the Opteron cluster and
the four-processor AlphaServer. The parallel efficiencies for running the UPC,
SHMEM, and MPI versions on the Opteron cluster were over 97.5%. In addition,
the parallel efficiencies for the UPC, MPI, and SHMEM versions on the
AlphaServer were also over 99%.
27
Double Precision Floating Point Convolution
140
AlphaServer UPC
AlphaServer GPSHMEM
AlphaServer MPI
Opteron UPC
Opteron GPSHMEM
Opteron MPI
120
Time (seconds)
100
80
60
40
20
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Number of threads
Figure 3.7 - Overall convolution performance
An interesting phenomenon is also diagrammed in Figure 3.7: floating point
performance for the UPC implementation was significantly better than the floating
point performance obtained by the MPI and SHMEM versions of the code. On
our AlphaServer, both MPI and SHMEM are made available to C programmers
through C libraries that are linked with the user’s application. We suspect that
the UPC compiler on the AlphaServer has more intimate knowledge of the
available floating point hardware; it seems the UPC compiler was better able to
schedule the use of the floating point units than the sequential C compiler paired
with parallel programming libraries. It is also worthy to note that integer
performance was not improved by using the UPC compiler on the AlphaServer,
although performance was not degraded by using the UPC compiler as it was
with our bench9 UPC implementation. The performance of the convolution given
by the Berkeley UPC compiler on our Opteron was degraded; again, we attribute
28
this to the source-to-source transformations interfering with the ability of the GCC
compiler to perform the same optimizations that it was able to on the MPI and
SHMEM code.
3.3.4 Concurrent wave equation (C and UPC)
Figure 3.8 summarizes the execution times for various implementations of the
code. The modified sequential version was 30% faster than the baseline for the
Xeon cluster, but only 17% faster for the Opteron cluster. Computations take up
more of the total execution time on the Xeon cluster, so the optimizations will
necessarily have more of an impact. Since the algorithm is memory-intensive,
obtaining execution times for larger data sets is intractable due to the physical
limitations of main memory.
UPC Concurrent Wave Equation Results
Xeon-sequential
Xeon-upc-1
Xeon-upc-4
Opteron-sequential mod
Opteron upc-2
Xeon-sequential mod
Xeon-upc-2
Opteron-sequential
Opteron upc-1
Opteron upc-4
1.4
Execution time (sec)
1.2
1
0.8
0.6
0.4
0.2
0
0.5
1
1.5
2
2.5
3
Number of points (1E6)
Figure 3.8 - Concurrent wave performance
The UPC versions of the code exhibit near-linear speedup. This fact is more
meaningful when considering that the UPC code was fairly straightforward to port
from the sequential code. Once the code was written, we could focus our
attention on determining the most efficient language constructs to use for a given
29
situation. We found that for smaller data sets, the affinity expression array+j
performed slightly better than &(array[j]). This is probably due to the
different implementations of each construct by the UPC compiler. By gaining
more information about the implementations of various language constructs, we
hope to be able to exploit more construct-specific performance benefits.
3.3.5 Depth-first search (C and UPC)
Figure 3.9 shows the average performance of the DFS algorithm on a tree with
1,000,000 elements ran on Xeon clusters using SCI. The tree was set up so the
data contained in each node contained the same value as its array index. Values
from 1 to 1,000,000 were used as search keys, and the average time for all of the
searches was recorded. As can be seen from the data, the UPC versions
perform much slower than the sequential version. This is primarily due to the
extra synchronization needed by the UPC versions. However, the UPC versions
start to perform better when a long delay was added to the matching process
(this data is not shown). This is intuitive, because adding the delay results in less
frequent synchronization, which increases the efficiency of the parallelization.
Again, we found that using the for instead of the upc_forall results in better
performance, and also scales better as the number of threads is increased.
Execution Time (msec.)
1 Node
2 Nodes
4 Nodes
300
250
200
150
100
50
0
Sequential
UPC with for_all
UPC with for
Figure 3.9 - DFS performance on Xeon Cluster with SCI
30
We also tried increasing the block size of the global array to see if this would
affect overall performance. In our implementation, adding this optimization
actually decreased the performance because it creates a load imbalance. To
illustrate why this is, assume there are two processors and a block size of 5 is
used on the global array representing the search tree. Because of the
distribution of array elements, the first node ends up searching all of the first two
tree levels while the second node is idle. The extra idle time spent by the second
node decreases the overall efficiency of the parallelization, which results in lower
overall performance. This effect worsens as the depth of the search increases.
In addition, we also investigated other optimizations, including casting local
variables before using them (pointer privatization) and using other language
constructs where applicable.
These optimizations did not improve the
performance of the application, so we have excluded their results from this
section.
3.4
Conclusions
Our implementation of the CAMEL differential cryptanalysis program gave us
useful experience with both the UPC and MPI programming languages. Our
UPC implementation was easily constructed from the original sequential code,
while the MPI implementation took a little more thought due to our MPI vendor’s
implementation of a blocking receive with a spin lock. We were able to achieve
high efficiency on both of our implementations. One interesting fact we learned
from working with our CAMEL implementation is that both UPC compilers can
sometimes give slightly better or slightly worse performance for the same code
compared with the MPI or sequential C compilers. We suspect that since the
UPC source code may be slightly transformed or altered depending on the UPC
compiler implementation, the final code that gets assembled may have different
optimizations that can be applied to it as compared with the original version. In
this respect, since MPI compilers generally do not perform any additional code
reorganization of the source code before compiling and linking against the
available MPI libraries, overall MPI performance usually matches the
performance of sequential versions more closely than UPC. This is especially
true for the Berkeley UPC compiler, which utilizes source-to-source
transformations during compilation.
31
Our Mod 2n inverse implementation was another useful tool that allowed us to
compare MPI, SHMEM, and UPC. While we could have examined an existing
UPC implementation of Mod 2n, writing implementations from scratch turned out
to be an excellent learning experience. The simplicity of the code for Mod 2 n
inverse allowed us to experiment with different UPC optimizations on a variety of
platforms and UPC runtimes. We found that three commonly-employed
optimizations (as evidenced by the GWU UPC benchmark suite) can make a
considerable difference in performance. Specifically, the combination of using
upc_memget/upc_memput to block transfer contiguous spaces of memory,
manually partitioning work up using the for construct instead of the
upc_forall blocks, and casting shared variables to local variables where
necessary resulted in the best performance on all of our UPC runtimes and
compilers.
Our implementations of convolution in UPC, SHMEM, and MPI show that each
language offers similar performance for applications with large computation
requirements and moderate communication requirements. The AlphaServer’s
UPC compiler was able to squeeze more performance out of its floating point
units, even when compared to the sequential C compiler. UPC’s notion of
blocked arrays also made uniform work sharing easier to implement in this
application. Finally, our chosen optimizations had a positive impact on overall
UPC performance for both the AlphaServer and our Opteron cluster, and the
effects of our optimizations also agreed with the results we obtained on our Mod
2n implementation.
The adjustments required to port the wave equation code from the original
sequential code to UPC were fairly intuitive. However, the overhead for running
the UPC code for one process relative to the sequential code is quite substantial.
Even more noteworthy is that when the sequential code is passed into the UPC
compiler and run on one processor, a similar overhead incurred. We believe the
reasons for this overhead are similar to the reasons we mentioned for the
overhead observed in the CAMEL application.
In our DFS implementation, we again verified that use of certain constructs
instead of the built-in alternatives (for versus upc_forall) can make an
impact on the program performance. We also verified that the computation to
communication ratio of a program has significant effect on parallel program
32
efficiency. More importantly, we discovered that it was necessary to change the
underlying algorithm and restrict the original problem when creating a parallel
version of an application in order to gain efficiency. It is necessary for
programmers to understand the limitations of the language they use, because
understanding the limitations posed by a language is a prerequisite to becoming
efficient in a language. By examining the capabilities of language, a set of “good”
performance guidelines might be formed.
In general, this task provided us with an opportunity to become more familiar with
UPC and SHMEM. The optimization process was beneficial, as it forced us to
view performance analysis tools from a user’s perspective. It caused us to gain
an understanding of the information that is needed when optimizing parallel
programs. In addition, it provided us with experience using ad-hoc methods for
collecting such basic metrics as total execution time and total time spent within a
function.
Finally, the programming practice confirmed that the use of
optimizations can result in a significant performance impact.
3.5
References
[3.1]
“Cryptanalysis of S-DES,” Dr. K. S. Ooi and Brain Chin Vito,
Cryptology ePrint Archive, Report 2002/045, April 2002.
[3.2]
“Wave Equation. Wikipedia: The Free Encyclopedia”
[3.3]
Fox et al. (1988), “Solving Problems on Concurrent Processors”, Vol 1.
[3.4]
http://www.new-npac.org/projects/html/projects/cdroms/cewes-199906-vol1/cps615course/mpi-examples/note5.html
[3.5]
http://super.tit.ac.kr/workshop/html/samples/exercises.html#wave
33
4 Performance tool strategies
Writing a parallel program can often be a daunting task. In addition to the normal
issues raised during sequential programming, programmers working on parallel
codes have to contend with data partitioning schemes, synchronization, and work
distribution among processors among others. Recently, several programming
languages such as UPC have been created which aim to improve programmer
efficiency by providing simplified coding styles and convenient machine
abstractions to the programmer. Even with these improved programming
environments, deciding how to optimally code a program can still be a trial-anderror process.
Evaluation of parallel code is usually accomplished via one or a combination of
the following three methods:

Simulation – Here, detailed models are created of the parallel code and
hardware that the user wishes to evaluate. These models are simulated,
and information during the simulation of the models is collected and stored
for later analysis or shown to the user immediately through a Graphical
User Interface (GUI).

Analytical models – In analytical models, mathematical formulas are used
in conjunction with specific parameters describing the parallel code and
the hardware upon which it will be executed. This gives an approximation
of what will happen when the code is actually run by the user on the target
hardware.

Experimental – In this method, instrumentation code is added that records
data during the program’s execution on real hardware.
Creating detailed simulation models can provide extremely detailed information
to the user. However, creating and validating models of existing hardware is a
very labor-intensive process. The most accurate models may take eons to
simulate, while coarse-grained models that run in a reasonable amount of time
usually have poor accuracy. In addition, the models created are usually tied
closely to particular architectures or runtime systems, and modifying them to
34
updated architectures or drastically different architectures usually involves
substantial work. Given that parallel architectures can vary wildly in short
amounts of time, and the large execution cost of accurate models, simulative
models are usually only used in cases where the detailed information they
provide is absolutely required. In this respect, they are invaluable tools for
ensuring the correctness of execution of mission-critical systems, but their
usefulness in the implementation of a PAT is limited.
Analytical models can be thought of as extremely simplified versions of models
created for simulative analysis. While they lack the accuracy of detailed
simulation techniques, often the information they provide is sufficient enough to
warrant their use in a PAT. For example, parallel performance models can
provide the programmer with insight into the characteristics of existing parallel
machines by giving the programmer a “mental picture” of parallel hardware. In
addition, some performance models may even predict how a given program will
perform on other available hardware or hardware not yet available.
The experimental approach (direct execution of a user’s code) provides the most
accurate performance information to the user, but using this strategy by itself
encourages the use of the “measure-modify” approach, in which incremental
changes are made to the code after measuring a program’s runtime
performance. This modified code is then run and measured again. This process
repeats until the user obtains the desired performance from their code. A major
drawback of this process is that it is usually very time consuming; in addition,
many actions are performed on a trial-and-error basis, so getting better
performance from codes may also involve an element of chance.
35
5 Analytical performance modeling
In the context of a PAT, a performance models have many possible uses. A
performance model may be used to give suggestions on how to improve the
performance of their application, or may allow a user to perform simple tradeoff
studies to aid them in obtaining the maximum performance from their program.
Additionally, studying existing performance models will give us an idea of which
particular metrics (“performance factors”) are deemed important enough to
researchers to include them in models that evaluate or predict performance of
parallel code. By compiling a list of the performance factors used in a wide
variety of existing performance models, we can justify the inclusion of
measurement tools for these particular metrics in our PAT. If we are able to
characterize performance of an application on real hardware using a handful of
performance factors, we will have a better understanding of how each
performance factor affects a program’s overall performance.
The rest of this section will review existing performance models in attempt to
determine their applicability to a PAT. To evaluate each model, we will use the
following set of criteria:

Machine-friendliness – For a performance model to be useful in a PAT,
the PAT must be able to evaluate a user’s program with little or no help at
all from the user. If a user must expend a great deal of effort to use the
model, the user will most likely avoid the model in favor of other, easier-touse features provided by the PAT.

Accuracy – A very inaccurate performance model is not useful to the user.
We seek models that have a reasonable amount of accuracy (20-30%
error). In general, we wish this requirement to be flexible; if a model has
slightly worse than 30% accuracy but has other redeeming features, we
may still assign it a good evaluation.

Speed – If evaluating a given code under a performance model takes
longer than evaluating it on actual hardware, it will generally be more
productive to use the actual hardware instead of a model that
approximates it. Therefore, we have chosen the time taken to re-run the
36
code on the actual hardware as an extreme upper bound of time that is
taken to evaluate a performance model. To quantify this time interval,
models that are to give evaluations in seconds to tens of seconds are
highly desirable. Models that take minutes to evaluate need to have high
accuracy or some other desirable feature to make up for their speed
deficiency. If a model takes one hour or longer to evaluate code, we will
not consider it a feasible choice for inclusion in our PAT.
We have divided the existing performance models we will be evaluating into
three categories. First, models that use formalized mathematical methods are
grouped under the title “formal performance models.” These models are
presented and evaluated in Section 5.1. Second, general performance models
that are used to give mental pictures to programmers or that provide general
programming strategies are categorized as “general analytical performance
models,” and are summarized and evaluated in Section 5.2. Performance
models that are designed to specifically predict the performance of parallel codes
on existing or future architectures are categorized as “predictive performance
models,” and are presented and evaluated in Section 5.3. Finally, Section 5.4
gives our recommendations on how to incorporate an analytical model into our
PAT.
5.1
Formal performance models
In this section, we will briefly give an introduction to some of the more widelyused formal methods for evaluating performance of parallel codes. This category
of performance models encompasses many different methods and techniques.
Because formalized methods generally require extensive user interaction, and
thus are not readily applicable for inclusion into a PAT based on our previously
mentioned criteria, we will only present a brief overview of them. Also, formal
performance models use highly abstracted views of parallel machines, so we
cannot extract which performance factors are considered when evaluating
performance as these metrics are not used directly.
For completeness, we have included a generalized evaluation of formal
performance models below. These comments are applicable to all the formal
models discussed in this section.
37
Formal models summary:

Parameters used – varies; usually an abstract representation of processes
and resources

Machine-friendliness – very low, requires on the user’s ability to create
abstract models of the systems they wish to study

Average error – varies; can be arbitrarily low or high depending on how
systems are modeled

Speed – creating and verifying the models used by formal methods can be
time consuming
5.1.1 Petri nets
Petri nets came around as a result of the work that Carl Petri performed while
working on his PhD thesis [5.5.1]. Petri nets are specialized graphs (in the
computer science sense) which are used to graphically represent processes and
systems. In some sense they are more generalized versions of finite state
machines, as finite state machines may be represented by using petri nets. The
original petri nets proposed by Carl Petri had no notion of time and only allowed
limited modeling of complex systems.
Several improvements have been proposed to the original petri nets, including
colored petri nets which allow more complicated transitions and timed petri nets
which introduce time parameters to petri nets. Using the colored, timed petri
nets, it is possible to model arbitrarily complex systems. Since petri nets are
strongly grounded in mathematical theory, it is often possible to make strong
assertions about the systems modeled using petri nets. Specialized versions of
petri nets have even been created specifically for performance analysis of
distributed systems [5.5.2].
Petri nets can be thought of as basic tools that provide a fundamental framework
for modeling. However, they are often very difficult to create for general parallel
systems and codes. If included in a PAT, the PAT would need a lot of help from
the user in the form of hints on how to construct the petri nets (or the provision of
graphical tools to help the user create the petri nets). Therefore, because of the
38
large dependence on user interaction for petri nets, we suggest not using petri
nets as a basis for a performance model in our PAT.
5.1.2 Process algebras
Process algebras represent another formal modeling technique strongly rooted in
mathematic (especially algebra). Some of the more popular instances of process
algebras are Milner’s “A Calculus of Communicating Systems” [5.5.3] and
Hoare’s “Communicating Sequential Processes” [5.5.4]. Processes algebras
abstractly model parallel processes and events that happen between them.
These techniques are rich in applicability for studying concurrent systems but are
quite complex. Entire books have been written on these subjects (for example,
Hoare has an entire textbook dedicated to his “Communicating Sequential
Processes” process algebra [5.5.5]).
Process algebras excel at giving formalized mathematical theory behind
concurrent systems, but in general are difficult to apply to real parallel application
codes and real systems. They are useful for verifying certain properties of
parallel systems (e.g., deadlock-free algorithms), but are not immediately useful
to a PAT. Therefore, we recommend excluding process algebras from our PAT.
5.1.3 Queuing theory
Another formalized modeling technique we discuss here is the application of
queuing theory to parallel systems. As with process algebras and petri nets,
queuing theory is strongly rooted in mathematics and provides a general way to
describe and evaluate systems that involve queuing. Queuing theory comprises
an entire field in and of itself; it has been successfully applied in the past to
evaluate parallel systems (for a summary paper, see [5.5.6]).
However, like process algebras, queuing theory is a general tool that is
sometimes difficult to apply to real-world problems. In addition, some parallel
codes might not be readily modeled using queuing terminology; this is even more
problematic when working with languages that provide high-level abstractions of
communication to the user. While queuing theory is a useful tool for solving
some parallel computing problems (notably load balancing), it is not appropriate
for inclusion into our PAT.
39
5.1.4 PAMELA
PAMELA is a generic PerformAnce ModEling Language invented by van
Gemund [5.5.7, 5.5.8]. PAMELA is an imperative, C-style language extended
with constructs to support concurrent and time-related operations. PAMELA
code is intended to be fed into a simulator, and has some calculus operators
defined for the language that allow for the reductions of programs to speed
evaluation of them. PAMELA has much in common with general process
algebras, although PAMELA is oriented towards direct simulation of the resulting
codes in the PAMELA language/process algebra. During evaluation of code
written in the PAMELA language, serialization analysis is used to provide a lower
and upper bound on the effects of contention that affect a program’s
performance.
Even though PAMELA itself is strictly a symbolic language, van Gemund
developed a prototype system that was written for the Spar/Java data-parallel
programming language [5.5.9]. This system generates a PAMELA program
model based on comments provided in the source code to programs. While the
system used was not fully automated due to the dependence on special
comments in the source code, the generated models had an average of 15%
error for matrix multiplication, numerical integration, Gaussian elimination, and
parallel sample sort codes. The models generated by their system were in
general many times larger than the actual source code, although the sizes
shrunk dramatically after the performance models were reduced.
PAMELA provides a convenient description of parallel programs, but
automatically creating models from actual programs, especially programs coded
in UPC, would be too difficult to perform efficiently without interaction from the
user. The performance calculus used is simple enough to allow for machine
evaluation of PAMELA models; however, because of the necessary user
interaction required in creating accurate PAMELA models, we cannot
recommend the use of PAMELA in our PAT.
5.2
General analytical performance models
In this section, we present an overview of general analytical performance
models, which we define as performance models that are meant to provide a
40
programmer with a mental model of parallel hardware or general strategies to
follow while creating parallel codes.
5.2.1 PRAM
Perhaps one of the most popular general analytical performance models to come
out of academia is Fortune and Wylie’s Parallel Random Access Machine
(PRAM) model [5.5.10]. The PRAM model is an extremely straightforward model
to work with. In the model, an infinite number of ideal CPUs are available to the
programmer. Parallelism is accomplished using a fork-join style of programming.
Each processor in the PRAM model has a local memory and all processors have
access to a global memory store. Each processor may access its local memory
and the global memory store, but may not access the private memory of the
other processors. In addition, local and global memory accesses have the same
cost.
Because of its simplicity, many programmers use the model to determine the
algorithmic complexity of their parallel algorithms. However, no real machine
exists today that matches the ideal characteristics of the PRAM model, so the
complexities determined using the model are strict lower bounds. Researchers
have created several different variations of the model to try to more closely
match the model with existing hardware by restricting the types of memory
operations that can be performed. For example, different variations of the model
can be obtained by specifying whether reads and writes to the global memory
store can be performed concurrently or exclusively. CREW-PRAM (concurrent
read, exclusive write PRAM) is one particular variation of the model that is widely
used. However, even with the enhancements to the model, the model is still too
simplistic to predict performance of parallel codes to a reasonable degree of
accuracy, especially in non-uniform memory access machines. In addition,
synchronization costs are not directly captured by the model, which can
contribute greatly to the cost of running parallel codes on today’s architectures.
PRAM is a very useful algorithmic analysis and programming model. It is an
attractive method because it is a very straightforward model to deal with.
However, its low accuracy prohibits its use in our PAT.
PRAM summary:
41

Parameters used – coarse model of memory access, sequential code
speed used for predicting overall execution time for parallel code

Machine-friendliness – medium to low; in order to accurately model
algorithms using PRAM either the source code must be processed in
some way or extra information must be provided by the user

Average error – generally accurate to within an order of magnitude,
although accuracy can be much worse depending on nature of code being
evaluated (excessive synchronization is problematic for the model)

Speed – extremely fast (entirely analytic)
5.2.2 BSP
In 1990, Leslie Valiant introduced an analytical performance model he termed the
Bulk-Synchronous Parallel (BSP) model [5.5.11]. The BSP model aims to
provide a bridging tool between hardware and programming models, as the von
Neumann model of computing has done for sequential programming. In the BSP
model, computation is broken up into several supersteps. In each superstep, a
processor first performs local computation. Then, it is allowed to receive or send
at most h messages. These messages are charged a cost of gЇ h+s, where gЇ is
the throughput of the communication network between the processors and s is
the startup latency for performing communication costs. Finally, at the end of
each superstep, a global barrier synchronization operation is performed. The
barrier synchronizations are performed with frequency determined by a model
parameter L. It is important to note that synchronization is performed every L
time units. If a superstep has not completed at this time, the next superstep is
allocated for finishing the computation of the current superstep. Breaking
computation up into supersteps and limiting the network communication that can
occur within a superstep makes it easier to predict the overall time taken for that
superstep, which will be the sum of the maximum computation time,
communication time, and the time needed to synchronize all the processors
participating in that superstep. A few variants of the BSP model also exist,
including the Extended BSP (E-BSP) model [5.5.12]. The E-BSP model
enhances the original BSP model by introducing a new parameter M which
models the pipelining of messages from processor to processor. In addition, the
E-BSP model also adds in the notion of locality and unbalanced communication
42
into the model by also considering network topology in the model, something that
is ignored in the general BSP model.
In 1996, Juurlink and Wijshoff [5.5.13] performed an evaluation of BSP by tuning
the BSP and E-BSP models to three actual machines and using their tuned
models to predict performance of several parallel applications: matrix
multiplication, bitonic sort, local sort, sample sort, and all-pairs shortest path.
The three machines used were a 1024-processor MasPar MP-1, 64-processor
CM-5, and a 64-node GCel. For each code, the BSP model and E-BSP model
showed good accuracy, within 5% in most cases. However, error grew to 5070% in the cases of poor communication schedules (when several processors try
to send to one processor at the same time), poor data locality (high cache
misses), and sending a large number of tiny messages. A similar drop in
accuracy was reported for the BSP model when working on networks with low
average bisection bandwidth but high point-to-point bandwidth of processors
near each other in the interconnection network. The BSP model uses the
maximum number of messages sent by a group of processors to predict
performance by assuming all processors send the exact same number of
messages, but if there is only one processor that sends a larger number of
messages than the others this cost will be overestimated. The problem is
compounded by networks having low bisection bandwidth, since network
saturation decreases performance more quickly on these types of networks. It is
also important to note in these cases the programs were coded with the BSP and
E-BSP models in mind from the start, so they most likely represent optimistic
accuracies.
In addition, the BSP model itself imposes restrictions in the structure of the
program, although the restrictions themselves are not that hard to live with. To
this end, a group of researches created a BSP programming library named
BSPLib to ease the implementation of BSP-style algorithms [5.5.14]. The library
provides similar operations that MPI provides, albeit on a much smaller scale.
The BSPLib tool also includes the ability to generate a trace of program
execution that can be analyzed using a profiling tool. One interesting feature of
the profiling tool is that it is able to predict performance of the application on
other machines, given their BSP parameters. The accuracy of the predictions
made by the profiling tool on a complex Computational Fluid Dynamics (CFD)
program were within 20%, which is fairly respectable considering the trace was
43
performed on a shared-memory machine and the predicted platform was a
loosely-coupled cluster. One other contribution of the BSPLib effort was the
change of gЇ to a function gЇ (x) which allows the model to capture different
latencies associated with different messages sizes, which is a common property
of most available networks today. This addition does increase the complexity of
the model, but this additional complexity may be mitigated by the use of profiling
utilities.
The BSP model provides a convenient upgrade in accuracy from the PRAM
model, although it requires a specific programming style and the model ignores
some important features including unbalanced communication, cache locality,
and reduced overhead for large messages on some networks. Some of these
deficiencies have been addressed by the E-BSP model at the expense of added
model complexity. In general, if processing trace files created by instrumented
code is not too costly, supersteps and h parameters may be extrapolated from
existing code by examining when messages were sent and received in relation to
the time barrier synchronization operations were performed. In addition,
microbenchmarks may be used to automatically record the gЇ and L parameters of
existing systems. Using these two methods together, applying the BSP model to
a program after it has been executed can probably be entirely automated. In
addition, the simplicity of the model is also an attractive feature, although it
remains to be seen whether the model can be applied to more finely-grained
applications that do not make use of many barrier synchronizations.
BSP summary:

Parameters used – network bandwidth, network latency, sequential code
performance

Machine-friendliness – medium to high; supersteps of an algorithm may
be detected automatically by analyzing a trace file or source code directly

Average error – within 20% for nominal cases, accurate to within 70% in
cases where parameters not captured by model negatively affect
performance or costs are overestimated by model

Speed – extremely fast (entirely analytic)
44
5.2.3 LogP
The LogP model [5.5.15] was created after its authors noticed a trend in parallel
computer architecture in which most machines were converging towards
thousands of loosely-coupled processors connected by robust communication
networks. Since these types of architectures will scale to the thousands of nodes
(but not millions), the authors theorized that a large number of data elements will
need to be assigned to each processor. Also, since network technology
generally lags significantly behind the speed available for processor-memory
interconnects, the cost of network operations is very high compared to the cost of
local operations. In addition, since adaptive routing techniques such as
wormhole and cut-through routing make topologies less important to overall
performance, a parallel performance model need not take into account overall
topology. Finally, due to the range of programming methodologies in use, the
authors decided to make their performance model applicable to all styles of
parallel programming.
The parameters used in the LogP model are entirely network-centric: an upper
bound on network latency L, the overhead of sending a message o, the minimum
gap between messages that can be sent by a processor g, and the number of
processors P. In addition, the network is assumed to have a limited capacity
where a maximum number of messages are allowed to be transferred at once.
Processors are delayed if a network is saturated.
Many extensions to LogP have been proposed by various researches. LogGP
[5.5.16] handles longer messages in a different manner, reflecting the common
practice of switching communication protocols for larger messages. LoPC
[5.5.17] and LogGPC [5.5.18] incorporate a contention parameter that captures
contention in an effort to model systems that use active messages. Log n P
[5.5.19] attempts to capture the overhead of communicating messages that are
not in contiguous memory space. LogP has even been applied to modeling
memory hierarchies [5.5.20], although the memory LogP model only accurately
predicts regular memory access patterns.
Due to its simplicity and limited parameters, the LogP algorithm is very
approachable and easy to work with. Also, the model encourages such things as
contention-free communication patterns which other models such as BSP and
45
PRAM ignore.
In general, the accuracy of LogP in predicting network
performance is usually good; Culler et al. predicted sorting performance on a
512-node CM-5 machine to within 12% accuracy [5.5.21].
Unlike the BSP model, LogP captures details of every message that is
transferred on the network, and care must be taken to ensure that the capacity of
the network is respected. Also, no interdependence between messages is
directly captured using the LogP model. This increases the cost of automated
analysis.
LogP in and of itself has no provisions to predict computational performance of
algorithms. In addition, if we are tailoring our PAT to mainly shared-memory
architectures, the LogP model may not be a good fit for evaluating these
machines. Therefore, in order for LogP to be useful for our PAT, it must be
adapted for use with shared-memory machines and supplemented with another
model that can give a good representation of computational performance.
LogP summary:

Parameters used – network latency, overhead, gap (bandwidth), number
of processors

Machine-friendliness – high for network-specific evaluations, medium to
low for general evaluations; the model does not directly capture
interdependencies between messages, and the model does not provide a
way to evaluate computational performance

Average error – within 10% for predicting network performance,
specialized versions of the model may be necessary to accurately model
specific hardware

Speed – extremely fast (entirely analytic), although entire communication
trace files will need to be processed which may slow things down
5.2.4 Other techniques
In this section, a brief overview of other general analytical models will be
presented and briefly evaluated. The models presented here are not useful for
our PAT, but are included for completeness.
46
Clement and Quinn created a general analytical performance model for use with
the Dataparallel C language which they describe in [5.5.22]. Their model uses a
more complex form of Amdahl’s law which takes into account communication
cost, memory cost, and compiler overhead introduced by Dataparallel C
compilers. While their model is too simplistic to be useful, it is interesting that
they model compiler overhead directly, although it is characterized by a simple
scalar slowdown factor.
A performance model tailored to a specific application is presented in [5.5.23].
While high accuracy was obtained for the model, creating application-specific
models is both time consuming and difficult. This illustrates that applicationspecific models do not present a tractable design strategy, especially in the
context of a PAT.
Sun and Zhu present a case study of creating a performance model and using it
to evaluate architecture tradeoffs and predict performance of a Householder
transformation matrix algorithm [5.5.24]. The model they presented used simple
formulas to describe performance, and the resulting accuracy for the model was
not very high. In addition, the authors only presented results for a particular
architecture and did not attempt to predict the performance of their application on
different hardware. The case study, however, is still useful for its pedagogical
value.
Kim et al. present a detailed performance model for parallel computing in
[5.5.25]. Their method involves a general strategy for the creation of highly
accurate models, but the method requires considerable modeling effort on the
part of the user. They categorize the models created with their strategy as
“Parametric Micro-level Performance Models.” They use matrix multiplication, LU
decomposition, and FFT kernels to evaluate their modeling strategy and find that
their models were accurate to within 2%. However, this high accuracy comes at
the price of very complex models – the end of their paper contains seven full
pages of equations that represent the models created for the scientific kernels.
This illustrates a difficult problem associated with the creation of analytical
models from scratch: in order to get high accuracy in analytical models, it is often
necessary to use complicated statistical techniques to capture the nonlinear
performance of parallel machines.
47
5.3
Predictive performance models
In this section, we present predictive performance models, which we define as
models that are used to predict performance of parallel codes on existing
hardware. We differentiate the models presented here from the models
presented in Section 5.2 by specifying that the models presented here are
created specifically for predicting performance of parallel codes.
5.3.1 Lost cycles analysis
Crovella and LeBlanc invented lost cycles analysis [5.5.26, 5.5.27] as a way to
determine all sources of overhead in a parallel program. The authors of this
paper noted that analytic performance models were not being widely used as a
development tool. They reasoned that because analytical models generally
emphasize asymptotic performance, assume that particular overheads dominate
over many cases, and may be difficult to work with, parallel programmers avoid
using them. To remedy these problems, they created a tool set to aid
programmers in applying their lost cycles analysis method on real-world
applications.
In lost cycles analysis, a distinction is made between pure computation and all
other aspects of parallel programming. Any part of a parallel program that is not
directly related to computation is deemed overhead and is labeled as “lost
cycles.” To further classify these lost cycles, the authors came up with the
following categories of lost cycles: load imbalance (idle cycles spent when work
could be done), insufficient parallelism (idle cycles spent when no work is
available), synchronization loss (cycles spent waiting for locks or barriers),
communication loss (cycles spent waiting for data), and resource contention
(cycles spent waiting for access to a shared resource). The authors make one
interesting assertion: they insist the categories chosen to classify the lost cycles
must be complete and orthogonal. That is, the categories must be able to
classify all sources of lost cycles, and must not overlap at all. They also assert
the five categories they chose (which are mentioned above) fulfill these
properties.
Measurement of lost cycles is accomplished via the pp tool. Instrumentation
code is inserted into the parallel application code that sets flags at appropriate
times. The instrumentation code comes in the form of flags which report the
48
current state of execution of the processor. Example flags are Work_Exists,
Solution_Found, Busy, Spinning, and Idle. In an early implementation
[5.28], the programmer is expected to add the instrumentation code to set the
flags at appropriate places throughout the code, although in a later
implementation these predicates were handled via general library calls inserted
at various places in the code (e.g., before and after parallel loops). During the
execution of the parallel code, the values of the flags are either sampled or
logged to a file for later analysis.
Some flags are necessarily measured in machine-specific ways. For example, to
measure communication loss on the shared-memory machine used in the paper,
dedicated hardware was used to count the number of second-level cache misses
(which incur communication) and record the time necessary to service them.
Resource contention was measured indirectly by dividing the time taken to
service the second-level cache miss by the optimal time taken to service a
second-level cache miss on an unloaded machine.
Once the values for the flags are recorded, they are fed into predicates which
allow the computation of lost cycles. An example predicate is:
Load Imbalance(x)≣Work Exists ^ Processors Idle(x).
The values for the lost cycles in each category along with the values recorded for
pure computation are reported back to the user in units of seconds of execution.
The lca program is then used to analyze data from different runs when one
system parameter was varied (e.g., number of processors or data set size). After
lca finishes its analysis, it provides the user with a set of equations describing
the effect of changing the parameter that was varied earlier, along with
goodness-of-fit information that give statistical confidence intervals on the
equations that were presented. It is important to note the lca program does not
support multiple variables directly. It is up to the user to correlate separate
variables based on the information lca reports when varying each variable.
Although lost cycles analysis is presented as a method for predicting program
performance, it is restricted to predicting performance of that code on a single
machine. The accuracy of this relatively simple method is surprisingly good; for a
2D FFT, the authors were able to obtain an average prediction accuracy of
49
12.5%. The accuracy was high enough for the authors to productively evaluate
two different FFT implementations with success. This is no small feat, as FFT
implementations tend to be orders of magnitude more complex than the general
program codes used as illustrative examples by most performance models. In
general, it seems lost cycles analysis is useful as an analysis tradeoff study
technique, but as a performance model it is limited.
Lost cycles summary:

Parameters used – load imbalance, insufficient parallelism,
synchronization cost, communication cost, resource contention

Machine-friendliness – medium to high; library calls must be inserted but
are generally easy to do and may be automated, although recording some
flags such as communication costs for shared-memory machines and
whether a processor is blocked may be tricky depending on what facilities
the architecture provides

Average error – 12.5% for nontrivial example

Speed – fast, but requires several initial execution runs to gain accuracy
5.3.2 Adve’s deterministic task graph analysis
Adve introduced what he termed deterministic task graph analysis as part of his
PhD thesis work [5.5.29]. In Adve’s method, a task graph for a program is
created which precisely represents the parallelism and synchronization inherent
in that program. Task graphs are visual representations of parallel program
codes. Each node in the graph represents a task that needs to be accomplished,
and edges in the graph represent communication dependencies. The task graph
assumes that execution of the program will be deterministic. As motivation
behind this assumption, Adve noted that while many stochastic program
representations were able to capture non-determinism of application codes, in
order to keep the analysis of the models tractable, simplifying assumptions about
the nature of the work distribution or a restriction of the possible task graphs that
may be represented were employed. He suggests using mean execution times
as representations for stochastic processes, instead of analyzing them directly
using complicated stochastic methods. He also argues that while stochastic
50
behavior can affect overall performance, the effect of this is usually negligible on
overall execution time.
Adve’s performance models are composed of two levels: the first level contains
lower-level (system-level, possibly stochastic) system resource usage model,
and the second level contains a higher-level deterministic model of program
behavior. He created a system-level model for a Sequent Symmetry platform
[5.5.30], and created high-level models for several program codes, including a
particle simulation program, a network simulation program, a VLSI routing tool, a
polynomial root finder with arbitrary precision, and a gene sequence aligning
program. The sizes of these test cases ranged from 1800 to 7200 lines of C
code, and the task graphs ranged from 348 to 40963 tasks. Each task graph
took under 30 seconds to evaluate, so the authors assert that task graphs are
efficient representations of parallel codes. Adve also asserts that the task graphs
can be used to evaluated overall program structure; in this manner, they may be
also used as an analysis tool for parallel codes.
Adve was able to obtain very good accuracies from his task graph models of the
programs he presented. His models predicted accuracy on 1024- and 4096processor systems to within 5% accuracy, although the models used in the
analysis have been in existence since Adve’s original PhD thesis, a time period
of almost 10 years, so they may have been refined several times. In addition, no
coherent methods for automatically obtaining task graphs from existing programs
are mentioned. However, since evaluating a task graph is much easier than
generating one, if we are able to generate task graphs based on information
contained in a program’s trace file, this method may provide a tractable way of
including a performance model in our PAT, as long as we are also able to invent
a method for generating system models. In general, the method Adve presents
is open-ended enough to be used in many situations, although the quality of
predictions made from this method is highly dependent on the quality of the
models used to represent the programs and systems being evaluated.
Adve’s deterministic task graph analysis summary:

Parameters used – no specific set required, although for modeling the
Sequent Symmetry Adve used weighted averages of memory access
characteristics in addition to other unspecified metrics
51

Machine-friendliness – medium to low; no unified procedure for creating
system-level models is given (although it may be feasible to automate this
at the expense of accuracy), and generating application task graphs may
require human interaction although it may be possible to derive a task
graph from a program’s trace

Average error – less than 5% for nontrivial examples

Speed – moderate to fast, 30 seconds for a large program
5.3.3 Simon and Wierum’s task graphs
Simon and Wierum noted that the BSP, LogP, and PRAM models ignored
multilevel memory hierarchies, which they felt was a large omission because
node design is an important aspect of parallel computer performance. They
decided to create a performance model [5.5.31] based on task graphs which
uses a distributed memory architecture consisting of shared-memory, specifically
Symmetric Multi-Processor (SMP), nodes.
In their model, parallel programs are represented by task graphs.
Communication between processors is modeled via a function that uses
parameters for message size and distance between the sender and receiver.
Multiprocessing in each SMP node is modeled by an ideal abstracted scheduler.
Finally, resource contention inside SMP nodes is modeled by closed system of
queuing networks.
The authors assume that tasks are mapped to processors statically and cannot
change during the execution of the program. Control flows are approximated
using mean-value analysis. Since task graphs for real programs can be
overwhelmingly large, they are reduced by allowing loops to be expressed
compactly. In addition, nodes that are not in the critical path of execution may be
eliminated from task graphs to reduce their size.
Once a task graph is created for a program, microbenchmarks are used to
evaluate particular performance metrics of the target hardware.
These
microbenchmarks measure the time taken for a mixture of load, store, and
arithmetic operations of various sizes. The authors mention the benchmarks
investigate all levels of the machine’s memory hierarchy. Unfortunately, the
52
authors do not provide details on the specifics of the benchmarks they used or
the exact metrics collected. After the data for the microbenchmarks is collected,
the data is used to evaluate the task graph, and a prediction of the runtime of the
application is given.
As a test case, the authors apply their model to the LU decomposition code from
the Linpack benchmark suite. The LU decomposition code is a compact
scientific computation kernel, so one would expect the model would be
straightforward to apply to this code. Unfortunately, the author’s test case
reveals a significant problem with their model: it is very difficult to use.
Constructing a task graph from code is a labor-intensive process, and when
coupled with high-level languages it is not clear if it is possible at all to
automatically generate task graphs. While the authors were able to achieve high
prediction accuracy for their test case (within 6.5%), significant manual analysis
was required to obtain that accuracy. The detail presented in their models does
an excellent job of predicting performance of codes, but makes the model difficult
to work with. Also, one potential major problem is the use of mean-value
analysis to approximate algorithm control flow. This worked well for the LU
computation kernel, but for more nondeterministic codes it may prove to be a
source of model errors. Because of these reasons, we do not suggest
incorporating this model into our PAT.
Simon and Wierum’s task graph summary:

Parameters used – memory performance, integer performance, floating
point performance, message size, distance between sender and receiver,
resource contention for SMP nodes via closed queuing network

Machine-friendliness – very low

Average error – excellent (below 6.5%), although mean-value analysis
may introduce more significant errors in nondeterministic code

Speed – n/a; evaluation of models fast, but creation of task graphs tedious
and cannot be easily automated, even if a program’s trace is available
53
5.3.4 ESP
Parashar and Hariri have implemented a performance prediction tool as part of a
development environment for High-Performance Fortran systems named ESP
[5.5.32, 5.5.33]. Their tool operates at the source-code level, interpreting the
source code to predict its performance on a iPSC/860 hypercube system.
In their interpretive environment, the application is abstracted by extracting
properties from the source code. The system that the user wishes the
application is simulated on is modeled by a System Abstraction Graph (SAG),
whose nodes abstract part of the machine’s overall performance. An interesting
idea presented here is that each node in the SAG uses a well-defined interface,
so each node may use any technique it wishes to report predicted information.
The interface that is used for each node is composed of four components:
processing, memory, communication/synchronization, and input/output. Deeper
nodes in the SAG represent finer and finer models of specific components of the
target architecture to be evaluated. Applications are modeled in a similar way
using an Application Abstraction Graph (AAG). Nodes in the AAG represent
actions the program takes, such as starting, stopping, sequential code execution,
synchronization, and communication.
Information from the AAG is merged with the SAG to apply system-specific
characteristics to the application to form the Synchronized Application
Abstraction Graph (SAAG). The SAAG is then fed into the interpreter, which
uses the features provided by the SAG to estimate the performance of the
application. Most of the nodes in the SAG graph use simple analytical formulas
to predict performance of specific actions taken by a HPF program.
The overall accuracies obtained by ESP are respectable. Errors for the tests
cases used by the authors were at most 20% with most errors being under 10%.
The model itself is not immediately useful to our PAT, but ESP illustrates that it is
feasible to successfully incorporate a predictive performance model into a PAT.
In addition, the notion of using a standard interface for system models and the
use is an interesting one which allows increased accuracy for parts of the models
by allowing simulation where needed.
ESP summary:
54

Parameters used – raw arithmetic speeds, overhead for loops and
conditional branches, function call overhead, memory parameters (cache
size, associativity, block size, memory size, cache miss penalties for reads
and writes, main memory costs for reading and writing data, TLB miss
overhead), network parameters (topology, bandwidth, router buffer sizes,
communication startup overhead, per-hop overhead, receive overhead,
synchronization overhead, broadcast and multicast algorithms)

Machine-friendliness – medium to low; requires a Fortran source code
parser

Average error – below 10% in most cases

Speed – medium to slow; interpretive models may take a long time to
evaluate depending on how many simulative components exist for the
system models
5.3.5 VFCS
In the early 1990s, Fahringer, Blasko, and Zima from the University of Vienna
created a specialized Fortran compiler they dubbed the Vienna Fortran
Compilation System (VFCS). VFCS is a Fortran compilation system that
automatically parallelizes Fortran77 codes based on a predictive performance
model and sequential execution characteristics. Fahringer et al. also refer to the
VFCS as SUPERB-2, illustrating that it is a second-generation version of an
earlier compilation system developed at the University of Vienna named
SUPERB.
In an earlier implementation of the VFCS [5.5.34], an Abstract Program Graph
(APG) is constructed for each Fortran program to be parallelized. The
unmodified Fortran77 code is instrumented and run via the Weight Finder, which
captures information such as conditional branch probability, loop iteration counts,
and estimated execution times for basic code blocks. The Weight Finder
requires a representative set of data for the profiling stage, or the effectiveness of
the VFCS is limited. The information obtained from the Weight Finder is merged
into the APG and simulated using a discrete-event simulator, which also uses
machine-specific metrics like network topology, pipeline behavior, and number of
processors. The simulator then outputs the predicted performance for the user.
55
More recent versions of the VFCS incorporate P3T [5.5.35], which is the
Parameter-based Performance Prediction Tool. Instead of using simulation to
perform performance predictions on a program graph, an analytical model is
used to predict performance based on certain metrics collected in the profiling
stage by the Weight Finder. These metrics are classified into two categories.
The first category, which represents machine-independent information, contains
the work distribution of a program (based on an owner-compute rule and a block
decomposition of arrays in the Fortran program), number of data transfers
performed by the program, and the amount of data transferred. The second
category, which represents machine-dependent information, contains a basic
model of network contention, network transfer times, and number of cache
misses for the program. It is important to note that scalar values are reported for
each of the metrics listed above. The accuracies the authors reported for their
analytical models were within 10%, although the codes they evaluated were
small scientific kernels consisting of at most tens of lines of code. Fahringer has
also incorporated a graphical user interface (GUI) for P3T which helps a user
tune the performance of their Fortran code, and also extended it to work on a
subset of HPF code [5.5.36]. Blasko has also refined the earlier discrete-event
simulation model by the addition of the PEPSY (PErformance Prediction SYstem)
[5.5.37], keeping the tradition of the esoteric naming conventions for VFCSrelated components.
The VFCS system and its related components illustrate a variety of interesting
techniques for predicting parallel program performance and applying it to guide
automatic parallelization and performance tuning. However, the authors never
successfully demonstrate their system for anything but small scientific
computation kernels. In addition, their tool is limited to working with Fortran
code. Therefore, even though it is interesting to examine the techniques used in
the VFCS set of tools, they are not directly applicable to our PAT.
VFCS summary:

Parameters used – conditional branch probability, loop iteration counts,
and estimated execution times for basic code blocks for Weight Finder;
network topology, pipeline behavior, and number of processors for
simulations; work distribution of a program, number of data transfers
performed by the program, the amount of data transferred, network
56
contention, network transfer times, and number of cache misses for the
program for the analytical model

Machine-friendliness – very high, but requires integration with Fortran
compiler; specific to Fortran

Average error – below 10% for simple scientific kernels

Speed – fast for analytical model, unspecified for simulative model
5.3.6 PACE
PACE, or PerformaAnCE analysis environment, is a tool for predicting and
analyzing parallel performance for models of program codes [5.5.38]. PACE
uses a hierarchical classification technique similar to our layers, in which
software and hardware layers are separated with a parallel template layer.
Parallel programs are represented in the template layer via a task graph, similar
to the task graphs used by Adve and Simon and Wierum. The PACE
environment offers a graphical environment in which to evaluate parallel
programs. Abstract task graphs are realized using a performance modeling
language CHIP3S, and models are compiled and used to predict performance of
the application.
An updated version of PACE has been targeted as a generic tool for modeling
parallel grid computing applications, rather than a specific system for
automatically doing such [5.5.39]. PACE provides a few domain-specific
modeling languages, including PSL (Performance Specification Language) which
models and HMCL (Hardware Model Configuration Language), but like Adve’s
deterministic task graph method, no standard way of modeling specific
architectures is presented.
Based on the papers available describing PACE, it seems as though PACE is
just an integration of several existing performance tools to provide a system for
prediction the performance of grid applications. However, PACE also introduces
a novel idea: predictive traces [5.5.40]. In this approach, trace files are created
when running application models. These predictive trace files use general
formats such as Pablo’s Self-Defining Data Format (SDDF) or Paragraph’s
Portable Instrumented Communication Library (PICL). This increases the value
57
of predictions to the user, since the predictions can be viewed in the exact same
manner as regular performance traces.
Since PACE is targeted for grid computing, we will not specifically consider its
inclusion in our PAT. However, predictive traces are an interesting idea that
could potentially add value to a performance model included in a PAT.
PACE summary:

Parameters used – application execution characteristics (for loops, etc)
for program models, unspecified for system models

Machine-friendliness – medium to low; generation of models requires
much user interaction

Average error – within 9% for examples given, although specifics of the
systems models used not given so it is difficult to gauge this accuracy

Speed – fast (single-second times were reported in [5.38])
5.3.7 Convolution method
The Performance Evaluation Research Center (PERC) has created an
interesting method for predicting performance which is based on the idea of
convolution [5.5.41]. In this method, an “application signature” is collected from
an application, which is combined (convolved) with a machine profile to give an
estimate of execution performance.
The convolution method uses existing tools to gather application signatures and
machine profiles. Machine profiles are gathered using MAPS, which gathers
memory performance of a machine, and PMB, which gathers network
performance of a machine. Application signatures are captured using MetaSim
tracer, which collects memory operations from a program, and MPIDtrace, which
collects MPI operations performed by a program. The communication-specific
portions of the application signature and machine profile and convolved together
using DIMEMAS, and the memory-specific portions of the signature and profile
are convolved using MetaSim convolver. Currently, MetaSim tracer is limited to
the Alpha platform, but the authors suggest that future versions of MetaSim
58
tracer may use Paradyn’s DynInst API, which would open the door to more
platforms.
The convolution method provides accuracies within 25%, which is surprising
considering that the model only takes into account memory operations, network
communications, and floating point performance. In general, the model accuracy
was good for using weak scaling (problem size increases with system size) but
suffered when strong scaling was applied to the problem (problem size stays
fixed with increases in system size). However, one drawback to this method is
that it requires programs to be executed before performance predictions may be
carried out, which limits the efficiency reported by the tool to the quality of
workloads used during profiling runs.
Simulation is used to determine performance of network and computational
performance. Because of this, time to perform the convolutions may vary widely
based on the properties of the simulators used. In addition, no standard set of
convolutions is suggested to use; the authors used application-specific
convolution equations to predict performance of different parallel application
codes. Because of the possible large cost of the simulative aspects of this
method, it seems unlikely that it will be feasible to use it in a PAT. In addition,
the general approach used seems to be used in almost all other performance
prediction frameworks under a different name; the nomenclature used to
describe the separation of machine- and application-specific models is new, but
the general idea is old.
Convolution method summary:

Parameters used – network usage, memory performance, floating point
performance

Machine-friendliness – high as long as tools are supported on platforms of
interest

Average error – within 25% for matrix multiplication

Speed – not reported; possibly slow due to inclusion of simulative
techniques
59
5.3.8 Other techniques
In this section, we will present a brief overview of other predictive performance
models and quickly evaluate them. The models presented here are not useful for
our PAT, but are included for completeness.
Howell presents an MPI-specific performance prediction tool in his PhD thesis
[5.5.42]. He uses microbenchmarks to gather performance data for MPI codes.
He then provides several equations that can be used to predict the performance
of MPI-related function calls. These equations may be used directly (a simple
graphing package is also provided), or used in the context of his “reverse
profiling” library. Howell’s reverse profiling library is an interesting idea; in it, MPI
code is linked against an instrumented library and run. When MPI commands
are intercepted by the library, the equations derived earlier are used to simulate
delays, and the delays are reported back to the user. This interesting hybrid
between an analytical model, profiling, and simulation allows a user to quickly
predict the performance of their MPI codes on different network hardware. In
addition, Howell also provides a simulation tool to directly simulate the MPI code
and provide traces. Reverse profiling is an interesting technique, but would be
much harder to implement for languages that have implicit communication (such
as UPC).
Grove also presents an MPI-specific performance prediction technique in his
PhD thesis [5.5.43]. He creates what he terms a Performance Evaluating Virtual
Parallel Machine (PEVPM), which processes special comments in the source
code to quickly predict the performance of an MPI program. Sequential sections
of program code are labeled with comments indicating how long they take to
evaluate, and MPI communication calls are simulated using a probabilistic
general communication model that takes into account the type of MPI call, the
number of messages in transit, and the size of data being transferred. As with
Howell’s work, microbenchmarks are used to predict communication
performance (in addition to other information collected by the execution
simulator). Again, this approach would be difficult to apply to UPC because it
requires that the user know explicitly when communication is taking place. In
addition, the details of sequential code performance are abstracted away entirely
and require the user to specify how long each sequential code statement takes.
This approach does represent a low-cost method of performance modeling and
60
prediction for SHMEM, though, and may be useful for a prototype system for
SHMEM.
Models specific to cluster architectures are presented by Qin and Baer in [5.5.44]
and by Yan and Zhang in [5.5.45]. Qin and Baer created simple analytical
equations for distributed shared-memory systems operating in a cluster
environment. However, their analytical models also require detailed hardware
and network models to be used with trace-drive simulation. In addition, they did
not calibrate their models to real systems, so the accuracy of their approach is
unknown. Yan and Zhang use a model of bus-based Ethernet clusters to predict
performance of applications being run over non-dedicated clusters. Their
communication model assumes communication times are nondeterministic, but in
practice uses averages to approximate the communication costs.
Their
approach sustained an accuracy of 15%, but the sequential code time prediction
they used requires an instruction mix for each program that is to be evaluated.
Such instruction mixes can sometimes be costly or impossible to obtain for large
parallel codes due to the non-determinism and large scales associated with
them.
A simulation tool for parallel codes named COMPASS is presented in [5.5.46],
which is being used in the POEMS (Performance Oriented End-to-end Modeling
System) project. The simulation tools is able to gain high accuracy (within 2% for
the Sweep3D application), but requires the use of multiple nodes during
simulation to keep simulations times manageable. In addition, their simulation
tool requires detailed network and hardware models to be created and so does
not represent a viable strategy for inclusion into our PAT, since the work required
to create accurate models can easily become unwieldy.
An earlier simulation tool is presented by Poulsen and Yew in [5.5.47]. This tool
is execution-driven, and supports three types of simulations: critical-path
simulation, execution-driven trace generation, and trace-driven simulation. Of
the three types of simulations supported by their tool, critical-path simulation is
the most interesting. In critical-path simulation, the minimum parallel execution
time is given for a program by instrumentation to compute the earliest time at
which each task in a given parallel code could execute, given parameters about
the hardware and network being used. However, like all simulation techniques,
this requires detailed models, and also requires an initial run with instrumented
61
code. In this respect, it seems related techniques like lost cycles analysis can
accomplish a similar objective with less overhead.
Brehm presents a prediction tool for massively parallel systems named PerPreT
in [5.5.48]. This prediction tool operates on task graphs represented in a C-like
language. It incorporates an analytical model to estimate communication
performance, although the analytical model ignores network contention. PerPreT
can be thought of as a specific version of Adve’s general deterministic task graph
method. The method has decent accuracy (within 20% for matrix multiplication
and conjugate gradient codes), but requires expressing the programs being
evaluated in PerPreT’s specialized language. Since it is not likely this can be
automated, PerPreT is not an ideal candidate for inclusion into our PAT.
One different approach for performance prediction taken by Kuhnemann et al. is
using source code only to predict performance [5.5.49]. Their approach uses the
SUIF system (Stanford Universal Intermediary Format) to generate parse trees of
C code. The parse trees of these codes are analyzed and costs are associated
with each node in the parse tree. Computation is modeled using simple
machine-specific parameters, and communication is modeled using analytical
equations with parameters derived from microbenchmarks. While this approach
is interesting, it only supports MPI codes. However, this approach does validate
the usefulness of the SUIF source code parser.
5.4
Conclusion and recommendations
In this section, we have presented an overview of performance modeling and
predictive techniques. We categorized the existing performance models into
three categories: formal models, general analytical models, and predictive
models. For each model presented, we evaluated it against our criteria set forth
in the beginning of this section.
In general, most of the analytical models presented in this section do not apply
well within the context of a PAT. The formal models presented in Section 5.1
require too much user interaction for them to be useful. Most of the analytical
models presented in Section 5.2 are either too simplistic or would take too much
work to provide working implementations. The predictive models in Section 5.3
are the most useful for a PAT, but many of the models presented need user
62
interaction in order for them to be useful, or they use detailed simulation models
which will take too long to create and run.
We feel it is entirely necessary to choose a method that creates models for the
user automatically or at least with the help of minimal comments in the user’s
source code. Creating detailed, accurate models is feasible for researchers
studying in the area of performance modeling and prediction, but most users will
not have the time or desire to do this. If our performance needs to require any
input at all from the user, it should be as quick and painless for the user to
provide as possible.
There are a few attractive options that we believe would fit well within a PAT.
Lost cycles analysis is an especially promising method, since it is a model that
can not only predict performance, but help the user improve their program code
because it doubles as an analysis strategy. In addition, it does not seem it would
be that difficult to implement, especially when compared with other techniques
that depend on the generation of deterministic task graphs. If we are going to be
creating instrumented forms of UPC runtime environments or methods that
accomplish the same goal, we should be able to easily insert extra code to
record the simple states needed for lost cycles analysis.
Lost cycles analysis also presents an easy-to-understand metaphor to the user
that illustrates where performance is being lost in a parallel application.
However, one problem that we will need to address before implementing this
strategy is how to report a lost cycles analysis with granularity finer than at the
application level, since lost cycles will be more useful if we can report data at the
function- or loop-level granularity. If our PAT is going to support tracing (which in
all likelihood it will), the simplest solution to this problem is to perform lost cycles
analysis entirely posthumously based on information provided in the trace files.
As long as our trace files can relate data back to the source code, we can give a
lost cycles analysis for arbitrary subsections of the source code. In any case, the
lost cycles analysis should be an optional feature the user can turn on and off at
will in case the method introduces measurement overhead.
Another possible way to include a performance model in our PAT is to use one of
the simpler methods which rely on comments inserted in the source code of the
user’s program. This should not impose too much overhead on the user,
63
especially if our PAT is able to provide detailed timing information back to the
user for subsections of code. This method would be attractive because it
represents a low-cost solution: parsing source code comments is not difficult
provided they are in a well-defined format, and the data collected could be
plugged into any arbitrary modeling technique we wish, be it a simple analytical
model or a more detailed simulative model. This could provide an “upgrade path”
where an analytical model or a simulative model could be used, depending on
how much accuracy the user needs and how long they can tolerate the
evaluation time to be.
One possible wrinkle in the source code annotation plan is how to handle
modeling implicit communications in UPC. One possible way to deal with this is
to analyze the trace files generated by an actual run and correlate them with the
source code. If this could be implemented, it probably would not be too hard to
also include computational metrics from the trace files in the modeling process,
thus obviating the need for the user to do anything at all in order to use the
performance model. However, such a system would undoubtedly be extremely
difficult to implement, and as such is out of the scope of our project.
Finally, if we are able to automate the generation of task graphs and system
models for a particular program and architecture, using Adve’s deterministic task
graphs may provide a relatively low-cost solution for performance prediction.
The task graphs generated may also be a useful visualization tool, because they
provide a high-level view of the parallelism and synchronization that occurs
during a program’s execution. Adve’s method seems like it would be best used if
complemented with other models (such as LogP and memory hierarchy models),
since it provides a general framework in which to represent a program’s structure
and predict its execution time.
In summary, lost cycles analysis provides a useful performance model that can
be implemented with relatively low cost (under a few key assumptions). Since
lost cycles analysis also doubles as an analysis strategy, we feel it would fit very
well in our PAT. Therefore, we are recommending it as the most likely candidate
if we are to include a performance model in our PAT.
64
Table 5.1 - Summary of performance models
Model name
Model type
Parameters used
Machine-friendliness
Accuracy
Speed
Section
Adve’s task graphs
Predictive
No specific set required
Medium to low
Less than 5%
Moderate
5.3.2
BSP
General
analytical
Network bandwidth and
latency, sequential code
performance
Medium to high
Within 20% for
most cases
Fast
5.2.2
Convolution
Predictive
Network usage, memory
performance, floating point
performance
High
Within 25%
Not reported;
possibly slow
5.3.7
ESP
Predictive
(many)
Medium to low
Within 10%
Medium to
slow
5.3.4
Formal models
Formal
Varies
Very low
Varies
Low
5.1
LogP
General
analytical
Network latency, overhead,
gap (bandwidth), number of
processors
High for network-specific
evaluations, low in
general case
Within 10% for
predicting
network
performance
Fast
5.2.3
Lost cycles
Predictive
Load imbalance, insufficient
parallelism, synchronization
cost, communication cost,
resource contention
Medium to high
12.5% for FFT
Fast; requires
several
initialization
runs
5.3.1
PACE
Predictive
Application execution
characteristics, unspecified
system model parameters
Medium to low
Within 9%
Fast
5.3.6
PRAM
General
analytical
Coarse model of memory
access and sequential
execution time
Medium to low
Within an order
of magnitude
Fast
5.2.1
Simon and
Wierum’s task
graphs
Predictive
Memory performance, floating
point performance, integer
performance, message size,
distance between sender and
receiver, resource contention
Very low
Within 6.5%
Fast for
evaluation,
very slow for
creating
models
5.3.3
VFCS
Predictive
(many)
Very high; requires
integration with Fortran
compiler
Within 10%
Medium to
fast
5.3.5
66
5.5
References
[5.1]
C. A. Petri, Kommunikation mit Automaten. PhD thesis, Universitat
Bonn, 1962.
[5.2]
S. Gilmore, J. Hillston, L. Kloul, and M. Ribaudo, “Software
performance modelling using PEPA nets,” in WOSP ’04: Proceedings
of the fourth international workshop on Software and performance,
pp. 13–23, ACM Press, 2004.
[5.3]
R. Milner, A Calculus of Communicating Systems, vol. 92 of Lecture
Notes in Computer Science. Springer, 1980.
[5.4]
C. A. R. Hoare, “Communicating sequential processes,”
Communications of the ACM, vol. 21, no. 8, pp. 666–677, 1978.
[5.5]
C. A. R. Hoare, Communicating Sequential Processes. Prentice Hall
International, 1985.
[5.6]
O. Boxma, G. Koole, and Z. Liu, “Queueing-theoretic solution methods
for models of parallel and distributed systems,” in 3rd QMIPS
workshop: Performance Evaluation of Parallel and Distributed
Systems, pp. 1–24, 1994.
[5.7]
A. J. C. van Gemund, “Performance prediction of parallel processing
systems: the PAMELA methodology,” in ICS ’93: Proceedings of the
7th international conference on Supercomputing, pp. 318–327, ACM
Press, 1993.
[5.8]
A. J. C. van Gemund, “Compiling performance models from parallel
programs,” in ICS ’94: Proceedings of the 8th international conference
on Supercomputing, pp. 303–312, ACM Press, 1994.
[5.9]
A. J. C. van Gemund, “Symbolic performance modeling of parallel
systems.,” IEEE Trans. Parallel Distrib. Syst., vol. 14, no. 2, pp. 154–
165, 2003.
[5.10]
S. Fortune and J. Wyllie, “Parallelism in random access machines,” in
STOC ’78: Proceedings of the tenth annual ACM symposium on
Theory of computing, pp. 114–118, ACM Press, 1978.
[5.11]
L. G. Valiant, “A bridging model for parallel computation,” Commun.
ACM, vol. 33, no. 8, pp. 103–111, 1990.
[5.12]
B. H. H. Juurlink and H. A. G. Wijshoff, “The E-BSP model:
Incorporating general locality and unbalanced communication into the
BSP model.,” in Proc. Euro-Par’96, vol.II, (LNCS 1124), pp. 339–347,
1996.
[5.13]
B. Juurlink and H. A. G. Wijshoff, “A quantitative comparison of parallel
computation models,” in Proc. 8th ACM Symp. on Parallel Algorithms
and Architectures (SPAA’96), pp. 13–24, January 1996.
[5.14]
J. M. D. Hill, P. I. Crumpton, and D. A. Burgess, “Theory, practice, and
a tool for BSP performance prediction,” in Euro-Par, Vol. II, pp. 697–
705, 1996.
[5.15]
D. E. Culler, R. M. Karp, D. A. Patterson, A. Sahay, K. E. Schauser,
E. E. Santos, R. Subramonian, and T. von Eicken, “LogP: Towards a
realistic model of parallel computation,” in PPOPP, pp. 1–12, 1993.
[5.16]
A. Alexandrov, M. F. Ionescu, K. E. Schauser, and C. Scheiman,
“LogGP: incorporating long messages into the logp model – one step
closer towards a realistic model for parallel computation,” in SPAA ’95:
Proceedings of the seventh annual ACM symposium on Parallel
algorithms and architectures, pp. 95–105, ACM Press, 1995.
[5.17]
M. I. Frank, A. Agarwal, and M. K. Vernon, “LoPC: modeling contention
in parallel algorithms,” in PPOPP ’97: Proceedings of the sixth ACM
SIGPLAN symposium on Principles and practice of parallel
programming, pp. 276–287, ACM Press, 1997.
[5.18]
C. A. Moritz and M. I. Frank, “LoGPC: modeling network contention in
message-passing programs,” in SIGMETRICS ’98/PERFORMANCE
’98: Proceedings of the 1998 ACM SIGMETRICS joint international
conference on Measurement and modeling of computer systems,
pp. 254–263, ACM Press, 1998.
[5.19]
K. W. Cameron and R. Ge, “Predicting and evaluating distributed
communication performance,” in Supercomputing ’04: Proceedings of
the 2004 ACM/IEEE conference on Supercomputing, 2004.
[5.20]
K. W. Cameron and X.-H. Sun, “Quantifying locality effect in data
access delay: Memory LogP,” in 17th International Parallel and
Distributed Processing Symposium (IPDPS 2003), pp. 48–55, 2003.
68
[5.21]
A. C. Dusseau, D. E. Culler, K. E. Schauser, and R. P. Martin, “Fast
parallel sorting under LogP: Experience with the CM-5,” IEEE Trans.
Parallel Distrib. Syst., vol. 7, no. 8, pp. 791–805, 1996.
[5.22]
M. J. Clement and M. J. Quinn, “Analytical performance prediction on
multicomputers,” in Supercomputing ’93: Proceedings of the 1993
ACM/IEEE conference on Supercomputing, pp. 886–894, ACM Press,
1993.
[5.23]
D. J. Kerbyson, H. J. Alme, A. Hoisie, F. Petrini, H. J. Wasserman, and
M. Gittings, “Predictive performance and scalability modeling of a
large-scale application,” in Supercomputing ’01: Proceedings of the
2001 ACM/IEEE conference on Supercomputing (CDROM), pp. 37–48,
ACM Press, 2001.
[5.24]
X.-H. Sun and J. Zhu, “Performance prediction of scalable computing:
a case study,” in HICSS, pp. 456–469, 1995.
[5.25]
Y. Kim, M. Fienup, J. C. Clary, and S. C. Kothari, “Parametric microlevel performance models for parallel computing,” Tech. Rep. TR-9423, Department of Computer Science, Iowa State University,
December 1994.
[5.26]
M. E. Crovella and T. J. LeBlanc, “Parallel performance using lost
cycles analysis,” in Supercomputing ’94: Proceedings of the 1994
conference on Supercomputing, pp. 600–609, IEEE Computer Society
Press, 1994.
[5.27]
J. Wagner Meira, “Modeling performance of parallel programs,” Tech.
Rep. 589, Computer Science Department, University of Rochester,
June 1995.
[5.28]
M. Crovella and T. J. LeBlanc, “Performance debugging using parallel
performance predicates,” in Workshop on Parallel and Distributed
Debugging, pp. 140–150, 1993.
[5.29]
V. S. Adve, Analyzing the behavior and performance of parallel
programs. PhD thesis, Department of Computer Sciences, University
of Wisconsin-Madison, December 1993.
69
[5.30]
V. S. Adve and M. K. Vernon, “Parallel program performance
prediction using deterministic task graph analysis,” ACM Trans.
Comput. Syst., vol. 22, no. 1, pp. 94–136, 2004.
[5.31]
J. Simon and J.-M. Wierum, “Accurate performance prediction for
massively parallel systems and its applications,” in Euro-Par ’96:
Proceedings of the Second International Euro-Par Conference on
Parallel Processing-Volume II, pp. 675–688, Springer-Verlag, 1996.
[5.32]
M. Parashar and S. Hariri, “Compile-time performance prediction of
hpf/fortran 90d,” IEEE Parallel Distrib. Technol., vol. 4, no. 1, pp. 57–
73, 1996.
[5.33]
M. Parashar and S. Hariri, “Interpretive performance prediction for high
performance application development,” in HICSS (1), pp. 462–471,
1997.
[5.34]
T. Fahringer, R. Blasko, and H. P. Zima, “Automatic performance
prediction to support parallelization of fortran programs for massively
parallel systems,” in ICS ’92: Proceedings of the 6th international
conference on Supercomputing, pp. 347–356, ACM Press, 1992.
[5.35]
T. Fahringer and H. P. Zima, “A static parameter based performance
prediction tool for parallel programs,” in ICS ’93: Proceedings of the 7th
international conference on Supercomputing, pp. 207–219, ACM
Press, 1993.
[5.36]
T. Fahringer, “Estimating and optimizing performance for parallel
programs,” Tech. Rep. TR 96-1, Institute for Software Technology and
Parallel Systems, University of Vienna, March 1996.
[5.37]
R. Blasko, “Hierarchical performance prediction for parallel programs,”
in Proceedings of the 1995 International Symposium and Workshop on
Systems Engineering of Computer Based Systems, pp. 398–405,
March 1995.
[5.38]
D. J. Kerbyson, J. S. Harper, A. Craig, and G. R. Nudd, “PACE: A
toolset to investigate and predict performance in parallel systems,” in
Proc. of the European Parallel Tools Meeting, (Chвtillon, France), Oct.
1996.
70
[5.39]
J. Cao, D. J. Kerbyson, E. Papaefstathiou, and G. R. Nudd, “Modelling
of ASCI high performance applications using PACE,” in 15th Annual
UK Performance Engineering Workshop, pp. 413–424, 1999.
[5.40]
D. J. Kerbyson, E. Papaefstathiou, J. S. Harper, S. C. Perry, and G. R.
Nudd, “Is predictive tracing too late for HPC users,” in HighPerformance Computing, pp. 57–67, Kluwer Academic, 1999.
[5.41]
A. Snavely, L. Carrington, N. Wolter, J. Labarta, R. Badia, and
A. Purkayastha, “A framework for performance modeling and
prediction,” in Supercomputing ’02: Proceedings of the 2002
ACM/IEEE conference on Supercomputing, pp. 1–17, IEEE Computer
Society Press, 2002.
[5.42]
F. Howell, Approaches to parallel performance prediction. PhD thesis,
Dept of Computer Science, University of Edinburgh, 1996.
[5.43]
D. P. Grove, Performance modelling of message-passing parallel
programs. PhD thesis, Dept of Computer Science, University of
Adelaide, 2003.
[5.44]
X. Qin and J.-L. Baer, “A performance evaluation of cluster
architectures,” in SIGMETRICS ’97: Proceedings of the 1997 ACM
SIGMETRICS international conference on Measurement and modeling
of computer systems, pp. 237–247, ACM Press, 1997.
[5.45]
Y. Yan, X. Zhang, and Y. Song, “An effective and practical
performance prediction model for parallel computing on nondedicated
heterogeneous NOW,” J. Parallel Distrib. Comput., vol. 38, no. 1,
pp. 63–80, 1996.
[5.46]
R. Bagrodia, E. Deeljman, S. Docy, and T. Phan, “Performance
prediction of large parallel applications using parallel simulations,” in
PPoPP ’99: Proceedings of the seventh ACM SIGPLAN symposium on
Principles and practice of parallel programming, pp. 151–162, ACM
Press, 1999.
[5.47]
D. K. Poulsen and P.-C. Yew, “Execution-driven tools for parallel
simulation of parallel architectures and applications,” in
Supercomputing ’93: Proceedings of the 1993 ACM/IEEE conference
on Supercomputing, pp. 860–869, ACM Press, 1993.
71
[5.48]
J. Brehm, M. Madhukar, E. Smirni, and L. W. Dowdy, “PerPreT - a
performance prediction tool for massive parallel sysytems,” in MMB,
pp. 284–298, 1995.
[5.49]
M. Kuhnemann, T. Rauber, and G. Runger, “A source code analyzer
for performance prediction.,” in IPDPS, 2004.
72
6 Experimental performance measurement
The experimental performance modeling process includes five stages (Figure
6.1). In the instrumentation stage, measuring codes are inserted into the original
application code. This is when the user manually or the PAT automatically
decides what kind of measurements are needed, where to put the codes and
how to collect them. After this, in the measurement stage, the PAT collects the
raw data that feeds into the analysis stage so they can be transformed into a set
of meaningful performance data. This set is then organized and presented (plain
text, visualization, etc.) to the user in a meaningful and intuitive way in the
presentation stage. Finally, in the optimization stage, the user or the PAT
discovers the source of the performance bottleneck and modifies the original
code to alleviate the problem. Other terms have been used for many of these
stages (ex: monitoring for instrumentation and measurement stages, filtering and
aggregation for analysis stage) but the goals of the stages remain the same.
Among the methods proposed, by far the most dominant and perhaps the only
way implemented in today’s PATs is based on events (another approach
available is the flow analysis, but materials regarding this approach are generally
used for determining the correctness of the program and has no little correlation
with performance analysis. Furthermore, no tool that we know of is based on this
approach except the inclusion of a call graph tree which probably stems from the
flow control world). An event is a measurable behavior of the program that is of
interest. It can be as simple as an occurrence of a cache miss or a collection of
simple events such as the cache miss count. It was found that 95% of these
events can be classified as frequency (probability of an action), distance (time
between occurrence of the same event), duration (length in time of the actual
event), count (number of times an event occurred) or access value (program
related values such as message length) [6.1.10]. Using appropriate events, the
dynamic behavior of the program can be reconstructed, thus aiding in the
identification of the performance bottleneck. The challenge in this approach is
the determination of the meaningful events (what we define as factors) and how
to accurately measure these events. Many PAT developers simply use what is
available, what is used by other tools, or depend on user’s request to determine
what factors to put into their PAT. Although these are great sources, support is
73
needed to justify their importance. Ultimately, we need to establish a “standard”
set of important factors along with support for why they are important. In
addition, the best way to measure them and to what degree performance is
affected by them should be determined.
Original code
Instrumented
code
Improved code
Instrumentation
Measurement
Optimization
Raw data
Bottleneck
detection
Analysis
Presentation
Meaningful set
of data
Figure 6.1 - Experimental performance modeling stages
6.1
Instrumentation
This section covers issues and options important in the instrumentation stage
along with what we believe to be the approach to try in our PAT. Note that the
presenter of SC03 tutorial on principles of performance analysis stated that 89%
of the development is directly or indirectly related to writing instrumentation
software [6.1.2].
Because of this, we should try to extend existing
instrumentation software(s) if we find factors that are not currently being
measured.
74
6.1.1 Instrumentation overhead
Instrumentation overhead can be broken down into two categories. One is the
additional work a PAT user needs to do, the manual overhead, and the other is
the performance overhead, the extra program running time due to the insertion of
instrumentation codes. The question is then what levels of overhead are
acceptable in both of these areas? It is ideal to minimize the manual overhead
but doing so limits the usefulness of a PAT as only events included in the tool are
available. This is a tradeoff between effectiveness of a PAT and the extra effort
needed from the user. Will the user deem a PAT too tedious to work with?
Unfortunately, these issues are rarely evaluated by PAT developers. However,
one often view it as one of the benefits of dynamic instrumentation (see Section
6.1.2) but that is as far as the discussion goes.
The PAT performance overhead, on the other hand, is often mentioned. This is
important since excessive instrumentation will alter the behavior of the original
program, thus making the analysis invalid. PAT developers and users are aware
of the impact this will have on the usefulness of the PAT but determining what
level of overhead is acceptable is still arbitrary. Generally, less than 30%
increase in overall program execution time is viewed as acceptable (from the
developer’s point of view, what’s acceptable from the user is yet to be determine)
and this may be just a consensus people have without experimental support.
Nonetheless, it does provide a tangible level for us to use. However, this is not
sufficient to determine if the instrumentation code perturbs the original program
behavior or not. It is possible for an instrumented version of the code to run only
slightly slower than the original version but because of where and how it was
instrumented, it may alter the program’s behavior. This should be studied and
perhaps determining the overhead level of each instrumentation step along with
some simple modeling may provide an answer.
6.1.2 Profiling and tracing
Profiling refers to the process of collecting statistical event data at runtime and
performing data aggregation and filtration when the program terminates. It can
be based on sampled process timing, where a hardware timer periodically
interrupts the program execution to trigger the measurements (i.e. there is a
daemon running independent of program being analyzed that collects data on
state of the machine), or measured process timing, in which instrumentation is
75
triggered at the start and end of the event. In profiling, when and what to
instrument can be determined statically and are inserted prior to program
execution. The performance data generally require small storage space as
occurrence of the same event can be recorded by an event collection. Because
of the statistical nature of the method, it is more difficult and often impossible to
reconstruct the accurate program behavior based on profiling.
Tracing, on the other hand, is more useful when accurate reconstruction of
program behavior is desirable. Events are often time-stamped or at least
ordered with respect to program execution. This enables the user to see exactly
when and what occurred. Partial, meaningful event data can be presented to the
user at runtime so the user can detect bottleneck while tracing takes place
(partial data can also be presented to the user at runtime using profiling, but the
profiling data, due to their statistical nature, are often meaningless). Lastly, it is
possible to calculate the profiling data from tracing data after the program
terminates. Tracing is able to provide a more accurate view of the program
behavior. However, with a long running program, tracing requires an enormous
amount of storage space and because of this it can also complicate the analysis
process.
Two ideas have been proposed to alleviate the problem with large trace files.
The first approach is to use trace file format that is compact and scalable
(popular formats include SDDF, EPILOG, CLOG/SLOG, VTF/STF, XML). This
has been shown to be effective to some extent and new formats are constantly
being devised to address these issues along with other desirable features such
as interoperability between tools (a big selling point of XML). There is also an
effort to develop a trace file database [6.1.4] which could be a great source for
factor discovery. The second approach is to vary the degree of instrumentation
depending on the program behavior and is part of the motivation behind dynamic
instrumentation where the PAT decides when and where instrumentations take
place (ex: DynInst of Paradyn, DPCL from IBM). The system examines the
available data and instruments the original code frequently only if it discovers a
possible bottleneck. Heavy instrumentation is then toned back to the normal
amount (a predefined set of events needed for problem discovery which does not
impose too much overhead) once the program behaves normally again.
Schemes have been devised (ex: W 3 of Paradyn and knowledge based
discovery system) to identify all possible performance bottlenecks without
76
wasting too much time on the false positives. However, the research in this area
is still very primitive.
We should probably use tracing as our data collecting strategy but we need to be
very careful not to over-trace the program. Dynamic instrumentation and
scalable trace format are both important and we should use them in our PAT, but
we might need to extend or even come up with a better bottleneck discovery
system.
6.1.3 Manual vs. automatic
As mentioned in Section 6.1, the instrumentation process can be done manually
or automatically by the system. It is debatable as to which one is more useful
and tools incorporate one or both of these methods (but often only one of the
methods is used at a time) have shown to be successful. It is interesting to note
that users are not entirely against manual instrumentation so the ideal approach
seems to be that the PAT should provide as much automatic instrumentation as
possible while allowing manual instrumentation so its usefulness would not be
limited. Finally, it would be beneficial to classify factors according to their
appropriateness to manual or automatic instrumentation. I believe some of the
factors are better left for the user to instrument (ex: user defined functions) while
others should be automatically inserted (ex: memory profile) to lessen the effort
needed from the user.
6.1.4 Number of passes
The number of passes refers to how many times a tool needs to execute the
instrumented code in order to gather the necessary profiling/tracing information.
A one-pass PAT is desirable since it minimizes the time the user needs to wait
for feedback. More importantly, it is sometimes critical to have a one-pass
system because multi-pass system is simply not a viable solution for long running
programs. On the other hand, more accurate data can be obtained, and thus
help the performance bottleneck identification process, using the multiple passes
approach. Later passes can use the data collected in the first pass (generally
profiling) to fine tune its task (one usage is in dynamic instrumentation where the
system analyzes the statistical data from the first pass and then trace only at the
hot spots). Finally, a hybrid strategy was introduced to overcome the shortfall of
the one-pass approach. The basic principle is to periodically analyze data while
collecting event data and use the analysis as a basis of future instrumentation.
77
This is very interesting if we decide to incorporate dynamic instrumentation into
our PAT but we need to be aware that this method does not yield the exact same
result as would a multiple passes approach. Finally, it is worth re-enforcing that
this issue is really only applicable to tracing as virtually all profiling events are
statistically oriented, thus the necessary instrumentation can be determined a
priori.
6.1.5 Levels of instrumentation
Instrumentation can be done at various levels in the programming cycle. Typical
levels include the source (application), runtime/middleware (library and compiler),
operating system and binary (executable) levels.
6.1.5.1 Source level instrumentation
Instrumentation at the source level involves asserting desirable measurements to
the source code. This strategy is commonly used as it gives the PAT user great
control over the instrumentation process. It is easy to correlate between the
measurements with the program constructs of the source code thus giving user a
better idea of where the problem is occurring. In addition, source code
instrumentation is system independent (while being language specific), thus
portable across platforms. Unfortunately, some source level instrumentation will
hinder the compiler optimization process (instrumentation version turns off an
optimization which would normally be present in the original version) and does
not always provide the most accurate measurements due to other system
activities (ex: timing of a function. Because context switch can happen while a
function executes, it is possible to obtain function running time longer than the
actual running time.).
Source level instrumentation can be done in two ways: manual insertion by user
or automatically with a pre-processor (we consider using an instrumentation
language as manual since user still need to do work doing instrumentation
although it generally requires less work). The tradeoff between these two
approaches is the same as that of manual vs. automatic instrumentation
mentioned.
78
Source code
Pre-processor
Instrumented
code
Compiler
Object code
Linker
Libraries
Executables
Execution
Figure 6.2 – Levels of instrumentation (adopted from Fig. 1 of [6.15])
6.1.5.2 System level instrumentation
Instrumentation at this level (platform specific) is possible through the use of a
library or by directly incorporating instrumentation into the compiler. In the library
approach, wrapper functions of interest are created. An example of this is the
PMPI (MPI profiling) interface in which a profiling library is defined that includes
the wrapper to MPI functions (ex: MPI_send). The underneath MPI library then
includes the original version of MPI functions plus the corresponding PMPI
function (ex: PMPI_send) that include instrumentation codes. Instrumentation of
these functions can then be turned on by linking the program with the profiling
library. This approach is very convenient for the users as they do not need to
modify their source code manually but it is also limited as only those functions
defined in the library are available.
The compiler approach works similarly except that it is the compiler’s job to
correctly add instrumentation code. It is more versatile then the library approach
as it can also instrument constructs (ex: a single statement) other than functions.
However, instrumentation still can only be applied to those sets of constructs
79
predefined by the developer. In addition, this approach requires the access of a
compiler’s source code and the problem of getting inaccurate measurement still
exists.
In many ways, the library instrumentation approach can be view as a variation of
automatic source level instrumentation. In both situations, based on a predefined set of rules, instrumentation codes are automatically added to the source
code. The only difference is that the library approach uses a library whereas the
automatic source level uses a pre-processor. Because of this, these can be
grouped together into a single method that uses a pre-processor. We believe
this is more appropriate because it enables the incorporation of constructs other
than function calls (although wrapper function approach is easier to implement).
Finally, we do not think the compiler approach is viable because the compiler
developers will probably not be willing to put in the effort to incorporate the sets
of events we deemed important.
6.1.5.3 Operating system level instrumentation
The OS level instrumentation involves use of existing system calls to extract
information on program behavior. These calls vary from system to system and
are quite often limited, making this technique impractical in most instances. An
exception is the use of hardware performance counters. A package like PAPI
can be considered one of these but it can also be treated as a library. Because
of this, we should not consider this level of instrumentation in our project.
6.1.5.4 Binary level instrumentation
The lowest level of instrumentation is at the binary level. Executables specific to
a target system are examined and a pre-defined set of rules is used to facilitate
the automatic insertion of event measuring codes (statically inserting or
dynamically inserting, removing and changing of code). This approach avoids
the problem of obtaining inaccurate measurements and a single implementation
works with many different programming languages. Unfortunately, although it is
simple to correlate the instrumentations with low level events, applying them to
the more complex events is not trivial. Because of this, it is difficult to associate
the instrumentations back to the source code, making it harder for the user to
pinpoint the problem. Furthermore, instrumentation must be done to all
executables on all the nodes in the system. If a heterogeneous environment is
used for the execution of the parallel program, the instrumentations done at each
80
machine might not yield the same useful set.
automatically.
In general, this is done
6.1.5.5 Level of instrumentation conclusion
Although many tools use one or many of these levels of instrumentation, they are
generally used in separate runs. Due to the nature of the factors, we believe that
some factors will naturally be fitting to measure at one of these levels. As in the
case of manual vs. automatic instrumentation, it would be beneficial to categorize
factors in regard to the instrumentation levels as well. In addition, our PAT
should use several levels of instrumentations together in a single run (according
to the categorization) or at the very least include source and binary level
instrumentations.
81
6.1.6 References
[6.1.1]
Andrew W. Appel et al, “Profiling in the Presence of Optimization and
Garbage Collection”, Nov 1988
[6.1.2]
Luiz DeRose, Bernd Mohr and Kevin London, “Performance Tools 101:
Principles of Experimental Performance Measurement and Analysis”,
SC2003 Tutorial M-11
[6.1.3]
Luiz DeRose, Bernd Mohr and Seetharami Seelam, “An
Implementation of the POMP Performance Monitoring Interface for
OpenMP Based on Dynamic Probes”
[6.1.4]
Ken Ferschweiler et al, “A Community Databank for Performance
Tracefiles”
[6.1.5]
Thomas Fahringer and Clovis Seragiotto Junior, “Modeling and
Detecting Performance Problems for Distributed and Parallel Programs
with JavaPSL”, University of Vienna
[6.1.6]
Seon Wook Kim et al, “VGV: Supporting Performance Analysis of
Object-Oriented Mixed MPI/OpenMP Parllel Applications”
[6.1.7]
Anjo Kolk, Shari Yamaguchi and Jim Viscusi, “Yet Another
Performance Profiling Method”, Oracle Corporation, June 1999
[6.1.8]
Yong-fong Lee and Barbara G. Ryder, “A Comprehensive Approach to
Parallel Data Flow Analysis”, Rutgers University
[6.1.9]
Allen D. Malony and Sameer Shende, “Performance Technology for
Complex Parallel and Distributed Systems”, University of Oregon
[6.1.10]
Bernd Mohr, “Standardization of Event Traces Considered Harmful or
Is an Implementation of Object-Independent Event Trace Monitoring
and Analysis Systems Possible?”, Advances in Parallel Computing,
Vol. 6, pp. 103-124, 1993
[6.1.11]
Philip Mucci et al, “Automating the Large-Scale Collection and Analysis
of Performance Data on Linux Clusters”, University of TenesseeKnoxville and National Center for Supercomputing Applications
[6.1.12]
Stephan Oepen and John Carroll, “Parser Engineering and
Performance Profiling”, Natural Language Engineering 6 (1): 81-97,
Feb 2000
82
[6.1.13]
Daniel A. Reed et al, “Performance Analysis of Parallel Systems:
Approaches and Open Problems”, University of Illinois-Urbana
[6.1.14]
Lambert Schaelicke, Al Davis and Sally A. McKee, “Profiling I/O
Interrupts in Modern Architectures”, University of Utah
[6.1.15]
Sameer Shende, “Profiling and Tracing in Linux”, University of Oregon
[6.1.16]
Sameer Shende et al, “Portable Profiling and Tracing for Parallel,
Scientific Applications using C++”, University of Oregon and Los
Alamos National Laboratory
[6.1.17]
Sameer Shende, Allen D. Malony and Robert Ansell-Bell,
“Instrumentation and Measurement Strategies for Flexible and Portable
Empirical Performance Evaluation”, University of Oregon
[6.1.18]
Hong-Linh Truong and Thomas Fahringer, “On Utilizing Experiment
Data Repository for Performance Analysis of Parallel Applications”,
University of Vienna
[6.1.19]
Jeffrey Vetter, “Performance Analysis of Distributed Applications using
Automatic Classification of Communication Inefficiencies”, ACM
International Conference on Supercomputing 2000
[6.1.20]
Jurgen Vollmer, “Data Flow Analysis of Parallel Programs”, University
of Karlsruhe
[6.1.21]
Youfeng Wu, “Efficient Discovery of Regular Stride Patterns in Irregular
Programs and Its Use in Compiler Prefetching”, Intel Labs
83
6.2
Measurement
6.2.1 Performance factor
The ever increasing desire for high computation power, supported by advances
in microprocessor and communications technology as well as algorithms, has led
to the rapid deployment of a wide range of parallel and distributed systems.
Many factors affect the performance of these systems, including the processors
and hardware architecture, the communication network, the various system
software components, and the mapping of user applications and their algorithms
to the architecture [6.2.1]. Analyzing performance factors in such systems has
proven to be a challenging task that requires innovative performance analysis
tools and methods to keep up with the rapid evolution and ever increasing
complexity of such systems.
This section first provides a formal definition of the term performance factor,
followed by a discussion as to what constitutes a good performance factor
including a discussion on the distinction between mean and end factors. Then a
three-step approach to determine if a factor is good is presented, followed by
conclusions.
6.2.1.1 Definition of a performance factor
Before we can begin to design a performance analysis tool, we must determine
what things are interesting and useful to measure. The basic characteristics of a
parallel computer system that a user of a performance tool typically wants to
measure are [6.2.2]:

a count of how many times an event occurs

the duration of some time interval

the size of some parameter.
For instance, a user may want to count how many times a processor initiates an
I/O request. They may also be interested in how long each of these requests
84
takes.
Finally, it is probably also useful to determine the amount of data
transmitted and stored. From these types of measured values, a performance
analysis tool can derive the actual value that the user wants in order to describe
the performance of the system. This value is called a performance factor. If the
user is interested specifically in the time, count, or size value measured, we can
use that value directly as the performance factor. Often, however, the user is
interested in normalizing event counts to a common time basis to provide a
speed metric such as instructions executed per second. This type of factor is
referred to in the literature as a rate factor or throughput and is calculated by
dividing the count of the number of events that occur in a given interval by the
time interval over which the events occur. Since a rate factor is normalized to a
common time basis, such as seconds, it is useful for comparing different
measurements made over different time intervals.
Conceptually, the information provided by the PAT should consist of two groups.
The first group is the information gathered from low-level system monitoring tools
such as hardware counters.
This group of information will provide the raw
performance data of the system. The user will be able to use this set of data to
gather fine-grained information about the performance issues of the system.
This raw information will be most useful to users with extensive experience with
parallel systems and may not be as applicable to the novice user. However, the
second group of information will be at higher level to provide the user with an
overall view of performance.
The information provided by this group will be
derived from the other group and may include such information as speedup,
parallel efficiency, and other high level information which users have come to
identify as important in diagnosing parallel systems. This conceptual separation
of data seems quite natural and may suggest a straightforward approach to the
design of the PAT. Many tools currently follow this approach with the use of a
hardware counter library such as PAPI to provide low-level information that is
then abstracted into higher level information which is more readily understood by
the user.
85
6.2.1.2 Characteristics of a good performance factor
There are many different metrics that have been used to describe the
performance of a computer system [6.2.5, 6.2.9, 6.2.11, 6.2.12]. Some of these
metrics are commonly used throughout the field, such as MIPS and MFLOPS,
whereas others are invented for new situations as they are needed. Although the
designers of a tool will primarily be concerned with performance factors that are
specific to parallel systems, factors that contribute to sequential performance
need to be explored as well. Experience has shown that not all metrics are
‘good' in the sense that sometimes using a particular metric can lead to
erroneous or misleading conclusions. Consequently, it is useful to understand
the characteristics of a ‘good' performance metric. Naturally, the designers of a
performance tool will be interested in providing the most useful and unambiguous
information to user about the desired performance factors. Understanding the
characteristics of a good performance factor will be useful when deciding which
of the existing performance metrics to use for a particular situation, and when
developing a new performance metric would be more appropriate.
Since many performance factors provide information that may be misleading to
the user, designers of a performance tool must be careful to inform users of the
limits of the applicability of a given performance factor. For example, the number
of instructions executed may not correspond to the total execution time of a given
application. If the user relies solely on the number of instructions executed as a
performance factor, the user may end up applying performance optimizations
that actually degrade performance rather than improve it.
Although this is a
simplistic example, one can envision more complicated scenarios in which data
provided by a performance tool may be misleading or ambiguous, which may
lead to the misapplication of performance optimizations and ultimately, the poor
performance of the user’s application. This would be detrimental to the
acceptance and usefulness of the tool regardless of the correctness of the
information provided.
86
Many tools provide the user with the ability to define their own performance
factors in order to provide a more customized analysis of their application. For
example, Paradyn [6.2.8] provides the user with the Metric Description Language
(MDL) [6.2.10] to enable user-defined metrics.
However, since not all
performance factors are equally applicable, if we decide to allow user-defined
metrics in the PAT, we should take measures in order to ensure the proper use
of the user factors. Warning the user of the potential misapplication of any userdefined factor may be sufficient since this functionality is probably intended for
advanced users who are fully aware of the issues involved. However, it is still
necessary for us to be aware of the characteristics of a good performance factor
to ensure that at least the default performance factors will be useful to the user.
A performance factor that satisfies all of the following requirements is generally
considered ‘useful’ by the literature in allowing accurate and detailed
comparisons of different measurements. These criteria have been developed by
observing the results of numerous performance analyses over many years
[6.2.2]. Although some caution that following these guidelines is not necessarily
a recipe for success, it is generally regarded that using a factor that does not
satisfy these requirements can often lead to erroneous conclusions [6.2.6].
Reliability - A performance factor is considered to be reliable if system A always
outperforms system B when the corresponding values of the factor for both
systems indicate that system A should outperform system B assuming all other
factors are the same. A general test to determine if a factor is reliable is to While
this requirement would seem to be so obvious as to be unnecessary to state
explicitly, several commonly used performance factors do not in fact satisfy this
requirement.
The MIPS metric, for instance, is notoriously unreliable.
Specifically, it is not unusual for one processor to have a higher MIPS rating than
another processor while the second processor actually executes a specific
program in less time than does the processor with the higher value of the metric.
Such a factor is essentially useless for summarizing performance, and is
87
unreliable. Thus if a performance tool provides information about an unreliable
factor to the user, it should only be one of a number of factors rather than a
single number summarizing the total performance of the application [6.2.3]. It
should be noted that unreliable factors need not necessarily be removed from a
performance tool, rather they should be used within the context of a larger, more
comprehensive data set.
One of the problems with many of the metrics discussed earlier (such as MIPS)
that makes them unreliable is that they measure what was done whether or not it
was useful.
What makes a performance factor reliable, however, is that it
accurately and consistently measures progress towards a goal.
Metrics that
measure what was done, useful or not, are referred to in the literature as meansbased metrics whereas ends-based metrics measure what is actually
accomplished. For the high-level information provided by a performance tool to
the user, end-based factors will be more appropriate and useful. However, much
of the low-level information provided by the tool must necessarily be meansbased, such as the information provided by hardware counters.
Also, many
performance factors will not be adequately described as end or means-based
such as network latency. However, an effort should be made on the part of tool
designers to provide the user with end-based performance factors whenever
possible.
This will help to ensure the most relevant and reliable data is
presented to the user.
Repeatability - A performance factor is repeatable if the same value of the factor
is measured each time the same experiment is performed. This also implies that
a good metric is deterministic. Although a particular factor may be dependent
upon characteristics of a program that are essentially non-deterministic from the
point of view of the program such as user input, a repeatable factor should yield
the same results for two identical runs of the application.
88
Ease of measurement - If a factor is not easy to measure, it is unlikely that
anyone will actually use it. Furthermore, the more difficult a factor is to measure
directly, or to derive from other measured values, the more likely it is that the
factor will be determined incorrectly [6.2.7]. Since a performance tool has a finite
development time, the majority of the implementation effort should be used to
provide the user with the easier, more useful factors since efforts to provide
factors which are more difficult to obtain may not provide the user with more
utility.
Consistency - A consistent performance factor is one for which the units of the
factor and its precise definition are the same across different systems and
different configurations of the same system.
If the units of a factor are not
consistent, it is impossible to use the factor to compare the performances of the
different systems.
Although the concepts of reliability and consistency are
similar, reliability refers to a factor’s ability to predict the relative performances of
two systems whereas consistency refers to a factor’s ability to provide the same
information about different systems. Since one of the goals of our performance
tool is to be portable, the issue of consistency is particularly important. Many
users will port their code from one system to another and a good performance
tool should try to provide consistent information for both systems whenever
possible.
Therefore, it is highly desirable that a performance tool provide
semantically similar if not identical performance factors across a wide variety of
platforms. While the necessity for this characteristic would also seem obvious, it
is not satisfied by many popular metrics, such as MIPS and MFLOPS.
Our general strategy for determining if a performance factor meets these four
criteria is as follows. First, on each supported platform, determine if the factor is
measurable and if so, how easy was the value for the factor obtained. Obviously,
if a factor is not measurable for a given platform, it will not be supported for that
system, but it may be for other systems. After the factor has been determined to
be easy to measure, it will be straightforward to determine if the factor is
89
repeatable.
The determination of reliability and consistency, however, will
require a more involved, three-step approach. First, if the factor can be modified
on a real system, the factor should be tested directly to determine if it is reliable
and consistent by applying the definitions of each listed above. For example,
changing the network of a given system will determine if network latency and
bandwidth are reliable and consistent factors. However, most factors will not be
able to be modified on a real system as easily. Factors such as cache size and
associativity will require a different approach since they cannot be readily
changed on a real system. Therefore, justification from the literature will be
needed to determine whether the factor is reliable and consistent.
Lastly, if
information regarding the reliability and consistency is not available in the
literature, the information will need to be generated through the use of predictive
performance models. If there are no suitable performance models to use, the
factor may still be included in the PAT as long as the user is informed of its
limitations. In summary, in order to determine if a proposed performance factor
satisfies the four requirements of a good performance factor, we will perform the
following tests.
1. On each platform, determine ease of measurement.
2. Determine repeatability.
3. Determine reliability and consistency by one of the following methods.
(a) Modify the factor using real hardware.
(b) Find justification in the literature.
(c) Derive the information from performance models.
6.2.1.3 Conclusions
In this section, we provided a formal definition of the term performance factor and
then presented a discussion as to what constitutes a good performance factor.
90
This was followed by a three-step approach to determine which of our proposed
factors are good.
In order to provide the most useful data to the user of a performance analysis
tool, we must ensure that the data presented by the PAT about each
performance factor satisfy the constraints of a good factor. Namely, the factor
must be reliable, repeatable, easy to measure, and consistent.
Whenever
possible, a factor should be end-based in order to avoid ambiguity and reduce
misapplication of the tool. Also, there should be mechanisms in place to warn
users of the potential pitfalls of using a poor performance factor when a tool
supports user-defined factors. Finally, the natural separation between high-level
factors and the low-level, measurable factors upon which they are based leads to
an intuitive implementation strategy adopted by many parallel performance tools.
6.2.2
Measurement strategies
To be written.
6.2.3
Factor List + experiments
To be written.
91
6.2.4 References
[6.2.1]
L. Margetts. “Parallel Finite Element Analysis”, Ph.D Thesis, University
of Manchester.
[6.2.2]
D. Lilja. “Measuring Computer Performance: A Practitioner's Guide”.
Cambridge University Press.
[6.2.3]
J.E. Smith, “Characterizing Computer Performance with a Single
Number”, Communications of the ACM, October 1988, pp. 1202-1206.
[6.2.4]
L.A. Crowl, “How to Measure, Present, and Compare Parallel
Performance”, IEEE Parallel and Distributed Technology, Spring 1994,
pp. 9-25.
[6.2.5]
J.L. Gustafson and Q.O. Snell, “HINT: A New Way to Measure
Computer Performance”, Hawaii International Conference on System
Sciences, 1995.
[6.2.6]
R. Jain, “The Art of Computer Systems Performance Analysis”, John
Wiley and Sons, Inc., 1991.
[6.2.7]
A.D. Malony and D.A. Reed, “Performance Measurement Intrusion and
Perturbation Analysis”, IEEE Transactions on Parallel Distributed
Systems, Vol. 3, No. 4, July 1992.
[6.2.8]
B.P. Miller, M.D. Callaghan, J.M. Cargille, J.K. Hollingsworth, R.B.
Irvin, K.L. Karavanic, K. Kunchithapadam, and T. Newhall. “The
Paradyn Parallel Performance Measurement Tools”, IEEE Computer,
Vol. 28, No. 11, November 1995.
[6.2.9]
J. Hollingsworth, B.P. Miller, “Parallel Program Performance Metrics: A
Comparison and Validation”. Proceedings of the 1992 ACM/IEEE
conference on Supercomputing, ACM/IEEE.
[6.2.10]
M. Goncalves, “MDL: A Language and Compiler for Dynamic Program
Instrumentation”.
[6.2.11]
S. Sahni, V. Thanvantri, “Parallel Computing: Performance Metrics and
Models”. Research Report, Computer Science Department, University
of Florida, 1995.
[6.2.12]
H. Truong, T. Fahringer, G. Madsen, A.D. Malony, H. Moritsch, and S.
Shende. “On Using SCALEA for Performance Analysis of Distributed
92
and Parallel Programs”. In Proceeding of the 9th IEEE/ACM HighPerformance Networking and Computing Conference (SC’2001),
Denver, USA, November 2001.
93
6.3
Analysis
This is probably one of the later things to do as it involve using a lot of other
findings,
6.4
Presentation
6.4.1 Usability
Parallel performance tools have been in existence for some time now and a great
deal of research effort has been directed at tools for improving the performance
of parallel applications. Significant progress has been made, however, parallel
tools are still lacking widespread acceptance [6.4.3].
The most common complaints about parallel performance tools concern their
lack of usability [6.4.6].
They are criticized for being too hard to learn, too
complex to use in most programming tasks, and unsuitable for the size and
structure of real world parallel applications. Users are skeptical about how much
value tools can really provide in typical program development situations. They
are reluctant to invest time in learning to apply tools effectively because they are
uncertain that the investment will pay off. A tool is only perceived as valuable
when it clearly assists the user to accomplish the tasks to which it is applied.
Thus, a tool’s propensities for being misapplied or its failure to support the kinds
of tasks users wish to perform constitute serious impediments to tool acceptance.
Therefore, it is essential that the PAT be easy to use as well as provide
information in such a way as to be useful to the user. This is where usability
comes into play.
6.4.1.1 Usability factors
There are several dimensions of usability presented in the literature, of which
four seem particularly important for parallel performance tools.
6.4.1.1.1 Ease-of-learning
94
The first factor, ease-of-learning, is particularly important for attracting new users.
The interface presents the user with an implicit model of the underlying software.
This shapes the user's understanding of what can be done with the software, and
how. Learning a new piece of software requires that the user discover, or invent,
a mapping function from his/her logical understanding of the software's domain,
to the implicit model established by the interface [6.4.7, 6.4.19]. Designers often
fail to take into account the fact that the interface is really the only view of the
software that a user ever sees. Each inconsistency, ambiguity, and omission in
the interface model can lead to confusion and misunderstanding during the
learning process. For example, providing default settings for some objects, but
not for all, hinders learning because it forces users to recognize subtle
distinctions when they are still having to make assumptions about the larger
patterns; a common result is the misinterpretation of what object categories
mean or what defaults are for. In fact, any place the interface deviates from what
users already know about parallel performance tools, or about any other software
with which they are familiar is a likely source of error [6.4.16].
As PAT designers, it should be our goal to create an internally as well as
externally consistent tool. In order to ensure internal consistency, we should
provide the most uniform interface possible to the user. We should also make an
effort to use the established conventions set by other performance analysis tools
when a clear precedent has been set. By leveraging the conventions set by
other performance analysis tools, we will ensure an externally consistent tool that
will be easier for the user to learn.
It is also important to recognize that the time a user invests to learn a library or
tool will not be warranted unless it can be amortized across many applications of
the interface. If the interface is not intuitive or difficult to understand, lack of
regular use forces them to re-learn the interface many times over. The short
lifespan of HPC machines exacerbates the problem. Parallel programmers will
generally end up porting their applications across several machine platforms over
95
the course of time. The investment in learning a software package may not be
warranted unless it is supported, and behaves consistently, across multiple
platforms.
Therefore, a successful PAT will need to be intuitive as well as
consistent across platforms in order to establish a strong user-base.
6.4.1.1.2 Ease-of-use
Once an interface is familiar to the user, other usability factors begin to dominate.
Ease-of-use refers to the amount of attention and effort required to accomplish a
specific task using the software. In general, the more a user has to memorize
about using the interface, the more effort will be required to apply that
remembered knowledge [6.4.11]. This is why mnemonic names, the availability
of menus listing operations, and other mechanisms aimed at prodding the user's
memory will serve to improve the usability of the PAT. Interface simplicity is
equally important, since it allows users to organize their actions in small,
meaningful units; complexity forces users to pause and re-consider their logic at
frequent intervals, which is detrimental to the user’s experience with the PAT.
Ease-of-use also suffers dramatically when features and operations are indirect
or "hidden" at other levels of the interface. For example, the need to precede a
desired action by some apparently unrelated action forces the user to expend
extra effort, both to recognize the dependency, and to memorize the sequencing
requirement. Thus for the PAT, operations should be concrete and logical; there
should be a clear correlation between an action the user takes and its desired
effect.
6.4.1.1.3 Usefulness
Where ease-of-use refers to how easy it is to figure out what actions are needed
to accomplish some task, usefulness focuses on how directly the software
supports the user's own task model. That is, as the user formulates goals and
executes a series of actions to reach each goal, how direct is the mapping
between what the user wants to do and what they must do within the constraints
imposed by the interface? If a lengthy sequence of steps must be carried out to
96
accomplish even very common goals such as determining execution times, the
usefulness of a performance tool is low. On the other hand, if the most common
user tasks are met through simple, direct operations, usefulness will be high (in
spite of the fact that long sequences may be required for tasks that occur only
rarely). It should be our goal as PAT designers to optimize the usability of the
common case whenever possible.
Another aspect of usefulness is how easily users can apply the software to new
task situations. If the implicit model presented by the interface is clear, it should
be possible to infer new uses with a low incidence of error. For instance, if a
user has been focused on alleviating the performance problems associated with
the network, a similar set of operations should be able to be used to diagnose
problems with the memory hierarchy.
6.4.1.1.4 Throughput
Since the inherent goal of software is to increase user productivity, throughput is
also important. This measure reflects the degree to which the tool contributes to
user productivity in general. It includes the efficiency with which the software can
be applied to find and correct performance problems, as well as the negative
influences exerted by frequent errors and situations where corrections are
difficult or time consuming. For performance tools with graphical interfaces or
other software start-up costs, throughput also measures the amount of time
required to invoke the tool and begin accomplishing useful work.
6.4.1.1.5 Summary
It should be clear that all four dimensions mentioned in this section contribute to
how quickly and generally users will adopt a software package. It should be
equally clear that users are the only ones who will have the insight needed to
accurately identify which interface features contribute to usability, and which
represent potential sources of confusion or error. The basis for usability lies in
97
how responsive the software interface is to user needs and preferences which is
something that can generally only be determined with the help of actual users.
6.4.1.2 User centered design
User-centered design is based on the premise that usability will be achieved only
if the software design process is user-driven. The tool designer must make a
conscious effort to understand the target users, what they will want to accomplish
with the tool, and how they will use the tool to accomplish those tasks [6.4.13].
The concept that usability should be the driving factor in software design and
implementation is not particularly new; it has appeared in the literature under the
guises of usability engineering, participatory design, and iterative design, as well
as user-centered design [6.4.1]. There is not yet a firm consensus on what
methodology is most appropriate, nor on the frequency with which users should
be involved in design decisions [6.4.2]. What is clear is that the tradition of
soliciting user feedback only during the very early and very late stages of
development is not adequate for assessing and improving usability.
While
developing the PAT, we should take this fact into consideration and try to
incorporate user feedback into the design throughout the development process in
order to avoid common mistakes. During early stages, the design will be too
amorphous for a user to really assess how the interface structure might enhance
or constrain task performance. During late stages such as alpha testing, the
software structure will be solidified so much that user impact will be largely
cosmetic. User involvement and feedback will be needed throughout the design
process, since different types of usability problems will be caught and corrected
at different points. Moreover, it will be important to work with at least a few
individual users on a sustained basis in order to ensure continuity and to gauge
the progress of the PAT over time. The introduction of any performance tool
does more than replace a sequence of manual operations (such as printf
statements) by automated ones. Only by understanding how a user interacts
98
with the interface as they become familiar with it can tool designers understand
the real issues affecting usability [6.4.4].
A four-step model for incorporating user-centered design in the development of a
parallel performance tool is suggested by [6.4.1].
(1) Ensure that the initial functionality provided by the tool is based on
demonstrable user needs. Soliciting input directly from the user community is the
best way to identify this information. Specifically, a tool will not be useful unless
it facilitates tasks that the user already does and that are time-consuming,
tedious, error-prone, or impossible to perform manually. If, instead, design of the
PAT is driven by the kind of functionality that we are ready or able to provide, it
will miss the mark. Because of the nature of the PAT project, direct user contact
may be difficult to obtain. However, the general principles regarding the basic
functionality of a parallel performance analysis tool are the same for all parallel
languages. Therefore, the information gained from the MPI user community as
well as tool developers on this matter will be useful. In order to ensure that the
functionality provided by our PAT is what users want, we should contact users
from the MPI community to obtain justification and general feedback on our
preliminary design.
This information will only be helpful up to a point so
additional information regarding the UPC/SHMEM specific aspects of the
preliminary design should be obtained from any end-users we can find as well as
people that can speak on behalf of our end-users; the meta-user.
(2) Analyze how users identify and correct performance problems in a wide
variety of environments. A preliminary step in the design of the PAT is to study
UPC and SHMEM users in their world, where the tool ultimately will be used.
The point is to analyze how users go about organizing their efforts to improve the
performance of their parallel application. For example, the realization that users
write down or sketch out certain information for themselves provides important
clues about how visual representations should be structured and how the
99
interface should be documented, as well as indicating the need for additional
functionality.
Some of this information was gained through our programming
practices; however, it will be useful to gain insight from other real-world
UPC/SHMEM users. Again because of our limited access to end-users from
governmental agencies, we will need to find as many other users in the
UPC/SHMEM community as possible such as the graduate students at GWU.
From these users, we can gain a better idea of how the tool will actually be used
on real programs.
The information we gain from these users will then be
presented to the meta-users to be presented to end-users for them to critique
and provide feedback. In this way, we can ensure that we will have input from a
variety of sources to ensure a comprehensive analysis of the preliminary design.
(3) Implement incrementally. Based on information gained from users, we should
begin to organize the proposed interface so that the most useful features are the
best supported. Prototypes or preliminary implementations can be constructed to
show basic interface concepts. This allows user evaluation and feedback to
begin long before efforts have been invested in features that will have low
usability payoffs. Again, this information will only be obtained by talking with
users and applying changes based on their feedback. This will also allows time
to gain information about the users' instinctive reactions, one piece at a time. For
example, the user might be presented with a few subroutine names and asked to
guess what each does or what arguments are required for each. Early reactions
might suggest major changes in thrust that will ultimately have repercussions
throughout the PAT interface. It will definitely be easier to make these types of
changes early in the design process rather later. Obviously, this type of user
interaction will be impossible with some of our users so it will be key to maintain
a strong relationship with the few other users with whom we have access. These
users will provide us with information on the small changes that we need to make
to the interface and suggest larger, more fundamental changes that will be
verified by the meta-users and end-users.
100
(4) Have users evaluate every aspect of the tool’s interface structure and
behavior. As the implementation of the PAT progresses, user tests should be
performed at many points along the way.
This permits feature-by-feature
refinement in response to specific sources of user confusion or frustration.
Feedback gained from this approach will also provides valuable insight into
sources of user error and what might be done to minimize the opportunity for
errors or ameliorate their effects.
By following this four-step model, the designers of a performance tool can
prevent some of the common problems cited by users presented in the next
section.
6.4.1.3 Common problems of usability in PAT
Throughout the literature, several basic problems are cited for the usability of
parallel performance tools. Some data was compiled through case studies, while
other data comes from user surveys and general principles from the field of
Human Computer Interaction (HCI).
Inconsistency: Lack of symmetry or consistency among elements within a given
tool. The most blatant inconsistencies (e.g., spelling, naming of elements, or
punctuation) can be caught through a careful checking by the tool developers
themselves. Nevertheless, there are generally additional inconsistencies that are
likely to impede the user’s learning of the tool.
In many cases, the tool
developers may be able to cite practical justifications for the inconsistencies but
with real users, they may cause errors and confusion. We need to be aware of
this potential and change the interface in order to facilitate the use of the tool.
Ambiguity: the choice of interface names leads to user misinterpretation. For
example, as a case study in [6.4.11], users were confused by the fact that both
"reduce" and "combine" were routines, where one referred to the operation the
users traditionally call reduction or combination (i.e., acquiring values from each
of multiple nodes and calculating the sum, minimum, etc.) and the other was a
101
shortcut referring to that same operation, followed immediately by a broadcasting
of the result to all nodes. The users complained that it would be very hard to
remember which meaning went with which name. Whenever possible, the PAT
should use concrete, clear names for everything.
Incompatibility: The interface specification contradicts or overrides accepted
patterns of usage established by other tools. Whereas consistency compares
elements within an interface, compatibility assesses how well the interface
conforms to "normal practice."
Since there already exist many parallel
performance tools, designers for a new performance tool should conform to
established practices if possible.
Indirection: One operation must be performed as a preliminary to another.
Indirection as described in the literature most often involved some sort of table
lookup operation, so that the index (rather than the name assigned by the user)
must be supplied as an argument to some other operation. We recommend that
user-defined names be used whenever possible throughout the PAT interface. In
this way, the user will be spared any confusion regarding naming conventions,
thus increasing efficiency.
Fragility: Subtleties in syntax or semantics those are likely to result in errors.
For example, if all blocking operations involve routines whose names begin with
b (such as bsend and brecv) but one routine beginning with that letter (bcast) is
non-blocking [6.4.11].
Fragility increases when the errors are essentially
undetectable (that is, the software will still work, but results will be incorrect or
unexpected). Therefore, it is essential for tool designers to be extremely careful
when applying the naming conventions to be used in order to prevent any
ambiguity. However, problems regarding fragility should be easy to find and
correct early in the design process through a careful review of the interface by
the tool designers as well as outside users.
102
Ergonomic Problems: Too many (or too clumsy) physical movements must be
performed by the user. In most cases, problems occur because users are forced
to do unnecessary or redundant typing. The tool should have measures in place
to facilitate the menial tasks associated with using it. For some aspects of the
tool’s interface, it may be more appropriate to develop a GUI rather than a
command line interface for the user to interact with the tool. However, if the
relative simplicity and intuitiveness of a command line interface is desired by
users, we recommend the tool have a scrollable command history to prevent
ergonomic problems.
Problems stemming from inconsistency and incompatibility clearly make it harder
for the novice to develop a well-formulated mental model of how the tool
operates, and therefore can be said to impede ease-of-learning. Incongruity and
ambiguity, on the other hand, are likely to cause problems in remembering how
to apply the software, thereby affecting ease-of-use. Fragility has more impact
on the user's ability to apply the software efficiently and with few errors, and so,
impedes usefulness. Finally, indirection and ergonomic problems are wasteful of
the user's movements and necessitate more actions/operations, thereby affecting
throughput. By avoiding these common problems, a tool will be able to address
the most relevant issues regarding the usability of parallel performance tools.
The next section provides strategies that may be used to avoid the common
problems presented in this section.
6.4.1.4 General solutions
This section provides an outline as to the types of changes that will be required
based on user feedback. This information may be useful to the designers of a
performance tool contemplating using a user-centered design approach in order
to justify the implementation costs of using such a strategy. The designers may
wish to know how the use of a user-centered approach generally affects the
implementation time of a performance tool. If the implementation time using a
user-centered approach is much greater than without, designers may wish to use
103
an alternative approach. Therefore, this section provides the justification for the
use of a user-centered design methodology in terms of implementation time.
Solutions to the common issues of usability found in parallel tools presented in
the previous section can generally be placed in one of six categories.

Superficial change: modification limited to the documentation and/or
procedure prototypes (e.g., to change the names of arguments)

Trivial syntactic change: modification limited to the name of the operation

Syntactic change: modification of the order of arguments or operands

Trivial semantic change: modification of the number of arguments or
operands

Semantic change: relatively minor modification of the operation's meaning

Fundamental change: addition of a new feature to the interface and/or
major changes in operational semantics
However, evidence from case studies suggests that the overwhelming majority of
usability issues fall into the categories of superficial or trivial syntactic changes.
That is, simple changes in the names used to refer to operations or operands
were sufficient to eliminate the problem, from the users' perspective. Only rarely
do problems fall into the category of fundamental changes, requiring significant
implementation work. Thus, the actual cost of modifying a tool based on user
response is generally low. Therefore, since the implementation cost should be
low and the benefit to the usability of the tool will be high, as PAT designers, we
should feel justified in adopting a user-centered design methodology.
Experience has shown that development time for software created with user
input is not significantly increased [6.4.5, 6.4.8]. In some cases, development
time is actually shortened because features that would have required major
implementation effort turn out to be of no real interest to users. Essentially, more
developer time is spent earlier in the design cycle, rather than making
adjustments once the design is solidified. Generally speaking, the earlier that
input is solicited from users, the easier and less expensive it is to make changes
104
particularly semantic or fundamental changes. We should take this information
into account while designing the PAT in order to prevent lost time later in
development.
Finally, the design of software with usability in mind engenders real interest and
commitment on the part of users [13, 16]. From their perspective, the developers
are making a serious attempt to be responsive to their needs, rather than making
half-hearted, cosmetic changes when it's too late to do any good anyway [14,
15]. By involving users in the design process, we may be able to establish a
strong user-base, which will be crucial to the success of the PAT project. Also,
users may be able to help in ways that go well beyond interface evaluation, such
as developing example applications or publicizing the tool among their
colleagues.
6.4.1.5 Conclusions
In this section, we provided a discussion on the factors influencing the usability of
performance tools followed by an outline of how to incorporate user-centered
design into the PAT.
We then talked about common problems seen in
performance tools followed by solutions. From the preceding sections, it is clear
that how users perceive and respond to a performance tool is critical to its
success. Creating an elegant, powerful performance tool does not guarantee
that it will be accepted by users. Ease-of-learning, ease-of-use, usefulness, and
throughput are important indicators of the tool’s usability, and they depend
heavily on the interface provided by the tool.
It is essential that we capitalize on the observation that users are the ones who
are best qualified to determine how the PAT should be used. We need to be
sure that the PAT reflects user needs, that the interface organization correlates
well with established conventions set by other performance tools, and that
interface features and terminology are clear and efficient for users. The best way
for the PAT to meet these usability goals is to involve users in the entire
105
development process. The feedback gained from our most accessible users
should be presented to our meta-users and end-users for verification of large
changes in the interface. This will ensure that a wide variety of users will have
supported and verified the changes made to the interface. By using this strategy,
we feel that we can create a performance tool that has a high level of usability.
6.4.2 Presentation methodology
To be written.
106
6.4.3 References
[6.4.1]
S.Musil, G.Pigel, M.Tscheligi, “User Centered Monitoring of Parallel
Programs with InHouse”, HCI Ancillary Proceedings, 1994.
[6.4.2]
J. Grudin, “Interactive Systems: Bridging the Gaps between
Developers and Users”, IEEE Computer, April 1991, pp. 59-69.
[6.4.3]
C. Pancake. “Applying Human Factors to the Design of Performance
Tools”.
[6.4.4]
B. Curtis. “Human Factors in Software Development: A Tutorial 2nd
Edition”, IEEE Computer Society Press.
[6.4.5]
B.P. Miller, et al. “The Paradyn Parallel Performance Measurement
Tools”. IEEE Computer, 1995.
[6.4.6]
C. Pancake and C. Cook. “What Users Need in Parallel Tool Support:
Survey Results and Analysis”
[6.4.7]
J. Grudin, “Interactive Systems: Bridging the Gaps between
Developers and Users”.
[6.4.8]
R. Jeffries, J. R. Miller, C. Wharton and K. M. Uyeda, “User Interface
Evaluation in the Real World: A Comparison of Four Techniques”.
[6.4.9]
J. Kuehn, `”NCAR User Perspective”, Proc. 1992 Supercomputing
Debugger Workshop, Jan 1993.
[6.4.10]
J. Whiteside, J. Bennett and K. Holtzblatt, “Usability Engineering: Our
Experience and Evolution”, in Handbook of Human-Computer
Interaction.
[6.4.11]
C. Pancake. “Improving the Usability of Numerical Software through
User-Centered Design”.
[6.4.12]
D. Szafron, J. Schaeffer, A Edmonton.” Interpretive Performance
Prediction for Parallel Application Development”.
[6.4.13]
G.V. Wilson, J. Schaeffer, D Szafron, A Edmonton. “Enterprise in
Context: Assessing the Usability of Parallel Programming
Environments”.
[6.4.14]
S. MacDonald et al. “From Patterns to Frameworks to Parallel
Programs”.
107
[6.4.15]
S. Utter. “Enhancing the Usability of Parallel Debuggers”.
[6.4.16]
C. Pancake, D. Hall. “Can Users Play an Effective Role in Parallel
Tools Research”.
[6.4.17]
T.R.G. Green, M. Petre. “Usability analysis of visual programming
environments: a 'cognitive dimensions' framework”.
[6.4.18]
T.E. Anderson, E.D. Lazowska. “Quartz: A Tool for Tuning Parallel
Program Performance”
[6.4.19]
M. Parashar, S. Hariri. ”Interpretive Performance Prediction for Parallel
Application Development.”
108
6.5
Optimization
109
6.5.1 Optimization techniques
When a programmer does not obtain the desired performance they require from
their program codes, they often turn to optimization techniques to enhance their
program’s performance. Optimization techniques for programs have been wellresearched and come in many forms. In this section, we examine optimization
techniques in the following categories:
1. Sequential compiler optimizations – In this class of optimizations, we study
techniques that have been used to enhance the performance of sequential
programs. We restrict our study to techniques that are used by sequential
compilers. These optimizations are presented in Section 6.5.1.1.
2. Pre-compilation optimizations for parallel codes – These optimizations are
specific to parallel programs and are meant to be applied to a program
before being compiled. These optimizations deal with high-level issues
such as data placement and load balancing. These optimizations are
presented in Section 6.5.1.2.
3. Compile-time optimizations for parallel codes – Optimizing compilers play
a large role in the performance of codes generated from implicitly parallel
languages such as High-Performance Fortran (HPF) and OpenMP. Here
we examine techniques used by such compilers. These optimizations are
presented in Section 6.5.1.3.
4. Runtime optimizations for parallel codes – Many times optimizations
cannot be applied until a program’s runtime for various reasons. These
types of optimizations are presented in Section 6.5.1.4.
5. Post-runtime optimizations for parallel codes – The optimizations
presented in this category require that a program be executed in some
manner before the optimizations can be applied. These optimizations are
presented in Section 6.5.1.5.
We also present the following information for each optimization category
presented:

Purpose of optimizations – What do these optimizations attempt to do?
110

Metrics optimizations examine – Which particular metrics or program
characteristics are examined by the optimization?
The aim of this study is to create a catalog of existing optimizations. We hope
that by studying existing optimization techniques, we will have a better
understanding of what kind of information needs to be reported to the user of our
PAT in order for them to increase the performance of their applications.
6.5.1.1 Sequential compiler optimizations
Many types of optimizations are performed by sequential compilers. In today’s
superscalar architectures, sequential compilers are expected to extrapolate
instruction-level parallelism to keep the hardware busy. A summary of these
techniques is presented below, which is mainly taken from an excellent survey
paper written by Bacon et al. [6.5.1]. In order for these techniques to be applied,
the compiler must have a method of correctly analyzing dependencies of
statements in the source code. We do not present the methods of dependency
analysis here; instead, we focus on the transformations the compiler performs
that increase program performance.
While most of these transformations can also be used for parallel codes, there
are additional restrictions that must be enforced. Midkiff and Padua [6.5.2] give
specific examples of these types of problems, and provide an analysis technique
that can be used to ensure the correctness of the transformations.
6.5.1.1.1 Loop transformations

Purpose of optimizations – To restructure or modify loops to reduce their
computational complexity, to increase their parallelism, or to improve their
memory access characteristics

Metrics optimizations examine – Computational cost of repeated or
unnecessary statements, dependencies between code statements, and
memory access characteristics
Loop-based strength reduction – In this method, a compiler substitutes a
statement in a loop that depends on the loop index with a less-costly
computation. For example, if the variable i is a loop index, the statement
a[i] += a[i] + c*i
111
can be replaced with the statement
a[i] += a[i] + t; T += c
(where T is a temporary variable initialized to c).
Induction variable elimination – Here, instead of using an extra register to
keep track of the loop index, the exit condition of the loop is rewritten in terms of
a strength-reduced loop index variable. This frees up a register and reduces the
work done during the loop. For example, the following code:
for (i = 0; i < 10; i++) a[i]++;
can be reduced by using a register to point to a[0] and stopping iteration after
the pointer reaches a[9], instead of incrementing i and recomputing the
address of a[i] in every loop iteration.
Loop-invariant code motion – Loop-invariant code motion moves expressions
evaluated in a loop that do not change between loop executions outside that
loop. This reduces the computation cost of each iteration in the loop.
Loop unswitching – This technique is related to loop-invariant code motion,
except that instead of singular expressions, conditionals that do not change
during loop execution are moved outside of the loop. The loop is duplicated as
necessary. Assuming x does not depend on the loop below, this optimization
changes
for (i = 0; i < 10; i++) {
if (x <= 25) {
a[i]++;
} else {
a[i]--;
}
}
to
if (x <= 25) {
for (i = 0; i < 10; i++) a[i]++;
} else {
for (i = 0; i < 10; i++) a[i]--;
112
}
Since the conditional is only evaluated once in the second form, it saves a bit of
time compared with the first form.
Loop interchange – Loop interchanging interchanges the iteration order of
nested loops. This method can change
for (i = 0; i < 10; i++) {
for (j = 0; j < 10; j++) {
a[i][j] = sin(i + j);
}
}
to
for (j = 0; j < 10; j++) {
for (i = 0; i < 10; i++) {
a[i][j] = sin(i + j);
}
}
By interchanging loop orders, it is often possible to create larger independent
loops in order to increase parallelism in loops by increasing their granularity. In
addition, interchanging loop orders can also increase vectorization of code by
creating larger sections of independent code.
Loop skewing – Loop skewing is often employed in wavefront applications that
have updates that “ripple” through the arrays. Skewing works by adding a
multiplying factor to the outside loop and subtracting as needed throughout the
inner code body. This can create a larger degree of parallelism when codes
have dependencies between both indices.
Loop reversal – Loop reversal is related to loop skewing. In loop reversal, the
loop is iterated in reverse order, which may allow dependencies to be evaluated
earlier.
Strip mining – In strip mining, the granularity of an operation is increased by
splitting a loop up into larger pieces. The purpose of this operation is to split
code up into “strips” of code that can be converted to vector operations.
113
Cycle shrinking – Cycle shrinking is a special case of strip mining where the
strip-mined code is transformed into an outer serial loop and an inner loop with
instructions that only depend on the current iteration. This allows instructions in
the inner loop to be performed in parallel.
Loop tiling – Loop tiling is the multidimensional equivalent of cycle shrinking. It
is primarily used to improve data locality, and is accomplished by breaking the
iteration space up into equal-sized “tiles.”
Loop distribution – This method breaks up a loop into many smaller loops
which have the same iteration space as the iteration, but contain a subset of
instructions performed in the main loop. Loop distribution is employed to reduce
the use of registers within a loop and improve instruction cache locality.
Loop fusion – This is the reverse operation of loop distribution, where a series
of loops that have the same bounds are fused together. This method reduces
loop overhead, can increase spatial locality, and can also increase the load
balancing of parallel loops.
Loop unrolling – Loop unrolling replicates the body of a loop several times in an
effort to reduce the overhead of the loop bookkeeping code. In addition, loop
unrolling can increase instruction-level parallelism since more instructions are
available which do not have loop-carried dependencies.
Software pipelining – This method breaks a loop up into several stages and
pipelines the execution of these stages. For example, the loop
for (i = 0; i < 1000; i++)
a[i]++;
can be broken into three pipeline stages: a load stage, an increment stage, and a
store stage. The instructions are executed in a pipelined fashion such that the
load for iteration 3 is executed, followed by the increment for iteration 2, and then
the store for iteration 1. This helps reduce the overhead of loop-carried
dependencies by allowing multiple iterations of the loop to be executed
simultaneously.
Loop coalescing – This method breaks up a nested loop into one large loop,
using indices computed from the outermost loop variable. The expressions
114
generated by loop coalescing are often reduced so that the overhead of the
original nested loops is reduced.
Loop collapsing – Loop collapsing can be used to reduce the cost of iteration of
multidimensional variables stored in a contiguous memory space. This method
collapses loops iterating over contiguous memory spaces by controlling iteration
with a single loop and substituting expressions for the other indices.
Loop peeling – Loop peeling removes a small number of iterations from the
beginning or end of a loop. It is used to eliminate loop dependencies caused by
dependent instructions in the beginning or end of a loop, and is also used to
enable loop fusion.
Loop normalization – Loop normalization converts all loops so that the looping
index starts at zero or one and is incremented by one each iteration. This is
used to bring the loop into a standard form, which some dependency analysis
methods require.
Loop spreading – This method takes code from one serial loop and inserts it
into another loop so that both loops can be executed in parallel.
Reduction recognition – In reduction recognition, the compiler recognizes
reduction operations (operations that are performed on each element in an array)
and transforms them to be vectorized or executed in parallel. This works well for
or, and, min, and max operations since they are fully associative and can be
fully parallelized.
Loop idiom recognition – In this optimization, the compiler recognizes a loop
that has certain properties which allow use of special instructions on the
hardware. For example, some architectures have special reduction instructions
that can be used on a range of memory locations.
6.5.1.1.2 Memory access transformations

Purpose of optimizations – To reduce the cost of memory operations, or
restructure a program to reduce the number of memory accesses it
performs

Metrics optimizations examine – Memory access characteristics
115
Array padding – This is used to pad elements in an array so the fit nicely into
cache line sizes. Array padding can be used to increase the effectiveness of
caches by aligning memory access locations to sizes well-supported by the
architecture.
Scalar expansion – Scalar expansion is used in vector compilers to eliminate
antidependencies. Compiler-generated temporary loop variables are converted
to arrays so that each loop iteration has a “private” copy.
Array contraction – This method reduces the space needed for compilergenerated temporary variables. Instead of using n-dimensional temporary arrays
for loops nested n times, a single temporary array is used.
Scalar replacement – Here, a commonly-used memory location is “pegged” into
a register to prevent multiple loads to the same memory location over and over
again. It is useful when an inner loop reads the same variable over and over and
a new value is written back to that location after the inner loop has completed.
Code co-location – In code co-location, commonly-used code is replicated
throughout a program. Code co-location is meant to increase instruction cache
locality by keeping related code closer together in memory.
Displacement minimization – In most architectures, a branch is specified as an
offset from the current location and a finite number of bits are used to specify the
value of the offset. This means that the range of a branch is often limited, and
performing branches to larger offsets requires additional instructions and/or
overhead. Displacement minimization aims to minimize the number of long
branches by restructuring the code to keep branches closer together.
6.5.1.1.3 Reduction transformations

Purpose of optimizations – To reduce the cost of statements in the code
by eliminating duplicated work or transforming individual statements to
equivalent statements of lesser cost

Metrics optimizations examine – Computation time and number of
registers used
116
Constant propagation – In constant propagation, constants are propagated
throughout a program, allowing the code to be more effectively analyzed. For
example, constants can affect loop iterations or cause some conditional
branches to always evaluate to a specific outcome.
Constant folding – Here, operations performed on constant values are
evaluated at compile time and substituted in the code, instead of being evaluated
at run time. For example, a compiler may replace all instances of 3 + 2 with 5
in order to save a few instructions throughout the program.
Copy propagation – In this method, multiple copies of a variable are eliminated
such that all references are changed back to the original variable. This reduces
the number of registers used.
Forward substitution – This generalization of copy propagation substitutes a
variable with its defining expression as appropriate. Forward substitution can
change the dependencies between statements, which can increase the amount
of parallelism in the code.
Algebraic simplification & strength reduction – Some math expressions may
be substituted for other, less costly ones. For example, x2 can be replaced with
x*x and x*2 can be replaced by x+x. A compiler can exploit this during
compilation and reduce some algebraic expressions to less costly ones.
Unreachable code elimination – In this method, the compiler eliminates code
that can never be reached. This method may also create more opportunities for
constant propagation; if a compiler knows a section of code will never be
reached, then a variable which previously was modified there may be turned into
a constant if no other code modifies its value.
Dead-variable elimination – Variables that are declared but whose values are
never used end up wasting a register. Dead-variable elimination ignores these
variables and can free up registers.
Common subexpression elimination – Here, common subexpressions which
are evaluated multiple times are coalesced and evaluated only once. This can
speed up code at the cost of using another register.
117
Short-circuiting – Some boolean expressions can be made faster by creating
code that short-circuits after one expression evaluates to a certain value. For
example, in a compound and statement, if a single expression evaluates to false,
the rest of the expressions can be ignored since the whole expression will never
be true. In general, short-circuiting can also affect a program’s correctness if the
programmer makes an assumption about whether a statement in a boolean
expression is always evaluated or not, but most languages (such as C) use
short-circuiting to speed up code.
6.5.1.1.4 Function transformations

Purpose of optimizations – To reduce the overhead of function calls

Metrics optimizations examine – Computation time and function call
overhead
Leaf function optimization – Some hardware architectures use a specific
register to store the location of the return address in a function call. If a function
never calls another function, this register is never modified, and instructions that
save the value of the return address at the beginning of the function and restore
the original return address back into the register after the function is done can be
eliminated.
Cross-call register allocation – If a compiler knows which registers are used by
a calling function and which functions this particular function calls, it may be able
to efficiently use temporary registers between function calls. For example, if
function A calls function B, but function B only uses registers R10 through R12
and function A only uses registers R5 through R9, then functin B doesn’t have to
save those registers on its call stack.
Parameter promotion – In functions that take call-by-reference arguments, it is
much faster to allow a function to take in arguments from registers instead of
storing them to memory and reloading them from within the function. Parameter
promotion uses registers to pass arguments and can lead to frame collapsing for
function calls (where no values are placed on the stack at all).
118
Function inlining – Function inlining attempts to reduce the overhead of a
function call by replicating code instead of using a branch statement. This
increases register contention, but eliminates the overhead of a function call.
Function cloning – This method creates specialized versions of functions based
on the values obtained by constant propagation. The specialized versions of the
functions can take advantage of algebraic and strength reductions to make them
faster than the general versions.
Loop pushing – Loop pushing moves a loop that calls a function several times
inside a cloned version of that function. It allows the cost of the function call to
be paid only once, instead of several times as in the original loop. This also may
increase the parallelism within the loop itself.
Tail recursion elimination – This method reduces the function call costs
associated with recursive programs by exploiting properties of tail-recursive
functions. Tail-recursive functions compute values as the recursion depth
increases, and once the recursion stops the value is returned. This returned
value propagates up the call stack until it is finally returned to the caller. Tail
recursion elimination converts a tail-recursive function into an iterative one, thus
reducing the overhead of several function calls.
Function memoization – Function memoization is a technique often used with
dynamic programming. In this technique, values from function calls which have
no side effects on the program’s execution are cached before they are returned.
If the function is called with the same arguments, the cached value is returned
instead of being computed over again.
6.5.1.2 Pre-compilation optimizations for parallel codes
In this section, we present optimization strategies that are meant to be used
during the coding phase of an application. The methods here are general
enough to apply to many programming paradigms, although some are difficult to
use and require specialized skills in combinatorial mathematics.
6.5.1.2.1 Tiling

Purpose of optimizations – Automatically parallelize sequential code that
can be sectioned into atomic tiles
119

Metrics optimizations examine – Rough approximations of communication
and computation; some tile placement strategies also consider load
balancing
Tiling is a general strategy for decomposing the work of programs that use
nested for loops. Tiling groups iterations within those nested for loops into
“tiles” which can be independently executed. These tiles are then assigned to
processors, and any communication necessary to satisfy dependencies is also
scheduled. For tiling mappings of programs that have inherent loop-carried
dependencies, a systolic architecture is used to exploit the maximum amount of
parallelism possible. In general, tiling may use any type of “tile shape,” which
controls which elements are grouped into which tiles. The term shape is used,
since the iteration space is visualized as an n-dimensional space broken into
different tiles which can be visualized into shapes based on the decomposition
used.
A technique based on iteration space tiling is presented by Andonov and
Rajopadhye in [6.5.3]. Iteration space tiling is used to vary the size of the tiles for
a given tile mapping in an effort to find the optimal tile size based on a simple
systolic communication and computation model. In the technique presented by
Andonov and Rajopadhy, the systolic model is relaxed by providing a general
(but still highly-abstracted) way to describe the communication and computation
that occurs within a tile. It is assumed that the optimal tile shape has already
been found through another method. Once a relaxed systolic model for the
problem has been created, it is combined with the general communication and
computation functions and turned into a two-variable optimization problem whose
free variables are the tile period (computation time) and inter-tile latency
(communication time). The authors then present a complicated algorithm that
solves this optimization problem, which gives the optimal tiles size to be used
with the chosen tiling shape.
Goumas et al. present a method for automatically generating parallel code for
nested loops in [6.5.4] that is based on tiling. In their method, the sequential
code is analyzed and an optimal tiling is chosen, which is then parallelized.
Since the body of each tile can be executed independently provided the
prerequisite computations have already been completed and sent to this
processor, the parallelization step is very straightforward. The authors then test
120
their parallelization method on some small computation kernels, and find that
their method is able to perform the tiling quickly and give reasonable
performance.
Lee and Kedem present another data and computation decomposition strategy
that is based on tiling in [6.5.5]. In their method, they first determine the optimal
data distribution to be used for code that is to be parallelized, and then decide to
how map the tiles to a group of processors to minimize communication. Their
method does not make any assumptions about tiles sizes or shapes, but is
extremely complicated.
6.5.1.2.2 ADADs

Purpose of optimizations – Optimally parallelize do loops in Fortran D

Metrics optimizations examine – Data placement of array elements and
communication caused by loop dependencies
Hovland and Ni propose a method for parallelizing Fortran D code in [6.5.6]
which uses augmented data access descriptors. Augmented data access
descriptors (ADADs) provide compact representations of dependencies inherent
in loops. Traditionally, compilers rely on pessimistic heuristics to determine
whether a loop can be fully parallelized or not. Instead of directly analyzing
dependencies between code statements within a loop, ADADs represent the how
sections of a program can influence each other.
ADADs also allow
dependencies to be determined by performing simple union and intersection
operations. Armed with the ADADs, Hovland and Ni compute data placement for
the array elements modified in the loop based on a heuristic. In addition, they
show that they are able to apply such loop transformations as loop unrolling and
loop fusion directly on their ADADs, which may help a parallelizing compiler
choose which optimizations to apply.
6.5.1.3 Compile-time optimizations for parallel codes
In this section, we present optimizations that can be performed during the
compilation of a program. In many cases, the methods described here may also
be performed prior to the compilation of the code; however, because these
methods can be automated, they are incorporated into the compilation process.
121
Therefore, the methods presented here generally have low-cost implementations
(in terms of memory and CPU usage).
6.5.1.3.1 General compile-time strategies

Purpose of optimizations – To eliminate unnecessary communication, or
reduce the cost of necessary communication between processors

Metrics optimizations examine – Dependency transformations which allow
privatization of scalars or arrays, overhead of small message sizes vs.
large message sizes, cache line size for shared-memory systems,
dependencies between computation and communication statements (for
latency hiding)
The summary provided by Bacon et al. [6.5.1] also contains a short overview of
general compile-time optimizations performed by parallel compilers. Since they
are general strategies, we group them together here.
Scalar privatization – This is the parallel generalization of the “scalar
expansion” presented in Section 6.5.1.1.2, in which temporary variables used
during execution of a loop are kept local instead of shared across all processors
evaluating that loop. This decreases the amount of unnecessary communication
between threads, and is only needed in parallelizing compilers for implicitly
parallel languages (e.g., HPF).
Array privatization – Certain parallel codes operate on arrays that are written
once (or statically initialized) at the beginning of a program’s execution and are
only read from after that. Distributing these arrays to every processor (if
necessary) reduces a program’s overall execution significantly, since much
unnecessary communication is eliminated.
Cache alignment – On shared-memory and distributed shared-memory
machines, aligning objects to cache sizes can reduce false sharing of two
unrelated elements that get placed on the same page by padding them to keep
them in separate cache lines. This is analogous to the “array padding”
mentioned in Section 6.5.1.1.2.
Message vectorization, coalescing, & aggregation – In these methods, data is
grouped together before being sent to another processor. Since the overhead of
122
sending a large amount of small messages usually dominates the overall
communication cost, using larger messages whenever possible allows for more
efficient communication.
Message pipelining – Message pipelining is another name for the overlapping
of communication and computation, which can allow a machine to “hide” much of
the latency that communication normally incurs.
6.5.1.3.2 PARADIGM compiler

Purpose of optimizations – Reduce communication overhead and increase
parallelism in automatically parallelized Fortran77 and HPF source code

Metrics optimizations examine – Communication and load balancing
PARADIGM is a compiler developed by Banerjee et al. that automatically
parallelizes Fortran77 or HPF source code [6.5.7]. The authors describe their
compilation system as one that uses a “unified approach.” PARADIGM uses an
abstract model of a multidimensional mesh of processors in order to determine
the best data distribution for the source code fed into it. The parallelizer uses
task graphs internally to schedule tasks to processors, and supports regular
computations through static scheduling and irregular computations through a
runtime system. The PARADIGM compiler can also generate code that overlaps
communication with computation. An earlier version of PARADIGM was only
able to determine data partitionings for HPF code [6.5.8], but later versions were
also able to parallelize the code and apply optimizations to the parallelized code.
To perform automatic data partitioning, a communication and computation cost
model is created for specific architectures, which is controlled via architectural
parameters. The data partitioning happens in several phases: a phase to align
array dimensions to a dimensions in the abstract processor mesh, a pass to
determine if a block or cyclic layout distribution should be used, a block size
selection pass, and a mesh configuration pass where data is mapped to a mesh.
PARADIGM also uses a few optimization techniques such as message
coalescing, message vectorization, message aggregation, and coarse-grained
pipelining to maximize the amount of parallelism and minimize the
communication costs. A short description of these optimizations is shown below:
123
Message coalescing – Eliminates redundant messages referencing data at
different times that has not been modified
Message vectorization – Vectorizes nonlocal accesses within a loop iteration
into one larger message
Message aggregation – Messages with identical source/destination pairs are
merged together into a single message (may also merge vectorized messages)
Course-grained pipelining – Overlaps loop iterations to increase parallelism in
loops which cannot be parallelized due to dependencies between iterations of the
loop
6.5.1.3.3 McKinley’s algorithm

Purpose of optimizations – Optimize regular Fortran code to increase
parallelism and exploit spatial and temporal locality on shared-memory
machines

Metrics optimizations examine – Spatial and temporal locality (memory
performance),
parallel
granularity,
loop
overhead
(for
loop
transformations)
In [6.5.9], McKinley describes an algorithm for optimizing and parallelizing
Fortran code which is meant to be used with “dusty deck” codes. McKinley
argues that the only way to get decent performance out of sequential programs
not written with parallelism in mind is to combine algorithmic parallelization and
compiler optimizations. McKinley’s algorithm is designed for use with sharedmemory multiprocessor systems.
The algorithm considers many aspects and program optimizations and is divided
into four steps: an optimization step, a fusing step, a parallelization step, and an
enabling step. The optimization step uses loop permutation and tiling (which
incorporates strip mining) on a single loop nest to exploit data locality and
parallelism. The fusing step performs loop fusion and loop distribution to
increase the granularity of parallelism across multiple loop nests.
The
parallelization step combines the results from the optimization and fusing steps to
optimize single loop nests within procedures. Finally, the enabling step uses
interprocedural analysis and transformations to optimize loop nests containing
124
function calls and spanning function calls by applying loop embedding, loop
extraction, and procedure cloning as needed. The algorithm uses a memory
model to determine placement and ordering of loop iterations for maximum
parallelism and locality. A simple loop tiling algorithm chooses the largest tile
size possible given a number of processors, which enhances the previous
optimizations that try to maximize the spatial locality of a loop.
The authors used their optimization techniques on code obtained from scientists
working at Argonne National Laboratory. The speedups they obtained surpassed
the speedup obtained with hand-coded parallel routines in some cases, and
matched or closed matched in all others, which implies the strategies and
optimizations used in their compiler worked as intended.
6.5.1.3.4 ASTI compiler

Purpose of optimizations – Ease parallelization of existing sequential code
by extending a sequential compiler with parallel transformations and “loop
outlining”

Metrics optimizations examine – Work distribution, memory contention,
loop overheads
ASTI is a sequential compiler that has been extended by IBM to parallelize codes
for SMP machines [6.5.10]. The compiler uses many high-level sequential code
optimizations such as loop interchanges and loop unrolling, as well as general
techniques for estimating cache and TLB access costs. The extended version of
ASTI also uses loop coalescing and iteration reordering to enhance
parallelization. In addition, the compiler uses a method known as function
outlining which simplifies storage management for threads by transforming
sections of code into function calls. This does simplify storage management,
since in a function call each thread gets a separate copy of local variables
allocated in its stack frame. The compiler also uses a parallel runtime library that
employs dynamic self-scheduling for load balancing. The “chunk sizes” that are
used to distribute work by the runtime library can be changed at runtime, and the
library also lets the user choose between busy waits and sleeps when a thread is
blocked for data. The busy wait method can decrease performance since it
increases memory contention, but can have higher latency than the busy wait
method. The authors tested their ASTI compiler on a four-processor machine
125
and found that their compiler generated positive speedups, although the parallel
efficiency of their implementation left much to be desired.
6.5.1.3.5 dHPF compiler

Purpose of optimizations – Reduce communication between processors
and keep work distribution even

Metrics optimizations examine – Communication cost vs. computation
cost, cost of small vs.
large messages, dependencies between
statements to allow for overlapping of communication and computation
The dHPF is a High-Performance Fortran compiler which automatically
parallelizes HPF code [6.5.11]. dHPF uses the owner-computes rule to guide its
computation partitioning model. For a given loop, the compiler chooses a data
partitioning that minimizes communication costs by evaluating possible data
mappings with a simple communication model (based on the number of remote
references).
The compiler also includes several optimizations, including
message vectorization, message coalescing, message combining, and loopsplitting transformations that enable communication to be overlapped with
computation. The authors found that while these optimizations worked well on
simple kernels and benchmarks, but more optimizations were needed for more
realistic codes such as the NAS benchmarks.
Therefore, the dHPF compiler also incorporates other optimizations to increase
its ability to parallelize the NAS benchmarks. The additional optimizations
needed for the NAS benchmarks are listed below:

Array privatization – Same technique as discussed in Section 6.5.1.3.1

Partial replication of computation – In order to decrease communication
costs, code sections can be marked with a LOCALIZE directive which
indicates that the statements will be performed by all processors instead
of having one processor evaluate them and all other processors read them
With these optimizations in place, the authors were able to achieve competitive
performance on the NAS benchmarks.
126
6.5.1.4 Runtime optimizations for parallel codes
In this section, we present optimization methods that are not employed until
runtime. In general, it more efficient to perform optimizations during development
or compile time. However, irregular code, nondeterministic code, and code that
cannot be statically analyzed cannot be optimized at compile time and must be
handled at runtime, so it is useful to study these techniques.
6.5.1.4.1 Inspector/executor scheme

Purpose of optimizations – Perform runtime load balancing for irregular
applications by generating a schedule based on runtime information
(which may be generated in parallel) and executing that schedule

Metrics optimizations examine – data access patterns during runtime
Saltz et al. invented a general method for runtime parallelization by splitting the
problem up into two pieces: computing the parallel iteration schedule and then
executing the computed schedule [6.5.12].
They title their method the
inspector/executor scheme, where the job of the inspector is to generate an
optimal (or near-optimal) iteration schedule and the job of the executor is achieve
the execution of that schedule. The general idea is very straightforward, but can
be very effective for irregular programs, although it only works for problems
whose data access patterns can be predicted before a loop is executed (which
covers many unstructured mesh explicit and multigrid solvers, along with many
sparse iterative linear systems solvers). The authors later implemented a
general library for their method under the name of PARTI [6.5.13].
The overhead of generating the schedule during runtime can be very high. This
prompted Leung and Zahorjan to parallelize the inspection algorithm in order to
speed the generation of the evaluation schedule and thus gain parallel efficiency
[6.5.14]. The provided two parallel variations of the original inspection algorithm:
one that works by assigning sections of code to be examined by each processor
and merging those sections using the serial inspection algorithm, and another
that “bootstraps” itself by using the inspection algorithm on itself. The authors
tested their method and found that the sectioning method worked best in most
cases.
6.5.1.4.2 Nikolopolous’ method
127

Purpose of optimizations – To redistribute work based on load balancing
information obtained after a few initial probing iterations of an irregular
application code

Metrics optimizations examine – Number of floating point operations
performed per processor, data placement
Nikolopolous et al. present another method for runtime load balancing for
irregular applications that works with unmodified OpenMP APIs in [6.5.15]. Their
method works by using a few probing iterations in which information about the
state of each processor is obtained through instrumented library calls. After
these probing iterations are done, the collected metrics are analyzed and data
and loop migration are performed to minimize data communication and maximize
parallelism. The algorithm used to reallocate iterations is very simple; it is based
on a simple greedy method that redistributes work from overloaded processors to
less busy processors. The authors tested their method and found that they
achieved performance within 33% of a hand-coded MPI version.
6.5.1.5 Post-runtime optimizations for parallel codes
In this section, we explore methods for optimizing a program’s performance
based on tracing data obtained during a program’s run. Many optimization
strategies in this category can be categorized as using the “measure-modify”
approach, where optimization is performed iteratively by making an improvement
and rerunning the application in hopes that the measurements obtained by
rerunning the application show improvements.
6.5.1.5.1 Ad-hoc methods

Purpose of optimizations – To optimize performance using the using trialand-error methods

Metrics optimizations examine – varies
A large number of optimization methods currently fall under the “ad-hoc”
category, including methods used with most performance analysis tools. The
defining characteristic of these methods is that they rely on providing the user
with detailed information gathered during the program’s execution, and they
expect that this information will enable the user to optimize their code. Usually
128
this involves changing one parameter of the program at a time and rerunning to
see the impact of the change, although analytical models and simulative models
may be used to gain insight into which code changes the user should try to use.
Aside from improving the algorithm used for a program, this class of
optimizations usually yields the most speedup for a program. However, it is
heavily dependent on the skill (and sometimes luck) of the user.
6.5.1.5.2 PARADISE

Purpose of optimizations – To automate the optimization processes by
analyzing trace files generated during runtime

Metrics optimizations examine – Communication overhead and load
balancing
Krishnan and Kale present a general method for automating trace file analysis in
[6.5.16]. The general idea they present is based on the general ad-hoc method,
except that the improvements the user implements after each run of the program
are suggested by the PARADISE (PARallel programming ADvISEr) system. In
addition, the improvements suggested by PARADISE can be automatically
implemented without user intervention through the use of a “hints file” given to
the runtime library used in the system.
PARADISE works with the Charm++ language, a variant of C++ that has
distributed objects. Because of this, some of the optimizations it deals with
(method invocation scheduling and object placement) are of a different nature
than those used in traditional SPMD-style programs. However, the system does
incorporate a few well-known techniques for improving communications:
message pipelining and message aggregation.
6.5.1.5.3 KAPPA-PI

Purpose of optimizations – To suggest optimizations based on trace files
generated during runtime

Metrics optimizations examine – information extracted from trace files
(time between send/receive pairs, time taken for barrier synchronizations,
send and receive communication patterns)
129
KAPPA-PI, a tool written by Espinosa while performing research for his PhD
[6.5.17], aims to provide a user with suggestions for improving their program.
KAPPA-PI is a knowledge-based program that classifies performance problems
using a rule-based system. A catalog of common performance problems are
associated with a set of rules, and when the rules are evaluated as true, the
problem is presented to the user and is correlated with the source code location
the problem occurs at. The performance problems used in the tool are
categorized as blocked sender, multiple output (which represents extra
serialization caused by many messages being sent), excessive barrier
synchronization time, a master/worker scheme that generates idle cycles, and
load imbalance. Based on the problems, the program also has a method for
coming up with recommendations to the user on how to fix their problem.
KAPPA-PI is an interesting technique, but it is aimed squarely at the novice user.
As such, it would not be helpful for users who have a better understanding of the
systems than it does.
6.5.1.6 Conclusions
In this section we have presented many different optimization techniques which
are applied at different stages of a program’s lifetime. The most common metrics
examined by the optimizations we presented are loop overhead, data locality
(spatial and temporal), parallel granularity, load balancing, communication
overhead of small messages, communication overhead of messages that can be
eliminated or merged, and placement strategies for shared data. The metrics
relevant to our PAT are the communication, load balancing, and cache
characteristics of programs. Nearly all of the optimizations techniques presented
here can be applied based on the information provided by these three main
categories of program information; if we make sure to include them in our PAT,
we enable programmers to use these optimization techniques which are already
developed.
One interesting phenomenon we have observed is the lack of optimizations for
parallel programs in commercial compilers. Arch Robison, a programmer with
KAI Software (a division of Intel), attributes this to the fact that compiler
optimizations are a small part of a larger system that a customer expects when
purchasing a compiler [6.5.18]. In addition to the compiler itself, customers also
expect libraries, support, and other tools to be provided with their purchase.
130
Therefore, optimizations must compete with other aspects of a compiler package
for attention from customers. Because no “one size fits all” optimizations have
been discovered that drastically improve the performance for all types of parallel
applications, it becomes harder to integrate them into a commercial product due
to their limited applicability. To remedy this situation, Robison advocates the
separation of optimizations from the compiler, so that third-party developers may
develop optimizations for specific types of situations as appropriate.
6.5.2 Performance bottleneck identification
To be written.
6.5.3 References
[6.5.1]
D. F. Bacon, S. L. Graham, and O. J. Sharp, “Compiler transformations
for high-performance computing,” ACM Computing Surveys, vol. 26,
no. 4, pp. 345–420, 1994.
[6.5.2]
S. Midkiff and D. Padua, “Issues in the compile-time optimization of
parallel programs,” in 19th Int’l Conf. on Parallel Processing, August
1990.
[6.5.3]
R. Andonov and S. Rajopadhye, “Optimal Tiling of Two-dimensional
Uniform Recurrences,” Tech. Rep. 97-01, LIMAV, Universitй de
Valenciennes, Le Mont Houy - B.P. 311, 59304 Valenciennes Cedex,
France, January 1997. (submitted to JPDC). This report superseeds
Optimal Tiling (IRISA,PI-792), January, 1994. Part of these results
were presented at CONPAR 94-VAPP VI, Lecture Notes in Computer
Science 854, 701–712, 1994, Springer Verlag.
[6.5.4]
G. Goumas, N. Drosinos, M. Athanasaki, and N. Koziris, “Automatic
parallel code generation for tiled nested loops,” in SAC ’04:
Proceedings of the 2004 ACM symposium on Applied computing,
pp. 1412–1419, ACM Press, 2004.
[6.5.5]
P. Lee and Z. M. Kedem, “Automatic data and computation
decomposition on distributed memory parallel computers,” ACM Trans.
Program. Lang. Syst., vol. 24, no. 1, pp. 1–50, 2002.
131
[6.5.6]
P. D. Hovland and L. M. Ni, “A model for automatic data partitioning,”
Tech. Rep. MSU-CPS-ACS-73, Department of Computer Science,
Michigan State University, October 1992.
[6.5.7]
P. Banerjee, J. A. Chandy, M. Gupta, E. W. H. IV, J. G. Holm, A. Lain,
D. J. Palermo, S. Ramaswamy, and E. Su, “The paradigm compiler for
distributed-memory multicomputers,” Computer, vol. 28, no. 10,
pp. 37–47, 1995.
[6.5.8]
M. Gupta and P. Banerjee, “PARADIGM: A compiler for automatic data
distribution on multicomputers,” in International Conference on
Supercomputing, pp. 87–96, 1993.
[6.5.9]
K. S. McKinley, “A compiler optimization algorithm for shared-memory
multiprocessors,” IEEE Trans. Parallel Distrib. Syst., vol. 9, no. 8,
pp. 769–787, 1998.
[6.5.10]
J.-H. Chow, L. E. Lyon, and V. Sarkar, “Automatic parallelization for
symmetric shared-memory multiprocessors,” in CASCON ’96:
Proceedings of the 1996 conference of the Centre for Advanced
Studies on Collaborative research, p. 5, IBM Press, 1996.
[6.5.11]
V. Adve, G. Jin, J. Mellor-Crummey, and Q. Yi, “High performance
Fortran compilation techniques for parallelizing scientific codes,” in
Supercomputing ’98: Proceedings of the 1998 ACM/IEEE conference
on Supercomputing (CDROM), pp. 1–23, IEEE Computer Society,
1998.
[6.5.12]
J. Saltz, H. Berryman, and J. Wu, “Multiprocessors and runtime
compilation,” in Proc. International Workshop on Compilers for Parallel
Computers, 1990.
[6.5.13]
A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy,
and K. Crowley, “PARTI primitives for unstructured and block
structured problems,” Computing Systems in Engineering, vol. 3, no. 14, pp. 73–86, 1992.
[6.5.14]
S.-T. Leung and J. Zahorjan, “Improving the performance of runtime
parallelization,” in PPOPP ’93: Proceedings of the fourth ACM
SIGPLAN symposium on Principles and practice of parallel
programming, pp. 83–91, ACM Press, 1993.
132
[6.5.15]
D. S. Nikolopoulos, C. D. Polychronopoulos, and E. Ayguadи, “Scaling
irregular parallel codes with minimal programming effort,” in
Supercomputing ’01: Proceedings of the 2001 ACM/IEEE conference
on Supercomputing (CDROM), pp. 16–16, ACM Press, 2001.
[6.5.16]
S. Krishnan and L. V. Kale, “Automating parallel runtime optimizations
using post-mortem analysis,” in ICS ’96: Proceedings of the 10th
international conference on Supercomputing, pp. 221–228, ACM
Press, 1996.
[6.5.17]
A. E. Morales, Automatic Performance Analysis of Parallel Programs.
PhD thesis, Department d’Informаtica, Universitat Autтnoma de
Barcelona, 2000.
[6.5.18]
A. D. Robison, “Impact of economics on compiler optimization,” in JGI
’01: Proceedings of the 2001 joint ACM-ISCOPE conference on Java
Grande, pp. 1–10, ACM Press, 2001.
133
7 Language analysis
To be written.
134
8 Tool design
To be written.
(this might not have enough info and the implemented portion is probably cover
in tool evaluation as well)
135
9 Tool evaluation strategies
Before the users can decide which PAT best fits their needs, a set of guidelines
needs to be established so users can compare the available tools in a systematic
fashion. In addition, as new PAT developers, we need to develop a method to
evaluate existing tools so we can decide if a new tool is needed or not. In this
section, features that should be considered are introduced, along with a brief
description of why they are important. An Importance Rating, which indicates the
relative importance of the feature compared to the other features, is then applied
to each of these features. The possible values are minor (nice to have), average
(should include) and critical (needed).
Finally, features are categorized
according to their influence on usability, productivity, portability and
miscellaneous aspects.
9.1
Pre-execution issues
This section of report deals with features that are generally considered before
getting the software.
9.1.1 Cost
Commercial PATs, given their business nature, are generally more appealing to
users because they often provide features familiar to users (meaning they
generally try to appeal to users by giving them what they want). However, users
often need to search for alternative tools simply because they cannot afford to
pay for the commercial product. Even worse, they may decide not to use any
PAT if what they can afford does not fulfill their need. Because of this, any useful
tool evaluation needs to take cost into consideration. Generally, the user will
desire the minimum cost product that accommodates most of their needs. There
is often a tradeoff between the optional desirable features versus their additional
cost.
Importance Rating: Average. By itself, cost is of average importance. However,
this should be evaluated in conjunction with the desirable PAT features from the
user’s perspective. Together, these two aspects become critical in determining
in which tool to deploy.
136
Category: Miscellaneous
9.1.2 Installation
Ease of installation is another feature that the users look upon when deciding
which tool to use. Sometimes a suitable tool, in term of cost and features, will
not be used because the user cannot install the system. Due to modularization,
many tools required installation of multiple components to enable the complete
set of features the tool can provide. Furthermore, in order to support features
provided by other systems (other tools or other systems such as a visualization
kit), these components must be installed separately and they may require
multiple components of their own. As more and more features are added to a
given PAT, it seems inevitable that more and more components need to be
installed as they are often developed by different groups.
With productivity in mind, a tool should try to incorporate as many useful
components as possible. However, it should simplify the installation process as
much as possible. The use of environment detection scripts with few options is
desirable. With parallel language such as UPC/SHMEM in mind, a tool should
also have the ability to automatically install required components on all nodes in
the targeted system. Finally, it is optimal to minimize the number of separate
installations required, perhaps by having a master script that installs all desired
components (maybe by including only the version of external components that
work with the tool version and have them installed as part of the tool).
Importance Rating: Minor. Because the installation process is usually done
once, this feature is not as important as the others.
Category: Usability.
9.1.3 Software support (libraries/compilers)
A good tool needs to support libraries and compilers that the user intends to use.
If the support is limited, the user is then restricted to use the tool only with few
particular system settings (i.e. they can only use the tool on machines with
certain set of libraries installed). This greatly hinders the usability of the tool and
could also limit productivity. However, the support should only be applied to a
selective set of key libraries, as too much library support can introduce
unnecessary overhead. Tool developers should decide on a set of core libraries
137
(from user’s point of view) and support all implementation of them.
however should try to support all available language compilers.
A tool
Importance Rating: Critical. Without the necessary software support, a tool is
virtually useless.
Category: Usability, productivity.
9.1.4 Hardware support (platform)
Another important aspect of environmental support is hardware support. Again,
a good tool needs to execute well on platforms of interest without significant
effort (from user) when going from platform to platform. With longer software
development times (due to larger applications) and fast improvements in
hardware technology, this feature is becoming increasingly important. Software
development can sometimes outlast the hardware that it was originally developed
on. Because of this need to port from one generation of a system to the next, it
is important for the tool to provide support on all these systems equally well.
However, it is not necessary to provide support for all the platforms which the
program can run, although it is desirable. A balance between additional
development times versus platform support should be considered.
Importance Rating: Critical. A tool must provide support for the most widely used
current machines and support future ones. A core set needs to be identified.
Category: Usability, portability.
9.1.5 Heterogeneity support
Support for heterogeneity is related to both software and hardware support. This
feature deals with using the tool simultaneously on a single application running
on nodes with different software and/or hardware configurations. With an
increasing desire to run applications in a heterogeneous environment, tool
developers should take this into consideration.
Importance Rating: Minor. This feature is not very important at this time,
especially since UPC and SHMEM implementations still do not support
heterogeneous environments well.
Category: Miscellaneous.
138
9.1.6 Learning curve
Many powerful systems (both software and hardware) never become widely
accepted because they are too difficult to learn. Users are able to recognize the
usefulness of the features these systems provide but are unable to utilize them in
an efficient manner. The same principle applies to PATs. Users will not use a
PAT with lots of desirable features but difficult to learn because the benefit
gained from using the tool is just not worth the learning effort. Given that
desirable features are keys to a productive PAT, a good approach is to have a
basic set of features most users appreciate that requires a small learning time
and make the other features optional. This way, a novice user can quickly see
the benefit of using the PAT while a more advanced user can enjoy all the more
sophisticated features.
Importance Rating: Critical. No matter how powerful a tool is, it is useless if it is
too difficult for users to use.
Category: Usability, productivity.
9.2
Execution-time issues
This section covers the issues when tool is in use. The issues presented are
broken into subsections corresponding to the five stages in experimental
performance modeling mentioned in Section 6.
9.2.1 Stage 1: Instrumentation
9.2.1.1 Manual instrumentation overhead
How much effort the user needs to put into running the PAT is a big determinant
on how often they will use the tool (assuming that they do decide to use the tool
at all). It is possible that using a sophisticated PAT is not as effective as simply
using the printf statement to pinpoint performance bottlenecks. If a tool
requires too much manual instrumentation, the user might consider the effort not
worth while if they can obtain similar useful information from inserting printf
statements. It is up to the tool developers to minimize the amount of manual
overhead while maximize the benefit a tool provides. For this reason, an ideal
tool should perform automatic instrumentation as much as possible and allow
manual instrumentation for extendibility (see Section 6.1.3).
139
Importance Rating: Average. This is important for the introduction of a tool to
new user but is less critical as user become more accustomed to using the PAT.
Advanced users are often willing to put in the extra effort to obtain useful
information.
Category: Usability, productivity.
9.2.1.2 Profiling/tracing support
It is highly recommended for a tool to support tracing as profiling data can be
extracted from the tracing data (see Section 6.1.2). Tracing technique, trace file
format and mechanisms for turning tracing on and off are important as they
directly impact the amount of storage needed and the perturbations caused by of
the tool (i.e. the effect of instrumentation on the correctness of program’s
behavior). An ideal tool should deploy strategies to minimize the storage
requirement while gathering all necessary information. The strategies should
also not affect the original program’s behavior and should be compatible with
other popular tools. Finally, the performance overhead issue needs to be
considered (see the profiling/tracing sub-report for a detailed discussion).
Importance Rating: Critical. The choice significantly affects how useful the PAT
is based on the reasons mentioned.
Category: Productivity, portability, scalability.
9.2.1.3 Available metrics
Core to the PAT are the metrics (events) it is able to tract. Intuitively, this set
should cover all the key aspects of software and hardware system critical to
performance analysis (details about these will be covered in the sub-report
regarding important factors for UPC/SHMEM).
Importance Rating: Critical.
Category: Productivity.
140
9.2.2 Stage 2: measurement issues
9.2.2.1 Measurement accuracy
Measurements must be done so they represent the exact behavior of the original
program. An ideal tool need to ensure that the measuring strategy provides
accurate information in the most efficient manner under the various software and
hardware environments (the set of measurements could be the same or different
between systems).
Importance Rating: Critical. Accurate event data is vital to a PAT’s usefulness.
Category: Productivity, portability.
9.2.2.2 Interoperability
Once multiple tools are available for a particular language, it is a good strategy
for a tool to store its event data in a format that is portable to other tools as
people might be accustomed to a particular way a tool presents its data.
Furthermore, it saves time to use components developed elsewhere, as this
avoids reinventing the wheel. This helps the acceptability of the tool and can
sometimes reduce the development cost.
Importance Rating: Average. Since no tool in existence supports UPC and
SHMEM, there is no need to consider this except to store the data in a way that
is portable in the future. Also, since it’s highly likely that we end up using one
tool as the major backbone for our PAT, this is a minor issue. However, it might
be good to consider using a format that is compatible with visualization packages
because GUI is time consuming to produce.
Category: Portability.
9.2.3 Stage 3: analysis issues
9.2.3.1 Filtering and aggregation
With the assumption that most users use PATs with long-running programs (a
likely assumption since little benefit can be gained from improving short-running
programs unless they are part of a bigger application), it is important for a tool to
provide filtering abilities (also applicable to system with multiple nodes). Filtering
removes excess event information that can get in the way of performance
141
bottleneck identification by flushing out a large amount of “normal” events that do
not provide insight into identifying performance bottlenecks. Aggregation is used
to organize existing data and produce new meaningful event data (commonly
from one event but aggregating two or more related events is useful). These
aggregate events lead to a better understanding of program behavior as they
provide a higher-level view of the program behavior. An ideal tool should provide
various degree of filtering that can help in the identification of abnormality
depending on the problem space and system size. In addition, aggregation
should be applied to filtered data to better match the user’s need (filtering
following by aggregation is faster than the reverse and can also provide the same
information).
Importance Rating: Critical. Event data needs to be organized to be meaningful
for bottleneck detection.
Category: Productivity, scalability.
9.2.3.2 Multiple analyses
There are often multiple ways to analyze a common problem. For example,
multiple groups have proposed different ways to identify synchronization
overhead. However, there is no consensus as to which of these methods are
more useful than the others. Because of this, it is helpful for a tool to provide
support for all the useful methods so the user can the user can switch between
them based on their preference.
Importance Rating: Average. Although it is nice to have this feature, trying to
provide too much can significantly impact the usability of the tool. It is perhaps
best to select a few major issues and then provide a couple of views for those
issues.
Category: Usability.
9.2.4 Stage 4: presentation issues
9.2.4.1 Multiple views
Ideally, a tool should have multiple levels of presentation (text, simple graphic). It
should also provide a few different presentation methods for some of its displays.
142
It is also desirable to have a zooming capability (i.e. display data on a particular
range).
Importance Rating: Critical. This is the only stage relating to what the user sees,
and the what user sees completely determines how useful the tool will be.
Category: Usability, productivity.
9.2.4.2 Source code correlation
It is important to correlate the presentation of performance data with the source
code. This is vital in facilitating the task of performance bottleneck identification.
Importance Rating: Critical.
Category: Usability, productivity.
9.2.5 Stage 5: optimization issues
9.2.5.1 Performance bottleneck identification
It is always beneficial if the tool can identify performance bottlenecks and provide
suggestions on how to fix them. However, it should avoid false positives.
Importance Rating: Minor to average. This is a nice feature to have but not
critical. However, the identification part should probably be investigated as it is
related to dynamic instrumentation.
Category: Productivity.
9.2.6 Response time
Another important issue to consider that is applicable to the entire tool utilization
phase is the response time of the tool. A tool should provide useful information
back to the user as quickly as possible. A tool that takes too long to provide
feedback will deter the user from using it. In generally, it is best to provide partial
useful information when it becomes available and update the information in a
periodic fashion. This is related to the profiling/tracing technique and
performance bottleneck identification.
Importance Rating: Minor to average. As long as the response time is not too
terrible, it should be fine.
143
Category: Productivity.
9.3
Other issues
This section covers all issues that do not fit well with the other two phases.
9.3.1 Extendibility
An important factor for our project but generally not as important to the user is
how easy the tool can be extended to support new languages or adding new
metrics. This is of great importance as we need to evaluate the existing
capabilities of tools against the development effort to decide if it is better to
design a tool from scratch or use existing tool(s).
Importance Rating: Critical. An ideal tool should not require a significant amount
of effort to extend.
Category: Miscellaneous.
9.3.2 Documentation quality
A tool should provide clear documentation on its design, how it can be used, and
how to best use it to learn its features. A good document often determines if the
tool will ever be used or not (as discussed in the installation section). However,
as tool developers, this is not as important.
Importance Rating: Minor.
Category: Miscellaneous.
9.3.3 System stability
A tool should not crash too often. This is probably difficult to determine and what
is a good rate is arbitrary.
Importance Rating: Average.
Category: Usability, productivity.
144
9.3.4 Technical support
Tool developers should be responsive to the user. If the user can get a response
regarding the tool within a few days, then the support is acceptable. The tool
should also provide clear error messages to the user.
Importance Rating: Minor to average. As long as the tool itself is easy to use,
there is little need for this. It is more important for us, however, because we
definitely need developer support if we decide to use their tool. Tool developers
are generally willing to work with others to extend their tool though.
Category: Usability.
9.3.5 Multiple executions
Since we are dealing with parallel programs that involve multiple processing
nodes, it is sometimes beneficial to compare the performance of the same
program running on different numbers of nodes (i.e. performance on 2 nodes vs.
4 nodes). This helps to identify a program’s scalability trend.
Importance Rating: Minor to average. Depending on what factors we deem
important (if program scalability is important to show, then this is of average
importance. Otherwise, this should not be considered).
Category: Productivity.
9.3.6 Searching
Another helpful feature to include is the ability to search the performance data
gathered using some criteria. This is helpful when user wish to see a particular
event data.
Importance Rating: Minor. Searching is a nice feature to have but implementing
it probably isn’t worth the effort for the prototype version of our PAT.
Category: Productivity.
145
Table 9.1 - Tool evaluation summary table
* The way this should be used is by filling out all the information for all the features (+ other comments). Then, based on the information, provide a
rating of 1-5 (5 being the best) on how good the tool is. Some explanation as to why you choose such a rating is also helpful.
Feature (section)
Available metrics (9.2.1.3)
Cost (9.1.1)
Documentation quality (9.3.2)
Extendibility (9.3.1)
Filtering and aggregation (9.2.3.1)
Information to gather
Categories
Importance
Rating
Metrics it can provide (function, hw …)
Productivity
Critical
How much
Miscellaneous
Average
Clear document? Helpful document?
Miscellaneous
Minor
1. Estimating of how easy it is to extend to
UPC/SHMEM
2. How easy is it to add new metrics
Does it provide filtering? Aggregation?
Miscellaneous
Critical
Productivity, Scalability
Critical
Platform support
Usability, Portability
Critical
Support running in a heterogeneous environment?
Miscellaneous
Minor
1. How to get the software
2. How hard to install the software
3. Components needed
4. Estimate number of hours needed for installation
List of other tools that can be used with this
Usability
Minor
Portability
Average
To what degree
Hardware support (9.1.4)
Heterogeneity support (9.1.5)
Installation (9.1.2)
Interoperability (9.2.2.2)
Learning curve (9.1.6)
Estimate learning time for basic set of features and
complete set of features
Usability, Productivity
Critical
1. Method for manual instrumentation (source code,
instrumentation language, etc)
2. Automatic instrumentation support
Evaluation of the measuring method (probably going
to be difficult to do, might leave until the measuring
method report is done)
Usability, Productivity
Average
Productivity, Portability
Critical
Multiple analyses (9.2.3.2)
Provide multiple analyses? Useful analyses?
Usability
Average
Multiple executions (9.3.5)
Support multiple executions?
Productivity
Minor  Average
Provide multiple views? Intuitive views?
Usability, Productivity
Critical
Support automatic bottleneck identification? How?
Productivity
Minor  Average
1. Profiling? Tracing?
2. Trace format
3. Trace strategy
4. Mechanism for turning on and off tracing
How long does it take to get back useful information
Productivity, Portability,
Scalability
Critical
Productivity
Average
Support data searching?
Productivity
Minor
1. Libraries it supports
2. Languages it supports
Usability, Productivity
Critical
Able to correlate performance data to source code?
Usability, Productivity
Critical
Manual overhead (9.2.1.1)
Measurement accuracy (9.2.2.1)
Multiple views (9.2.4.1)
Performance bottleneck identification
(9.2.5.1)
Profiling / tracing support (9.2.1.2)
Response time (9.2.6)
Searching (9.3.6)
Software support (9.1.3)
Source code correlation (9.2.4.2)
147
System stability (9.3.3)
Technical support (9.3.4)
Crash rate
Usability, Productivity
Average
1. Time to get a response from developer.
2. Quality/usefulness of system messages
Usability
Minor  Average
148
9.4
References
[9.1]
Shirley Browne, Jack Dongarra, Kevin London, “Review of
Performance Analysis Tools for MPI Parallel Programs”, University of
Tenessee (updated version of 5)
[9.2]
Michael Gerndt and Andreas Schmidt, “Comparison of Performance
Analysis Tools on Hitachi”, Technical Report, Technische Universitat
Munchen, Feb. 26, 2002
[9.3]
Esther Jean-Pierre, “Performance Tool Evaluation”, University of
Florida, Dec. 15, 2003
[9.4]
Bernd Mohr, Michael Gerndt and Jesper Larsson Traff, “Design of a
Test Suite for (Automatic) Performance Analysis Tools”, A PowerPoint
presentation, SC 99 Tutorial S7, Nov. 14, 1999
[9.5]
Shirley Moore et al., “Review of Performance Analysis Tools for MPI
Parallel Programs”, University of Tenessee
[9.6]
Cherri M. Pancake, “Performance Tools for Today’s HPC: Are We
Addressing the Right Issues?”, Oregon State University
[9.7]
Cherri M. Pancake, “Can Users Play an Effective Role in Parallel Tools
Research?”, Oregon State University, Oct., 1996
[9.8]
“Domain Analysis Tools Evaluation”,
http://www.vtt.fi/ele/people/matias.vierimaa/gradu~oa
10 Tool evaluations
We either keep it in power point format or convert to word.
150
11 Conclusion
To be written
151
Download