Computer Science Technical Report
Predicting Remote Reuse Distance Patterns in
Unified Parallel C Applications
by
Steven Vormwald, Steven Carr,
Steven Seidel and Zhenlin Wang
Computer Science Technical Report
CS-TR-09-02
December 18, 2009
Department of Computer Science
Houghton, MI 49931-1295
www.cs.mtu.edu
ABSTRACT*
Productivity is becoming increasingly important in high performance computing. Parallel
systems, as well as the problems they are being used to solve, are becoming dramatically larger
and more complicated. Traditional approaches to programming for these systems, such as MPI,
are being regarded as too tedious and too tied to particular machines. Languages such as Unified
Parallel C attempt to simplify programming on these systems by abstracting the communication
with a global shared memory, partitioned across all the threads in an application. These
Partitioned Global Address Space, or PGAS, languages offer the programmer a way to specify
programs in a much simpler and more portable fashion.
However, performance of PGAS applications has tended to lag behind applications implemented
in a more traditional way. It is hoped that cache optimizations can provide similar benefits to
UPC applications as they have given single-threaded applications to close this performance gap.
Memory resuse distance is a critical measure of how much an application will benefit from a
cache, as well as an important piece of tuning information for enabling effective cache
optimization.
This research explores extending existing reuse distance analysis to remote memory accesses in
UPC applications. Existing analyses store a very good approximation of the reuse distance
histogram for each memory access in a program efficiently. Reuse data are collected for small
test runs, and then used to predict program behavior during full runs by curve fitting the patterns
seen in the training runs to a function of the problem size. Reuse data are kept for each UPC
thread in a UPC application, and these data are used to predict the data for each UPC thread in a
larger run. Both scaling up the problem size and the increasing the total number of UPC threads
are explored for prediction. Results indicate that good predictions can be made using existing
prediction algorithms. However, it is noted that choice of training threads can have a dramatic
effect on the accuracy of the prediction. Therefore, a simple algorithm is also presented that
partitions threads into groups with similar behavior to select threads in the training runs that will
lead to good predictions in the full run.
*This work is partially supported by NSF grant CCF-0833082.
CHAPTER 1
Introdution
1.1 Motivation
High performane omputing is beoming an inreasingly important part of
our daily lives. It is used to determine where oil ompanies drill for oil, to gure
out what the weather will be like for the next week, to design safer buildings
and vehiles. Companies save millions of dollars every year by simulating produt
designs instead of reating physial prototypes. The movie industry relies heavily
on speial eets rendered on large lusters. Sientists rely on simulations to
understand nulear reations without having to perform dangerous experiments
with nulear materials.
In addition, the mahines used to arry out these omputations are beoming
drastially larger and more ompliated. The Top500 list laims that the fastest
superomputer in the world has over 200000 ores [1℄. It is made up of thousands of six-ore opteron proessors in ompute blades networked together. The
ompute blades an eah be onsidered a omputer in its own right, working together with the others to at as one large superomputer. This lustering model
of superomputer is now the dominant fore in high performane omputing.
Traditionally, appliations written for these lusters required the programmer
to expliitly manage the ommuniation needs of the program aross the various
nodes of the luster. It was thought that the performane needs of suh appliations
ould only be met by a human programmer arefully designing the program to
minimize the neessary ommuniation osts. This approah to programming for
superomputers is quikly beoming unwieldy. The produtivity ost of requiring
the appliation programmer to manage and tune an appliation's ommuniation
for these omplex systems is simply too high.
Partitioned global address spae languages, suh as Co-Array Fortran [2℄ and
Unied Parallel C [3℄, attempt to address these produtivity onerns by building
a shared memory programming model for programmers to work with, delegating
the task of optimizing the neessary ommuniation to the language implementor.
While these languages do oer produtivity improvements, implementations
haven't been able to math the performane of more traditional message passing
setups. Various approahes to athing up have been tried. The UPC implementation from the University of California Berkeley [4℄ uses the GASNet network
library [5℄, whih attempts to optimize ommuniation using various methods suh
as message oalesing [6℄. Many implementations try to split synhronous operation into an asynhronous operation and a orresponding wait, then spread these
as far apart as possible to hide ommuniation lateny. These optimizations an
1
lead to impressive performane gains, but there is still a performane gap for some
appliations.
One approah hasn't been ommonly used in implementations is software
ahing of remote memory operations. Cahing has been used to great eet in
many situations to hide the ost of expensive operations. Programmers are also
austomed to working with ahes, sine they are so prevalent in today's CPUs.
As Mar Snir pointed out in his keynote address to the PGAS2009 onferene [7℄,
programmers would like to see some kind of ahing in these languages' implementations. He demoed a ahing sheme implemented entirely in the appliation.
However, it is desirable that the ahing be done at the level of the language implementation to avoid foring the programmer to deal with the omplexities of
ommuniation that these languages were designed to hide.
This researh takes an initial look at the possibility of using existing algorithms
for single threaded appliations designed to predit patterns in the reuse distanes
for memory operations to predit patterns in the reuse distanes for remote memory
operations in Unied Parallel C appliations. It is hoped that this information
ould be used to tune ahe behavior for ahing remote referenes, and to enable
other optimizations that rely on this information and have been suessfully used
with single-threaded appliations to work with multi-threaded UPC appliations.
1.2 Thesis Outline
The rest of this doument is organized as follows. Chapter 2 gives a broad
bakground in instrution-based reuse distane analysis and Unied Parallel C.
Chapter 3 introdues the instrumentation, test kernels and models used to predit
remote memory behavior. Chapter 4 shows the predition results obtained. Finally
Chapter 5 summarizes the results, looks at ways this predition model an be used,
and possible future work to overome some of this model's weaknesses.
2
CHAPTER 2
Bakground
2.1 Parallel Computation
When one wishes to solve problems faster, there are generally only three things
to do. First, one an try to nd a better algorithm to solve the problem at hand.
While this an lead to massive performane benets, most ommon problems already have known optimal solutions. Seond, one an inrease the rate at whih
a program exeutes. If a program requires the exeution of 10000 instrutions, a
mahine that an exeute an instrution every milliseond will nish the program
muh sooner than one that an only exeute an instrution every seond. Finally,
one an exeute the program in parallel{solving multiple piees of the problem at
the same time. Just as having multiple hekout lanes speed onsumers through
their purhases, having the ability to work on multiple piees of a problem an
speed its solution.
There are many dierent ways to think about solving a problem in parallel.
Flynn's taxonomy lassies parallel programming models up along two axes, one
based on the program's ontrol, the other based on its data [8, 9℄. This reates
four lasses of programs: single instrution single data, multiple instrution single data, single instrution multiple data, and multiple instrution multiple data.
These are heneforth referred to by the aronyms SISD, MISD, SIMD, and MIMD
respetively.
The SISD model is the lassi, non-parallel programming model where eah
instrution is exeuted serially and works on a single piee of data. While modern
arhitetures atually do work in parallel, this remains the most ommonly used
abstration for software developers.
The MISD model is widely onsidered nonsensial, as it refers to multiple
instrutions being exeuted in parallel operating on the same data. While this
usually makes little sense, the term has been used to refer to redundant systems,
whih use parallelism not to speed up a omputation, but rather to prevent program failures.
The SIMD model sees widespread use in many speial purpose aelerators.
In this model, a single instrution is exeuted in parallel aross large hunks of
data. The most well-known use of this is probably in the graphis proessing
units, or GPUs, that have beome standard on modern personal omputers. Many
arhitetures also inlude SIMD extensions to their SISD instrution set that enable
ertain types of appliations to perform muh better.
The MIMD model is the most general, where multiple instrutions are run in
parallel, eah on dierent data. This model is perhaps the most ommonly used
3
abstration for software developers writing parallel appliations, as this model is
used in the threading libraries inluded with many operating systems.
2.1.1 SPMD Model
The Single Program, Multiple Data, or SPMD model of parallel programming
is a subset of the MIMD model from Flynn's taxonomy. The MIMD model an be
thought of as multiple SISD threads exeuting in parallel that have some means
of ommuniating amongst themselves. The SPMD model is a speial ase where
eah thread is exeuting the same program, only working with dierent data. The
threads need not operate in lok-step, nor all follow the same ontrol-ow through
the program.
2.1.2 Communiation and the Shared Memory Model
Flynn's taxonomy desribes how a problem an be broken up and solved in
parallel. It does not speify how the various threads of exeution that are being run
in parallel ommuniate. From a programmer's perspetive, there are two major
models of parallel ommuniation, message passing and shared memory.
The message passing model, as its name suggests, requires that threads send
messages to one another to ommuniate. In general, these messages must be
paired in that the sender must expliitly send a message and the reeiver must
expliitly reeive that message.
In the shared memory model, all threads share some amount of memory,
and ommuniation ours through this shared memory. This naturally simplies
ommuniation as the programmer no longer needs to expliitly send data bak and
forth between threads. The programmer is also no longer responsible for ensuring
threads send messages in the orret order to avoid deadloks, nor to hek for
errors in ommuniation.
However, as memory is a shared resoure, the programmer must be aware of
the onsisteny that the model allows. In the most simple ase, there is a strit ordering on all memory aesses aross all threads, whih is easy for the programmer
to understand, but is usually quite expensive for the implementation to enfore.
There are various ways of relaxing the semantis to improve the performane of
the program by permitting threads to see operations our in dierent orders,
eliminating unneessary synhronization.
2.1.3 Partitioned Global Address Spae
While the shared memory programming model oers the programmer many
advantages, it is often diÆult to implement eÆiently on today's large distributed
systems. One major diÆulty omes from the fat that it is usually orders of
magnitude more expensive to aess shared memory that is o-node than it is to
4
aess on-node memory. If the programmer has no way of dierentiating between
on-node and o-node memory, it beomes diÆult to write programs that run
eÆiently on these modern mahines.
Partitioned Global Address Spae, PGAS, languages try to address this problem by introduing the onept of aÆnity [10℄. The shared memory spae is partitioned up among the threads suh that every objet in shared memory has aÆnity
to one and only one thread. This allows programmers to write ode that takes
advantage of the loation of the objet.
2.1.4 Unied Parallel C
Unied Parallel C, heneforth UPC, is a parallel extension to the C programming language [3℄. It uses the SPMD programming model, where a xed number
of UPC threads exeute the same UPC program. Eah UPC thread has its own
loal stak and loal heap, but there is also a global memory spae that all threads
have aess to. As UPC is a PGAS language, this global memory is partitioned
amongst all the UPC threads.
It is important to note that aesses to shared memory in UPC appliations do
not require any speial libraries or syntax. One a variable is delared as shared, it
an be referened just as any loal variable, at least from the programmer's point
of view. In many implementations, inluding MuPC, these aesses are diretly
translated into runtime library alls that perform the read or write as neessary.
For example, the program in Figure 2.1 prints hello from eah UPC thread,
reords how many haraters eah thread printed, and exits with EXIT FAILURE
if any thread had an error (printf() returns a negative value on errors).
One important performane feature of UPC is the ability to speify that memory aesses use relaxed memory onsisteny. The default, strit, memory onsisteny requires that all shared memory aesses be ordered, and that all threads
see the same global order of memory aesses. In partiular, if two threads write
to the same variable at the same time, all threads will see the same written value
after both threads have ourred. Consider the ode segment in Figure 2.2.
For threads other than thread 0, there are only three possible outputs at the end
of the ode: a = 0; b = 0 or a = 1; b = 0 or a = 1; b = 2.
By ontrast, relaxed memory onsisteny provides no suh guarantee. Dierent threads may see operations our in dierent orders. Consider the same ode
segment using relaxed variables instead of strit ones shown in Figure 2.3.
For threads other than thread 0, there are now four possible outputs at the end
of the ode: a = 0; b = 0 or a = 1; b = 0 or a = 1; b = 2 or a = 0; b = 2. The
additional value, a = 0; b = 2 is permitted beause the relaxed semantis allow
threads other than thread 0 to see the assignment to sb our before the assignment
to sa, while the strit semantis require they our in program order.
Sine implementation is allowed to reorder relaxed operations, it is also permit5
#inlude <
#inlude <
#inlude <
>
>
stdio .h
stdlib .h
>
up . h
/
Eah UPC
shared
int
int
[1℄
int
thread
has
1
element
of
this
/
main ( )
f
/
i ,
exit
value
Initialize
/
Wait
up
/
for
part
=
1
the
of
printed
array
to
finish
of
haraters
Wait
for
f
if
/
for
number
=
printf (" Hello
Verify
( i =0; i
g
everyone
barrier (
<
2
all
/
/
to
finish
printed .
from
UPC
/
thread
%d
of
%d .
n
n" ,
THREADS ) ;
printing .
/
);
the
threads
printed
something .
/
THREADS;++ i )
( printed [ i ℄
exit ( exit
initialization .
MYTHREAD,
up
0.
);
p r i n t e d [MYTHREAD℄
/
to
0;
everyone
barrier (
Reord
= EXIT SUCCESS ;
loal
p r i n t e d [MYTHREAD℄
g
array .
p r i n t e d [ THREADS ℄ ;
<
0)
exit
value
= EXIT FAILURE ;
value );
Figure 2.1. Hello World in UPC
ted to ahe the values without having to worry about keeping the ahes oherent
until a strit aess or olletive ours. Despite this apability, relatively few
UPC implementations ahe remote aesses, and those that do use relatively simple ahes. At the time of this writing, only the MuPC referene implementation
from Mihigan Tehnologial University [11℄ and the ommerial implementation
from Hewlett Pakard [12℄ are known to implement ahing of remote referenes.
2.1.5 MuPC
MuPC is a referene UPC implementation that is built on top of MPI and
POSIX threads. It urrently supports Intel x86 and x86-64 based lusters running
Linux, as well as alpha lusters running Tru64. Eah UPC thread is implemented
as a single OS proess using two pthreads, one for managing ommuniation, the
other to run the UPC program's omputation. The ompile sript rst translates
the UPC ode into C ode with alls into the MuPC runtime library to handle
ommuniation and synhronization. This is then ompiled with the system MPI
ompiler and the resulting binary an be run as if it were an MPI program.
6
int
if
strit
f
shared
l a =0 ,
int
s a =0 ,
s b =0;
l b =0;
(MYTHREAD==0)
s a =1;
s b =2;
g
else
f
l b=s b ;
l a=s a ;
g
p r i n t f ( " T h r e a d %d :
a
= %d ,
b = %d
n
n" , l a , l b ) ;
Figure 2.2. Example of Strit Semantis in UPC
int
if
relaxed
f
a =0 ,
shared
b =0;
int
s a =0 ,
s b =0;
(MYTHREAD==0)
s a =1;
s b =2;
g
else
f
b=s b ;
a=s a ;
g
p r i n t f ( " T h r e a d %d :
a
= %d ,
b = %d
n
n" , a , b ) ;
Figure 2.3. Example of Relaxed Semantis in UPC
MuPC implements a ahe for remote referenes in the ommuniation thread.
The ahe is divided into THREADS 1 setions, one for eah non-loal UPC
thread. The size of the ahe is determined by the user via a onguration le.
The user an also hange the size of a ahe line. The defaults setup a 2mb ahe
with 64-byte ahe lines.
2.2 Reuse Distane Analysis
Reuse distane is dened as the number of distint memory loations that are
aessed between two aesses to a given memory address. This information is
generally used to determine, predit, or optimize ahe behavior.
Forward reuse distane answers the question "How many distint memory
loations will be aessed before the next time this address is aessed?". It sans
forward in an exeution, ounting the number of memory loations aessed until
the given address is found. This information an be useful for determining whether
or not ahing should be performed for a memory referene, among other things.
7
Bakward reuse distane answers the question "How many distint memory
loations were aessed sine the last time this address was aessed?". It sans
bakward in an exeution, ounting the number of memory loations aessed until
the given address is found. This information an be useful for determining whether
or not a memory referene should be prefethed, among other things.
A[ 1 ℄
=
1;
A[ 2 ℄
=
2;
A[ 1 ℄
=
3;
A[ 2 ℄
=
4;
A[ 3 ℄
=
5;
A[ 1 ℄
=
6;
A[ 2 ℄
=
7;
Figure 2.4. Reuse Distane Example
For example, onsider the seond referene to A[2℄ in the short ode segment
in Figure 2.4. Beause only A[1℄ was aessed sine the last aess to A[2℄, the
bakward reuse distane is 1. The forward reuse distane is 2, beause both A[1℄
and A[3℄ are aessed before A[2℄ is aessed again.
There is also a distintion between temporal reuse and spatial reuse. Temporal
reuse refers to reuse of a single loation in memory, as in the example above. Spatial
reuse onsiders larger setions of memory, as the ahe in many systems pulls in
more than one element at a time. Assuming the ahe lines in the system an
hold two array elements and that A[1℄ is aligned to the start of a ahe line, the
bakwards spatial reuse distane of the seond aess to A[2℄ is 0, beause their
were no intervening aesses to dierent ahe lines sine the last aess. The
forward reuse distane is 1 however, beause A[3℄ does not share a ahe line with
A[1℄ and A[2℄.
2.2.1 Instrution Based Reuse Distane
Analyses generally reate histograms of either the forward or bakward reuse
distanes enountered in a program trae. The histograms are generally assoiated
either with a partiular memory address or with a partiular memory operation.
When assoiated with a memory operation, the data are referred to as instrution
based reuse distanes [13, 14℄.
Studying the reuse distanes assoiated with operations an provide many
useful insights into an appliation's behavior. For example, the maximum reuse
distane seen by any operation an tell how large a ahe will be beneial to the
appliation. Critial instrutions, operations that suer from a disproportionately
large number of the ahe misses in a program, an be identied as well [14℄.
It is generally expensive to reord the reuse distanes over an entire program
exeution exatly, as there an be trillions of memory operations enountered.
8
However, it has been shown that highly aurate approximations of the reuse distane an be stored eÆiently using a splay tree to reord when memory loations
are enountered [15℄. This information an then be used to reate histograms
desribing the reuse distanes for a given appliation.
2.2.2 Prediting Reuse Distane Patterns
input: the set of memory-distane bins B
output: the set of loality patterns P
for eah memory referene r
;
Pr =
false; p
; down =
f
=
null;
for (i = 0; i < numBins; i++)
i
if (Br .size > 0)
k
i
(Br :min
i
1
(down&&Br
:f req <
if (p ==
g
null
p:max > p:max
f
i
Br :f req))
i
p = new pattern; p.mean = Br .mean;
i
i
p.min = Br .min; p.max = Br .max;
i
i
p.freq = Br .freq; p.maxf = Br .freq;
Pr = Pr
p; down = false;
i
k
Br :min)
[
else
f
i
i
p.max = Br .max; p.freq += Br .freq;
i
(Br .freq
if
g
> p.maxf )
i
i 1 :f req
if (!down && Br
g
down =
else
g
f
i
p.mean = Br .mean; p.maxf = Br .maxf;
p =
true;
i
> Br :f req)
null;
Figure 2.5. Pattern-formation Algorithm
Using proling data, it has been shown that the memory behavior of a program
an be predited by using urve tting to model the reuse distane as a funtion
of the data size of an appliation [14℄.
First, patterns are identied for eah memory operation in the instrumented
training runs. Histograms storing the reuse data for memory operations are used
to identify these patterns. For eah bin in the histogram, a minimum distane,
maximum distane, mean distane, and frequeny are reorded. Then, adjaent
bins are merged using the algorithm in Figure 2.5. The loality patterns for an
operation are dened as the sets of merged bins. Finally, the predition algorithm
uses urve tting with eah of a memory operation's patterns in two training runs
to predit the orresponding pattern in the predited run [14℄.
9
2.2.3 Predition Auray Model
There are two important measures of the reuse distane preditions. First is
the overage, whih is dened as the perentage of operations in the referene run
that an be predited. An operation an be predited if it ours in both training
runs, and all of its patterns are regular. A pattern is regular if it ours in both
training runs and the reuse distane does not derease as the problem size grows.
The auray then is the perentage of overed operations that are predited
orretly. An operation is predited orretly if the predited patterns exatly
math the observed patterns, or they overlap by at least 90%. The overlap for two
patterns A and B is dened as
A:max
max(A:min; B:min)
max(B:max B:min; A:max A:min
The overlap fator of 90% was used in this work beause this fator worked
well in prior work with sequential appliations [14℄.
10
CHAPTER 3
Prediting Remote Reuse Distane
3.1 Instrumentation
In order to get the raw ahe reuse data for the preditions, it was important
to have a working base ompiler from whih to add instrumentation to generate
the raw ahe reuse data. The MuPC ompiler and runtime was used for the data
olletion, with a number of modiations to support runtime remote reuse data
olletion.
Initially, it was neessary to update MuPC to add support for x86-64 systems
to ensure MuPC ontinues to funtion in the future, as well as to support our new
luster. Beause the vender-provided MPI libraries were 64-bit only, it was not
possible to simply use the 32-bit support in the OS. Therefore, the MuPC ompiler
was modied to generate orret ode for the new platform. The bulk of this work
was merely inreasing the size of primitive types and enabling 64-bit maros and
typedefs in the EDG front-end.
The other hanges were all to support reording ahe reuse in UPC programs.
First, generi instrumentation was added to many of the runtime funtions. These
allow a programmer to register funtions that get alled whenever a partiular
runtime funtion is alled. This enables a programmer to inspet the program's
runtime behavior. To test this instrumentation, a simple funtion was written to
reate a log of all remote memory aesses, reording the operation (put or get),
the remote address, the size of the aess, and the loation in the program soure
that initiated the aess.
One this funtionality was working, existing instrumentation for the Atom
simulator [16℄ was modied for use with MuPC. This ode uses splay trees to store
ahe reuse data per instrution. Sine it was originally designed to work with
ahe reuse in hardware, assoiating the reuse data with an instrution works ne.
However, for this projet, there is no simple instrution to assoiate the reuse data
with, as remote aesses are ompliated operations. It was nally deided that
the return address (bak into the appliation ode) would work as a substitute,
sine it would produe the desired mapping bak to the appliation soure.
Additionally, the Atom instrumentation had to be modied to deal with differing sizes of addresses. In partiular, shared memory addresses are a strut in
MuPC, ontaining a 64-bit address, a thread, and a phase. The phase was not
important to this researh, but both the thread and the address were. To properly
store these values, the instrumentation was modied to store addresses as 64-bit
values instead of 32-bit values, and the thread was stored in the upper six bits of
the address, sine they were unused due to memory alignment.
11
Unfortunately, the implementation does require modiations to the soure
program. The modiations are quite small, and an easily be disabled with the
preproessor. The maros and global variables shown in Figure 3.1 were dened
in the up.h header.
/
Added
to
support
extern har har #define
har mup
mup
get
#define
#define
remote
aesses .
name ( ) ;
n
mup
trae
trae
fun
mup
trae
f u n =
TRACE FUNC RET
for
void (
traing .
mup
void
void mup
void
void mup
Must
get
trae
be
n
init
trae
rd ();
init )()
fun ;
n
name ( )
f u n =
mup
exatly
trae
one
fun
per
prev
program !
/
= NULL ;
file ();
init )()
trae
fun
used
init )()
init
trae
trae
mup
mup
n
trae
MUPC TRACE RD
(
n
n
prev ;
mup
trae
MUPC TRACE FILE
mup
fun
p r e v=
mup
MUPC TRACE NONE
(
#define
fun ;
mup
Maros
#define
trae
fun
TRACE FUNC
/
traing
/
=
n
mup
trae
init
file ;
n
=
mup
trae
init
rd ;
Figure 3.1. Added MuPC Maros
These maros setup information in global variables that is used by speial
funtions that wrap alls into the MuPC runtime. The alls save the return address
of the all and the name of the funtion that it was alled from so they an be
reorded by the instrumentation.
The maros TRACE FUNC and TRACE FUNC RET should be used at the
beginning and end of all funtions with remote aesses. While these maros are
not stritly neessary, they enable the instrumentation to trak the funtion name
that an operation originated from without having to gure it out from the return
address.
The MUPC TRACE * maros initialize the traing ode.
The
MUPC TRACE RD maro ongures the traing to store per instrution shared
memory reuse data. It must be inluded exatly one in the program's soure.
During an instrumented run, all remote aesses are logged, and eah plae
there is a all into the runtime gets assoiated with a reuse distane histogram.
Barriers and strit aesses ause the last use data to be dropped to fore all
later referenes to at as if no addresses had yet been seen. When the program
ompletes, these histograms are written out to disk, one le per thread.
12
3.2 Test Kernels
Sine there are no standard benhmark appliations for UPC, a small number
of kernels were written or modied from existing appliations to model program
behavior. These are desribed below.
3.2.1 Matrix Multipliation
Matrix multipliation is a well studied problem in parallel omputation. Eah
index (i; j ) in the resulting matrix is dened as the sum of the produts of the
elements of row i from the rst matrix with the orresponding elements of olumn
j from the seond. A simple UPC implementation to solve C = A B where A,
B , and C are N N matries is shown in Figure 3.2.
for int
(
up
i =0; i
<
int
for int
C[ i ℄ [ j ℄
(
g
g
g
f
N;++ i )
forall (
=
j =0; j
0;
<
k =0;k
C[ i ℄ [ j ℄
<
f
N;++ j ;&C [ i ℄ [ j ℄ )
N;++k )
f
+= A [ i ℄ [ k ℄
B[ k ℄ [ j ℄ ;
Figure 3.2. Matrix Multipliation in UPC
The Matrix Multipliation kernel simply multiplies two large (square) arrays
together. The nodes are arranged in a 2-d grid, and eah node has an NxN blok
of the array. The multipliation is performed using a naive implementation with
a triple-nested loop, where eah thread alulates its portion of the nal matrix,
working a blok at a time. The problem size in this kernel is the loal size of the
three matries. In testing, this kernel was run with 4, 9, and 16 threads with 4
elements per thread up to 262144 elements per thread, in inreasing powers of 4.
The omplete soure for the kernel used an be found in Appendix A.1.
3.2.2 Jaobi
The Jaobi method is an iterative method that nds an approximate solution
to a series of linear equations. While it does not nd exat solutions, it an give
a solution that is within a desired delta of the exat solution for most problems.
However, its most important feature is its numerial stability, whih is ritial
when working with inexat values. Sine the oating point format most ommonly
used to represent real numbers is inexat, this stability makes the Jaobi method
quite useful in a variety of appliations.
The Jaobi kernel simulates a Jaobi iterative solver for a large array distributed in a blok yli fashion. The kernel only simulates the remote aess
pattern for a Jaobi solver, it does not atually attempt to solve the generated
13
Figure 3.3. LU Deomposition of a Matrix
array. Like the Matrix Multipliation kernel, every thread performs essentially the
same task. The problem size in this kernel is the number of iterations the solver
runs for. In testing, this kernel was run with 2 through 24 threads, with iteration
ounts as a power of 2 up to 8192. The omplete soure for the kernel used an be
found in Appendix A.2.
3.2.3 LU Deomposition
LU deomposition is another important operation for many sienti appliations. It deomposes a matrix into the produt of a lower triangular matrix with
an upper triangular matrix, as shown in Figure 3.3. This an be used to nd both
the determinant and inverse of a matrix, as well as to solve a system of linear
equations.
The LU kernel, whih omes from the test suite from Berkeley's UPC ompiler [4℄ based on a program from Stanford University [17℄, performs an LUdeomposition on a large array that is distributed aross nodes in a blok yli
fashion. Square bloks of B B (B = 8; 16; 32 were tested) are distributed to
eah thread until the entire array has been alloated. The problem size is the total
size of the array. In testing, this kernel was run with 2 through 36 threads, with
problem sizes from 1024 elements to 16777216 elements. The omplete soure for
the kernel used an be found in Appendix A.3.
3.2.4 Stenil
Stenil problems are problems where elements of an array are iteratively updated based on the pattern of its neighbors. A ommon example of this is John
Conway's Game of Life, where the life or death of a ell at a given time step is
determined by the life or death of neighboring ells in the previous time step. The
stenil desribes whih neighboring ells are used to update a given ell. This is
often used in engineering when modeling the ow of heat or air.
14
The stenil kernels are a family of kernels that apply a 2-d or 3-d stenil to
an array. These kernels were added as an example of a lass of problems where
threads displayed diering behavior based on the logial layout of threads. The
problem size for these is the total size of the array the stenil operates over. These
kernels were developed later than the others and did not have as many test runs
as the other kernels due to time and hardware onstraints. Therefore, exhaustive
results are not available for these kernels, results are available only for the four
and eight point 2d stenil kernels.
While initially instrumenting the stenil ode, it was observed that in some
ases, there were additional unexpeted remote memory aesses that did not
math up to the theoretial ommuniation behavior of the programs.
void
f
g
stenil (
int
i ,
int
A( i , j )
+= A( i
A( i , j )
+= A( i +1 , j ) ;
A( i , j )
+= A( i , j
A( i , j )
+= A( i , j + 1 ) ;
A( i , j )
/=
j )
1, j ) ;
1);
5.0;
Figure 3.4. Failing Stenil Code
Assuming that A(i; j ) is loal to the alling thread, one would expet that the
stenil funtion in Figure 3.4 generates at most four operations (alls to the runtime) that ould be remote aesses. Without that assumption, at most fourteen operations ould be remote aesses. The ode generated atually has at
most twenty-three operations that ould be remote aesses. This was ausing the
thread-saling predition to fail, as the pattern of whih of these operations is used
varies with thread ount.
The ulprit was determined to be the '+=' operator and its interation with
a shared variable. Changing the stenil ode sample from Figure 3.4 to the that
shown in Figure 3.5 eliminates the superuous operations.
This xed the problem beause the UPC to C translator uses nested onditional operators to hek for loal aesses when working with shared variables.
This nesting aused multiple aesses to be generated for a single soure aess
when there are multiple aesses to shared variables in a statement. Sine this is a
limitation of the MuPC ompiler and ommon pratie is to use loal temporaries
to avoid multiple remote aesses, the stenil ode was updated to use the latter
form. The omplete soure for the kernel used an be found in Appendix A.4.
15
void
f
int
double
stenil (
i ,
int
j )
t ;
g
t
= A( i , j ) ;
t
+= A( i
t
+= A( i +1 , j ) ;
t
+= A( i , j
t
+= A( i , j + 1 ) ;
t
/=
1, j ) ;
1);
5.0;
A( i , j )
=
t ;
Figure 3.5. Correted Stenil Code
3.3 Thread Partitioning
Sine a predition is being performed for eah UPC thread in a run, the
hoie of whih thread's data are used to train beomes important. If the number
of threads is kept onstant, the obvious hoie is to use the same thread's data for
the training. However, that doesn't work when the number of threads inreases.
Therefore, it was neessary to ome up with a way of seleting the threads used in
the preditions.
In searhing for a suitable algorithm for partitioning threads for training, it
was noted that all of the kernels tested worked with large square arrays, and the
ommuniation pattern was based on the distribution of these arrays. Beause the
ommuniation pattern is based on the geometri layout of the data, an algorithm
was used that mathes the training data with a pattern desribing this geometri
layout, and then hooses threads for predition based on it. As a speial exeption,
thread 0 is assumed to be used for various extraneous tasks, suh as initialization,
and is therefore always predited with thread 0 from eah of the training runs.
The algorithm is split into three parts. The threads in the two training runs
are partitioned into groups based on their ommuniation behavior, as shown in
Figure 3.6. It is assumed that data for eah thread in eah training run has
assoiated with it the pattern data, represented as t.patterns, the instrutions
enountered (patterns are assoiated with an instrution) represented as t.inst.
Eah thread's patterns are tested against those in the existing groups. The thread
is inluded in the group if all the patterns are present in both, and there is at
most 5% dierene between the min, max, freq, and mean values for the thread
and the average of the values for all the threads in the group. Eah group traks
the number of members and the running arithmeti average min, max, freq, and
mean of eah pattern. One the patterns are generated, they are assoiated with
the run as T .groups.
Then these groups are mathed against a funtion desribing the expeted
16
input: the set of training runs ontaining pattern data for eah thread in the run
output: the set of training runs is updated with the set of groups G
for eah training run T
T .groups =
;
for eah thread t
t.group =
2
NULL
for eah group g
T
2
patterns math
if
T .groups
(g,t)
t.group = g
break
if t.group ==
NULL
t.group = new group
t.group.vals =
NULL
t.group.inst = t.inst
t.group.patterns = t.patterns
t.group.nthr = 1
T .groups = T .groups
else
for eah pattern pg
2
[
t.group
t.group.patterns
let pt be the orresponding pattern in t.patterns
pg .min = (pg .min t.group.nthr+pt .min)=(t.group.nthr+1)
pg .max = (pg .max t.group.nthr+pt .max)=(t.group.nthr+1)
pg .freq = (pg .freq t.group.nthr+pt .freq)=(t.group.nthr+1)
pg .mean = (pg .mean t.group.nthr+pt .mean)=(t.group.nthr+1)
t.group.nthr = t.group.nthr+1
Figure 3.6. Partitioning Algorithm
input: a group g and a thread t
output: true if the patterns of t
6
if g.inst= t.inst return
for eah pattern pg
2
math those of g,
false
otherwise
false
g.patterns
let pt be the orresponding pattern in t.patterns
if no suh pt return
if (pg .min
if (pg .max
if (pg .freq
if (pg .mean
return
true
false
pt .min)/max(pg .min,pt .min)> 0:05 return
false
false
false
false
pt .max)/max(pg .max,pt .max)> 0:05 return
pt .freq)/max(pg .freq,pt .freq)> 0:05 return
pt .mean)/max(pg .mean,pt .mean)> 0:05 return
Figure 3.7.
patterns math()
17
Funtion
partitioning of threads, as shown in Figure 3.8. The pattern funtion is used to
generate the set of values assoiated with threads in the group. The training run
mathes the pattern funtion if the set of values assoiated with eah group is
disjoint from ever other group. Additionally, a map mapping values with one of
the threads assoiated with it (the lowest if the threads are tested in asending
order) is kept for eah training run, for later use when piking whih threads to
use in the predition.
input: the set T of training runs with the assoiated group assignments
output: true if training runs math the pattern, false otherwise
for eah training run Ti
for eah thread t
2
Ti
[
t.group.vals = t.group.vals
f (t; T .numthreads)
if !Mi .ontainsKey(f (t; T .numthreads))
Mi .insert(f (t; T .numthreads),t)
for eah unordered pair of groups g1 ; g2
if (g1 .vals
return
true
\
6 ;
g2 .vals) =
return
false
2
T .groups
Figure 3.8. Mathing Algorithm
Unfortunately, the pattern itself is not automatially deteted, but was hosen
beause the kernels tested all used a square thread layout. The pattern funtion
used is shown in Figure 3.9. Note that the pattern partitions a square into regions
based on geometri properties. Beause the algorithm merges regions based on
observed behaviors, the pattern atually denes a large number of subpatterns,
whih are automatially mathed by the algorithm. The ability to automatially
generate a pattern funtion for a given problem would inrease the generality of
this analysis greatly, an important avenue for future work. However, it is possible
to reate general patterns that math a wide variety of problems by utilizing the
subsetting inherent to this algorithm.
Finally, if both training runs math the pattern, threads are predited with
threads from the training runs who share the same group as determined by the
mathed funtion. The algorithm seleting the pairs is shown in Figure 3.10. It
uses the maps generated while mathing the pattern to hoose threads in the
training runs with the same value returned by the pattern funtion.
As an example, onsider using equation 3.1 as a pattern to math against the
LU kernel run with 16 threads in the test dataset, 25 threads in the train dataset,
and 36 threads in the referene dataset with a xed per-thread data size. In this
ase, the threads with data on the diagonal of the array have to do extra work.
Equation 3.1 returns 0 for threads on the diagonal and 1 for threads not on the
diagonal, assuming the threads are laid out in a square grid. This should perfetly
math the thread layout of the LU kernel when run with a number of threads that
18
Figure 3.9. Pattern Funtion
input:
input:
the set of threads T
in the predition run
pred
for eah training run, a map Mi mapping a return value from the pattern f (t; T )
output:
to a thread in the training run that generates that return value
for eah thread in the predition run, a pair of threads from the training runs to use
for eah thread t
pred
2
T
t.preds = (M0 .get(f (t
pred
pred
;T
pred
.numthreads)),M1 .get(f (t
;T
.numthreads)))
pred
pred
Figure 3.10. Training Thread Seletion Algorithm
19
is a perfet square, suh as in this example. For larity, the mathing of reuse
patterns is simplied suh that threads with data on the diagonal perfetly math
only other threads with data on the diagonal, and likewise for threads without
data on the diagonal.
(
p
0 if b ptn = (t mod n)
(3.1)
f (t; n) =
1 o.w.
First the threads in the test and train datasets are sorted into groups. In
both ases there are two groups, those threads that have data on the diagonal {
and therefore have extra work, and those that don't. For the test dataset, the
groups are g = ft ; t ; t g and g = ft ; t ; t ; t ; t ; t ; t ; t ; t ; t ; t ; t g.
For the train dataset, the groups are g = ft ; t ; t ; t g and g =
ft ; t ; t ; t ; t ; t ; t ; t ; t ; t ; t ; t ; t ; t ; t ; t ; t ; t ; t ; t ; t g. As noted
earlier, t is exluded as a speial ase.
Next, for eah group, the set vals of results of f (ti ; n) where n is the number
of threads is omputed for eah ti in the group. This gives g T
.vals = f0g and g .vals
= f1g for both the test and train datasets. Beause g .vals g .vals is empty, the
pattern is mathed in both datasets.
0
5
10
15
1
1
2
3
4
0
1
2
3
4
5
7
8
9
10
11
12
13
14
15
16
6
6
17
19
7
12
20
8
9
18
21
11
12
24
22
13
14
1
23
0
0
0
1
1
Ref Test Train Ref Test Train
0
0
0
18
1
1
1
1
1
19
1
1
2
1
1
20
1
1
3
1
1
21
5
6
4
1
1
22
1
1
5
1
1
23
1
1
6
1
1
24
1
1
7
5
6
25
1
1
8
1
1
26
1
1
9
1
1
27
1
1
10
1
1
28
5
6
11
1
1
29
1
1
12
1
1
30
1
1
13
1
1
31
1
1
14
5
6
32
1
1
15
1
1
33
1
1
16
1
1
34
1
1
17
1
1
35
5
6
Table 3.1. Thread Grouping for 36-Thread Referene Dataset
Sine the pattern was mathed, it an be used to selet the pairs of threads
20
used for predition of the referene dataset. For eah thread in the referene
dataset, a pair of threads from the test and train datasets are hosen based on
whih set they are in. Sine threads are grouped by behavior, it doesn't matter
whih thread in a set is used for the predition. In the solution shown in table 3.1,
the lowest thread in the set is used for the preditions.
21
CHAPTER 4
Predition Results
To determine whether or not it would be possible to model the behavior
of remote memory aesses, the four kernels were instrumented and run with a
number of dierent data sizes and numbers of threads. All tests were run with the
instrumented MuPC ompiler on a 24-node dual dual ore opteron luster with an
inniband interonnet.
Due to hardware and software limitations, the testing was restrited to a
maximum of 48 UPC threads. In pratie, any more than 24 threads ran quite
slowly, thus there are relatively few results with more than 24 threads. While it is
reognized that these are relatively small systems in the world of high performane
omputing, it is believed that the results would hold as the problem and thread
size inreases sine the problems enountered were due to either hanges in data
layout or using training data from runs that were too small.
4.1 Problem Size Saling
As expeted, the predition auray was very high when holding the number
of threads onstant and just inreasing the problem size. The predition followed
the same pattern as shown in earlier work, whih makes sense as there is very little
to distinguish size saling with onstant threads from size saling with one thread
as far as the predition is onerned.
However, problems an arise when the growth of the problem size auses the
distribution of shared data to hange, whih in turn auses the ommuniation
pattern between threads to hange. This behavior is seen in the LU kernel, where
predition auray and overage drop steeply in a few ases beause the distribution of the array hanged.
Table 4.1. Problem Size Saling Auray Distribution
Kernel
Preditions Minimum Average Maximum
Matrix Multipliation
1547
3.08%
96.98% 100.00%
Jaobi Solver
14234
99.87% 100.00% 100.00%
LU Deomposition, 8x8
7027
6.36%
94.20% 100.00%
LU Deomposition, 16x16
5809
8.54%
95.91% 100.00%
LU Deomposition, 32x32
5839
0.00%
96.29% 100.00%
22
Table 4.2. Problem Size Saling Coverage Distribution
Kernel
Preditions Minimum Average Maximum
Matrix Multipliation
1547
53.45%
99.45% 100.00%
Jaobi Solver
14234
38.80% 100.00% 100.00%
LU Deomposition, 8x8
7027
0.00%
76.82% 100.00%
LU Deomposition, 16x16
5809
0.00%
63.17% 100.00%
LU Deomposition, 32x32
5839
0.00%
52.97% 100.00%
Tables 4.1 and 4.2 show that both overage and auray are quite high in most
ases. In the matrix multipliation and Jaobi kernels, the low minimum auraies
are seen only when training with very small problem sizes. Of the 1612 preditions
on the matrix multipliation kernel where the auray is less than 60%, 1610 of
them our when the smallest training size is 256 or less, 1332 when the smallest
training size is 16 or less. Likewise, the low minimum overage perentages are
seen only when the training sizes are small enough that some operations disappear.
The LU deomposition kernel is a bit more problemati. Consider the results
in Table 4.3. It harts the perentage of preditions that have greater than 80,
90, and 95% auray for eah of the three bloking fators tested, rst for all
preditions, then only for those where the overage was 100%.
Table 4.3. LU Problem Size Saling Auray by Coverage
Kernel
Preditions
8x8, All Preditions
7027
16x16, All Preditions
5809
32x32, All Preditions
5839
8x8, 100% Coverage
3503
16x16, 100% Coverage
2486
32x32, 100% Coverage
1844
80% A
56.07%
50.20%
47.58%
96.97%
97.30%
88.72%
>
90% A
50.39%
49.63%
45.83%
85.64%
96.46%
85.36%
>
95% A
33.56%
40.37%
45.71%
53.04%
74.82%
84.98%
>
The large dierene between preditions with 100% overage and others stems
from the behavior of the kernel. The data distribution amongst the threads is determined by the problem size, thus hanging the problem size hanges the data
distribution. The data distribution in turn determines whih remote memory operations a thread enounters, whih also hanges when the the problem size hanges.
This results in low overage. The altered data distribution also hanges the behavior of a ouple remote memory operations as the number of threads enountering
23
the operation hanges. This has the eet of reduing the auray of the predition for those operations.
It is lear that this model is restrited to using training data from runs that
have similar ommuniation patterns as the run being predited. An interesting
question for future work is whether or not a model of the data distribution an be
used to model how the saling will aet the appliations ommuniation pattern,
and if that an in turn be used to enable high overage and predition auray
when the data distribution does hange.
4.2 Thread Saling
Table 4.4. Thread Saling Auray Distribution by Kernel
Kernel
Preditions Minimum Average Maximum
Matrix Multipliation
61056
3.08%
97.75% 100.00%
Jaobi Solver
103600
94.02%
99.87% 100.00%
LU Deomposition, 8x8
862683
12.85%
94.71% 100.00%
LU Deomposition, 16x16
346905
19.57%
91.73%
99.98%
LU Deomposition, 32x32
94600
13.11%
82.49%
99.31%
2d Stenil
18000
100.00% 100.00% 100.00%
Table 4.5. Thread Saling Coverage Distribution by Kernel
Kernel
Preditions Minimum Average Maximum
Matrix Multipliation
61056
27.62%
99.29% 100.00%
Jaobi Solver
103600
55.66%
91.26% 100.00%
LU Deomposition, 8x8
862683
0.00%
41.73% 100.00%
LU Deomposition, 16x16
346905
0.00%
37.43% 100.00%
LU Deomposition, 32x32
94600
0.00%
20.96%
98.56%
2d Stenil
18000
0.00%
57.99% 100.00%
Tables 4.4 and 4.5 show the predition results for varying the number of
threads, and prediting using all possible pairs of training threads from the training data available. Both overage and auray are quite high in most ases. In
the matrix multipliation and Jaobi kernels, the low minimum auraies are seen
only when training with very small problem sizes, the same that ours when saling the problem size. These results are for exhaustively prediting every thread
24
in the referene set with every possible ombination of threads in the two training
sets. The high auray and overage in the matrix multipliation and Jaobi kernels in these tables indiate that the predition works well regardless of the threads
hosen for training.
The predition overage and auray on the LU kernel is muh more dependent on the hoie of threads used for the training however. Consider the auray
of the predition for thread 9 of 25, when using threads from runs with 4 and 16
threads for the training. The predition auray by training pairs is shown in
Figure 4.1.
Prediction Accuracy for Thread 9 by Test and Train Thread
100
90
80
70
Accuracy
60
50
40
30
20
Test Thread 0
Test Thread 1
Test Thread 2
Test Thread 3
10
0
0
1
2
3
4
5
6
7
8
Train Thread
9
10
11
12
13
14
15
Figure 4.1. LU 32x32 4-16-25 Thread Saling Example Thread
For most of the pairs, the predition auray is quite good. However, there
are a number of pairs that results in terrible auray. This is a result of the
behavioral dierenes between threads in the LU kernel. Sine ertain threads
(those that ontain bloks on the diagonal) have to do additional ommuniation,
and thread 9 when run with 25 threads is not one of them, pairs where both threads
are on the diagonal show dramati dereases in auray.
However, these results show that if the threads an be partitioned in suh
a way that threads with similar behaviors are in similar groups, high auray
preditions an be made. Thus onsider what happens when partitioning the
training threads as desribed in Setion 3.3 with the pattern shown in Figure 3.9.
25
Prediction Accuracy Using Selected Training
100
99.95
99.9
99.85
Accuracy
99.8
99.75
99.7
99.65
99.6
99.55
99.5
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Predicted Thread
Figure 4.2. LU 32x32 4-16-25 Auray Using Thread Seletion Algorithm
The equation for Figure 3.9 is a ompliated funtion that simply partitions
a square into groups based on geometri loation. Eah orner is in its own group,
the edges (minus the orners) are eah in their own groups. The diagonal, and
the upper and lower triangles also have their own groups. Figure 4.2 shows the
auray results when using the thread partitioning to selet training pairs.
As expeted, by prediting threads on the diagonal with threads also on the
diagonal, it is possible to avoid the pits seen in Figure 4.1. However, it is desirable
that the pattern that is mathed not be spei to the LU kernel. Thus, the square
2-d stenils were used to verify that the pattern would also work for an appliation
with a very dierent ommuniation pattern than the LU kernel.
Like the LU kernel, the data distribution of the shared array determines the
ontrol ow through the program in the stenil kernels. Unlike the LU kernel,
threads on the orners and edges exhibit diering behavior. This is due to laking ommuniation on one or more sides of the stenil. In turn, this auses low
predition overage, if threads along the edges are used to predit for threads in
the middle{thus the low minimum and average overage values in Table 4.5. The
auray of overed operations is not impated, however, beause the skipped operations have minimal impat on the reuse patterns. This is shown by the very
high auray of preditions seen in Table 4.4.
26
Table 4.6. Thread Saling Auray Distribution with Partitioning
Kernel
Minimum Average Maximum
Matrix Multipliation
92.38%
99.44% 100.00%
Jaobi Solver
94.69%
99.87% 100.00%
LU Deomposition, 8x8
51.07%
91.24%
99.86%
LU Deomposition, 16x16 56.26%
80.36%
99.33%
LU Deomposition, 32x32 48.95%
91.67%
99.31%
2d Stenil
100.00% 100.00% 100.00%
Table 4.7. Thread Saling Coverage Distribution with Partitioning
Kernel
Minimum Average Maximum
Matrix Multipliation
100.00% 100.00% 100.00%
Jaobi Solver
100.00% 100.00% 100.00%
LU Deomposition, 8x8
97.45%
98.79%
99.90%
LU Deomposition, 16x16 97.56%
98 97% 100.00%
LU Deomposition, 32x32 96.34%
98.26%
98.56%
2d Stenil
100.00% 100.00% 100.00%
Using thread partitioning, the threads in the orners, on eah edge, and in
the middle are separately grouped. This provides 100% overage for all threads,
as threads in the training runs that skip operations are used to predit for threads
that will also skip those instrutions due to the data distribution.
Tables 4.6 and 4.7 show the results of using the partitioning algorithm presented to selet training threads for all kernels. As expeted, the overage and
auray both show marked improvement for all kernels. The only unexpeted result is the deline in the average auray for the lu kernels. On loser inspetion,
this beause the results in Table 4.4 are padded by the large number of ombinations that math threads not on the diagonal. Additionally, beause the number
of preditions is so muh smaller, the minimums have more weight.
27
CHAPTER 5
Conlusions
In summary, it has been shown that it is possible to predit the remote reuse
distane for UPC appliations with a high degree of auray and overage, though
there are a number of important limitations.
First, it is neessary to hoose training data that math behavior of the desired predition size to ahieve high auray and overage. Changes in the data
distribution aused by inreases in the problem size or number of threads an ause
signiant drops in both auray and overage.
Choie of training threads is also ritially important for predition when
saling up the number of threads. The predition results an vary from extremely
poor to exellent merely by the hoie of whih threads were used for the predition.
It is therefore neessary to math threads' behaviors in the training data to patterns
that predit whih threads will perform similarly in the saled-up runs.
5.1 Appliations
One promising appliation of this researh is automatially adjusting ahe
parameters suh as size and thread aÆnity of ahe lines. This inludes things
like bypassing the ahe for operations that are likely to result in a ahe miss,
or disabling the ahe entirely for appliations that won't make muh use of it.
Consider the ode segment in Figure 5.1. Assume do something is a funtion that
performs no ommuniation.
void
f
int
for
example sub (
i , j ;
up
g
g
g
f o r a l l ( i =0; i
( j =0; j
do
<
float shared
<
l e n ;++ i ; p1+ i )
l e n ;++ j )
f
p1 ,
shared
float p2 ,
int
len
)
f
s o m e t h i n g ( p1 [ i ℄ , p2 [ j ℄ , i , j ) ;
Figure 5.1. Sample UPC Funtion
Sine elements of p2 are aessed in order, repeatedly for eah element of p1,
this partiular segment of ode would work well with a ahe large enough to hold
at least len elements. This work ould be used to predit how large a ahe is
needed for the appliation to ahe all the elements this funtion sees based on the
reuse distane patterns seen.
28
Another option is warning the user about operations that show poor ahe
performane, perhaps as part of a larger performane analysis/proling tool. Going
bak to the previous ode, assume this time that the maximum size the ahe an
grow to in an implementation is CACHE LEN. If the predition indiates that len
is likely to be larger than this, it might warn the user that the aesses to elements
of p2 are likely to result in ahe misses beause the ahe is too small. Then the
user ould hange the ode, perhaps to something like the ode in Figure 5.2 to
take better advantage of the ahe, or perhaps simply hange the ompiler options
used to tell the ompiler to do so automatially.
void
f
int
for
example sub (
g
for
g
g
<
l e n /CACHE LEN;++k )
f o r a l l ( i =0; i
( j =0; j
do
g
for
up
g
float i , j ,k;
( k =0;k
up
g
shared
<
s o m e t h i n g ( p1 [ i ℄ , p2 [ k
( j =0; j
<
<
f
f
l e n ;++ i ; p1+ i )
CACHE LEN;++ j )
f o r a l l ( i =0; i
do
<
p1 ,
l e n ;++ i ; p1+ i )
l e n%CACHE LEN;++ j )
s o m e t h i n g ( p1 [ i ℄ , p2 [ k
shared
float p2 ,
int
len
)
f
CACHE LEN+ j ℄ , i , k
f
f
CACHE LEN+ j ℄ , i , k
CACHE LEN+ j ) ;
CACHE LEN+ j ) ;
Figure 5.2. Sample UPC Funtion Tuned for Cahe Size
Inserting prefethes prior to operations that are likely to result in a ahe miss,
and likewise avoiding putting in unneessary prefethes, would also be a good use
of these preditions. Sine network ongestion an ause dramati performane
penalties on large lusters, avoiding unneessary ommuniation an be just as
important as requesting data before it is atually needed.
5.2 Future Work
This researh shows that it is possible to predit the remote reuse distane
behavior of UPC appliations. There are a number of weaknesses that should be
addressed in future work. Foremost among these is the ability to model the data
distribution of an appliation, and use that model to avoid problems suh as are
seen with the LU kernel where hanging the data distribution between training
runs auses poor auray. Sine the data distribution is known at runtime, it
should be possible to store it and use it to tune the predition by modeling how
the distribution will hange with an inrease in problem size.
Another weakness of this researh is that the funtion used as a pattern during
thread partitioning must be hosen manually. As the thread grouping is largely
29
based on the data distribution, it seems natural to expet that a model of how data
distribution hanges when the number of threads grows would also enable a better
partitioning of threads for improved predition auray. Another possibility is
looking at the program in a more abstrat fashion, hoosing a pattern based on
the type of problem being solved. A list of suh abstrat problem types, and the
assoiated ommuniation patterns, suh as Berkeley's Dwarfs [18℄ ould be used
as a starting point.
Sine UPC is meant to inrease produtivity on large systems, it will also
be neessary to improve the salability of this work. In partiular, storing reuse
patterns for every thread in the two training sets, and prediting for every thread
generated a large amount of data even for the relatively small test appliations
used. If this were to be used in a prodution environment, there would need to be
some way of ompressing the data or skipping threads whose patterns are similar
to another thread's. This ould perhaps extend into exploring a global view of
ahe behavior, where the data is not kept on a per-thread basis, but rather taken
over all the threads in an appliation.
Finally, this work only explored temporal reuse. Spatial reuse, where "nearby"
data is pulled in along with requested data provides quite a bit of performane for
many serial appliations. It is likely that it would work similarly for many UPC
appliations. The same predition sheme used for temporal reuse was shown to
work well for spatial reuse in serial appliations as well. However, spatial loality
in UPC shared memory an be ross thread or on the same thread, depending on
how the appliation steps through memory and the way the data is laid out.
30
LIST OF REFERENCES
[1℄ TOP500.Org. \ORNLs Jaguar Claws its Way to Number One,
Leaving Reongured Roadrunner Behind in Newest TOP500
List of Fastest Superomputer." Nov. 2009. [Online℄. Available:
http://top500.org/lists/2009/11/press-release
[2℄ R. Numwih and J. Reid, \Co-Array Fortran for parallel programming,"
Rutherford Appleton Laboratory, Teh. Rep. RAL-TR-1998-060, 1998.
[3℄ The UPC Consortium. \UPC language speiation, v1.2." June 2005.
[Online℄. Available: http://www.gwu.edu/up/dos/up spes 1.2.pdf
[4℄ University of California Berkeley. \The Berkely UPC Compiler." 2002.
[Online℄. Available: http://up.lbl.gov
[5℄ University of California Berkeley. \GASNet Communiation System." 2002.
[Online℄. Available: http://gasnet.s.berkeley.edu/
[6℄ W.-Y. Chen, C. Ianu, and K. Yelik, \Communiation optimizations for negrained up appliations," in PACT '05: Proeedings of the 14th International
Conferene on Parallel Arhitetures and Compilation Tehniques. Washington, DC, USA: IEEE Computer Soiety, 2005, pp. 267{278.
[7℄ M. Snir, \Shared memory programming on distributed memory systems," 2009, keynote address at PGAS 2009. [Online℄. Available:
http://www2.hpl.gwu.edu/pgas09/tutorials/PGAS Snir Keynote.pdf
[8℄ M. J. Flynn, \Very high-speed omputing systems," in Proeedings of the
IEEE, vol. 54, no. 12.
IEEE Computer Soiety, De. 1966, pp. 1901{1909.
[9℄ M. J. Flynn, \Some opmuter organizations and their eetiveness,"
Transations on Computers, vol. C-21, no. 9, pp. 948{960, Sept. 1972.
IEEE
[10℄ B. Carlson, T. El-Ghazawi, R. Numrih, and K. Yelik. \Programming in
the Partitioned Global Address Spae Model." 2003. [Online℄. Available:
http://rd.lbl.gov/UPC/images/b/b5/PGAS Tutorial s2003.pdf
[11℄ J. Savant and S. Seidel, \MuPC: A Run Time System for Unied Parallel
C," Mihigan Tehnologial University, Teh. Rep. CS-TR-02-03, Sept. 2002.
[Online℄. Available: http://up.mtu.edu/papers/CS.TR.2.3.pdf
[12℄ Hewlett Pakard. \HP Unied Parallel C."
31
[13℄ C. Fang, S. Carr, S. Onder, and Z. Wang, \Reuse-distane-based miss-rate
predition on a per instrution basis," in Proeedings of the 2nd ACM Workshop on Memory System Performane, June 2004, pp. 60{68.
[14℄ C. Fang, S. Carr, S. Onder, and Z. Wang, \Instrution based memory distane
analysis and its appliation to optimization," in In Proeedings of the 14 th
International Conferene on Parallel Arhitetures and Compilation, 2005.
[15℄ C. Ding and Y. Zhong, \Prediting whole-program loality through reuse
distane analysis," SIGPLAN Not., vol. 38, no. 5, pp. 245{257, 2003.
[16℄ A. Srivastava and A. Eustae, \Atom: A system for building ustomized
program analysis tools." ACM, 1994, pp. 196{205.
[17℄ Stanford University. \Parallel dense bloked LU fatorization." 1994.
[18℄ University of California Berkeley. \The Landsape of Parallel Computing Researh:
A View From Berkeley." [Online℄. Available:
http://view.ees.berkeley.edu/wiki/Main Page"
32
APPENDIX
Test Kernel Soures
A.1 Matrix Multipliation
#inlude <
#inlude <
#inlude <
#inlude <
#inlude <
#inlude <
#ifdef
#endif
>
>
>
>
>
stdlib .h
stdio .h
math . h
assert .h
up . h
up
>
relaxed . h
MUPC TRACE RD
MUPC TRACE RD
shared
[℄
shared
[℄
shared
[℄
#define
/
int int int shared
[1℄
shared
[1℄
shared
[1℄
Initialize
will
of
olumns
and
as
a
array .
int
void
f
int
#ifdef
#endif
n
i d x ( arr , i , j )
arr
x
n
set
of
B;
C;
( ( a r r ) [ ( ( i ) /N)
this
rows
A;
to
s q r t (THREADS) ,
THREADS.
Eah
n + ( ( j ) /N ) ℄ + ( ( i )%N)
thread
Threads
loally
the
are
has
N+ ( ( j )%N ) )
number
laid
out
an NxN
element .
/
n ,N;
initialize (
har argv )
i , j ,k;
TRACE FUNC
TRACE FUNC ;
/
Seed
the
random
number
generator .
/
s r a n d (MYTHREAD) ;
/
int
Verify
n =
(
)
assert ((n
/
if
Read
N =
/
that
the
in
the
( shared
up
all
size
of
the
loal
all
all
int int int memory
[℄
for
the
is
a
elment .
[℄
shared
[℄
shared
a l l o (THREADS,
loal
square .
[1℄
bloks
arrays .
sizeof
sizeof
sizeof
shared
a l l o (THREADS,
( shared
up
p r i n t f ( " Using
a l l o (THREADS,
( shared
up
C =
threads
/
)THREADS ) ) ;
/
a t o i ( argv ) ;
Alloate
B =
of
n)==THREADS ) ;
(MYTHREAD==0)
A =
double
number
floor ( sqrt ((
[℄
)
( shared
[1℄
[℄
)
( shared
[℄
a s s e r t ( ( A!=NULL)&&(B!=NULL)&&(C!=NULL ) ) ;
A [MYTHREAD℄
up
all
=
( shared
[℄
int sizeof
a l l o (THREADS, N N
int
) ( ( ( shared
(
size
%dx%d
/
)
( shared
[1℄
of
int int int [1℄
));
));
));
int )
) ) ) +MYTHREAD) ;
33
n
n" ,N, N ) ;
B [MYTHREAD℄
up
all
=
C [MYTHREAD℄
up
/
all
Fill
up
f
( shared
[℄
int sizeof
int sizeof
=
( shared
[℄
in
arrays
<
A and B .
(
(
int int [1℄
)
) ) ) +MYTHREAD) ;
) ( ( ( shared
a l l o (THREADS, N N
f o r a l l ( i =0; i
int
int
) ( ( ( shared
a l l o (THREADS, N N
[1℄
)
) ) ) +MYTHREAD) ;
/
THREADS; i ++; i )
a s s e r t (A [ i ℄ ! = NULL ) ;
a s s e r t (B [ i ℄ ! = NULL ) ;
a s s e r t (C [ i ℄ ! = NULL ) ;
for
for
( j =0; j
f
f
g
g
g
#ifdef
#endif
<
N ; j ++)
<
( k =0;k
N ; k++)
(A [ i ℄+ j
N+k )
=
i ;
(B [ i ℄+ j
N+k )
=
i ;
(C [ i ℄+ j
N+k )
=
0;
TRACE FUNC RET
TRACE FUNC RET ;
g
void
f
int
#ifdef
#endif
array (
print
shared
[℄
int shared
[1℄
A
)
i , j ;
TRACE FUNC
TRACE FUNC ;
if
for
/
Only
thread
0
(MYTHREAD! = 0 )
f
g
( i =0; i
puthar ( '
f
( i =0; i
for
prints .
Everyone
else
just
returns .
;
<
n
n
<
n
printf ("
for
return
N ; i ++)
t%d " , i ) ;
n ' );
n N ; i ++)
p r i n t f ( "%d " , i ) ;
f
<
n N ; j ++)
// p r i n t f ( "
printf ("
g
g
( j =0; j
puthar ( '
puthar ( '
#ifdef
#endif
n
n
n
n
t%d " ,
(A [ ( i /N)
t%d " , a r r
n+( j /N) ℄ + ( i%N) N+( j%N ) ) ) ;
i d x (A, i , j ) ) ;
n ' );
n ' );
TRACE FUNC RET
TRACE FUNC RET ;
g
void
al
blok (
int
idx )
34
/
f
int
#ifdef
#endif
i , j , k , rowt , o l t ;
TRACE FUNC
TRACE FUNC ;
r o w t =(MYTHREAD/ n )
n+i d x ;
o l t =(MYTHREAD%n ) + ( i d x
for
<
for
<
f
for
<
( i =0; i
f
N ; i ++)
( j =0; j
f
g
g
g
#ifdef
#endif
n);
N ; j ++)
( k =0;k
N ; k++)
(C [MYTHREAD℄+ i
N+k )
+=
(
( A [ r o w t ℄+ i
TRACE FUNC RET
TRACE FUNC RET ;
g
void
int
f
mult
i , j ;
up
f
g
int
f
/
( j =0; j
al
main (
if
/
f o r a l l ( i =0; i
for
f
g
g
kernel ()
n ; j ++)
blok ( j );
arg ,
have
( arg !=2)
THREADS; i ++; i )
<
int
Must
<
1
har argv )
argument
size
of
N.
/
e x i t ( EXIT FAILURE ) ;
Initialize
arrays .
/
/
i n i t i a l i z e ( argv [ 1 ℄ ) ;
up
/
barrier (0);
Print
out
A and B .
print
a r r a y (A ) ;
print
a r r a y (B ) ;
up
/
barrier (1);
Compute C=A B .
up
/
/
kernel ( ) ;
mult
barrier (2);
Print
print
out
the
result .
/
a r r a y (C ) ;
35
N+ j ) )
(
( B [ o l t ℄+ j
N+k ) ) ;
g
return
0;
36
A.2 Jaobi Solver
#inlude <
#inlude <
#inlude <
#inlude <
#inlude <
#inlude <
#ifdef
#endif
>
>
>
>
>
stdlib .h
stdio .h
math . h
assert .h
up . h
>
relaxed . h
up
MUPC TRACE RD
MUPC TRACE RD
#ifndef
#define
#endif
#define
/
Default
N
of
unknowns
per
thread .
/
100
SIZE
shared
[N℄
shared
[N℄
shared
[N℄
double
[N℄
shared
number
N
N THREADS
double
double
double
double
A [ SIZE ℄ [ SIZE ℄ ;
X [ 2 ℄ [ SIZE ℄ ;
B [ SIZE ℄ ;
D [ SIZE ℄ ;
maxD ;
long int
void
f
int
#ifdef
#endif
// d o u b l e
epsilon ;
MAX ITER ;
initialize (
har argv )
i , j ;
TRACE FUNC
/
TRACE FUNC ;
Seed
the
random
number
generator .
/
s r a n d o m (MYTHREAD) ;
/
Read
in
// e p s i l o n
the
=
MAX ITER =
if
desired
(MYTHREAD==0)
Fill
up
f
in
p r i n t f (" Using
p r i n t f ( " Using
A and B .
f o r a l l ( i =0; i
double
<
for
f
( j =0; j
if
<
g
g
#ifdef
#endif
= %g
Initialize
X[ i ℄
to
B[ i ℄ .
n
n
n" , epsilon ) ;
n " , MAX ITER ) ;
/
THREADS)
(((
double
)
random ( ) ) / ( (
double
SIZE ; j ++)
A[ j ℄ [ i ℄ = ( (
( j==i )
)
epsilon
MAX ITER = %l d
SIZE ; i ++;&B [ i ℄ )
X [ 0 ℄ [ i ℄=X [ 1 ℄ [ i ℄=B [ i ℄=
((
/
a t o i ( argv ) ;
// i f (MYTHREAD==0)
/
epsilon .
a t o f ( argv ) ;
double
)
double
random ( ) ) / ( (
A [ j ℄ [ i ℄+=(
)
double
)
SIZE ;
TRACE FUNC RET
TRACE FUNC RET ;
g
37
RAND MAX ) ;
)
RAND MAX ) ) ;
unsigned int
f
unsigned int
int
double
#ifdef
#endif
for
<
jaobi
kernel ()
iter ;
i , j ;
sum ;
TRACE FUNC
TRACE FUNC ;
f
/
( i t e r =0; i t e r
MAX ITER ; i t e r ++)
Update X [ ℄
up
f
for
for
sum = 0 . 0 ;
( j =0; j
i ; j ++)
( j=i +1; j
f
sum+=A [ i ℄ [ j ℄
<
SIZE ; j ++)
Compute maximum
up
f o r a l l ( i =0; i
for
sum = 0 . 0 ;
sum+=A [ i ℄ [ j ℄
X[ i t e r %2℄[ j ℄ ;
X[ i t e r %2℄[ j ℄ ;
sum ) /A [ i ℄ [ i ℄ ;
( j =0; j
<
deltas
on
eah
up
for
if
<
SIZE ; j ++)
Chek
for
sum+=(A [ i ℄ [ j ℄
termination .
( maxD= i = 0 ; i
<
// i f ( maxD
#ifdef
#endif
return
X[ ( i t e r +1)%2℄[ j ℄ ) ;
<
/
SIZE ; i ++) maxD=(D [ i ℄
>
maxD) ?D [ i ℄ : maxD ;
f p r i n t f ( s t d e r r , "maxD :
epsilon )
return
%8g
iter ;
TRACE FUNC RET
TRACE FUNC RET ;
g
void
int
#ifdef
#endif
if
iter ;
print
A ()
i , j ;
TRACE FUNC
TRACE FUNC ;
(MYTHREAD! = 0 )
for
f
for
return
;
p u t s ( "A [ ℄ [ ℄ : " ) ;
( i =0; i
f
/
barrier ;
(MYTHREAD==0)
f
thread .
SIZE ; i ++;&D [ i ℄ )
D [ i ℄= f a b s ( sum B [ i ℄ ) ;
g
g
SIZE ; i ++;&B [ i ℄ )
barrier ;
up
/
<
<
X [ ( i t e r + 1 ) % 2 ℄ [ i ℄ = (B [ i ℄
g
/
/
f o r a l l ( i =0; i
<
SIZE ; i ++)
( j =0; j
<
SIZE ; j ++)
38
n
n " , maxD ) ;
printf ("
g
puthar ( '
g
puthar ( '
#ifdef
#endif
n
n
n
t %8g " ,A [ i ℄ [ j ℄ ) ;
n ' );
n ' );
TRACE FUNC RET
TRACE FUNC RET ;
g
void
int
#ifdef
#endif
if
print
f
B ()
i ;
TRACE FUNC
TRACE FUNC ;
return
(MYTHREAD! = 0 )
for
;
p u t s ( "B [ ℄ : " ) ;
f
g
( i =0; i
<
n
n
SIZE ; i ++)
printf ("
puthar ( '
#ifdef
#endif
t %8g
n
n " ,B [ i ℄ ) ;
n ' );
TRACE FUNC RET
TRACE FUNC RET ;
g
void
f
int
#ifdef
#endif
if
print
X(
int
iter )
i ;
TRACE FUNC
TRACE FUNC ;
return
(MYTHREAD! = 0 )
for
;
p u t s ( "X [ ℄ : " ) ;
f
g
( i =0; i
<
n
n
SIZE ; i ++)
printf ("
puthar ( '
#ifdef
#endif
t %8g
n
n " ,X [ i t e r % 2 ℄ [ i ℄ ) ;
n ' );
TRACE FUNC RET
TRACE FUNC RET ;
g
int
int
main (
f
if
/
int
arg ,
har argv )
iter ;
Must
have
( arg !=2)
1
argument
desired
preision .
e x i t ( EXIT FAILURE ) ;
39
/
/
Initialize
arrays .
/
i n i t i a l i z e ( argv [ 1 ℄ ) ;
up
/
barrier (1);
Print
A ();
print
B ();
up
/
Find X s u h
up
/
the
randomly
g e n e r a t e d A and B .
barrier (2);
iter
=
that
AX=B .
/
kernel ();
jaobi
barrier (3);
Print
print
g
out
print
out
the
result .
/
X( iter );
return
0;
40
/
A.3 LU Deomposition
#inlude <
>
#inlude <
>
#inlude <
>
#inlude <
>
/
/
/
Copyright
()
1994
Stanford
University
/
/
All
rights
reserved .
/
/
Permission
is
given
/
non
/
removed .
All
/
part ,
forbidden
ommerial
are
to
use ,
purpose
other
as
uses ,
opy ,
long
as
and
this
inluding
without
prior
modify
this
opyright
software
notie
redistribution
written
in
for
is
any
not
whole
or
in
permission .
/
/
This
software
/
support .
is
provided
with
absolutely
no
warranty
and
no
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
Parallel
dense
bloked
LU
fatorization
( no
pivoting )
/
/
This
/
to
version
be
ontains
fatored
is
one
dimensional
arrays
in
whih
the
matrix
stored .
/
/
Command
line
options :
/
/
nN
:
Deompose NxN
matrix .
/
pP
:
P = number
proessors .
/
bB
:
Use
/
a
good
of
blok
size
of
performane .
/
s
:
Print
individual
/
t
:
Test
/
o
:
Print
out
matrix
/
h
:
Print
out
ommand
B.
BxB
Small
elements
blok
proessor
sizes
timing
should
fit
(B=8 , B=16)
in
ahe
work
well .
statistis .
output .
values .
line
options .
/
/
Note :
This
version
works
under
both
for
t h e FORK and SPROC m o d e l s
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
stdio .h
math . h
stdlib .h
#inlude
#inlude
s y s / time . h
" up . h"
" l u . h"
//MAIN ENV
#define
#define
#define
#define
#define
#ifdef
#endif
MAXRAND
32767.0
DEFAULT N
128
DEFAULT P
1
DEFAULT B
16
min ( a , b )
((a)
<
(b)
?
(a)
:
(b ))
MUPC TRACE RD
MUPC TRACE RD
double
double
double
double
// E v e r t h i n g
shared
shared
shared
shared
in
globalmemory
t
in
orrspond
to
shared
s o l v e [ THREADS ℄ ;
t
in
t
in
m o d [ THREADS ℄ ;
b a r [ THREADS ℄ ;
t
in
f a [ THREADS ℄ ;
41
types
shared
shared
shared
shared
shared
up
strut
o m p l e t i o n [ THREADS ℄ ;
timeval
rf ;
timeval
rs ;
timeval
done ;
id ;
lok
/
double
strut
strut
strut
int
t
idlok ;
f
GlobalMemory
// d o u b l e
t
in
shared
double
shared
double
double
double
strut
t
in
fa ;
t
in
solve ;
t in mod ;
bar ;
ompletion ;
timeval
starttime ;
strut
timeval
rf ;
strut
timeval
rs ;
strut
timeval
done ;
int
id ;
//BARDEC( s t a r t )
//LOCKDEC( i d l o k )
g
up
lok
t
;
idlok ;
/
// s h a r e d
strut
double
double
double
double
strut
g
in
t
fa ;
in
t
in
t
in
int
int
n;
Global ;
f
LoalCopies
t
GlobalMemory
solve ;
mod ;
bar ;
;
shared
int
int
int
shared
/
/
size ;
blok
The
nbloks ;
/
num rows ;
/
num ols ;
/
double double // d o u b l e
shared
// d o u b l e
size
Blok
a;
/
of
the
matrix
dimension
/
/
of
bloks
Number
of
proessors
per
row
of
proessor
grid
Number
of
proessors
per
ol
of
proessor
grid
a =
lu ;
l
and
in
u
eah
both
dimension
Number
plaed
bak
/
in
a
/
pro
a;
rhs ;
/
bytes ;
Bytes
of
to
mallo
per
proessor
to
fatorization ?
/
/
Test
=
0;
/
Print
out
matrix
dostats
=
0;
/
Print
out
individual
result
=
0;
hold
bloks
A /
doprint
test
/
rhs ;
int int
int
int
void
void
void
int int
double int int
void
double int int
void
double double int int int int
void
double double int int int int
void
double double double int int int int
void
double double int double
int
int int
void int int int strut
int
double
int int
void
shared
/
result
of
values?
/
proessor
statistis?
);
/
InitA ( ) ;
SlaveStart ( ) ;
OneSolve (
,
,
shared
lu0 ( shared
,
bdiv ( shared
,
bmodd ( s h a r e d
,
daxpy ( s h a r e d
lu (
,
TouhA (
,
,
);
,
,
,
,
,
,
,
,
,
shared
shared
,
,
,
);
);
,
,
shared
shared
,
BlokOwner (
,
);
shared
,
bmod ( s h a r e d
,
,
LoalCopies
);
PrintA ( ) ;
42
);
,
);
,
,
,
);
void
void
#define
float
float
return
onst har ChekResult ( ) ;
printerr (
CLOCK( x )
time (
al
( tp
diff
g
int
=
2nd . tv
main (
diff
int
);
g e t t i m e o f d a y (&( x ) ,
strut
( tp
2nd . tv
use
/
timeval
tp
se
tp
tp
1st . tv
NULL)
strut
1st ,
1st . tv
use )
se )
timeval
tp 2nd )
1000000.0
f
+
;
1000000.0;
har arg ,
argv [ ℄ )
f
int
int
double
double
double
double
int
strut
#ifdef
#endif
if
i ,
j ;
h ;
mint ,
min
maxt ,
fa ,
max fa ,
fa ,
avg
avgt ;
min
solve ,
min mod ,
min bar ;
max solve ,
max mod ,
max bar ;
avg
avg mod ,
avg
solve ,
bar ;
pro num ;
timeval
start ;
TRACE FUNC
TRACE FUNC ;
f
(MYTHREAD==0) n=DEFAULT N ; b l o k
s i z e =DEFAULT B ;
g
CLOCK( s t a r t ) ;
if
f
( !MYTHREAD)
while
swith
ase
ase
ase
ase
ase
ase
( ( h
=
g e t o p t ( arg ,
argv ,
f
( h )
'n ' :
n =
'b ' :
blok
a t o i ( optarg ) ;
's ' :
dostats
size
't ' :
test
'o ' :
doprint
'h ' :
p r i n t f ( " Usage :
LU
printf (" options :
n
=
<
break
=
break
! test
>n n
n
result ;
printf ("
nN
:
Deompose
printf ("
bB
:
Use
good
performane .
:
Copy
non
printf ("
s
:
Print
printf ("
t
:
Test
printf ("
o
:
Print
out
matrix
printf ("
h
:
Print
out
ommand
for
n
printf ("
n
;
;
matrix .
size
of
n
n" ) ;
B.
BxB
elements
should
fit
Small
blok
sizes
(B=8 ,
in
loal
B=16)
loally
alloated
bloks
to
ahe
individual
proessor
timing
statistis .
LU
output .
n%1d
n
n" ) ;
p%1d
values .
line
b%1d
n
n
n" ,
DEFAULT N, DEFAULT P , DEFAULT B ) ;
break
exit (0);
g
;
printf ("
n
n" ) ;
options .
n" ) ;
43
n n
n
n" ) ;
work
memory
n" ) ;
p r i n t f (" Default :
g
f
n
n" ) ;
printf ("
use .
NxN
blok
1)
n" ) ;
n" ) ;
a
break
break
!=
;
;
! doprint ;
options
;
a t o i ( optarg ) ;
1;
result
=
break
=
"n : p : b : s t o h " ) )
n
n" ) ;
well .
before
n
n
n" ) ;
p r i n t f ( " Bloked
Dense
LU
printf ("
%d
by %d
Matrix
printf ("
%d
Proessors
printf ("
%d
by %d
printf ("
g
up
for
if
n
break
int
nbloks
)
sqrt ((
num ols
size
n" , b l o k
size , blok
size );
)
THREADS ) ;
== THREADS)
size ;
nbloks
!=
f
n)
printf ("
n u m r o w s = %d
printf ("
num ols
= %d
printf ("
nbloks
= %d
printf ("
up
n
n
rhs
double ) G MALLOC( n
( shared
= ( strut
>
>
>
>
>
Global
Global
Global
Global
/
t
up
in
)
up
matrix
round
= ( double
= ( double
t
in
solve
t
in
bar
aross
>
id
all
=
=
// F o r k
(
) G MALLOC(P
is
where
) G MALLOC(P
) G MALLOC(P
) G MALLOC(P
one
( i =1;
ode
s i z e o f ( double ) ) ;
s i z e o f ( double ) ) ;
s i z e o f ( double ) ) ;
might
/
allo ();
i
<
P;
unneessary
i ++)
CREATE( S l a v e S t a r t )
due
to
spmd
f
44
distribute
memories
0)
is
GlobalMemory ) ) ;
s i z e o f ( double ) ) ;
0;
off
));
s i z e o f ( double ) ) ;
) G MALLOC(P
distributed
desired .
idlok );
lok
));
) G MALLOC( s i z e o f ( s t r u t
0;
(MYTHREAD ==
id
sizeof double
start );
//LOCKINIT ( G l o b a l
up
as
(
physially
fashion
>
>
Here
n,
a l l o (n ,
= ( double
= ( double
s i z e o f ( double ) ) ;
ompletion = ( double
robin
// G l o b a l
sizeof double
s i z e o f ( double ) ) ;
a l l o (n
all
fa
data
=
n
t in mod
//BARINIT( G l o b a l
idlok
GlobalMemory
POSSIBLE ENHANCEMENT:
for
n" , n b l o k s ) ;
all
) G MALLOC( n
/
Global
)
double = ( double
=
// G l o b a l
n" , num ols ) ;
n" ) ;
( shared
// r h s
if
n " , num rows ) ;
wait ;
=
n
n
n
n" ) ;
// a = ( d o u b l e
g
double
f
( !MYTHREAD)
g
/
n
;
= n/ b l o k
printf ("
THREADS ) ;
Bloks
n b l o k s ++;
if
/
n" ,
Element
;
( blok
g
n" ) ;
= THREADS/ n u m r o w s ;
( num rows
num rows
a
n
n" , n , n ) ;
n" ) ;
(
f
(;;)
num ols
if
n
n
notify ;
num rows =
g
Fatorization
model
in
a
the
a
/
if
InitA ( ) ;
g
(MYTHREAD == 0 &&
doprint )
before
p r i n t f ( " Matrix
f
deomposition :
n
n" ) ;
PrintA ( ) ;
// S l a v e S t a r t (MyNum ) ;
SlaveStart ( ) ;
up
barrier ;
//WAIT FOR END( P
if
if
(MYTHREAD ==
printf ("
if
for
g
=
+=
g
avgt
=
min
fa
min
solve
mint
=
ompletion [ 0 ℄ ;
>
i ++)
f
maxt )
f
<
f
mint )
avgt
=
/
THREADS;
max fa
=
max
=
avg
solve
=
fa
=
avg
t
in
solve
avg mod
=
t
in
min bar
avg
=
t
in
for
( t
g
min
g
max mod =
g
min mod =
g
max bar
( t
( t
(t
(t
( t
( t
in
=
=
in
in
t
=
in
=
t
mod [ i ℄
t
in
mod [ i ℄
t
in
bar [ i ℄
=
t
in
bar [ i ℄
bar
i ++)
max fa )
f
fa [0℄;
=
<
min
fa )
>
f
f
>
solve )
f
solve [ i ℄;
<
in
max
min
solve )
f
solve [ i ℄;
max mod )
f
mod [ i ℄ ;
<
min mod )
f
mod [ i ℄ ;
>
max bar )
f
bar [ i ℄ ;
<
min bar )
in
solve [0℄;
bar [ 0 ℄ ;
fa [ i ℄;
in
t
mod [ 0 ℄ ;
fa [ i ℄;
in
t
=
THREADS;
solve [ i ℄
solve
in
t
<
>
solve [ i ℄
in
in
i
fa [ i ℄
fa
in
max bar
fa [ i ℄
g
( t
=
( i =1;
in
max solve
if
=
min mod = max mod =
g
if
n" ) ;
ompletion [ i ℄ ;
min
if
n
ompletion [ i ℄ ;
max fa
if
deomposition :
ompletion [ i ℄ ;
g
if
after
THREADS;
( ompletion [ i ℄
avgt
if
i
( ompletion [ i ℄
mint
if
f
<
avgt
( i =1;
maxt =
if
f
nMatrix
( dostats )
maxt =
g
if
n
0)
f
( doprint )
PrintA ( ) ;
g
if
1)
f
45
g
min bar
avg
fa
avg
solve
=
+=
t
t
in
in
+=
in
t
in
avg
t
in
+=
g
avg
fa
avg
solve
=
solve [ i ℄;
mod [ i ℄ ;
bar [ i ℄ ;
avg
=
f a /THREADS;
avg
s o l v e /THREADS;
avg mod
=
avg mod /THREADS;
bar
=
avg
avg
g
fa [ i ℄;
t
avg mod +=
bar
bar [ i ℄ ;
b a r /THREADS;
printf ("
PROCESS STATISTICS
printf ("
n
Barrier
Total
n
Pro
Time
0
Time
%10.6 f
ompletion [ 0 ℄ , t
t
in
solve [0℄ , t
if
in
bar [ 0 ℄ ) ;
t
for
( dostats )
( i =1;
printf ("
g
i
in
Interior
Time
Time
%10.6 f
in
THREADS;
i ++)
%4.6 f
%4.6 f
t
in
solve [ i ℄ , t
mod [ i ℄ ,
t
in
bar [ i ℄ ) ;
printf ("
mint , m i n
printf ("
in
%4.6 f
%10.6 f
%10.6 f
s o l v e , avg mod , a v g
%10.6 f
fa , min
Max
%4.6 f
n
printf ("
n
%10.6 f
%10.6 f
%10.6 f
%10.6 f
%10.6 f
%10.6 f
bar ) ;
%10.6 f
%10.6 f
s o l v e , min mod , m i n b a r ) ;
%10.6 f
%10.6 f
%10.6 f
n" ) ;
printf ("
//
time
//
%16d
finish
time
:
%16d
:
%16d
rs );
// p r i n t f ( " O v e r a l l
//
finish
time
rf );
p r i n t f ( " Total
al
time
time ( start ,
p r i n t f ( " Total
t i m e ( rs ,
printf ("
if
:
starttime );
// p r i n t f ( " I n i t i a l i z a t i o n
al
( test
n
with
initialization
:
%4.6 f
:
%4.6 f
rf ));
time
without
initialization
n
n
n
n
n
n" ,
n" ,
n" ,
n" ,
n" ,
rf ));
n" ) ;
result )
f
n
printf ("
TESTING RESULTS
ChekResult ( ) ;
g
#ifdef
#endif
return
n
TIMING INFORMATION n " ) ;
// p r i n t f ( " S t a r t
n" ,
n" ,
maxt , m a x f a , m a x s o l v e , max mod , m a x b a r ) ;
g
n
fa [ i ℄ ,
%10.6 f
fa , avg
Min
%10.6 f
f
%4.6 f
in
Avg
%10.6 f
mod [ 0 ℄ ,
i , ompletion [ i ℄ , t
avgt , a v g
%10.6 f
fa [0℄ ,
f
<
%3d
printf ("
TRACE FUNC RET
TRACE FUNC RET ;
g
n" ) ;
n" ) ;
printf ("
g
n
Perimeter
n" ) ;
printf ("
Time
Diagonal
0;
46
n" ) ;
n
n
n
n" ,
n" ,
n" ,
n
n
void
SlaveStart ()
f
/
POSSIBLE ENHANCEMENT:
proessors
to
OneSolve ( n ,
g
void
int
int
int
int
avoid
blok
OneSolve ( n ,
double shared
n;
blok
Here
is
migration
size ,
blok
a,
where
one
might
pin
proesses
to
/
MYTHREAD,
dostats ) ;
size ,
a,
dostats )
myrf ,
mydone ;
MyNum,
a;
size ;
MyNum ;
dostats ;
f
unsigned int
strut
strut
#ifdef
#endif
strut
if
i ;
timeval
myrs ,
LoalCopies
l ;
TRACE FUNC
TRACE FUNC ;
l
=
(
( l
LoalCopies
f
== NULL)
f p r i n t f ( s t d e r r , " P r o %d
exit (
g
>
>
>
>
l
l
l
l
/
t
in
fa
in
solve
in
t
in
=
/
=
0.0;
bar
=
0.0;
to
ensure
to
remove
>
all
start ,
old
start
size ,
//BARRIER( G l o b a l
g
initialization
is
done
P) ;
misses ,
all
proessors
>
start ,
l
n
n " ,MyNum ) ;
/
begin
by
touhing
a[℄
P) ;
that
( (MyNum ==
0)
one
jj
is
Here
is
where
measuring
( dostats ))
one
about
blok
size ,
( (MyNum ==
0)
MyNum,
jj
l ,
f
dostats ) ;
( dostats ))
might
the
CLOCK( myrs ) ;
lu (n ,
if
for
barrier ;
statistis
g
memory
MyNum ) ;
POSSIBLE ENHANCEMENT:
if
mallo
LoalCopies ) ) ;
barrier ;
TouhA ( b l o k
/
not
(
0.0;
mod
barrier
up
ould
sizeof strut
0.0;
=
//BARRIER( G l o b a l
up
mallo (
1);
t
t
)
f
CLOCK( mydone ) ;
47
reset
parallel
the
exeution
/
/
>
//BARRIER( G l o b a l
if
start ,
( (MyNum ==
jj
0)
( dostats ))
CLOCK( m y r f ) ;
g
if
g
P) ;
barrier ;
up
t
in
f a [ MyNum ℄
t
in
s o l v e [ MyNum ℄
t
in
t
in
=
m o d [ MyNum ℄
=
b a r [ MyNum ℄
=
(MyNum ==
=
l
=
in
t
t
in
t
in
al
fa ;
in
solve ;
mod ;
bar ;
t i m e ( myrs ,
mydone ) ;
f
0)
myrs ;
done
= mydone ;
rf
myrf ;
#ifdef
#endif
t
l
l
o m p l e t i o n [ MyNum ℄
rs
>
>
>
>
l
=
f
=
TRACE FUNC RET
TRACE FUNC RET ;
g
void
int
int
lu0 (a ,
shared
n;
n,
stride )
double a;
stride ;
f
int
int
int
double
#ifdef
#endif
for
for
j ;
k;
length ;
alpha ;
TRACE FUNC
TRACE FUNC ;
<
( k =0;
/
k
alpha
j
<
n;
stride ℄
=
length
= n
k
#ifdef
#endif
j ++)
/=
a [ k+ j
olumns
f
a [ k+k
/
stride ℄ ;
stride ℄ ;
1;
d a x p y (& a [ k+1+ j
g
f
k++)
subsequent
( j =k + 1 ;
a [ k+ j
g
n;
modify
stride ℄ ,
&a [ k+1+k
stride ℄ ,
n
k
1,
TRACE FUNC RET
TRACE FUNC RET ;
g
void
bdiv ( a ,
shared
int
int
int
shared
diag ,
double double stride
a ,
stride
diag ,
dimi ,
a;
diag ;
stride
a ;
stride
diag ;
dimi ;
48
dimk )
alpha ) ;
int
dimk ;
f
int
int
double
#ifdef
#endif
for
for
j ;
k;
alpha ;
TRACE FUNC
TRACE FUNC ;
<
( k =0;
k
alpha
g
dimk ;
( j =k + 1 ;
=
#ifdef
#endif
dimk ;
d i a g [ k+ j
d a x p y (& a [ j
g
k++)
<
j
stride
f
j ++)
f
stride
a ℄ ,
diag ℄ ;
&a [ k
stride
a ℄ ,
dimi ,
alpha ) ;
TRACE FUNC RET
TRACE FUNC RET ;
g
void
bmodd ( a ,
shared
int
int
int
int
shared
,
double double dimi ,
dimj ,
stride
a ,
stride
)
a;
;
dimi ;
dimj ;
stride
a ;
stride
;
f
int
int
int
int
double
#ifdef
#endif
for
for
i ;
j ;
k;
length ;
alpha ;
TRACE FUNC
TRACE FUNC ;
<
( k =0;
k
dimi ;
( j =0;
[ k+ j
alpha
length
j
k++)
<
dimj ;
stride
=
=
℄
[ k+ j
dimi
d a x p y (& [ k+1+ j
g
#ifdef
#endif
f
j ++)
/=
a [ k+k
stride
k
stride
a ℄;
℄;
1;
stride
℄ ,
&a [ k+1+k
stride
TRACE FUNC RET
TRACE FUNC RET ;
g
void
bmod ( a ,
shared
shared
b,
double double ,
dimi ,
dimj ,
dimk ,
stride )
a;
b;
49
a ℄ ,
dimi
k
1,
alpha ) ;
int
int
int
int
shared
double ;
dimi ;
dimj ;
dimk ;
stride ;
f
int
int
int
double
#ifdef
#endif
for
for
i ;
j ;
k;
alpha ;
TRACE FUNC
TRACE FUNC ;
( k =0;
alpha
g
<
k
dimk ;
( j =0;
j
=
<
b [ k+ j
d a x p y (& [ j
g
#ifdef
#endif
f
k++)
dimj ;
j ++)
f
stride ℄ ;
stride ℄ ,
&a [ k
stride ℄ ,
dimi ,
alpha ) ;
TRACE FUNC RET
TRACE FUNC RET ;
g
void
daxpy ( a ,
shared
double
int
shared
b,
double double n,
alpha )
a;
b;
alpha ;
n;
f
int
#ifdef
#endif
for
g
i ;
TRACE FUNC
TRACE FUNC ;
( i =0;
a[ i ℄
#ifdef
#endif
+=
i
<
n;
alpha
i ++)
f
b[ i ℄;
TRACE FUNC RET
TRACE FUNC RET ;
g
int
int
int
f
g
BlokOwner ( I ,
J)
I ;
J;
return
( ( I%n u m o l s )
+
( J%n u m r o w s )
num ols ) ;
50
void
int
int
int
strut
int
lu (n ,
bs ,
MyNum,
l ,
dostats )
n;
bs ;
MyNum ;
LoalCopies
l ;
dostats ;
f
int
int
i ,
il ,
I ,
J ,
jl ,
k,
kl ;
double // d o u b l e
int
int
strut
int
int
#ifdef
#endif
j ,
K;
shared
A,
B,
C,
A,
dimI ,
dimJ ,
D;
B,
C,
D;
dimK ;
strI ;
// u n s i g n e d
int
t1 ,
t2 ,
t3 ,
t4 ,
t11 ,
t22 ;
timeval
t1 ,
t2 ,
t3 ,
t4 ,
t11 ,
t22 ;
diagowner ;
olowner ;
TRACE FUNC
for
if
strI
= n;
( k =0 ,
kl
( kl
kl
g
/
>
n)
fator
( diagowner
f
( dostats ))
f
blok
kl
f
== MyNum)
n ℄);
k,
( (MyNum ==
/
K) ;
strI );
jj
0)
( dostats ))
f
CLOCK( t 1 1 ) ;
up
/
K++)
B l o k O w n e r (K,
//BARRIER( G l o b a l
g
jj
0)
diagonal
=
A = &( a [ k+k
l u 0 (A,
if
k+=b s ,
f
( (MyNum ==
g
g
n;
= n;
diagowner
if
k
CLOCK( t 1 ) ;
if
<
K= 0 ;
= k+b s ;
g
if
TRACE FUNC ;
>
start ,
P) ;
barrier ;
( (MyNum ==
jj
0)
( dostats ))
f
CLOCK( t 2 ) ;
divide
for
if
olumn
D = &( a [ k+k
( i =k l ,
k
by
n ℄);
I=K+ 1 ;
i
( BlokOwner ( I ,
if
il
g
=
i
( il
il
+
>
bs ;
n)
<
n;
K)
diagonal
i+=b s ,
blok
I ++)
== MyNum)
f
f
/
/
f
= n;
51
parel
out
bloks
/
A = &( a [ i +k
g
b d i v (A,
g
for
if
/
modify
D,
row
( j =k l ,
k
if
g
by
if
jl
=
j +b s ;
>
( jl
jl
bmodd (D,
A,
n;
J)
blok
j+=b s ,
J++)
f
== MyNum)
/
f
/
parel
out
bloks
/
n ℄);
kl
k,
jj
0)
jl
j ,
n,
strI );
f
( dostats ))
>
start ,
P) ;
barrier ;
( (MyNum ==
modify
il
g
jj
0)
f
( dostats ))
subsequent
( i =k l ,
=
I=K+ 1 ;
i +b s ;
>
( il
il
olowner
for
if
=
j
( jl
jl
g
=
g
g
n;
f
olumns
i+=b s ,
I ++)
/
+
>
BlokOwner ( I ,K ) ;
n ℄);
J=K+ 1 ;
bs ;
n)
j
<
n;
j+=b s ,
J++)
f
f
= n;
( BlokOwner ( I ,
B = &( a [ k+ j
C = &( a [ i + j
g
blok
<
f
n)
( j =k l ,
jl
if
i
= n;
A = &( a [ i +k
bmod ( A ,
( (MyNum ==
B,
f
== MyNum)
/
parel
out
bloks
n ℄);
C,
jj
0)
J)
n ℄);
il
i ,
jl
j ,
( dostats ))
kl
k,
n);
f
CLOCK( t 4 ) ;
l
l
l
g
<
k);
CLOCK( t 3 ) ;
if
j
kl
f
n)
( (MyNum ==
for
if
/
i ,
CLOCK( t 2 2 ) ;
up
g
il
= n;
//BARRIER( G l o b a l
if
n,
diagonal
J=K+ 1 ;
( B l o k O w n e r (K,
g
g
n ℄);
strI ,
A = &( a [ k+ j
g
g
#ifdef
#endif
l
>
>
>
>
t
in
fa
t
in
solve
t
in
t
in
+=
al
+=
t i m e ( t1 ,
al
t11 ) ;
t i m e ( t2 ,
t22 ) ;
mod
+=
al
t i m e ( t3 ,
t4 ) ;
bar
+=
al
t i m e ( t11 , t2 )
+
al
TRACE FUNC RET
TRACE FUNC RET ;
g
52
t i m e ( t22 ,
t3 ) ;
/
void
// v o i d
InitA ( double
rhs )
InitA ()
f
int
#ifdef
#endif
i ,
j ;
TRACE FUNC
TRACE FUNC ;
for
for
if
srand48 ( (
( j =0;
g
g
up
g
g
g
<
n℄
=
<
=
(
f
f
double
n;
i ++)
)
f
j )
l r a n d 4 8 ( ) /MAXRAND;
10;
j
<
n;
j ++;
j )
f
0.0;
j
<
n;
forall
rhs [ i ℄
1);
j ++)
( j =0;
=
( j =0;
#ifdef
#endif
i
==
n℄
)
n;
forall
up
g
( i
rhs [ j ℄
for
j
( i =0;
a [ i +j
a [ i +j
long
j ++)
( i =0;
f
<
i
+= a [ i + j
n;
i ++;
i )
f
n ℄;
TRACE FUNC RET
TRACE FUNC RET ;
g
double
int
int
TouhA ( b s ,
MyNum)
bs ;
MyNum ;
f
int
double
#ifdef
#endif
for
for
if
i ,
j ,
tot
I ,
J;
=
0.0;
TRACE FUNC
TRACE FUNC ;
<
<
for
for
g
g
( J =0;
J
( I =0;
bs
n;
I
bs
( j =J
bs ;
tot
g
#ifdef
g
f
I ++)
( BlokOwner ( I ,
( i =I
g
J++)
n;
j
bs ;
<
J)
f
== MyNum)
(J +1)
i
+= a [ i + j
<
bs
( I +1)
&&
bs
j
f
<
<
&&
n;
i
j ++)
n;
n ℄;
TRACE FUNC RET
TRACE FUNC RET ;
53
f
i ++)
f
#endif
return
g
void
f
int
#ifdef
#endif
for
for
( tot ) ;
PrintA ( )
i ,
j ;
TRACE FUNC
TRACE FUNC ;
( i =0;
i
n;
j
printf ("
#ifdef
#endif
n
f
i ++)
<
n;
f
j ++)
p r i n t f ( " %8.1 f
g
g
<
( j =0;
" ,
a [ i+j
n ℄);
n" ) ;
TRACE FUNC RET
TRACE FUNC RET ;
g
void
ChekResult ( )
f
int
double #ifdef
#endif
double if
f
i ,
j ,
bogus
y,
=
diff ,
0;
max
diff ;
TRACE FUNC
TRACE FUNC ;
y
=
(
)
mallo (n
( y == NULL)
p r i n t e r r ( " Could
g
exit (
g
y[ j ℄
for
=
g
g
j
g
for
if
max
n;
j
n;
=
j ++)
1;
j
i
diff
=
=
( j =0;
i
<
n;
>
<
for
y
<
n;
j
)
n℄
( fabs ( d i f f )
f
f
y[ j ℄;
f
j ++)
= y[ j ℄
y[ j ℄;
i ++)
a [ i+j
f
i ++)
0.0;
j
f
n ℄;
n℄
=0;
j ;
a [ i+j
( i =0;
y[ i ℄
diff
memory
));
f
j ++)
<
( i=j +1;
( j =n
g
<
= y [ j ℄ / a [ j+j
y[ i ℄
for
for
double
rhs [ j ℄ ;
( j =0;
y[ j ℄
mallo
(
1);
( j =0;
for
for
not
sizeof
1.0;
>
0.00001)
f
54
n
n" ) ;
bogus
g
if
g
g
max
g
( bogus )
else f
=
diff
1;
=
diff ;
f
p r i n t f ( "TEST FAILED :
p r i n t f ( "TEST PASSED
n
(%.5 f
diff )
n
n" ,
max
n" ) ;
free (y );
#ifdef
#endif
TRACE FUNC RET
TRACE FUNC RET ;
g
void
f
g
printerr (
onst har f p r i n t f ( s t d e r r , "ERROR:
s)
%s
n
n" , s ) ;
55
diff ) ;
A.4 Stenil
#inlude <
#inlude <
#inlude <
#inlude <
#inlude <
#ifdef
#endif
#define
#define
>
>
>
>
>
stdio .h
stdlib .h
math . h
assert .h
up . h
MUPC TRACE RD
MUPC TRACE RD
shared
N
6
ITERS
100
[N℄
int
#define
void
f
double
#ifdef
#endif
shared
[1℄
n;
double
double
A( i , j )
stenil (
a [ N ℄ [ THREADS N ℄ ;
dmax [ THREADS ℄ ;
a [ ( i )%N ℄ [ ( ( i ) /N)
int
i ,
int
N n+( j ) ℄
j )
t ;
TRACE FUNC
TRACE FUNC ;
t
= A( i , j ) ;
#ifdef
EIGHT PT STENCIL
t
+= A( i
1, j
t
+= A( i
1, j ) ;
1);
t
+= A( i
1, j +1);
t
+= A( i , j
t
+= A( i , j + 1 ) ;
t
+= A( i +1 , j
t
+= A( i +1 , j ) ;
t
+= A( i +1 , j + 1 ) ;
#else
t
/=
1);
1);
9.0;
t
+= A( i
t
+= A( i +1 , j ) ;
t
+= A( i , j
t
+= A( i , j + 1 ) ;
#endif
t
/=
1);
5.0;
A( i , j )
#ifdef
#endif
1, j ) ;
=
t ;
TRACE FUNC RET
TRACE FUNC RET ;
g
int
int
#ifdef
f
main ( )
i ,
j ,
iter ;
TRACE FUNC
56
#endif
/
TRACE FUNC ;
int
Ensure
n =
(
assert (
for
for
/
of
sqrt (
(n
n)
Initialize
( i =0; i
double
number
)
<
<
(
threads
== THREADS
A.
for
f
( j =0; j
Update
( i =1; i
f
g
g
all
/
points
(N n)
(
double
)
MYTHREAD;
for
if
4/8
pt
stenil .
(N n)
1;++ j )
t h r e a d o f (&A( i , j ) ) )
dmax .
=
/
(
double
)
iter
+
1.0;
barrier ;
Chek
return
with
stenil (i , j );
( i =0; i
g
=
1;++ i )
(MYTHREAD==u p Update
up
for
<
ompletion .
goto
/
THREADS;++ i )
( dmax [ i ℄
<
0.0)
end ;
end :
g
/
ITERS;++ i t e r )
dmax [MYTHREAD℄
/
);
<
< < ( j =1; j
/
square .
);
N;++ j )
( i t e r =0; i t e r
g
a
barrier ;
for
f
for
f
if
/
is
THREADS
N;++ i )
a [ i ℄ [ N MYTHREAD+ j ℄
up
)
0;
57
/