Computer Science Technical Report Predicting Remote Reuse Distance Patterns in Unified Parallel C Applications by Steven Vormwald, Steven Carr, Steven Seidel and Zhenlin Wang Computer Science Technical Report CS-TR-09-02 December 18, 2009 Department of Computer Science Houghton, MI 49931-1295 www.cs.mtu.edu ABSTRACT* Productivity is becoming increasingly important in high performance computing. Parallel systems, as well as the problems they are being used to solve, are becoming dramatically larger and more complicated. Traditional approaches to programming for these systems, such as MPI, are being regarded as too tedious and too tied to particular machines. Languages such as Unified Parallel C attempt to simplify programming on these systems by abstracting the communication with a global shared memory, partitioned across all the threads in an application. These Partitioned Global Address Space, or PGAS, languages offer the programmer a way to specify programs in a much simpler and more portable fashion. However, performance of PGAS applications has tended to lag behind applications implemented in a more traditional way. It is hoped that cache optimizations can provide similar benefits to UPC applications as they have given single-threaded applications to close this performance gap. Memory resuse distance is a critical measure of how much an application will benefit from a cache, as well as an important piece of tuning information for enabling effective cache optimization. This research explores extending existing reuse distance analysis to remote memory accesses in UPC applications. Existing analyses store a very good approximation of the reuse distance histogram for each memory access in a program efficiently. Reuse data are collected for small test runs, and then used to predict program behavior during full runs by curve fitting the patterns seen in the training runs to a function of the problem size. Reuse data are kept for each UPC thread in a UPC application, and these data are used to predict the data for each UPC thread in a larger run. Both scaling up the problem size and the increasing the total number of UPC threads are explored for prediction. Results indicate that good predictions can be made using existing prediction algorithms. However, it is noted that choice of training threads can have a dramatic effect on the accuracy of the prediction. Therefore, a simple algorithm is also presented that partitions threads into groups with similar behavior to select threads in the training runs that will lead to good predictions in the full run. *This work is partially supported by NSF grant CCF-0833082. CHAPTER 1 Introdution 1.1 Motivation High performane omputing is beoming an inreasingly important part of our daily lives. It is used to determine where oil ompanies drill for oil, to gure out what the weather will be like for the next week, to design safer buildings and vehiles. Companies save millions of dollars every year by simulating produt designs instead of reating physial prototypes. The movie industry relies heavily on speial eets rendered on large lusters. Sientists rely on simulations to understand nulear reations without having to perform dangerous experiments with nulear materials. In addition, the mahines used to arry out these omputations are beoming drastially larger and more ompliated. The Top500 list laims that the fastest superomputer in the world has over 200000 ores [1℄. It is made up of thousands of six-ore opteron proessors in ompute blades networked together. The ompute blades an eah be onsidered a omputer in its own right, working together with the others to at as one large superomputer. This lustering model of superomputer is now the dominant fore in high performane omputing. Traditionally, appliations written for these lusters required the programmer to expliitly manage the ommuniation needs of the program aross the various nodes of the luster. It was thought that the performane needs of suh appliations ould only be met by a human programmer arefully designing the program to minimize the neessary ommuniation osts. This approah to programming for superomputers is quikly beoming unwieldy. The produtivity ost of requiring the appliation programmer to manage and tune an appliation's ommuniation for these omplex systems is simply too high. Partitioned global address spae languages, suh as Co-Array Fortran [2℄ and Unied Parallel C [3℄, attempt to address these produtivity onerns by building a shared memory programming model for programmers to work with, delegating the task of optimizing the neessary ommuniation to the language implementor. While these languages do oer produtivity improvements, implementations haven't been able to math the performane of more traditional message passing setups. Various approahes to athing up have been tried. The UPC implementation from the University of California Berkeley [4℄ uses the GASNet network library [5℄, whih attempts to optimize ommuniation using various methods suh as message oalesing [6℄. Many implementations try to split synhronous operation into an asynhronous operation and a orresponding wait, then spread these as far apart as possible to hide ommuniation lateny. These optimizations an 1 lead to impressive performane gains, but there is still a performane gap for some appliations. One approah hasn't been ommonly used in implementations is software ahing of remote memory operations. Cahing has been used to great eet in many situations to hide the ost of expensive operations. Programmers are also austomed to working with ahes, sine they are so prevalent in today's CPUs. As Mar Snir pointed out in his keynote address to the PGAS2009 onferene [7℄, programmers would like to see some kind of ahing in these languages' implementations. He demoed a ahing sheme implemented entirely in the appliation. However, it is desirable that the ahing be done at the level of the language implementation to avoid foring the programmer to deal with the omplexities of ommuniation that these languages were designed to hide. This researh takes an initial look at the possibility of using existing algorithms for single threaded appliations designed to predit patterns in the reuse distanes for memory operations to predit patterns in the reuse distanes for remote memory operations in Unied Parallel C appliations. It is hoped that this information ould be used to tune ahe behavior for ahing remote referenes, and to enable other optimizations that rely on this information and have been suessfully used with single-threaded appliations to work with multi-threaded UPC appliations. 1.2 Thesis Outline The rest of this doument is organized as follows. Chapter 2 gives a broad bakground in instrution-based reuse distane analysis and Unied Parallel C. Chapter 3 introdues the instrumentation, test kernels and models used to predit remote memory behavior. Chapter 4 shows the predition results obtained. Finally Chapter 5 summarizes the results, looks at ways this predition model an be used, and possible future work to overome some of this model's weaknesses. 2 CHAPTER 2 Bakground 2.1 Parallel Computation When one wishes to solve problems faster, there are generally only three things to do. First, one an try to nd a better algorithm to solve the problem at hand. While this an lead to massive performane benets, most ommon problems already have known optimal solutions. Seond, one an inrease the rate at whih a program exeutes. If a program requires the exeution of 10000 instrutions, a mahine that an exeute an instrution every milliseond will nish the program muh sooner than one that an only exeute an instrution every seond. Finally, one an exeute the program in parallel{solving multiple piees of the problem at the same time. Just as having multiple hekout lanes speed onsumers through their purhases, having the ability to work on multiple piees of a problem an speed its solution. There are many dierent ways to think about solving a problem in parallel. Flynn's taxonomy lassies parallel programming models up along two axes, one based on the program's ontrol, the other based on its data [8, 9℄. This reates four lasses of programs: single instrution single data, multiple instrution single data, single instrution multiple data, and multiple instrution multiple data. These are heneforth referred to by the aronyms SISD, MISD, SIMD, and MIMD respetively. The SISD model is the lassi, non-parallel programming model where eah instrution is exeuted serially and works on a single piee of data. While modern arhitetures atually do work in parallel, this remains the most ommonly used abstration for software developers. The MISD model is widely onsidered nonsensial, as it refers to multiple instrutions being exeuted in parallel operating on the same data. While this usually makes little sense, the term has been used to refer to redundant systems, whih use parallelism not to speed up a omputation, but rather to prevent program failures. The SIMD model sees widespread use in many speial purpose aelerators. In this model, a single instrution is exeuted in parallel aross large hunks of data. The most well-known use of this is probably in the graphis proessing units, or GPUs, that have beome standard on modern personal omputers. Many arhitetures also inlude SIMD extensions to their SISD instrution set that enable ertain types of appliations to perform muh better. The MIMD model is the most general, where multiple instrutions are run in parallel, eah on dierent data. This model is perhaps the most ommonly used 3 abstration for software developers writing parallel appliations, as this model is used in the threading libraries inluded with many operating systems. 2.1.1 SPMD Model The Single Program, Multiple Data, or SPMD model of parallel programming is a subset of the MIMD model from Flynn's taxonomy. The MIMD model an be thought of as multiple SISD threads exeuting in parallel that have some means of ommuniating amongst themselves. The SPMD model is a speial ase where eah thread is exeuting the same program, only working with dierent data. The threads need not operate in lok-step, nor all follow the same ontrol-ow through the program. 2.1.2 Communiation and the Shared Memory Model Flynn's taxonomy desribes how a problem an be broken up and solved in parallel. It does not speify how the various threads of exeution that are being run in parallel ommuniate. From a programmer's perspetive, there are two major models of parallel ommuniation, message passing and shared memory. The message passing model, as its name suggests, requires that threads send messages to one another to ommuniate. In general, these messages must be paired in that the sender must expliitly send a message and the reeiver must expliitly reeive that message. In the shared memory model, all threads share some amount of memory, and ommuniation ours through this shared memory. This naturally simplies ommuniation as the programmer no longer needs to expliitly send data bak and forth between threads. The programmer is also no longer responsible for ensuring threads send messages in the orret order to avoid deadloks, nor to hek for errors in ommuniation. However, as memory is a shared resoure, the programmer must be aware of the onsisteny that the model allows. In the most simple ase, there is a strit ordering on all memory aesses aross all threads, whih is easy for the programmer to understand, but is usually quite expensive for the implementation to enfore. There are various ways of relaxing the semantis to improve the performane of the program by permitting threads to see operations our in dierent orders, eliminating unneessary synhronization. 2.1.3 Partitioned Global Address Spae While the shared memory programming model oers the programmer many advantages, it is often diÆult to implement eÆiently on today's large distributed systems. One major diÆulty omes from the fat that it is usually orders of magnitude more expensive to aess shared memory that is o-node than it is to 4 aess on-node memory. If the programmer has no way of dierentiating between on-node and o-node memory, it beomes diÆult to write programs that run eÆiently on these modern mahines. Partitioned Global Address Spae, PGAS, languages try to address this problem by introduing the onept of aÆnity [10℄. The shared memory spae is partitioned up among the threads suh that every objet in shared memory has aÆnity to one and only one thread. This allows programmers to write ode that takes advantage of the loation of the objet. 2.1.4 Unied Parallel C Unied Parallel C, heneforth UPC, is a parallel extension to the C programming language [3℄. It uses the SPMD programming model, where a xed number of UPC threads exeute the same UPC program. Eah UPC thread has its own loal stak and loal heap, but there is also a global memory spae that all threads have aess to. As UPC is a PGAS language, this global memory is partitioned amongst all the UPC threads. It is important to note that aesses to shared memory in UPC appliations do not require any speial libraries or syntax. One a variable is delared as shared, it an be referened just as any loal variable, at least from the programmer's point of view. In many implementations, inluding MuPC, these aesses are diretly translated into runtime library alls that perform the read or write as neessary. For example, the program in Figure 2.1 prints hello from eah UPC thread, reords how many haraters eah thread printed, and exits with EXIT FAILURE if any thread had an error (printf() returns a negative value on errors). One important performane feature of UPC is the ability to speify that memory aesses use relaxed memory onsisteny. The default, strit, memory onsisteny requires that all shared memory aesses be ordered, and that all threads see the same global order of memory aesses. In partiular, if two threads write to the same variable at the same time, all threads will see the same written value after both threads have ourred. Consider the ode segment in Figure 2.2. For threads other than thread 0, there are only three possible outputs at the end of the ode: a = 0; b = 0 or a = 1; b = 0 or a = 1; b = 2. By ontrast, relaxed memory onsisteny provides no suh guarantee. Dierent threads may see operations our in dierent orders. Consider the same ode segment using relaxed variables instead of strit ones shown in Figure 2.3. For threads other than thread 0, there are now four possible outputs at the end of the ode: a = 0; b = 0 or a = 1; b = 0 or a = 1; b = 2 or a = 0; b = 2. The additional value, a = 0; b = 2 is permitted beause the relaxed semantis allow threads other than thread 0 to see the assignment to sb our before the assignment to sa, while the strit semantis require they our in program order. Sine implementation is allowed to reorder relaxed operations, it is also permit5 #inlude < #inlude < #inlude < > > stdio .h stdlib .h > up . h / Eah UPC shared int int [1℄ int thread has 1 element of this / main ( ) f / i , exit value Initialize / Wait up / for part = 1 the of printed array to finish of haraters Wait for f if / for number = printf (" Hello Verify ( i =0; i g everyone barrier ( < 2 all / / to finish printed . from UPC / thread %d of %d . n n" , THREADS ) ; printing . / ); the threads printed something . / THREADS;++ i ) ( printed [ i ℄ exit ( exit initialization . MYTHREAD, up 0. ); p r i n t e d [MYTHREAD℄ / to 0; everyone barrier ( Reord = EXIT SUCCESS ; loal p r i n t e d [MYTHREAD℄ g array . p r i n t e d [ THREADS ℄ ; < 0) exit value = EXIT FAILURE ; value ); Figure 2.1. Hello World in UPC ted to ahe the values without having to worry about keeping the ahes oherent until a strit aess or olletive ours. Despite this apability, relatively few UPC implementations ahe remote aesses, and those that do use relatively simple ahes. At the time of this writing, only the MuPC referene implementation from Mihigan Tehnologial University [11℄ and the ommerial implementation from Hewlett Pakard [12℄ are known to implement ahing of remote referenes. 2.1.5 MuPC MuPC is a referene UPC implementation that is built on top of MPI and POSIX threads. It urrently supports Intel x86 and x86-64 based lusters running Linux, as well as alpha lusters running Tru64. Eah UPC thread is implemented as a single OS proess using two pthreads, one for managing ommuniation, the other to run the UPC program's omputation. The ompile sript rst translates the UPC ode into C ode with alls into the MuPC runtime library to handle ommuniation and synhronization. This is then ompiled with the system MPI ompiler and the resulting binary an be run as if it were an MPI program. 6 int if strit f shared l a =0 , int s a =0 , s b =0; l b =0; (MYTHREAD==0) s a =1; s b =2; g else f l b=s b ; l a=s a ; g p r i n t f ( " T h r e a d %d : a = %d , b = %d n n" , l a , l b ) ; Figure 2.2. Example of Strit Semantis in UPC int if relaxed f a =0 , shared b =0; int s a =0 , s b =0; (MYTHREAD==0) s a =1; s b =2; g else f b=s b ; a=s a ; g p r i n t f ( " T h r e a d %d : a = %d , b = %d n n" , a , b ) ; Figure 2.3. Example of Relaxed Semantis in UPC MuPC implements a ahe for remote referenes in the ommuniation thread. The ahe is divided into THREADS 1 setions, one for eah non-loal UPC thread. The size of the ahe is determined by the user via a onguration le. The user an also hange the size of a ahe line. The defaults setup a 2mb ahe with 64-byte ahe lines. 2.2 Reuse Distane Analysis Reuse distane is dened as the number of distint memory loations that are aessed between two aesses to a given memory address. This information is generally used to determine, predit, or optimize ahe behavior. Forward reuse distane answers the question "How many distint memory loations will be aessed before the next time this address is aessed?". It sans forward in an exeution, ounting the number of memory loations aessed until the given address is found. This information an be useful for determining whether or not ahing should be performed for a memory referene, among other things. 7 Bakward reuse distane answers the question "How many distint memory loations were aessed sine the last time this address was aessed?". It sans bakward in an exeution, ounting the number of memory loations aessed until the given address is found. This information an be useful for determining whether or not a memory referene should be prefethed, among other things. A[ 1 ℄ = 1; A[ 2 ℄ = 2; A[ 1 ℄ = 3; A[ 2 ℄ = 4; A[ 3 ℄ = 5; A[ 1 ℄ = 6; A[ 2 ℄ = 7; Figure 2.4. Reuse Distane Example For example, onsider the seond referene to A[2℄ in the short ode segment in Figure 2.4. Beause only A[1℄ was aessed sine the last aess to A[2℄, the bakward reuse distane is 1. The forward reuse distane is 2, beause both A[1℄ and A[3℄ are aessed before A[2℄ is aessed again. There is also a distintion between temporal reuse and spatial reuse. Temporal reuse refers to reuse of a single loation in memory, as in the example above. Spatial reuse onsiders larger setions of memory, as the ahe in many systems pulls in more than one element at a time. Assuming the ahe lines in the system an hold two array elements and that A[1℄ is aligned to the start of a ahe line, the bakwards spatial reuse distane of the seond aess to A[2℄ is 0, beause their were no intervening aesses to dierent ahe lines sine the last aess. The forward reuse distane is 1 however, beause A[3℄ does not share a ahe line with A[1℄ and A[2℄. 2.2.1 Instrution Based Reuse Distane Analyses generally reate histograms of either the forward or bakward reuse distanes enountered in a program trae. The histograms are generally assoiated either with a partiular memory address or with a partiular memory operation. When assoiated with a memory operation, the data are referred to as instrution based reuse distanes [13, 14℄. Studying the reuse distanes assoiated with operations an provide many useful insights into an appliation's behavior. For example, the maximum reuse distane seen by any operation an tell how large a ahe will be beneial to the appliation. Critial instrutions, operations that suer from a disproportionately large number of the ahe misses in a program, an be identied as well [14℄. It is generally expensive to reord the reuse distanes over an entire program exeution exatly, as there an be trillions of memory operations enountered. 8 However, it has been shown that highly aurate approximations of the reuse distane an be stored eÆiently using a splay tree to reord when memory loations are enountered [15℄. This information an then be used to reate histograms desribing the reuse distanes for a given appliation. 2.2.2 Prediting Reuse Distane Patterns input: the set of memory-distane bins B output: the set of loality patterns P for eah memory referene r ; Pr = false; p ; down = f = null; for (i = 0; i < numBins; i++) i if (Br .size > 0) k i (Br :min i 1 (down&&Br :f req < if (p == g null p:max > p:max f i Br :f req)) i p = new pattern; p.mean = Br .mean; i i p.min = Br .min; p.max = Br .max; i i p.freq = Br .freq; p.maxf = Br .freq; Pr = Pr p; down = false; i k Br :min) [ else f i i p.max = Br .max; p.freq += Br .freq; i (Br .freq if g > p.maxf ) i i 1 :f req if (!down && Br g down = else g f i p.mean = Br .mean; p.maxf = Br .maxf; p = true; i > Br :f req) null; Figure 2.5. Pattern-formation Algorithm Using proling data, it has been shown that the memory behavior of a program an be predited by using urve tting to model the reuse distane as a funtion of the data size of an appliation [14℄. First, patterns are identied for eah memory operation in the instrumented training runs. Histograms storing the reuse data for memory operations are used to identify these patterns. For eah bin in the histogram, a minimum distane, maximum distane, mean distane, and frequeny are reorded. Then, adjaent bins are merged using the algorithm in Figure 2.5. The loality patterns for an operation are dened as the sets of merged bins. Finally, the predition algorithm uses urve tting with eah of a memory operation's patterns in two training runs to predit the orresponding pattern in the predited run [14℄. 9 2.2.3 Predition Auray Model There are two important measures of the reuse distane preditions. First is the overage, whih is dened as the perentage of operations in the referene run that an be predited. An operation an be predited if it ours in both training runs, and all of its patterns are regular. A pattern is regular if it ours in both training runs and the reuse distane does not derease as the problem size grows. The auray then is the perentage of overed operations that are predited orretly. An operation is predited orretly if the predited patterns exatly math the observed patterns, or they overlap by at least 90%. The overlap for two patterns A and B is dened as A:max max(A:min; B:min) max(B:max B:min; A:max A:min The overlap fator of 90% was used in this work beause this fator worked well in prior work with sequential appliations [14℄. 10 CHAPTER 3 Prediting Remote Reuse Distane 3.1 Instrumentation In order to get the raw ahe reuse data for the preditions, it was important to have a working base ompiler from whih to add instrumentation to generate the raw ahe reuse data. The MuPC ompiler and runtime was used for the data olletion, with a number of modiations to support runtime remote reuse data olletion. Initially, it was neessary to update MuPC to add support for x86-64 systems to ensure MuPC ontinues to funtion in the future, as well as to support our new luster. Beause the vender-provided MPI libraries were 64-bit only, it was not possible to simply use the 32-bit support in the OS. Therefore, the MuPC ompiler was modied to generate orret ode for the new platform. The bulk of this work was merely inreasing the size of primitive types and enabling 64-bit maros and typedefs in the EDG front-end. The other hanges were all to support reording ahe reuse in UPC programs. First, generi instrumentation was added to many of the runtime funtions. These allow a programmer to register funtions that get alled whenever a partiular runtime funtion is alled. This enables a programmer to inspet the program's runtime behavior. To test this instrumentation, a simple funtion was written to reate a log of all remote memory aesses, reording the operation (put or get), the remote address, the size of the aess, and the loation in the program soure that initiated the aess. One this funtionality was working, existing instrumentation for the Atom simulator [16℄ was modied for use with MuPC. This ode uses splay trees to store ahe reuse data per instrution. Sine it was originally designed to work with ahe reuse in hardware, assoiating the reuse data with an instrution works ne. However, for this projet, there is no simple instrution to assoiate the reuse data with, as remote aesses are ompliated operations. It was nally deided that the return address (bak into the appliation ode) would work as a substitute, sine it would produe the desired mapping bak to the appliation soure. Additionally, the Atom instrumentation had to be modied to deal with differing sizes of addresses. In partiular, shared memory addresses are a strut in MuPC, ontaining a 64-bit address, a thread, and a phase. The phase was not important to this researh, but both the thread and the address were. To properly store these values, the instrumentation was modied to store addresses as 64-bit values instead of 32-bit values, and the thread was stored in the upper six bits of the address, sine they were unused due to memory alignment. 11 Unfortunately, the implementation does require modiations to the soure program. The modiations are quite small, and an easily be disabled with the preproessor. The maros and global variables shown in Figure 3.1 were dened in the up.h header. / Added to support extern har har #define har mup mup get #define #define remote aesses . name ( ) ; n mup trae trae fun mup trae f u n = TRACE FUNC RET for void ( traing . mup void void mup void void mup Must get trae be n init trae rd (); init )() fun ; n name ( ) f u n = mup exatly trae one fun per prev program ! / = NULL ; file (); init )() trae fun used init )() init trae trae mup mup n trae MUPC TRACE RD ( n n prev ; mup trae MUPC TRACE FILE mup fun p r e v= mup MUPC TRACE NONE ( #define fun ; mup Maros #define trae fun TRACE FUNC / traing / = n mup trae init file ; n = mup trae init rd ; Figure 3.1. Added MuPC Maros These maros setup information in global variables that is used by speial funtions that wrap alls into the MuPC runtime. The alls save the return address of the all and the name of the funtion that it was alled from so they an be reorded by the instrumentation. The maros TRACE FUNC and TRACE FUNC RET should be used at the beginning and end of all funtions with remote aesses. While these maros are not stritly neessary, they enable the instrumentation to trak the funtion name that an operation originated from without having to gure it out from the return address. The MUPC TRACE * maros initialize the traing ode. The MUPC TRACE RD maro ongures the traing to store per instrution shared memory reuse data. It must be inluded exatly one in the program's soure. During an instrumented run, all remote aesses are logged, and eah plae there is a all into the runtime gets assoiated with a reuse distane histogram. Barriers and strit aesses ause the last use data to be dropped to fore all later referenes to at as if no addresses had yet been seen. When the program ompletes, these histograms are written out to disk, one le per thread. 12 3.2 Test Kernels Sine there are no standard benhmark appliations for UPC, a small number of kernels were written or modied from existing appliations to model program behavior. These are desribed below. 3.2.1 Matrix Multipliation Matrix multipliation is a well studied problem in parallel omputation. Eah index (i; j ) in the resulting matrix is dened as the sum of the produts of the elements of row i from the rst matrix with the orresponding elements of olumn j from the seond. A simple UPC implementation to solve C = A B where A, B , and C are N N matries is shown in Figure 3.2. for int ( up i =0; i < int for int C[ i ℄ [ j ℄ ( g g g f N;++ i ) forall ( = j =0; j 0; < k =0;k C[ i ℄ [ j ℄ < f N;++ j ;&C [ i ℄ [ j ℄ ) N;++k ) f += A [ i ℄ [ k ℄ B[ k ℄ [ j ℄ ; Figure 3.2. Matrix Multipliation in UPC The Matrix Multipliation kernel simply multiplies two large (square) arrays together. The nodes are arranged in a 2-d grid, and eah node has an NxN blok of the array. The multipliation is performed using a naive implementation with a triple-nested loop, where eah thread alulates its portion of the nal matrix, working a blok at a time. The problem size in this kernel is the loal size of the three matries. In testing, this kernel was run with 4, 9, and 16 threads with 4 elements per thread up to 262144 elements per thread, in inreasing powers of 4. The omplete soure for the kernel used an be found in Appendix A.1. 3.2.2 Jaobi The Jaobi method is an iterative method that nds an approximate solution to a series of linear equations. While it does not nd exat solutions, it an give a solution that is within a desired delta of the exat solution for most problems. However, its most important feature is its numerial stability, whih is ritial when working with inexat values. Sine the oating point format most ommonly used to represent real numbers is inexat, this stability makes the Jaobi method quite useful in a variety of appliations. The Jaobi kernel simulates a Jaobi iterative solver for a large array distributed in a blok yli fashion. The kernel only simulates the remote aess pattern for a Jaobi solver, it does not atually attempt to solve the generated 13 Figure 3.3. LU Deomposition of a Matrix array. Like the Matrix Multipliation kernel, every thread performs essentially the same task. The problem size in this kernel is the number of iterations the solver runs for. In testing, this kernel was run with 2 through 24 threads, with iteration ounts as a power of 2 up to 8192. The omplete soure for the kernel used an be found in Appendix A.2. 3.2.3 LU Deomposition LU deomposition is another important operation for many sienti appliations. It deomposes a matrix into the produt of a lower triangular matrix with an upper triangular matrix, as shown in Figure 3.3. This an be used to nd both the determinant and inverse of a matrix, as well as to solve a system of linear equations. The LU kernel, whih omes from the test suite from Berkeley's UPC ompiler [4℄ based on a program from Stanford University [17℄, performs an LUdeomposition on a large array that is distributed aross nodes in a blok yli fashion. Square bloks of B B (B = 8; 16; 32 were tested) are distributed to eah thread until the entire array has been alloated. The problem size is the total size of the array. In testing, this kernel was run with 2 through 36 threads, with problem sizes from 1024 elements to 16777216 elements. The omplete soure for the kernel used an be found in Appendix A.3. 3.2.4 Stenil Stenil problems are problems where elements of an array are iteratively updated based on the pattern of its neighbors. A ommon example of this is John Conway's Game of Life, where the life or death of a ell at a given time step is determined by the life or death of neighboring ells in the previous time step. The stenil desribes whih neighboring ells are used to update a given ell. This is often used in engineering when modeling the ow of heat or air. 14 The stenil kernels are a family of kernels that apply a 2-d or 3-d stenil to an array. These kernels were added as an example of a lass of problems where threads displayed diering behavior based on the logial layout of threads. The problem size for these is the total size of the array the stenil operates over. These kernels were developed later than the others and did not have as many test runs as the other kernels due to time and hardware onstraints. Therefore, exhaustive results are not available for these kernels, results are available only for the four and eight point 2d stenil kernels. While initially instrumenting the stenil ode, it was observed that in some ases, there were additional unexpeted remote memory aesses that did not math up to the theoretial ommuniation behavior of the programs. void f g stenil ( int i , int A( i , j ) += A( i A( i , j ) += A( i +1 , j ) ; A( i , j ) += A( i , j A( i , j ) += A( i , j + 1 ) ; A( i , j ) /= j ) 1, j ) ; 1); 5.0; Figure 3.4. Failing Stenil Code Assuming that A(i; j ) is loal to the alling thread, one would expet that the stenil funtion in Figure 3.4 generates at most four operations (alls to the runtime) that ould be remote aesses. Without that assumption, at most fourteen operations ould be remote aesses. The ode generated atually has at most twenty-three operations that ould be remote aesses. This was ausing the thread-saling predition to fail, as the pattern of whih of these operations is used varies with thread ount. The ulprit was determined to be the '+=' operator and its interation with a shared variable. Changing the stenil ode sample from Figure 3.4 to the that shown in Figure 3.5 eliminates the superuous operations. This xed the problem beause the UPC to C translator uses nested onditional operators to hek for loal aesses when working with shared variables. This nesting aused multiple aesses to be generated for a single soure aess when there are multiple aesses to shared variables in a statement. Sine this is a limitation of the MuPC ompiler and ommon pratie is to use loal temporaries to avoid multiple remote aesses, the stenil ode was updated to use the latter form. The omplete soure for the kernel used an be found in Appendix A.4. 15 void f int double stenil ( i , int j ) t ; g t = A( i , j ) ; t += A( i t += A( i +1 , j ) ; t += A( i , j t += A( i , j + 1 ) ; t /= 1, j ) ; 1); 5.0; A( i , j ) = t ; Figure 3.5. Correted Stenil Code 3.3 Thread Partitioning Sine a predition is being performed for eah UPC thread in a run, the hoie of whih thread's data are used to train beomes important. If the number of threads is kept onstant, the obvious hoie is to use the same thread's data for the training. However, that doesn't work when the number of threads inreases. Therefore, it was neessary to ome up with a way of seleting the threads used in the preditions. In searhing for a suitable algorithm for partitioning threads for training, it was noted that all of the kernels tested worked with large square arrays, and the ommuniation pattern was based on the distribution of these arrays. Beause the ommuniation pattern is based on the geometri layout of the data, an algorithm was used that mathes the training data with a pattern desribing this geometri layout, and then hooses threads for predition based on it. As a speial exeption, thread 0 is assumed to be used for various extraneous tasks, suh as initialization, and is therefore always predited with thread 0 from eah of the training runs. The algorithm is split into three parts. The threads in the two training runs are partitioned into groups based on their ommuniation behavior, as shown in Figure 3.6. It is assumed that data for eah thread in eah training run has assoiated with it the pattern data, represented as t.patterns, the instrutions enountered (patterns are assoiated with an instrution) represented as t.inst. Eah thread's patterns are tested against those in the existing groups. The thread is inluded in the group if all the patterns are present in both, and there is at most 5% dierene between the min, max, freq, and mean values for the thread and the average of the values for all the threads in the group. Eah group traks the number of members and the running arithmeti average min, max, freq, and mean of eah pattern. One the patterns are generated, they are assoiated with the run as T .groups. Then these groups are mathed against a funtion desribing the expeted 16 input: the set of training runs ontaining pattern data for eah thread in the run output: the set of training runs is updated with the set of groups G for eah training run T T .groups = ; for eah thread t t.group = 2 NULL for eah group g T 2 patterns math if T .groups (g,t) t.group = g break if t.group == NULL t.group = new group t.group.vals = NULL t.group.inst = t.inst t.group.patterns = t.patterns t.group.nthr = 1 T .groups = T .groups else for eah pattern pg 2 [ t.group t.group.patterns let pt be the orresponding pattern in t.patterns pg .min = (pg .min t.group.nthr+pt .min)=(t.group.nthr+1) pg .max = (pg .max t.group.nthr+pt .max)=(t.group.nthr+1) pg .freq = (pg .freq t.group.nthr+pt .freq)=(t.group.nthr+1) pg .mean = (pg .mean t.group.nthr+pt .mean)=(t.group.nthr+1) t.group.nthr = t.group.nthr+1 Figure 3.6. Partitioning Algorithm input: a group g and a thread t output: true if the patterns of t 6 if g.inst= t.inst return for eah pattern pg 2 math those of g, false otherwise false g.patterns let pt be the orresponding pattern in t.patterns if no suh pt return if (pg .min if (pg .max if (pg .freq if (pg .mean return true false pt .min)/max(pg .min,pt .min)> 0:05 return false false false false pt .max)/max(pg .max,pt .max)> 0:05 return pt .freq)/max(pg .freq,pt .freq)> 0:05 return pt .mean)/max(pg .mean,pt .mean)> 0:05 return Figure 3.7. patterns math() 17 Funtion partitioning of threads, as shown in Figure 3.8. The pattern funtion is used to generate the set of values assoiated with threads in the group. The training run mathes the pattern funtion if the set of values assoiated with eah group is disjoint from ever other group. Additionally, a map mapping values with one of the threads assoiated with it (the lowest if the threads are tested in asending order) is kept for eah training run, for later use when piking whih threads to use in the predition. input: the set T of training runs with the assoiated group assignments output: true if training runs math the pattern, false otherwise for eah training run Ti for eah thread t 2 Ti [ t.group.vals = t.group.vals f (t; T .numthreads) if !Mi .ontainsKey(f (t; T .numthreads)) Mi .insert(f (t; T .numthreads),t) for eah unordered pair of groups g1 ; g2 if (g1 .vals return true \ 6 ; g2 .vals) = return false 2 T .groups Figure 3.8. Mathing Algorithm Unfortunately, the pattern itself is not automatially deteted, but was hosen beause the kernels tested all used a square thread layout. The pattern funtion used is shown in Figure 3.9. Note that the pattern partitions a square into regions based on geometri properties. Beause the algorithm merges regions based on observed behaviors, the pattern atually denes a large number of subpatterns, whih are automatially mathed by the algorithm. The ability to automatially generate a pattern funtion for a given problem would inrease the generality of this analysis greatly, an important avenue for future work. However, it is possible to reate general patterns that math a wide variety of problems by utilizing the subsetting inherent to this algorithm. Finally, if both training runs math the pattern, threads are predited with threads from the training runs who share the same group as determined by the mathed funtion. The algorithm seleting the pairs is shown in Figure 3.10. It uses the maps generated while mathing the pattern to hoose threads in the training runs with the same value returned by the pattern funtion. As an example, onsider using equation 3.1 as a pattern to math against the LU kernel run with 16 threads in the test dataset, 25 threads in the train dataset, and 36 threads in the referene dataset with a xed per-thread data size. In this ase, the threads with data on the diagonal of the array have to do extra work. Equation 3.1 returns 0 for threads on the diagonal and 1 for threads not on the diagonal, assuming the threads are laid out in a square grid. This should perfetly math the thread layout of the LU kernel when run with a number of threads that 18 Figure 3.9. Pattern Funtion input: input: the set of threads T in the predition run pred for eah training run, a map Mi mapping a return value from the pattern f (t; T ) output: to a thread in the training run that generates that return value for eah thread in the predition run, a pair of threads from the training runs to use for eah thread t pred 2 T t.preds = (M0 .get(f (t pred pred ;T pred .numthreads)),M1 .get(f (t ;T .numthreads))) pred pred Figure 3.10. Training Thread Seletion Algorithm 19 is a perfet square, suh as in this example. For larity, the mathing of reuse patterns is simplied suh that threads with data on the diagonal perfetly math only other threads with data on the diagonal, and likewise for threads without data on the diagonal. ( p 0 if b ptn = (t mod n) (3.1) f (t; n) = 1 o.w. First the threads in the test and train datasets are sorted into groups. In both ases there are two groups, those threads that have data on the diagonal { and therefore have extra work, and those that don't. For the test dataset, the groups are g = ft ; t ; t g and g = ft ; t ; t ; t ; t ; t ; t ; t ; t ; t ; t ; t g. For the train dataset, the groups are g = ft ; t ; t ; t g and g = ft ; t ; t ; t ; t ; t ; t ; t ; t ; t ; t ; t ; t ; t ; t ; t ; t ; t ; t ; t ; t g. As noted earlier, t is exluded as a speial ase. Next, for eah group, the set vals of results of f (ti ; n) where n is the number of threads is omputed for eah ti in the group. This gives g T .vals = f0g and g .vals = f1g for both the test and train datasets. Beause g .vals g .vals is empty, the pattern is mathed in both datasets. 0 5 10 15 1 1 2 3 4 0 1 2 3 4 5 7 8 9 10 11 12 13 14 15 16 6 6 17 19 7 12 20 8 9 18 21 11 12 24 22 13 14 1 23 0 0 0 1 1 Ref Test Train Ref Test Train 0 0 0 18 1 1 1 1 1 19 1 1 2 1 1 20 1 1 3 1 1 21 5 6 4 1 1 22 1 1 5 1 1 23 1 1 6 1 1 24 1 1 7 5 6 25 1 1 8 1 1 26 1 1 9 1 1 27 1 1 10 1 1 28 5 6 11 1 1 29 1 1 12 1 1 30 1 1 13 1 1 31 1 1 14 5 6 32 1 1 15 1 1 33 1 1 16 1 1 34 1 1 17 1 1 35 5 6 Table 3.1. Thread Grouping for 36-Thread Referene Dataset Sine the pattern was mathed, it an be used to selet the pairs of threads 20 used for predition of the referene dataset. For eah thread in the referene dataset, a pair of threads from the test and train datasets are hosen based on whih set they are in. Sine threads are grouped by behavior, it doesn't matter whih thread in a set is used for the predition. In the solution shown in table 3.1, the lowest thread in the set is used for the preditions. 21 CHAPTER 4 Predition Results To determine whether or not it would be possible to model the behavior of remote memory aesses, the four kernels were instrumented and run with a number of dierent data sizes and numbers of threads. All tests were run with the instrumented MuPC ompiler on a 24-node dual dual ore opteron luster with an inniband interonnet. Due to hardware and software limitations, the testing was restrited to a maximum of 48 UPC threads. In pratie, any more than 24 threads ran quite slowly, thus there are relatively few results with more than 24 threads. While it is reognized that these are relatively small systems in the world of high performane omputing, it is believed that the results would hold as the problem and thread size inreases sine the problems enountered were due to either hanges in data layout or using training data from runs that were too small. 4.1 Problem Size Saling As expeted, the predition auray was very high when holding the number of threads onstant and just inreasing the problem size. The predition followed the same pattern as shown in earlier work, whih makes sense as there is very little to distinguish size saling with onstant threads from size saling with one thread as far as the predition is onerned. However, problems an arise when the growth of the problem size auses the distribution of shared data to hange, whih in turn auses the ommuniation pattern between threads to hange. This behavior is seen in the LU kernel, where predition auray and overage drop steeply in a few ases beause the distribution of the array hanged. Table 4.1. Problem Size Saling Auray Distribution Kernel Preditions Minimum Average Maximum Matrix Multipliation 1547 3.08% 96.98% 100.00% Jaobi Solver 14234 99.87% 100.00% 100.00% LU Deomposition, 8x8 7027 6.36% 94.20% 100.00% LU Deomposition, 16x16 5809 8.54% 95.91% 100.00% LU Deomposition, 32x32 5839 0.00% 96.29% 100.00% 22 Table 4.2. Problem Size Saling Coverage Distribution Kernel Preditions Minimum Average Maximum Matrix Multipliation 1547 53.45% 99.45% 100.00% Jaobi Solver 14234 38.80% 100.00% 100.00% LU Deomposition, 8x8 7027 0.00% 76.82% 100.00% LU Deomposition, 16x16 5809 0.00% 63.17% 100.00% LU Deomposition, 32x32 5839 0.00% 52.97% 100.00% Tables 4.1 and 4.2 show that both overage and auray are quite high in most ases. In the matrix multipliation and Jaobi kernels, the low minimum auraies are seen only when training with very small problem sizes. Of the 1612 preditions on the matrix multipliation kernel where the auray is less than 60%, 1610 of them our when the smallest training size is 256 or less, 1332 when the smallest training size is 16 or less. Likewise, the low minimum overage perentages are seen only when the training sizes are small enough that some operations disappear. The LU deomposition kernel is a bit more problemati. Consider the results in Table 4.3. It harts the perentage of preditions that have greater than 80, 90, and 95% auray for eah of the three bloking fators tested, rst for all preditions, then only for those where the overage was 100%. Table 4.3. LU Problem Size Saling Auray by Coverage Kernel Preditions 8x8, All Preditions 7027 16x16, All Preditions 5809 32x32, All Preditions 5839 8x8, 100% Coverage 3503 16x16, 100% Coverage 2486 32x32, 100% Coverage 1844 80% A 56.07% 50.20% 47.58% 96.97% 97.30% 88.72% > 90% A 50.39% 49.63% 45.83% 85.64% 96.46% 85.36% > 95% A 33.56% 40.37% 45.71% 53.04% 74.82% 84.98% > The large dierene between preditions with 100% overage and others stems from the behavior of the kernel. The data distribution amongst the threads is determined by the problem size, thus hanging the problem size hanges the data distribution. The data distribution in turn determines whih remote memory operations a thread enounters, whih also hanges when the the problem size hanges. This results in low overage. The altered data distribution also hanges the behavior of a ouple remote memory operations as the number of threads enountering 23 the operation hanges. This has the eet of reduing the auray of the predition for those operations. It is lear that this model is restrited to using training data from runs that have similar ommuniation patterns as the run being predited. An interesting question for future work is whether or not a model of the data distribution an be used to model how the saling will aet the appliations ommuniation pattern, and if that an in turn be used to enable high overage and predition auray when the data distribution does hange. 4.2 Thread Saling Table 4.4. Thread Saling Auray Distribution by Kernel Kernel Preditions Minimum Average Maximum Matrix Multipliation 61056 3.08% 97.75% 100.00% Jaobi Solver 103600 94.02% 99.87% 100.00% LU Deomposition, 8x8 862683 12.85% 94.71% 100.00% LU Deomposition, 16x16 346905 19.57% 91.73% 99.98% LU Deomposition, 32x32 94600 13.11% 82.49% 99.31% 2d Stenil 18000 100.00% 100.00% 100.00% Table 4.5. Thread Saling Coverage Distribution by Kernel Kernel Preditions Minimum Average Maximum Matrix Multipliation 61056 27.62% 99.29% 100.00% Jaobi Solver 103600 55.66% 91.26% 100.00% LU Deomposition, 8x8 862683 0.00% 41.73% 100.00% LU Deomposition, 16x16 346905 0.00% 37.43% 100.00% LU Deomposition, 32x32 94600 0.00% 20.96% 98.56% 2d Stenil 18000 0.00% 57.99% 100.00% Tables 4.4 and 4.5 show the predition results for varying the number of threads, and prediting using all possible pairs of training threads from the training data available. Both overage and auray are quite high in most ases. In the matrix multipliation and Jaobi kernels, the low minimum auraies are seen only when training with very small problem sizes, the same that ours when saling the problem size. These results are for exhaustively prediting every thread 24 in the referene set with every possible ombination of threads in the two training sets. The high auray and overage in the matrix multipliation and Jaobi kernels in these tables indiate that the predition works well regardless of the threads hosen for training. The predition overage and auray on the LU kernel is muh more dependent on the hoie of threads used for the training however. Consider the auray of the predition for thread 9 of 25, when using threads from runs with 4 and 16 threads for the training. The predition auray by training pairs is shown in Figure 4.1. Prediction Accuracy for Thread 9 by Test and Train Thread 100 90 80 70 Accuracy 60 50 40 30 20 Test Thread 0 Test Thread 1 Test Thread 2 Test Thread 3 10 0 0 1 2 3 4 5 6 7 8 Train Thread 9 10 11 12 13 14 15 Figure 4.1. LU 32x32 4-16-25 Thread Saling Example Thread For most of the pairs, the predition auray is quite good. However, there are a number of pairs that results in terrible auray. This is a result of the behavioral dierenes between threads in the LU kernel. Sine ertain threads (those that ontain bloks on the diagonal) have to do additional ommuniation, and thread 9 when run with 25 threads is not one of them, pairs where both threads are on the diagonal show dramati dereases in auray. However, these results show that if the threads an be partitioned in suh a way that threads with similar behaviors are in similar groups, high auray preditions an be made. Thus onsider what happens when partitioning the training threads as desribed in Setion 3.3 with the pattern shown in Figure 3.9. 25 Prediction Accuracy Using Selected Training 100 99.95 99.9 99.85 Accuracy 99.8 99.75 99.7 99.65 99.6 99.55 99.5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Predicted Thread Figure 4.2. LU 32x32 4-16-25 Auray Using Thread Seletion Algorithm The equation for Figure 3.9 is a ompliated funtion that simply partitions a square into groups based on geometri loation. Eah orner is in its own group, the edges (minus the orners) are eah in their own groups. The diagonal, and the upper and lower triangles also have their own groups. Figure 4.2 shows the auray results when using the thread partitioning to selet training pairs. As expeted, by prediting threads on the diagonal with threads also on the diagonal, it is possible to avoid the pits seen in Figure 4.1. However, it is desirable that the pattern that is mathed not be spei to the LU kernel. Thus, the square 2-d stenils were used to verify that the pattern would also work for an appliation with a very dierent ommuniation pattern than the LU kernel. Like the LU kernel, the data distribution of the shared array determines the ontrol ow through the program in the stenil kernels. Unlike the LU kernel, threads on the orners and edges exhibit diering behavior. This is due to laking ommuniation on one or more sides of the stenil. In turn, this auses low predition overage, if threads along the edges are used to predit for threads in the middle{thus the low minimum and average overage values in Table 4.5. The auray of overed operations is not impated, however, beause the skipped operations have minimal impat on the reuse patterns. This is shown by the very high auray of preditions seen in Table 4.4. 26 Table 4.6. Thread Saling Auray Distribution with Partitioning Kernel Minimum Average Maximum Matrix Multipliation 92.38% 99.44% 100.00% Jaobi Solver 94.69% 99.87% 100.00% LU Deomposition, 8x8 51.07% 91.24% 99.86% LU Deomposition, 16x16 56.26% 80.36% 99.33% LU Deomposition, 32x32 48.95% 91.67% 99.31% 2d Stenil 100.00% 100.00% 100.00% Table 4.7. Thread Saling Coverage Distribution with Partitioning Kernel Minimum Average Maximum Matrix Multipliation 100.00% 100.00% 100.00% Jaobi Solver 100.00% 100.00% 100.00% LU Deomposition, 8x8 97.45% 98.79% 99.90% LU Deomposition, 16x16 97.56% 98 97% 100.00% LU Deomposition, 32x32 96.34% 98.26% 98.56% 2d Stenil 100.00% 100.00% 100.00% Using thread partitioning, the threads in the orners, on eah edge, and in the middle are separately grouped. This provides 100% overage for all threads, as threads in the training runs that skip operations are used to predit for threads that will also skip those instrutions due to the data distribution. Tables 4.6 and 4.7 show the results of using the partitioning algorithm presented to selet training threads for all kernels. As expeted, the overage and auray both show marked improvement for all kernels. The only unexpeted result is the deline in the average auray for the lu kernels. On loser inspetion, this beause the results in Table 4.4 are padded by the large number of ombinations that math threads not on the diagonal. Additionally, beause the number of preditions is so muh smaller, the minimums have more weight. 27 CHAPTER 5 Conlusions In summary, it has been shown that it is possible to predit the remote reuse distane for UPC appliations with a high degree of auray and overage, though there are a number of important limitations. First, it is neessary to hoose training data that math behavior of the desired predition size to ahieve high auray and overage. Changes in the data distribution aused by inreases in the problem size or number of threads an ause signiant drops in both auray and overage. Choie of training threads is also ritially important for predition when saling up the number of threads. The predition results an vary from extremely poor to exellent merely by the hoie of whih threads were used for the predition. It is therefore neessary to math threads' behaviors in the training data to patterns that predit whih threads will perform similarly in the saled-up runs. 5.1 Appliations One promising appliation of this researh is automatially adjusting ahe parameters suh as size and thread aÆnity of ahe lines. This inludes things like bypassing the ahe for operations that are likely to result in a ahe miss, or disabling the ahe entirely for appliations that won't make muh use of it. Consider the ode segment in Figure 5.1. Assume do something is a funtion that performs no ommuniation. void f int for example sub ( i , j ; up g g g f o r a l l ( i =0; i ( j =0; j do < float shared < l e n ;++ i ; p1+ i ) l e n ;++ j ) f p1 , shared float p2 , int len ) f s o m e t h i n g ( p1 [ i ℄ , p2 [ j ℄ , i , j ) ; Figure 5.1. Sample UPC Funtion Sine elements of p2 are aessed in order, repeatedly for eah element of p1, this partiular segment of ode would work well with a ahe large enough to hold at least len elements. This work ould be used to predit how large a ahe is needed for the appliation to ahe all the elements this funtion sees based on the reuse distane patterns seen. 28 Another option is warning the user about operations that show poor ahe performane, perhaps as part of a larger performane analysis/proling tool. Going bak to the previous ode, assume this time that the maximum size the ahe an grow to in an implementation is CACHE LEN. If the predition indiates that len is likely to be larger than this, it might warn the user that the aesses to elements of p2 are likely to result in ahe misses beause the ahe is too small. Then the user ould hange the ode, perhaps to something like the ode in Figure 5.2 to take better advantage of the ahe, or perhaps simply hange the ompiler options used to tell the ompiler to do so automatially. void f int for example sub ( g for g g < l e n /CACHE LEN;++k ) f o r a l l ( i =0; i ( j =0; j do g for up g float i , j ,k; ( k =0;k up g shared < s o m e t h i n g ( p1 [ i ℄ , p2 [ k ( j =0; j < < f f l e n ;++ i ; p1+ i ) CACHE LEN;++ j ) f o r a l l ( i =0; i do < p1 , l e n ;++ i ; p1+ i ) l e n%CACHE LEN;++ j ) s o m e t h i n g ( p1 [ i ℄ , p2 [ k shared float p2 , int len ) f CACHE LEN+ j ℄ , i , k f f CACHE LEN+ j ℄ , i , k CACHE LEN+ j ) ; CACHE LEN+ j ) ; Figure 5.2. Sample UPC Funtion Tuned for Cahe Size Inserting prefethes prior to operations that are likely to result in a ahe miss, and likewise avoiding putting in unneessary prefethes, would also be a good use of these preditions. Sine network ongestion an ause dramati performane penalties on large lusters, avoiding unneessary ommuniation an be just as important as requesting data before it is atually needed. 5.2 Future Work This researh shows that it is possible to predit the remote reuse distane behavior of UPC appliations. There are a number of weaknesses that should be addressed in future work. Foremost among these is the ability to model the data distribution of an appliation, and use that model to avoid problems suh as are seen with the LU kernel where hanging the data distribution between training runs auses poor auray. Sine the data distribution is known at runtime, it should be possible to store it and use it to tune the predition by modeling how the distribution will hange with an inrease in problem size. Another weakness of this researh is that the funtion used as a pattern during thread partitioning must be hosen manually. As the thread grouping is largely 29 based on the data distribution, it seems natural to expet that a model of how data distribution hanges when the number of threads grows would also enable a better partitioning of threads for improved predition auray. Another possibility is looking at the program in a more abstrat fashion, hoosing a pattern based on the type of problem being solved. A list of suh abstrat problem types, and the assoiated ommuniation patterns, suh as Berkeley's Dwarfs [18℄ ould be used as a starting point. Sine UPC is meant to inrease produtivity on large systems, it will also be neessary to improve the salability of this work. In partiular, storing reuse patterns for every thread in the two training sets, and prediting for every thread generated a large amount of data even for the relatively small test appliations used. If this were to be used in a prodution environment, there would need to be some way of ompressing the data or skipping threads whose patterns are similar to another thread's. This ould perhaps extend into exploring a global view of ahe behavior, where the data is not kept on a per-thread basis, but rather taken over all the threads in an appliation. Finally, this work only explored temporal reuse. Spatial reuse, where "nearby" data is pulled in along with requested data provides quite a bit of performane for many serial appliations. It is likely that it would work similarly for many UPC appliations. The same predition sheme used for temporal reuse was shown to work well for spatial reuse in serial appliations as well. However, spatial loality in UPC shared memory an be ross thread or on the same thread, depending on how the appliation steps through memory and the way the data is laid out. 30 LIST OF REFERENCES [1℄ TOP500.Org. \ORNLs Jaguar Claws its Way to Number One, Leaving Reongured Roadrunner Behind in Newest TOP500 List of Fastest Superomputer." Nov. 2009. [Online℄. Available: http://top500.org/lists/2009/11/press-release [2℄ R. Numwih and J. Reid, \Co-Array Fortran for parallel programming," Rutherford Appleton Laboratory, Teh. Rep. RAL-TR-1998-060, 1998. [3℄ The UPC Consortium. \UPC language speiation, v1.2." June 2005. [Online℄. Available: http://www.gwu.edu/up/dos/up spes 1.2.pdf [4℄ University of California Berkeley. \The Berkely UPC Compiler." 2002. [Online℄. Available: http://up.lbl.gov [5℄ University of California Berkeley. \GASNet Communiation System." 2002. [Online℄. Available: http://gasnet.s.berkeley.edu/ [6℄ W.-Y. Chen, C. Ianu, and K. Yelik, \Communiation optimizations for negrained up appliations," in PACT '05: Proeedings of the 14th International Conferene on Parallel Arhitetures and Compilation Tehniques. Washington, DC, USA: IEEE Computer Soiety, 2005, pp. 267{278. [7℄ M. Snir, \Shared memory programming on distributed memory systems," 2009, keynote address at PGAS 2009. [Online℄. Available: http://www2.hpl.gwu.edu/pgas09/tutorials/PGAS Snir Keynote.pdf [8℄ M. J. Flynn, \Very high-speed omputing systems," in Proeedings of the IEEE, vol. 54, no. 12. IEEE Computer Soiety, De. 1966, pp. 1901{1909. [9℄ M. J. Flynn, \Some opmuter organizations and their eetiveness," Transations on Computers, vol. C-21, no. 9, pp. 948{960, Sept. 1972. IEEE [10℄ B. Carlson, T. El-Ghazawi, R. Numrih, and K. Yelik. \Programming in the Partitioned Global Address Spae Model." 2003. [Online℄. Available: http://rd.lbl.gov/UPC/images/b/b5/PGAS Tutorial s2003.pdf [11℄ J. Savant and S. Seidel, \MuPC: A Run Time System for Unied Parallel C," Mihigan Tehnologial University, Teh. Rep. CS-TR-02-03, Sept. 2002. [Online℄. Available: http://up.mtu.edu/papers/CS.TR.2.3.pdf [12℄ Hewlett Pakard. \HP Unied Parallel C." 31 [13℄ C. Fang, S. Carr, S. Onder, and Z. Wang, \Reuse-distane-based miss-rate predition on a per instrution basis," in Proeedings of the 2nd ACM Workshop on Memory System Performane, June 2004, pp. 60{68. [14℄ C. Fang, S. Carr, S. Onder, and Z. Wang, \Instrution based memory distane analysis and its appliation to optimization," in In Proeedings of the 14 th International Conferene on Parallel Arhitetures and Compilation, 2005. [15℄ C. Ding and Y. Zhong, \Prediting whole-program loality through reuse distane analysis," SIGPLAN Not., vol. 38, no. 5, pp. 245{257, 2003. [16℄ A. Srivastava and A. Eustae, \Atom: A system for building ustomized program analysis tools." ACM, 1994, pp. 196{205. [17℄ Stanford University. \Parallel dense bloked LU fatorization." 1994. [18℄ University of California Berkeley. \The Landsape of Parallel Computing Researh: A View From Berkeley." [Online℄. Available: http://view.ees.berkeley.edu/wiki/Main Page" 32 APPENDIX Test Kernel Soures A.1 Matrix Multipliation #inlude < #inlude < #inlude < #inlude < #inlude < #inlude < #ifdef #endif > > > > > stdlib .h stdio .h math . h assert .h up . h up > relaxed . h MUPC TRACE RD MUPC TRACE RD shared [℄ shared [℄ shared [℄ #define / int int int shared [1℄ shared [1℄ shared [1℄ Initialize will of olumns and as a array . int void f int #ifdef #endif n i d x ( arr , i , j ) arr x n set of B; C; ( ( a r r ) [ ( ( i ) /N) this rows A; to s q r t (THREADS) , THREADS. Eah n + ( ( j ) /N ) ℄ + ( ( i )%N) thread Threads loally the are has N+ ( ( j )%N ) ) number laid out an NxN element . / n ,N; initialize ( har argv ) i , j ,k; TRACE FUNC TRACE FUNC ; / Seed the random number generator . / s r a n d (MYTHREAD) ; / int Verify n = ( ) assert ((n / if Read N = / that the in the ( shared up all size of the loal all all int int int memory [℄ for the is a elment . [℄ shared [℄ shared a l l o (THREADS, loal square . [1℄ bloks arrays . sizeof sizeof sizeof shared a l l o (THREADS, ( shared up p r i n t f ( " Using a l l o (THREADS, ( shared up C = threads / )THREADS ) ) ; / a t o i ( argv ) ; Alloate B = of n)==THREADS ) ; (MYTHREAD==0) A = double number floor ( sqrt (( [℄ ) ( shared [1℄ [℄ ) ( shared [℄ a s s e r t ( ( A!=NULL)&&(B!=NULL)&&(C!=NULL ) ) ; A [MYTHREAD℄ up all = ( shared [℄ int sizeof a l l o (THREADS, N N int ) ( ( ( shared ( size %dx%d / ) ( shared [1℄ of int int int [1℄ )); )); )); int ) ) ) ) +MYTHREAD) ; 33 n n" ,N, N ) ; B [MYTHREAD℄ up all = C [MYTHREAD℄ up / all Fill up f ( shared [℄ int sizeof int sizeof = ( shared [℄ in arrays < A and B . ( ( int int [1℄ ) ) ) ) +MYTHREAD) ; ) ( ( ( shared a l l o (THREADS, N N f o r a l l ( i =0; i int int ) ( ( ( shared a l l o (THREADS, N N [1℄ ) ) ) ) +MYTHREAD) ; / THREADS; i ++; i ) a s s e r t (A [ i ℄ ! = NULL ) ; a s s e r t (B [ i ℄ ! = NULL ) ; a s s e r t (C [ i ℄ ! = NULL ) ; for for ( j =0; j f f g g g #ifdef #endif < N ; j ++) < ( k =0;k N ; k++) (A [ i ℄+ j N+k ) = i ; (B [ i ℄+ j N+k ) = i ; (C [ i ℄+ j N+k ) = 0; TRACE FUNC RET TRACE FUNC RET ; g void f int #ifdef #endif array ( print shared [℄ int shared [1℄ A ) i , j ; TRACE FUNC TRACE FUNC ; if for / Only thread 0 (MYTHREAD! = 0 ) f g ( i =0; i puthar ( ' f ( i =0; i for prints . Everyone else just returns . ; < n n < n printf (" for return N ; i ++) t%d " , i ) ; n ' ); n N ; i ++) p r i n t f ( "%d " , i ) ; f < n N ; j ++) // p r i n t f ( " printf (" g g ( j =0; j puthar ( ' puthar ( ' #ifdef #endif n n n n t%d " , (A [ ( i /N) t%d " , a r r n+( j /N) ℄ + ( i%N) N+( j%N ) ) ) ; i d x (A, i , j ) ) ; n ' ); n ' ); TRACE FUNC RET TRACE FUNC RET ; g void al blok ( int idx ) 34 / f int #ifdef #endif i , j , k , rowt , o l t ; TRACE FUNC TRACE FUNC ; r o w t =(MYTHREAD/ n ) n+i d x ; o l t =(MYTHREAD%n ) + ( i d x for < for < f for < ( i =0; i f N ; i ++) ( j =0; j f g g g #ifdef #endif n); N ; j ++) ( k =0;k N ; k++) (C [MYTHREAD℄+ i N+k ) += ( ( A [ r o w t ℄+ i TRACE FUNC RET TRACE FUNC RET ; g void int f mult i , j ; up f g int f / ( j =0; j al main ( if / f o r a l l ( i =0; i for f g g kernel () n ; j ++) blok ( j ); arg , have ( arg !=2) THREADS; i ++; i ) < int Must < 1 har argv ) argument size of N. / e x i t ( EXIT FAILURE ) ; Initialize arrays . / / i n i t i a l i z e ( argv [ 1 ℄ ) ; up / barrier (0); Print out A and B . print a r r a y (A ) ; print a r r a y (B ) ; up / barrier (1); Compute C=A B . up / / kernel ( ) ; mult barrier (2); Print print out the result . / a r r a y (C ) ; 35 N+ j ) ) ( ( B [ o l t ℄+ j N+k ) ) ; g return 0; 36 A.2 Jaobi Solver #inlude < #inlude < #inlude < #inlude < #inlude < #inlude < #ifdef #endif > > > > > stdlib .h stdio .h math . h assert .h up . h > relaxed . h up MUPC TRACE RD MUPC TRACE RD #ifndef #define #endif #define / Default N of unknowns per thread . / 100 SIZE shared [N℄ shared [N℄ shared [N℄ double [N℄ shared number N N THREADS double double double double A [ SIZE ℄ [ SIZE ℄ ; X [ 2 ℄ [ SIZE ℄ ; B [ SIZE ℄ ; D [ SIZE ℄ ; maxD ; long int void f int #ifdef #endif // d o u b l e epsilon ; MAX ITER ; initialize ( har argv ) i , j ; TRACE FUNC / TRACE FUNC ; Seed the random number generator . / s r a n d o m (MYTHREAD) ; / Read in // e p s i l o n the = MAX ITER = if desired (MYTHREAD==0) Fill up f in p r i n t f (" Using p r i n t f ( " Using A and B . f o r a l l ( i =0; i double < for f ( j =0; j if < g g #ifdef #endif = %g Initialize X[ i ℄ to B[ i ℄ . n n n" , epsilon ) ; n " , MAX ITER ) ; / THREADS) ((( double ) random ( ) ) / ( ( double SIZE ; j ++) A[ j ℄ [ i ℄ = ( ( ( j==i ) ) epsilon MAX ITER = %l d SIZE ; i ++;&B [ i ℄ ) X [ 0 ℄ [ i ℄=X [ 1 ℄ [ i ℄=B [ i ℄= (( / a t o i ( argv ) ; // i f (MYTHREAD==0) / epsilon . a t o f ( argv ) ; double ) double random ( ) ) / ( ( A [ j ℄ [ i ℄+=( ) double ) SIZE ; TRACE FUNC RET TRACE FUNC RET ; g 37 RAND MAX ) ; ) RAND MAX ) ) ; unsigned int f unsigned int int double #ifdef #endif for < jaobi kernel () iter ; i , j ; sum ; TRACE FUNC TRACE FUNC ; f / ( i t e r =0; i t e r MAX ITER ; i t e r ++) Update X [ ℄ up f for for sum = 0 . 0 ; ( j =0; j i ; j ++) ( j=i +1; j f sum+=A [ i ℄ [ j ℄ < SIZE ; j ++) Compute maximum up f o r a l l ( i =0; i for sum = 0 . 0 ; sum+=A [ i ℄ [ j ℄ X[ i t e r %2℄[ j ℄ ; X[ i t e r %2℄[ j ℄ ; sum ) /A [ i ℄ [ i ℄ ; ( j =0; j < deltas on eah up for if < SIZE ; j ++) Chek for sum+=(A [ i ℄ [ j ℄ termination . ( maxD= i = 0 ; i < // i f ( maxD #ifdef #endif return X[ ( i t e r +1)%2℄[ j ℄ ) ; < / SIZE ; i ++) maxD=(D [ i ℄ > maxD) ?D [ i ℄ : maxD ; f p r i n t f ( s t d e r r , "maxD : epsilon ) return %8g iter ; TRACE FUNC RET TRACE FUNC RET ; g void int #ifdef #endif if iter ; print A () i , j ; TRACE FUNC TRACE FUNC ; (MYTHREAD! = 0 ) for f for return ; p u t s ( "A [ ℄ [ ℄ : " ) ; ( i =0; i f / barrier ; (MYTHREAD==0) f thread . SIZE ; i ++;&D [ i ℄ ) D [ i ℄= f a b s ( sum B [ i ℄ ) ; g g SIZE ; i ++;&B [ i ℄ ) barrier ; up / < < X [ ( i t e r + 1 ) % 2 ℄ [ i ℄ = (B [ i ℄ g / / f o r a l l ( i =0; i < SIZE ; i ++) ( j =0; j < SIZE ; j ++) 38 n n " , maxD ) ; printf (" g puthar ( ' g puthar ( ' #ifdef #endif n n n t %8g " ,A [ i ℄ [ j ℄ ) ; n ' ); n ' ); TRACE FUNC RET TRACE FUNC RET ; g void int #ifdef #endif if print f B () i ; TRACE FUNC TRACE FUNC ; return (MYTHREAD! = 0 ) for ; p u t s ( "B [ ℄ : " ) ; f g ( i =0; i < n n SIZE ; i ++) printf (" puthar ( ' #ifdef #endif t %8g n n " ,B [ i ℄ ) ; n ' ); TRACE FUNC RET TRACE FUNC RET ; g void f int #ifdef #endif if print X( int iter ) i ; TRACE FUNC TRACE FUNC ; return (MYTHREAD! = 0 ) for ; p u t s ( "X [ ℄ : " ) ; f g ( i =0; i < n n SIZE ; i ++) printf (" puthar ( ' #ifdef #endif t %8g n n " ,X [ i t e r % 2 ℄ [ i ℄ ) ; n ' ); TRACE FUNC RET TRACE FUNC RET ; g int int main ( f if / int arg , har argv ) iter ; Must have ( arg !=2) 1 argument desired preision . e x i t ( EXIT FAILURE ) ; 39 / / Initialize arrays . / i n i t i a l i z e ( argv [ 1 ℄ ) ; up / barrier (1); Print A (); print B (); up / Find X s u h up / the randomly g e n e r a t e d A and B . barrier (2); iter = that AX=B . / kernel (); jaobi barrier (3); Print print g out print out the result . / X( iter ); return 0; 40 / A.3 LU Deomposition #inlude < > #inlude < > #inlude < > #inlude < > / / / Copyright () 1994 Stanford University / / All rights reserved . / / Permission is given / non / removed . All / part , forbidden ommerial are to use , purpose other as uses , opy , long as and this inluding without prior modify this opyright software notie redistribution written in for is any not whole or in permission . / / This software / support . is provided with absolutely no warranty and no / / / / / / / / / / / / / / / / / / / / / Parallel dense bloked LU fatorization ( no pivoting ) / / This / to version be ontains fatored is one dimensional arrays in whih the matrix stored . / / Command line options : / / nN : Deompose NxN matrix . / pP : P = number proessors . / bB : Use / a good of blok size of performane . / s : Print individual / t : Test / o : Print out matrix / h : Print out ommand B. BxB Small elements blok proessor sizes timing should fit (B=8 , B=16) in ahe work well . statistis . output . values . line options . / / Note : This version works under both for t h e FORK and SPROC m o d e l s / / / / / / / / / / / / / / / / / / / / / / stdio .h math . h stdlib .h #inlude #inlude s y s / time . h " up . h" " l u . h" //MAIN ENV #define #define #define #define #define #ifdef #endif MAXRAND 32767.0 DEFAULT N 128 DEFAULT P 1 DEFAULT B 16 min ( a , b ) ((a) < (b) ? (a) : (b )) MUPC TRACE RD MUPC TRACE RD double double double double // E v e r t h i n g shared shared shared shared in globalmemory t in orrspond to shared s o l v e [ THREADS ℄ ; t in t in m o d [ THREADS ℄ ; b a r [ THREADS ℄ ; t in f a [ THREADS ℄ ; 41 types shared shared shared shared shared up strut o m p l e t i o n [ THREADS ℄ ; timeval rf ; timeval rs ; timeval done ; id ; lok / double strut strut strut int t idlok ; f GlobalMemory // d o u b l e t in shared double shared double double double strut t in fa ; t in solve ; t in mod ; bar ; ompletion ; timeval starttime ; strut timeval rf ; strut timeval rs ; strut timeval done ; int id ; //BARDEC( s t a r t ) //LOCKDEC( i d l o k ) g up lok t ; idlok ; / // s h a r e d strut double double double double strut g in t fa ; in t in t in int int n; Global ; f LoalCopies t GlobalMemory solve ; mod ; bar ; ; shared int int int shared / / size ; blok The nbloks ; / num rows ; / num ols ; / double double // d o u b l e shared // d o u b l e size Blok a; / of the matrix dimension / / of bloks Number of proessors per row of proessor grid Number of proessors per ol of proessor grid a = lu ; l and in u eah both dimension Number plaed bak / in a / pro a; rhs ; / bytes ; Bytes of to mallo per proessor to fatorization ? / / Test = 0; / Print out matrix dostats = 0; / Print out individual result = 0; hold bloks A / doprint test / rhs ; int int int int void void void int int double int int void double int int void double double int int int int void double double int int int int void double double double int int int int void double double int double int int int void int int int strut int double int int void shared / result of values? / proessor statistis? ); / InitA ( ) ; SlaveStart ( ) ; OneSolve ( , , shared lu0 ( shared , bdiv ( shared , bmodd ( s h a r e d , daxpy ( s h a r e d lu ( , TouhA ( , , ); , , , , , , , , , shared shared , , , ); ); , , shared shared , BlokOwner ( , ); shared , bmod ( s h a r e d , , LoalCopies ); PrintA ( ) ; 42 ); , ); , , , ); void void #define float float return onst har ChekResult ( ) ; printerr ( CLOCK( x ) time ( al ( tp diff g int = 2nd . tv main ( diff int ); g e t t i m e o f d a y (&( x ) , strut ( tp 2nd . tv use / timeval tp se tp tp 1st . tv NULL) strut 1st , 1st . tv use ) se ) timeval tp 2nd ) 1000000.0 f + ; 1000000.0; har arg , argv [ ℄ ) f int int double double double double int strut #ifdef #endif if i , j ; h ; mint , min maxt , fa , max fa , fa , avg avgt ; min solve , min mod , min bar ; max solve , max mod , max bar ; avg avg mod , avg solve , bar ; pro num ; timeval start ; TRACE FUNC TRACE FUNC ; f (MYTHREAD==0) n=DEFAULT N ; b l o k s i z e =DEFAULT B ; g CLOCK( s t a r t ) ; if f ( !MYTHREAD) while swith ase ase ase ase ase ase ( ( h = g e t o p t ( arg , argv , f ( h ) 'n ' : n = 'b ' : blok a t o i ( optarg ) ; 's ' : dostats size 't ' : test 'o ' : doprint 'h ' : p r i n t f ( " Usage : LU printf (" options : n = < break = break ! test >n n n result ; printf (" nN : Deompose printf (" bB : Use good performane . : Copy non printf (" s : Print printf (" t : Test printf (" o : Print out matrix printf (" h : Print out ommand for n printf (" n ; ; matrix . size of n n" ) ; B. BxB elements should fit Small blok sizes (B=8 , in loal B=16) loally alloated bloks to ahe individual proessor timing statistis . LU output . n%1d n n" ) ; p%1d values . line b%1d n n n" , DEFAULT N, DEFAULT P , DEFAULT B ) ; break exit (0); g ; printf (" n n" ) ; options . n" ) ; 43 n n n n" ) ; work memory n" ) ; p r i n t f (" Default : g f n n" ) ; printf (" use . NxN blok 1) n" ) ; n" ) ; a break break != ; ; ! doprint ; options ; a t o i ( optarg ) ; 1; result = break = "n : p : b : s t o h " ) ) n n" ) ; well . before n n n" ) ; p r i n t f ( " Bloked Dense LU printf (" %d by %d Matrix printf (" %d Proessors printf (" %d by %d printf (" g up for if n break int nbloks ) sqrt (( num ols size n" , b l o k size , blok size ); ) THREADS ) ; == THREADS) size ; nbloks != f n) printf (" n u m r o w s = %d printf (" num ols = %d printf (" nbloks = %d printf (" up n n rhs double ) G MALLOC( n ( shared = ( strut > > > > > Global Global Global Global / t up in ) up matrix round = ( double = ( double t in solve t in bar aross > id all = = // F o r k ( ) G MALLOC(P is where ) G MALLOC(P ) G MALLOC(P ) G MALLOC(P one ( i =1; ode s i z e o f ( double ) ) ; s i z e o f ( double ) ) ; s i z e o f ( double ) ) ; might / allo (); i < P; unneessary i ++) CREATE( S l a v e S t a r t ) due to spmd f 44 distribute memories 0) is GlobalMemory ) ) ; s i z e o f ( double ) ) ; 0; off )); s i z e o f ( double ) ) ; ) G MALLOC(P distributed desired . idlok ); lok )); ) G MALLOC( s i z e o f ( s t r u t 0; (MYTHREAD == id sizeof double start ); //LOCKINIT ( G l o b a l up as ( physially fashion > > Here n, a l l o (n , = ( double = ( double s i z e o f ( double ) ) ; ompletion = ( double robin // G l o b a l sizeof double s i z e o f ( double ) ) ; a l l o (n all fa data = n t in mod //BARINIT( G l o b a l idlok GlobalMemory POSSIBLE ENHANCEMENT: for n" , n b l o k s ) ; all ) G MALLOC( n / Global ) double = ( double = // G l o b a l n" , num ols ) ; n" ) ; ( shared // r h s if n " , num rows ) ; wait ; = n n n n" ) ; // a = ( d o u b l e g double f ( !MYTHREAD) g / n ; = n/ b l o k printf (" THREADS ) ; Bloks n b l o k s ++; if / n" , Element ; ( blok g n" ) ; = THREADS/ n u m r o w s ; ( num rows num rows a n n" , n , n ) ; n" ) ; ( f (;;) num ols if n n notify ; num rows = g Fatorization model in a the a / if InitA ( ) ; g (MYTHREAD == 0 && doprint ) before p r i n t f ( " Matrix f deomposition : n n" ) ; PrintA ( ) ; // S l a v e S t a r t (MyNum ) ; SlaveStart ( ) ; up barrier ; //WAIT FOR END( P if if (MYTHREAD == printf (" if for g = += g avgt = min fa min solve mint = ompletion [ 0 ℄ ; > i ++) f maxt ) f < f mint ) avgt = / THREADS; max fa = max = avg solve = fa = avg t in solve avg mod = t in min bar avg = t in for ( t g min g max mod = g min mod = g max bar ( t ( t (t (t ( t ( t in = = in in t = in = t mod [ i ℄ t in mod [ i ℄ t in bar [ i ℄ = t in bar [ i ℄ bar i ++) max fa ) f fa [0℄; = < min fa ) > f f > solve ) f solve [ i ℄; < in max min solve ) f solve [ i ℄; max mod ) f mod [ i ℄ ; < min mod ) f mod [ i ℄ ; > max bar ) f bar [ i ℄ ; < min bar ) in solve [0℄; bar [ 0 ℄ ; fa [ i ℄; in t mod [ 0 ℄ ; fa [ i ℄; in t = THREADS; solve [ i ℄ solve in t < > solve [ i ℄ in in i fa [ i ℄ fa in max bar fa [ i ℄ g ( t = ( i =1; in max solve if = min mod = max mod = g if n" ) ; ompletion [ i ℄ ; min if n ompletion [ i ℄ ; max fa if deomposition : ompletion [ i ℄ ; g if after THREADS; ( ompletion [ i ℄ avgt if i ( ompletion [ i ℄ mint if f < avgt ( i =1; maxt = if f nMatrix ( dostats ) maxt = g if n 0) f ( doprint ) PrintA ( ) ; g if 1) f 45 g min bar avg fa avg solve = += t t in in += in t in avg t in += g avg fa avg solve = solve [ i ℄; mod [ i ℄ ; bar [ i ℄ ; avg = f a /THREADS; avg s o l v e /THREADS; avg mod = avg mod /THREADS; bar = avg avg g fa [ i ℄; t avg mod += bar bar [ i ℄ ; b a r /THREADS; printf (" PROCESS STATISTICS printf (" n Barrier Total n Pro Time 0 Time %10.6 f ompletion [ 0 ℄ , t t in solve [0℄ , t if in bar [ 0 ℄ ) ; t for ( dostats ) ( i =1; printf (" g i in Interior Time Time %10.6 f in THREADS; i ++) %4.6 f %4.6 f t in solve [ i ℄ , t mod [ i ℄ , t in bar [ i ℄ ) ; printf (" mint , m i n printf (" in %4.6 f %10.6 f %10.6 f s o l v e , avg mod , a v g %10.6 f fa , min Max %4.6 f n printf (" n %10.6 f %10.6 f %10.6 f %10.6 f %10.6 f %10.6 f bar ) ; %10.6 f %10.6 f s o l v e , min mod , m i n b a r ) ; %10.6 f %10.6 f %10.6 f n" ) ; printf (" // time // %16d finish time : %16d : %16d rs ); // p r i n t f ( " O v e r a l l // finish time rf ); p r i n t f ( " Total al time time ( start , p r i n t f ( " Total t i m e ( rs , printf (" if : starttime ); // p r i n t f ( " I n i t i a l i z a t i o n al ( test n with initialization : %4.6 f : %4.6 f rf )); time without initialization n n n n n n" , n" , n" , n" , n" , rf )); n" ) ; result ) f n printf (" TESTING RESULTS ChekResult ( ) ; g #ifdef #endif return n TIMING INFORMATION n " ) ; // p r i n t f ( " S t a r t n" , n" , maxt , m a x f a , m a x s o l v e , max mod , m a x b a r ) ; g n fa [ i ℄ , %10.6 f fa , avg Min %10.6 f f %4.6 f in Avg %10.6 f mod [ 0 ℄ , i , ompletion [ i ℄ , t avgt , a v g %10.6 f fa [0℄ , f < %3d printf (" TRACE FUNC RET TRACE FUNC RET ; g n" ) ; n" ) ; printf (" g n Perimeter n" ) ; printf (" Time Diagonal 0; 46 n" ) ; n n n n" , n" , n" , n n void SlaveStart () f / POSSIBLE ENHANCEMENT: proessors to OneSolve ( n , g void int int int int avoid blok OneSolve ( n , double shared n; blok Here is migration size , blok a, where one might pin proesses to / MYTHREAD, dostats ) ; size , a, dostats ) myrf , mydone ; MyNum, a; size ; MyNum ; dostats ; f unsigned int strut strut #ifdef #endif strut if i ; timeval myrs , LoalCopies l ; TRACE FUNC TRACE FUNC ; l = ( ( l LoalCopies f == NULL) f p r i n t f ( s t d e r r , " P r o %d exit ( g > > > > l l l l / t in fa in solve in t in = / = 0.0; bar = 0.0; to ensure to remove > all start , old start size , //BARRIER( G l o b a l g initialization is done P) ; misses , all proessors > start , l n n " ,MyNum ) ; / begin by touhing a[℄ P) ; that ( (MyNum == 0) one jj is Here is where measuring ( dostats )) one about blok size , ( (MyNum == 0) MyNum, jj l , f dostats ) ; ( dostats )) might the CLOCK( myrs ) ; lu (n , if for barrier ; statistis g memory MyNum ) ; POSSIBLE ENHANCEMENT: if mallo LoalCopies ) ) ; barrier ; TouhA ( b l o k / not ( 0.0; mod barrier up ould sizeof strut 0.0; = //BARRIER( G l o b a l up mallo ( 1); t t ) f CLOCK( mydone ) ; 47 reset parallel the exeution / / > //BARRIER( G l o b a l if start , ( (MyNum == jj 0) ( dostats )) CLOCK( m y r f ) ; g if g P) ; barrier ; up t in f a [ MyNum ℄ t in s o l v e [ MyNum ℄ t in t in = m o d [ MyNum ℄ = b a r [ MyNum ℄ = (MyNum == = l = in t t in t in al fa ; in solve ; mod ; bar ; t i m e ( myrs , mydone ) ; f 0) myrs ; done = mydone ; rf myrf ; #ifdef #endif t l l o m p l e t i o n [ MyNum ℄ rs > > > > l = f = TRACE FUNC RET TRACE FUNC RET ; g void int int lu0 (a , shared n; n, stride ) double a; stride ; f int int int double #ifdef #endif for for j ; k; length ; alpha ; TRACE FUNC TRACE FUNC ; < ( k =0; / k alpha j < n; stride ℄ = length = n k #ifdef #endif j ++) /= a [ k+ j olumns f a [ k+k / stride ℄ ; stride ℄ ; 1; d a x p y (& a [ k+1+ j g f k++) subsequent ( j =k + 1 ; a [ k+ j g n; modify stride ℄ , &a [ k+1+k stride ℄ , n k 1, TRACE FUNC RET TRACE FUNC RET ; g void bdiv ( a , shared int int int shared diag , double double stride a , stride diag , dimi , a; diag ; stride a ; stride diag ; dimi ; 48 dimk ) alpha ) ; int dimk ; f int int double #ifdef #endif for for j ; k; alpha ; TRACE FUNC TRACE FUNC ; < ( k =0; k alpha g dimk ; ( j =k + 1 ; = #ifdef #endif dimk ; d i a g [ k+ j d a x p y (& a [ j g k++) < j stride f j ++) f stride a ℄ , diag ℄ ; &a [ k stride a ℄ , dimi , alpha ) ; TRACE FUNC RET TRACE FUNC RET ; g void bmodd ( a , shared int int int int shared , double double dimi , dimj , stride a , stride ) a; ; dimi ; dimj ; stride a ; stride ; f int int int int double #ifdef #endif for for i ; j ; k; length ; alpha ; TRACE FUNC TRACE FUNC ; < ( k =0; k dimi ; ( j =0; [ k+ j alpha length j k++) < dimj ; stride = = ℄ [ k+ j dimi d a x p y (& [ k+1+ j g #ifdef #endif f j ++) /= a [ k+k stride k stride a ℄; ℄; 1; stride ℄ , &a [ k+1+k stride TRACE FUNC RET TRACE FUNC RET ; g void bmod ( a , shared shared b, double double , dimi , dimj , dimk , stride ) a; b; 49 a ℄ , dimi k 1, alpha ) ; int int int int shared double ; dimi ; dimj ; dimk ; stride ; f int int int double #ifdef #endif for for i ; j ; k; alpha ; TRACE FUNC TRACE FUNC ; ( k =0; alpha g < k dimk ; ( j =0; j = < b [ k+ j d a x p y (& [ j g #ifdef #endif f k++) dimj ; j ++) f stride ℄ ; stride ℄ , &a [ k stride ℄ , dimi , alpha ) ; TRACE FUNC RET TRACE FUNC RET ; g void daxpy ( a , shared double int shared b, double double n, alpha ) a; b; alpha ; n; f int #ifdef #endif for g i ; TRACE FUNC TRACE FUNC ; ( i =0; a[ i ℄ #ifdef #endif += i < n; alpha i ++) f b[ i ℄; TRACE FUNC RET TRACE FUNC RET ; g int int int f g BlokOwner ( I , J) I ; J; return ( ( I%n u m o l s ) + ( J%n u m r o w s ) num ols ) ; 50 void int int int strut int lu (n , bs , MyNum, l , dostats ) n; bs ; MyNum ; LoalCopies l ; dostats ; f int int i , il , I , J , jl , k, kl ; double // d o u b l e int int strut int int #ifdef #endif j , K; shared A, B, C, A, dimI , dimJ , D; B, C, D; dimK ; strI ; // u n s i g n e d int t1 , t2 , t3 , t4 , t11 , t22 ; timeval t1 , t2 , t3 , t4 , t11 , t22 ; diagowner ; olowner ; TRACE FUNC for if strI = n; ( k =0 , kl ( kl kl g / > n) fator ( diagowner f ( dostats )) f blok kl f == MyNum) n ℄); k, ( (MyNum == / K) ; strI ); jj 0) ( dostats )) f CLOCK( t 1 1 ) ; up / K++) B l o k O w n e r (K, //BARRIER( G l o b a l g jj 0) diagonal = A = &( a [ k+k l u 0 (A, if k+=b s , f ( (MyNum == g g n; = n; diagowner if k CLOCK( t 1 ) ; if < K= 0 ; = k+b s ; g if TRACE FUNC ; > start , P) ; barrier ; ( (MyNum == jj 0) ( dostats )) f CLOCK( t 2 ) ; divide for if olumn D = &( a [ k+k ( i =k l , k by n ℄); I=K+ 1 ; i ( BlokOwner ( I , if il g = i ( il il + > bs ; n) < n; K) diagonal i+=b s , blok I ++) == MyNum) f f / / f = n; 51 parel out bloks / A = &( a [ i +k g b d i v (A, g for if / modify D, row ( j =k l , k if g by if jl = j +b s ; > ( jl jl bmodd (D, A, n; J) blok j+=b s , J++) f == MyNum) / f / parel out bloks / n ℄); kl k, jj 0) jl j , n, strI ); f ( dostats )) > start , P) ; barrier ; ( (MyNum == modify il g jj 0) f ( dostats )) subsequent ( i =k l , = I=K+ 1 ; i +b s ; > ( il il olowner for if = j ( jl jl g = g g n; f olumns i+=b s , I ++) / + > BlokOwner ( I ,K ) ; n ℄); J=K+ 1 ; bs ; n) j < n; j+=b s , J++) f f = n; ( BlokOwner ( I , B = &( a [ k+ j C = &( a [ i + j g blok < f n) ( j =k l , jl if i = n; A = &( a [ i +k bmod ( A , ( (MyNum == B, f == MyNum) / parel out bloks n ℄); C, jj 0) J) n ℄); il i , jl j , ( dostats )) kl k, n); f CLOCK( t 4 ) ; l l l g < k); CLOCK( t 3 ) ; if j kl f n) ( (MyNum == for if / i , CLOCK( t 2 2 ) ; up g il = n; //BARRIER( G l o b a l if n, diagonal J=K+ 1 ; ( B l o k O w n e r (K, g g n ℄); strI , A = &( a [ k+ j g g #ifdef #endif l > > > > t in fa t in solve t in t in += al += t i m e ( t1 , al t11 ) ; t i m e ( t2 , t22 ) ; mod += al t i m e ( t3 , t4 ) ; bar += al t i m e ( t11 , t2 ) + al TRACE FUNC RET TRACE FUNC RET ; g 52 t i m e ( t22 , t3 ) ; / void // v o i d InitA ( double rhs ) InitA () f int #ifdef #endif i , j ; TRACE FUNC TRACE FUNC ; for for if srand48 ( ( ( j =0; g g up g g g < n℄ = < = ( f f double n; i ++) ) f j ) l r a n d 4 8 ( ) /MAXRAND; 10; j < n; j ++; j ) f 0.0; j < n; forall rhs [ i ℄ 1); j ++) ( j =0; = ( j =0; #ifdef #endif i == n℄ ) n; forall up g ( i rhs [ j ℄ for j ( i =0; a [ i +j a [ i +j long j ++) ( i =0; f < i += a [ i + j n; i ++; i ) f n ℄; TRACE FUNC RET TRACE FUNC RET ; g double int int TouhA ( b s , MyNum) bs ; MyNum ; f int double #ifdef #endif for for if i , j , tot I , J; = 0.0; TRACE FUNC TRACE FUNC ; < < for for g g ( J =0; J ( I =0; bs n; I bs ( j =J bs ; tot g #ifdef g f I ++) ( BlokOwner ( I , ( i =I g J++) n; j bs ; < J) f == MyNum) (J +1) i += a [ i + j < bs ( I +1) && bs j f < < && n; i j ++) n; n ℄; TRACE FUNC RET TRACE FUNC RET ; 53 f i ++) f #endif return g void f int #ifdef #endif for for ( tot ) ; PrintA ( ) i , j ; TRACE FUNC TRACE FUNC ; ( i =0; i n; j printf (" #ifdef #endif n f i ++) < n; f j ++) p r i n t f ( " %8.1 f g g < ( j =0; " , a [ i+j n ℄); n" ) ; TRACE FUNC RET TRACE FUNC RET ; g void ChekResult ( ) f int double #ifdef #endif double if f i , j , bogus y, = diff , 0; max diff ; TRACE FUNC TRACE FUNC ; y = ( ) mallo (n ( y == NULL) p r i n t e r r ( " Could g exit ( g y[ j ℄ for = g g j g for if max n; j n; = j ++) 1; j i diff = = ( j =0; i < n; > < for y < n; j ) n℄ ( fabs ( d i f f ) f f y[ j ℄; f j ++) = y[ j ℄ y[ j ℄; i ++) a [ i+j f i ++) 0.0; j f n ℄; n℄ =0; j ; a [ i+j ( i =0; y[ i ℄ diff memory )); f j ++) < ( i=j +1; ( j =n g < = y [ j ℄ / a [ j+j y[ i ℄ for for double rhs [ j ℄ ; ( j =0; y[ j ℄ mallo ( 1); ( j =0; for for not sizeof 1.0; > 0.00001) f 54 n n" ) ; bogus g if g g max g ( bogus ) else f = diff 1; = diff ; f p r i n t f ( "TEST FAILED : p r i n t f ( "TEST PASSED n (%.5 f diff ) n n" , max n" ) ; free (y ); #ifdef #endif TRACE FUNC RET TRACE FUNC RET ; g void f g printerr ( onst har f p r i n t f ( s t d e r r , "ERROR: s) %s n n" , s ) ; 55 diff ) ; A.4 Stenil #inlude < #inlude < #inlude < #inlude < #inlude < #ifdef #endif #define #define > > > > > stdio .h stdlib .h math . h assert .h up . h MUPC TRACE RD MUPC TRACE RD shared N 6 ITERS 100 [N℄ int #define void f double #ifdef #endif shared [1℄ n; double double A( i , j ) stenil ( a [ N ℄ [ THREADS N ℄ ; dmax [ THREADS ℄ ; a [ ( i )%N ℄ [ ( ( i ) /N) int i , int N n+( j ) ℄ j ) t ; TRACE FUNC TRACE FUNC ; t = A( i , j ) ; #ifdef EIGHT PT STENCIL t += A( i 1, j t += A( i 1, j ) ; 1); t += A( i 1, j +1); t += A( i , j t += A( i , j + 1 ) ; t += A( i +1 , j t += A( i +1 , j ) ; t += A( i +1 , j + 1 ) ; #else t /= 1); 1); 9.0; t += A( i t += A( i +1 , j ) ; t += A( i , j t += A( i , j + 1 ) ; #endif t /= 1); 5.0; A( i , j ) #ifdef #endif 1, j ) ; = t ; TRACE FUNC RET TRACE FUNC RET ; g int int #ifdef f main ( ) i , j , iter ; TRACE FUNC 56 #endif / TRACE FUNC ; int Ensure n = ( assert ( for for / of sqrt ( (n n) Initialize ( i =0; i double number ) < < ( threads == THREADS A. for f ( j =0; j Update ( i =1; i f g g all / points (N n) ( double ) MYTHREAD; for if 4/8 pt stenil . (N n) 1;++ j ) t h r e a d o f (&A( i , j ) ) ) dmax . = / ( double ) iter + 1.0; barrier ; Chek return with stenil (i , j ); ( i =0; i g = 1;++ i ) (MYTHREAD==u p Update up for < ompletion . goto / THREADS;++ i ) ( dmax [ i ℄ < 0.0) end ; end : g / ITERS;++ i t e r ) dmax [MYTHREAD℄ / ); < < < ( j =1; j / square . ); N;++ j ) ( i t e r =0; i t e r g a barrier ; for f for f if / is THREADS N;++ i ) a [ i ℄ [ N MYTHREAD+ j ℄ up ) 0; 57 /