A Simple Task Larry University, Research Yorktown Heights, Bryce Based PO visiting TJ Watson IBM Science Israel ple load ing tasks of local BOX Industries Weizman in scheduling through comparable the to that each more do not mirgrate The load balancing eration consists processor of the and of workpiles. of the tasks putation local in the time. task i.e. by the number tasks The provide workpile an system receives the size its fair of the proves share the expected size a small constant factor of tasks in the system parallel widely have ent assumptions when that element cerned of the ber divided of processors. or require The Introduction scheduling on a parallel formance. of the machine A poor activities of can significantly scheduling policy a parallel influence may leave and shared memo- ones wish development of memory, believe that how are asyn- our results parallel most used to differof ei- examplle, applications balancing such and schemes processor at natural finite are to equalize each is exchanged For difference boundries allocated process- make knowledge in finite load to move for application. methods, particles con- the num([DG89]). synchroniza- points. to find the loaded node munication program to in- mapping passing allow tl~e demand willing the a detailed or the meshes Independent 1 but although information tion machines on shared schemes calculations, of Here and We focus architecture with when to proces- tasks otherhand, studied, adaptive of each important as a few message job that map doing the main scheme therefore balancing as Particle-in-cell of com- in an un- The cost, parallel machines, been the and are On consume cycles. processors. only and Center applicable. load ing ther are of this code. are more may the effort user portable one low of today’s as well the Many op- Israel Ca CPU to schedule users chronous propor- all schemes machines Jose, total across themselves. to relieve Research San a distributed, Herculerian ry a a clever performance it can Rehovot, Almaden of the load Many vest tasks analysis while is allowed maximum of a random scheme the scheduling balancing probabilistic balancing number is to sors. When- so as to equalize load is within total goal archi- it performs The the Specifically, queue average, that inversely workpile. exchanging two workpile, probability its of examining performance each that and distributed: its local with size portion system The computer memory large Scheduling done minimiz- is desirable idle duly Task a balancing while processors more is simple accesses the IBM balances frequently. operation to some so it schedul- workpile. parallel has and balancing Institute, Mathematics and Israel a sim- been achieves workpile, many efficiently, a processor tional In processor access ever of a global overheads. tectures, paper for usually accessible, this and machines. has globally in suited parallel machines introduced queues) is well memory such a single, scheme ing scheme shared on (task Upfal of Applied 23838 Jerusalem, Center Eli Department NY workpiles balancing Machines Ltd. Abstract A collection for Slivkin-Allalouf John Science Jerusalem, currently Scheme in Parallel Miriam of Computer and Balancing Allocation Rudolph Department Hebrew Load of global in the application, average load the steps. its per- small number many hope of system The of neighbors keeping its own some and using the schemes least and a minimum strategy is to at periodic of com- query only intervals information try mc)st in the up-to-date (e.g. [CK79] ,[BKW89] ,[J W89] ). Sometimes a central server is considered ([B K90]). There have also been attempts to use simulated Permission to copy without fee all or part of this matertial is granted provided that tbe copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. @ 1991 ACM 0897914384/91/0007/0237 Schemes architectures. for $1.50 237 hypercubes annealing are often There ([FFKS89]). directly have ([ DG90], been [DG89] related many ) to the schemes and either a machine proposed balance only between t ween local adj scent Similar achieve to our global of processors. fixed, predetermined the effects approach shared all sume work that occurs at are interested times. Another polling [KR89] has with the PEs that performance Parallel past the provided results a minimal cost in moving sor to for these architectures sor (PE) chooses and another. The executes does not to chooses the mat ched would it for the the with of to more cally exist evenly be idle in the system. more global local state task to that with each fers PE from (tasks) long its will workpile empty one. ancing this unequal while another Unless inversely by the system tasks will proportional is tive sor PE When the global queue, execute the it sufwork some to processor associated a very employes short a parallel then task our assumptions system. Our presented. analyzed Finally, allocation and we Its the show scheme behavior results that load and and based on are local cases workpiles are presented level. since there may a preemp- That is, a procesusually voluntarily called gives In up with task 238 latter for the in that a task PEs the only from task Access one has time. chooses the process retries global workpile may be adding workpile attempt to the thus and processor fashion. task the processor to the the global the of time at a later workpile is empty, to process until and are processor is returned idle simultaneously in an arbitrary begins task processing global period. idle or a quantum considered workpile idle and the a single of tasks Each workpile continued from on deletions executes case, is then If the a short serialized load the Queue is based and suspends, workpile If multiple a t=k process processor or removing presented. which from terminates, is sequential is then scheme insertions the after is as a Global allocation all The continues. for scheme are in tasks schemes and task a task The of WorkpHe task task. global different length performance of simulations there All of time, removes elapsed. bal- requirements balancing the each possible priority is required. a period until with also of processors, data-structure. another state interact to the same either workpiles. We first the ways: processors multiple priority number for or paper. with closely all are priority; directed that or load at average time-quantum, inserted different to associated this distinct strategy two schemes in the a task The a have a very scheduling executes are cent rol. of for of the may have then for may than Tasks a more but distribution One a tasks are either workpile. We use this of task queue since a one be implemented tasks tasks. by the system accessible same each other is or are are maintained required. hybred considered within a that workpile. workpiles, the code with be organized course have 2.1 same continuity, may the Since will amount workpile can used be more viable term work pile local levels are We con- is executing of in the we call the can Of task tasks is preferable A local processors. technique, rates, their the it of not here lo- which dynamically processor is not global to physi- is the workpile assumed suited there from although no PE well where Here, provide a potential among which execution may and order PE workpile of the PEs FIFO asks there than is not and it previously. that ports by any of the processors. traditional deleted that parallel tasks that of the a strict as set in by some to and or registers, 1/0 is a piece can be executed structure processor. to be executed. scheme is significant. the global work-load in task oft that parallel ready hierarchy resume global the smallest are generated ready-to-run instead The is re- tasks and as a single, it and unlikely The all the architectures executed on on a task. processor possibly that in a data and entity, task A executed term the number appearance in is a task memory was PE. The workpile advance complex it the the workpile If workpile unless among there based of a CPU, independently a parallel executed, existing, proces- global global same All proces- Each the time-quantum, the executing while The gives PEs in be consists computation program. processors is sufficient the of PEs, on the balanced one unit. We assume so there one see [R87]). Thus, number from time-quantum. the task. than as an indivisible assume parallel and in and local it to work the task being workpile task within time-slicing also next a given end next a task enable We as- can also designed memory global (e.g. the be resumed many single terminate turned global element for computational scheme. machines a flat processing tasks We is uniform our balancing shared-memory often was of the Each will cou- program. between it appears tightly where parallel Foundations cal memory aPPIY to restricted communication architectures, such as hypercubes. Moreover, we are able to prove a bound of the better 2 sider machines same communication although performs to are idle. is concerned parallel of the between they interval they try loads however, and the when memory inexpensive balancing be- [HTC89] the balancing, uses random our be part et al. equalizing intervals only In contrast, may Hong by The also poll pled balance workpiles. of changing that processors or recursively scheme, balance pairs in neighbors sub cubes. at any time. access, they are x Balancing x x x x x x x x x x Task x lock x x x x x x if x x x x x x x then PE1 PEz PE3 PE4 PE5 PE6 PE7 PE8 I size of move to Figure 1: Each ancing and PE mechanism another. PEi probability has deletes a workpile only may move executes I/li workpile. tasks a load where associated to this from li is the with A load operation length of the This scheme since it idle while there is true only may if the is a task ready of tasks switches. will viously it and a significant memory then 2.2 if the same will Workpile than have the them. On A a chooses the next (Figure as a FIFO queue fashion. associated processor workpile tion. This Many parallel low a task local are begins If the this m. each one templete execution, tasks local are the In the several in is empty, might a single with call, task opera- situations. index amount and block organization. The are are not executing. Any we focus on the 3 Balancing idle workpile. initial pro- Two types allocation onto workpile and executed be many balancing local first load balancing, idl local inserted are possible; The any of the local where tasks may to another combination latter where whenever of these two case. most al- decision as to suprising that is about the The main and the desire master each processor disadvantages advantages each share load to which a processor are stead, the 239 on each node of control- conflict between to have desirable whole one that system load-balancing is has traditionally In the the In the been exectued system during which when necessary or is idle in periodic solution a tilme it polls the other lclad polling PEs to work. using balancing load is interference. ways. a PE It knowl- system. of the the minimal throughout whenever load is not situation Rather is performed. global unwillingness it sim- whom. has the very independent with is the Also the with We propose time, Lo- balancing and to the function the period. load local, and workpiles and check in their own thoughout related balancing our the expected load processor. different its balance load, the time of no processor problem is fixed solution, task to average dynamically balancing is created. of CPU when although to control single period in the for distributed, makes balancing Load of placing feature it is adaptive, processor one of two tasks as each striking is that ling Scheme Workpiles Each the and tasks may possible, one done be significant. advantages are from during systems, consists control a small local m distinct the workpile as to in parallel that explicit be on a single be moved an associated a different any continuous a the the creation or operating optimization full processed into there without edge of the system its own lois organized in certain assigned only will workpile executing or created into require optimization There the count, a single placed and O to m — 1. The only also an optimization languages, created, from are r longer task removed processor otherhand, balancing The is for each proof tasks. Each workpile 1 > the balancing and same and Local idle. are to be spawned multiplicity range tasks tasks local remains processor allows the If the created of the only the cal Of 1). The local workpile so that round-robin Newly task load be inserted on the created ple. cal workpile the of ability Queues processor i pre- in their Set to the global workpile its own local workpile 2: The Indeed, scheme An alternative cessor to maintain of workpiles they that may remain newly the that tasks tasks be expensive. As not workpiler end so of load needless unlikely state end the sizes. the tasks this many Second, processor of task the — size from smaller cessors. workpile, many processors migration But since larger is highly amount this The it on the WorkpileC task Figure balanc- be delayed. be too [1, p] but equalize their workpile unlock Workpilei and workpiler to remains global slightly will Finally, executed to is only be rescheduled to keep have there load First, the range i. workpile. to be executed. access the with a processor setting. may of processors, the best that limited access number context task case concurrently of these number to provide the in a very processors some appears is never in it. that ing ~ bal- one workpile balancing PE workpile, x It inserts for lock Workpile: r = random number a different work. There executes the or size of the approach to scheduling is no set load local balancing workpile time during task. dictates Inthe frequency tasks of its on the execution. workpile t. Let t be the time scheduling task code. from If li,t the = before load delays balancing task probability for and I/li,t. amount again. with execute one with load-balance. invest in The load many tasks in the load-balancing a short workpile The fraction its local task will of time balancing little task workpiles are processor roughly to access is currently there is lit- system, load also holds Furthermore, It is also attempt loaded This processor. migration. will system, a lightly attempted. individual of the loaded in is frequently sizes equal, very there rare that a workpile accessing true since the is very some will pro- some other it. workpile frequently that is inversly load-balancing and workpiles between limit, two tasks are to current ly or try to a processor Figure 2). to If migrated until the Since there If then the the the that PE becomes several first Our to difference further ation details that tation explanation: simplist with which dom. Each its way to balance number Concurrent load of tasks only task than ~ will A one ize their longer one at a unique ran- seed for workpile is not in FIFO is measured value, whose allowed. workpile and T is the total final subsection and load balancing number two the result. presents The receives compu- size in the entire of processors tasks the effect in the system. of the scheduling an example outperforms in the however, results system of active and our to its explains sample analytical p is the number number scheme different in where Simulation For are moved such that by more differs the where centralized the one. Results application results of a simulation puter. The 1. there programs, system simulation we present of an ideal system makes the parallel many com- simplifica- of in from the to the end 2. no overhead moving order is no end equalof 3. no overhead the of the shorter are accessed in the costs of memory in moving a thread from in updating cache or page ac- one processor to another. tasks to difference cesses. operation. another Tasks lengths. r, is fixed consists to is, p/T, The 4.1 order. as the length a balancing operation workpile that tasks of presents is proportional with reported sufficent, applicability perfor- one using tions: threshold perform each that the computer are not therefore, and low performance. workpile are sys- as memory compared results Simulations decentralized workpile. workpiles balancing workpile generator. of a workpile A system-wide from with are accessed on the local choosing begins access to a local workpiles The another is by processor random Local of choosing The that time system, The of a parallel general such the global system the proving re- the impor- in a real of application, and and programs. subsection, perform influence measured different Of prime factors, type details scheme subsection. prove Many performance several scheme. scheme traffic, there for of our second implement network application strategies. are have our workloads. a simulation give free. is little We is may real accepted we present of our does implementation mance loaded workpile well access time, load generally scheduling, of the behavior with level lower heavier the there in task is how tem the some other either workpile suggest than is no single, for analyses its other between difference from one. some the load is greater accessed, these chooses to equalize lighter wait There tries then implementations between Results criteria will proportional simply workpiles the being else task (see the workpile 4 infrequently tance PE at random quire and scheme workload. two in a heavily balancing each cessor In our backoff load balancing next a coin a certain once In summary, tle its the i flips with processor of i at time scheduling we use an exponential processor while up number case, thus own task the i is accessing processor balancing the be processor t,before time the ation, li,t with processor workpile, executing A will At O, then implement in this at which its local executes Let associated 4. no contention tion/deletion on the takes global table workpile entries. and inser- no time. workpile. 5. load No starvation in FIFO tually occurs. order a task be executed. a position closer Since workpiles the does not migrate If it is migrated, to the head of the will All even- it is moved these workpile to the queue. 240 balancing global takes assumptions, scheme workpile since zero time. save the such scheme latter, overheads than in any favor are other the global higher in scheme. We wish fits, the to argue load-balancing scheme. the even There scheme. costly these scheme with are, of course, on how without workpile is competitive centralized pends that decentralized the many Thus, are these adaptive global workpile situations the bene- with that ultimate measures 4.2 We on a particular present four allel the different version of program. In choosen, larger different The applications is subdivided The ecuting master. in The the tasks. our The process we vary the off run for a short again spawns continues both cases, number these two process ex- parallel a set of times, size and measure Figures lation that There are schemes the global the to the workpile We can present when these workpile are memory Without to random memory accesses after a while, after time, memory. sive much out case. the Figure global as the The processing initially, others. unbalanced rate. these implies that there to note performance is as bad work wastes load of tasks,, of tasks) proces-. balancing have to changes, p generates terminated at at proces- an arbitrary distri- lAP,t I is bounded by some load for In some other Pp,t denote a load words, in workpiles 0 > balancing the procedure at balanced, 1, then up expected of different syst- at time the probability starts constant load of the load balancing if the system optimal to and the sys- a constant waiting processors time differ of only factor. 1: algorithm step t, is executed, for each processor p, and at processor p >0. Thus, processors. Their loaded arbitration remote balancing but see as expenis not + c’. as the global power. 241 If other step. Furthermore, processor. The dominates step, load p did p does was not the chosen balance performance of the performance the induction that the hypothesis holds for holds t we bound a load at this a load by more than one its load with any algorithm algorithm. E[LP,t] < aAt + C’. t that the system thus conflict initiate original on that some lightt~ messages balancing not of this by induction theorem weak it ignores assumes The claim a very for very p is initiating initiating but +] of the if processor procedure then assume at a given procedure min[fi, activity processor processors processor, balancing by the We also procedure We prove the a load we restrict from clearly too with- half W% bounded scheme: balancing local p initiates t) is always time When can s proofi To simplify the analysis we consider a weaker load balancing scheme in which PP,t (the probability that that about &, achiev~s ancing local cost We twice > that prove at each local to the results. Of assump termination processor Let ~[Lp,d We hope is a high case is always This at the way. There are constants a, and C, independent of the number of processors n, and the distribution of AP,t (the schedule of tasks to the processors), such and the load balthat when the system starts balanced, initially moved. to a local there are as local. charged It is also interesting balancing, never the either can be transfered 5 presents workpile migration. load is first state all In that re- 1, 2, tass charged it reverts that the and are a task the fact that processors always to capture some balancing, or are are stay two situation, as shared. references loab is migration, and cha~ged values sys-. LP,t the total p initiates that shared, global situation, allocated local, if the reasonable load might are n processors. Theorem local is, we assumed either the is that t,and by At = ~ the average by a constant to as the the first local times t,and by AP,t its change that Lt = ~P~v by at time processes applications. when AP,t of factor number of tasks we require pro- 6. factor. gives the will and (= tasks number step. but tem ignores is as good In the references remote. That be workpile rate which of new the simu- it balance, number t. We access since results 3, respectively. there one same the happends dropped. memory or when the just what are references mote. t minus time scheduling concurrent and a balance We also examined memory the step time features. the is important considers gives of the creation the load at time processor workpile. simplifications and and simulations one scheme First, even migration This results load-balancing note. performance of when the on the to workpile. credence global only points affect overheads Second, 5 summarize focuses two the i.e. if Pp,t 5, and load t;there to completetion. it on some t ignoring own that the the by Lp,t Denote and then depends indeed length observation its em 16 in is fixed expected of step constant of slave key scheme constant sor p has at the start Somewhat off The balanced, sor p at this the balancing a small result concerning bution, tasks, with time. problem of processors The of slave in a number the elements to a number work is element. a master to once those applied has case, slaves In into parallel spawns our master case. in skeleton load The in this Denote element the splitting application then them later, and second and 16 of than starts tions are a par- splitting length. course applications a master/slave a random smaller is recursively sets. and first, array and those sort of two quicksort the the results schemes. the balance. is within average tern We that a good workpiles de- machine. and prove vides favor choice Analysis started balanced., for t = O. Assuming EILP,t+l]. ~[~P,t+-1] = ~[%p] + ~[Ap,tl where Cl gives the contribution to ancing load with lost a processor by a load + ~[CI] – chosen by p, and balancing with receives more ~[cz], from load bal- Lp,t+l C’2 gives a processor the we computed the square differences We of the averaged alltime choosing global this average, At, of the value and and queue then steps. That is, see that the load balancing the average. summed lengths took the the from At. average over P. Clearly, the p never processor it chooses, a load-balancing ~ If it .At “ receives load processor no more than at step z is 1 /(n L.,t the – 1), /2 the that load of p initiates t + 1 is bounded balancing, than half probability procedure initiates Itchooses the and tasks. by probability We that in that workpiles case it tween Thus, very 2 and We a lower with E[C2] bound. Lm,t < is more Let 2At. that a processor least min[~, Alt denote Clearly z fi, c chooses p, p is not initiating lLl~ I ~ The set n/2. load by any balancing of we need balancing other and it takes p], then Lp,t l-f 13[C2] z min E[ [ 4(n z ~6M – 1) z units, where task queue scheme: ~ min L P, ~ – 2At)], E[~( Set a = and recall that At+l If ~ < ~ then EILp,t+l] ~ At Next, + with +6+p/2-y(@ 1/p) the E[cd -2)/8 – s~At+I+ <~ = E[Lt,,] At+ + E[Ap,t] C+6+p/2– + C/16 EICI] <cxAt+l Since ‘[cd +C. We measured how from resut 1s in Figure together differed and from the much 5. We ran measured the global the global length average and 10 master-slave how the average. local At of the task the applications workpile each lengths time task The quan- the global the short tasks quantum has units. h – s units of takes the long short the execution Thus, 242 task + s/2) < can be inserted happens is alone tasks. is h. execution global load diffent into a queue a probability some long with are two with and If the time the best scheme 1. There other task task is alone Otherwise is expected + ((P – 1)/p)h s/p the queue T = (this the whole it in takes to require = h + s/(zP). task queue scheme does scheduling. Conclusions that suited the local for gives the level of abstraction shared programmer exact number load balancing the step, task task task two We believe plotted long or s/(2p) ideally differed the local a threshold has provide 5 ❑ queues long execution it. s time consider * s time 1 other l)/p)*s=ll+s/P short l/p(h C’. – of the whole The queue E[cz] not E[Lp,t+,] and another queue – 6. + part the execute all We to em- and p+ than to until + I)/p) p PEs requiring First, time global balance. longer units central- the in order us assume h. the blush, of workpile - each The ((p the possibilities: C = 8p+16(a+l)6, E[Ap,t] sized s << consider balancing h+s/2. = .E[Lt,p] <CYAt+C Thus, of 6+86, p = 0/2 = a+l, be- units. ‘J]] ~ E[ ‘Lp”l; [ remainder 1 2 of the decentralized possible is much h–s+((p+ 1 – Lz,t 1 4(n – 1) Rx 2 .~M The time. at first quantum requires the than best Let tasks h time when the nature. tum 1) 2 Lp,t E[ – Lz,t size ranged Best since in place are equal complete y = min[d/2, queue of the tasks the value scheduling to provide its FIFO other Let example better It is interesting One and another term tasks. is at least Always seems phasize z processor, step one. use the is at processor Not provides queue probability that at this ized processors The probability chosen load since the A4 initiates ~]. p is not complicated present scheme keeps This 3. FIFO 4.3 Bounding near system. with memory parallel the since of workpile ability scheme adapts balancing is processors. It to program he does processors load not available. to the need at a higher to know Moreover, current load the the on It is interesting when faced chronous” the Here, that execute each task executed for all the the execution other Now suppose ning concurrently. tasks, consists a period If mapping. than the cluster the there several scheme of these of tasks then belonging to is expected the placed all that same R. Kling, GAMMON: for Vol and an a local Conference F. and A. Kumar, in with a central actions on Computers, job on Paratlel ‘(Adaptive optima] multiserver scheduler,” Vol. balProc- 1989. a nonhomogeneous system “im- load system,” 2, pp. 77-80, Aug. balancing B. Wah, efficient computer of the International Bonomi, load run- 39, IEEE Trans- 1232-50, pp. Oct. 1990. larger scheme on the [BK90] equal. is much our strategy eedings K., of Processing, step approach number ancing partic- for Baumartner, plementation is, programs quickly processors, initially That processors of [BKW89] tasks are about total were of then It tasks will syn- exchange are enough tasks tasks and program. of the are number together since in the If there our a set fashion. of time References performs “loosly environ- of or information periods then of synchronous tasks there scheme a multiprogrammed in a communication with our paridigm program in a loosely that the and each how popular programs ment. ipates to consider with [CK79] will program same Y.C. namic task Chow Load and W. Balancing Processor System,” ers, Vol. C-28, Kohler, in “Models for a Heterogeneous IEEE Transactions pp. 334-361, May Dy- Multiple on Comput- 1979. queue. [DG90] Our any experimental difference between distribution that with showed the of work chitectures served results two among local there schemes the encourage the that local workpile in terms Thus, state, be better would The that It key insight system the random each small begin [DG89] scheme. set the had or heaps continues It is necessary the with to it tasks and further with Dragon, whether the provement are system is in of the scheme executed with form is balanced [FFKS89] a Fox, Simic, the dynamically little size of the the thins the load- nected that should topologies be investigated such it is nevertheless as a completely pay the tween extra [JW89] when It is our to treat network sampling scent processors. extention to ple conbelief newly created ones. If cuted. One other only using should not tasks can differently is cheaper does tion tssks to move move tasks be moved. different tasks a good Of on the involves from a task such provided be based work course same first enough both load and has only balance J. Fourth March and P. balancing Conference and Applica- 1989 and M. Chen, “Dynamic Proceedings on Hypercubes, and B. in Vol Wah, Concurrent 1, pp. Kumar, Journal Vol 595-598, V, and V. Rae, on Vol “The J., Load “Load MOOS Balancing” on and Applications, Vol 603-608, H Operating Proceedings Hypercubes, 1, pp multi- and Concurrent 599-602, L)is- Dec. 1989. balancing Concurrent 1, pp. amd with of Parallel Proceedings Hypercubes, Applications, balancing system 7, pp 391-415, architecture,” Keller, “Load a computer buses,” Conference exe- load on hypercubes,” Computing, Dynamic executing it 583-539, J. Keller, and Applications, selection Conference be- treating already before 1, pp. Computers Conference contention and this Vol of Concurrent 1989 hypercube to simply or moving hy- low Proceedings of the J., X. Tan, Juang, [KR89] the hypercube and cost “A Furmanski, balancing and tributed [K89] Another that load March threshold for non-fully hypercube. interconnected cost non-adj as the worthwhile 116, Hypercubes, Concurrent Hong, ordered scheme Vol optimization 1, pp. 591-594, Computers im- W. Proceedings of the Fourth frequency the ,“ Vol tions, peak value. The G., the A possible setting on Applications, “Physical on Hypercubes, same or not. and algori is that the the 1989 cyclic tasks on Computing, algorithm,” Conference Computers [HTC89] One of the weaknesses balancing Fourth March processor. balancing Para!lel note hypercube K, and J. Gustafson, load at a processor from “A grained result. only to located gets that be shown to get our is possible new as one demand can balance the peak to generate decreasing is to Gastaldo, for coarse machines,” percube state. chosen of neighbors, workpiles analysis a balanced selection processor mountains that in in M. problem pp. 75-79, Nov 1990. ar- the the F. and balancing dictionary of the processors. Dehne, load is hardly of the on the Fourth Compuiers March 1989 System and of the Fourth Computers March 1989 when [R87] should types of migra- balancing scheme, Raetz, cessing 1987. probabilities. 243 G. “Sequent system,” general Northcon\87, purpose pp. paraJlel 7/2/1-5, proSept. 5500 ] I I quicksort (size I I I 500) I 5000 4500 Load Load Balancing, Balancing, I I Global No Load Workpile Balancing Initial Placement Initial Placement Static Random +--+ ~ x 4000 K 3500 Time 3000I 2500 2000 1500 I 1000I [ I I I O 2 4 6 8 I 1 I I 10 12 14 16 18 P Figure are 3: Quicksort times when advantages, the application global we see that the of size workpile load I not balancing master-slave 9000 of 500. does (size 8); uniform memory I No Load Load best yields 8000~ 7000 scheduling the scheme i I The yield Balancing, Balancing, good Static Random clearly without the performance the local and memory there access costs I Global Balancing affects Even performance. access I Load strategy performance. I Workpile (Local) Initial Placement Initial Placement ~ + x 6000 Time 5000 4000 3000 2000 1000 1 I I I 2 3 4 I I I I 5 6 7 8 9 P Figure the 4: This number data shows of processors, that when then our memory load access balancing costs gives are the better 244 same results and than the then number of tasks global workpile. is a little more than master-slave 900000 I (size I 64); I nonuniform I memory I Global 800000 No 700000 Load Load Load Balancing, Random I Workpile Balancing Static Balancing, costs I I e (Local) Initial Placement Initial Placement . + - x 600000 500000 1 Time 400000 300000 200000 - 1000000~ O 1 I I I 2 4 6 8 i I I 10 12 14 16 18 P Figure 5: memory Each task slaves fact A memory access are executes finish, that a total another each Without access charged load memory After is started. access balancing local of 64 accesses. batch memory to 2 units. performance its application, results is twice 1 unit migrates, In this The time is charged a task the shows that due to the poor to remote are master the Thus as expensive, degrades and 10 access task global we meory charged spawns workpile see that scheduling few though 3 units. highest 64 slave is about very even is charged at the rate tasks. twice Shared of 3 units. After as bad all these due to the of the accesses are all accesses are charged the remote. 1 unit. ~..u 1, multiple 700 I L master-slave; I variance No 600 Load oad 500 400 X2 300 200 Balancing, of queue I I Load Balancing I I Static Balancing, lengths I I (Local) -9- Initial Placement + Random Initial Placement e 10 12 14 and measured 100 1/ O 2 4 6 8 16 18 P Figure 6: the global the number tasks We in the ran average. 10 master-slave We plotted of processors, system at time Li,t applications the together following is the length t. We see that ~ zic~ ~ ~ie~ of workpile the load at processor balancing 245 how the (b,t– -4)2)where keeps i at time the size local workpile lengths differed from T ranges over all time values, n. is t, and At is the average of the workpiles very near number the average. of