Parallel Algorithms for Personalized With an Experimental (Extended David for Advanced Department University {helman, dbader, high-level, cute challenge efficiently the has become gorithms of message easier existing communication event uniformly personalized We first present do most high parallel MD the scalability and ent platforms. In unlike it ing 20742 efficiency fact, known to the problem algorithms to authors on the those by the these set differ- all similar platforms, of input Our reported NAS across outperform algorithms. with posed seem over efficient favorably of our they is invariant previous compare al- and umd.edu algorithms primitives. sorting for Integer ancl distributions results the simpler Sorting also rank- ( 1S) Bench- mark. of collective handle an important to exchange for the whose Park, performance With as MPI, or all processors an algorithm performance portable assortment not messages communication allow an they when and such J6JAt Engineering, joseph}@umiacs. exe- machines. communication allow routines, will efficient primitives sized parallel which Studies College is to obtain standards use of these communication sonalized computing algorithms passing to design by making other. parallel independent, on general-purpose emergence While for architecture Joseph Computer Abstract Sorting ) A. Bader* of Electrical of Maryland, and Study Abstract David R. Helman Institute A fundamental Communication non- nication, with each mance. h-relation efficient Keywords: have Parallel Sorting, Algorithms, Sample Sort, Personalized Radix Sort, Conlmu- Parallel Perfor- per- implementation implementations of a large 1 class Introduction of algorithms. We then cation consider primitives ous schemes have had ular how to for sorting to choose between ular all-to-all yields anteed good has of these of duplicates rounds this two communication with regular and variations without a novel mance of reg- deterministic guarthe of tagging presented run on chines tific are in this a variety CM-5, CS-2, paper have been of platforms, IBM and consistent communication the with SP-2, Intel the Gray and sorting coded in (say, the algorithms the Thinking Ma- T3D, Meiko Scien- Our experimental analyses and or and including theoretical two is the The grant alone is to algorithms parallel cleslgn that exe- machines. portability that (say, results using and The high it is considerably by using PVM sophisticated to the that first is the consisting of by this perforeasier ) or high MPI will programmers machine). to per- fundamental of a dominant of of standards, [20], for which try to provide micropro- a proprietary such interconnect The as the machine more parallel powerful interconnect. fine are cnr- problem a number either to There emergence off-the-shelf emergence developers specific make interconnected standard second factor message builders efficient support. by presenting passing and software Our work illustrate “ The support by NASA Graduate Student Researcher Fellowship NGT-50951 is gratefully acknowledged tsupported in part by NSF grant No cCR-91113135 and NSF HPCC/GCAG achieve Note by factors a standard builds No to algorithm architecture el- Research Paragon. able portability tractable. pres- each SPLIT-C be achieve cessors personalized independent, general-purpose simultaneously. rently ement. The to formance tune the require- handle overhead computing that choosing communication efficiently the is no overhead. for with memory person- aim on parallel irreg- rounds virtually in architecture efficiently in a scheme sampling problem cute and we introduce uses only high-level, machines of all-to-all paper, which Previ- parallel balancing A fundamental communi- of sorting. load performance on the Both poor balancing using similar bounds ments. load variation splitters sort personalized very Another ence In on sample use these problem or multiple communication. variation the on general-purpose communication alized to effectively address ical on and these an algorithms. No. BIR-9318183, two sorting. Permission to make digital/bard copies of all or part of this material for personal or classroom use is granted without fee provided that the copies are not made or $stributed for profit or commercial advantage, tbe copyn,ght notice, the .Utle ,of the publication and Its date appear, and notice is given that copymght M by pernmwon of the ACM, Inc. To copy otherwise, to republish, to post on servers or to redistribute to Iista, requires specific permission and/or fee. SPAA’96, Padua, Italy 01996 ACM 0-89791-809-6/96/06 ..$3.50 We developments experimental In important two this framework abstract, problems: start with we sketch personalized a brief outhne for a theoret- designing our parallel contributions in communication of the and computation model. We putations 2111 view a parallel interleaved algorithm with as a sequence communication of local steps, and comWe al- low computation count for gestion, the words r and time per from on word at the 0(. + a max (m, p)), transmitted be justified this cost model, ( n, p) n, the number tive communication a gives the (e.g. that the have for performance Tcomp (n, p) of rithms are distributed IBM beast, Our in detail [6], POWERparallel for example, gather, [7], include and [1 I], and the scatter. algo- are as follows. The transpose to those Cray communication in which and that ally no a unique block of data are of the same size. to every broadcast a block maining of data processors. The from The a single source primitives gather ray residing then whereby on a processor distributed coalesces into primitives to these a single the blocks array scatter into equal-sized remaining residing on one the Main In this section Results problems we state the of personalized results communication prob- around but our the by same algorithm bound is size for n equally such from that the distributed they appear smallest el- in non- indexed proces- variation of sample all-to-all personalized very (2) makes the (3) good A new same An that deterministic with virtu- algorithm as (1) that by using implementation personalized uses only communication balancing performance efficient use of our load sort regu- of radix communication sort scheme. Significance: We and conducted extensive have lem 2. Our developed a suite experimentations of benchmarks related algorithms have consistently to performed Prob- very well the different and benchmarks and the scatter divides blocks the different platforms, authors the on platforms. It even known compares to favorably implementations reported by some machine which only vendors requires for the the Class easier task A NAS of IS Benchmark, ranking. Algorithm are ar- which are ing memory achieves gather the and the sorting 3 for our machine-specific of the ciently and results re- processors achieved all similar to a single and different has outperformed all the to all and (1) has Significance main [21] a tighter Rearrange starting almost sampling. same communication best known algorithm. handle of tagging performance presence each Personalized bounds, execution Additionally, the as (1) times all while and for of our of duplicates guarantee- Algorithm a stable (3) integer algorithms without the effiover- element. Communication solving For the has this described of processor, and communication. overhead. that head 2 for been seems trans- is called processors, on et al. collective maintains (2) has almost companion results appeared, has algorithm MPP processor processor, primitive input is an all-to- each beast and different independently and of regular lar to blocks of very efficient problem Our overhead, (1) A novel rounds the to send this 22]). Ranka [3] p processors order Algorithm all personalized and of descriptions primitive [24, was and 2 (Sorting): amongst across these over experimental draft intermediate achieves but following: Brief [18] has less Results: primitives the rounds sor. an ex- The similar known earlier decreasing high communication are and of (e.g. approach et al. Problem to be primitives collective in two complexity to be very platforms importance best general as our ements processor these the The two communication processor time w machines. how papers the steps. as to The in many simpler, models) seems SPLIT-C using optimal is shown different Significance: time mod- computation in memory assumptions [1 O] and, pose, computation [1] by each h messages This on current taken that algorithm with coefficients, across Kaufmann collec- of similar and the a. exchanged LogGP deterministic stated lem. coefficient data of at most primitive, constant to achieve size processors. literature define implemented. the of Communication): such distributions. time r and the algorithms time local and and in the parallel We described systems [24], Using of times a number are implemented any actually [20], remaining use of MPI-like make primitives, MPI of its C for make not amount the maximum algorithms tension do the all total appeared designing as used, to 6, 5]. input A new scalable of a cost parameters number are BSP amount Such of the Personalized is to be routed or destination transpose small by communication the total is close platforms. to perform Our the [13], recently well-suited p , and and LogP the origin Result: the receive modeled [17, as a function maximum model work is the collective maximum 1 (h-Relation where or be Problem A set of n messages a is the of the a processor. evaluate primitives a processor inject will earlier algorithm of I- gives communication els our can can below) by time, and of each ac- contiguous + am) network m is the of processors coefficient between (see received we The cost We no con- of m O(T the a processor using of an Tcomm of where or by takes The primitives can of which overlap. Assuming consisting latency network. communication data processors the to as follows. of a block two a bound data costs transfer between is communication communication ease of presentation, we describe the personalized com- sorting. munication 212 algorithm for the case when the input is initially evenly distributed directed to elements the [4] no assumptions made that no and without data the processor ~, a set than pattern that n h such dest There is are no of more for than simplicity h is an integral h I (and multiple In (Jl~ our Processors solution, we use two communication element second The rounds of the primitive. is routed the to In round, our first to its algorithm final Figure col- round, destination, it is routed for transpose the an intermediate pseudocode each and 1: Final input data (l): assigns the Each Its ~ elements following are given destination. defined destination J, then Otherwise, if the placed bin bin p. was J to P, Pj, this ( 3): received in [4], ment’s final (2) first last occurrence element for Pp_l. k is the Note the i, j have s contents p – more 1). than block bins size the according bin (4): ~ y to processor established ments that [-i], this primitive The processor for no bin block size complexity routes (O < z,j can is equivalent with overall P], P, have the < p – more to performing totically = optimal O(h) with + 2 [r very + (h +p2) small each ele- result Summary \\’e on examine various using four the of Experimental performance configurations values of h: of of the h = ~, 2:, our IBM A:, SP-2 that processor i.e. h, while from this and that ,n the For (with ~ keys same data ~. for each procesfiof h = ~ tll( PO, followed = h > are dl~trlb[lted ewh for so forth, LIP–I For data O receives other v, = bv ~ LX,vs Iabelled ko1 movement instead a. of an cqlla] processor, the function by unbalanced rsonalized the imbalance h. receive For 2, the for of processors compared Figure hence ]. ]> of elements. val-ying h = 8;, and in block si~es apploxi]uately ~ this an rcpreseuts case. communication compared largest no elements, in Figure n is large p ) + sho~vu O to at most extremely As shown movement, processors we ~ + ~ ele- (p) with with the time scales p. to toute a given pe- size on a varylug reversely However, machine an h-lelation input \vit h p whenever for size, the as shown in inputs which are commllnication is asymptime for P2 s n. and cxperlmenth z < p – 1). labelled last destined 1).Since is Tco~P(n, 8;. results receive is dominated algorithm Cray More Cray by T3D 0(p2) with the case of the n = 2561{. 1 Note that, the personalized camn?linlcatloll IS m<>r, g$~rl~ral t I1:LII a transpose prlmltll,e and does n,ot make the assllmprlc,n that tiara M already held m contiguous, reg~llar sized l)ufTcrs Results and this processors a transpose h-relation that of 128-processor 3.2 PI, the of a] , which constants for with primitive. The small Tcurrwrt(n, P) keys ~ = such (O s (0 ~ z ~ p — 1), is characterized number algorithm m these (1) ~ + ~. of our for P,), contents than to ollr + ~. destination. Each used of size n is inltlally p processors vo of elements ranging c Step colrespon(lmz Since elements to keys ~ + ~ P’ a trans- to performing with of transpose u,, for of sets input ~ keys, ~ ke.vs labelled labelled with the holds consists UI = into element P, rearranges into input data Our across P, initially to it is placed routes primitive Step 1), according b, then (O s processor p – The as follows. number can is equivalent Each in for no bin communication Step ● processor that [4], bin p, processor elements in (b + 1) mod Each we established pose placed ~ ~ of p bins with j (2): one (O ~ k is the destination Step ● to for if element (t +,1 ~ mod into P,, of the dur- is as follows: rule: of an element bin processor distribution sets cyclically Step ● p- I 2nfh -1 Algorithm lective ing L of p. 3.1 of Elements redistribution, destination assume Distnbutlon elements. where of data Final is of in ~ dest), is to be routed. Lt’e of generality) more (data, is the h z reader p processors holds a pair the about thus loss amongst of The Consider case. processor to where processors. general consists location elements the the distributed that element except for eveulv a manner Each amongst T3D, results 213 IBM divide SP-2 Performance of h-Relatton Personalized Commurucatlon wlthh=4 p,4 p=a _D domly name in the turn way depends to input ,16 cessor, G sample on choose the elements at how well splztters each we IS by processor - sort. versions chosen that ;’ P -- this One sampling the Previous — — and splitters. randomly hence El input, the n/p I 1 ~ the choose of sample s samples routed them to and processor, splztter. Each on these splitters the a single then P, each [16, ~ 8, 12] have ran- elements at each pro- processor, selected processor for sort from sorted every then s performs of its input th them element, a binary values and at as a search then uses 4“ J// /’ : ‘ / e-’ I results to tion, after which process. 01 64K 12’SK Total Number Cray Performance to the is done difficulty with and of s results in better overhead. The appropriate destina- to complete this sorting in processors, n/p the IW5’2 load second the approach the sorting is the splttter-s. the there and this work A larger value standard [20]. with the The duplicate some cost final are unique of both an with access and the original of irregof under by This, use an advantage the the to large different such take [8]. rise for proposed the how Inefficient accommodated value memory in cannot difficulty values give Moreover, primitives increases matter destined results scheme communication is that item turn no that of elements in it also inputs bandwidth. communication regular but is that exist number communication ular balancing, difficulty is scheduled, variations Communication w1thh=4 first values sorting in gathering routing T3D of h-Relation Personalized the local 512K 256K of Elements Research route The involved 32k 1 the the MPI approach tagging of course, each doubles interprocessor commu- nication. In our taining 256K 512K Total Number samples IM 2: Performance 4 ~ ) with respect of personalized to machine and to communication problem (h = size. able identify the cates Sorting we ● two to A New Consider the tributed loss Sample problem amongst input sort that into that every element each of the elements Then the the to task the of this p groups in the we be algorithm group (Z + l)th each indexed obviously in The (l): idea O up to p — 1 such than for p groups or equal sorted to to our replace of this the the The these very irregular transpose to tagging. in ob- in sorting high routing primitive, and presence of dupli- pseudo code \ve for our processor of its J to processor processor cl ~ elements, to one of p buckets, will receive c1 is a constant Each high pose no bucket where with P, (0 < t < p – 1) randomly ~ elements routes will is equivalent with block the contents receive of 1).Since more to performing size c1 ~ later, (O ~ Z, J ~ p – no bucket this operation for With than to be defined P, Pj, probability more than a trans- c1 ~, O < z < p – 2, can processor, depends n Because accommodate Each (2): bucket the to probability, o Step behind able each elements, dis- without to partition is less group, of the arranged from equally assume are no overhead and is as follows: Step high elements n evenly. indexed ith in the of sorting will n where a set of p— 1 splztters correspondingly n elements ciency Algorithm sorting p divides is to find elements over of p processors, of generality sample Sort we incur processor splitters, resorting assigns 4.1 sort, each calls efficiently without algorithm 4 from exactly are of sample samples oversampling, of Elements with Figure version ~ be turned after order. on how which The well effiwe ● Step (3): values quential radix bers Each received sorting sort processor in Step algorithm. algorithm, we use the P, (2) merge whereas sort sorts the (al ~ s using an For integers we floating point for algorithm. c1 ~) appropriate se- use the num- Step (4): From its sorted list of (/3 ~ s c1 ~) ele- (l), th ments, processor Splitter[j], PO selects for (1 s j s each p). j~~ () default, By element (3), as (3) Splitter~] largest value and for allowed each by the Splitter[j], data type binary used. search tal a value Frac[j] elements at Splitter[j] which is the processor which also fraction PO with lie the between index requires using requires (5): and in communication – 1)/3 ~ the Processor their Step $rac (6): sorted Po values Each broadcasts to the local processor array a subsequence Splitter[j] equal [6] the other p SJ. to The of elements in for – which 1] and includes the each (r Step (7): Each local array with then with more to forming or or equal to total the same routes to processor high than the (8): sequences umn Each with Tcomm(n,p) < fashion. The Steps the sorted (2), (5), transpose, analysis high Step subseand (7) beast, of probability this value for as of our sample bers) by forp2 received with these and primitives these three steps in Step probability some positive the an bit C2 ~. - that for 1: The of elements is at by Witir any high 2.62 for and p2 < &. the sub- more than aZ ~ elements, theorem, riety = O(~logn+~+~a) need proof has been each bucket algorithm time overall (for ) , respec- complexity floating point p) + num- P) TCOTTLm(~, (~) sample sort of input our the on the for I displays Table The for of duplicates, for Random of the but sequential I the had both and a 64- ) version. distribution that input distribution. the de- for a va- There contain to the in Step of performance which is attributed Sorting Problem are performance show sorting Sample Sorting this T3D) (doubJe benchmarks vaand benchmarks of input results those Cray these a wide of which number A. with perform implemented each point to using was rationale independent preference algorithm (64-bit as a function optimal important algorithm floating sizes. required is also benchmarks, version and asymptotically it of different in Appendix I is But Our precision details numbers for we whose TcO~p(n, evaluation is essentially with (1 – n–’) this, our integer a slight algorithm doing is given = coefficients. double our probabil- later. ~ before p) on nine scribed Zth col- high probability But (P – 1) a the < \ number Step by most two of (1) elements is at each processor CY2~, and probability in for C2 ~ the the reduced (3). of Integers Size Step (7) is at most 2, crz z 4.24 for 1.77 1 bounds on sample sort the values algorithm 0.272 1.08 0.0713 0.277 1.08 of c1, a2, and is as follows. “z” [WR] I 0.0710 / 0.269 ] [ RD ] ] 0.0597 \ 0.230 ] 0.883 I: Total a variety these 0.0707 -s” I 0.0551 , 0.214 0.888 I 4.27 4.23 ! 1.07 3.49 4.22 \ 3.48 dupliTable of our ‘ ‘B-‘ of elements c1 z (C2 ~ at number completion number any 3.10 c1 ~, at the processors duplicates), in most C2, the Steps 215 execution of benchmarks time on (in seconds) a 64 node for Cray sample T3D. is high [14]. received (7) c. following brevity completion exchanged is with p sorted with of this r +4.24; probability, algorithm of benchmarks. a 32-bit to per- that, to be defined , T..mm(n,p) ) ( < high sort empirical tested cz is a size to produce complexity constant of the Theorem has received establish high results Note rxz is a constant can (7) the with (p – 1) a / *. small riety no sequence block P, merges array. no processor where < very (O < where is equivalent operation T + 23 andT~~~~(n,p) Hence, T(n, subsequence elements, later, processor of the sorted ity, With O ( ~ log ~ ) time. we merge primitives that +Pa), The aualysis point number Pj, probability C2 ~ be defined a transpose Step cates), using floating with than the P, Splitter[j] Since contain constant all integers sorting p splitters greater less Frac [~] of processor with Z, j < p — 1). will ~ tree respectively. shows Clearly, associated (a, in Step on its associated are and search of the subsequence values Splitter[j and C2 ~, if ( p — 1 processors. P, uses binary to define contains Splitter[j], of Step Sorting requires time splitters Splitter[j]. of (8). whereas sort a binary transpose, ) tively. the merge O (~ log p) quences require omitted Step time, and sorting (~~~-’)inclusive’y Step the O (~) call in We communication as ( and in no sequential to- value (j merging involve of the to of the same (8) cost Ad- is used (8) define and by the the sort numbers ditionally, (6), is radix the (4), are dominated sorting Scalability in Machine & Problem our Size sample numbers on the T3D sort processors 10 ~ . ----- . j ,. ----:. i.. ... .... -, : , A & .--k . ..y__. _r_. 1 ~ ,-E:” r- #, 01 -- . A --- 001 1--512K — ~. . ;-y— ,..:. ~ IM ~~~ 4M 2ivl Total Number Table III compares mark for integer the 8M benchmark quiring that benchmark any reordering, that -+-16 -=-8 Scalability +32 in Machine *64 & Problem Machine , . !. . . . . . . .. . . . .-’ I of ,. Cray 1 ---.- ..— A --.. -. —+...-.. z ~ “; , .: .:.. —- ---- # -; .—. .-J. : Table III: with best Benchmark ally 001 1“”:1 51;K IM 2M Total Number +8 Figure 3: Scalability sorting tntegers bers of processors, 8M respect the [U] on the to problem benchmark, Cray the requires that See for II examines a function of machine size execution n the number the T3D size of sample for and time of processors I Random p. Sample Machine Cray IBM TMC scalability size. T3D SP2-WN It the differing IBM of our shows scales Figure [14] with other 4.2 A num- that sample for a given almost inversely 3 examines the Sorting Number sort as the II: integers and Total from input with only the scalability to of 8M of Processors be this processors. platform was A 32 64 128 0.966 0.540 0.294 ‘2.41 1.24 0.696 0.421 - - 7.20 3.65 1.79 0.849 unavailable (in benchmark hyphen indicates to Times ] Our I I Time Time 14.0 16 7.07 7.1:3 1Q6 32 3.89 7.05 64 2.09 128 1.05 4.09 ~,~6 of our times execution for the sorting. in sorted be ranked J time (in A NAS Class Note that order, the without ) seconds) Parallel while we actu- benchmark actually only reordering. additional seconds ) for Sorting of our data and Algorithm random bounds guaranteed comparisons by the high the of sample and with choose version Sorting ~ elements element Regular algorithm by The regular sort Sampling is that requirements probability. sample by Regular sort memory samples regular ple sorting 8M of machines that particular the after us. 216 at each processor as a sample. single on a variety that performance results. can alternative sampling, [23, 19], (PSRS), A known first as sorts the th I 16 time Benchmark 12.0 published is to Parallel 1.74 [WR] modifications. 128 they New previous Integers of 8 execution the reported machine- 29.4 integers performance processor, is selected splitters Table other Sampling SP-2-WN. 3.21 CM-5 be- using system Reported integer A disadvantage Table We high-level, of Integers *16-A-32*64 with from 4M place task. using 24.2 reported for without Best Comparison the ranked the (IS) re- as we do, 43.1 .. ...”. --:. ; ; . . ... . 4 .—... —-— _~ NAS name of 64 T3D ,101 A reported the order vendors perhaps Bench- 3? ---=-- . .. ... . . and the Processors CM-5 ~ be with by the Finally, Instead simpler favorably Class that obtained Number on the SP-2-WN 10 of times Note they were obtained best in sorted is a significantly n. A NAS misleading. that which implementations Comparison Size the of between Class T3D. be placed compare were the differing number of elements with Cray requires results, which specific -%-128 integers for a fixed dependence on (IS) the size, for number results and which our linear total our only code, times, the that is somewhat the the of problem shows sorting of this portable of Integers and CM-5 TMC It is an almost time lieve ~ there execution for -- ...-; . J——- as a function of processors. the process. The balance. There merging first exist are difficulty inputs selects to the with every then and processor input is done for are sorted Each sorted subsequences local then samples they as a splitter. resulting which These where to partition and values to this which routed every complete approach at least to pth then and appropriate () 3 uses then sam- these routes destinations, the sorting is the one load a processor many will be as 2~ – duced by ( choosing crease the workers the completion – p + 1 at elements. ) splitters, more overhead. have teriorate left ~ And observed linearly with but no matter that the the of sorting This this what load would as be re- which proceed in- (l). After also is done, balance number with could amongst so that previous would still of duplicates difficulty uled, there the is that exist number in turn results tion bandwidth. In take to such and the MPI in ● and s p < s < ~ , we guarantee ( by b each while processor the splitters. number of duplicates able replace to to our have This at the at most in the irregular (~ holds input. routing + ~ – p) ele- the swrspJes regardless of the Moreover, with code for algorithm est values Step (l): of its Each ~ input sorting processor values algorithm. exactly for each of Step (2) and each Each The using an integers we are two whereas for floating point values with the (7): merge sort algorithm. The < to each into p bins so that the radix numbers k’k than is placed P)’h ● ● the in 1~~ ‘h position j to Each processor processor Pj, for equivalent to performing block ~. size Step (3): From I+ is the of the each routes (O < i, j the of the in Step element (2), processor as a sample, value ofs, for for (1 s p<s<~ contents of which is subsequences Step (4): quences Step of .sarnples for splitter is the used. Additionally, Pp_I re- and (ks – same is an 1) the largest upper value binary SAk number value then selects kA and () a given of allowed search with bound on the indices (ks)th the been omitted that number at Each Splitter[j] than or equal or equal to Est[j] ( as Splitter[~]. P, routes to no to Splitter[j], x Y most value two ele) the p subsequences processor Pj, for processors will exchange elements, this (O ~ is equivalent ) operation processor sharing P, with block “unshuffled” a common origin the subsequences column ith results of the the are sorted complexity of the following brevity [15]. for Theorem 2: The number processors in Step (7) quently, (for size in all those Step then (2). merged to array. of this theorem, algorithm, whose we proof has at the than (~ Hence, the floating to of elements is at completion of Step + ~ – p) analysis that elements, of our of our point ( (7), for by any two sort . Conse- ) no processor receives n z p3. regular sample numbers) exchanged ~+~–1 most sample algorithm sort algorithm and is given by sample the data to T(n, last forn~p3 compute which (Est[k] p) =O(; logn+r+;cr) type and p~s~~. ( (k – 1)s through Est[k] Note total by include Splitterj establishing the more subse- default, is used samples as Splitter[k]. p sorted each By in a subsequence with greater less then p consolidated is similar the (1 ~ k < p – 1). for the set of samples the each k < s — 1), merges to . (8): need . Processor as a splitter, selects are search received with ) ( ● Pp-l binary sequences associated processor sorted th ceived uses then operation p sorted p spiitters (k mod < p – 1), a transpose asso- ) Before (2): the p splitters a transpose ( sort bin. Step bin into the Q ~z+~—l +:-1 produce order duplicates we use data item P, and Since performing fi sequential the sorted of same with p — 1). more The “dealt” A p — I processors. p sorted which Each subsequences the the – I] and associated ● algorithm, X (l). other processor collectively Step b calls 1) sorts appropriate we use so ) broadcasts p subsequences contain ments is as follows: P, (O ~ i s p– For of Step Est[k] of those Step to the (6): ( ● in PP–I Step i, j our order redistributed between + 1 SAk define and primitive, pseudo is completion in gathering bound present the transpose The will no overhead set Processor Splitter[j incurring to identify (5): S[j,k). a sampling ) of sorting, ments, that – 1) elements ) the their all ratio — 1) x A with Step and [20]. is parameterized input holds (f sorted communica- communication standard the ( communication regular the in the is sched- processors, an irregular (2), processor (Est[k] The variations use of the of the which routing large different an inefficient under algorithm, the rise for advantage proposed our destined in how give Moreover, cannot primitives matter that of elements this scheme no inputs and sample Step each ciated other in SAk each ( de- [19]. the samples share x :) of duplicates 217 ) (3) in Machine Scalability & Problem Regular Size on the T3D (s = 8p) 10 ool~ 51~K Table IM ing 8M Total Nurnbe%’f Sample Sorting Sorting Benchmark Inte$rs [U] lM +16+-32=64! ( 0.355 1.17 4.42 [G] 0.114 0.341 1.16 4.38 4.33 0.0963 0.312 1.11 ~4-Gj 0.0970 0.320 1.12 4.28 ‘B- 0.117 0.352 1.17 4.35 [s- 0.0969 0.304 1.10 4.42 “z- 0.0874 0.273 0.963 3.75 [WR] 0.129 0.365 1.35 5.09 [RD] 0.0928 0.285 0.997 3.81 IV: Total execution time of benchmarks Sample (in on Sorting Machine ) Cray Scalability in Machine & Problem Size IBM T3D seconds) of 8M ing V: Total 8M Integers, dence 51;K IM 2M Total Number 4M 4: Scalability with Integers from the bers o~ processors, Criy T3D and the ple sort our random algorithm coefficients. and tested on the for performance Table size the number IBM SP-2-WN. We now the range V examines scalability differing number execution of our sample numbers input sizes. of of processors It independent the size. It p. scales sort that almost Figure as a function processors. there very small They is an almost (in seconds) A for other hyphen regular on sort- a variety indicates that that to us. time and additional the total number performance published Efficient for benchmark unavailable execution [15] 2.42 [WR] of data and results. Radix Sort problem [0, J!f – 1] that each key and stable into the the Here for The are algorithm keys with of of r-bit by sorting sorting on individual Counting Sort range ber the followed by determining the the of each is known, that input distri- regular for sort a given inversely of problem show that linear personalized inwith 4 examines correct ble the a depen- initial The major 218 to routine, order in B,, rank blocks bit Counting h = is, final O < each we can ~. if two sort n integers for of the to move case that the chosen r-bit to accumulate bucket communication in this and significant algorithm sorts R counters t in element position; sorting relative size, for rank equal the a p- blocks. the shows [0, R – I] by using of the least in decomposes a suitably the algorithm dis- keys for over efficient that on each displays of An blocks, sketch keys equally sort containing only n integer machine. is radix block we sorting distributed memory groups sorts positions. Sort the of input of our shows Finally, sam- IV of the scalability was distributed known implemented as a function time of processors was Table time the processors. consider well T, regular with algorithm sort of our optimal benchmarks. regular of machine n the algorithm, our is essentially put fixed nine a variety as a function for again, of our tribution sort is asymptotically Once performance bution. sample 4.34 See An beginning Like 0.864 8.04 from [WR] [U] respect to problem size of regular [U] benchmark, for differing num- on the 1.57 - with processor Figure sorting 3.12 the n. 4.3 8M of Integers 64 0.711 execution between comparisons ~ 32 1.15 platform elements s = 8P. Processors 16 and particular of sort- T3D. 2.11 tntege~s of machines regular Cray 8 CM-5 Table for a 64 node 4.07 SP2-WN TMC 64M 0.114 Number 1+8 s = 8p [u] a variety [ Regular of Integers, Benchmark 16M 4M keys are the num1, Once h-relation Sort identical, same the R – element. element Counting remains the z < use our each in into the is a statheir as their order. pseudocode steps and for our is as follows. Counting Sort algorithm uses six Step (1): of its ~ keys; For each processor i, count the Rad!xsort frequency Scalablllty In Plachlne and Problem Size on the T3D keys equal Step (2): array is, to k, for Apply using step, that the each compute (O ~ k < R– the block ~. will Step (3): of its rows of the array locally Step (4): Apply the (inverse) R corresponding total count for pose primitive Step (5): at the bin. of 10 the end of this rows computes 1 the @ of ~. El c prefix- 1. transpose prefix-sums each to ~ consecutive sums to the number primitive Hence, hold processor the 1). transpose size processor Each l[z][k], The block primitive augmented by the size of the trans- the ranks of local ! 256K processor computes keys to Perform rank h = a personalized location using communication our h-relation Scalability in Ilachme and Problem . . . —. .1.’”””. Size 0. the 5P-2 10, elements. for 4M is 2 ~. Each (6): 2M T.191Mumnw.1 K.ys Rnd]xsort Step In 5 12K of algorithm P ~ ~. ❑ ■ :1 Input, ~ SP-2 p = 16 [BHJI [AIS] -~ , 4K “R- , 64K ‘c’ , 4K (2S-2 p = 16 p = 32 [B~Jl [AIS] 0.107 1.63 0.163 0.664 0.083 0.938 0.592 3.41 1.91 1.33 0.808 4.13 4.03 7.75 7.33 19.2 15.1 0.479 0.107 1.64 0.163 0.641 0.081 0.958 0.584 3.31 1.89 1.23 0.790 [ 4.13 4.02 14.9 6.86 6.65 16.4 ( ‘N ‘, 4K 0.475 0.109 1.63 0.163 0.623 0.085 ( ‘N ‘, 64K 0.907 0.613 3.55 1.89 1.22 0.815 [ [N ~, 512K 4.22 4.12 6.34 7.29 18.2 15.0 gers VI: (in Total execution seconds), time comparing for the radix AIS sort and on 32-bit our 128K Figure ‘ and 5: Scalability problem implementa- VI presents a comparison of radix of our sort radix in SPLIT-C et al. the algorithm [C] [1] This Meiko is cyclically Figure to 5 and and Additional in the as BHJ. sorted, [4]. implementation, is identified is referred proximation in other CS-2, The sort [N] was as AIS, input [R] is a random performance our A on [3] D. Bader. is random, Gaussian results are ap- April for Scheiman. into alistic the LogGP: LogP model Model for ACM Symposium tures, pages - One parallel Santa step Long closer computation. on Parallel 95-105, Schauser, Incorporating Barbara, CA, machine SP-2-TN Krishnamurthy, Empirical Compiler of S.G. Evaluation Perspective. the Computer 22nd In of the ACM Annual Architecture, 7th Ligure, Italy, and h-Relations. Bader, and 1995. Press, International pages June 320-331, 1995. Deterministic ENEE 648X Routing Class Al- Report, Sorting. a re- Architec- July 1995. To Helman, for Report, CS-TR-3548 UMIACS of Maryland, appear and in J. JiiJii. Personalized and Par- and and UMIACS-TR-95-101 Electrical College ACM Practical Communication Park, Engineering, MD, November Journal of Experimental Parallel Algorithms Al- gorithmic. Annual and D.R. Algorithms University Messages towards In Algorithms to IBM 1, 1994. Technical C. the given References K. A. Yelick. Randomized gorithms Integer Ionescu, K. Margherita [4] D.A. M. Culler, respect and tuned while [4]. Alexandrov, with T3D with allel [1] A. 4M by Alexan- which table sort Cray D.E. and Proceedings Santa for on the Arpaci, Steinberg, Symposium drov of radix size, editor, implementation 2M - CRAY-T3D: Table lM inte- tions. another 512K 256K Total number01tQs [2] R.H. Table 16 I 64K ~ ‘c” “, 64K ‘c” “, 512K ■ - [BHJI I [AIS] 0.474 [ - ‘R: , 512K i CM-5 4 Z38 z [5] D.A. age 219 and J. JiJ& Histogramming Bader and Connected Components for Imwith an Experimental posium Study. of Principles rnzng, pages TO appear In and 123–133, in Fifth Santa Journal ACM Practice SIGPLA Barbara, of Parallel Program- CA, and Report N Sym- of Parallel July 1995. Distributed Com- for and Dynamic J. JtiJ& Data Selection. College Helman, J. Bruck, Tunable Collective Parallel Computers. Distributed [8] G.E. ton, and S. J. Smith, Algorithms rithms and [9] W.W. nical for the ACM SRC-TR-95-141, Center, B.M. J.M. Bowie, Draper. and Maggs, C.G. J.S. on CM-2. Parallel 3–16, July 1991. AC T3D. for the February [18] In by Computer Software J.F. JiJa Model for Proceedings of and [19] RegStudy. Engineering, Park, MD, June 1996. Ryu. Proceedings The Block Memory 8th pages of Data the 7’th Conference, pages Distributed Mem- Multiprocessors. International 752–756, in IEEE In Parallel Canctin, Process- Mexico, Transactions April on Parallel and Systems. J.F. Algorithms for J. Shi. Regular and Routing 5th 669–679. P. Lu, H. Sibeyn, of the pages Li, and 1983. the Proceedings X. Sorting Applications Shared Kaufmann, and 1995. A Parallel Electrical In and K.W. To appear rithms, Re- Mary- Experimental Parallel Sampling. Symposturn, M. Chow. November ory In Tech- J. JiJi. and of 1995. an College Y.C. Partitioning izing Algo- and With UMIACS and Distributed Plax- Bader, of Maryland, Huang ing of Sort- Machine D.A. December Algorithm UMI- University Preparation. 1994. Supercomputing MD, [17] for Scalable A Comparison pages and C.and on Parallel Symposium Carlson Ho, Portable MD, report, 627-631, 1995. Connection Architectures, [16] To Process- A. A Library Zagha. Report search CCL: Transactions M. 1995. 1996. P. Elustondo, Snir. Leiserson, the July 15-19, April 6:154-164, and of M. IEEE C.E. Proceedings HI, Uni- Parallel Communication Systems, Blelloch, Internattoaal R. Cypher, S. Kipnis, MD, In UMIACS- UMIACS-TR-95-102, Engineering, Park, Sorting University and Engineering, Park, Honolulu, Ho, and Electrical 10th Finding, and Electrical Technical Algorithms Median College at the Symposzum, [7] V. Bala, Parallel CS-TR-3494 and of Maryland, be presented ing Report UMIACS versity Practical Redistribution, Technical TR-95-74, T. land, ular Bader ing and [15] D.R. puting. [6] D.A. CS-TR-3549 ACS and the Sampling. Suel. on ACM-SIAM, J. on Meshes. Discrete Algo- 1994. Shillington, Versatility Parallel Derandom- Sorting Sympostum Schaeffer, On T. of P.S. Parallel Computing, Wong, Sorting by 19:1079-1103, 1993. [10] Cray Research, October 1994. Inc. SHMEM Revision 2.3. Techntcal Note for C, [20] Message Passing Passing [11] D.E. Culler, namurthy, K. sion S. Yelick. Division A. Dusseau, Lumetta, S. Introduction - EECS, S.C. Luna, T. to Split-C. University 1.0 edition, Goldstein, March A. von Krish- Eicken, Computer of California, sity D.E. Culler, Schauser. ory to A.C. Fast Dusseau, Practice. In Martin, Under Portability LogP: and [21] ver- S. Ranka, Processing, chapter 71-98. In K.E. sively The- & Sons, D.E. Schauser, Eicken. LogP: R.M. Towards In on and May 1993. D.R. Helman, Karp, E. Santos, Computation. Prtnczples February John ing Algorithm report, Univer- June 1995. Version and K.A. Communication Symposium on Computation, Alsabti. with pages the Many-to- Bounded Frontiers 20–27, Trafof Mas- McLean, VA, 1995. S. Rae, T, Suel, T. Tsantilas, and M. Goudreau. Effi- Wi- Fourth Practice D.A. Patterson, A. Communication R. Subramonian, a Realistic ACM of and Model SIGPLAN Parallel of D.A. With Bader, and J. JiJiL Symposium, T. 1995. von Parallel [23] Symposium Using an Experimental Study. H. Shi ing, Sort- L.G. tation. Technical 1990. 220 the 9th pages and Sampling. Programming, A Parallel of Sahay, [24] [14] Message- Total-Exchange. In l%-o- 19934 Culler, K.E. TN, A for ceedings [13] Shankar, Ftfth Parallel cient ley R.V. The and Performance 4, pages Technical Knoxville, Personalized From [22] Parallel Standard. MPI: Science Berkeley, many R.P. Sorting Forum. ., 1.1. and 6, 1994. Parallel Interface of Tennessee, fic. [12] Interface J. Schaeffer. Journal 14:361–372, Valiant. International 544-550, Santa Parallel of Parallel and Parallel Processing Barbara, CA, Sorting by April Regular Distributed Comput- for Compu- 1992. A Bridging Commurz2cat20ns Model of the ACM, Parallel 33(8):103-111, A Sorting Benchmarks worst possible regular Our nine MAX for sorting benchmarks is (231 – 1) for are defined integers and as follows, approximately in which 1.8x 1030s doubles: 1. Uniform [U], erator gers in the processor returned then and Gaussian [G], normalize the the the a zero constant calls double is S. to random For the returned We and double type, by random () in such Previous sian, and for created by setting ev- as zero. in this ate. For 4. Bucket buckets, each [B], an by setting processor MAX — P ( Sorted obtained – 1 ) to ( second – 1 numbers ~ numbers , and an input into groups index group these j we numbers created groups set proces- ements at ((jg ((( ~g + case n partition might In this respect, Thus, perform the [S], p) created i is < ~, then to mod to p) , the ~ (2i+2) ~-I ( we set all ~ elements even of size be ~ ing : numbers el- 7. Worst-Case of values Regular between an algorithm, approprithat the benchmark of possible which values try to guess the well on such an input. benchmark is more which telling. would Thus, evalu- we wanted ) to between O and ~ at that [WR] - an MAX designed 1), input perform by an are first to the do this naive on at the the same time, in by which iterating the over a strategy the data problem uti- might processor and then a sequence would Sorted route in poor at each most or rearrang- to This destination of Perhaps try resulting bandwidth. algorithm to Bucket would uses difficult approach poorly specifications such contention processor grouped is by But find which the routed of p destina- straightforward way possible communication perform poorly with the processor pro- (22 + 1) ~ - A every permutations. strides. if the would an algorithm would avoid. processor according tion which of communication elements be- to Here, same lization , and so forth. as follows: +1) data ran- , and so forth. Otherwise, ) to be random numbers between ((z-~ the to the g-Group benchmark, for this case, using iteration, long to a particular andso selected forth. consisting to induce stride of g destination set, and the range benchmarks for of routing impossible benchmark. and – 1 phase then and second be random numbers and ~ be de- Uniform communication. benchmarks be avoided ) ~ the order, we set all ~ elements cessor to be random ;) them should is not algorithms to find Gaus- expect would misleadingly of rea- included from the so forth. Uniform, one in the to another benchmark Gaussian we also wanted the set 1), a variety so we too behavior >> p, be those which set O sum then – – 1), and benchmarks Uniform for are then some between (range for used But splztters a single p evenly. elements – 1 p) processors partitions + ~) ~ processor ~ + 2) mod : ( (jg ~ + 1) mod dividing consecutive first between each + 6. Staggered (i- the are dupli- whose input (range and of the splitters But have of T with work) O and benchmarks worst into equal intervals. and by first which in the )g + ~ + 1) modp) index choice would O and at each MAX y of consecutive ((( tween p at nine input values our of the input (~ – ~ + so forth. g, where g can be any integer dom between elements between into elements sense an array random O and of comparison. to include [g-G], processors for ~ ) 5. g-Group we is sorted first of the ate the cost of irregular a MAX — P If that the be random , the sor to be random input between benchmarks, example, optimal ~ values between the an fills even- elements, hold [RD] is 32 for researchers to illicit and details. of of our the ~ — p) additional ~ these Zero purposes (~ + will value E ~ values value sons. range first selected signed input, m~ random input, to randomo The next for — 1) (range a random completion of sorting, processors processor number (range assign- [U]. entropy to a constant inte- a randomly choMAX by [~31 _l) result by four. for returns each the hold Duplicates cates in which and at completion will See [15] 8. Randomized by each the odd-indexed gen- by first distributed four input, For the values randomo values described [Z], value by the result manner 3. Zero (23+10012). scaling by adding dividing we first ery processors whereas number is initialized these a Gaussian approximated the value bit which – 1), we “normalize” integer sen sign then the (23] random random function, O to P, with type, the This range distributed the C library rano!omo. data 9 ~. a uniformly by calling balance At indexed p) elements. obtained ing load sorting. the 221 they producing so that, choose group when can synchronize but this in turn all the the route This the will processors reduce data to the subset in possibly value processors g processors processors and chosen those processors. contention one a suitably which same In be- subset of destinations route exactly the stalling. after of g. each the communication to is this same suborder, Alternatively, permutation, bandwidth by a factor sor needs away. the of This result is the of p. Of g-Group have the phase but included for regular rate performance duplicate which this avoid worst of This was case running and performance. the ef- Finally, was algorithms is single an equiva- algorithm benchmark of the iter- in the the included to presence of values Acknowledgements B We would trical like to thank Engineering and Ronald Greenberg Department for of UMCP’S his valuable Elec- comments encouragement. We would at The help Lok which Research Tin No. Arvind his port of Jet Propulsion used SPLIT-C to this tics, and Space Science. Systems Jesse Draper compiler has also This Cray Shared IBM University Infrastruc- We Division the of work Supercomputing also T3D to Planet Center with [2]. The by Earth, acknowledge help Supercom- provided funding Aeronau- William for Computing Research Center) 2.6) [9] on which (version for CarlScience writing the T3D the port based. the 160-node also T3D Cray was of Mission been additional Research 256-node from AC thank Urbana-Champaign, the Krish- 16-node Research provided the Supercomputing use of their by an IBM investigation Offices We UMIACS Academic Lab/Caltech in of SPLIT-C for Arvind CDA9401151. NASA parallel especially Culler, of the Krishnamurthy the (formerly use an NSF from son and David group Liu. the and CASTLE/SPLIT-C Berkeley, from was provided award Grant 12 the of California, acknowledge SP-2-TN2, puter to thank encouragement and We ture also like University and namurthy, for on both benchmark the sorting Duplicates that of any Regular both parties by better. could code and sequences we are unaware evaluate our performs Case foradditional benchmark another which Worst sampling the time algorithm The object scheme that which Please stride bandwidth Staggered a routing and proces- benchmark, correctly the thwart same Randomized assess Staggered strides, at the each a unique communication and to empirically of the the to scenario, processor one can be found expected fect course, deterministic case the might challenge. time of of the possible permutations possible, worst to a single case tailored over lent the benchmark been ates In data is a reduction a factor the ~. to send Numerical the IBM utilized NASA Ames Simulation Research Center for SP-2-WN. the Applications, under Aerodynamic grant CM-5 at National University number of Center Illinois at ASC960008N. 222 used see http: //waw. performance in this from our paper umiacs. researchers to compare .mnd. information. will anonymous ftp://ftp. urniacs be freely ftp edu/research/EXPAR In available our for all the interested site, urad. edu/pub/dbader. with addition, results We encourage for similar inputs. other