Parallel Algorithms for Personalized

advertisement
Parallel
Algorithms
for Personalized
With
an Experimental
(Extended
David
for Advanced
Department
University
{helman,
dbader,
high-level,
cute
challenge
efficiently
the
has become
gorithms
of message
easier
existing
communication
event
uniformly
personalized
We first
present
do
most
high
parallel
MD
the
scalability
and
ent
platforms.
In
unlike
it
ing
20742
efficiency
fact,
known
to
the
problem
algorithms
to
authors
on
the
those
by the
these
set
differ-
all
similar
platforms,
of input
Our
reported
NAS
across
outperform
algorithms.
with
posed
seem
over
efficient
favorably
of our
they
is invariant
previous
compare
al-
and
umd.edu
algorithms
primitives.
sorting
for
Integer
ancl
distributions
results
the
simpler
Sorting
also
rank-
( 1S) Bench-
mark.
of collective
handle
an important
to exchange
for the
whose
Park,
performance
With
as MPI,
or all processors
an algorithm
performance
portable
assortment
not
messages
communication
allow
an
they
when
and
such
J6JAt
Engineering,
joseph}@umiacs.
exe-
machines.
communication
allow
routines,
will
efficient
primitives
sized
parallel
which
Studies
College
is to obtain
standards
use of these
communication
sonalized
computing
algorithms
passing
to design
by making
other.
parallel
independent,
on general-purpose
emergence
While
for
architecture
Joseph
Computer
Abstract
Sorting
)
A. Bader*
of Electrical
of Maryland,
and
Study
Abstract
David
R. Helman
Institute
A fundamental
Communication
non-
nication,
with
each
mance.
h-relation
efficient
Keywords:
have
Parallel
Sorting,
Algorithms,
Sample
Sort,
Personalized
Radix
Sort,
Conlmu-
Parallel
Perfor-
per-
implementation
implementations
of a large
1
class
Introduction
of algorithms.
We
then
cation
consider
primitives
ous schemes
have
had
ular
how
to
for sorting
to choose
between
ular
all-to-all
yields
anteed
good
has
of these
of duplicates
rounds
this
two
communication
with
regular
and
variations
without
a novel
mance
of reg-
deterministic
guarthe
of tagging
presented
run
on
chines
tific
are
in
this
a variety
CM-5,
CS-2,
paper
have
been
of platforms,
IBM
and
consistent
communication
the
with
SP-2,
Intel
the
Gray
and
sorting
coded
in
(say,
the
algorithms
the
Thinking
Ma-
T3D,
Meiko
Scien-
Our
experimental
analyses
and
or
and
including
theoretical
two
is the
The
grant
alone
is to
algorithms
parallel
cleslgn
that
exe-
machines.
portability
that
(say,
results
using
and
The
high
it is considerably
by
using
PVM
sophisticated
to the
that
first
is the
consisting
of
by
this
perforeasier
) or high
MPI
will
programmers
machine).
to
per-
fundamental
of a dominant
of
of standards,
[20],
for
which
try
to
provide
micropro-
a proprietary
such
interconnect
The
as the
machine
more
parallel
powerful
interconnect.
fine
are cnr-
problem
a number
either
to
There
emergence
off-the-shelf
emergence
developers
specific
make
interconnected
standard
second
factor
message
builders
efficient
support.
by
presenting
passing
and
software
Our
work
illustrate
“ The support by NASA Graduate
Student Researcher Fellowship
NGT-50951
is gratefully
acknowledged
tsupported
in part by NSF grant No cCR-91113135
and NSF
HPCC/GCAG
achieve
Note
by
factors
a standard
builds
No
to
algorithm
architecture
el-
Research
Paragon.
able
portability
tractable.
pres-
each
SPLIT-C
be
achieve
cessors
personalized
independent,
general-purpose
simultaneously.
rently
ement.
The
to
formance
tune
the
require-
handle
overhead
computing
that
choosing
communication
efficiently
the
is
no overhead.
for
with
memory
person-
aim
on
parallel
irreg-
rounds
virtually
in
architecture
efficiently
in a scheme
sampling
problem
cute
and
we introduce
uses only
high-level,
machines
of all-to-all
paper,
which
Previ-
parallel
balancing
A fundamental
communi-
of sorting.
load
performance
on the
Both
poor
balancing
using
similar
bounds
ments.
load
variation
splitters
sort
personalized
very
Another
ence
In
on sample
use these
problem
or multiple
communication.
variation
the
on general-purpose
communication
alized
to effectively
address
ical
on
and
these
an
algorithms.
No. BIR-9318183,
two
sorting.
Permission to make digital/bard
copies of all or part of this material for
personal or classroom use is granted without fee provided that the copies
are not made or $stributed
for profit or commercial advantage, tbe copyn,ght notice, the .Utle ,of the publication
and Its date appear, and notice is
given that copymght M by pernmwon of the ACM, Inc. To copy otherwise,
to republish, to post on servers or to redistribute to Iista, requires specific
permission and/or fee.
SPAA’96,
Padua, Italy
01996
ACM
0-89791-809-6/96/06
..$3.50
We
developments
experimental
In
important
two
this
framework
abstract,
problems:
start
with
we sketch
personalized
a brief
outhne
for
a theoret-
designing
our
parallel
contributions
in
communication
of
the
and
computation
model.
We
putations
2111
view
a parallel
interleaved
algorithm
with
as a sequence
communication
of local
steps,
and
comWe al-
low
computation
count
for
gestion,
the
words
r
and
time
per
from
on
word
at
the
0(.
+ a max
(m, p)),
transmitted
be justified
this
cost
model,
( n,
p)
n, the
number
tive
communication
a gives
the
(e.g.
that
the
have
for
performance
Tcomp
(n,
p)
of
rithms
are
distributed
IBM
beast,
Our
in detail
[6],
POWERparallel
for
example,
gather,
[7],
include
and
[1 I],
and
the
scatter.
algo-
are
as follows.
The
transpose
to
those
Cray
communication
in which
and
that
ally
no
a unique
block
of data
are of the
same
size.
to every
broadcast
a block
maining
of data
processors.
The
from
The
a single
source
primitives
gather
ray
residing
then
whereby
on a processor
distributed
coalesces
into
primitives
to
these
a single
the
blocks
array
scatter
into
equal-sized
remaining
residing
on one
the
Main
In this
section
Results
problems
we state
the
of personalized
results
communication
prob-
around
but
our
the
by
same
algorithm
bound
is
size
for
n equally
such
from
that
the
distributed
they
appear
smallest
el-
in non-
indexed
proces-
variation
of sample
all-to-all
personalized
very
(2)
makes
the
(3)
good
A
new
same
An
that
deterministic
with
virtu-
algorithm
as (1)
that
by using
implementation
personalized
uses only
communication
balancing
performance
efficient
use of our
load
sort
regu-
of radix
communication
sort
scheme.
Significance:
We
and
conducted
extensive
have
lem
2. Our
developed
a suite
experimentations
of
benchmarks
related
algorithms
have
consistently
to
performed
Prob-
very
well
the
different
and
benchmarks
and
the
scatter
divides
blocks
the
different
platforms,
authors
the
on
platforms.
It
even
known
compares
to
favorably
implementations
reported
by
some
machine
which
only
vendors
requires
for
the
the
Class
easier
task
A NAS
of
IS Benchmark,
ranking.
Algorithm
are
ar-
which
are
ing
memory
achieves
gather
the
and
the
sorting
3
for
our
machine-specific
of the
ciently
and
results
re-
processors
achieved
all similar
to
a single
and
different
has outperformed
all the
to all
and
(1)
has
Significance
main
[21]
a tighter
Rearrange
starting
almost
sampling.
same
communication
best
known
algorithm.
handle
of tagging
performance
presence
each
Personalized
bounds,
execution
Additionally,
the
as (1)
times
all
while
and
for
of our
of duplicates
guarantee-
Algorithm
a stable
(3)
integer
algorithms
without
the
effiover-
element.
Communication
solving
For
the
has
this
described
of
processor,
and
communication.
overhead.
that
head
2
for
been
seems
trans-
is called
processors,
on
et al.
collective
maintains
(2) has almost
companion
results
appeared,
has
algorithm
MPP
processor
processor,
primitive
input
is an all-to-
each
beast
and
different
independently
and
of regular
lar
to
blocks
of
very
efficient
problem
Our
overhead,
(1) A novel
rounds
the
to send
this
22]).
Ranka
[3]
p processors
order
Algorithm
all personalized
and
of
descriptions
primitive
[24,
was
and
2 (Sorting):
amongst
across
these
over
experimental
draft
intermediate
achieves
but
following:
Brief
[18]
has less
Results:
primitives
the
rounds
sor.
an ex-
The
similar
known
earlier
decreasing
high
communication
are
and
of
(e.g.
approach
et al.
Problem
to be
primitives
collective
in
two
complexity
to be very
platforms
importance
best
general
as our
ements
processor
these
the
The
two
communication
processor
time
w
machines.
how
papers
the
steps.
as to
The
in many
simpler,
models)
seems
SPLIT-C
using
optimal
is shown
different
Significance:
time
mod-
computation
in
memory
assumptions
[1 O] and,
pose,
computation
[1]
by
each
h messages
This
on current
taken
that
algorithm
with
coefficients,
across
Kaufmann
collec-
of similar
and
the
a.
exchanged
LogGP
deterministic
stated
lem.
coefficient
data
of at most
primitive,
constant
to achieve
size
processors.
literature
define
implemented.
the
of
Communication):
such
distributions.
time
r and
the
algorithms
time
local
and
and
in the
parallel
We
described
systems
[24],
Using
of times
a number
are implemented
any
actually
[20],
remaining
use of MPI-like
make
primitives,
MPI
of its
C for
make
not
amount
the
maximum
algorithms
tension
do
the
all
total
appeared
designing
as
used,
to
6, 5].
input
A new
scalable
of
a cost
parameters
number
are
BSP
amount
Such
of the
Personalized
is to be routed
or destination
transpose
small
by
communication
the
total
is close
platforms.
to perform
Our
the
[13],
recently
well-suited
p , and
and
LogP
the
origin
Result:
the
receive
modeled
[17,
as a function
maximum
model
work
is the
collective
maximum
1 (h-Relation
where
or
be
Problem
A set of n messages
a is the
of the
a processor.
evaluate
primitives
a processor
inject
will
earlier
algorithm
of I- gives
communication
els
our
can
can
below)
by
time,
and
of each
ac-
contiguous
+ am)
network
m is the
of processors
coefficient
between
(see
received
we
The
cost
We
no con-
of m
O(T
the
a processor
using
of an
Tcomm
of
where
or
by
takes
The
primitives
can
of
which
overlap.
Assuming
consisting
latency
network.
communication
data
processors
the
to
as follows.
of a block
two
a bound
data
costs
transfer
between
is
communication
communication
ease
of presentation,
we describe
the
personalized
com-
sorting.
munication
212
algorithm
for
the
case when
the
input
is initially
evenly
distributed
directed
to
elements
the
[4]
no
assumptions
made
that
no
and
without
data
the
processor
~,
a set
than
pattern
that
n
h
such
dest
There
is
are no
of more
for
than
simplicity
h is an integral
h
I
(and
multiple
In
(Jl~
our
Processors
solution,
we use
two
communication
element
second
The
rounds
of the
primitive.
is routed
the
to
In
round,
our
first
to its
algorithm
final
Figure
col-
round,
destination,
it is routed
for
transpose
the
an intermediate
pseudocode
each
and
1: Final
input
data
(l):
assigns
the
Each
Its
~
elements
following
are given
destination.
defined
destination
J, then
Otherwise,
if the
placed
bin
bin
p.
was
J to
P,
Pj,
this
( 3):
received
in [4],
ment’s
final
(2)
first
last
occurrence
element
for
Pp_l.
k is
the
Note
the
i, j
have
s
contents
p –
more
1).
than
block
bins
size
the
according
bin
(4):
~
y to processor
established
ments
that
[-i], this
primitive
The
processor
for
no bin
block
size
complexity
routes
(O < z,j
can
is equivalent
with
overall
P],
P,
have
the
< p –
more
to performing
totically
=
optimal
O(h)
with
+ 2 [r
very
+ (h +p2)
small
each
ele-
result
Summary
\\’e
on
examine
various
using
four
the
of
Experimental
performance
configurations
values
of h:
of
of the
h =
~, 2:,
our
IBM
A:,
SP-2
that
processor
i.e.
h,
while
from
this
and
that
,n the
For
(with
~ keys
same
data
~.
for each
procesfiof
h =
~ tll(
PO, followed
=
h >
are
dl~trlb[lted
ewh
for
so forth,
LIP–I
For
data
O receives
other
v, =
bv
~ LX,vs
Iabelled
ko1
movement
instead
a.
of an cqlla]
processor,
the
function
by
unbalanced
rsonalized
the
imbalance
h.
receive
For
2, the
for
of processors
compared
Figure
hence
].
]>
of elements.
val-ying
h = 8;,
and
in
block
si~es
apploxi]uately
~
this
an
rcpreseuts
case.
communication
compared
largest
no elements,
in Figure
n is large
p ) +
sho~vu
O to at most
extremely
As shown
movement,
processors
we
~ + ~ ele-
(p)
with
with
the
time
scales
p.
to toute
a given
pe-
size on a varylug
reversely
However,
machine
an h-lelation
input
\vit h p whenever
for
size,
the
as shown
in
inputs
which
are
commllnication
is asymptime
for P2 s n.
and
cxperlmenth
z < p – 1).
labelled
last
destined
1).Since
is Tco~P(n,
8;.
results
receive
is dominated
algorithm
Cray
More
Cray
by
T3D
0(p2)
with
the
case
of
the
n = 2561{.
1 Note that, the personalized
camn?linlcatloll
IS m<>r, g$~rl~ral t I1:LII
a transpose
prlmltll,e
and does n,ot make the assllmprlc,n that tiara
M already held m contiguous,
reg~llar
sized
l)ufTcrs
Results
and
this
processors
a transpose
h-relation
that
of
128-processor
3.2
PI,
the
of
a] , which
constants
for
with
primitive.
The
small
Tcurrwrt(n, P)
keys
~
=
such
(O s
(0 ~ z ~ p — 1), is characterized
number
algorithm
m these
(1)
~ + ~.
of our
for
P,),
contents
than
to ollr
+ ~.
destination.
Each
used
of size n is inltlally
p processors
vo
of elements
ranging
c Step
colrespon(lmz
Since
elements
to
keys
~ + ~
P’
a trans-
to performing
with
of
transpose
u,, for
of
sets
input
~ keys,
~ ke.vs labelled
labelled
with
the
holds
consists
UI =
into
element
P, rearranges
into
input
data
Our
across
P, initially
to
it is placed
routes
primitive
Step
1),
according
b, then
(O s
processor
p –
The
as follows.
number
can
is equivalent
Each
in
for
no bin
communication
Step
●
processor
that
[4],
bin
p,
processor
elements
in
(b + 1) mod
Each
we established
pose
placed
~ ~
of p bins
with
j
(2):
one
(O ~
k is the
destination
Step
●
to
for
if element
(t +,1 ~ mod
into
P,,
of the
dur-
is as follows:
rule:
of an element
bin
processor
distribution
sets
cyclically
Step
●
p- I
2nfh -1
Algorithm
lective
ing
L
of
p.
3.1
of Elements
redistribution,
destination
assume
Distnbutlon
elements.
where
of data
Final
is
of
in
~
dest),
is to be routed.
Lt’e
of generality)
more
(data,
is the
h z
reader
p processors
holds
a pair
the
about
thus
loss
amongst
of
The
Consider
case.
processor
to where
processors.
general
consists
location
elements
the
the
distributed
that
element
except
for
eveulv
a manner
Each
amongst
T3D,
results
213
IBM
divide
SP-2
Performance
of h-Relatton
Personalized
Commurucatlon
wlthh=4
p,4
p=a
_D
domly
name
in
the
turn
way
depends
to
input
,16
cessor,
G
sample
on
choose
the
elements
at
how
well
splztters
each
we
IS by
processor
-
sort.
versions
chosen
that
;’
P
--
this
One
sampling
the
Previous
—
—
and
splitters.
randomly
hence
El
input,
the
n/p
I 1
~
the
choose
of
sample
s samples
routed
them
to
and
processor,
splztter.
Each
on these
splitters
the
a single
then
P,
each
[16,
~
8,
12]
have
ran-
elements
at
each
pro-
processor,
selected
processor
for
sort
from
sorted
every
then
s
performs
of its
input
th
them
element,
a binary
values
and
at
as a
search
then
uses
4“
J//
/’
:
‘
/
e-’
I
results
to
tion,
after
which
process.
01
64K
12’SK
Total
Number
Cray
Performance
to
the
is done
difficulty
with
and
of s results
in better
overhead.
The
appropriate
destina-
to complete
this
sorting
in
processors,
n/p
the
IW5’2
load
second
the
approach
the
sorting
is the
splttter-s.
the
there
and
this
work
A larger
value
standard
[20].
with
the
The
duplicate
some
cost
final
are
unique
of both
an
with
access
and
the
original
of
irregof
under
by
This,
use
an
advantage
the
the
to large
different
such
take
[8].
rise
for
proposed
the
how
Inefficient
accommodated
value
memory
in
cannot
difficulty
values
give
Moreover,
primitives
increases
matter
destined
results
scheme
communication
is that
item
turn
no
that
of elements
in
it also
inputs
bandwidth.
communication
regular
but
is that
exist
number
communication
ular
balancing,
difficulty
is scheduled,
variations
Communication
w1thh=4
first
values
sorting
in gathering
routing
T3D
of h-Relation
Personalized
the
local
512K
256K
of Elements
Research
route
The
involved
32k
1
the
the
MPI
approach
tagging
of course,
each
doubles
interprocessor
commu-
nication.
In our
taining
256K
512K
Total
Number
samples
IM
2: Performance
4 ~ ) with
respect
of personalized
to machine
and
to
communication
problem
(h =
size.
able
identify
the
cates
Sorting
we
●
two
to
A
New
Consider
the
tributed
loss
Sample
problem
amongst
input
sort
that
into
that
every
element
each
of the
elements
Then
the
the
to
task
the
of this
p groups
in
the
we
be
algorithm
group
(Z + l)th
each
indexed
obviously
in
The
(l):
idea
O up to p — 1 such
than
for
p groups
or equal
sorted
to
to
our
replace
of this
the
the
The
these
very
irregular
transpose
to tagging.
in ob-
in sorting
high
routing
primitive,
and
presence
of dupli-
pseudo
code
\ve
for
our
processor
of its
J to
processor
processor
cl ~
elements,
to one of p buckets,
will
receive
c1 is a constant
Each
high
pose
no bucket
where
with
P, (0 < t < p – 1) randomly
~ elements
routes
will
is equivalent
with
block
the
contents
receive
of
1).Since
more
to performing
size
c1 ~
later,
(O ~ Z, J ~ p –
no bucket
this
operation
for
With
than
to be defined
P,
Pj,
probability
more
than
a trans-
c1 ~,
O < z < p – 2,
can
processor,
depends
n
Because
accommodate
Each
(2):
bucket
the
to
probability,
o Step
behind
able
each
elements,
dis-
without
to partition
is less
group,
of the
arranged
from
equally
assume
are
no overhead
and
is as follows:
Step
high
elements
n evenly.
indexed
ith
in the
of sorting
will
n
where
a set of p— 1 splztters
correspondingly
n elements
ciency
Algorithm
sorting
p divides
is to find
elements
over
of
p processors,
of generality
sample
Sort
we incur
processor
splitters,
resorting
assigns
4.1
sort,
each
calls
efficiently
without
algorithm
4
from
exactly
are
of sample
samples
oversampling,
of Elements
with
Figure
version
~
be turned
after
order.
on
how
which
The
well
effiwe
●
Step
(3):
values
quential
radix
bers
Each
received
sorting
sort
processor
in
Step
algorithm.
algorithm,
we use the
P,
(2)
merge
whereas
sort
sorts
the
(al
~ s
using
an
For
integers
we
floating
point
for
algorithm.
c1 ~)
appropriate
se-
use
the
num-
Step
(4):
From
its
sorted
list
of
(/3 ~ s
c1 ~)
ele-
(l),
th
ments,
processor
Splitter[j],
PO selects
for
(1 s
j
s
each
p).
j~~
()
default,
By
element
(3),
as
(3)
Splitter~]
largest
value
and
for
allowed
each
by
the
Splitter[j],
data
type
binary
used.
search
tal
a value
Frac[j]
elements
at
Splitter[j]
which
is the
processor
which
also
fraction
PO with
lie
the
between
index
requires
using
requires
(5):
and
in
communication
– 1)/3 ~
the
Processor
their
Step
$rac
(6):
sorted
Po
values
Each
broadcasts
to the
local
processor
array
a subsequence
Splitter[j]
equal
[6]
the
other
p
SJ.
to
The
of elements
in
for
–
which
1] and
includes
the
each
(r
Step
(7):
Each
local
array
with
then
with
more
to
forming
or
or equal
to
total
the
same
routes
to
processor
high
than
the
(8):
sequences
umn
Each
with
Tcomm(n,p)
<
fashion.
The
Steps
the
sorted
(2),
(5),
transpose,
analysis
high
Step
subseand
(7)
beast,
of
probability
this
value
for
as
of our
sample
bers)
by
forp2
received
with
these
and
primitives
these
three
steps
in Step
probability
some
positive
the
an
bit
C2 ~.
- that
for
1:
The
of
elements
is at
by
Witir
any
high
2.62
for
and
p2 < &.
the
sub-
more
than
aZ ~ elements,
theorem,
riety
=
O(~logn+~+~a)
need
proof
has
been
each
bucket
algorithm
time
overall
(for
)
, respec-
complexity
floating
point
p)
+
num-
P)
TCOTTLm(~,
(~)
sample
sort
of input
our
the
on the
for
I displays
Table
The
for
of duplicates,
for
Random
of the
but
sequential
I
the
had
both
and
a 64-
) version.
distribution
that
input
distribution.
the
de-
for
a va-
There
contain
to the
in Step
of
performance
which
is attributed
Sorting
Problem
are
performance
show
sorting
Sample
Sorting
this
T3D)
(doubJe
benchmarks
vaand
benchmarks
of input
results
those
Cray
these
a wide
of which
number
A.
with
perform
implemented
each
point
to
using
was
rationale
independent
preference
algorithm
(64-bit
as a function
optimal
important
algorithm
floating
sizes.
required
is also
benchmarks,
version
and
asymptotically
it
of
different
in Appendix
I
is
But
Our
precision
details
numbers
for
we
whose
TcO~p(n,
evaluation
is essentially
with
(1 – n–’)
this,
our
integer
a slight
algorithm
doing
is given
=
coefficients.
double
our
probabil-
later.
~
before
p)
on nine
scribed
Zth col-
high
probability
But
(P – 1) a
the
<
\
number
Step
by
most
two
of
(1)
elements
is
at
each
processor
CY2~,
and
probability
in
for
C2 ~
the
the
reduced
(3).
of Integers
Size
Step
(7)
is
at
most
2, crz z
4.24
for
1.77
1
bounds
on
sample
sort
the
values
algorithm
0.272
1.08
0.0713
0.277
1.08
of c1, a2,
and
is as follows.
“z”
[WR]
I 0.0710
/ 0.269
]
[ RD ]
] 0.0597
\ 0.230
] 0.883
I: Total
a variety
these
0.0707
-s”
I
0.0551
,
0.214
0.888
I
4.27
4.23
!
1.07
3.49
4.22
\
3.48
dupliTable
of our
‘ ‘B-‘
of elements
c1 z
(C2 ~
at
number
completion
number
any
3.10
c1 ~,
at
the
processors
duplicates),
in
most
C2, the
Steps
215
execution
of benchmarks
time
on
(in
seconds)
a 64 node
for
Cray
sample
T3D.
is
high
[14].
received
(7)
c.
following
brevity
completion
exchanged
is with
p sorted
with
of this
r +4.24;
probability,
algorithm
of benchmarks.
a 32-bit
to per-
that,
to be defined
, T..mm(n,p)
)
(
<
high
sort
empirical
tested
cz is a
size
to produce
complexity
constant
of the
Theorem
has received
establish
high
results
Note
rxz is a constant
can
(7)
the
with
(p – 1) a
/
*.
small
riety
no sequence
block
P, merges
array.
no processor
where
<
very
(O <
where
is equivalent
operation
T + 23
andT~~~~(n,p)
Hence,
T(n,
subsequence
elements,
later,
processor
of the sorted
ity,
With
O ( ~ log ~ ) time.
we merge
primitives
that
+Pa),
The
aualysis
point
number
Pj,
probability
C2 ~
be defined
a transpose
Step
cates),
using
floating
with
than
the
P,
Splitter[j]
Since
contain
constant
all
integers
sorting
p splitters
greater
less
Frac [~] of
processor
with
Z, j < p — 1).
will
~
tree
respectively.
shows
Clearly,
associated
(a,
in Step
on its
associated
are
and
search
of the
subsequence
values
Splitter[j
and
C2 ~,
if
(
p — 1 processors.
P, uses binary
to define
contains
Splitter[j],
of Step
Sorting
requires
time
splitters
Splitter[j].
of
(8).
whereas
sort
a binary
transpose,
)
tively.
the
merge
O (~ log p)
quences
require
omitted
Step
time,
and
sorting
(~~~-’)inclusive’y
Step
the
O (~)
call
in
We
communication
as
(
and
in
no
sequential
to-
value
(j
merging
involve
of the
to
of the
same
(8)
cost
Ad-
is used
(8)
define
and
by the
the
sort
numbers
ditionally,
(6),
is
radix
the
(4),
are dominated
sorting
Scalability
in Machine
& Problem
our
Size
sample
numbers
on the T3D
sort
processors
10
~
.
-----
. j
,.
----:.
i..
... .... -,
:
,
A
& .--k
. ..y__.
_r_.
1
~
,-E:”
r-
#,
01
--
.
A
---
001 1--512K
—
~.
. ;-y—
,..:. ~
IM
~~~
4M
2ivl
Total Number
Table
III
compares
mark
for
integer
the
8M
benchmark
quiring
that
benchmark
any
reordering,
that
-+-16
-=-8
Scalability
+32
in Machine
*64
& Problem
Machine
,
. !.
.
. . . . . .. . . .
.-’
I of
,.
Cray
1
---.- ..— A --.. -. —+...-..
z
~
“;
,
.: .:..
—-
---- #
-;
.—.
.-J.
:
Table
III:
with
best
Benchmark
ally
001 1“”:1
51;K
IM
2M
Total Number
+8
Figure
3: Scalability
sorting
tntegers
bers
of processors,
8M
respect
the
[U]
on the
to problem
benchmark,
Cray
the
requires
that
See
for
II
examines
a function
of machine
size
execution
n the
number
the
T3D
size of sample
for
and
time
of processors
I
Random
p.
Sample
Machine
Cray
IBM
TMC
scalability
size.
T3D
SP2-WN
It
the
differing
IBM
of our
shows
scales
Figure
[14]
with
other
4.2
A
num-
that
sample
for
a given
almost
inversely
3 examines
the
Sorting
Number
sort
as
the
II:
integers
and
Total
from
input
with
only
the
scalability
to
of
8M
of
Processors
be
this
processors.
platform
was
A
32
64
128
0.966
0.540
0.294
‘2.41
1.24
0.696
0.421
-
-
7.20
3.65
1.79
0.849
unavailable
(in
benchmark
hyphen
indicates
to
Times
]
Our
I
I
Time
Time
14.0
16
7.07
7.1:3
1Q6
32
3.89
7.05
64
2.09
128
1.05
4.09
~,~6
of
our
times
execution
for
the
sorting.
in
sorted
be ranked
J
time
(in
A
NAS
Class
Note
that
order,
the
without
)
seconds)
Parallel
while
we
actu-
benchmark
actually
only
reordering.
additional
seconds
) for
Sorting
of our
data
and
Algorithm
random
bounds
guaranteed
comparisons
by
the
high
the
of
sample
and
with
choose
version
Sorting
~
elements
element
Regular
algorithm
by
The
regular
sort
Sampling
is that
requirements
probability.
sample
by Regular
sort
memory
samples
regular
ple
sorting
8M
of machines
that
particular
the
after
us.
216
at each
processor
as a sample.
single
on a variety
that
performance
results.
can
alternative
sampling,
[23,
19],
(PSRS),
A
known
first
as
sorts
the
th
I
16
time
Benchmark
12.0
published
is to
Parallel
1.74
[WR]
modifications.
128
they
New
previous
Integers
of
8
execution
the
reported
machine-
29.4
integers
performance
processor,
is selected
splitters
Table
other
Sampling
SP-2-WN.
3.21
CM-5
be-
using
system
Reported
integer
A disadvantage
Table
We
high-level,
of Integers
*16-A-32*64
with
from
4M
place
task.
using
24.2
reported
for
without
Best
Comparison
the
ranked
the
(IS)
re-
as we do,
43.1
.. ...”.
--:.
;
;
. . ... .
4
.—... —-— _~
NAS
name
of
64
T3D
,101
A
reported
the
order
vendors
perhaps
Bench-
3?
---=--
. .. ... . .
and
the
Processors
CM-5
~
be
with
by
the
Finally,
Instead
simpler
favorably
Class
that
obtained
Number
on the SP-2-WN
10
of
times
Note
they
were
obtained
best
in sorted
is a significantly
n.
A NAS
misleading.
that
which
implementations
Comparison
Size
the
of
between
Class
T3D.
be placed
compare
were
the
differing
number
of elements
with
Cray
requires
results,
which
specific
-%-128
integers
for
a fixed
dependence
on
(IS)
the
size,
for
number
results
and
which
our
linear
total
our
only
code,
times,
the
that
is somewhat
the
the
of problem
shows
sorting
of this
portable
of Integers
and
CM-5
TMC
It
is an almost
time
lieve
~
there
execution
for
-- ...-;
.
J——-
as a function
of processors.
the
process.
The
balance.
There
merging
first
exist
are
difficulty
inputs
selects
to
the
with
every
then
and
processor
input
is done
for
are
sorted
Each
sorted
subsequences
local
then
samples
they
as a splitter.
resulting
which
These
where
to partition
and
values
to
this
which
routed
every
complete
approach
at least
to
pth
then
and
appropriate
() 3
uses
then
sam-
these
routes
destinations,
the
sorting
is the
one
load
a
processor
many
will
be
as
2~ –
duced
by
(
choosing
crease
the
workers
the
completion
– p + 1
at
elements.
)
splitters,
more
overhead.
have
teriorate
left
~
And
observed
linearly
with
but
no matter
that
the
the
of sorting
This
this
what
load
would
as
be
re-
which
proceed
in-
(l).
After
also
is done,
balance
number
with
could
amongst
so that
previous
would
still
of duplicates
difficulty
uled,
there
the
is that
exist
number
in
turn
results
tion
bandwidth.
In
take
to
such
and
the
MPI
in
●
and
s
p < s <
~
, we guarantee
(
by
b
each
while
processor
the
splitters.
number
of duplicates
able
replace
to
to our
have
This
at the
at most
in the
irregular
(~
holds
input.
routing
+ ~ – p)
ele-
the
swrspJes
regardless
of the
Moreover,
with
code
for
algorithm
est
values
Step
(l):
of its
Each
~ input
sorting
processor
values
algorithm.
exactly
for
each
of
Step
(2)
and
each
Each
The
using
an
integers
we are
two
whereas
for
floating
point
values
with
the
(7):
merge
sort
algorithm.
The
<
to
each
into
p bins
so that
the
radix
numbers
k’k
than
is placed
P)’h
●
●
the
in
1~~ ‘h position
j
to
Each
processor
processor
Pj,
for
equivalent
to performing
block
~.
size
Step
(3):
From
I+
is
the
of the
each
routes
(O <
i, j
the
of the
in Step
element
(2),
processor
as a sample,
value
ofs,
for
for
(1 s
p<s<~
contents
of
which
is
subsequences
Step
(4):
quences
Step
of .sarnples
for
splitter
is the
used.
Additionally,
Pp_I
re-
and
(ks
–
same
is an
1)
the
largest
upper
value
binary
SAk
number
value
then
selects
kA
and
()
a given
of
allowed
search
with
bound
on
the
indices
(ks)th
the
been
omitted
that
number
at
Each
Splitter[j]
than
or equal
or equal
to
Est[j]
(
as Splitter[~].
P, routes
to
no
to Splitter[j],
x Y
most
value
two
ele)
the p subsequences
processor
Pj,
for
processors
will
exchange
elements,
this
(O
~
is equivalent
)
operation
processor
sharing
P,
with
block
“unshuffled”
a common
origin
the
subsequences
column
ith
results
of the
the
are
sorted
complexity
of the
following
brevity
[15].
for
Theorem
2:
The
number
processors
in
Step
(7)
quently,
(for
size
in
all
those
Step
then
(2).
merged
to
array.
of this
theorem,
algorithm,
whose
we
proof
has
at the
than
(~
Hence,
the
floating
to
of elements
is at
completion
of Step
+ ~ – p)
analysis
that
elements,
of our
of our
point
(
(7),
for
by any
two
sort
.
Conse-
)
no processor
receives
n z p3.
regular
sample
numbers)
exchanged
~+~–1
most
sample
algorithm
sort
algorithm
and
is given
by
sample
the
data
to
T(n,
last
forn~p3
compute
which
(Est[k]
p)
=O(;
logn+r+;cr)
type
and
p~s~~.
(
(k – 1)s through
Est[k]
Note
total
by
include
Splitterj
establishing
the
more
subse-
default,
is used
samples
as Splitter[k].
p sorted
each
By
in
a subsequence
with
greater
less then
p consolidated
is similar
the
(1 ~ k < p – 1).
for the set of samples
the
each
k < s — 1),
merges
to
.
(8):
need
.
Processor
as a splitter,
selects
are
search
received
with
)
(
●
Pp-l
binary
sequences
associated
processor
sorted
th
ceived
uses
then
operation
p sorted
p spiitters
(k mod
< p – 1),
a transpose
asso-
)
Before
(2):
the
p splitters
a transpose
(
sort
bin.
Step
bin
into
the
Q
~z+~—l
+:-1
produce
order
duplicates
we use
data
item
P,
and
Since
performing
fi
sequential
the
sorted
of
same
with
p — 1).
more
The
“dealt”
A
p — I processors.
p sorted
which
Each
subsequences
the
the
– I] and
associated
●
algorithm,
X
(l).
other
processor
collectively
Step
b
calls
1) sorts
appropriate
we use
so
)
broadcasts
p subsequences
contain
ments
is as follows:
P, (O ~ i s p–
For
of Step
Est[k]
of those
Step
to the
(6):
(
●
in
PP–I
Step
i, j
our
order
redistributed
between
+ 1
SAk
define
and
primitive,
pseudo
is
completion
in gathering
bound
present
the
transpose
The
will
no overhead
set
Processor
Splitter[j
incurring
to identify
(5):
S[j,k).
a sampling
)
of sorting,
ments,
that
– 1) elements
)
the
their
all
ratio
— 1) x A
with
Step
and
[20].
is parameterized
input
holds
(f
sorted
communica-
communication
standard
the
(
communication
regular
the
in the
is sched-
processors,
an irregular
(2),
processor
(Est[k]
The
variations
use of the
of the
which
routing
large
different
an inefficient
under
algorithm,
the
rise
for
advantage
proposed
our
destined
in
how
give
Moreover,
cannot
primitives
matter
that
of elements
this
scheme
no
inputs
and
sample
Step
each
ciated
other
in SAk
each
(
de-
[19].
the samples
share
x :)
of duplicates
217
)
(3)
in Machine
Scalability
& Problem
Regular
Size
on the T3D (s = 8p)
10
ool~
51~K
Table
IM
ing
8M
Total Nurnbe%’f
Sample
Sorting
Sorting
Benchmark
Inte$rs
[U]
lM
+16+-32=64!
(
0.355
1.17
4.42
[G]
0.114
0.341
1.16
4.38
4.33
0.0963
0.312
1.11
~4-Gj
0.0970
0.320
1.12
4.28
‘B-
0.117
0.352
1.17
4.35
[s-
0.0969
0.304
1.10
4.42
“z-
0.0874
0.273
0.963
3.75
[WR]
0.129
0.365
1.35
5.09
[RD]
0.0928
0.285
0.997
3.81
IV:
Total
execution
time
of benchmarks
Sample
(in
on
Sorting
Machine
)
Cray
Scalability
in Machine
& Problem
Size
IBM
T3D
seconds)
of 8M
ing
V:
Total
8M
Integers,
dence
51;K
IM
2M
Total Number
4M
4: Scalability
with
Integers
from the
bers
o~ processors,
Criy
T3D
and
the
ple sort
our
random
algorithm
coefficients.
and
tested
on
the
for
performance
Table
size
the
number
IBM
SP-2-WN.
We
now
the
range
V examines
scalability
differing
number
execution
of our
sample
numbers
input
sizes.
of
of processors
It
independent
the
size.
It
p.
scales
sort
that
almost
Figure
as a function
processors.
there
very
small
They
is an almost
(in
seconds)
A
for
other
hyphen
regular
on
sort-
a variety
indicates
that
that
to us.
time
and
additional
the
total
number
performance
published
Efficient
for
benchmark
unavailable
execution
[15]
2.42
[WR]
of
data
and
results.
Radix
Sort
problem
[0, J!f – 1] that
each
key
and
stable
into
the
the
Here
for
The
are
algorithm
keys
with
of
of
r-bit
by
sorting
sorting
on individual
Counting
Sort
range
ber
the
followed
by determining
the
the
of each
is known,
that
input
distri-
regular
for
sort
a given
inversely
of problem
show
that
linear
personalized
inwith
4 examines
correct
ble
the
a
depen-
initial
The
major
218
to
routine,
order
in
B,,
rank
blocks
bit
Counting
h =
is,
final
O <
each
we can
~.
if two
sort
n integers
for
of the
to move
case
that
the
chosen
r-bit
to accumulate
bucket
communication
in this
and
significant
algorithm
sorts
R counters
t in
element
position;
sorting
relative
size,
for
rank
equal
the
a p-
blocks.
the
shows
[0, R – I] by using
of the
least
in
decomposes
a suitably
the
algorithm
dis-
keys
for
over
efficient
that
on each
displays
of
An
blocks,
sketch
keys
equally
sort
containing
only
n integer
machine.
is radix
block
we
sorting
distributed
memory
groups
sorts
positions.
Sort
the
of input
of our
shows
Finally,
sam-
IV
of the
scalability
was
distributed
known
implemented
as a function
time
of processors
was
Table
time
the
processors.
consider
well
T,
regular
with
algorithm
sort
of
our
optimal
benchmarks.
regular
of machine
n the
algorithm,
our
is essentially
put
fixed
nine
a variety
as a function
for
again,
of our
tribution
sort
is asymptotically
Once
performance
bution.
sample
4.34
See
An
beginning
Like
0.864
8.04
from
[WR]
[U]
respect
to problem
size of regular
[U] benchmark,
for differing
num-
on the
1.57
-
with
processor
Figure
sorting
3.12
the
n.
4.3
8M
of Integers
64
0.711
execution
between
comparisons
~
32
1.15
platform
elements
s = 8P.
Processors
16
and
particular
of
sort-
T3D.
2.11
tntege~s
of machines
regular
Cray
8
CM-5
Table
for
a 64 node
4.07
SP2-WN
TMC
64M
0.114
Number
1+8
s = 8p
[u]
a variety
[ Regular
of Integers,
Benchmark
16M
4M
keys
are
the
num1,
Once
h-relation
Sort
identical,
same
the
R –
element.
element
Counting
remains
the
z <
use our
each
in
into
the
is a statheir
as their
order.
pseudocode
steps
and
for
our
is as follows.
Counting
Sort
algorithm
uses six
Step
(1):
of its
~ keys;
For
each
processor
i,
count
the
Rad!xsort
frequency
Scalablllty
In Plachlne
and Problem
Size
on the T3D
keys
equal
Step
(2):
array
is,
to k, for
Apply
using
step,
that
the
each
compute
(O ~ k < R–
the
block
~.
will
Step
(3):
of its
rows
of the
array
locally
Step
(4):
Apply
the
(inverse)
R corresponding
total
count
for
pose
primitive
Step
(5):
at the
bin.
of
10
the
end
of this
rows
computes
1
the
@
of ~.
El
c
prefix-
1.
transpose
prefix-sums
each
to
~ consecutive
sums
to the
number
primitive
Hence,
hold
processor
the
1).
transpose
size
processor
Each
l[z][k],
The
block
primitive
augmented
by the
size
of the
trans-
the
ranks
of local
!
256K
processor
computes
keys
to
Perform
rank
h =
a personalized
location
using
communication
our
h-relation
Scalability
in Ilachme
and Problem
.
. . —.
.1.’”””.
Size
0. the 5P-2
10,
elements.
for
4M
is 2 ~.
Each
(6):
2M
T.191Mumnw.1 K.ys
Rnd]xsort
Step
In
5 12K
of
algorithm
P
~
~.
❑
■
:1
Input,
~
SP-2
p = 16
[BHJI
[AIS]
-~
, 4K
“R- , 64K
‘c’ , 4K
(2S-2 p = 16
p = 32
[B~Jl
[AIS]
0.107
1.63
0.163
0.664
0.083
0.938
0.592
3.41
1.91
1.33
0.808
4.13
4.03
7.75
7.33
19.2
15.1
0.479
0.107
1.64
0.163
0.641
0.081
0.958
0.584
3.31
1.89
1.23
0.790
[
4.13
4.02
14.9
6.86
6.65
16.4
(
‘N ‘, 4K
0.475
0.109
1.63
0.163
0.623
0.085
(
‘N ‘, 64K
0.907
0.613
3.55
1.89
1.22
0.815
[
[N ~, 512K
4.22
4.12
6.34
7.29
18.2
15.0
gers
VI:
(in
Total
execution
seconds),
time
comparing
for
the
radix
AIS
sort
and
on 32-bit
our
128K
Figure
‘
and
5:
Scalability
problem
implementa-
VI
presents
a comparison
of radix
of our
sort
radix
in SPLIT-C
et al.
the
algorithm
[C]
[1] This
Meiko
is cyclically
Figure
to
5 and
and
Additional
in
the
as BHJ.
sorted,
[4].
implementation,
is identified
is referred
proximation
in
other
CS-2,
The
sort
[N]
was
as AIS,
input
[R]
is a random
performance
our
A
on
[3] D. Bader.
is random,
Gaussian
results
are
ap-
April
for
Scheiman.
into
alistic
the
LogGP:
LogP
model
Model
for
ACM
Symposium
tures,
pages
- One
parallel
Santa
step
Long
closer
computation.
on Parallel
95-105,
Schauser,
Incorporating
Barbara,
CA,
machine
SP-2-TN
Krishnamurthy,
Empirical
Compiler
of
S.G.
Evaluation
Perspective.
the
Computer
22nd
In
of the
ACM
Annual
Architecture,
7th
Ligure,
Italy,
and
h-Relations.
Bader,
and
1995.
Press,
International
pages
June
320-331,
1995.
Deterministic
ENEE
648X
Routing
Class
Al-
Report,
Sorting.
a re-
Architec-
July
1995.
To
Helman,
for
Report,
CS-TR-3548
UMIACS
of Maryland,
appear
and
in
J. JiiJii.
Personalized
and
Par-
and
and
UMIACS-TR-95-101
Electrical
College
ACM
Practical
Communication
Park,
Engineering,
MD,
November
Journal
of Experimental
Parallel
Algorithms
Al-
gorithmic.
Annual
and
D.R.
Algorithms
University
Messages
towards
In
Algorithms
to
IBM
1, 1994.
Technical
C.
the
given
References
K.
A.
Yelick.
Randomized
gorithms
Integer
Ionescu,
K.
Margherita
[4] D.A.
M.
Culler,
respect
and
tuned
while
[4].
Alexandrov,
with
T3D
with
allel
[1] A.
4M
by Alexan-
which
table
sort
Cray
D.E.
and
Proceedings
Santa
for
on the
Arpaci,
Steinberg,
Symposium
drov
of radix
size,
editor,
implementation
2M
-
CRAY-T3D:
Table
lM
inte-
tions.
another
512K
256K
Total number01tQs
[2] R.H.
Table
16
I
64K
~ ‘c” “, 64K
‘c” “, 512K
■
-
[BHJI I
[AIS]
0.474
[ - ‘R: , 512K
i
CM-5
4
Z38
z
[5] D.A.
age
219
and
J. JiJ&
Histogramming
Bader
and
Connected
Components
for
Imwith
an Experimental
posium
Study.
of Principles
rnzng,
pages
TO appear
In
and
123–133,
in
Fifth
Santa
Journal
ACM
Practice
SIGPLA
Barbara,
of Parallel
Program-
CA,
and
Report
N Sym-
of Parallel
July
1995.
Distributed
Com-
for
and
Dynamic
J. JtiJ&
Data
Selection.
College
Helman,
J. Bruck,
Tunable
Collective
Parallel
Computers.
Distributed
[8] G.E.
ton,
and
S. J. Smith,
Algorithms
rithms
and
[9] W.W.
nical
for
the
ACM
SRC-TR-95-141,
Center,
B.M.
J.M.
Bowie,
Draper.
and
Maggs,
C.G.
J.S.
on
CM-2.
Parallel
3–16, July
1991.
AC
T3D.
for the
February
[18]
In
by
Computer
Software
J.F.
JiJa
Model
for
Proceedings
of
and
[19]
RegStudy.
Engineering,
Park,
MD,
June
1996.
Ryu.
Proceedings
The
Block
Memory
8th
pages
of
Data
the
7’th
Conference,
pages
Distributed
Mem-
Multiprocessors.
International
752–756,
in IEEE
In
Parallel
Canctin,
Process-
Mexico,
Transactions
April
on Parallel
and
Systems.
J.F.
Algorithms
for
J.
Shi.
Regular
and
Routing
5th
669–679.
P. Lu,
H.
Sibeyn,
of the
pages
Li,
and
1983.
the
Proceedings
X.
Sorting
Applications
Shared
Kaufmann,
and
1995.
A Parallel
Electrical
In
and
K.W.
To appear
rithms,
Re-
Mary-
Experimental
Parallel
Sampling.
Symposturn,
M.
Chow.
November
ory
In
Tech-
J. JiJi.
and
of
1995.
an
College
Y.C.
Partitioning
izing
Algo-
and
With
UMIACS
and
Distributed
Plax-
Bader,
of Maryland,
Huang
ing
of Sort-
Machine
D.A.
December
Algorithm
UMI-
University
Preparation.
1994.
Supercomputing
MD,
[17]
for Scalable
A Comparison
pages
and
C.and
on Parallel
Symposium
Carlson
Ho,
Portable
MD,
report,
627-631,
1995.
Connection
Architectures,
[16]
To
Process-
A.
A
Library
Zagha.
Report
search
CCL:
Transactions
M.
1995.
1996.
P. Elustondo,
Snir.
Leiserson,
the
July
15-19,
April
6:154-164,
and
of
M.
IEEE
C.E.
Proceedings
HI,
Uni-
Parallel
Communication
Systems,
Blelloch,
Internattoaal
R. Cypher,
S. Kipnis,
MD,
In
UMIACS-
UMIACS-TR-95-102,
Engineering,
Park,
Sorting
University
and
Engineering,
Park,
Honolulu,
Ho,
and
Electrical
10th
Finding,
and
Electrical
Technical
Algorithms
Median
College
at the
Symposzum,
[7] V. Bala,
Parallel
CS-TR-3494
and
of Maryland,
be presented
ing
Report
UMIACS
versity
Practical
Redistribution,
Technical
TR-95-74,
T.
land,
ular
Bader
ing
and
[15] D.R.
puting.
[6] D.A.
CS-TR-3549
ACS
and
the
Sampling.
Suel.
on
ACM-SIAM,
J.
on
Meshes.
Discrete
Algo-
1994.
Shillington,
Versatility
Parallel
Derandom-
Sorting
Sympostum
Schaeffer,
On
T.
of
P.S.
Parallel
Computing,
Wong,
Sorting
by
19:1079-1103,
1993.
[10]
Cray
Research,
October
1994.
Inc.
SHMEM
Revision
2.3.
Techntcal
Note
for
C,
[20]
Message
Passing
Passing
[11]
D.E.
Culler,
namurthy,
K.
sion
S.
Yelick.
Division
A.
Dusseau,
Lumetta,
S.
Introduction
- EECS,
S.C.
Luna,
T.
to Split-C.
University
1.0 edition,
Goldstein,
March
A.
von
Krish-
Eicken,
Computer
of California,
sity
D.E.
Culler,
Schauser.
ory
to
A.C.
Fast
Dusseau,
Practice.
In
Martin,
Under
Portability
LogP:
and
[21]
ver-
S. Ranka,
Processing,
chapter
71-98.
In
K.E.
sively
The-
& Sons,
D.E.
Schauser,
Eicken.
LogP:
R.M.
Towards
In
on
and
May
1993.
D.R.
Helman,
Karp,
E. Santos,
Computation.
Prtnczples
February
John
ing
Algorithm
report,
Univer-
June
1995.
Version
and
K.A.
Communication
Symposium
on
Computation,
Alsabti.
with
pages
the
Many-to-
Bounded
Frontiers
20–27,
Trafof Mas-
McLean,
VA,
1995.
S. Rae,
T,
Suel,
T.
Tsantilas,
and
M.
Goudreau.
Effi-
Wi-
Fourth
Practice
D.A.
Patterson,
A.
Communication
R. Subramonian,
a Realistic
ACM
of
and
Model
SIGPLAN
Parallel
of
D.A.
With
Bader,
and
J. JiJiL
Symposium,
T.
1995.
von
Parallel
[23]
Symposium
Using
an Experimental
Study.
H.
Shi
ing,
Sort-
L.G.
tation.
Technical
1990.
220
the
9th
pages
and
Sampling.
Programming,
A Parallel
of
Sahay,
[24]
[14]
Message-
Total-Exchange.
In
l%-o-
19934
Culler,
K.E.
TN,
A
for
ceedings
[13]
Shankar,
Ftfth
Parallel
cient
ley
R.V.
The
and
Performance
4, pages
Technical
Knoxville,
Personalized
From
[22]
Parallel
Standard.
MPI:
Science
Berkeley,
many
R.P.
Sorting
Forum.
.,
1.1.
and
6, 1994.
Parallel
Interface
of Tennessee,
fic.
[12]
Interface
J. Schaeffer.
Journal
14:361–372,
Valiant.
International
544-550,
Santa
Parallel
of Parallel
and
Parallel
Processing
Barbara,
CA,
Sorting
by
April
Regular
Distributed
Comput-
for
Compu-
1992.
A
Bridging
Commurz2cat20ns
Model
of
the
ACM,
Parallel
33(8):103-111,
A
Sorting
Benchmarks
worst
possible
regular
Our
nine
MAX
for
sorting
benchmarks
is (231 – 1) for
are defined
integers
and
as follows,
approximately
in which
1.8x
1030s
doubles:
1.
Uniform
[U],
erator
gers
in the
processor
returned
then
and
Gaussian
[G],
normalize
the
the
the
a zero
constant
calls
double
is S.
to
random
For
the
returned
We
and
double
type,
by random
() in
such
Previous
sian,
and
for
created
by setting
ev-
as zero.
in this
ate.
For
4. Bucket
buckets,
each
[B],
an
by
setting
processor
MAX
—
P
(
Sorted
obtained
– 1
)
to
(
second
– 1
numbers
~
numbers
, and
an input
into groups
index
group
these
j
we
numbers
created
groups
set
proces-
ements
at
((jg
(((
~g +
case
n
partition
might
In this
respect,
Thus,
perform
the
[S],
p)
created
i is < ~, then
to
mod
to
p)
, the
~
(2i+2)
~-I
(
we set all ~ elements
even
of size
be
~
ing
:
numbers
el-
7. Worst-Case
of values
Regular
between
an algorithm,
approprithat
the
benchmark
of possible
which
values
try to guess the
well on such an input.
benchmark
is more
which
telling.
would
Thus,
evalu-
we wanted
)
to
between
O and
~
at that
[WR]
- an
MAX
designed
1),
input
perform
by
an
are
first
to
the
do
this
naive
on
at the
the
same
time,
in
by
which
iterating
the
over
a strategy
the
data
problem
uti-
might
processor
and
then
a sequence
would
Sorted
route
in poor
at each
most
or
rearrang-
to
This
destination
of
Perhaps
try
resulting
bandwidth.
algorithm
to
Bucket
would
uses
difficult
approach
poorly
specifications
such
contention
processor
grouped
is by
But
find
which
the
routed
of p destina-
straightforward
way
possible
communication
perform
poorly
with
the
processor
pro-
(22 + 1) ~
-
A
every
permutations.
strides.
if the
would
an algorithm
would
avoid.
processor
according
tion
which
of communication
elements
be-
to
Here,
same
lization
, and so forth.
as follows:
+1)
data
ran-
, and so forth.
Otherwise,
)
to be random
numbers between
((z-~
the
to the
g-Group
benchmark,
for
this
case,
using
iteration,
long
to a particular
andso
selected
forth.
consisting
to induce
stride
of g destination
set,
and
the range
benchmarks
for
of routing
impossible
benchmark.
and
– 1
phase
then
and
second
be random
numbers
and
~
be de-
Uniform
communication.
benchmarks
be avoided
)
~
the
order,
we set all ~ elements
cessor to be random
;)
them
should
is not
algorithms
to find
Gaus-
expect
would
misleadingly
of rea-
included
from
the
so forth.
Uniform,
one
in
the
to another
benchmark
Gaussian
we also wanted
the
set
1),
a variety
so we too
behavior
>> p,
be those which
set
O
sum
then
–
– 1), and
benchmarks
Uniform
for
are
then
some
between
(range
for
used
But
splztters
a single
p evenly.
elements
– 1
p)
processors
partitions
+ ~)
~
processor
~ + 2) mod
:
( (jg
~ + 1) mod
dividing
consecutive
first
between
each
+
6. Staggered
(i-
the
are
dupli-
whose
input
(range
and
of the
splitters
But
have
of
T with
work)
O and
benchmarks
worst
into equal intervals.
and
by first
which
in
the
)g + ~ + 1) modp)
index
choice
would
O and
at each
MAX
y
of consecutive
(((
tween
p
at
nine
input
values
our
of the
input
(~ – ~ +
so forth.
g, where g can be any integer
dom
between
elements
between
into
elements
sense
an array
random
O and
of comparison.
to include
[g-G],
processors
for
~
)
5. g-Group
we
is sorted
first
of the
ate the cost of irregular
a MAX
—
P
If
that
the
be random
, the
sor to be random
input
between
benchmarks,
example,
optimal
~ values
between
the
an
fills
even-
elements,
hold
[RD]
is 32 for
researchers
to illicit
and
details.
of
of our
the
~ — p)
additional
~
these
Zero
purposes
(~ +
will
value
E
~ values
value
sons.
range
first
selected
signed
input,
m~
random
input,
to randomo
The
next
for
— 1) (range
a random
completion
of sorting,
processors
processor
number
(range
assign-
[U].
entropy
to a constant
inte-
a randomly
choMAX
by [~31 _l)
result
by four.
for
returns
each
the
hold
Duplicates
cates in which
and
at
completion
will
See [15]
8. Randomized
by each
the
odd-indexed
gen-
by first
distributed
four
input,
For the
values
randomo
values
described
[Z],
value
by
the result
manner
3. Zero
(23+10012).
scaling
by adding
dividing
we first
ery
processors
whereas
number
is initialized
these
a Gaussian
approximated
the
value
bit
which
– 1),
we “normalize”
integer
sen sign
then
the
(23]
random
random
function,
O to
P, with
type,
the
This
range
distributed
the C library
rano!omo.
data
9
~.
a uniformly
by calling
balance
At
indexed
p) elements.
obtained
ing
load
sorting.
the
221
they
producing
so that,
choose
group
when
can
synchronize
but
this
in turn
all
the
the
route
This
the
will
processors
reduce
data
to the
subset
in
possibly
value
processors
g processors
processors
and
chosen
those
processors.
contention
one
a suitably
which
same
In
be-
subset
of destinations
route
exactly
the
stalling.
after
of g.
each
the communication
to
is
this
same
suborder,
Alternatively,
permutation,
bandwidth
by
a factor
sor
needs
away.
the
of
This
result
is the
of p.
Of
g-Group
have
the
phase
but
included
for
regular
rate
performance
duplicate
which
this
avoid
worst
of
This
was
case running
and
performance.
the
ef-
Finally,
was
algorithms
is
single
an equiva-
algorithm
benchmark
of the
iter-
in the
the
included
to
presence
of
values
Acknowledgements
B
We would
trical
like
to thank
Engineering
and
Ronald
Greenberg
Department
for
of UMCP’S
his
valuable
Elec-
comments
encouragement.
We would
at The
help
Lok
which
Research
Tin
No.
Arvind
his
port
of
Jet
Propulsion
used
SPLIT-C
to
this
tics,
and
Space
Science.
Systems
Jesse Draper
compiler
has
also
This
Cray
Shared
IBM
University
Infrastruc-
We
Division
the
of
work
Supercomputing
also
T3D
to Planet
Center
with
[2].
The
by
Earth,
acknowledge
help
Supercom-
provided
funding
Aeronau-
William
for Computing
Research
Center)
2.6)
[9] on which
(version
for
CarlScience
writing
the
T3D
the
port
based.
the
160-node
also
T3D
Cray
was
of Mission
been
additional
Research
256-node
from
AC
thank
Urbana-Champaign,
the
Krish-
16-node
Research
provided
the
Supercomputing
use of their
by an IBM
investigation
Offices
We
UMIACS
Academic
Lab/Caltech
in
of SPLIT-C
for
Arvind
CDA9401151.
NASA
parallel
especially
Culler,
of the
Krishnamurthy
the
(formerly
use
an NSF
from
son and
David
group
Liu.
the
and
CASTLE/SPLIT-C
Berkeley,
from
was provided
award
Grant
12
the
of California,
acknowledge
SP-2-TN2,
puter
to thank
encouragement
and
We
ture
also like
University
and
namurthy,
for
on
both
benchmark
the
sorting
Duplicates
that
of any
Regular
both
parties
by
better.
could
code
and
sequences
we are unaware
evaluate
our
performs
Case
foradditional
benchmark
another
which
Worst
sampling
the
time
algorithm
The
object
scheme
that
which
Please
stride
bandwidth
Staggered
a routing
and
proces-
benchmark,
correctly
the
thwart
same
Randomized
assess
Staggered
strides,
at the
each
a unique
communication
and
to empirically
of the
the
to
scenario,
processor
one can
be found
expected
fect
course,
deterministic
case
the
might
challenge.
time
of
of the
possible
permutations
possible,
worst
to a single
case
tailored
over
lent
the
benchmark
been
ates
In
data
is a reduction
a factor
the
~.
to send
Numerical
the
IBM
utilized
NASA
Ames
Simulation
Research
Center
for
SP-2-WN.
the
Applications,
under
Aerodynamic
grant
CM-5
at
National
University
number
of
Center
Illinois
at
ASC960008N.
222
used
see http:
//waw.
performance
in this
from
our
paper
umiacs.
researchers
to compare
.mnd.
information.
will
anonymous
ftp://ftp.
urniacs
be freely
ftp
edu/research/EXPAR
In
available
our
for
all
the
interested
site,
urad. edu/pub/dbader.
with
addition,
results
We encourage
for
similar
inputs.
other
Download