A Simple Load Balancing Scheme for Task Allocation in

advertisement
A Simple
Task
Larry
University,
Research
Yorktown
Heights,
Bryce
Based
PO
visiting
TJ Watson
IBM
Science
Israel
ple
load
ing
tasks
of local
BOX
Industries
Weizman
in
scheduling
through
comparable
the
to that
each
more
do not
mirgrate
The
load
balancing
eration
consists
processor
of the
and
of
workpiles.
of the
tasks
putation
local
in the
time.
task
i.e.
by the
number
tasks
The
provide
workpile
an
system
receives
the size
its
fair
of the
proves
share
the
expected
size
a small
constant
factor
of tasks
in the system
parallel
widely
have
ent
assumptions
when
that
element
cerned
of the
ber
divided
of processors.
or require
The
Introduction
scheduling
on a parallel
formance.
of the
machine
A
poor
activities
of
can significantly
scheduling
policy
a parallel
influence
may
leave
and
shared
memo-
ones wish
development
of
memory,
believe
that
how
are
asyn-
our
results
parallel
most
used
to
differof ei-
examplle,
applications
balancing
such
and
schemes
processor
at natural
finite
are
to equalize
each
is exchanged
For
difference
boundries
allocated
process-
make
knowledge
in
finite
load
to move
for
application.
methods,
particles
con-
the num([DG89]).
synchroniza-
points.
to find
the
loaded
node
munication
program
to in-
mapping
passing
allow
tl~e
demand
willing
the
a detailed
or the
meshes
Independent
1
but
although
information
tion
machines
on shared
schemes
calculations,
of
Here
and
We focus
architecture
with
when
to proces-
tasks
otherhand,
studied,
adaptive
of each
important
as a few message
job
that
map
doing
the
main
scheme
therefore
balancing
as Particle-in-cell
of com-
in
an un-
The
cost,
parallel
machines,
been
the
and
are
On
consume
cycles.
processors.
only
and
Center
applicable.
load
ing
ther
are
of this
code.
are more
may
the
effort
user
portable
one
low
of today’s
as well
the
Many
op-
Israel
Ca
CPU
to schedule
users
chronous
propor-
all
schemes
machines
Jose,
total
across
themselves.
to relieve
Research
San
a distributed,
Herculerian
ry
a
a clever
performance
it
can
Rehovot,
Almaden
of the
load
Many
vest
tasks
analysis
while
is allowed
maximum
of a random
scheme
the
scheduling
balancing
probabilistic
balancing
number
is to
sors.
When-
so as to equalize
load
is within
total
goal
archi-
it performs
The
the
Specifically,
queue
average,
that
inversely
workpile.
exchanging
two
workpile,
probability
its
of examining
performance
each
that
and distributed:
its local
with
size
portion
system
The
computer
memory
large
Scheduling
done
minimiz-
is desirable
idle
duly
Task
a balancing
while
processors
more
is simple
accesses
the
IBM
balances
frequently.
operation
to
some
so it
schedul-
workpile.
parallel
has
and
balancing
Institute,
Mathematics
and
Israel
a sim-
been
achieves
workpile,
many
efficiently,
a processor
tional
In
processor
access
ever
of a global
overheads.
tectures,
paper
for
usually
accessible,
this
and
machines.
has
globally
in
suited
parallel
machines
introduced
queues)
is well
memory
such
a single,
scheme
ing
scheme
shared
on
(task
Upfal
of Applied
23838
Jerusalem,
Center
Eli
Department
NY
workpiles
balancing
Machines
Ltd.
Abstract
A collection
for
Slivkin-Allalouf
John
Science
Jerusalem,
currently
Scheme
in Parallel
Miriam
of Computer
and
Balancing
Allocation
Rudolph
Department
Hebrew
Load
of
global
in
the
application,
average
load
the
steps.
its per-
small
number
many
hope
of
system
The
of neighbors
keeping
its
own
some
and
using
the
schemes
least
and
a minimum
strategy
is
to
at periodic
of com-
query
only
intervals
information
try
mc)st
in the
up-to-date
(e.g.
[CK79] ,[BKW89]
,[J W89]
). Sometimes
a central
server
is considered
([B K90]).
There
have also been attempts
to use simulated
Permission to copy without fee all or part of this matertial is granted provided that tbe copies are not made or distributed for direct commercial
advantage, the ACM copyright notice and the title of the publication and
its date appear, and notice is given that copying is by permission of the
Association for Computing Machinery. To copy otherwise, or to republish,
requires a fee and/or specific permission.
@ 1991
ACM
0897914384/91/0007/0237
Schemes
architectures.
for
$1.50
237
hypercubes
annealing
are
often
There
([FFKS89]).
directly
have
([ DG90],
been
[DG89]
related
many
)
to
the
schemes
and
either
a
machine
proposed
balance
only
between
t ween
local
adj scent
Similar
achieve
to
our
global
of processors.
fixed,
predetermined
the
effects
approach
shared
all
sume
work
that
occurs
at
are
interested
times.
Another
polling
[KR89]
has
with
the
PEs
that
performance
Parallel
past
the
provided
results
a minimal
cost
in moving
sor
to
for
these
architectures
sor
(PE)
chooses
and
another.
The
executes
does
not
to
chooses
the
mat ched
would
it for
the
the
with
of
to
more
cally
exist
evenly
be idle
in the
system.
more
global
local
state
task
to
that
with
each
fers
PE
from
(tasks)
long
its
will
workpile
empty
one.
ancing
this
unequal
while
another
Unless
inversely
by
the
system
tasks
will
proportional
is
tive
sor
PE
When
the
global
queue,
execute
the
it
sufwork
some
to
processor
associated
a very
employes
short
a parallel
then
task
our
assumptions
system.
Our
presented.
analyzed
Finally,
allocation
and
we
Its
the
show
scheme
behavior
results
that
load
and
and
based
on
are
local
cases
workpiles
are
presented
level.
since
there
may
a preemp-
That
is, a procesusually
voluntarily
called
gives
In
up
with
task
238
latter
for
the
in that
a task
PEs
the
only
from
task
Access
one
has
time.
chooses
the
process
retries
global
workpile
may
be adding
workpile
attempt
to the
thus
and
processor
fashion.
task
the processor
to the
the global
the
of time
at a later
workpile
is empty,
to process
until
and
are
processor
is returned
idle
simultaneously
in an arbitrary
begins
task
processing
global
period.
idle
or a quantum
considered
workpile
idle
and
the
a single
of tasks
Each
workpile
continued
from
on
deletions
executes
case,
is then
If the
a short
serialized
load
the
Queue
is based
and
suspends,
workpile
If multiple
a t=k
process
processor
or removing
presented.
which
from
terminates,
is sequential
is then
scheme
insertions
the
after
is
as a Global
allocation
all
The
continues.
for
scheme
are
in
tasks
schemes
and
task
a task
The
of
WorkpHe
task
task.
global
different
length
performance
of simulations
there
All
of time,
removes
elapsed.
bal-
requirements
balancing
the
each
possible
priority
is required.
a period
until
with
also
of processors,
data-structure.
another
state
interact
to the same
either
workpiles.
We first
the
ways:
processors
multiple
priority
number
for
or
paper.
with
closely
all
are
priority;
directed
that
or
load
at
average
time-quantum,
inserted
different
to
associated
this
distinct
strategy
two
schemes
in
the
a task
The
a
have
a very
scheduling
executes
are
cent rol.
of
for
of the
may
have
then
for
may
than
Tasks
a more
but
distribution
One
a
tasks
are either
workpile. We use this
of task queue since
a
one
be implemented
tasks
tasks.
by the
system
accessible
same
each
other
is
or are are maintained
required.
hybred
considered
within
a
that
workpile.
workpiles,
the
code
with
be organized
course
have
2.1
same
continuity,
may
the
Since
will
amount
workpile
can
used
be more
viable
term
work pile
local
levels
are
We con-
is executing
of
in the
we call
the
can
Of
task
tasks
is preferable
A local
processors.
technique,
rates,
their
the
it
of
not
here
lo-
which
dynamically
processor
is not
global
to
physi-
is
the
workpile
assumed
suited
there
from
although
no PE
well
where
Here,
provide
a potential
among
which
execution
may
and
order
PE
workpile
of the
PEs
FIFO
asks
there
than
is not
and
it previously.
that
ports
by any of the processors.
traditional
deleted
that
parallel
tasks
that
of the
a strict
as set
in
by some
to and
or
registers,
1/0
is a piece
can be executed
structure
processor.
to be executed.
scheme
is significant.
the
global
work-load
in
task
oft
that
parallel
ready
hierarchy
resume
global
the smallest
are generated
ready-to-run
instead
The
is re-
tasks
and
as a single,
it
and
unlikely
The
all the
architectures
executed
on
on a task.
processor
possibly
that
in a data
and
entity,
task
A
executed
term
the
number
appearance
in
is a task
memory
was
PE.
The
workpile
advance
complex
it
the
the
workpile
If
workpile
unless
among
there
based
of a CPU,
independently
a parallel
executed,
existing,
proces-
global
global
same
All
proces-
Each
the
time-quantum,
the
executing
while
The
gives
PEs
in
be
consists
computation
program.
processors
is sufficient
the
of PEs,
on the
balanced
one
unit.
We assume
so there
one
see [R87]).
Thus,
number
from
time-quantum.
the
task.
than
as an indivisible
assume
parallel
and
in
and
local
it to work
the task
being
workpile
task
within
time-slicing
also
next
a given
end
next
a task
enable
We
as-
can also
designed
memory
global
(e.g.
the
be resumed
many
single
terminate
turned
global
element
for
computational
scheme.
machines
a flat
processing
tasks
We
is uniform
our
balancing
shared-memory
often
was
of the
Each
will
cou-
program.
between
it appears
tightly
where
parallel
Foundations
cal memory
aPPIY to restricted
communication
architectures,
such
as hypercubes.
Moreover,
we are able to prove a bound
of the
better
2
sider
machines
same
communication
although
performs
to
are idle.
is concerned
parallel
of the
between
they
interval
they
try
loads
however,
and
the
when
memory
inexpensive
balancing
be-
[HTC89]
the
balancing,
uses random
our
be part
et al.
equalizing
intervals
only
In contrast,
may
Hong
by
The
also
poll
pled
balance
workpiles.
of changing
that
processors
or recursively
scheme,
balance
pairs
in
neighbors
sub cubes.
at any time.
access,
they
are
x
Balancing
x
x
x
x
x
x
x
x
x
x
Task
x
lock
x
x
x
x
x
x
if
x
x
x
x
x
x
x
then
PE1
PEz
PE3
PE4
PE5
PE6
PE7
PE8
I size
of
move
to
Figure
1:
Each
ancing
and
PE
mechanism
another.
PEi
probability
has
deletes
a workpile
only
may
move
executes
I/li
workpile.
tasks
a load
where
associated
to this
from
li is the
with
A load
operation
length
of the
This
scheme
since
it
idle
while
there
is true
only
may
if the
is a task
ready
of tasks
switches.
will
viously
it and
a significant
memory
then
2.2
if the
same
will
Workpile
than
have
the
them.
On
A
a
chooses the next
(Figure
as a FIFO
queue
fashion.
associated
processor
workpile
tion.
This
Many
parallel
low
a task
local
are
begins
If the
this
m.
each
one
templete
execution,
tasks
local
are
the
In
the
several
in
is empty,
might
a single
with
call,
task
opera-
situations.
index
amount
and
block
organization.
The
are
are
not
executing.
Any
we focus
on the
3
Balancing
idle
workpile.
initial
pro-
Two
types
allocation
onto
workpile
and
executed
be many
balancing
local
first
load balancing, idl
local
inserted
are possible;
The
any
of the
local
where
tasks
may
to another
combination
latter
where
whenever
of these
two
case.
most
al-
decision
as to
suprising
that
is about
the
The
main
and
the
desire
master
each
processor
disadvantages
advantages
each
share
load
to
which
a processor
are
stead,
the
239
on each node
of control-
conflict
between
to have
desirable
whole
one
that
system
load-balancing
is
has
traditionally
In the
the
In the
been
exectued
system
during
which
when
necessary
or
is idle
in
periodic solution a tilme
it polls
the
other
lclad
polling
PEs
to
work.
using
balancing
load
is
interference.
ways.
a PE
It
knowl-
system.
of the
the
minimal
throughout
whenever
load
is not
situation
Rather
is performed.
global
unwillingness
it
sim-
whom.
has
the
very
independent
with
is the
Also
the
with
We propose
time,
Lo-
balancing
and
to the function
the
period.
load
local,
and
workpiles
and
check
in their
own
thoughout
related
balancing
our
the expected
load
processor.
different
its
balance
load,
the
time
of
no processor
problem
is fixed
solution,
task
to
average
dynamically
balancing
is created.
of CPU
when
although
to control
single
period
in the
for
distributed,
makes
balancing
Load
of placing
feature
it is adaptive,
processor
one of two
tasks
as each
striking
is that
ling
Scheme
Workpiles
Each
the
and
tasks
may
possible,
one
done
be significant.
advantages
are
from
during
systems,
consists
control
a small
local
m distinct
the workpile
as to
in parallel
that
explicit
be on a single
be moved
an associated
a different
any
continuous
a
the
the
creation
or operating
optimization
full
processed
into
there
without
edge of the system
its own lois organized
in certain
assigned
only
will
workpile
executing
or created
into
require
optimization
There
the
count,
a single
placed
and
O to m — 1. The
only
also
an optimization
languages,
created,
from
are
r
longer
task
removed
processor
otherhand,
balancing
The
is for each proof tasks. Each
workpile
1 >
the
balancing
and
same
and
Local
idle.
are
to be spawned
multiplicity
range
tasks
tasks
local
remains
processor
allows
the
If the
created
of the
only
the
cal
Of
1). The local workpile
so that
round-robin
Newly
task
load
be inserted
on the
created
ple.
cal workpile
the
of
ability
Queues
processor
i
pre-
in their
Set
to the global workpile
its own local workpile
2: The
Indeed,
scheme
An alternative
cessor to maintain
of
workpiles
they
that
may
remain
newly
the
that
tasks
tasks
be expensive.
As
not
workpiler
end
so
of load
needless
unlikely
state
end
the
sizes.
the tasks
this
many
Second,
processor
of task
the
— size
from
smaller
cessors.
workpile,
many
processors
migration
But
since
larger
is highly
amount
this
The
it
on the
WorkpileC
task
Figure
balanc-
be delayed.
be too
[1, p] but
equalize
their
workpile
unlock
Workpilei
and workpiler
to
remains
global
slightly
will
Finally,
executed
to
is only
be rescheduled
to keep
have
there
load
First,
the
range
i. workpile.
to be executed.
access
the
with
a processor
setting.
may
of processors,
the best
that
limited
access
number
context
task
case
concurrently
of these
number
to provide
the
in a very
processors
some
appears
is never
in
it.
that
ing
~
bal-
one workpile
balancing
PE
workpile,
x
It inserts
for
lock
Workpile:
r = random number
a different
work.
There
executes
the
or size of the
approach
to scheduling
is no set
load
local
balancing
workpile
time
during
task.
dictates
Inthe
frequency
tasks
of its
on the
execution.
workpile
t. Let t be the time
scheduling
task
code.
from
If li,t
the
=
before
load
delays
balancing
task
probability
for
and
I/li,t.
amount
again.
with
execute
one
with
load-balance.
invest
in
The
load
many
tasks
in
the
load-balancing
a short
workpile
The
fraction
its
local
task
will
of time
balancing
little
task
workpiles
are
processor
roughly
to access
is currently
there
is lit-
system,
load
also
holds
Furthermore,
It is also
attempt
loaded
This
processor.
migration.
will
system,
a lightly
attempted.
individual
of the
loaded
in
is frequently
sizes
equal,
very
there
rare
that
a workpile
accessing
true
since
the
is very
some
will
pro-
some
other
it.
workpile
frequently
that
is inversly
load-balancing
and
workpiles
between
limit,
two
tasks
are
to
current
ly
or
try
to
a processor
Figure
2).
to
If
migrated
until
the
Since
there
If
then
the
the
the
that
PE
becomes
several
first
Our
to
difference
further
ation
details
that
tation
explanation:
simplist
with
which
dom.
Each
its
way
to
balance
number
Concurrent
load
of tasks
only
task
than
~ will
A
one
ize
their
longer
one
at
a unique
ran-
seed for
workpile
is not
in FIFO
is measured
value,
whose
allowed.
workpile
and
T is the total
final
subsection
and
load
balancing
number
two
the
result.
presents
The
receives
compu-
size in the
entire
of processors
tasks
the effect
in the
system.
of the scheduling
an example
outperforms
in the
however,
results
system
of active
and
our
to its
explains
sample
analytical
p is the number
number
scheme
different
in
where
Simulation
For
are moved
such
that
by
more
differs
the
where
centralized
the
one.
Results
application
results
of a simulation
puter.
The
1. there
programs,
system
simulation
we present
of an ideal
system
makes
the
parallel
many
com-
simplifica-
of
in
from
the
to the end
2. no overhead
moving
order
is no
end
equalof
3. no overhead
the
of the
shorter
are
accessed
in
the
costs
of memory
in moving
a thread
from
in updating
cache
or page
ac-
one processor
to another.
tasks
to
difference
cesses.
operation.
another
Tasks
lengths.
r, is fixed
consists
to
is, p/T,
The
4.1
order.
as the
length
a balancing
operation
workpile
that
tasks
of
presents
is proportional
with
reported
sufficent,
applicability
perfor-
one using
tions:
threshold
perform
each
that
the
computer
are not
therefore,
and low
performance.
workpile
are
sys-
as memory
compared
results
Simulations
decentralized
workpile.
workpiles
balancing
workpile
generator.
of a workpile
A system-wide
from
with
are accessed
on the
local
choosing
begins
access to a local
workpiles
The
another
is by
processor
random
Local
of choosing
The
that
time
system,
The
of a parallel
general
such
the
global
system
the
proving
re-
the
impor-
in a real
of application,
and
and
programs.
subsection,
perform
influence
measured
different
Of prime
factors,
type
details
scheme
subsection.
prove
Many
performance
several
scheme.
scheme
traffic,
there for
of our
second
implement
network
application
strategies.
are
have
our
workloads.
a simulation
give
free.
is little
We
is
may
real
accepted
we present
of our
does
implementation
mance
loaded
workpile
well
access time,
load
generally
scheduling,
of the behavior
with
level
lower
heavier
the
there
in
task
is how
tem
the
some
other
either
workpile
suggest
than
is no single,
for
analyses
its
other
between
difference
from
one.
some
the load
is greater
accessed,
these
chooses
to equalize
lighter
wait
There
tries
then
implementations
between
Results
criteria
will
proportional
simply
workpiles
the
being
else
task
(see
the
workpile
4
infrequently
tance
PE at random
quire
and
scheme
workload.
two
in a heavily
balancing
each
cessor
In our
backoff
load
balancing
next
a coin
a certain
once
In summary,
tle
its
the
i flips
with
processor
of
i at time
scheduling
we use an exponential
processor
while
up
number
case,
thus
own
task
the
i is accessing
processor
balancing
the
be
processor
t,before
time
the
ation,
li,t
with
processor
workpile,
executing
A
will
At
O, then
implement
in this
at which
its local
executes
Let
associated
4.
no
contention
tion/deletion
on
the
takes
global
table
workpile
entries.
and
inser-
no time.
workpile.
5. load
No
starvation
in FIFO
tually
occurs.
order
a task
be executed.
a position
closer
Since
workpiles
the does not
migrate
If it is migrated,
to the
head
of the
will
All
even-
it is moved
these
workpile
to
the
queue.
240
balancing
global
takes
assumptions,
scheme
workpile
since
zero
time.
save the
such
scheme
latter,
overheads
than
in
any
favor
are
other
the global
higher
in
scheme.
We
wish
fits,
the
to
argue
load-balancing
scheme.
the
even
There
scheme.
costly
these
scheme
with
are, of course,
on how
without
workpile
is competitive
centralized
pends
that
decentralized
the
many
Thus,
are these
adaptive
global
workpile
situations
the
bene-
with
that
ultimate
measures
4.2
We
on a particular
present
four
allel
the
different
version
of
program.
In
choosen,
larger
different
The
applications
is subdivided
The
ecuting
master.
in
The
the
tasks.
our
The
process
we vary
the
off
run
for
a short
again
spawns
continues
both
cases,
number
these
two
process
ex-
parallel
a set
of times,
size
and
measure
Figures
lation
that
There
are
schemes
the
global
the
to the
workpile
We
can
present
when
these
workpile
are
memory
Without
to random
memory
accesses
after
a while,
after
time,
memory.
sive
much
out
case.
the
Figure
global
as the
The
processing
initially,
others.
unbalanced
rate.
these
implies
that
there
to note
performance
is as bad
work
wastes
load
of tasks,,
of tasks)
proces-.
balancing
have
to
changes,
p generates
terminated
at
at proces-
an arbitrary
distri-
lAP,t I is bounded
by some
load
for
In
some
other
Pp,t denote
a load
words,
in workpiles
0 >
balancing
the
procedure
at
balanced,
1, then
up
expected
of different
syst-
at time
the probability
starts
constant
load
of the
load
balancing
if the system
optimal
to
and
the
sys-
a constant
waiting
processors
time
differ
of
only
factor.
1:
algorithm
step t,
is executed,
for
each
processor
p,
and
at
processor
p >0.
Thus,
processors.
Their
loaded
arbitration
remote
balancing
but
see
as expenis not
+
c’.
as the
global
power.
241
If
other
step.
Furthermore,
processor.
The
dominates
step,
load
p did
p does
was
not
the
chosen
balance
performance
of the
performance
the induction
that
the
hypothesis
holds
for
holds
t we bound
a load
at this
a load
by
more
than
one
its
load
with
any
algorithm
algorithm.
E[LP,t] < aAt + C’.
t that
the system
thus
conflict
initiate
original
on
that
some
lightt~
messages
balancing
not
of this
by induction
theorem
weak
it ignores
assumes
The
claim
a very
for
very
p is initiating
initiating
but
+]
of the
if processor
procedure
then
assume
at a given
procedure
min[fi,
activity
processor
processors
processor,
balancing
by
the
We also
procedure
We prove
the
a load
we restrict
from
clearly
too
with-
half
W%
bounded
scheme:
balancing
local
p initiates
t) is always
time
When
can
s
proofi
To simplify
the analysis we consider a weaker
load balancing
scheme in which PP,t (the probability
that
that
about
&,
achiev~s
ancing
local
cost
We
twice
>
that
prove
at each
local
to the
results.
Of
assump
termination
processor
Let
~[Lp,d
We hope
is a high
case is always
This
at the
way.
There are constants a, and C, independent of the number of processors n, and the distribution
of AP,t (the schedule of tasks to the processors), such
and the load balthat when the system starts balanced,
initially
moved.
to a local
there
are
as local.
charged
It is also interesting
balancing,
never
the
either
can be transfered
5 presents
workpile
migration.
load
is first
state
all
In
that
re-
1, 2,
tass
charged
it reverts
that
the
and
are
a task
the fact
that
processors
always
to capture
some
balancing,
or
are
are
stay
two
situation,
as shared.
references
loab
is migration,
and
cha~ged
values
sys-.
LP,t the total
p initiates
that
shared,
global
situation,
allocated
local,
if the
reasonable
load
might
are n processors.
Theorem
local
is, we assumed
either
the
is that
t,and by At = ~ the average
by a constant
to
as the
the first
local
times
t,and by AP,t its change
that
Lt = ~P~v
by
at time
processes
applications.
when
AP,t
of
factor
number
of tasks
we require
pro-
6.
factor.
gives
the
will
and
(=
tasks
number
step.
but
tem
ignores
is as good
In the
references
remote.
That
be
workpile
rate
which
of new
the
simu-
it
balance,
number
t. We
access
since
results
3, respectively.
there
one
same
the
happends
dropped.
memory
or
when
the
just
what
are
references
mote.
t minus
time
scheduling
concurrent
and
a balance
We also examined
memory
the
step
time
features.
the
is important
considers
gives
of the
creation
the load
at time
processor
workpile.
simplifications
and
and
simulations
one
scheme
First,
even
migration
This
results
load-balancing
note.
performance
of
when
the
on the
to
workpile.
credence
global
only
points
affect
overheads
Second,
5 summarize
focuses
two
the
i.e.
if Pp,t
5, and
load
t;there
to completetion.
it
on some
t ignoring
own
that
the
the
by Lp,t
Denote
and
then
depends
indeed
length
observation
its
em
16 in
is fixed
expected
of step
constant
of slave
key
scheme
constant
sor p has at the start
Somewhat
off
The
balanced,
sor p at this
the
balancing
a small
result
concerning
bution,
tasks,
with
time.
problem
of processors
The
of slave
in
a number
the
elements
to
a number
work
is
element.
a master
to
once
those
applied
has
case,
slaves
In
into
parallel
spawns
our
master
case.
in
skeleton
load
The
in
this
Denote
element
the splitting
application
then
them
later,
and
second
and
16 of
than
starts
tions
are a par-
splitting
length.
course
applications
a master/slave
a random
smaller
is recursively
sets.
and
first,
array
and those
sort
of two
quicksort
the
the
results
schemes.
the
balance.
is within
average
tern
We
that
a good
workpiles
de-
machine.
and
prove
vides
favor
choice
Analysis
started
balanced.,
for t = O. Assuming
EILP,t+l].
~[~P,t+-1]
= ~[%p]
+
~[Ap,tl
where
Cl gives
the contribution
to
ancing
load
with
lost
a processor
by a load
+
~[CI]
–
chosen by p, and
balancing
with
receives
more
~[cz],
from load bal-
Lp,t+l
C’2 gives
a processor
the
we computed
the
square
differences
We
of the
averaged
alltime
choosing
global
this
average,
At,
of the
value
and
and
queue
then
steps.
That
is,
see that
the
load
balancing
the
average.
summed
lengths
took
the
the
from
At.
average
over
P.
Clearly,
the
p never
processor
it chooses,
a load-balancing
~
If it
.At “
receives
load
processor
no more
than
at step
z is 1 /(n
L.,t
the
– 1),
/2
the
that
load
of
p initiates
t + 1 is bounded
balancing,
than
half
probability
procedure
initiates
Itchooses
the
and
tasks.
by
probability
We
that
in that
workpiles
case it
tween
Thus,
very
2 and
We
a lower
with
E[C2]
bound.
Lm,t
<
is more
Let
2At.
that
a processor
least
min[~,
Alt
denote
Clearly
z
fi,
c
chooses
p, p is not
initiating
lLl~ I ~
The
set
n/2.
load
by
any
balancing
of
we need
balancing
other
and
it
takes
p],
then
Lp,t
l-f
13[C2]
z min
E[
[
4(n
z
~6M
– 1) z
units,
where
task
queue
scheme:
~ min
L P, ~ – 2At)],
E[~(
Set a =
and
recall
that At+l
If ~ < ~ then
EILp,t+l]
~ At
Next,
+
with
+6+p/2-y(@
1/p)
the
E[cd
-2)/8
–
s~At+I+
<~
= E[Lt,,]
At+
+
E[Ap,t]
C+6+p/2–
+
C/16
EICI]
<cxAt+l
Since
‘[cd
+C.
We
measured
how
from
resut 1s in Figure
together
differed
and
from
the
much
5. We ran
measured
the
global
the
global
length
average
and
10 master-slave
how
the
average.
local
At
of
the
task
the
applications
workpile
each
lengths
time
task
The
quan-
the
global
the
short
tasks
quantum
has
units.
h – s units
of
takes
the
long
short
the
execution
Thus,
242
task
+ s/2)
<
can
be inserted
happens
is alone
tasks.
is h.
execution
global
load
diffent
into
a queue
a probability
some
long
with
are two
with
and
If the
time
the
best
scheme
1. There
other
task
task
is alone
Otherwise
is expected
+ ((P – 1)/p)h
s/p
the
queue
T =
(this
the whole
it
in
takes
to require
= h + s/(zP).
task
queue
scheme
does
scheduling.
Conclusions
that
suited
the local
for
gives
the
level
of abstraction
shared
programmer
exact
number
load
balancing
the
step,
task
task
task
two
We believe
plotted
long
or
s/(2p)
ideally
differed
the local
a threshold
has
provide
5
❑
queues
long
execution
it.
s time
consider
* s time
1
other
l)/p)*s=ll+s/P
short
l/p(h
C’.
–
of the
whole
The
queue
E[cz]
not
E[Lp,t+,]
and
another
queue
– 6.
+
part
the
execute
all
We
to em-
and p+
than
to
until
+ I)/p)
p PEs
requiring
First,
time
global
balance.
longer
units
central-
the
in order
us assume
h.
the
blush,
of workpile
- each
The
((p
the
possibilities:
C = 8p+16(a+l)6,
E[Ap,t]
sized
s <<
consider
balancing
h+s/2.
= .E[Lt,p]
<CYAt+C
Thus,
of
6+86, p = 0/2 = a+l,
be-
units.
‘J]] ~
E[ ‘Lp”l;
[
remainder
1
2
of the
decentralized
possible
is much
h–s+((p+
1
– Lz,t
1
4(n – 1) Rx
2 .~M
The
time.
at first
quantum
requires
the
than
best
Let
tasks
h time
when
the
nature.
tum
1)
2
Lp,t
E[
– Lz,t
size
ranged
Best
since
in place
are equal
complete
y = min[d/2,
queue
of the
tasks
the
value
scheduling
to provide
its FIFO
other
Let
example
better
It is interesting
One
and
another
term
tasks.
is at least
Always
seems
phasize
z
processor,
step
one.
use the
is at
processor
Not
provides
queue
probability
that
at this
ized
processors
The
probability
chosen
load
since
the
A4 initiates
~].
p is not
complicated
present
scheme
keeps
This
3.
FIFO
4.3
Bounding
near
system.
with
memory
parallel
the
since
of
workpile
ability
scheme
adapts
balancing
is
processors.
It
to program
he does
processors
load
not
available.
to
the
need
at a higher
to know
Moreover,
current
load
the
the
on
It is interesting
when
faced
chronous”
the
Here,
that
execute
each
task
executed
for
all
the
the
execution
other
Now
suppose
ning
concurrently.
tasks,
consists
a period
If
mapping.
than
the
cluster
the
there
several
scheme
of these
of tasks
then
belonging
to
is expected
the
placed
all
that
same
R. Kling,
GAMMON:
for
Vol
and
an
a local
Conference
F. and A. Kumar,
in
with
a central
actions
on
Computers,
job
on Paratlel
‘(Adaptive
optima]
multiserver
scheduler,”
Vol.
balProc-
1989.
a nonhomogeneous
system
“im-
load
system,”
2, pp. 77-80, Aug.
balancing
B. Wah,
efficient
computer
of the International
Bonomi,
load
run-
39,
IEEE
Trans-
1232-50,
pp.
Oct.
1990.
larger
scheme
on the
[BK90]
equal.
is much
our
strategy
eedings
K.,
of
Processing,
step
approach
number
ancing
partic-
for
Baumartner,
plementation
is,
programs
quickly
processors,
initially
That
processors
of
[BKW89]
tasks
are about
total
were
of
then
It
tasks
will
syn-
exchange
are enough
tasks
tasks
and
program.
of the
are
number
together
since
in the
If there
our
a set
fashion.
of time
References
performs
“loosly
environ-
of
or information
periods
then
of
synchronous
tasks
there
scheme
a multiprogrammed
in a communication
with
our
paridigm
program
in a loosely
that
the
and
each
how
popular
programs
ment.
ipates
to consider
with
[CK79]
will
program
same
Y.C.
namic
task
Chow
Load
and
W.
Balancing
Processor
System,”
ers, Vol.
C-28,
Kohler,
in
“Models
for
a Heterogeneous
IEEE
Transactions
pp. 334-361,
May
Dy-
Multiple
on Comput-
1979.
queue.
[DG90]
Our
any
experimental
difference
between
distribution
that
with
showed
the
of work
chitectures
served
results
two
among
local
there
schemes
the
encourage
the
that
local
workpile
in terms
Thus,
state,
be better
would
The
that
It
key
insight
system
the
random
each
small
begin
[DG89]
scheme.
set
the
had
or heaps
continues
It
is necessary
the
with
to
it
tasks
and
further
with
Dragon,
whether
the
provement
are
system
is in
of the scheme
executed
with
form
is balanced
[FFKS89]
a
Fox,
Simic,
the
dynamically
little
size of the
the
thins
the load-
nected
that
should
topologies
be investigated
such
it is nevertheless
as a completely
pay
the
tween
extra
[JW89]
when
It is our
to treat
network
sampling
scent
processors.
extention
to
ple
conbelief
newly
created
ones.
If
cuted.
One
other
only
using
should
not
tasks
can
differently
is cheaper
does
tion
tssks
to
move
move
tasks
be moved.
different
tasks
a good
Of
on the
involves
from
a task
such
provided
be based
work
course
same
first
enough
both
load
and
has
only
balance
J.
Fourth
March
and
P.
balancing
Conference
and Applica-
1989
and M.
Chen,
“Dynamic
Proceedings
on Hypercubes,
and
B.
in
Vol
Wah,
Concurrent
1, pp.
Kumar,
Journal
Vol
595-598,
V, and V. Rae,
on
Vol
“The
J.,
Load
“Load
MOOS
Balancing”
on
and Applications,
Vol
603-608,
H Operating
Proceedings
Hypercubes,
1, pp
multi-
and
Concurrent
599-602,
L)is-
Dec. 1989.
balancing
Concurrent
1, pp.
amd
with
of Parallel
Proceedings
Hypercubes,
Applications,
balancing
system
7, pp 391-415,
architecture,”
Keller,
“Load
a computer
buses,”
Conference
exe-
load
on hypercubes,”
Computing,
Dynamic
executing
it
583-539,
J. Keller,
and
Applications,
selection
Conference
be-
treating
already
before
1, pp.
Computers
Conference
contention
and
this
Vol
of
Concurrent
1989
hypercube
to simply
or moving
hy-
low
Proceedings
of the
J., X. Tan,
Juang,
[KR89]
the hypercube
and
cost
“A
Furmanski,
balancing
and
tributed
[K89]
Another
that
load
March
threshold
for non-fully
hypercube.
interconnected
cost
non-adj
as the
worthwhile
116,
Hypercubes,
Concurrent
Hong,
ordered
scheme
Vol
optimization
1, pp. 591-594,
Computers
im-
W.
Proceedings
of the Fourth
frequency
the
,“
Vol
tions,
peak
value.
The
G.,
the
A possible
setting
on
Applications,
“Physical
on Hypercubes,
same
or not.
and
algori
is that
the
the
1989
cyclic
tasks
on
Computing,
algorithm,”
Conference
Computers
[HTC89]
One of the weaknesses
balancing
Fourth
March
processor.
balancing
Para!lel
note
hypercube
K, and J. Gustafson,
load
at a processor
from
“A
grained
result.
only
to
located
gets
that
be shown
to get our
is possible
new
as one
demand
can
balance
the peak
to generate
decreasing
is to
Gastaldo,
for coarse
machines,”
percube
state.
chosen
of neighbors,
workpiles
analysis
a balanced
selection
processor
mountains
that
in
in
M.
problem
pp. 75-79, Nov 1990.
ar-
the
the
F. and
balancing
dictionary
of the
processors.
Dehne,
load
is hardly
of the
on the
Fourth
Compuiers
March
1989
System
and
of the Fourth
Computers
March
1989
when
[R87]
should
types
of migra-
balancing
scheme,
Raetz,
cessing
1987.
probabilities.
243
G.
“Sequent
system,”
general
Northcon\87,
purpose
pp.
paraJlel
7/2/1-5,
proSept.
5500
]
I
I
quicksort
(size
I
I
I
500)
I
5000
4500
Load
Load
Balancing,
Balancing,
I
I
Global
No Load
Workpile
Balancing
Initial
Placement
Initial
Placement
Static
Random
+--+
~
x
4000
K
3500
Time
3000I
2500
2000
1500
I
1000I
[
I
I
I
O
2
4
6
8
I
1
I
I
10
12
14
16
18
P
Figure
are
3:
Quicksort
times
when
advantages,
the
application
global
we see that
the
of size
workpile
load
I
not
balancing
master-slave
9000
of 500.
does
(size
8);
uniform
memory
I
No
Load
Load
best
yields
8000~
7000
scheduling
the
scheme
i
I
The
yield
Balancing,
Balancing,
good
Static
Random
clearly
without
the
performance
the
local
and
memory
there
access
costs
I
Global
Balancing
affects
Even
performance.
access
I
Load
strategy
performance.
I
Workpile
(Local)
Initial
Placement
Initial
Placement
~
+
x
6000
Time
5000
4000
3000
2000
1000
1
I
I
I
2
3
4
I
I
I
I
5
6
7
8
9
P
Figure
the
4: This
number
data
shows
of processors,
that
when
then
our
memory
load
access
balancing
costs
gives
are the
better
244
same
results
and
than
the
then
number
of tasks
global
workpile.
is a little
more
than
master-slave
900000
I
(size
I
64);
I
nonuniform
I
memory
I
Global
800000
No
700000
Load
Load
Load
Balancing,
Random
I
Workpile
Balancing
Static
Balancing,
costs
I
I
e
(Local)
Initial
Placement
Initial
Placement
.
+
-
x
600000
500000 1
Time
400000 300000 200000 -
1000000~
O
1
I
I
I
2
4
6
8
i
I
I
10
12
14
16
18
P
Figure
5:
memory
Each
task
slaves
fact
A
memory
access
are
executes
finish,
that
a total
another
each
Without
access
charged
load
memory
After
is started.
access
balancing
local
of 64 accesses.
batch
memory
to
2 units.
performance
its
application,
results
is twice
1 unit
migrates,
In this
The
time
is charged
a task
the
shows
that
due
to the
poor
to
remote
are
master
the
Thus
as expensive,
degrades
and
10 access
task
global
we
meory
charged
spawns
workpile
see that
scheduling
few
though
3 units.
highest
64 slave
is about
very
even
is charged
at the
rate
tasks.
twice
Shared
of 3 units.
After
as bad
all these
due
to the
of the
accesses
are
all
accesses
are charged
the
remote.
1 unit.
~..u
1,
multiple
700 I
L
master-slave;
I
variance
No
600
Load
oad
500
400
X2
300
200
Balancing,
of queue
I
I
Load
Balancing
I
I
Static
Balancing,
lengths
I
I
(Local)
-9-
Initial
Placement
+
Random
Initial
Placement
e
10
12
14
and
measured
100
1/
O
2
4
6
8
16
18
P
Figure
6:
the
global
the
number
tasks
We
in the
ran
average.
10 master-slave
We
plotted
of processors,
system
at time
Li,t
applications
the
together
following
is the
length
t. We see that
~ zic~
~ ~ie~
of workpile
the
load
at processor
balancing
245
how
the
(b,t– -4)2)where
keeps
i at time
the
size
local workpile
lengths
differed
from
T ranges over all time values,
n. is
t, and At is the average
of the
workpiles
very
near
number
the
average.
of
Download