Isotach Networks - Parallel and Distributed Systems, IEEE

advertisement
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 4, APRIL 1997
337
Isotach Networks
Paul F. Reynolds, Jr., Member, /E€€ Computer Society
Craig Williams, and Raymond R. Wagner, Jr.
Abstract-We introduce a class of networks called isotach networks designed to reduce the cost of synchronization in parallel
computations. Isotach networks maintain an invariant that allows each process to control the logical times at which its messages are
received and consequently executed. This control allows processes to pipeline operations without sacrificing sequential consistency
and to send isochrons, groups of operations that appear to be received and executed as an indivisible unit. Isochrons allow
processes to execute atomic actions without locks. Other uses of isotach networks include ensuring causal message delivery and
consistency among replicated data. Isotach networks are characterized by this invariant, not by their topology. They can be implemented in a wide variety of configurations, including NUMA (nonuniform memory access) multiprocessors. Empirical and analytic
studies of isotach synchronization techniques show that they outperform conventional techniques, in some cases by an order of
magnitude or more. Results presented here assume fault-free systems;we are exploring extension to selected failure models.
Index Terms-Logical time, interprocess coordination,concurrency control, isochronicity,atomicity,sequential consistency,
interconnection networks, multiprocessor systems.
1 INTRODUCTION
continue to perform well.
Our focus in this paper is on shared memory model
reduce synchronization costs in parallel computations by
providing stronger guarantees about the order in which (SMM) computations on tightly-coupled multiprocessors.
messages are delivered. In existing networks, a process can There are two reasons for this focus:
neither predict nor control the order in which its messages
We assume a fault-free system. This assumption is
are delivered relative to concurrently issued messages. Few
standard in discussions of synchronization techniques
existing networks offer ordering guarantees any stronger
for SMM computations on tightly coupled systems
than FIFO delivery order among messages with the same
but not in distributed systems, where dealing with
source and destination. As a result, a process must use delays
faulty communications is a major concern.
and higher-level synchronization mechanisms such as locks
Unlike many other synchronization techniques, isoto ensure that execution is consistent with the program’s
tach techniques are suitable for the fine-grained synconstraints.
chronization required for SMM computations.
Isotach networks offer a guarantee called the isotach
invariant that allows processes to predict and control the Isotach techniques are also applicable to message based
logical times at which their messages are received. This model computations, but their use in distributed computing
invariant serves as the basis for some efficient new syn- requires finding a way to relax the assumption of a fault-free
chronization techniques. The invariant is expressed in logi- system. We address this issue in the discussion of ongoing
cal time because a physical time guarantee would be pro- work in Section 5.
The principal contributions of this paper are in
hibitively expensive, whereas the logical time guarantee
can be enforced cheaply, using purely local knowledge.
1) defining isotach logical time, the logical time system
Isotach networks are characterized by this invariant, not by
that incorporates the isotach invariant;
2) giving an efficient, distributed algorithm for impleany particular topology or programming model. They can
be implemented in a wide variety of configurations, includmenting isotach logical time; and
ing NUMA (nonuniform memory access) multiprocessors.
3) showing how to use isotach logical time to enforce
Our performance studies show that isotach synchronization
two basic synchronization properties, sequential contechniques are more efficient than conventional techniques.
sistency and isochronicity.
As contention for shared data increases, conventional tech- An execution is sequentidy consistent if operations specified
niques cease to perform acceptably, but isotach techniques by the program to be executed in a specific order appear to
be executed in the order specified [26] and is isochronous if
operations specified by the program to be executed at the
P.F. Reynolds, ]r. and C. Williams are with the Departrnent of Computer
Science, Thornton Hall, University of Virginia, Charlottesville, V A 22903.
same time appear to be executed at the same time. As
E-mail: (reynolds, craigwl@virginia.edu.
explained in Section 3, isochronicity and atomicity are
R.R. Wngner, 1.. is w i t h The Vermovzt Softzuam Cowpnny. Inc., Boldruin
similar, though not identical, concepts.
House, PO Box 731, Wells River, VT 05081.
Enforcing sequential consistency and isochronicity is
E-mail: Roy. Wasncu@VTSoftwaue.co,n.
trivial, given an assumption of fault-freedom, on systems in
Marruscript received Feb. 13,1995.
For. iriformatiori on obtaining reprints of this article, please serid e-rrrnil to:
which processes communicate over a single shared broad-
I
sotack networks are a new class of networks designed to
-
triiirspds~corirputer.ors,and reference IEEECS Log Nirrnber D95263.
1045-9219/97/$10.00 01997 IEEE
338
IEEE TRANSACTIONSON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 4, APRIL 1997
cast medium such as a shared bus. However such systems
do not scale. Point-to-point networks scale, but enforcing
sequential consistency and isochronicity is hard on pointto-point networks, even with the assumption of faultfreedom, and existing algorithms are expensive. The isotach
algorithms for enforcing isochronicity and sequential consistency are unique in that they are both scalable-they have
low message complexity and no intrinsic bottleneck such as
a shared bus-and
light-weight-they are efficient and
require only a single communication round.
Section 2 of this paper defines isotach logical time and
shows how it can be implemented. Section 3 describes isotach synchronization techniques. Section 4 reports on our
performance studies and section 5 summarizes the paper
and describes ongoing work.
2 ISOTACH
LOGICAL
TIME
A logical time system is a set of constraints on the way in
which events of interest are ordered, i.e., assigned logical
times. Isotach logical time is an extension of the logical time
system defined by Lamport in his classic paper on ordering
events in distributed systems 1251. We begin by describing
Lamport’s system. We then define isotach logical time and
give an algorithm for implementing isotach logical time
using network switches and interfaces.
In Lamport’s system, the events of interest are the sending and receiving of messages. Times assigned to these
events are required to be consistent with the happened before
relation, a relation over send and receive events that captures the notion of potential causality: Event n happened before event 6, denoted a + b, if
1) a and b occur at the same process and a occurs before h;
2) a is the event of sending message m and b is the event
of receiving the same message m; or
3) there exists some event c such that a -+ c and c + 6.
In Lamport’s system, a + b * t(a) < t(6). (Refer to Fig. 1. for
notation relating to logical time.)
Lamport gives a simple distributed algorithm for assigning times consistent with this constraint. Each process has
its own logical clock, a variable that records the time
assigned to the last local event. When it sends a message, a
process increments its clock and timestamps the message
with the new time. When it receives a message, a process
sets its clock to one more than the maximum of the current
time and the timestamp of the incoming message.
Logical time systems have been proposed that extend
Lamport’s system by adding the constraint t(a) < t(b) a + b
(thus t(n) < f(b)if and only if a -+b). For a recent survey, see
[37]. ISIS [7] is a messaging layer that incorporates this extension to Lamport’s logical time system. Isotach logical
time also extends Lamport’s time system, but in a different
way. In isotach logical time, the additional constraint is that
the times assigned be consistent with an invariant relating
the logical distance a message travels to the logical time it
takes to reach its destination.
2.1 Defining Isotach Logical Time
In an isotach logical time system, times are lexicographically ordered n-tuples of nonnegative integers. The first and
For any event e
t(e)
logical time assigned to e
For any message m
logical time assigned to the event of
sending ni
logical time assigned to the event of
t,(m)
receiving in
logical distance traveled by m
d(m)
unique identifier of the process that
pid(m)
issued m
issue rank of m, rank (m) = i if m is
rank(m)
the ith message issued by pid(m)
Lamport’s happened-before relation
t,(m)
+
Fig. 1. Logical time notation.
most significant component is the pulse. The number and
interpretation of the remaining components can vary. In
this paper, logical times are three-tuples of the form (pulse,
pid(m), runk(m)). Isotach logical time extends Lamport’s
logical time by requiring that the times assigned be consistent with both the -+ relation and the isotach invariant: Each
message m is received exactly d(m) pulses after it is sent.
1
Thus t,(m) = (i, j, k ) implies t,(m) = (i + d(m), j , k). In this
paper, we use the routing distance, i.e., the number of
switches through which rn is routed, as the measure of logical distance, but other measures can be used. Isotach logical
time is so named because all messages travel at the same
velocity in logical time-one unit of logical distance per
pulse of logical time.
As we will show in Section 3, the isotach invariant provides a powerful mechanism for coordinating access to
shared data. Given the isotach invariant, a process that
knows d(m) for a message m that it sends can determine
t,(m) through its choice of t,(m). Thus processes can use the
isotach invariant to control the logical times at which their
messages are received.
2.2 Implementing Isotach Logical Time
An isotach network is a network that implements isotach
logical time, i.e., that delivers messages in an order that
allows send and receive events to be assigned logical times
consistent with the isotach invariant and the + relation.
Several networks have been proposed that can, in retrospect, be classified as isotach networks. The alphasynchronizer network proposed by Awerbuch to execute
SIMD graph algorithms on asynchronous networks (31, and
a network proposed to support barrier synchronization [51,
[17] can be viewed as isotach networks that maintain a logical time system in which each logical time consists of the
pulse component only. A network proposed by Ranade
[351, 1361 as the basis for an efficient concurrent-read, concurrent-write (CRCW) PRAM emulation can be viewed as
realizing the more general and powerful n-tuple isotach
logical time system. We give an algorithm, the isonet algorithm, for implementing isotach logical time on networks of
1. For any message 111 for which d(m) = 0, e.g., a cache hit or a message
between collocated processes, the isotach invariant requires t,(nt) = t,(m).
For this reason, we relax Lamport‘s requirement that a -+ b t(n) < t(b). In
isotach logical time, n + b 3 t(n) s I@).
REYNOLDSET AL.: ISOTACH NETWORKS
arbitrary topology. The algorithm localizes the work of
implementing isotach logical time in the network interfaces
and switches. It is simple but impractical. Later in this section, we show how to make it practical.
2.2.1 Model
We consider a network with arbitrary topology consisting
of a set of interconnected nodes in which each node is either
a switch or a host. Adjacent nodes communicate over FIFO
links. Each host has a network interface called a switch interface unit (SIU) that connects it to a single switch. A host
communicates with its SIU over a FIFO link. All components and interconnecting links are failure-free. To postpone considering the issue of communication deadlock, we
initially assume each switch has infinite buffers. A message
is issued when the process passes it to the SIU, sent when the
SIU passes it to the network, and received when the SIU
accepts the message from the network. We assume that the
destination SIU delivers messages to the destination process in the order in which it received them.
2.2.2 lsonet Algorithm
Each SIU and switch maintains a logical clock, a counter
initialized to zero, that gives the pulse component of the
local logical time. The clocks are kept loosely synchronized
by the exchange of tokens. A token is a signal sent to mark
the end of one pulse of logical time and the beginning of the
next. Initially each switch sends a token wave, i.e., it sends
a token on each output, including the outputs to adjacent
SIUs, if any. Thereafter, each switch sends token wave i + 1
upon receiving the ith token on each input, including the
inputs from adjaceht SIUs. Each time a switch sends a token
wave, it increments its clock. When an SIU receives a token,
the SIU increments its clock and returns the token to the
switch.
A message is sent (received/routed) in pulse i if it is sent
(received/routed) while the local clock is i. An SIU may
send no messages or several messages in a pulse. The SIU
tags each message m it sends with the lexicographically
ordered pair (pld(m),runk(m)). The messages an SIU sends
in a pulse must be sent in pid-rank order. Thus if m and m‘
are issued by the same process and runk(m) < rank(”), the
SIU must send m before m‘ if it sends m and m’ in the same
pulse. However the rule does not prohibit the SIU from
sending m’ in an earlier pulse than m. During each pulse, an
SIU continuously compares the tag of the next message to
be sent with the tag of the next message to be received and
handles (sends or receives) the message with the lowest tag.
If its network input is empty, the SIU waits for it to be
nonempty. The switches in an isotach network route messages in pid-rank order in each pulse. The switch behaves
the same as a switch in a conventional network except that
it always selects for routing the message with the lowest tag
among the messages at the head of its inputs. As in the case
of an SIU, a switch waits if one of its inputs is empty. Pseudocode for the isonet algorithm is given in Fig. 2.
To show that the isonet algorithm implements isotach
time, we
1) give a rule for assigning logical times to send and
receive events, and
339
At each switch
clock = 0;
repeat forever
send a token on each output;
increment clock;
route all messages up to next token on each input
in pid-rank order
consume a token from each input;
At each SIU
clock = 0;
repeat forever
accept a token from adjacent switch
increment clock;
return the token to the switch;
send and receive messages in pid-rank order;
Fig. 2. Pseudocode for isonet algorithm.
2) prove that this assignment is consistent with the isotach invariant and the + relation.
Let sjulse(m) denote the pulse in which m is sent and
r_pulse(m) the pulse in which m is received. Time assignment rule: For each message m, t,(m) = (s_pulse(m),pid(m),
runk(m))and t,(m) = (r_pulse(m),pid(m), runk(m)).
THEOREM
1. The isonet algorithm maintains isotach logical time.
PROOF.Since a message routed by a switch in pulse i is
routed at the next switch in pulse i + 1, for any message m, r_pulse(m) - s_pulse(m) = d(m).Since t,(m) and
t,(m) differ only in the pulse component, the isotach
invariant holds. Showing that the logical time assignment is consistent with the -+ relation requires
showing
1) for each message m, t,(m) 2 ts(m);and
2) for each SIU, the times assigned to local events are
nondecreasing, i.e., local logical time doesn’t move
backwards.
The first part follows directly from the isotach invariant. Since for any message m, d(m) 2 0, t,(m) 2 t,(m) by
the isotach invariant. Since the messages each SIU
sends in each pulse are sent in pid-rank order, and
since each switch merges the streams of messages
arriving on its inputs so that it maintains pid-rank
order on each output, a simple inductive proof over
the number of switches through which a message
travels shows that messages received at an SIU within
each pulse are received in pid-rank order. Since an
SIU interleaves sends and receives so as to handle the
all messages in each pulse in pid-rank order, the times
assigned to all events at each SIU are nondecreasing.0
2.2.3 Discussion
An isotach network need not maintain logical clocks at the
SIUs and switches or assign logical times to send and receive events. It need o,nlyensure that messages are received
in an order consistent with the isotach invariant and the -+
relation. The logical clocks are auxiliary variables that make
the algorithm easy to understand but are not essential to
340
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 4, APRIL 1997
the implementation. Similarly, the rank component of the
pid-rank tags can be eliminated or reduced to a few bits. The
rank component is used in assigning logical times to send
and receive events, but this role is auxiliary to the algorithm. The rank component is essential to the isonet algorithm only to ensure that messages sent by the same source
to the same destination in the same pulse are received in
the order in which they were issued and sent. If the network topology ensures FIFO delivery order among messages with the same source and destination, the rank component is unnecessary. If it does not, the size of the rank
component can be reduced by limiting the number of messages a process can send to a single destination in a single
pulse, or it can be eliminated by requiring that multiple
messages sent to the same destination in the same pulse be
sent as a single larger message. The bandwidth required for
the pid component of tags is not attributable to maintaining
isotach time because messages are typically required to
carry the identity of the issuing process for reasons unrelated to maintaining isotach time. Token traffic is the only
other source of bandwidth overhead attributable to isotach
time. Tokens can be piggybacked on messages at a cost in
bandwidth of one bit per message.
2.2.4 Latency and Throughput
The principal cost of the isonet algorithm is the additional
waiting implied by the requirement that messages be
routed in pid-rank order. One way to reduce this cost is to
use ghosts. When a switch sends an operation on one output, it sends a ghost [36] or null message [lo] with the same
pid-rank tag on each of its other output(s). A switch receiving
a ghost knows all further operations it receives on the same
input will have a larger tag than the ghost and this
knowledge may allow it to route the operation on its other
inputk). Ghosts improve isotach network performance by
allowing information about logical time to spread more
quickly and are needed in some networks to avoid deadlock. Since a ghost can always be overwritten by a newer
message, ghosts take up only unused bandwidth and buffers.
Another way to reduce waiting in the switches is to add
internal buffers. In a switch with input queueing only, such
as the switch shown in Fig. 3a, a blocked message at the
head of an input queue blocks subsequent messages arriving on the same input, a phenomenon known as head of
line (HOL) blocking. Buffers placed internally, as shown in
Fig. 3b, reduce HOL blocking [211, [241. Our simulation
study of isotach networks shows that internal buffers are of
significantly more benefit in isotach networks than in conventional networks. One reason for this difference is that a
conventional switch of either design can route messages
onto both outputs in a single switch cycle, but an isotach
switch of the simpler design can route at most one operation per cycle since we allow the router to make only one
pid-rank tag comparison per cycle. An isotach switch with
internal buffers can route messages onto both outputs since
it has two merge units operating in parallel. A more fundamental reason for the difference in benefit is that the
switch with internal buffers offers increased throughput at
the cost of increased delay. Isotach synchronization techniques, because they are more concurrent, e.g., they allow
pipelining, allow processes to take better advantage of increases in potential throughput.
Splitter
Splitter
Fig. 3. Switch designs: (a)simple switch, (b) switch with internal buffers.
2.2.5 Deadlock
The requirement that switches route messages in pid-rank
order makes deadlock freedom a harder problem in isotach
networks than in conventional networks. The isonet algorithm uses infinite buffers at the switches to prevent deadlock. One way to bound the number of buffers is to limit the
number of messages sent per SIU per pulse. Although such
bounds may be useful for other reasons, the number of
buffers required to ensure deadlock freedom using this
approach is very large. A better approach is to find routing
schemes that ensure deadlock freedom. We have proven
deadlock free routing schemes for equidistant networks
using one buffer per switch per physical channel and for
arbitrary topologies using O(d)buffers and virtual channels
per switch per physical channel, where d is the network
diameter [46].
The isonet algorithm, as modified, is scalable and efficient: It is completely distributed; the bandwidth attributable to maintaining isotach logical time is one bit per message; and the additional work required at each SIU and
switch is one tag comparison per message. The main cost of
the algorithm is the increased probability that a message
will wait within the network. The cost of maintaining isotach logical time relative to the benefits in reduced synchronization overhead is the subject of Section 4.
3 USINGISOTACH
LOGICAL
TIME
We show how isotach logical time provides the basis for
enforcing isochronicity and sequential consistency. Our
focus is on SMM computations. Each message is an operation or a response to an operation and each operation is a
read or write access to a variable. Each host is a processing
element (PE) or memory module (MM). The destination of
each operation is an MM. Since synchronization is not an
issue in relation to private variables, we use the term vari
able to refer to a shared variable.
3.1 lsochronicity and Sequential Consistency
An isochron is a group of operations that can be issued as 'batch and that is to be executed indivisibly, i.e., withou'
apparent interleaving with other operations. An executioi
is isochronous if every isochron appears to be executed indi
REYNOLDS ET AL.: ISOTACH NETWORKS
visibly. Isochrons and isochronicity are closely related to
atomic actions and atomicity. An atomic action is a group of
operations that is to be executed indivisibly [28], 1311. An
execution is atomic if the operations in each atomic action
appear to be executed indivisibly. Isochrons and atomic
actions differ in two ways:
An atomic action, in many contexts, is a unit of recovery from hardware failure. Where a fault-free system
is assumed or fault-freedom is not a major concern,
atomicity is synonymous with indivisible execution
[27], 1321. As stated previously, we assume a fault-free
system.
An atomic action can have internal dependences preventing the operations in the atomic action from being
issued as a batch. An atomic action with such dependences, e.g., A = B, is a Structured atomic action. An
atomic action with no such dependences is a flat
atomic action. Since we assume a fault-free system,
flat atomic actions and isochrons are the same.
Isochrons are also similar to totally ordered multicasts
(TOMS).The main distinctions are
1) TOM algorithms are usually fault-tolerant; and
2) the messages making up a TOM are typically identical
and in some implementations constrained to be identical, whereas the operations making up an isochron
are typically different. (Of course, subcomponents of
identical messages can be addressed to different recipients, at the cost of increased message size.)
We compare implementations at the end of this section.
Isochrons are directly useful in obtaining consistent
views, making consistent updates, and maintaining consistency among replicated copies. They are also useful as a
building block for implementing structured atomic actions.
We describe isotach techniques for executing structured
atomic actions later in this section.
SMM computations on tightly-coupled multiprocesses
execute atomic actions using some form of locking. Drawbacks of locking include overhead for lock maintenance in
space, time, and communication bandwidth, unnecessarily
restricted access to variables, and the care that must be
taken when using locks to avoid deadlock and livelock. In
an isotach system, a process can execute flat atomic actions
without synchronizing with other processes and can execute structured atomic actions without acquiring locks or
otherwise obtaining exclusive access rights to the variables
accessed.
The second synchronization property we consider is
sequential consistency. An execution is sequentially consistent if operations appear to be executed in an order consistent with the order specified by each individual process’s
sequential program [26]. This property is so basic it is easily
taken for granted, but it is expensive to enforce in non-busbased systems because stochastic delays within the network
can
reorder operations. A violation of sequential consis-
tency is shown in Fig. 4. Each circle in the figure represents
an operation on a shared variable, A or B. Solid directed
lines connect circles representing operations on the same
variable and indicate the order in which the operations are
executed at memory. Dashed directed lines connect circles
34I
representing operations issued by the same process and
indicate the order specified by that process’s sequential
program. In the execution shown, process P, reads the new
value of A (the value written by P2) and subsequently reads
the old value of B, in contradiction to P i s program, which
specifies that B‘S value be changed first. This violation
could occur even if both processes sent the operations in the
correct order, if one operation (in this case the write to B) is
delayed in the network. The conventional solution is to
prohibit pipelining of operations, i.e., each process delays
issuing each operation until it receives an acknowledgement for its previous operation. However, pipelining is an
important way to decrease effective memory latency, so this
solution is expensive. The high cost of enforcing sequential
consistency has led to extensive exploration of weaker
memory consistency models, e.g., [161, 1431. These weaker
models are harder to reason about and still impose significant restrictions on pipelining, but make sense given the
cost of maintaining sequential consistency in a conventional
system. In an isotach system, processes can pipeline memory operations without violating sequential consistency.
P l ::
P2::
write(B);
write(A);
a
W
@
Programorder at process
Executionorder at M M
--
t
3.2 Enforcing lsochronicity and Sequential
Consistency
We consider an SMM computation on an isotach system.
Each process’s program consists of a sequence of isochrons
interspersed with local computation. Further assumptions:
1) Each process issues operations by sending them
through a reliable FIFO channel to the SIU for the PE
on which it is executing;
2) Each process marks the last operation in each isochron;
3) Each MM executes operations in the order in which
they are received; and
4) Each SIU knows the logical distance to each MM with
which it communicates.
Although some applications (e.g., cache coherence)
require that isotach logical time be maintained for both
operations and responses to operations, the techniques described here require that logical times be assigned only to
the send and receive events for operations. Since each MM
need only execute operations in the order in which they are
received and need not ensure that its responses are received
at any given logical time, an MM requires only a conventional network interface. Thus
we use
SIU in this section to
mean the SIU for a PE.
The key to enforcing isochronicity and sequential consistency in isotach systems is the isotach invariant. Given
the isotach invariant and knowledge of the logical distance
for each operation it sends, an SIU can control the logical
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8 , NO. 4, APRIL 1997
342
times at which the operations it sends are received by controlling the logical times at which they are sent. In particular, the SIU can ensure that its operations appear to be
received at the same time (isochronicity) or in some specified sequence (sequential consistency) by sending operations in accordance with the send order rules:
Isochronicity. Send operations from the same isochron so
they are received in the same pulse.
Sequential Consistency. Send each operation so it is
received in a pulse no earlier than that of the operation
issued before it.
Processes need not be aware of how their ordering constraints are enforced. These rules are implemented entirely
by the SIU. In a system with an equidistant network, i.e., a
network in which every message travels the same logical
distance, the rules call for each SIU to send operations in
the order in which they were issued and to send all operations from the same isochron in the same pulse. To avoid
delaying logical time in the event of a context switch or
process failure, the SIU should not send any operation from
an isochron until the last operation in the isochron has been
issued. In systems with nonequidistant topologies, the rules
may require that an SIU send operations in an order that
differs from the order in which they were issued.
that conforms to the send order rules, there is an equivalent
serial execution that is isochronous and sequentially consistent. To establish that two executions of the same program are equivalent, it is sufficient to show that each variable is accessed by the same operations and that operations
on the same variable are executed in the same order in both
executions.
THEOREM
2. Any isotach system execution E that satisfies the
send order rules is isochronous and sequentially consistent.
PROOF.Since no two operations issued by the same process
have the same issue rank, operations in E are totally
ordered by their logical receive times. Let E, be the serial execution in which operations are executed in order of their logical receive times in E. We show
1) E, and E are equivalent;
2) E, is isochronous; and
3) E, is sequentially consistent.
It follows that E is isochronous and sequentially consistent.
Consider any two operations op, and op, on the
same variable V. Without loss of generality, assume
op, is received in E before op,. Since logical receive
times are unique and consistent with the + relation, t,(op,) < tr(op,). Thus op, is executed before op,
in E,. Since operations on V are executed in E in the
EXAMPLE.
Assume process P, is required to read and P, to
write variables A and B and that each process's accesses
must be executed indivisibly. In a conventional system, P , and P, must obtain locks, either on individual
variables or on the section of code through which the
variables are accessed. In an isotach system, P , and P,
can execute their accesses without locks. Consider an
isotach system with the nonequidistant topology
shown in Fig. 5a. Each circle in Fig. 5a represents a
switch and each rectangle an MM or PE. Fig. 5b shows
one of many possible correct executions. In accordance
with the first send order rule, the SIU for each process
sends its operations so that they both arrive in the
same pulse, e.g., the SIU for PI sends the operation on
A one pulse after the operation on B since the routing
distance from PI to A is one less than to B. If, as in the
execution shown, all four operations happen to be
received in the same pulse, operations on each variable will be received and executed in order by pid. As
a result, each isochron will appear to be executed
without interleaving with other operations.
The correctness proof for the send order rules is similar
to serializability proofs of database schedulers, see, e.g.,
[33]. We show that for any execution on an isotach system
order in which they are received, op, is also executed before op, in E.
Since, for any isochron A, all operations in A are
received in the same pulse (first send order rule),
issued by the same process, and have consecutive
issue ranks, the logical receive time of any operation not in A is outside the interval of logical timc
delimited by the receive events for operations in A
Thus the operations in A are executed in E, indivisibly, without interleaving with other operations
For any pair of operations op, and op, issued by tht
same process where rank(op,) < rank(opJ, op, and op
are either received in the same pulse or op, is
received in an earlier pulse than op, (second senc
order rule). In either case, op, is executed before op
in E,.
C
EXAMPLE.Consider again the execution shown in Fig. 5. Thc
space-time diagram in Fig. 6a shows one of the manj
possible ways in which this execution can occur in
operation
PE
1
j readA
i
reads
.......:....................
2
i
write A
write B
(a)
Fig. 5. Executing isochrons: (a) network topology, (b) operation send and receive times.
1,
. d . tr
i (i+l,l,x) i. 2 .j (i+3,1,x)
i (i,l,x+l)
i 3 i (i+3,1,x+1)
................. ......& .....................
i (62,~) . 3 f. (i+3,2,y)
0-
(i+1,2,y+l)
.j
i2
(i+3,2,y+l)
REYNOLDS ET AL.: ISOTACH NETWORKS
343
physical time. Fig. 6b shows the serial execution in
which operations are executed in order by their logical
receive times. Solid lines connecting circles indicate the
associated operations are from the same isochron.
Since each variable is accessed by the same operations in the same order in both executions, the executions are equivalent. In discussing a similar pair of
diagrams, Lamport provides an alternative way to
view the equivalence among the executions in the
figure [251:
Without introducing the concept of time into the system
(which would require introducing physical clocks), there
is no way to decide which of these pictures is the better
representation.
Intuitively, any vertical line in the diagram can be
stretched or compressed in any way that preserves the
relative order of the points on the line and the result
will be equivalent to the original execution. Without
memory-mapped peripherals or other special instrumentation, the executions are indistinguishable. Since
the second of the equivalent executions in Fig. 6 is
isochronous, so is the first, even though it interleaves
execution of isochrons in physical time.
i Accessby
A
B
(a)
1
B
A
(b)
Fig. 6. Space-time diagrams of equivalent executions.
3.2.7 Isochrons and Totally Ordered Multicasts
Algorithms for totally ordered multicasts (TOMS) ensure
total ordering without the benefit of the assumption of
fault-freedom. TOMs have been used to maintain consistency among replicated copies of objects in distributed systems, e.g., [4], and could theoretically be used to execute
flat atomic actions in SMM computations on multiprocessors, and as a basis for techniques to execute structured
atomic actions, but we are not aware of any proposals to
use TOMs in this way. Since locks are the standard technique for executing atomic actions we use them as the basis
for comparison in the performance studies described in the
next section.
Even if we leave out the fault-tolerant aspects, concentrating only on those aspects of TOM implementations
designed to ensure that multicasts are received in a consistent order, we find existing TOM implementations are not
scalable or are unsuitable for SMM computations for other
reasons. Several messaging layers, e.g., 161, [71, [34], use a
logical time system in obtaining an ordering property
‘imong multicasts called causal ordering, but the logical
time systems these messaging layers use does not offer an
efficient scalable way to obtain total ordering.
Existing TOM algorithms introduce some element to
obtain the total ordering that makes them unsuitable for
use in SMM computations. Most use a serialization point,
either in the network itself or in a communication pattern
superimposed on the computation, e.g., [ll, ill], [201, to
order multicasts. These algorithms scale poorly. Multicasts
for hierarchically structured networks or for applications
with a hierarchical communication pattern scale more successfully, but still have a potential bottleneck at the root.
The few TOM algorithms that do not use a centralized serialization point [61, [13], [14], [18] are based on worst case
assumptions or require multiple message rounds. Many
TOM implementations, e.g., [7], [15], [34], are group-based.
They send each multicast to the full membership of a process group and use the fact that every process sees every
message sent to its group in obtaining the total ordering.
Group-based protocols have been shown to work well in
distributed message based model computations, but the
process group concept does not translate well to the SMM.
In contrast to existing TOM algorithms, the implementation of isochrons is scalable and light-weight. The underlying implementation of isotach logical time is distributed
and the total ordering among isochrons is achieved without
introducing a centralized serialization point. Comparing
message complexity is complicated by the question of what
cost to assess for token traffic in the isotach network and
the fact that multiple processes can each send multiple isochrons in the same pulse. Since a token is an empty control
signal sent between neighboring switches that can be piggybacked on other messages if there is any real traffic, we
believe a cost of 0 for tokens is reasonable. Assuming a cost
of 0 for tokens, an isochron with k destinations requires one
round of k messages, the minimal cost on a point-to-point
network for sending k unique messages to k destinations.
3.2.2 Other Applications
Other applications of isotach logical time include the following:
Atomicity. Given our assumption of a fault-free system, isotach techniques enforce atomicity for both flat
and structured atomic actions. Structured atomic
actions cannot be executed using isochrons consisting
of ordinary reads and writes because internal dependences prevent issuing all of the operations in a batch,
but they can bc executed using isochrons consisting of
split operations [48]. A split operation, in brief, is half
of a read or write. Split operations are used in executing reads and writes in two steps, where the first
step schedules the access and the second transfers a
value. Dividing accesses into two steps allows a process that has incomplete knowledge about an access
due to an unsatisfied dependence to reserve the context for the access even though the value to be transferred has yet to be determined. A process executes a
structured atomic action by first issuing an isochron
that contains split operations scheduling all of the
accesses required for the atomic action. The process
executes the assignment steps for these accesses as it
determines the values to be assigned. Execution is
atomic because the initial isochron reserves a consis-
IlEEE TRANSACTION:j ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 4, APRIL 1997
344
tent time slice across the histories of the accessed
variables. This technique works when the access set,
the set of variables accessed by the atomic action, can
be determined prior to execution of the ,atomicaction.
We have proposed variations on the technique for
atomic actions with data dependent access sets [48].
Isotach techniques for executing structured atomic
actions are similar to techniques used in multiversion
concurrency control I381 except that isochrons
streamline the& techniques, eliminating roll-back and
permitting an efficient in-cache implementation of
variable histories. The resulting techniques are sufficiently light-weight in that they can be used in SMM
computations.
Cache Coherence. Isotach based techniques for enforcing atomicity and sequential consistency extend to
systems with caches. The resulting cache coherence
protocols [441, [471, 1481 are more concurrent than
existing protocols for non-bus-based systems in that
they support multiple concurrent reads and writes of
the same block, allow processes to pipeline memory
accesses without sacrificing sequential consistency,
and allow processes to read and write multiple shared
cache blocks atomically without obtaining exclusive
rights to the accessed blocks.
Combining. Combining is a technique for maintaining
good performance in the presence of multiple concurrent accesses to the same variable [231. An isotach
network need not implement combirdng, but if it
does, it can combine operations not (combinable in
other networks, resulting in improved concurrency in
accessing shared memory [491.
Other applications include support for causal message delivery, causal and totally ordered multicasting, migration
mechanisms, checkpointing, wait-free communication
primitives, and highly concurrent access to linked data
structures. Important special applications include parallel
and distributed databases and production systems 1411, [481.
4 PERFORMANCE
We studied the performance of equidistant isotach networks using both simulation [40] and analytic modeling
[46]. The following is a synopsis; the interested reader is
referred to the noted references for more details.
4.1 Simulation Studies
Isotach networks do more work than conventional networks-they route messages and deliver them in an order
consistent with the isotach invariant and the + relation.
Thus isotach networks can be expected to have less raw
power than comparable conventional networks. Raw power,
as determined by network throughput and latency, measures the ability of a network to deliver a workrload of generic
messages without atomicity or sequencing constraints to
their destinations. Our studies were designed to answer the
following questions:
1) How do the raw power of an isotach network and a
conventional network relate?
2) Under what conditions, if any, do isotach networks
make up for an expected loss in raw power through
more efficient support of synchronization?
We report on the results of three simulation studies: raw
power, sequential consistency, and warm-spot traffic mode'
with sequential consistency and atomicity. These studie:
compared isotach systems to conventional systems tha
enforce sequential Consistency by restricting pipelining anc
enforce atomicity by locking the variables accessed in eacl
atomic action. Improving the performance of locking ii
shared memory systems has been the subject of severa
studies in recent years (e.g., [2], [29], [30]).Although we dic
not implement caches, and thus could not implemen
cache-based locks, we implemented a highly efficient lock
ing protocol, described below, that avoids spinning on
locks in an alternative way.
We simulated two physical networks, N1 and N2, eacl
composed of 2 x 2 switches, synchronously clocked, anc
arranged in a reverse baseline topology. The messagi
transmission protocol was store-and-forward using send
acknowledge. N1 and N2 differed only in the design of tht
individual switches. N1 used the simple switch shown ii
Fig. 3a; N2 used the internal buffer switch shown in Fig. 3b.
We simulated isotach and conventional versions of bot1
N l and N2. Networks I1 and I2 are the isotach version o
N1 and N2, respectively. Similarly, C1 and C2 are the conventional versions of N1 and N2. Each isotach networl
used the same switch algorithm as the corresponding con
ventional network except that the isotach switches selectec
operations for routing in pid-rank order and sent token:
(piggybacked on operations) and ghosts.
Latency results are reported in cycle units, which repre
sent physical time. Cycle unit duration depends on low
level switch design and technology. We assume switches i;
C1 and I1 have the same best case time-the time requirec
for an operation to move through a switch assuming nc
waiting-of one cycle unit, and switches in C2 and I2 hav
best case times of two cycle units. This assumption does nc
imply that average switch latency is the same in convei
tional and isotach networks of comparable design. Averag
switch latency tends to be longer in isotach switches becau:
operations must be routed in pid-rank order. Throughoi
the studies, throughput is the average number of oper,
tions arriving at memory, per MM, per cycle.
We analyzed C1, C2,11, and I2 under a variety of synthet.
workloads, which ranged from one with no constraints o
operation execution order to ones with combinations (
atomicity, sequencing, and data dependence constraints.
4.1.1 Raw Power Study
We measured the raw power of each of the four network:
i.e., throughput and delay assuming no atomicity c
sequencing constraints. The workload consisted of generi
operations-reads and writes were not distinguished. M
assumed memory cycle time was the same as switch cyc
tim-ach
MM could consume one operation per cycl
Although the workload had no sequencing constraint
there was no way to relax enforcement of sequential COI
sistency in I1 and 12. Thus we actually compared i s o h
networks that enforce sequential consistency with convei
tional networks that did not.
REYNOLDS ET AL.: ISOTACH NETWORKS
345
Delay was measured from the time an operation was generated until it was delivered to the MM. As expected, C1 and
C2 handled higher load with less delay than I1 and 12,
respectively. Conventional networks have more raw power
because isotach network switches are subject to the additional constraint that they must route operations in isotach
time order. Also of significance: C2 and I2 offered higher
throughput at the cost of higher latency than C1 and 11.
In all of the remaining studies, C1 outperformed C2 at
cach data point. C2 performed poorly relative to C1 because
its latency was higher than Cl’s. The increased potential
throughput offered by C2 could not be exploited when
locks and restrictions on pipelining are used for synchronization. Henceforth we report results only for Cl,Il, and 12.
4.1.2 Sequential Consistency Study
We compared networks under a workload requiring sequential consistency but no atomicity. Here and in the
warm-spot study, traffic in both the forward (PE -+ MM)
and reverse (MM -+ PE) directions was simulated. The
reverse networks were conventional networks: I2 used C2,
and I1 and C1 used C1. In each cycle, each MM (PE) could
receive one operation (response) from the network and issue
one response (operation).
Fig. 7 shows the high cost of enforcing sequential consistency in a conventional network. In I1 and 12, operations
could be pipelined without risk of violating sequential consistency. We assumed no data dependences constrained the
issuing of operations. Each PE issued a new operation
whenever its output buffer was empty. C1 enforced sequential consistency by limiting each PE to one outstanding
operation at a time. Throughput for I1 and 12, shown in
Fig. 7, was the same as the raw power studies, since the
isotach networks enforced sequential consistency in both
cases, but was much lower for C1. I1 and I2 outperformed
C1 in terms of throughput, but with longer delay, partly
because pipelining caused I1 and I2 to be more heavily
loaded.
We would expect data dependences among operations
issued by the same process to diminish throughput in isotach networks because they prevent the networks from
taking full advantage of pipelining. When we set the data
dependence distance to one, meaning each process needs
the result of its current operation before it can issue its next
Sequential consistency NOT enforced
0.60
1
}
4
6
8
I
10
# Stages
Fig. 7. Effect of enforcing sequential consistency. 0: C1, 0: 11, 0: 12
operation, the throughput of the isotach networks was low.
As the data dependence distance increased, the throughput
of I1 and I2 climbed sharply until network saturation. Delay
also increased as the networks became more heavily
loaded. Because its latency is longer and its saturation point
higher, we observed that I2 requires a larger data dependence distance than I1 to take full advantage of its ability to
pipeline. I2 performed less well than I1 at small dependence distances because I2 cannot take advantage of its
higher throughput and is hurt by a higher response time
due to higher latency. Since C1 cannot pipeline regardless
of the data dependence distance, throughput and delay for
C1 do not vary with the data dependence distance. We did
observe that the throughputs of I1 and I2 began to exceed
Cls at a data dependence distance of only two.
4.1.3 Warm-Spot Traffic Model Study
We combined a warm-spot traffic model with a workload
with atomicity and sequencing constraints. A warm-spot
model has several warm variables instead of a single hot
variable and is more realistic than either the uniform or hotspot traffic models. The warm-traffic model we used was
based on the standard 80/20 rule (see, e.g., [22]), modified
to lessen contention for the warmest variables [40].
Atomic actions were assumed to be flat and independent, i.e., there were no data dependences among or within
atomic actions. I1 and I2 enforced atomicity by issuing all
operations from the same atomic action in the same pulse.
C1 enforced atomicity using locks: A lock was associated
with each variable and a process acquired a lock for each
variable accessed by the atomic action before releasing any
of the atomic action’s locks. To avoid deadlock, each process acquired the locks it needed for each atomic action in a
predetermined linear order.
Our lock algorithms were generously efficient. Instead of
spinning, lock requests queued at memory. Reads used
nonexclusive locks and comprised 75% of the workload.
Instead of sending a lock request for an operation, a process
sent the operation itself. Each operation implicitly carried a
request for a lock of the type indicated by the operation.
When it received an operation, the MM enqueued it. If no
conflicting operation was enqueued ahead of it, the operation was executed and a response returned to the source
PE. A PE knew it had acquired a lock when it received the
Sequential consistency enforced
OBO
0.60
7
i
1
1
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 4, APRIL 199-
346
response. Eliminating explicit lock requesting and granting
reduced traffic and eliminated roundtrip PE/MM message
delays when a lock was granted. When the P,E had acquired
all the locks for the atomic action, it sent a lock release for
each lock it held.
Execution of our warm-spot model was sequentially
consistent. C1 enforced sequential consistency by limiting
the number of outstanding atomic actions per PE to one. In
I1 and 12, a PE could have any number of active atomic actions without sacrificing sequential consistency.
Fig. 8 shows the effect of varying the average size
(AA-size) of atomic actions under a warm-spot model. Delay
is delay per operation, i.e., the average nuimber of cycles
from the time an atomic action is generated until it is completed, divided by AA-size. For C1, the tirne required to
release locks is not included in delay. A process in C1 generated a new atomic action as soon as it had received all the
responses due in its current atomic action. In I1 and 12, a
process generated a new atomic action whenever its output
buffer was empty. The graphs show C1 performed poorly
for large atomic actions. As the atomic action size increased,
Cl’s throughput dropped and its delay rose steeply. Cl’s
performance was significantly worse than in a similar study
we conducted that assumed a uniform access model, because
when the probability of conflicting operations is higher,
locks have more impact on performance. As atomic actions
grow larger, not only does the number of ciperations contending for access grow, but also the length of time each
lock is held. In I1 and 12, by contrast, delay per operation
actually decreased as the size of atomic actions increased,
since increasing size meant more concurrency.
As can be seen in Fig. 8, when atomic actions are large
and traffic is nonuniform, Cl’s performance is very poor
relative to I1 and 12. For large atomic actions (AA-size =
161, throughput for I2 is about 78 times higher than Cl’s
and its delay about 24 times lower. We observed in other
studies that hot-spot traffic results show the same pattern.
The warm spot study shows that isotach systems perform well in relation to conventional systems under a
workload with atomicity and sequencing constraints. Conventional systems cannot take full advantage of the greater
raw power of their networks because the rate at which
operations can be issued is limited by locks and restrictions
on pipelining. Isotach system performance, by contrast, is
limited only by the raw power of the networks. Though
raw power of the isotach networks is somewhat lower tha
that of the conventional networks, the synchronizatio
support they provide allows them to outperform the con
ventional systems by a wide margin.
Finally, we analyzed conventional and isotach system
for a workload of structured atomic actions in which dat
dependences existed both within and among atomic at
tions. I1 and I2 continued to outperform C1, though by
smaller margin since the data dependences prevented th
isotach networks from taking advantage of their ability +
pipeline operations.
4.1.4 Simulation Study Summary
Our study shows conventional networks have higher ra1
power than isotach networks, but with atomicity ani
sequencing constraints, isotach networks outperform cor
ventional networks, in some cases by a factor of 10 or morc
Conventional systems perform relatively poorly in spite (
their greater raw power because they enforce atomicity an
sequencing constraints by restricting throughput and in
posing delays. Isotach networks perform best in relation t
conventional networks when the distance between dat
dependent operations within the same process’s program 1
large enough to permit pipelining; atomic actions are largc
and contention for shared variables is high. When th
workload has two or more of these characteristics, an isc
tach network performs markedly better than a convention,
network. These results indicate that isotach based synchrc
nization techniques warrant further study. We have e:
tended our simulation to include isotach based cache cohei
ence protocols [471.
4.2 Analytic Model
Analytic models were developed for networks C1, C2, I
and 12. That work led to the discovery of novel modelin
methods that will be presented in a future paper. We ou’
line the approach and results we obtained here. A detaile
presentation appears in [46].
We extended a mean value analysis technique first prl
sented by Jenq [19] for banyan-like MINs. Our extensior
enabled the modeling of
1) routing dependencies,
2) a measure of bandwidth we have called informatio
flow, and
3) message types.
0 50
2500
I
0 40
I
t
I
&
I
1
200 0
I
I
I
c
I
I
1
1
030
4
1
2
-
1
1500
1
I
1000
~
14
~
1
50 0
0 10
I
00
0 00
2
4
6
8
AA
10
Ske
12
14
16
Fig. 8. Varying the atomic action size-warm-spot traffic. 0: C1, 0:
11, 0: 12
9
2
4
6
8 10 12
AA Size
14
1
16
347
HEYNOLDS ET AL.: ISOTACH NETWORKS
$Lltingdependencies exist in isotach networks in the form
constraints requiring routing in pid-rank order (Jenq did
not include this kind of dependence). Information flow is a
measure of the bandwidth of a given stage of a MIN.
[Ilcluding information flow allowed us to introduce independence among network stages, gaining tractability with
little sacrifice in accuracy. Finally, certain optimizations that
,,wid be made in an implementation of an isotach net*rk, for example, piggy-backing of tokens on messages of
itent and ghost messages, needed to be reflected in the
,1i1dytic model. We succeeded in capturing the effects of
these message types.
For networks I1 and 12, the analytic model predicted
throughput with no worse than 12% deviation from the
simulation results presented earlier. The model tended to
be optimistic, which we expected because it didn't account
tor delays that variations in uniform distribution of mes,iges would create in the network. Those delays were cap:d in the simulation. Prediction of delay was even better
,th a maximum error of 5%.In both cases, the results were
tor networks ranging in depth from two to 10 stages.
5 CONCLUSIONS AND ONGOING WORK
We described a new approach to enforcing ordering constraints in a fault-free system in which processes communicate over a point-to-point network. This problem is difficult,
t n with the assumption of fault-freedom. The conventional
I tion uses locks and restrictions on pipelining. Solutions
to the harder problem, where fault-freedom is not assumed,
have been found useful in distributed computations, but
these solutions are not suitable for use in SMM computations.
We have described an approach that does not require locks
or restrictions on pipelining and that is scalable and lightweight enough for use in SMM computations.
The approach is based on isotach logical time, a logical
time system that extends Lamport's system by adding an
mant that relates communication time to communicadistance. This invariant is a powerful coordinating
mechanism that allows the sender to control the order in
which its messages are received. We defined isotach logical
time, gave an efficient distributed algorithm for implementing it, and showed how to use isotach logical time to
enforce ordering constraints. The work of implementing
isotach logical time and enforcing the ordering constraints
can be localized in the network interfaces and switches. We
riefined a new class of networks called isofach networks that
iplement isotach logical time.
Implementing isotach logical time, of course, is not free
and we are well aware of controversy (181, 1121, 1391) over
implementing synchronization services at a low level in
end-to-end systems 1421. Isotach networks are justifiable on
a cost/benefit basis-the guarantees they offer can be
implemented cheaply, yet are sufficiently powerful to
enforce fundamental synchronization properties. We
reported the results of analytic and empirical studies that
.' ow that isotach synchronization techniques are far more
1 ilcient than conventional techniques.
Ongoing w o r k . We are building an isotach network prototype and studying the use of isotach techniques in imple1
I
1
lii
menting distributed shared memory and rule-based systems. One of our goals in building the prototype is to demonstrate that isotach techniques can work in distributed
systems and can tolerate some types of faults. Preliminary
explorations show that isotach networks can function under
a number of different failure models. One result €or failstop models appears in 1461. Our approach to ensuring a
reliable implementation of the isochron on the prototype
system will be similar to the approach reported in [451, exicept that we intend to use a distributed protocol in place of
<acentralized credit manager for flow control. Other goals in
lbuilding the prototype include showing that implementing
iisotach logical time does not significantly interfere with
network traffic that does not use isotach logical time, and
{hat isotach logical time can be implemented without
blocking messages within the network. The former is important in addressing the end-to-end controversy referred to
above. The latter is important to allow the use of off-theshelf networks and can be accomplished by pushing most
of the work that the switches do in the isonet algorithm
onto the receiving network interface. Our target platform
for the prototype is a nonequidistant cluster of personal
computers (PCs) interconnected with Myrinet [9].
ACKNOWLEDGMENTS
This work was supported in part by DARPA Grant
DABT63-95-C-0081, U.S. National Science Foundation
Grant CCR9503143, and DUSA(0R) Simtech funding
(through the National Ground Intelligence Center).
REFERENCES
Y. Amir, L.E. Moser, P.M. Melliar-Smith, D.A. Agarwal, and P.
Ciarfella, "The Totem Single-Ring Ordering and Membership
Protocol," ACM Trans. Computer Systems, vol. 13, no. 4, pp. 311342, Nov. 1995.
T.E. Anderson, 'The Performance of Spin Lock Alternatives for
Shared-Memory Multiprocessors," I E E E Trans. Parallel and Distributed Systems, vol. 1, no. 1, pp. 6-16, Jan. 1990.
B. Awerbuch, "Complexity of Network Synchronization," 1.ACM,
vol. 32, no. 4, pp. 804-823, Oct. 1985.
H.E. Bal and AS. Tanenbaum, "Distributed Programming with
Shared Data," Proc. I€€€ CS 1988 Int'l Conf. Computer Languages,
pp. 82-91, Miami, Fla., Oct. 1988.
Y . Birk, P.B. Gibbons, J.L.C. Sanz, and D. Soroker, "A Simple
Mechanism for Efficient Barrier Synchronization in MIMD Machines," Technical Report RJ 7078, IBM, Oct. 1989.
K.P. Birman and T.A. Joseph, "Reliable Communication in the
Presence of Failures," ACM Trans. Computer Systems, vol. 5, no. 1,
pp. 47-76, Feb. 1987.
K.P. Birman, A. Schiper, and P. Stephenson, "Lightweight Causal
and Atomic Group Multicast," ACM Trans. Computer Systems,
pp. 272-314, Aug. 1991.
K. Birman, "A Response to Cheriton and Skeen's Criticism of
Causal and Totally Ordered Communication," OS Review,vol. 28,
no. 1, pp. 11-21, Jan. 1994.
N.J. Boden, D. Cohen, R.E. Felderman, A.E. Kulawik, C.L. Seitz,
J.N. Seizovic, and W. Su, "Myrinet-A Gigabit-per-Second LocalArea Network," IEEE Micro. vol. 15. no. 1. DD.29-36. Feb. 1995.
[101 K. Chandy and J. Misra, "Distributed Sim;ttion; A'Case Study in
Design and Verification of Distributed Programs," IEEE Trans.
Software Eng., vol. 5, no. 5, pp. 440452, Sept. 1979.
[lI] J. Chang and N.F. Maxemchuk, "Reliable Broadcast Protocols,"
ACM Trans. Computer Systems, vol. 2, no. 3, pp. 251-273, Aug. 1984.
[12] D.R. Cheriton and D. Skeen, "Understanding the Limitations of
Causally and Totally Ordered Communication," OS Review,14th
ACM Symp. OS Principles, pp. 44-57, Dec. 1993.
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 4, APRIL 1997
I;. Cristian, ”Synchronous Atomic Broadcast for Redundant
Broadcast Channels,” Technical Report RJ 7203 (67682), IBM,
Dec. 1989.
M. Dasser, ”TOMP: A Total Ordering Multicast Protocol,“ O p ntirig Systems Review,vol. 26, no. 1, Jan. 1992.
H. Garcia-Molina and A. Spauster, “Ordered and llieliable Multicast
Communication,” ACM Trans. Corirputer Systeiris, vol. 9, no. 3,
pp. 242-271, Aug. 1991.
K. Gharachorloo et al., “Performance Evaluation of Memory Consistency Models for Shared-Memory Multiprocessors,” Proc.
Fourth ASPLOS, pp. 245-257, Apr. 1991.
P.B. Gibbons, ”The Asynchronous PRAM: A St.ini-Sytichronous
Model for Shared Memory MIMD Machines,” 89-062, ICSI, Berkeley, Calif., December, 1989.
K.J. Goldman, “Highly Concurrent Logically Synchronous Multicast,” Distributed Computing, pp. 94-108, Berlin-Heidelburg-New
York Springer-Verlag, 1989.
Y. Jenq, “Performance Analysis of a Packet Switch Based on Single-Buffered Banyan Network,” I E E E ]. Selcctecl Areas Comm., vol. 1,
no. 6, pp. 1,014-1,021, Dec. 1983.
M.F. Kaashoek, A.S. Tanenbaum, S.F. Huminell, and H.E. Bal,
”An Efficient Reliable Broadcast Protocol,” OS Review, vol. 23, no. 4,
pp. 5-19, Oct. 1989.
M.J. Karol, M.G. Hluchyj, and S.P. Morgan, “Input Versus Output
Queueing on a Space-Division Packet Switch,” IEEE Trans.
Comm., vol. 35, no. 12, pp. 1,347-1,356, Dec. 1987.
D.E. Knuth, ”Sorting and Searching, “ Fundnmental Algorithms,
vol. 3, p. 397. Addison-Wesley, 1973.
C.P. Kruskal, L. Rudolph, and M. Snir, ”Efficient Synchronization
on Multiprocessors with Shared Memory,” ACM Trans. Programming Languages and Systems, vol. 10, no. 4, pp. 579-601, Oct. 1988.
M. Kumar and J.R. Jump, ”Performance Enhancement of Buffered
Delta Networks Using Crossbar Switches and Multiple Links,” ].
Parallel and Distributed Computing, vol. 1, pp. 81-103, 1984.
L. Lamport, ”Time, Clocks, and the Ordering of Events in a Distributed System,” Comm. ACM, vol. 21, no. 7, pp. 558-S65, July 1978.
L. Lamport, ”How to Make a Multiprocessor Computer That
Correctly Executes Multiprocessor Programs,” I E E E Trans. Computers, V O ~ 28,
.
pp. 690-691,1979.
L. Lamport, ”On Interprocess Communication,” Distributed Conputiiig, vol. 1, no. 2, pp. 77-101, Apr. 1986.
D.B. Lomet, “Process Structuring, Synchronization, and Recovery
Using Atomic Actions,” SIGPLAN Notices, vol. 12, no. 3, pp. 12&137,
Mar. 1977.
J.M. Mellor-Crummey and M. L. Scott, “Algorithms for Scalable
Synchronization on Shared-Memory Multiprocessors,” ACM
Trans. Computer Systems, vol. 9, no. 1, pp. 21-65, Feb. 1991.
M.M. Michael and M.L. Scott, “Scalability of Atomic Primitives on
Distributed Shared Memory Multprocessors,” ‘Technical Report
528, Univ. of Rochester, Computer Science Dept., July 1994.
S. Owicki and D. Gries, “An Axiomatic Proof Technique for Parallel Programs I,” Acta Infornzatica, vol. 6, pp. 319--340, i976.
1 S. Owicki and L. Lamport, ”Proving Liveness Properties of Concurrent Programs,“ ACM Trans. Programming Lnrigunges nnd Systems, vol. 4, no. 3, pp. 455-495, July 1982.
1331 C. Papadimitriou, Dntabase Concirrrermy Control. Computer Science
Press, 1986.
I341 L.L. Peterson, N.C. Bucholz, and R.D. Schlichting, “Preserving and
Using Context Information in Interprocess Communication,” ACM
Tmirs. Conzputer Systeins, vol. 7, no. 3, pp. 217-246, Aug. 1989.
[351 A.G. Ranade, S.N. Bhatt, and S.L. Johnsson, “The Fluent Abstract
Machine,” Technical Report 573, Yale Univ., Dept. of Computer
Science, Jan. 1988.
(361 A. Ranade and A.G. Ranade, “How to Emulate Shared Memory,”
I E E E Ann. Sytnp. Foirndntions of Currrpirter Science, pp. 185-194, Los
Angeles, 1987.
1371 M. Raynal and M. Singhal, “Logical Time: Capturing Causality in
Distributed Systems,” Corrrputer, vol. 29, no. 2, pp. 49-56, Feb. 1996
1381 D. Reed, ”Implementing Atomic Actions on Decentralized Data,”
ACM Trans. Computer Sysferns, vol. 1, no. 1, pp. 3-23, Feb. 1983.
1391 R. Renesse, “Why Bother With CATOCS?” OS Reziiezu, vol. 28, no. 1,
pp. 22-27, Jan. 1994.
1401 P.F. Reynolds Jr., C. Williams, and R.R. Wagtier Jr., “Empirical
Analysis of Isotach Networks,” Technical Report 92-19, Univ. of
Virginia, Dept. of Computer Science, June 1992.
141I 1’ F Reynolds Jr and C Williams, “Asynchronous Rule-Based
Systems 111 CGF,” Proc Fifth Conf Cowputer Generated Forces an0
Behn7m1rnl Representation, Orlando, Fla ,May 1995.
1421 J H Saker, D P Reed, and D D Clark, “End-To-End Arguments
System Design,” ACM Trnns Conrputer Systems, vol 2, no. 4,
p p 277-288,Nov 1984
1431 C Schcurich and M Dubois, “Correct Memory Operation of CacheBased Multiprocessors,” Proc 14th ISCA, pp 234-243, June 1987.
1441 B R Supinski, C Williams, and P F Reynolds Jr , “Performance
Evaluation of the Late Delta Cache Coherence Protocol,” Technp
CAI Report CS-96-05, Univ of Virginia, Dept of Computer Science
Mar 1996.
1451 K Verstoep, K Langendoen, and H E Bal, “Efficient Reliablc
Multicast on Myrinet,” Technical Report IR-399, Vrije Univer
siteit, Amsterdam, Jan 1996
1461 R.R Wagner Jr , “On the Implementation of Local Synchrony;
CS-93-33, Univ. of Virginia, Dept of Computer Science, June 1993
[471 C. Williams and P.F. Reynolds Jr., “Delta-Cache Protocols: A New
Class of Cache Coherence Protocols,” Technical Report 90-34,
Univ of Virginia, Dept. of Computer Science, Dec. 1990
[481 C C Williams, “Concurrency Control in Asynchronous Compu
tations,” Ph.D. thesis, No 9301, Univ of Virginia, 1993.
I491 C C. Williams and P.F. Reynolds Jr , “Combining Atomic Actions,’
] Parallel and Distributed Computing, pp 152-163, Feb. 1995.
Paul F. Reynolds, Jr. received the PhD degree
from the University of Texas at Arlington in 1979
He is currently an associate professor of com
puter science at the University of Virginia, wherc
he has been on the faculty since 1980. He hac
published widely in the area of parallel computa
tion, specifically in parallel stimulation and par
allel language and algorithm design. He ha<
served on a number of national committees anc
advisory groups as an expert on parallel com
putation, and, more specifically, as an expert o
parallel stimulation. He has served as a consultant to numerous corpo
rations and government agencies in the systems and simulation areas
Craig Williams received her BA degree in eco
nomics from the College of William and Mary i t
1973, a JD from Columbia Law School in 1975
and her Masters degree in computer science i r
1987. Her PhD was awarded by the University o
Virginia in 1993. She is now a senior scientist i r
the Computer Science Department at the Un
versity of Virginia. Her research interests includ
parallel data structures and both the hardwarc
and software aspects of parallel computation.
Raymond R. Wagner, Jr. received his BA i/
computer science from Dartmouth College ir
1985, and his MS and PhD degrees in compute
science from the University of Virginia in 198and 1993. He is currently a senior associate anc
partner in the Vermont Software Company, Inc
in Wells River, Vermont.
Download