IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 4, APRIL 1997 337 Isotach Networks Paul F. Reynolds, Jr., Member, /E€€ Computer Society Craig Williams, and Raymond R. Wagner, Jr. Abstract-We introduce a class of networks called isotach networks designed to reduce the cost of synchronization in parallel computations. Isotach networks maintain an invariant that allows each process to control the logical times at which its messages are received and consequently executed. This control allows processes to pipeline operations without sacrificing sequential consistency and to send isochrons, groups of operations that appear to be received and executed as an indivisible unit. Isochrons allow processes to execute atomic actions without locks. Other uses of isotach networks include ensuring causal message delivery and consistency among replicated data. Isotach networks are characterized by this invariant, not by their topology. They can be implemented in a wide variety of configurations, including NUMA (nonuniform memory access) multiprocessors. Empirical and analytic studies of isotach synchronization techniques show that they outperform conventional techniques, in some cases by an order of magnitude or more. Results presented here assume fault-free systems;we are exploring extension to selected failure models. Index Terms-Logical time, interprocess coordination,concurrency control, isochronicity,atomicity,sequential consistency, interconnection networks, multiprocessor systems. 1 INTRODUCTION continue to perform well. Our focus in this paper is on shared memory model reduce synchronization costs in parallel computations by providing stronger guarantees about the order in which (SMM) computations on tightly-coupled multiprocessors. messages are delivered. In existing networks, a process can There are two reasons for this focus: neither predict nor control the order in which its messages We assume a fault-free system. This assumption is are delivered relative to concurrently issued messages. Few standard in discussions of synchronization techniques existing networks offer ordering guarantees any stronger for SMM computations on tightly coupled systems than FIFO delivery order among messages with the same but not in distributed systems, where dealing with source and destination. As a result, a process must use delays faulty communications is a major concern. and higher-level synchronization mechanisms such as locks Unlike many other synchronization techniques, isoto ensure that execution is consistent with the program’s tach techniques are suitable for the fine-grained synconstraints. chronization required for SMM computations. Isotach networks offer a guarantee called the isotach invariant that allows processes to predict and control the Isotach techniques are also applicable to message based logical times at which their messages are received. This model computations, but their use in distributed computing invariant serves as the basis for some efficient new syn- requires finding a way to relax the assumption of a fault-free chronization techniques. The invariant is expressed in logi- system. We address this issue in the discussion of ongoing cal time because a physical time guarantee would be pro- work in Section 5. The principal contributions of this paper are in hibitively expensive, whereas the logical time guarantee can be enforced cheaply, using purely local knowledge. 1) defining isotach logical time, the logical time system Isotach networks are characterized by this invariant, not by that incorporates the isotach invariant; 2) giving an efficient, distributed algorithm for impleany particular topology or programming model. They can be implemented in a wide variety of configurations, includmenting isotach logical time; and ing NUMA (nonuniform memory access) multiprocessors. 3) showing how to use isotach logical time to enforce Our performance studies show that isotach synchronization two basic synchronization properties, sequential contechniques are more efficient than conventional techniques. sistency and isochronicity. As contention for shared data increases, conventional tech- An execution is sequentidy consistent if operations specified niques cease to perform acceptably, but isotach techniques by the program to be executed in a specific order appear to be executed in the order specified [26] and is isochronous if operations specified by the program to be executed at the P.F. Reynolds, ]r. and C. Williams are with the Departrnent of Computer Science, Thornton Hall, University of Virginia, Charlottesville, V A 22903. same time appear to be executed at the same time. As E-mail: (reynolds, craigwl@virginia.edu. explained in Section 3, isochronicity and atomicity are R.R. Wngner, 1.. is w i t h The Vermovzt Softzuam Cowpnny. Inc., Boldruin similar, though not identical, concepts. House, PO Box 731, Wells River, VT 05081. Enforcing sequential consistency and isochronicity is E-mail: Roy. Wasncu@VTSoftwaue.co,n. trivial, given an assumption of fault-freedom, on systems in Marruscript received Feb. 13,1995. For. iriformatiori on obtaining reprints of this article, please serid e-rrrnil to: which processes communicate over a single shared broad- I sotack networks are a new class of networks designed to - triiirspds~corirputer.ors,and reference IEEECS Log Nirrnber D95263. 1045-9219/97/$10.00 01997 IEEE 338 IEEE TRANSACTIONSON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 4, APRIL 1997 cast medium such as a shared bus. However such systems do not scale. Point-to-point networks scale, but enforcing sequential consistency and isochronicity is hard on pointto-point networks, even with the assumption of faultfreedom, and existing algorithms are expensive. The isotach algorithms for enforcing isochronicity and sequential consistency are unique in that they are both scalable-they have low message complexity and no intrinsic bottleneck such as a shared bus-and light-weight-they are efficient and require only a single communication round. Section 2 of this paper defines isotach logical time and shows how it can be implemented. Section 3 describes isotach synchronization techniques. Section 4 reports on our performance studies and section 5 summarizes the paper and describes ongoing work. 2 ISOTACH LOGICAL TIME A logical time system is a set of constraints on the way in which events of interest are ordered, i.e., assigned logical times. Isotach logical time is an extension of the logical time system defined by Lamport in his classic paper on ordering events in distributed systems 1251. We begin by describing Lamport’s system. We then define isotach logical time and give an algorithm for implementing isotach logical time using network switches and interfaces. In Lamport’s system, the events of interest are the sending and receiving of messages. Times assigned to these events are required to be consistent with the happened before relation, a relation over send and receive events that captures the notion of potential causality: Event n happened before event 6, denoted a + b, if 1) a and b occur at the same process and a occurs before h; 2) a is the event of sending message m and b is the event of receiving the same message m; or 3) there exists some event c such that a -+ c and c + 6. In Lamport’s system, a + b * t(a) < t(6). (Refer to Fig. 1. for notation relating to logical time.) Lamport gives a simple distributed algorithm for assigning times consistent with this constraint. Each process has its own logical clock, a variable that records the time assigned to the last local event. When it sends a message, a process increments its clock and timestamps the message with the new time. When it receives a message, a process sets its clock to one more than the maximum of the current time and the timestamp of the incoming message. Logical time systems have been proposed that extend Lamport’s system by adding the constraint t(a) < t(b) a + b (thus t(n) < f(b)if and only if a -+b). For a recent survey, see [37]. ISIS [7] is a messaging layer that incorporates this extension to Lamport’s logical time system. Isotach logical time also extends Lamport’s time system, but in a different way. In isotach logical time, the additional constraint is that the times assigned be consistent with an invariant relating the logical distance a message travels to the logical time it takes to reach its destination. 2.1 Defining Isotach Logical Time In an isotach logical time system, times are lexicographically ordered n-tuples of nonnegative integers. The first and For any event e t(e) logical time assigned to e For any message m logical time assigned to the event of sending ni logical time assigned to the event of t,(m) receiving in logical distance traveled by m d(m) unique identifier of the process that pid(m) issued m issue rank of m, rank (m) = i if m is rank(m) the ith message issued by pid(m) Lamport’s happened-before relation t,(m) + Fig. 1. Logical time notation. most significant component is the pulse. The number and interpretation of the remaining components can vary. In this paper, logical times are three-tuples of the form (pulse, pid(m), runk(m)). Isotach logical time extends Lamport’s logical time by requiring that the times assigned be consistent with both the -+ relation and the isotach invariant: Each message m is received exactly d(m) pulses after it is sent. 1 Thus t,(m) = (i, j, k ) implies t,(m) = (i + d(m), j , k). In this paper, we use the routing distance, i.e., the number of switches through which rn is routed, as the measure of logical distance, but other measures can be used. Isotach logical time is so named because all messages travel at the same velocity in logical time-one unit of logical distance per pulse of logical time. As we will show in Section 3, the isotach invariant provides a powerful mechanism for coordinating access to shared data. Given the isotach invariant, a process that knows d(m) for a message m that it sends can determine t,(m) through its choice of t,(m). Thus processes can use the isotach invariant to control the logical times at which their messages are received. 2.2 Implementing Isotach Logical Time An isotach network is a network that implements isotach logical time, i.e., that delivers messages in an order that allows send and receive events to be assigned logical times consistent with the isotach invariant and the + relation. Several networks have been proposed that can, in retrospect, be classified as isotach networks. The alphasynchronizer network proposed by Awerbuch to execute SIMD graph algorithms on asynchronous networks (31, and a network proposed to support barrier synchronization [51, [17] can be viewed as isotach networks that maintain a logical time system in which each logical time consists of the pulse component only. A network proposed by Ranade [351, 1361 as the basis for an efficient concurrent-read, concurrent-write (CRCW) PRAM emulation can be viewed as realizing the more general and powerful n-tuple isotach logical time system. We give an algorithm, the isonet algorithm, for implementing isotach logical time on networks of 1. For any message 111 for which d(m) = 0, e.g., a cache hit or a message between collocated processes, the isotach invariant requires t,(nt) = t,(m). For this reason, we relax Lamport‘s requirement that a -+ b t(n) < t(b). In isotach logical time, n + b 3 t(n) s I@). REYNOLDSET AL.: ISOTACH NETWORKS arbitrary topology. The algorithm localizes the work of implementing isotach logical time in the network interfaces and switches. It is simple but impractical. Later in this section, we show how to make it practical. 2.2.1 Model We consider a network with arbitrary topology consisting of a set of interconnected nodes in which each node is either a switch or a host. Adjacent nodes communicate over FIFO links. Each host has a network interface called a switch interface unit (SIU) that connects it to a single switch. A host communicates with its SIU over a FIFO link. All components and interconnecting links are failure-free. To postpone considering the issue of communication deadlock, we initially assume each switch has infinite buffers. A message is issued when the process passes it to the SIU, sent when the SIU passes it to the network, and received when the SIU accepts the message from the network. We assume that the destination SIU delivers messages to the destination process in the order in which it received them. 2.2.2 lsonet Algorithm Each SIU and switch maintains a logical clock, a counter initialized to zero, that gives the pulse component of the local logical time. The clocks are kept loosely synchronized by the exchange of tokens. A token is a signal sent to mark the end of one pulse of logical time and the beginning of the next. Initially each switch sends a token wave, i.e., it sends a token on each output, including the outputs to adjacent SIUs, if any. Thereafter, each switch sends token wave i + 1 upon receiving the ith token on each input, including the inputs from adjaceht SIUs. Each time a switch sends a token wave, it increments its clock. When an SIU receives a token, the SIU increments its clock and returns the token to the switch. A message is sent (received/routed) in pulse i if it is sent (received/routed) while the local clock is i. An SIU may send no messages or several messages in a pulse. The SIU tags each message m it sends with the lexicographically ordered pair (pld(m),runk(m)). The messages an SIU sends in a pulse must be sent in pid-rank order. Thus if m and m‘ are issued by the same process and runk(m) < rank(”), the SIU must send m before m‘ if it sends m and m’ in the same pulse. However the rule does not prohibit the SIU from sending m’ in an earlier pulse than m. During each pulse, an SIU continuously compares the tag of the next message to be sent with the tag of the next message to be received and handles (sends or receives) the message with the lowest tag. If its network input is empty, the SIU waits for it to be nonempty. The switches in an isotach network route messages in pid-rank order in each pulse. The switch behaves the same as a switch in a conventional network except that it always selects for routing the message with the lowest tag among the messages at the head of its inputs. As in the case of an SIU, a switch waits if one of its inputs is empty. Pseudocode for the isonet algorithm is given in Fig. 2. To show that the isonet algorithm implements isotach time, we 1) give a rule for assigning logical times to send and receive events, and 339 At each switch clock = 0; repeat forever send a token on each output; increment clock; route all messages up to next token on each input in pid-rank order consume a token from each input; At each SIU clock = 0; repeat forever accept a token from adjacent switch increment clock; return the token to the switch; send and receive messages in pid-rank order; Fig. 2. Pseudocode for isonet algorithm. 2) prove that this assignment is consistent with the isotach invariant and the + relation. Let sjulse(m) denote the pulse in which m is sent and r_pulse(m) the pulse in which m is received. Time assignment rule: For each message m, t,(m) = (s_pulse(m),pid(m), runk(m))and t,(m) = (r_pulse(m),pid(m), runk(m)). THEOREM 1. The isonet algorithm maintains isotach logical time. PROOF.Since a message routed by a switch in pulse i is routed at the next switch in pulse i + 1, for any message m, r_pulse(m) - s_pulse(m) = d(m).Since t,(m) and t,(m) differ only in the pulse component, the isotach invariant holds. Showing that the logical time assignment is consistent with the -+ relation requires showing 1) for each message m, t,(m) 2 ts(m);and 2) for each SIU, the times assigned to local events are nondecreasing, i.e., local logical time doesn’t move backwards. The first part follows directly from the isotach invariant. Since for any message m, d(m) 2 0, t,(m) 2 t,(m) by the isotach invariant. Since the messages each SIU sends in each pulse are sent in pid-rank order, and since each switch merges the streams of messages arriving on its inputs so that it maintains pid-rank order on each output, a simple inductive proof over the number of switches through which a message travels shows that messages received at an SIU within each pulse are received in pid-rank order. Since an SIU interleaves sends and receives so as to handle the all messages in each pulse in pid-rank order, the times assigned to all events at each SIU are nondecreasing.0 2.2.3 Discussion An isotach network need not maintain logical clocks at the SIUs and switches or assign logical times to send and receive events. It need o,nlyensure that messages are received in an order consistent with the isotach invariant and the -+ relation. The logical clocks are auxiliary variables that make the algorithm easy to understand but are not essential to 340 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 4, APRIL 1997 the implementation. Similarly, the rank component of the pid-rank tags can be eliminated or reduced to a few bits. The rank component is used in assigning logical times to send and receive events, but this role is auxiliary to the algorithm. The rank component is essential to the isonet algorithm only to ensure that messages sent by the same source to the same destination in the same pulse are received in the order in which they were issued and sent. If the network topology ensures FIFO delivery order among messages with the same source and destination, the rank component is unnecessary. If it does not, the size of the rank component can be reduced by limiting the number of messages a process can send to a single destination in a single pulse, or it can be eliminated by requiring that multiple messages sent to the same destination in the same pulse be sent as a single larger message. The bandwidth required for the pid component of tags is not attributable to maintaining isotach time because messages are typically required to carry the identity of the issuing process for reasons unrelated to maintaining isotach time. Token traffic is the only other source of bandwidth overhead attributable to isotach time. Tokens can be piggybacked on messages at a cost in bandwidth of one bit per message. 2.2.4 Latency and Throughput The principal cost of the isonet algorithm is the additional waiting implied by the requirement that messages be routed in pid-rank order. One way to reduce this cost is to use ghosts. When a switch sends an operation on one output, it sends a ghost [36] or null message [lo] with the same pid-rank tag on each of its other output(s). A switch receiving a ghost knows all further operations it receives on the same input will have a larger tag than the ghost and this knowledge may allow it to route the operation on its other inputk). Ghosts improve isotach network performance by allowing information about logical time to spread more quickly and are needed in some networks to avoid deadlock. Since a ghost can always be overwritten by a newer message, ghosts take up only unused bandwidth and buffers. Another way to reduce waiting in the switches is to add internal buffers. In a switch with input queueing only, such as the switch shown in Fig. 3a, a blocked message at the head of an input queue blocks subsequent messages arriving on the same input, a phenomenon known as head of line (HOL) blocking. Buffers placed internally, as shown in Fig. 3b, reduce HOL blocking [211, [241. Our simulation study of isotach networks shows that internal buffers are of significantly more benefit in isotach networks than in conventional networks. One reason for this difference is that a conventional switch of either design can route messages onto both outputs in a single switch cycle, but an isotach switch of the simpler design can route at most one operation per cycle since we allow the router to make only one pid-rank tag comparison per cycle. An isotach switch with internal buffers can route messages onto both outputs since it has two merge units operating in parallel. A more fundamental reason for the difference in benefit is that the switch with internal buffers offers increased throughput at the cost of increased delay. Isotach synchronization techniques, because they are more concurrent, e.g., they allow pipelining, allow processes to take better advantage of increases in potential throughput. Splitter Splitter Fig. 3. Switch designs: (a)simple switch, (b) switch with internal buffers. 2.2.5 Deadlock The requirement that switches route messages in pid-rank order makes deadlock freedom a harder problem in isotach networks than in conventional networks. The isonet algorithm uses infinite buffers at the switches to prevent deadlock. One way to bound the number of buffers is to limit the number of messages sent per SIU per pulse. Although such bounds may be useful for other reasons, the number of buffers required to ensure deadlock freedom using this approach is very large. A better approach is to find routing schemes that ensure deadlock freedom. We have proven deadlock free routing schemes for equidistant networks using one buffer per switch per physical channel and for arbitrary topologies using O(d)buffers and virtual channels per switch per physical channel, where d is the network diameter [46]. The isonet algorithm, as modified, is scalable and efficient: It is completely distributed; the bandwidth attributable to maintaining isotach logical time is one bit per message; and the additional work required at each SIU and switch is one tag comparison per message. The main cost of the algorithm is the increased probability that a message will wait within the network. The cost of maintaining isotach logical time relative to the benefits in reduced synchronization overhead is the subject of Section 4. 3 USINGISOTACH LOGICAL TIME We show how isotach logical time provides the basis for enforcing isochronicity and sequential consistency. Our focus is on SMM computations. Each message is an operation or a response to an operation and each operation is a read or write access to a variable. Each host is a processing element (PE) or memory module (MM). The destination of each operation is an MM. Since synchronization is not an issue in relation to private variables, we use the term vari able to refer to a shared variable. 3.1 lsochronicity and Sequential Consistency An isochron is a group of operations that can be issued as 'batch and that is to be executed indivisibly, i.e., withou' apparent interleaving with other operations. An executioi is isochronous if every isochron appears to be executed indi REYNOLDS ET AL.: ISOTACH NETWORKS visibly. Isochrons and isochronicity are closely related to atomic actions and atomicity. An atomic action is a group of operations that is to be executed indivisibly [28], 1311. An execution is atomic if the operations in each atomic action appear to be executed indivisibly. Isochrons and atomic actions differ in two ways: An atomic action, in many contexts, is a unit of recovery from hardware failure. Where a fault-free system is assumed or fault-freedom is not a major concern, atomicity is synonymous with indivisible execution [27], 1321. As stated previously, we assume a fault-free system. An atomic action can have internal dependences preventing the operations in the atomic action from being issued as a batch. An atomic action with such dependences, e.g., A = B, is a Structured atomic action. An atomic action with no such dependences is a flat atomic action. Since we assume a fault-free system, flat atomic actions and isochrons are the same. Isochrons are also similar to totally ordered multicasts (TOMS).The main distinctions are 1) TOM algorithms are usually fault-tolerant; and 2) the messages making up a TOM are typically identical and in some implementations constrained to be identical, whereas the operations making up an isochron are typically different. (Of course, subcomponents of identical messages can be addressed to different recipients, at the cost of increased message size.) We compare implementations at the end of this section. Isochrons are directly useful in obtaining consistent views, making consistent updates, and maintaining consistency among replicated copies. They are also useful as a building block for implementing structured atomic actions. We describe isotach techniques for executing structured atomic actions later in this section. SMM computations on tightly-coupled multiprocesses execute atomic actions using some form of locking. Drawbacks of locking include overhead for lock maintenance in space, time, and communication bandwidth, unnecessarily restricted access to variables, and the care that must be taken when using locks to avoid deadlock and livelock. In an isotach system, a process can execute flat atomic actions without synchronizing with other processes and can execute structured atomic actions without acquiring locks or otherwise obtaining exclusive access rights to the variables accessed. The second synchronization property we consider is sequential consistency. An execution is sequentially consistent if operations appear to be executed in an order consistent with the order specified by each individual process’s sequential program [26]. This property is so basic it is easily taken for granted, but it is expensive to enforce in non-busbased systems because stochastic delays within the network can reorder operations. A violation of sequential consis- tency is shown in Fig. 4. Each circle in the figure represents an operation on a shared variable, A or B. Solid directed lines connect circles representing operations on the same variable and indicate the order in which the operations are executed at memory. Dashed directed lines connect circles 34I representing operations issued by the same process and indicate the order specified by that process’s sequential program. In the execution shown, process P, reads the new value of A (the value written by P2) and subsequently reads the old value of B, in contradiction to P i s program, which specifies that B‘S value be changed first. This violation could occur even if both processes sent the operations in the correct order, if one operation (in this case the write to B) is delayed in the network. The conventional solution is to prohibit pipelining of operations, i.e., each process delays issuing each operation until it receives an acknowledgement for its previous operation. However, pipelining is an important way to decrease effective memory latency, so this solution is expensive. The high cost of enforcing sequential consistency has led to extensive exploration of weaker memory consistency models, e.g., [161, 1431. These weaker models are harder to reason about and still impose significant restrictions on pipelining, but make sense given the cost of maintaining sequential consistency in a conventional system. In an isotach system, processes can pipeline memory operations without violating sequential consistency. P l :: P2:: write(B); write(A); a W @ Programorder at process Executionorder at M M -- t 3.2 Enforcing lsochronicity and Sequential Consistency We consider an SMM computation on an isotach system. Each process’s program consists of a sequence of isochrons interspersed with local computation. Further assumptions: 1) Each process issues operations by sending them through a reliable FIFO channel to the SIU for the PE on which it is executing; 2) Each process marks the last operation in each isochron; 3) Each MM executes operations in the order in which they are received; and 4) Each SIU knows the logical distance to each MM with which it communicates. Although some applications (e.g., cache coherence) require that isotach logical time be maintained for both operations and responses to operations, the techniques described here require that logical times be assigned only to the send and receive events for operations. Since each MM need only execute operations in the order in which they are received and need not ensure that its responses are received at any given logical time, an MM requires only a conventional network interface. Thus we use SIU in this section to mean the SIU for a PE. The key to enforcing isochronicity and sequential consistency in isotach systems is the isotach invariant. Given the isotach invariant and knowledge of the logical distance for each operation it sends, an SIU can control the logical IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8 , NO. 4, APRIL 1997 342 times at which the operations it sends are received by controlling the logical times at which they are sent. In particular, the SIU can ensure that its operations appear to be received at the same time (isochronicity) or in some specified sequence (sequential consistency) by sending operations in accordance with the send order rules: Isochronicity. Send operations from the same isochron so they are received in the same pulse. Sequential Consistency. Send each operation so it is received in a pulse no earlier than that of the operation issued before it. Processes need not be aware of how their ordering constraints are enforced. These rules are implemented entirely by the SIU. In a system with an equidistant network, i.e., a network in which every message travels the same logical distance, the rules call for each SIU to send operations in the order in which they were issued and to send all operations from the same isochron in the same pulse. To avoid delaying logical time in the event of a context switch or process failure, the SIU should not send any operation from an isochron until the last operation in the isochron has been issued. In systems with nonequidistant topologies, the rules may require that an SIU send operations in an order that differs from the order in which they were issued. that conforms to the send order rules, there is an equivalent serial execution that is isochronous and sequentially consistent. To establish that two executions of the same program are equivalent, it is sufficient to show that each variable is accessed by the same operations and that operations on the same variable are executed in the same order in both executions. THEOREM 2. Any isotach system execution E that satisfies the send order rules is isochronous and sequentially consistent. PROOF.Since no two operations issued by the same process have the same issue rank, operations in E are totally ordered by their logical receive times. Let E, be the serial execution in which operations are executed in order of their logical receive times in E. We show 1) E, and E are equivalent; 2) E, is isochronous; and 3) E, is sequentially consistent. It follows that E is isochronous and sequentially consistent. Consider any two operations op, and op, on the same variable V. Without loss of generality, assume op, is received in E before op,. Since logical receive times are unique and consistent with the + relation, t,(op,) < tr(op,). Thus op, is executed before op, in E,. Since operations on V are executed in E in the EXAMPLE. Assume process P, is required to read and P, to write variables A and B and that each process's accesses must be executed indivisibly. In a conventional system, P , and P, must obtain locks, either on individual variables or on the section of code through which the variables are accessed. In an isotach system, P , and P, can execute their accesses without locks. Consider an isotach system with the nonequidistant topology shown in Fig. 5a. Each circle in Fig. 5a represents a switch and each rectangle an MM or PE. Fig. 5b shows one of many possible correct executions. In accordance with the first send order rule, the SIU for each process sends its operations so that they both arrive in the same pulse, e.g., the SIU for PI sends the operation on A one pulse after the operation on B since the routing distance from PI to A is one less than to B. If, as in the execution shown, all four operations happen to be received in the same pulse, operations on each variable will be received and executed in order by pid. As a result, each isochron will appear to be executed without interleaving with other operations. The correctness proof for the send order rules is similar to serializability proofs of database schedulers, see, e.g., [33]. We show that for any execution on an isotach system order in which they are received, op, is also executed before op, in E. Since, for any isochron A, all operations in A are received in the same pulse (first send order rule), issued by the same process, and have consecutive issue ranks, the logical receive time of any operation not in A is outside the interval of logical timc delimited by the receive events for operations in A Thus the operations in A are executed in E, indivisibly, without interleaving with other operations For any pair of operations op, and op, issued by tht same process where rank(op,) < rank(opJ, op, and op are either received in the same pulse or op, is received in an earlier pulse than op, (second senc order rule). In either case, op, is executed before op in E,. C EXAMPLE.Consider again the execution shown in Fig. 5. Thc space-time diagram in Fig. 6a shows one of the manj possible ways in which this execution can occur in operation PE 1 j readA i reads .......:.................... 2 i write A write B (a) Fig. 5. Executing isochrons: (a) network topology, (b) operation send and receive times. 1, . d . tr i (i+l,l,x) i. 2 .j (i+3,1,x) i (i,l,x+l) i 3 i (i+3,1,x+1) ................. ......& ..................... i (62,~) . 3 f. (i+3,2,y) 0- (i+1,2,y+l) .j i2 (i+3,2,y+l) REYNOLDS ET AL.: ISOTACH NETWORKS 343 physical time. Fig. 6b shows the serial execution in which operations are executed in order by their logical receive times. Solid lines connecting circles indicate the associated operations are from the same isochron. Since each variable is accessed by the same operations in the same order in both executions, the executions are equivalent. In discussing a similar pair of diagrams, Lamport provides an alternative way to view the equivalence among the executions in the figure [251: Without introducing the concept of time into the system (which would require introducing physical clocks), there is no way to decide which of these pictures is the better representation. Intuitively, any vertical line in the diagram can be stretched or compressed in any way that preserves the relative order of the points on the line and the result will be equivalent to the original execution. Without memory-mapped peripherals or other special instrumentation, the executions are indistinguishable. Since the second of the equivalent executions in Fig. 6 is isochronous, so is the first, even though it interleaves execution of isochrons in physical time. i Accessby A B (a) 1 B A (b) Fig. 6. Space-time diagrams of equivalent executions. 3.2.7 Isochrons and Totally Ordered Multicasts Algorithms for totally ordered multicasts (TOMS) ensure total ordering without the benefit of the assumption of fault-freedom. TOMs have been used to maintain consistency among replicated copies of objects in distributed systems, e.g., [4], and could theoretically be used to execute flat atomic actions in SMM computations on multiprocessors, and as a basis for techniques to execute structured atomic actions, but we are not aware of any proposals to use TOMs in this way. Since locks are the standard technique for executing atomic actions we use them as the basis for comparison in the performance studies described in the next section. Even if we leave out the fault-tolerant aspects, concentrating only on those aspects of TOM implementations designed to ensure that multicasts are received in a consistent order, we find existing TOM implementations are not scalable or are unsuitable for SMM computations for other reasons. Several messaging layers, e.g., 161, [71, [34], use a logical time system in obtaining an ordering property ‘imong multicasts called causal ordering, but the logical time systems these messaging layers use does not offer an efficient scalable way to obtain total ordering. Existing TOM algorithms introduce some element to obtain the total ordering that makes them unsuitable for use in SMM computations. Most use a serialization point, either in the network itself or in a communication pattern superimposed on the computation, e.g., [ll, ill], [201, to order multicasts. These algorithms scale poorly. Multicasts for hierarchically structured networks or for applications with a hierarchical communication pattern scale more successfully, but still have a potential bottleneck at the root. The few TOM algorithms that do not use a centralized serialization point [61, [13], [14], [18] are based on worst case assumptions or require multiple message rounds. Many TOM implementations, e.g., [7], [15], [34], are group-based. They send each multicast to the full membership of a process group and use the fact that every process sees every message sent to its group in obtaining the total ordering. Group-based protocols have been shown to work well in distributed message based model computations, but the process group concept does not translate well to the SMM. In contrast to existing TOM algorithms, the implementation of isochrons is scalable and light-weight. The underlying implementation of isotach logical time is distributed and the total ordering among isochrons is achieved without introducing a centralized serialization point. Comparing message complexity is complicated by the question of what cost to assess for token traffic in the isotach network and the fact that multiple processes can each send multiple isochrons in the same pulse. Since a token is an empty control signal sent between neighboring switches that can be piggybacked on other messages if there is any real traffic, we believe a cost of 0 for tokens is reasonable. Assuming a cost of 0 for tokens, an isochron with k destinations requires one round of k messages, the minimal cost on a point-to-point network for sending k unique messages to k destinations. 3.2.2 Other Applications Other applications of isotach logical time include the following: Atomicity. Given our assumption of a fault-free system, isotach techniques enforce atomicity for both flat and structured atomic actions. Structured atomic actions cannot be executed using isochrons consisting of ordinary reads and writes because internal dependences prevent issuing all of the operations in a batch, but they can bc executed using isochrons consisting of split operations [48]. A split operation, in brief, is half of a read or write. Split operations are used in executing reads and writes in two steps, where the first step schedules the access and the second transfers a value. Dividing accesses into two steps allows a process that has incomplete knowledge about an access due to an unsatisfied dependence to reserve the context for the access even though the value to be transferred has yet to be determined. A process executes a structured atomic action by first issuing an isochron that contains split operations scheduling all of the accesses required for the atomic action. The process executes the assignment steps for these accesses as it determines the values to be assigned. Execution is atomic because the initial isochron reserves a consis- IlEEE TRANSACTION:j ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 4, APRIL 1997 344 tent time slice across the histories of the accessed variables. This technique works when the access set, the set of variables accessed by the atomic action, can be determined prior to execution of the ,atomicaction. We have proposed variations on the technique for atomic actions with data dependent access sets [48]. Isotach techniques for executing structured atomic actions are similar to techniques used in multiversion concurrency control I381 except that isochrons streamline the& techniques, eliminating roll-back and permitting an efficient in-cache implementation of variable histories. The resulting techniques are sufficiently light-weight in that they can be used in SMM computations. Cache Coherence. Isotach based techniques for enforcing atomicity and sequential consistency extend to systems with caches. The resulting cache coherence protocols [441, [471, 1481 are more concurrent than existing protocols for non-bus-based systems in that they support multiple concurrent reads and writes of the same block, allow processes to pipeline memory accesses without sacrificing sequential consistency, and allow processes to read and write multiple shared cache blocks atomically without obtaining exclusive rights to the accessed blocks. Combining. Combining is a technique for maintaining good performance in the presence of multiple concurrent accesses to the same variable [231. An isotach network need not implement combirdng, but if it does, it can combine operations not (combinable in other networks, resulting in improved concurrency in accessing shared memory [491. Other applications include support for causal message delivery, causal and totally ordered multicasting, migration mechanisms, checkpointing, wait-free communication primitives, and highly concurrent access to linked data structures. Important special applications include parallel and distributed databases and production systems 1411, [481. 4 PERFORMANCE We studied the performance of equidistant isotach networks using both simulation [40] and analytic modeling [46]. The following is a synopsis; the interested reader is referred to the noted references for more details. 4.1 Simulation Studies Isotach networks do more work than conventional networks-they route messages and deliver them in an order consistent with the isotach invariant and the + relation. Thus isotach networks can be expected to have less raw power than comparable conventional networks. Raw power, as determined by network throughput and latency, measures the ability of a network to deliver a workrload of generic messages without atomicity or sequencing constraints to their destinations. Our studies were designed to answer the following questions: 1) How do the raw power of an isotach network and a conventional network relate? 2) Under what conditions, if any, do isotach networks make up for an expected loss in raw power through more efficient support of synchronization? We report on the results of three simulation studies: raw power, sequential consistency, and warm-spot traffic mode' with sequential consistency and atomicity. These studie: compared isotach systems to conventional systems tha enforce sequential Consistency by restricting pipelining anc enforce atomicity by locking the variables accessed in eacl atomic action. Improving the performance of locking ii shared memory systems has been the subject of severa studies in recent years (e.g., [2], [29], [30]).Although we dic not implement caches, and thus could not implemen cache-based locks, we implemented a highly efficient lock ing protocol, described below, that avoids spinning on locks in an alternative way. We simulated two physical networks, N1 and N2, eacl composed of 2 x 2 switches, synchronously clocked, anc arranged in a reverse baseline topology. The messagi transmission protocol was store-and-forward using send acknowledge. N1 and N2 differed only in the design of tht individual switches. N1 used the simple switch shown ii Fig. 3a; N2 used the internal buffer switch shown in Fig. 3b. We simulated isotach and conventional versions of bot1 N l and N2. Networks I1 and I2 are the isotach version o N1 and N2, respectively. Similarly, C1 and C2 are the conventional versions of N1 and N2. Each isotach networl used the same switch algorithm as the corresponding con ventional network except that the isotach switches selectec operations for routing in pid-rank order and sent token: (piggybacked on operations) and ghosts. Latency results are reported in cycle units, which repre sent physical time. Cycle unit duration depends on low level switch design and technology. We assume switches i; C1 and I1 have the same best case time-the time requirec for an operation to move through a switch assuming nc waiting-of one cycle unit, and switches in C2 and I2 hav best case times of two cycle units. This assumption does nc imply that average switch latency is the same in convei tional and isotach networks of comparable design. Averag switch latency tends to be longer in isotach switches becau: operations must be routed in pid-rank order. Throughoi the studies, throughput is the average number of oper, tions arriving at memory, per MM, per cycle. We analyzed C1, C2,11, and I2 under a variety of synthet. workloads, which ranged from one with no constraints o operation execution order to ones with combinations ( atomicity, sequencing, and data dependence constraints. 4.1.1 Raw Power Study We measured the raw power of each of the four network: i.e., throughput and delay assuming no atomicity c sequencing constraints. The workload consisted of generi operations-reads and writes were not distinguished. M assumed memory cycle time was the same as switch cyc tim-ach MM could consume one operation per cycl Although the workload had no sequencing constraint there was no way to relax enforcement of sequential COI sistency in I1 and 12. Thus we actually compared i s o h networks that enforce sequential consistency with convei tional networks that did not. REYNOLDS ET AL.: ISOTACH NETWORKS 345 Delay was measured from the time an operation was generated until it was delivered to the MM. As expected, C1 and C2 handled higher load with less delay than I1 and 12, respectively. Conventional networks have more raw power because isotach network switches are subject to the additional constraint that they must route operations in isotach time order. Also of significance: C2 and I2 offered higher throughput at the cost of higher latency than C1 and 11. In all of the remaining studies, C1 outperformed C2 at cach data point. C2 performed poorly relative to C1 because its latency was higher than Cl’s. The increased potential throughput offered by C2 could not be exploited when locks and restrictions on pipelining are used for synchronization. Henceforth we report results only for Cl,Il, and 12. 4.1.2 Sequential Consistency Study We compared networks under a workload requiring sequential consistency but no atomicity. Here and in the warm-spot study, traffic in both the forward (PE -+ MM) and reverse (MM -+ PE) directions was simulated. The reverse networks were conventional networks: I2 used C2, and I1 and C1 used C1. In each cycle, each MM (PE) could receive one operation (response) from the network and issue one response (operation). Fig. 7 shows the high cost of enforcing sequential consistency in a conventional network. In I1 and 12, operations could be pipelined without risk of violating sequential consistency. We assumed no data dependences constrained the issuing of operations. Each PE issued a new operation whenever its output buffer was empty. C1 enforced sequential consistency by limiting each PE to one outstanding operation at a time. Throughput for I1 and 12, shown in Fig. 7, was the same as the raw power studies, since the isotach networks enforced sequential consistency in both cases, but was much lower for C1. I1 and I2 outperformed C1 in terms of throughput, but with longer delay, partly because pipelining caused I1 and I2 to be more heavily loaded. We would expect data dependences among operations issued by the same process to diminish throughput in isotach networks because they prevent the networks from taking full advantage of pipelining. When we set the data dependence distance to one, meaning each process needs the result of its current operation before it can issue its next Sequential consistency NOT enforced 0.60 1 } 4 6 8 I 10 # Stages Fig. 7. Effect of enforcing sequential consistency. 0: C1, 0: 11, 0: 12 operation, the throughput of the isotach networks was low. As the data dependence distance increased, the throughput of I1 and I2 climbed sharply until network saturation. Delay also increased as the networks became more heavily loaded. Because its latency is longer and its saturation point higher, we observed that I2 requires a larger data dependence distance than I1 to take full advantage of its ability to pipeline. I2 performed less well than I1 at small dependence distances because I2 cannot take advantage of its higher throughput and is hurt by a higher response time due to higher latency. Since C1 cannot pipeline regardless of the data dependence distance, throughput and delay for C1 do not vary with the data dependence distance. We did observe that the throughputs of I1 and I2 began to exceed Cls at a data dependence distance of only two. 4.1.3 Warm-Spot Traffic Model Study We combined a warm-spot traffic model with a workload with atomicity and sequencing constraints. A warm-spot model has several warm variables instead of a single hot variable and is more realistic than either the uniform or hotspot traffic models. The warm-traffic model we used was based on the standard 80/20 rule (see, e.g., [22]), modified to lessen contention for the warmest variables [40]. Atomic actions were assumed to be flat and independent, i.e., there were no data dependences among or within atomic actions. I1 and I2 enforced atomicity by issuing all operations from the same atomic action in the same pulse. C1 enforced atomicity using locks: A lock was associated with each variable and a process acquired a lock for each variable accessed by the atomic action before releasing any of the atomic action’s locks. To avoid deadlock, each process acquired the locks it needed for each atomic action in a predetermined linear order. Our lock algorithms were generously efficient. Instead of spinning, lock requests queued at memory. Reads used nonexclusive locks and comprised 75% of the workload. Instead of sending a lock request for an operation, a process sent the operation itself. Each operation implicitly carried a request for a lock of the type indicated by the operation. When it received an operation, the MM enqueued it. If no conflicting operation was enqueued ahead of it, the operation was executed and a response returned to the source PE. A PE knew it had acquired a lock when it received the Sequential consistency enforced OBO 0.60 7 i 1 1 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 4, APRIL 199- 346 response. Eliminating explicit lock requesting and granting reduced traffic and eliminated roundtrip PE/MM message delays when a lock was granted. When the P,E had acquired all the locks for the atomic action, it sent a lock release for each lock it held. Execution of our warm-spot model was sequentially consistent. C1 enforced sequential consistency by limiting the number of outstanding atomic actions per PE to one. In I1 and 12, a PE could have any number of active atomic actions without sacrificing sequential consistency. Fig. 8 shows the effect of varying the average size (AA-size) of atomic actions under a warm-spot model. Delay is delay per operation, i.e., the average nuimber of cycles from the time an atomic action is generated until it is completed, divided by AA-size. For C1, the tirne required to release locks is not included in delay. A process in C1 generated a new atomic action as soon as it had received all the responses due in its current atomic action. In I1 and 12, a process generated a new atomic action whenever its output buffer was empty. The graphs show C1 performed poorly for large atomic actions. As the atomic action size increased, Cl’s throughput dropped and its delay rose steeply. Cl’s performance was significantly worse than in a similar study we conducted that assumed a uniform access model, because when the probability of conflicting operations is higher, locks have more impact on performance. As atomic actions grow larger, not only does the number of ciperations contending for access grow, but also the length of time each lock is held. In I1 and 12, by contrast, delay per operation actually decreased as the size of atomic actions increased, since increasing size meant more concurrency. As can be seen in Fig. 8, when atomic actions are large and traffic is nonuniform, Cl’s performance is very poor relative to I1 and 12. For large atomic actions (AA-size = 161, throughput for I2 is about 78 times higher than Cl’s and its delay about 24 times lower. We observed in other studies that hot-spot traffic results show the same pattern. The warm spot study shows that isotach systems perform well in relation to conventional systems under a workload with atomicity and sequencing constraints. Conventional systems cannot take full advantage of the greater raw power of their networks because the rate at which operations can be issued is limited by locks and restrictions on pipelining. Isotach system performance, by contrast, is limited only by the raw power of the networks. Though raw power of the isotach networks is somewhat lower tha that of the conventional networks, the synchronizatio support they provide allows them to outperform the con ventional systems by a wide margin. Finally, we analyzed conventional and isotach system for a workload of structured atomic actions in which dat dependences existed both within and among atomic at tions. I1 and I2 continued to outperform C1, though by smaller margin since the data dependences prevented th isotach networks from taking advantage of their ability + pipeline operations. 4.1.4 Simulation Study Summary Our study shows conventional networks have higher ra1 power than isotach networks, but with atomicity ani sequencing constraints, isotach networks outperform cor ventional networks, in some cases by a factor of 10 or morc Conventional systems perform relatively poorly in spite ( their greater raw power because they enforce atomicity an sequencing constraints by restricting throughput and in posing delays. Isotach networks perform best in relation t conventional networks when the distance between dat dependent operations within the same process’s program 1 large enough to permit pipelining; atomic actions are largc and contention for shared variables is high. When th workload has two or more of these characteristics, an isc tach network performs markedly better than a convention, network. These results indicate that isotach based synchrc nization techniques warrant further study. We have e: tended our simulation to include isotach based cache cohei ence protocols [471. 4.2 Analytic Model Analytic models were developed for networks C1, C2, I and 12. That work led to the discovery of novel modelin methods that will be presented in a future paper. We ou’ line the approach and results we obtained here. A detaile presentation appears in [46]. We extended a mean value analysis technique first prl sented by Jenq [19] for banyan-like MINs. Our extensior enabled the modeling of 1) routing dependencies, 2) a measure of bandwidth we have called informatio flow, and 3) message types. 0 50 2500 I 0 40 I t I & I 1 200 0 I I I c I I 1 1 030 4 1 2 - 1 1500 1 I 1000 ~ 14 ~ 1 50 0 0 10 I 00 0 00 2 4 6 8 AA 10 Ske 12 14 16 Fig. 8. Varying the atomic action size-warm-spot traffic. 0: C1, 0: 11, 0: 12 9 2 4 6 8 10 12 AA Size 14 1 16 347 HEYNOLDS ET AL.: ISOTACH NETWORKS $Lltingdependencies exist in isotach networks in the form constraints requiring routing in pid-rank order (Jenq did not include this kind of dependence). Information flow is a measure of the bandwidth of a given stage of a MIN. [Ilcluding information flow allowed us to introduce independence among network stages, gaining tractability with little sacrifice in accuracy. Finally, certain optimizations that ,,wid be made in an implementation of an isotach net*rk, for example, piggy-backing of tokens on messages of itent and ghost messages, needed to be reflected in the ,1i1dytic model. We succeeded in capturing the effects of these message types. For networks I1 and 12, the analytic model predicted throughput with no worse than 12% deviation from the simulation results presented earlier. The model tended to be optimistic, which we expected because it didn't account tor delays that variations in uniform distribution of mes,iges would create in the network. Those delays were cap:d in the simulation. Prediction of delay was even better ,th a maximum error of 5%.In both cases, the results were tor networks ranging in depth from two to 10 stages. 5 CONCLUSIONS AND ONGOING WORK We described a new approach to enforcing ordering constraints in a fault-free system in which processes communicate over a point-to-point network. This problem is difficult, t n with the assumption of fault-freedom. The conventional I tion uses locks and restrictions on pipelining. Solutions to the harder problem, where fault-freedom is not assumed, have been found useful in distributed computations, but these solutions are not suitable for use in SMM computations. We have described an approach that does not require locks or restrictions on pipelining and that is scalable and lightweight enough for use in SMM computations. The approach is based on isotach logical time, a logical time system that extends Lamport's system by adding an mant that relates communication time to communicadistance. This invariant is a powerful coordinating mechanism that allows the sender to control the order in which its messages are received. We defined isotach logical time, gave an efficient distributed algorithm for implementing it, and showed how to use isotach logical time to enforce ordering constraints. The work of implementing isotach logical time and enforcing the ordering constraints can be localized in the network interfaces and switches. We riefined a new class of networks called isofach networks that iplement isotach logical time. Implementing isotach logical time, of course, is not free and we are well aware of controversy (181, 1121, 1391) over implementing synchronization services at a low level in end-to-end systems 1421. Isotach networks are justifiable on a cost/benefit basis-the guarantees they offer can be implemented cheaply, yet are sufficiently powerful to enforce fundamental synchronization properties. We reported the results of analytic and empirical studies that .' ow that isotach synchronization techniques are far more 1 ilcient than conventional techniques. Ongoing w o r k . We are building an isotach network prototype and studying the use of isotach techniques in imple1 I 1 lii menting distributed shared memory and rule-based systems. One of our goals in building the prototype is to demonstrate that isotach techniques can work in distributed systems and can tolerate some types of faults. Preliminary explorations show that isotach networks can function under a number of different failure models. One result €or failstop models appears in 1461. Our approach to ensuring a reliable implementation of the isochron on the prototype system will be similar to the approach reported in [451, exicept that we intend to use a distributed protocol in place of <acentralized credit manager for flow control. Other goals in lbuilding the prototype include showing that implementing iisotach logical time does not significantly interfere with network traffic that does not use isotach logical time, and {hat isotach logical time can be implemented without blocking messages within the network. The former is important in addressing the end-to-end controversy referred to above. The latter is important to allow the use of off-theshelf networks and can be accomplished by pushing most of the work that the switches do in the isonet algorithm onto the receiving network interface. Our target platform for the prototype is a nonequidistant cluster of personal computers (PCs) interconnected with Myrinet [9]. ACKNOWLEDGMENTS This work was supported in part by DARPA Grant DABT63-95-C-0081, U.S. National Science Foundation Grant CCR9503143, and DUSA(0R) Simtech funding (through the National Ground Intelligence Center). REFERENCES Y. Amir, L.E. Moser, P.M. Melliar-Smith, D.A. Agarwal, and P. Ciarfella, "The Totem Single-Ring Ordering and Membership Protocol," ACM Trans. Computer Systems, vol. 13, no. 4, pp. 311342, Nov. 1995. T.E. Anderson, 'The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors," I E E E Trans. Parallel and Distributed Systems, vol. 1, no. 1, pp. 6-16, Jan. 1990. B. Awerbuch, "Complexity of Network Synchronization," 1.ACM, vol. 32, no. 4, pp. 804-823, Oct. 1985. H.E. Bal and AS. Tanenbaum, "Distributed Programming with Shared Data," Proc. I€€€ CS 1988 Int'l Conf. Computer Languages, pp. 82-91, Miami, Fla., Oct. 1988. Y . Birk, P.B. Gibbons, J.L.C. Sanz, and D. Soroker, "A Simple Mechanism for Efficient Barrier Synchronization in MIMD Machines," Technical Report RJ 7078, IBM, Oct. 1989. K.P. Birman and T.A. Joseph, "Reliable Communication in the Presence of Failures," ACM Trans. Computer Systems, vol. 5, no. 1, pp. 47-76, Feb. 1987. K.P. Birman, A. Schiper, and P. Stephenson, "Lightweight Causal and Atomic Group Multicast," ACM Trans. Computer Systems, pp. 272-314, Aug. 1991. K. Birman, "A Response to Cheriton and Skeen's Criticism of Causal and Totally Ordered Communication," OS Review,vol. 28, no. 1, pp. 11-21, Jan. 1994. N.J. Boden, D. Cohen, R.E. Felderman, A.E. Kulawik, C.L. Seitz, J.N. Seizovic, and W. Su, "Myrinet-A Gigabit-per-Second LocalArea Network," IEEE Micro. vol. 15. no. 1. DD.29-36. Feb. 1995. [101 K. Chandy and J. Misra, "Distributed Sim;ttion; A'Case Study in Design and Verification of Distributed Programs," IEEE Trans. Software Eng., vol. 5, no. 5, pp. 440452, Sept. 1979. [lI] J. Chang and N.F. Maxemchuk, "Reliable Broadcast Protocols," ACM Trans. Computer Systems, vol. 2, no. 3, pp. 251-273, Aug. 1984. [12] D.R. Cheriton and D. Skeen, "Understanding the Limitations of Causally and Totally Ordered Communication," OS Review,14th ACM Symp. OS Principles, pp. 44-57, Dec. 1993. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 4, APRIL 1997 I;. Cristian, ”Synchronous Atomic Broadcast for Redundant Broadcast Channels,” Technical Report RJ 7203 (67682), IBM, Dec. 1989. M. Dasser, ”TOMP: A Total Ordering Multicast Protocol,“ O p ntirig Systems Review,vol. 26, no. 1, Jan. 1992. H. Garcia-Molina and A. Spauster, “Ordered and llieliable Multicast Communication,” ACM Trans. Corirputer Systeiris, vol. 9, no. 3, pp. 242-271, Aug. 1991. K. Gharachorloo et al., “Performance Evaluation of Memory Consistency Models for Shared-Memory Multiprocessors,” Proc. Fourth ASPLOS, pp. 245-257, Apr. 1991. P.B. Gibbons, ”The Asynchronous PRAM: A St.ini-Sytichronous Model for Shared Memory MIMD Machines,” 89-062, ICSI, Berkeley, Calif., December, 1989. K.J. Goldman, “Highly Concurrent Logically Synchronous Multicast,” Distributed Computing, pp. 94-108, Berlin-Heidelburg-New York Springer-Verlag, 1989. Y. Jenq, “Performance Analysis of a Packet Switch Based on Single-Buffered Banyan Network,” I E E E ]. Selcctecl Areas Comm., vol. 1, no. 6, pp. 1,014-1,021, Dec. 1983. M.F. Kaashoek, A.S. Tanenbaum, S.F. Huminell, and H.E. Bal, ”An Efficient Reliable Broadcast Protocol,” OS Review, vol. 23, no. 4, pp. 5-19, Oct. 1989. M.J. Karol, M.G. Hluchyj, and S.P. Morgan, “Input Versus Output Queueing on a Space-Division Packet Switch,” IEEE Trans. Comm., vol. 35, no. 12, pp. 1,347-1,356, Dec. 1987. D.E. Knuth, ”Sorting and Searching, “ Fundnmental Algorithms, vol. 3, p. 397. Addison-Wesley, 1973. C.P. Kruskal, L. Rudolph, and M. Snir, ”Efficient Synchronization on Multiprocessors with Shared Memory,” ACM Trans. Programming Languages and Systems, vol. 10, no. 4, pp. 579-601, Oct. 1988. M. Kumar and J.R. Jump, ”Performance Enhancement of Buffered Delta Networks Using Crossbar Switches and Multiple Links,” ]. Parallel and Distributed Computing, vol. 1, pp. 81-103, 1984. L. Lamport, ”Time, Clocks, and the Ordering of Events in a Distributed System,” Comm. ACM, vol. 21, no. 7, pp. 558-S65, July 1978. L. Lamport, ”How to Make a Multiprocessor Computer That Correctly Executes Multiprocessor Programs,” I E E E Trans. Computers, V O ~ 28, . pp. 690-691,1979. L. Lamport, ”On Interprocess Communication,” Distributed Conputiiig, vol. 1, no. 2, pp. 77-101, Apr. 1986. D.B. Lomet, “Process Structuring, Synchronization, and Recovery Using Atomic Actions,” SIGPLAN Notices, vol. 12, no. 3, pp. 12&137, Mar. 1977. J.M. Mellor-Crummey and M. L. Scott, “Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors,” ACM Trans. Computer Systems, vol. 9, no. 1, pp. 21-65, Feb. 1991. M.M. Michael and M.L. Scott, “Scalability of Atomic Primitives on Distributed Shared Memory Multprocessors,” ‘Technical Report 528, Univ. of Rochester, Computer Science Dept., July 1994. S. Owicki and D. Gries, “An Axiomatic Proof Technique for Parallel Programs I,” Acta Infornzatica, vol. 6, pp. 319--340, i976. 1 S. Owicki and L. Lamport, ”Proving Liveness Properties of Concurrent Programs,“ ACM Trans. Programming Lnrigunges nnd Systems, vol. 4, no. 3, pp. 455-495, July 1982. 1331 C. Papadimitriou, Dntabase Concirrrermy Control. Computer Science Press, 1986. I341 L.L. Peterson, N.C. Bucholz, and R.D. Schlichting, “Preserving and Using Context Information in Interprocess Communication,” ACM Tmirs. Conzputer Systeins, vol. 7, no. 3, pp. 217-246, Aug. 1989. [351 A.G. Ranade, S.N. Bhatt, and S.L. Johnsson, “The Fluent Abstract Machine,” Technical Report 573, Yale Univ., Dept. of Computer Science, Jan. 1988. (361 A. Ranade and A.G. Ranade, “How to Emulate Shared Memory,” I E E E Ann. Sytnp. Foirndntions of Currrpirter Science, pp. 185-194, Los Angeles, 1987. 1371 M. Raynal and M. Singhal, “Logical Time: Capturing Causality in Distributed Systems,” Corrrputer, vol. 29, no. 2, pp. 49-56, Feb. 1996 1381 D. Reed, ”Implementing Atomic Actions on Decentralized Data,” ACM Trans. Computer Sysferns, vol. 1, no. 1, pp. 3-23, Feb. 1983. 1391 R. Renesse, “Why Bother With CATOCS?” OS Reziiezu, vol. 28, no. 1, pp. 22-27, Jan. 1994. 1401 P.F. Reynolds Jr., C. Williams, and R.R. Wagtier Jr., “Empirical Analysis of Isotach Networks,” Technical Report 92-19, Univ. of Virginia, Dept. of Computer Science, June 1992. 141I 1’ F Reynolds Jr and C Williams, “Asynchronous Rule-Based Systems 111 CGF,” Proc Fifth Conf Cowputer Generated Forces an0 Behn7m1rnl Representation, Orlando, Fla ,May 1995. 1421 J H Saker, D P Reed, and D D Clark, “End-To-End Arguments System Design,” ACM Trnns Conrputer Systems, vol 2, no. 4, p p 277-288,Nov 1984 1431 C Schcurich and M Dubois, “Correct Memory Operation of CacheBased Multiprocessors,” Proc 14th ISCA, pp 234-243, June 1987. 1441 B R Supinski, C Williams, and P F Reynolds Jr , “Performance Evaluation of the Late Delta Cache Coherence Protocol,” Technp CAI Report CS-96-05, Univ of Virginia, Dept of Computer Science Mar 1996. 1451 K Verstoep, K Langendoen, and H E Bal, “Efficient Reliablc Multicast on Myrinet,” Technical Report IR-399, Vrije Univer siteit, Amsterdam, Jan 1996 1461 R.R Wagner Jr , “On the Implementation of Local Synchrony; CS-93-33, Univ. of Virginia, Dept of Computer Science, June 1993 [471 C. Williams and P.F. Reynolds Jr., “Delta-Cache Protocols: A New Class of Cache Coherence Protocols,” Technical Report 90-34, Univ of Virginia, Dept. of Computer Science, Dec. 1990 [481 C C Williams, “Concurrency Control in Asynchronous Compu tations,” Ph.D. thesis, No 9301, Univ of Virginia, 1993. I491 C C. Williams and P.F. Reynolds Jr , “Combining Atomic Actions,’ ] Parallel and Distributed Computing, pp 152-163, Feb. 1995. Paul F. Reynolds, Jr. received the PhD degree from the University of Texas at Arlington in 1979 He is currently an associate professor of com puter science at the University of Virginia, wherc he has been on the faculty since 1980. He hac published widely in the area of parallel computa tion, specifically in parallel stimulation and par allel language and algorithm design. He ha< served on a number of national committees anc advisory groups as an expert on parallel com putation, and, more specifically, as an expert o parallel stimulation. He has served as a consultant to numerous corpo rations and government agencies in the systems and simulation areas Craig Williams received her BA degree in eco nomics from the College of William and Mary i t 1973, a JD from Columbia Law School in 1975 and her Masters degree in computer science i r 1987. Her PhD was awarded by the University o Virginia in 1993. She is now a senior scientist i r the Computer Science Department at the Un versity of Virginia. Her research interests includ parallel data structures and both the hardwarc and software aspects of parallel computation. Raymond R. Wagner, Jr. received his BA i/ computer science from Dartmouth College ir 1985, and his MS and PhD degrees in compute science from the University of Virginia in 198and 1993. He is currently a senior associate anc partner in the Vermont Software Company, Inc in Wells River, Vermont.