The Totem Single-Ring Ordering and Membership Protocol Y. AMIR, L. E. MOSER, P. M. MELLIAR-SMITH, D. A. AGARWAL, and P. CIARFELLA University of California, Santa Barbara Fault-tolerant distr ibuted systems are becoming more important, but in existing systems, maintaining the consistency of replicated data is quite expensive. The Totem single-ring protocol suppot ts consistent concurrent operations by placing a total order on broadcast messages. This total order is derived from the sequence number in a token that circulates around a logical ring imposed on a set of processors in a broadcast domain. The protocol handles reconfiguration of the system when processors fail and restart or when the network partitions and remerges. Extended virtual synchrony ensures that processors deliver messages and configuration changes to the application in a consistent, systemwide total order. An effective flow control mechanism enables the Totem single-ring protocol to achieve message-ordering rates significantly higher than the best prior total-ordering protocols. Categories and Subject Descriptors: C.2.1 tComputer-Communications Networksl: Network Architecture and Design—rie/morh com muitications, C.2.2 [Computer-Communication Networksl: Network Protocols—protocol ci rchitectu re,’ C.2.4 [Computer-Communication Networksl Distributed Systems—networh opero ring system s, C.2.5 tComputer-Communication Networksl: Local Networks—run gs, D.4.4 lOperating SystemsJ: Communications Management—netu:oi-h com niu iitcation; D.4.5 [Operating Systemsl: Reliability—{nti/t tolerance; D.4.7 [Operating Systemsl: Organization and Design—distribiited s ystetn s General Terms: Performance, Reliability Additional Key Words and Phrases: Flow control, membership, reliable delivery, token passing, total orderi rig, virtual synchrony Earlier versions of the Totem single-ring protocol appeared in the Proceedings of the IEE International Conference on Information Engineering, Singapore tDecember 1991) and in the Proceedings of the IEEE 18th International Conference on Distributed Computing Systems, Pittsburgh, Pa. (May 1993). This research was supported by NSF Grant no. NCR-9016361, ARPA Contract no. N00174-93-K0097, and Rockwell CMC /State of California MICRO Crrant no. 92-101. Authors’ addresses: Y. Amir, Computer Science Department, The Hebrew University of Jerusalem, Israel; email: yairamir mcs.huji.ac.i1; L. E. Moser and P. M. Melliar-Smith, Department of Electrical and Computer Engineering, Univei-sity of C alifornia, Santa Barbara, CA 93106; email: (moser; pmms} 'ece.ucsb.edu: D A. Agarwal, Lawrence Berkeley National Laboratory, Berkeley, CA 94720; email: deba george.LBL.gov; P. Ciat fella, Cascade Communications Corporation, S Carlisle Road, Westford, MA 01886; email: ciarfella alpo.cascade.com. Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. sO 1995 ACM 0734-2071 /95/1100-03 11 $03.50 ACM Transactions on Computer Systems. Vol 13, No. 4, November 1995, Pages 311—342 312 • Y. Amir et al. 1. INTRODUCTION Fault-tolerant distributed systems are becoming more important due to the increasing demand for more-reliable operation and improved performance. Maintaining the consistency of replicated data and coordinating the activities of cooperating processors present substantial problems, which are made more difficult by concurrency, asynchrony, fault tolerance, and real-time performance requirements. Existing fault-tolerant distributed systems that address these problems are difficult to program and are expensive in the number of messages broadcast and/or computations required. Recent protocols for faulttolerant distributed systems [Amir et al. 1992b; Birman and van Re- nesse 1994; Kaashoek and Tanenbaum 1991; Melliar-Smith et al. 1990; Peterson et al. 1989] employ the idea of placing a partial or total order on broadcast messages to simplify the application programs and to reduce the communication and computation costs. The Totem single-ring protocol supports high-performance, fault-tolerant distributed systems that must continue to operate despite network partitioning and remerging and despite processor failure and restart. Totem provides totally ordered message delivery with ]ow overhead, high throughput, and low latency using a logical token-passing ring imposed on a broadcast domain. The key to its high performance is an effective flow control mechanism. Totem also provides rapid detection of network partitioning and processor failure together with reconfiguration and membership services. Its novel mechanisms prevent delivery of messages in different orders in different components of a partitioned network and provide accurate information about which processors have delivered which messages. Earlier versions of the Totem single-ring protocol are described in Amir et al. [ 1993] and MelliarSmith et al. [ 1991]. Programming the application is considerably simplified if messages are delivered in total order rather than only in causal order or if messages are delivered in causal order rather than only in FIFO order. In prior systems, delivery of messages in total order has been more expensive than delivery of messages in causal order, and delivery of messages in causal order has been more expensive than FIFO delivery. The Totem single-ring protocol can, however, deliver totally ordered messages with high throughput at no greater cost than causally ordered messages or, indeed, than reliable point-to-point FIFO messages. A total order on messages simplifies the application programming by reducing the risk of inconsistency when replicated data are updated and by resolving the contention for shared resources within the system, such as the claiming of locks. In Totem, messages are delivered in agreed order, which guarantees that processors deliver messages in a consistent total order and that, when a processor delivers a message, it has already delivered all prior messages that originated within its current configuration. Totem also provides delivery of messages in safe order, which guarantees additionally that, when a processor delivers a message, it has determined that every processor in the current configuration has received and will deliver the message unless that processor ACM Transactions on Computer Systems, Vol. 13, No 4, November 1995 The Totem Single-Ring Ordering and Membership Protocol • 313 fails. Delivery of a message in agreed or safe order is requested by the originator of the message. Delivery of messages in a consistent total order is not easy to achieve in distributed systems that are subject to processor failure and network partitioning. A failing processor, or a group of processors that have become isolated, may deliver messages in an order that is different from the order determined by other processors. As long as those processors remain failed or isolated, these inconsistencies are not apparent. However, as soon as a processor is repaired and readmitted to the system, or as soon as the components of a partitioned system are remerged, the inconsistencies in the message order may become manifest, and recovery may be difficult. The Totem protocol cannot guarantee that every processor is able to deliver every message, but it does guarantee that, if two processors deliver a message, they deliver the message in the same total order. The application programs may also need to know about configuration changes. Different processors may learn of a configuration change at different times, but they must form consistent views of the configuration change and of the messages that precede or follow the configuration change. Birman devised the concept of virtual synchrony [Birman and van Renesse 1994], which ensures that processors deliver messages consistently in the event of processor fail-stop faults. We have generalized this concept to extended virtual synchrony [Moser et a1. 1994a], which applies to systems in which the network can partition and remerge and in which processors can fail and restart with stable storage intact. The Totem single-ring protocol is designed to operate over a single broadcast domain, such as an Ethernet. It uses Unix UDP, which provides a besteffort multicast service over such media. Other media that provide a best-effort multicast service, such as ATM or the Internet MB one, can be used to construct the broadcast domain needed by Totem. The software architecture of the Totem single-ring protocol is shown in Figure 1. The arrows on the left represent the passage of messages through the protocol hierarchy, while the arrows on the right represent Configuration Change messages and configuration installation. Using a logical token-passing ring imposed on the physical broadcast domain, the single-ring protocol provides reliable totally ordered message delivery and effective flow control. On detection of token loss, or on receiving a message from a processor not on its ring, a processor invokes the membership protocol to form a new ring using Join messages and a Commit token transmitted over the broadcast domain. The membership protocol activates the recovery protocol with the proposed configuration change. The recovery protocol uses the single-ring ordering protocol to recover missing messages. The processor then installs the new ring by delivering Configuration Change messages to the application. 2. RELATED WORK Our work on the Totem protocol is based on our combined experience with two systems: the Trans and Total reliable, ordered broadcast and memberACM Transactions on Computer Systems, Vol. 13, No. 4, November 1995. 314 • Y. Amir et at. Application Broadcast messages , Configuration Change messages Recovery Install Fig. 1. The Totem single-ring protocol hierarchy. Reliable Delivery Total Ordering Flow Control Token loss Configuration changes Membership Foreign messages Best-Effort Broadcast Domain ship protocols [Melliar-Smith et al. 1990; Moser et al. l994b] and the Transit group communication system [Amir et al. 1992a; 1992b]. The Trans protocol uses positive and negative acknowledgments piggybacked onto broadcast messages and exploits the transitivity of positive acknowledgments to reduce the number of acknowledgments required. The Total protocol, layered on top of the Trans protocol, is a fault-tolerant totalordering protocol that continues to order messages provided that a resiliency constraint is met. The membership protocol, layered on top of the Total protocol, ensures that each change in the membership occurs at the same logical time in each processor, corresponding to a position in the total order. The Totem protocol was developed to address the computational over- head of Trans and Total and is intended for local area networks with fast and highly reliable communication. The Transis group communication system provides reliable, ordered group multicast and membership services. Transis based its ordering protocol initially on the Trans protocol but, more recently, has also included the Totem protocol for message ordering. Unlike other prior protocols, the Transis membership protocol supports remerging of a partitioned network and maintains a consensus view of the membership of each component, rather than a global consensus view of the entire system. The Totem membership protocol uses the idea. first proposed for Transis, that the membership can be reduced in size to ensure termination. Chang and Maxemchuk [ 1984] described a reliable broadcast and ordering protocol that uses a token-based sequencer strategy. Unlike Totem, which requires that a processor must hold the token to broadcast a message, their protocol allows processors to broadcast messages at any time. The processor holding the token is responsible for broadcasting an acknowledgment message that includes a sequence number for each message acknowledged. A processor that has not received a message sends a negative acknowledgment ACM Transactions on Computer Systems, Vol 13, No 4, N ove rn bee 1995 The Totem Single-Ring Ordering and Membership Protocol • 315 to request retransmission by the processor that acknowledged the message. While the latency is good at low loads, it increases at high loads and in the presence of a failed processor. More closely related to Totem is the TPM protocol of Rajagopalan and McKinley [ 1989], which also uses a token to control broadcasting and sequencing of messages. The TPM protocol provides the safe delivery but not the agreed delivery that Totem provides. In the absence of processor failure and network partitioning, TPM requires on average two and one-half token rotations for safe delivery, whereas Totem requires two token rotations. In the event of network partitioning, only the component containing a majority of the processors continues to operate; processors in the other components block. In contrast, Totem handles network partitioning and remerging by allowing each component of a partitioned system to continue operating, not just the component that contains a majority of the processors. Birman’s Isis system [Birman and van Renesse 1994] and more recently the Horus system [van Renesse et al. 1994] have focused on process groups and the application program interface. Isis provides BCAST or unordered messages, CBCAST or causally ordered messages, and ABCAST or totally ordered messages. A vector clock strategy is used to ensure causal ordering, and a token-based sequencer, similar to that of Chang and Maxemchuk [1984], is used to provide total ordering. Isis introduced the important idea of virtual synchrony in which configuration change messages are ordered relative to other messages so that a consistent view of the system is maintained as the system changes dynamically. The user interfaces of both Transis and Totem were inspired by the Isis user interface. The Psync protocol of Peterson et al. [ 1989] constructs a partial order on messages that can be converted into a total order. Isis, Trans, and Transis employ a similar strategy. In contrast, Totem constructs a total order on messages directly without constructing a partial order first. Mishra et a1. [1991] have developed a membership protocol based on the partial order of Psync. Kaashoek and Tanenbaum [ 1991] describe group communication in the Amoeba distributed operating system. One processor, called the sequencer, is responsible for placing a total order on messages. Processors send point-topoint messages to the sequences, which assigns sequence numbers to messages and broadcasts them to the other processors. Messages are recovered by sending a request to the sequencer for retransmission. Group membership functions are also provided. Performance is excellent for very short messages but deteriorates for long messages and if high resilience to processor failure and network partitioning is required. The excellent performance of the Totem single-ring protocol is achieved using a flow control strategy that limits message loss due to buffer overflow at the receivers. Related to this flow control strategy are the sliding-window strategy, the FDDI token rotation time limit, and the flow control mechanism of Transis. The use of a token in combination with a window for flow control on a broadcast medium works much better than either mechanism in isolation and, as far as we know, has not been investigated in prior work. ACM Transactions on Computer Systems, Vol. 13, No 4, November 1995. 316 Y. Amir et al 3. THE MODEL We consider a distributed system built on a broadcast domain consisting of a finite number of processors that communicate by broadcasting messages. We use the term origi irate to refer to the first broadcast of a message generated by the application. Each broadcast of a message is received immediately or not at all by a processor in the broadcast domain, and thus messages may have to be retransmitted to achieve reliable delivery. A processor receives all of its own broadcast messages. The broadcast domain may become partitioned so that processors in one component of the partitioned system are unable to communicate with processors in another component. Commnication between separated components can subsequently be reestablished. Each processor within the system has a unique identifier and stable storage. Processors can incur fail-stop, timing, or omission faults. A processor that is excessively slow, or that fails to receive a message an excessive number of times, can be regarded as having failed. Failed processors can be repaired and are reconfigured into the system when they restart. If a processor fails and restarts, its identifier does not change, and all or part of its state may have been retained in stable storage. There are no malicious faults, such as faults in which processors generate incorrect messages or in which the communication medium undetectably modifies messages in transit. Imposed on the broadcast domain is a logical token-passing ring. The token is a special message transmitted point-to-point. The token may be lost by not being received by a processor on the ring. Each ring has a representative, chosen deterministically from the membership when the ring is formed, and an identifier that consists of a ring sequence number and the identifier of the representative. To ensure that ring sequence numbers and hence that ring identifiers are unique, each processor records its ring sequence number in stable storage. We use the term ring to refer to the infrastructure of Totem and the term coli figiiratioti to represent the view provided to the application. The meibership of a configuration is a set of processor identifiers. The minimum configuration for a processor consists of the process itself. A regu far configuration has the same membership and identifier as its corresponding ring. A transifiona/ configuration consists of processors that are members of a new ring coming directly from the same old ring; it has an identifier that consists of a “ring” sequence number and the identifier of the representative. We distinguish between the terms “receive” and “deliver,” as follows. A processor receives messages that were broadcast by processors in the broadcast domain, and a processor delivers messages in total order to the applicaGon. Two types of messages are delivered to the application. Regular messages are generated by the application for delivery to the application. Con figs ration Cha rige messages are generated by the processors for delivery to the application, without being broadcast, to terminate one configuration and to initiate ACM Transactions on Computer Systems, Vol 13, No 4, November 1995 • The Totem Single-Ring Ordering and Membership Protocol 317 another. The identifiers of the regular and Configuration Change messages consist of configuration identifiers and message sequence numbers. We define a causal order that is a modification of Lamport’s causal order [Lamport 1978] in that it applies to messages rather than events and is constrained to messages that originated within a single configuration. This allows remerging of a partitioned network and joining of a failed processor without requiring all messages in the history to be delivered. Causal and total orders' are defined on sets of messages, as follows: Causal Order for Configuration C. The reflexive transitive closure of the “precedes” relation, which is defined for all processors p that are members of C as follows: —Message mt precedes message m2 if processor p originated m, in configuration C before p originated m2 in C. —Message m precedes message m 2 if processor p originated m ration C and p delivered m l in C before originating m 2 . 2 in configu- The causal order is assumed to be antisymmetric and, thus, is a partial order.° Delivery Order /or Configuration C. The reflexive transitive closure of the “precedes” relation, which is defined on the union over all processors p in C of the sets of regular messages delivered in C by p as follows: Message ml precedes message m2 if processor p delivers m1 in C before p delivers m2 in C. Note that some processors in configuration C may not deliver all messages of the delivery order for Configuration C. Global Delivery Order. The reflexive transitive closure of the union of the Configuration Delivery Orders for all configurations and of the “precedes” relation, which is defined on the set of Configuration Change messages and regular messages as follows: For each processor p and each configuration C of which p is a member, the Configuration Change message delivered by p that initiates C precedes every message m delivered by p in C. —For each processor p and each configuration C of which p in a member, every message m delivered by p in C precedes the Configuration Change message delivered by p that terminates C. 'A total order on a set iS ie a relatñion that satisfies the reflexive ( z x), transitive (if z y and y z, then x z), antisymmetric (if x < y and z — y, then y z), and comparable propertieñs (z y or y z 4. A partial order on a set S is a relation defined on S that satisfies the reflexive, transitive, and antisymmetric properties. To adhere to standard mathematical practice in which partial and total orders are reflexive, “before” must be regarded as nonstrict, i.e., an event occurs before idself. This property cannot be proved. That the physical world is antisymmetric must be an assumption. ACM Transactions on Computer Systems, Vol 13, No. 4, November 1995. 3 8 Y. Amir et al. In Amir et al. [ 1994] we prove that the delivery order for configuration C is a total order and that the global delivery order is a total order. 4. SERVICES The objective of the Totem single-ring protocol is to provide the application with reliable, totally ordered message delivery and membership services, as defined below. 4.1 Membership Services The Totem membership protocol provides the following properties: Un iqueness of Con/igtirntion s. Each configuration identifier is unique; moreover, at any time, a processor is a member of at most one configuration. Con senstis. All of the processors that install a configuration determine that the members of the configuration have reached consensus on the membership.' tern iiia//on. If a configuration ceases to exist for any reason, such as processor failure or network partitioning, then every processor of that configuration either installs a new configuration by delivering a Configuration Change message or fails before doing so. The Configuration Change message contains the identifier of the configuration it terminates, the identifier of the configuration it initiates, and the membership of the configuration it initi- ates. Con[igui ation Cñaiige Consistency. If processors p and q install configuration C; directly after C 1 , then and q both deliver the same Configuration Change message to terminate C 1 and initiate C; . 4.2 Reliable, Ordered Delivery Services The Totem total-ordering protocol provides the following properties, which hold for all configurations C and for all processors p in C: Reliable Delivery for Con figuration C. —Each message m has a unique identifier. —If processor p delivers message m, then p delivers i once only. Moreover, if processor p delivers two different messages, then p delivers one of those messages strictly before it delivers the other. —If processorp originates message Sri, thenp will delivor m or will fail before delivering a Configuration Change message to install a new regular configuration. —If processor p is a member of regular configuration C, and no configuration change ever occurs, then p will deliver in C all messages originated in C. 'This does not rotate the impossibility result of Fischer ct at. [1985] because the membership protocol allows the membership to decrease in order to reach consensus. ACDI Transactions on Computer Systems, Vol 13, No 4, November 1995 The Totem Single-Ring Ordering and Membership Protocol • 319 If processor p delivers message in originated in configuration C, then p is a member of C, and p has installed C. Moreover, p delivers m in C or in a transitional configuration between C and the next regular configuration it installs. —If processors p and q are both members of consecutive configurations Ct and C 2 , then p and q deliver the same set of messages in C 1 before delivering the Configuration Change message that terminates C, and initiates 2 Reliable delivery defines the basic requirements on message delivery—in particular, which messages a processor must deliver within a configuration. Delivery in Causal Order for Configuration C. Reliable delivery for configuration C. —If processor p delivers messages m and m2 , and m, precedes m 2 in the causal order for configuration C, then p delivers zn before p delivers m 2 . Causal delivery imposes an ordering constraint to ensure that the delivery order respects Lamport’s causality within a configuration. Delivery in Agreed Order for Con[iguration C. —Delivery in causal order for configuration C. —If processor p delivers message rn 2 in configuration C, and m is any message that precedes m 2 iTl the Delivery Order for Configuration C, then p delivers m 1 in C before p delivers m 2 . Agreed delivery requires that all processors deliver messages within a configuration in the same total order. Moreover, when a processor delivers a message, it must have delivered all preceding messages in the total order for the configuration. Delivery in Sa[e Order for Configuration C. Delivery in agreed order for configuration C. —If processor p delivers message m in regular configuration C in safe order, then every member of C has installed C. If processor p delivers message m in configuration C, and the originator of rn requested safe delivery, then p has determined that each processor in C has received m and will deliver m or will fail before installing a new regular configuration. Delivery of a message in safe order requires that a processor has determined that all of the processors in the configuration have received the message. This determination is typically based on acknowledgments from the processors indicating that they have received the message and all of its predecessors in the total order. Once a processor has acknowledged receipt of a safe message, it is required to deliver the message unless it fails. Note that this requireACM Transactions on Computer Systems, Vol 13, No. 4, November 1995. 320 • Y. Amir et at. ment does not guarantee that a processor delivers the message in the same configuration as all of the other processors. Totem uses a Configuration Change message to notify the application of the membership of the configuration within which delivery is guaranteed as safe. Extended Virtual S ynchron y. —Delivery in agreed or safe order as requested by the originator of the message. —If processor p delivers messages m1 a ll d m 2 , £tnd m1 p T’ecedes m2 in the global delivery order, then p delivers mt before p delivers m2. Virtual synchrony was devised by Birman [Birman and van Renesse 1994J to ensure that view (configuration) changes occur at the same point in the message delivery history for all operational processors. Processors that are members of two successive views must deliver exactly the same set of messages in the first view. A failed processor that recovers can only be readmitted to the system as a new processor. Thus, failed processors are not constrained as to the messages they deliver or their order, and messages delivered by a failed processor have no effect on the system. If the system partitions, only processors in one component, the primary component, continue to operate; all of the other processors are deemed to have failed. Extended virtual synchrony extends the concept of virtual synchrony to systems in which all components of a partitioned system continue to operate and can subsequently remerge and to systems in which failed processors can be repaired and can rejoin the system with stable storage intact. Two processors may deliver different sets of messages, when one of them has failed or when they are members of different components, but they must not deliver messages inconsistently. In particular, if processor p delivers message 1 before p delivers message 2 then processor q must not deliver message m before q delivers message m 1. Extended virtual synchrony requires that the properties of delivery in agreed and safe order must be satisfied. If processor p delivers message m as safe in configuration C, then every processor in C has received m and will deliver m before it installs a new regular configuration, unless that processor fails. This is achieved by installing a transitional configuration with a reduced membership, within which any remaining messages from the prior configuration are delivered, while honoring the agreed and safe delivery guarantees. Thus, Totem delivers two Configuration Change messages, the first to introduce a smaller transitional configuration and the second to introduce the new regular configuration. In Moser et al. [ 1994a] we demonstrate that virtual synchrony can be implemented on top of the more-general extended virtual synchrony property provided by Totem. Proofs of correctness for the Totem single-ring protocol based on the service properties defined above can be found in Amir et al. [ 1994]. Note, however, that a definition based on pairwise delivery of messages does not suffice. We must ensure that the global delivery order has no cycles of any length. ACM Transactions on Computer Systems, Vol. 13, No. 4, November 1995 The Totem Single-Ring Ordering and Membership Protocol • 321 5. THE TOTAL-ORDERING PROTOCOL The Totem single-ring ordering protocol provides agreed and safe delivery of messages within a broadcast domain. Imposed on the broadcast domain is a logical token-passing ring. The token controls access to the ring; only the processor in possession of the token can broadcast a message. A processor can broadcast more than one message for each visit of the token, subject to the constraints imposed by the flow control mechanisms described in Section 8. When no processor has a message to broadcast, the token continues to circulate. Each processor has a set of input buffers in which it stores incoming messages. The flow control mechanisms avoid overflow of these input buffers. Each message header contains a sequence number derived from a field of the token; thus, there is a single sequence of message sequence numbers for all processors on the ring. Delivery of messages in sequence number order is agreed delivery. Safe delivery uses an additional field of the token, the am field, to determine when all processors on the ring have received a message. We now describe the Totem single-ring ordering protocol with the assumptions that the token is never lost, that processor failures do not occur, and that the network does not become partitioned; however, messages may be lost. In Section 6 we relax these assumptions and extend the protocol to handle token loss, processor failure and restart, and network partitioning and remerging. 5.1 The Data Structures Regular Message. Each regular message contains the following fields: —sender id: The identifier of the processor originating the message. ring_td: The identifier of the ring on which the message was originated, consisting of a ring sequence number and the representative’s identifier. —seq: A message sequence number. —conf id: 0. contents: The contents of the message. The ring id, seq, and conf id fields comprise the identifier of the message. Regular Tohen. To broadcast a message on the ring, a processor must hold the regular token, also referred to as the token. The token contains the following fields: —I ype: Regular. —ring id: The identifier of the ring on which the token is circulating, consisting of a ring sequence number and the representative’s identifier. toben seq: A sequence number which allows recognition of redundant copies of the token. seq : The largest sequence number of any message that has been broadcast on the ring, i.e., a high-water mark. ACM Transactions on Computer Systems, Vol 13, No. 4, November 1995 322 . Y Amir et al —am: A sequence number (all-received-up-to) used to determine if all processors on the ring have received all messages with sequence numbers less than or equal to this sequence number, i.e., a low-water mark. am id: The identifier of the processor that set the am to a value less than the seq. —rtr: A retransmission request list, containing one or more retransmission requests. The seq field of the token provides a single total order of messages for all processors on the ring. The am field is the basic acknowledgment mechanism that determines if a message can be delivered as safe. Local Variables. Each processor maintains several local variables, includ—My tolzen seq : The value of the tohen seq when the processor forwarded the token last. —my am: The sequence number of a message such that the processor has received all messages with sequence numbers less than or equal to this sequence number. —my am count: The number of times that the processor has received the token with an unchanged am and with the am not equal to seq. —new message queue: The queue of messages originated by the application waiting to be broadcast. —received pmessage queue: The queue of messages received from the communication medium waiting to be delivered to the application. A processor updates my tohen seq and my _ am count as it receives tokens and updates my am as it receives messages. When it transmits a message, the processor transfers the message from new message queue to received message queue. When it determines that a message has become safe, the processor no longer needs to retain the message for future retransmission and, thus, can discard the message from received message queme. 5.2 The Protocol On receipt of the token, a processor empties its input buffer completely, either delivering the messages or retaining them until they can be delivered in order. It then broadcasts requested retransmissions and new messages, updates the token, and transmits the token to the next processor on the ring. For each new message that it broadcasts, the processor increments the seq field of the token and sets the sequence number of the message to this value. Each time a processor receives the token, it compares the am field of the token with my am. If my _ar¿i is smaller, the processor replaces the am with my am and sets the am id field of the token to its identifier. If the areaid equals the processor’s identifier, it sets the are to my am. (In this case, the processor had set the am on the last visit of the token, and no other processor changed the am during the token rotation.) Whenever the seq and ACM Transactions on Computer Systems, Vol 13, No 4, November 1995 The Totem Single-Ring Ordering and Membership Protocol • 323 the am are equal, the processor increments are and my are in step with seq and sets the am id Lo a null value (a value that is not the id of any processor). If the seq field of the token is greater than its my am, the processor has not received all of the messages that have been broadcast on the r'ing, so it augments the rtr field of the token with the missed messages. If the processor has received messages that appear in the rtr field, it retransmits those messages before broadcasting new messages. When it retransmits a message, the processor removes the sequence number of the message from the rtr field. If a processor has received a message m and has delivered every message with sequence number less than that of m, and if the originator of m requested agreed delivery, then the processor delivers m in agreed order. If, in addition, the processor forwards the token with the am field greater than or equal to the sequence number of m on two successive rotations and if the originator of m requested safe delivery, then rn is safe, and the processor delivers m in safe order. The total-ordering protocol is, of course, unable to continue when the token is lost; a token retransmission mechanism has been implemented to reduce the probability of token loss. Each time a processor forwards the token, it sets a Token Retransmission timeout. If a processor receives a regular message or the token, it cancels the Token Retransmission timeout. On a Token Retransmission timeout, the processor retransmits the token to the next processor on the ring and then resets the timeout. The tohen seq field of the token provides recognition of redundant tokens. A processor accepts the token only if the tohen seq field is greater than my tohen seq, otherwise the token is discarded as redundant. If the token is accepted, the processor increments tohen seq and sets my toheii seq to the new value of token seq. Token retransmission increases the probability that the token will be received at the next processor on the ring and incurs minimal overhead. The membership protocol described in the next section handles the loss of all copies of the token. 6. THE MEMBERSHIP PROTOCOL The Totem single-ring ordering protocol is optimized for high performance under failure-free conditions but depends on a membership protocol to resolve processor failure, network partitioning, and loss of all copies of the token. The membership protocol detects such failures and reconstructs a new ring on which the total-ordering protocol can resume operation. The objective of the membership protocol is to ensure consensus, in that every member of the configuration agrees on the membership of the configuration, and termination, in that every processor installs some configuration with an agreed membership within a bounded time unless it fails within that time. The membership protocol also generates a new token and recovers messages that had not been received by some of the processors when the failure occurred. ACM Transactions on Computer Systems, Vol 13, No. 4, November 1995. 324 • Y. Amir et at. 6.1 The Data Structures Join Message. A Join message contains two items: a set of identifiers of processors being considered for membership in the new ring by the processor broadcasting the Join message and a set of identifiers of processors that it regards as having failed. These are contained in the proc set and [ail set fields of the Join message defined below: —I ype: Join. —sended id: The processor identifier of the sender. —proc set: The set of identifiers of processors that the sender is considering for membership in a new ring. —[ail set: The set of identifiers of processors that the sender has determined to have failed. ring seq: The largest ring sequence number of a ring id known to the sender. Join messages differ from regular messages in that a processor may broadcast a Join message without holding the token; moreover, Join messages are not retransmitted or delivered to the application. When a processor broadcasts a Join message, it is trying to achieve consensus on the proc set and fail set in the Join message. The fail set is a subset of the proc set. The proc set and /ai/ psef can only increase until a new ring is installed. The ring seq field allows the receiver of a Join message to determine if the sender has abandoned a past round of consensus and is now attempting to form a new membership. It is also used to create unique ring identifiers. Configuration Change Message. The membership protocol also uses another special type of message, the Configuration Change message, which contains the following fields: —ring id: The identifier of the regular configuration if this message initiates a regular configuration or the identifier of the preceding regular configuration if this message initiates a transitional configuration. —seq : 0, if this message initiates a regular configuration, or the largest sequence number of a message delivered in the preceding regular configuration if this message initiates a transitional configuration. —confp id: The identifier of the old transitional configuration from which the processor is transitioning if this message initiates a regular configuration or the identifier of the transitional configuration to which the processor is transitioning if this message initiates a transitional configuration. —memh: The membership of the configuration that this message initiates. The ring id, seq, and com/_ id fields comprise the identifier of the message. A Configuration Change message may describe a change from an old configuration to a transitional configuration or from a transitional configura- tion to a new configuration. Configuration Change messages differ from ACM Transactions on Computer Systems, Vol 13, No 4, November 1995 The Totem Single-Ring Ordering and Membership Protocol • 325 regular messages in that they are generated locally at each processor and are delivered directly to the application without being broadcast. Commit Tohen. Each new ring is initiated by one of its members, the representative, which is a processor chosen deterministically from the members of the ring. The representative generates a Commit token that differs from the regular token in that its t ype field is set to Commit, and it contains the following fields in place of the rtr field: mem6 _/isi: A list containing a processor identifier, old ring ring id, old ring my am, high delivered, and received fig fields for each member of the new ring. memb index: The index of the processor in memfi _(tsp that last forwarded the Commit token. For each proeessor identifier in memó _//sf, the high deliuered field is the largest sequence number of a message that the proeessor has delivered on the old ring. The receiued fÍ g field indicates that the processos has already received all of the messages possessed by other processors in its transitional configuration. On the first rotation of the Commit token around the new ring, each proeessor sets its old ring id, old ring my am, high deliuered, and receiued flg fields. It also updates memb index. The remaining fields are set by the representative when it creates the Commit token. Local Variables. Each processor maintains several local variables, including: iny ring id: The ring identifier in the most recently accepted Commit token. —my pmemb: The set of identifiers of processors on the processor’s current ring. —my men› qmemb: The set of identifiers of processors on the processor’s new ring. my qproc set: The set of identifiers of processors that the processor is considering for membership of a new ring. my fail set: The set of identifiers of processors that the processor has determined to have failed. consensus: A boolean array indexed by processors and indicating whether each processor is committed to the processor’s my proc qset and my fail set. Stable storage is required to store a processor’s ring sequence number, my ring id. seq. This stable storage is read only when a processor recovers from a failure and is written when a configuration change occurs. ACM Transactions on Computer Systems, Vol. 13, No 4, November 1995. 326 Y Amir et at Delrmr Confjgurallon Change messages Operational Gather RBCOV8F/ recen/ed Consensus and Repesentatl'm) or Corr¥Ttrt token receW fecehmd Fig L° The finite-state machine for the membership protocol. 6.2 The Protocol The membership protocol can be described by a finite-state machine with seven events and four states, as illustrated in Figure 2. 6.2.1 the Seven Euetits of the Mem bership Protocol. Receiving a Foreign Message. Such a message was broadcast by a processor that is not a member of the receiving processor’s ring and activates the membership protocol in the receiving processor. fieceiriiig a join Message. This informs the receiver of the sender’s proposed membership and may cause the receiver to enlarge its my qproc set or my fail set. fieceit'ing a Commit Notes. On the first reception of the Commit token, a member of the proposed new ring updates the Commit token. On the second reception, it obtains the updated information that the other members have supplied. Tohen Loss Timeout. This timeout (1) indicates that a processor did not receive the token or a regular message within the required amount of time and (2) activates the membership protocol. Robert Retransmission timeout. This timeout indicates that a processor should retransmit the token because it has not received the token or a regular message broadcast by another processor on the ring. Join Timeout. This timeout is used to determine the interval after which a Join message is rebroadcast in the Gather or Commit states. ACM Transactions on Co mputer Systems, Vol 13. No 4„ November 1995 The Totem Single-Ring Ordering and Membership Protocol • 327 Consensus Timeout. This timeout indicates that a processor participating in the formation of a new ring failed to reach consensus in the required amount of time. Recognizing Failure to Receive. If the am field has not advanced in several rotations of the token, a processor determines that the processor that set the am has repeatedly failed to receive a message. 6.2.2 The Four States o[ the Mem bersliip Protocol. Operational Rstate. In the Operational state (Figure 3), messages are broadcast and delivered in agreed or safe order, as requested by the originator of the message. Since processor failure and network partitioning result in loss of the token, the mechanism for detecting failures is the Token Loss timeout. When the Token Loss timeout expires or when a processor receives a Join message or a foreign message, the processor invokes the protocol for the formation of a new ring and shifts to the Gather state (Figure 7). A processor buffers a message for retransmission until the message has been acknowledged by the other processors on the ring. If a processor fails repeatedly to receive a particular message, then the other processors buffer that message and all subsequent messages until that message is received. Consequently, a processor cannot be allowed to fail to receive messages indefinitely. When its local variable my am count reaches a predetermined constant, a processor determines that some other processor has failed to receive, namely, the processor whose identifier is in the am id field of the token. It then includes that processor’s identifier in its my fail set, shifts to the Gather state, and broadcasts a Join message. Gather lstate. In the Gather state (Figure 4), a processor collects information about operational processors and failed processors and broadcasts that information in Join messages. When a processor receives a Join message, the processor updates its my proc set and iny fail set. If its iny qproc set and my fail set have changed, the processor abandons its previous membership, broadcasts a Join message containing the updated sets, and resets the Join and Consensus timeouts. The Join timeout is shorter than the Consensus timeout and is used to increase the probability that Join messages from all currently operational processors are received during a single round of consensus. A processor reaches consensus when it has received Join messages with proc set and [ail set equal to its my proc set and icy _/ni/ set, respectively, from every processor in the difference of those sets, i.e., my pproc set — m y fail set. It then no longer accepts incoming Join messages. A processor is also considered to have reached consensus when it has received a Commit token with the same membership as my proc set — in y friil set. The processors in that difference constitute the membership of the proposed new ring. If the Consensus timeout expires before a processor has reached consensus, it adds to my /t2t/ _sef all of the processors in m y proc sef from ACM Transactions on Computer Systems, Vol 13, No 4, November 1995 328 Y. Amir et al. Regular token received: if token.ringed / my ningñd or token.token.seq < m my.token seq then discard token else determine how many messages I’m allowed to broadcast by flow control update retransmission requests broadcast requested retransmissions subtract retransrriissions from allowed to broadcast for as many messages as allowed to broadcast do get message from new message ueue increment token.seq set message header fields and broadcast message update mymru if my mm ( token.aru or my md = token.aruud or token.aruñd — invalid then token.aru := my mru if token.aru = token.seq then token.aruñd := invalid else token.aruñd :— my id if token.am = aru in token on last rotation and token.aruñd / invalid then increment my amount else my amount := 0 if my-arumount faiLto.icvmonst and token.aru-id = my id then add token.aru.id to mydailme. t call ShiftNo-Gather else update token.rtr and token 8ow control fields increment token.token-seq forward token reset Token Loss and Token Retransmission timeouts deliver messages that satisfy their deli very criteria Regular message received: cancel Token Retransmission timeout if set add message to receivewessage@ueue update retransmission request list update myor deliver messages that satisfy their delivery criteria Token Loss timeout expired: call Shift to.Gather Token Retransmission timeout expired: retransmit token reset Token Retransmission timeout Foreign message from processor q received: add message.senderñd to my.procmet call Shift.to-G ather Join message from processor q recei ved: same as in Gather state except call Shift Jo.Gather regardless of Join message’s content Commit token received: discard the Commit token Fig. 3. The psoudocode executed by a processor in the Operational state. ACM Transactions on Computer Systems, Vo] 13, No 4, November 1995 The Totem Single-Ring Ordering and Membership Protocol • 329 Regular token or regular message received: same as in Operational state Foreign message from processor q received: if q not in my.procset then add message.senderñd to my-proc set call Shift to-Lather Join message from processor q received: if my.procmet — message.procmet and my Jailmet m message.failzet then consensus(q] := true if for all r in my.procset — my Jailmet consensus[r] m true and my id — smallest id of my.procmet — my Jailset then token.ringed.seq :m (maximum of myningñd.seq and Join ringmeqs) -{- 4 token-memb := my.procmet — my Jailmet call Shift to-Commit else return else if message.proc-set subset of my.procmet and message.failmet subset of my failset then return else if q in my failset then return else merge message.proc-set into my -proc set if my id in message.fai1zet then add message.senderñd to my.failmet else merge message.failmet into my Jailnet call S I hift o.G at her Commit token received: if my-proc set — my failset = token.memb and token.seq call S I hift o.Commit my ringid then Join timeout expired: broadcast Join message with my crochet, my dailmet, seq — my ningñd.seq set Join timeout Consensus timeout expired: if consensus not reached then for all r such that consensus[r] / true do add r to my fail.set call Shift.to.Gather else for all r do consensus[r] :— false consensus[myñd} :m true set Token Loss timeout Token Loss timeout expired: execute code for Consensus timeout expired in Gather state call Shift.to-G ather Fig. 4. The pseudocode executed by a processor in the father state. ACM Transactions on Computer Systems, Vol. 13, No 4, November 1955 330 • Y. Amm et al. which it has not received a Join message with proc set and [ail set equal to its own sets, returns to the Gather state, and tries to reach consensus again by broadcasting Join messages. When a processor has reached consensus, it determines whether it has the lowest processor identifier in the membership and, thus, is the representative of the proposed new ring. If it is the representative, the processor generates a Commit token. It determines the ring id of the new ring, which is composed of a ring sequence number equal to four plus the maximum of the ring sequence numbers in any of the Join messages used to reach consensus and its own ring sequence number. (The sequence number which is two less than that of the new ring is used as the transitional configuration identifier.) The representative also determines the memb list of the Commit token, which specifies the membership of the new ring and the order in which the token will circulate, with the representative placed first. It then transmits the Commit token and shifts to the Commit state (Figure 7). When a processor other than the representative has reached consensus, if it has not received the Commit token, the professor sets the Token Loss timeout, cancels the Consensus timeout, and continues in the Gather state waiting for the Commit token. If the Token Loss timeout expires, the processor returns to the Gather state and tries to reach consensus again. On receiving the Commit token, the processor compares the proposed membership, given by the memb list field of the Commit token, with my proc set — m y fail set. If they differ, the processor discards the Commit token, returns to the Gather state, and repeats the attempt to form a new ring. If they agree, the processor extracts the ring id for the new ring from the Commit token, sets the fields in its entry of inemb list, increments the mem6 index field, and shifts to the Commit state. Commit Strife. In the Commit state (Figure 5), the first rotation of the Commit token around the proposed new ring confirms that all members in the memb list of the Commit token are committed to the membership. It also collects information needed to determine correct handling of the messages from the old ring that still require retransmission when the membership protocol was invoked. The second rotation of the Commit token disseminates the information collected in the first rotation. On receiving the Commit token for the second time, a processor shifts to the Recovery state (Figure 7) and writes the sequence number, mv ring id. seq, for its new ring into stable storage. Recovery Citate. In the Recovery state (figure 6), when the representative receives the Commif token after its second rotation, it converte the Commit token into the regular token for the new ring, replacing the memò list and memb index fields by the rtr field. At this point, the new ring is formed but not yet installed, and the execution of the recovery protocol begins. The processors use the new ring to retransmit messages from their old rings that must be exchanged to maintain agreed and safe delivery guarantees. In one atomic action, each processor delivers the exchanged messages to ACM Transactions on Computer Systems, Vol 13, No. 4, November 1995. The Totem Single-Ring Ordering and Membership Protocol • 331 Regular token received: discard token Regular message received: same as in Operational state Foreign message received: discard message Join message from processor q received: if q in mymewwemb and message.ring>eq T myningñd.seq then execute code for receipt of Join message in Gather state call ShiftJo-Gather Commit token received: if token.seq = myningñd.seq then call Shift.to.Ttecovery Join timeout expired: same as in Gather state Token Loss timeout expired: call Shift to.Gather Fig. S. The pseudocode executed by a processor in the Commit state. the application along with Configuration Change messages, installs the new ring, and shifts to the Operational state (Figure 7). The recovery protocol is described in more detail in Section 7. When a processor starts or restarts, it first forms and installs a singleton ring, consisting of only the processor itself. The processor then broadcasts a Join message containing the value of my ring id. seq, obtained from its stable storage, and shifts to the Gather state. The membership protocol described above is guaranteed to terminate in bounded time because proc set and fail set increase monotonically within a fixed finite domain, because timeouts bound the time that a processor spends in each of the states, and because an additional failure (which increases the [atl set) is forced to prevent the repetition of a proposed membership. In the base case, the membership, proc set — [ail set, consists of a single processor identifier. 7. THE RECOVERY PROTOGOL The objective of the recovery protocol is to recover the messages that had not been received when the membership protocol was invoked and to enable the processors transitioning from the same old configuration to the same new configuration to deliver the same messages from the old configuration. The recovery protocol also provides message delivery guarantees, and thus maintains extended virtual synchrony, during recovery from failures. Maintenance of extended virtual synchrony is essential to applications such as fault-tolerant distributed databases. ACM Transactions on Computer Systems, Vol. 13, No. 4, November 1995. 332 Y. Amir et at Regular token received: same as in Operational state except get messages from retranswessage.queue instead of neivwessage.queue and before forwarding the token execuLe: if retransmessage queue is not empty then if token.retrans Mtg — false then token.retrans Mfg := true else if token.retransAg = true and I set it then token.retransAg := false if token.retransAg = false then increment my netransJg count else my netransJg.count :- 0 if my retrans lgcount - 2 then my ñnstallmeq :m token.seq if my retrans&g.count 2 and my mru T my Inst allseq and my neceivedAg - false then myreceived kg := true my Aelivermemb :— my Iransmemb if my retrans&gmount 3 and token.aru my installseq on last two rotations then call Shift to.Operational Regular message received: reset Token Ret ransmission t imeout add message to recei vemessage queue update my mru if ret.ransmit ted message from my old ring then add to recei vemessage queue for old ring remove message from retransmessage queue for old ring Foreign message from processor q received: discard message Join message from processor q received: if q in my new memb and message.ringmeq > myningñd.seq then execute code for receipt of Join message in Commit state execute code for Token Loss timeout expired in Recovery state Commit token received by new representative: convert Cornmi t token to regular token if retransmessage queue is not empty then token.retrans Mtg := true else token.retransAg := false forward regular token reset Token Loss and Token Retransmission timeouts Token Loss timeout expired: discard all new messages received on the new ring empty retransmessage queue determine current old ring aru (it may have increased) call Shift.to-G ather Token Retransmission timeout expired: retransmit token reset Token Retransmission timeout Fi g 6 The pseudocode executed by a processor in the Recovery state ACM Transactions on Computer Systems, Vol 13, No 4, November 1995 The Totem Single-Ring Ordering and Membership Protocol • 333 ShiftJo.Gather: broadcast Join message containing myyrocmet, my Jailmet, seq = my.ringed.seq cancel Token Loss timeout and Token Retransmission timeouts reset Join and Consensus timeouts for al l t in my.procset do consensus[r] := false consensus[myñd) := true state := Gather Shift do Commit: update membJist in Commit token with my ningñd, my mru, my received kg, my..high.delivered myningñd := Commit token ringed forward Commit token cancel Join and Consensus timeouts reset Token Loss and Token Retransmission timeouts state := Commit Shift JoWecovery: forward Commit token for the second time mynew wemb := membership in Commit token my Iranswemb := members on old ring transitioning to new ring if for some processor in my-transwemb receivedAg = false then mydeliver.memb := my transwemb low.ringmru :m lowest am for old ring for processors in my Aeliverwemb ilighningfielivered :m highest sequence number of message delivered for old ring by a processor in my Aeliverwemb copy all messages from old ring wi th sequence number lowningmru into retranswessage.queue mymru := 0 mymrumount :m 0 reset Token Loss and Token Retransmission timeouts state := Recovery Shift Jo.Operational: deliver messages deliverable on old ring (at least up through high.ring.delivered) deliver membership change for transitional configuration deliver remaining messages from processors in my.deliverwemb in transitional configuration deliver membership change for new ring mywemb := mynew -memb myprocset := my wemb myJailmet := empty set state := Operational Fig. 7. The pseudocode executed by a processor when shifting between states. 7.1 The Data Structures The recovery protocol uses the following data structures in addition to those above. Regular Tohen Field. The regular token has the following additional field: —retrans fig : A flag that is used to determine whether there are any additional old ring messages that must be rebroadcast on the new ring. ACM Transactions on Computer Systems, Vol. 13, No 4, November 1995 334 - Y. Amir et at. Local Vuriables. The recovery protocol also depends on the following local variables: my new inemb : The set of identifiers of processors on the processor’s new my traas merub: The set of identifiers of processors that are transitioning from the processor’s old ring to its new ring. —iny deliver meinb: The set of identifiers of processors whose messages the processor must deliver in the transitional configuration. —low ring am: The lowest am for the old ring of all the processors in my deliver pmemb. higli ring delivered: The highest message sequence number such that some processor delivered the message with that sequence number as safe on the old ring. —iny install seq : The highest new sequence number of any old ring message transmitted on the new ring. retrans message queue: A queue of messages from the old ring awaiting retransmission to ensure that all remaining processors from the old ring have the same set of messages. —m y retraas count: The number of successive token rotations in which the processor has received the token with retraus flg false. 7.2 The Protocol A processor executing the recovery protocol takes the following steps: (1) Exchange messages with the other processors that were members of the same old ring to ensure that they have the same set of messages broadcast on the old ring but not yet delivered. (2) Deliver to the application those messages that can be delivered on the old ring according to the agreed or safe delivery requirements, including all messages with old ring sequence numbers less than or equal to high ring delivered. (3) Deliver the first Configuration Change message, which initiates the transitional configuration. (4) Deliver to the application further messages that could not be delivered in agreed or safe order on the old ring (because delivery might violate the requirements for agreed or safe delivery) but that can be delivered in agreed or safe order in the smaller transitional configuration. (5) Deliver the second Configuration Change message, which initiates the new regular configuration. (6) Shift to the Operational state. Steps (2) through t6) involve no communication with other processors and are performed as one atomic action. The pseudocode executed by a processor to complete these steps is given in Figure 6. ACM Transactions on Computer Systems, Vol. 13, No 4, November 1995 The Totem Single-Ring Ordering and Membership Protocol 335 Exchange o[Messages from the Ofd King. In the first step of the recovery protocol, each processor determines the lowest my arm of any processor from its old ring that is also a member of the new ring. The processor then broadcasts on the new ring every message for the old ring that it has received and that has a sequence number greater than the lowest in y am. The retransp flg field in the token is used to determine when all old ring messages have been retransmitted. This exchange of messages ensures that each processor receives as many messages as possible from the old ring. Each such message is broadcast with a new ring identifier and encapsulates the old ring message with its old ring identifier. The new ring sequence numbers of these messages are used to ensure that messages are received; the old ring sequence numbers are used to order messages as messages of the old ring. Messages from an old ring retransmitted on the new ring are not delivered to the application by any processor that was not a member of the old ring. No new messages originated on the new ring are broadcast in the Recovery state. Delivery of Messages on the Old Ri eg. For each message, the processor must determine the appropriate configuration in which to deliver the message. A processor can deliver a message in agreed order for the old ring if it has delivered all messages originated on that ring with lower sequence numbers. A processor can deliver a message in safe order for the old ring (1) if it has received the old ring token twice in succession with the am at least equal to the sequence number of the message or (2) if some other processor has already delivered the message as safe on the old ring as indicated by ñig/z_ring delivered. The processor sorts the messages for the old ring that were broadcast on the new ring into the order of their sequence numbers on the old ring and delivers messages in order, until it encounters a gap in the sorted sequence or a message requiring safe delivery with a sequence number greater than high ring delivered. Messages beyond this point cannot be delivered as safe on the old ring but may be delivered in a transitional configuration. The processor then delivers the first Configuration Change message, which contains the identifier of the old regular configuration, the identifier of the transitional configuration, and the membership of the transitional configuration. The membership of the transitional configuration is my trans qmemb. The identifier of the transitional configuration has a sequence number one less than the sequence number of the new ring, and the representative’s identifier is chosen deterministically from my _trans pmemb. Delivery of Messages in the Transitional Configuration. Following the first Configuration Change message, the processor delivers in order all remaining messages that were originated on the old ring by processors in my deliver memb. The processor then delivers a second Configuration Change message, which contains the identifier of the transitional configuration, the identifier of the new regular configuration, and the membership of the new regular configuration. The processor then shifts to the Operational state. ACM Transactions on Computer Systems, Vol. 13, No. 4, November 1995. 336 • Y. Amir et al. Note that some messages cannot be delivered on the old ring or even in the transitional configuration because delivery of those messages might violate causality. Such messages follow a gap in the message sequence. For example, if processor p originates or delivers message m 1 before it originates message m 2 , fi nd processor q received m 2 but did not receive m in the message exchange, then q cannot deliver m because causality would be violated. Here p is not in the same transitional configuration as q because, if it were, then q would have received all of the messages originated by p before or during the message exchange. Failure of Recovery. If the recovery fails while the recovery protocol is being executed, some processors may have installed the new ring while others have not. Prior to installation, a processor’s old ring is the ring of which it was a member when it was last in the Operational state. Each processor must preserve its old ring identifier until it installs a new ring. When a processor delivers a message in safe order in a transitional configuration, it must have a guarantee that each of the other members of the configuration will deliver the message before it installs the new ring, unless that processor fails. If a processor does not install the new ring, it will proceed in due course to install a different new ring with a corresponding transitional configuration. It must deliver the message in that transitional configuration in order to honor the delivery guarantee. Thus, if a processor finds the received flg in the Commit token set to true for every processor in my _ trans pmemb, it must retain the old ring messages originated by members of m y deliuer memb and deliver them as safe in my trans memb, the transitional confIguration for the new ring that it actually installs. Note that my trans memb is a subset of my deliuer _memb and that my trans memb must decrease on successive passes through the Recovery state before a new ring is installed. 7.3 An Example As shown in Figure 8, a ring containing processors p, q, r, s, and t partitions, so that p becomes isolated while q, r, s, and f merge into a new ring with u and u. Processors q, r, s, and f successfully complete the recovery protocol and deliver two Configuration Change messages, one to switch from the regular configuration (p, q, r, s, /} to the transitional configurations (q, r, s, t j and one to switch from the transitional configuration {q, r, s, t] to the regular configuration {q , r, s, I , u , u . Processors q, r, s, and t may not be able to deliver all of the messages originated in the regular configuration [p , q, r, s, I›, because they may not have received some of the messages from p before p became isolated; however, it can be guaranteed that they deliver all of the messages originated by a processor in the transitional configuration {q, r, s, I}. Similarly, processors q, r, s, and f may not be able to deliver a message as safe in the regular configuration ( p , q , r , s, t j because they may have no information as to whether p had received the message before it became isolated; however, it can be guaranteed that they deliver the message as safe in the transitional ACM Transactions on Computer Systems, Vol. 13, No 4, November 1995. The Totem Single-Ring Ordering and Membership Protocol • 337 pqrst UV regular configuration qrstuv transitional configuration Fig. 8. Regular and transitional configurations. The vertical lines represent total orders of messages, and the horizontal lines represent Configuration Change messages. configuration {q, r, s, I]. The first Configuration Change message separates the messages for which delivery guarantees can be provided in the regular configuration (p, q, r, s, I} from the messages for which delivery guarantees apply in the reduced transitional configuration {q, r, s, I}. Extended virtual synchrony does not, of course, solve all of the problems of maintaining consistency in a fault-tolerant distributed system, but it does provide a foundation upon which these problems can be solved. Consider, for example, the set of messages delivered by processor p. Prior to the first Configuration Change message delivered by p Lo terminate the regular configuration (p, q, r, s, t j, there are no missing messages. In the transitional configuration p delivers all remaining messages originated by itself and other messages that have become safe. There may, however, be messages broadcast by processors q, r, s, and t that are not available to, and are not delivered by, processor p. After the second Configuration Change message, p is a member of the regular configuration (p], and p does not deliver messages from the other processors. When p rejoins the other processors in some subsequent configuration, the application programs must update their states, using application-specific algorithms, to reflect activities that were not communicated while the system was partitioned. The Configuration Change messages warn the application that a membership change occurred, so that the application programs can take appropriate actions based on the membership change. Extended virtual synchrony guarantees a consistent order of message delivery, which is essential if the application programs are to reconcile their states following repair of a failed processor or remerging of a partitioned network. 8. THE FLOW CONTROL MECHANISM The Totem protocol is designed to provide high performance under high load. The performance measures we consider are throughput (messages ordered per second) and latency (delay from message origination to delivery in agreed or safe order). Effective flow control is required to achieve the desired performance. ACM Transactions on Computer Systems, Vol 13, No. 4, November 1995 338 • Y. Amir et al. With point-to-point communication, positive acknowledgment protocols, such as the sliding-window protocol, have been refined to provide excellent flow control. However, with broadcast and multicast communication, positive acknowledgment protocols result in excessive numbers of acknowledgments. Rate-controlled protocols have attracted attention recently but have the disadvantage for broadcast and multicast communication that the transmission rate must be set for each processor individually rather than for the multicast group. With bursty communication, the maximum transmission rate for a processor must be set to a value that is unacceptably low, even when other processors have few messages to transmit. A basic characteristic of reliable, ordered, broadcast and multicast protocols is that the rate of broadcasting messages cannot exceed the rate at which the slowest processor can receive messages. At higher rates of broadcasting, the input buffer of the slowest processor will become full, and messages will be lost. In our experience, this is the primary cause of message loss. Retransmission of lost messages increases the message traffic and reduces the effective transmission rate. The Totem single-ring protocol uses a simple flow control mechanism to control the maximum number of messages broadcast during one token rotation. If a processor is unable to process messages at the rate at which they are broadcast, one or more messages will be in its input buffer when the token arrives. Before processing the token and broadcasting the messages, a processor must empty its input buffer. Thus, the rate of broadcasting messages is reduced to the rate at which messages can be handled by the slowest processor. If the maximum number of messages broadcast during any one token rotation is limited by the size of each processor’s input buffer, then input buffer overflow cannot occur. When the rate of broadcasting is not equally spread across all processors, the protocol can be modified to allow the token to visit more than once per token rotation those processors that have the highest transmission rates and to allow those processors to transmit more messages on each visit. To minimize the latency, the rate at which the token visits a processor should be approximately proportional to the square root of the rate at which the processor broadcasts [Boxma et a1. 1990). 8.1 The Data Structures Regular Tohen Fields. The flow control mechanism depends on two fields of the regular token: [cc: A count of the number of messages broadcast by all processors during the previous rotation of the token. bachlog : The sum of the number of new messages waiting to be transmitted by each processor on the ring at the time at which that processor forwarded the token during the previous rotation. ACM Transactions on Computer Systems, Vol. 13, No 4, November 1995 The Totem Single-Ring Ordering and Membership Protocol • 339 Flow Control Constants. The flow control mechanism also depends on two global constants: —window size: The maximum number of messages that all processors are allowed to broadcast in any token rotation. max messages: The maximum number of messages that each processor is allowed to broadcast during one visit of the token. These constants can be determined analytically from the number of processors and their characteristics, or they can be negotiated during ring formation. The constant max pmessages may be different for different processors. Local Variables. Each processor maintains the following local variables: my trc: The number of messages broadcast by this processor on rotation of the token (my this rotation count). my qpbl: The number of new messages waiting to be transmitted by processor when it forwarded the token on the previous rotation previous backlog). my tbl: The number of new messages waiting to be transmitted by processor when it forwards the token on this rotation (my this backlog). this this (my this The values of my p6l and my tbl are limited by the amount of buffer space available for messages awaiting transmission. 8.2 The Algorithm The value of my trc, the number of messages broadcast by this processor on this token rotation, is subject to the following constraints: iny trc max messages: The number of messages broadcast by this processor must not exceed the maximum number it is allowed to broadcast during one visit of the token. my trc window size fcc: The number of messages broadcast by this processor must not exceed the window size minus the number of messages broadcast in the previous rotation of the token. —my trc window st•<e X my tbl/(bachlog + my tbl — my pbl): The number of messages broadcast by this processor must not exceed its fair share of the window size, based on the ratio of its backlog to the sum of the backlogs of all the processors as they released the token during the previous rotation. The backlog mechanism achieves a more-uniform transmission rate for individual processors under high, but not overloaded, traffic conditions than does the simple window-size mechanism used by FDDI. 9. IMPLEMENTATION AND PERFORMANCE The Totem single-ring ordering and membership protocol has been implemented in the C programming language on a network of Sun 4/IPC workstations connected by an Ethernet. The implementation uses only standard Unix features and is highly portable. It has been transferred from our Sun ACM Transactions on Computer Systems, Vol 13, No 4, November 1995 • 340 Y. Amir et al. 80 1300 MveSun 4dPC Proceeeom 1200 60 ° 1100 . w 1000 O 900 O 800 {}j 700 • broadcast equally er'”* * 600 soo 0 .00 400 600 800 1000 1200 1400 Message Size in Bytes o 0 200 400 600 800 1000 Messages Originated per Second Fig 9 To the left, the throughput as a function of message size. To the right, the latency to agreed and safe delivery as a function of load. workstations to DEC and SGI workstations and has worked with little modification. The implementation uses the UDP broadcast interface in the Unix operating system SunOS 4.1.1 with Sun’s default kernel allocations. One UDP socket is used for all broadcast messages, and a separate UDP socket is used by each processor to receive the token from its predecessor on the ring. The input buffers are a combination of the buffers managed by the physical Ethernet controller and those managed by the Unix UDP service. Four bytes of stable storage are required to store the ring sequence number. We have measured the performance of our implementation on a network of five Sun 4/IPC workstations, each ready to broadcast at all times with minimal extraneous load on the processors and on the Ethernet. For each measurement, the wiodor size and ma.x pmessages were adjusted to the maximum values for which message loss is negligible in order to maximize throughput. The throughput was measured for equal numbers of messages broadcast by all processors on the ring and for all messages broadcast by a single processor. As the left graph of Figure 9 shows, there was little difference in the throughput. With 1024-byte messages, more than 800 messages are ordered per second. For smaller messages, over 1000 messages are ordered per second. The highest prior rates of message ordering for reliable, totally ordered message delivery for 1024-byte messages are about 300 messages per second for the Transis system using the same equipment and for the Amoeba system using equipment of similar performance. We have also investigated the latency from origination to delivery of a message in agreed and safe order; a detailed analysis can be found in Moser and Melliar-Smith [ 1994]. The right graph of Figure 9 shows the mean latency to agreed and safe delivery for Poisson arrivals at lower, more-typical ACM Transactions or Computer Systems, Vol. 13, No 4, November 1995 The Totem Single-Ring Ordering and Membership Protocol • 341 loads for 1024-byte messages. At low loads (e.g., 400 ordered messages per second, which is much more than the maximum throughput for prior protocols), the latency to agreed delivery is under 10 milliseconds. Even at 50%S useful utilization of the Ethernet (625 ordered messages per second), the latency to agreed delivery is still only about 13 milliseconds. In general, the latency to agreed delivery is approximately half the token rotation time, and the latency to safe delivery is approximately twice the token rotation time, except at very high loads where the latency is dominated by queuing delays in the buffers. Other performance characteristics of interest are the time to recover from token loss and the time to execute the membership protocol and reconfigure the system when failures occur. With the token retransmission mechanism enabled, the time to return to normal operation after loss of the token is on average 16 milliseconds. With the token retransmission mechanism disabled, loss of the token triggers a Token Loss timeout and forces a complete reformation of the membership. The time to form a new ring, generate a new token, recover messages from the old ring, and return to normal operation is on average the Token Loss timeout plus 40 milliseconds, with the Token Loss timeout set to 100 milliseconds. 10. CONCLUSION The Totem single-ring protocol provides fast, reliable, ordered delivery of messages in a broadcast domain where processors may fail and where the network may partition. A token circulating around a logical ring imposed on the broadcast domain is used to recover lost messages and to order messages on the ring. Delivery of messages in agreed and safe order is provided. The membership protocol handles processor failure and recovery, as well as network partitioning and remerging. Extended virtual synchrony ensures consistent actions by processors that fail and are repaired with stable storage intact and in networks that partition and remerge. A recovery protocol that maintains extended virtual synchrony during recovery after a failure has been provided. The flow control mechanism avoids message loss due to buffer overflow and provides significantly higher throughput than prior total-ordering protocols. Given the high performance of Totem, there is no need to provide a weaker message-ordering service, such as causally ordered delivery, because totally ordered agreed delivery can be provided at no greater cost. Moreover, applica- tions can be programmed more easily and more reliably with totally ordered messages. Continuing work on Totem is exploiting the single-ring protocol to provide more-general services. Agreed and safe delivery services, as well as membership services, are being provided to multiple rings interconnected by gateways. It remains to be investigated whether the exceptional performance of the Totem single-ring protocol can be sustained when Totem is extended to multiple rings. ACM Transactions on Computer Systems, Vol. 13, No. 4, N osein bee 1995 342 • Y Amir et al REFERENCES his, Y, DOLU\ , D., Ks.WtER, S., ANn MAi.in, D. 1992a. Membership algorithms in broadcast domains In Proceeds rigs o/' //ie fi/ñ In ternuttona I iVorhshop on Dr stri buted Af gori thins fHaifa, Israel). Springer-Verlag, Berlin, 292—312. AMIR, Y , DOLEV, D. , KnHlEIi, S. , AND MALAI, D. 1592b. Transis: A communication subsystem for high availability. In Procc•ed Irigs o[ ltte IEEE 2z*nd An niial In ternutionci I Gympoor uin on Fu it It-Tolera nt C'omputt ii g ( Boston, Mass). IEEE, New York, 76—84 Ser In, Y , MO>“ER, L E. , MELLIAR-SrIITH, P. M. , AGARWAI>, D. A. , AND CIA11r'ELLA, P. 1993 Fast message ordering and membership using a logical token-passing ring. In Proceeds ngs o[ the IEEE 7‘3/Jt Internuti oncil Con[ere rice on Dt stributed Com p uti ng S ysteni s (Pittsburgh, Pa ) IEEE, New York, 5ñ1—560. Liu, Y , MOSER, L. E , MELrIAn-SsIITH, P. M. , Ac.ARWAL, D. A , zoo C iARFELL.A, P. 1994. The Totem single-ring ordering and membership protocol. Tech Rep. 94— 19, Dept of Electrical and Computer E n nearing, Univ. of C alifornia, Santa Barbara, C alif. Aug. BIRhIw K. P AND VAN RENESSE, R. 1994 Re Itable Dr strt butech Co r p uti rig wCth the Is Is Toothi t. IEEE Computer Society Press, Los Alamitos, Calif Bo.in , O. J. , I>EVâ’ J. , AND NES"fRA"fE, J. A. 1990. Optimization of pollin g systems. In Per[or- rna rice '90, Proceed Ings o[ the 14th IFIP WG 7. 3 Internationa I S 5'mposi um on Computen Performance 5Iodelli ng, Meets tireraeti t a net Ether I uati on (E dinburg, U.K J. North-Holland, Am- sterdam, 349-361. Can, J. M.AND McUhICHUK, N F. lsvst. ’2, 3 (Aug.), 251-273. 1984. Reliable broadcast protocols ACID Tro us Comput FI.*CIDER, M. J. , LYNCH, N. A. , AND PATERSON, M S. 1985. Impossibility of distributed consen- sus Cth one faulty process J. ACM ,3z*, 2 (Apr ), 374—382 KAASFIOEFi, M F. all TANENLAULI, A. S. 1991. Crroup communication in the Amoeba distributed operating system. In Proceed ings o[ the IEEE 11th luternationcil Conference on Dr stributed Com p iiting !S ysteins (Arlington, Tex 4. IEEE, New York, 882-891. LAMPORT, L 1978. Time, clocks, and the ordering of events in a distributed system. Com m un. ACT L*7, 7.1JulyJ, 55fi—565. MzrLIw-SsIITH, P. M. , MOSER, L. E. , AND AGARWAL, D. A. 199.1. Ring-based ordering protocols. In Proceeds nz s of the IEE Internati onal C!on[erence on In[ormati oti En gi iieer ing f Singapores IEE, Stevenage, Herts, U.K , 882—891 MELLIAIt-SMITH, P. M. , MosEu, L. E. , Ann AGRAWALA, V. 1990. Broadcast protocols for dis- tributed sytems. IEEE Trans. Pura llel Distri b. !S yst. 1, 1 (Jan.), 17—25. MISHRA, S. , PETERSON, L. L. , AND NCHLICHTING, R. D. 1991. A membership protocol based on partial order. In Proceeds ngs o[ the 2iid IFIP WG 10.4 Internuttona I Working Con[erence on Depen dable Co inputi ng [or Cm 1.1ca I A p p Iicci ttons (Tucson, Ariz.). Springer-Verlag, Wien, Aus- tria, 309-331. Mosrn, L. E on MELL1AR-NMITFt, P. M. 1994. Probabilistic bounds on message delivery for the Totem single-ring protocol. In Procee‹1ings of the IEEE lsth Real -Ttme IS stems !S yinpos i um lsan Juan, Puerto Rico). IEEE, New York, 288—248. MOSER, L. E. , Avtts, Y. , MELLIAR-SMITH, P M. , AowwAr, D. A. 1994a. Extended virtual synchrony. In Proceediii gs o[ the IEEE 14th International C'on[eren ce or Distri huted Com p uttng S›'stems (Posnan, Polandâ. IEEE, New York, 56—65. MosER, L. E. , MELLIw-SvI4’H, P. M. , AND AGRAWALA, V. 1994b. Processor membership in asynchronous distributed systems IEEE Trans Parallel Distri b. !S yet. 5, 5 (May), 459—473. PziznGOu, L. L., BUCHHOLz, N. C., on Scrriceieo. It. D. 1989. Preserving and using context information in interprocess communication. ACM Trains Com p ut. Syst 7, 3 (Aug.I, 217—246. RAJ.\GOPALAN, B. wn McKINLEY, P. K. 19h9. A token-based protocol for reliable, ordered multicast communication In Proceeds rigs o[ the IEEE 8th H ympost um on Reliable Distributed System s ( Seattle, Wash ). IEEE, New York, 84—93. 'w RENEW.8E, R., HIcKCv, T. M. , AND BIRMAN. K. P 1994. Design and performance of Horus: A lightweight group communications system Tech Rep. 94—1442, Dept. of Computer Science, Cornell Univ., IU aca, N.Y. Aug. Received July 1993; revised August 1994: accepted July 1995 ACM Transactions on Computer Systems, Vol 13, No 4, Novem ber 1995