Uploaded by angerol Arbet

totem SRP

advertisement
The Totem Single-Ring Ordering and
Membership Protocol
Y. AMIR, L. E. MOSER, P. M. MELLIAR-SMITH, D. A. AGARWAL, and
P. CIARFELLA
University of California, Santa Barbara
Fault-tolerant distr ibuted systems are becoming more important, but in existing systems,
maintaining the consistency of replicated data is quite expensive. The Totem single-ring protocol
suppot ts consistent concurrent operations by placing a total order on broadcast messages. This
total order is derived from the sequence number in a token that circulates around a logical ring
imposed on a set of processors in a broadcast domain. The protocol handles reconfiguration of the
system when processors fail and restart or when the network partitions and remerges. Extended
virtual synchrony ensures that processors deliver messages and configuration changes to the
application in a consistent, systemwide total order. An effective flow control mechanism enables
the Totem single-ring protocol to achieve message-ordering rates significantly higher than the
best prior total-ordering protocols.
Categories and Subject Descriptors: C.2.1 tComputer-Communications Networksl: Network
Architecture and Design—rie/morh com muitications, C.2.2 [Computer-Communication Networksl: Network Protocols—protocol ci rchitectu re,’ C.2.4 [Computer-Communication Networksl Distributed Systems—networh opero ring system s, C.2.5 tComputer-Communication
Networksl: Local Networks—run gs, D.4.4 lOperating SystemsJ: Communications Management—netu:oi-h com niu iitcation; D.4.5 [Operating Systemsl: Reliability—{nti/t tolerance; D.4.7
[Operating Systemsl: Organization and Design—distribiited s ystetn s
General Terms: Performance, Reliability
Additional Key Words and Phrases: Flow control, membership, reliable delivery, token passing,
total orderi rig, virtual synchrony
Earlier versions of the Totem single-ring protocol appeared in the Proceedings of the IEE
International Conference on Information Engineering, Singapore tDecember 1991) and in the
Proceedings of the IEEE 18th International Conference on Distributed Computing Systems,
Pittsburgh, Pa. (May 1993).
This research was supported by NSF Grant no. NCR-9016361, ARPA Contract no. N00174-93-K0097, and Rockwell CMC /State of California MICRO Crrant no. 92-101.
Authors’ addresses: Y. Amir, Computer Science Department, The Hebrew University of
Jerusalem, Israel; email: yairamir mcs.huji.ac.i1; L. E. Moser and P. M. Melliar-Smith, Department of Electrical and Computer Engineering, Univei-sity of C alifornia, Santa Barbara, CA
93106; email: (moser; pmms} 'ece.ucsb.edu: D A. Agarwal, Lawrence Berkeley National Laboratory, Berkeley, CA 94720; email: deba george.LBL.gov; P. Ciat fella, Cascade Communications
Corporation, S Carlisle Road, Westford, MA 01886; email: ciarfella alpo.cascade.com.
Permission to make digital/hard copy of part or all of this work for personal or classroom use is
granted without fee provided that copies are not made or distributed for profit or commercial
advantage, the copyright notice, the title of the publication, and its date appear, and notice is
given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on
servers, or to redistribute to lists, requires prior specific permission and/or a fee.
sO 1995 ACM 0734-2071 /95/1100-03 11 $03.50
ACM Transactions on Computer Systems. Vol 13, No. 4, November 1995, Pages 311—342
312
•
Y. Amir et al.
1. INTRODUCTION
Fault-tolerant distributed systems are becoming more important due to the
increasing demand for more-reliable operation and improved performance.
Maintaining the consistency of replicated data and coordinating the activities
of cooperating processors present substantial problems, which are made more
difficult by concurrency, asynchrony, fault tolerance, and real-time performance requirements. Existing fault-tolerant distributed systems that address
these problems are difficult to program and are expensive in the number of
messages broadcast and/or computations required. Recent protocols for faulttolerant distributed systems [Amir et al. 1992b; Birman and van Re- nesse
1994; Kaashoek and Tanenbaum 1991; Melliar-Smith et al. 1990; Peterson et al.
1989] employ the idea of placing a partial or total order on broadcast messages
to simplify the application programs and to reduce the communication and
computation costs.
The Totem single-ring protocol supports high-performance, fault-tolerant
distributed systems that must continue to operate despite network partitioning and remerging and despite processor failure and restart. Totem provides
totally ordered message delivery with ]ow overhead, high throughput, and
low latency using a logical token-passing ring imposed on a broadcast domain. The key to its high performance is an effective flow control mechanism.
Totem also provides rapid detection of network partitioning and processor
failure together with reconfiguration and membership services. Its novel
mechanisms prevent delivery of messages in different orders in different
components of a partitioned network and provide accurate information about
which processors have delivered which messages. Earlier versions of the
Totem single-ring protocol are described in Amir et al. [ 1993] and MelliarSmith et al. [ 1991].
Programming the application is considerably simplified if messages are
delivered in total order rather than only in causal order or if messages are
delivered in causal order rather than only in FIFO order. In prior systems,
delivery of messages in total order has been more expensive than delivery of
messages in causal order, and delivery of messages in causal order has been
more expensive than FIFO delivery. The Totem single-ring protocol can,
however, deliver totally ordered messages with high throughput at no greater
cost than causally ordered messages or, indeed, than reliable point-to-point
FIFO messages. A total order on messages simplifies the application programming by reducing the risk of inconsistency when replicated data are
updated and by resolving the contention for shared resources within the
system, such as the claiming of locks.
In Totem, messages are delivered in agreed order, which guarantees that
processors deliver messages in a consistent total order and that, when a
processor delivers a message, it has already delivered all prior messages that
originated within its current configuration. Totem also provides delivery of
messages in safe order, which guarantees additionally that, when a processor
delivers a message, it has determined that every processor in the current
configuration has received and will deliver the message unless that processor
ACM Transactions on Computer Systems, Vol. 13, No 4, November 1995
The Totem Single-Ring Ordering and Membership Protocol
•
313
fails. Delivery of a message in agreed or safe order is requested by the
originator of the message.
Delivery of messages in a consistent total order is not easy to achieve in
distributed systems that are subject to processor failure and network partitioning. A failing processor, or a group of processors that have become
isolated, may deliver messages in an order that is different from the order
determined by other processors. As long as those processors remain failed or
isolated, these inconsistencies are not apparent. However, as soon as a
processor is repaired and readmitted to the system, or as soon as the components of a partitioned system are remerged, the inconsistencies in the
message order may become manifest, and recovery may be difficult. The
Totem protocol cannot guarantee that every processor is able to deliver every
message, but it does guarantee that, if two processors deliver a message, they
deliver the message in the same total order.
The application programs may also need to know about configuration
changes. Different processors may learn of a configuration change at different
times, but they must form consistent views of the configuration change and of
the messages that precede or follow the configuration change. Birman devised
the concept of virtual synchrony [Birman and van Renesse 1994], which
ensures that processors deliver messages consistently in the event of processor fail-stop faults. We have generalized this concept to extended virtual
synchrony [Moser et a1. 1994a], which applies to systems in which the
network can partition and remerge and in which processors can fail and
restart with stable storage intact.
The Totem single-ring protocol is designed to operate over a single broadcast domain, such as an Ethernet. It uses Unix UDP, which provides a besteffort multicast service over such media. Other media that provide a best-effort
multicast service, such as ATM or the Internet MB one, can be used to
construct the broadcast domain needed by Totem.
The software architecture of the Totem single-ring protocol is shown in
Figure 1. The arrows on the left represent the passage of messages through
the protocol hierarchy, while the arrows on the right represent Configuration
Change messages and configuration installation. Using a logical token-passing ring imposed on the physical broadcast domain, the single-ring protocol
provides reliable totally ordered message delivery and effective flow control.
On detection of token loss, or on receiving a message from a processor not on
its ring, a processor invokes the membership protocol to form a new ring
using Join messages and a Commit token transmitted over the broadcast
domain. The membership protocol activates the recovery protocol with the
proposed configuration change. The recovery protocol uses the single-ring
ordering protocol to recover missing messages. The processor then installs the
new ring by delivering Configuration Change messages to the application.
2. RELATED WORK
Our work on the Totem protocol is based on our combined experience with
two systems: the Trans and Total reliable, ordered broadcast and memberACM Transactions on Computer Systems, Vol. 13, No. 4, November 1995.
314
•
Y. Amir et at.
Application
Broadcast
messages
,
Configuration Change
messages
Recovery
Install
Fig. 1. The Totem single-ring protocol
hierarchy.
Reliable Delivery
Total Ordering
Flow Control
Token loss
Configuration changes
Membership
Foreign messages
Best-Effort Broadcast Domain
ship protocols [Melliar-Smith et al. 1990; Moser et al. l994b] and the Transit
group communication system [Amir et al. 1992a; 1992b].
The Trans protocol uses positive and negative acknowledgments piggybacked onto broadcast messages and exploits the transitivity of positive
acknowledgments to reduce the number of acknowledgments required. The
Total protocol, layered on top of the Trans protocol, is a fault-tolerant totalordering protocol that continues to order messages provided that a resiliency
constraint is met. The membership protocol, layered on top of the Total protocol,
ensures that each change in the membership occurs at the same logical time in
each processor, corresponding to a position in the total order. The Totem
protocol was developed to address the computational over- head of Trans and
Total and is intended for local area networks with fast and highly reliable
communication.
The Transis group communication system provides reliable, ordered group
multicast and membership services. Transis based its ordering protocol initially on the Trans protocol but, more recently, has also included the Totem
protocol for message ordering. Unlike other prior protocols, the Transis
membership protocol supports remerging of a partitioned network and maintains a consensus view of the membership of each component, rather than a
global consensus view of the entire system. The Totem membership protocol
uses the idea. first proposed for Transis, that the membership can be reduced
in size to ensure termination.
Chang and Maxemchuk [ 1984] described a reliable broadcast and ordering
protocol that uses a token-based sequencer strategy. Unlike Totem, which
requires that a processor must hold the token to broadcast a message, their
protocol allows processors to broadcast messages at any time. The processor
holding the token is responsible for broadcasting an acknowledgment message that includes a sequence number for each message acknowledged. A
processor that has not received a message sends a negative acknowledgment
ACM Transactions on Computer Systems, Vol 13, No 4, N ove rn bee 1995
The Totem Single-Ring Ordering and Membership Protocol
•
315
to request retransmission by the processor that acknowledged the message.
While the latency is good at low loads, it increases at high loads and in the
presence of a failed processor.
More closely related to Totem is the TPM protocol of Rajagopalan and
McKinley [ 1989], which also uses a token to control broadcasting and sequencing of messages. The TPM protocol provides the safe delivery but not
the agreed delivery that Totem provides. In the absence of processor failure
and network partitioning, TPM requires on average two and one-half token
rotations for safe delivery, whereas Totem requires two token rotations. In
the event of network partitioning, only the component containing a majority
of the processors continues to operate; processors in the other components
block. In contrast, Totem handles network partitioning and remerging by
allowing each component of a partitioned system to continue operating, not
just the component that contains a majority of the processors.
Birman’s Isis system [Birman and van Renesse 1994] and more recently
the Horus system [van Renesse et al. 1994] have focused on process groups
and the application program interface. Isis provides BCAST or unordered
messages, CBCAST or causally ordered messages, and ABCAST or totally
ordered messages. A vector clock strategy is used to ensure causal ordering,
and a token-based sequencer, similar to that of Chang and Maxemchuk
[1984], is used to provide total ordering. Isis introduced the important idea of
virtual synchrony in which configuration change messages are ordered relative to other messages so that a consistent view of the system is maintained
as the system changes dynamically. The user interfaces of both Transis and
Totem were inspired by the Isis user interface.
The Psync protocol of Peterson et al. [ 1989] constructs a partial order on
messages that can be converted into a total order. Isis, Trans, and Transis
employ a similar strategy. In contrast, Totem constructs a total order on
messages directly without constructing a partial order first. Mishra et a1.
[1991] have developed a membership protocol based on the partial order of
Psync.
Kaashoek and Tanenbaum [ 1991] describe group communication in the
Amoeba distributed operating system. One processor, called the sequencer, is
responsible for placing a total order on messages. Processors send point-topoint messages to the sequences, which assigns sequence numbers to messages and broadcasts them to the other processors. Messages are recovered by
sending a request to the sequencer for retransmission. Group membership
functions are also provided. Performance is excellent for very short messages
but deteriorates for long messages and if high resilience to processor failure
and network partitioning is required.
The excellent performance of the Totem single-ring protocol is achieved
using a flow control strategy that limits message loss due to buffer overflow
at the receivers. Related to this flow control strategy are the sliding-window
strategy, the FDDI token rotation time limit, and the flow control mechanism
of Transis. The use of a token in combination with a window for flow control
on a broadcast medium works much better than either mechanism in isolation and, as far as we know, has not been investigated in prior work.
ACM Transactions on Computer Systems, Vol. 13, No 4, November 1995.
316
Y. Amir et al
3. THE MODEL
We consider a distributed system built on a broadcast domain consisting of a
finite number of processors that communicate by broadcasting messages. We
use the term origi irate to refer to the first broadcast of a message generated
by the application. Each broadcast of a message is received immediately or
not at all by a processor in the broadcast domain, and thus messages may
have to be retransmitted to achieve reliable delivery. A processor receives all
of its own broadcast messages.
The broadcast domain may become partitioned so that processors in one
component of the partitioned system are unable to communicate with processors in another component. Commnication between separated components
can subsequently be reestablished.
Each processor within the system has a unique identifier and stable
storage. Processors can incur fail-stop, timing, or omission faults. A processor
that is excessively slow, or that fails to receive a message an excessive
number of times, can be regarded as having failed. Failed processors can be
repaired and are reconfigured into the system when they restart. If a processor fails and restarts, its identifier does not change, and all or part of its state
may have been retained in stable storage.
There are no malicious faults, such as faults in which processors generate
incorrect messages or in which the communication medium undetectably
modifies messages in transit.
Imposed on the broadcast domain is a logical token-passing ring. The token
is a special message transmitted point-to-point. The token may be lost by not
being received by a processor on the ring. Each ring has a representative,
chosen deterministically from the membership when the ring is formed, and
an identifier that consists of a ring sequence number and the identifier of the
representative. To ensure that ring sequence numbers and hence that ring
identifiers are unique, each processor records its ring sequence number in
stable storage.
We use the term ring to refer to the infrastructure of Totem and the term
coli figiiratioti to represent the view provided to the application. The meibership of a configuration is a set of processor identifiers. The minimum configuration for a processor consists of the process itself. A regu far configuration
has the same membership and identifier as its corresponding ring. A transifiona/ configuration consists of processors that are members of a new ring
coming directly from the same old ring; it has an identifier that consists of a
“ring” sequence number and the identifier of the representative.
We distinguish between the terms “receive” and “deliver,” as follows. A
processor receives messages that were broadcast by processors in the broadcast domain, and a processor delivers messages in total order to the applicaGon.
Two types of messages are delivered to the application. Regular messages
are generated by the application for delivery to the application. Con figs ration
Cha rige messages are generated by the processors for delivery to the application, without being broadcast, to terminate one configuration and to initiate
ACM Transactions on Computer Systems, Vol 13, No 4, November 1995
•
The Totem Single-Ring Ordering and Membership Protocol
317
another. The identifiers of the regular and Configuration Change messages
consist of configuration identifiers and message sequence numbers.
We define a causal order that is a modification of Lamport’s causal order
[Lamport 1978] in that it applies to messages rather than events and is
constrained to messages that originated within a single configuration. This
allows remerging of a partitioned network and joining of a failed processor
without requiring all messages in the history to be delivered. Causal and
total orders' are defined on sets of messages, as follows:
Causal Order for Configuration C. The reflexive transitive closure of the
“precedes” relation, which is defined for all processors p that are members of
C as follows:
—Message mt precedes message m2 if processor p originated m, in configuration C before p originated m2 in C.
—Message m precedes message m 2 if processor p originated m
ration C and p delivered m l in C before originating m 2 .
2
in configu-
The causal order is assumed to be antisymmetric and, thus, is a partial
order.°
Delivery Order /or Configuration C. The reflexive transitive closure of the
“precedes” relation, which is defined on the union over all processors p in C
of the sets of regular messages delivered in C by p as follows:
Message ml precedes message m2 if processor p delivers m1 in C before p
delivers m2 in C.
Note that some processors in configuration C may not deliver all messages of
the delivery order for Configuration C.
Global Delivery Order. The reflexive transitive closure of the union of the
Configuration Delivery Orders for all configurations and of the “precedes”
relation, which is defined on the set of Configuration Change messages and
regular messages as follows:
For each processor p and each configuration C of which p is a member, the
Configuration Change message delivered by p that initiates C precedes
every message m delivered by p in C.
—For each processor p and each configuration C of which p in a member,
every message m delivered by p in C precedes the Configuration Change
message delivered by p that terminates C.
'A total order on a set iS ie a relatñion
that satisfies the reflexive ( z
x), transitive (if z
y
and y z, then x z), antisymmetric (if x < y and z — y, then y
z), and comparable
propertieñs (z
y or y z 4. A partial order on a set S is a relation
defined on S that satisfies
the reflexive, transitive, and antisymmetric properties. To adhere to standard mathematical
practice in which partial and total orders are reflexive, “before” must be regarded as nonstrict,
i.e., an event occurs before idself.
This property cannot be proved. That the physical world is antisymmetric must be an assumption.
ACM Transactions on Computer Systems, Vol 13, No. 4, November 1995.
3 8
Y. Amir et al.
In Amir et al. [ 1994] we prove that the delivery order for configuration C is a
total order and that the global delivery order is a total order.
4. SERVICES
The objective of the Totem single-ring protocol is to provide the application
with reliable, totally ordered message delivery and membership services, as
defined below.
4.1 Membership Services
The Totem membership protocol provides the following properties:
Un iqueness of Con/igtirntion s. Each configuration identifier is unique;
moreover, at any time, a processor is a member of at most one configuration.
Con senstis. All of the processors that install a configuration determine
that the members of the configuration have reached consensus on the membership.'
tern iiia//on. If a configuration ceases to exist for any reason, such as
processor failure or network partitioning, then every processor of that configuration either installs a new configuration by delivering a Configuration
Change message or fails before doing so. The Configuration Change message
contains the identifier of the configuration it terminates, the identifier of the
configuration it initiates, and the membership of the configuration it initi-
ates.
Con[igui ation Cñaiige Consistency. If processors p and q install configuration C; directly after C 1 , then and q both deliver the same Configuration
Change message to terminate C 1 and initiate C; .
4.2 Reliable, Ordered Delivery Services
The Totem total-ordering protocol provides the following properties, which
hold for all configurations C and for all processors p in C:
Reliable Delivery for Con figuration C.
—Each message m has a unique identifier.
—If processor p delivers message m, then p delivers i once only. Moreover,
if processor p delivers two different messages, then p delivers one of those
messages strictly before it delivers the other.
—If processorp originates message Sri, thenp will delivor m or
will fail
before delivering a Configuration Change message to install a new regular
configuration.
—If processor p is a member of regular configuration C, and no configuration
change ever occurs, then p will deliver in C all messages originated in C.
'This does not rotate the impossibility result of Fischer ct at. [1985] because the membership
protocol allows the membership to decrease in order to reach consensus.
ACDI Transactions on Computer Systems, Vol 13, No 4, November 1995
The Totem Single-Ring Ordering and Membership Protocol
•
319
If processor p delivers message in originated in configuration C, then p is
a member of C, and p has installed C. Moreover, p delivers m in C or in a
transitional configuration between C and the next regular configuration it
installs.
—If processors p and q are both members of consecutive configurations Ct
and C 2 , then p and q deliver the same set of messages in C 1 before
delivering the Configuration Change message that terminates C, and
initiates 2
Reliable delivery defines the basic requirements on message delivery—in
particular, which messages a processor must deliver within a configuration.
Delivery in Causal Order for Configuration C.
Reliable delivery for configuration C.
—If processor p delivers messages m and m2 , and m, precedes m 2 in the
causal order for configuration C, then p delivers zn before p delivers m 2 .
Causal delivery imposes an ordering constraint to ensure that the delivery
order respects Lamport’s causality within a configuration.
Delivery in Agreed Order for Con[iguration C.
—Delivery in causal order for configuration C.
—If processor p delivers message rn 2 in configuration C, and m is any
message that precedes m 2 iTl the Delivery Order for Configuration C, then
p delivers m 1 in C before p delivers m 2 .
Agreed delivery requires that all processors deliver messages within a configuration in the same total order. Moreover, when a processor delivers a
message, it must have delivered all preceding messages in the total order for
the configuration.
Delivery in Sa[e Order for Configuration C.
Delivery in agreed order for configuration C.
—If processor p delivers message m in regular configuration C in safe order,
then every member of C has installed C.
If processor p delivers message m in configuration C, and the originator of
rn requested safe delivery, then p has determined that each processor in C
has received m and will deliver m or will fail before installing a new
regular configuration.
Delivery of a message in safe order requires that a processor has determined
that all of the processors in the configuration have received the message. This
determination is typically based on acknowledgments from the processors
indicating that they have received the message and all of its predecessors in
the total order. Once a processor has acknowledged receipt of a safe message,
it is required to deliver the message unless it fails. Note that this requireACM Transactions on Computer Systems, Vol 13, No. 4, November 1995.
320
•
Y. Amir et at.
ment does not guarantee that a processor delivers the message in the same
configuration as all of the other processors. Totem uses a Configuration
Change message to notify the application of the membership of the configuration within which delivery is guaranteed as safe.
Extended Virtual S ynchron y.
—Delivery in agreed or safe order as requested by the originator of the
message.
—If processor p delivers messages m1 a ll d m 2 , £tnd m1 p T’ecedes m2 in the
global delivery order, then p delivers mt before p delivers m2.
Virtual synchrony was devised by Birman [Birman and van Renesse 1994J to
ensure that view (configuration) changes occur at the same point in the
message delivery history for all operational processors. Processors that are
members of two successive views must deliver exactly the same set of
messages in the first view. A failed processor that recovers can only be
readmitted to the system as a new processor. Thus, failed processors are not
constrained as to the messages they deliver or their order, and messages
delivered by a failed processor have no effect on the system. If the system
partitions, only processors in one component, the primary component, continue to operate; all of the other processors are deemed to have failed.
Extended virtual synchrony extends the concept of virtual synchrony to
systems in which all components of a partitioned system continue to operate
and can subsequently remerge and to systems in which failed processors can
be repaired and can rejoin the system with stable storage intact. Two
processors may deliver different sets of messages, when one of them has
failed or when they are members of different components, but they must not
deliver messages inconsistently. In particular, if processor p delivers message
1 before p delivers message 2 then processor q must not deliver message
m before q delivers message m 1.
Extended virtual synchrony requires that the properties of delivery in
agreed and safe order must be satisfied. If processor p delivers message m as
safe in configuration C, then every processor in C has received m and will
deliver m before it installs a new regular configuration, unless that processor
fails. This is achieved by installing a transitional configuration with a reduced membership, within which any remaining messages from the prior
configuration are delivered, while honoring the agreed and safe delivery
guarantees. Thus, Totem delivers two Configuration Change messages, the
first to introduce a smaller transitional configuration and the second to
introduce the new regular configuration. In Moser et al. [ 1994a] we demonstrate that virtual synchrony can be implemented on top of the more-general
extended virtual synchrony property provided by Totem.
Proofs of correctness for the Totem single-ring protocol based on the service
properties defined above can be found in Amir et al. [ 1994].
Note, however, that a definition based on pairwise delivery of messages does not suffice. We
must ensure that the global delivery order has no cycles of any length.
ACM Transactions on Computer Systems, Vol. 13, No. 4, November 1995
The Totem Single-Ring Ordering and Membership Protocol
•
321
5. THE TOTAL-ORDERING PROTOCOL
The Totem single-ring ordering protocol provides agreed and safe delivery of
messages within a broadcast domain. Imposed on the broadcast domain is a
logical token-passing ring. The token controls access to the ring; only the
processor in possession of the token can broadcast a message. A processor can
broadcast more than one message for each visit of the token, subject to the
constraints imposed by the flow control mechanisms described in Section 8.
When no processor has a message to broadcast, the token continues to
circulate. Each processor has a set of input buffers in which it stores
incoming messages. The flow control mechanisms avoid overflow of these
input buffers.
Each message header contains a sequence number derived from a field of
the token; thus, there is a single sequence of message sequence numbers for
all processors on the ring. Delivery of messages in sequence number order is
agreed delivery. Safe delivery uses an additional field of the token, the am
field, to determine when all processors on the ring have received a message.
We now describe the Totem single-ring ordering protocol with the assumptions that the token is never lost, that processor failures do not occur, and
that the network does not become partitioned; however, messages may be
lost. In Section 6 we relax these assumptions and extend the protocol to
handle token loss, processor failure and restart, and network partitioning
and remerging.
5.1 The Data Structures
Regular Message. Each regular message contains the following fields:
—sender id: The identifier of the processor originating the message.
ring_td: The identifier of the ring on which the message was originated,
consisting of a ring sequence number and the representative’s identifier.
—seq: A message sequence number.
—conf id: 0.
contents: The contents of the message.
The ring id, seq, and conf id fields comprise the identifier of the message.
Regular Tohen. To broadcast a message on the ring, a processor must hold
the regular token, also referred to as the token. The token contains the
following fields:
—I ype: Regular.
—ring id: The identifier of the ring on which the token is circulating,
consisting of a ring sequence number and the representative’s identifier.
toben seq: A sequence number which allows recognition of redundant
copies of the token.
seq : The largest sequence number of any message that has been broadcast
on the ring, i.e., a high-water mark.
ACM Transactions on Computer Systems, Vol 13, No. 4, November 1995
322
.
Y Amir et al
—am: A sequence number (all-received-up-to) used to determine if all processors on the ring have received all messages with sequence numbers less
than or equal to this sequence number, i.e., a low-water mark.
am id: The identifier of the processor that set the am to a value less than
the seq.
—rtr: A retransmission request list, containing one or more retransmission
requests.
The seq field of the token provides a single total order of messages for all
processors on the ring. The am field is the basic acknowledgment mechanism
that determines if a message can be delivered as safe.
Local Variables. Each processor maintains several local variables, includ—My tolzen seq : The value of the tohen seq when the processor forwarded
the token last.
—my am: The sequence number of a message such that the processor has
received all messages with sequence numbers less than or equal to this
sequence number.
—my am count: The number of times that the processor has received the
token with an unchanged am and with the am not equal to seq.
—new message queue: The queue of messages originated by the application
waiting to be broadcast.
—received pmessage queue: The queue of messages received from the communication medium waiting to be delivered to the application.
A processor updates my tohen seq and my _ am count as it receives tokens and updates my am as it receives messages. When it transmits a
message, the processor transfers the message from new message queue to
received message queue. When it determines that a message has become
safe, the processor no longer needs to retain the message for future retransmission and, thus, can discard the message from received message queme.
5.2 The Protocol
On receipt of the token, a processor empties its input buffer completely,
either delivering the messages or retaining them until they can be delivered
in order. It then broadcasts requested retransmissions and new messages,
updates the token, and transmits the token to the next processor on the ring.
For each new message that it broadcasts, the processor increments the seq
field of the token and sets the sequence number of the message to this value.
Each time a processor receives the token, it compares the am field of the
token with my am. If my _ar¿i is smaller, the processor replaces the am
with my am and sets the am id field of the token to its identifier. If the
areaid equals the processor’s identifier, it sets the are to my am. (In this
case, the processor had set the am on the last visit of the token, and no other
processor changed the am during the token rotation.) Whenever the seq and
ACM Transactions on Computer Systems, Vol 13, No 4, November 1995
The Totem Single-Ring Ordering and Membership Protocol
•
323
the am are equal, the processor increments are and my are in step with
seq and sets the am id Lo a null value (a value that is not the id of any
processor).
If the seq field of the token is greater than its my am, the processor has
not received all of the messages that have been broadcast on the r'ing, so it
augments the rtr field of the token with the missed messages. If the processor has received messages that appear in the rtr field, it retransmits those
messages before broadcasting new messages. When it retransmits a message,
the processor removes the sequence number of the message from the rtr field.
If a processor has received a message m and has delivered every message
with sequence number less than that of m, and if the originator of m
requested agreed delivery, then the processor delivers m in agreed order. If,
in addition, the processor forwards the token with the am field greater than
or equal to the sequence number of m on two successive rotations and if the
originator of m requested safe delivery, then rn is safe, and the processor
delivers m in safe order.
The total-ordering protocol is, of course, unable to continue when the token
is lost; a token retransmission mechanism has been implemented to reduce
the probability of token loss. Each time a processor forwards the token, it sets
a Token Retransmission timeout. If a processor receives a regular message or
the token, it cancels the Token Retransmission timeout. On a Token Retransmission timeout, the processor retransmits the token to the next processor on
the ring and then resets the timeout.
The tohen seq field of the token provides recognition of redundant tokens.
A processor accepts the token only if the tohen seq field is greater than
my tohen seq, otherwise the token is discarded as redundant. If the token is
accepted, the processor increments tohen seq and sets my toheii seq to the
new value of token seq. Token retransmission increases the probability that
the token will be received at the next processor on the ring and incurs
minimal overhead. The membership protocol described in the next section
handles the loss of all copies of the token.
6. THE MEMBERSHIP PROTOCOL
The Totem single-ring ordering protocol is optimized for high performance
under failure-free conditions but depends on a membership protocol to resolve
processor failure, network partitioning, and loss of all copies of the token. The
membership protocol detects such failures and
reconstructs
a
new
ring
on
which the total-ordering protocol can resume operation.
The objective of the membership protocol is to ensure consensus, in that
every member of the configuration agrees on the membership of the configuration, and termination, in that every processor installs some configuration
with an agreed membership within a bounded time unless it fails within that
time. The membership protocol also generates a new token and recovers
messages that had not been received by some of the processors when the
failure occurred.
ACM Transactions on Computer Systems, Vol 13, No. 4, November 1995.
324
•
Y. Amir et at.
6.1 The Data Structures
Join Message. A Join message contains two items: a set of identifiers of
processors being considered for membership in the new ring by the processor
broadcasting the Join message and a set of identifiers of processors that it
regards as having failed. These are contained in the proc set and [ail set
fields of the Join message defined below:
—I ype: Join.
—sended id: The processor identifier of the sender.
—proc set: The set of identifiers of processors that the sender is considering
for membership in a new ring.
—[ail set: The set of identifiers of processors that the sender has determined to have failed.
ring seq: The largest ring sequence number of a ring id known to the
sender.
Join messages differ from regular messages in that a processor may broadcast a Join message without holding the token; moreover, Join messages are
not retransmitted or delivered to the application.
When a processor broadcasts a Join message, it is trying to achieve
consensus on the proc set and fail set in the Join message. The fail set is
a subset of the proc set. The proc set and /ai/ psef can only increase until a
new ring is installed. The ring seq field allows the receiver of a Join
message to determine if the sender has abandoned a past round of consensus
and is now attempting to form a new membership. It is also used to create
unique ring identifiers.
Configuration Change Message. The membership protocol also uses another special type of message, the Configuration Change message, which
contains the following fields:
—ring id: The identifier of the regular configuration if this message initiates a regular configuration or the identifier of the preceding regular
configuration if this message initiates a transitional configuration.
—seq : 0, if this message initiates a regular configuration, or the largest
sequence number of a message delivered in the preceding regular configuration if this message initiates a transitional configuration.
—confp id: The identifier of the old transitional configuration from which the
processor is transitioning if this
message
initiates
a regular
configuration
or the identifier of the transitional configuration to which the processor is
transitioning if this message initiates a transitional configuration.
—memh: The membership of the configuration that this message initiates.
The ring id, seq, and com/_ id fields comprise the identifier of the message. A
Configuration Change message may describe a change from an old configuration
to a transitional configuration or from a transitional configura- tion to a new
configuration. Configuration Change messages differ from
ACM Transactions on Computer Systems, Vol 13, No 4, November 1995
The Totem Single-Ring Ordering and Membership Protocol
•
325
regular messages in that they are generated locally at each processor and are
delivered directly to the application without being broadcast.
Commit Tohen. Each new ring is initiated by one of its members, the
representative, which is a processor chosen deterministically from the members of the ring. The representative generates a Commit token that differs
from the regular token in that its t ype field is set to Commit, and it contains
the following fields in place of the rtr field:
mem6 _/isi: A list containing a processor identifier, old ring ring id, old
ring my am, high delivered, and received fig fields for each member of
the new ring.
memb index: The index of the processor in memfi _(tsp that last forwarded
the Commit token.
For each proeessor identifier in memó _//sf, the high deliuered field is the
largest sequence number of a message that the proeessor has delivered on the
old ring. The receiued fÍ g field indicates that the processos has already
received all of the messages possessed by other processors in its transitional
configuration.
On the first rotation of the Commit token around the new ring, each
proeessor sets its old ring id, old ring my am, high deliuered, and receiued flg fields. It also updates memb index. The remaining fields are set
by the representative when it creates the Commit token.
Local Variables. Each processor maintains several local variables, including:
iny ring id: The ring identifier in the most recently accepted Commit
token.
—my pmemb: The set of identifiers of processors on the processor’s current
ring.
—my men› qmemb: The set of identifiers of processors on the processor’s new
ring.
my qproc set: The set of identifiers of processors that the processor is
considering for membership of a new ring.
my fail set: The set of identifiers of processors that the processor has
determined to have failed.
consensus: A boolean array indexed by processors and indicating whether
each processor is committed to the processor’s my proc qset and
my fail set.
Stable storage is required to store a processor’s ring sequence number,
my ring id. seq. This stable storage is read only when a processor recovers
from a failure and is written when a configuration change occurs.
ACM Transactions on Computer Systems, Vol. 13, No 4, November 1995.
326
Y Amir et at
Delrmr Confjgurallon Change messages
Operational
Gather
RBCOV8F/
recen/ed
Consensus and Repesentatl'm)
or Corr¥Ttrt token receW
fecehmd
Fig L°
The finite-state machine for the membership protocol.
6.2 The Protocol
The membership protocol can be described by a finite-state machine with
seven events and four states, as illustrated in Figure 2.
6.2.1 the Seven Euetits of the Mem bership Protocol.
Receiving a Foreign Message. Such a message was broadcast by a processor that is not a member of the receiving processor’s ring and activates the
membership protocol in the receiving processor.
fieceiriiig a join Message. This informs the receiver of the sender’s proposed membership and may cause the receiver to enlarge its my qproc set or
my fail set.
fieceit'ing a Commit Notes. On the first reception of the Commit token, a
member of the proposed new ring updates the Commit token. On the second
reception, it obtains the updated information that the other members have
supplied.
Tohen Loss Timeout. This timeout (1) indicates that a processor did not
receive the token or a regular message within the required amount of time
and (2) activates the membership protocol.
Robert Retransmission timeout. This timeout indicates that a processor
should retransmit the token because it has not received the token or a
regular message broadcast by another processor on the ring.
Join Timeout. This timeout is used to determine the interval after which
a Join message is rebroadcast in the Gather or Commit states.
ACM Transactions on Co mputer Systems, Vol 13. No 4„ November 1995
The Totem Single-Ring Ordering and Membership Protocol
•
327
Consensus Timeout. This timeout indicates that a processor participating
in the formation of a new ring failed to reach consensus in the required
amount of time.
Recognizing Failure to Receive. If the am field has not advanced in
several rotations of the token, a processor determines that the processor that
set the am has repeatedly failed to receive a message.
6.2.2 The Four States o[ the Mem bersliip Protocol.
Operational Rstate. In the Operational state (Figure 3), messages are
broadcast and delivered in agreed or safe order, as requested by the originator of the message. Since processor failure and network partitioning result in
loss of the token, the mechanism for detecting failures is the Token Loss
timeout. When the Token Loss timeout expires or when a processor receives a
Join message or a foreign message, the processor invokes the protocol for the
formation of a new ring and shifts to the Gather state (Figure 7).
A processor buffers a message for retransmission until the message has
been acknowledged by the other processors on the ring. If a processor fails
repeatedly to receive a particular message, then the other processors buffer
that message and all subsequent messages until that message is received.
Consequently, a processor cannot be allowed to fail to receive messages
indefinitely. When its local variable my am count reaches a predetermined
constant, a processor determines that some other processor has failed to
receive, namely, the processor whose identifier is in the am id field of the
token. It then includes that processor’s identifier in its my
fail set, shifts to
the Gather state, and broadcasts a Join message.
Gather lstate. In the Gather state (Figure 4), a processor collects information about operational processors and failed processors and broadcasts that
information in Join messages. When a processor receives a Join message, the
processor updates its my proc set and iny fail set. If its iny qproc set
and my fail set have changed, the processor abandons its previous membership, broadcasts a Join message containing the updated sets, and resets
the Join and Consensus timeouts. The Join timeout is shorter than the
Consensus timeout and is used to increase the probability that Join messages
from all currently operational processors are received during a single round
of consensus.
A processor reaches consensus when it has received Join messages with
proc set and [ail set equal to its my proc set and icy _/ni/ set, respectively, from every processor in the difference of those sets, i.e., my pproc set
— m y fail set. It then no longer accepts incoming Join messages. A processor is also considered to have reached consensus when it has received a
Commit token with the same membership as my proc set — in y friil set.
The processors in that difference constitute the membership of the proposed
new ring. If the Consensus timeout expires before a processor has reached
consensus, it adds to my /t2t/ _sef all of the processors in m y proc sef from
ACM Transactions on Computer Systems, Vol 13, No 4, November 1995
328
Y. Amir et al.
Regular token received:
if token.ringed / my ningñd or token.token.seq < m my.token seq then
discard token
else
determine how many messages I’m allowed to broadcast by flow control
update retransmission requests
broadcast requested retransmissions
subtract retransrriissions from allowed to broadcast
for as many messages as allowed to broadcast do
get message from new message ueue
increment token.seq
set message header fields and broadcast message
update mymru
if my mm ( token.aru or my md = token.aruud or token.aruñd — invalid then
token.aru := my mru
if token.aru = token.seq then
token.aruñd := invalid
else token.aruñd :— my id
if token.am = aru in token on last rotation and token.aruñd / invalid then
increment my amount
else my amount := 0
if my-arumount
faiLto.icvmonst and token.aru-id = my id then
add token.aru.id to mydailme. t
call ShiftNo-Gather
else
update token.rtr and token 8ow control fields
increment token.token-seq
forward token
reset Token Loss and Token Retransmission timeouts
deliver messages that satisfy their deli very criteria
Regular message received:
cancel Token Retransmission timeout if set
add message to receivewessage@ueue
update retransmission request list
update myor
deliver messages that satisfy their delivery criteria
Token Loss timeout expired:
call Shift to.Gather
Token Retransmission timeout expired:
retransmit token
reset Token Retransmission timeout
Foreign message from processor q received:
add message.senderñd to my.procmet
call Shift.to-G ather
Join message from processor q recei ved:
same as in Gather state except call Shift Jo.Gather regardless of Join message’s content
Commit token received:
discard the Commit token
Fig. 3. The psoudocode executed by a processor in the Operational state.
ACM Transactions on Computer Systems, Vo] 13, No 4, November 1995
The Totem Single-Ring Ordering and Membership Protocol
•
329
Regular token or regular message received:
same as in Operational state
Foreign message from processor q received:
if q not in my.procset then
add message.senderñd to my-proc set
call Shift to-Lather
Join message from processor q received:
if my.procmet — message.procmet and my Jailmet m message.failzet then
consensus(q] := true
if for all r in my.procset — my Jailmet
consensus[r] m true and my id — smallest id of my.procmet — my Jailset then
token.ringed.seq :m (maximum of myningñd.seq and Join ringmeqs) -{- 4
token-memb := my.procmet — my Jailmet
call Shift to-Commit
else return
else if message.proc-set subset of my.procmet and
message.failmet subset of my failset then return
else if q in my failset then return
else
merge message.proc-set into my -proc set
if my id in message.fai1zet then
add message.senderñd to my.failmet
else
merge message.failmet into my Jailnet
call S
I hift o.G at her
Commit token received:
if my-proc set — my failset = token.memb and token.seq
call S
I hift o.Commit
my ringid then
Join timeout expired:
broadcast Join message with my crochet, my dailmet, seq — my ningñd.seq
set Join timeout
Consensus timeout expired:
if consensus not reached then
for all r such that consensus[r] / true do
add r to my fail.set
call Shift.to.Gather
else
for all r do
consensus[r] :— false
consensus[myñd} :m true
set Token Loss timeout
Token Loss timeout expired:
execute code for Consensus timeout expired in Gather state
call Shift.to-G ather
Fig. 4. The pseudocode executed by a processor in the father state.
ACM Transactions on Computer Systems, Vol. 13, No 4, November 1955
330
•
Y. Amm et al.
which it has not received a Join message with proc set and [ail set equal to
its own sets, returns to the Gather state, and tries to reach consensus again
by broadcasting Join messages.
When a processor has reached consensus, it determines whether it has the
lowest processor identifier in the membership and, thus, is the representative
of the proposed new ring. If it is the representative, the processor generates a
Commit token. It determines the ring id of the new ring, which is composed
of a ring sequence number equal to four plus the maximum of the ring
sequence numbers in any of the Join messages used to reach consensus and
its own ring sequence number. (The sequence number which is two less than
that of the new ring is used as the transitional configuration identifier.) The
representative also determines the memb list of the Commit token, which
specifies the membership of the new ring and the order in which the token
will circulate, with the representative placed first. It then transmits the
Commit token and shifts to the Commit state (Figure 7).
When a processor other than the representative has reached consensus, if
it has not received the Commit token, the professor sets the Token Loss
timeout, cancels the Consensus timeout, and continues in the Gather state
waiting for the Commit token. If the Token Loss timeout expires, the processor returns to the Gather state and tries to reach consensus again. On
receiving the Commit token, the processor compares the proposed membership, given by the memb list field of the Commit token, with my proc set
— m y fail set. If they differ, the processor discards the Commit token,
returns to the Gather state, and repeats the attempt to form a new ring. If
they agree, the processor extracts the ring id for the new ring from the
Commit token, sets the fields in its entry of inemb list, increments the
mem6 index field, and shifts to the Commit state.
Commit Strife. In the Commit state (Figure 5), the first rotation of the
Commit token around the proposed new ring confirms that all members in
the memb list of the Commit token are committed to the membership. It
also collects information needed to determine correct handling of the messages from the old ring that still require retransmission when the membership protocol was invoked.
The second rotation of the Commit token disseminates the information
collected in the first rotation. On receiving the Commit token for the second
time, a processor shifts to the Recovery state (Figure 7) and writes the
sequence number, mv ring id. seq, for its new ring into stable storage.
Recovery Citate. In the Recovery state (figure 6), when the representative
receives the Commif token after its second rotation, it converte the Commit
token into the regular token for the new ring, replacing the memò list and
memb index fields by the rtr field. At this point, the new ring is formed but
not yet installed, and the execution of the recovery protocol begins.
The processors use the new ring to retransmit messages from their old
rings that must be exchanged to maintain agreed and safe delivery guarantees. In one atomic action, each processor delivers the exchanged messages to
ACM Transactions on Computer Systems, Vol 13, No. 4, November 1995.
The Totem Single-Ring Ordering and Membership Protocol
•
331
Regular token received: discard token
Regular message received: same as in Operational state
Foreign message received: discard message
Join message from processor q received:
if q in mymewwemb and message.ring>eq T myningñd.seq then
execute code for receipt of Join message in Gather state
call ShiftJo-Gather
Commit token received:
if token.seq = myningñd.seq then
call Shift.to.Ttecovery
Join timeout expired: same as in Gather state
Token Loss timeout expired: call Shift to.Gather
Fig. S. The pseudocode executed by a processor in the Commit state.
the application along with Configuration Change messages, installs the new
ring, and shifts to the Operational state (Figure 7). The recovery protocol is
described in more detail in Section 7.
When a processor starts or restarts, it first forms and installs a singleton
ring, consisting of only the processor itself. The processor then broadcasts a
Join message containing the value of my ring id. seq, obtained from its
stable storage, and shifts to the Gather state.
The membership protocol described above is guaranteed to terminate in
bounded time because proc set and fail set increase monotonically within
a fixed finite domain, because timeouts bound the time that a processor
spends in each of the states, and because an additional failure (which
increases the [atl set) is forced to prevent the repetition of a proposed
membership. In the base case, the membership, proc set — [ail set, consists
of a single processor identifier.
7. THE RECOVERY PROTOGOL
The objective of the recovery protocol is to recover the messages that had not
been received when the membership protocol was invoked and to enable the
processors transitioning from the same old configuration to the same new
configuration to deliver the same messages from the old configuration. The
recovery protocol also provides message delivery guarantees, and thus maintains extended virtual synchrony, during recovery from failures. Maintenance
of extended virtual synchrony is essential to applications such as fault-tolerant
distributed databases.
ACM Transactions on Computer Systems, Vol. 13, No. 4, November 1995.
332
Y. Amir et at
Regular token received:
same as in Operational state except get messages from retranswessage.queue
instead of neivwessage.queue and before forwarding the token execuLe:
if retransmessage queue is not empty then
if token.retrans Mtg — false then
token.retrans Mfg := true
else if token.retransAg = true and I set it then
token.retransAg := false
if token.retransAg = false then
increment my netransJg count
else my netransJg.count :- 0
if my retrans lgcount - 2 then
my ñnstallmeq :m token.seq
if my retrans&g.count
2 and my mru T my Inst allseq and my neceivedAg - false then myreceived kg := true
my Aelivermemb :— my Iransmemb
if my retrans&gmount
3 and token.aru
my installseq on last two rotations then
call Shift to.Operational
Regular message received:
reset Token Ret ransmission t imeout
add message to recei vemessage queue
update my mru
if ret.ransmit ted message from my old ring then
add to recei vemessage queue for old ring
remove message from retransmessage queue for old ring
Foreign message from processor q received:
discard message
Join message from processor q received:
if q in my new memb and message.ringmeq > myningñd.seq then
execute code for receipt of Join message in Commit state
execute code for Token Loss timeout expired in Recovery state
Commit token received by new representative:
convert Cornmi t token to regular token
if retransmessage queue is not empty then
token.retrans Mtg := true
else token.retransAg := false
forward regular token
reset Token Loss and Token Retransmission timeouts
Token Loss timeout expired:
discard all new messages received on the new ring
empty retransmessage queue
determine current old ring aru (it may have increased)
call Shift.to-G ather
Token Retransmission timeout expired:
retransmit token
reset Token Retransmission timeout
Fi g 6 The pseudocode executed by a processor in the Recovery state
ACM Transactions on Computer Systems, Vol 13, No 4, November 1995
The Totem Single-Ring Ordering and Membership Protocol
•
333
ShiftJo.Gather:
broadcast Join message containing myyrocmet, my Jailmet, seq = my.ringed.seq
cancel Token Loss timeout and Token Retransmission timeouts
reset Join and Consensus timeouts
for al l t in my.procset do
consensus[r] := false
consensus[myñd) := true
state := Gather
Shift do Commit:
update membJist in Commit token with my ningñd, my mru, my received kg, my..high.delivered
myningñd := Commit token ringed
forward Commit token
cancel Join and Consensus timeouts
reset Token Loss and Token Retransmission timeouts
state := Commit
Shift JoWecovery:
forward Commit token for the second time
mynew wemb := membership in Commit token
my Iranswemb := members on old ring transitioning to new ring
if for some processor in my-transwemb receivedAg = false then
mydeliver.memb := my transwemb
low.ringmru :m lowest am for old ring for processors in my Aeliverwemb
ilighningfielivered :m highest sequence number of message delivered for old ring
by a processor in my Aeliverwemb
copy all messages from old ring wi th sequence number lowningmru
into retranswessage.queue
mymru := 0
mymrumount :m 0
reset Token Loss and Token Retransmission timeouts
state := Recovery
Shift Jo.Operational:
deliver messages deliverable on old ring (at least up through high.ring.delivered)
deliver membership change for transitional configuration
deliver remaining messages from processors in my.deliverwemb in transitional configuration
deliver membership change for new ring
mywemb := mynew -memb
myprocset := my wemb
myJailmet := empty set
state := Operational
Fig. 7. The pseudocode executed by a processor when shifting between states.
7.1 The Data Structures
The recovery protocol uses the following data structures in addition to those
above.
Regular Tohen Field. The regular token has the following additional field:
—retrans fig : A flag that is used to determine whether there are any
additional old ring messages that must be rebroadcast on the new ring.
ACM Transactions on Computer Systems, Vol. 13, No 4, November 1995
334 - Y. Amir et at.
Local Vuriables. The recovery protocol also depends on the following local
variables:
my new inemb : The set of identifiers of processors on the processor’s new
my traas merub: The set of identifiers of processors that are transitioning from the processor’s old ring to its new ring.
—iny deliver meinb: The set of identifiers of processors whose messages the
processor must deliver in the transitional configuration.
—low ring am: The lowest am for the old ring of all the processors in
my deliver pmemb.
higli ring delivered: The highest message sequence number such that
some processor delivered the message with that sequence number as safe
on the old ring.
—iny install seq : The highest new sequence number of any old ring message transmitted on the new ring.
retrans message queue: A queue of messages from the old ring awaiting
retransmission to ensure that all remaining processors from the old ring
have the same set of messages.
—m y retraas count: The number of successive token rotations in which the
processor has received the token with retraus flg false.
7.2 The Protocol
A processor executing the recovery protocol takes the following steps:
(1) Exchange messages with the other processors that were members of the
same old ring to ensure that they have the same set of messages broadcast on the old ring but not yet delivered.
(2) Deliver to the application those messages that can be delivered on the old
ring according to the agreed or safe delivery requirements, including all
messages with old ring sequence numbers less than or equal to
high ring delivered.
(3) Deliver the first Configuration Change message, which initiates the
transitional configuration.
(4) Deliver to the application further messages that could not be delivered in
agreed or safe order on the old ring (because delivery might violate the
requirements for agreed or safe delivery) but that can be delivered in
agreed or safe order in the smaller transitional configuration.
(5) Deliver the second Configuration Change message, which initiates the
new regular configuration.
(6) Shift to the Operational state.
Steps (2) through t6) involve no communication with other processors and are
performed as one atomic action. The pseudocode executed by a processor to
complete these steps is given in Figure 6.
ACM Transactions on Computer Systems, Vol. 13, No 4, November 1995
The Totem Single-Ring Ordering and Membership Protocol
335
Exchange o[Messages from the Ofd King. In the first step of the recovery
protocol, each processor determines the lowest my arm of any processor from
its old ring that is also a member of the new ring. The processor then
broadcasts on the new ring every message for the old ring that it has received
and that has a sequence number greater than the lowest in y am. The
retransp flg field in the token is used to determine when all old ring messages
have been retransmitted. This exchange of messages ensures that each
processor receives as many messages as possible from the old ring.
Each such message is broadcast with a new ring identifier and encapsulates the old ring message with its old ring identifier. The new ring sequence
numbers of these messages are used to ensure that messages are received;
the old ring sequence numbers are used to order messages as messages of the
old ring. Messages from an old ring retransmitted on the new ring are not
delivered to the application by any processor that was not a member of the
old ring. No new messages originated on the new ring are broadcast in the
Recovery state.
Delivery of Messages on the Old Ri eg. For each message, the processor
must determine the appropriate configuration in which to deliver the message. A processor can deliver a message in agreed order for the old ring if it
has delivered all messages originated on that ring with lower sequence
numbers. A processor can deliver a message in safe order for the old ring (1) if
it has received the old ring token twice in succession with the am at least
equal to the sequence number of the message or (2) if some other processor
has already delivered the message as safe on the old ring as indicated by
ñig/z_ring delivered.
The processor sorts the messages for the old ring that were broadcast on
the new ring into the order of their sequence numbers on the old ring and
delivers messages in order, until it encounters a gap in the sorted sequence or
a message requiring safe delivery with a sequence number greater than
high ring delivered. Messages beyond this point cannot be delivered as safe
on the old ring but may be delivered in a transitional configuration.
The processor then delivers the first Configuration Change message, which
contains the identifier of the old regular configuration, the identifier of the
transitional configuration, and the membership of the transitional configuration. The membership of the transitional configuration is my trans qmemb.
The identifier of the transitional configuration has a sequence number one
less than the sequence number of the new ring, and the representative’s
identifier is chosen deterministically from my _trans pmemb.
Delivery of Messages in the Transitional Configuration. Following the first
Configuration Change message, the processor delivers in order all remaining
messages that were originated on the old ring by processors in
my deliver memb. The processor then delivers a second Configuration
Change message, which contains the identifier of the transitional configuration, the identifier of the new regular configuration, and the membership of
the new regular configuration. The processor then shifts to the Operational
state.
ACM Transactions on Computer Systems, Vol. 13, No. 4, November 1995.
336
•
Y. Amir et al.
Note that some messages cannot be delivered on the old ring or even in the
transitional configuration because delivery of those messages might violate
causality. Such messages follow a gap in the message sequence. For example,
if processor p originates or delivers message m 1 before it originates message
m 2 , fi nd processor q received m 2 but did not receive m in the message
exchange, then q cannot deliver m because causality would be violated.
Here p is not in the same transitional configuration as q because, if it were,
then q would have received all of the messages originated by p before or
during the message exchange.
Failure of Recovery. If the recovery fails while the recovery protocol is
being executed, some processors may have installed the new ring while others
have not. Prior to installation, a processor’s old ring is the ring of which it
was a member when it was last in the Operational state. Each processor must
preserve its old ring identifier until it installs a new ring.
When a processor delivers a message in safe order in a transitional
configuration, it must have a guarantee that each of the other members of the
configuration will deliver the message before it installs the new ring, unless
that processor fails. If a processor does not install the new ring, it will
proceed in due course to install a different new ring with a corresponding
transitional configuration. It must deliver the message in that transitional
configuration in order to honor the delivery guarantee.
Thus, if a processor finds the received flg in the Commit token set to true
for every processor in my _ trans pmemb, it must retain the old ring messages
originated by members of m y deliuer memb and deliver them as safe in
my trans memb, the transitional confIguration for the new ring that it
actually installs. Note that my trans memb is a subset of my deliuer
_memb and that my trans memb must decrease on successive passes
through the Recovery state before a new ring is installed.
7.3 An Example
As shown in Figure 8, a ring containing processors p, q, r, s, and t
partitions, so that p becomes isolated while q, r, s, and f merge into a new
ring with u and u. Processors q, r, s, and f successfully complete the
recovery protocol and deliver two Configuration Change messages, one to
switch from the regular configuration (p, q, r, s, /} to the transitional configurations (q, r, s, t j and one to switch from the transitional configuration
{q, r, s, t] to the regular configuration {q , r, s, I , u , u .
Processors q, r, s, and t may not be able to deliver all of the messages
originated in the regular configuration [p , q, r, s, I›, because they may not
have received some of the messages from p before p became isolated;
however, it can be guaranteed that they deliver all of the messages originated
by a processor in the transitional configuration {q, r, s, I}. Similarly, processors q, r, s, and f may not be able to deliver a message as safe in the regular
configuration ( p , q , r , s, t j because they may have no information as to
whether p had received the message before it became isolated; however, it
can be guaranteed that they deliver the message as safe in the transitional
ACM Transactions on Computer Systems, Vol. 13, No 4, November 1995.
The Totem Single-Ring Ordering and Membership Protocol
•
337
pqrst
UV
regular
configuration
qrstuv
transitional
configuration
Fig. 8. Regular and transitional configurations. The vertical lines represent total orders of
messages, and the horizontal lines represent Configuration Change messages.
configuration {q, r, s, I]. The first Configuration Change message separates
the messages for which delivery guarantees can be provided in the regular
configuration (p, q, r, s, I} from the messages for which delivery guarantees
apply in the reduced transitional configuration {q, r, s, I}.
Extended virtual synchrony does not, of course, solve all of the problems of
maintaining consistency in a fault-tolerant distributed system, but it does
provide a foundation upon which these problems can be solved. Consider, for
example, the set of messages delivered by processor p. Prior to the first
Configuration Change message delivered by p Lo terminate the regular
configuration (p, q, r, s, t j, there are no missing messages. In the transitional
configuration p delivers all remaining messages originated by itself and
other messages that have become safe. There may, however, be messages
broadcast by processors q, r, s, and t that are not available to, and are not
delivered by, processor p. After the second Configuration Change message, p
is a member of the regular configuration (p], and p does not deliver messages from the other processors.
When p rejoins the other processors in some subsequent configuration, the
application programs must update their states, using application-specific
algorithms, to reflect activities that were not communicated while the system
was partitioned. The Configuration Change messages warn the application
that a membership change occurred, so that the application programs can
take appropriate actions based on the membership change. Extended virtual
synchrony guarantees a consistent order of message delivery, which is essential if the application programs are to reconcile their states following repair of
a failed processor or remerging of a partitioned network.
8. THE FLOW CONTROL MECHANISM
The Totem protocol is designed to provide high performance under high load.
The performance measures we consider are throughput (messages ordered
per second) and latency (delay from message origination to delivery in agreed
or safe order). Effective flow control is required to achieve the desired
performance.
ACM Transactions on Computer Systems, Vol 13, No. 4, November 1995
338
•
Y. Amir et al.
With point-to-point communication, positive acknowledgment protocols,
such as the sliding-window protocol, have been refined to provide excellent
flow control. However, with broadcast and multicast communication, positive
acknowledgment protocols result in excessive numbers of acknowledgments.
Rate-controlled protocols have attracted attention recently but have the
disadvantage for broadcast and multicast communication that the transmission rate must be set for each processor individually rather than for the
multicast group. With bursty communication, the maximum transmission
rate for a processor must be set to a value that is unacceptably low, even
when other processors have few messages to transmit.
A basic characteristic of reliable, ordered, broadcast and multicast protocols is that the rate of broadcasting messages cannot exceed the rate at which
the slowest processor can receive messages. At higher rates of broadcasting,
the input buffer of the slowest processor will become full, and messages will
be lost. In our experience, this is the primary cause of message loss. Retransmission of lost messages increases the message traffic and reduces the
effective transmission rate.
The Totem single-ring protocol uses a simple flow control mechanism to
control the maximum number of messages broadcast during one token rotation. If a processor is unable to process messages at the rate at which they are
broadcast, one or more messages will be in its input buffer when the token
arrives. Before processing the token and broadcasting the messages, a processor must empty its input buffer. Thus, the rate of broadcasting messages is
reduced to the rate at which messages can be handled by the slowest
processor. If the maximum number of messages broadcast during any one
token rotation is limited by the size of each processor’s input buffer, then
input buffer overflow cannot occur.
When the rate of broadcasting is not equally spread across all processors,
the protocol can be modified to allow the token to visit more than once per
token rotation those processors that have the highest transmission rates and
to allow those processors to transmit more messages on each visit. To
minimize the latency, the rate at which the token visits a processor should be
approximately proportional to the square root of the rate at which the
processor broadcasts [Boxma et a1. 1990).
8.1 The Data Structures
Regular Tohen Fields. The flow control mechanism depends on two fields
of the regular token:
[cc: A count of the number of messages broadcast by all processors during
the previous rotation of the token.
bachlog : The sum of the number of new messages waiting to be transmitted by each processor on the ring at the time at which that processor
forwarded the token during the previous rotation.
ACM Transactions on Computer Systems, Vol. 13, No 4, November 1995
The Totem Single-Ring Ordering and Membership Protocol
•
339
Flow Control Constants. The flow control mechanism also depends on two
global constants:
—window size: The maximum number of messages that all processors are
allowed to broadcast in any token rotation.
max messages: The maximum number of messages that each processor is
allowed to broadcast during one visit of the token.
These constants can be determined analytically from the number of processors and their characteristics, or they can be negotiated during ring formation. The constant max pmessages may be different for different processors.
Local Variables.
Each processor maintains the following local variables:
my trc: The number of messages broadcast by this processor on
rotation of the token (my this rotation count).
my qpbl: The number of new messages waiting to be transmitted by
processor when it forwarded the token on the previous rotation
previous backlog).
my tbl: The number of new messages waiting to be transmitted by
processor when it forwards the token on this rotation (my this backlog).
this
this
(my
this
The values of my p6l and my tbl are limited by the amount of buffer space
available for messages awaiting transmission.
8.2 The Algorithm
The value of my trc, the number of messages broadcast by this processor on
this token rotation, is subject to the following constraints:
iny trc max messages: The number of messages broadcast by this processor must not exceed the maximum number it is allowed to broadcast
during one visit of the token.
my trc window size fcc: The number of messages broadcast by this
processor must not exceed the window size minus the number of messages
broadcast in the previous rotation of the token.
—my trc window st•<e X my tbl/(bachlog + my tbl — my pbl): The
number of messages broadcast by this processor must not exceed its fair
share of the window size, based on the ratio of its backlog to the sum of the
backlogs of all the processors as they released the token during the
previous rotation.
The backlog mechanism achieves a more-uniform transmission rate for individual processors under high, but not overloaded, traffic conditions than does
the simple window-size mechanism used by FDDI.
9. IMPLEMENTATION AND PERFORMANCE
The Totem single-ring ordering and membership protocol has been implemented in the C programming language on a network of Sun 4/IPC workstations connected by an Ethernet. The implementation uses only standard
Unix features and is highly portable. It has been transferred from our Sun
ACM Transactions on Computer Systems, Vol 13, No 4, November 1995
•
340
Y. Amir et al.
80
1300
MveSun 4dPC Proceeeom
1200
60
°
1100
.
w
1000
O
900
O 800
{}j 700
•
broadcast equally
er'”*
* 600
soo
0
.00
400
600
800 1000 1200 1400
Message Size in Bytes
o
0
200
400
600
800
1000
Messages Originated per Second
Fig 9 To the left, the throughput as a function of message size. To the right, the latency to
agreed and safe delivery as a function of load.
workstations to DEC and SGI workstations and has worked with little
modification.
The implementation uses the UDP broadcast interface in the Unix operating system SunOS 4.1.1 with Sun’s default kernel allocations. One UDP
socket is used for all broadcast messages, and a separate UDP socket is used
by each processor to receive the token from its predecessor on the ring. The
input buffers are a combination of the buffers managed by the physical
Ethernet controller and those managed by the Unix UDP service. Four bytes
of stable storage are required to store the ring sequence number.
We have measured the performance of our implementation on a network of
five Sun 4/IPC workstations, each ready to broadcast at all times with
minimal extraneous load on the processors and on the Ethernet. For each
measurement, the wiodor size and ma.x pmessages were adjusted to the
maximum values for which message loss is negligible in order to maximize
throughput.
The throughput was measured for equal numbers of messages broadcast by
all processors on the ring and for all messages broadcast by a single processor. As the left graph of Figure 9 shows, there was little difference in the
throughput. With 1024-byte messages, more than 800 messages are ordered
per second. For smaller messages, over 1000 messages are ordered per
second. The highest prior rates of message ordering for reliable, totally
ordered message delivery for 1024-byte messages are about 300 messages per
second for the Transis system using the same equipment and for the Amoeba
system using equipment of similar performance.
We have also investigated the latency from origination to delivery of a
message in agreed and safe order; a detailed analysis can be found in Moser
and Melliar-Smith [ 1994]. The right graph of Figure 9 shows the mean
latency to agreed and safe delivery for Poisson arrivals at lower, more-typical
ACM Transactions or Computer Systems, Vol. 13, No 4, November 1995
The Totem Single-Ring Ordering and Membership Protocol
•
341
loads for 1024-byte messages. At low loads (e.g., 400 ordered messages per
second, which is much more than the maximum throughput for prior protocols), the latency to agreed delivery is under 10 milliseconds. Even at 50%S
useful utilization of the Ethernet (625 ordered messages per second), the
latency to agreed delivery is still only about 13 milliseconds. In general, the
latency to agreed delivery is approximately half the token rotation time, and
the latency to safe delivery is approximately twice the token rotation time,
except at very high loads where the latency is dominated by queuing delays
in the buffers.
Other performance characteristics of interest are the time to recover from
token loss and the time to execute the membership protocol and reconfigure
the system when failures occur. With the token retransmission mechanism
enabled, the time to return to normal operation after loss of the token is on
average 16 milliseconds. With the token retransmission mechanism disabled,
loss of the token triggers a Token Loss timeout and forces a complete
reformation of the membership. The time to form a new ring, generate a new
token, recover messages from the old ring, and return to normal operation is
on average the Token Loss timeout plus 40 milliseconds, with the Token Loss
timeout set to 100 milliseconds.
10. CONCLUSION
The Totem single-ring protocol provides fast, reliable, ordered delivery of
messages in a broadcast domain where processors may fail and where the
network may partition. A token circulating around a logical ring imposed on
the broadcast domain is used to recover lost messages and to order messages
on the ring. Delivery of messages in agreed and safe order is provided.
The membership protocol handles processor failure and recovery, as well as
network partitioning and remerging. Extended virtual synchrony ensures
consistent actions by processors that fail and are repaired with stable storage
intact and in networks that partition and remerge. A recovery protocol that
maintains extended virtual synchrony during recovery after a failure has
been provided.
The flow control mechanism avoids message loss due to buffer overflow and
provides significantly higher throughput than prior total-ordering protocols.
Given the high performance of Totem, there is no need to provide a weaker
message-ordering service, such as causally ordered delivery, because totally
ordered agreed delivery can be provided at no greater cost. Moreover, applica-
tions can be programmed more easily and more reliably with totally ordered
messages.
Continuing work on Totem is exploiting the single-ring protocol to provide
more-general services. Agreed and safe delivery services, as well as membership services, are being provided to multiple rings interconnected by gateways. It remains to be investigated whether the exceptional performance of
the Totem single-ring protocol can be sustained when Totem is extended to
multiple rings.
ACM Transactions on Computer Systems, Vol. 13, No. 4, N osein bee 1995
342
•
Y Amir et al
REFERENCES
his, Y, DOLU\ , D., Ks.WtER, S., ANn MAi.in, D.
1992a.
Membership algorithms in broadcast
domains In Proceeds rigs o/' //ie fi/ñ In ternuttona I iVorhshop on Dr stri buted Af gori thins fHaifa,
Israel). Springer-Verlag, Berlin, 292—312.
AMIR, Y , DOLEV, D. , KnHlEIi, S. , AND MALAI, D. 1592b.
Transis: A communication subsystem
for high availability. In Procc•ed Irigs o[ ltte IEEE 2z*nd An niial In ternutionci I Gympoor uin on
Fu it It-Tolera nt C'omputt ii g ( Boston, Mass). IEEE, New York, 76—84
Ser In, Y , MO>“ER, L E. , MELLIAR-SrIITH, P. M. , AGARWAI>, D. A. , AND CIA11r'ELLA, P.
1993 Fast
message ordering and membership using a logical token-passing ring. In Proceeds ngs o[ the
IEEE 7‘3/Jt Internuti oncil Con[ere rice on Dt stributed Com p uti ng S ysteni s (Pittsburgh, Pa )
IEEE, New York, 5ñ1—560.
Liu, Y , MOSER, L. E , MELrIAn-SsIITH, P. M. , Ac.ARWAL, D. A , zoo C iARFELL.A, P.
1994.
The
Totem single-ring ordering and membership protocol. Tech Rep. 94— 19, Dept of Electrical and
Computer E n nearing, Univ. of C alifornia, Santa Barbara, C alif. Aug.
BIRhIw K. P AND VAN RENESSE, R. 1994 Re Itable Dr strt butech Co r p uti rig wCth the Is Is Toothi t.
IEEE Computer Society Press, Los Alamitos, Calif
Bo.in , O. J. , I>EVâ’ J. , AND NES"fRA"fE, J. A. 1990. Optimization of pollin g systems. In Per[or-
rna rice '90, Proceed Ings o[ the 14th IFIP WG 7. 3 Internationa I S 5'mposi um on Computen
Performance 5Iodelli ng, Meets tireraeti t a net Ether I uati on (E dinburg, U.K J. North-Holland, Am-
sterdam, 349-361.
Can, J. M.AND McUhICHUK, N F.
lsvst. ’2, 3 (Aug.), 251-273.
1984.
Reliable broadcast protocols ACID Tro us Comput
FI.*CIDER, M. J. , LYNCH, N. A. , AND PATERSON, M S.
1985. Impossibility of distributed consen-
sus Cth one faulty process J. ACM ,3z*, 2 (Apr ), 374—382
KAASFIOEFi, M F. all TANENLAULI, A. S. 1991. Crroup communication in the Amoeba distributed operating system. In Proceed ings o[ the IEEE 11th luternationcil Conference on
Dr stributed Com p iiting !S ysteins (Arlington, Tex 4. IEEE, New York, 882-891.
LAMPORT, L 1978. Time, clocks, and the ordering of events in a distributed system. Com m un.
ACT L*7, 7.1JulyJ, 55fi—565.
MzrLIw-SsIITH, P. M. , MOSER, L. E. , AND AGARWAL, D. A. 199.1.
Ring-based ordering protocols.
In Proceeds nz s of the IEE Internati onal C!on[erence on In[ormati oti En gi iieer ing f Singapores
IEE, Stevenage, Herts, U.K , 882—891
MELLIAIt-SMITH, P. M. , MosEu, L. E. , Ann AGRAWALA, V.
1990. Broadcast protocols for dis-
tributed sytems. IEEE Trans. Pura llel Distri b. !S yst. 1, 1 (Jan.), 17—25.
MISHRA, S. , PETERSON, L. L. , AND NCHLICHTING, R. D. 1991. A membership protocol based on
partial order. In Proceeds ngs o[ the 2iid IFIP WG 10.4 Internuttona I Working Con[erence on
Depen dable Co inputi ng [or Cm 1.1ca I A p p Iicci ttons (Tucson, Ariz.). Springer-Verlag, Wien, Aus-
tria, 309-331.
Mosrn, L. E on MELL1AR-NMITFt, P. M. 1994. Probabilistic bounds on message delivery for the
Totem single-ring protocol. In Procee‹1ings of the IEEE lsth Real -Ttme IS stems !S yinpos i um
lsan Juan, Puerto Rico). IEEE, New York, 288—248.
MOSER, L. E. , Avtts, Y. , MELLIAR-SMITH, P M. ,
AowwAr,
D.
A.
1994a.
Extended
virtual
synchrony. In Proceediii gs o[ the IEEE 14th International C'on[eren ce or Distri huted Com p uttng S›'stems (Posnan, Polandâ. IEEE, New York, 56—65.
MosER, L. E. , MELLIw-SvI4’H, P. M. , AND AGRAWALA, V.
1994b. Processor membership in
asynchronous distributed systems IEEE Trans Parallel Distri b. !S yet. 5, 5 (May), 459—473.
PziznGOu, L. L., BUCHHOLz, N. C., on Scrriceieo. It. D. 1989. Preserving and using context
information in interprocess communication. ACM Trains Com p ut. Syst 7, 3 (Aug.I, 217—246.
RAJ.\GOPALAN, B. wn McKINLEY, P. K. 19h9. A token-based protocol for reliable, ordered
multicast communication In Proceeds rigs o[ the IEEE 8th H ympost um on Reliable Distributed
System s ( Seattle, Wash ). IEEE, New York, 84—93.
'w RENEW.8E, R., HIcKCv, T. M. , AND BIRMAN. K. P 1994. Design and performance of Horus: A
lightweight group communications system Tech Rep. 94—1442, Dept. of Computer Science,
Cornell Univ., IU aca, N.Y. Aug.
Received July 1993; revised August 1994: accepted July 1995
ACM Transactions on Computer Systems, Vol 13, No 4, Novem ber 1995
Download