A Fast Delivery Protocol for Total Order Broadcasting Li Ou , Xubin He , Christian Engelmann , and Stephen L. Scott Department of Electrical and Computer Engineering Tennessee Technological University, Cookeville, TN 38505, USA Email: lou21, hexb @tntech.edu Computer Science and Mathematics Division Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA Email: engelmannc,scottsl @ornl.gov Department of Computer Science The University of Reading, Reading, RG6 6AH, UK Abstract—Sequencer, privilege-based, and communication history algorithms are popular approaches to implement total ordering, where communication history algorithms are most suitable for parallel computing systems, because they provide best performance under heavy work load. Unfortunately, posttransmission delay of communication history algorithms is most apparent when a system is idle. In this paper, we propose a fast delivery protocol to reduce the latency of message ordering. The protocol optimizes the total ordering process by waiting for messages only from a subset of the machines in the group, and by fast acknowledging messages on behalf of other machines. Our test results indicate that the fast delivery protocol is suitable for both idle and heavy load systems, while reducing the latency of message ordering. I. I NTRODUCTION Total order broadcasting is essential for group communication services [1]–[3], but the agreement on a total order usually bears a cost of performance: a message is not delivered immediately after being received, until all the communication machines reach agreement on a single total order of delivery. Generally, the cost is measured as latency of totally ordered messages, from the point the message is ready to be sent, to the time it is delivered by the sender machine. Traditionally three approaches are widely used to implement total ordering: sequencer, privilege-based, and communication history algorithms [3]. In sequencer algorithms, one machine is responsible for ordering the messages on behalf of other machines in the group. Privilege-based algorithms rely on the idea that senders can broadcast messages only when they are granted the privilege to do so. For example, in a token-based algorithm [4], a token is rotated among machines in the same group, and one machine can only send messages while it holds the token. In communication history algorithms, total order This research was partially sponsored by the Office of Advanced Scientific Computing Research of the U.S. Department of Energy. The work was performed in part at the Oak Ridge National Laboratory, which is managed by UT-Battelle, LLC under Contract No. De-AC05-00OR22725. The work performed at Tennessee Tech University was partially supported by the U.S. National Science Foundation under under Grant Nos CNS-0617528 and SCI0453438. This work was also partially sponsored by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory. messages can be sent by any machine at any time, without prior enforced order, and total order is ensured by delaying the delivery of messages, until enough information of communication history has been gathered from other machines. Three types of algorithms have both advantages and disadvantages. Sequencer algorithms and privilege-based algorithms provide good performance when a system is relatively idle. However, when multiple machines are active and constantly send messages, the latency is limited by the time to circulate the token or produce the order number from the sequencer. Communication history algorithms have a post-transmission delay [3], [5]. To collect enough information, the algorithm has to wait for a message from each machine in the group, and then deliver the set of messages that do not causally follow any other, in a predefined order, for example, by sender ID. The length of the delay is set by the slowest machine to respond with a message. The post-transmission delay is most apparent when the system is relatively idle, and when waiting for response from all other machines in the group. In the worst case, the delay may be equal to the interval of heart beat messages from an idle machine. On the contrary, if all machines produce messages and the communication in the group is heavy, the regular messages continuously form a total order, and the algorithm provides the potential for low latency of total order message delivery. In a parallel computing system, multiple concurrent requests are expected to arrive simultaneously. A communication history algorithm is preferred to order requests among multiple machines, since such algorithm performs well under heavy communication loads with concurrent requests. However, for relatively light load scenarios, the post-transmission delay is high. We propose a fast delivery protocol to reduce this posttransmission delay. The fast delivery protocol forms the total order by waiting for messages only from a subset of the machines in the group, and by fast acknowledging a message if necessary, thus it fast delivers total order messages. To verify our protocol, we implemented a prototype in the Transis [5] group communication system and applied this protocol to an active/active replication scenario. Let and are two total order broadcasting messages: 1) if causally precedes , all machines that deliver and deliver before . 2) if and are concurrent, if machine delivers before , then any machine that belongs to the same partition with delivers before . Fig. 1. The definition of total order broadcasting service. II. T OTAL O RDER B ROADCASTING S ERVICE In this section, we briefly discuss the services a total order broadcasting system should provide. We assume that there is a substrate layer providing basic broadcasting services. A. Basic Broadcasting Services The machines in the system use a broadcasting service to send messages. A broadcast message is sent once by it’s source machine, and arrives to all destined machines in the system, at different time. The broadcasting service is responsible for the reliable delivery of messages. Internally, the causal delivery order [6] of messages is guaranteed by the service. (Definition): Message and message are concurrent, if does not causally precedes , and does not causally precedes . The basic broadcasting service receives the messages of the network. It keeps causal order of messages and delivers them to our fast delivery protocol. The broadcasting service does not guarantee the same delivery sequence of concurrent messages on all machines in the system. B. Total Order Broadcasting On top of the basic broadcasting service, totally ordered broadcasting extends the underlying causal order to a total order for concurrent messages. (Total Order): If two correct machines and both deliver message and , then delivers before if and only if delivers before . The total order broadcasting provided by the system does not guarantee the total order across multiple partitions. As long as partitions do not occur, all machines deliver the messages in the same total order. When a partition occurs, machines in every partition continue to form the same total order. However, this total order may differ across partitions. The total order broadcasting service of the system is defined in Fig. 1. III. FAST D ELIVERY P ROTOCOL FOR T OTAL O RDER B ROADCASTING In this section, we describe a fast delivery protocol to provide the total order broadcasting service defined in Section II-B. We assume that the protocol works on top of the basic broadcasting services described in Section II-A. We first consider a static system of machines, which means no failure of machines, no network partitions and re-merges, no new machines. Those features will be considered in the Section IV, in which we show how to extend the protocol to handle dynamic environments. Each machine will not deliver any messages until it collects a message set from other machines in the group. The message set should contain enough information to guarantee the totally ordered delivery. After receives enough messages, the machine delivers the set of messages that do not causally follow any other, in a predefined order. Idle machines periodically broadcast heart beat messages with a predefined interval on behalf of other machines. Those heart beat messages will not be delivered, but used by machines to determine the order of received messages. A. Notation and Definition We define that a partition consists of a group of machines . We assume each machine in the group has a distinct ID. For a machine , function "!#$% returns its ID. If the number of machines in the primary group is & , ' (*)+,-!#$.%/-!#%0)1324657/&89 "!#$%;< : "!=#>% We associate with each machine ) the functions @?BABC-D and EFCC-D which are defined: ' 1) =?GABC-DH#I% < 7J )+-!#%0KL"!#$%/ ' 2) EFCCMDN#$.% < 7J *)+"!=#>%+OP"!#$%/ The input to the fast delivery protocol is a stream of causally ordered messages from underlying the broadcasting service. We denote by QSR the ith message sent by machine . EAT!9A?U#VQSR -% < . If message QSR is delivered before WR in machine , !3AGXV-YZA?U#QSR -%[K\!3AGXV-YZA?U#WR % We define a pending message [7] to be a message that was received by the protocol but has not be agreed to total order, thus, not delivered for processing. A pending message that follows only delivered messages is called a candidate message. The set of concurrent candidate messages is called the candidate set. This is the set of messages that are considered for the next slot in the total order. For example, a system has 5 machines, 6 > =]4=^4_B . After a certain time, there is no undelivered messages on any machines. Machine broadcasts a message `U2 , and machine ^ broadcasts a message `Ga . All five machines receive both `U2 and `a , but none of them can deliver the two messages, because except the sending machines, no one knows if messages b`45 and `c are sent by and ] , concurrently with `Ba , or not. All machines should not deliver `U2 and `Ba , until enough information is collected to determine a total order. The message set of `U2 and `Ga is called the candidate set, and messages ` 2 and ` a are called candidate messages. Let d < ` W be the set of candidate messages in d a machine . We associate with ` a function EAT!3A?E . EAT!9A?E3# d d < ` % GEAT!9A?U#V ' %J ) d ` Let !B` is the set of messages ready to be delivered in a d d d machine . !`fe ` . The !` is called the deliver set. When receiving a regular message in machine : 1) if ( is a new candidate message) d add into candidate set ` d 2) if ( `<h : g ) d for all (bi) ` ) d if @?BABC-DH#MEAT!3A?Z#b-%j%[ekEAT!9A?E3# `4% d add into delivery set !` d 3) if ( !B`<l : g ) d deliver all messages in !B` in the order of -!#>EAT!3A?U# G%j% 4) if ( EAT!3A?Z#%m) : EF.CC-DN#$% ) return if (message is not a total order message) return if (messages are waiting to be broadcast from ) return d if ( nZi) `Z-!#>EAT!3A?U#bM%j% < ) return otherwise, fast acknowledge Fig. 2. The fast delivery protocol. B. The Fast Delivery Protocol The fast delivery protocol is symmetric, and we describe it for a specific machine (the pseudo code is shown in Fig. 2). The key idea of the fast delivery protocol is that it forms the total order by waiting for messages only from a subset of the machines in the group. Assuming a candidate message is in d candidate set ` , we use the following delivery criterion to define what messages a machine has to wait before delivering : d 1) Add into deliver set !` when: 2) Deliver order: =?GABC-DH#MEAT!3A?U#V%j%+eoEAT!3A?BE3# d the messages in the !4` with ' d `% the following EAT!3A?U#V%[)EFCC-DH#% In the same example, if and ] are idle, they should fast acknowledge ` a , because =^ )PEFCCMDN#$ % , and =^ ) EF.CC-DN#$=]% . If _ is idle, it does not need to send fast : EFCCMDN#$_% . acknowledgment because ^v) fast acknowledgment reduces the latency of message delivery, however, it injects more packets into network. If communication is heavy, fast acknowledgment may burden network and machines, thus increase delivery latency. To reduce the cost of fast acknowledgment, we define the following acknowledgment criterion: (ACK) Fast acknowledge a message from a machine when: 1) Message is a total order message. 2) There is no message waiting to be sent from the machine . d < 3) : nZ ) . ` j"!#>EAT!9A?U#V %j% It is very straightforward of conditions 1. Condition 2 means if a machine is sending regular messages, it is not a idle machine, and the regular messages themselves are enough to form a total order. Condition 3 means if a machine already d sent a regular message which is still in the ` , that message can be used to form a total order, without an additional acknowledgment. In the same example, if T is idle after sent `Z2 , it does not need to send any acknowledgment (although d ` . ^ )EF.CC-DN#$.w% ), because `Z2 is still in In a parallel system, when multiple concurrent requests arrive a machine simultaneously and the system is busy, conditions 2 and 3 are very unlikely to be satisfied simultaneously, so almost no additional acknowledgments are injected into the network when communication is heavy. IV. FAST D ELIVERY P ROTOCOL j ) d ! ` "!=#MEAT!3A?Z#b-%j%[K\"!=#MEAT!3A?U#V %j% p=q !3AX-Y3A?U#VM%[Ko!3AX-Y3A?U#V % With the same example in Section III-A, we explain how the protocol works. All five machines could deliver b`U2 immedid ately, because =?BAGC-DN#$(2G% <rg , and EAT!3A?E3# `% <sg . The five machines could not deliver `a , because =?BAGC-DN#$@aZ% < d .BB ] , and EAT!3A?E3# `4% < . . Machines have to wait messages from both and ^ , but do not need to wait a message from _ , because _ )t : @?BABC-DH#I@a3% . According to the protocol, if =?GABC-DH#MEAT!3A?U#V%j%ue : d EAT!3A?BE3# ` % . the machine has to wait messages from other machines before delivering . If any of those machines is idle, the waiting time could be the interval of heart beat messages. To speedup the delivery of , the idle machines should immediately acknowledge on behalf of other machines. If a machine receives a message , and is idle, broadcasts a fast acknowledgment when: FOR DYNAMIC S YSTEMS The fast delivery protocol operates on an asynchronous stream of causal order messages. In this section, we show how to extend the protocol to handle failures, the network partitioning and re-merge, and joining machines. Fast delivery protocol is integrated into the group communication service to provide total order delivery of messages on top of the basic broadcasting service. We assume that the system contains membership service, which maintains a view of current membership set (CMS) consistent among all machines in the dynamic environment. When machines crash or disconnect, the network partitions and re-merges, or the new machines join, the membership service of all connected machines must reconfigure and reach a new agreement of CMS. After a new agreement is reached, membership service delivers a view change event indicating a new configuration. All connected machines in the new configuration agree on the set of regular messages that belong to the previous membership, and must be delivered before the new view change event. Fast # of machines 1 2 3 4 5 6 7 8 Single 235 952 1031 1089 1158 1213 1290 1348 All 971 1461 1845 2236 2646 3073 3514 TABLE I R EQUEST LATENCY ( x3y ) COMPARISON OF MULTIPLE MACHINES . ”S INGLE ” MEANS ONLY ONE MACHINE IN A RELATIVELY IDLE SYSTEM BROADCASTS REQUESTS . ”A LL ” MEANS ALL MACHINES IN A BUSY SYSTEM BROADCAST REQUESTS . delivery protocol is extended to define how to deliver such messages in the dynamic environment. We assume that after a new agreement of CMS, the membership service notifies fast delivery protocol with a special event. With such event, the protocol gets the machine set, {z , which belongs to previous configuration, but is included in the new configuration. The new =?GABC-DH#I% and EFCC-DH#I% are calculated based on the |z : 1) @?BABC-DH#$.%~}9" < @?BABC-DH#$.%jj p iz 2) EF.CC-DH#I%~TA < EFCC-DH#I%~ p iz Using the algorithm described in Section III, the set of regular messages that belong to the previous configuration are delivered before the view change event with the new @?BABCMDN#$.% and EF.CC-DH#I% . Since a new CMS always completes within a finite delay of time, any total order message could be delivered within a limited time interval. V. E XPERIMENTAL R ESULTS To verify above model, a proof-of-concept prototype for fast delivery protocol has been implemented in Transis [5], and deployed on the XTORC cluster at Oak Ridge National Laboratory, using up to 8 machines in various combinations for functional and performance testing. The computing nodes of the XTORC cluster are IBM IntelliStation M Pro series servers. Individual nodes contain a Intel Pentium 2GHz processor with 768MB memory, and a 40GB hard disk. All nodes are connected via Fast Ethernet (100MBit/s full duplex) switches. Fedora Core 5 has been installed as the operating system. Transis v1.03 with fast delivery protocol is used to provide group communication services. Failures are simulated by unplugging network cables and by forcibly shutting down individual processes. An MPI-based benchmark is used to send concurrent requests from multiple machines. The latency is measured with blocked requests, and an average latency is calculated from 100 requests of each machine. The results under various configurations are provided for comparison from 1 to 8 machines (Table I). In the configuration of only one machine, the latency overhead mainly comes from processing cost of the group communication service. When the number of machines increases, additional overhead is introduced by the network communication and total order communication algorithm to reach agreement among machines. For each configuration, we measure the latency under both idle and busy systems. In an idle system, only a single machine is active to send requests. In a busy system, all machines send concurrent requests, and the communication traffic is heavy. The proof-of-concept prototype showed that although the latency increases with the number of machines, the fast delivery protocol works well to keep the overall overhead acceptable for any HPC system. The overhead is consistent for both idle and busy systems. The fast acknowledgment aggressively acknowledges total order messages to reduce the latency of idle systems when only a single machine is active. The protocol is smart enough to hold its acknowledgments when the network communication is heavy because more machines are involved. We compared the fast delivery protocol with the traditional communication history algorithm provided by the original Transis system. In idle system, the post-transmission delay of traditional communication history algorithm is apparent. It is not a constant value, and in worst case, the latency is equal to the interval of heart beat messages from a idle machine. A typical interval is in the gratitude of hundred milliseconds, which is unacceptable, compared to the latency of fast delivery protocol. In a busy system, the latency of fast delivery protocol is almost the same as the traditional communication history algorithm, because the protocol holds unnecessary acknowledgments. We found that when all machines sent concurrent requests, the fast delivery protocol did not acknowledge any broadcast, and the regular messages continuously form a total order. VI. R ELATED W ORK Group communication systems typically provide different group broadcast services with a variety of ordering and reliability guarantees [1]. A totally ordered service extends the causal service by ensuring that messages sent to a set of processes are delivered by all these processes in the same order. Past research on total ordering algorithms focuses on three popular approaches [3]. Sequencer algorithms [8]–[10] rely on one machine to order the messages on behalf of other machines in the group. Privilege-based algorithms [4], [11] force the total order in the process of competition of a privilege among all machines in the group. Communication history algorithms [5], [7], [12]–[14] ensure the total order by delaying the delivery of messages, until the enough information of communication history has been gathered from other machines. The agreement on a total order in communication history algorithms usually bears a cost of performance: posttransmission delay [3], [5]. Several studies have been made to reduce the cost. Early delivery algorithms [7], [13] reduce the latency by reaching agreement with a subset of the machines in the group. Optimal delivery algorithms [15], [16] deliver messages before the total order is determined, but notify the applications and cancel the delivery if the final order is different with delivered order. Researchers have considered using totally ordered broadcasting to provide high availability for HPC systems [17]. Past research in high availability primarily focused on the active/standby model [18]–[21]. Recent research of symmetric active/active replication model [22], [23] uses multiple redundant service nodes running in virtual synchrony [24] and totally ordered broadcasting. VII. C ONCLUSION In this paper, we propose a fast delivery protocol to reduce the latency of message ordering in group communication. The protocol optimizes the total ordering process by waiting for messages only from a subset of the machines in the group. The protocol performances well for both idle and busy systems. Furthermore, the fast acknowledgment aggressively acknowledges total order messages to reduce the latency when some machines are idle. The protocol is smart enough to hold the acknowledgments when the network communication is heavy. We implemented a prototype of the fast delivery protocol and discussed an application of this protocol in symmetric active/active replication in parallel computing systems. Our performance results are encouraging: they indicate that the fast delivery protocol is suitable for both idle and heavy loaded systems. R EFERENCES [1] G. V. Chockler, I. Keidar, and R. Vitenberg, “Group communication specifications: A comprehensive study,” ACM Computing Surveys, 2001. [2] R. Baldoni, S. Cimmino, and C. Marchetti, “Total order communications: A practical analysis,” Lecture Notes in Computer Science: Dependable Computing, vol. 3463, pp. 38–54, 2005. [3] X. Defago, A. Schiper, and Peter Urban, “Total order broadcast and multicast algorithms: Taxonomy and survey,” ACM Computing Surveys, December 2004. [4] Y. Amir, L. E. Moser, P. M. Melliar-Smith, D. A. Agarwal, and P. Ciarfella, “The Totem single-ring ordering and membership protocol,” ACM Trans. Comput. Syst., November 1995. [5] D. Dolev and D. Malki, “The transis approach to high availability cluster communication,” Communications of the ACM, 1996. [6] L. Lamport, “Time, clocks, and the ordering of events in a distributed system,” Comm. ACM, July 1978. [7] D. Dolev, S. Kramer, and D. Malki, “Early delivery totally ordered multicast in asynchronous environments,” Symposium on Fault-Tolerant Computing, 1993. [8] K. P. Birman and R. van Renesse, “Reliable distributed computing with the ISIS toolkit,” IEEE Computer Society Press, 1993. [9] M. F. Kaashoek and A. S. Tanenbum, “An evaluation of the Amoeba group communication system,” Proc. 16th Intl. Conf. on Distributed Computing Systems, 1996. [10] M. K. Reiter, “Distributing trust with the Rampart toolkit,” Communications of the ACM. 39, vol. 39, no. 4, April 1996. [11] F. Cristian, S. Mishra, and G. Alvarez, “High-performance asynchronous atomic broadcast,” Distributed System Engineering Journal 4, 2, June 1997. [12] M. K. Aguilera and R. E. Strom, “Efficient atomic broadcast using deterministic merge,” Proc. 19th Annual ACM Intl. Symp. on Principles of Distributed Computing, 2000. [13] Z. Bar-Joseph, I. Keidar, and N. Lynch, “Early-delivery dynamic atomic broadcast,” Proc. 16th Intl. Symp. on Distributed Computing, 2002. [14] D. Dolev and I. Keidar, “Totally ordered broadcast in the face of network partition,” Dependable Network Computing, Chapter 3. [15] B. Kemme, F. Pedone, G. Alonso, A. Schiper, and M. Wiesmann, “Using optimistic atomic broadcast in transaction processing systems,” IEEE Trans. Knowl. Data Eng., July 2003. [16] P. Vicente and L. Rodrigues, “An indulgent uniform total order algorithm with optimistic delivery,” Proc. 21st IEEE Intl. Symp. on Reliable Distributed Systems, 2002. [17] C. Engelmann and S. Scott, “Concepts for high availability in scientific high-end computing,” in Proc. of HAPCW, 2005. [18] “PBSPro job management system for the cray XT3 at Altair Engineering, Inc,” http://www.altair.com/pdf/PBSPro Cray.pdf. [19] “SLURM at lawrence livermore national laboratory, Livermore, CA, USA,” http://www.llnl.gov/linux/slurm. [20] C. Engelmann, S. L. Scott, D. E. Bernholdt, N. R. Gottumukkala, C. Leangsuksun, J. Varma, C. Wang, F. Mueller, A. G. Shet, and P. Sadayappan, “MOLAR: Adaptive runtime support for high-end computing operating and runtime systems,” ACM SIGOPS Operating Systems Review, 2006. [21] I. Haddad, C. Leangsuksun, and S. L. Scott, “HA-OSCAR: Towards highly available linux clusters,” Linux World Magazine, March 2004. [22] C. Engelmann, S. L. Scott, C. Leangsuksun, and X. He, “Active/active replication for highly available HPC system services,” Proceedings of the International Symposium on Frontiers in Availability, Reliability and Security (FARES),Vienna, Austria, April 2006. [23] K. Uhlemann, C. Engelmann, and S. L. Scott, “JOSHUA: Symmetric active/active replication for highly available HPC job and resource management,” in Proceedings of IEEE International Conference on Cluster Computing, September 2006. [24] L. Moser, Y. Amir, P. Melliar-Smith, and D. Agarwal, “Extended virtual synchrony,” Proc. of DCS, 1994.