Transparent Failover Pete Perlegos UC Berkeley Abstract Servers fail more often than users want to know. The clear solution to this is server replication. But how does a redundant server know how to transparently continue the connection? There are already good techniques for monitoring the health of a server and notifying other servers to take over. There are also good techniques for picking the new server. There are currently some ideas on how to migrate the state by periodically disseminating information to backup servers. My solution differs in that the state information is passed only at the beginning of a connection and at failover. The application will communicate the application-dependant information and initial TCP state at the beginning of the initial TCP connection. Once a failure occurs, the current sequence number is requested from the client. Simulation shows that connections suffer only a small performance degradation when using this failover technique. 1. Introduction Servers on the Internet today do not have the necessary reliability for missioncritical services. An effective way to engineer a reliable system out of unreliable components is to use redundancy. Server replication is used to provide reliable and available services on the Web today [4]. Providing reliable, robust service requires the ability to rapidly transition the client to a new server from an unresponsive, overloaded, or failed server during a connection [12]. (Figure 1). Figure 1: Transparent Failover. My design is a component that fits into the applications on the servers and clients so they can easily benefit from the transparent failover. When data being transferred is static or can be made available on redundant servers, my transparent failover performs well. If content is being generated dynamically and is not easily reproduced by another server, handoff becomes harder to accomplish. Fortunately, today these servers and overlay networks are more sophisticated and can do dynamic processing recovery [3]. I discuss the components involved in designing a transparent failover system in section 2. Section 3 describes my architecture for transparent server failover. My TCP-based implementation is described in section 4. Section 5 contains simulation and performance analysis showing the effectiveness of the failover mechanism. I conclude with a summary of my contributions in section 6. 2. Components for Transparent Failover It is important to look at all of the steps to see how my solution fits in. This section will discuss the components for a transparent failover system. For any connection in progress, there must be a method to determine if and when to move it to another server. Second, there must be a selection process to identify a set of new server candidates. Finally, there must also be a mechanism to move the connection and seamlessly resume the data transfer from the new server. 2.1 Health Monitoring There are already many solutions to health monitoring [5, 10, 13, 14]. One solution has a cluster of network servers in which a front-end directs incoming requests to one of a number of back-ends (Figure 2) [13]. Figure 2: Cluster of servers with front-end. A node's load is measured by the number of active connections. An overloaded node will fall behind and the resulting queuing of requests will cause its number of active connections to increase, while the number of active connections at an underloaded node will tend to zero. Monitoring the relative number of active connections allows the front-end to estimate the amount of outstanding work and thus adjust the relative load on a back-end without requiring explicit communication with the back-end node. If one of the replicated back-end servers is overloaded or goes down, connections can be offloaded to one of the other servers. Akamai has a Network Operations Command Center (NOCC) which provides proactive, real-time performance monitoring of all servers in Akamai's global network [3]. The NOCC's proactive stance and unique view ensures a seamless response to network conditions. 2.2 Server Selection There are also many solutions to server selection [7, 13, 14]. In the previously mentioned solution, the front-end uses the content requested, in addition to information about the load on the back-end nodes, to choose which back-end server will handle this request [13]. One of the important factors in the efficient utilization of replicated servers is the ability to direct client requests to the best server. Many techniques have been used to select a particular server among a set of replicated servers. One example is to simply list the servers and have the client pick one, based on geographical proximity or some other criteria the client deems appropriate. But this technique is not transparent to the user. Furthermore, the closest server geographically may not have the least endto-end delay or it may not be the least loaded of the servers. Another example is to use Domain Name System (DNS) modifications [8] to return the IP address of one of a set of servers when the DNS server is queried. This technique is transparent to the client, but often the DNS server uses a round robin mechanism to allocate the servers to clients because it maintains no server performance information on which to base its selection decision. In the [8] paper, the authors use an environment in which servers are distributed across the Internet and clients identify servers using an application-layer anycasting service. The goal in this case is to allocate servers to clients in a way that minimizes a client's response time [8]. A significant response time improvement can be achieved with this technique over the use of random, or other performance-independent allocation mechanisms. A potential problem with an approach that identifies the best server to clients is that of oscillation among servers. Akamai supports Edge Side Includes (ESI), an open specification for dynamic assembly and delivery of highly dynamic Web content at the edge of the Internet. ESI provides a mechanism for managing content transparently across application server solutions, content management systems, and content delivery networks [3]. Figure 3: Akamai Architecture [16] There are many varied solutions to the problem of server selection. 2.3 Connection Migration and Resumption Once a connection needs to move and a new server has been selected, the client application should continue seamlessly. The data stream must resume from exactly where it left off at the old server. As a result, the transport-layer state must be moved to the new server, and the application-layer state appropriately synchronized and restarted. There are a variety of methods to accomplish this. One approach is an application-independent mechanism which uses a secure, transport-layer connection migration mechanism [4]. This method requires periodic updates and the maintaining of soft-state for the transportlayer in replicate servers. Another approach, which I propose, uses an application-layer connection migration mechanism. All initial application and transport layer state is passed to the backup server at the start of a connection. All that is needed to recover is the current sequence number, which is passed to the backup server by the client at failure. So there is no need to burden the network with additional traffic. 3. Failover Architecture In the transport-layer connection migration proposal [4], the authors associate each connection with a subset of the servers in the system. This is the connection's support group, the collection of servers that are collectively responsible for the correct operation of the connection. Each support group uses a soft-state synchronization protocol to distribute weakly consistent information about the connection to each server in the group. The state distribution protocol periodically disseminates, for each connection, the mapping between the transport layer state and the applicationlevel object being sent to the client. My architecture preserves the end-toend semantics of a connection across moves between servers. The application state and initial sequence number are transferred at the beginning of the initial transport-layer connection. Also, if there is a wraparound of the sequence number during a long running connection, the replicate server must be informed. Fortunately, this is a rare case. When the server fails, the application in the failover server determines the appropriate point from which to resume transmission by getting the sequence number from the client. 3.1 Support Groups The larger the group of servers in the support group, the more there are to offload the clients to. So there is a smaller sudden load added to each redundant server in the case of a server failure. Unfortunately, in the transport-layer connection migration proposal [4], the communication load also increases, as each member of the group must advertise connection state to the others. My solution eliminates this problem, since the connection state is only passed to the backup server upon failure of the original server. It will be desirable to limit the number of candidate servers that simultaneously attempt to contact the client in large support groups, as the implosion of migration requests may swamp the client. Support groups can be behind a Web switch or distributed across the Internet. Clearly, the choice of a live initial server is an important one, and much previous work has addressed methods to select appropriate servers in the wide area. The choice of support group membership and final server that handles a failed client should be engineered in a manner that avoids the server implosion problem. 3.2 State Transfer Previous proposals have suggested that the transport-layer information would be passed via periodic soft-state synchronization [4]. My proposal endorses passing the transport-layer state only at the start and at failover. This will solve a problem presented by the soft-state synchronization technique in which the communication load also increases, as each member of the group must advertise connection state to the others. Once a server fails, the application in the failover server requests a new connection from the client and determines the appropriate point from which to resume transmission via a modified 3-way handshake. When the SYN-ACK acknowledgement number is sent by the client, it is the same one form the previous connection. The new server accepts the acknowledgement number sent by the client instead the sequence number it sent. The new server can then resume the data stream from exactly where it left off at the old server. 3.3 Connection Failover By using connection migration, servers can be replicated across the widearea, and there is no requirement for a redirecting device on the path between client and server. The client can select its candidate server of choice. This can be done by simply accepting the first migration message to arrive. The first request to arrive at the client is likely from the server best equipped to handle the message. The response time of a candidate server is the sum of the delay at the server and propagation delay of the request to the client. If a more sophisticated decision process is desired it can be implemented either at the candidate servers, or the client application, or both. Web switches multiplex incoming requests across the servers, and rewrite addresses as packets pass through (Figure4). This enables multiple servers to appear to the external network as one machine, providing client transparency. The obvious drawback of this approach, however, is that all servers share fate with the switch, which may become a performance bottleneck. Figure 4: Web switch. This is a common scenario. Fortunately, such switches have a higher reliability than the computers serving Web content and susceptibility to failure is reduced to a large extent. Some switches are available with full redundancy and no single point of failure [2], providing the reliability usually associated with the public telephone network [17]. In the Web switch example there can be an optimization made so that the window size and other congestion state can be maintained. But if the failover servers are across the Internet the failover connection must begin form slow-start. 3.4 Security Since the application state is passed to the backup server at the start of a connection, the backup server can be trusted by the application layer at the client to initiate a connection to a different end-point. In the Web switch optimization an unscrupulous client may choose to take advantage of a failed server to increase the window size. This can be done to either obtain a faster transfer or to intentionally overwhelm a backup server. To thwart such a takeover, the application layer component must provide security and authentication that the client cannot circumvent. 4. Implementation I have implemented the transparent failover in ns-2. I pass the initial state from the initial server to the backup server at the start of the connection. This initial state includes: the object being transferred, the initial sequence number, and the client IP address and port number. When wraparound of the sequence number occurs this is passed to the backup server. Once a server fails, the application at the failover server requests a new connection from the client. This application layer will be able to start a new connection since the new connection will be just as trusted as the original connection. The appropriate point to resume is passed from client to server via a modified 3-way handshake. The SYN-ACK from the client to the backup server contains the acknowledgement number from the previous connection. The backup server accepts this sequence number instead of the one it sent and continues the transfer from that point. Figure 5: This is how my solution fits into server and client applications. Each connection has its own application layer management. 4.1 Application Layer The application layer is implemented in TCL. To give a clear overview of the steps needed from the start of the initial connection, to failure, to recovery, I will explain what actions must be performed by each of the actors in my implementation. The following is the pseudo code that must be performed at each of the actors in my implementation. The needed actors are: the initial server (S0), the backup server (S1), and the client (C0). The pseudo code follows the format: TCP source destination (type of packet or data payload) The numbers on the left of each line are how the order of events correlate between actors. S0: 1: TCP S0 C0 (start sending dataobj) 1: TCP S0 S1 (start, dataobj, InitialSeqno(), getDestIP(), getSourceIP(), getDport(), getSport() ) 2: TCP S0 S1 (Sequence number wrap-around) 7: TCP S0 C0 (finished sending dataobj) 7: TCP S0 S1 (end, dataobj, getDestIP(), getSourceIP(), getDport(), getSport() ) S1: 1: recvBytes(start of connection: TCP S0 C0 dataobj) 2: recvBytes(Seq # wrap-around: TCP S0 C0 dataobj) 3: When fail S0 { 4: TCP S1 C0 SYN 6: TCP S1 C0 (dataobj + Seqno - InitSeqno + (Seq# wrap-around * 2^32) ) } 7: recvBytes(end of connection: TCP S0 C0 dataobj) C0: 1: recvBytes(dataobj from S0) 4: When fail S0 { (S1 SYN) 5: TCP C0 S1 (SYN-ACK, Seqno(TCP S0 C0)+1 ) } 6: recvBytes(dataobj from S1) 4.2 Transport Layer There are only two things that must be modified at the transport layer to support the application layer. The first is the 3-way handshake (Figure 6) at the client to send the old sequence number with the SYN-ACK if a failover connection is requested. Second is the 3-way handshake at the failover server to accept the sequence number from the client’s SYN-ACK. Figure 6: TCP 3-way handshake. The request for a new connection to the client application layer is an indication to the client that the original connection has failed. When the SYN packet is sent by the backup server(Host A) to the client(Host B), the client must have an operation to request the current sequence number from the old connection and an operation to insert a new ACK number into the SYN-ACK packet. So, instead of ACK=x+1, the SYN-ACK contains ACK=oldacknum+1. When the backup server receives the SYN-ACK, it must accept the ACK number as the sequence number and discard the sequence number that it previously sent. The backup server must also have an operation to return the current sequence number. 5. Simulation To simulate a set of realistic network conditions, the servers, clients, and network are simulated using the ns-2 simulator. I pinged several popular internet sites and received round trip times (RTT) of mostly 20ms to 120ms. I chose to use a RTT of 100ms for my simulation since this seems to be towards the worse case and latency is a factor in my solution. Each pipe has a bottleneck bandwidth of 384Kb per second, which would be a typical speed of today’s high speed connections. My Simulation topology consists of 2 servers and 1 client connected by a simple network (Figure 7). Figure 7: Simulation Topology R0-R2, R1-R2 are identical. I have decided on two simulation scenarios: 1) Data transfer for various rates of oscillation (on their own connections) 2) Data transfer with 4 TCP without failure and 1 TCP with failure (the segment with the cross traffic has a capacity of 1.5Mb/s) 5.1 Results I assume that there is negligible delay in the response time of the servers. The delay involved is the delay of notifying the new server to take over and the startup delay of a TCP connection, which is just the round trip time (RTT) delay of the SYN, SYN-ACK handshake and the startup as window size increases. If there is more delay, then the transfer is worse. Figure 8: Data transfer for various rates of oscillation (simulation scenario 1) A single server failover has a negligible effect on the transfer. The greater the number of failovers or handoffs, the greater the effect on the transfer. Figure 9: Data transfer with TCP cross traffic (simulation scenario 2) TCP cross traffic does not appear to have a significant effect on the transfer, other than the connection having to share the capacity. The comparison between the failover transfers and non-failover transfers is comparable whether or not there is competing cross traffic. Failover seems to only suffer a slight amount more in the presence of cross traffic. 6. Conclusion I described the design and implementation of a transparent failover architecture using an application-layer failover mechanism, with associated supporting modifications to the transport layer. My architecture is end-to-end and fits into server and client applications. There are a number of benefits of my transparent failover architecture. It allows for a connection to continue in the event of a server failure with small performance degradation. It also permits wide-area distribution of backup servers. A strong benefit is that it does not require the passing of state periodically because the state is passed only at the start and at failover. There are also some limitations to my design. The applications must be modified to interact with the failover application layer instead of the transport layer. Also, if the object being transferred is dynamically modified during transfer, the higher application layer must have a mechanism for continuing at the backup server. Acknowledgements: I would like to thank Kevin Lai and Professor Ion Stoica for their assistance in refocusing my project. References: [1] Douglas E. Comer. Internetworking with TCP/IP: Principles, Protocols, and Architectures, Fourth Edition. [2] Cisco Home Page http://www.cisco.com [3] Akamai Home Page http://www.akamai.com [4] Alex C. Snoeren, David G. Anderson, and Hari Balikrishnan. Fine-Grained Failover Using Connection Migration. In Proc. USENIX Symposium on Internet Technologies and Systems (USITS), March 2001. [5] E. Amir, S. McCanne, and R. Katz. An active service framework and its application to real-time multimedia transcoding. In Proc. ACM SIGCOMM '98, Sept. 1998. [6] M. Aron, P. Druschel, and W. Zwaenepoel. Efficient support for PHTTP in cluster-based web servers. In Proc. USENIX '99, June 1999. [7] Digital Island, Inc. Digital Island, Inc. Home Page. http://www.digitalisland.net. [8] Z. Fei, S. Bhattacharjee, E. W. Zegura, and M. Ammar. A novel server selection technique for improving the response time of a replicated service. In Proc. IEEE Infocom '98, Mar. 1998. [9] Foundry Networks. ServerIron Internet Traffic Management Switches. http://www.foundrynet.com/PDFs/Ser verIron3_00.pdf. [10] A. Fox, S. Gribble, Y. Chawathe, and E. Brewer. Cluster-based scalable network services. In Proc. ACM SOSP '97, Oct. 1997. [11] G. Hunt, E. Nahum, and J. Tracey. Enabling content-based load distribution for scalable services. Technical report, IBM T.J. Watson Research Center, May 1997. [12] L. Ong and M. Stillman. Reliable Server Pooling. Working group charter, IETF, Dec. 2000. http://www.ietf.org/html.charters/rser pool-charter.html. [13] V. S. Pai, M. Aron, G. Banga, M. Svendsen, P. Druschel, W. Zwaenepoel, and E. Nahum. Localityaware request distribution in clusterbased network servers. In Proc. ASPLOS '98, Oct. 1998. [14] Radware. Web Server Director. http://www.radware.com/archive/pdfs /products/WSD.pdf [15] A. C. Snoeren and H. Balakrishnan. An end-to-end approach to host mobility. In Proc. ACM/IEEE Mobicom '00, pages 155-166, Aug. 2000. [16] Fang Yu, Noah Treuhaft, Takashi Suzuki, Matthew Caesar. Overlay Architecture and API. 2002. [17] Web Infrastructure. http://www.cpusales.com/web_infrast ucture.html