CS268 Project Paper

advertisement
Transparent Failover
Pete Perlegos
UC Berkeley
Abstract
Servers fail more often than users want to know. The clear solution to this is server
replication. But how does a redundant server know how to transparently continue the
connection? There are already good techniques for monitoring the health of a server and
notifying other servers to take over. There are also good techniques for picking the new server.
There are currently some ideas on how to migrate the state by periodically disseminating
information to backup servers. My solution differs in that the state information is passed only at
the beginning of a connection and at failover. The application will communicate the
application-dependant information and initial TCP state at the beginning of the initial TCP
connection. Once a failure occurs, the current sequence number is requested from the client.
Simulation shows that connections suffer only a small performance degradation when using this
failover technique.
1. Introduction
Servers on the Internet today do not
have the necessary reliability for missioncritical services. An effective way to
engineer a reliable system out of unreliable
components is to use redundancy. Server
replication is used to provide reliable and
available services on the Web today [4].
Providing reliable, robust service requires
the ability to rapidly transition the client to a
new server from an unresponsive,
overloaded, or failed server during a
connection [12]. (Figure 1).
Figure 1: Transparent Failover.
My design is a component that fits
into the applications on the servers and
clients so they can easily benefit from the
transparent failover. When data being
transferred is static or can be made available
on redundant servers, my transparent
failover performs well. If content is being
generated dynamically and is not easily
reproduced by another server, handoff
becomes harder to accomplish. Fortunately,
today these servers and overlay networks are
more sophisticated and can do dynamic
processing recovery [3].
I discuss the components involved in
designing a transparent failover system in
section 2. Section 3 describes my
architecture for transparent server failover.
My TCP-based implementation is described
in section 4. Section 5 contains simulation
and performance analysis showing the
effectiveness of the failover mechanism. I
conclude with a summary of my
contributions in section 6.
2. Components for Transparent
Failover
It is important to look at all of the
steps to see how my solution fits in. This
section will discuss the components for a
transparent failover system. For any
connection in progress, there must be a
method to determine if and when to move it
to another server. Second, there must be a
selection process to identify a set of new
server candidates. Finally, there must also
be a mechanism to move the connection and
seamlessly resume the data transfer from the
new server.
2.1 Health Monitoring
There are already many solutions to
health monitoring [5, 10, 13, 14]. One
solution has a cluster of network servers in
which a front-end directs incoming requests
to one of a number of back-ends (Figure 2)
[13].
Figure 2: Cluster of servers with front-end.
A node's load is measured by the
number of active connections. An
overloaded node will fall behind and the
resulting queuing of requests will cause its
number of active connections to increase,
while the number of active connections at an
underloaded node will tend to zero.
Monitoring the relative number of active
connections allows the front-end to estimate
the amount of outstanding work and thus
adjust the relative load on a back-end
without requiring explicit communication
with the back-end node. If one of the
replicated back-end servers is overloaded or
goes down, connections can be offloaded to
one of the other servers.
Akamai has a Network Operations
Command Center (NOCC) which provides
proactive, real-time performance monitoring
of all servers in Akamai's global network
[3]. The NOCC's proactive stance and
unique view ensures a seamless response to
network conditions.
2.2 Server Selection
There are also many solutions to
server selection [7, 13, 14]. In the
previously mentioned solution, the front-end
uses the content requested, in addition to
information about the load on the back-end
nodes, to choose which back-end server will
handle this request [13].
One of the important factors in the
efficient utilization of replicated servers is
the ability to direct client requests to the best
server. Many techniques have been used to
select a particular server among a set of
replicated servers. One example is to simply
list the servers and have the client pick one,
based on geographical proximity or some
other criteria the client deems appropriate.
But this technique is not transparent to the
user. Furthermore, the closest server
geographically may not have the least endto-end delay or it may not be the least
loaded of the servers. Another example is to
use Domain Name System (DNS)
modifications [8] to return the IP address of
one of a set of servers when the DNS server
is queried. This technique is transparent to
the client, but often the DNS server uses a
round robin mechanism to allocate the
servers to clients because it maintains no
server performance information on which to
base its selection decision. In the [8] paper,
the authors use an environment in which
servers are distributed across the Internet
and clients identify servers using an
application-layer anycasting service. The
goal in this case is to allocate servers to
clients in a way that minimizes a client's
response time [8]. A significant response
time improvement can be achieved with this
technique over the use of random, or other
performance-independent allocation
mechanisms. A potential problem with an
approach that identifies the best server to
clients is that of oscillation among servers.
Akamai supports Edge Side Includes
(ESI), an open specification for dynamic
assembly and delivery of highly dynamic
Web content at the edge of the Internet. ESI
provides a mechanism for managing content
transparently across application server
solutions, content management systems, and
content delivery networks [3].
Figure 3: Akamai Architecture [16]
There are many varied solutions to
the problem of server selection.
2.3 Connection Migration and
Resumption
Once a connection needs to move
and a new server has been selected, the
client application should continue
seamlessly. The data stream must resume
from exactly where it left off at the old
server. As a result, the transport-layer state
must be moved to the new server, and the
application-layer state appropriately
synchronized and restarted.
There are a variety of methods to
accomplish this. One approach is an
application-independent mechanism which
uses a secure, transport-layer connection
migration mechanism [4]. This method
requires periodic updates and the
maintaining of soft-state for the transportlayer in replicate servers. Another approach,
which I propose, uses an application-layer
connection migration mechanism. All initial
application and transport layer state is
passed to the backup server at the start of a
connection. All that is needed to recover is
the current sequence number, which is
passed to the backup server by the client at
failure. So there is no need to burden the
network with additional traffic.
3. Failover Architecture
In the transport-layer connection
migration proposal [4], the authors associate
each connection with a subset of the servers
in the system. This is the connection's
support group, the collection of servers that
are collectively responsible for the correct
operation of the connection. Each support
group uses a soft-state synchronization
protocol to distribute weakly consistent
information about the connection to each
server in the group. The state distribution
protocol periodically disseminates, for each
connection, the mapping between the
transport layer state and the applicationlevel object being sent to the client.
My architecture preserves the end-toend semantics of a connection across moves
between servers. The application state and
initial sequence number are transferred at
the beginning of the initial transport-layer
connection. Also, if there is a wraparound
of the sequence number during a long
running connection, the replicate server
must be informed. Fortunately, this is a rare
case. When the server fails, the application
in the failover server determines the
appropriate point from which to resume
transmission by getting the sequence
number from the client.
3.1 Support Groups
The larger the group of servers in the
support group, the more there are to offload
the clients to. So there is a smaller sudden
load added to each redundant server in the
case of a server failure. Unfortunately, in
the transport-layer connection migration
proposal [4], the communication load also
increases, as each member of the group must
advertise connection state to the others. My
solution eliminates this problem, since the
connection state is only passed to the backup
server upon failure of the original server.
It will be desirable to limit the
number of candidate servers that
simultaneously attempt to contact the client
in large support groups, as the implosion of
migration requests may swamp the client.
Support groups can be behind a Web switch
or distributed across the Internet. Clearly,
the choice of a live initial server is an
important one, and much previous work has
addressed methods to select appropriate
servers in the wide area.
The choice of support group
membership and final server that handles a
failed client should be engineered in a
manner that avoids the server implosion
problem.
3.2 State Transfer
Previous proposals have suggested
that the transport-layer information would
be passed via periodic soft-state
synchronization [4]. My proposal endorses
passing the transport-layer state only at the
start and at failover. This will solve a
problem presented by the soft-state
synchronization technique in which the
communication load also increases, as each
member of the group must advertise
connection state to the others.
Once a server fails, the application in
the failover server requests a new
connection from the client and determines
the appropriate point from which to resume
transmission via a modified 3-way
handshake. When the SYN-ACK
acknowledgement number is sent by the
client, it is the same one form the previous
connection. The new server accepts the
acknowledgement number sent by the client
instead the sequence number it sent. The
new server can then resume the data stream
from exactly where it left off at the old
server.
3.3 Connection Failover
By using connection migration,
servers can be replicated across the widearea, and there is no requirement for a
redirecting device on the path between client
and server. The client can select its
candidate server of choice. This can be done
by simply accepting the first migration
message to arrive. The first request to arrive
at the client is likely from the server best
equipped to handle the message. The
response time of a candidate server is the
sum of the delay at the server and
propagation delay of the request to the
client. If a more sophisticated decision
process is desired it can be implemented
either at the candidate servers, or the client
application, or both.
Web switches multiplex incoming
requests across the servers, and rewrite
addresses as packets pass through (Figure4).
This enables multiple servers to appear to
the external network as one machine,
providing client transparency. The obvious
drawback of this approach, however, is that
all servers share fate with the switch, which
may become a performance bottleneck.
Figure 4: Web switch. This is a common
scenario.
Fortunately, such switches have a
higher reliability than the computers serving
Web content and susceptibility to failure is
reduced to a large extent. Some switches
are available with full redundancy and no
single point of failure [2], providing the
reliability usually associated with the public
telephone network [17].
In the Web switch example there can
be an optimization made so that the window
size and other congestion state can be
maintained. But if the failover servers are
across the Internet the failover connection
must begin form slow-start.
3.4 Security
Since the application state is passed
to the backup server at the start of a
connection, the backup server can be trusted
by the application layer at the client to
initiate a connection to a different end-point.
In the Web switch optimization an
unscrupulous client may choose to take
advantage of a failed server to increase the
window size. This can be done to either
obtain a faster transfer or to intentionally
overwhelm a backup server. To thwart such
a takeover, the application layer component
must provide security and authentication
that the client cannot circumvent.
4. Implementation
I have implemented the transparent
failover in ns-2. I pass the initial state from
the initial server to the backup server at the
start of the connection. This initial state
includes: the object being transferred, the
initial sequence number, and the client IP
address and port number. When
wraparound of the sequence number occurs
this is passed to the backup server.
Once a server fails, the application at
the failover server requests a new
connection from the client. This application
layer will be able to start a new connection
since the new connection will be just as
trusted as the original connection. The
appropriate point to resume is passed from
client to server via a modified 3-way
handshake. The SYN-ACK from the client
to the backup server contains the
acknowledgement number from the previous
connection. The backup server accepts this
sequence number instead of the one it sent
and continues the transfer from that point.
Figure 5: This is how my solution fits into
server and client applications.
Each connection has its own application
layer management.
4.1 Application Layer
The application layer is implemented
in TCL. To give a clear overview of the
steps needed from the start of the initial
connection, to failure, to recovery, I will
explain what actions must be performed by
each of the actors in my implementation.
The following is the pseudo code that must
be performed at each of the actors in my
implementation. The needed actors are: the
initial server (S0), the backup server (S1),
and the client (C0). The pseudo code
follows the format:
TCP source destination (type of packet or data payload)
The numbers on the left of each line are how
the order of events correlate between actors.
S0:
1: TCP S0 C0 (start sending dataobj)
1: TCP S0 S1 (start, dataobj, InitialSeqno(),
getDestIP(), getSourceIP(), getDport(),
getSport() )
2: TCP S0 S1 (Sequence number wrap-around)
7: TCP S0 C0 (finished sending dataobj)
7: TCP S0 S1 (end, dataobj, getDestIP(), getSourceIP(),
getDport(), getSport() )
S1:
1: recvBytes(start of connection: TCP S0 C0 dataobj)
2: recvBytes(Seq # wrap-around: TCP S0 C0 dataobj)
3: When fail S0 {
4: TCP S1 C0 SYN
6: TCP S1 C0 (dataobj + Seqno
- InitSeqno
+ (Seq# wrap-around * 2^32) )
}
7: recvBytes(end of connection: TCP S0 C0 dataobj)
C0:
1: recvBytes(dataobj from S0)
4: When fail S0 { (S1 SYN)
5: TCP C0 S1 (SYN-ACK, Seqno(TCP S0 C0)+1 )
}
6: recvBytes(dataobj from S1)
4.2 Transport Layer
There are only two things that must
be modified at the transport layer to support
the application layer. The first is the 3-way
handshake (Figure 6) at the client to send the
old sequence number with the SYN-ACK if
a failover connection is requested. Second
is the 3-way handshake at the failover server
to accept the sequence number from the
client’s SYN-ACK.
Figure 6: TCP 3-way handshake.
The request for a new connection to
the client application layer is an indication
to the client that the original connection has
failed. When the SYN packet is sent by the
backup server(Host A) to the client(Host B),
the client must have an operation to request
the current sequence number from the old
connection and an operation to insert a new
ACK number into the SYN-ACK packet.
So, instead of ACK=x+1, the SYN-ACK
contains ACK=oldacknum+1.
When the backup server receives the
SYN-ACK, it must accept the ACK number
as the sequence number and discard the
sequence number that it previously sent.
The backup server must also have an
operation to return the current sequence
number.
5. Simulation
To simulate a set of realistic network
conditions, the servers, clients, and network
are simulated using the ns-2 simulator. I
pinged several popular internet sites and
received round trip times (RTT) of mostly
20ms to 120ms. I chose to use a RTT of
100ms for my simulation since this seems to
be towards the worse case and latency is a
factor in my solution. Each pipe has a
bottleneck bandwidth of 384Kb per second,
which would be a typical speed of today’s
high speed connections. My Simulation
topology consists of 2 servers and 1 client
connected by a simple network (Figure 7).
Figure 7: Simulation Topology
R0-R2, R1-R2 are identical.
I have decided on two simulation
scenarios:
1) Data transfer for various rates of
oscillation (on their own
connections)
2) Data transfer with 4 TCP without
failure and 1 TCP with failure (the
segment with the cross traffic has a
capacity of 1.5Mb/s)
5.1 Results
I assume that there is negligible
delay in the response time of the servers.
The delay involved is the delay of notifying
the new server to take over and the startup
delay of a TCP connection, which is just the
round trip time (RTT) delay of the SYN,
SYN-ACK handshake and the startup as
window size increases. If there is more
delay, then the transfer is worse.
Figure 8: Data transfer for various rates of
oscillation (simulation scenario 1)
A single server failover has a
negligible effect on the transfer. The greater
the number of failovers or handoffs, the
greater the effect on the transfer.
Figure 9: Data transfer with TCP cross
traffic (simulation scenario 2)
TCP cross traffic does not appear to
have a significant effect on the transfer,
other than the connection having to share the
capacity. The comparison between the
failover transfers and non-failover transfers
is comparable whether or not there is
competing cross traffic. Failover seems to
only suffer a slight amount more in the
presence of cross traffic.
6. Conclusion
I described the design and
implementation of a transparent failover
architecture using an application-layer
failover mechanism, with associated
supporting modifications to the transport
layer. My architecture is end-to-end and fits
into server and client applications.
There are a number of benefits of my
transparent failover architecture. It allows
for a connection to continue in the event of a
server failure with small performance
degradation. It also permits wide-area
distribution of backup servers. A strong
benefit is that it does not require the passing
of state periodically because the state is
passed only at the start and at failover.
There are also some limitations to my
design. The applications must be modified to
interact with the failover application layer
instead of the transport layer. Also, if the
object being transferred is dynamically
modified during transfer, the higher
application layer must have a mechanism
for continuing at the backup server.
Acknowledgements:
I would like to thank Kevin Lai and
Professor Ion Stoica for their assistance in
refocusing my project.
References:
[1] Douglas E. Comer. Internetworking
with TCP/IP: Principles, Protocols,
and Architectures, Fourth Edition.
[2] Cisco Home Page
http://www.cisco.com
[3] Akamai Home Page
http://www.akamai.com
[4] Alex C. Snoeren, David G. Anderson,
and Hari Balikrishnan. Fine-Grained
Failover Using Connection
Migration. In Proc. USENIX
Symposium on Internet Technologies
and Systems (USITS), March 2001.
[5] E. Amir, S. McCanne, and R. Katz.
An active service framework and its
application to real-time multimedia
transcoding. In Proc. ACM
SIGCOMM '98, Sept. 1998.
[6] M. Aron, P. Druschel, and W.
Zwaenepoel. Efficient support for PHTTP in cluster-based web servers.
In Proc. USENIX '99, June 1999.
[7] Digital Island, Inc. Digital Island, Inc.
Home Page.
http://www.digitalisland.net.
[8] Z. Fei, S. Bhattacharjee, E. W.
Zegura, and M. Ammar. A novel
server selection technique for
improving the response time of a
replicated service. In Proc. IEEE
Infocom '98, Mar. 1998.
[9] Foundry Networks. ServerIron
Internet Traffic Management
Switches.
http://www.foundrynet.com/PDFs/Ser
verIron3_00.pdf.
[10] A. Fox, S. Gribble, Y. Chawathe, and
E. Brewer. Cluster-based scalable
network services. In Proc. ACM
SOSP '97, Oct. 1997.
[11] G. Hunt, E. Nahum, and J. Tracey.
Enabling content-based load
distribution for scalable services.
Technical report, IBM T.J. Watson
Research Center, May 1997.
[12] L. Ong and M. Stillman. Reliable
Server Pooling. Working group
charter, IETF, Dec. 2000.
http://www.ietf.org/html.charters/rser
pool-charter.html.
[13] V. S. Pai, M. Aron, G. Banga, M.
Svendsen, P. Druschel, W.
Zwaenepoel, and E. Nahum. Localityaware request distribution in clusterbased network servers. In Proc.
ASPLOS '98, Oct. 1998.
[14] Radware. Web Server Director.
http://www.radware.com/archive/pdfs
/products/WSD.pdf
[15] A. C. Snoeren and H. Balakrishnan.
An end-to-end approach to host
mobility. In Proc. ACM/IEEE
Mobicom '00, pages 155-166, Aug.
2000.
[16] Fang Yu, Noah Treuhaft, Takashi
Suzuki, Matthew Caesar. Overlay
Architecture and API. 2002.
[17] Web Infrastructure.
http://www.cpusales.com/web_infrast
ucture.html
Download