Uploaded by IOXUS DING

Multi-Root I/O Virtualization Based Redundant Systems

SCIS&ISIS 2014, Kitakyushu, Japan, December 3-6, 2014
Multi-Root I/O Virtualization Based Redundant
Sendren Sheng-Dong Xu1,*, member, IEEE, Chia-Hong Wang1, Teng-Chang Chang1, and Shun-Feng Su1,2, Fellow, IEEE
Graduate Institute of Automation and Control, National Taiwan University of Science and Technology, Taipei, Taiwan
Department of Electrical Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan
*E-mail: sdxu@mail.ntust.edu.tw
hardware [11]. Recently, using virtualization is one feasible
method for redundancy. Virtualization can easily achieve
physical functions of hardware, including CPU, OS or
peripherals. Hardware virtualization or platform virtualization
refers to the creation of a virtual machine that acts like a real
computer with an operating system. Software executed on
these virtual machines is separated from the underlying
hardware resources. A failure of a hosting server becomes a
serious problem in consolidated server systems using
virtualization. Virtual machines depend on physical devices
and virtualization platform on the hosting server. When the
hosting server goes down due to any failures of their
components, all virtual machines on this server are unable to
escape from service down. The more virtual machines the
hosting server hosts, the more serious damage a failure of this
hosting server causes [12].
Abstract—Redundancy, a method being designed to prevent
failures due to software/hardware problem, is one of the most
common applications in fault-tolerance systems. In this paper, we
provide a multi-root I/O virtualization (MR-IOV) based
redundant system architecture which supports high
performance, reliability, and scalability to improve conventional
redundant architecture with hardware multiplexer for the failover function. In order to fix this drawback, we proposed a
redundant architecture to save these statuses in the shared
memory, and the backup system will apply the states to fail-over
primary host. From experiment results, we observe that the
proposed architecture is feasible and it is better than the
conventional redundant architecture.
Keywords—fault tolerance, fault-tolerant systems, multi-root
I/O virtualization (MR-IOV), PCI Express, redundant systems,
single-root I/O virtualization (SR-IOV)
This paper will present an application of MR-IOV (multiroot I/O virtualization) virtualization method to a case which
establishes a redundant configuration against host server
failures with multiple host servers. The rest of the paper is
organized as follows. Section II describes some related works,
the configuration, and requirements for consolidated server
systems using virtualization in PCI Express switch. Section III
provides a problem definition for determining redundant virtual
machine configurations while minimizing the number of
required hosting servers. Section IV discusses the experiments
and architecture comparison for redundant applications.
Section V is the summary and conclusion of this paper.
Redundancy is the duplication of critical components or
functions of a system with the intention of increasing reliability
of the system. Usually in the form of a backup or fail-safe, it is
the quality of systems or elements of a system that is backed up
with secondary resources. Redundant configurations have been
used in research and design to provide system fault tolerance.
The fault tolerance is concerned with the continuation of
correct operation of system despite an internal fault [1]. The
fault tolerance is a method to design the reliable system using
unreliable components and is achieved by using different
methods of time or temporal redundancy [2]-[3], information
redundancy [4], software redundancy [5], and hardware
redundancy [6]-[8]. Hardware redundancy is one of the most
common applications of fault-tolerance systems, designed to
prevent failures due to hardware components. Typically,
components have multiple backups and are separated into
smaller "segments" that act to contain a fault, and extra
redundancy is built into all physical connectors, power
supplies, fans, etc. [9]. In accordance with hardware
redundancy, N identical copies of program are executed in N
hardware channels. For example, STAR [13] and FTSC [14]
have N =2; C.vmp [15], FTMP [16], and SIFT [17] have N = 3;
the Space Shuttle [18] has N = 4, DEDIX [19] has N changing
from 2 to 20 [10]. Software using real-time redundancy for
fault-tolerance is based on nullifying programming errors, or
filling in static “emergency” subprograms to the crashed
programs. There are many ways to conduct such faultregulation, depending on the application and the available
978-1-4799-5955-6/14/$31.00 ©2014 IEEE
Conventional redundant architecture is shown in Fig. 1. In
this mechanism, it uses the hardware multiplexer or the
switching method to achieve host fail-over. It depends on the
state pin of the host to inform the hardware multiplexer. In the
hardware multiplexer, it contains some control logic to
determine which source will lead to the destination.
Fortunately, the success of PCI Express has been primarily
as a fan-out interconnection, enabling CPU, I/O, and storage
devices – all of which have PCI Express access points – to
communicate. There has been penetration into more
sophisticated applications, such as host failover, and the PCI
Express interconnection standard has even been used as a
backplane to connect PCI Express based subsystems. Due to
the performance of PCI Express at generation 3 and its
widespread adoption on devices, the popular interconnection
SCIS&ISIS 2014, Kitakyushu, Japan, December 3-6, 2014
In a dual-host application, provision is made for both a
primary (or active) host and a secondary (or backup) one.
During normal operation, heartbeat messages are sent from
primary to secondary to indicate that it is still alive. Checkpoint
message containing the current state and transaction history are
also sent periodically from primary to secondary. The job of
the backup host is to monitor the state of the primary upon
detection of its failure, to take over as primary host continuing
system operation from the last valid check point. In our used
model, the secondary or backup host is connected to the system
via a nontransparent bridge while the primary or active host is
connected via a transparent bridge.
has become an attractive alternative to current solutions as a
fabric for data center and cloud computing applications.
The PLX PEX8976 [20] device offers multi-host PCI
Express switching capacity enabling users to connect multiple
hosts to their respective endpoints via scalable, high
bandwidth, non-blocking interconnection to a wide variety of
application. This solution employs an enhanced architecture
which allows users to configure the device in single-host or
multi-host mode. In multi-host mode, PEX8976 can be
configured with up to 4 upstream to host system, and each has
its own dedicated downstream ports. The PEX8796 allows the
hosts to communicate their status to each other through
accessing special register – door-bell or mailbox registers.
The secondary host connection could be directly to its Root
Complex or through a fabric connection. In the latter case, both
hosts may be active simultaneously. In this case, heartbeat and
checkpoint messages would flow in both directions. The BARs
on both sides of the non-transparent bridge are used to create
tunnels through which each host may send messages to the
other host. The doorbell registers available in the NTB may be
used for heartbeat messages. The memory access tunnels are
used for checkpoint and other data transfer.
Failure of the primary host is detected when the secondary
host fails to receive a certain number of the regularly scheduled
heartbeat messages. As part of the fail over process, the
secondary host’s port copies primary SR-IOV status in
management port. Failure of the primary host likely leaves
switch buffers backlogged and device endpoints with
incomplete transactions. During the failover process, the
secondary host causes the buffers to be flushed and terminates
incomplete transactions at endpoints. It then reconfigures the
system with itself as host and restarts the devices and
applications in some application specific way, using checkpoint
data in management CPU [21].
Fig. 1. The conventional redundant architecture.
Multiple hosts are supported by a non-transparent bridge
and a RDMA-NIC emulating DMA controller at every host
port. Each host communication by exchanging ID routed
vendor defines message in the global space isolated from the
hosts by the non-transparent bridges. We use vendor provided
PF(Physical Function) and VF(Virtual Function) drivers to
implement MR sharing of SR-IOV endpoint function. It is
achieved by a CSR redirection process that allows the
management CPU to snoop and intervene on configuration
space transfers and configure the requisite address and ID
translations transparently to software running on the servers.
Our system architecture is illustrated in Fig. 2. It supports
the multi-root sharing of SR-IOV endpoint. Use a management
port to manage virtual and physical functions in SR-IOV
endpoint. The virtual functions of multiple SR-IOV endpoints
can be shared among multiple hosts or system. A physical
function can be shared by several virtual functions in the same
endpoint. A management port is used for I/O management and
fabric routing. It connects to all switches in the fabric via a
separate control plane.
The procedural of creating virtual SR-IOV end point in
multi-hosts purpose is describe as following:
In PCI Express at generation 3, it remains the nontransparent bridges but owns significant enhancements. For
host to host communications, look-up table address translation
in the NTBs provides more flexible and improves performance
in the systems, allowing small, successive local windows to be
scattered across the global space and protecting local memory
from external corruption by means of write enable permissions,
read enable and a RID check field in each entry. Finally, the
addition of a DMA messaging engine changes the host to host
communications model from a load/store operation to a
networking model allows the applications written to standard
networking API to run over PCI Express network essentially
978-1-4799-5955-6/14/$31.00 ©2014 IEEE
Step 1: Host sends configure transaction to manager CPU
through management port.
Step 2: Management CPU receives the configuration
request from memory.
Step 3: After the receiving, the management CPU issues the
configure transaction to PCI-E endpoint connected to
downstream port.
Step 4: Management CPU response the transaction status to
the host.
SCIS&ISIS 2014, Kitakyushu, Japan, December 3-6, 2014
In this scenario, we assume Host 2 is our redundant or
backup system. We create a shared memory area in Host 2
allow primary system (i.e., Host 1) to save the status of MRIOV in Virtual Function. Our proposed method is using Interprocess communication (IPC) to create the connection channel
between Host 1 and 2. After the connection is complete, sync
and exchange these data or status in shared memory
periodically. When Host 2 obtains the event or alarm signal
that Host 1 is failed from management CPU, it applies these
statuses from shared memory to substitute Host 1’s status and
processing remained task in further. After Host 1 is reset or
resume completely, then get these resources back from Host 2.
Finally, the management CPU will inform Host 2 to deliver
these tasks and status to Host 1.
System reliable
Maintain effort
Virtualization support
MR/SR -IOV support
System Scale
Depending on Hardware switching
In this paper, we discuss a class of redundant systems and
architectures. Conventional redundant Architecture adopts
hardware multiplexer to fail-over connected device from host
system. It depends on heartbeat or status signal to determine
their fail-over source. When the primary host is fail, it results in
disconnection with SR-IOV device. In order to keep the state,
we proposed a redundant architecture to save these status in
shared memory. The secondary host applies the state to failover primary host. Finally, the experiment results shows that
the proposed architecture is feasible and better than
conventional redundant architecture.
A. Avizienis, J. C. Laprie, B. Randell, and C. Landwehr, “Basic
concepts and taxonomy of dependable and secure computing,” IEEE
Trans. Dependable and Secure Computing, vol. 1, no. 1, pp. 11-33, Jan.
[2] S. Borkar, “Designing reliable systems from unreliable components: The
challenges of transistor variability and degradation,” IEEE Micro, vol.
25, no. 6, pp. 10-16, Nov. 2005.
[3] A. Timor, A. Mendelson, Y. Birk, and N. Suri, “Using underutilized
CPU resources to enhance its reliability,” IEEE Trans. Dependable and
Secure Computing, vol. 7, no. 1, pp. 94-109, 2010.
[4] A. Ejlali, B. M. Al-Hashimi, M. T. Schmitz, P. Rosinger, and S. G.
Miremadi, “Combined time and information redundancy for SEUtolerance in energy-efficient real-time systems,” IEEE Trans. Very
Large Scale Integration (VLSI) Systems, vol. 14, no. 4, pp. 323-335,
Apr. 2006.
[5] T. Tsai, “Fault tolerance via N-modular software redundancy,” Proc.
28th Int’l Symp. Fault-Tolerant Computing (FTCS-28), pp. 201-206,
[6] S. Mitra, N. R. Saxena, and E. J. McCluskey, “A design diversity metric
and analysis of redundant systems,” IEEE Trans. Computers, vol. 51,
no. 5, pp. 498-510, May 2002.
[7] W. Dabney, L. Etzkorn, and G. W. Cox, “A fault-tolerant approach to
test control utilizing dual-redundant processors,” Advances in Eng.
Software, vol. 39, pp. 371-383, 2008.
[8] R. Samet, “Recovery device for real-time dual-redundant computer
systems,” IEEE Trans. Dependable and Secure Computing, vol. 8, no. 3,
pp. 391-403, May-June 2011.
[9] Fault-tolerant
[10] R. Samet, “Recovery device for real-time dual-redundant computer
systems,” IEEE Trans. Dependable and Secure Computing, vol. 8, no. 3,
pp. 391-403, May-June 2011.
Fig. 2. The proposed system architecture.
Our experience environment is evaluated on Linux RHEL
6.3 64-bit. The Management software, NIC driver and RDMA
driver (stack support Open Fabrics 3.5 [22]) can be referred to
PLX support [20]. The shared I/O driver is HBA vendor
Table I is the architecture comparison for redundant
system. In comparison with the conventional method, our
architecture is based on virtualization concept to recover failed
host system. It provides the mechanism using software stack
method to isolate physical and virtual in the SR-IOV endpoint.
It creates a shared memory area to handshake their status
between/among multi-host system.
978-1-4799-5955-6/14/$31.00 ©2014 IEEE
SCIS&ISIS 2014, Kitakyushu, Japan, December 3-6, 2014
[17] J. H. Wensley, L. Lamport, J. Goldberg, and M. W. Green, K.N. Levitt,
P.M. Melliar-Smith, R.E. Shostak, and C.B. Weinstock, “SIFT— design
and analysis of a fault-tolerant computer for aircraft control,” Proc.
IEEE, vol. 66, no. 10, pp. 1240-1255, Oct. 1978.
[18] J. R. Sklaroff, “Redundancy management technique for space shuttle
computers,” IBM J. Research and Development, vol. 20, pp. 20-28,
[19] A. Avizienis, P. Gunningberg, J. P. J. Kelly, R. T. Lyu, L. Strigini, P. J.
Traverse, K. S. Tso, and U. Voges, “The UCLA DEDIX system: a
distributed testbed for multiple version software,” International
Symposium on Fault Tolerant Computing, pp. 126-134, June 1985.
[20] PEX8796 datasheet ver. 1.2, PLX Technology, Inc., 24 July 2012.
[21] Jack Regula,“Using Non-transparent Bridging in PCI Express Systems,”
PLX Technology, Inc., 1 June 2004.
[22] Open Fabrics Alliance, https://www.openfabrics.org/index.php
[11] D. K. Pradhan, “Fault-Tolerant Computer System Design Book
Contents,” pp. 221-235, 1996.
[12] F. Machida, M. Kawato, and Y. Maeno, “Redundant virtual machine
placement for fault-tolerant consolidated server clusters,” Network
Operations and Management Symposium (NOMS), vol. 32, no. 39, pp.
19-23, April 2010.
[13] A. Avizienis, G. C. Gilley, F. P. Mathur, D. A. Rennels, J. A. Rohr, and
D. K. Rubin, “The STAR (Self-Testing and Repairing) computer: an
investigation of the theory and practice of fault-tolerant computer
design,” IEEE Trans. Computers, vol. 20, no. 11, pp. 1312-1321, Nov.
[14] D. D. Burchby, L. W. Kern, and W. A. Sturm, “Specification of the
fault-tolerant spaceborne computer (FTSC),” Proc. 1976 International
Symposium on Fault-Tolerant Computing, pp. 129-133, June 1976.
[15] D. Siewiorek, M. Canepa, and S. Clark, “C.vmp: the architecture of a
fault-tolerant multiprocessors,” Proc. 1977 International Symposium on
Fault- Tolerant Computing, June 1977.
[16] A. L. Hopkins, T. B. Smith, and J. H. Lala, “FTMP — a highly reliable
fault-tolerant multiprocessor for aircraft,” IEEE Trans. Computers, vol.
66, no. 10, 1978.
978-1-4799-5955-6/14/$31.00 ©2014 IEEE