A New Disk I/O Model of Virtualized Cloud Environment

advertisement
1
A New Disk I/O Model of Virtualized Cloud
Environment
Dingding Li, Xiaofei Liao, Member, IEEE, Hai Jin, Senior Member, IEEE, Bingbing Zhou, and Qi Zhang
Abstract—In a traditional virtualized cloud environment, using asynchronous I/O in the guest file system and synchronous I/O in
the host file system to handle an asynchronous user disk write exhibits several drawbacks, such as performance disturbance among
different guests and consistency maintenance across guest failures. To improve these issues, this paper introduces a novel disk I/O
model for virtualized cloud system called HypeGear, where the guest file system uses synchronous operations to deal with the guest
write request and the host file system performs asynchronous operations to write the data to the hard disk. A prototype system is
implemented on the Xen hypervisor and our experimental results verify that this new model has many advantages over the conventional
asynchronous-synchronous model. We also evaluate the overhead of asynchronous I/O at host which is brought by our new model.
The result demonstrates that it enforces little cost on host layer.
Index Terms—Virtualization, file system, asynchronous I/O, synchronous I/O.
✦
1
I NTRODUCTION
As an emerging IT paradigm that separates computing
functions and technology implementations from physical
hardware, virtualization technology can play a key role
in maintaining flexibility and scalability for cloud deployments [1]. Although having many advantages over
a native, or non-virtualized system, a virtualized cloud
is more complex, and thus, more difficult to understand
and improve [2]. This is often caused by a hypervisor
layer that abstracts the underlying hardware resources
for multiple guest instances.
One particular type of abstraction, which we use
often but have not yet fully understood and improve,
is the file system. A file system in a typical virtualized
system generally presents a two-level nested-structure
in which a host file system maps regular files as virtual
block device to guest file systems [3]. Compared with a
single-level native file system, this two-level file system
structure is more complex since a disk I/O request from
a guest is required to be processed twice — first in the
guest and then in the host.
Currently, virtualized systems usually use an
asynchronous-synchronous (async-sync) model to handle
user asynchronous writes [4], [5]. With this model, the
guest file system uses asynchronous I/O instructions
to deal with write requests [6], [7]. When dirty data
need to be written to the hard disk drive, the data are
• D. Li, X. Liao, H. Jin, and Q. Zhang are with the Services Computing
Technology and System Lab, Cluster and Grid Computing Lab, School
of Computer Science and Technology, Huazhong University of Science
and Technology, Wuhan, 430074, China. Xiaofei Liao is the corresponding
author.
dingly.hust@gmail.com, xfliao@hust.edu.cn, hjin@hust.edu.cn, cheungrck@gmail.com
• B. Zhou is with the School of Information Technologies, Sydney University,
NSW 2006, Australia. Email: bbz@it.usyd.edu.au
Manuscript received March 1, 2012.
first transferred to the host, and the host file system
then performs synchronous I/O operations to flush the
data to the disk. The advantage of this model is its
simplicity, that is, the host can be kept small and simple
and the guest can accommodate any existing modern
OSes without modifications.
However, this model has several disadvantages. When
a guest decides to flush its dirty data (for example, when
the application explicitly invokes a flush operation), the
host file system has to write the data synchronously
back to the disk in real time. If the amount of data to
be transferred is large, the normal operations of other
guest systems will seriously be affected by these intensive synchronous I/O operations. When a guest OS is
terminated abnormally, for another example, important
data stored in the guest file system cache will be lost
[8]. Thus, there is a trade-off among performance, data
reliability, simplicity and generality.
Though the host file system is supposed to manage the
I/O operations on behalf of the guest file systems, the
requirement of synchronous I/O makes the management
less effective and less efficient. To alleviate this problem
we propose a new disk I/O working model. Rather
than performing asynchronous I/O at the guest and
synchronous I/O at the host, in our new I/O model the
guest file system uses synchronous operations to handle
disk I/O requests and the host system then performs
asynchronous I/O to transfer data to the hard disk.
With this synchronous-asynchronous I/O arrangement,
data stored in the guest file system cache are always
made up-to-date and the dirty data are maintained and
managed by the host. It has several potential advantages
over the conventional async-sync I/O model.
With synchronous I/O at the guest, data inside its file
system cache are always considered clean, and so there is
no need to flush data from the guest system periodically
or when the guest system is terminated normally or
2
abnormally. As a result, the I/O management inside the
guest can be made simpler and more efficient.
On the other hand, it is common that the guest is
usually less stable than the hypervisor layer [9], [10].
This is because the guest OS is directly exposed to user
applications which may contain various fatal bugs [11],
causing the whole system to crush. In such a situation,
if dirty data are stored and managed at the guest,
important data will be lost (unless additional complex
mechanisms are available to flush the dirty data from
the guest cache to the hard disk in real time). Using
synchronous I/O operations at the guest, user dirty data
will still be kept in the host when the guest system
crashes. Thus, the system becomes more robust to guest
faults.
Since the host can acquire the system-wide information of multiple guest systems, with asynchronous I/O
operations at the host, flushing dirty data back to the
hard disk can be done on occasions when the physical system is not busy [12]. Therefore, the virtualized
system becomes more intelligent to the concurrent I/O
flows from multiple guest OSes. Furthermore, because
the hypervisor at the host can directly and accurately
acquire the information of physical disk devices [13],
asynchronous I/O at the host provides opportunities
for disk I/O optimization according to specific devicedependent characteristics.
Based on this new I/O model, we designed a new
architecture called HypeGear. This architecture is built
upon the conventional virtualized system with several
added/modified components. The key added components include one at the guest to convert a normal asynchronous I/O write from the user application to a synchronous one and a block-level cache called GearCache
to maintain and manage dirty data at the host. It is
worth noting that a user application could perform an
original synchronous I/O write. In such case, HypeGear
provides a kind of sync-sync operation to bypass its
block-level cache and make the data in the guest file
system cache directly consistent with the copy on the
hard disk. We have implemented a prototype HypeGear
system on the Xen hypervisor. Extensive experiments
have been conducted to verify the advantages, evaluate
the performance, and measure the overhead of our new
disk I/O model.
The rest of this paper is organized as follows. Section
2 presents the background and motivation of our work.
The architecture of HypeGear is presented in Section
3. Section 4 describes the implementation details of
HypeGear on Xen hypervisor. The experimental results
are presented in Section 5. The related work is briefly
discussed in Section 6. Finally, Section 7 concludes the
paper.
2
BACKGROUND AND M OTIVATION
The idea of using asynchronous I/O or caching scheme
in host file system to improve disk I/O bottleneck in a
virtualized system is not very new. By leveraging the
file system cache in the host, a common hypervisor
can easily enable its caching or asynchronous scheme
for guest image files, which presents an async-async
model to deal with the guest write request [14].
However, this async-async model has heavy dataredundancy between the file system caches of both the
guest and the host. Taking the Linux-based hypervisor as
an example, when a guest reads/writes a block from/to
physical disk device, the related I/O buffer is not only
cached in the guest page cache but also stored in the host
page cache. As the guest number is increased, the guests
would enforce more redundant I/O activities upon the
host file system and keeps hypervisor layer very busy to
deal with them (e.g. allocate/free memory to accommodate/release guest data). Consider the hypervisor layer
at the host plays a critical role for hosting multiple guestOS instances, overburden in the host file system, which is
located in the host kernel-space, may continuously consume the host resource and thus degrade the efficiency
of the whole virtualized system. To avoid this problem,
current hypervisors generally use DIRECT I/O (as an
async-sync style) to manipulate the guest image files
[4], [5] without caching guest I/O data in the host file
system.
To make a case here, we have studied the newest
Xen hypervisor 4.1.2 with its Blktap AIO driver on an
experimental machine, which has dual Quad-Core Intel
Xeon(R) 1.6GHz processors, 8GB DDR2 RAM, 160GB SATA II hard drive with 7200 RPM (ST3160815AS), and
dual Full-duplex Intel Pro/1000 Gbit/s NIC. The driver
domain (namely the host OS) is running 64-bit CentOS
6.0 distribution and the hypervisor is Xen 4.1.2 with
Linux 2.6.18.8-Xen kernel. The guest domains (namely
guest OSes) are running CentOS 6.0, each with 512MB
memory allocation, 16GB disk capacity, Linux 2.6.18.8
kernel and with the ext3 file system as ordered model.
Figure 1 presents a comparison between the models of async-sync (activating the flag O DIRECT
when Blktap AIO drivers opens guest image files) and
async-async (clearing the flag O DIRECT). It clearly
shows that async-async can improve the guest I/O
performance by leveraging a page cache in the host file
system, but trades the extra memory and CPU resources
in the hypervisor.
An I/O completion in a native system indicates that a
write has been committed. Therefore,using the model of
async-sync in a virtualized system has another reason:
keeping the guest flush semantic which relies on the
guest file system and guest applications. However, this
makes a potential pitfall after the system is virtualized
and runs multiple OS instances. If one of them, by taking
massive synchronous flush operations [15], exploits this
strong adherence to the guest flush semantic to dominate the shared disk device, the I/O performance of
other normal guests will be severely disturbed. Figure
2 demonstrates this negative impact under the same experimental environment. It clearly shows that the normal
3
1600000
Host Memory Consumption (KB)
60%
Memory Consumption on async-async
Memory Consumption on async-sync
CPU Consumption on async-async
CPU Consumption on async-sync
50%
Test End (async-async)
1400000
40%
1200000
Test Begin
1000000
30%
Test End (async-sync)
800000
20%
600000
10%
400000
200000
Host CPU Consumption
1800000
0%
0
-10%
100
200
300
400
500
Time (Seconds)
Fig. 1. Three guests run IOzone simultaneously. Each of them produces asynchronous write-only I/O flows with
512MB testing file size, 4KB block size and sequential manner. Although all guests can finish this test within 260
seconds under the model of async-async (about 200 seconds faster than the model of async-sync), it consumes up
to 1.5GB host redundant memory (about the total size of three testing files) and 50% host CPU resource on its peak
value.
in a general case different guests have different image
files. When dealing with the concurrent I/O requests
from different guests, the magnetic head of a rotating
disk may do irregular movement among the image files,
even these I/O flows are logically sequential from the
perspective of their own guests. Eventually, this situation
contradicts the physical characteristics of a rotating disk
device, and the virtualized system thus presents a poor
disk I/O performance.
Normal Guest I/O Latency (µs)
100000
10000
Disturbance End
Disturbance Begin
1000
0
50
100
150
200
250
300
350
Time (Seconds)
Fig. 2. Fluctuation latencies of file I/O of the normal guest,
the y-axis is logarithmic. Two guests run simultaneously,
one is doing normal file I/O, which issues single read/write
request to disk device every second, meanwhile the other
is playing a disturbance role, which uses IOzone to produce continuous and small flush operations (4KB) from
131th second to 245th second. Most latencies on either
sides of this figure are not plotted due to small value.
guest has confronted with a quite large fluctuation on file
I/O latencies (i.e., it spans four orders of magnitude on
y-axis) running alongside a flush-intensive guest.
The model of async-sync also disregards the specific
characteristics of underlying storage device. Due to its
synchronism with guest OSes, each interaction between
the host file system and the physical disk is constrained
to the view of each individual guest. Unfortunately,
Based on above analysis and observations, we find
that the model of async-sync can keep the hypervisor simple through its synchronous manner in the
host file system. However, synchronous operations in
the host may cause a serious problem of performance
variation for each individual guest. Although the model
of async-async can relieve these shortages using asynchronous I/O operations in the host file system, the
overhead enforced on the hypervisor or host is very high.
Therefore, we expect that a hypervisor or host should use
a more intelligent asynchronous I/O model to handle the
guest flush operations at a system level, not just based
on the needs of each individual guest. Meanwhile, the
simplicity of a hypervisor should also be kept, rather
than introducing a heavy-weight scheme such as the
simple async-async model we just discussed.
Under the both models of async-sync and
async-async, the guest file system uses asynchronous
I/O to deal with application write requests [7]. The
upside of these models is high-performance and
compatibility since existing modern operating systems
can be virtualized without any modifications. However,
there are some limitations. First, it complicates data
management policies in the guest file system. A Linux-
4
based guest involves complex flush logics to control
the flush procedure of user dirty-data [12]. In this way
the code base of the guest file system or development
library is enlarged and complicated, which may increase
the risk of fault appearing and the extra consumption of
CPU and memory resource [11]. Second, asynchronous
I/O operations in the guest sacrifices the reliability of
dirty data, as a fault occurred in the guest OS may lead
the dirty data in the guest file system cache to be lost
or tampered [6].
Several works have been proposed for native file systems to improve aforesaid problems [6], [16]. However,
these methods are custom-made for native systems. Importing them into current virtualized systems generally
requires extraneous hardware or software mechanism,
even with expensive provision1 . For a two-level file system, therefore, a general, simple and drop-in solution to
the original virtualized environment is highly desirable.
3
H YPE G EAR
3.1
Overview
To alleviate the above mentioned problems in a typical
virtualized system, we propose a new disk I/O model,
called HypeGear. It uses synchronous I/O at the guest
file system and asynchronous I/O at the host, namely
a sync-async I/O model. Synchronous I/O on the
guest side is to simplify guest file system and enhance
the reliability of user dirty-data, while asynchronous
I/O on the host side is used for flushing guest data
with system-wide optimization based on device-specific
characteristics. To realize this idea, an I/O convertor
SignPost and a block-level cache GearCache are
added into the guest and hypervisor respectively. With
these added components in HypeGear, the life cycle of
a normal asynchronous write operation in HypeGear
takes the following steps: (1) A user-application issues
an asynchronous write operation; (2) SignPost detects
this operation and converts it into a synchronous write;
(3) The write request will be trapped into the hypervisor; (4) Hypervisor stores the I/O data in GearCache
and then returns an acknowledgment to the guest; (5)
GearCache flushes the dirty data to the disk device in
an asynchronous manner.
The dirty data produced by user-application in our
disk model will be handled by SignPost with synchronous style, which allows the data to pass through
the relatively less trusted guest OS instantly. By this
conversion, guest file system can also be simplified.
GearCache is designed to maintain the host simplicity as
the model of async-sync. It is a dedicated block-level
write cache, which buffers write data only and manages
them in an effective way. By isolating read data, the
structure of GearCache can be kept much simpler than
a typical file system cache, and thus greatly reduces the
redundant I/O activities enforced on host or hypervisor.
1. We will discuss them in detail in the section 6.
3.2 SignPost
According to the specific implementation of a certain
file system or application, the convert work of SignPost
can be embodied in various concrete ways, such as the
parameter tweaking inside a file system, the self-tuning
in an application, or their combinations. Unfortunately,
each kind of async-to-sync conversion will invoke
a context switch, i.e., from guest context to hypervisor
context, in the current virtualized environment. If an
application makes frequent write requests, the system
will be forced to context switch frequently and then the
application performance affected [17]. Nevertheless, in
section 5.5, we will show this cost would not severely
affect the performance of typical workloads. Furthermore, by improving CPU consumption on the guest file
system (there is no need for guest file system to use
complex flush management), the overall performance
of virtualized system can even be enhanced, especially
on multiple guest instances running. SignPost has little
effect to guest read operations. Since it uses the original
path of synchronous I/O in guest file system, a write
operation that has been converted will leave a copy in
the guest for subsequent read operations on the same
data. If a read request is missed in guest file system
cache, it will first find its data in GearCache to keep the
data consistency, instead of directly talking with disk.
3.3 GearCache
Popular hypervisors often create individual address
space for hosting and monitoring each specific guest.
For examples, both QEMU-IO in KVM [5] and Blktap
in Xen hypervisor [18] redirect guest I/O requests into
a special and privileged guest instance, where contains
specialized user-processes to initiate I/O operations on
behalf of the guest. These user-processes have their own
address spaces and they are isolated from the real guest
memory, thus the data in this space can survive the guest
crash. A combination of these special memory spaces,
named vContext, forms our GearCache in hypervisor.
GearCache stores each individual guest’s write into a
matched vContext and interposes a flush thread into the
event handlers, such as guest crash, migration and close.
This design motivates the local-based management
principle for GearCache, which not only keeps the design logic of GearCache simple, but also provides a selective disk I/O framework to a single physical machine
that hosts various guests. For example, since GearCache
relaxes the adherence to guest flush semantic due to
I/O bottleneck, this block-level write cache may not be
suitable for all applications. A possible solution to this
problem under the local-based management is moving
those applications, who requires strict flush semantic,
into the other guests, where provides the original disk
I/O path.
GearCache mainly uses an on-demand principle to
manage the space for each vContext, but it will adopt an
idea of the static-based allocation to restrain the memory
5
Crash at t1
Crash at t2
FlushCtrl1
…...
A1
A2
…...
TransA
B1
FlushCtrl2
…...
B2
FlushCtrl3
…...
C2
TransB
C1
…...
TransC
Time
Fig. 3. Crash recovery on HypeGear.
consumption of any aggressive vContexts. Specifically,
there is a size limit θ for each individual vContext.
GearCache deals with the upcoming write requests with
an on-demand manner if the current usage is below the
value of θ. When a new write request arrives, GearCache
will apply for a piece of memory from the hypervisor
to serve this request. The memory will then be directly
freed and returned to the hypervisor after the data is
flushed into disk. In this way, GearCache can avoid the
waste of memory by those guests who carry little writes.
If the current usage of a vContext is up to the value of
θ, GearCache will simply treat this guest as aggressive,
and uses a static-based method to handle the upcoming
write requests. In details, a vContext first flushes part
of the cached data according to a FIFO policy, and
then clears the related data, but preserves their memory
space. When a new write arrives, it just reuses the space
that has been cleared for the new data. In this way,
GearCache can reduce the overhead of frequent memory
allocations and releases for intensive writes.
3.3.1
FlushCtrl
Data flushing is an important operation and there is a
new component FlushCtrl in GearCache dedicated for
the purpose. FlushCtrl is invoked when a guest system
is terminated (normally or abnormally), or when more
free spaces are required. FlushCtrl uses FIFO (First-inFirst-out) principle to do real flush procedure. While a
replacement algorithm based on LRU (Least-RecentlyUsed) may obtain better performance, GearCache should
keep the constraint of guest I/O ordering. For example,
when guest was writing to an unallocated region of a
file, the FlushCtrl must flush the file data before writing
the meta-data, otherwise it risks exposing uninitialized
data to users.
FlushCtrl is located inside each vContext and five
kinds of event will trigger its flush handler to conduct a
flush operation. (1) A vContext is about to be exhausted;
(2) A timer in GearCache is strictly required to periodically flush the user dirty data into disk device; (3) User
or hypervisor closes a guest; (4) Hypervisor migrates a
guest; (5) Unexpected behaviors crash a guest. In the fifth
case, the associated vContext in hypervisor can catch
crash event by the inherent message-based notification
or a polling mechanism.
On the other hand, some extra processes can be added
into the FlushCtrl to provide device-dependent opti-
mizations during the flush procedure. We will give an
illustrative example in Section 4.2.
3.4 Crash Recovery
According to crash type, Figure 3 illustrates the crash
recovery in HypeGear, which is measured by its occurrence time and place.
When crash occurs at time t1 , at which point FlushCtrl
is not running, T ransA , which has been completely
flushed back by F lushCtrl1 , is kept. To T ransB , which is
divided into two continuous flush procedures, can only
be partly flushed back (namely B1 in Figure 3) after this
crash. If crash t1 only involves guest OS, F lushCtrl2
will still be triggered to flush the existing guest dirty
data. But T ransB is perhaps incomplete as guest may
not write B2 out before t1 .
Since all transactions in Figure 3 belongs to guest
semantic, HypeGear can leverage the guest journaling
mechanism to maintain its atomicity, such as the file
system journaling and database journal. Therefore, the
uncompleted T ransB will be undone when guest is
restarted and then can be rolled back to a clean state,
even crash t1 involves whole physical system.
Then we discuss crash at time t2 , at which point
FlushCtrl is working. If this crash only involves guest
OS, guest can be restarted cleanly by its journal mechanism. If crash t2 involves whole physical system, it may
lead to an unusable state to guest image files, because
FlushCtrl could reorder guest write requests to improve
flush performance (See section 4.2). For example, C2 in
Figure 3 is supposed to be flushed into disk after C1 ,
but FlushCtrl swaps their flush positions due to disk
optimization in FlushCtrl. After crash t2 happens and
guest is restarted, system will wrongly treat that T ransC
has been completed.
To deal with this issue, HypeGear sets a flag before
each FlushCtrl working, and clears it after flush procedure is completed. If a t2 crash happens, HypeGear will
check this flag when restarting guest OS. If this flag is
set, HypeGear must roll back this guest image file to a
history clean-state via image snapshot [19].
4
I MPLEMENTATION
In this section, we discuss an implementation of HypeGear on the Xen hypervisor atop a rotating disk device. Although our current work is focused on a specific
platform, the main idea can be applied to other systems.
6
4.1
SignPost in Guest Domain
For ordinary applications that typically use a common
I/O library, SignPost simply moves them into a special
file system to shape the synchronous I/O model. This
kind of file system is easily available from the original user environment, in which only an extra parameter is applied to the file fstab of the guest domain.
An exception is that some applications use their own
caching mechanisms to acquire the full control of the
disk I/O data transfer (e.g. the self-caching application,
using O_DIRECT flag to bypass the file system cache
in Linux kernel [7]). Our solution to this problem is to
tune its own configuration parameters to simulate the
model of synchronous disk I/O. Both methods make
little change to the guest application and file system. A
system administrator, not involving the end users, can
deal with it. An example for illustrating the SignPost
usage will be presented in section 5.3.
4.2
the Tapdisk with a fixed-size and segmented I/O flow
(usually 4KB each time), an aggregation procedure to
the continuous I/O requests, which have been formed in
the ”re-ordering” phase, can further improve the flush
performance. Specifically, FlushCtrl starts from the head
of this ordered buffer and then scans forward. During
this process, if multiple write operations have continuous offsets (stepped by 4KB), FlushCtrl will encapsulate
them into one unit (by replacing multiple write system
calls with a single writev), and then issues them in a
batch style. To the other non-continuous and scattered
writes, FlushCtrl will use AIO library to deliver them. If
FlushCtrl confronts with write requests with same offset,
only the newest one will be written back.
Finally, by using mutex in each vContext, a periodical
flush mechanism is also implemented inside GearCache
as a thread-based method. The mutex is used for ensuring only one flush handler can operate a vContext at a
certain time.
GearCache in Driver Domain
In the current implementation, we add a new protocol
into Blktap, which uses Tapdisk as vContext for caching
the guest writes into GearCache. GearCache uses linkedlist to store information on write requests. Each node in
the list corresponds to a specific write. It includes an
offset field to indicate the data location in the image file,
a buffer field to point to the actual data, a size field to
specify the data size and a dirty flag to mark this node
on whether it can be re-used by new write requests. Data
nodes in the linked-list are arranged in an order from
head to tail according to their arrival times. FlushCtrl
always begin from the head and then forward traverse
the linked-list. In more details, two linked-list structures
comprise the storage space for guest writes. One of the
linked-lists, called dirty list, is used to store new write
requests, while the other one, called clean list, is used
to store data, which have been flushed to the disk. The
memory space of a node in the clean list can be reused by a new write request. The re-used node will
be delinked from the clean list, marked dirty and then
inserted into the dirty list.
For the implementation of FlushCtrl, we just interposes the flush handler into the existing unmap disk()
function in Tapdisk, which will trigger the flush procedure when guest domain is terminated abnormally
or normally. When flush is triggered, considering the
bottom rotating disk device, HypeGear will apply two
device-dependent optimizations to improve the flush
operation. The first one is block data re-ordering. FlushCtrl sorts the writes according to their offsets inside
the matched image file (Only ”flush candidates” will
be involved) and then put them orderly into a buffer
(The buffer only stores those node addresses). This
will facilitate the bottom disk I/O schedulers that have
been optimized for the logical sequential disk I/O flow.
Currently, we use a heap algorithm to do this sort
work. The other is aggregation. Since Blktap provides
5
E VALUATION
5.1 Testing Environment
We compare HypeGear with the Xen hypervisor, which
uses original Blktap driver with aio protocol (denoted
by Blktap+AIO below). Blktap+AIO is further divided
into two cases, namely the models of async-sync and
async-async which we have discussed in section 2.
The former is adopted by the XenServer [4] as the
default configuration of block driver due to its highperformance, practicability and simplicity. To have an
apple-to-apple comparison, under the model of HypeGear we allocate each guest with only 448MB memory
and set an upper limit of 64MB (namely θ) on their
own vContexts. On the flipped side, the memory size of
each guest on Blktap+AIO is set to 512MB. The other
experimental environment is the same as the section
2 described. To form the synchronous I/O in guest,
we mount the testing directory in guest file system as
synchronous model under HypeGear (append sync flag
in fstab file), and keep Blktap+AIO as default configuration. Finally, the frequency of periodic FlushCtrl in
HypeGear is set as 30 seconds.
5.2 Performance Interference
We first use HypeGear to directly give a comparison in
Figure 2, in which a normal guest is severely disturbed
by a malicious guest with intensive flush operations. Figure 4 shows the result. Generally, HypeGear eliminates
the high-latencies due to GearCache asynchronizing normal guest I/O. But async-async confronts with a large
fluctuation at the end of disturbance process. This is
caused by the busy host file system, where is required
to flush the large dirty-data that produced by malicious
guest.
7
Normal Guest I/O Latency (µs)
TABLE 1
The comparison of data integrity. ”Client” refers to the
number of requests issued from clients, while ”Server”
column gives the number of requests actually received
by the server
async-sync
async-async
HypeGear
1000000
Disturbance End
100000
Disturbance Begin
Test End
10000
Test Begin
0
50
100
150
200
250
300
350
Time (Seconds)
Fig. 4. Fluctuation latencies of file I/O of the normal guest,
the y-axis is logarithmic. Due to small value, latencies on
sides of HypeGear, async-sync and async-async, are not
plotted .
VM No.
1
2
3
4
5
6
7
8
9
10
Disk I/O Model
async-sync
async-sync
async-sync
async-async
async-async
async-async
HypeGear
HypeGear
HypeGear
HypeGear
Client
6093
17290
8822
2064
16415
748
7043
14497
3146
25661
55
async-sync
async-async
HypeGear
We create ten database clients and each one is connecting
with a guest. All guests reside on the same hypervisor
and MySQL server 5.1.40 with MyISAM engine is installed.
Three of these use the model of async-sync, and
another three guests use the model of async-async,
and the rest are using HypeGear model.
To allow MySQL server to work in a synchronous
manner on HypeGear, the logging function is enabled in
SignPost and the file my.cnf is configured in the/etc
directory. Specifically, the parameter sync_binlog is
tuned to trigger synchronization between the log file
and the hard disk. This configuration makes each new
record in the database be flushed into the hard disk by
force. On the other side, we keep the configuration on
Blktap+AIO as the default parameter.
All clients repeatedly send the ”insert” operations
(each with 1KB data) to the related MySQL servers.
During the interaction processes between servers and
their corresponding clients, we randomly choose ten
occasions to destroy these running guests, one at a time,
by using the command "xm destroy dom ID" in the
terminal of driver domain. Each time it will immediately
crash a guest OS without leaving sufficient response time
for any exceptional handler in the MySQL server or guest.
Table 1 shows the experimental results concerning
data reliability. Generally, under the original disk I/O
model on the Xen hypervisor (includes both cases of
async-sync and async-async), the completion rate
is quite low due to data pending inside guest file system
cache. In addition, because of metadata inconsistency
data tables on MySQL server may even become unusable
with a high probability (33%) after the sudden crash. A
positive case for the conventional I/O model is shown
on the second row of Table 1, which achieves 85.7%
completion rate for the client records. This situation may
occur when the flush routine in MySQL server is activated
just before the crash.
On the other hand, HypeGear restores all of the dirty
Bandwidth (MB/S)
45
Crash Recovery
Ratio
2.82%
85.7%
NA
4.75%
NA
0%
100%
100%
100%
100%
60
50
5.3
Server
172
14828
Crashed
98
Crashed
0
7043
14497
3146
25661
40
35
30
25
20
15
10
5
0
1 VM
2 VMs
3 VMs
4 VMs
5 VMs
Fig. 5. Performance for sequential disk I/O. Each value is
the mean of running guests.
data which have been received by all MySQL servers
after their hosted guests are crashed. It is clearly shown
that, by dealing dirty data in the guest as a synchronous
manner, HypeGear is able to protect unsaved data from
being vanished when the guest is terminated abnormally.
Finally, we also reboot these crashed guests, which
have run under HypeGear model, and then restart their
MySQL servers. All of them can be worked normally.
5.4 FlushCtrl Performance
To measure the effects of the device-specific optimizations in FlushCtrl, we use IOmeter benchmark to repeatedly send disk I/O requests from the client to a
corresponding guest to stress the write request flow on
the virtual block device. The testing file is 1GB and
each write is 4KB in size. Under the HypeGear model,
the continuous disk I/O flow will repeatedly trigger
the FlushCtrl to do real flushing work. We keep all
interactions running 30 minutes and measure average
bandwidth or IOPS. Figure 5 shows the average disk
bandwidth (MB/S) when clients keep sending 100%
sequential disk I/O flow, while Figure 6 shows the
average IOPS (Input/Output Operations per Second)
when clients sending 100% random flow.
In the case of sequential disk I/O, HypeGear generally outperforms async-sync achieves about 14%61% more bandwidth. On the other side, async-async
8
5500
5.5.2 Kernel Compilation
5000
async-sync
async-async
HypeGear
4500
4000
IOPS
3500
3000
2500
2000
1500
1000
500
0
1 VM
2 VMs
3 VMs
4 VMs
5 VMs
Fig. 6. Performance for random disk I/O. Each value is
the mean of running guests.
has an advantage over HypeGear, when the number
of running guest is less than or equal to three. However HypeGear improves performance by 4%-14% over
async-async as the number of guest increased. In
the case of random one, HypeGear makes about 90%509% improvement over async-sync, especially when
there is a large number of running guests. The model of
async-async also has a performance advantage over
the other two models when the fewer guests running.
But HypeGear improves performance by 26%-47% as the
number of guest reached three. In summary, although
the model of async-async has an impressive performance in the case of fewer guests running, it incurs performance degradation as the number of guest increased.
This is because the host file system is required to deal
with the large and redundant I/O data. On the flipped
side, depending on the simple design of GearCache
as well as the device-specific optimizations, HypeGear
shows the better performance when the physical server
hosts more guest instances.
5.5
5.5.1
Realistic Workloads
Web Server
In the first experiment, a web server workload
http_load is used. It produces massive requests
for randomly fetching files from a web server (readintensive). Several guests are created and each guest is
installed with an Apache web server. Five clients, one
residing in a client machine, are also created and run
the workload http_load to fetch 200,000 files from
each Apache web server. The experimental results are
shown in Figure 7 for different number of guests (and
thus web servers) running simultaneously. It can be seen
from the figure that HypeGear almost performs as well
as its counterpart async-sync and async-async. This
indicates SignPost has little influence on the guest read
operations. Note that a performance spike is presented in
the second group of bars because physical I/O resource
is just about to be saturated, but it will drop as the
number of guests is increased due to I/O resource racing
among guests.
In the second experiment, the performance of Linux
kernel compilation (version 2.6.38) is evaluated. By nature, the compilation process is CPU- and memoryintensive, and it also produces disk I/O to load the
source files and creates many object files. Multiple guests
with this workload will stress random accessing flows
on the shared disk device. The results are shown in
Figure 8. When the number of running guest is less than
or equal to 2, HypeGear is at most 2 minutes slower
than its counterparts (about 4% overhead). We speculate
that this overhead is derived from the synchronized file
I/O inside guest file system. However, as the running
guest increased, HypeGear takes about 3 minutes to 23
minutes shorter for Linux kernel compilation than its
counterparts. In detail, compared with async-sync ,
HypeGear brings about 9.23% to 14% improvement. The
enhancement mainly comes from GearCache’s capability
of caching and aggregating guests’ write requests. The
data re-ordering and aggregation process in GearCache
can make random disk I/O flows be flush in a logically
sequential and batched manner, and thus reduce the
mechanical movement of the magnetic disk head. Compared with async-async, HypeGear takes about 13%
to 22% time saving. Since the model of async-async
is already asynchronized guest write requests, the effect of device-specific optimizations in HypeGear is
not apparent. We speculate that this improvement is
mainly derived from the simple management in guest
file system (synchronous I/O in guest) as well as the
compact design on HypeGear. Specifically, because CPU
consumption is saved from the complex management in
an asynchronous guest, and the guest I/O redundant
activities are avoided in host file system, HypeGear has
more CPU resource to do compilation work in guest.
Therefore, the required time to compile Linux source is
reduced.
5.5.3 File Server
In the third experiment, we use DBench as a file server
workload. DBench simulates a network file server with
multiple clients by using a trace file. We create 20
clients for each file server and all clients perform the
trace file for about 30 minutes. Figure 9 shows the
results. In the cases of 1-3 guest running, the model
of async-async has an impressive advantage over
async-sync and HypeGear. This high performance is
caused by the enough memory resource in the host,
which can allow guest to use extra host memory to fit the
running workload. However, this gap is quickly narrowing as the number of running guest increased. HypeGear
even achieves 179.3% improvement over async-async
in the case of five guests running. We speculate this
performance penalty on async-async is aroused by the
intensive I/O activities enforced upon host file system,
which frequently consumes the CPU and memory resource in driver domain, thus indirectly degrades the
9
160
120
async-sync
async-async
HypeGear
4500
Compilation time (minutes)
3500
Fetches per second
100
async-sync
async-async
HypeGear
3000
2500
2000
1500
1000
140
async-sync
async-async
HypeGear
120
Bandwidth (MB/S)
4000
80
60
40
100
80
60
40
20
20
500
0
0
1 VM
2 VMs
3 VMs
4 VMs
5 VMs
Fig. 7. Performance for web server.
Each value is the mean of running
guests.
0
1 VM
2 VMs
Overhead in Driver Domain
To see the necessary overhead of GearCache in driver
domain, we boot three guests in our experimental machine. For having a comparison with Figure 1, we use
the same workload to measure the overhead in driver
domain.
For memory consumption, as a static value of θ is
constrained, GearCache always keeps (64 × N )MB as its
upper limit, where N is the number of running guest.
This prevents host memory from being exhausted by
I/O intensive guests. It should be noted that the memory
size of vContext on driver domain can even be treated
as zero, since GearCache is built based on the memory
which we take from guest itself, under the model of HypeGear. For CPU consumption, GearCache achieves 15%
consumption at most (the average value is about 7.8%
during IOzone running). Compared with the model of
async-async, which presents 17.1% CPU consumption
on average, GearCache has an 54.3% improvement. By
using Xenoprof [20] to analyze the distribution of CPU
cycles during this process under HypeGear, we find that
memcpy function, which is used to copy the internal
buffer inside each guest write, accounts for a large
proportion (about 23%).
6
4 VMs
R ELATED W ORK
Similar with our work on GearCache, another prior
works explore the benefits of adding a secondary cache
in the hypervisor for guests. First, XHive [21] is a cooperative caching system for guests that share the storage
1 VM
2 VMs
3 VMs
4 VMs
5 VMs
5 VMs
Fig. 8.
Time for Linux build
(Smaller is better). Each value is
the mean of running guests.
efficiency of physical machine. On the contrary, synchronous I/O in HypeGear’s guest can save more CPU
resource to schedule more clients to run at a time.
Furthermore, the device-specific optimization in FlushCtrl can also improve the randomized disk I/O when
multiple guest running, thus the overall bandwidth is increased. Compared with the model of async-sync, HypeGear achieves about 33.4%-185% improvement across
all cases. This further indicates that the device-specific
optimization at hypervisor layer would play an important role to enhance the I/O performance in a virtualized
system.
5.6
3 VMs
Fig. 9. Performance for file server.
Each value is the mean of running
guests.
device. This scheme allows guests to collaboratively
share block copies that are read from the bottom storage,
thereby reducing the number of disk I/O operations
while enhancing memory utilization. XHive is more
concentrated on read direction while our system mainly
focuses on write one. Second, by pending the dirty data
in hypervisor, Ye et al propose a solution of energysaving for reducing the times of rotating disk spin-up
[22], in which the hypervisor can choose a suitable occasion to flush these cached dirty data into disk, depending
on whether the rotating disk is working or not. Third,
Lu et al propose a hypervisor-level exclusive buffer cache
[23], which allows the user workloads in a guest to be
transparently traced while accurately predicting the page
miss ratio in a guest without heavy costs. In doing so,
the system can acquire guest memory access pattern and
then guide guest memory allocation.
Concerning performance isolation in virtualized systems, Seelam et al propose VIOS [24], a virtual I/O
scheduler that can provide absolute performance virtualization by being fair in sharing storage system resource
among operating systems and their applications. Although sharing the same intention, HypeGear and VIOS
achieve this aim through very different approaches.
Specifically, our system realizes this by adding a blocklevel cache in the hypervisor to marshal those requests
from various guest OSs, whereas VIOS uses the tuning
work of CPU schedulers in different guest OSs. Since
modern DMAs were separated from CPU intervention,
this kind of works may not look very efficient to handle
those write requests which takes large buffer. More recently, Gulati et al presents mClock [25], which is also an
I/O scheduling algorithm to provide per-guest quality of
service (QoS) in presence of variable overall throughput
on the shared storage.
Similar with the synchronous I/O in HypeGear, several works have been proposed to strike a balance between the asynchronous I/O and synchronous I/O in a
native file system. Chen et al proposed Rio [6], which
uses DEC Alpha workstations (DEC 3000/600) to allow
a reset and boot without erasing memory. Rio modifies
the kernel by inserting a check instruction before every
memory access to provide the basic protection for dirty
10
data. Therefore, this method depends on the specific
device while requiring a moderate modification to the
kernel of the native OS.
7
C ONCLUSION AND F UTURE W ORK
Although the traditional disk I/O model in a virtualized
environment is simple and general for disk I/O operations on the hypervisor, the efficiency of coordination
and management of multiple guests is compromised.
To alleviate these potential problems we propose a new
disk I/O model called HypeGear. A prototype system is
implemented on Xen hypervisor and our experimental
results show that the new disk I/O model has many
potential advantages over the conventional one.
This paper only describes HypeGear under the environment of single machine. However, it is easy to expand
our model to the multiple-machine virtualized cloud
environment, where guest OSes usually put their data
into a special storage pool via fast network rather than
the original local disk. In this environment, the usage
of SignPost is the same with this paper introducing in
section 3.2, but importing GearCache is more complex.
In detail, there are two optional places to interpose
GearCache in the environment of multiple-machine, one
is at hypervisor layer which resided in the ordinary
physical node, the other is at the side of special storage
pool. In the first case, the solution is fully compatible
with the one which has presented in section 3.3, but
degrading the flush performance since FlushCtrl is required to transfer data to a remote storage device. In the
second case, the GearCache can enhance its reliability
because the special storage pool is often more reliable,
but synchronous I/O in guest file system incurring the
extra latency due to network I/O. We will explore this
tradeoff in our future work.
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
ACKNOWLEDGMENTS
This work is supported by National High-tech
R&D Program of China (863 Program) under grant
No.2012AA010905, China National Natural Science
Foundation (Key Program) under grant No. 61133006
and China National Natural Science Foundation under
grant No.61272408, 60973133.
R EFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
“Amazon online shopping,” 2011, http://www.amazon.com.
M. Rosenblum and C. Waldspurger, “I/O Virtualization,” Queue,
vol. 9, pp. 30:30–30:39, Nov. 2011.
D. Le, H. Huang, and H. Wang, “Understanding Performance Implications of Nested File Systems in a Virtualized Environment,”
in USENIX Conference on File & Storage Technologies (FAST), 2012.
“Citrix xenserver: Efficient server virtualization software,” 2011,
http://www.citrix.com.
A. Kivity, Y. Kamay, D. Laor, U. Lublin, and A. Liguori, “Kvm:
the linux virtual machine monitor,” in In Proceedings of the Linux
Symposium, Ottawa, Canada, 2007, pp. 225–230.
P. Chen, W. Ng, S. Chandra, C. Aycock, G. Rajamani, and D. Lowell, “The rio file cache: Surviving operating system crashes,”
in Proceedings of ACM on Architectural Support for Programming
Languages and Operating System (ASPLOS’96).
Massachusetts,
USA: ACM, 1996, pp. 74–83.
[20]
[21]
[22]
[23]
[24]
[25]
P. Daniel and M. Cesati, “Understanding the linux kernel,” Sebastopol, CA, US, OReilly, pp. 500–800, 2005.
A. Depoutovitch and M. Stumm, “Otherworld: giving applications a chance to survive os kernel crashes,” in Proceedings of
the 5th European conference on Computer systems (Eurosys 2010).
Paris,France: ACM, 2010, pp. 181–194.
P. Chen and B. Noble, “When virtual is better than real,” in
Proceedings of the 8th Workshop on Hot Topics in Operating Systems
(HotOS’01). Elmau/Oberbayern, Germany: IEEE, 2001, pp. 133–
138.
T. Garfinkel and M. Rosenblum, “A virtual machine introspection
based architecture for intrusion detection,” in Proceedings of ISOC
Network and Distributed System Security Symposium (NDSS 2003).
San Diego, CA: Internet Society, 2003, pp. 191–206.
N. Palix, G. Thomas, S. Saha, C. Calvès, J. Lawall, and G. Muller,
“Faults in linux: ten years later,” in Proceedings of the sixteenth
international conference on Architectural support for programming
languages and operating systems, ser. ASPLOS ’11. New York, NY,
USA: ACM, 2011, pp. 305–318.
J.
Corbet,
“Dynamic
writeback
throttling,”
2010,
http://lwn.net/Articles/405076/.
M. Ben-Yehuda, E. Borovik, M. Factor, E. Rom, A. Traeger, and
B.-A. Yassour, “Adding advanced storage controller functionality
via low-overhead virtualization,” in USENIX Conference on File &
Storage Technologies (FAST), 2012.
P. Barham, B. Dragovicand, K. Fraser, S. Hand, T. Harris, A. Ho,
R. Neugebauer, I. Prattand, and A. Warfield, “Xen and the art
of virtualization,” in Proceedings of the 19th ACM symposium on
Operating systems principles (SOSP’03). New York, NY: ACM,
2003, pp. 164–177.
T. Harter, C. Dragga, M. Vaughn, A. C. Arpaci-Dusseau, and R. H.
Arpaci-Dusseau, “A File is Not a File: Understanding the I/O
Behavior of Apple Desktop Applications,” in Proceedings of the
Twenty-Third ACM Symposium on Operating Systems Principles, ser.
SOSP ’11. New York, NY, USA: ACM, 2011, pp. 71–83.
E. Nightingale, K. Veeraraghavan, P. Chen, and J. Flinn, “Rethink
the sync,” in Proceedings of the 7th symposium on Operating systems
design and implementation (OSDI 2006). Seattle, WA: USENIX,
2006, pp. 1–14.
A. Gordon, N. Amit, N. Har’El, M. Ben-Yehuda, A. Landau,
D. Tsafrir, and A. Schuster, “Eli: Bare-metal performance for
i/o virtualization,” in ACM Architectural Support for Programming
Languages & Operating Systems (ASPLOS), 2012.
A. Warfield, S. Hand, K. Fraser, and T. Deegan, “Facilitating
the development of soft devices,” in Proceedings of the annual
conference on USENIX Annual Technical Conference (USENIX ATC
2005). Anaheim, CA: USENIX, 2005, pp. 379–382.
C. Tang, “FVD: A High-Performance Virtual Machine Image
Format for Cloud,” in Proceedings of the 2011 USENIX conference on
USENIX Annual Technical Conference, ser. USENIXATC’11. Berkeley, CA, USA: USENIX Association, 2011.
A. Menon, J. Santos, Y. Turner, G. Janakiraman, and
W. Zwaenepoel, “Diagnosing performance overheads in
the xen virtual machine environment,” in Proceedings of the
1st ACM/USENIX international conference on Virtual execution
environments (VEE 2005). Chicago, IL: ACM, 2005, pp. 13–23.
H. Kim, H. Jo, and J. Lee, “Xhive: Efficient cooperative caching
for virtual machines,” Transactions on Computers, vol. 60, no. 1, pp.
106–119, 2010.
L. Ye, G. Lu, S. Kumar, C. Gniady, and J. Hartman, “Energyefficient storage in virtual machine environments,” in Proceedings
of ACM International Conference on Virtual Execution Environments
(VEE 2010). Pittsburgh, PA: ACM, 2010, pp. 75–84.
P. Lu and K. Shen, “Virtual machine memory access tracing with
hypervisor exclusive cache,” in 2007 USENIX Annual Technical
Conference on Proceedings of the USENIX Annual Technical Conference
(USENIX ATC 2007). Santa Clara, CA: USENIX, 2007, pp. 75–84.
S. Seelam and P. Teller, “Virtual i/o scheduler: a scheduler of
schedulers for performance virtualization,” in Proceedings of the
3rd international conference on Virtual execution environments (VEE
2007). San Diego, California: ACM, 2007, pp. 105–115.
A. Gulati, A. Merchant, and P. Varman, “mclock: handling
throughput variability for hypervisor io scheduling,” in Proceedings of the 9th USENIX conference on Operating systems design and
implementation (OSDI 2010). Vancouver, BC, Canada: USENIX,
2010, pp. 1–7.
11
D ingding Li is a Ph.D student working with Prof.
Hai Jin in the Services Computing Technology
and System Laboratory (SCTS) at Huazhong
university of Science and Technology (HUST).
His research is focused around I/O virtualization
and cloud computing.
X iaofei Liao received his Ph.D. degree in computer science and engineering from Huazhong
University of Science and Technology (HUST),
China, in 2005. He is now an associate professor
in the school of Computer Science and Engineering at HUST. He has served as a reviewer
for many conferences and journal papers. His research interests are in the areas of virtualization
technology for computing system, P2P system,
cluster computing and streaming services. He is
a member of the IEEE and the IEEE Computer
Society.
H ai Jin received his B.S., an M.A. and a Ph.D.
degree in computer engineering from Huazhong
University of Science and Technology (HUST) in
1988, 1991 and 1994, respectively. Now he is a
Professor of Computer Science and Engineering
at HUST in China. He is now the Dean of School
of Computer Science and Technology at HUST.
In 1996, he was awarded German Academic
Exchange Service (DAAD) fellowship for visiting
the Technical University of Chemnitz in Germany. He worked for the University of Hong Kong
between 1998 and 2000 and participated in the HKU Cluster project.
He worked as a visiting scholar at the University of Southern California
between 1999 and 2000. He is the chief scientist of the 973 project
ChinaV and the largest grid computing project, ChinaGrid, in China.
His research interests include virtualization technology for computing
system, cluster computing and grid computing, peer-to-peer computing,
network storage, network security, and high assurance computing. He
is the member of Grid Forum Steering Group (GFSG). He is a senior
member of IEEE and member of ACM.
D r. Bing Bing Zhou is an associate professor
in School of Information Technologies at the
University of Sydney, Australia (2003-present).
He graduated in electronic engineering in 1982
from Nanjing Institute of Technology in China
and received his PhD in Computer Science in
1989 at Australian National University, Australia.
Currently he is the Theme Leader for Distributed
Computing Applications in Centre for Distributed
and High Performance Computing at the University of Sydney.
Q i Zhang gets his master’s degree in Computer
Science at Huazhong university of science and
Technology (HUST) in Mar,2012. His research
is focused around virtualization and computer
architecture.
Download