1 A New Disk I/O Model of Virtualized Cloud Environment Dingding Li, Xiaofei Liao, Member, IEEE, Hai Jin, Senior Member, IEEE, Bingbing Zhou, and Qi Zhang Abstract—In a traditional virtualized cloud environment, using asynchronous I/O in the guest file system and synchronous I/O in the host file system to handle an asynchronous user disk write exhibits several drawbacks, such as performance disturbance among different guests and consistency maintenance across guest failures. To improve these issues, this paper introduces a novel disk I/O model for virtualized cloud system called HypeGear, where the guest file system uses synchronous operations to deal with the guest write request and the host file system performs asynchronous operations to write the data to the hard disk. A prototype system is implemented on the Xen hypervisor and our experimental results verify that this new model has many advantages over the conventional asynchronous-synchronous model. We also evaluate the overhead of asynchronous I/O at host which is brought by our new model. The result demonstrates that it enforces little cost on host layer. Index Terms—Virtualization, file system, asynchronous I/O, synchronous I/O. ✦ 1 I NTRODUCTION As an emerging IT paradigm that separates computing functions and technology implementations from physical hardware, virtualization technology can play a key role in maintaining flexibility and scalability for cloud deployments [1]. Although having many advantages over a native, or non-virtualized system, a virtualized cloud is more complex, and thus, more difficult to understand and improve [2]. This is often caused by a hypervisor layer that abstracts the underlying hardware resources for multiple guest instances. One particular type of abstraction, which we use often but have not yet fully understood and improve, is the file system. A file system in a typical virtualized system generally presents a two-level nested-structure in which a host file system maps regular files as virtual block device to guest file systems [3]. Compared with a single-level native file system, this two-level file system structure is more complex since a disk I/O request from a guest is required to be processed twice — first in the guest and then in the host. Currently, virtualized systems usually use an asynchronous-synchronous (async-sync) model to handle user asynchronous writes [4], [5]. With this model, the guest file system uses asynchronous I/O instructions to deal with write requests [6], [7]. When dirty data need to be written to the hard disk drive, the data are • D. Li, X. Liao, H. Jin, and Q. Zhang are with the Services Computing Technology and System Lab, Cluster and Grid Computing Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, 430074, China. Xiaofei Liao is the corresponding author. dingly.hust@gmail.com, xfliao@hust.edu.cn, hjin@hust.edu.cn, cheungrck@gmail.com • B. Zhou is with the School of Information Technologies, Sydney University, NSW 2006, Australia. Email: bbz@it.usyd.edu.au Manuscript received March 1, 2012. first transferred to the host, and the host file system then performs synchronous I/O operations to flush the data to the disk. The advantage of this model is its simplicity, that is, the host can be kept small and simple and the guest can accommodate any existing modern OSes without modifications. However, this model has several disadvantages. When a guest decides to flush its dirty data (for example, when the application explicitly invokes a flush operation), the host file system has to write the data synchronously back to the disk in real time. If the amount of data to be transferred is large, the normal operations of other guest systems will seriously be affected by these intensive synchronous I/O operations. When a guest OS is terminated abnormally, for another example, important data stored in the guest file system cache will be lost [8]. Thus, there is a trade-off among performance, data reliability, simplicity and generality. Though the host file system is supposed to manage the I/O operations on behalf of the guest file systems, the requirement of synchronous I/O makes the management less effective and less efficient. To alleviate this problem we propose a new disk I/O working model. Rather than performing asynchronous I/O at the guest and synchronous I/O at the host, in our new I/O model the guest file system uses synchronous operations to handle disk I/O requests and the host system then performs asynchronous I/O to transfer data to the hard disk. With this synchronous-asynchronous I/O arrangement, data stored in the guest file system cache are always made up-to-date and the dirty data are maintained and managed by the host. It has several potential advantages over the conventional async-sync I/O model. With synchronous I/O at the guest, data inside its file system cache are always considered clean, and so there is no need to flush data from the guest system periodically or when the guest system is terminated normally or 2 abnormally. As a result, the I/O management inside the guest can be made simpler and more efficient. On the other hand, it is common that the guest is usually less stable than the hypervisor layer [9], [10]. This is because the guest OS is directly exposed to user applications which may contain various fatal bugs [11], causing the whole system to crush. In such a situation, if dirty data are stored and managed at the guest, important data will be lost (unless additional complex mechanisms are available to flush the dirty data from the guest cache to the hard disk in real time). Using synchronous I/O operations at the guest, user dirty data will still be kept in the host when the guest system crashes. Thus, the system becomes more robust to guest faults. Since the host can acquire the system-wide information of multiple guest systems, with asynchronous I/O operations at the host, flushing dirty data back to the hard disk can be done on occasions when the physical system is not busy [12]. Therefore, the virtualized system becomes more intelligent to the concurrent I/O flows from multiple guest OSes. Furthermore, because the hypervisor at the host can directly and accurately acquire the information of physical disk devices [13], asynchronous I/O at the host provides opportunities for disk I/O optimization according to specific devicedependent characteristics. Based on this new I/O model, we designed a new architecture called HypeGear. This architecture is built upon the conventional virtualized system with several added/modified components. The key added components include one at the guest to convert a normal asynchronous I/O write from the user application to a synchronous one and a block-level cache called GearCache to maintain and manage dirty data at the host. It is worth noting that a user application could perform an original synchronous I/O write. In such case, HypeGear provides a kind of sync-sync operation to bypass its block-level cache and make the data in the guest file system cache directly consistent with the copy on the hard disk. We have implemented a prototype HypeGear system on the Xen hypervisor. Extensive experiments have been conducted to verify the advantages, evaluate the performance, and measure the overhead of our new disk I/O model. The rest of this paper is organized as follows. Section 2 presents the background and motivation of our work. The architecture of HypeGear is presented in Section 3. Section 4 describes the implementation details of HypeGear on Xen hypervisor. The experimental results are presented in Section 5. The related work is briefly discussed in Section 6. Finally, Section 7 concludes the paper. 2 BACKGROUND AND M OTIVATION The idea of using asynchronous I/O or caching scheme in host file system to improve disk I/O bottleneck in a virtualized system is not very new. By leveraging the file system cache in the host, a common hypervisor can easily enable its caching or asynchronous scheme for guest image files, which presents an async-async model to deal with the guest write request [14]. However, this async-async model has heavy dataredundancy between the file system caches of both the guest and the host. Taking the Linux-based hypervisor as an example, when a guest reads/writes a block from/to physical disk device, the related I/O buffer is not only cached in the guest page cache but also stored in the host page cache. As the guest number is increased, the guests would enforce more redundant I/O activities upon the host file system and keeps hypervisor layer very busy to deal with them (e.g. allocate/free memory to accommodate/release guest data). Consider the hypervisor layer at the host plays a critical role for hosting multiple guestOS instances, overburden in the host file system, which is located in the host kernel-space, may continuously consume the host resource and thus degrade the efficiency of the whole virtualized system. To avoid this problem, current hypervisors generally use DIRECT I/O (as an async-sync style) to manipulate the guest image files [4], [5] without caching guest I/O data in the host file system. To make a case here, we have studied the newest Xen hypervisor 4.1.2 with its Blktap AIO driver on an experimental machine, which has dual Quad-Core Intel Xeon(R) 1.6GHz processors, 8GB DDR2 RAM, 160GB SATA II hard drive with 7200 RPM (ST3160815AS), and dual Full-duplex Intel Pro/1000 Gbit/s NIC. The driver domain (namely the host OS) is running 64-bit CentOS 6.0 distribution and the hypervisor is Xen 4.1.2 with Linux 2.6.18.8-Xen kernel. The guest domains (namely guest OSes) are running CentOS 6.0, each with 512MB memory allocation, 16GB disk capacity, Linux 2.6.18.8 kernel and with the ext3 file system as ordered model. Figure 1 presents a comparison between the models of async-sync (activating the flag O DIRECT when Blktap AIO drivers opens guest image files) and async-async (clearing the flag O DIRECT). It clearly shows that async-async can improve the guest I/O performance by leveraging a page cache in the host file system, but trades the extra memory and CPU resources in the hypervisor. An I/O completion in a native system indicates that a write has been committed. Therefore,using the model of async-sync in a virtualized system has another reason: keeping the guest flush semantic which relies on the guest file system and guest applications. However, this makes a potential pitfall after the system is virtualized and runs multiple OS instances. If one of them, by taking massive synchronous flush operations [15], exploits this strong adherence to the guest flush semantic to dominate the shared disk device, the I/O performance of other normal guests will be severely disturbed. Figure 2 demonstrates this negative impact under the same experimental environment. It clearly shows that the normal 3 1600000 Host Memory Consumption (KB) 60% Memory Consumption on async-async Memory Consumption on async-sync CPU Consumption on async-async CPU Consumption on async-sync 50% Test End (async-async) 1400000 40% 1200000 Test Begin 1000000 30% Test End (async-sync) 800000 20% 600000 10% 400000 200000 Host CPU Consumption 1800000 0% 0 -10% 100 200 300 400 500 Time (Seconds) Fig. 1. Three guests run IOzone simultaneously. Each of them produces asynchronous write-only I/O flows with 512MB testing file size, 4KB block size and sequential manner. Although all guests can finish this test within 260 seconds under the model of async-async (about 200 seconds faster than the model of async-sync), it consumes up to 1.5GB host redundant memory (about the total size of three testing files) and 50% host CPU resource on its peak value. in a general case different guests have different image files. When dealing with the concurrent I/O requests from different guests, the magnetic head of a rotating disk may do irregular movement among the image files, even these I/O flows are logically sequential from the perspective of their own guests. Eventually, this situation contradicts the physical characteristics of a rotating disk device, and the virtualized system thus presents a poor disk I/O performance. Normal Guest I/O Latency (µs) 100000 10000 Disturbance End Disturbance Begin 1000 0 50 100 150 200 250 300 350 Time (Seconds) Fig. 2. Fluctuation latencies of file I/O of the normal guest, the y-axis is logarithmic. Two guests run simultaneously, one is doing normal file I/O, which issues single read/write request to disk device every second, meanwhile the other is playing a disturbance role, which uses IOzone to produce continuous and small flush operations (4KB) from 131th second to 245th second. Most latencies on either sides of this figure are not plotted due to small value. guest has confronted with a quite large fluctuation on file I/O latencies (i.e., it spans four orders of magnitude on y-axis) running alongside a flush-intensive guest. The model of async-sync also disregards the specific characteristics of underlying storage device. Due to its synchronism with guest OSes, each interaction between the host file system and the physical disk is constrained to the view of each individual guest. Unfortunately, Based on above analysis and observations, we find that the model of async-sync can keep the hypervisor simple through its synchronous manner in the host file system. However, synchronous operations in the host may cause a serious problem of performance variation for each individual guest. Although the model of async-async can relieve these shortages using asynchronous I/O operations in the host file system, the overhead enforced on the hypervisor or host is very high. Therefore, we expect that a hypervisor or host should use a more intelligent asynchronous I/O model to handle the guest flush operations at a system level, not just based on the needs of each individual guest. Meanwhile, the simplicity of a hypervisor should also be kept, rather than introducing a heavy-weight scheme such as the simple async-async model we just discussed. Under the both models of async-sync and async-async, the guest file system uses asynchronous I/O to deal with application write requests [7]. The upside of these models is high-performance and compatibility since existing modern operating systems can be virtualized without any modifications. However, there are some limitations. First, it complicates data management policies in the guest file system. A Linux- 4 based guest involves complex flush logics to control the flush procedure of user dirty-data [12]. In this way the code base of the guest file system or development library is enlarged and complicated, which may increase the risk of fault appearing and the extra consumption of CPU and memory resource [11]. Second, asynchronous I/O operations in the guest sacrifices the reliability of dirty data, as a fault occurred in the guest OS may lead the dirty data in the guest file system cache to be lost or tampered [6]. Several works have been proposed for native file systems to improve aforesaid problems [6], [16]. However, these methods are custom-made for native systems. Importing them into current virtualized systems generally requires extraneous hardware or software mechanism, even with expensive provision1 . For a two-level file system, therefore, a general, simple and drop-in solution to the original virtualized environment is highly desirable. 3 H YPE G EAR 3.1 Overview To alleviate the above mentioned problems in a typical virtualized system, we propose a new disk I/O model, called HypeGear. It uses synchronous I/O at the guest file system and asynchronous I/O at the host, namely a sync-async I/O model. Synchronous I/O on the guest side is to simplify guest file system and enhance the reliability of user dirty-data, while asynchronous I/O on the host side is used for flushing guest data with system-wide optimization based on device-specific characteristics. To realize this idea, an I/O convertor SignPost and a block-level cache GearCache are added into the guest and hypervisor respectively. With these added components in HypeGear, the life cycle of a normal asynchronous write operation in HypeGear takes the following steps: (1) A user-application issues an asynchronous write operation; (2) SignPost detects this operation and converts it into a synchronous write; (3) The write request will be trapped into the hypervisor; (4) Hypervisor stores the I/O data in GearCache and then returns an acknowledgment to the guest; (5) GearCache flushes the dirty data to the disk device in an asynchronous manner. The dirty data produced by user-application in our disk model will be handled by SignPost with synchronous style, which allows the data to pass through the relatively less trusted guest OS instantly. By this conversion, guest file system can also be simplified. GearCache is designed to maintain the host simplicity as the model of async-sync. It is a dedicated block-level write cache, which buffers write data only and manages them in an effective way. By isolating read data, the structure of GearCache can be kept much simpler than a typical file system cache, and thus greatly reduces the redundant I/O activities enforced on host or hypervisor. 1. We will discuss them in detail in the section 6. 3.2 SignPost According to the specific implementation of a certain file system or application, the convert work of SignPost can be embodied in various concrete ways, such as the parameter tweaking inside a file system, the self-tuning in an application, or their combinations. Unfortunately, each kind of async-to-sync conversion will invoke a context switch, i.e., from guest context to hypervisor context, in the current virtualized environment. If an application makes frequent write requests, the system will be forced to context switch frequently and then the application performance affected [17]. Nevertheless, in section 5.5, we will show this cost would not severely affect the performance of typical workloads. Furthermore, by improving CPU consumption on the guest file system (there is no need for guest file system to use complex flush management), the overall performance of virtualized system can even be enhanced, especially on multiple guest instances running. SignPost has little effect to guest read operations. Since it uses the original path of synchronous I/O in guest file system, a write operation that has been converted will leave a copy in the guest for subsequent read operations on the same data. If a read request is missed in guest file system cache, it will first find its data in GearCache to keep the data consistency, instead of directly talking with disk. 3.3 GearCache Popular hypervisors often create individual address space for hosting and monitoring each specific guest. For examples, both QEMU-IO in KVM [5] and Blktap in Xen hypervisor [18] redirect guest I/O requests into a special and privileged guest instance, where contains specialized user-processes to initiate I/O operations on behalf of the guest. These user-processes have their own address spaces and they are isolated from the real guest memory, thus the data in this space can survive the guest crash. A combination of these special memory spaces, named vContext, forms our GearCache in hypervisor. GearCache stores each individual guest’s write into a matched vContext and interposes a flush thread into the event handlers, such as guest crash, migration and close. This design motivates the local-based management principle for GearCache, which not only keeps the design logic of GearCache simple, but also provides a selective disk I/O framework to a single physical machine that hosts various guests. For example, since GearCache relaxes the adherence to guest flush semantic due to I/O bottleneck, this block-level write cache may not be suitable for all applications. A possible solution to this problem under the local-based management is moving those applications, who requires strict flush semantic, into the other guests, where provides the original disk I/O path. GearCache mainly uses an on-demand principle to manage the space for each vContext, but it will adopt an idea of the static-based allocation to restrain the memory 5 Crash at t1 Crash at t2 FlushCtrl1 …... A1 A2 …... TransA B1 FlushCtrl2 …... B2 FlushCtrl3 …... C2 TransB C1 …... TransC Time Fig. 3. Crash recovery on HypeGear. consumption of any aggressive vContexts. Specifically, there is a size limit θ for each individual vContext. GearCache deals with the upcoming write requests with an on-demand manner if the current usage is below the value of θ. When a new write request arrives, GearCache will apply for a piece of memory from the hypervisor to serve this request. The memory will then be directly freed and returned to the hypervisor after the data is flushed into disk. In this way, GearCache can avoid the waste of memory by those guests who carry little writes. If the current usage of a vContext is up to the value of θ, GearCache will simply treat this guest as aggressive, and uses a static-based method to handle the upcoming write requests. In details, a vContext first flushes part of the cached data according to a FIFO policy, and then clears the related data, but preserves their memory space. When a new write arrives, it just reuses the space that has been cleared for the new data. In this way, GearCache can reduce the overhead of frequent memory allocations and releases for intensive writes. 3.3.1 FlushCtrl Data flushing is an important operation and there is a new component FlushCtrl in GearCache dedicated for the purpose. FlushCtrl is invoked when a guest system is terminated (normally or abnormally), or when more free spaces are required. FlushCtrl uses FIFO (First-inFirst-out) principle to do real flush procedure. While a replacement algorithm based on LRU (Least-RecentlyUsed) may obtain better performance, GearCache should keep the constraint of guest I/O ordering. For example, when guest was writing to an unallocated region of a file, the FlushCtrl must flush the file data before writing the meta-data, otherwise it risks exposing uninitialized data to users. FlushCtrl is located inside each vContext and five kinds of event will trigger its flush handler to conduct a flush operation. (1) A vContext is about to be exhausted; (2) A timer in GearCache is strictly required to periodically flush the user dirty data into disk device; (3) User or hypervisor closes a guest; (4) Hypervisor migrates a guest; (5) Unexpected behaviors crash a guest. In the fifth case, the associated vContext in hypervisor can catch crash event by the inherent message-based notification or a polling mechanism. On the other hand, some extra processes can be added into the FlushCtrl to provide device-dependent opti- mizations during the flush procedure. We will give an illustrative example in Section 4.2. 3.4 Crash Recovery According to crash type, Figure 3 illustrates the crash recovery in HypeGear, which is measured by its occurrence time and place. When crash occurs at time t1 , at which point FlushCtrl is not running, T ransA , which has been completely flushed back by F lushCtrl1 , is kept. To T ransB , which is divided into two continuous flush procedures, can only be partly flushed back (namely B1 in Figure 3) after this crash. If crash t1 only involves guest OS, F lushCtrl2 will still be triggered to flush the existing guest dirty data. But T ransB is perhaps incomplete as guest may not write B2 out before t1 . Since all transactions in Figure 3 belongs to guest semantic, HypeGear can leverage the guest journaling mechanism to maintain its atomicity, such as the file system journaling and database journal. Therefore, the uncompleted T ransB will be undone when guest is restarted and then can be rolled back to a clean state, even crash t1 involves whole physical system. Then we discuss crash at time t2 , at which point FlushCtrl is working. If this crash only involves guest OS, guest can be restarted cleanly by its journal mechanism. If crash t2 involves whole physical system, it may lead to an unusable state to guest image files, because FlushCtrl could reorder guest write requests to improve flush performance (See section 4.2). For example, C2 in Figure 3 is supposed to be flushed into disk after C1 , but FlushCtrl swaps their flush positions due to disk optimization in FlushCtrl. After crash t2 happens and guest is restarted, system will wrongly treat that T ransC has been completed. To deal with this issue, HypeGear sets a flag before each FlushCtrl working, and clears it after flush procedure is completed. If a t2 crash happens, HypeGear will check this flag when restarting guest OS. If this flag is set, HypeGear must roll back this guest image file to a history clean-state via image snapshot [19]. 4 I MPLEMENTATION In this section, we discuss an implementation of HypeGear on the Xen hypervisor atop a rotating disk device. Although our current work is focused on a specific platform, the main idea can be applied to other systems. 6 4.1 SignPost in Guest Domain For ordinary applications that typically use a common I/O library, SignPost simply moves them into a special file system to shape the synchronous I/O model. This kind of file system is easily available from the original user environment, in which only an extra parameter is applied to the file fstab of the guest domain. An exception is that some applications use their own caching mechanisms to acquire the full control of the disk I/O data transfer (e.g. the self-caching application, using O_DIRECT flag to bypass the file system cache in Linux kernel [7]). Our solution to this problem is to tune its own configuration parameters to simulate the model of synchronous disk I/O. Both methods make little change to the guest application and file system. A system administrator, not involving the end users, can deal with it. An example for illustrating the SignPost usage will be presented in section 5.3. 4.2 the Tapdisk with a fixed-size and segmented I/O flow (usually 4KB each time), an aggregation procedure to the continuous I/O requests, which have been formed in the ”re-ordering” phase, can further improve the flush performance. Specifically, FlushCtrl starts from the head of this ordered buffer and then scans forward. During this process, if multiple write operations have continuous offsets (stepped by 4KB), FlushCtrl will encapsulate them into one unit (by replacing multiple write system calls with a single writev), and then issues them in a batch style. To the other non-continuous and scattered writes, FlushCtrl will use AIO library to deliver them. If FlushCtrl confronts with write requests with same offset, only the newest one will be written back. Finally, by using mutex in each vContext, a periodical flush mechanism is also implemented inside GearCache as a thread-based method. The mutex is used for ensuring only one flush handler can operate a vContext at a certain time. GearCache in Driver Domain In the current implementation, we add a new protocol into Blktap, which uses Tapdisk as vContext for caching the guest writes into GearCache. GearCache uses linkedlist to store information on write requests. Each node in the list corresponds to a specific write. It includes an offset field to indicate the data location in the image file, a buffer field to point to the actual data, a size field to specify the data size and a dirty flag to mark this node on whether it can be re-used by new write requests. Data nodes in the linked-list are arranged in an order from head to tail according to their arrival times. FlushCtrl always begin from the head and then forward traverse the linked-list. In more details, two linked-list structures comprise the storage space for guest writes. One of the linked-lists, called dirty list, is used to store new write requests, while the other one, called clean list, is used to store data, which have been flushed to the disk. The memory space of a node in the clean list can be reused by a new write request. The re-used node will be delinked from the clean list, marked dirty and then inserted into the dirty list. For the implementation of FlushCtrl, we just interposes the flush handler into the existing unmap disk() function in Tapdisk, which will trigger the flush procedure when guest domain is terminated abnormally or normally. When flush is triggered, considering the bottom rotating disk device, HypeGear will apply two device-dependent optimizations to improve the flush operation. The first one is block data re-ordering. FlushCtrl sorts the writes according to their offsets inside the matched image file (Only ”flush candidates” will be involved) and then put them orderly into a buffer (The buffer only stores those node addresses). This will facilitate the bottom disk I/O schedulers that have been optimized for the logical sequential disk I/O flow. Currently, we use a heap algorithm to do this sort work. The other is aggregation. Since Blktap provides 5 E VALUATION 5.1 Testing Environment We compare HypeGear with the Xen hypervisor, which uses original Blktap driver with aio protocol (denoted by Blktap+AIO below). Blktap+AIO is further divided into two cases, namely the models of async-sync and async-async which we have discussed in section 2. The former is adopted by the XenServer [4] as the default configuration of block driver due to its highperformance, practicability and simplicity. To have an apple-to-apple comparison, under the model of HypeGear we allocate each guest with only 448MB memory and set an upper limit of 64MB (namely θ) on their own vContexts. On the flipped side, the memory size of each guest on Blktap+AIO is set to 512MB. The other experimental environment is the same as the section 2 described. To form the synchronous I/O in guest, we mount the testing directory in guest file system as synchronous model under HypeGear (append sync flag in fstab file), and keep Blktap+AIO as default configuration. Finally, the frequency of periodic FlushCtrl in HypeGear is set as 30 seconds. 5.2 Performance Interference We first use HypeGear to directly give a comparison in Figure 2, in which a normal guest is severely disturbed by a malicious guest with intensive flush operations. Figure 4 shows the result. Generally, HypeGear eliminates the high-latencies due to GearCache asynchronizing normal guest I/O. But async-async confronts with a large fluctuation at the end of disturbance process. This is caused by the busy host file system, where is required to flush the large dirty-data that produced by malicious guest. 7 Normal Guest I/O Latency (µs) TABLE 1 The comparison of data integrity. ”Client” refers to the number of requests issued from clients, while ”Server” column gives the number of requests actually received by the server async-sync async-async HypeGear 1000000 Disturbance End 100000 Disturbance Begin Test End 10000 Test Begin 0 50 100 150 200 250 300 350 Time (Seconds) Fig. 4. Fluctuation latencies of file I/O of the normal guest, the y-axis is logarithmic. Due to small value, latencies on sides of HypeGear, async-sync and async-async, are not plotted . VM No. 1 2 3 4 5 6 7 8 9 10 Disk I/O Model async-sync async-sync async-sync async-async async-async async-async HypeGear HypeGear HypeGear HypeGear Client 6093 17290 8822 2064 16415 748 7043 14497 3146 25661 55 async-sync async-async HypeGear We create ten database clients and each one is connecting with a guest. All guests reside on the same hypervisor and MySQL server 5.1.40 with MyISAM engine is installed. Three of these use the model of async-sync, and another three guests use the model of async-async, and the rest are using HypeGear model. To allow MySQL server to work in a synchronous manner on HypeGear, the logging function is enabled in SignPost and the file my.cnf is configured in the/etc directory. Specifically, the parameter sync_binlog is tuned to trigger synchronization between the log file and the hard disk. This configuration makes each new record in the database be flushed into the hard disk by force. On the other side, we keep the configuration on Blktap+AIO as the default parameter. All clients repeatedly send the ”insert” operations (each with 1KB data) to the related MySQL servers. During the interaction processes between servers and their corresponding clients, we randomly choose ten occasions to destroy these running guests, one at a time, by using the command "xm destroy dom ID" in the terminal of driver domain. Each time it will immediately crash a guest OS without leaving sufficient response time for any exceptional handler in the MySQL server or guest. Table 1 shows the experimental results concerning data reliability. Generally, under the original disk I/O model on the Xen hypervisor (includes both cases of async-sync and async-async), the completion rate is quite low due to data pending inside guest file system cache. In addition, because of metadata inconsistency data tables on MySQL server may even become unusable with a high probability (33%) after the sudden crash. A positive case for the conventional I/O model is shown on the second row of Table 1, which achieves 85.7% completion rate for the client records. This situation may occur when the flush routine in MySQL server is activated just before the crash. On the other hand, HypeGear restores all of the dirty Bandwidth (MB/S) 45 Crash Recovery Ratio 2.82% 85.7% NA 4.75% NA 0% 100% 100% 100% 100% 60 50 5.3 Server 172 14828 Crashed 98 Crashed 0 7043 14497 3146 25661 40 35 30 25 20 15 10 5 0 1 VM 2 VMs 3 VMs 4 VMs 5 VMs Fig. 5. Performance for sequential disk I/O. Each value is the mean of running guests. data which have been received by all MySQL servers after their hosted guests are crashed. It is clearly shown that, by dealing dirty data in the guest as a synchronous manner, HypeGear is able to protect unsaved data from being vanished when the guest is terminated abnormally. Finally, we also reboot these crashed guests, which have run under HypeGear model, and then restart their MySQL servers. All of them can be worked normally. 5.4 FlushCtrl Performance To measure the effects of the device-specific optimizations in FlushCtrl, we use IOmeter benchmark to repeatedly send disk I/O requests from the client to a corresponding guest to stress the write request flow on the virtual block device. The testing file is 1GB and each write is 4KB in size. Under the HypeGear model, the continuous disk I/O flow will repeatedly trigger the FlushCtrl to do real flushing work. We keep all interactions running 30 minutes and measure average bandwidth or IOPS. Figure 5 shows the average disk bandwidth (MB/S) when clients keep sending 100% sequential disk I/O flow, while Figure 6 shows the average IOPS (Input/Output Operations per Second) when clients sending 100% random flow. In the case of sequential disk I/O, HypeGear generally outperforms async-sync achieves about 14%61% more bandwidth. On the other side, async-async 8 5500 5.5.2 Kernel Compilation 5000 async-sync async-async HypeGear 4500 4000 IOPS 3500 3000 2500 2000 1500 1000 500 0 1 VM 2 VMs 3 VMs 4 VMs 5 VMs Fig. 6. Performance for random disk I/O. Each value is the mean of running guests. has an advantage over HypeGear, when the number of running guest is less than or equal to three. However HypeGear improves performance by 4%-14% over async-async as the number of guest increased. In the case of random one, HypeGear makes about 90%509% improvement over async-sync, especially when there is a large number of running guests. The model of async-async also has a performance advantage over the other two models when the fewer guests running. But HypeGear improves performance by 26%-47% as the number of guest reached three. In summary, although the model of async-async has an impressive performance in the case of fewer guests running, it incurs performance degradation as the number of guest increased. This is because the host file system is required to deal with the large and redundant I/O data. On the flipped side, depending on the simple design of GearCache as well as the device-specific optimizations, HypeGear shows the better performance when the physical server hosts more guest instances. 5.5 5.5.1 Realistic Workloads Web Server In the first experiment, a web server workload http_load is used. It produces massive requests for randomly fetching files from a web server (readintensive). Several guests are created and each guest is installed with an Apache web server. Five clients, one residing in a client machine, are also created and run the workload http_load to fetch 200,000 files from each Apache web server. The experimental results are shown in Figure 7 for different number of guests (and thus web servers) running simultaneously. It can be seen from the figure that HypeGear almost performs as well as its counterpart async-sync and async-async. This indicates SignPost has little influence on the guest read operations. Note that a performance spike is presented in the second group of bars because physical I/O resource is just about to be saturated, but it will drop as the number of guests is increased due to I/O resource racing among guests. In the second experiment, the performance of Linux kernel compilation (version 2.6.38) is evaluated. By nature, the compilation process is CPU- and memoryintensive, and it also produces disk I/O to load the source files and creates many object files. Multiple guests with this workload will stress random accessing flows on the shared disk device. The results are shown in Figure 8. When the number of running guest is less than or equal to 2, HypeGear is at most 2 minutes slower than its counterparts (about 4% overhead). We speculate that this overhead is derived from the synchronized file I/O inside guest file system. However, as the running guest increased, HypeGear takes about 3 minutes to 23 minutes shorter for Linux kernel compilation than its counterparts. In detail, compared with async-sync , HypeGear brings about 9.23% to 14% improvement. The enhancement mainly comes from GearCache’s capability of caching and aggregating guests’ write requests. The data re-ordering and aggregation process in GearCache can make random disk I/O flows be flush in a logically sequential and batched manner, and thus reduce the mechanical movement of the magnetic disk head. Compared with async-async, HypeGear takes about 13% to 22% time saving. Since the model of async-async is already asynchronized guest write requests, the effect of device-specific optimizations in HypeGear is not apparent. We speculate that this improvement is mainly derived from the simple management in guest file system (synchronous I/O in guest) as well as the compact design on HypeGear. Specifically, because CPU consumption is saved from the complex management in an asynchronous guest, and the guest I/O redundant activities are avoided in host file system, HypeGear has more CPU resource to do compilation work in guest. Therefore, the required time to compile Linux source is reduced. 5.5.3 File Server In the third experiment, we use DBench as a file server workload. DBench simulates a network file server with multiple clients by using a trace file. We create 20 clients for each file server and all clients perform the trace file for about 30 minutes. Figure 9 shows the results. In the cases of 1-3 guest running, the model of async-async has an impressive advantage over async-sync and HypeGear. This high performance is caused by the enough memory resource in the host, which can allow guest to use extra host memory to fit the running workload. However, this gap is quickly narrowing as the number of running guest increased. HypeGear even achieves 179.3% improvement over async-async in the case of five guests running. We speculate this performance penalty on async-async is aroused by the intensive I/O activities enforced upon host file system, which frequently consumes the CPU and memory resource in driver domain, thus indirectly degrades the 9 160 120 async-sync async-async HypeGear 4500 Compilation time (minutes) 3500 Fetches per second 100 async-sync async-async HypeGear 3000 2500 2000 1500 1000 140 async-sync async-async HypeGear 120 Bandwidth (MB/S) 4000 80 60 40 100 80 60 40 20 20 500 0 0 1 VM 2 VMs 3 VMs 4 VMs 5 VMs Fig. 7. Performance for web server. Each value is the mean of running guests. 0 1 VM 2 VMs Overhead in Driver Domain To see the necessary overhead of GearCache in driver domain, we boot three guests in our experimental machine. For having a comparison with Figure 1, we use the same workload to measure the overhead in driver domain. For memory consumption, as a static value of θ is constrained, GearCache always keeps (64 × N )MB as its upper limit, where N is the number of running guest. This prevents host memory from being exhausted by I/O intensive guests. It should be noted that the memory size of vContext on driver domain can even be treated as zero, since GearCache is built based on the memory which we take from guest itself, under the model of HypeGear. For CPU consumption, GearCache achieves 15% consumption at most (the average value is about 7.8% during IOzone running). Compared with the model of async-async, which presents 17.1% CPU consumption on average, GearCache has an 54.3% improvement. By using Xenoprof [20] to analyze the distribution of CPU cycles during this process under HypeGear, we find that memcpy function, which is used to copy the internal buffer inside each guest write, accounts for a large proportion (about 23%). 6 4 VMs R ELATED W ORK Similar with our work on GearCache, another prior works explore the benefits of adding a secondary cache in the hypervisor for guests. First, XHive [21] is a cooperative caching system for guests that share the storage 1 VM 2 VMs 3 VMs 4 VMs 5 VMs 5 VMs Fig. 8. Time for Linux build (Smaller is better). Each value is the mean of running guests. efficiency of physical machine. On the contrary, synchronous I/O in HypeGear’s guest can save more CPU resource to schedule more clients to run at a time. Furthermore, the device-specific optimization in FlushCtrl can also improve the randomized disk I/O when multiple guest running, thus the overall bandwidth is increased. Compared with the model of async-sync, HypeGear achieves about 33.4%-185% improvement across all cases. This further indicates that the device-specific optimization at hypervisor layer would play an important role to enhance the I/O performance in a virtualized system. 5.6 3 VMs Fig. 9. Performance for file server. Each value is the mean of running guests. device. This scheme allows guests to collaboratively share block copies that are read from the bottom storage, thereby reducing the number of disk I/O operations while enhancing memory utilization. XHive is more concentrated on read direction while our system mainly focuses on write one. Second, by pending the dirty data in hypervisor, Ye et al propose a solution of energysaving for reducing the times of rotating disk spin-up [22], in which the hypervisor can choose a suitable occasion to flush these cached dirty data into disk, depending on whether the rotating disk is working or not. Third, Lu et al propose a hypervisor-level exclusive buffer cache [23], which allows the user workloads in a guest to be transparently traced while accurately predicting the page miss ratio in a guest without heavy costs. In doing so, the system can acquire guest memory access pattern and then guide guest memory allocation. Concerning performance isolation in virtualized systems, Seelam et al propose VIOS [24], a virtual I/O scheduler that can provide absolute performance virtualization by being fair in sharing storage system resource among operating systems and their applications. Although sharing the same intention, HypeGear and VIOS achieve this aim through very different approaches. Specifically, our system realizes this by adding a blocklevel cache in the hypervisor to marshal those requests from various guest OSs, whereas VIOS uses the tuning work of CPU schedulers in different guest OSs. Since modern DMAs were separated from CPU intervention, this kind of works may not look very efficient to handle those write requests which takes large buffer. More recently, Gulati et al presents mClock [25], which is also an I/O scheduling algorithm to provide per-guest quality of service (QoS) in presence of variable overall throughput on the shared storage. Similar with the synchronous I/O in HypeGear, several works have been proposed to strike a balance between the asynchronous I/O and synchronous I/O in a native file system. Chen et al proposed Rio [6], which uses DEC Alpha workstations (DEC 3000/600) to allow a reset and boot without erasing memory. Rio modifies the kernel by inserting a check instruction before every memory access to provide the basic protection for dirty 10 data. Therefore, this method depends on the specific device while requiring a moderate modification to the kernel of the native OS. 7 C ONCLUSION AND F UTURE W ORK Although the traditional disk I/O model in a virtualized environment is simple and general for disk I/O operations on the hypervisor, the efficiency of coordination and management of multiple guests is compromised. To alleviate these potential problems we propose a new disk I/O model called HypeGear. A prototype system is implemented on Xen hypervisor and our experimental results show that the new disk I/O model has many potential advantages over the conventional one. This paper only describes HypeGear under the environment of single machine. However, it is easy to expand our model to the multiple-machine virtualized cloud environment, where guest OSes usually put their data into a special storage pool via fast network rather than the original local disk. In this environment, the usage of SignPost is the same with this paper introducing in section 3.2, but importing GearCache is more complex. In detail, there are two optional places to interpose GearCache in the environment of multiple-machine, one is at hypervisor layer which resided in the ordinary physical node, the other is at the side of special storage pool. In the first case, the solution is fully compatible with the one which has presented in section 3.3, but degrading the flush performance since FlushCtrl is required to transfer data to a remote storage device. In the second case, the GearCache can enhance its reliability because the special storage pool is often more reliable, but synchronous I/O in guest file system incurring the extra latency due to network I/O. We will explore this tradeoff in our future work. [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] ACKNOWLEDGMENTS This work is supported by National High-tech R&D Program of China (863 Program) under grant No.2012AA010905, China National Natural Science Foundation (Key Program) under grant No. 61133006 and China National Natural Science Foundation under grant No.61272408, 60973133. R EFERENCES [1] [2] [3] [4] [5] [6] “Amazon online shopping,” 2011, http://www.amazon.com. M. Rosenblum and C. Waldspurger, “I/O Virtualization,” Queue, vol. 9, pp. 30:30–30:39, Nov. 2011. D. Le, H. Huang, and H. Wang, “Understanding Performance Implications of Nested File Systems in a Virtualized Environment,” in USENIX Conference on File & Storage Technologies (FAST), 2012. “Citrix xenserver: Efficient server virtualization software,” 2011, http://www.citrix.com. A. Kivity, Y. Kamay, D. Laor, U. Lublin, and A. Liguori, “Kvm: the linux virtual machine monitor,” in In Proceedings of the Linux Symposium, Ottawa, Canada, 2007, pp. 225–230. P. Chen, W. Ng, S. Chandra, C. Aycock, G. Rajamani, and D. Lowell, “The rio file cache: Surviving operating system crashes,” in Proceedings of ACM on Architectural Support for Programming Languages and Operating System (ASPLOS’96). Massachusetts, USA: ACM, 1996, pp. 74–83. [20] [21] [22] [23] [24] [25] P. Daniel and M. Cesati, “Understanding the linux kernel,” Sebastopol, CA, US, OReilly, pp. 500–800, 2005. A. Depoutovitch and M. Stumm, “Otherworld: giving applications a chance to survive os kernel crashes,” in Proceedings of the 5th European conference on Computer systems (Eurosys 2010). Paris,France: ACM, 2010, pp. 181–194. P. Chen and B. Noble, “When virtual is better than real,” in Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS’01). Elmau/Oberbayern, Germany: IEEE, 2001, pp. 133– 138. T. Garfinkel and M. Rosenblum, “A virtual machine introspection based architecture for intrusion detection,” in Proceedings of ISOC Network and Distributed System Security Symposium (NDSS 2003). San Diego, CA: Internet Society, 2003, pp. 191–206. N. Palix, G. Thomas, S. Saha, C. Calvès, J. Lawall, and G. Muller, “Faults in linux: ten years later,” in Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems, ser. ASPLOS ’11. New York, NY, USA: ACM, 2011, pp. 305–318. J. Corbet, “Dynamic writeback throttling,” 2010, http://lwn.net/Articles/405076/. M. Ben-Yehuda, E. Borovik, M. Factor, E. Rom, A. Traeger, and B.-A. Yassour, “Adding advanced storage controller functionality via low-overhead virtualization,” in USENIX Conference on File & Storage Technologies (FAST), 2012. P. Barham, B. Dragovicand, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Prattand, and A. Warfield, “Xen and the art of virtualization,” in Proceedings of the 19th ACM symposium on Operating systems principles (SOSP’03). New York, NY: ACM, 2003, pp. 164–177. T. Harter, C. Dragga, M. Vaughn, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau, “A File is Not a File: Understanding the I/O Behavior of Apple Desktop Applications,” in Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, ser. SOSP ’11. New York, NY, USA: ACM, 2011, pp. 71–83. E. Nightingale, K. Veeraraghavan, P. Chen, and J. Flinn, “Rethink the sync,” in Proceedings of the 7th symposium on Operating systems design and implementation (OSDI 2006). Seattle, WA: USENIX, 2006, pp. 1–14. A. Gordon, N. Amit, N. Har’El, M. Ben-Yehuda, A. Landau, D. Tsafrir, and A. Schuster, “Eli: Bare-metal performance for i/o virtualization,” in ACM Architectural Support for Programming Languages & Operating Systems (ASPLOS), 2012. A. Warfield, S. Hand, K. Fraser, and T. Deegan, “Facilitating the development of soft devices,” in Proceedings of the annual conference on USENIX Annual Technical Conference (USENIX ATC 2005). Anaheim, CA: USENIX, 2005, pp. 379–382. C. Tang, “FVD: A High-Performance Virtual Machine Image Format for Cloud,” in Proceedings of the 2011 USENIX conference on USENIX Annual Technical Conference, ser. USENIXATC’11. Berkeley, CA, USA: USENIX Association, 2011. A. Menon, J. Santos, Y. Turner, G. Janakiraman, and W. Zwaenepoel, “Diagnosing performance overheads in the xen virtual machine environment,” in Proceedings of the 1st ACM/USENIX international conference on Virtual execution environments (VEE 2005). Chicago, IL: ACM, 2005, pp. 13–23. H. Kim, H. Jo, and J. Lee, “Xhive: Efficient cooperative caching for virtual machines,” Transactions on Computers, vol. 60, no. 1, pp. 106–119, 2010. L. Ye, G. Lu, S. Kumar, C. Gniady, and J. Hartman, “Energyefficient storage in virtual machine environments,” in Proceedings of ACM International Conference on Virtual Execution Environments (VEE 2010). Pittsburgh, PA: ACM, 2010, pp. 75–84. P. Lu and K. Shen, “Virtual machine memory access tracing with hypervisor exclusive cache,” in 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference (USENIX ATC 2007). Santa Clara, CA: USENIX, 2007, pp. 75–84. S. Seelam and P. Teller, “Virtual i/o scheduler: a scheduler of schedulers for performance virtualization,” in Proceedings of the 3rd international conference on Virtual execution environments (VEE 2007). San Diego, California: ACM, 2007, pp. 105–115. A. Gulati, A. Merchant, and P. Varman, “mclock: handling throughput variability for hypervisor io scheduling,” in Proceedings of the 9th USENIX conference on Operating systems design and implementation (OSDI 2010). Vancouver, BC, Canada: USENIX, 2010, pp. 1–7. 11 D ingding Li is a Ph.D student working with Prof. Hai Jin in the Services Computing Technology and System Laboratory (SCTS) at Huazhong university of Science and Technology (HUST). His research is focused around I/O virtualization and cloud computing. X iaofei Liao received his Ph.D. degree in computer science and engineering from Huazhong University of Science and Technology (HUST), China, in 2005. He is now an associate professor in the school of Computer Science and Engineering at HUST. He has served as a reviewer for many conferences and journal papers. His research interests are in the areas of virtualization technology for computing system, P2P system, cluster computing and streaming services. He is a member of the IEEE and the IEEE Computer Society. H ai Jin received his B.S., an M.A. and a Ph.D. degree in computer engineering from Huazhong University of Science and Technology (HUST) in 1988, 1991 and 1994, respectively. Now he is a Professor of Computer Science and Engineering at HUST in China. He is now the Dean of School of Computer Science and Technology at HUST. In 1996, he was awarded German Academic Exchange Service (DAAD) fellowship for visiting the Technical University of Chemnitz in Germany. He worked for the University of Hong Kong between 1998 and 2000 and participated in the HKU Cluster project. He worked as a visiting scholar at the University of Southern California between 1999 and 2000. He is the chief scientist of the 973 project ChinaV and the largest grid computing project, ChinaGrid, in China. His research interests include virtualization technology for computing system, cluster computing and grid computing, peer-to-peer computing, network storage, network security, and high assurance computing. He is the member of Grid Forum Steering Group (GFSG). He is a senior member of IEEE and member of ACM. D r. Bing Bing Zhou is an associate professor in School of Information Technologies at the University of Sydney, Australia (2003-present). He graduated in electronic engineering in 1982 from Nanjing Institute of Technology in China and received his PhD in Computer Science in 1989 at Australian National University, Australia. Currently he is the Theme Leader for Distributed Computing Applications in Centre for Distributed and High Performance Computing at the University of Sydney. Q i Zhang gets his master’s degree in Computer Science at Huazhong university of science and Technology (HUST) in Mar,2012. His research is focused around virtualization and computer architecture.