Future Generation Computer Systems 29 (2013) 2067–2076 Contents lists available at ScienceDirect Future Generation Computer Systems journal homepage: www.elsevier.com/locate/fgcs VSA: An offline scheduling analyzer for Xen virtual machine monitor Zhiyuan Shao a,∗ , Ligang He b,∗,1 , Zhiqiang Lu a , Hai Jin a a Services Computing Technology and System Lab, Cluster and Grid Computing Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, 430074, China b Department of Computer Science, University of Warwick, Coventry, CV4 7AL, United Kingdom article info Article history: Received 30 April 2012 Received in revised form 29 August 2012 Accepted 7 December 2012 Available online 22 December 2012 Keywords: Xen Virtual machine Scheduling analysis Co-scheduling abstract Nowadays, it is an important trend in the system domain to use the software-based virtualization technology to build the execution environments (e.g., the Clouds). After introducing the virtualization layer, there exist two schedulers: One in the hypervisor and the other inside the Guest Operating System (GOS). To fully understand the virtualized system and identify the possible reasons for performance problems incurred by the virtualization technology, it is very important for the system administrators and engineers to know the scheduling behavior of the hypervisor, in addition to understanding the scheduler inside the GOS. In this paper, we develop a virtualization scheduling analyzer, called VSA, to analyze the trace data of the Xen virtual machine monitor. With VSA, one can easily obtain the scheduling data associated with virtual processors (i.e., VCPUs) and physical processors (i.e., PCPUs), and further conduct the scheduling analysis for a group of interacting VCPUs running in the same domain. © 2012 Elsevier B.V. All rights reserved. 1. Introduction It is an important trend nowadays to use software-based virtualization technologies to build the private data centers and public cloud systems, such as Amazon EC2 [1] and GoGrid [2]. Among these virtualization technologies, the Xen Virtual Machine Monitor (VMM) [3], which allow users to execute up to hundreds of virtual machines on a single physical machine with low extra overhead, is widely adopted for such purposes [4]. With the introduction of the Xen virtualization layer, traditional operating systems (e.g., Linux) are para-virtualized (i.e., the machine sensitive and privileged operations are replaced with hypercalls) to run on top of the Xen hypervisor. The operating systems in the virtualized environments are called Guest Operating Systems (GOS). In this new architecture, there exist two schedulers working together: one works inside the Xen hypervisor and the other in the GOS. The examples of the former include the Simple Earliest Deadline First Scheduler (SEDF) [5] and the Credit Scheduler [6], which schedule the VCPUs to run on top of physical processors, while the examples of the latter include Completely Fair Scheduler (i.e., CFS) [7,8] in Linux, which schedules the application processes to run in the VCPUs. Although there is long history of research in GOS schedulers and the GOS schedulers have been relatively mature, the schedulers inside the hypervisor are still under active research and development. To help understand the scheduling behaviors inside the hypervisor, Xen provides a powerful tool, named xentrace, to collect system trace data (including scheduling-related data), and a related format tool, xentrace_format, to convert the trace data into a human readable form. However, it is still very difficult to analyze the scheduling behavior of each individual VCPU by using the trace data alone.2 In this work, we design and implement VSA,3 an offline scheduling analyzer for the Xen VMM. The system can reconstruct the scheduling history of each VCPU based on the offline trace data generated by xentrace and xentrace_format. By using the scheduling history data of each VCPU, VSA can render detailed analysis data on VCPUs, including: • block-to-wake time: the duration between the time when a VCPU is blocked and the time it is woken up from the blocking state • wake-to-schedule-in time: the duration between the time when a VCPU is woken up and the time when it is scheduled to run • preemption: whether a VCPU is preempted or gives up the PCPU voluntarily when it is scheduled out by the hypervisor • migration: the time point at which a VCPU is migrated and the migration frequency of a VCPU. ∗ Corresponding authors. E-mail addresses: zyshao@hust.edu.cn (Z. Shao), liganghe@dcs.warwick.ac.uk (L. He), hjin@hust.edu.cn (H. Jin). 1 Tel.: +44 2476573802. 0167-739X/$ – see front matter © 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.future.2012.12.004 2 We will explain why it is hard to analyze the data by using the trace data alone in Section 2. 3 The source code is available at http://code.google.com/p/vsa. 2068 Z. Shao et al. / Future Generation Computer Systems 29 (2013) 2067–2076 guest operating system (i.e., GOS), such as CFS (i.e., Completely Fair Scheduler) in Linux. From the perspective of the hypervisor, the smallest entity for scheduling is VCPUs. Although the scheduling data of the VCPUs (such as the data gathered in Xentrace) cannot be used to analyze the scheduling behavior of the GOS directly, these data can be used to deduce the scheduling behavior of the GOS in some cases. For example, when an idle VCPU start to run on the fly, it can be derived that a child process is invoked. Therefore, analyzing the scheduling data of the VCPUs and domains can be used to diagnose performance problems, no matter they are originated from the VCPUs observable to the hypervisor or from the processes (or threads) inside the GOSes. Fig. 1. The scheduling architecture of Xen VMM. Moreover, with the rich data about the scheduling history of VCPUs, VSA can conduct further advanced analysis, including: • utilization of each PCPU • domain-based analysis: the scheduling analysis for a group of interacting VCPUs running in the same guest domain. This paper presents the key design issues of VSA, and also a number of case studies using VSA to analyze the scheduling behaviors of a Xen-based virtualized system. The design strategies and the case studies explain the merits of VSA and demonstrate how the analysis data generated by VSA can help understand the behaviors of Xen-based virtualized systems, and therefore help further system optimization. The rest of the paper is organized as follows: Section 2 introduces the background knowledge for understanding this paper. Section 3 discusses the key design issues of VSA. In Section 4, we present three case studies to demonstrate and explain how the analysis data generated by VSA can be used to reveal the scheduling behavior of VCPUs, PCPUs and domains. Section 5 presents related works. Section 6 concludes the paper and discusses future works of VSA. 2. Background This section briefly discusses the background knowledge required to understand this work, including the scheduling architecture in Xen VMM, the split driver model, three schedulers used in the Xen hypervisor, and how trace data are generated and processed in Xen. 2.1. Scheduling architecture in Xen Fig. 1 shows the scheduling architecture of Xen VMM. As illustrated in Fig. 1, the traditional servers are installed as the domains by Xen VMM, and a domain can be configured to run on one or multiple VCPUs. The hypervisor treats the VCPU as the smallest scheduling entity, and maintains a run_queue of VCPUs for each PCPU. Under the current architecture of Xen, each PCPU works independently to choose a VCPU from its run_queue and run the VCPU according to the scheduling algorithm configured in the hypervisor. When a guest virtual machine is invoked to run, its configuration (e.g., the number of VCPUs, the size of its virtual memory) will not change unless the administrator makes explicit changes by issuing the commands (e.g., ‘‘xm vcpu-set’’). Therefore, when the child processes or threads are created inside the guest system, they will reuse the same set of processors (VCPUs from the viewpoint of hypervisor) as their parents. The processes running inside the guest virtual machine are scheduled by the scheduler of the 2.2. Split driver model In order to enforce security and isolation among the domains (i.e., if a domain crashes on the fly, it will not destroy the whole system), Xen adopts the Split driver model for I/O virtualization. An Independent Driver Domain (IDD) is created that contains the native drivers (called the Backend drivers)(domain0 is typically configured to play this role as shown in Fig. 1). Other domains (i.e., domainUs) contains the split Frontend drivers, which cannot access the hardware devices directly. When domainUs want to perform I/O operations, they call the Frontend drivers and the Frontend drivers communicate the I/O requests to Backend drivers. The communications between these two types of drivers are performed through hypercalls, shared event channels and ring buffers in Xen. It is important to understand the principle of the Split driver mode in order to further understand the scheduling behavior of Xen. For example, suppose one VCPU of a domainU hosts an application that performs a ping-pong micro benchmark communicating with the outside world. The network packet issued by the application is first sent to the Frontend driver, which will relay the packet to the Backend driver located in domain0. During this process, the VCPU of domainU will enter the block status after sending the packet, since it has nothing to do but wait for the arrival of the response packet. Meanwhile, the VCPU of domain0 will be woken, since it has to receive the packet from the domainU and sent it to the outside world. When the response packet from the outside world arrives, the hypervisor works in the opposite direction and transfers the response packet from the Backend driver to the Frontend driver. This is a very simple example, the situation in real applications is much more complicated. For example, the application running on the VCPU of domainU may need to perform both computation and communication, and therefore, the VCPU of domainU may not enter the block status after sending a packet. In order to accelerate I/O processing, some schedulers (e.g., Credit scheduler) specify higher priorities for the VCPUs that are woken from the block status. We will explain this in more detail when introducing the schedulers of Xen in the next subsection. 2.3. The scheduler of Xen During the short development history of the Xen hypervisor, at least three scheduling algorithms [9] have been introduced, including Borrowed Virtual Time (BVT) [10], Simple Early Deadline First (SEDF) [5] and Credit [6]. Borrowed Virtual Time (BVT) is a fair-share scheduler based on the concept of virtual time. It dispatches the runnable virtual machines (VM) on a smallest-virtual-time-first basis. The BVT scheduler provides low-latency support for real-time and interactive applications by allowing latency sensitive clients to ‘‘warp’’ back in virtual time to gain the scheduling priority. The Z. Shao et al. / Future Generation Computer Systems 29 (2013) 2067–2076 client effectively ‘‘borrows’’ virtual time from its future CPU allocation. Simple Earliest Deadline First (SEDF) is a real-time scheduling algorithm. In the SEDF scheduler, each VM Domi specifies its CPU requirements with a tuple (si , pi , xi ), where the slice si and the period pi together represent the CPU share that Domi requests: Domi needs to receive at least si units of time in each period of length pi . The boolean flag xi indicates whether Domi is eligible to receive extra CPU time. SEDF distributes the spare PCPU time in a fair manner after all runnable VMs receive their CPU shares. The Credit scheduler is currently the default scheduler in Xen. The scheduler allocates the CPU resources to VCPU according to the weight of the guest domain that the VCPU belongs to. It uses credits to track VCPU’s execution time. Each VCPU has its own credits. If one VCPU has credits greater than 0, it gets UNDER priority. When it is scheduled to run, its credit is deducted by 100 every time it receives a scheduler interrupt that occurs periodically once every 10 ms (called a tick). If one VCPU’s credit is less than 0, its priority is set to OVER. All VCPUs waiting in the run-queue have their credits topped up once every 30 ms, according to their weights. The higher weight a domain has, the more credits are topped up for its VCPUs every time. An important feature of the Credit scheduler is that it can automatically load-balance the virtual CPUs across PCPUs on a host with multiple processors. The scheduler on each PCPU can ‘‘steal’’ VCPUs residing in the run-queue of its neighboring PCPUs once there are no VCPUs in its local run-queue. In order to accelerate I/O processing and maintain I/O fairness among the domainUs, the work in [11] introduces the BOOST priority. The idea essentially works by assigning the VCPU that is woken from the block status and still has remaining credits (i.e., in UNDER state before blocking) to BOOST priority (the highest priority), and preempting the currently running VCPU. In this way, the I/O requests can be handled in time. As the BVT scheduling algorithm is no longer supported by the latest Xen releases, only the SEDF and the Credit schedulers are investigated in this paper. 2.4. Processing trace data in Xen The Xen hypervisor allocates a region of memory (called the trace buffer) to store the trace data on the fly so as to minimize the overhead of recording the trace data [12]. When the xentrace command is invoked in domain0, the content of the trace buffer is dumped to the hard disk and stored as a binary file (we call it the raw trace file). After that, the xentrace_format command can be invoked to generate a human readable trace file (we call it the text trace file) by filtering the raw file with the events of interest (e.g., scheduling-related events). Since modern computers often have multiple processors (i.e., multicore), the trace records are stored in round-robin over physical processors in both the raw trace file and the text trace file as shown in Fig. 2. The problem with the raw trace file and the text trace file is that the files are huge. For example, on a busy virtualized 2-VCPU web server with two PCPUs, the size of a 10 s raw trace file will be about 100 MB, while the size of the text trace file generated using the default filter will be about 400 MB, containing about 4 million lines (which we call records). The size of the text trace file with only scheduling events can be about 200 MB containing 2 million records. It is very difficult, if not impossible, for human to read the huge text trace files and analyze the scheduling characteristics of an interested VCPU. Although one can narrow down the scope using the tuple of (DomainID, VCPUID) to further filter the records related to specific VCPUs, the round-robin storage of records in the text trace file and the VCPU migration mechanism in the Credit 2069 Fig. 2. The round-robin fashion for storing trace data in files. scheduler will eventually make the filtered results very difficult to read.4 Moreover, facing these huge data sets, it is almost impossible for one to compare the scheduling behaviors of different VCPUs running together in the same machine (therefore, recorded in the same trace file), or of the VCPUs running in different computers (therefore, generating separate trace files). It is all these challenges that motivate us to build VSA, which enables us to conduct the analysis much more effectively on the scheduling trace data. 3. System design In this section, we first explain in Section 3.1 how VSA uses the records in the text trace file to compute the time attributes that can be used to conduct the scheduling analysis, and also the method provided by VSA to extract interested time durations from the entire trace period. Then, we discuss the key functionalities of VSA: conducting the scheduling analysis for individual VCPUs (Section 3.2) and conducting a group of interacting VCPUs running in the same domain (Section 3.3). 3.1. Calculating the time attributes and splitting the trace data The trace files are generated for a certain time period specified by the system administrator. However, when a record is actually created, it is only associated with the value of the Time Stamp Counter (TSC) register,5 which cannot be directly translated to the time that corresponds to the specified time period. For example, for a 10 s trace file, the tracing period can be expressed as [0, 10 s]. The TSC values of the records in the trace file will be in the range of [Tmin , Tmax ], where Tmin and Tmax denote the minimum and maximum TSC value respectively that can be found in the trace file. For a record with the TSC value of t, its time offset from the beginning of the trace file, denoted as Toffset , can be calculated using Eq. (1), where the ClockRate denotes the CPU frequency, which can be regarded as a constant.6 Toffset = t − Tmin ClockRate . (1) With the time offset, many important metrics can be computed, such as the length of each scheduling time slice. 4 We will explain this further in Section 3.2. 5 For the architectures with multiple processors, the TSCs are synchronized by the Xen hypervisor. 6 In the architectures with Dynamic Voltage and Frequency Scaling (i.e., DFVS) features, the Constant Rate TSC can be used. Therefore, we can still regard the ClockRate as a constant. 2070 Z. Shao et al. / Future Generation Computer Systems 29 (2013) 2067–2076 It is often the case that people only want to analyze a particular duration of the tracing period. For example, after further investigation, the system administrator may find that only the duration of [2 s, 8 s] is of interest in a 10 s tracing sample. VSA provides the function to help the administrator to split the text trace file for further analysis. Assume that a 10 s trace period corresponds to ′ ′ the TSC range of [Tmin , Tmax ], Tmin and Tmax correspond to the time offset of 2 and 8 s, respectively. The trace file can be further split ′ ′ so that only the interested duration of [Tmin , Tmax ] is analyzed. VSA uses the TSC values to determine the interested durations, since the TSC values can be easily obtained in the domainUs using the RDTSC instruction and they are more precise than the wall clock time. 3.2. Scheduling analysis for VCPUs VCPU is the minimum scheduling entity in a virtualized system. Therefore, the analysis of VCPU scheduling is the foundation of revealing the scheduling behaviors of a virtualized system. In VSA, the number of times that the __enter_scheduling event appears in the text trace file is used to determine the number of times that a VCPU is scheduled (called scheduled times). The records of this event are also used in VSA to acquire the time points at which a specific VCPU starts and stops occupying a PCPU, and consequently to construct the time slices during which this VCPU occupies a PCPU. A time slice as such is called a time slice record in this paper. Nevertheless, on the computers with multiple processors, although the trace data are collected in time sequence, there is no guarantee that the time slices calculated from the text trace file are ordered in time. For example, in order to improve the throughput of a virtualized system, the Credit scheduler allows VCPUs to migrate to idle PCPUs dynamically. Suppose a VCPU is scheduled to run on PCPU1 for a time slice S1 , and then it migrates to PCPU0 and runs for another time slice S2 . In this case, although S1 is before S2 , there is no guarantee that the records for S1 appear before those for S2 in the trace files. This is because the trace files are organized in a round-robin fashion over PCPUs as shown in Fig. 2 and the records of PCPU0 for S2 may appear before those of PCPU1 for S1 . In VSA, therefore, after the time slices are obtained from the text trace files, the time slice records of the same VCPU will be reordered in time. The set of the reordered time slice records for a VCPU constitutes the scheduling history of the VCPU. With the scheduling history, VSA can perform many further analyses for that VCPU. Some main analyses are discussed in the rest of this subsection. • Blocking times, wake-up times and the block-to-wake time A VCPU may enter the block state when it issues an I/O request, since the requested I/O operation may not be performed immediately (recall the split drive model introduced in Section 2) and therefore the VCPU has to wait for the result. The VCPU will be woken up (by the event channel of Xen) when the result arrives. The frequency at which a VCPU enters the block state and is woken up can be used to measure the I/O intensity of the workload running on the VCPU. The time interval for the VCPU to change from the block state to the wake-up state can be used to measure how fast the I/O request is performed. A VCPU enters the block state when a do_block event occurs, while a VCPU is woken up when a domain_wake event occurs. In VSA, each time slice record in the VCPU scheduling history is annotated with a block or wake-up event. The number of times that the block event or the wake-up event occurs for a VCPU is the blocking times or the wake-up times of the VCPU, respectively. The interval between a block event and the next wake-up event for a VCPU is defined as the block-to-wake time for the VCPU. Consequently, the block-to-wake frequency and the average blockto-wake interval are calculated. Using these data, the user can measure the I/O intensity and average I/O latency of the workload running on a VCPU. • Wake-to-schedule-in time After a VCPU is woken up, it may not be scheduled to run immediately. There are many reasons for this. One reason is that the scheduler (e.g., Credit) is designed to be proportionally fair to all domains (and eventually to their VCPUs), and therefore it does not allow the VCPUs with I/O intensive workloads to gain more processing resources by entering the block and the ‘‘wakeup’’ state frequently. For this reason, a wake-up VCPU may not have a positive credit value and therefore not be given the BOOST priority to preempt the currently running VCPU. Another reason is as follows. If a VCPU has other CPU workloads to run after issuing an I/O request, it will not enter the block state. Therefore, when such a VCPU is woken up (i.e., the I/O operation is completed and the result is returned), it is already in the run_queue. In this case, the VCPU will not been given the BOOST priority and preempt the currently running VCPU. However, when the workloads have real-time requirements (e.g., media playing), the interval between the time point at which the VCPU is ‘‘woken-up’’ and the time point at which the VCPU is scheduled to run (we call the interval the wake-to-schedule-in time) is a very important metric to measure whether the VCPU is suitable to run such workloads. The interval between a wakeup event and the next corresponding __enter_scheduling event in the scheduling history of a VCPU (it can be checked whether the __enter_scheduling event occurs due to scheduling in or scheduling out a VCPU) is calculated as the wake-to-schedule-in time for the VCPU. This analysis can help the administrators of Xenbased virtualized systems to diagnose the potential performance problems for the workloads with real-time requirements. • Preemption In Xen-based virtualized systems, multiple domains often run simultaneously in the same computer. Therefore, the VCPUs of individual domains inevitably compete for the hardware resources in the computer. In order to prevent one VCPU from monopolizing a PCPU and improve I/O responsiveness, the Xen hypervisor allows a VCPU to preempt the currently running VCPU. For example, when one VCPU is woken up from the block state and is given the BOOST priority, it will preempt the currently running VCPU and run immediately. Analyzing VCPU preemption is important because it can be used to reveal how the workload running a certain VCPU is interrupted. VSA is able to analyze VCPU preemption. The design of this aspect is as follows. A running VCPU may give up the possession of its PCPU voluntarily (e.g., enter the block state due to I/O). Therefore, when a running VCPU is scheduled out, the following three rules are designed in VSA to determine whether the VCPU is preempted. If the following three rules are true, the VCPU is preempted. (1) The state of the VCPU is changed from ‘‘running’’ to ‘‘runnable’’; (2) The state of the VCPU is NOT be changed to ‘‘block’’; (3) After the VCPU is scheduled out, the PCPU does not schedule in an idle VCPU. The third rule is designed as such because the situation needs to be considered where the running VCPU has used up its credits in the non-work-conserving mode [9] of the Credit scheduler. In this case, the VCPU is not allowed to consume more resources, but it is not preempted. • Migration Since the Credit scheduler allows VCPUs to migrate among all available PCPUs (if not being explicitly confined to the specific PCPUs), one VCPU may run on different PCPUs in the period when Z. Shao et al. / Future Generation Computer Systems 29 (2013) 2067–2076 the trace files are generated. Moreover, as modern multicore processors have shared L2 cache, migration between the processors that do not share the same L2 cache will incur huge performance penalty (due to, for example, cache warming up and losing the cached data in previous cache). The analysis on the VCPU migration can help system architects to evaluate the performance hindrance of future schedulers (e.g., the forthcoming Credit2), or analyze the performance problems of the workloads caused by scheduling. VSA analyses VCPU migrations in the following way. In VSA, every time slice record in a VCPU’s scheduling history is associated with a tag number of the PCPU where the VCPU is running on. VSA browses the scheduling history of a VCPU. If the PCPU tag number of one time slice is not the same as that of the next time slice, then the migration happens. Further, VSA is able to calculate other statistic data about migration, such as the number of migrations, the migration frequency and etc. • Utilization VSA introduces the term, usage rate of a VCPU, to determine the percentage of a PCPU that is consumed by the VCPU during a specified period. With the scheduling history of VCPUs, the usage rate of a VCPU can be computed. Suppose a VCPU, VCPUi , has a scheduling history of m time slices, which are denoted as Sij , 0 ≤ j ≤ m − 1. The usage rate of VCPUi during an interval T starting from the time point t0 , denoted as Ui (t0 , T ), can be computed by Eq. (2). m−1 Ui (t0 , T ) = (the duration of Sij that falls in [t0 , t0 + T ]) j =0 T . (2) The usage rate of idle VCPUs can also be calculated using the above method. In Xen, an idle VCPU is always confined to running on a PCPU with the same tag number. Therefore, the utilization of a PCPU can be computed by subtracting the usage rate of its corresponding idle VCPU from 100%. in groups and co-schedule [14] the processes (consequently their VCPUs at the hypervisor layer) on the PCPUs simultaneously. Indeed, co-scheduling in virtualized systems is becoming a hot research topic, and a lot of research works [15–17] have been conducted recently. In order to conduct the co-scheduling analysis for a parallel application, VSA introduces a metric called overlap rate. By using the scheduling histories of multiple VCPUs that belong to the same domainU, VSA can compute the overlap rate in the domain-based analysis. To explain how the overlap rate is computed, we first introduce another term, the overlap history of two VCPUs. Suppose there are two VCPUs, VCPUi and VCPUj , and there are m and n time slices in the scheduling history for VCPUi and VCPUj , respectively. The time slices of VCPUi are denoted as Six , x ∈ [0, m − 1], and those of VCPUj as Sjy , y ∈ [0, n − 1]. The overlap history of VCPUi and VCPUj consists of the time slices during which both VCPUs are scheduled simultaneously. The overlap history of these two VCPUs can formally be computed by the function OH (i, j) defined in Eq. (3), where the ∪ operator performs the union of all time slices, and the ∩ operator computes the common subset of two time slices. For example, suppose there are two time slices Si0 and Sj0 , the time slice Si0 starts from time 0 and stops at the 5th second, while time slice Sj0 starts from the 3rd second and stops at the 7th second. The common subset of these two time slices is the duration of [3 s, 5 s]. n OH (i, j) = ∪m x=0 Six ∩ ∪y=0 Siy . (3) For a domainU with N VCPUs, the overlap history of all these VCPUs can be computed by Eq. (4). OHall = OH (· · · OH (OH (0, 1), 2), . . . , N − 1). (4) N −1 Suppose that there are M time slices in the computed overlap history of all VCPUs associated with one domainU, denoted as m SOH , m ∈ [0, M − 1]. The overlap rate of the domainU during all the execution of a parallel application can be computed by Eq. (5). 3.3. Domain-based analysis M −1 The previous subsection discussed the scheduling analysis for individual VCPUs. This subsection presents how to conduct domain-based analysis, i.e., analyze the scheduling behaviors of a group of interacting VCPUs running in the same domain. As today’s servers are always equipped with multiple processing units (i.e., multicore processor or even multiple processors), a virtual machine is typically configured with multiple VCPUs to fully exploit the processing potentials of the hardware. This is especially true for the workloads that need parallel processing. Such workloads will spawn multiple running processes in the domainUs after invocation. When the VCPUs of these processes are scheduled to run on different PCPUs simultaneously, the execution of such workloads can always be accelerated. However, due to the competition with other domainUs, the execution of parallel applications may be greatly disturbed, which may result in huge performance degradation and resource waste. Our previous work [13] reveals that the reasons for this are as follows. The spawned processes of the parallel application need to communicate with each other (using, for example, OpenMP or MPI). The communication cannot proceed if one of the involved processes (e.g., the message sender) is offline since its container (i.e., the VCPU) is scheduled out. In the virtualized systems based on Xen and its default Credit scheduler, even when the message sender is offline, the receiving processes may keep waiting until their VCPUs have consumed the whole time slice in vain and are scheduled out. This behavior wastes a large amount of processing resources. In order to prevent such performance degradation and resource waste, it is a promising method to organize the VCPUs 2071 Ov erlapRate = i =0 m (duration of SOH ) all Application execution time . (5) In Section 4.3, we will show how the overlap rate calculated above is co-related with the extent to which a parallel application running inside domainUs is co-scheduled. Besides the co-scheduling analysis, VSA also provides the blockto-wake, wake-to-scheduled-in and preemption analysis in the domain-based analysis. They are achieved by summing up the figures associated with each VCPU in the domain. 4. Case studies of using VSA to conduct analyses In this section, we present three case studies to demonstrate how VSA can be used to analyze the behavior of the schedulers in the Xen hypervisor, and to analyze different scheduling characteristics for different types of applications. In the first case study, by using the VSA data, we compare the behaviors of the SEDF and the Credit scheduler when they handle the identical parallel applications. In the second case study, we use VSA to compare the scheduling characteristics of different types of application, i.e., the parallel application and the network service application, running on the same Xen-based virtualized system. In the third case study, we conduct the domain-based analysis on the granularity of domains that host parallel applications. We show that the domain-based analysis can demonstrate the relationship between the overlap rate and the execution time of the application, and thus quantitatively reveal the co-scheduling effect in Xen VMM. 2072 Z. Shao et al. / Future Generation Computer Systems 29 (2013) 2067–2076 In all these three case studies, the experiments were conducted on a four-core physical host. The host has two-way Intel Xeon 5110 processor (Woodcrest), each of which has two 1.6 GHz processing cores with the 4 MB shared L2 cache. It is configured with 4 GB DDR2 memory, 160 GB SATA hard disk drive, and one NetXtreme II BCM5708 Gigabit Ethernet interface. Xen 4.0.1 is adopted as the hypervisor. Both the domain0 and domainUs in the experiments are para-virtualized VMs, and install the CentOS x86_64 of version 6.2 with kernel version 2.6.31.8. Each of the domains is configured with 4 VCPUs, 1 GB virtual memory and 4 GB virtual hard disk drive in the tests. 4.1. Comparing the SEDF and the credit schedulers In this case study, we use ep.A.4 from the NPB suite of version 3.3 [18] as the parallel application. ep.A.4 invokes four processes that generate a predefined number of random values. It is ‘‘embarrassingly’’ parallel computing in that few communications are required during the computation. One domainU with 4VCPUs is employed to host ep.A.4. The hypervisor is configured with the SEDF and the Credit scheduler, respectively, and the scheduling parameters are the default ones for both the domain0 and domainU, i.e., (20, 15, 1) for domain0 and (100, 0, 1) for domainU in the SEDF scheduler, and w eight = 256 and cap = 0 for both domain0 and domainU in the Credit scheduler. Using the method presented in Section 3.2, VSA can construct the scheduling histories of the 4 VCPUs of the domainU during the execution of the benchmark program. The scheduling histories of these 4 VCPUs under the SEDF and the Credit scheduler are shown in Fig. 3(a) and (b), respectively. These figures should be understood in the following way. The marks 0–3 in Y -axis represent the tag number of the PCPUs which the VCPUs are scheduled to run on. The mark labeled SCHED_OUT in Y -axis represents that the VCPU is scheduled out and does not occupy any PCPU. A figure plots the location of a particular VCPU (i.e., one of the 4 PCPUs or none of the PCPUs) at any time point as the execution of the application progresses. It can be observed from this figure that the SEDF and the Credit scheduler have very different scheduling behaviors. First, as shown in Fig. 3(a), a VCPU always resides in the same PCPU throughout the execution of the application. This suggests that the VCPUs do not migrate between PCPUs under the SEDF scheduler. However, it can be seen from Fig. 3(b) that the VCPUs of the domainU migrate among the PCPUs frequently under the Credit scheduler. Second, under the SEDF scheduler, a VCPU oscillates very frequently between a PCPU and sched_out throughout the execution of the program. This suggests that the SEDF scheduler interrupts the execution of the application very frequently. Under the Credit scheduler, however, a VCPU may occupy a PCPU for a long time. For example, VCPU0 stays in PCPU2 for nearly 4 s from 0 to 4 s and then stays in PCPU1 for nearly 4 s from 5.8 to 10 s. Generally, it can be seen from these figures that the SEDF scheduler interrupts the execution of an application more frequently than the Credit scheduler does. For the CPU-intensive applications such as EP, few interruptions mean higher utilization of the physical resources. 4.2. Comparison of scheduling data between parallel applications and service applications In this case study, in addition to the parallel application of ep.A.4 used in previous case study, we use another type of application, service application, running inside the same 4-VCPU domainU. The service application is the web server using the Apache server of version 2.2.3. We aim to compare the scheduling characteristics of these two applications. It takes about 14 s to run ep.A.4 on the virtualized system. In order to make the comparison fair, we set the experiment environment as follows so that the service application also takes about 14 s: (1) the ab Apache benchmark resides in an isolated client machine that connects to the server via network, and (2) the ab benchmark completes 70,000 connections (HTTP requests) with a concurrency of 10. During this experiment, the Xen hypervisor is configured to use the Credit scheduler, and the domainU is configured with the default scheduling parameter of w eight = 256 and cap = 0. In the experiments, the trace data of the virtualized web server are collected during the benchmark testing, and VSA is used to compute the following statistic data on VCPU scheduling (including Domain0): scheduled times, average length of the time slices in the scheduling history, the number of migrations, wakeup times and blocking times. And then the data for the service application are compared with those collected for ep.A.4 in the previous case study. The comparison is shown in Table 1. It can be observed from Table 1 that the length of the time slices for the VCPUs running parallel applications are much longer than those running the server application (see the ‘‘Avg. time slice length’’ column in Table 1). For example, in Domain 1 (i.e., the DomainU that hosts the parallel application or the service application), the average time slice length of VCPU0 is 5.04 and 0.17 ms for the parallel application and the service application, respectively. This can be explained as follows. When the web service provides services, a large number of network I/O requests need to be handled. This results in a large number of blocking and wake-up events for the VCPUs (see the ‘‘wake-up times’’ and ‘‘blocking times’’ columns of Table 1), due to the split driver model discussed in Section 2.2. Moreover, when a VCPU is woken up, it will always preempt the currently running VCPU. Therefore the average time slice lengths for the service application are much shorter than those for the parallel application. Another effect of the large number of blocking and wake-up events on scheduling behaviors is that it causes very frequent migrations of the VCPUs in both DomainU and Domain0 (see the ‘‘migration times’’ column of Table 1). This can be explained as follows. The large number of blocking events suggests that the VCPUs are more likely to enter the block status. The aggressive migration police employed by the Credit scheduler mandates the PCPUs to ‘‘steal’’ VCPUs from its neighboring PCPUs immediately after the currently running VCPU enters the block state (although it will come back soon) and its run_queue is empty. Although this high migration frequency will inevitably result in performance overhead, to the best of our knowledge, there are still no research work in the literature that systematically analyses the negative impact on the performance of the service applications on the virtualized platforms based on Xen. We believe that the quantitative analysis data provided by VSA is able to help the researchers work towards this direction. 4.3. Co-scheduling analysis To evaluate the effectiveness of VSA in conducting the coscheduling analysis, in addition to ep.A.4 used in the previous two case studies, we used two other parallel computing benchmark programs from the NPB suite of version 3.3: is.A.4 (integer sorting, which relies on all-to-all communications to exchange intermediate results) and bt.A.4 (block tridiagonal, which solves three sets of uncoupled systems of equations in the X, Y and Z dimensions in order). The three benchmarks represent three typical types of parallel applications: (1) the communication intensive applications with little computation, i.e., IS, (2) the CPU intensive applications with little communication, i.e., EP, and (3) the applications that lie in the middle, i.e., BT. Z. Shao et al. / Future Generation Computer Systems 29 (2013) 2067–2076 2073 (a) SEDF scheduler. (b) Credit scheduler. Fig. 3. Comparison of the scheduling history of 4 VCPUs under SEDF and credit schedulers. In the experiments in this case study, the Credit scheduler is used and all guest VMs have default scheduling parameters (w eight = 256 and cap = 0). We set up four experiment scenarios: (1) The host runs only one domainU; all VCPUs are free to migrate; the benchmark programs run inside the domainU; (2) This scenario is the same as Scenario 1, except that the following measure is taken to prohibit VCPU migrations: the VCPUs of domainU and domain0 are confined to running on the PCPUs with the corresponding tag number (i.e., VCPUi , i ∈ [0, 3] of domainU (and domain0) is confined to PCPUi , i ∈ [0, 3]); (3) The host runs two domainUs; two identical copies of each of the three benchmarks (is.A.4, bt.A.4 and ep.A.4) run simultaneously (invoked at almost the same time) in these two domainUs; (4) This scenario is the same as Scenario 3, except that the VCPUs of these two domainUs and domain0 are confined to running on the PCPUs with the corresponding tag numbers. The fourth scenario is deployed to imitate the balance scheduling that is proposed in [17] for co-scheduling the KVM virtual machines. In this case study, we want to evaluate the effectiveness of balance scheduling on the Xen platform. In these four experiment scenarios, the execution times of the benchmark programs are recorded and averaged (for Scenario 3 and 4), and the trace records are generated to help VSA conduct the co-scheduling analysis. Table 2 presents the execution times of the benchmark programs and the corresponding overlap rate computed by VSA. 2074 Z. Shao et al. / Future Generation Computer Systems 29 (2013) 2067–2076 Table 1 Comparison of the statistics data of VCPU scheduling between ep.A.4 and web server. App DomainID VCPUID ep.A.4 0 0 0 0 1 1 1 1 0 1 2 3 0 1 2 3 Scheduled times 2 350 411 16 347 1 202 2 844 3 464 4 154 4 425 Avg. time slice length (ms) 0.065 0.031 0.021 0.163 5.040 3.972 3.435 2.760 35 28 40 34 119 73 81 109 2 361 410 16 737 1 203 256 152 148 135 2 361 410 16 736 1 203 256 152 148 135 httpd ab 0 0 0 0 1 1 1 1 0 1 2 3 0 1 2 3 104 603 57 769 70 312 1 442 28 278 26 877 26 540 28 790 0.063 0.073 0.054 0.065 0.170 0.448 0.137 0.133 1783 8084 8386 548 6635 8689 6878 7836 108 580 58 273 70 916 1 438 25 455 12 943 23 374 24 721 108 414 58 231 70 826 1 438 25 451 12 937 23 371 24 718 Table 2 Execution time of the benchmark programs and the corresponding overlap rate. One domainU Two domainUs Scenario1 Scenario2 Scenario3 Scenario4 is.A.4 Avg. run time Overlap rate 2.003 78.2% 2.000 80.2% 7.855 2.0% 6.221 6.1% bt.A.4 Avg. run time Overlap rate 126.939 97.1% 126.727 97.2% 314.393 1.5% 281.358 6.4% ep.A.4 Avg. run time Overlap rate 13.585 96.4% 13.629 96.6% 27.937 6.1% 27.483 6.8% Migration times Wake-up times Blocking times By these three case studies in this section, we demonstrate that the data generated by VSA can be used to analyze the scheduling behaviors of the schedulers in the Xen hypervisor and determine the scheduling characteristics of different types of applications. The co-scheduling analysis at the granularity of a domain gives an insight into the relationship between the performance and the coscheduling degree of parallel applications, and also quantitatively indicates potential improvement room for future development of co-scheduling algorithms on the Xen VMM. 5. Related works From Table 2, it can be observed that for all three benchmark programs, the execution times and the overlap rates under Scenario 2 and Scenario 3 are similar. Moreover, in Scenario 3, we found that the execution time of the communication intensive benchmark program, i.e., is.A.4 is increased by about 3 times, compared with the one domain scenario (Actually, the execution times can increase by up to 6 times in real usages. But the tracing process employed in this experiment caused scheduling to occur more frequently, and consequently reduced the increase fold). We also found that the overlap rate dropped substantially (from about 80% to only 2%). This result suggests that in Scenario 3, there are few opportunities for the four processes of is.A.4 to run simultaneously and therefore to reduce the communication penalty. This result also reveals the reason for the significant slowdown of the program executions. For the benchmark programs that need little communications during executions, i.e., ep.A.4, the execution time is just about doubled. It appears that the substantial decrease in the overlap rate for EP (from about 96% to about 6%) does not has the same level of impact as that for IS on performance. This is because the processes of ep.A.4 need little communications during the execution and therefore the performance will not be affected if they are not co-scheduled. In Scenario 4, we found that the balance scheduling (by preventing VCPU migrations) does help increase the opportunity of co-scheduling (higher overlap rate than that in Scenario 3). Consequently, the benchmark programs that need communications (i.e., is.A.4 and bt.A.4) have shorter execution times. However, the analysis data provided by VSA also reveals that the balance scheduling is not optimal for the Xen VMM, although it works fine for KVM according to [17]. When the number of domainUs (consequently the amount of the competing workloads) doubles, the ideal execution time of an application should also double theoretically. Therefore, the optimal overlap rate in this scenario should be close to 50% according to Eq. (5). As shown in the analysis data generated by VSA, the overlap rate is only 6%–7% under the balance scheduling, which suggests a huge improvement room for future co-scheduling algorithms. Xentop and XenMon are two widely used tools to analyze the performance of Xen VMM. • Xentop: Xentop is the built-in performance monitor tool of Xen. It provides the real time data for the usage rate of system resources (such as memory, virtual network interfaces, virtual block devices, and so on), the domains and their constituent VCPUs. However, as a general tool, the information provided by Xentop on VCPU scheduling is not sufficient (only the usage rates are provided). Therefore, it is very difficult to use the data to conduct advanced analysis on the scheduling behaviors of VCPUs and domains. • XenMon [19]: XenMon provides QoS monitoring and performance profiling mechanisms on the Xen VMM. However, with regards to scheduling, XenMon focuses on monitoring the PCPUs and providing the real time data to show how the PCPUs are used by which domains. Although its data can be used further to show the usage rate, woken-up and blocking periods at the domain granularity, more detailed analysis on the VCPUs, such as block-to-wake, wake-to-schedule-in, and other analysis provided by VSA are not available in XenMon. Since there are no data about the VCPU scheduling history, it is also impossible for XenMon to conduct the co-scheduling analysis. VSA adds to the two popular performance monitoring tools discussed above and provides a complementary performance analysis tool for the Xen VMM. VSA is able to help the system administrators or developers to identify the performance problems on scheduling in virtualized systems using Xen and investigate the potential room for performance improvements. Some existing research works have been conducted to analyze and evaluate the scheduling algorithms of Xen hypervisor [9], and to predict the workload characteristics by instrumenting Xentrace and analyzing the output [20]. There are also some works carried out to improve the scheduling framework of Xen hypervisor [11,21], and to co-schedule the virtual machines [15–17,13]. Z. Shao et al. / Future Generation Computer Systems 29 (2013) 2067–2076 In these existing work, the work presented in [20] adopted a similar approach as in this paper to investigating the workload characteristics. In [20], the running behaviors of the workloads (i.e., whether the workload is CPU intensive or I/O intensive) in a VCPU are determined by instrumenting Xentrace and analyzing the trace data of the VCPU scheduling. Also, the scheduling behavior of the interested VCPUs can be further predicted. However, there are non-trivial differences between the work in [20] and our work. The purpose of VSA is to provide a generic tool that can be used to generate the data in a wide range of aspects of VCPU scheduling. Actually, the work in [20] can be regarded as one particular application of VSA (corresponding to the case studies presented in Section 4.2). Compared to the work in [20], VSA is able to generate much more comprehensive analyses on the scheduling behaviors of VCPUs and domains, and thus help researchers and engineers understand the scheduling behavior of the hypervisor more comprehensively. 6. Conclusions and future work An offline scheduling analyzer for the Xen VMM, called VSA, is developed in this paper. By using the trace data, the system can reconstruct the scheduling history of each VCPU, and conduct advanced analyses, such as the block-to-wakeup, wake-to-schedulein, preemption analysis and so on. This paper discusses the design and functionalities of VSA, and also presents three case studies. The design strategies and the case studies explain the merits of VSA and demonstrate how the analysis data generated by VSA can help understand the behaviors of Xen-based virtualized systems, and therefore help further system optimizations. The future work for VSA is two-fold. First, the co-scheduling analysis of VSA to date considers only the intra-domain communications in domain-based analysis. It cannot analyze the cases yet where the inter-domain communications are involved, which requires auditing the scheduling of VCPUs of domain0 (since they are scheduled to relay the network messages for the domainUs), and requires considering more dimensions in addition to the overlap rate. We will further develop VSA to address these issues. Second, we plan to develop the visualization interface for VSA so that the analysis data generated by VSA can be automatically plotted. Acknowledgments This work is supported by National Natural Science Foundation of China under grant No. 60903022, 612320081, Research Fund for the Doctoral Program of Higher Education of China under grant No. 20100142120027, and the Leverhulme Trust in UK (grant number RPG-101). References [1] Amazon EC2 Cloud. URL http://aws.amazon.com/ec2. [2] Gogrid Cloud. URL http://www.gogrid.com. [3] B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, I. Pratt, A. Warfield, P. Barham, R. Neugebauer, Xen and the art of virtualization, in: Proc. of the ACM Symposium on Operating Systems Principles, SOSP’03, 2003, pp. 164–177. [4] X. Liao, H. Jin, X. Yuan, ESPM: an optimized resource distribution policy in virtual user environment, Future Generation Computer Systems 26 (8) (2010) 1393–1402. [5] I.M. Leslie, D. Mcauley, R. Black, T. Roscoe, P.T. Barham, D. Evers, R. Fairbairns, E. Hyden, The design and implementation of an operating system to support distributed multimedia applications, IEEE Journal of Selected Areas in Communications (1996). [6] Credit Scheduler. URL http://wiki.xensource.com/xenwiki/CreditScheduler. [7] C.S. Pabla, Completely fair scheduler, Linux Journal 184 (2009). [8] C.S. Wong, I. Tan, R.D. Kumari, F. Wey, Towards achieving fairness in the Linux scheduler, ACM SIGOPS Operating Systems Review 42 (5) (2008) 34–43. [9] L. Cherkasova, D. Gupta, A. Vahdat, Comparison of the three CPU schedulers in Xen, ACM SIGMETRICS Performance Evaluation Review 35 (2) (2007) 42–51. 2075 [10] K.J. Duda, D.R. Cheriton, Borrowed-virtual-time (BVT) scheduling: supporting latency-sensitive threads in a general-purpose scheduler, in: Proc. of the 17th ACM Symposium on Operating Systems Principles, SOSP’99, 1999. [11] D. Ongaro, A.L. Cox, S. Rixner, Scheduling I/O in virtual machine monitors, in: Proc. of ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, VEE’08, 2008, pp. 1–10. [12] P. Heidari, M. Desnoyers, M. Dagenais, Performance analysis of virtual machines through tracing, in: Canadian Conference on Electrical and Computer Engineering, 2008. CCECE 2008, 2008, pp. 000261–000266. [13] Z. Shao, Q. Wang, X. Xie, H. Jin, L. He, Analyzing and Improving MPI Communication Performance in Overcommitted Virtualized Systems, in: IEEE 19th International Symposium on Modeling, Analysis Simulation of Computer and Telecommunication Systems, MASCOTS, 2011, 2011, pp. 381–389. [14] J.K. Ousterhout, Scheduling techniques for concurrent systems, in: Proc. of the third International Conference on Distributed Computing Systems, 1982, pp. 22–30. [15] C. Weng, Q. Liu, L. Yu, M. Li, Dynamic adaptive scheduling for virtual machines, in: Proceedings of the 20th International Symposium on High Performance Distributed Computing, HPDC ’11, ACM, New York, NY, USA, 2011, pp. 239–250. [16] Y. Yu, Y. Wang, H. Guo, X. He, Hybrid co-scheduling optimizations for concurrent applications in virtualized environments, in: 2011 6th IEEE International Conference on Networking, Architecture and Storage, NAS, 2011, pp. 20–29. [17] O. Sukwong, H.S. Kim, Is co-scheduling too expensive for SMP VMs? in: Proceedings of the Sixth Conference on Computer Systems, EuroSys ’11, ACM, New York, NY, USA, 2011, pp. 257–272. [18] NPB: NAS parallel benchmarks. URL http://www.nas.nasa.gov/Resources/Software/npb.html. [19] D. Gupta, R. Gardner, L. Cherkasova, XenMon: QoS monitoring and performance profiling tool, Tech. rep., HP Labs Technical Report HPL-2005-187, October, 2005. [20] B.K. Kim, J. Kim, Y.W. Ko, Efficient virtual machine scheduling exploiting VCPU characteristics, International Journal of Security and Its Applications 6 (2) (2012) 415–420. [21] H. Kim, H. Lim, J. Jeong, H. Jo, J. Lee, Task-aware virtual machine scheduling for I/O performance, in: Proceedings of the 2009 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, VEE ’09, ACM, New York, NY, USA, 2009, pp. 101–110. Zhiyuan Shao received Ph.D. in computer science and engineering from Huazhong University of Science and Technology (HUST), China, in 2005. He is now an Associate Professor in school of Computer Science and Engineering at HUST. He has served as a reviewer for many conferences and journal papers. His research interests are in the areas of operating systems, virtualization technology for computing system and computer networks. He is a member of the IEEE and the IEEE Computer Society. Ligang He is an Associate Professor in the Department of Computer Science at the University of Warwick. He studied for a Ph.D. in Computer Science at the University of Warwick, UK, from 2002 to 2005, and then worked as a post-doctor in the University of Cambridge, UK. In 2006, he joined the Department of Computer Science at the University of Warwick as an Assistant Professor. His research interests focus on parallel and distributed processing, Cluster, Grid and Cloud computing. He has published more than 40 papers in international conferences and journals, such as IEEE Transactions on Parallel and Distributed Systems, IPDPS, CCGrid, MASCOTS. He has been a member of the program committee for many international conferences, and has been the reviewer for a number of international journals, including IEEE Transactions on Parallel and Distributed Systems, IEEE Transactions on Computers, etc. He is a member of the IEEE. Zhiqiang Lu is a Master’s student in school of Computer Science and Engineering at Huazhong University of Science and Technology. His research area is Cloud computing and virtualization. 2076 Z. Shao et al. / Future Generation Computer Systems 29 (2013) 2067–2076 Hai Jin is a professor of computer science and engineering at the Huazhong University of Science and Technology (HUST) in China. He is now Dean of the School of Computer Science and Technology at HUST. Jin received his Ph.D. in computer engineering from HUST in 1994. In 1996, he was awarded a German Academic Exchange Service fellowship to visit the Technical University of Chemnitz in Germany. Jin worked at the University of Hong Kong between 1998 and 2000, and as a visiting scholar at the University of Southern California between 1999 and 2000. He was awarded Excellent Youth Award from the National Science Foundation of China in 2001. Jin is the chief scientist of ChinaGrid, the largest grid computing project in China. Jin is a senior member of the IEEE and a member of the ACM. Jin is the member of Grid Forum Steering Group (GFSG). He has coauthored 15 books and published over 400 research papers. His research interests include computer architecture, virtualization technology, cluster computing and grid computing, peer-to-peer computing, network storage, and network security. Jin is the steering committee chair of International Conference on Grid and Pervasive Computing (GPC), Asia-Pacific Services Computing Conference (APSCC), International Conference on Frontier of Computer Science and Technology (FCST), and Annual ChinaGrid Conference. Jin is a member of the steering committee of the IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid), the IFIP International Conference on Network and Parallel Computing (NPC), and the International Conference on Grid and Cooperative Computing (GCC), International Conference on Autonomic and Trusted Computing (ATC), International Conference on Ubiquitous Intelligence and Computing (UIC).