Active CoordinaTion (ACT) – Toward Effectively Managing Virtualized Multicore Clouds Mukil Kesavan, Adit Ranadive, Ada Gavrilovska, Karsten Schwan Center for Experimental Research in Computer Systems (CERCS) Georgia Institute of Technology Atlanta, Georgia USA {mukil, adit262, ada, schwan}@cc.gatech.edu Abstract—A key benefit of utility data centers and cloud computing infrastructures is the level of consolidation they can offer to arbitrary guest applications, and the substantial saving in operational costs and resources that can be derived in the process. However, significant challenges remain before it becomes possible to effectively and at low cost manage virtualized systems, particularly in the face of increasing complexity of individual many-core platforms, and given the dynamic behaviors and resource requirements exhibited by cloud guest VMs. This paper describes the Active CoordinaTion (ACT) approach, aimed to address a specific issue in the management domain, which is the fact that management actions must (1) typically touch upon multiple resources in order to be effective, and (2) must be continuously refined in order to deal with the dynamism in the platform resource loads. ACT relies on the notion of Class-of-Service, associated with (sets of) guest VMs, based on which it maps VMs onto Platform Units, the latter encapsulating sets of platform resources of different types. Using these abstractions, ACT can perform active management in multiple ways, including a VM-specific approach and a black box approach that relies on continuous monitoring of the guest VMs’ runtime behavior and on an adaptive resource allocation algorithm, termed Multiplicative Increase, Subtractive Decrease Algorithm with Wiggle Room. In addition, ACT permits explicit external events to trigger VM or application-specific resource allocations, e.g., leveraging emerging standards such as WSDM. The experimental analysis of the ACT prototype, built for Xenbased platforms, use industry-standard benchmarks, including RUBiS, Hadoop, and SPEC. They demonstrate ACT’s ability to efficiently manage the aggregate platform resources according to the guest VMs’ relative importance (Class-of-Service), for both the black-box and the VM-specific approach. I. I NTRODUCTION Virtualization technologies like VMWare’s ESX server [1], the Xen hypervisor [2], and IBM’s longstanding mainframebased systems [3] have gone beyond becoming prevalent solutions for resource consolidation, but are also enabling entirely new functionality in system management. Examples include dealing with bursty application behaviors and providing new reliability or availability solutions for coping with emergencies. Management is particularly important in Utility Data Center or Cloud Computing systems, as with Amazon’s Elastic Compute Cloud (EC2) [4], which use virtualization to offer datacenter resources (e.g., clusters or blade servers) to 1 This research is partially supported by NSF award No. 0702249, and donations from Intel Corporation, Cisco Systems, and Xsigo Systems. applications run by different customers and must therefore, be able to safely provide different kinds of services to diverse codes running on the same underlying hardware (e.g., time-sensitive trading systems jointly with high performance software used for financial analysis and forecasting). Other recent cloud computing infrastructures and their typical uses are described in IBM’s Blue Cloud announcement and/or by the developers of the Virtual Computing Initiative [5]. A promised benefit of utility datacenters or cloud computing infrastructures is the level of consolidation they can offer to arbitrary guest applications, packaging them as sets of virtual machines (VMs) and providing such VMs just the resources they need and only when they need them, rather than over-provisioning underlying platforms for worst case loads. This creates opportunities for reductions in the operational costs of maintaining the physical machine and datacenter infrastructures, as well as the costs of power consumption associated with both [6], [7]. Additional savings can be derived by coupling consolidation with management automation, including both facility and software management, such as managing upgrade and patching processes and other elements of the software lifecycle. Significant challenges remain before it will be possible to effectively and at low cost manage virtualized systems or more generally, manage entire compute clouds and their potentially rich collections of VM-based applications. Technical elements of these challenges in cloud management range from efficient monitoring and actuation, to effective algorithms and methods that make and enact management decisions, to high level decision making processes and policy considerations for such processes [8]. The Active CoordinaTion (ACT) approach introduced in this paper addresses a specific problem and issue in the management domain, which is the fact that management actions must typically touch upon multiple resources in order to be effective. Examples include power management that must address CPUs, memory, and devices; end-to-end performance management that must consider CPU cycles and network bandwidths; and many others. The approach must be ‘active’ in that it must be able to react to changes in any one resource as it may affect any number of VMs using it. More specifically, a change in allocation of network bandwidth to a VM may necessitate changing its CPU allocation, to avoid inappropriate (too short or too long) message queues at the network interface. In response, ACT has methods for tuning the use of each of the resources required by a VM or set of VMs, using mechanisms that change CPU allocations for a compute-intensive workload, while additional methods are used to maintain its IO allocation at low levels, or vice versa for workloads reaching an IOintensive phase. Furthermore, such mechanisms and the associated allocation algorithms should be capable of exploiting the multicore nature of future cloud nodes, but having multiple applications’ VMs share a single platform can lead to breaches of VM isolation, due to hardware effects like cache thrashing, IO interference [9], memory bandwidth contention, and similar issues [10]. ACT’s technical solutions will address this fact. ACT’s resource management mechanisms presented in this paper and developed in our group (see [11]) specifically support active coordination for the runtime management of the resources provided by a compute cloud to a target application. In this paper, we focus on the active coordination of the communication and computational resources on multicore cloud nodes. In other work, we are developing general management infrastructures and methods for coordinated management [12] in cloud or utility computing systems, first addressing power management [7] and then considering reliability and lifecycle management for the set of VMs comprising an application. In both cases, our solutions are realized using specialized management components residing on each individual multicore node within the cloud. These managers dynamically monitor, assess, and reallocate the resources used by the node’s guest VMs, in a manner that meets the VMs’ resource requirements or expectations. ACT requires information about the platform’s resources and the applications using them. To characterize applications, we draw on related work in power management [13] to formulate the needs of each VM as a multi-dimensional vector, termed Class of Service (CoS). The vector expresses the relative importance, priority, or resource expectations of a cloud’s guest VM. Further, for scalability, the needs of a single VM may be derived from a single CoS associated with a set of VMs (e.g., an application comprised of uniform VMs or a uniform set of VMs in an application). We note, however, that CoS specifications will typically not be complete and sometimes not even known, which requires us to continually refine the CoS vector of a running application using online monitoring. Such refinement can also leverage external CoS specifications (e.g., using emerging standards like WSDM) [14]). ACT cannot allocate platform resources without information about them provided by the firmware at machine boot time. Using such information, ACT constructs a second abstraction – the Platform Unit (PU) – which encapsulates a collection of platform resources of multiple types. In our current implementation, these units encode both CPU utilization and IO bandwidth. This abstract characterization of the underlying physical hardware helps ACT to (1) efficiently reason about the available platform resources, (2) perform allocations of VMs to sets of instead of single platform components (i.e., cores and IO lanes), and (3) to dynamically reconfigure current resource allocations. More specifically, ACT uses these two abstractions to dynamically monitor and manage the platform resources and their utilization by individual VMs, and, based on observed behavior, trigger appropriate adjustments. ACT management can be performed in multiple ways. One way solely uses a black box approach [15], where ACT uses historic (i.e., observed) information regarding VM behavior to ‘guess’ its future trends. For such black-box management, the approach chosen in our work uses an algorithm termed Multiplicative Increase, Subtractive Decrease Algorithm with Wiggle-Room (MISDA-WR). The algorithm makes adjustments in each of the resources represented in the platform unit. Another way is for ACT to incorporate mechanisms that permit external actuating events to trigger VM- or applicationspecific resource allocation decisions. Such external events may be provided by the cloud administrator, may be available statically at VM creation time (e.g., the aforementioned financial application may provide information requesting the doubling of its resources on the 15th of each month), or may be dynamically provided to the node’s manager by the guest VM itself and the application executing within it [12]. The latter option is relevant should the VM and/or application include a management agent that knows how to export management-related information regarding its resource needs or QoS requirements. Given emerging standards like WSDM, it is foreseeable that some VMs may incorporate such functionality, but ACT does not require it from the guest VMs it is managing. This paper makes several technical contributions. First, we develop the ACT approach for management of both communication and computational resources in virtualized cloud environments. The same approach can be extended to also consider other types of resources. Second, we introduce the Platform Unit abstraction and develop platform-level management mechanisms that efficiently and actively manage resources on virtualized many-core platforms. The ACT approach is suitable for black-box management, based on historic/observed information and using the general notion of VMs’ CoS (class of services) with the MISDA-WR algorithm. In addition, ACT allows for explicit external events or management requests, including those generated by VMs’ or applications’ management agents, if available. The experimental analysis of the ACT prototype, built for Xen-based platforms, use industry-standard benchmarks, including RUBiS, Hadoop, and SPEC, and demonstrate ACT’s ability to efficiently manage the aggregate platform resources according to the guest VMs relative importance (Class-ofService), for both the black-box and the VM-specific approach. Specific results demonstrate ACT’s ability (1) to respond quickly to changes in application resource requirements, while resulting in only negligible overhead levels, (2) to distribute the aggregate platform resources based on the relative importance (i.e., CoS) of the platform workloads, and (3) to deliver substantial resource consolidation, up to 50% reduction in CPU utilization and 63% less required bandwidth as compared Node Mgr Node Mgr Node Mgr Cloud Manager .. . Node Mgr some level of consistency among the high-level managers. Since the overall cloud management is not the focus of the current paper, the remainder of our discussion focuses on the ACT management mechanisms for single many-core nodes within the cloud. Node Mgr Node Mgr Node Mgr Node Mgr Fig. 1. Cloud Management Architecture to static worst case resource allocation, while maintaining acceptable VM performance levels. Remainder of paper. The remainder of this paper is organized as follows. Section II describes the high level view of the cloud architecture and its use of the ACT management components. The Class of Service and Platform Unit abstractions and their use for representing VMs resource requirements and available platform resources are described in Section III. Section IV discusses in greater detail the ACT architecture and accompanying mechanisms, and their realization in our prototype system. Description of the experimental setup and discussion of the experimental results appear in Sections V and VI. A brief discussion of related work and concluding remarks appear in the remaining two sections. II. H IGH L EVEL C LOUD A RCHITECTURE Figure 1 describes the high-level view of a Cloud infrastructure using Active CoordinaTion (ACT). Similar to existing management approaches for cluster servers, data centers or virtualized environments [16], [8], a top tier manager performs high-level admission control and deployment decisions. The top tier cloud manager is responsible for deployment of VMs or sets of VMs onto one or more many-core nodes. Per-node resource management is performed by the ACT node manager, which periodically exports to the higher-level cloud manager its view of the available node resources. The actual deployment of a VM or set of VMs on individual node’s cores is performed by the ACT node manager. With multi-core systems expected to continue to increase the number of cores in the future, and companies such as Intel building 80 core prototype systems [17], this approach achieves a naturally supported hierarchy which will scale more easily to clouds with future generation many-core systems. For simplicity, the architecture presented in Figure 1 shows a single-level hierarchy, where a single cloud manager is responsible for all many core resources in the cloud. This solution can be further improved by extending it to include a deeper hierarchy of managers, using clustering techniques such as those developed for [18], [19], [20], or by replicating and/or partitioning the cloud manager functionality and maintaining III. S PECIFYING R ESOURCE D ESCRIPTIONS The ability of both, the top-level cloud manager, as well as separate node managers, to perform mapping of VMs to underlying resources (i.e., nodes, or cores within a node, respectively), requires some notion of the type and amount of resources required or expected by the cloud clients and their VMs. In addition, it requires a representation of the underlying platform resources onto which client resource requirements can easily be mapped. Towards this end, we introduce the notion of Class of Service (CoS) to describe VM resource needs, and an abstraction Platform Units (PU) – which describes a unit of platform resources. Class of Service coarsely describes the expectations of a cloud client with respect to the level of service it will attain. The CoS designation may be based on monetary values, i.e., fees that the client is being charged for using the cloud services, or based on other criteria. For instance, VMs can be classified as ‘gold’, ‘silver’, and ‘bronze’, similar to client classifications in enterprise systems. CoS relates to notions such as Service Level Agreements (SLAs), but it is a lower-level representation that directly specifies the platform resources, i.e., CPU, IO, etc., expected by a guest VM. ACT uses a multi-dimensional vector representation for the CoS, similar to other work developed by our group [12]. The vector expresses the relative importance, priority, or resource expectations of a cloud’s guest VM. Since VMs are not expected to explicitly specify their resource requirements, or often are unable to do so given the aforementioned reasons, the ‘active’ CoS of a VM, i.e., the resources allocated to a VM at runtime, is dynamically tuned, based on the relative priority of the VM, the current platform loads, and the runtime observations regarding the VMs’ resource needs. Further detail on ACT’s mechanisms for runtime resource management appear in the following section. The elementary resource unit on the multi-core platform, onto which client VMs are deployed, is referred to as Platform Unit. Platform units (PUs) encapsulate a collection of resources used in ACT to represent the resource sets allocated to guest VMs as well as the available platform resources. In our current prototype, we consider platform management based on VMs’ computation and IO requirements, so that a PU is represented as a quantity of CPU and IO resources, e.g., equivalent of 60specialized hardware accelerators such as graphics cards, etc., or even number of CPUs for deploying workloads with concurrency requirements. PUs may come in different ‘sizes’, i.e., may include different levels of the resources they describe. While the range of PUs supported on a single multicore node may be very diverse, for practical reasons, we focus on a set of discrete PU values to represent initial resource allocations to guest VMs. Namely, depending on the VM’s CoS, ACT allocates a predefined Platform Unit (i.e, a fixed amount of CPU and IO resources in our prototype implementation). Subsequently, ACT’s resource management algorithm will further tune the PU allocated to a given VM based on its runtime behavior, as observed by ACT’s monitoring component, or based on direct input from the VM itself. These initial PU values express the maximum resources that can be allocated to a VM based on its CoS. They may correspond to the CoS classification and may similarly be termed ‘gold’, ‘silver’ and ‘bronze’, or they may be defined and matched to different CoSs in some other manner. Representing Resource State Information. The aggregate platform resources are maintained in a global resource map that describes the platform resources available on per-core basis. In our case, CPU resources are maintained for each core separately, whereas the network resources are maintained as a single pool available to each of the VMs, given that all VMs can uniformly access the node’s network devices. Information regarding currently allocated resources is maintained in list of per-VM PU allocations, ordered based on the VM’s CoS. This list is updated at each monitoring interval. In addition, each entry contains information regarding the identifiers of the physical resources last used by each VM, and whenever possible, the VM is mapped to the same physical resource components (i.e., cores), to derive cache affinity benefits. Realizing Platform Units. The node-level implementation of the computational, i.e., CPU resource units can easily be supported via common credit-based schedulers at the VMM level, such as the Xen credit-based scheduler [21]. The realization of the IO component of the resource unit may be implemented differently based on the IO virtualization solution supported in the cloud environment. Typically, for IO devices virtualized via driver domains, or for drivers residing in a specialized control domain, such as dom0 in Xen virtualization, all IO operations are performed through the dedicated domain (or the VMM if devices are deployed as part of the VMM). Therefore, the device domain can easily be extended to both, gather and export the current IO utilization per individual VM, as well as to enforce resource limitations on IO usage by each VM. For IO devices supporting VMM-pass through, each VM interacts with the IO device directly, thereby bypassing the hypervisor layer or any specialized/dedicated domain. In such cases, the realization of IO reservations has to rely on device and/or fabric level functionality. For instance, interconnects like InfiniBand (IB) provide support for controlling the rate of sending data by way of Service Level to Virtual Lane (SLVL) Mappings and VLArbitration Tables. Virtual Lanes (VL) are a way of carving the bandwidth into multiple sizes. The VLArbitration tables inform the Subnet manager of how much ‘data’ can be sent in unit time. Each IB packet is associated with a Service Level (SL) specifiable by the application, and the SL-VL Mappings allow a packet to be sent on a particular VL. In theory, such hardware support allows for different External Triggers CoS Resource PUs QoS-aware Allocator Adaptation PU Physical Res. Mappings Xen Control Interface Performance Monitor Application Monitoring XenStat Guest VMs Hypercall Interface Xen VMM Fig. 2. ACT Software Architecture service levels (or different bandwidths) to be provided to different applications. Our concrete implementation is based on the per-VL bandwidth limitations supported in InfiniBand platforms, but instead of being based on direct manipulations of the VLArbitration tables and the Subnet Management state, it relies on the functionality provided by a specialized IO switch, Xsigo VP780 I/O Director [22]. The Xsigo switch internally implements similar mechanisms for controlling the per lane resource utilization as native IB (i.e., it limits the amount of credits issued on per VL basis). It allows subnet management and IO provisioning per individual virtual NIC similarly to the native IB solution, but the VNICs are wrapped with a standard Ethernet/IP layer and are exported to the guest VMs as traditional Ethernet devices. The VMs access them through the traditional split-driver approach and dom0 access supported in Xen. IV. ACT - M ULTICORE N ODE M ANAGEMENT A. Platform Level Components Figure 2 illustrates the main components of the Active CoordinaTions software architecture, realized for Xen-based virtualized platforms. Its operation relies on information gathered from chip- and device-level hardware performance counters, CPU and IO utilization information, etc., as well as direct input from external administrators, or VM or application management agents. The ACT components, included within the dashed lines in this figure, may be part of the virtual machine monitor (VMM) layer, or they may be deployed within a designated control domain, such as dom0 in a Xenbased system. The software components can be summarized as follows. Monitoring: The monitoring module periodically gathers information regarding VMs’ resource utilization, which includes CPU and network utilization in our prototype implementation. VMs are treated as black-boxes, and the monitoring functionality is external to and independent from the VMs [15]. The monitored state includes platform-wide hardware performance counters, as well as per-VM statistics maintained by the virtual machine monitor, represented as XenStat for our Xen-based ACT implementation. QoS-aware Adaptation Module: The QoS-aware adaptation module implements the resource management policy for the multicore platform. It encapsulates the active management logic which, based on the current monitoring state and the CoS of the workloads deployed on the node, triggers reconfigurations of the resource allocations of VMs’ PUs. These reconfigurations include changes in resource allocation, such as CPU and network bandwidth, scheduling parameters, migration of VMs across the cores on the many-core platform, etc. Since our current focus in on node-level mechanisms, we do not consider issues related to VM migration across distinct nodes in the cloud infrastructure. Instead, when a VM is perceived to have insufficient resources, it is temporarily blocked from execution. In reality, however, determining such resource inadequacies will result in requests to the higherlevel cloud managers for VM migration. The actual realization of the QoS-aware Adaptation Module may include arbitrary logic, ranging from lookup into rule sets, to control theory feedback loops [23], statistical models, AI or machine learning methods, etc. The management algorithm implemented in our prototype implementation is described later in this section. In addition to relying on state provided by the monitoring component or the CoS specifications regarding guest VMs, the Adaptation Module can respond to external events and resource allocation requests. These may come in the form of external triggers, i.e., specific rules provided by the cloud administrator regarding regular and/or anticipated variations in the platform resource allocation policy. An example of such events may be rules stating that on the 15th of each month the resources allocated to a set of guest VMs executing financial services related to commodity futures trading, must be increased a fixed number of times (e.g., beyond the multiplicative increase supported by the currently deployed algorithm). A separate source of explicit external events may be provided directly by the VMs or the applications they encapsulate, for instance through management agents or APIs, similar to those supported by standards such as WSDM. In the event the VMs provide information regarding their perceived quality of service, the adaptation module correlates that information with the ACT resource allocation to both validate and tune its behavior. This is necessary in scenarios where changes in VM’s CPU or network allocations do not translate directly to improvements in application level quality requirements (i.e., SLAs). Note, however, that we do not make any requirements for such agents to exist within guest VMs or for any additional information regarding the VM’s resource requirements to be provided to the ACT manager beyond the CoS. These external events merely give cloud clients greater ability to influence the resource allocations they receive, thereby increasing the ability to better meet application quality requirements, while keeping the overall resource usage (and therefore, client costs) at a lower level. Resource Allocator: The actual deployment of PUs to the un- derlying physical platform components (i.e., cores and vnics) is performed by the Resource Allocator. This module relies on the global resource map to determine the available platform resources, and if necessary, changes the PU-to-physical resource deployments. Such changes may result in balancing VMs across the available platform resources, or consolidating existing VMs onto less physical components, in order to accommodate additional VMs and their PU requests. The mappings determined by the resource allocator are then passed to the platform’s VMM; in our case, this is accomplished via the Xen Control interface. Not shown in Figure 2 is the state used by ACT managers, which includes state updated by the monitoring module regarding the resource utilization of individual VMs, external rules or per-VM external inputs, the platform-wide resource map and the current platform units allocated to VMs and their deployment on the underlying physical resources. In addition, for integration in the overall cloud management hierarchy, ACT nodes include a coordination API, used for interactions with upper-layer managers for exchange of monitoring updates, new loads (i.e., VMs) and their CoS specifications, external events, or for implementation of cloud-wide reconfiguration mechanisms, including VM migrations and coordination with peer ACT nodes. B. ACT Resource Management Mechanisms Monitoring. The monitoring module periodically gathers information regarding VM-level cpu and network usage. The resource usage sampling frequency influences the rate at which ACT initiates changes in cpu and network activity, and determines the system’s adaptation ability. Given the well understood trade-offs between monitoring accuracy and monitoring overheads which depend on the frequency of the monitoring operations, this frequency is a configurable parameter. Considering the workloads targeted by our current work and the dynamic behaviors they exhibit, we chose a reasonably coarse granular sampling interval of a few seconds to curtail monitoring overhead. This is based both on experimental observation as well as on the assumption that most enterprise applications have longer run times wherein the time taken by the system to adapt to changes is small compared to the lifetime of the application. Our experimental analysis demonstrate that successful adaptation can be achieved without undue monitoring costs. Reconfiguration & QoS Adaptation. The QoS adaptation module determines the amounts of CPU and network bandwidth allocated to VMs, based on its current and past resource usage pattern. The prototype ACT system implements a QoS adaptation mechanism along the lines of Additive Increase Multiplicative Decrease (AIMD) TCP congestion control, called Multiplicative Increase, Subtractive Decrease Allocation with Wiggle Room (MISDA-WR). The initial resources allocated to a VM are derived based on the platform unit corresponding to the VM’s CoS. This is the maximum allocation that a VM with a given CoS can receive. MISDA-WR then continues to actively refine this allocation to reduce the wasted residual resource capacity not being utilized by the application VM. This is done in a linear subtractive decrease manner to slowly lower the VM resource allocation by observing its resource usage over a reasonable amount of time. In essence, if a VM has constant resource utilization for a long period of time, then it is safe to assume that it would stay longer in that state of utilization. Note that we always slightly overprovision the VM’s resources and provide some “wiggle room” to account for sudden small increases in resource usage. When the wiggle room capacity is noticed to be fully utilized at a monitoring time point, this usually means that the application is transitioning to a phase with increased cpu and network requirements. ACT detects this condition and reacts to it by multiplicatively increasing the degree of resource overprovisioning to accommodate an impending increase in the rate of computation/communication, with the aim of meeting the application level quality requirements. An increase in the allocation of one resource, however, might result in an increase in the usage of some other resource, as is the case for network and CPU resources. In response, the algorithm uses a tunable system parameter to account for this perturbance by speculatively increasing the allocation of a resource by a small amount based on an increase in the allocation of another resource. The subtractive decrease mechanism would eliminate this increase, if unnecessary. It is always possible, of course, that the system might not meet current application level quality requirements for short periods of time, because of the reasonably coarse monitoring granularity. As mentioned earlier, if this presents a problem, the time to adapt can be controlled by varying the monitoring time period to suit specific deployment scenarios. The MISDA-WR algorithm is described in Figure 3. α, β, and γ are configurable parameters that determine the rate at which the algorithm adjusts the allocated resource levels, or the manner in which changes in the allocation of one resource (e.g., CPU) should influence adjustments in another resource (e.g., network bandwidth). The figure also shows that in addition, simple affinity-based scheduling (where the processor where a VM last ran is checked for available capacity ahead of other processors) is used to prevent thrashing effects of repeatedly pinning of VMs to different CPUs (avoiding cold cache problem to some extent). C. Implementation Details ACT’s monitoring and QoS adaptation modules are implemented as a user-level application in Dom0. The Xen statistics collection library API (libxenstat) is used to collect the cpu and network monitoring data of VMs and the Xen control interface API (libxc) is used to set the cpu bounds/placement for VMs. In the prototype implementation all network resources (i.e., lanes and vnics) are equally accessible to all platform cores, and are treated as a single resource pool. In addition, we only manage the egress QoS rate for VMs. We also round off the bandwidth allocation deduced by the algorithm to the nearest multiple of 10 for the sake of simplicity of interfacing with the Xsigo switch QoS abstractions. Network QoS is enforced via remote execution of relevant commands on the admin console of the Xsigo switch in our infrastructure. To reduce the overhead associated with this remote execution in our critical path, we batch QoS enforcement requests for all the running VMs and execute them all at once. At each sampling interval, admission control is performed in case a VM’s resource requirement computed by the algorithm cannot be satisfied by the available free resources in the physical machine. In such cases, ideally, the VM should be migrated to another node in the cluster that has the resources to host it, but due to the limitations of our current testbed, we emulate such migration by pausing the VM in question. When more resources become available in the physical machine, the paused VMs are started again, thereby roughly emulating a new VM being migrated into the current physical machine from elsewhere in the cluster. V. E VALUATION E NVIRONMENT A. Testbed The evaluation of the ACT prototype is performed on a testbed consisting of 2 Dell 1950 PowerEdge servers, each with 2 Quad-core 64-bit Xeon processors operating at 1.86 GHz. Each server is running the RHEL 4 Update 5 OS (paravirtualized 2.6.18 kernel) in dom0 with the Xen 3.1 hypervisor. Each of the servers has a Mellanox MT25208 InfiniBand HCAs. These HCAs are connected through a Xsigo VP780 I/O Director [22] switch. The Xsigo switch is an I/O virtualization solution that allows both Ethernet and Fibre Channel traffic to be consolidated onto the InfiniBand fabric. Current experiments use only the Ethernet capability. The switch provides a hardware-based solution to control network bandwidth for the “vnics” it exports. These vnics behave similarly to virtual ethernet or ‘tap’ devices in Linux except that underneath they use InfiniBand rather than Ethernet fabric. A Xsigo driver module sits between the vnics and the kernel InfiniBand modules which are responsible for mapping the Ethernet packets to InfiniBand RDMA requests. In the Xen virtualized environment, the network traffic flows between the dom0 and domUs by the Ethernet split-driver model as explained in [2]. The VMM-bypass virtualization solution for InfiniBand discussed in our previous paper [24] is not used for the current experimental setup. All communication among VMs is carried via the vnic interfaces. The Xsigo switch can be configured with multiple I/O modules (e.g., with multiple 1GigE ports, 10GigE port, or with Fiber channel ports). Currently, our switch is outfitted only with a GigE I/O module, which limits the total network capacity to/from the cluster, despite the larger capability of the internal IB fabric. In the future, we expect to both increase the aggregate network capacity of the cluster and also include disk IO in ACT’s resource management methods. A combination of network processing elements within the Xsigo switch are responsible for enforcing the QoS policy across separate vnics, and can be dynamically reconfigured via the Xsigo control APIs. The ACT components interact with the Xsigo switch to enforce the reconfiguration of network resources. RESOURCE-ALLOCATION (MISDA-WR) 1. Start with unbounded resource allocation initially for some time until sufficient historical data is gathered SUBTRACTIVE DECREASE of WIGGLE ROOM if (wiggle-used[k-1] > THRESHOLD-SMALL-CHANGE) then wiggle-temp[k] = wiggle[k-1] – β*wiggle[k-1]; wiggle[k] = max(wiggle-temp[k], min-wiggle[k]); 2. Calculate: min-wiggle[k] = (avg-roi-res[k-1]) + (α + α*γ) * (max-res-alloc – res-usage[k-1]); 4. res-alloc[k] = res-alloc[k-1] + wiggle-room[k]; wiggle-used[k-1] = res-alloc[k-1] – res-usage[k-1]; 3. MULTIPLICATIVE INCREASE of WIGGLE ROOM if (wiggle-used[k-1] almost 0) then wiggle[k] = wiggle[k-1] * 2; α - Used to vary the lower bound on resource allocation β - Influences the rate at which MISDA-WR resource alloc converges to the actual usage γ - Tunable parameter that increases resource alloc by a small amount w.r.t. an increase in another resource’s alloc Fig. 3. Resource Allocation Algorithm The guest VMs on the servers are running the RHEL 4 Update 5 OS, and each is allocated 256 MB of RAM. For running benchmarks like RUBiS, Hadoop, and SPEC, different VMs are created, since each benchmark requires a different runtime environment. For example, RUBiS runs well with a JDK1.4 while Hadoop requires JDK1.5. Different VMs may run different application components or a combination of components, as explained later in this section. In addition, some of the benchmarks are augmented to provide feedback to the adaptation module regarding the current performace level. In the case of RUBiS this is specified in terms of number of requests satisfied per second. This feedback information is used to change the CPU and network settings for the VMs, and it is used to motivate the use of application-provided hints to better tune the resources allocated to applications and the VMs within which they execute. However, such applicationlevel feedback might be available only in certain cases. To demonstrate the effectiveness of the ACT approach, we treat VMs running Hadoop and SPEC as black-boxes and only monitor their CPU and network usage. B. Benchmarks The workloads used for the experimental analysis of the ACT resource management mechanisms are derived from a number of industry-standard benchmarks. The three primary workloads used are RUBiS (Rice University Bidding System), Hadoop, and the SPEC CPU benchmarks. RUBiS is a wellknown enterprise-level benchmark that mimics the performance of a typical 3-tier enterprise application, and includes a Web Server, Application Servers, and Database Servers. In our experiments, RUBiS is split into its three separate types of servers, and each server is housed in a separate VM. The RUBiS client request generator is housed in a separate VM, as well. Hadoop is representative of applications used in cloud environments, as its data-intensive nature exercises clouds’ compute and network capacities. using the MapReduce paradigm. For the Hadoop benchmark, multiple VMs act as slaves, with a single VM as master. The wordcount test is used as a workload. In addition, we use the CPU intensive H.264 video compression test of the SPEC 2006 benchmark suite within a VM as a representative CPU-bound workload. The Iperf tool is used to generate constant rate TCP traffic to utilize a portion of the available network resources. Our experiments are based on test cases with multiple virtual machines with various mix of the above benchmark components. In addition, we assign each VM or set of VMs (i.e., for RUBiS and Hadoop) a Class of Service, that determines the amount of compute and network resources in the VM’s PU. We use three classes of service: GOLD, SILVER and BRONZE, to differentiate between VMs’ resource requirements. The corresponding PUs have the following resource specifications: Gold VMs have a 80% CPU cap, and a 200Mbps network cap, Silver VMs have 60% CPU and 125Mbps network caps, and Bronze VMs’ caps are 40% and 75Mbps, respectively. VI. E XPERIMENTAL R ESULTS Results from the experimental evaluation of the ACT prototype are encouraging, demonstrating both suitable levels of reactivity and improved efficiency in resource usage. A. ACT Performance Figure 4 depicts ACT’s ability to respond to changes in VM resource needs and to tune resource allocations accordingly. Across a range of sampling periods, we present the monitored (Mon) and allocated (Alloc) usage for CPU (left graph) and 90 80 250 70 200 Alloc NW(G) Alloc NW(G) Mon NW(G) Alloc NW(S) Mon NW(S) ( ) Alloc NW(B) Mon NW(B) ( ) 50 Ban ndwidth (M Mbps) CPU % 60 150 40 30 100 20 10 50 0 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 Sample Points Alloc CPU(G) Mon CPU(S) Mon CPU(G) Alloc CPU(B) 0 Alloc CPU(S) Mon CPU(B) 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 Sample Points Fig. 4. ACT Adaptation Performance for CPU and Network 2 1.6 1.8 1.4 1.6 1.2 1.4 1 1.2 08 0.8 1 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 RUBiS Hadoop ACT Norm Iperf SPEC ‐ h264 CPU ‐ Oversub ‐ 320:300 ‐ Pinning Fig. 5. Hadoop ACT Norm Iperf SPEC ‐ h264 NW Oversub 1250:1000 ACT Performance Comparison with Underprovisioning CPU and Network and Broze VMs, respectively. 400 Static Alloc 350 ACT (0.3,0.2,0.3,7) Norm malized Performance e % RUBiS ACT (0.5,0.2,0.3,7) 300 ACT (0.7,0.2,0.3,7) ( , , , ) ACT (0.3,0.4,0.3,7) 250 ACT (0.3,0.4,0.3,7) ACT(0 3 0 2 0 3 10) ACT(0.3,0.2,0.3,10) 200 ACT(0.3,0.2,0.3,15) 150 100 50 0 RUBiS Hadoop Iperf Spec‐h264ref Benchmarks Fig. 6. Application Performance in ACT environment with different tunable parameters network (right graph) resources. The x-axis denotes sampling time points in 7 second intervals. We observe that ACT adapts almost simultaneously to changes in the resource utilization and results in corresponding updates in the PU values. In addition the graphs show that ACT continues to maintain ‘wiggle room’, as seen from the difference between the allocated vs. monitored values. G, S, B, in the figure, denote Gold, Silver Figures 5 show ACT’s ability to allocate platform resources according to the VMs’ CoS in overloaded plaforms. In the left graph in Figure 5 four sets of VMs, each set running one of the benchmark applications, are allocated 3 CPUs. The requirement from the CoS specify aggregate 320% of CPU while only 300% are available – hence, the CPU resource is underprovisioned. RUBiS and SPEC have gold CoS, iperf is SILVER, and Hadoop’s master VM is GOLD, while its slave VMs are marked as BRONZE. The y-axis in these graphs shows normalized application performance, where the performance metrics for the different applications in these as well as in latter experiments is: • RUBiS: Requests per second - the more the better. • Hadoop: Execution time - the lesser the better. • Iperf: Throughput - the more the better. • Spec-h264ref: Execution time - the lesser the better. We observe that for the Gold VMs RUBiS and SPEC, performance remains almost the same, with only a 5% drop for the RUBiS VMs. Hadoop’s performance decreases by 40%. In Hadoop, despite of its Gold master VM, this decrease is due to its many Bronze slave VMs. The SILVER Iperf VM’s performance benefits at the cost of the reduction of CPU resources allocated to the bronze Hadoop slaves by attaining an increase in its network bandwidth usage. This is in fact the desired effect we aim to achieve with ACT – to distribute aggregate platform resources based on the relative importance (CoS) of the platform workloads. Similar observations can be made in the right hand-side graph in Figure 5. In this case, we modify the CoS resource specification to overload the network resource (1250 versus 1000 Mbps). The network requirement for Gold is increased to 300, Silver to 200, and Bronze to 125Mbps. The performance of SPEC is not affected, as is expected due to its CPUbound nature. The throughput of the Gold RUBiS VMs is only slightly affected. Hadoop execution time increases by about 75% which is expected due to the Bronze CoS of its slaves. Finally, Iperf bandwidth measurement shows a 20% increase, which would be expected given its constant network load, for which ACT now allocates a higher level (200 vs. 125Mbps) of maximum resources. B. Application Performance In order to evaluate the ability of the ACT system to dynamically tune platform resources to exact VM needs, we compare ACT to a statically provisioned system, where the resource requirements are known apriori and the operating conditions do not change at runtime. Figure 6 compares the normalized performance of applications in the statically provisioned system to several configurations of ACT with different values for the tunable MISDA-WR parameters. We use the notation ACT(ALPHA,BETA,GAMMA,SAMPLING INTERVAL) to denote measured application performance with the particular values of the tunable parameters of our system. The CoS levels of the VMs are set in the same manner as in the previous set of experiments. The graphs demonstrate several facts. First, the dynamic approach in ACT can deliver comparable performance level to the statically provisioned system; in several configuration with no observable performance difference. Next, for workloads without significant variability in the resource utilization footprint, such as the SPEC workload, the selection of parameters makes no impact on ACT’s ability to make the appropriate allocation decisions. Third, the allocations for the higher priority GOLD RUBiS workload, are more accurate, i.e., closer to the ideal ‘Static Alloc’ performance, compared to the SILVER iperf workload. This demonstrates the utility of the Class of Service notion in prioritizing different VMs. The bronze slave VMs result in the largest degradation of performance for the Hadoop benchmark. The results demonstrate that ACT is equally effective in managing the platform resources with both black-box monitoring as well as using external triggers. C. Consolidation Opportunities The next set of experiments demonstrate an important aspect of ACT, which is the opportunities for reduced resource utilization and consolidation it creates. Namely, in the case of the previous ACT measurements, substantial platform resources remain available for executing additional workloads, which is not the case for the statically provisioned platform. Figure 7 illustrates the resource utilization under a static allocation policy vs. as measured with ACT for the first algorithm configuration shown in Figure 6. We observe that with the ACT approach on an average there are 50% less CPU resources required and 63% less network bandwidth required to run the same workload when compared to a statically provisioned environment. The small “wiggle-room” factor α does result in occasional service degradtions for some of the Gold VMs, e.g., the SPEC performance is not impacted, while for the RUBiS VM service may degrade up to 20% depending on the choice of α. For choices of the algorithm parameters with larger “wiggle-room” which do not result in any noticeable service degradation for gold and silver VMs, resource utilization is still reduced up to 30% on average. D. Monitoring Overhead Monitoring overhead for different sampling time intervals and workload configurations are negligible, and they do not significantly impact ACT’s ability to adequately allocate aggregate platform resources. For brevity, we do not include detailed data on these measurements. VII. R ELATED W ORK Our work is related to many efforts from the HPC/Grid or the autonomic computing community on managing shared infrastructures and datacenters [8], [16], [25], including monitoring and deployment considerations for mix of batch and interactive VMs on shared cluster resources [26], [27], [23], as well as older work on cluster management and co-scheduling and deployment of cluster process [20], [19]. Similarly, in the context of multi-core platforms, several recent efforts have focused on workload scheduling across platform cores [6], [13], [25]. In comparison, the results presented in this paper concern the coordinated management of multiple types of resources on individual multi-core nodes in such distributed virtualized infrastructures. Several research efforts by our own group as well as others focus on dynamic monitoring and analysis on the behavior of applications deployed on a single system or across distributed nodes [15], [28], [25]. The approach implemented in the current ACT prototype uses historic information regarding VM behavior to ‘guess’ its future resource requirements. It can easily be replaced with other mechanisms, such as those developed by these or other related efforts. VIII. C ONCLUSIONS This paper describes the Active CoordinaTion (ACT) approach. ACT addresses a specific issue in the management domain, which is the fact that the management actions must (1) typically touch upon multiple resources in order to be effective, and (2) must be continuously refined in order to deal with the dynamism in the platform resource loads and application needs or behaviors. ACT can performs active management using a black-box approach, which relies on the continuous monitoring of the guest VMs’ runtime behavior, and an 900 350 800 Total Bandwidth fo or Workloaads (Mbps) To otal %CPU for Worklo oads 400 300 250 200 150 100 50 700 600 500 400 300 200 100 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101 Static Network BW Allocation Sample Points Total % Static CPU Allocation 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101 105 0 0 Fig. 7. Sample Points Sample Points Dynamic Network BW Allocation Dynamic % CPU Allocation Improvements in available resources with ACT adaptive resource allocation algorithm, termed Multiplicative Increase, Subtractive Decrease Algorithm with Wiggle Room. In addition, ACT permits explicit external events to trigger VM or application-specific resource allocations, e.g., leveraging emerging standards such as WSDM. The experimental analysis of the ACT prototype, built for Xen-based platforms, uses industry-standard benchmarks, including RUBiS, Hadoop, and SPEC. It demonstrates ACT’s ability to effectively manage aggregate platform resources according to the guest VMs’ relative importance (Class-ofService), for both the black-box and the VM-specific approach. Experimental results demonstrate ACT’s ability (1) to respond quickly to changes in application resource requirements, with negligible overhead, (2) to distribute aggregate platform resources based on the relative importance (i.e., CoS) of platform workloads, and (3) to deliver substantial resource consolidation, with an up to 50% reduction in CPU utilization and 63% reduction in required bandwidth, while maintaining limited degradation in VM performance. ACKNOWLEDGEMENT We would like to specifically thank Xsigo Systems for their donation of the VP780 I/O Director used in our research, in particular to Kirk Wrigley, Eric Dube and Ariel Cohen for their support and technical insights. R EFERENCES [1] “The VMWare ESX Server,” http://www.vmware.com/products/esx/. [2] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield, “Xen and the Art of Virtualization,” in SOSP 2003, 2003. [3] M. Fair, C. Conklin, S. B. Swaney, P. J. Meaney, W. J. Clarke, L. C. Alves, I. N. Modi, F. Freier, W. Fischer, and N. E. Weber, “Reliability, Availability, and Serviceability (RAS) of the IBM eServer z990,” IBM Journal of Research and Development, 2004. [4] “Amazon Elastic Compute Cloud (EC2),” aws.amazon.com/ec2. [5] “Virtual Computing Lab,” http://vcl.ncsu.edu/. [6] R. Raghavendra, P. Ranganathan, V. Talwar, Z. Wang, and X. Zhu, “No Power Struggles: A Unified Multi-level Power Management Architecture for the Data Center,” in ASPLOS), 2008. [7] R. Nathuji and K. Schwan, “VirtualPower: Coordinated Power Management in Virtualized Enterprise Systems,” in SOSP, 2007. [8] L. Grit, D. Irwin, A. Yumerefendi, and J. Chase, “Virtual Machine Hosting for Networked Clusters: Building the Foundations for ”Autonomic” Orchestration,” in VTDC, 2006. [9] H. Raj and K. Schwan, “High performance and scalable i/o virtualization via self-virtualized devices,” in HPDC, 2007. [10] F. Petrini, D. Kerbyson, and S. Pakin, “The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q,” in Supercomputing’03, 2003. [11] “Virtualized Multi-Core Platforms Project,” www.cercs.gatech.edu/projects/virtualization/virt. [12] S. Kumar, V. Talwar, P. Ranganathan, R. Nathuji, and K. Schwan, “M-Channels and M-Brokers: Coordinated Management in Virtualized Systems,” in MMCS, joint with HPDC, 2008. [13] J. W. Strickland, V. W. Freeh, X. Ma, and S. S. Vazhkudai, “Governor: Autonomic Throttling for Aggressive Idle Resource Scavenging,” in ICAC, 2005. [14] “Web services architecture to manage distributed resources,” www.oasisopen.org/committees/wsdm. [15] I. Cohen, S. Zhang, M. Goldszmidt, J. Symons, T. Kelly, and A. Fox, “Capturing, indexing, clustering and retrieving system history,” in SOSP, 2005. [16] J. Xu, M. Zhao, M. Yousif, R. Carpenter, and J. Fortes, “On the Use of Fuzzy Modeling in Virtualized Data Center Management,” in ICAC, 2007. [17] “Intel Research Advances ‘Era of Tera’,” intel News Release, www.intel.com/pressroom/archive/releases/20070204comp.htm. [18] V. Kumar, Z. Cai, B. F. Cooper, G. Eisenhauer, K. Schwan, M. Mansour, B. Seshasayee, and P. Widener, “Implementing Diverse Messaging Models with Self-Managing Properties using iFLOW,” in ICAC, 2006. [19] M. Silberstein, D. Geiger, A. Schuster, and M. Livny, “Scheduling Mixed Workloads in Multi-grids: The Grid Execution Hierarchy,” in (HPDC), 2006. [20] M. S. Squillante, Y. Zhang, A. Sivasubramaniam, N. Gautam, H. Franke, and J. E. Moreira, “Modeling and analysis of dynamic coscheduling in parallel and distributed environments,” in SIGMETRICS, 2002. [21] “Xen Credit Scheduler,” wiki.xensource.com/xenwiki/CreditScheduler. [22] “Xsigo Virtual I/O Overview,” www.xsigo.com (whitepaper). [23] E. Kalyvianaki and T. Charalambous, “On Dynamic Resource Provisioning for Consolidated Servers in Virtualized Data Centers,” 2007. [24] A. Ranadive, M. Kesavan, A. Gavrilovska, and K. Schwan, “Performance Implications of Virtualizing Multicore Cluster Machines,” in HPCVirt, 2008. [25] B. Urgaonkar and P. Shenoy, “Sharc: Managing CPU and Network Bandwidth in Shared Clusters,” in IPDPS, 2004. [26] B. Lin and P. Dinda, “VSched: Mixing Batch and Interactive Virtual Machines Using Periodic Real-time Scheduling,” in Proceedings of ACM/IEEE SC 2005 (Supercomputing), 2005. [27] P. Padala, K. G. Shin, X. Zhu, M. Uysal, and et al., “Adaptive control of virtualized resources in utility computing environments,” in SIGOPS, 2007. [28] S. Agarwala and K. Schwan, “SysProf: Online Distributed Behavior Diagnosis through Fine-grain System Monitoring,” in ICDCS, 2006.