Active CoordinaTion (ACT) – Toward Effectively Managing Virtualized Multicore Clouds

advertisement
Active CoordinaTion (ACT) – Toward Effectively
Managing Virtualized Multicore Clouds
Mukil Kesavan, Adit Ranadive, Ada Gavrilovska, Karsten Schwan
Center for Experimental Research in Computer Systems (CERCS)
Georgia Institute of Technology
Atlanta, Georgia USA
{mukil, adit262, ada, schwan}@cc.gatech.edu
Abstract—A key benefit of utility data centers and cloud
computing infrastructures is the level of consolidation they can
offer to arbitrary guest applications, and the substantial saving
in operational costs and resources that can be derived in the
process. However, significant challenges remain before it becomes
possible to effectively and at low cost manage virtualized systems,
particularly in the face of increasing complexity of individual
many-core platforms, and given the dynamic behaviors and
resource requirements exhibited by cloud guest VMs.
This paper describes the Active CoordinaTion (ACT) approach, aimed to address a specific issue in the management
domain, which is the fact that management actions must (1)
typically touch upon multiple resources in order to be effective,
and (2) must be continuously refined in order to deal with
the dynamism in the platform resource loads. ACT relies on
the notion of Class-of-Service, associated with (sets of) guest
VMs, based on which it maps VMs onto Platform Units, the
latter encapsulating sets of platform resources of different types.
Using these abstractions, ACT can perform active management
in multiple ways, including a VM-specific approach and a black
box approach that relies on continuous monitoring of the guest
VMs’ runtime behavior and on an adaptive resource allocation
algorithm, termed Multiplicative Increase, Subtractive Decrease
Algorithm with Wiggle Room. In addition, ACT permits explicit
external events to trigger VM or application-specific resource
allocations, e.g., leveraging emerging standards such as WSDM.
The experimental analysis of the ACT prototype, built for Xenbased platforms, use industry-standard benchmarks, including
RUBiS, Hadoop, and SPEC. They demonstrate ACT’s ability to
efficiently manage the aggregate platform resources according to
the guest VMs’ relative importance (Class-of-Service), for both
the black-box and the VM-specific approach.
I. I NTRODUCTION
Virtualization technologies like VMWare’s ESX server [1],
the Xen hypervisor [2], and IBM’s longstanding mainframebased systems [3] have gone beyond becoming prevalent
solutions for resource consolidation, but are also enabling
entirely new functionality in system management. Examples
include dealing with bursty application behaviors and providing new reliability or availability solutions for coping with
emergencies. Management is particularly important in Utility
Data Center or Cloud Computing systems, as with Amazon’s
Elastic Compute Cloud (EC2) [4], which use virtualization to
offer datacenter resources (e.g., clusters or blade servers) to
1 This research is partially supported by NSF award No. 0702249, and
donations from Intel Corporation, Cisco Systems, and Xsigo Systems.
applications run by different customers and must therefore,
be able to safely provide different kinds of services to diverse codes running on the same underlying hardware (e.g.,
time-sensitive trading systems jointly with high performance
software used for financial analysis and forecasting). Other
recent cloud computing infrastructures and their typical uses
are described in IBM’s Blue Cloud announcement and/or by
the developers of the Virtual Computing Initiative [5].
A promised benefit of utility datacenters or cloud computing
infrastructures is the level of consolidation they can offer
to arbitrary guest applications, packaging them as sets of
virtual machines (VMs) and providing such VMs just the
resources they need and only when they need them, rather than
over-provisioning underlying platforms for worst case loads.
This creates opportunities for reductions in the operational
costs of maintaining the physical machine and datacenter
infrastructures, as well as the costs of power consumption
associated with both [6], [7]. Additional savings can be derived by coupling consolidation with management automation,
including both facility and software management, such as
managing upgrade and patching processes and other elements
of the software lifecycle.
Significant challenges remain before it will be possible to
effectively and at low cost manage virtualized systems or more
generally, manage entire compute clouds and their potentially
rich collections of VM-based applications. Technical elements
of these challenges in cloud management range from efficient
monitoring and actuation, to effective algorithms and methods
that make and enact management decisions, to high level
decision making processes and policy considerations for such
processes [8].
The Active CoordinaTion (ACT) approach introduced in this
paper addresses a specific problem and issue in the management domain, which is the fact that management actions must
typically touch upon multiple resources in order to be effective.
Examples include power management that must address CPUs,
memory, and devices; end-to-end performance management
that must consider CPU cycles and network bandwidths; and
many others. The approach must be ‘active’ in that it must
be able to react to changes in any one resource as it may
affect any number of VMs using it. More specifically, a change
in allocation of network bandwidth to a VM may necessitate
changing its CPU allocation, to avoid inappropriate (too short
or too long) message queues at the network interface. In
response, ACT has methods for tuning the use of each of the
resources required by a VM or set of VMs, using mechanisms
that change CPU allocations for a compute-intensive workload,
while additional methods are used to maintain its IO allocation
at low levels, or vice versa for workloads reaching an IOintensive phase. Furthermore, such mechanisms and the associated allocation algorithms should be capable of exploiting
the multicore nature of future cloud nodes, but having multiple
applications’ VMs share a single platform can lead to breaches
of VM isolation, due to hardware effects like cache thrashing,
IO interference [9], memory bandwidth contention, and similar
issues [10]. ACT’s technical solutions will address this fact.
ACT’s resource management mechanisms presented in this
paper and developed in our group (see [11]) specifically
support active coordination for the runtime management of the
resources provided by a compute cloud to a target application.
In this paper, we focus on the active coordination of the communication and computational resources on multicore cloud
nodes. In other work, we are developing general management
infrastructures and methods for coordinated management [12]
in cloud or utility computing systems, first addressing power
management [7] and then considering reliability and lifecycle
management for the set of VMs comprising an application. In
both cases, our solutions are realized using specialized management components residing on each individual multicore
node within the cloud. These managers dynamically monitor,
assess, and reallocate the resources used by the node’s guest
VMs, in a manner that meets the VMs’ resource requirements
or expectations.
ACT requires information about the platform’s resources
and the applications using them. To characterize applications,
we draw on related work in power management [13] to
formulate the needs of each VM as a multi-dimensional vector,
termed Class of Service (CoS). The vector expresses the relative importance, priority, or resource expectations of a cloud’s
guest VM. Further, for scalability, the needs of a single VM
may be derived from a single CoS associated with a set of VMs
(e.g., an application comprised of uniform VMs or a uniform
set of VMs in an application). We note, however, that CoS
specifications will typically not be complete and sometimes
not even known, which requires us to continually refine the
CoS vector of a running application using online monitoring.
Such refinement can also leverage external CoS specifications
(e.g., using emerging standards like WSDM) [14]).
ACT cannot allocate platform resources without information
about them provided by the firmware at machine boot time.
Using such information, ACT constructs a second abstraction
– the Platform Unit (PU) – which encapsulates a collection
of platform resources of multiple types. In our current implementation, these units encode both CPU utilization and IO
bandwidth. This abstract characterization of the underlying
physical hardware helps ACT to (1) efficiently reason about
the available platform resources, (2) perform allocations of
VMs to sets of instead of single platform components (i.e.,
cores and IO lanes), and (3) to dynamically reconfigure current
resource allocations. More specifically, ACT uses these two
abstractions to dynamically monitor and manage the platform
resources and their utilization by individual VMs, and, based
on observed behavior, trigger appropriate adjustments.
ACT management can be performed in multiple ways. One
way solely uses a black box approach [15], where ACT uses
historic (i.e., observed) information regarding VM behavior
to ‘guess’ its future trends. For such black-box management,
the approach chosen in our work uses an algorithm termed
Multiplicative Increase, Subtractive Decrease Algorithm with
Wiggle-Room (MISDA-WR). The algorithm makes adjustments
in each of the resources represented in the platform unit.
Another way is for ACT to incorporate mechanisms that
permit external actuating events to trigger VM- or applicationspecific resource allocation decisions. Such external events
may be provided by the cloud administrator, may be available statically at VM creation time (e.g., the aforementioned
financial application may provide information requesting the
doubling of its resources on the 15th of each month), or
may be dynamically provided to the node’s manager by
the guest VM itself and the application executing within
it [12]. The latter option is relevant should the VM and/or
application include a management agent that knows how to
export management-related information regarding its resource
needs or QoS requirements. Given emerging standards like
WSDM, it is foreseeable that some VMs may incorporate such
functionality, but ACT does not require it from the guest VMs
it is managing.
This paper makes several technical contributions. First, we
develop the ACT approach for management of both communication and computational resources in virtualized cloud
environments. The same approach can be extended to also consider other types of resources. Second, we introduce the Platform Unit abstraction and develop platform-level management
mechanisms that efficiently and actively manage resources
on virtualized many-core platforms. The ACT approach is
suitable for black-box management, based on historic/observed
information and using the general notion of VMs’ CoS (class
of services) with the MISDA-WR algorithm. In addition, ACT
allows for explicit external events or management requests, including those generated by VMs’ or applications’ management
agents, if available.
The experimental analysis of the ACT prototype, built
for Xen-based platforms, use industry-standard benchmarks,
including RUBiS, Hadoop, and SPEC, and demonstrate ACT’s
ability to efficiently manage the aggregate platform resources
according to the guest VMs relative importance (Class-ofService), for both the black-box and the VM-specific approach.
Specific results demonstrate ACT’s ability (1) to respond
quickly to changes in application resource requirements, while
resulting in only negligible overhead levels, (2) to distribute
the aggregate platform resources based on the relative importance (i.e., CoS) of the platform workloads, and (3) to deliver
substantial resource consolidation, up to 50% reduction in
CPU utilization and 63% less required bandwidth as compared
Node
Mgr
Node
Mgr
Node
Mgr
Cloud
Manager
..
.
Node
Mgr
some level of consistency among the high-level managers.
Since the overall cloud management is not the focus of the
current paper, the remainder of our discussion focuses on the
ACT management mechanisms for single many-core nodes
within the cloud.
Node
Mgr
Node
Mgr
Node
Mgr
Node
Mgr
Fig. 1.
Cloud Management Architecture
to static worst case resource allocation, while maintaining
acceptable VM performance levels.
Remainder of paper. The remainder of this paper is organized
as follows. Section II describes the high level view of the cloud
architecture and its use of the ACT management components.
The Class of Service and Platform Unit abstractions and
their use for representing VMs resource requirements and
available platform resources are described in Section III.
Section IV discusses in greater detail the ACT architecture
and accompanying mechanisms, and their realization in our
prototype system. Description of the experimental setup and
discussion of the experimental results appear in Sections V
and VI. A brief discussion of related work and concluding
remarks appear in the remaining two sections.
II. H IGH L EVEL C LOUD A RCHITECTURE
Figure 1 describes the high-level view of a Cloud infrastructure using Active CoordinaTion (ACT). Similar to existing
management approaches for cluster servers, data centers or
virtualized environments [16], [8], a top tier manager performs
high-level admission control and deployment decisions. The
top tier cloud manager is responsible for deployment of VMs
or sets of VMs onto one or more many-core nodes. Per-node
resource management is performed by the ACT node manager,
which periodically exports to the higher-level cloud manager
its view of the available node resources.
The actual deployment of a VM or set of VMs on individual
node’s cores is performed by the ACT node manager. With
multi-core systems expected to continue to increase the number of cores in the future, and companies such as Intel building
80 core prototype systems [17], this approach achieves a
naturally supported hierarchy which will scale more easily to
clouds with future generation many-core systems.
For simplicity, the architecture presented in Figure 1 shows
a single-level hierarchy, where a single cloud manager is
responsible for all many core resources in the cloud. This
solution can be further improved by extending it to include a
deeper hierarchy of managers, using clustering techniques such
as those developed for [18], [19], [20], or by replicating and/or
partitioning the cloud manager functionality and maintaining
III. S PECIFYING R ESOURCE D ESCRIPTIONS
The ability of both, the top-level cloud manager, as well
as separate node managers, to perform mapping of VMs to
underlying resources (i.e., nodes, or cores within a node,
respectively), requires some notion of the type and amount
of resources required or expected by the cloud clients and
their VMs. In addition, it requires a representation of the
underlying platform resources onto which client resource
requirements can easily be mapped. Towards this end, we
introduce the notion of Class of Service (CoS) to describe
VM resource needs, and an abstraction Platform Units (PU)
– which describes a unit of platform resources.
Class of Service coarsely describes the expectations of
a cloud client with respect to the level of service it will
attain. The CoS designation may be based on monetary values,
i.e., fees that the client is being charged for using the cloud
services, or based on other criteria. For instance, VMs can be
classified as ‘gold’, ‘silver’, and ‘bronze’, similar to client classifications in enterprise systems. CoS relates to notions such
as Service Level Agreements (SLAs), but it is a lower-level
representation that directly specifies the platform resources,
i.e., CPU, IO, etc., expected by a guest VM. ACT uses a
multi-dimensional vector representation for the CoS, similar to
other work developed by our group [12]. The vector expresses
the relative importance, priority, or resource expectations of a
cloud’s guest VM. Since VMs are not expected to explicitly
specify their resource requirements, or often are unable to
do so given the aforementioned reasons, the ‘active’ CoS
of a VM, i.e., the resources allocated to a VM at runtime,
is dynamically tuned, based on the relative priority of the
VM, the current platform loads, and the runtime observations
regarding the VMs’ resource needs. Further detail on ACT’s
mechanisms for runtime resource management appear in the
following section.
The elementary resource unit on the multi-core platform,
onto which client VMs are deployed, is referred to as Platform Unit. Platform units (PUs) encapsulate a collection of
resources used in ACT to represent the resource sets allocated
to guest VMs as well as the available platform resources.
In our current prototype, we consider platform management
based on VMs’ computation and IO requirements, so that a
PU is represented as a quantity of CPU and IO resources,
e.g., equivalent of 60specialized hardware accelerators such as
graphics cards, etc., or even number of CPUs for deploying
workloads with concurrency requirements.
PUs may come in different ‘sizes’, i.e., may include different
levels of the resources they describe. While the range of PUs
supported on a single multicore node may be very diverse, for
practical reasons, we focus on a set of discrete PU values to
represent initial resource allocations to guest VMs. Namely,
depending on the VM’s CoS, ACT allocates a predefined
Platform Unit (i.e, a fixed amount of CPU and IO resources in
our prototype implementation). Subsequently, ACT’s resource
management algorithm will further tune the PU allocated to
a given VM based on its runtime behavior, as observed by
ACT’s monitoring component, or based on direct input from
the VM itself. These initial PU values express the maximum
resources that can be allocated to a VM based on its CoS. They
may correspond to the CoS classification and may similarly be
termed ‘gold’, ‘silver’ and ‘bronze’, or they may be defined
and matched to different CoSs in some other manner.
Representing Resource State Information. The aggregate
platform resources are maintained in a global resource map
that describes the platform resources available on per-core
basis. In our case, CPU resources are maintained for each
core separately, whereas the network resources are maintained
as a single pool available to each of the VMs, given that all
VMs can uniformly access the node’s network devices.
Information regarding currently allocated resources is maintained in list of per-VM PU allocations, ordered based on the
VM’s CoS. This list is updated at each monitoring interval.
In addition, each entry contains information regarding the
identifiers of the physical resources last used by each VM, and
whenever possible, the VM is mapped to the same physical
resource components (i.e., cores), to derive cache affinity
benefits.
Realizing Platform Units. The node-level implementation of
the computational, i.e., CPU resource units can easily be supported via common credit-based schedulers at the VMM level,
such as the Xen credit-based scheduler [21]. The realization
of the IO component of the resource unit may be implemented
differently based on the IO virtualization solution supported
in the cloud environment.
Typically, for IO devices virtualized via driver domains, or
for drivers residing in a specialized control domain, such as
dom0 in Xen virtualization, all IO operations are performed
through the dedicated domain (or the VMM if devices are
deployed as part of the VMM). Therefore, the device domain
can easily be extended to both, gather and export the current
IO utilization per individual VM, as well as to enforce resource
limitations on IO usage by each VM.
For IO devices supporting VMM-pass through, each VM
interacts with the IO device directly, thereby bypassing the
hypervisor layer or any specialized/dedicated domain. In such
cases, the realization of IO reservations has to rely on device
and/or fabric level functionality. For instance, interconnects
like InfiniBand (IB) provide support for controlling the rate
of sending data by way of Service Level to Virtual Lane (SLVL) Mappings and VLArbitration Tables. Virtual Lanes (VL)
are a way of carving the bandwidth into multiple sizes. The
VLArbitration tables inform the Subnet manager of how much
‘data’ can be sent in unit time. Each IB packet is associated
with a Service Level (SL) specifiable by the application, and
the SL-VL Mappings allow a packet to be sent on a particular
VL. In theory, such hardware support allows for different
External
Triggers
CoS
Resource PUs QoS-aware
Allocator
Adaptation
PU
Physical
Res. Mappings
Xen Control
Interface
Performance
Monitor
Application
Monitoring
XenStat
Guest VMs
Hypercall Interface
Xen VMM
Fig. 2.
ACT Software Architecture
service levels (or different bandwidths) to be provided to
different applications.
Our concrete implementation is based on the per-VL bandwidth limitations supported in InfiniBand platforms, but instead of being based on direct manipulations of the VLArbitration tables and the Subnet Management state, it relies
on the functionality provided by a specialized IO switch,
Xsigo VP780 I/O Director [22]. The Xsigo switch internally
implements similar mechanisms for controlling the per lane
resource utilization as native IB (i.e., it limits the amount of
credits issued on per VL basis). It allows subnet management
and IO provisioning per individual virtual NIC similarly to
the native IB solution, but the VNICs are wrapped with
a standard Ethernet/IP layer and are exported to the guest
VMs as traditional Ethernet devices. The VMs access them
through the traditional split-driver approach and dom0 access
supported in Xen.
IV. ACT - M ULTICORE N ODE M ANAGEMENT
A. Platform Level Components
Figure 2 illustrates the main components of the Active
CoordinaTions software architecture, realized for Xen-based
virtualized platforms. Its operation relies on information gathered from chip- and device-level hardware performance counters, CPU and IO utilization information, etc., as well as
direct input from external administrators, or VM or application
management agents. The ACT components, included within
the dashed lines in this figure, may be part of the virtual
machine monitor (VMM) layer, or they may be deployed
within a designated control domain, such as dom0 in a Xenbased system.
The software components can be summarized as follows.
Monitoring: The monitoring module periodically gathers information regarding VMs’ resource utilization, which includes
CPU and network utilization in our prototype implementation.
VMs are treated as black-boxes, and the monitoring functionality is external to and independent from the VMs [15]. The
monitored state includes platform-wide hardware performance
counters, as well as per-VM statistics maintained by the virtual
machine monitor, represented as XenStat for our Xen-based
ACT implementation.
QoS-aware Adaptation Module: The QoS-aware adaptation
module implements the resource management policy for the
multicore platform. It encapsulates the active management
logic which, based on the current monitoring state and the
CoS of the workloads deployed on the node, triggers reconfigurations of the resource allocations of VMs’ PUs. These
reconfigurations include changes in resource allocation, such
as CPU and network bandwidth, scheduling parameters, migration of VMs across the cores on the many-core platform,
etc.
Since our current focus in on node-level mechanisms, we
do not consider issues related to VM migration across distinct
nodes in the cloud infrastructure. Instead, when a VM is
perceived to have insufficient resources, it is temporarily
blocked from execution. In reality, however, determining such
resource inadequacies will result in requests to the higherlevel cloud managers for VM migration. The actual realization
of the QoS-aware Adaptation Module may include arbitrary
logic, ranging from lookup into rule sets, to control theory
feedback loops [23], statistical models, AI or machine learning
methods, etc. The management algorithm implemented in our
prototype implementation is described later in this section.
In addition to relying on state provided by the monitoring
component or the CoS specifications regarding guest VMs,
the Adaptation Module can respond to external events and
resource allocation requests. These may come in the form
of external triggers, i.e., specific rules provided by the cloud
administrator regarding regular and/or anticipated variations in
the platform resource allocation policy. An example of such
events may be rules stating that on the 15th of each month
the resources allocated to a set of guest VMs executing financial services related to commodity futures trading, must be
increased a fixed number of times (e.g., beyond the multiplicative increase supported by the currently deployed algorithm).
A separate source of explicit external events may be provided
directly by the VMs or the applications they encapsulate,
for instance through management agents or APIs, similar to
those supported by standards such as WSDM. In the event
the VMs provide information regarding their perceived quality
of service, the adaptation module correlates that information
with the ACT resource allocation to both validate and tune
its behavior. This is necessary in scenarios where changes in
VM’s CPU or network allocations do not translate directly to
improvements in application level quality requirements (i.e.,
SLAs). Note, however, that we do not make any requirements
for such agents to exist within guest VMs or for any additional
information regarding the VM’s resource requirements to be
provided to the ACT manager beyond the CoS. These external
events merely give cloud clients greater ability to influence the
resource allocations they receive, thereby increasing the ability
to better meet application quality requirements, while keeping
the overall resource usage (and therefore, client costs) at a
lower level.
Resource Allocator: The actual deployment of PUs to the un-
derlying physical platform components (i.e., cores and vnics)
is performed by the Resource Allocator. This module relies
on the global resource map to determine the available platform resources, and if necessary, changes the PU-to-physical
resource deployments. Such changes may result in balancing
VMs across the available platform resources, or consolidating
existing VMs onto less physical components, in order to
accommodate additional VMs and their PU requests. The
mappings determined by the resource allocator are then passed
to the platform’s VMM; in our case, this is accomplished via
the Xen Control interface.
Not shown in Figure 2 is the state used by ACT managers,
which includes state updated by the monitoring module regarding the resource utilization of individual VMs, external rules
or per-VM external inputs, the platform-wide resource map
and the current platform units allocated to VMs and their deployment on the underlying physical resources. In addition, for
integration in the overall cloud management hierarchy, ACT
nodes include a coordination API, used for interactions with
upper-layer managers for exchange of monitoring updates,
new loads (i.e., VMs) and their CoS specifications, external
events, or for implementation of cloud-wide reconfiguration
mechanisms, including VM migrations and coordination with
peer ACT nodes.
B. ACT Resource Management Mechanisms
Monitoring. The monitoring module periodically gathers information regarding VM-level cpu and network usage. The
resource usage sampling frequency influences the rate at
which ACT initiates changes in cpu and network activity,
and determines the system’s adaptation ability. Given the
well understood trade-offs between monitoring accuracy and
monitoring overheads which depend on the frequency of
the monitoring operations, this frequency is a configurable
parameter. Considering the workloads targeted by our current
work and the dynamic behaviors they exhibit, we chose a
reasonably coarse granular sampling interval of a few seconds
to curtail monitoring overhead. This is based both on experimental observation as well as on the assumption that most
enterprise applications have longer run times wherein the time
taken by the system to adapt to changes is small compared
to the lifetime of the application. Our experimental analysis
demonstrate that successful adaptation can be achieved without
undue monitoring costs.
Reconfiguration & QoS Adaptation. The QoS adaptation module determines the amounts of CPU and network bandwidth
allocated to VMs, based on its current and past resource usage
pattern. The prototype ACT system implements a QoS adaptation mechanism along the lines of Additive Increase Multiplicative Decrease (AIMD) TCP congestion control, called
Multiplicative Increase, Subtractive Decrease Allocation with
Wiggle Room (MISDA-WR). The initial resources allocated to
a VM are derived based on the platform unit corresponding
to the VM’s CoS. This is the maximum allocation that a VM
with a given CoS can receive. MISDA-WR then continues to
actively refine this allocation to reduce the wasted residual
resource capacity not being utilized by the application VM.
This is done in a linear subtractive decrease manner to slowly
lower the VM resource allocation by observing its resource
usage over a reasonable amount of time. In essence, if a
VM has constant resource utilization for a long period of
time, then it is safe to assume that it would stay longer in
that state of utilization. Note that we always slightly overprovision the VM’s resources and provide some “wiggle room”
to account for sudden small increases in resource usage. When
the wiggle room capacity is noticed to be fully utilized at a
monitoring time point, this usually means that the application
is transitioning to a phase with increased cpu and network
requirements. ACT detects this condition and reacts to it
by multiplicatively increasing the degree of resource overprovisioning to accommodate an impending increase in the
rate of computation/communication, with the aim of meeting
the application level quality requirements. An increase in
the allocation of one resource, however, might result in an
increase in the usage of some other resource, as is the case for
network and CPU resources. In response, the algorithm uses
a tunable system parameter to account for this perturbance
by speculatively increasing the allocation of a resource by a
small amount based on an increase in the allocation of another
resource. The subtractive decrease mechanism would eliminate
this increase, if unnecessary.
It is always possible, of course, that the system might not
meet current application level quality requirements for short
periods of time, because of the reasonably coarse monitoring
granularity. As mentioned earlier, if this presents a problem,
the time to adapt can be controlled by varying the monitoring
time period to suit specific deployment scenarios.
The MISDA-WR algorithm is described in Figure 3. α, β,
and γ are configurable parameters that determine the rate at
which the algorithm adjusts the allocated resource levels, or
the manner in which changes in the allocation of one resource
(e.g., CPU) should influence adjustments in another resource
(e.g., network bandwidth). The figure also shows that in addition, simple affinity-based scheduling (where the processor
where a VM last ran is checked for available capacity ahead
of other processors) is used to prevent thrashing effects of
repeatedly pinning of VMs to different CPUs (avoiding cold
cache problem to some extent).
C. Implementation Details
ACT’s monitoring and QoS adaptation modules are implemented as a user-level application in Dom0. The Xen statistics
collection library API (libxenstat) is used to collect the cpu and
network monitoring data of VMs and the Xen control interface
API (libxc) is used to set the cpu bounds/placement for VMs.
In the prototype implementation all network resources (i.e.,
lanes and vnics) are equally accessible to all platform cores,
and are treated as a single resource pool. In addition, we only
manage the egress QoS rate for VMs. We also round off the
bandwidth allocation deduced by the algorithm to the nearest
multiple of 10 for the sake of simplicity of interfacing with
the Xsigo switch QoS abstractions. Network QoS is enforced
via remote execution of relevant commands on the admin
console of the Xsigo switch in our infrastructure. To reduce the
overhead associated with this remote execution in our critical
path, we batch QoS enforcement requests for all the running
VMs and execute them all at once.
At each sampling interval, admission control is performed in
case a VM’s resource requirement computed by the algorithm
cannot be satisfied by the available free resources in the
physical machine. In such cases, ideally, the VM should be
migrated to another node in the cluster that has the resources
to host it, but due to the limitations of our current testbed, we
emulate such migration by pausing the VM in question. When
more resources become available in the physical machine, the
paused VMs are started again, thereby roughly emulating a
new VM being migrated into the current physical machine
from elsewhere in the cluster.
V. E VALUATION E NVIRONMENT
A. Testbed
The evaluation of the ACT prototype is performed on a
testbed consisting of 2 Dell 1950 PowerEdge servers, each
with 2 Quad-core 64-bit Xeon processors operating at 1.86
GHz. Each server is running the RHEL 4 Update 5 OS
(paravirtualized 2.6.18 kernel) in dom0 with the Xen 3.1
hypervisor. Each of the servers has a Mellanox MT25208
InfiniBand HCAs. These HCAs are connected through a Xsigo
VP780 I/O Director [22] switch. The Xsigo switch is an
I/O virtualization solution that allows both Ethernet and Fibre
Channel traffic to be consolidated onto the InfiniBand fabric.
Current experiments use only the Ethernet capability. The
switch provides a hardware-based solution to control network
bandwidth for the “vnics” it exports. These vnics behave
similarly to virtual ethernet or ‘tap’ devices in Linux except
that underneath they use InfiniBand rather than Ethernet fabric.
A Xsigo driver module sits between the vnics and the kernel
InfiniBand modules which are responsible for mapping the
Ethernet packets to InfiniBand RDMA requests. In the Xen
virtualized environment, the network traffic flows between
the dom0 and domUs by the Ethernet split-driver model as
explained in [2]. The VMM-bypass virtualization solution for
InfiniBand discussed in our previous paper [24] is not used
for the current experimental setup. All communication among
VMs is carried via the vnic interfaces.
The Xsigo switch can be configured with multiple I/O
modules (e.g., with multiple 1GigE ports, 10GigE port, or
with Fiber channel ports). Currently, our switch is outfitted
only with a GigE I/O module, which limits the total network
capacity to/from the cluster, despite the larger capability of the
internal IB fabric. In the future, we expect to both increase the
aggregate network capacity of the cluster and also include disk
IO in ACT’s resource management methods. A combination
of network processing elements within the Xsigo switch are
responsible for enforcing the QoS policy across separate vnics,
and can be dynamically reconfigured via the Xsigo control
APIs. The ACT components interact with the Xsigo switch to
enforce the reconfiguration of network resources.
RESOURCE-ALLOCATION (MISDA-WR)
1. Start with unbounded resource allocation
initially for some time until sufficient
historical data is gathered
SUBTRACTIVE DECREASE of WIGGLE ROOM
if (wiggle-used[k-1] >
THRESHOLD-SMALL-CHANGE) then
wiggle-temp[k] = wiggle[k-1] –
β*wiggle[k-1];
wiggle[k] = max(wiggle-temp[k],
min-wiggle[k]);
2. Calculate:
min-wiggle[k] =
(avg-roi-res[k-1]) +
(α + α*γ) *
(max-res-alloc – res-usage[k-1]);
4.
res-alloc[k] = res-alloc[k-1] +
wiggle-room[k];
wiggle-used[k-1] = res-alloc[k-1] –
res-usage[k-1];
3. MULTIPLICATIVE INCREASE of WIGGLE ROOM
if (wiggle-used[k-1] almost 0) then
wiggle[k] = wiggle[k-1] * 2;
α - Used to vary the lower bound on resource
allocation
β - Influences the rate at which MISDA-WR
resource alloc converges to the actual usage
γ - Tunable parameter that increases
resource alloc by a small amount w.r.t. an
increase in another resource’s alloc
Fig. 3.
Resource Allocation Algorithm
The guest VMs on the servers are running the RHEL 4
Update 5 OS, and each is allocated 256 MB of RAM. For
running benchmarks like RUBiS, Hadoop, and SPEC, different
VMs are created, since each benchmark requires a different
runtime environment. For example, RUBiS runs well with a
JDK1.4 while Hadoop requires JDK1.5. Different VMs may
run different application components or a combination of
components, as explained later in this section. In addition,
some of the benchmarks are augmented to provide feedback to
the adaptation module regarding the current performace level.
In the case of RUBiS this is specified in terms of number
of requests satisfied per second. This feedback information is
used to change the CPU and network settings for the VMs,
and it is used to motivate the use of application-provided hints
to better tune the resources allocated to applications and the
VMs within which they execute. However, such applicationlevel feedback might be available only in certain cases. To
demonstrate the effectiveness of the ACT approach, we treat
VMs running Hadoop and SPEC as black-boxes and only
monitor their CPU and network usage.
B. Benchmarks
The workloads used for the experimental analysis of the
ACT resource management mechanisms are derived from a
number of industry-standard benchmarks. The three primary
workloads used are RUBiS (Rice University Bidding System),
Hadoop, and the SPEC CPU benchmarks. RUBiS is a wellknown enterprise-level benchmark that mimics the performance of a typical 3-tier enterprise application, and includes
a Web Server, Application Servers, and Database Servers. In
our experiments, RUBiS is split into its three separate types
of servers, and each server is housed in a separate VM.
The RUBiS client request generator is housed in a separate
VM, as well. Hadoop is representative of applications used
in cloud environments, as its data-intensive nature exercises
clouds’ compute and network capacities. using the MapReduce paradigm. For the Hadoop benchmark, multiple VMs
act as slaves, with a single VM as master. The wordcount test
is used as a workload. In addition, we use the CPU intensive
H.264 video compression test of the SPEC 2006 benchmark
suite within a VM as a representative CPU-bound workload.
The Iperf tool is used to generate constant rate TCP traffic to
utilize a portion of the available network resources.
Our experiments are based on test cases with multiple virtual machines with various mix of the above benchmark components. In addition, we assign each VM or set of VMs (i.e.,
for RUBiS and Hadoop) a Class of Service, that determines the
amount of compute and network resources in the VM’s PU. We
use three classes of service: GOLD, SILVER and BRONZE,
to differentiate between VMs’ resource requirements. The
corresponding PUs have the following resource specifications:
Gold VMs have a 80% CPU cap, and a 200Mbps network
cap, Silver VMs have 60% CPU and 125Mbps network caps,
and Bronze VMs’ caps are 40% and 75Mbps, respectively.
VI. E XPERIMENTAL R ESULTS
Results from the experimental evaluation of the ACT prototype are encouraging, demonstrating both suitable levels of
reactivity and improved efficiency in resource usage.
A. ACT Performance
Figure 4 depicts ACT’s ability to respond to changes in VM
resource needs and to tune resource allocations accordingly.
Across a range of sampling periods, we present the monitored
(Mon) and allocated (Alloc) usage for CPU (left graph) and
90
80
250
70
200
Alloc NW(G)
Alloc
NW(G)
Mon NW(G)
Alloc NW(S)
Mon NW(S)
( )
Alloc NW(B)
Mon NW(B)
( )
50
Ban
ndwidth (M
Mbps)
CPU %
60
150
40
30
100
20
10
50
0
1
6
11
16
21
26
31
36
41
46
51
56
61
66
71
Sample Points
Alloc CPU(G)
Mon CPU(S)
Mon CPU(G)
Alloc CPU(B)
0
Alloc CPU(S)
Mon CPU(B)
1
6
11
16
21
26
31
36
41
46
51
56
61
66
71
76
81
86
91
Sample Points
Fig. 4.
ACT Adaptation Performance for CPU and Network
2
1.6
1.8
1.4
1.6
1.2
1.4
1
1.2
08
0.8
1
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
RUBiS
Hadoop
ACT Norm
Iperf
SPEC ‐ h264
CPU ‐ Oversub ‐ 320:300 ‐ Pinning
Fig. 5.
Hadoop
ACT Norm
Iperf
SPEC ‐ h264
NW Oversub 1250:1000
ACT Performance Comparison with Underprovisioning CPU and Network
and Broze VMs, respectively.
400
Static Alloc
350
ACT (0.3,0.2,0.3,7)
Norm
malized Performance
e %
RUBiS
ACT (0.5,0.2,0.3,7)
300
ACT (0.7,0.2,0.3,7)
( , , , )
ACT (0.3,0.4,0.3,7)
250
ACT (0.3,0.4,0.3,7)
ACT(0 3 0 2 0 3 10)
ACT(0.3,0.2,0.3,10)
200
ACT(0.3,0.2,0.3,15)
150
100
50
0
RUBiS
Hadoop
Iperf
Spec‐h264ref
Benchmarks
Fig. 6. Application Performance in ACT environment with different
tunable parameters
network (right graph) resources. The x-axis denotes sampling
time points in 7 second intervals. We observe that ACT adapts
almost simultaneously to changes in the resource utilization
and results in corresponding updates in the PU values. In addition the graphs show that ACT continues to maintain ‘wiggle
room’, as seen from the difference between the allocated vs.
monitored values. G, S, B, in the figure, denote Gold, Silver
Figures 5 show ACT’s ability to allocate platform resources
according to the VMs’ CoS in overloaded plaforms. In the
left graph in Figure 5 four sets of VMs, each set running
one of the benchmark applications, are allocated 3 CPUs. The
requirement from the CoS specify aggregate 320% of CPU
while only 300% are available – hence, the CPU resource
is underprovisioned. RUBiS and SPEC have gold CoS, iperf
is SILVER, and Hadoop’s master VM is GOLD, while its
slave VMs are marked as BRONZE. The y-axis in these
graphs shows normalized application performance, where the
performance metrics for the different applications in these as
well as in latter experiments is:
• RUBiS: Requests per second - the more the better.
• Hadoop: Execution time - the lesser the better.
• Iperf: Throughput - the more the better.
• Spec-h264ref: Execution time - the lesser the better.
We observe that for the Gold VMs RUBiS and SPEC,
performance remains almost the same, with only a 5% drop
for the RUBiS VMs. Hadoop’s performance decreases by
40%. In Hadoop, despite of its Gold master VM, this decrease
is due to its many Bronze slave VMs. The SILVER Iperf
VM’s performance benefits at the cost of the reduction of CPU
resources allocated to the bronze Hadoop slaves by attaining
an increase in its network bandwidth usage. This is in fact
the desired effect we aim to achieve with ACT – to distribute
aggregate platform resources based on the relative importance
(CoS) of the platform workloads.
Similar observations can be made in the right hand-side
graph in Figure 5. In this case, we modify the CoS resource
specification to overload the network resource (1250 versus
1000 Mbps). The network requirement for Gold is increased to
300, Silver to 200, and Bronze to 125Mbps. The performance
of SPEC is not affected, as is expected due to its CPUbound nature. The throughput of the Gold RUBiS VMs is only
slightly affected. Hadoop execution time increases by about
75% which is expected due to the Bronze CoS of its slaves.
Finally, Iperf bandwidth measurement shows a 20% increase,
which would be expected given its constant network load, for
which ACT now allocates a higher level (200 vs. 125Mbps)
of maximum resources.
B. Application Performance
In order to evaluate the ability of the ACT system to
dynamically tune platform resources to exact VM needs, we
compare ACT to a statically provisioned system, where the
resource requirements are known apriori and the operating
conditions do not change at runtime. Figure 6 compares
the normalized performance of applications in the statically
provisioned system to several configurations of ACT with
different values for the tunable MISDA-WR parameters. We
use the notation ACT(ALPHA,BETA,GAMMA,SAMPLING
INTERVAL) to denote measured application performance with
the particular values of the tunable parameters of our system.
The CoS levels of the VMs are set in the same manner as in
the previous set of experiments.
The graphs demonstrate several facts. First, the dynamic
approach in ACT can deliver comparable performance level to
the statically provisioned system; in several configuration with
no observable performance difference. Next, for workloads
without significant variability in the resource utilization footprint, such as the SPEC workload, the selection of parameters
makes no impact on ACT’s ability to make the appropriate allocation decisions. Third, the allocations for the higher priority
GOLD RUBiS workload, are more accurate, i.e., closer to the
ideal ‘Static Alloc’ performance, compared to the SILVER
iperf workload. This demonstrates the utility of the Class
of Service notion in prioritizing different VMs. The bronze
slave VMs result in the largest degradation of performance
for the Hadoop benchmark. The results demonstrate that ACT
is equally effective in managing the platform resources with
both black-box monitoring as well as using external triggers.
C. Consolidation Opportunities
The next set of experiments demonstrate an important aspect
of ACT, which is the opportunities for reduced resource utilization and consolidation it creates. Namely, in the case of the
previous ACT measurements, substantial platform resources
remain available for executing additional workloads, which is
not the case for the statically provisioned platform. Figure 7
illustrates the resource utilization under a static allocation
policy vs. as measured with ACT for the first algorithm configuration shown in Figure 6. We observe that with the ACT
approach on an average there are 50% less CPU resources
required and 63% less network bandwidth required to run
the same workload when compared to a statically provisioned
environment. The small “wiggle-room” factor α does result in
occasional service degradtions for some of the Gold VMs, e.g.,
the SPEC performance is not impacted, while for the RUBiS
VM service may degrade up to 20% depending on the choice
of α. For choices of the algorithm parameters with larger
“wiggle-room” which do not result in any noticeable service
degradation for gold and silver VMs, resource utilization is
still reduced up to 30% on average.
D. Monitoring Overhead
Monitoring overhead for different sampling time intervals
and workload configurations are negligible, and they do not
significantly impact ACT’s ability to adequately allocate aggregate platform resources. For brevity, we do not include
detailed data on these measurements.
VII. R ELATED W ORK
Our work is related to many efforts from the HPC/Grid
or the autonomic computing community on managing shared
infrastructures and datacenters [8], [16], [25], including monitoring and deployment considerations for mix of batch and
interactive VMs on shared cluster resources [26], [27], [23], as
well as older work on cluster management and co-scheduling
and deployment of cluster process [20], [19]. Similarly, in the
context of multi-core platforms, several recent efforts have
focused on workload scheduling across platform cores [6],
[13], [25]. In comparison, the results presented in this paper
concern the coordinated management of multiple types of
resources on individual multi-core nodes in such distributed
virtualized infrastructures.
Several research efforts by our own group as well as others
focus on dynamic monitoring and analysis on the behavior of
applications deployed on a single system or across distributed
nodes [15], [28], [25]. The approach implemented in the
current ACT prototype uses historic information regarding
VM behavior to ‘guess’ its future resource requirements. It
can easily be replaced with other mechanisms, such as those
developed by these or other related efforts.
VIII. C ONCLUSIONS
This paper describes the Active CoordinaTion (ACT) approach. ACT addresses a specific issue in the management
domain, which is the fact that the management actions must (1)
typically touch upon multiple resources in order to be effective,
and (2) must be continuously refined in order to deal with
the dynamism in the platform resource loads and application
needs or behaviors. ACT can performs active management
using a black-box approach, which relies on the continuous
monitoring of the guest VMs’ runtime behavior, and an
900
350
800
Total Bandwidth fo
or Workloaads (Mbps)
To
otal %CPU for Worklo
oads
400
300
250
200
150
100
50
700
600
500
400
300
200
100
1
6
11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101
Static Network BW Allocation
Sample Points
Total % Static CPU Allocation
1
5
9
13
17
21
25
29
33
37
41
45
49
53
57
61
65
69
73
77
81
85
89
93
97
101
105
0
0
Fig. 7.
Sample Points
Sample Points
Dynamic Network BW Allocation
Dynamic % CPU Allocation
Improvements in available resources with ACT
adaptive resource allocation algorithm, termed Multiplicative
Increase, Subtractive Decrease Algorithm with Wiggle Room.
In addition, ACT permits explicit external events to trigger VM
or application-specific resource allocations, e.g., leveraging
emerging standards such as WSDM.
The experimental analysis of the ACT prototype, built
for Xen-based platforms, uses industry-standard benchmarks,
including RUBiS, Hadoop, and SPEC. It demonstrates ACT’s
ability to effectively manage aggregate platform resources
according to the guest VMs’ relative importance (Class-ofService), for both the black-box and the VM-specific approach.
Experimental results demonstrate ACT’s ability (1) to respond quickly to changes in application resource requirements,
with negligible overhead, (2) to distribute aggregate platform
resources based on the relative importance (i.e., CoS) of
platform workloads, and (3) to deliver substantial resource
consolidation, with an up to 50% reduction in CPU utilization
and 63% reduction in required bandwidth, while maintaining
limited degradation in VM performance.
ACKNOWLEDGEMENT
We would like to specifically thank Xsigo Systems for their
donation of the VP780 I/O Director used in our research, in
particular to Kirk Wrigley, Eric Dube and Ariel Cohen for
their support and technical insights.
R EFERENCES
[1] “The VMWare ESX Server,” http://www.vmware.com/products/esx/.
[2] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield, “Xen and the Art of Virtualization,” in
SOSP 2003, 2003.
[3] M. Fair, C. Conklin, S. B. Swaney, P. J. Meaney, W. J. Clarke, L. C.
Alves, I. N. Modi, F. Freier, W. Fischer, and N. E. Weber, “Reliability,
Availability, and Serviceability (RAS) of the IBM eServer z990,” IBM
Journal of Research and Development, 2004.
[4] “Amazon Elastic Compute Cloud (EC2),” aws.amazon.com/ec2.
[5] “Virtual Computing Lab,” http://vcl.ncsu.edu/.
[6] R. Raghavendra, P. Ranganathan, V. Talwar, Z. Wang, and X. Zhu, “No
Power Struggles: A Unified Multi-level Power Management Architecture
for the Data Center,” in ASPLOS), 2008.
[7] R. Nathuji and K. Schwan, “VirtualPower: Coordinated Power Management in Virtualized Enterprise Systems,” in SOSP, 2007.
[8] L. Grit, D. Irwin, A. Yumerefendi, and J. Chase, “Virtual Machine Hosting for Networked Clusters: Building the Foundations for ”Autonomic”
Orchestration,” in VTDC, 2006.
[9] H. Raj and K. Schwan, “High performance and scalable i/o virtualization
via self-virtualized devices,” in HPDC, 2007.
[10] F. Petrini, D. Kerbyson, and S. Pakin, “The Case of the Missing
Supercomputer Performance: Achieving Optimal Performance on the
8,192 Processors of ASCI Q,” in Supercomputing’03, 2003.
[11] “Virtualized
Multi-Core
Platforms
Project,”
www.cercs.gatech.edu/projects/virtualization/virt.
[12] S. Kumar, V. Talwar, P. Ranganathan, R. Nathuji, and K. Schwan,
“M-Channels and M-Brokers: Coordinated Management in Virtualized
Systems,” in MMCS, joint with HPDC, 2008.
[13] J. W. Strickland, V. W. Freeh, X. Ma, and S. S. Vazhkudai, “Governor:
Autonomic Throttling for Aggressive Idle Resource Scavenging,” in
ICAC, 2005.
[14] “Web services architecture to manage distributed resources,” www.oasisopen.org/committees/wsdm.
[15] I. Cohen, S. Zhang, M. Goldszmidt, J. Symons, T. Kelly, and A. Fox,
“Capturing, indexing, clustering and retrieving system history,” in SOSP,
2005.
[16] J. Xu, M. Zhao, M. Yousif, R. Carpenter, and J. Fortes, “On the Use
of Fuzzy Modeling in Virtualized Data Center Management,” in ICAC,
2007.
[17] “Intel Research Advances ‘Era of Tera’,” intel News Release,
www.intel.com/pressroom/archive/releases/20070204comp.htm.
[18] V. Kumar, Z. Cai, B. F. Cooper, G. Eisenhauer, K. Schwan, M. Mansour,
B. Seshasayee, and P. Widener, “Implementing Diverse Messaging
Models with Self-Managing Properties using iFLOW,” in ICAC, 2006.
[19] M. Silberstein, D. Geiger, A. Schuster, and M. Livny, “Scheduling Mixed
Workloads in Multi-grids: The Grid Execution Hierarchy,” in (HPDC),
2006.
[20] M. S. Squillante, Y. Zhang, A. Sivasubramaniam, N. Gautam, H. Franke,
and J. E. Moreira, “Modeling and analysis of dynamic coscheduling in
parallel and distributed environments,” in SIGMETRICS, 2002.
[21] “Xen Credit Scheduler,” wiki.xensource.com/xenwiki/CreditScheduler.
[22] “Xsigo Virtual I/O Overview,” www.xsigo.com (whitepaper).
[23] E. Kalyvianaki and T. Charalambous, “On Dynamic Resource Provisioning for Consolidated Servers in Virtualized Data Centers,” 2007.
[24] A. Ranadive, M. Kesavan, A. Gavrilovska, and K. Schwan, “Performance Implications of Virtualizing Multicore Cluster Machines,” in
HPCVirt, 2008.
[25] B. Urgaonkar and P. Shenoy, “Sharc: Managing CPU and Network
Bandwidth in Shared Clusters,” in IPDPS, 2004.
[26] B. Lin and P. Dinda, “VSched: Mixing Batch and Interactive Virtual
Machines Using Periodic Real-time Scheduling,” in Proceedings of
ACM/IEEE SC 2005 (Supercomputing), 2005.
[27] P. Padala, K. G. Shin, X. Zhu, M. Uysal, and et al., “Adaptive control
of virtualized resources in utility computing environments,” in SIGOPS,
2007.
[28] S. Agarwala and K. Schwan, “SysProf: Online Distributed Behavior
Diagnosis through Fine-grain System Monitoring,” in ICDCS, 2006.
Download