Distributed Resource Exchange: Virtualized Resource Management for SR-IOV InfiniBand Clusters

advertisement
Distributed Resource Exchange: Virtualized Resource
Management for SR-IOV InfiniBand Clusters
Adit Ranadive, Ada Gavrilovska, Karsten Schwan
Center for Experimental Research in Computer System (CERCS)
Georgia Institute of Technology, Atlanta, Georgia
{adit262, ada, schwan}@cc.gatech.edu
Abstract—The commoditization of high performance interconnects, like 40+ Gbps InfiniBand, and the emergence of lowoverhead I/O virtualization solutions based on SR-IOV, is enabling the proliferation of such fabrics in virtualized datacenters
and cloud computing platforms. As a result, such platforms
are better equipped to execute workloads with diverse I/O
requirements, ranging from throughput-intensive applications,
such as ‘big data’ analytics, to latency-sensitive applications,
such as online applications with strict response-time guarantees. Improvements are also seen for the virtualization infrastructures used in datacenter settings, where high virtualized
I/O performance supported by high-end fabrics enables more
applications to be configured and deployed in multiple VMs
– VM ensembles (VMEs) – distributed and communicating
across multiple datacenter nodes. A challenge for I/O-intensive
VM ensembles is the efficient management of the virtualized
I/O and compute resources they share with other consolidated
applications, particularly in lieu of VME-level SLA requirements
like those pertaining to low or predictable end-to-end latencies
for applications comprised of sets of interacting services.
This paper addresses this challenge by presenting a management solution able to consider such SLA requirements, by
supporting diverse SLA-aware policies, such as those maintaining
bounded SLA guarantees for all VMEs, or those that minimize the
impact of misbehaving VMEs. The management solution, termed
Distributed Resource Exchange (DRX), borrows techniques from
principles of microeconomics, and uses online resource pricing
methods to provide mechanisms for such distributed and coordinated resource management. DRX and its mechanisms allow
policies to be deployed on such a cluster in order to provide SLA
guarantees to some applications by charging all the interfering
VMEs ‘equally’ or based on the ‘hurt’, i.e. amount of I/O
performed by the VMEs. While these mechanisms are general,
our implementation is specifically for SR-IOV-based fabrics like
InfiniBand and the KVM hypervisor.
Our experimental evaluation consists of workloads representative of data-analytics, transactional and parallel benchmarks.
The results demonstrate the feasibility of DRX and its utility to
maintain SLA for transactional applications. We also show that
the impact to the interfering workloads is also within acceptable
bounds for certain policies.
I.
I NTRODUCTION
Current datacenter workloads exhibit diverse resource
and performance requirements, ranging from communicationand I/O-intensive applications like parallel HPC tasks, to
throughput-sensitive mapreduce-based applications, multi-tier
enterprise codes, to latency-sensitive applications like transaction processing, financial trading [28] or VoIP services [17].
c
978-1-4799-0898-1/13/$31.00 !2013
IEEE
Despite the increased popularity of virtualization and the
emergence of cloud computing platforms, some of these
classes of applications, however, continue to run on nonvirtualized, dedicated infrastructures, to reduce overheads and
avoid potential interference effects due to resource sharing.
This is particularly true for these applications’ I/O needs,
since, unlike existing hardware-supported methods for CPU
and memory resources, prevalent I/O devices in current cluster
and datacenter installations continue to introduce substantial
overheads in their shared and virtualized use.
Concerning I/O, commoditization of high-end fabrics like
40+Gbps InfiniBand and Ethernet, and hardware-level improvements for I/O virtualization like Single Root I/O Virtualization (SR-IOV) [18], are addressing some of these I/Orelated challenges, and are further expanding the class of
applications able to benefit from virtualization technology like
Xen, Microsoft HyperV, and VMware. Although these technology advances provide high levels of aggregate I/O capacity
and low-overhead I/O operations, a remaining challenge is
the ability to consolidate the above mentioned highly diverse
workloads across multiple shared, virtualized platforms. This
is because current systems lack the methods for fine-grained
I/O provisioning and isolation needed to control potential
interference and noise phenomena [19], [23], [26]. Specifically,
while SR-IOV-based devices provide low-overhead device access to multiple VMs consolidated on a single platform, this
hardware-supported device resource partitioning is insufficient
in providing performance isolation. In fact, we demonstrate
that for I/O-intensive applications running across consolidated
SR-IOV-devices, there remain serious issues with performance
variability and lack of isolation. We see evidence of this variability in Figure 1 for a latency transactional benchmark when
it is running as itself versus consolidated with a throughputintensive one.
Recent efforts, including our own, have addressed these
issues on individual virtualized nodes [6], [8], [10], [25], but
effective solutions that span multi-node virtualized infrastructures and distributed, multi-VM applications remain unavailable. This is because there are additional challenges with
workloads deployed as distributed VM Ensembles (VMEs),
which include (i) the timely detection and management of
I/O-related interference effects, (ii) in ways that consider all
relevant VME components, and (iii) take into account all of the
physical resources and nodes being used. This is particularly
the case for environments with high-end fabrics, running I/Ointensive workloads, where delays in diagnosing and managing
resource congestion and the resulting interference effects, have
significant impact on performance degradation [29]. Stated
technically, the effectiveness and timeliness of the performance
and isolation management operations concerning the I/O use
of distributed VM ensembles requires coordinated resource
management actions across the entire set of relevant distributed platform resources.
The specific contributions made by this research include the
following. (1) We present the design of the DRX framework
and its implementation for cluster servers interconnected with
InfiniBand SR-IOV fabrics, and virtualized with the KVM
hypervisor. (2) DRX integrates mechanisms for low-overhead
accounting of resource usage, usage-based charging, and dynamic resource price adjustment, which make it possible to
realize diverse resource management policies. (3) The importance of these mechanisms is illustrated through the implementation of two concrete policies: an (i) Equal-Blame (EB)
policy, which under increased demand, equally limits workloads’ resource allocations, e.g., through platform-wide price
configuration, and a (ii) Hurt-Based (HB) policy, where price
adjustments are made in a manner proportional to the ‘hurt’
being caused, i.e., the amount of I/O generated by VMEs.
(4) The realization of DRX leverages our own prior work,
which developed memory-introspection-based techniques for
lightweight accounting, i.e., monitoring of the use of I/O
resources in InfiniBand-(and similar) connected platforms [24],
and mechanisms for managing performance interference on
single node platforms through the use of appropriate charging
methods [25]. (5) Evaluations use representative application
benchmarks corresponding to transactional, data-analytics, and
parallel workloads. The results indicate the importance and
efficacy of our distributed management solution and make
current SR-IOV more feasible for SLA-driven workloads.
The remainder of the paper is organized as follows. Section II motivates the need for DRX resource management
It also provides background on the PCI Passthrough, SRIOV InfiniBand technologies used in modern virtualization
Latency Count for Interfered App
Latency Count for Non-Interfered App
45000
40000
35000
30000
Count
To achieve this goal, this paper proposes a resource management framework – Distributed Resource Exchange (DRX)
– for managing the performance interference effects seen by
distributed workloads deployed in shared virtualized environments. DRX borrows ideas from microeconomics principles
on managing commodities’ supply and demand, by managing
virtualized clusters as an exchange where resource allocations are controlled via continuous accounting, charging, and
dynamic price adjustment methods. DRX provides the basic
mechanisms for allocation, accounting, charging, and pricing
operations that are needed to support a range of resource management policies. Furthermore, these operations are performed
with consideration of entire VM ensembles and their resource
demands, and take into account inter-ensemble interference
effects, thereby improving the efficacy of DRX management
processes. Although the ideas used in the DRX design are general, the distributed, coordinated resource allocation actions it
enables are particularly important for virtualized clusters with
high-end fabrics, where the bandwidth and latency properties
of the interconnect make them suitable platforms for shared
deployment of both I/O- and communication-intensive workloads, and where, precisely because of the I/O-sensitive nature
of some of these distributed workloads, the need for lowoverhead, effective management actions is more pronounced.
Distribution of Request Latencies for a Financial App
50000
25000
20000
15000
10000
5000
0
0
50
100
150
Request Service Times (µs)
Fig. 1: Distribution of Latencies for a Non-Interfered v/s an
Interfered Financial Application.
infrastructures and assumed by DRX methods. In Section III,
we introduce and explain the design of Distributed Resource
Exchange (DRX). We describe the DRX mechanisms and how
they interact with each other in Section IV. In Section V we
describe two policies that use these mechanisms. Sections VI
describes our experimental methodology and measurement
results. Related work is surveyed in Section VII, followed by
conclusions and future work in Section VIII.
II.
BACKGROUND
DRX targets virtualized clusters with high-end fabrics for
which Single Root I/O Virtualization (SR-IOV) enables lowoverhead I/O operations for the hosted guest VMs. Although
SR-IOV creates and provides VMs with access to physical
device partitions, it does not provide fine-grain control needed
for performance isolation. This is illustrated with the results
shown Figure 1, where two collocated applications are running instances of the Nectere benchmark developed in our
own work [11] with different I/O requirements. The results
demonstrate the performance degradation experienced by one
of the workloads, despite the use of SR-IOV-enabled devices.
An additional challenge with SR-IOV devices like the
InfiniBand adapters used in our work, is that by providing
direct access to a subset of physical device resources, SR-IOV
techniques make it also difficult to monitor/account for the
VMs’ I/O usage, and to insert fine-grained controls needed
to manage the I/O resource (i.e., bandwidth) allocation made
to VMs. We leverage our prior work on using memoryintrospection techniques to estimate the VMs’ use of IB
resources [24], and on using CPU capping as a method to
gauge the VMs’ use of I/O resources, thereby also limiting
the amount of I/O they can perform and indirectly affecting
their I/O allocation.
We next present some detail of these key enabling technologies that drive the design and implementation of DRX.
PCI Passthrough. PCI passthrough allows PCI devices (SRIOV-capable or standard) to be directly accessible from guest
VMs, without the involvement of the hypervisor or host
OS, but requiring Intel’s VT-d [1] or AMD’s IOMMU [2]
extensions for correct address translation from guest physical
addresses to machine physical addresses [4]. The hypervisor
(e.g., KVM or Xen) is responsible for assigning the PCI
device (specified for passthrough) to the guest’s PCI bus and
removing it from the management domain’s PCI bus list – i.e.,
the device is under full control of the guest domain. While
providing guests with near-native virtualized I/O performance,
by bypassing the management domain, i.e., the hypervisor, it
becomes challenging to monitor and manage the guest’s I/O
behavior.
Single Root I/O Virtualization (SR-IOV) InfiniBand. With
SR-IOV [7], the physical device interface (i.e., the device Physical Function (PF)) and associated resources are ‘partitioned’
and exposed as Virtual Functions (VFs). One or more VFs are
then allocated to guest VMs in a manner that leverages PCI
passthrough functionality.
The current InfiniBand Mellanox ConnectX-2 SR-IOV
devices used in our work provide SR-IOV support by dividing the available physical resources, i.e., queue pairs (QPs),
completion queues (CQs), memory regions (MRs), etc., among
VFs and exposing this subset of resources as a VF. The
PF driver running in the management domain is responsible
for creating the number of VFs (in our case 16). Each of
these VFs are assigned to a guest using PCI passthrough. A
Mellanox VF driver residing in each guest is responsible for
device configuration and management. All natively supported
IB transports – RDMA, IPoIB or SDP – are also supported by
the VF driver.
IBMon. To monitor VMs’ usage of IB we use a tool called
IBMon developed in our prior research [24]. IBMon asynchronously tracks VMs’ IB usage via memory introspection
of the guests’ memory pages used by their internal IB (i.e.,
OFED) stack. In this manner, IBMon gathers information
concerning VM’s QPs, including application-level parameters
like buffer size, WQE index (to track completed CQEs), QP
number (uniquely identifies a VM-VM communication). These
are used to more accurately depict application IB usage.
KVM Memory Introspection. KVM provides support for a
libvirt function called virDomainMemoryPeek. The function
maps the memory pointed by guest physical addresses and
returns a pointer to the mapped memory. IBMon uses the
mapped memory and interprets CQE and QP information.
ResourceExchange Model. We abstract the way in which
VMs use QPs and CQs by generally describing how the
resources assigned to and used by VMs in the notion of
’Resos’, explained in detail in [25], where the Resos allocated
to each VM represent its permissible use of some physical
resource. Given Resos of different types, it is then possible to
charge VMs for their resource usage based on some providerlevel policy, where charging can be based on micro-economic
theories and their application to resource management [14],
[21].
III.
OVERVIEW OF THE D ISTRIBUTED R ESOURCE
E XCHANGE
The DRX architecture illustrated in Figure 2 is a multi-level
software structure spanning the machines and VM ensembles
using them. Its top-tier Platform Manager (PM) is responsible for enforcing cluster-wide policies and driving resource
Platform Manager
VME 1
VME 2
...
VME n
HA
HA
HA
HA
VM . . . VM
VM . . . VM
VM . . . VM
VM . . . VM
...
HA
VM . . . VM
Fig. 2: Distributed Resource Exchange Model.
allocation actions for one or more VM Ensembles (VMEs).
Each VME consists of multiple VMs corresponding to a single
cloud tenant – e.g., a single application, such as a multi-tier
enterprise workload, or a distributed MPI or MapReduce job.
For each VME, an Ensemble Manager (EM) monitors the
VME’s resource and performance needs, and drives necessary
interactions with the DRX management layer, such as to report
SLA violations.
EMs interact directly with the DRX Host Agent (HA) components deployed on each host. The HAs maintain accounting
information for the host’s VM’s resource usage, and manage
(reduce or increase) resource allocations based on the specific
pricing and charging policy established by the PM.
The resource management provided by DRX behaves like
an exchange. Participants (VMEs and their VMs) are allocated
credits called Resos, which in turn determine the resource
allocations made to each VM by the corresponding Host
Agent. DRX includes mechanisms for allocation, accounting,
and charging, performed by each HA in order to enforce
a given resource allocation policy. I/O congestion and the
resulting performance degradation are managed through dynamic resource pricing. Depending on resource supply and
demand (e.g., considering factors such as current price, number
of VMs and VMEs, amount of I/O usage, etc.), and in response to performance interference events, the PM determines
adjustments in the Resos price that a VM/VME should be
charged for I/O resource consumption. Finally, to deal with
dynamism in the workload requirements, DRX uses an epochbased approach: Resos allocations are determined and renewed
at the start of an epoch, based on overall supply and demand,
per-VM or VME accounting information, etc.; the resource
price, along with the charging function, determine the rate at
which the workload (some VM or component) will be allowed
to consume I/O resources. These mechanisms allow us to treat
physical resources like commodities which can then be bought
or sold from an ‘exchange’. Further, we can use economicsbased schemes to control the commodity supply and demand,
thereby affecting resource utilization, and the consequent VM
performance and interference effects. We describe these in
detail in Section IV.
The current DRX implementation divides Resos equally
among all VMs in a VME (additional policies can be supported
easily). Given our focus on data-intensive applications and the
abundance of CPU resources in multicore servers, the current
HA implementation charges its VMs only for their I/O usage,
for the duration of the epoch. I/O usage is obtained from IBMon, which samples each VM’s I/O queues to estimate its current I/O demand. Controlling I/O usage, however, is not easily
done in SR-IOV environments: (1) the VM-device interactions
bypass the hypervisor and prevent its direct intervention; and
(2) SR-IOV IB devices carry out I/O via asynchronous DMA
operations directly to/from application memory. Software that
’wraps’ device calls with additional controls would negate
the low-overhead SR-IOV bypass solution. The current HA,
therefore, relies on the relatively crude method of CPU capping
to indirectly control the I/O allocations available to the VM.
Our prior work prototyped this method for paravirtualized IB
devices [25]. With DRX, we have extended it to SR-IOV
devices.
The DRX infrastructure can be used for many purposes,
including tracking, charging, and usage control. Key to this
paper is its use for ensuring isolation for I/O-intensive distributed datacenter applications. Specifically, isolation must
be provided for a set of VMEs co-running on a cluster of
machines. This requires monitoring for and tracking distributed
interference across VMEs caused by their I/O activities. Such
interference occurs when VMEs share physical links, using
them in ways that cause one VME’s actions to affect the
performance of another. A more formal statement of distributed
interference considers VMs communicating across physical
links affected by other VMEs, using what we term DCCR
Distributed Causal Congestion Relationship. The formulation
below considers reduced performance by some VMi due to
interference by VMEs: mathematically shown as follows for a
VMi that has reduced performance
!
!
DCCR(i) = {V M Ej }! [VMk ∈ VMEj ∧ VMk ∈ P(VMi )]
(1)
The equation identifies the VME or set of VMEs that is
affecting the performance of VMi . Note that those VMEs also
contain VMs that are on the same physical machine as VMi ,
the latter denoted by the function P. Knowledge about this
potentially resulting in a large set of ’culprit’ VMs is the basis
on which DRX manages interference. The next step will be
to identify those VMs in that set that are actually causing the
interference being observed, followed by mitigation actions
that prevent them from doing so. The next section explains
the techniques and steps used in detail.
IV.
DRX R ESOURCE M ANAGEMENT M ECHANISMS
We use the concept of ‘Resos’ described in [25] as a
Resource currency for VMs, or EMs acting on behalf of
entire VMEs, in the case of DRX, to ‘buy’ resources for
their execution. In this section we explain how the components
described in the section before, interact to implement our DRX
mechanisms which use Resos.
A. Allocating, Accounting for Resos and Charging Resources
The PM is responsible for a global resource management
of the cluster and it ensures resources are allocated in an
appropriate manner to meet the resources of the VMs. However, to improve scalability for VM resource management,
the PM allocates a certain number of Resos per EM called
‘EM Allocation’. Each EM Allocation depends on the set
of all resources present in the cluster and on the resource
management policy, see Section V. It also depends on the
number of VMs present in each VME in order to avoid a
completely unfair distribution of resources. Further, each EM
is responsible for distributing Resos to its VMs. For the sake
of simplicity, we assume EMs distribute Resos to its VMs
‘equally’, unless we state otherwise is our policies. Since we
consider only CPU and IB resources for management, we
assign Resos to VMs only for these resources.
We use an ‘Epoch and Interval-based model’ for accounting
of Resources, where one epoch is equal to 60 seconds, and
each interval is 1 second. A certain number of Resos allows
the VM to buy resources from the host. Every epoch, the EM
distributes a new allocation of Resos to its VMs. Next, every
interval, the Host Agent deducts Resos from the VM’s Resos
allocation – i.e., charges VMs – to account for the CPU and I/O
consumed by each VM in that interval. Any resource allocation
that needs to be applied for a VM is performed based on the
resource management policies.
B. Resource Pricing
When VMs consume resources they spend Resos allocated
to them by their respective EM. In order to control the rate at
which applications consume resources, specifically I/O, and to
deal with possible congestion where other application’s VMs
are no longer able to receive their resource share and make
adequate progress, DRX dynamically changes the resource
price with granularity of VMEs. By increasing the price of
the resource, VMs can only afford a limited quantity since
they have a limited number of Resos with them. According to
Congestion Pricing principles [14], [21], this implies that the
demand for the resource would reduce, which in turn would
reduce the congestion of that resource. Next, we explain the
key concepts in the DRX resource pricing methods.
First, the price increase intrinsically depends on the amount
of congestion-caused Performance Degradation (PD) of a
VM/VME. We find the PD for a VM using IBMon to detect
changes in the IB usage. The PD is the percentage change
in the CQEs (for RDMA) or I/O Bytes (for IPoIB or RDMA
port counters) generated by the VM. To maintain a certain SLA
the VM needs to generate a required number of CQEs/Bytes.
When this CQE rate falls below the SLA, IBMon detects it
and we can report the difference between it and the SLA to
the PM.
Second, pricing is performed on a per-VME Basis. This
simplifies our ability to track prices across the cluster when
pricing for an entire VME and reduces the amount of communication performed between HAs and the PM. By performing
price changes on the entire VME, we can provide a faster
response to reduce congestion, rather than repeatedly changing
prices per VM. All VMEs that belong to the DCCR set
with a VM whose performance degradation triggers a price
adjustment, will have their price increased. The amount of the
price increase for each VME would depend on these factors
listed above, as well as on the cluster-wide policy.
Third, in order to be more flexible in changing the price
based on the policy define we use two policy-specific parameters αi and δPi , which affects how the price increases for a
VME i. The αi defines the weight or priority for the VME i
and is between 0 and 1. The δPi is the policy coefficient for
a VME i and is defined for each policy in Section V. We also
use the Old Price or OPi of the VME to find the New Price.
Using the factors described above, we generally define our
Pricing Function as follows:
N Pi = f (OPi , P Dx , δPi , αi )
a ratio of 9:1, the effective price increase for VME1 would
be 27% and VME2 3%. The PM can find the aggregated I/O
from the HAs to compute the I/O ratio between VMEs and
therefore adjust prices accordingly. The goal of this policy is
to charge the VMEs more based on the fraction of performance
degradation they caused. For this policy δPi and δCij for VME
i and VM j are as follows:
C. CPU Capping
Since we do not have explicit control over the RDMA I/O
performed by the VMs, we use the rather crude methods of
CPU capping to reduce the amount of I/O the VM actually
performs. We have shown in [25] that by throttling CPU we
can control the amount of I/O the VM performs. Therefore,
we again use the CPU capping mechanism provided by the
hypervisor to control the VM’s I/O usage. The capping degree
depends on the policy being implemented. In general, the
CPUCap for a VM depends on the New Price for the VME
i (NPi ), Old Cap for VM (OCji ), VME Priority (αi ) and a
cpucap policy co-efficient, δCji , which defines the conversion
of Price into a CPUCap. Generally, we define the new CPU
Cap for a VM j belonging to VME i as:
N Cij = f (OCij , N Pi , αi , δCij )
V.
P OLICIES FOR A D ISTRIBUTED R ESOURCE E XCHANGE
Given the various components and mechanisms of DRX in
Section III and IV we now describe various ways in which
these components can interact to provide distributed resource
management.
A. Equal-Blame Policy
This policy is implemented to show a naive method of
charging VMEs when there is congestion. In this case each
VME is charged, i.e., its price is increased, equally for all
VMEs responsible for congestion. For example, if the performance degradation reported by a HA is 30%, then with
2 VMEs in the DCCR of congested VME x, each VME’s
price is increased by 15%. The goal of this policy is to have
a lightweight and simple mechanism by which we can charge
VMEs on congestion occurrence. We define the Price and CPU
Cap functions for a VME i and VM j as follows:
N Pi = OPi + (P Dx ∗ δPi ∗ αi ) ∗ OPi
"
#
(N Pi − OPi )
OCij
∗
∗ 100 ∗ δCij
N Cij = OCij −
100
OPi
δPi =
1
N (DCCRx )
δCij =
ELh
RLji
where, N(DCCRx ) denotes the number of VMEs in the DCCR
set of VME x, ELh and RLji denote the % of Epoch Left on
HA h and % of Resos Left for VM j ∈ VME i respectively.
B. Hurt-Based Policy
In order to improve a naive policy, we now consider the
amount of I/O generated by a VME in order to increase price.
Therefore, price increases for a VME are directly proportional
to the I/O generated by the VME and the performance degradation. From the earlier example, if the VMEs perform I/O in
IOi
,
δPi = $N
k=0 IOk
δCij =
ELh
RLji
where, N denotes the number of VMEs in the DCCR set of
VME x and IOi the amount of IO performed by VME i. The
rest of the formula is the same as the Equal-Blame Policy.
VI.
E VALUATION
A. Testbed
Our testbed consists of 8 Relion 1752 Servers. Each server
consists of dual hexa-core Intel Westere X5650 CPUs (HT
enabled), 40Gbps Mellanox QDR (MT26428) ConnectX-2
InfiniBand HCA, 1 Gigabit Ethernet and 48GB of RAM. Our
host OS is RHEL6.3 OS with KVM and the guest OS’ are
running RHEL6.1. Each guest is configured with 1 VCPU
(pinned to a PCPU), 2GB of RAM and an IB VF. We use the
mlx4 core beta version of the drivers (based on OFED 1.5)
for the hosts and guests, configured to enable 16VFs, so we
can run upto 16VMs on each host. The IB cards are connected
via a 36-port Mellanox IS5030 switch.
B. Workloads
We use three benchmarks, each representing a different
type of cluster workload. Nectere [11], is a server-clientbased financial transactional workload with low latency characteristics. We measure its performance in terms of µs for
request completion. We use Hadoop’s Terasort [12] with a
10GB dataset as a representative for data analytic computing
to generate distributed interference. For Hadoop workloads
we use the job running time as the performance metric.
Linpack [13] (Ns = 300 1000 7500) is a characteristic MPI
workload for clusters and uses Gflops as its performance
metric. We run the workloads in a staggered manner, where
we start and let the Hadoop job run for 30s before starting the
Linpack job. Next, we start the Nectere workload and run all
the jobs till Nectere completes successfully. This ensures that
the workloads are performing sufficient I/O communication
before Nectere starts.
The Hadoop and Linpack workloads are configured to use
32 VMs each and 2 VMs for Nectere. In the ‘symmetric
configuration’ each physical machine is running 4 VMs of
each benchmark. Additionally, we also use 2 asymmetric configurations – Asymm1 and Asymm2. In Asymm1, we have more
Linpack and Hadoop VMs (upto 16 total) on the same physical
machine as the Nectere VMs. Therefore, this configuration
should cause more interference to the Nectere application. In
Asymm2, there is only one VM each of Linpack and Hadoop
along with the Nectere VM. This explores the other end of the
asymmetry, where there is minimal interference.
Each DRX policy ensures that when Nectere is running,
it maintains the CQE/s within a SLA limit of 15% (we can
Native Nectere(64KB) with Interfering Workloads and Policy Performance
Workload Performance with Policies and Symmetric Configuration
NoMgmt
CC-Dist
PC-Dist
EB-Local
60
CQE/s for 64 KB Nectere
4500
EB-Dist
HB-Local
HB-Dist
40
SLA=15%
3500
3000
Monitor Instance #
2500
0
500
1000
1500
2000
2500
3000
80
30
Request Latency (µs)
Performance Degradation %
50
4000
20
10
75
70
65
SLA=15%
60
55
50
Non-Interfered Server
Interfered Server
45
EB-Dist Policy
HB-Dist Policy
40
0
Nectere-Latency
Hadoop
0
Linpack
50000
100000
150000
200000
250000
300000
350000
400000
450000
Nectere Request #
Workload
Fig. 3: Effect of Policies on Performance of Nectere, Hadoop Fig. 4: Effect of Policies on the running average of Nectere
and Linpack Workloads.
CQE/s and Latency. High CQE/s denotes Low Latency.
-20
100
-22
100
VME IBMTU Price
Hadoop VME IBMTU Price
CPUCap
Linpack VME IBMTU Price
-19
Linpack CPUCap
-20
90
90
Hadoop CPUCap
-18
-18
80
80
-16
SLA=15%
70
-14
CPU Cap
70
CPU Cap
-16
SLA Diff %
SLA Diff %
-17
SLA=15%
-15
60
60
-12
-14
50
-13
-12
40
1
1.2
1.4
1.6
1.8
2
50
-10
-8
40
1
1.5
2
2.5
Price (Resos/MTU)
Price (Resos/MTU)
(a) Equal-Blame Policy
(b) Hurt-Based Policy
3
3.5
Fig. 5: Comparison of VME Prices, CPU Cap and SLA Difference for Hadoop and Linpack VMEs.
easily configure other values) from the base value. We design
our experiments in order to highlight some of the important
features of DRX and their impact on the selected workloads.
We also use two CPU Capping-only policies, denoted as CC
and PC. In CC (CPUCap), we decrease the CPU Cap steadily
by 5% (upto 25%) for the interfering VMs. In PC (Proportional
Capping), we decrease the CPU Cap for the interfering VMEs
based on their I/O Ratio metrics. Therefore, each VM would
have its CPU Cap reduced by a fraction of the 5% based on
the I/O Ratio. With these additional policies, we show the
importance of Pricing over only performing CPU Capping. We
have also configured two sub-types of policies for each of the
main policies. In one, all price changes for a VME apply to its
VMs across the entire cluster, termed a ‘Distributed’ policy.
In the other, the price changes for a VM apply only at the
HA that is reporting the congestion, termed a ‘Local’ policy.
Broadly, we divide the results into three different categories:
(i) policy performance, (ii) policy sensitivity to resources usage
patterns, and (iii) limitations and overhead of DRX in different
workload configurations, which we describe next.
C. Policy Performance
Figure 3 shows the impact of the distributed and local
policies on workload performance. The CC-Dist and PC-Dist
policies do help in reducing Nectere latency, but not to its
SLA level. These also have a much greater impact on the performance of Linpack and Hadoop, because of the continuous
capping. The EB-Local and HB-Local policies cannot reduce
the latency for Nectere below the SLA because Linpack VMs
on other machines are actually causing congestion by sending
data to its VMs collocated with Nectere. However, in the case
of the EB-Dist and HB-Dist policies, Nectere can meet its SLA
of 15%. This demonstrates the feasibility of resource pricing
as a vehicle to reduce congestion, as well as the importance
of performing distributed resource management actions, as
enabled by DRX.
Figure 4 shows the impact of each policy on the latency
of the Nectere application (bottom graph). It also shows the
change in the metric used by DRX to manage I/O performance
– CQE/s. Both the Equal-Blame and Hurt-Based policies
Workload Performance with Policies and Asymmetric Configuration
90
EB-Dist
HB-Dist
80
80
60
20
0
EB-Local-Asymm1 Policy
EB-Dist-Asymm1 Policy
HB-Local-Asymm1 Policy
HB-Dist-Asymm1 Policy
NoMgmt-Asymm1
EB-Local-Asymm2 Policy
EB-Dist-Asymm2 Policy
HB-Local-Asymm2 Policy
HB-Dist-Asymm2 Policy
NoMgmt-Asymm2
70
40
Nectere
HPL1
HPL2
Nectere
MRBench
HPL
Workload
Fig. 6: Comparison of DRX Policy Sensitivity with two
different workloads sizes.
are effective in reducing contention effects and providing
performance within the guaranteed SLA levels – they reach the
same value for latency, though their impact on the interfering
workload performance is different. Also, the HB-Dist policy
provides the least degradation of Hadoop and Linpack workloads while meeting the SLA for Nectere. The EB-Dist policy
degrades the workloads more since it increases the prices
equally for both VMEs which negatively impacts how fast the
CPUCap is reduced. As a result the EB-Dist policy is more
reactive or fast-acting to SLA violations as highlighted by the
CPU Cap reductions versus price increases shown in Figure 5a.
EB penalizes interfering VMs more than HB and assesses a
lower CPU Cap for Hadoop, Linpack at 60. Figure 5b, shows
the more slow-acting nature of the HB-Dist policy, where the
CPUCap of the interfering workloads is decreased much more
gradually than EB-Dist. HB-Dist allocates a higher CPU Cap
to Hadoop (71) and lower CPU Cap to Linpack (50) since
these are based on the I/O Ratio between the VMEs. For both
these policies the PM always responds to a SLA violation
messages within 5ms, therefore DRX always detects and acts
upon congestion in a timely manner.
Essentially, EB and HB policies serve two respective methods for SLA satisfaction – (1) fast-acting while not performing
graceful degradation of workloads, (2) slow-acting while
providing graceful degradation to other workloads. This result
highlights an important aspect of DRX: multiple policies can
be constructed and configured to meet the SLA values for
applications.
D. DRX Sensitivity to Resources
In order to evaluate the effect of resource usage patterns
on the effectiveness of DRX, we use two more workload
configurations. In one configuration we use an instance of
Nectere along with 2 Linpack instances. In the second, we
use an instance of Nectere along with the Hadoop MRBench
application and a smaller data size for Linpack. We observe
from Figure 6 that when the interfering workloads perform
similar amounts of I/O both EB and HB policies behave
equally well, however, since EB treats both VMEs similarly
at all times, and applies the same cap simultaneously, it
achieves a lower latency for Nectere. In the adjacent graph, HB
becomes more aggressive than EB and it caps both Linpack
and MRBench much more. This is because as HB performs the
capping, it leads to oscillations in which VME domanates the
I/O Ratio (> 95%), which forces HB to perform large amounts
of cap alternately on the VMEs. This is not evident in Figure 3
as the difference in the I/O Ratio between Hadoop and Linpack
is smaller. Therefore, we find that HB is more sensitive to large
Performance Degradation %
Performance Degradation %
DRX Sensitivity to Resource Usage Patterns
100
60
50
40
30
20
10
0
Nectere-Latency
Hadoop
Linpack
Workload
Fig. 7: Performance of Policies (Local and Distributed) with
Asymmetric Workload Configuration. Asymm1 and Asymm2
refer to the types of workload deployment.
swings in the I/O Ratio while EB is less sensitive to differences
in the generated I/O. Future policies will be extended with
mechanisms to detect such oscillations, and to further limit
their aggressivness under such circumstances.
E. Limitations and Overhead of DRX
We show in Figure 7 that for two different workload
configurations, the DRX policies affect them differently. When
there is a lot of interference in the Asymm1 configuration, none
of the policies can satisfy the Nectere SLA. This is because,
despite CPU capping, the VMs still generate sufficient I/O to
cause congestion for Nectere. In this case, having more support
from the hardware to control I/O would be very useful. In the
Asymm2 case where Nectere has minimal interference, DRX
ensures that other workloads are perturbed much less or not
at all. Here, both Linpack and Hadoop perform very close to
their baseline values. These results highlight the limited utility
of CPU Capping in extreme interference and also the low
overhead caused by DRX components and their management
actions.
VII.
R ELATED W ORK
In this section we briefly discuss prior research related to
DRX.
Distributed Rate Limiting for Networks. Many recent
efforts have explored distributed control for providing network
guarantees for cloud-based workloads. These have looked
at providing min-max fairness to workloads [22], providing
minimum bandwidth guarantees [9], or using congestion notifications from switches [5]. [20] provides a detailed survey
of these approaches. There are also other efforts that provide
network guarantees for per-tenant [16] and inter-tenant communication [3]. Authors in Gatekeeper [27] enforce limits per
tenant per physical machine by providing exact egress and
ingress bandwidth values. In DRX by providing prices and
setting CPU Cap limits, we similarly enforce network limits
per tenant per physical machine.
These approaches show that providing distributed control
for networks is becoming important for cloud systems. How-
ever, while these approaches may work well for Ethernetbased para-virtualized networks, they do not yet explore high
performance devices like InfiniBand or SR-IOV devices. DRX
borrows some ideas like minimum guarantees and tenant
fairness (VM ensembles are similar to tenants) from these
efforts to show that distributed control for networks is still
required and feasible for hardware-based virtualized networks.
Economics and Resource Management. DRX also relies
on the effects of Congestion Pricing on resource usage and
allocation. These ideas have been explored before in network
congestion avoidance [14], [21], platform energy management [30], as well as in market-based strategies to allocate
resources [15]. However, to our knowledge ours is the first
to combine congestion pricing to provide a distributed control
over InfiniBand network usage.
VIII.
C ONCLUSIONS AND F UTURE W ORK
This paper addresses the unresolved problem of crossapplication interference for distributed applications running
on virtualized settings. This problem occurs not only with
software-virtualized networking but also with newer high performance fabrics that use hardware-virtualization techniques
like SR-IOV, which grants VMs direct access to the network.
As a result, this removes the hypervisor from the communication path as well as the control over how VMs use the
fabric. The performance degradation from co-running set of
VMs is particularly acute for low latency applications used in
computational finance.
In this paper, we describe our approach called Distributed
Resource Exchange or DRX which offers hypervisor-level
methods to mitigate such inter-application interference in SRIOV-based cluster systems. We monitor VM Ensembles – a
set of VMs part of a distributed application – which enables
controls that apportion interconnect bandwidth across different
VMEs by implementing diverse cluster-wide policies. Two
policies are implemented in DRX: to assign ‘Equal Blame’
to interfering VMEs or to look at how much ‘Hurt’ they
are causing, and therefore showing the feasibility of such
distributed controls. The results demonstrate that DRX is able
to maintain SLA for low-latency codes to within 15% of the
baseline by controlling collocated data-analytic and parallel
workloads. Limitations of the DRX approach are primarily
due to its current method to mitigate interference, which is to
‘cap’ the VMs that over-use the interconnect and cause ‘hurt’.
Our future work, therefore, will consider utilizing congestion
control mechanisms present on current InfiniBand hardware to
mitigate the sending rate of certain QPs, in order to remove
our reliance on CPU Capping.
R EFERENCES
D. Abramson et al. Intel Virtualization Technology for Directed I/O.
Intel Technology Journal, 10(3), 2006.
[2] AMD I/O Virtualization Technology. http://tinyurl.com/a6wsdwe.
[3] H. Ballani, K. Jhang, T. Karagiannis, and C. K. et. al. Chatty Tenants
and the Cloud Network Sharing Problem. In NSDI, 2013.
[4] M. Ben-Yehuda, J. Mason, O. Krieger, J. Xenidis, L. V. Dorn,
A. Mallick, J. Nakajima, and E. Wahlig. Utilizing IOMMUs for
Virtualization in Linux and Xen. In Ottawa Linux Symposium, 2006.
[5]
B. Briscoe and M. Sridharan.
Network Performance Isolation in Data Centres using Congestion Exposure (ConEx), 2012.
http://datatracker.ietf.org/doc/draft-briscoe-conex-data-centre.
[6]
L. Cherkasova and R. Gardner. Measuring CPU Overhead for I/O
Processing in the Xen Virtual MachineMonitor. In USENIX ATC, 2005.
Y. Dong, Z. Yu, and G. Rose. SR-IOV Networking in Xen: Architecture,
Design and Implementation. In Proceedings of WIOV, 2008.
S. Govindan, A. R. Nath, A. Das, B. Urgaonkar, and A. Sivasubramaniam. Xen and co.: Communication-Aware CPU Scheduling for
Consolidated Xen-based Hosting Platforms. In VEE, 2007.
C. Guo, G. Lu, H. J. Wang, and S. Y. et. al. SecondNet: A Data Center
Network Virtualization Architecture with Bandwidth Guarantees. In
Proceedings of ACM CoNext, 2010.
D. Gupta, L. Cherkasova, R. Gardner, and A. Vahdat. Enforcing
Performance Isolation Across Virtual Machines in Xen. In Proc. of
MiddleWare, 2006.
V. Gupta, A. Ranadive, A. Gavrilovska, and K. Schwan. Benchmarking
Next Generation Hardware Platforms: An Experimental Approach. In
Proceedings of SHAW, 2012.
Apache Hadoop. http://hadoop.apache.org/.
High Performance Linpack. http://www.netlib.org/benchmark/hpl.
P. Key, D. Mcauley, P. Barham, and K. Laevens. Congestion Pricing
for Congestion Avoidance. Technical report, Microsoft Research, 1999.
K. Lai, L. Rasmusson, E. Adar, S. Sorkin, L. Zhang, and B. A.
Huberman. Tycoon: an Implemention of a Distributed Market-Based
Resource Allocation System. Technical report, HP Labs, Palo Alto,
CA, USA, 2004.
T. Lam, S. Radhakrishnan, A. Vahdat, and G. Varghese. NetShare:
Virtualizing Data Center Networks across Services. Technical Report
CS2010-0957, University of California, San Diego, 2010.
M. Lee, A. S. Krishnakumar, P. Krishnan, N. Singh, and S. Yajnik.
Supporting Soft Real-Time Tasks in the Xen Hypervisor. In VEE, 2010.
J. Liu. Evaluating Standard-Based Self-Virtualizing Devices: A Performance Study on 10 GbE NICs with SR-IOV Support. In IPDPS,
2010.
A. Menon, J. R. Santos, Y. Turner, G. J. Janakiraman, and
W. Zwaenepoel. Diagnosing Performance Overheads in the Xen Virtual
Machine Environment. In Proceedings of VEE, 2005.
J. Mogul and L. Popa. What We Talk About When We Talk About
Cloud Network Performance. ACM CCR, 2012.
R. Neugebauer and D. McAuley. Congestion Prices as Feedback Signals: An Approach to QoS Management. In ACM SIGOPS Workshop,
2000.
L. Popa, G. Kumar, M. Chowdhury, and A. K. et. al. FairCloud: Sharing
the Network in Cloud Computing. In ACM SIGCOMM, 2012.
X. Pu, L. Liu, Y. Mei, S. Sivathanu, Y. Koh, and C. Pu. Understanding
Performance Interference of I/O Workload in Virtualized Cloud Environment. In Proceedings of IEEE Cloud, 2010.
A. Ranadive, A. Gavrilovska, and K. Schwan. IBMon: Monitoring
VMM-Bypass InfiniBand Devices using Memory Introspection. In
HPCVirtualization Workshop, Eurosys, 2009.
A. Ranadive, A. Gavrilovska, and K. Schwan. ResourceExchange:
Latency-Aware Scheduling in Virtualized Environments with High
Performance Fabrics. In Proceedings of IEEE Cluster, 2011.
A. Ranadive, M. Kesavan, A. Gavrilovska, and K. Schwan. Performance
Implications of Virtualizing Multicore Cluster Machines. In HPCVirtualization Workshop, EuroSys, 2008.
P. V. Soares, J. R. Santos, N. Tolia, and D. Guedes. Gatekeeper:
Distributed Rate Control for Virtualized Datacenters. Technical Report
HPL-2010-151, HP Labs, 2010.
InterContinental Exchange. http://www.theice.com.
C. Wang, I. A. Rayan, G. Eisenhauer, K. Schwan, and et. al. VScope:
Middleware for Troubleshooting Time-Sensitive Data Center Applications. In MiddleWare, 2012.
H. Zeng, C. S. Ellis, A. R. Lebeck, and A. Vahdat. Currentcy: A
Unifying Abstraction for Expressing Energy Management Policies. In
Proceedings of the USENIX Annual Technical Conference, 2003.
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[1]
[28]
[29]
[30]
Download