Performance Implications of Virtualizing Multicore Cluster Machines Adit Ranadive Mukil Kesavan

advertisement
Performance Implications of Virtualizing Multicore Cluster
Machines
Adit Ranadive
Mukil Kesavan
Ada Gavrilovska
Karsten Schwan
Center for Experimental Research in Computer Systems (CERCS)
Georgia Institute of Technology
Atlanta, Georgia, 30332
{adit262, mukil, ada, schwan}@cc.gatech.edu
Abstract
High performance computers are typified by cluster machines constructed from multicore nodes and using high performance interconnects like Infiniband. Virtualizing such ‘capacity computing’
platforms implies the shared use of not only the nodes and node
cores, but also of the cluster interconnect (e.g., Infiniband). This paper presents a detailed study of the implications of sharing these resources, using the Xen hypervisor to virtualize platform nodes and
exploiting Infiniband’s native hardware support for its simultaneous use by multiple virtual machines. Measurements are conducted
with multiple VMs deployed per node, using modern techniques for
hypervisor bypass for high performance network access, and evaluating the implications of resource sharing with different patterns
of application behavior. Results indicate that multiple applications
can share the clusters multicore nodes without undue effects on the
performance of Infiniband access and use. Higher degrees of sharing are possible with communication-conscious VM placement and
scheduling.
Categories and Subject Descriptors D.4.7 [Operating Systems]:
Organization and Design; C.2.4 [Computer-Communication Networks]: Distributed Systems; C.5.1 [Computer System Implementation]: Large and Medium Computers
General Terms
Keywords
band
1.
Design, Performance, Management, Reliability
Virtualization, High-performance Computing, Infini-
Introduction
In the enterprise domain, virtualization technologies like VMWare’s
ESX server [29] and the Xen hypervisor [3] are becoming a prevalent solution for resource consolidation, power reduction, and to
deal with bursty application behaviors. Amazon’s Elastic Compute
Cloud (EC2) [2], for instance, uses virtualization to offer datacenter resources (e.g., clusters or blade servers) to applications run
by different customers, safely providing different kinds of services
to diverse codes running on the same underlying hardware (e.g.,
trading systems jointly with software used for financial analysis
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. To copy otherwise, to republish, to post on servers or to redistribute
to lists, requires prior specific permission and/or a fee.
HPCVirt ’08 March 31, 2008, Glasgow, Scotland.
c 2008 ACM 1-59593-090-6/05/0007. . . $5.00
Copyright and forecasting). Virtualization has also shown to be an effective
vehicle for dealing with machine failures, to improve application
portability, and to help debug complex application codes.
In high performance systems, research has demonstrated virtualized network interfaces [12], shown the benefits of virtualization
for grid applications [1, 17, 20, 10, 23, 35], and argued the utility of
these technologies for attaining high reliabilty for large scale machines [26]. Furthermore, key industry providers of HPC technology are actively developing efficient, lightweight virtualization solutions, an example being the close collaboration between vendors
of high performance IO solutions like Infiniband, such as Cisco
and Mellanox, with representatives of the virtualization industry,
including VMWare and Xen. Here, a key motivator is the importance of virtualization for the ‘capacity’ systems in common use in
both the scientific and commercial domains, the latter including financial institutions, retail, telecom and transportation corporations,
providers of web and information services, and gaming applications [25]. In fact, an analysis of the Top500 list demonstrates over
30 application areaGes, most of which do not belong into the category of traditional HPC scientific codes. Finally, when industry
uses large scale HPC systems, now even including IBM’s Bluegene, as platforms for ‘utility’ or ‘cloud’ computing [4, 7], virtualization makes it possible to package client application components
into isolated guest VMs that can be cleanly deployed onto and share
underlying platform resources.
Despite these trends, scientists running traditional high performance codes have been reluctant to adopt virtualization technologies. In part, this is because of their desire to exploit all available
platform resources to attain the performance gains sought by use
of ‘capability’ HPC machines. Perhaps more importantly, however,
this is because of resource sharing can degrade the high levels of
performance sought by HPC codes. As a result, the degrees or extent to which virtualization technologies will be adopted in the HPC
domain remain unclear [9, 11].
This paper contributes experimental insights and measurements
to better understand the effects of resource sharing on the performance of HPC applications. Specifically, for multiple virtual machines running on multicore platforms, we evaluate the extent to
which their communications are affected by the fact that they share
a single communication resource, using an Infiniband interconnect
as the concrete instance of such a resource. Stated more precisely,
using standard x86-based quadcore nodes and the Xen hypervisor,
we evaluate the degree of sharing possible via Infiniband under a
range of platform parameters and application characteristics. The
purpose is (1) to understand the performance implication and overheads of supporting multiple VMs on virtualized multicore IB platforms; (2) to explore the performance implication of different interand intra-VM interaction patterns on such platforms; and (3) to de-
vise suitable deployment and co-location and scheduling strategies
for individual VMs onto shared virtualized resources.
Experimental results presented in the paper demonstrate that a
high level of sharing, that is, a significant number of VMs deployed
to each node, is feasible without noticeable performance degradation, despite the fact that VM-VM communications share a single
Infiniband interconnect. Further, sharing is facilitated by methods
for VM deployment and scheduling that are aware of VMs’ communication behaviors (i.e., communication-awareness) and of the
requirements on communications imposed by VMs (i.e., awareness
of the SLA - ‘Service Level Agreements’ sought by VMs). Technically, this involves (1) manipulating hypervisor-level parameters
like scheduling weights, (2) carrying out service-level actions like
mapping VMs’ QoS requirements to Infiniband virtual lanes, and
(3) devising suitable system-level policies for VM migration and
deployment. This paper lays the foundation for such future technical work, by providing experimental insights into the bottlenecks
such mechanisms will need to avoid and/or the performance levels
they can be expected to deliver.
Remainder of paper. The remainder of the paper is organized as
follows. Section 2 describes our experimental testbed and methodology. Sections 3 and 4 discuss the experimental results gathered
with various VM loads and deployments and different inter- and
intra-VM communication patters, for native RDMA communication and MPI applications, respectively. A brief survey of related
work and concluding remarks appear in the last two sections.
2.
Testbed
All experimental evaluations are performed on a testbed consisting of 2 Dell 1950 PowerEdge servers, each with 2 Quad-core
64-bit Xeon processors at 1.86 GHz. The servers have Mellanox
MT25208 HCAs, operating in the 23208 Tavor compatibility mode,
connected through a Cisco/Topspin 90 switch. Each server is running the RHEL 4 Update 5 OS (paravirtualized 2.6.18 kernel) in
dom0 with the Xen 3.1 hypervisor.
The virtualized Infiniband implementation available on the
Xensource site is based on Xen 3.0 with the BVT scheduler [31]
and uses kernel sockets for the initial Infiniband split driver setup.
Since this implementation does not scale well for multiple VMs,
we changed the initial driver setup to be performed over Xenbus
instead, and we ported the entire implementation to Xen 3.1 to
analyze the new credit scheduler’s [30, 6] impact on Infiniband
performance. The guest kernels are paravirtualized running the
RHEL 4 Update 5 OS. Each guest is allocated 256 MB of RAM.
For running Infiniband applications within the guests, OFED (Open
Fabrics Enterprise Distribution) 1.1 [18] is modified to be able to
use the virtualized IB driver.
Microbenchmarks include the RDMA benchmarks from the
OFED1.1 distribution and the Presta MPI from Lawrence Livermore National Labs [19]. These permit us to evaluate the performance impact of executing multiple VMs on shared virtualized resources, for both native IB RDMA and for MPI communications, as
well as to consider various VM-VM interaction patterns. For running the Presta MPI Suite, OpenMPI 1.1.1 is installed on dom0 and
on domUs.
A specific challenge in communication fabrics that support
asynchronous IO, like Infiniband, is the inability to obtain accurate timing measurements without additional hardware support.
Our results are based on time measurements gathered before posting an IO request and after the corresponding completion event is
detected via a polling interface. This approach has been accepted
in the community as a viable approximation of the exact timings of
various asynchronous IO operations [16, 14].
3.
Experimental Evaluation - Microbenchmarks
The first set of measurements evaluate the Infiniband RDMA communication layer. We do not include IPoIB measurements, as those
numbers are inferior in performance compared to native RDMA
support. Tests are run with different numbers and deployments of
VMs per core and per IB node and with different scheduling criteria. Measurements are taken for the three basic operations in Infiniband, which are RDMA Write, RDMA Send/Receive, and RDMA
Read, in terms of average bandwidth and latency (RDMA Write
only). Each test consists of 5000 iterations performed for each of
the message sizes, as shown in the graphs (from 2B to 8MB). The
MTU size in these experiments is 2KB.
Basic Benchmarks. For the graphs in Figure 1, the setup of the
Virtual Machines is symmetric, i.e., running an equal number of
VMs on the two physical machines, denoted as a 2VM-2VM test.
The motivation is to understand the performance effects of multiple
VMs sharing the same Infiniband HCA.
The first graph in Figure 1 shows that the differences in average
bandwidth for RDMA Write and Send/Recv tests, achieved running inside a VM vs. in a non-virtualized platform, are practically
negligible. This shows that virtualization does not impose noticeable overheads and IB throughput. Varying the number of VMs on
each machine from 1 to 6, we find that the bandwidths converge approximately to the total maximum bandwidth divided by the number of VMs. This occurs for larger message sizes, where the network link becomes saturated with data. As the number of VMs
increases, saturation occurs at ever smaller message sizes. At the
same time, the total bandwidth perceived by VMs in non-saturated
cases (e.g., up to 64k in the case of 2VMs and 32k for 3VMs) is the
maximum sustainable bandwidth. This implies that 1. the shared
use of IB interconnects by multiple VMs is both viable and reasonable, as long as the total bandwidth required by all simultaneously
running VMs remains below the maximum sustainable bandwidth.
Further, 2. network bandwidth is divided equally among all VMs,
with RDMA Write delivering the highest performance, followed by
RDMA Send/Receive. RDMA Read performs worst, as well documented in other work [16]. Finally, 3. the maximum bandwidth
achieved by any of the RDMA operations is 932 MBps, or approximately 7.5Gbps.
Effects of Scheduling. The next test demonstrates the effects of
pinning VMs to different and/or the same physical CPUs (PCPUs),
thereby controlling the physical resources available to each VM.
The Xen scheduler allows guest VMs to either use a specific CPU
or any CPU that is free when the VM is scheduled. Specifically,
with Xen 3.1’s default Credit Scheduler [30], the same weight
is assigned to each VM that is pinned to the same CPU, so that
each VM receives an equal CPU share. Note that for these and all
future experiments, we show only the results for the RDMA Write
microbenchmark. It consistently delivers the highest performance
compared to other microbenchmarks.
The graphs in Figure 2 show that when all VMs are assigned
to the same physical CPU, the bandwidth attained by each VM
is highly variable. This is due to the fact that the Xen scheduler
shares the CPU by continuously swapping out/swapping in these
VMs. In contrast, 4. the performance attained by VMs pinned
to different PCPUs is both higher and more consistent, in terms
of the average bandwidth achieved by each VM. In both cases,
however, average bandwidth converges to maximum bandwidth
divided by the number of VMs, as with the simple benchmark tests
described above. Furthermore, 5. as the link becomes saturated
with increasing message sizes, the average bandwidth attained by
each VM decreases.
Conclusions derived from these results include the following.
First, even when co-locating VMs on the same physical CPU, per-
1000
RDMA Write Results
800
800
1VM
700
2VM
600
3VM
6VM
400
8388608
4194304
524288
2097152
262144
1048576
65536
131072
8192
32768
RDMA Read Results
900
800
1VM
700
2VM
1VM
8388608
4194304
2097152
1048576
524288
65536
262144
Message Size (bytes)
131072
32768
16384
8192
2
8388608
4194304
2097152
1048576
524288
262144
Message Size (bytes)
131072
65536
32768
16384
8192
4096
2048
1024
512
256
64
32
128
16
8
0
4
100
0
2
200
100
4096
300
200
2048
300
6VM
400
1024
6VM
512
400
4VM
500
256
4VM
64
500
3VM
128
3VM
600
32
2VM
600
4
Bandwidth ((MBps)
700
16
800
8
900
Bandwidth (MBps)
16384
Message Size (bytes)
1000
RDMA Send/Recv Results
4096
2
8388608
4194304
524288
2097152
262144
Message Size (bytes)
1000
1048576
65536
131072
32768
8192
4096
16384
2048
512
1024
256
64
128
8
32
16
0
4
100
0
2
200
100
2048
300
200
512
300
4VM
256
Send/Recv - 1 VM
400
500
1024
Write - 1 VM
64
Send/Recv - Native IB
500
8
600
32
Write - Native IB
4
700
16
Bandwidth (MBps)
900
Bandwidth (MBps)
900
128
Native Infiniband v/s 1VM-1VM RDMA Results
1000
Figure 1. RDMA Performance Numbers
formance degradation will not occur until total required bandwidth
exceeds available IB resources. Second, the “plateau” in each of the
graphs shows that even for the case of 6VMs per single machine,
we can still achieve the maximum sustainable performance level,
as in the native case. The width of this “plateau” is dependent upon
the number of VMs and the messages sizes.
Latency Tests. Figure 3 shows the latencies recorded for different
numbers of VMs. The latencies are measured for pairs of VMs
communicating across two physical nodes. As a baseline, we also
include measurements performed for communications between the
dom0s on the virtualized machines.
Results show that 6. the typical latency for a RDMA Write operation does not change much as the number of VMs increases.
This is because VMM-bypass capable interconnects like Infiniband
avoid the frontend-backend communication overheads experienced
by other Xen devices. However, 7. as message sizes increase, latencies increase exponentially due to bandwidth saturation. For
smaller message sizes, the difference in latencies in dom0 and VMs
is negligible (on the order of less than 10 usec), thereby demonstrating the effectiveness of Infiniband’s VMM-bypass implementation.
4.
MPI benchmarks
For the MPI benchmarks we use the Lawrence Livermore National
Lab Presta MPI benchmark suite. The two benchmark tests used
include (1) the com test, used to analyze the impact of virtualization
on inter-process communication bandwidth and latency, and (2)
the glob test, used to analyze the impact on collective operations
across VMs or processes within a VM.
MPI Com Test. The com test is an indicator of link saturation
between pairs of communicating MPI processes. All of the results
reported below are for the unidirectional test. The various test
configurations and the resulting trends discovered are listed below:
1. Virtualization Overhead Measurement. The com test is run
across two native Linux 2.6.18 kernels and 2 VMs, with one
process per machine, virtual or otherwise.
2. Xen credit scheduler effects on IB-based applications running
on VMs. It is important to analyze the effects of virtual machine scheduling on applications running in VMs. We run one
MPI process per VM, and use two test configurations, where in
one configuration, all VMs are pinned to different physical CPU
cores and in the other, all the VMs are pinned to the same physical CPU. This represents the ‘best’ vs. ‘worst’ cases concerning
the effects of scheduling on communication performance. Tests
are performed with 2 and 4 VMs, respectively, running on the
same physical machine.
3. Latency Variation due to VM load. To measure the variation in
communication latency due to VM load resulting from different
distributions of processes across VMs, we use 2, 4, 8, 16, and 32
communicating MPI processes on 2, 4, and 8 virtual machines,
with a fair distribution of processes across VMs. We devise two
tests: (1) all VMs pertaining to a measurement run are pinned
across two physical cores, i.e., multiple VMs may share the
same physical cpu core; and (2) 8 physical cores are used for
the VMs pertaining to a measurement run, i.e., in some cases,
a VM may have more than one VCPU available to it. The
rationale is that test results make it possible to compare the
effects of load on the native Linux scheduler (the Linux O(1)
scheduler in the kernel version used in our tests) vs. the Xen
2VM-2VM Write BW – CPU Pinning
1000
1000
3VM-3VM Write BW - CPU Pinning
900
900
Avg BW - Same PCPU
Avg BW - Same PCPU
800
Avg BW - Diff PCPU
Avg BW - Diff PCPU
Bandwidth (MBps)
Bandwidth (MBps)
800
700
700
600
600
500
500
400
400
300
300
200
200
100
100
1000
4VM-4VM Write BW - CPU Pinning
8388608
4194304
2097152
524288
1048576
262144
65536
131072
32768
8192
16384
4096
2048
512
1024
64
256
128
32
8
Message Size (bytes)
6VM-6VM Write BW - CPU Pinning
900
900
Avg BW - Same PCPU
Avg BW - Same PCPU
800
800
Avg BW - Diff PCPU
Avg BW - Diff PCPU
Bandwidth (MBps)
16
8388608
4194304
2097152
524288
1048576
65536
2
1000
262144
131072
32768
8192
16384
4096
2048
512
1024
256
64
128
8
32
16
4
2
Message Size (bytes)
4
0
0
Bandwidth (MBps)
700
600
500
400
300
200
700
600
500
400
300
200
100
100
0
8388608
4194304
2097152
524288
1048576
262144
65536
32768
16384
8192
4096
2048
512
1024
Message Size (bytes)
131072
Message Size (bytes)
256
128
64
32
16
8
4
2
0
Figure 2. Effect of CPU Pinning on RDMA Operations in MultiVM
25000
1200
RDMA Write Latency
3000
Virtualization overhead
1VM-1VM
Impact of pinning VMs on the same core
2500
1000
2VM-2VM
Bandwidth (MBps)
Latency (µsec)
3VM-3VM
4VM-4VM
15000
Dom0-Dom0
10000
800
600
Native IB
2 Dom0s
400
2 VMs
Bandwidth (MBps)
20000
2000
2 VMs – Diff. Cores
4 VMs – Diff. Cores
1500
8 VMs – Diff. Cores
2 VMs – Same Core
1000
4 VMs – Same Core
Figure 3. RDMA write latency
Message Size (bytes)
Figure 4. Virtualization overhead
credit scheduler. Measurements depict the com latency for a
pre-configured number of operations for each pair of VMs vs
processes.
4. Bandwidth variation due to cpu capping. To simulate cases in
which a VM running MPI shares the same CPU with other
applications, we use cpu caps of 25, 75, and 100, expressed as
the percentage availability of the physical cpu. These caps are
applied to each VM in a 2 VMs and 4 VMs case, where 1 MPI
process runs on each of these VMs.
Figure 4 measures the unidirectional inter-process bandwidth
achieved for pairs of MPI processes. Multiple message sizes are
evaluated for a native Linux install vs for 2 Dom0s (i.e., the base
case) vs with 2 VMs pinned on different physical processor cores.
32
64
128
256
512
1024
2048
4096
8192
16384
32768
65536
131072
262144
524288
1048576
2097152
4194304
8388608
8388608
4194304
524288
2097152
1048576
65536
262144
131072
8192
32768
4096
16384
512
2048
256
0
128
0
1024
8388608
4194304
2097152
1048576
524288
262144
65536
32768
Message Size (bytes)
131072
16384
8192
4096
2048
512
256
1024
64
128
32
8
4
16
2
0
64
200
32
5000
500
Message Size (bytes)
Figure 5. Impact of pinning VMs on the same
core
It is evident from the figure that virtualization does not cause
additional overheads for MPI communications. Related work has
already demonstrated the negligible overheads on MPI processes
when deployed in a single VM per node [33, 15].
Figure 5 shows the bandwidths achieved for multiple pairs of
MPI processes, each running in its own VM, where the VM (1) has
its own physical cpu core and (2) is sharing a physical cpu core
with the other VMs. The most notable trend in the graph is that
when multiple VMs are all pinned on the same physical CPU core,
the bandwidths for message sizes greater than 8KB drop drastically
compared to the case when the VMs are pinned to different cores.
This is primarily due to VM scheduling overheads. As the VMs
share the single physical CPU 8. for small time slices, the smaller
60
Latency comparison for multiple processes and VMs - 8
cores
Latency comparison for multiple processes and VMs - 2
cores
160
140
50
120
30
2VMs
4VMs
8VMs
Latency ((µsec)
Latency (µsec)
40
100
80
2VMs
4VMs
60
8VMs
20
40
10
20
0
0
2
4
8
16
32
2
4
8
# MPI Processes
16
32
# MPI Processes
Figure 6. Latency comparison for multiple processes and VMs with 8 vs. 2 cores
1200
2500
Bandwidth variations due to cpu cap
2 VMs
1000
Bandwidth variations due to cpu cap
4 VMs
800
600
BW – 25%
BW – 50%
BW – 75%
400
BW – 100%
Bandwidth (MBps)
Bandwidth (MBps)
2000
1500
BW – 25%
BW – 50%
1000
BW – 75%
BW – 100%
500
200
8388608
4194304
524288
2097152
1048576
262144
65536
32768
131072
8192
16384
4096
512
2048
1024
64
Message Size (bytes)
256
32
128
8388608
4194304
2097152
524288
1048576
65536
262144
32768
131072
8192
4096
16384
2048
512
1024
64
256
128
0
32
0
Message Size (bytes)
Figure 7. Bandwidth variations due to cpu cap
messages are sent in a single time slice when the VM is scheduled,
so that VM performance is not significantly affected. For larger
messages, with sizes greater than 8kb for the tests considered, the
VMs are de-scheduled and have to be rescheduled, one or more
times, to complete the data transfer.
The graphs in Figure 6 show the latencies measured for sets of
communicating MPI processes performing a fixed number of operations, on multiple VMs sharing 8 and 2 physical cores, respectively. Details about these measurement include:
• VMs sharing 8 cores: when there are less than 8 VMs, the
number of VCPUs available per VM is increased to distribute
the 8 physical cores evenly amongst VMs.
• VMs sharing 2 cores: when there are more than 2 VMs, we
pin multiple VMs on a single physical cpu core such that each
available cpu core is balanced.
Measurements indicate that latency increases as the number of
processes per VM increases. In essence, a heavily loaded VM tends
to perform poorly irrespective of the presence of RDMA-based
MPI implementation. Further, giving a VM more VCPUs for use by
guest OS processes appears to be less effective than using a larger
number of VMs. This is likely due to the actions of guest OS vs.
VM schedulers.
Figure 7 shows the variation in bandwidth with 2 VMs and
4 VMs, respectively, for different CPU caps and with each VM
running on a different core. Smaller CPU caps result in higher
variations in total achieved bandwidths for large message sizes,
again due to scheduling effects (e.g., VMs losing the CPU while
communicating).
MPI Glob Test. The glob test from the Presta MPI Benchmark is
used to measure the latencies of MPI collectives. The MPI Reduce,
MPI Broadcast, and MPI Barrier collectives are measured, all of
which are frequently used in high performance applications [28].
We perform these measurements to better understand the implications on communication performance of co-deploying interacting
VMs on a virtualized infrastructure. Experimental evaluations consider the following configurations:
1. 1 MPI process / VM, with a varied number of VMs, and compared with dom0 results;
2. 4 processes running on different dom0s vs. 4 processes in 4
VMs (all on same physical multicore machine); and
3. 8 processes, with a varied number of VMs, i.e., 2, 4, and 8 VMs.
Experimental results for the first configuration are depicted by the
graphs in Figure 8. The latencies shown in the broadcast and allreduce graphs are similar to earlier results, demonstrating that for
smaller message sizes, the latencies for the MPI Collectives do not
vary much as the number of VMs increases. Even at a finer grained
scale, the latency differences between the dom0 and 8VM cases are
less than 10usec, for message sizes upto 64k. For larger message
sizes, as bandwidth is saturated, the latencies increase. In the barrier
test in Figure 8, the number in the brackets indicates the number of
MPI processes running across the virtual/physical machines. The
notation 4Dom0 in the figure indicates that 4 MPI processes run
50000
80
Glob: Broadcast - 1 MPI process/VM
180000
45000
70
Dom0
160000
40000
2VMs
35000
140000
Dom0
120000
2VMs
100000
4VMs
60
30000
8VMs
25000
20000
15000
Time (µsec)
4VMs
Latency (µsec)
Latency (µsec)
Glob:Barrier Test
Glob: AllReduce - 1 MPI process/VM
8VMs
80000
60000
50
40
30
10000
40000
20
5000
20000
0
10
0
0
Message Size (bytes)
Dom0 (2)
2VMs (2)
Message Size (bytes)
4VMs (4)
8VMs (8)
4Dom0 (4)
Configurations
Figure 8. Latencies of collective operations across VMs
30000
120000
Glob: Broadcast - Using VMs v/s dom0
Glob: AllReduce - Using VMs v/s dom0
25000
100000
4VMs
4VMs
4Dom0
Latency (µsec)
Latency (µsec)
4Dom0s
20000
15000
10000
80000
60000
40000
5000
20000
0
0
Message Size (bytes)
Message Size (bytes)
Figure 9. Latencies for collective operations for 4 MPI processes within one domain (dom0) or across 4 VMs
120000
300000
Glob: Broadcast - 8 processes
Varying #VMs
100000
250000
8 VMs
8VMs
4 VMs
Latency (µsec)
Latency (µsec)
80000
Glob: AllReduce - 8 processes
Varying #VMs
2 VMs
60000
40000
200000
2VMs
150000
100000
20000
50000
0
0
Message Size (bytes)
4VMs
Message Size (bytes)
Figure 10. Latencies for collective operations for different number of processes per VM
in dom0. The increased overheads in the barrier case are expected
because the increased amount of VM-VM interaction are not amortized by any gains in performance due to improved ability for data
movement between processes in the VM and the Infiniband network. We are planning additional tests to gather information from
low-level performance counters, such as VMentry/exit operations,
time spent in the hypervisor, etc., which we believe will help better
explain the observed behaviors for these types of collective operations.
The experiments presented in Figure 9 compare the performance of MPI processes running in multiple VMs versus running
in the same VM. The performance of 4VMs (1 MPI process/VM)
versus 4 processes in dom0 shows little difference in terms of latency for upto 64KB sized messages. In these tests, we use the default Xen scheduling policy. These results demonstrate that based
on the types of interactions between application processes, and the
amount of IO performed, it can be acceptable to structure individual components as separate VMs, all deployed on the same platform. This can be useful in maintaining isolation between different
application components, or to leverage the Xen-level mechanisms
for dynamic VM migration for reliability or load balancing purposes.
Similar tests shown in Figure 10 investigate the impact of varying the number of processes running within a VM. Unlike the barrier case in Figure 6 above, results show that 9. broadcast or allreduce communication patterns benefit if they are structured across
a larger number of VMs, particularly for larger message sizes. The
best case is the one in which each process is within a single VMs,
which is because that reduces the additional scheduling overheads
within guest VM (the Linux scheduler) and at the VMM level (the
Xen scheduler). These results further strengthen our experimental
demonstration of the fact that multiple VMs can easily share a
single virtualized platform, even in the high performance domain.
5.
Related Work
Other research efforts that have analyzed the performance overheads of virtualizing Infiniband platforms with the Xen hypervisor
appear in [15, 24]. Our work differs in that it specifically focuses on
the effects on communication performance when virtualized multicore platforms are shared by many collaborating VMs. For these
purposes, the IB split driver was modified to enable guestVM-dom0
interactions via Xenbus, which made it possible for multiple VMs
to be instantiated in an efficient and scalable manner, thereby enabling the experiments described in this paper.
The opportunities for virtualization in the HPC domain have
been investigated in multiple recent research efforts. The work described in [33, 34] assesses the performance impact of Xen-based
virtualization on high performance codes running MPI, specifically
focusing on the Xen-related overheads. It does not take into account
the effects of any specific platform characteristics, such as the multicore processing nodes or the Infiniband fabric considered by our
work. Other efforts have used virtualization to ease reliability, management, and development and debugging for HPC systems and
applications [26, 27, 8]. The results described in this paper complement these efforts. Finally, many research efforts use virtualization for HPC grid services [35, 23, 17, 20] – our complementary
research focus is to understand the performance factors in deploying multiple VMs on the individual multicore resources and cluster
machines embedded in such grids.
There is much related work on managing shared data centers [5,
32], including considering deployment issues for mixes of batch
and interactive VMs on shared cluster resources [13], cluster management, co-scheduling and deployment of cluster process [22, 21].
Our future research will build on such work to create a QoS-aware
management architecure that controls the shared use of virtualized
high performance resources.
6.
Conclusions and Future Work
This paper presents a detailed study of the implications of sharing
high performance multicore cluster machines that use high end interconnection fabrics like Infiniband and that are virtualized with
standard hypervisors like Xen. Measurements are conducted with
multiple VMs deployed per node, using modern techniques for hypervisor bypass for high performance network access. Experiments
evaluate the implications of resource sharing with different patterns
of application behavior, including number of processes deployed
per VM, types of communication patterns, and amounts of available platform resources.
Results indicate that multiple applications can share multicore
virualized nodes without undue performance effects on Infiniband access and use, with higher degrees of sharing possible with
communication-conscious VM placement and scheduling. Furthermore, depending on the types of interactions between application
processes and the amounts of IO performed, it can be beneficial to
structure individual components as separate VMs rather than plac-
ing them into a single VM. This is because such placements can
avoid undesirable interactions between guest OS-level and VMMlevel schedulers. Such placement can also bring additional benefits for maintaining isolation between different application components, or for load-balancing, reliability and fault-tolerance mechanisms that can leverage the existing hypervisor- (i.e., Xen-) level
VM migration mechanisms.
Our future work will derive further insights from the experimental results discussed in Sections 3 and 4 by gathering additional low
level performance information, including time spent in the hypervisor, number of ‘world switches’ between the VMs and the hypervisor, etc., using tools like Xenoprofile. The idea is to attain greater
insights into the implications of shared use of virtualized platforms
and the manner in which the platforms’ resources should be distributed among running VMs. We hope to be able to include select
results from such measurements into the final version of this paper.
In addition, we plan to extend this work to analyze the ability of Infiniband virtualized platforms to meet different QoS requirements
and honor SLAs for sets of collaborating VMs, by manipulating
parameters such as VMs deployment onto or across individual platform nodes, resource allocation and hypervisor-level scheduling
parameters on these multicore nodes, and fabric-wide policies for
Service Level (SL) to Virtual Lane (VL) mappings. Certain extensions of our current testbed are necessary to make these measurements possible. The longer term goal of our research is to devise
new management mechanisms and policies for QoS-aware management architectures for shared high performance virtualized infrastructures.
References
[1] S. Adabala, V. Chadha, P. Chawla, R. Figueiredo, J. Fortes, I. Krsul,
A. Matsunaga, M. Tsugawa, J. Zhang, M. Zhao, L. Zhu, and X. Zhu.
From virtualized resources to virtual computing grids: the In-VIGO
system. Future Generation Computer Systems, 21(6):896–909, 2005.
[2] Amazon Elastic Compute Cloud (EC2). aws.amazon.com/ec2.
[3] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho,
R. Neugebauer, I. Pratt, and A. Warfield. Xen and the Art of
Virtualization. In SOSP 2003, 2003.
[4] IBM Research Blue Gene. www.research.ibm.com/bluegene/.
[5] J. Chase, L. Grit, D. Irwin, J. Moore, and S. Sprenkle. Dynamic
Virtual Clusters in a Grid Site Manager. In Twelfth International
Symposium on High Performance Distributed Computing (HPDC12), 2003.
[6] L. Cherkasova, D. Gupta, and A. Vahdat. Comparison of the Three
CPU Schedulers in Xen. ACM SIGMETRICS Performance Evaluation
Review, 35(2):42–51, 2007.
[7] Technology Review: Computer in the Cloud. www.technologyreview.com/Infotech/19397/?a=f.
[8] C. Engelmann, S. L. Scott, H. Ong, G. Vallée, and T. Naughton.
Configurable Virtualized System Environments for High Performance
Computing. In Proceedings of the 1st Workshop on System-level
Virtualization for High Performance Computing (HPCVirt) 2007, in
conjunction with the 2nd ACM SIGOPS European Conference on
Computer Systems (EuroSys) 2007, Lisbon, Portugal, Mar. 20, 2007.
[9] R. Farber. Keeping “Performance” in HPC: A look at the impact of
virtualization and many-core processors. Scientific Computing, 2006.
www.scimag.com.
[10] R. Figueiredo, P. Dinda, and J. Fortes. A Case For Grid Computing
on Virtual Machines. In Proc. of IEEE International Conference on
Distributed Computing Systems, 2003.
[11] A. Gavrilovska, S. Kumar, H. Raj, K. Schwan, V. Gupta, R. Nathuji,
R. Niranjan, A. Ranadive, and P. Saraiya. Scalable Hypervisor
Architectures for High Performance Systems. In Proceedings of the
1st Workshop on System-level Virtualization for High Performance
Computing (HPCVirt) 2007, in conjunction with the 2nd ACM
SIGOPS European Conference on Computer Systems (EuroSys) 2007,
Lisbon, Portugal, Mar. 20, 2007.
[12] W. Huang, J. Liu, and D. Panda. A Case for High Performance
Computing with Virtual Machines. In ICS, 2006.
[13] B. Lin and P. Dinda. VSched: Mixing Batch and Interactive Virtual
Machines Using Periodic Real-time Scheduling. In Proceedings of
ACM/IEEE SC 2005 (Supercomputing), 2005.
[14] J. Liu, B. Chandrasekaran, J. Wu, W. Jiang, S. Kini, W. Yu,
D. Buntinas, P. Wyckoff, and D. K. Panda. Performance Comparison
of MPI Implementations over InfiniBand, Myrinet and Quadrics. In
Supercomputing’03, 2003.
[15] J. Liu, W. Huang, B. Abali, and D. K. Panda. High Performance
VMM-Bypass I/O in Virtual Machines. In ATC, 2006.
[16] J. Liu, J. Wu, S. P. Kini, P. Wyckoff, and D. K. Panda. High
Performance RDMA-Based MPI Implementation over InfiniBand.
In Int’l Conference on Supercomputing (ICS ’03), 2003.
[17] A. Matsunaga, M. Tsugawa, S. Adabala, R. Figueiredo, H. Lam,
and J. Fortes. Science gateways made easy: the In-VIGO approach.
Concurrency and Computation: Practice and Experience, 19(1),
2007.
[18] OpenFabrics Software Stack - OFED 1.1. www.openfabrics.org/.
[19] Presta Benchmark Code. svn.openfabrics.org/svn/openib/gen2/branches/1.1/ofed/mpi/.
[20] P. Ruth, X. Jiang, D. Xu, and S. Goasguen. Virtual Distributed
Environments in a Shared Infrastructure. IEEE Computer, Special
Issue on Virtualization Technologies, 38(5):63–69, 2005.
[21] M. Silberstein, D. Geiger, A. Schuster, and M. Livny. Scheduling
Mixed Workloads in Multi-grids: The Grid Execution Hierarchy.
In Proceedings of the 15th IEEE Symposium on High Performance
Distributed Computing (HPDC), 2006.
[22] M. S. Squillante, Y. Zhang, A. Sivasubramaniam, N. Gautam,
H. Franke, and J. E. Moreira. Modeling and analysis of dynamic
coscheduling in parallel and distributed environments. In SIGMETRICS, 2002.
[23] A. Sundararaj and P. Dinda. Towards Virtual Networks for Virtual
Machine Grid Computing. In Proceedings of the Third USENIX
Virtual Machine Technology Symposium (VM 2004), 2004.
[24] S. Sur, M. Koop, L. Chai, and D. K. Panda. Performance Analysis
and Evaluation of Mellanox ConnectX InfiniBand Architecture with
Multi-Core Platforms. In 15th Symposium on Hot Interconnects,
2007.
[25] Top500 SuperComputing Sites. www.top500.org.
[26] G. Vallee, T. Naughton, H. Ong, and S. Scott. Checkpoint/Restart of
Virtual Machines Based on Xen. In HAPCW, 2006.
[27] G. Vallée and S. L. Scott. Xen-OSCAR for Cluster Virtualization.
In ISPA Workshop on XEN in HPC Cluster and Grid Computing
Environments (XHPC’06), Dec. 2006.
[28] J. Vetter and F. Mueller. Communication Characteristics of LargeScale Scientific Applications for Contemporary Cluster Architectures.
In Proc. of Int’l Parallel and Distributed Processing Symposium,
2002.
[29] The VMWare ESX Server. http://www.vmware.com/products/esx/.
[30] Xen Credit Scheduler. wiki.xensource.com/xenwiki/CreditScheduler.
[31] XenSmartIO Mercurial Tree.
smartio.hg.
xenbits.xensource.com/ext/xen-
[32] J. Xu, M. Zhao, M. Yousif, R. Carpenter, and J. Fortes. On the
Use of Fuzzy Modeling in Virtualized Data Center Management. In
Proceedings of International Conference on Autonomic Computing
(ICAC), Jacksonville, FL, 2007.
[33] L. Youseff, R. Wolski, B. Gorda, and C. Krintz. Evaluating the
Performance Impact of Xen on MPI and Process Execution For HPC
Systems. In International Workshop on Virtualization Technologies
in Distributed Computing (VTDC), with Supercomputing’06, 2006.
[34] L. Youseff, R. Wolski, B. Gorda, and C. Krintz. Paravirtualization for
HPC Systems. In XHPC: Workshop on XEN in High-Performance
Cluster and Grid Computing, 2006.
[35] M. Zhao, J. Zhang, and R. Figueiredo. Distributed File System
Virtualization Techniques Supporting On-Demand Virtual Machine
Environments for Grid Computing. Cluster Computing Journal, 9(1),
2006.
Download