Performance Implications of Virtualizing Multicore Cluster Machines Adit Ranadive Mukil Kesavan Ada Gavrilovska Karsten Schwan Center for Experimental Research in Computer Systems (CERCS) Georgia Institute of Technology Atlanta, Georgia, 30332 {adit262, mukil, ada, schwan} Abstract High performance computers are typified by cluster machines constructed from multicore nodes and using high performance interconnects like Infiniband. Virtualizing such ‘capacity computing’ platforms implies the shared use of not only the nodes and node cores, but also of the cluster interconnect (e.g., Infiniband). This paper presents a detailed study of the implications of sharing these resources, using the Xen hypervisor to virtualize platform nodes and exploiting Infiniband’s native hardware support for its simultaneous use by multiple virtual machines. Measurements are conducted with multiple VMs deployed per node, using modern techniques for hypervisor bypass for high performance network access, and evaluating the implications of resource sharing with different patterns of application behavior. Results indicate that multiple applications can share the clusters multicore nodes without undue effects on the performance of Infiniband access and use. Higher degrees of sharing are possible with communication-conscious VM placement and scheduling. Categories and Subject Descriptors D.4.7 [Operating Systems]: Organization and Design; C.2.4 [Computer-Communication Networks]: Distributed Systems; C.5.1 [Computer System Implementation]: Large and Medium Computers General Terms Keywords band 1. Design, Performance, Management, Reliability Virtualization, High-performance Computing, Infini- Introduction In the enterprise domain, virtualization technologies like VMWare’s ESX server [29] and the Xen hypervisor [3] are becoming a prevalent solution for resource consolidation, power reduction, and to deal with bursty application behaviors. Amazon’s Elastic Compute Cloud (EC2) [2], for instance, uses virtualization to offer datacenter resources (e.g., clusters or blade servers) to applications run by different customers, safely providing different kinds of services to diverse codes running on the same underlying hardware (e.g., trading systems jointly with software used for financial analysis Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. HPCVirt ’08 March 31, 2008, Glasgow, Scotland. c 2008 ACM 1-59593-090-6/05/0007. . . $5.00 Copyright and forecasting). Virtualization has also shown to be an effective vehicle for dealing with machine failures, to improve application portability, and to help debug complex application codes. In high performance systems, research has demonstrated virtualized network interfaces [12], shown the benefits of virtualization for grid applications [1, 17, 20, 10, 23, 35], and argued the utility of these technologies for attaining high reliabilty for large scale machines [26]. Furthermore, key industry providers of HPC technology are actively developing efficient, lightweight virtualization solutions, an example being the close collaboration between vendors of high performance IO solutions like Infiniband, such as Cisco and Mellanox, with representatives of the virtualization industry, including VMWare and Xen. Here, a key motivator is the importance of virtualization for the ‘capacity’ systems in common use in both the scientific and commercial domains, the latter including financial institutions, retail, telecom and transportation corporations, providers of web and information services, and gaming applications [25]. In fact, an analysis of the Top500 list demonstrates over 30 application areaGes, most of which do not belong into the category of traditional HPC scientific codes. Finally, when industry uses large scale HPC systems, now even including IBM’s Bluegene, as platforms for ‘utility’ or ‘cloud’ computing [4, 7], virtualization makes it possible to package client application components into isolated guest VMs that can be cleanly deployed onto and share underlying platform resources. Despite these trends, scientists running traditional high performance codes have been reluctant to adopt virtualization technologies. In part, this is because of their desire to exploit all available platform resources to attain the performance gains sought by use of ‘capability’ HPC machines. Perhaps more importantly, however, this is because of resource sharing can degrade the high levels of performance sought by HPC codes. As a result, the degrees or extent to which virtualization technologies will be adopted in the HPC domain remain unclear [9, 11]. This paper contributes experimental insights and measurements to better understand the effects of resource sharing on the performance of HPC applications. Specifically, for multiple virtual machines running on multicore platforms, we evaluate the extent to which their communications are affected by the fact that they share a single communication resource, using an Infiniband interconnect as the concrete instance of such a resource. Stated more precisely, using standard x86-based quadcore nodes and the Xen hypervisor, we evaluate the degree of sharing possible via Infiniband under a range of platform parameters and application characteristics. The purpose is (1) to understand the performance implication and overheads of supporting multiple VMs on virtualized multicore IB platforms; (2) to explore the performance implication of different interand intra-VM interaction patterns on such platforms; and (3) to de- vise suitable deployment and co-location and scheduling strategies for individual VMs onto shared virtualized resources. Experimental results presented in the paper demonstrate that a high level of sharing, that is, a significant number of VMs deployed to each node, is feasible without noticeable performance degradation, despite the fact that VM-VM communications share a single Infiniband interconnect. Further, sharing is facilitated by methods for VM deployment and scheduling that are aware of VMs’ communication behaviors (i.e., communication-awareness) and of the requirements on communications imposed by VMs (i.e., awareness of the SLA - ‘Service Level Agreements’ sought by VMs). Technically, this involves (1) manipulating hypervisor-level parameters like scheduling weights, (2) carrying out service-level actions like mapping VMs’ QoS requirements to Infiniband virtual lanes, and (3) devising suitable system-level policies for VM migration and deployment. This paper lays the foundation for such future technical work, by providing experimental insights into the bottlenecks such mechanisms will need to avoid and/or the performance levels they can be expected to deliver. Remainder of paper. The remainder of the paper is organized as follows. Section 2 describes our experimental testbed and methodology. Sections 3 and 4 discuss the experimental results gathered with various VM loads and deployments and different inter- and intra-VM communication patters, for native RDMA communication and MPI applications, respectively. A brief survey of related work and concluding remarks appear in the last two sections. 2. Testbed All experimental evaluations are performed on a testbed consisting of 2 Dell 1950 PowerEdge servers, each with 2 Quad-core 64-bit Xeon processors at 1.86 GHz. The servers have Mellanox MT25208 HCAs, operating in the 23208 Tavor compatibility mode, connected through a Cisco/Topspin 90 switch. Each server is running the RHEL 4 Update 5 OS (paravirtualized 2.6.18 kernel) in dom0 with the Xen 3.1 hypervisor. The virtualized Infiniband implementation available on the Xensource site is based on Xen 3.0 with the BVT scheduler [31] and uses kernel sockets for the initial Infiniband split driver setup. Since this implementation does not scale well for multiple VMs, we changed the initial driver setup to be performed over Xenbus instead, and we ported the entire implementation to Xen 3.1 to analyze the new credit scheduler’s [30, 6] impact on Infiniband performance. The guest kernels are paravirtualized running the RHEL 4 Update 5 OS. Each guest is allocated 256 MB of RAM. For running Infiniband applications within the guests, OFED (Open Fabrics Enterprise Distribution) 1.1 [18] is modified to be able to use the virtualized IB driver. Microbenchmarks include the RDMA benchmarks from the OFED1.1 distribution and the Presta MPI from Lawrence Livermore National Labs [19]. These permit us to evaluate the performance impact of executing multiple VMs on shared virtualized resources, for both native IB RDMA and for MPI communications, as well as to consider various VM-VM interaction patterns. For running the Presta MPI Suite, OpenMPI 1.1.1 is installed on dom0 and on domUs. A specific challenge in communication fabrics that support asynchronous IO, like Infiniband, is the inability to obtain accurate timing measurements without additional hardware support. Our results are based on time measurements gathered before posting an IO request and after the corresponding completion event is detected via a polling interface. This approach has been accepted in the community as a viable approximation of the exact timings of various asynchronous IO operations [16, 14]. 3. Experimental Evaluation - Microbenchmarks The first set of measurements evaluate the Infiniband RDMA communication layer. We do not include IPoIB measurements, as those numbers are inferior in performance compared to native RDMA support. Tests are run with different numbers and deployments of VMs per core and per IB node and with different scheduling criteria. Measurements are taken for the three basic operations in Infiniband, which are RDMA Write, RDMA Send/Receive, and RDMA Read, in terms of average bandwidth and latency (RDMA Write only). Each test consists of 5000 iterations performed for each of the message sizes, as shown in the graphs (from 2B to 8MB). The MTU size in these experiments is 2KB. Basic Benchmarks. For the graphs in Figure 1, the setup of the Virtual Machines is symmetric, i.e., running an equal number of VMs on the two physical machines, denoted as a 2VM-2VM test. The motivation is to understand the performance effects of multiple VMs sharing the same Infiniband HCA. The first graph in Figure 1 shows that the differences in average bandwidth for RDMA Write and Send/Recv tests, achieved running inside a VM vs. in a non-virtualized platform, are practically negligible. This shows that virtualization does not impose noticeable overheads and IB throughput. Varying the number of VMs on each machine from 1 to 6, we find that the bandwidths converge approximately to the total maximum bandwidth divided by the number of VMs. This occurs for larger message sizes, where the network link becomes saturated with data. As the number of VMs increases, saturation occurs at ever smaller message sizes. At the same time, the total bandwidth perceived by VMs in non-saturated cases (e.g., up to 64k in the case of 2VMs and 32k for 3VMs) is the maximum sustainable bandwidth. This implies that 1. the shared use of IB interconnects by multiple VMs is both viable and reasonable, as long as the total bandwidth required by all simultaneously running VMs remains below the maximum sustainable bandwidth. Further, 2. network bandwidth is divided equally among all VMs, with RDMA Write delivering the highest performance, followed by RDMA Send/Receive. RDMA Read performs worst, as well documented in other work [16]. Finally, 3. the maximum bandwidth achieved by any of the RDMA operations is 932 MBps, or approximately 7.5Gbps. Effects of Scheduling. The next test demonstrates the effects of pinning VMs to different and/or the same physical CPUs (PCPUs), thereby controlling the physical resources available to each VM. The Xen scheduler allows guest VMs to either use a specific CPU or any CPU that is free when the VM is scheduled. Specifically, with Xen 3.1’s default Credit Scheduler [30], the same weight is assigned to each VM that is pinned to the same CPU, so that each VM receives an equal CPU share. Note that for these and all future experiments, we show only the results for the RDMA Write microbenchmark. It consistently delivers the highest performance compared to other microbenchmarks. The graphs in Figure 2 show that when all VMs are assigned to the same physical CPU, the bandwidth attained by each VM is highly variable. This is due to the fact that the Xen scheduler shares the CPU by continuously swapping out/swapping in these VMs. In contrast, 4. the performance attained by VMs pinned to different PCPUs is both higher and more consistent, in terms of the average bandwidth achieved by each VM. In both cases, however, average bandwidth converges to maximum bandwidth divided by the number of VMs, as with the simple benchmark tests described above. Furthermore, 5. as the link becomes saturated with increasing message sizes, the average bandwidth attained by each VM decreases. Conclusions derived from these results include the following. First, even when co-locating VMs on the same physical CPU, per- 1000 RDMA Write Results 800 800 1VM 700 2VM 600 3VM 6VM 400 8388608 4194304 524288 2097152 262144 1048576 65536 131072 8192 32768 RDMA Read Results 900 800 1VM 700 2VM 1VM 8388608 4194304 2097152 1048576 524288 65536 262144 Message Size (bytes) 131072 32768 16384 8192 2 8388608 4194304 2097152 1048576 524288 262144 Message Size (bytes) 131072 65536 32768 16384 8192 4096 2048 1024 512 256 64 32 128 16 8 0 4 100 0 2 200 100 4096 300 200 2048 300 6VM 400 1024 6VM 512 400 4VM 500 256 4VM 64 500 3VM 128 3VM 600 32 2VM 600 4 Bandwidth ((MBps) 700 16 800 8 900 Bandwidth (MBps) 16384 Message Size (bytes) 1000 RDMA Send/Recv Results 4096 2 8388608 4194304 524288 2097152 262144 Message Size (bytes) 1000 1048576 65536 131072 32768 8192 4096 16384 2048 512 1024 256 64 128 8 32 16 0 4 100 0 2 200 100 2048 300 200 512 300 4VM 256 Send/Recv - 1 VM 400 500 1024 Write - 1 VM 64 Send/Recv - Native IB 500 8 600 32 Write - Native IB 4 700 16 Bandwidth (MBps) 900 Bandwidth (MBps) 900 128 Native Infiniband v/s 1VM-1VM RDMA Results 1000 Figure 1. RDMA Performance Numbers formance degradation will not occur until total required bandwidth exceeds available IB resources. Second, the “plateau” in each of the graphs shows that even for the case of 6VMs per single machine, we can still achieve the maximum sustainable performance level, as in the native case. The width of this “plateau” is dependent upon the number of VMs and the messages sizes. Latency Tests. Figure 3 shows the latencies recorded for different numbers of VMs. The latencies are measured for pairs of VMs communicating across two physical nodes. As a baseline, we also include measurements performed for communications between the dom0s on the virtualized machines. Results show that 6. the typical latency for a RDMA Write operation does not change much as the number of VMs increases. This is because VMM-bypass capable interconnects like Infiniband avoid the frontend-backend communication overheads experienced by other Xen devices. However, 7. as message sizes increase, latencies increase exponentially due to bandwidth saturation. For smaller message sizes, the difference in latencies in dom0 and VMs is negligible (on the order of less than 10 usec), thereby demonstrating the effectiveness of Infiniband’s VMM-bypass implementation. 4. MPI benchmarks For the MPI benchmarks we use the Lawrence Livermore National Lab Presta MPI benchmark suite. The two benchmark tests used include (1) the com test, used to analyze the impact of virtualization on inter-process communication bandwidth and latency, and (2) the glob test, used to analyze the impact on collective operations across VMs or processes within a VM. MPI Com Test. The com test is an indicator of link saturation between pairs of communicating MPI processes. All of the results reported below are for the unidirectional test. The various test configurations and the resulting trends discovered are listed below: 1. Virtualization Overhead Measurement. The com test is run across two native Linux 2.6.18 kernels and 2 VMs, with one process per machine, virtual or otherwise. 2. Xen credit scheduler effects on IB-based applications running on VMs. It is important to analyze the effects of virtual machine scheduling on applications running in VMs. We run one MPI process per VM, and use two test configurations, where in one configuration, all VMs are pinned to different physical CPU cores and in the other, all the VMs are pinned to the same physical CPU. This represents the ‘best’ vs. ‘worst’ cases concerning the effects of scheduling on communication performance. Tests are performed with 2 and 4 VMs, respectively, running on the same physical machine. 3. Latency Variation due to VM load. To measure the variation in communication latency due to VM load resulting from different distributions of processes across VMs, we use 2, 4, 8, 16, and 32 communicating MPI processes on 2, 4, and 8 virtual machines, with a fair distribution of processes across VMs. We devise two tests: (1) all VMs pertaining to a measurement run are pinned across two physical cores, i.e., multiple VMs may share the same physical cpu core; and (2) 8 physical cores are used for the VMs pertaining to a measurement run, i.e., in some cases, a VM may have more than one VCPU available to it. The rationale is that test results make it possible to compare the effects of load on the native Linux scheduler (the Linux O(1) scheduler in the kernel version used in our tests) vs. the Xen 2VM-2VM Write BW – CPU Pinning 1000 1000 3VM-3VM Write BW - CPU Pinning 900 900 Avg BW - Same PCPU Avg BW - Same PCPU 800 Avg BW - Diff PCPU Avg BW - Diff PCPU Bandwidth (MBps) Bandwidth (MBps) 800 700 700 600 600 500 500 400 400 300 300 200 200 100 100 1000 4VM-4VM Write BW - CPU Pinning 8388608 4194304 2097152 524288 1048576 262144 65536 131072 32768 8192 16384 4096 2048 512 1024 64 256 128 32 8 Message Size (bytes) 6VM-6VM Write BW - CPU Pinning 900 900 Avg BW - Same PCPU Avg BW - Same PCPU 800 800 Avg BW - Diff PCPU Avg BW - Diff PCPU Bandwidth (MBps) 16 8388608 4194304 2097152 524288 1048576 65536 2 1000 262144 131072 32768 8192 16384 4096 2048 512 1024 256 64 128 8 32 16 4 2 Message Size (bytes) 4 0 0 Bandwidth (MBps) 700 600 500 400 300 200 700 600 500 400 300 200 100 100 0 8388608 4194304 2097152 524288 1048576 262144 65536 32768 16384 8192 4096 2048 512 1024 Message Size (bytes) 131072 Message Size (bytes) 256 128 64 32 16 8 4 2 0 Figure 2. Effect of CPU Pinning on RDMA Operations in MultiVM 25000 1200 RDMA Write Latency 3000 Virtualization overhead 1VM-1VM Impact of pinning VMs on the same core 2500 1000 2VM-2VM Bandwidth (MBps) Latency (µsec) 3VM-3VM 4VM-4VM 15000 Dom0-Dom0 10000 800 600 Native IB 2 Dom0s 400 2 VMs Bandwidth (MBps) 20000 2000 2 VMs – Diff. Cores 4 VMs – Diff. Cores 1500 8 VMs – Diff. Cores 2 VMs – Same Core 1000 4 VMs – Same Core Figure 3. RDMA write latency Message Size (bytes) Figure 4. Virtualization overhead credit scheduler. Measurements depict the com latency for a pre-configured number of operations for each pair of VMs vs processes. 4. Bandwidth variation due to cpu capping. To simulate cases in which a VM running MPI shares the same CPU with other applications, we use cpu caps of 25, 75, and 100, expressed as the percentage availability of the physical cpu. These caps are applied to each VM in a 2 VMs and 4 VMs case, where 1 MPI process runs on each of these VMs. Figure 4 measures the unidirectional inter-process bandwidth achieved for pairs of MPI processes. Multiple message sizes are evaluated for a native Linux install vs for 2 Dom0s (i.e., the base case) vs with 2 VMs pinned on different physical processor cores. 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 8388608 8388608 4194304 524288 2097152 1048576 65536 262144 131072 8192 32768 4096 16384 512 2048 256 0 128 0 1024 8388608 4194304 2097152 1048576 524288 262144 65536 32768 Message Size (bytes) 131072 16384 8192 4096 2048 512 256 1024 64 128 32 8 4 16 2 0 64 200 32 5000 500 Message Size (bytes) Figure 5. Impact of pinning VMs on the same core It is evident from the figure that virtualization does not cause additional overheads for MPI communications. Related work has already demonstrated the negligible overheads on MPI processes when deployed in a single VM per node [33, 15]. Figure 5 shows the bandwidths achieved for multiple pairs of MPI processes, each running in its own VM, where the VM (1) has its own physical cpu core and (2) is sharing a physical cpu core with the other VMs. The most notable trend in the graph is that when multiple VMs are all pinned on the same physical CPU core, the bandwidths for message sizes greater than 8KB drop drastically compared to the case when the VMs are pinned to different cores. This is primarily due to VM scheduling overheads. As the VMs share the single physical CPU 8. for small time slices, the smaller 60 Latency comparison for multiple processes and VMs - 8 cores Latency comparison for multiple processes and VMs - 2 cores 160 140 50 120 30 2VMs 4VMs 8VMs Latency ((µsec) Latency (µsec) 40 100 80 2VMs 4VMs 60 8VMs 20 40 10 20 0 0 2 4 8 16 32 2 4 8 # MPI Processes 16 32 # MPI Processes Figure 6. Latency comparison for multiple processes and VMs with 8 vs. 2 cores 1200 2500 Bandwidth variations due to cpu cap 2 VMs 1000 Bandwidth variations due to cpu cap 4 VMs 800 600 BW – 25% BW – 50% BW – 75% 400 BW – 100% Bandwidth (MBps) Bandwidth (MBps) 2000 1500 BW – 25% BW – 50% 1000 BW – 75% BW – 100% 500 200 8388608 4194304 524288 2097152 1048576 262144 65536 32768 131072 8192 16384 4096 512 2048 1024 64 Message Size (bytes) 256 32 128 8388608 4194304 2097152 524288 1048576 65536 262144 32768 131072 8192 4096 16384 2048 512 1024 64 256 128 0 32 0 Message Size (bytes) Figure 7. Bandwidth variations due to cpu cap messages are sent in a single time slice when the VM is scheduled, so that VM performance is not significantly affected. For larger messages, with sizes greater than 8kb for the tests considered, the VMs are de-scheduled and have to be rescheduled, one or more times, to complete the data transfer. The graphs in Figure 6 show the latencies measured for sets of communicating MPI processes performing a fixed number of operations, on multiple VMs sharing 8 and 2 physical cores, respectively. Details about these measurement include: • VMs sharing 8 cores: when there are less than 8 VMs, the number of VCPUs available per VM is increased to distribute the 8 physical cores evenly amongst VMs. • VMs sharing 2 cores: when there are more than 2 VMs, we pin multiple VMs on a single physical cpu core such that each available cpu core is balanced. Measurements indicate that latency increases as the number of processes per VM increases. In essence, a heavily loaded VM tends to perform poorly irrespective of the presence of RDMA-based MPI implementation. Further, giving a VM more VCPUs for use by guest OS processes appears to be less effective than using a larger number of VMs. This is likely due to the actions of guest OS vs. VM schedulers. Figure 7 shows the variation in bandwidth with 2 VMs and 4 VMs, respectively, for different CPU caps and with each VM running on a different core. Smaller CPU caps result in higher variations in total achieved bandwidths for large message sizes, again due to scheduling effects (e.g., VMs losing the CPU while communicating). MPI Glob Test. The glob test from the Presta MPI Benchmark is used to measure the latencies of MPI collectives. The MPI Reduce, MPI Broadcast, and MPI Barrier collectives are measured, all of which are frequently used in high performance applications [28]. We perform these measurements to better understand the implications on communication performance of co-deploying interacting VMs on a virtualized infrastructure. Experimental evaluations consider the following configurations: 1. 1 MPI process / VM, with a varied number of VMs, and compared with dom0 results; 2. 4 processes running on different dom0s vs. 4 processes in 4 VMs (all on same physical multicore machine); and 3. 8 processes, with a varied number of VMs, i.e., 2, 4, and 8 VMs. Experimental results for the first configuration are depicted by the graphs in Figure 8. The latencies shown in the broadcast and allreduce graphs are similar to earlier results, demonstrating that for smaller message sizes, the latencies for the MPI Collectives do not vary much as the number of VMs increases. Even at a finer grained scale, the latency differences between the dom0 and 8VM cases are less than 10usec, for message sizes upto 64k. For larger message sizes, as bandwidth is saturated, the latencies increase. In the barrier test in Figure 8, the number in the brackets indicates the number of MPI processes running across the virtual/physical machines. The notation 4Dom0 in the figure indicates that 4 MPI processes run 50000 80 Glob: Broadcast - 1 MPI process/VM 180000 45000 70 Dom0 160000 40000 2VMs 35000 140000 Dom0 120000 2VMs 100000 4VMs 60 30000 8VMs 25000 20000 15000 Time (µsec) 4VMs Latency (µsec) Latency (µsec) Glob:Barrier Test Glob: AllReduce - 1 MPI process/VM 8VMs 80000 60000 50 40 30 10000 40000 20 5000 20000 0 10 0 0 Message Size (bytes) Dom0 (2) 2VMs (2) Message Size (bytes) 4VMs (4) 8VMs (8) 4Dom0 (4) Configurations Figure 8. Latencies of collective operations across VMs 30000 120000 Glob: Broadcast - Using VMs v/s dom0 Glob: AllReduce - Using VMs v/s dom0 25000 100000 4VMs 4VMs 4Dom0 Latency (µsec) Latency (µsec) 4Dom0s 20000 15000 10000 80000 60000 40000 5000 20000 0 0 Message Size (bytes) Message Size (bytes) Figure 9. Latencies for collective operations for 4 MPI processes within one domain (dom0) or across 4 VMs 120000 300000 Glob: Broadcast - 8 processes Varying #VMs 100000 250000 8 VMs 8VMs 4 VMs Latency (µsec) Latency (µsec) 80000 Glob: AllReduce - 8 processes Varying #VMs 2 VMs 60000 40000 200000 2VMs 150000 100000 20000 50000 0 0 Message Size (bytes) 4VMs Message Size (bytes) Figure 10. Latencies for collective operations for different number of processes per VM in dom0. The increased overheads in the barrier case are expected because the increased amount of VM-VM interaction are not amortized by any gains in performance due to improved ability for data movement between processes in the VM and the Infiniband network. We are planning additional tests to gather information from low-level performance counters, such as VMentry/exit operations, time spent in the hypervisor, etc., which we believe will help better explain the observed behaviors for these types of collective operations. The experiments presented in Figure 9 compare the performance of MPI processes running in multiple VMs versus running in the same VM. The performance of 4VMs (1 MPI process/VM) versus 4 processes in dom0 shows little difference in terms of latency for upto 64KB sized messages. In these tests, we use the default Xen scheduling policy. These results demonstrate that based on the types of interactions between application processes, and the amount of IO performed, it can be acceptable to structure individual components as separate VMs, all deployed on the same platform. This can be useful in maintaining isolation between different application components, or to leverage the Xen-level mechanisms for dynamic VM migration for reliability or load balancing purposes. Similar tests shown in Figure 10 investigate the impact of varying the number of processes running within a VM. Unlike the barrier case in Figure 6 above, results show that 9. broadcast or allreduce communication patterns benefit if they are structured across a larger number of VMs, particularly for larger message sizes. The best case is the one in which each process is within a single VMs, which is because that reduces the additional scheduling overheads within guest VM (the Linux scheduler) and at the VMM level (the Xen scheduler). These results further strengthen our experimental demonstration of the fact that multiple VMs can easily share a single virtualized platform, even in the high performance domain. 5. Related Work Other research efforts that have analyzed the performance overheads of virtualizing Infiniband platforms with the Xen hypervisor appear in [15, 24]. Our work differs in that it specifically focuses on the effects on communication performance when virtualized multicore platforms are shared by many collaborating VMs. For these purposes, the IB split driver was modified to enable guestVM-dom0 interactions via Xenbus, which made it possible for multiple VMs to be instantiated in an efficient and scalable manner, thereby enabling the experiments described in this paper. The opportunities for virtualization in the HPC domain have been investigated in multiple recent research efforts. The work described in [33, 34] assesses the performance impact of Xen-based virtualization on high performance codes running MPI, specifically focusing on the Xen-related overheads. It does not take into account the effects of any specific platform characteristics, such as the multicore processing nodes or the Infiniband fabric considered by our work. Other efforts have used virtualization to ease reliability, management, and development and debugging for HPC systems and applications [26, 27, 8]. The results described in this paper complement these efforts. Finally, many research efforts use virtualization for HPC grid services [35, 23, 17, 20] – our complementary research focus is to understand the performance factors in deploying multiple VMs on the individual multicore resources and cluster machines embedded in such grids. There is much related work on managing shared data centers [5, 32], including considering deployment issues for mixes of batch and interactive VMs on shared cluster resources [13], cluster management, co-scheduling and deployment of cluster process [22, 21]. Our future research will build on such work to create a QoS-aware management architecure that controls the shared use of virtualized high performance resources. 6. Conclusions and Future Work This paper presents a detailed study of the implications of sharing high performance multicore cluster machines that use high end interconnection fabrics like Infiniband and that are virtualized with standard hypervisors like Xen. Measurements are conducted with multiple VMs deployed per node, using modern techniques for hypervisor bypass for high performance network access. Experiments evaluate the implications of resource sharing with different patterns of application behavior, including number of processes deployed per VM, types of communication patterns, and amounts of available platform resources. Results indicate that multiple applications can share multicore virualized nodes without undue performance effects on Infiniband access and use, with higher degrees of sharing possible with communication-conscious VM placement and scheduling. Furthermore, depending on the types of interactions between application processes and the amounts of IO performed, it can be beneficial to structure individual components as separate VMs rather than plac- ing them into a single VM. This is because such placements can avoid undesirable interactions between guest OS-level and VMMlevel schedulers. Such placement can also bring additional benefits for maintaining isolation between different application components, or for load-balancing, reliability and fault-tolerance mechanisms that can leverage the existing hypervisor- (i.e., Xen-) level VM migration mechanisms. Our future work will derive further insights from the experimental results discussed in Sections 3 and 4 by gathering additional low level performance information, including time spent in the hypervisor, number of ‘world switches’ between the VMs and the hypervisor, etc., using tools like Xenoprofile. The idea is to attain greater insights into the implications of shared use of virtualized platforms and the manner in which the platforms’ resources should be distributed among running VMs. We hope to be able to include select results from such measurements into the final version of this paper. In addition, we plan to extend this work to analyze the ability of Infiniband virtualized platforms to meet different QoS requirements and honor SLAs for sets of collaborating VMs, by manipulating parameters such as VMs deployment onto or across individual platform nodes, resource allocation and hypervisor-level scheduling parameters on these multicore nodes, and fabric-wide policies for Service Level (SL) to Virtual Lane (VL) mappings. Certain extensions of our current testbed are necessary to make these measurements possible. The longer term goal of our research is to devise new management mechanisms and policies for QoS-aware management architectures for shared high performance virtualized infrastructures. References [1] S. Adabala, V. Chadha, P. Chawla, R. Figueiredo, J. Fortes, I. Krsul, A. Matsunaga, M. Tsugawa, J. Zhang, M. Zhao, L. Zhu, and X. Zhu. From virtualized resources to virtual computing grids: the In-VIGO system. Future Generation Computer Systems, 21(6):896–909, 2005. [2] Amazon Elastic Compute Cloud (EC2). [3] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and the Art of Virtualization. In SOSP 2003, 2003. [4] IBM Research Blue Gene. [5] J. Chase, L. Grit, D. Irwin, J. Moore, and S. Sprenkle. Dynamic Virtual Clusters in a Grid Site Manager. In Twelfth International Symposium on High Performance Distributed Computing (HPDC12), 2003. [6] L. Cherkasova, D. Gupta, and A. Vahdat. Comparison of the Three CPU Schedulers in Xen. ACM SIGMETRICS Performance Evaluation Review, 35(2):42–51, 2007. [7] Technology Review: Computer in the Cloud. [8] C. Engelmann, S. L. Scott, H. Ong, G. Vallée, and T. Naughton. Configurable Virtualized System Environments for High Performance Computing. In Proceedings of the 1st Workshop on System-level Virtualization for High Performance Computing (HPCVirt) 2007, in conjunction with the 2nd ACM SIGOPS European Conference on Computer Systems (EuroSys) 2007, Lisbon, Portugal, Mar. 20, 2007. [9] R. Farber. Keeping “Performance” in HPC: A look at the impact of virtualization and many-core processors. Scientific Computing, 2006. [10] R. Figueiredo, P. Dinda, and J. Fortes. A Case For Grid Computing on Virtual Machines. In Proc. of IEEE International Conference on Distributed Computing Systems, 2003. [11] A. Gavrilovska, S. Kumar, H. Raj, K. Schwan, V. Gupta, R. Nathuji, R. Niranjan, A. Ranadive, and P. Saraiya. Scalable Hypervisor Architectures for High Performance Systems. In Proceedings of the 1st Workshop on System-level Virtualization for High Performance Computing (HPCVirt) 2007, in conjunction with the 2nd ACM SIGOPS European Conference on Computer Systems (EuroSys) 2007, Lisbon, Portugal, Mar. 20, 2007. [12] W. Huang, J. Liu, and D. Panda. A Case for High Performance Computing with Virtual Machines. In ICS, 2006. [13] B. Lin and P. Dinda. VSched: Mixing Batch and Interactive Virtual Machines Using Periodic Real-time Scheduling. In Proceedings of ACM/IEEE SC 2005 (Supercomputing), 2005. [14] J. Liu, B. Chandrasekaran, J. Wu, W. Jiang, S. Kini, W. Yu, D. Buntinas, P. Wyckoff, and D. K. Panda. Performance Comparison of MPI Implementations over InfiniBand, Myrinet and Quadrics. In Supercomputing’03, 2003. [15] J. Liu, W. Huang, B. Abali, and D. K. Panda. High Performance VMM-Bypass I/O in Virtual Machines. In ATC, 2006. [16] J. Liu, J. Wu, S. P. Kini, P. Wyckoff, and D. K. Panda. High Performance RDMA-Based MPI Implementation over InfiniBand. In Int’l Conference on Supercomputing (ICS ’03), 2003. [17] A. Matsunaga, M. Tsugawa, S. Adabala, R. Figueiredo, H. Lam, and J. Fortes. Science gateways made easy: the In-VIGO approach. Concurrency and Computation: Practice and Experience, 19(1), 2007. [18] OpenFabrics Software Stack - OFED 1.1. [19] Presta Benchmark Code. [20] P. Ruth, X. Jiang, D. Xu, and S. Goasguen. Virtual Distributed Environments in a Shared Infrastructure. IEEE Computer, Special Issue on Virtualization Technologies, 38(5):63–69, 2005. [21] M. Silberstein, D. Geiger, A. Schuster, and M. Livny. Scheduling Mixed Workloads in Multi-grids: The Grid Execution Hierarchy. In Proceedings of the 15th IEEE Symposium on High Performance Distributed Computing (HPDC), 2006. [22] M. S. Squillante, Y. Zhang, A. Sivasubramaniam, N. Gautam, H. Franke, and J. E. Moreira. Modeling and analysis of dynamic coscheduling in parallel and distributed environments. In SIGMETRICS, 2002. [23] A. Sundararaj and P. Dinda. Towards Virtual Networks for Virtual Machine Grid Computing. In Proceedings of the Third USENIX Virtual Machine Technology Symposium (VM 2004), 2004. [24] S. Sur, M. Koop, L. Chai, and D. K. Panda. Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms. In 15th Symposium on Hot Interconnects, 2007. [25] Top500 SuperComputing Sites. [26] G. Vallee, T. Naughton, H. Ong, and S. Scott. Checkpoint/Restart of Virtual Machines Based on Xen. In HAPCW, 2006. [27] G. Vallée and S. L. Scott. Xen-OSCAR for Cluster Virtualization. In ISPA Workshop on XEN in HPC Cluster and Grid Computing Environments (XHPC’06), Dec. 2006. [28] J. Vetter and F. Mueller. Communication Characteristics of LargeScale Scientific Applications for Contemporary Cluster Architectures. In Proc. of Int’l Parallel and Distributed Processing Symposium, 2002. [29] The VMWare ESX Server. [30] Xen Credit Scheduler. [31] XenSmartIO Mercurial Tree. smartio.hg. [32] J. Xu, M. Zhao, M. Yousif, R. Carpenter, and J. Fortes. On the Use of Fuzzy Modeling in Virtualized Data Center Management. In Proceedings of International Conference on Autonomic Computing (ICAC), Jacksonville, FL, 2007. [33] L. Youseff, R. Wolski, B. Gorda, and C. Krintz. Evaluating the Performance Impact of Xen on MPI and Process Execution For HPC Systems. In International Workshop on Virtualization Technologies in Distributed Computing (VTDC), with Supercomputing’06, 2006. [34] L. Youseff, R. Wolski, B. Gorda, and C. Krintz. Paravirtualization for HPC Systems. In XHPC: Workshop on XEN in High-Performance Cluster and Grid Computing, 2006. [35] M. Zhao, J. Zhang, and R. Figueiredo. Distributed File System Virtualization Techniques Supporting On-Demand Virtual Machine Environments for Grid Computing. Cluster Computing Journal, 9(1), 2006.