Fido: Fast Inter-Virtual-Machine Communication for Enterprise Appliances Anton Burtsev†, Kiran Srinivasan, Prashanth Radhakrishnan, Lakshmi N. Bairavasundaram, Kaladhar Voruganti, Garth R. Goodson †University of Utah, School of Computing NetApp, Inc Enterprise appliances Network attached storage, routers, etc. • High performance • Scalable and highly-available access 2 Example Appliance • Monolithic kernel • Kernel components Problems: • Fault isolation • Performance isolation • Resource provisioning 3 Split architecture 4 Benefits of virtualization • High availability • Fault-isolation • Micro-reboots • Partial functionality in case of failure • Performance isolation • Resource allocation • Consolidation and load balancing, VM migration • Non-disruptive updates • Hardware upgrades via VM migration • Software updates as micro-reboots • Computation to data migration 5 Main Problem: Performance Is it possible to match performance of a monolithic environment? • Large amount of data movement between components • • • • • Mostly cross-core Connection oriented (established once) Throughput optimized (asynchronous) Coarse grained (no one-word messages) Multi-stage data processing • Main cost contributors • • • • Transitions to hypervisor Memory map/copy operations Not VM context switches (multi-cores) Not IPC marshaling 6 Main Insight: Relaxed Trust Model • Appliance is built by a single organization • Components: • Pre-tested and qualified • Collaborative and non-malicious • Share memory read-only across VMs! • Fast inter-VM communication • Exchange only pointers to data • No hypervisor calls (only cross-core notification) • No memory map/copy operations • Zero-copy across entire appliance 7 Contributions • Fast inter-VM communication mechanism • Abstraction of a single address space for traditional systems • Case study • Realistic microkernelized network attached storage 8 Design 9 Design Goals • Performance • High-throughput • Practicality • Minimal guest system and hypervisor dependencies • No intrusive guest kernel changes • Generality • Support for different communication mechanisms in the guest system 10 Transitive Zero Copy • Goal • Zero-copy across entire appliance • No changes to guest kernel • Observation • Multi-stage data processing 11 Pseudo Global Virtual Address Space 264 Insight: • CPUs support 64-bit address space • Individual VMs have no need in it 0 12 Pseudo Global Virtual Address Space 264 0 13 Pseudo Global Virtual Address Space 264 0 14 Transitive Zero Copy 15 Fido: High-level View 16 Fido: High-level View • “c” – connection management • “m” – memory mapping • “s” – cross-VM signaling 17 IPC Organization • Shared memory ring • Pointers to data 18 IPC Organization • Shared memory ring • Pointers to data • For complex data structures • Scatter-gather array 19 IPC Organization • Shared memory ring • Pointers to data • For complex data structures • Scatter-gather array • Translate pointers 20 IPC Organization • Shared memory ring • Pointers to data • For complex data structures • Scatter-gather array • Translate pointers •Signaling: • Cross-core interrupts (event channels) • Batching and in-ring polling 21 Fast device-level communication • MMNet • Link-level • Standard network device interface • Supports full transitive zero-copy • MMBlk • • • • Block-level Standard block device interface Zero-copy on write Incurs one copy on read 22 Evaluation 23 MMNet Evaluation Loop NetFront XenLoop MMNet • AMD Opteron with 2 2.1GHz 4-core CPUs (8 cores total) • 16GB RAM • NVidia 1Gbps NICs • 64-bit Xen (3.2), 64-bit Linux (2.6.18.8) • Netperf benchmark (2.4.4) 24 MMNet: TCP Throughput 12000 Throughput (Mbps) 10000 8000 Monolithic 6000 Netfront XenLoop 4000 MMNet 2000 0 0.5 1 2 4 8 16 32 64 128 256 Message Size (KB) 25 MMBlk Evaluation Monolithic XenBlk MMNet • Same hardware • AMD Opteron with 2 2.1GHz 4-core CPUs (8 cores total) • 16GB Ram • NVidia 1Gbps NICs • VMs are configured with 4GB and 1GB RAM • 3 GB in-memory file system (TMPFS) • IOZone benchmark 26 MMBlk Sequential Writes 600 Throughput (MB/s) 500 400 300 Monolithic XenBlk 200 MMBlk 100 0 4 8 16 32 64 128 256 512 1K 2K 4K Record Size (KB) 27 Case Study 28 Network-attached Storage 29 Network-attached Storage • RAM • VMs have 1GB each, except FS VM (4GB) • Monolithic system has 7GB RAM • Disks : • RAID5 over 3 64MB/s disks • Benchmark • IOZone reads/writes 8GB file over NFS (async) 30 Sequential Writes 90 80 Throughput (MB/s) 70 60 50 Monolithic 40 Native-Xen 30 MM-Xen 20 10 0 4 8 16 32 64 128 256 512 1K 2K 4K Record Size (KB) 31 Sequential Reads 80 Throughput (MB/s) 70 60 50 40 Monolithic 30 Native-Xen MM-Xen 20 10 0 4 8 16 32 64 128 256 512 1K 2K 4K Record Size (KB) 32 TPC-C (On-Line Transactional Processing) 350 Transactions/minute (tpmC) 300 250 200 Monolithic MMXen 150 Native-Xen 100 50 0 33 Conclusions • We match monolithic performance • “Microkernelization” of traditional systems is possible! • Fast inter-VM communication • The search for VM communication mechanisms is not over • Important aspects of design • Trust model • VM as a library (for example, FSVA) • End-to-end zero copy • Pseudo Global Virtual Address Space • There are still problems to solve • Full end-to-end zero copy • Cross-VM memory management • Full utilization of pipelined parallelism 34 Thank you. aburtsev@flux.utah.edu 35 Backup Slides 36 Related Work • Traditional microkernels [L4, Eros, CoyotOS] • Synchronous (effectively thread migration) • Optimized for single-CPU, fast context switch, small messages (often in registers), efficient marshaling (IDL) • Buffer management [Fbufs, IOLite, Beltway Buffers] • Shared buffer is a unit of protection • Fast-forward – fast cache-to-cache data transfer • VMs [Xen split drivers, XWay, XenSocket, XenLoop] • Page flipping, later buffer sharing • IVC, VMCI • Language-based protection [Singularity] • Shared heap, zero-copy (only pointer transfer) • Hardware acceleration [Solarflare] • Multi-core OSes [Barrelfish, Corey, FOS] 37