VMTorrent: Scalable P2P Virtual Machine Streaming Joshua Reich, Oren Laadan, Eli Brosh, Alex Sherman, Vishal Misra, Jason Nieh, and Dan Rubenstein 1 VM Basics • VM: software implementation of computer VM Image VMM VM • Implementation stored in VM image • VM runs on VMM – Virtualizes HW – Accesses image 2 Where is Image Stored? VM Image VMM VM 3 Traditionally: Local Storage Local Storage VMM VM 4 IaaS Cloud: on Network Storage Network Storage VMM VM VM Image 5 Can Be Primary Network Storage VM Image NFS/iSCSI VMM VM e.g., OpenStack Glance Amazon EC2/S3 vSphere network storage 6 Or Secondary Network Storage VM Image Local Storage VMM VM e.g., Amazon EC2/EBS vSphere local storage 7 Either Way, No Problem Here Network Storage VMM VM VM Image 8 Here? Network Storage VM Image Bottleneck! 9 Lots of Unique VM Images Network Storage on EC2 alone 54784 unique images* *http://thecloudmarket.com/stats#/totals , 06 Dec 2012 10 Unpredictable Demand Network Storage • Lots of customers • Spot-pricing • Cloud-bursting 11 Don’t Just Take My Word • “The challenge for IT teams will be finding way to deal with the bandwidth strain during peak demand - for instance when hundreds or thousands of users log on to a virtual desktop at the start of the day - while staying within an acceptable budget” 1 • “scale limits are due to simultaneous loading rather than total number of nodes” 2 • Developer proposals to replace or supplement VM launch architecture for greater scalability 3 1. http://www.zdnet.com/why-so-many-businesses-arent-ready-for-virtual-desktops7000008229/?s_cid=e539 2. http://www.openstack.org/blog/2011/12/openstack-deployments-abound-at-austin-meetup129 12 3. https://blueprints.launchpad.net/nova/+spec/xenserver-bittorrent-images Challenge: VM Launch in IaaS • Minimize delay in VM execution • Starting from time launch request arrives • For lots of instances (scale!) 13 Naive Scaling Approaches • Multicast – Setup, configuration, maintenance, etc. 1 – ACK implosion – “multicast traffic saturated the CPU on [Etsy] core switches causing all of Etsy to be unreachable“ 2 1. [El-Sayed et al., 2003; Hosseini et al., 2007] 2. http://codeascraft.etsy.com/2012/01/23/solr-bittorrent-index-replication 14 Naive Scaling Approaches • P2P bulk data download (e.g., Bit-Torrent) – Files are big (waste bandwidth) – Must wait until whole file available (waste time) – Network primary? Must store GB image in RAM! 15 Both Miss Big Opportunity VM image access • Sparse • Gradual • Most of image doesn’t need to be transferred • Can start w/ just a couple of blocks 16 VMTorrent Contributions • Architecture – Make (scalable) streaming possible: Decouple data delivery from presentation – Make scalable streaming effective: Profile-based image streaming techniques • Understanding / Validation – Modeling for VM image streaming – Prototype & evaluation not highly optimized 17 Talk • Make (scalable) streaming possible: Decouple data delivery from presentation • Make scalable streaming effective: Profile-based image streaming techniques • VMTorrent Prototype & Evaluation (Modeling along the way) 18 Decoupling Data Delivery from Presentation (Making Streaming Possible) 19 Generic Virtualization Architecture • Virtual Machine Monitor virtualizes hardware • Conducts I/O to image through file system VM VMM Hardware VM Image Host FS 20 Cloud Virtualization Architecture Network backend used • Either to download image • Or to access via remote FS VM VMM Hardware VM Image Network Backend FS 21 VMTorrent Virtualization Architecture • Introduce custom file system • Divide image into pieces • But provide appearance of complete image to VMM VM VMM Hardware VM Image Network Backend Custom FS FS 22 Decoupling Delivery from Presentation VMM attempts to read piece 1 Piece 1 is present, read completes VM VMM Hardware 0 1 2 3 4 5 6 7 8 Custom FS Network Backend 23 Decoupling Delivery from Presentation VMM attempts to read piece 0 Piece 0 isn’t local, read stalls VMM waits for I/O to complete VM stalls VM VMM Hardware 0 1 2 3 4 5 6 7 8 Custom FS Network Backend 24 Decoupling Delivery from Presentation FS requests piece from backend Backend requests from network VM VMM Hardware 0 1 2 3 4 5 6 7 8 Custom FS Network Backend 25 Decoupling Delivery from Presentation Later, network delivers piece 0 Custom FS receives, updates piece Read completes VMM resumes VM’s execution VM VMM Hardware 0 1 2 3 4 5 6 7 8 Custom FS Network Backend 26 Decoupling Improves Performance Primary Storage No waiting for image download to complete VM VMM Hardware 0 1 2 3 4 5 6 7 8 Custom FS Network Backend 27 Decoupling Improves Performance Secondary Storage No more writes or re-reads over network w/ remote FS X VM VMM Hardware 0 1 2 3 4 5 6 7 8 Custom FS X Network Backend 28 But Doesn’t Scale Assuming a single server, the time to download a single piece is t = W + S / (rnet / n) • • • • W: rnet : S: n: wait time for first bit network speed piece size # of clients Transfer time, each client gets rnet / n of server BW 29 Read Time Grows Linearly w/ n Assuming a single server, the time to download a single piece is t = W + n * S / rnet • • • • W: rnet : S: n: wait time for first bit network speed piece size # of clients Transfer time linear w/ n 30 This Scenario csd VM VMM Hardware 0 1 2 3 4 5 6 7 8 Custom FS Network Backend 31 Decoupling Enables P2P Backend Alleviate network storage bottleneck • Exchange pieces w/ swarm Swarm P2P copy must remain pristine VM VMM Hardware 0 1 2 3 4 5 6 7 8 Custom FS 0 1 2 3 4 5 Network 6 7 8 Backend P2P Manager 32 Space Efficient FS uses pointers to P2P image Swarm FS does copy-on-write VM VMM Hardware 0 1 2 3 4 5 6 7 8 Custom FS 0 1 2 3 4 5 6 7 8 P2P Manager 33 Minimizing Stall Time Non-local piece accesses Swarm Trigger high priority requests 4! VM VMM Hardware 4? 0 1 2 3 4 5 6 7 8 Custom FS 4? 0 1 2 3 4 5 6 7 8 P2P Manager 34 P2P Helps Now, the time to download a single piece is Transfer time independent of n t = W(d) + S / rnet Wait is function of diversity • • • • • W(d) : wait time for first bit as function of d: piece diversity rnet : network speed S: piece size n: # of peers 35 High Diversity Swarm Efficiency 36 Low Diversity Little Benefit Nothing to share 37 P2P Helps, But Not Enough All peers request same pieces at same time t = W(d) + S / rnet Low piece diversity Long wait (gets worse as n grows) Long download times 38 This Scenario p2pd VM VMM Hardware 0 1 2 3 4 5 6 7 8 Custom FS Swarm 0 1 2 3 4 5 6 7 8 P2P Manager 39 Profile-based Image Streaming Techniques (Making Streaming Effective) 40 How to Increase Diversity? Need to fetch pieces that are • Rare: not yet demanded by many peers • Useful: likely to be used by some peer 41 Profiling • Need useful pieces • But only small % of VM image accessed • We need to know which pieces accessed • Also, when (need later for piece selection) 42 Build Profile • One profile for each VM/workload • Ran one or more times (even online) • Use FS to track – Which pieces accessed – When pieces accessed • Entries w/ average appearance time, piece index, and frequency 43 Piece Selection • Want pieces not yet demanded by many • Don’t know piece distribution in swarm • Guess others like self • Gives estimate when pieces likely needed 44 Piece Selection Heuristic • Randomly (rarest first) pick one of first k pieces in predicted playback window • fetch w/ medium priority (demand wins) 45 Profile-based Prefetching • Increases diversity • Helps even w/ no peers (when ideal access exceeds network rate) 46 Obtain Full P2P Benefit Profile-based window-randomized prefetch t = W(d) + S / rnet High piece diversity Short wait (shouldn’t grow much w/ n) Quick piece download 47 Full VMTorrent Architecture p2pp VM VMM Hardware 0 1 2 3 4 5 6 7 8 Custom FS Swarm 0 1 2 3 4 5 6 7 8 profile P2P Manager 48 Prototype 49 VMTorrent Prototype Custom C Using FUSE VM Hardware Custom C++ & Libtorrent 0 1 2 3 4 5 6 7 8 Custom FS BT Swarm 0 1 2 3 4 5 6 7 8 profile P2P Manager 50 Evaluation Setup 51 Testbeds • Emulab [White, et. al, 2002] – Instances on 100 dedicated hardware nodes – 100 Mbps LAN • VICCI [Peterson, et. al, 2011] – Instances on 64 vserver hardware node slices – 1 Gbps LAN 52 VMs 53 Workloads • Short VDI-like tasks • Some cpu-intensive, some I/O intensive 54 Assessment • Measured total runtime – Launch through shutdown – (Easy to measure) • Normalized against memory-cached execution – Ideal runtime for that set of hardware – Allows easy cross-comparison • different VM/workload combinations • Different hardware platforms 55 Evaluation 56 100 Mbps Scaling Starting to increase 57 Due to Decreased Diversity # peers increases more demand requests to seed less opportunity to build diversity longer to reach max swarming efficiency + lower max 58 Due to Decreased Diversity # peers increases more demand requests to seed less opportunity to build diversity longer to reach max swarming efficiency + lower max 59 Due to Decreased Diversity We optimized too much for single instance! (choosing demand requests take precedence) # peers increases more demand requests to seed less opportunity to build diversity longer to reach max swarming efficiency + lower max 60 (Some) Future Work • Piece selection for better diversity Current work orders of magnitude better than state-of-art • Improved profiling • DC-specific optimizations 61 Demo (video omitted for space) 62 See Paper for More Details • Modeling – Playback process dynamics – Buffering (for prefetch) – Full characterization of r incorporating impact of centralized and distributed models on W – Other elided details • Plus – More architectural discussion! – Lots more experimental results! 63 Summary • Scalable VM launching needed • VMTorrent addresses by – Decoupling data presentation from streaming – Profile-based VM image streaming • Straightforward techniques, implementation, no special optimizations for DC • Performance much better than state-of-art – Hardware evaluation on multiple testbeds – As predicted by modeling 64