PPT - Usenix

The Design and Evolution of Live Storage Migration in VMware ESX Ali Mashtizadeh, VMware, Inc. Emré Celebi, VMware, Inc. Tal Garfinkel, VMware, Inc. Min Cai, VMware, Inc. © 2010 VMware Inc. All rights reserved Agenda  What is live migration?  Migration architectures  Lessons 2 What is live migration (vMotion)?  Moves a VM between two physical hosts  No noticeable interruption to the VM (ideally)  Use cases: • Hardware/software upgrades • Distributed resource management • Distributed power management 3 Live Migration Virtual Machine Virtual Machine Source Destination  Disk is placed on a shared volume (100GBs-1TBs)  CPU and Device State is copied (MBs)  Memory is copied (GBs) • Large and it changes often → Iteratively copy 4 Live Storage Migration  What is live storage migration? • Migration of a VM’s virtual disks  Why does this matter? • VMs can be very large • Array maintenance means you may migrate all VMs in an array • Migration time in hours 5 Live Migration and Storage Live Migration – a short history  ESX 2.0 (2003) – Live migration (vMotion) • Virtual disks must live on shared volumes  ESX 3.0 (2006) – Live storage migration lite (Upgrade vMotion) • Enabled upgrade of VMFS by migrating the disks  ESX 3.5 (2007) – Live storage migration (Storage vMotion) • Storage array upgrade and repair • Manual storage load balancing • Snapshot based  ESX 4.0 (2009) – Dirty block tracking (DBT)  ESX 5.0 (2011) – IO Mirroring 6 Agenda  What is live migration?  Migration architectures  Lessons 7 Goals  Migration Time • Minimize total end-to-end migration time • Predictability of migration time  Guest Penalty • Minimize performance loss • Minimize downtime  Atomicity • Avoid dependence on multiple volumes (for replication fault domains)  Guarantee Convergence • Ideally we want migrations to always complete successfully 8 Convergence  Migration time vs. downtime  More migration time → more guest performance impact  More downtime → more service unavailability • Block dirty rate • Available storage network bandwidth • Workload interactions  Challenges: • Many workloads interacting on storage array • Unpredictable behavior 9 Data Remaining  Factors that effect convergence: Migration Time Downtime Time Migration Architectures  Snapshotting  Dirty Block Tracking (DBT) • Heat Optimization  IO Mirroring 10 Snapshot Architecture – ESX 3.5 U1 ESX Host 11 VMDK VMDK Src Volume Dest Volume Synthetic Workload  Synthetic Iometer workload (OLTP): • 30% Write/70% Read • 100% Random • 8KB IOs • Outstanding IOs (OIOs) from 2 to 32  Migration Setup: • Migrated both the 6 GB System Disk and 32 GB Data Disk  Hardware: • Dell PowerEdge R710: Dual Xeon X5570 2.93 GHz • Two EMC CX4-120 arrays connected via 8Gb Fibre Channel 12 Migration Time vs. Varying OIO Snapshot 600 Migration Time (s) 500 400 300 Snapshot 200 100 0 2 13 4 8 Workload OIO 16 32 Downtime vs. Varying OIO Snapshot 25 Downtime (s) 20 15 Snapshot 10 5 0 2 14 4 8 Workload OIO 16 32 Total Penalty vs. Varying OIO Effective Lost Workload Time (s) Snapshot 500 450 400 350 300 250 Snapshot 200 150 100 50 0 2 15 4 8 Workload OIO 16 32 Snapshot Architecture  Benefits • Simple implementation • Built on existing and well tested infrastructure  Challenges • Suffers from snapshot performance issues • Disk space: Up to 3x the VM size • Not an atomic switch from source to destination • A problem when spanning replication fault domains • Downtime • Long migration times 16 Snapshot versus Dirty Block Tracking Intuition  Virtual disk level snapshots have overhead to maintain metadata  Requires lots of disk space  We want to operate more like live migration • Iterative copy • Block level copy rather than disk level – enables optimizations  We need a mechanism to track writes 17 Dirty Block Tracking (DBT) Architecture – ESX 4.0/4.1 ESX Host 18 VMDK VMDK Src Volume Dest Volume Data Mover (DM)  Kernel Service • Provides disk copy operations • Avoids memory copy (DMAs only)  Operation (default configuration) • 16 Outstanding IOs • 256 KB IOs 19 Migration Time vs. Varying OIO 800 Migration Time (s) 700 600 500 400 Snapshot DBT 300 200 100 0 2 20 4 8 Workload OIO 16 32 Downtime vs. Varying OIO 35 30 Downtime (s) 25 20 Snapshot DBT 15 10 5 0 2 21 4 8 Workload OIO 16 32 Total Penalty vs. Varying OIO Effective Lost Workload Time (s) 500 450 400 350 300 250 Snapshot DBT 200 150 100 50 0 2 22 4 8 Workload OIO 16 32 Dirty Block Tracking Architecture  Benefits • Well understood architecture based similar to live VM migration • Eliminated performance issues associated with snapshots • Enables block level optimizations • Atomicity  Challenges • Migrations may not converge (and will not succeed with reasonable downtime) • Destination slower than source • Insufficient copy bandwidth • Convergence logic difficult to tune • Downtime 23 Block Write Frequency – Exchange Workload 24 Heat Optimization – Introduction  Defer copying data that is frequently written  Detects frequently written blocks • File system metadata • Circular logs • Application specific data  No significant benefit for: • Copy on write file systems (e.g. ZFS, HAMMER, WAFL) • Workloads with limited locality (e.g. OLTP) 25 Heat Optimization – Design Disk LBAs 26 DBT versus IO Mirroring Intuition  Live migration intuition – intercepting all memory writes is expensive • Trapping interferes with data fast path • DBT traps only first write to a page • Writes a batched to aggregate subsequent writes without trap  Intercepting all storage writes is cheap • IO stack processes all IOs already  IO Mirroring • Synchronously mirror all writes • Single pass copy of the bulk of the disk 27 IO Mirroring – ESX 5.0 ESX Host 28 VMDK VMDK Src Volume Dest Volume Migration Time vs. Varying OIO 800 Migration Time (s) 700 600 500 400 Snapshot DBT Mirror 300 200 100 0 2 29 4 8 Workload OIO 16 32 Downtime vs. Varying OIO 35 30 Downtime (s) 25 20 Snapshot DBT Mirror 15 10 5 0 2 30 4 8 Workload OIO 16 32 Total Penalty vs. Varying OIO Effective Lost Workload Time (s) 500 450 400 350 300 250 Snapshot DBT Mirror 200 150 100 50 0 2 31 4 8 Workload OIO 16 32 IO Mirroring  Benefits • Minimal migration time • Near-zero downtime • Atomicity  Challenges • Complex code to guarantee atomicity of the migration • Odd guest interactions require code for verification and debugging 32 Throttling the source 33 IO Mirroring to Slow Destination 3500 Start Read IOPS Source 3000 Write IOPS Source 2500 IOPS Write IOPS Destination 2000 1500 1000 End 500 0 0 34 500 1000 1500 2000 Time (seconds) 2500 3000 Agenda  What is live migration?  Migration architectures  Lessons 35 Recap  In the beginning live migration  Snapshot: • Usually has the worst downtime/penalty • Whole disk level abstraction • Snapshot overheads due to metadata • No atomicity  DBT: • Manageable downtime (except when OIO > 16) • Enabled block level optimizations • Difficult to make convergence decisions • No natural throttling 36 Recap – Cont.  Insight: storage is not memory • Interposing on all writes is practical and performant  IO Mirroring: • Near-zero downtime • Best migration time consistency • Minimal performance penalty • No convergence logic necessary • Natural throttling 37 Future Work  Leverage workload analysis to reduce mirroring overhead • Defer mirroring regions with potential sequential write IO patterns • Defer hot blocks • Read ahead for lazy mirroring  Apply mirroring to WAN migrations • New optimizations and hybrid architecture 38 Thank You! 39 Backup Slides 40 Exchange Migration with Heat 500 Hot Block 450 400 Cold Block Data Copied (MB) 350 Baseline without Heat 300 250 200 150 100 50 Iteration 41 10 9 8 7 6 5 4 3 2 1 0 Exchange Workload  Exchange 2010: • Workload generated by Exchange Load Generator • 2000 User mailboxes • Migrated only the 350 GB mailbox disk  Hardware: • Dell PowerEdge R910: 8-core Nehalem-EX • EMC CX3-40 • Migrated between two 6 disk RAID-0 volumes 42 Exchange Results Type Migration Time Downtime DBT 2935.5 13.297 Incremental DBT 2638.9 7.557 IO Mirroring 1922.2 0.220 DBT (2x) Failed - Incremental DBT (2x) Failed - IO Mirroring (2x) 1824.3 0.186 43 IO Mirroring Lock Behavior  Moving the lock region 1. Wait for non-mirrored inflight read IOs to complete. (queue all IOs) 2. Move the lock range 3. Release queued IOs II Locked region: I IOs deferred until lock release III Mirrored region: Unlocked region: Write IOs mirrored Write IOs to source only 44 Non-trivial Guest Interactions  Guest IO crossing disk locked regions  Guest buffer cache changing  Overlapped IOs 45 46 Time (s) 464 411 464 0 446 20 446 40 429 No Contention 429 411 395 362 329 297 266 IO Mirroring 395 362 329 Lock Contention 297 266 233 203 178 158 139 121 101 81 62 Lock Contention 233 203 178 158 139 121 101 81 62 60 43 21 0 DM Copy (ms) 600 43 21 0 Lock Latency (ms) Lock Latency and Data Mover Time 800 Lock 2nd Disk 400 200 0 Source/Destination Valid Inconsistencies  Normal Guest Buffer Cache Behavior Guest OS Source IO Issues IO Issued Guest OS Modifies Buffer Cache Destination IO Issued Time  This inconsistency is okay! • Source and destination are both valid crash consistent views of the disk 47 Source/Destination Valid Inconsistencies  Overlapping IOs (Synthetic workloads only) IO 2 Issue Order IO 1 Virtual Disk LBA IO Reordering Issue Order IO 2 IO 1 Source Disk LBA Issue Order IO 2 Destination Disk  Seen in Iometer and other synthetic benchmarks  File systems do not generate this 48 IO 1 LBA Incremental DBT Optimization – ESX 4.1 Write to blocks Copy Dirty block Ignore Copy ? ? Ignore ? ? ? Disk Blocks 49

PPT - Usenix

Related documents

Products

Support

PPT - Usenix

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib