The Design and Evolution of
Live Storage Migration in VMware ESX
Ali Mashtizadeh, VMware, Inc.
Emré Celebi, VMware, Inc.
Tal Garfinkel, VMware, Inc.
Min Cai, VMware, Inc.
© 2010 VMware Inc. All rights reserved
Agenda
 What is live migration?
 Migration architectures
 Lessons
2
What is live migration (vMotion)?
 Moves a VM between two physical hosts
 No noticeable interruption to the VM (ideally)
 Use cases:
• Hardware/software upgrades
• Distributed resource management
• Distributed power management
3
Live Migration
Virtual Machine
Virtual Machine
Source
Destination
 Disk is placed on a shared volume (100GBs-1TBs)
 CPU and Device State is copied (MBs)
 Memory is copied (GBs)
• Large and it changes often → Iteratively copy
4
Live Storage Migration
 What is live storage migration?
• Migration of a VM’s virtual disks
 Why does this matter?
• VMs can be very large
• Array maintenance means you may migrate all VMs in an array
• Migration time in hours
5
Live Migration and Storage Live Migration – a short history
 ESX 2.0 (2003) – Live migration (vMotion)
• Virtual disks must live on shared volumes
 ESX 3.0 (2006) – Live storage migration lite (Upgrade vMotion)
• Enabled upgrade of VMFS by migrating the disks
 ESX 3.5 (2007) – Live storage migration (Storage vMotion)
• Storage array upgrade and repair
• Manual storage load balancing
• Snapshot based
 ESX 4.0 (2009) – Dirty block tracking (DBT)
 ESX 5.0 (2011) – IO Mirroring
6
Agenda
 What is live migration?
 Migration architectures
 Lessons
7
Goals
 Migration Time
• Minimize total end-to-end migration time
• Predictability of migration time
 Guest Penalty
• Minimize performance loss
• Minimize downtime
 Atomicity
• Avoid dependence on multiple volumes (for replication fault domains)
 Guarantee Convergence
• Ideally we want migrations to always complete successfully
8
Convergence
 Migration time vs. downtime
 More migration time → more guest performance impact
 More downtime → more service unavailability
• Block dirty rate
• Available storage network bandwidth
• Workload interactions
 Challenges:
• Many workloads interacting on storage array
• Unpredictable behavior
9
Data Remaining
 Factors that effect convergence:
Migration Time Downtime
Time
Migration Architectures
 Snapshotting
 Dirty Block Tracking (DBT)
• Heat Optimization
 IO Mirroring
10
Snapshot Architecture – ESX 3.5 U1
ESX Host
11
VMDK
VMDK
Src Volume
Dest Volume
Synthetic Workload
 Synthetic Iometer workload (OLTP):
• 30% Write/70% Read
• 100% Random
• 8KB IOs
• Outstanding IOs (OIOs) from 2 to 32
 Migration Setup:
• Migrated both the 6 GB System Disk and 32 GB Data Disk
 Hardware:
• Dell PowerEdge R710: Dual Xeon X5570 2.93 GHz
• Two EMC CX4-120 arrays connected via 8Gb Fibre Channel
12
Migration Time vs. Varying OIO
Snapshot
600
Migration Time (s)
500
400
300
Snapshot
200
100
0
2
13
4
8
Workload OIO
16
32
Downtime vs. Varying OIO
Snapshot
25
Downtime (s)
20
15
Snapshot
10
5
0
2
14
4
8
Workload OIO
16
32
Total Penalty vs. Varying OIO
Effective Lost Workload Time (s)
Snapshot
500
450
400
350
300
250
Snapshot
200
150
100
50
0
2
15
4
8
Workload OIO
16
32
Snapshot Architecture
 Benefits
• Simple implementation
• Built on existing and well tested infrastructure
 Challenges
• Suffers from snapshot performance issues
• Disk space: Up to 3x the VM size
• Not an atomic switch from source to destination
• A problem when spanning replication fault domains
• Downtime
• Long migration times
16
Snapshot versus Dirty Block Tracking Intuition
 Virtual disk level snapshots have overhead to maintain metadata
 Requires lots of disk space
 We want to operate more like live migration
• Iterative copy
• Block level copy rather than disk level – enables optimizations
 We need a mechanism to track writes
17
Dirty Block Tracking (DBT) Architecture – ESX 4.0/4.1
ESX Host
18
VMDK
VMDK
Src Volume
Dest Volume
Data Mover (DM)
 Kernel Service
• Provides disk copy operations
• Avoids memory copy (DMAs only)
 Operation (default configuration)
• 16 Outstanding IOs
• 256 KB IOs
19
Migration Time vs. Varying OIO
800
Migration Time (s)
700
600
500
400
Snapshot
DBT
300
200
100
0
2
20
4
8
Workload OIO
16
32
Downtime vs. Varying OIO
35
30
Downtime (s)
25
20
Snapshot
DBT
15
10
5
0
2
21
4
8
Workload OIO
16
32
Total Penalty vs. Varying OIO
Effective Lost Workload Time (s)
500
450
400
350
300
250
Snapshot
DBT
200
150
100
50
0
2
22
4
8
Workload OIO
16
32
Dirty Block Tracking Architecture
 Benefits
• Well understood architecture based similar to live VM migration
• Eliminated performance issues associated with snapshots
• Enables block level optimizations
• Atomicity
 Challenges
• Migrations may not converge (and will not succeed with reasonable downtime)
• Destination slower than source
• Insufficient copy bandwidth
• Convergence logic difficult to tune
• Downtime
23
Block Write Frequency – Exchange Workload
24
Heat Optimization – Introduction
 Defer copying data that is frequently written
 Detects frequently written blocks
• File system metadata
• Circular logs
• Application specific data
 No significant benefit for:
• Copy on write file systems (e.g. ZFS, HAMMER, WAFL)
• Workloads with limited locality (e.g. OLTP)
25
Heat Optimization – Design
Disk LBAs
26
DBT versus IO Mirroring Intuition
 Live migration intuition – intercepting all memory writes is
expensive
• Trapping interferes with data fast path
• DBT traps only first write to a page
• Writes a batched to aggregate subsequent writes without trap
 Intercepting all storage writes is cheap
• IO stack processes all IOs already
 IO Mirroring
• Synchronously mirror all writes
• Single pass copy of the bulk of the disk
27
IO Mirroring – ESX 5.0
ESX Host
28
VMDK
VMDK
Src Volume
Dest Volume
Migration Time vs. Varying OIO
800
Migration Time (s)
700
600
500
400
Snapshot
DBT
Mirror
300
200
100
0
2
29
4
8
Workload OIO
16
32
Downtime vs. Varying OIO
35
30
Downtime (s)
25
20
Snapshot
DBT
Mirror
15
10
5
0
2
30
4
8
Workload OIO
16
32
Total Penalty vs. Varying OIO
Effective Lost Workload Time (s)
500
450
400
350
300
250
Snapshot
DBT
Mirror
200
150
100
50
0
2
31
4
8
Workload OIO
16
32
IO Mirroring
 Benefits
• Minimal migration time
• Near-zero downtime
• Atomicity
 Challenges
• Complex code to guarantee atomicity of the migration
• Odd guest interactions require code for verification and debugging
32
Throttling the source
33
IO Mirroring to Slow Destination
3500
Start
Read IOPS Source
3000
Write IOPS Source
2500
IOPS
Write IOPS Destination
2000
1500
1000
End
500
0
0
34
500
1000
1500
2000
Time (seconds)
2500
3000
Agenda
 What is live migration?
 Migration architectures
 Lessons
35
Recap
 In the beginning live migration
 Snapshot:
• Usually has the worst downtime/penalty
• Whole disk level abstraction
• Snapshot overheads due to metadata
• No atomicity
 DBT:
• Manageable downtime (except when OIO > 16)
• Enabled block level optimizations
• Difficult to make convergence decisions
• No natural throttling
36
Recap – Cont.
 Insight: storage is not memory
• Interposing on all writes is practical and performant
 IO Mirroring:
• Near-zero downtime
• Best migration time consistency
• Minimal performance penalty
• No convergence logic necessary
• Natural throttling
37
Future Work
 Leverage workload analysis to reduce mirroring overhead
• Defer mirroring regions with potential sequential write IO patterns
• Defer hot blocks
• Read ahead for lazy mirroring
 Apply mirroring to WAN migrations
• New optimizations and hybrid architecture
38
Thank You!
39
Backup Slides
40
Exchange Migration with Heat
500
Hot Block
450
400
Cold Block
Data Copied (MB)
350
Baseline without Heat
300
250
200
150
100
50
Iteration
41
10
9
8
7
6
5
4
3
2
1
0
Exchange Workload
 Exchange 2010:
• Workload generated by Exchange Load Generator
• 2000 User mailboxes
• Migrated only the 350 GB mailbox disk
 Hardware:
• Dell PowerEdge R910: 8-core Nehalem-EX
• EMC CX3-40
• Migrated between two 6 disk RAID-0 volumes
42
Exchange Results
Type
Migration Time
Downtime
DBT
2935.5
13.297
Incremental DBT
2638.9
7.557
IO Mirroring
1922.2
0.220
DBT (2x)
Failed
-
Incremental DBT (2x)
Failed
-
IO Mirroring (2x)
1824.3
0.186
43
IO Mirroring Lock Behavior
 Moving the lock region
1. Wait for non-mirrored inflight read IOs to complete. (queue all IOs)
2. Move the lock range
3. Release queued IOs
II
Locked region:
I
IOs deferred until lock release
III
Mirrored region:
Unlocked region:
Write IOs mirrored
Write IOs to source only
44
Non-trivial Guest Interactions
 Guest IO crossing disk locked regions
 Guest buffer cache changing
 Overlapped IOs
45
46
Time (s)
464
411
464
0
446
20
446
40
429
No Contention
429
411
395
362
329
297
266
IO Mirroring
395
362
329
Lock Contention
297
266
233
203
178
158
139
121
101
81
62
Lock Contention
233
203
178
158
139
121
101
81
62
60
43
21
0
DM Copy (ms)
600
43
21
0
Lock Latency
(ms)
Lock Latency and Data Mover Time
800
Lock 2nd Disk
400
200
0
Source/Destination Valid Inconsistencies
 Normal Guest Buffer Cache Behavior
Guest OS
Source IO
Issues IO
Issued
Guest OS
Modifies
Buffer Cache
Destination
IO Issued
Time
 This inconsistency is okay!
• Source and destination are both valid crash consistent views of the disk
47
Source/Destination Valid Inconsistencies
 Overlapping IOs (Synthetic workloads only)
IO 2
Issue Order
IO 1
Virtual Disk
LBA
IO Reordering
Issue Order
IO 2
IO 1
Source Disk
LBA
Issue Order
IO 2
Destination Disk
 Seen in Iometer and other synthetic benchmarks
 File systems do not generate this
48
IO 1
LBA
Incremental DBT Optimization – ESX 4.1
Write to blocks
Copy
Dirty block
Ignore
Copy
?
?
Ignore
? ?
?
Disk Blocks
49