kMemvisor: Flexible System Wide Memory Mirroring in Virtual Environments Bin Wang Zhengwei Qi Haibing Guan Haoliang Dong Wei Sun Yaozu Dong Shanghai Key Laboratory of Scalable Computing and Systems Shanghai Jiao Tong University Intel China Software Center Is your memory error-prone? Today's memories do become errorprone. • [B. Schroeder et al. SIGMETRICS 09] Memory failures are common in clusters • 8% of DIMMs have correctable errors per year • 1.29% uncorrectable errors in Google testbed Today's memories do become errorprone. • [B. Schroeder et al. SIGMETRICS 09] Memory failures are common in clusters • 8% of DIMMs have correctable errors per year • 1.29% uncorrectable errors in Google testbed Memory -intensive App. Memory -intensive App. Memory -intensive App. Memory HA in Cloud Computing 1.29% error rate Memory HA in Cloud Computing 1.29% error rate 13,000 failures per year 1,083 failures per month 35 failures per day 1.5 failure per hour Memory HA in Cloud Computing 1.29% error rate 13,000 failures per year 1,083 failures per month 35 failures per day 1.5 failure per hour 99.95% Service Level Aggrement Memory HA in Cloud Computing 1.29% error rate 4.38 hours downtime per year 21.56 minutes downtime per month 5.04 minutes downtime per day 13,000 failures per year 1,083 failures per month 35 failures per day 1.5 failure per hour 99.95% Service Level Aggrement Memory HA in Cloud Computing 1.29% error rate 4.38 hours downtime per year 21.56 minutes downtime per month 5.04 minutes downtime per day 13,000 failures per year 1,083 failures per month 35 failures per day 1.5 failure per hour 99.95% Service Level Aggrement [Andy A. Hwang et al. ASPLOS 12] Memory Errors happen at a significant rate in all four sites with 2.5 to 5.5% of nodes affected per system Existing Solutions Hardware ECC (Hp, IBM, Google et al.) Mirrored Memory (Hp, IBM, Google et al.) Existing Solutions • Software - Duo-backup (GFS, Amazon Dynamo) System level tolerance - Checkpoint, hot spare+ VM migration /replication Application-Specific and High overhead (eg. Remus [NSDI 08] with 40 Checkpoints/sec, overhead 103%) Design Guideline • Low cost • Efficiency & compatibility - Arbitrary platform - On the fly+ hot spare • Low maintaining - Little impact to others (eg., networking utilization) - Without migration • Cloud requirement - VM granularity kMemvisor • A hypervisor providing system-wide memory mirroring based on hardware virtualization • Supporting VMs with or without mirror memory feature on the same physical machine • Supporting NModularRedundancy for some special mission critical applications kMemvisor High-level Architecture Ordinary VM App App Level HA VM … App System-wide HA VM … Guest OS Guest OS … App Guest OS kMemvisor CPU Management Memory Management Modify CPU mva = mirror(nva) Code Translation Management Hardware Translate Memory Code Page Table … ... MOV addr, data MOV mirror_addr, data ... … Load Disk Code ... MOV addr, data ... … Memory Mirroring Virtual Addr. Physical Mem. Memory Mirroring Create native PTE Virtual Addr. Physical Mem. Memory Mirroring Create native PTE Virtual Addr. Physical Mem. Memory Mirroring Create mirror PTE Virtual Addr. Physical Mem. Memory Mirroring Create mirror PTE Virtual Addr. Physical Mem. Memory Mirroring Virtual Addr. Physical Mem. mov $2, addr 2 2 Memory Mirroring Virtual Addr. Physical Mem. mov mov 2 $2, addr $2, mirror(addr) 2 Retrieve Memory Failure Virtual Addr. Physical Mem. X Memory corruption 2 Retrieve Memory Failure Virtual Addr. Physical Mem. X 2 Retrieve Memory Failure Virtual Addr. Physical Mem. X 2 Re-create PTE Retrieve Memory Failure Virtual Addr. Physical Mem. X 2 Re-create PTE Retrieve Memory Failure Virtual Addr. Physical Mem. X 2 2 Copy Create Mirror Page Table malloc() Application System Call mmap() Guest OS Hyper Call update_va_mapping() 1 kMemvisor 2 Native Create native PTE Create mirror PTE Mirror Virtual Addr. Physical Mem. Hardware Modified Memory Layout Native Space Virtual Memory Mirror Space Physical Memory 0xFFFFFFFFFFFF 0xFFFFFC000000 kMemvisor Space Native Kernel Space 0x800000000000 High Quality Memory Mirrored Memory Mirrored User Stack Space Native User Stack Space Mirrored User Heap Space 0x000000000000 Native User Heap Space mva = mirror(nva)=nva+offset Native Memory Memory Synchronization Native instructions Modified instructions … … movq $4, 144(%rdi) … … call log_text … addq ... %rdx, (%rax) … … pushq %rbp … … .Log_text: ... ... Compiler (cc1) Sysbench.c Binary Tanslation Sysbench.s movq $4, 144(%rdi) movq $4, offset+144(%rdi) call log_text addq %rdx, (%rax) addq %rdx,offset(%rax) pushq %rbp movq %rbp, offset(%rsp) .Log_text: pushq %rax movq 8(%rsp), %rax movq %rax, (offset+8)(%rsp) popq %rax Assembler (as) M_sysbench.s Linker (ld) Sysbench.o Modified Libraries Sysbench.out Implicit Memory Instructions …… call pushl movl movl movl popl …… grep %eax %eax, offset(%esp) 4(%eax), %eax %eax, offset+4(%esp) %eax Explicit and Implicit Instructions …… call pushl movl movl movl popl …… grep %eax %eax, offset(%esp) 4(%eax), %eax %eax, offset+4(%esp) %eax Would not be executed untill call returns causing data inconsistence Explicit and Implicit Instructions …… call pushl movl movl movl popl …… grep: pushl movl movl movl popl …... grep %eax %eax, offset(%esp) 4(%eax), %eax %eax, offset+4(%esp) %eax %eax %eax, offset(%esp) 4(%eax), %eax %eax, offset+4(%esp) %eax Would not be executed untill call returns causing data inconsistence Explicit and Implicit Instructions …… call pushl movl movl movl popl …… grep: pushl movl movl movl popl …... grep %eax %eax, offset(%esp) 4(%eax), %eax %eax, offset+4(%esp) %eax %eax %eax, offset(%esp) 4(%eax), %eax %eax, offset+4(%esp) %eax Would not be executed untill call returns causing data inconsistence Instrumenting mirror instructions at the start of the called procedure The Stack After An int Instruction Completes Privilege changed Pushed by processor Native Mirror ss ss esp eflags cs eip esp error code Copied by kMemvisor esp eflags cs eip error code The Stack After An int Instruction Completes Privilege changed Pushed by processor Native Mirror ss ss esp eflags cs Copied by kMemvisor eip error code eip esp error code Privilege not Native changed Mirror Data A Data A Pushed by processor esp Data B eflags cs esp eflags cs Copied by kMemvisor Data B eflags cs eip eip error code error code Test Environment • Hardware - Dell PowerEdge T610 server:6-core 2.67GHz Intel Xeon CPU with 12MB L3 cache - Samsung 8 GB DDR3 RAMs with ECC and a 148 GB SATA Disk • Software - Hypervisor: Xen-3.4.2 - Kernel: 2.6.30 - Guest OS: Busybox-1.19.2 Malloc Micro-test with Different Block Size The impact on performance is somewhat large when the allocated memory is less than 256KB but much more limited when the size is larger. Memory Sequential Read & Write Ave rageove rhe ad 33% No overhead Overhead for sequential memory write Overhead for sequential memory read XV6 Benchmark Large stove rhe ad 7.2% Ave rageove rhe ad 5% Large stove rhe ad 60% Ave rageove rhe ad 43% 200 Native 180 Modified Native 1.6 Modified 1.4 1.2 1 0.8 0.6 0.4 Normalized Latency Normalized Latency 1.8 1.2 160 Count of memory instructions 140 1 120 0.8 100 0.6 80 60 0.4 40 0.2 0.2 0 20 0 sbrk bigwrite mem fork Usertests performance comparison 0 echo mkdir wc cat ls Command performance comparison Count of instructions 1.4 Web Server & Database Large stove rhe ad 10% Latency 2.5 Native Modified 2 1.5 Double overhead Normalized 1 Performance of thttpd Impacted by lock operations 0.5 0 create insert select update delete The performance impact for create, insert, select, update, and delete in SQLite The cache mechanism will cause frequent memory writing operations. Compilation Time Ave rageove rhe ad 6% Mirror instructions detail information SQLite overhead 5.64% Compilation time overhead brought by kMemvisor Sometimes, implicit write instructions are less but usually more complicated themselves which would cost extra overhead. Discussion • Special memory area - Hypervisor + Kernel page table • Other device memory operation - IO operation • Address conflict - mva = nva+offset • Challenge in binary translation - Self-modifying code & value • Multi-threads and multi-cores - Multi-thread: emulate mirrored instruction - Multi-cores: explicit mutex lock Conclusion • A software mirrored memory solution – CPU-intensive tasks is almost unaffected – Our stressful memory write benchmark shows the backup overhead of 55%. – Average overhead in real world applications 30% • Dynamic binary translation • Full kernel mirroring Thanks!