Introduction to QEMU

advertisement
VM && QEMU
Date:2010/04/09 , rednoah
Outline

Introduction to Virtual Machine
 VM
Overview
 Interpretation
 Binary Translation
 Process VM

Introduction to QEMU
 QEMU
Overview
 QEMU JIT
 QEMU other topics
Reference



James Smith and Ravi Nair, “Virtual Machines:
Versatile Platforms for Systems and Processors”
QEMU internals
http://wiki.qemu.org/Main_Page
VM Overview

Why do we prefer something virtual than real?




why Virtual memory?
why Java virtual machine?
why Virtual I/O?
why Virtual Private Network (VPN) ?
VM Overview

Why do we prefer something virtual than real?

why Virtual memory?
 sharing,

protection, large address space, …
why Java virtual machine?
 interoperability,

why Virtual I/O?
 flexibility,

application sharing, protection
low cost sharing, better management
why Virtual Private Network (VPN) ?
 secure
communication over unsecure net
VM Overview

Common VMs:









IBM VM/CMS
VMware GSX
Xen
Virtual PC
JVM, MS CLI (Common Language Infrastructure)
Dalvik virtual machine
IA-32 IL, Apple Rosetta, HP PA-Aries
Transmeta Crusoe
QEMU
VM Overview

Virtualization
 OS: A machine
is defined
by ISA
 Compiler: A machine is
defined by ABI (User ISA +
OS calls)
 Application: A machine is
defined by API (User ISA +
Library calls)
VM Overview

Virtual Machines
 Add
virtualization software to a host platform and
support guest process or system on a VM.
VM Overview

Process Virtual Machine




Guest processes may intermingle with host processes
Execute applications with an ISA different from the HW platform
Couple at ABI level via runtime system
As a practical matter, guest OS and host OS are often the same


Ex. FX!32 (allows x86 Win32 programs to execute on Alpha-based systems
running Windows NT)
Different OS: ex. Wine (Enable running Win32 program on Linux)
VM Overview

System Virtual Machine
 Provide
a system environment
 Constructed at ISA-level
 Example: IBM VM/360, Vmware
 Purpose:





Server consolidation
Secure partitioning
Fault isolation
Support software development and deployment
Cloud computing bandwagon
VM Overview

High-Level Language Virtual Machine

Java and MS CIL (Common Language Infrastructure) are current
examples.

Binary class files are distributed
“ISA” (IR) is part of binary class format
OS interaction via API (part of VM platform)


VM Overview

High-Level Language Virtual Machine

Dalvik bytecode format
VM Overview
QEMU User-mode
QEMU System-mode
Interpretation

Emulation ? Simulation ?
Emulation: to be you. (A method for enabling a (sub)system
to present the same interface and characteristics as another)
 Simulation: to be like you.


Guest and Host


Refer to platforms
Source and Target
 Source ISA: Original instruction set or binary
 Target ISA: Instruction set being executed by processor
performing emulation.

Refer to ISAs
Interpretation

Ways of implementing emulation
 Interpretation:
instruction at-a-time
 Binary Translation: block-at-a-time optimized for
repeated instruction executions


Why binaries/executables?
Dynamic or Static compilation ?
Interpretation

Interpretation State

Hold complete source architecture state in the interpreter’s data
memory
target-arm/cpu.h
Interpretation
Decode-Dispatch Interpretation
while (!halt && !interrupt)
{
inst = code(PC);
opcode = extract(inst,31,6);
switch(opcode) {
case LdWord:
LdWord(inst);
case ALU:
ALU(inst);
case Branch:
Branch(inst);
...
}
Interpretation
Decode-Dispatch Interpretation
LdWord(inst)
{
RT = extract (inst,25,5);
RA = extract (inst,20,5);
displacement =extract (inst,15,16);
source = regs[RA];
address = source + displacement ;
regs[RT] = data[address];
PC = PC + 4;
}
Interpretation
Decode-Dispatch Interpretation
Interpretation
Decode-Dispatch (Efficiency)
Interpretation

Compiled Emulation
Replace each source instruction by a sequence of emulation
functions in high-level language.
 The in-lined program can then be compiled and optimized
by the compiler.
 After redundancy elimination optimization, the generated
code may be similar to the result of static translation

e.g.
add r1,r2,r3
sub r4,r5,r6

add(r1,r2,r3);…
sub(r4,r5,r6);…
QEMU uses a similar approach.
Binary Translation



Generate custom code for every source instruction.
For example, a load instruction in source code
could be translated into a respective load
instruction in native code.
Get rid of repeated parsing, decoding, and jumping
overhead.
Register mapping is needed to reduce load/stores
significantly.
Binary Translation

Example: Binary translation from IA-32 binary to
PowerPC binary
Binary Translation
Binary Translation

Register mapping to reduce load/store
Binary Translation
Binary Translation

Register mapping
 Easier
if
Number of target registers > number of source
registers. (e.g. translating x86 binary to RISC)
 May be on a per-block, or per-trace, or per-loop, basis
If the number of target registers is not enough
 Infrequently used registers (Source) may not be
mapped
Binary Translation

Source PC v.s. Target PC (program counter)
 TPC
(Target PC) is different from SPC (Source PC)
 For indirect branches, the registers hold source PCs. So
we must provide a way to map SPCs to TPCs.
Incorrect translation !
/* jump indirect through ctr, but ctr
contains SPC */
Binary Translation

Dynamic Translation

First Interpret


Translate Code




And perform code discovery as a
byproduct
Incrementally, as it is discovered
Place translated blocks into Code
Cache
Save source to target PC mapping
in an Address Lookup Table
Emulation process


Execute translated block to end
Lookup next source PC in table
 If translated, jump to target
PC
 else interpret and translate
Binary Translation

The translation system needs to track
SPC at all times
 Control
is shifted as needed between the interpreter,
the EM, and translated blocks in the code cache. Each
component must have a way to track SPC.
 Interpreter
uses SPC directly
 Interpreter passes the next SPC to EM
 Translated block passes the next SPC to EM
using JAL (jump and link instruction) or
mapping SPC to a register.
Binary Translation

Control flows between translated block and
emulation manager
Translation Block
Emulation
Manager
Context switch
Translation Block
Translation Block
Binary Translation
Control flow optimization - Chaining
1
0
Interpreter
2
3
a.out
4
1
2
SPC-TPC
Lookup
Table
Emulator
3
Cache cache
Binary Translation
Condition Codes (CC)



IA32, PowerPC, Sparc, VAX, and ARM all have CC
MIPS, Alpha, and Itanium do not
Case 1: Both source and target machines have CC
 the source machine’s CC must be save/restored
Case 2: Only source machine has CC
 target machine must simulate the CC
 some source machines set many CC in one instruction
Case 3: Only target machine has CC
 compare & branch is emulated by two instruction
Case 4: Neither target nor source have CC
 no issues
Binary Translation

CC is set more often than referenced.

If a CC is set before its use, the earlier set CC does not need to be
saved. However, most CC are saved at the end of each translated
block.

If it can be determined that the following block always set the CC
before use, the current block does not save the CC it sets. The
payoff is very high. (ex. all x86 ALU operations update CC)

Some rarely used CC (V,C) or flags (e.g. parity) can be simulated
using lazy evaluation. Instead of saving/restoring the CC or the
flag, save the instruction and its operands and re-compute the
CC/flag when it is needed
Binary Translation
CC example (IA32 to PowerPC)
Binary Translation

CC Optimizations
Combine Compare with Branch
ARM may use two instructions for a branch: a compare (or a TST or TEQ)
instruction followed by a branch. For some simple cases, MIPS can
simply use a compare-and-branch instruction.
There are cases, although very rare, the translated code (in terms of
number of instructions) could be even smaller than the original ARM
code.


Mapping each flag to a dedicated register
Example:
N:R17
Z: R18 C:R19
V:R20
This can reduce instruction overhead to extract/deposit target flags from/to the
CPSR (Current Program Status Register).
It the target architecture has sufficient number of registers, this optimization
should be considered. Otherwise, it may take away three more registers,
and cause register spilling.
Binary Translation

Other issues of translation
 Data
Formats and Arithmetic
 Memory Data Alignment
 Byte Order
Process VM




Perform guest/host mapping at
the ABI (ISA + system calls)
level
Encapsulate guest process in
process-level runtime
Example: QEMU linux usermode
Issues
 Memory architecture
 Exception architecture
 OS call emulation
 Overall VM architecture
 High performance
implementation
 System environments
Process VM – Implementation
Process VM

Loader


Initialization



What translation to flush?
OS Call Emulator


Emulate guest instructions with interpreter or binary translation
Code Cache Manager


Allocate memory for the code cache and other tables
Initialize runtime data structures and invoke OS to establish signal
handlers.
Emulation engine


A special loader writes guest code and data into a region holding the guest’s
memory image, and load the runtime code into memory.
Translate OS calls and OS responses
Exception Emulator


Handle signals
 If registered by src, pass to src handler, If not, emulate host response
Form precise state
Process VM



Compatibility
A strict definition of compatibility (e.g. bug-to-bug
compatible) would exclude many useful process VM.
Intrinsic compatibility
Any software written by the most devious programmer will
work in a compatible way
 Example: Intel strives for intrinsic compatibility when it
produces a new x86 microprocessor


Extrinsic compatibility
 Many useful VM applications do not achieve intrinsic
compatibility
 Limited application set: run Microsoft productivity
tools (Office)
Process VM

Compatibility issues
State Mapping
if the guest process uses all virtual address space, intrinsic
compatibility cannot be achieved
Mapping of control transfers
some potential trapping instructions may be removed
User-level instruction
FP format may be different
OS operation
host OS does not support exactly the same function as the
guest’s native OS
Process VM

Software Memory Mapping
 Runtime
Software to
maintain mapping table
 Similar to hardware page
table/TLB
 Slow, but always work
Process VM

Guest address space > Host address space +
Runtime
Process VM

Direct Translation Methods


VM software mapping is slow
Use underlying hardware

If guest address space + runtime fit within host space
Process VM - Protecting Runtime Memory
QEMU Overview


Created by Fabrice Bellard in 2003
Function-level emulation





Just-in-time (JIT) compilation support to achieve high performance
(400 ~ 500 MIPS)
Lots of peripherals support (VGA, serial, and Ethernet, etc…)
Lots of target hosts and targets support (full system emulation)




Faster than “cycle-accurate” simulators.
Good enough to use applications written for another CPU.
x86, arm, mips, sh4, cris, sparc, powerpc, nds32, …
qemu/hw/* contain all of the supported boards.
Good enough to use applications written for another CPU.
User mode emulation: can run applications compiled for another
CPU.
QEMU overview

Update status




0.9.1 (Jan 6, 2008)
Stable and stop for a long time
0.10 (Mar 5, 2009)
TCG support (a new general JIT framework)
0.11 (Sep 24, 2009)
KVM support
0.12
More KVM support.
Code refactoring
new peripheral framework to support dynamic board
configuration
QEMU Screenshot – Emulate ARM11MPCore
QEMU Screenshot – Android 2.1
QEMU JIT

TCG (Tiny Code Generator)
a
generic backend for a C compiler. It was simplified
to be used in QEMU.

Translation Block (TB)
 A TCG
"basic block" corresponds to a list of
instructions terminated by a branch instruction.

16Mb code cache size
QEMU JIT

Prologue, Epilogue
When the target-machine is ARM
QEMU JIT




cpu exec() called each time around main loop.
Program executes until an unchained block is
encountered.
Returns to cpu exec() through epilogue.
Enter the code cache:
Linux: Set buffer executable and
jump to Buffer & Execute
QEMU JIT – code gen flow


Front-end: qemu/tcg/tcg.c
gen_intermediate_code  disas_XXX_insn


Interprete source instruction and translate to micro-ops.
Translation stops when a conditional branch is encountered.
QEMU JIT – code gen flow

tcg_liveness_analysis



Remove dead code.
Ex. and_i32 t0, t0, $0xffffffff
Ex. add_i32 t0, t1, t2 add_i32 t0, t0, $1 mov_i32 t0, $1
QEMU JIT – code gen flow

Register mapping




register struct CPUNDS32State *env asm(r14);
register target_ulong T0 asm(r15);
register target_ulong T1 asm(r12);
register target_ulong T2 asm(r13);
QEMU JIT – Block chaining



Avoid context-switch overhead
Every time a block returns, try to chain it.
tb_add_jump(): back-patch the native jump address
QEMU JIT – Memory load emulation




Base on qemu 0.10.5 , emulate mips (little endian)
decode_opc  translate mips-asm to micro-op
Translation stops when a conditional branch is
encountered.
gen_store_gpr will store this value to the emulated
cpu’s general register.
QEMU JIT – Memory load emulation
target-mips/translate.c
QEMU JIT – Memory load emulation

Generate binary code
/qemu/Translate-all.c
/qemu/Tcg.c
cpu_gen_code
gen_intermediate_code
tcg_gen_code
/qemu/Tcg/i386/tcg-target.c
tcg_out_op
tcg_reg_alloc_op
opc
tcg_gen_code_common
tcg outputs 0xe8 which means a call instruction in
x86. It will call the functions in array
qemu_ld_helpers. The args to the functions is
passed by registers EAX,EDX and ECX.
QEMU JIT – Memory load emulation
0xe8
pc (s->code_ptr)
pc+4 (s->code_ptr += 4)
…
Offset (4 byte)
__ldb_mmu
qemu_ld_helpers[s_bits]
#define REGPARM __attribute((regparm(3)))
QEMU JIT – Memory load emulation
SoftMMU

Translate guest virtual address to host virtual address.
 Translate
the guest physical address to host physical
address.
 qemu
needs to find the PhysPageDesc entry in table
**l1_phys_map and get the phys_offset.
 guest_phy_addr[31:22]  first level entry
 guest_phy_addr[21:12]  second level entry
 If page not find  cpu_register_physical_memory : qemu
creates a new entry (by mmap) and updates its value and insert
this entry to the l1_phys_map table.
QEMU JIT – Memory load emulation
SoftMMU

Translate the guest physical address to host virtual address.
 phys_offset == IO_MEM_RAM guest RAM space
phys_offset[31:12]: the offset of this page in emulated physical
memory.
 phys_offset + phys_ram_base = host virtual address


phys_offset > IO_MEM_ROM MMIO space
phys_offset[11:3]: the index in io_mem_write/io_mem_read
array.
 register the I/O emulation functions:

QEMU JIT – Memory load emulation
SoftMMU

Original way




1. Translate the guest virtual address to guest physical address
2. Then qemu needs to find the PhysPageDesc entry in table l1_phys_map and get
the phys_offset
3. phys_offset + phys_ram_base = host virtual address
Software TLB table



1. Search TLB first.
2. Hit: guest_virtual_address + addend = host_virtual_address.
3. Miss: Search the l1_phys_map table and then fill the corresponding entry to the
TLB table
QEMU JIT – Memory load emulation
SoftMMU (__ldX_mmu)
(接下頁)
QEMU JIT – Memory load emulation
SoftMMU (__ldX_mmu)
cpu_exectb_find_fasttb_find_slowget_phys_addr_code(if tlb not
match)ldub_code(softmmu_header.h)__ldl_mmu(softmmu_template.h)tlb_fillcpu
_XXX_handle_mmu_faulttlb_set_pagetlb_set_page_exec
QEMU JIT – Summary
Look up TB
Cached?
No
Translate one TB
Yes
Execute Code
cache
Exception happen
and handling
Chain it to
existed TBs
QEMU other topics

Fixed register allocation
register struct CPUARMState *env asm(r14);
register target_ulong T0 asm(r15);
register target_ulong T1 asm(r12);
register target_ulong T2 asm(r13);

Conditional code (CC)
 Lazy
CC evaluation
 Recovery when needed
R=A+B
CC_SRC=A
CC_DST=R
CC_OP=CC_OP_ADDL
Source code organization









qemu/
qemu-* : OS dependent API wrapper
example: memory allocation or socket
target-*/ : target porting
tcg/ : new and unified JIT framework
*-user/ : user-mode emulation on different OS
softmmu-* : target MMU acceleration framework
hw/ : peripheral model
fpu : softfloat FPU emulation library
gdb : GDB stub implementation
Source code organization




TranslationBlock structure in translate-all.h
Translation cache is code_gen_buffer in exec.c
cpu-exec() in cpu-exec.c orchestrates translation
and
block chaining.
vl.c: Main loop for system emulation.
Sample Demo




Using gdb to debug QEMU
Using QEMU to debug guest OS
QEMU Linux-user mode emulation
QEMU system mode emulation
Funny issues of QEMU

Generate execution traces to drive timing models
 Try


to integrate timing models
Improve optimization, say, by retaining chaining
across interrupts
TCG Optimization.
 Code
cache management
 Optimization passes of micro-op

Multi-core emulate multi-core
Download