嵌入式作業系統實作 Implementation of Embedded Operating Systems - Virtualization 薛智文 cwhsueh@csie.ntu.edu.tw http://rswiki.csie.org/dokuwiki/courses:101_1:ieos http://www.facebook.com/groups/190254204331656/ 國立台灣大學 資訊工程學系 2012 Top 10 Jobs in USA 中位薪水(K) 最高薪水(K) 十年工作成長(%) 工作總數(K) 生物醫學工程師 79.5 124.0 61.7 15.7 行銷顧問 92.1 208.0 41.2 282.7 119.0 162.0 24.6 3426.0 臨床研究協理 90.7 129.0 36.4 100.0 資料庫管理員 87.2 122.0 30.6 110.8 財務顧問 90.2 206.0 32.1 206.8 市場研究分析師 63.1 97.7 41.2 282.7 物理治療師 76.7 99.0 39.0 198.6 軟體開發人員 84.2 121.0 24.6 3426.0 職能治療師 74.9 102.0 33.5 108.8 軟體架構師 It also considered softer issues, like how satisfying, flexible, or low stress a job is. 1 /62 資工系網媒所 NEWS實驗室 In Top 10 Jobs http://finance.yahoo.com/news/2012best-jobs-in-america.html?page=all What do they do all day? How to get the job? What makes it great? What's the catch? 2 /62 資工系網媒所 NEWS實驗室 How much can you earn for the company? 3 /62 資工系網媒所 NEWS實驗室 Preface Steve Jobs (Apple, 1955-2011) Stay hunger, stay foolish. (求知若渴,虛心若愚。) 持飢保愚 Dennis Ritchie (C language, 1941-2011) Skype eBay (4.1B USD, 2005) Microsoft (8.5B USD, 2011) Linux (Linus Torvalds, 1991) Android (Danger, 2003 Google, 2005) OpenStack (Rackspace and NASA, Oct 2010) Meego (Intel Samsung, Feb 2010 ) Tizen (Intel Samsung [Nokia], Sep 2011) Windows 8 (Microsoft, nVidia 2011) IOS 5 (Apple, 2011) 廣達, 台積電 (2011) 大立光,鴻海, 宏達電, Sharp (2012) 4 /62 資工系網媒所 NEWS實驗室 作業系統特徵遷移示意圖 1950 大型電腦 1960 1970 MULTICS 1980 無軟體 編譯器 分時 多使用者 批次 常駐監控常式 網路 迷你電腦 1990 2000 分散式系統 多處理器 容錯 UNIX 2010 虛擬機器 虛擬化 virtualization 無軟體 編譯器 常駐監控常式 個人電腦 分時 多使用者 多處理器 網路 容錯 叢集 UNIX 虛擬機器 Cloud OS? 無軟體 編譯器 虛擬機器 交談式 多處理器 多使用者 網路 手持式電腦 UNIX 虛擬化 virtualization 編譯器 無軟體 虛擬機器 交談式 虛擬化 網路 多處理器 virtualization 資工系網媒所 5 /62 NEWS實驗室 What is Virtualization ? The creation of a virtual version of something. Virtual class Virtual circuit Virtual community Virtual device Virtual disk Virtual host Virtual keyboard Virtual machine Virtual market Virtual memory Virtual money Virtual Private Network Virtual reality … Fully Utilizing Hardware Virtualization Etc. Sharing Hardware Resource Running Applications (x-platform) Security 6 /62 資工系網媒所 NEWS實驗室 Types of Virtualization Hardware/platform virtualization Desktop virtualization Software virtualization OS-level, Workspace, Application Storage virtualization Data virtualization Database virtualization Network virtualization 7 /62 資工系網媒所 NEWS實驗室 8 /62 資工系網媒所 NEWS實驗室 Big Questions for Virtualization How fast can virtualization achieve? What kinds of applications can there be? What problems it might incur? Technical Security Business Politics … 9 /62 資工系網媒所 NEWS實驗室 Why Virtualization is Difficult? 0/1/3 Ring, e.g. x86_32 OS is moved to ringr1/ring3 On x86 Some instructions Sensitive Instructions Cannot be trapped Critical Instructions Sensitive Register Instructions OS Instructions SGDT, SIDT, SLDT 0/3/3 Ring, e.g. x86_64, ARM SMSW PUSHF(D), POPF(D) LAR, LSL, VERR, VERW Protected System Instructions PUSH, POP CALL, JMP, INT, RET OS STR MOV 10 /62 資工系網媒所 NEWS實驗室 11 /62 資工系網媒所 NEWS實驗室 Virtual Machine Monitor (VMM) Hypervisor VM : Virtual Machine, Guest OS + Virtual Devices VM0 VM1 … VMN VM0 VM1 … VMN Hosted VMM, e.g. VMware Hypervisor, e.g. Xen Host Operating System Hardware Hardware Type I - Hypervisor Type II – Hosted VMM 12 /62 資工系網媒所 NEWS實驗室 Software Execution Modes in Virtualization Environment Mode Physical Virtual Hypervisor Privileged N/A Description For executing the hypervisor only. Kernel User Privileged For executing the kernel of a virtual machine. User User User For executing user processes of a guest OS. 13 /62 資工系網媒所 NEWS實驗室 The First Challenge of Virtualization Virtualizable According to Popek and Goldberg† in 1974, Virtual machines can be constructed for a platform if Sensitive Instructions ⊆ Privileged Instructions Sensitive Instructions might change the state of system resources Privileged Instructions must be executed with sufficient privilege † G. J. Popek and R. P. Goldberg, “Formal requirements for virtualizable third generation architectures,” Commun. ACM, vol. 17, no. 7, pp. 412–421, Jul. 1974. 14 /62 資工系網媒所 NEWS實驗室 How to Virtualize ? Binary translation Hypercall Full Virtualization Para Hardware Assisted Virtualization Intel VT-x & AMD SVM Trap and emulate Virtualization 15 /62 資工系網媒所 NEWS實驗室 Hypervisor (VMM) Type Type I + Microkernel Type I Xen (open source, Citrix), Microsoft Hyper-V Type I + Integrated kernel VMware ESX, KVM (kernel-base VM) Type II Type II (Host OS + Guest OS) VMware GSX, workstation, Microsoft virtual PC, Microsoft virtual server, Sun Virtual Box 16 /62 資工系網媒所 NEWS實驗室 Xen Architecture (1/2) Domain 0 Domain U Domain U Domain U 17 /62 資工系網媒所 NEWS實驗室 Xen Architecture (2/2) Compare to common Linux Linux Xen System Calls Hyper Calls Signals Events Interrupts Physical + Virtual Interrupts CPU PCPU + VCPU Virtual Filesystem XenStore POSIX Shared Memory Grant Tables/Shared Pages 18 /62 資工系網媒所 NEWS實驗室 Hyper Call System Call int 0x80 int 0x82 Guest OS 01 02 03 04 05 06 07 Hypervisor // linux/include/asm/unistd.h #define #define #define #define … __NR_restart_syscall __NR_exit __NR_fork __NR_read 0 1 2 3 HYPERVOSIR_sched_op int 82h hypercall Hypercall_table do_sched_op iret resume Guest OS Hyper Call 01 02 03 04 05 06 07 // xen/include/public/xen.h #define #define #define #define … __HYPERVISOR_set_trap_table __HYPERVISOR_mmu_update __HYPERVISOR_set_gdt __HYPERVISOR_stack_switch 19 /62 0 1 2 3 資工系網媒所 NEWS實驗室 Event Channel A lightweight signal mechanism Use “ports” as identifers (pending+mask) Four major purposes Guest OS Guest OS … IDC VCPU IPI VCPU VCPU … vIRQ Hypervisor Hardware IPI VCPU … pIRQ Virtual CPU Virtual Memory Physical CPU Physical Memory Scheduling Eth0 Eth1 20 /62 … … 資工系網媒所 NEWS實驗室 Interrupt Physical interrupt For the hypervisor or for guest OSes Virtual interrupt Ask guest OSes to do 8 for now (max is 24) Guest OS Guest OS … event OS Hypervisor ISR Hardware Hardware PIC Device IRQn PIC Device IRQn 21 /62 資工系網媒所 NEWS實驗室 CPU Virtualization Architecture App App Guest OS Guest OS … Hypervisor VCPU VCPU VCPU … Scheduling PCPU PCPU PCPU … 2 scheduling algorithms (Non-Work-Conserving) Simple Earliest Deadline First (SEDF) Credit PCG 22 /62 資工系網媒所 NEWS實驗室 XenStore Xenstore is a centralized configuration database that is accessible by all domains, meant for configuration and status information rather than for large data transfers. Management tools configure and control virtual devices by writing values into the database that trigger events in drivers. Each domain gets its own path in the store, which is somewhat similar in spirit to procfs. When values are changed in the store, the appropriate drivers are notified. XenBus provides a bus abstraction for paravirtualized drivers to communicate between domains. used for configuration negotiation, leaving most data transfer to be done via an interdomain channel composed of a shared page and an event channel. 23 /62 資工系網媒所 NEWS實驗室 Grant Table Page mapping & Page transferring Page as a unit Grant reference (GR) Grant entry Domain A create GR Domain B send GR map page access page unmap page Domain A transfer page Domain B send GR inform create GR receive page release GR release GR inform 24 /62 資工系網媒所 NEWS實驗室 Memory Virtualization (1/2) Two-level memory Three-level memory Virtual, Pseudo-physical, Machine hypervisor Application - Virtual Memory Guest OS OS -Physical -Pseudo-Physical Memory Memory P2M M2P Hypervisor -Machine Memory 25 /62 資工系網媒所 NEWS實驗室 Memory Virtualization (2/2) 168M memory for hypervisor 0xFC000000 0xFC400000 Area Size MPT, Machine-to-Physical Translation Table (RO) 16M Page-Frame Information 96M MPT, Machine-to-Physical Translation Table (R/W) 16M Heap 0xFFFFFFFF Linear Page Table 8M Shadow Linear Page Table 8M Per Domain Mappings 8M Direct Map 12M I/O Remap 4M 26 /62 資工系網媒所 NEWS實驗室 Memory Virtualization - Translation 4 mechanisms to manipulate page tables Paravirtualized page tables Write page tables (Only level 1 is writable) Shadow page tables Hardware-assisted paging Virtual Memory Page Table MMU (VM->PFN) Page Fault ! Shadow Page Table (VM->MFN or VM->P2M) Pseudo-Physical Memory P2M Second Level Paging HAP Machine Memory 27 /62 資工系網媒所 NEWS實驗室 I/O Device Virtualization Hypervisor also provides three mechanisms to use devices. Emulated Devices Paravirtualized Driver Pass-through 28 /62 資工系網媒所 NEWS實驗室 I/O Device Virtualization - Emulated Devices Implemented by QEMU e.g. sound card, ac97, sb16, etc QEMU-DM 29 /62 資工系網媒所 NEWS實驗室 I/O Device Virtualization - Paravirtualized Driver Split Device Driver Model An example of sending packets Back-End Driver Front-End Driver Native Driver 30 /62 資工系網媒所 NEWS實驗室 I/O Device Virtualization - I/O Ring Without data, it only transfers request/reply An example with GR Dom U Dom 0 GR GR GR Grant Table I/O Channel Hypervisor Active Grant Table Device 31 /62 資工系網媒所 NEWS實驗室 I/O Device Virtualization - Pass-Through Pass and directly use the device Dom U Dom 0 … Native Driver Hypervisor Hardware Dom U Virtual CPU Virtual Memory Physical CPU Physical Memory Native Driver Scheduling Eth0 Eth1 32 /62 … … 資工系網媒所 NEWS實驗室 Hardware Virtual Machine Intel Virtualization Technology Technology Description Virtualization Implementation VT-x Root/NonRoot CPU, Memory Extended Page Tables VT-i As VT-x, for Itanium VT-d DMA, Interrupt Devices VT-c Classify Packets Network Devices VMDq, VMDc Instructions Set IOMMU (Chipset) 33 /62 資工系網媒所 NEWS實驗室 CPU Benchmark (1/2) 8.3% Average over 100 tests, Deviation: 0.066~0.128% 34 /62 資工系網媒所 NEWS實驗室 CPU Benchmark (2/2) 5% Calculate the 32M digits of 𝜋. 35 /62 資工系網媒所 NEWS實驗室 Hard Disk Drive Benchmark 36 /62 資工系網媒所 NEWS實驗室 Network Benchmark (1/2) 59% Testing Time: 180 seconds, Deviation: 0.12~0.26%. 37 /62 資工系網媒所 NEWS實驗室 Network Benchmark (2/2) Average: 9.82% Sample Period: 2 seconds 38 /62 資工系網媒所 NEWS實驗室 Case Study 1. Embedded Virtualization† • Mobile phone 2. Domain 1§ • PC 3. OpenStack • Cloud OS † Yuan-Cheng Lee, Chih-Wen Hsueh, and Rong-Guey Chang, "Inline Emulation: An Optimization Technique for Virtualization on Embedded Systems," Proc. of the 17th International Conference on Real-Time and Embedded Computing Systems and Applications (RTCSA'11), Toyama, Japan, August 2011. 資工系網媒所 § Project with Insyde Inc. 39 /62 NEWS實驗室 Motivation of Mobile Virtualization Virtualization is fast enough on PC with 90+% performance compared to the same non-virtualized OS. We can further utilize multi-core embedded processors To run multiple operating systems on a mobile phone… 40 /62 資工系網媒所 NEWS實驗室 Related Work Secure Xen on ARM (Samsung) It proved virtualization is possible for ARM platform. The PENAR project (University of Applied Sciences, Western Switzerland) It integrated the source trees of Xen, RTLinux, and Linux for ARM. OKL4 (Open Kernel Labs) A hypervisor which adopts microkernel architecture for embedded systems 41 /62 資工系網媒所 NEWS實驗室 Solutions Complexity Design Implementation Runtime Binary translation Hypercall Inline emulation High Low Low Medium High Low High Medium Low Virtual function Normal function Inline function Counterpart (in programming languages) 42 /62 資工系網媒所 NEWS實驗室 The First Challenge of Virtualization Example For the ARM architecture, the instruction (TYPE III) MOVS PC, LR Changes the program counter and switches to user mode. However, it causes unpredictable behavior when executed in user mode. Therefore, it is a sensitive instruction but not a privileged instruction. Sensitive instructions ⊈ Privileged instructions 43 /62 資工系網媒所 NEWS實驗室 Hypercall Guest OS No Yes context switch reschedule? Hypervisor Software Interrupt Hypercalls Hyper Call Handler SWI Handler 44 /62 資工系網媒所 NEWS實驗室 Idea of Inline Emulation The Original Instruction MOV MCR R0, VIRT_ADDR p15, 0, R0, C8, C6, 1 Hypercall Inline Emulation Guest OS MOV MCR R0, VIRT_ADDR p15, 0, R0, C8, C6, 1 Guest OS MOV MOV SWI R0, #CMD_FLUSH_DENTRY R1, VIRT_ADDR #HYPER_CALL_TLB Inline Emulation Handler Hypercall Handler …… …… /* restore user context */ LDMIA SP, [R0 – R14] MCR p15, 0, R0, C8, C6, 1 Restore PC LDR MCR R1, [SP, #4] p15, 0, R1, C8, C6, 1 Restore User Context & PC 45 /62 資工系網媒所 NEWS實驗室 Inline Emulation Guest OS Yes context switch reschedule? UDI Exception No Software Interrupt Hypercalls return to guest Canonical Privileged Instructions (TYPE I) Hypervisor Hyper Call Handler Inline Emulation SWI Handler UDI Handler 46 /62 資工系網媒所 NEWS實驗室 Design of Inline Emulation The Main Handler A handler for the instruction is found No handler for the instruction was found 47 /62 資工系網媒所 NEWS實驗室 The Issue of Finding an Inline Emulation Handler It is hard to find a simple hash function. Because the encoding of ARM instructions is complicated. Instruction Ratio (%) mcr p15, 0, Rd, c3, c0, 0 58.44 mcr p15, 0, Rd, c7, c14, 1 39.73 mcr p15, 0, Rd, c8, c5, 1 0.49 mcr p15, 0, Rd, c8, c6, 1 0.49 mcr p15, 0, Rd, c7, c10, 4 0.24 Instead, we can construct an efficient search table. mcr p15, 0, Rd, c2, c0, 0 0.23 mcr p15, 0, Rd, c7, c5, 0 0.11 Because there are a few frequently used instructions. mcr p15, 0, Rd, c8, c5, 0 0.08 mcr p15, 0, Rd, c8, c6, 0 0.08 mrc p15, 0, Rd, c7, c14, 3 0.11 Others <0.01 48 /62 資工系網媒所 NEWS實驗室 Example of Mto1 Search Table Encoding of MCR instruction Syntax: MCR{cond} cp, op1, Rd, CRn, CRm, op2 31 0 cond 1110 op1 0 CRn Rd cp op2 1 CRm mask value handler Set 0x0F1F0F10 0x0E130F10 handler_CR3 MCR 15, op1, Rd, c3, 0x0F1C0F10 0x0E100F10 handler_CR02 MCR 15, op1, Rd, {c0 - c2}, CRm, op2 0x0F100F10 0x0E100F10 handler_CRX MCR 15, op1, Rd, {c4 - c15}, CRm, op2 0x00000000 0x00000000 End of Table CRm, op2 …… 0x00000000 * An entry E is matched if 𝒊𝒏𝒔𝒏 & 𝑬. 𝑚𝑎𝑠𝑘 == 𝑬. 𝑣𝑎𝑙𝑢𝑒 49 /62 資工系網媒所 NEWS實驗室 Design of Inline Emulation Dynamic Inline Emulation (DIE) Handler inlining the instruction flushing caches Self-modifying 50 /62 資工系網媒所 NEWS實驗室 Design of Inline Emulation Static Inline Emulation (SIE) Handler executing the hard-coded instructions /* data synchronization barrier */ restoring user context & PC 51 /62 資工系網媒所 NEWS實驗室 Evaluation and Analysis The Experiment Environment Emulator Android emulator (ARMv5) Memory 12MB for the hypervisor 32MB for the guest OS Hypervisor Xen 4.0.1 for ARMv5 Guest OS Linux 2.6.29-Goldfish Compilation Using GCC with debug (-g) flag 52 /62 資工系網媒所 NEWS實驗室 Evaluation and Analysis The Distribution of Emulated Instructions Instruction CRn, CRm, op2 MCR c3, c0, 0 58.44 c7, c14, 1 39.73 c8, c5, 1 0.49 c8, c6, 1 0.49 c7, c10, 4 0.24 c2, c0, 0 0.23 c7, c5, 0 0.11 c8, c5, 0 0.08 c8, c6, 0 0.08 c7, c14, 3 0.11 MRC Others Ratio(%) More than 98% <0.01 53 /62 資工系網媒所 NEWS實驗室 Evaluation and Analysis The Micro-Level Analysis (1/2) Operation - Invalidating TLB A single entry (DIE handler) PV IE The entire TLB PV (SIE handler) IE Mode (instructions) USER 13.00 UND 0.00 SWI Total 305.97 318.97 3.00 49.00 0.00 52.00 11.00 0.00 305.80 316.80 3.00 42.00 0.00 45.00 Improvement PV/IE (%) 613.39 704.01 54 /62 資工系網媒所 NEWS實驗室 Evaluation and Analysis The Micro-Level Analysis (2/2) Instruction Mode (instructions) USER UND SWI Total MCR p15, 0, Rd, PV c3, c0, 0 IE (DIE handler) 9.00 0.00 203.29 212.29 3.00 47.00 0.00 50.00 MCR p15, 0, Rd, PV c7, c14, 1 IE (DIE handler) 13.00 0.00 304.50 317.50 3.00 53.00 0.00 56.00 Improvement PV/IE (%) 424.57 566.94 Inline emulation can achieve at least 4.24X performance of hypercalls in most cases (about 98%). 55 /62 資工系網媒所 NEWS實驗室 Evaluation and Analysis The Macro-Level Analysis Data Data Processing Transfer Branch Software Interrupt Coprocessor Total and Other Paravirtualization (instructions) 89.22M 91.28M 27.08M 48560 4.79M 212.42M Inline Emulation (instructions) 89.04M 90.66M 26.93M 33658 4.93M 211.59M 0.68 0.53 30.69 -2.72 0.39 (PV – IE) / PV (%) 0.20 56 /62 資工系網媒所 NEWS實驗室 Brief Summary Inline emulation : Reduces the efforts to port guest operating systems Increases the handling of sensitive instructions (4-7x) Increases the overall system performance (0.39%) Future work Optimization for memory virtualization Much higher the overall speedup is possible. 57 /62 資工系網媒所 NEWS實驗室 Domain 1 – A Fake Domain 0 Architecture Dom1 Dom0 Windows Linux xend Drivers DomU … Android Drivers payload hypervisor BIOS assignable hardware VGA eth usb … non-assignable hardware 58 /62 資工系網媒所 NEWS實驗室 OpenStack An IaaS cloud computing project to enable any organization to create and offer cloud computing services running on standard hardware. now, Openstack Folsom Architecture, 6th release with Apache License, in Python, started by Rackspace Cloud and NASA in 2010, first released after 4 months announced, 150 companies have joined the project among which are AMD, Intel, Canonical, SUSE Linux, Red Hat, Cisco, Dell, HP, IBM and Yahoo!, included and released in both Ubuntu and Red Hat. 59 /62 資工系網媒所 NEWS實驗室 OpenStack Components code-name Description Compute Nova a cloud computing fabric controller utilizing external libraries such as Eventlet (for concurrent programming), Kombu (for AMQP communication), and SQLAlchemy (for database access). Object Storage Swift offers cloud storage software so that you can store and retrieve lots of data in virtual containers. Image Service Glance provides discovery, registration, and delivery services for virtual disk images, a standard REST interface for streaming virtual disk images stored in a variety of back-end stores Networking Quantum provides "networking as a service" between interface devices (e.g., vNICs) managed by other Openstack services Block Storage Cinder The goal of the Cinder project is to separate the existing nova-volume block service into its own project. Identity Keystone for authentication (authN) and high-level authorization (authZ) supporing token-based authN and user-service authorization. Will support oAuth, SAML and openID in future versions. Dashboard Horizon provides a baseline user interface for managing OpenStack services 60 /62 資工系網媒所 NEWS實驗室 Future OpenStack Components code-name Description Metering Ceilometer the infrastructure to collect measurements within OpenStack so that no two agents would need to collect the same data. • It's primary targets are monitoring and metering, but the framework should be easily expandable to collect for other needs. • To that effect, Ceilometer should be able to share collected data with a variety of consumers. Basic Cloud Orchestration & Service Definition Heat Heat is a service to orchestrate multiple composite cloud applications using the AWS CloudFormation template format, through both an OpenStack-native ReST API and a CloudFormationcompatible Query API. • AWS CloudFormation enables you to create and manage AWS infrastructure deployments predictably and repeatedly. AWS CloudFormation helps you leverage AWS products such as Amazon EC2, EBS, Amazon SNS, ELB, and Auto Scaling to build highly-reliable, highly scalable, cost effective applications without worrying about creating and configuring the underlying the AWS (Amazon Web Services) infrastructure. 61 /62 資工系網媒所 NEWS實驗室 Summary Stay hungry to be full [of passion]. Stay foolish to be smart [on absorption]. 假若真時真亦假! Virtualized reality. Real virtualization. Virtualized to go anywhere? Key is the system. System is the key. E.g. Virtual Tape Library 62 /62 資工系網媒所 NEWS實驗室