PwnOS Design Document Version 1.0a By Neil Dickson As part of Code Cortex Presented for: COMP 3000 Fall 2007 Operating Systems Carleton University Dr. Anil Somayaji Neil Dickson PwnOS Design Document Page 1 of 27 Revision History Revision 1.0a Editor Neil Dickson 1.0 Neil Dickson Neil Dickson Description Minor table formatting change; clarified explanation of process pages; added table of figures; added paragraph about heap allocation alignment This is the initial version of the document. It describes the proposed design of PwnOS in detail. PwnOS Design Document Page 2 of 27 Table of Contents Revision History ........................................................................................................................................... 2 Table of Figures ............................................................................................................................................ 4 Abstract ......................................................................................................................................................... 5 Introduction ................................................................................................................................................... 6 High Concept ............................................................................................................................................ 6 Motivation and Goals................................................................................................................................ 6 Project Scope ............................................................................................................................................ 6 Document Scope ....................................................................................................................................... 7 Abbreviations ........................................................................................................................................ 7 Market Analysis and Prior Art ...................................................................................................................... 9 High Performance Dedicated Servers Market........................................................................................... 9 Commercial High Performance Systems .................................................................................................. 9 Comparison of Similar Operating Systems ............................................................................................. 10 Design ......................................................................................................................................................... 12 Organisational Overview ........................................................................................................................ 12 API Specification ................................................................................................................................ 13 Booting.................................................................................................................................................... 17 Memory ................................................................................................................................................... 18 Page Memory Management ................................................................................................................ 18 Heap Memory Management................................................................................................................ 19 Threads and Processes ............................................................................................................................ 21 Process Management .......................................................................................................................... 21 Thread Management ........................................................................................................................... 21 Device I/O ............................................................................................................................................... 23 Files ......................................................................................................................................................... 24 Synchronisation....................................................................................................................................... 25 References ................................................................................................................................................... 27 Neil Dickson PwnOS Design Document Page 3 of 27 Table of Figures Figure 1: Overall Architecture .................................................................................................................... 12 Figure 2: Module Dependencies ................................................................................................................. 13 Figure 3: Physical Memory During Boot .................................................................................................... 18 Figure 4: Overall Virtual Memory Layout .................................................................................................. 19 Figure 5: Example Heap Memory Ranges .................................................................................................. 20 Figure 6: Heap Memory Range Trees ......................................................................................................... 20 Figure 7: Heap Memory Range Trees with Compound Nodes ................................................................... 20 Figure 8: User-Accessible Process Pages ................................................................................................... 21 Figure 9: Global Descriptor Table (GDT) .................................................................................................. 22 Neil Dickson PwnOS Design Document Page 4 of 27 Abstract PwnOS is a low-overhead operating system designed for isolated or otherwise dedicated computer systems that require high CPU performance on general-purpose hardware. This document explains the motivations and goals of PwnOS, and gives a market analysis based on those goals, followed by a comparison between PwnOS and operating systems with similar goals. Then, the design of PwnOS is presented in detail, in relation to those goals. Neil Dickson PwnOS Design Document Page 5 of 27 Introduction High Concept PwnOS is a low-overhead operating system designed for isolated or otherwise dedicated computer systems that require high CPU performance on general-purpose hardware. Motivation and Goals The main goals of PwnOS can be summarised as follows: Motivation Goal General-purpose operating systems tend to have a very low time overhead; low space overhead high level of bloat that is largely avoidable if the operating system is custom-tuned for high performance dedicated systems. Core APIs for general-purpose operating systems simple core API; expandable/modifiable user API tend to be very large and/or very cryptic High-performance computing and other dedicated best suited to few processes with many threads systems generally have just one large process with many threads. Dedicated systems do not need to waste time protection against accidents, not against malware checking for malware or otherwise be slowed down for security reasons only applicable to generalpurpose systems Dedicated systems need only a limited set of good support for just a few devices is sufficient for a drivers, and the large set of drivers with a generalparticular system; exhaustive support is not needed purpose operating system can be detrimental. Dedicated systems using general-purpose hardware only need support for general-purpose hardware can be much less expensive than dedicated systems using special hardware. Custom server systems often need custom good, thorough documentation at all levels modifications to the operating system Project Scope The scope of the PwnOS project is most significantly limited by its goals, which is quite convenient, in that the goals are then not significantly limited by the scope. Since the goals generally favour a simple design over a complex design, this makes most complex design aspects out of scope. The following items are currently considered to be out of the scope of the PwnOS project: Filesystem design Executable/library format design Full compatibility with other operating systems Graphical user interface libraries Security and privacy libraries Extensive device support Elaborate Inter-Process Communication (IPC) Application software Backward hardware compatibility Implications of these include that an existing filesystem design and an existing executable/library format must be used. Also, all application software and user libraries must be developed separately from PwnOS. Some such software might be developed as part of other Code Cortex projects. Neil Dickson PwnOS Design Document Page 6 of 27 The following items are currently considered to be in the scope of the PwnOS project: Boot loader code Page and heap memory management Thread management and scheduling; inter-thread communication Simple process management; support for an existing, common executable/library format Support for common, general-purpose hardware relevant to dedicated systems Support for an existing, common filesystem Thread synchronisation management Document Scope The objective of this document is to provide thorough details of the design of PwnOS, up to but not including the level of exhaustive function lists. Some details given in this document might be considered more implementation than design, but the intent is to have more emphasis on details that are less likely to change than details that are more likely to change. For documentation relating to implementation details, please see (1). This document does not provide introductory, tutorial, or reference documentation on hardware devices, protocols, data structures, standards, or computer languages, except if and when useful in describing their relation to the design of PwnOS. For details on these, please see the cited material. Abbreviations Abbreviation ACPI API APIC ATA AVL Tree BIOS CPU DMA FTP GB GDT HP HPC HTTP IBM I/O IP IPC ITRON LAPIC LSI MB MBR Neil Dickson Term Advanced Configuration and Power Interface Application Programming Interface Advanced Programmable Interrupt Controller AT Attachment Adelson-Velsky Landis Tree Basic Input/Output System Central Processing Unit Direct Memory Access File Transfer Protocol GigaBytes (230 bytes in this document) Global Descriptor Table Hewlett-Packard Company High-Performance Computing HyperText Transfer Protocol International Business Machines Corporation Input / Output Internet Protocol Inter-Process Communication Industrial The Real-time Operating system Nucleus Local APIC Large Scale Integration (of circuitry) MegaBytes (220 bytes in this document) Master Boot Record (sector 0) PwnOS Design Document Page 7 of 27 MFT NTFS OS PCI PDs PDPT PIT PL0 PL3 PML4 PS/2 PTs RAM RPTs SGI SIMD SSE# TB TCP TLB TSS UDP USB VBE Neil Dickson Master File Table New Technology File System Operating System Peripheral Component Interconnect Page Directories Page Directory Pointer Table Programmable Interval Timer Privilege Level 0 (supervisor privilege) Privilege Level 3 (user privilege) Page Map Level 4 table Personal System/2 (or the ports thereof) Page Tables (or in general the page mapping) Random-Access Memory Reverse Page Tables (not standardised) Silicon Graphics Incorporated Single-Instruction, Multiple-Data Streaming SIMD Extensions # (e.g. SSE3) TeraBytes (240 bytes in this document) Transmission Control Protocol Translation Lookaside Buffer Task State Segment User Datagram Protocol Universal Serial Bus VESA BIOS Extensions PwnOS Design Document Page 8 of 27 Market Analysis and Prior Art High Performance Dedicated Servers Market The server system market, although large and continually growing, is too broad a market to consider for PwnOS, since many uses for servers are not limited by CPU performance. Also, many server systems are not what would be considered “isolated” systems, and issues not addressed by PwnOS, such as security and capabilities, are important for those systems. As such, this section will only consider relevant submarkets of the server system market. The most distinctive market for high performance dedicated servers is the High-Performance Computing (HPC) market. HPC is best known for use of systems with many CPU cores (a cluster) for parallel computation and large amounts of RAM to facilitate computationally intense use of these cores. HPC servers are most commonly used for solving very difficult, but parallelisable, problems. This type of problem arises frequently in scientific computing and optimisation of complex systems. For example, simulation of the folding of proteins is a problem very important to biological and pharmaceutical research, but it is extremely difficult to do accurately, and yet is highly parallelisable. There is, however, a wide variety of these types of problems, including LSI verification, weather prediction, cinematic quality 3D rendering, network optimisation, schedule optimisation, and many others. The growth of HPC in recent years can be attributed to this myriad of fields in which HPC is useful, especially scientific and technical fields. The role of PwnOS is not to provide a full HPC system, but just to provide a simple, low overhead framework on top of which to run HPC applications using general-purpose hardware. Other uses for high performance dedicated servers include custom database systems. As an example of this, a telephone switching system can make use of such a database system. A large amount of information related to telephone switching remains relatively constant (e.g. phone number routing data) and is very frequently queried, often concurrently, but this amount of information is easily small enough to fit in memory given a mid-range dedicated server. Since the information is relatively constant, it can be perpetually cached, making a very low overhead operating system with support for large memory ideal for these operations. However, the ease of use and adequate performance of general-purpose database systems has made custom database systems less common, so this may not be a significant market for PwnOS. Commercial High Performance Systems Major companies specialising in development and/or retail of low- to mid-range HPC servers include IBM, Sun, HP, SGI, and Quadrics. Such servers (as limited by price) can generally be arranged into categories based on the amount of RAM per CPU. For instance, the HP Integrity rx6600 Server has 2-8 cores and 192GB of RAM (2496GB/CPU), whereas the Sun Blade x8420 Server Module has 8 cores and 64GB of RAM (8GB/CPU). Clusters with more RAM/CPU are better for some tasks (i.e. more performance per dollar) and worse for others. Similar arguments can be made for I/O connections to these servers. However, the modularity of these systems is perhaps a more important factor. Most of the above companies offer both standalone server systems and server systems in which many server modules can be mounted together with special interconnects for inter-module communication. The former is referred to as a rack-mount server system, and the latter is referred to as a blade server system. Rack-mount systems are standard computer systems usually using high-quality conventional hardware, but each computer is inherently separate, albeit likely connected via a network interface. A blade system requires a special enclosure providing power, cooling, and networking in a way that is more efficient than rack-mount Neil Dickson PwnOS Design Document Page 9 of 27 systems, but they can be more expensive due to their highly-specialized hardware. Blade server systems are expandable to high-end clusters, but may not be as cost-effective for low-end clusters. Software on these systems varies widely. IBM mostly sells systems running either a variant of Linux or Windows Compute Cluster Server on Windows Server 2003, and it also sells systems with a custom operating system, AIX. Sun’s systems mostly run its own operating system, Solaris, and the same is true for HP selling systems running HP-UX. Until recently, SGI sold systems running its operating system, IRIX, but they now primarily run Linux. Because these companies mostly make money from the hardware, not the software, and so for low- to mid-range servers, the overall trend has been moving away from custom operating systems for these servers. Comparison of Similar Operating Systems Purpose Min. CPU Reccommended CPU Reccommended RAM Status Organisation Market Penetration License PwnOS HP-UX Industrial Server 1 core, x86-64, SSE3 15 cores, x86-64, SSE3 1GB to 512GB+ Industrial Server 2 cores, Itanium 2 Solaris ITRON Minix 32GB to 512GB+ 128 cores, 64-bit MIPS ? to 1024GB Industrial Server 2 cores, x86-64 or SPARC64 8 cores, x86-64 or SPARC64 256MB to 64GB Industrial Embedded 1 core, Large variety 1 core, Large variety Small (<<4GB) Develop. Code Cortex None Active HewlettPackard Good Abandoned Silicon Graphics Fair Active Sun Microsys. Good GPL Proprietary Proprietary CDDL 128 cores, Itanium 2 Cellular IRIX Industrial Server 64-bit MIPS Mach OS Research 1 core, 80386 L4 Fiasco OS Research 1 core, 80486 1 core, Pentium 1 core, Pentium 1 core, 80486 16MB to <4GB 2MB to 1GB ? to <4GB Stagnant TRON Assoc. Excellent Active — Stagnant — Abandoned — None None None N/A BSD GPL None OS Research 1 core, 80386 HP-UX (2), Cellular IRIX (3), and Solaris (4) are operating systems that are or have been developed by HP, SGI, and Sun, respectively, for server systems, especially cluster or cluster-like systems. PwnOS is similar to these in that the primary focus is cluster systems and other high performance dedicated systems. However, PwnOS differs in that is inherently designed to remain simple while still providing the functionality needed to create and run software for these systems, whereas HP-UX, Cellular IRIX, and Solaris are colossal, complex masses of software. PwnOS also differs in that the other three operating systems are sold in conjunction with expensive server systems, whereas PwnOS is intended to enable companies and individuals to make inexpensive server systems from computers that they already have. ITRON (5) is an operating system specification used for more embedded systems than any other design on Earth. ITRON is, in fact, very dissimilar from PwnOS. ITRON operating systems are real-time operating systems for embedded systems on a wide variety of different custom and general-purpose hardware platforms. PwnOS is not intended to be a real-time operating system, it is not for embedded systems, and is designed to support only a limited range of modern, general-purpose hardware. The only similarity is the intent to be for industrial use. It is, however, a great success story of how in just 20 years, a project such as it can become so common. Neil Dickson PwnOS Design Document Page 10 of 27 Minix (6), L4 (7), and Mach (8) are or were projects developed by independent operating system enthusiasts interested in trying theoretical designs for operating systems (specifically micro-kernel designs). The purpose of these operating systems is completely different than that of PwnOS. Because of that, the designs of them are very different than that of PwnOS. The common design element is simplicity, but beyond that, PwnOS is not a micro-kernel design and has performance, functionality, and specific usefulness in mind, whereas the others are micro-kernel designs with reliability, minimal functionality, and no particular usefulness in mind. The other significant element in common is that PwnOS is currently being developed by a tiny group of independent operating system enthusiasts. The key difference there is that this tiny group is interested in more than just operating systems. Neil Dickson PwnOS Design Document Page 11 of 27 Design Organisational Overview This section outlines the overall organisation of PwnOS, the relations between its major modules, and design aspects common to all or most major modules of PwnOS. To optimise the use of modern general-purpose CPUs, PwnOS is a 64-bit operating system for processors supporting x86-64 architecture and SSE3. Portability and backward compatibility often conflict with the goals of low time and space overhead, and so are not considered in the design of PwnOS. PwnOS is designed to best work with many CPU cores. Because of the similarity from software, the terms “CPU” and “CPU core” are used interchangeably in the rest of this document. The following figure presents the overall code architecture of PwnOS in terms of its modules and the relevant interfaces. Figure 1: Overall Architecture The reasoning behind having the heap memory management module and part of the synchronisation module accessible directly from Privilege Level 3 (PL3) is that some of the time overhead of system calls can be avoided by using regular function calls where possible. Managing heap memory does not require actively changing page tables or updating core data structures, so can be done from PL3. Likewise, synchronisation actions such as getting a lock that is currently free and releasing a lock on which no threads are blocking are both operations that can be done in PL3, and upon failure (e.g. the lock is not free when attempting to get it) can call PL0 to properly handle all cases. That these modules are in special, read-only, pages of memory common to all processes is also important. There will be a heap for modules in PL0 and a heap for each application, and duplication of the heap management code would be wasteful, so it is kept in common for all tasks using global, read-only pages for efficiency and accident avoidance. This is discussed further in the Page Memory Management section. Custom libraries in this read-only memory may include libraries developed by/for a user of PwnOS, the Code Cortex libraries, or any other libraries to be loaded here. Neil Dickson PwnOS Design Document Page 12 of 27 The following diagram presents the dependencies between the modules of PwnOS. Figure 2: Module Dependencies Dependencies on (Fast) Sync have been omitted since all modules but Thread Scheduler and (Full) Sync have such a dependency. There are cyclic dependencies involving I/O, Page and Heap Memory, and Sync, and as such, they must be initialised without using any of the code therein. See the Booting section for more detail on initialisation. API Specification The following table specifies the core Application Programming Interface of PwnOS. It has been designed to be simple, with meaningful names, and still have sufficient functionality. Parameters are passed by registers, and system calls are made using the special SYSCALL instruction for optimal performance. Library calls can be made as normal calls (after relocation). Function AllocatePages Parameters Address nPages AllocType FreePages Address nPages AllocateMemory nBytes Neil Dickson Returns Address Address PwnOS Design Document Description Allocates the specified number of pages with the specified properties. If Address is not NULL, the pages will be allocated with that virtual address (rounded down to the page). Deallocates the specified number of pages starting at the specified address (rounded down to the page). Allocates on the heap a range of the specified number of bytes. Page 13 of 27 AllocateAlignedMemory nBytes Alignment FreeMemory Address GetAllocationSize Address nBytes GetAllocationStart Address StartAddress CreateProcess pName DataSize pData Flags pProcess DestroyProcess pProcess GetCurrentProcess pProcess CreateThread pFunction StackSize Flags Parameter DestroyThread pThread PauseThread pThread ResumeThread pThread Sleep pThread Milliseconds Neil Dickson Address pThread PwnOS Design Document Allocates on the head a range of the specified number of bytes aligned to 2Alignment bytes. Deallocates the memory range starting at Address from the heap. Returns the size of the heap memory range containing Address. Returns the start address of the heap memory range containing Address. Creates a new process from a file with the specified name and specified properties. If pData is not NULL, DataSize bytes of data are copied to the new process as command data. Stops and completely eliminates the specified process and all of its threads. This does not return if pProcess is the current process. Returns a reference to the current process. Creates a new thread with the specified properties that starts execution by calling the specified function with Parameter. The new thread’s stack has the specified size. Returning from the function destroys the thread. Destroys the specified thread. This does not return if pProcess is the current process. Pauses execution of the specified thread, saving its state. Resumes execution of the specified thread if the thread was paused or sleeping. Puts the specified thread to sleep (similar to pausing) for the specified number of milliseconds, after which execution resumes normally. Page 14 of 27 ScheduleThread pThread pFunction Parameter Milliseconds UnscheduleThread pSchedule GetCurrentThread pSchedule pThread GetLock pLock ReleaseLock pLock AttemptGetLock pLock Milliseconds WaitForNotify pQueue AttemptWaitForNotify pQueue Milliseconds wasNotified Notify pQueue nNotified NotifyAll pQueue nNotified OpenFile pName Flags pFile ReadFile pFile nBytesRead pDestination nBytes Neil Dickson hasLock PwnOS Design Document Schedules the specified paused thread to call the specified function with Parameter after the specified number of milliseconds. If the thread had state saved when paused, that state may be overwritten. Unschedules the previously scheduled thread execution event. Returns a reference to the current thread. Assigns the access controlled by the specified lock to the current thread, blocking until it is allowed to do so if necessary. (PL3and PL0) Releases the access controlled by the specified lock from the current thread immediately, informing blocked threads. (PL3 and PL0) Attempts to assigns the access controlled by the specified lock to the current thread for a limited amount of time before giving up. (PL3 and PL0) Adds the current thread to the specified queue of waiting threads. (all PL0) Adds the current thread to the specified queue of waiting threads for a limited amount of time before giving up. (all PL0) Notifies the first thread in the specified queue of waiting threads (if any). (all PL0) Notifies all threads in the specified queue of waiting threads. (all PL0) Opens a file with the specified name with the specified access and properties. Reads the specified number of bytes to the specified address in memory from the specified opened file. Page 15 of 27 WriteFile nBytesWritten CloseFile GetFileSize pFile pSource nBytes pFile pFile GetFilePointer pFile ByteIndex SetFilePointer pFile ByteIndex GetGraphicsAccess nBytes pGraphics ReleaseGraphicsAccess AddKeyListener pFunction RemoveKeyListener pFunction AddMouseButtonListener pFunction RemoveMouseButtonListener pFunction AddMouseMotionListener pFunction RemoveMouseMotionListener pFunction Neil Dickson PwnOS Design Document Writes the specified number of bytes to the specified file from the specified address in memory. Closes the specified file. Returns the size of the specified file if it has a size. Returns the current location in the specified file from which the next read or write operation would occur, if it has such a location. Sets the current location in the specified file from which the next read or write operation would occur, if it has such a location. Allocates pages onto the graphics linear frame buffer (or virtual linear frame buffer), returning a pointer to a structure describing the buffer. Deallocates pages of the graphics linear frame buffer (or virtual linear frame buffer). Registers the specified function to be called using the current thread when a key of a keyboard is pressed or released. Deregisters the specified function from being called on key events, preferring the current thread if duplicates exist. Registers the specified function to be called using the current thread when a button of a mouse is pressed or released. Deregisters the specified function from being called on mouse button events, preferring the current thread if duplicates exist. Registers the specified function to be called using the current thread when a mouse is moved. Deregisters the specified function from being called on mouse motion events, preferring the current thread if duplicates exist. Page 16 of 27 Booting The actions for which the master boot record (MBR) code is responsible are: Read the rest of the boot loader from the following sectors on the boot drive Find and switch to a video mode based on desired resolution and either 24-bit or 32-bit colour Disable interrupts, etc. Initialize the GDT data Enable 32-bit protected mode and jump to the rest of the boot loader The actions for which the rest of the boot loader is responsible are: Find and save relevant ACPI data as given by the BIOS. This includes data about the CPUs, memory, interrupts, and devices. This information is critical to the functionality of PwnOS. Ensure that the CPU and system meet the requirements for PwnOS. Configure the I/O APIC to route all I/O interrupts to the bootstrap processor. Set Memory Type Range Registers (MTRRs) and Page Attribute Table (PAT) MSR to configure memory caching. Configure Local APIC (LAPIC), and calibrate LAPIC timer using Programmable Interval Timer (PIT) during the waits of the INIT-SIPI-SIPI protocol. Wait for all CPUs to configure memory caching. Configure hard-coded paging setup on all CPUs. Switch all CPUs to 64-bit mode. Configure PCI for DMA with ATA devices and/or with USB devices. Identify all ATA devices and look for all NTFS partitions. Find an NTFS partition containing “PwnOS\Core.bin” and “PwnOS\Main.exe”. Load the pieces of Core.bin to their appropriate places in virtual memory. Initialise the modules of PwnOS (Thread Scheduler, I/O, Page Memory, ...) Make this boot loader have core data structures as if it was a real process. Call CreateProcess on “PwnOS\Main.exe”. Call DestroyProcess on the fake boot process, to free its resources. Inter-CPU communication in the boot loader (in order to initialise the non-bootstrap processors) is done using the mechanism built into the LAPIC, i.e. by sending special interrupts to other CPUs over the APIC bus. The physical memory layout during boot is as follows (but may be subject to significant changes). Neil Dickson PwnOS Design Document Page 17 of 27 Figure 3: Physical Memory During Boot The 15 stacks are for the up to 15 CPUs during boot. The ATA scratch memory is a buffer for loading data from disk during boot. The paging data is only used during boot, as the page directories are kept in their own page after boot. For more information on bootstrapping, the I/O APIC, and ACPI, see (9), (10), and (11). Memory Page Memory Management In order to ensure that page management and address translation are efficient and simple in PwnOS, the x86-64 option of using a 3-level page table tree with 2MB pages instead of a 4-level page table tree with 4KB pages is used. This means that page allocations and deallocations require much less updating of data, and fewer Translation Lookaside Buffer (TLB) invalidations. It also means that all but 8KB (i.e. the PML4 and the PDPT) of the page table tree for the first 512GB of memory can be fit into a single 2MB page. This allows for the virtual addresses of these page directories (PDs) to be constant, making lookup by PwnOS very fast. It also allows the processor to cache entries for address translation much more reliably. For amounts of RAM significantly more than 512GB, additional measures may need to be taken, but for reasonably foreseeable amounts of RAM (about 16TB), this approach of fixed-address PDs works sufficiently (since it only requires 64MB of address space for 16TB of RAM). The 2MB pages do not cause a significant waste of physical or virtual memory, since few processes will be present on the system. Page allocations will be assigned physical memory immediately (if there is enough), to avoid unnecessary and expensive page faults later. The option is given to reserve pages, though, which can be put into physical memory later. Since the assumption is made that there are very few processes running on the system at any given time, and allocations/deallocations of pages are not frequent, the brute force TLB invalidation algorithm is sufficient, and in fact, just as efficient as elaborate algorithms for TLB invalidation. The brute force algorithm is: 1. Interrupt all CPUs other than the current one with an indication that page tables have changed. 2. Invalidate those entries on every CPU to ensure that they are not in the TLB. 3. Resume all CPUs. Although there may be many CPUs, it is probable that all of them are currently either: running different threads of the same process running the thread scheduler, in which case the CPU will not be interrupted and page tables will be updated immediately anyway, or idle This then means that most CPUs that can be updated need updating, so there is no significant loss in using the brute force algorithm. Use of a page file (hence the dependency on I/O in the dependency diagram of the Organisational Overview section), if necessary, will be done in a very standard way. That is, pages that are paged-out will be marked as not present, and the bits that are then available in the corresponding entry will be used to indicate which page in the page file contains the data. Memory will only be put to the page file if necessary, or if there is idle CPU time and physical memory is 90% occupied or worse. Neil Dickson PwnOS Design Document Page 18 of 27 To keep track of the physical-to-virtual address mapping, how recently used physical pages are (with a variant of aging), and where free physical pages are, a reverse page table tree is used. Like the page tables, this structure also has fixed virtual addresses for faster lookup. Unlike the page tables, this structure must be updated at periodic intervals to update the age of used physical pages. However, if no page file is used, this update isn’t needed, and since the pages are large (2MB), this periodic interval can be very long compared to with 4KB pages, e.g. 100ms to 1 minute or longer, depending on the application. The read-only memory pages used for common libraries will be in a fixed virtual address range, for convenient dynamic or static linking to these libraries (“static” in this case meaning that the addresses are hard-coded in the compiled application). Being read-only, fixed address, and common to all processes means that these pages never need to have their TLB entries invalidated, and so TLB misses for these libraries should almost never happen (also because of the 2MB page size). The ability to share other memory between processes may eventually be added, but that is not of concern at the moment, because the focus is on systems with few processes and many threads. The overall virtual memory layout is as follows. Figure 4: Overall Virtual Memory Layout Naturally, each process must have its own page table tree, but the two page directories for the memory from 2GB to 4GB will be shared between all processes. For more detailed information on page table trees and their maintenance, please see (9). Heap Memory Management The code for managing heap memory in PwnOS is accessible directly from PL3 and PL0. To prevent accidental corruption of this code, and to improve performance, it is kept in read-only pages shared among all processes. Each process will have a heap, and the core code will have its own heap. Heap memory in PwnOS is maintained using a data structure representing two AVL trees stored at the end of the heap, along with a header of general data about the heap. These data are stored at the end of heap memory instead of the beginning of heap memory to aid in allocation of aligned blocks of memory within the heap. One of the two AVL trees (the “address tree”) is a tree of all ranges of memory within the usable portion of the heap, both free and allocated, sorted on the address of the range. The other AVL tree (the “free tree”) is a tree of all free ranges within the heap, sorted on the size of the range. Having these trees ensures that memory can be allocated with “best fit” (or a variant thereof), and that memory can be freed, both in O(log n) time, where n is the number of ranges. The operations of finding an allocation’s size from its address and finding the start of an allocation from an address it contains also run in O(log n) time. However, in order to avoid significant complications and/or performance hits (both asymptotic and clocktime), the two trees must share compound nodes to represent these ranges. As a concrete example to Neil Dickson PwnOS Design Document Page 19 of 27 illustrate what this means, suppose that the usable portion of the heap has the following ranges. (Suppose that ranges starting with “f” are free, and those starting with “a” are allocated.) Figure 5: Example Heap Memory Ranges These ranges could have the following address tree and free tree. Figure 6: Heap Memory Range Trees Together, the trees would then be the following. Figure 7: Heap Memory Range Trees with Compound Nodes Manipulation of these trees is identical to that of normal AVL trees, with the exception that upon removal of a node from the address tree, the node space left vacant must be filled by moving the node that is first in memory to that position. Additionally, the size of the free range preceding the trees must be updated upon adding and removing address nodes. If ever this free range reaches a size of zero, the heap cannot have more allocated because the trees would then intersect allocated ranges. Alternatively, another heap could be allocated in such a case, to extend the first, without requiring a significant change in the tree management. All allocations on the heap are aligned to 16 bytes to support SSE# operations that require aligned memory operands. The AllocateAlignedMemory function allows for alignment to higher powers of two. Despite this alignment, the number of bytes requested is not rounded up to a multiple of the alignment. This is so that when checking whether a certain address is in the allocated range using GetAllocationStart Neil Dickson PwnOS Design Document Page 20 of 27 and GetAllocationSize, it will be correctly determined that any bytes past the end are not allocated. Also, no range less than 16 bytes will be recorded as a free memory range. Threads and Processes Process Management The Windows Portable Executable Format is used as the executable format for PwnOS. It has support for dynamic linking via relocation entries and x86-64 code. Programs can then also be tested in Windows using a simple library simulating PwnOS. For details on the PE format, see (12). Programs loaded from disk are completely loaded immediately, instead of waiting for page faults to occur, because page faults are expensive with 2MB pages. The only explicit support for Inter-Process Communication (IPC) in PwnOS is via files (i.e. pipes), and so is discussed in the Files section. This is because with most commonly 1 or 2 processes on the system, elaborate IPC is unnecessary. The default page-level permissions on the user-accessible pages of a process are as follows. Type Code Global Data Heap Stack Allocated with AllocatePages Permissions Read, Execute Read, Write Read, Write Read, Write Custom The arrangement of these pages in virtual memory is as follows. Figure 8: User-Accessible Process Pages The red blocks are guard pages (no access allowed) for accident prevention. The heap may or may not immediately follow the guard page after the stack, to allow for heaps that are larger than 2GB (by placing them after the 4GB mark). If the heap does not immediately follow the guard page after the stack, it must be preceded by its own guard page. Each process also has its own page table tree and reverse page tables. These sets, although in different parts of physical memory, occupy the same virtual memory space, as described in the Page Memory Management section. New processes are created with one thread starting at the main code entry point. This thread may have default properties, or some of these properties can be specified to CreateProcess. Each new thread will have its own stack with a guard page on each side. For details on process management on x86-64, see (9). Thread Management Support for many threads in PwnOS requires careful use of the Global Descriptor Table (GDT). The layout of the GDT entries is as follows. Neil Dickson PwnOS Design Document Page 21 of 27 Figure 9: Global Descriptor Table (GDT) The careful use relates to the Task State Segment (TSS) descriptors. Only one CPU can be running a given task (thread) at a time, and the number of tasks (including idle tasks) may exceed the maximum number of GDT entries (8,192). Also, the Thread Scheduler must be in a task separate from all others to effectively make use of the built-in task switching and state saving. The solution is to have a single TSS descriptor for each processor, plus one for the Thread Scheduler. When a CPU is to switch threads, for any reason, the following occurs. 0. (All switches to the Thread Scheduler must be done from PL0, including LAPIC timer handlers, so being in PL0 is assumed.) 1. Thread disables interrupts (if not already disabled). 2. Thread does FXSAVE to save its extended state. 3. Thread sets its own status information to indicate why & when it is going to the Thread Scheduler. 4. Thread spinlocks for access to the Thread Scheduler (since little time will be spent in it, and switching tasks is required for more elaborate synchronisation). 5. Thread switches tasks to the Thread Scheduler. (The general CPU state is automatically saved in the TSS.) 6. Thread Scheduler stops APIC timer for the previous thread’s timeout if it wasn’t already stopped. 7. Thread Scheduler selects a thread to run. 8. Thread Scheduler writes the descriptor for that thread’s TSS to the GDT entry for the current processor. 9. Thread Scheduler does FXRSTOR to restore the extended state for the next thread. 10. Thread Scheduler sets new thread status information to indicate that it is running. 11. Thread Scheduler starts APIC timer for the new thread’s timeout. 12. Thread Scheduler switches to the new thread’s task. (The general CPU state is automatically restored.) 13. New thread releases the lock on Thread Scheduler. (Since this is always in PL0 code, this is not dependent on the application.) 14. New thread enables interrupts (if returning to PL3). This approach ensures proper and efficient functionality for even very large numbers of tasks. The structure used for threads encompasses the TSS for the thread and the extended state saved by FXSAVE, making most efficient use of the structures built into the CPU, instead of reorganising the data therein. However, Thread Scheduler does not use any extended state, and has no independent execution context, Neil Dickson PwnOS Design Document Page 22 of 27 so it is not a full thread; it only needs a TSS. These structures, along with all scheduling data are kept on the PL0 heap. Each thread also contains information on its current status, priority, any lock, notification, or I/O operation or device that it might be waiting for, and the time of the last status change. This allows the Thread Scheduler to implement any number of a wide variety of scheduling algorithms since it has enough information to make good decisions. For example, suppose that one thread has access to a device but is not waiting for an I/O operation to complete, and another thread of higher priority is waiting for access to the device. The first thread can be given a priority boost (possibly just temporarily) so that the higher priority thread is not left waiting too long. A similar situation occurs with locks, but this is discussed in the Synchronisation section. All I/O interrupts will be assigned to the bootstrap processor, so that preference can be given to the other processors when scheduling higher priority threads, for example. This is done using the I/O APIC’s software interface. Thread time slice timeouts are implemented using the Local APIC (LAPIC) timer on each CPU. Both the Programmable Interval Timer (PIT) and the CMOS Timer go through the I/O APIC, and so they cannot be used for an arbitrary number of CPUs concurrently. The LAPIC timer is local to each CPU, and so does not need intervention from another CPU to work for thread scheduling. The handler for the LAPIC timer is in PL0, and it simply spinlocks for access to the Thread Scheduler, calls the Thread Scheduler, then after returning from the Thread Scheduler (the next time that this thread is run), it releases Thread Scheduler access. For more information on task switching, LAPICs, and the I/O APIC, see (9) and (10). Device I/O Although it is planned that PwnOS will support PCI and USB protocols (as given by (13) and (14)), the abstractions for these protocols have not yet been designed. As such, they are not discussed extensively in this document. The driver for ATA devices (e.g. harddrives) supports the following operations. Reading sectors with 28-bit and 48-bit addressing (with DMA once PCI is supported) Writing sectors with 28-bit and 48-bit addressing (same) Device identification Removable media identification In order to support these operations, the driver strictly follows the protocols presented in (15). Reading and writing of sectors is done with blocking DMA I/O, and as such, they are followed by an I/O interrupt indicating that the operation has finished and that the requesting thread can be run again. Device identification and removable media identification are done with programmed I/O to avoid the overhead of setting up DMA, so the I/O interrupt is not needed. The driver for PS/2 (and the driver for USB keyboards and mice) supports the following operations. Receive key press/release Receive mouse button press/release Receive mouse movement data Receive mouse scroll data In order to support these operations, the driver is based upon information presented in (16) and much testing. Neil Dickson PwnOS Design Document Page 23 of 27 All of the PS/2 driver operations are interrupt-driven input operations. The API for the I/O module of PwnOS allows threads to register listeners in PL3 for these input events. These threads may be given a temporary priority boost to quickly handle the user input. Graphics in PwnOS is done using a fixed linear frame buffer, as set up by the boot loader using the VBE functions (17). As such, to have the ability to display graphics output, a process need only have the linear frame buffer’s pages present in its virtual memory. The GetGraphicsAccess and ReleaseGraphicsAccess API functions just allocate and deallocate these pages, so in a sense, they are more closely related to the Page Memory module than the I/O module. Separate Ethernet drivers are required for each type of Ethernet card, and due to lack of standardisation, these drivers may be very different from each other. However, they must all provide the Internet Protocol (IP) abstraction for the TCP driver (and UDP driver, if present). Likewise, the TCP driver must manage its abstraction for reading and writing over TCP connections (see the Files section). Other useful references on devices and their configuration are (18) and (11). Files The NTFS filesystem is the only hard disk storage filesystem supported by PwnOS. This decision was made based on its quality of documentation compared to other filesystems, its performance, its extensibility, and its compatibility with Windows. Details on NTFS can be found at (19), and so the searching and manipulating the NTFS data structures is beyond the scope of this document. Caching of Master File Table (MFT) entries for open files, and for recently used directories/files takes place, in addition to read/write caching of file clusters. The number of clusters cached or prefetched for reading, or cached for writing, depends on the size and number of previous requests for the file. Write caching is done until either some number of clusters has been filled, the next write is outside the cache, or a certain amount of time has passed. Use of file path names in PwnOS is similar (and in some cases, identical) to that of Windows. PwnOS will accept “/” or “\” as a directory separator, and the format of paths in general are case-insensitive, Unicode strings exemplified by the following examples. Path hd00:\PwnOS\Core.bin Meaning harddrive 0, partition 0, directory “PwnOS”, file “Core.bin” hd00:PwnOS/Core.bin same as above hd00://PwnOS\Core.bin same as above hd00:\\\pwnos\core.Bin same as above hd312:\MyDirectory\SubDir\Cool.doc harddrive 3, partition 1 (extended), subpartition 2, directory “MyDirectory”, subdirectory “SubDir”, file “Cool.doc” http://www.codecortex.com/index.php HTTP protocol, domain “www.codecortex.com”, file “index.php” http:www.codecortex.com\index.php same as above usb00:\MyFile.txt USB port 0, partition 0, file “MyFile.txt” \PwnOS\Core.bin root of partition of current directory, directory “PwnOS”, file “Core.bin” /PwnOS root of partition of current directory, directory Neil Dickson PwnOS Design Document Page 24 of 27 Core.bin PwnOS\Core.bin C:\PwnOS\Core.bin tcp:\127.0.0.1:1234 prog:\MyProgram\MyPipe “PwnOS” current directory, file “Core.bin” current directory, subdirectory “PwnOS”, file “Core.bin” partition mapped to “C”, directory “PwnOS”, file “Core.bin” TCP protocol, IP address 127.0.0.1, port 1234 Pipe (program) protocol, virtual directory “MyProgram”,virtual file “MyPipe” Virtual files will be supported for abstractions of network communication protocols (such as HTTP, FTP, TCP), and for pipes. Both network abstractions and pipes work very similarly, except that in the case of pipes, both ends of the virtual file are accessed by local programs, and in the case of a network abstraction, (usually) only one end of the virtual file is accessed by a local program. The network abstractions also depend on the I/O module, whereas pipes do not. Both pipes and network abstractions are fully cached, i.e. all data not yet retrieved/sent is kept in memory buffers (which may increase/decrease in size). The data structures used by the filesystem management code are kept in the PL0 heap. Synchronisation The synchronisation module of PwnOS provides mechanisms for mutual exclusion and other coordination of multiple threads. The most important of these mechanisms is the lock, and a wait-notify queue mechanism is also provided. A lock has two main operations: get and release. It keeps track of which thread currently has access to it (if any), and which threads are waiting to have access to it (if any). Both get and release have a case where they can be done entirely from PL3, and another case where they must be done in PL0. The get operation goes as follows. 1. If the current thread already has the lock, return. 2. Atomically, do the following (using LOCK CMPXCHG16B): a. If no thread currently has access and no thread has exclusive access to the list of waiting threads, i. Claim access for the current thread 3. If the current thread gained access, return. 4. Switch to PL0. 5. Disable interrupts. 6. Spinlock for exclusive access to list of threads waiting for access. 7. If the lock was released from PL3 before exclusive access was obtained (can only happen if this is the first thread to go in the list), a. Claim access for the current thread (since no other thread can claim it while this one has exclusive access to the list) b. Release access to the list of waiting threads. 8. Else, a. Add current thread to list of threads waiting for access. b. Set the current thread status to indicate that this thread is waiting for access to this lock. c. Release access to list of threads waiting for access. d. Go to the Thread Scheduler (see Thread Management section). Upon returning, this thread will have gotten the lock. 9. Enable interrupts. Neil Dickson PwnOS Design Document Page 25 of 27 10. Return to PL3, then return to application. In the cases where either the current thread already has the lock or the lock is free, the operation can complete without switching to PL0. Otherwise, the operation must be done in PL0. Similarly, the release operation is as follows. 1. If the current thread does not have the lock, fail. 2. Atomically do the following (using LOCK CMPXCHG16B): a. If the list of threads waiting for access is empty and no thread has exclusive access to the list, i. Set no threads to currently have the lock. 3. If the lock just got released by the atomic operation, return. 4. Switch to PL0. 5. Disable interrupts. 6. Spinlock for exclusive access to list of threads waiting for access. 7. Remove the next thread to run from the list of waiting threads. 8. Give that thread the claim to the lock’s access. 9. Set that thread’s status to reflect that it is now able to resume running. 10. Release access to the list of waiting threads. 11. Enable interrupts. 12. Return to PL3, then return to application. The order of these steps is absolutely critical to their proper execution. Changing the order or function of these steps could break certain cases, and the explanation of how would be too long and detailed for this document. These operations would be much simpler if done completely in PL0, but allowing the operations to occur in PL3 for the very common cases (getting a free lock, and releasing an unwatched lock) can yield a large performance benefit where locks are used often. The other, much simpler synchronisation mechanism provided by PwnOS is the wait-notify queue. A thread indicates that it wants to be notified of something after waiting in line to be notified. This can be used for larger constructs such as a simple inter-thread message coordination system. Since waiting requires being in PL0 anyway, and notifying requires modifying the shared data structure, both must have some component executing in PL0 in order to avoid problems. Thus, for simplicity, they both are completely implemented in PL0. This makes the operations so much simpler than getting and releasing locks that their steps are not discussed here. Because PwnOS is aware of the constructs that applications will use for mutual exclusion, some problems can be averted or at least identified. For example, cycles in the graph of locks owned by threads and threads waiting for locks represent deadlocks, and the Thread Scheduler has full access to this graph. Because deadlock cycles are almost always very small (2 or 3 locks in each cycle), and because few threads at any given time would be both waiting for a lock and owning a lock, detection is an inexpensive operation that could be performed periodically but not often (~1 minute to 1 hour) without a significant performance hit. The Thread Scheduler can also take into account the knowledge that, for example, 20 threads are waiting for a lock owned by a particular thread, so the owning thread should be given a priority boost (at least temporarily) to release the lock sooner. It is not necessary that all such information be used, but it opens up possibilities for a more intelligent scheduler. Neil Dickson PwnOS Design Document Page 26 of 27 References 1. Dickson, Neil. PwnOS Code Documentation. [Online] August 26, 2007. [Cited: October 16, 2007.] http://www.neildickson.com/os/documentation/. 2. ITRON Committee, TRON Association. μITRON4.0 Specification. Tokyo, Japan : TRON Association, 2002. 4.00.00. 3. International Data Corporation. HP-UX: A Foundation for Enterprise Workloads. s.l. : IDC, 2007. #206607. 4. Silicon Graphics, Inc. Cellular IRIX™ 6.4 Technical Report. 1996. 5. MINIX 3: A Highly Reliable, Self-Repairing Operating System. Jorrit N. Herder, Herbert Bos, et al. July 2006, s.l. : Operating Systems Review, 2006. 6. Sun Microsystems. Reference Materials. Solaris Operating System. [Online] November 2007. [Cited: November 3, 2007.] http://www.sun.com/software/solaris/reference_resources.jsp. 7. Scalability of Microkernel-Based Systems. Uhlig, Volkmar. June 2005, s.l. : Operating Systems Review, 2005. 8. Robert V. Baron, David Black, et al. Mach Kernel Interface Manual. 1990. 9. Intel Corporation. Intel® 64 and IA-32 Architectures Software Developer's Manuals. Intel. [Online] May 2007. [Cited: August 13, 2007.] http://www.intel.com/products/processor/manuals/. 253665-253669. 10. —. 82093AA I/O Advanced Programmable Interrupt Controller (I/O APIC) Datasheet. 1996. 29056601. 11. Hewlett-Packard Company, Intel Corporation, et al. Advanced Configuration and PowerInterface Specification, Revision 3.0. 2004. 12. Microsoft Corporation. Microsoft Portable Executable and Common Object File Format Specification, Revision 8.0. 2006. 13. Technical Committee T13. AT Attachment with Packet Interface - 6 (ATA-ATAPI-6). 2002. 1410D. 14. Hyde, Randall. Chapter 20 - The PC Keyboard. The Art of Assembly Language Programming, DOS 16-bit Edition. 2000. 15. Compaq Computer Corporation, Hewlett-Packard Company, et al. Universal Serial Bus Specification, Revision 2.0. 2000. 16. PCI Special Interest Group. PCI Local Bus Specification, Revision 2.2. 1998. 17. Video Electronics Standards Association. VESA BIOS Extension (VBE) Core Functions Standard, Version 3.0. 1998. 18. Gook, Michael. PC Hardware Interfaces. Wayne, Pennsylvania : A-List Publishing, 2004. 193176929X. 19. Richard Russon, Yuval Fledel. NTFS Documentation. Linux-NTFS. [Online] 2005. [Cited: October 21, 2007.] http://data.linux-ntfs.org/ntfsdoc.pdf. Neil Dickson PwnOS Design Document Page 27 of 27