Table of Figures

advertisement
PwnOS Design Document
Version 1.0a
By Neil Dickson
As part of Code Cortex
Presented for:
COMP 3000 Fall 2007
Operating Systems
Carleton University
Dr. Anil Somayaji
Neil Dickson
PwnOS Design Document
Page 1 of 27
Revision History
Revision
1.0a
Editor
Neil Dickson
1.0
Neil Dickson
Neil Dickson
Description
Minor table formatting change; clarified explanation of process pages;
added table of figures; added paragraph about heap allocation alignment
This is the initial version of the document. It describes the proposed
design of PwnOS in detail.
PwnOS Design Document
Page 2 of 27
Table of Contents
Revision History ........................................................................................................................................... 2
Table of Figures ............................................................................................................................................ 4
Abstract ......................................................................................................................................................... 5
Introduction ................................................................................................................................................... 6
High Concept ............................................................................................................................................ 6
Motivation and Goals................................................................................................................................ 6
Project Scope ............................................................................................................................................ 6
Document Scope ....................................................................................................................................... 7
Abbreviations ........................................................................................................................................ 7
Market Analysis and Prior Art ...................................................................................................................... 9
High Performance Dedicated Servers Market........................................................................................... 9
Commercial High Performance Systems .................................................................................................. 9
Comparison of Similar Operating Systems ............................................................................................. 10
Design ......................................................................................................................................................... 12
Organisational Overview ........................................................................................................................ 12
API Specification ................................................................................................................................ 13
Booting.................................................................................................................................................... 17
Memory ................................................................................................................................................... 18
Page Memory Management ................................................................................................................ 18
Heap Memory Management................................................................................................................ 19
Threads and Processes ............................................................................................................................ 21
Process Management .......................................................................................................................... 21
Thread Management ........................................................................................................................... 21
Device I/O ............................................................................................................................................... 23
Files ......................................................................................................................................................... 24
Synchronisation....................................................................................................................................... 25
References ................................................................................................................................................... 27
Neil Dickson
PwnOS Design Document
Page 3 of 27
Table of Figures
Figure 1: Overall Architecture .................................................................................................................... 12
Figure 2: Module Dependencies ................................................................................................................. 13
Figure 3: Physical Memory During Boot .................................................................................................... 18
Figure 4: Overall Virtual Memory Layout .................................................................................................. 19
Figure 5: Example Heap Memory Ranges .................................................................................................. 20
Figure 6: Heap Memory Range Trees ......................................................................................................... 20
Figure 7: Heap Memory Range Trees with Compound Nodes ................................................................... 20
Figure 8: User-Accessible Process Pages ................................................................................................... 21
Figure 9: Global Descriptor Table (GDT) .................................................................................................. 22
Neil Dickson
PwnOS Design Document
Page 4 of 27
Abstract
PwnOS is a low-overhead operating system designed for isolated or otherwise dedicated computer
systems that require high CPU performance on general-purpose hardware. This document explains the
motivations and goals of PwnOS, and gives a market analysis based on those goals, followed by a
comparison between PwnOS and operating systems with similar goals. Then, the design of PwnOS is
presented in detail, in relation to those goals.
Neil Dickson
PwnOS Design Document
Page 5 of 27
Introduction
High Concept
PwnOS is a low-overhead operating system designed for isolated or otherwise dedicated computer
systems that require high CPU performance on general-purpose hardware.
Motivation and Goals
The main goals of PwnOS can be summarised as follows:
Motivation
Goal
General-purpose operating systems tend to have a
very low time overhead; low space overhead
high level of bloat that is largely avoidable if the
operating system is custom-tuned for high
performance dedicated systems.
Core APIs for general-purpose operating systems
simple core API; expandable/modifiable user API
tend to be very large and/or very cryptic
High-performance computing and other dedicated
best suited to few processes with many threads
systems generally have just one large process with
many threads.
Dedicated systems do not need to waste time
protection against accidents, not against malware
checking for malware or otherwise be slowed down
for security reasons only applicable to generalpurpose systems
Dedicated systems need only a limited set of good
support for just a few devices is sufficient for a
drivers, and the large set of drivers with a generalparticular system; exhaustive support is not needed
purpose operating system can be detrimental.
Dedicated systems using general-purpose hardware only need support for general-purpose hardware
can be much less expensive than dedicated systems
using special hardware.
Custom server systems often need custom
good, thorough documentation at all levels
modifications to the operating system
Project Scope
The scope of the PwnOS project is most significantly limited by its goals, which is quite convenient, in
that the goals are then not significantly limited by the scope. Since the goals generally favour a simple
design over a complex design, this makes most complex design aspects out of scope.
The following items are currently considered to be out of the scope of the PwnOS project:
 Filesystem design
 Executable/library format design
 Full compatibility with other operating systems
 Graphical user interface libraries
 Security and privacy libraries
 Extensive device support
 Elaborate Inter-Process Communication (IPC)
 Application software
 Backward hardware compatibility
Implications of these include that an existing filesystem design and an existing executable/library format
must be used. Also, all application software and user libraries must be developed separately from
PwnOS. Some such software might be developed as part of other Code Cortex projects.
Neil Dickson
PwnOS Design Document
Page 6 of 27
The following items are currently considered to be in the scope of the PwnOS project:
 Boot loader code
 Page and heap memory management
 Thread management and scheduling; inter-thread communication
 Simple process management; support for an existing, common executable/library format
 Support for common, general-purpose hardware relevant to dedicated systems
 Support for an existing, common filesystem
 Thread synchronisation management
Document Scope
The objective of this document is to provide thorough details of the design of PwnOS, up to but not
including the level of exhaustive function lists. Some details given in this document might be considered
more implementation than design, but the intent is to have more emphasis on details that are less likely to
change than details that are more likely to change. For documentation relating to implementation details,
please see (1).
This document does not provide introductory, tutorial, or reference documentation on hardware devices,
protocols, data structures, standards, or computer languages, except if and when useful in describing their
relation to the design of PwnOS. For details on these, please see the cited material.
Abbreviations
Abbreviation
ACPI
API
APIC
ATA
AVL Tree
BIOS
CPU
DMA
FTP
GB
GDT
HP
HPC
HTTP
IBM
I/O
IP
IPC
ITRON
LAPIC
LSI
MB
MBR
Neil Dickson
Term
Advanced Configuration and Power Interface
Application Programming Interface
Advanced Programmable Interrupt Controller
AT Attachment
Adelson-Velsky Landis Tree
Basic Input/Output System
Central Processing Unit
Direct Memory Access
File Transfer Protocol
GigaBytes (230 bytes in this document)
Global Descriptor Table
Hewlett-Packard Company
High-Performance Computing
HyperText Transfer Protocol
International Business Machines Corporation
Input / Output
Internet Protocol
Inter-Process Communication
Industrial The Real-time Operating system Nucleus
Local APIC
Large Scale Integration (of circuitry)
MegaBytes (220 bytes in this document)
Master Boot Record (sector 0)
PwnOS Design Document
Page 7 of 27
MFT
NTFS
OS
PCI
PDs
PDPT
PIT
PL0
PL3
PML4
PS/2
PTs
RAM
RPTs
SGI
SIMD
SSE#
TB
TCP
TLB
TSS
UDP
USB
VBE
Neil Dickson
Master File Table
New Technology File System
Operating System
Peripheral Component Interconnect
Page Directories
Page Directory Pointer Table
Programmable Interval Timer
Privilege Level 0 (supervisor privilege)
Privilege Level 3 (user privilege)
Page Map Level 4 table
Personal System/2 (or the ports thereof)
Page Tables (or in general the page mapping)
Random-Access Memory
Reverse Page Tables (not standardised)
Silicon Graphics Incorporated
Single-Instruction, Multiple-Data
Streaming SIMD Extensions # (e.g. SSE3)
TeraBytes (240 bytes in this document)
Transmission Control Protocol
Translation Lookaside Buffer
Task State Segment
User Datagram Protocol
Universal Serial Bus
VESA BIOS Extensions
PwnOS Design Document
Page 8 of 27
Market Analysis and Prior Art
High Performance Dedicated Servers Market
The server system market, although large and continually growing, is too broad a market to consider for
PwnOS, since many uses for servers are not limited by CPU performance. Also, many server systems are
not what would be considered “isolated” systems, and issues not addressed by PwnOS, such as security
and capabilities, are important for those systems. As such, this section will only consider relevant
submarkets of the server system market.
The most distinctive market for high performance dedicated servers is the High-Performance Computing
(HPC) market. HPC is best known for use of systems with many CPU cores (a cluster) for parallel
computation and large amounts of RAM to facilitate computationally intense use of these cores. HPC
servers are most commonly used for solving very difficult, but parallelisable, problems. This type of
problem arises frequently in scientific computing and optimisation of complex systems. For example,
simulation of the folding of proteins is a problem very important to biological and pharmaceutical
research, but it is extremely difficult to do accurately, and yet is highly parallelisable. There is, however,
a wide variety of these types of problems, including LSI verification, weather prediction, cinematic
quality 3D rendering, network optimisation, schedule optimisation, and many others. The growth of HPC
in recent years can be attributed to this myriad of fields in which HPC is useful, especially scientific and
technical fields. The role of PwnOS is not to provide a full HPC system, but just to provide a simple, low
overhead framework on top of which to run HPC applications using general-purpose hardware.
Other uses for high performance dedicated servers include custom database systems. As an example of
this, a telephone switching system can make use of such a database system. A large amount of
information related to telephone switching remains relatively constant (e.g. phone number routing data)
and is very frequently queried, often concurrently, but this amount of information is easily small enough
to fit in memory given a mid-range dedicated server. Since the information is relatively constant, it can
be perpetually cached, making a very low overhead operating system with support for large memory ideal
for these operations. However, the ease of use and adequate performance of general-purpose database
systems has made custom database systems less common, so this may not be a significant market for
PwnOS.
Commercial High Performance Systems
Major companies specialising in development and/or retail of low- to mid-range HPC servers include
IBM, Sun, HP, SGI, and Quadrics.
Such servers (as limited by price) can generally be arranged into categories based on the amount of RAM
per CPU. For instance, the HP Integrity rx6600 Server has 2-8 cores and 192GB of RAM (2496GB/CPU), whereas the Sun Blade x8420 Server Module has 8 cores and 64GB of RAM (8GB/CPU).
Clusters with more RAM/CPU are better for some tasks (i.e. more performance per dollar) and worse for
others. Similar arguments can be made for I/O connections to these servers.
However, the modularity of these systems is perhaps a more important factor. Most of the above
companies offer both standalone server systems and server systems in which many server modules can be
mounted together with special interconnects for inter-module communication. The former is referred to
as a rack-mount server system, and the latter is referred to as a blade server system. Rack-mount systems
are standard computer systems usually using high-quality conventional hardware, but each computer is
inherently separate, albeit likely connected via a network interface. A blade system requires a special
enclosure providing power, cooling, and networking in a way that is more efficient than rack-mount
Neil Dickson
PwnOS Design Document
Page 9 of 27
systems, but they can be more expensive due to their highly-specialized hardware. Blade server systems
are expandable to high-end clusters, but may not be as cost-effective for low-end clusters.
Software on these systems varies widely. IBM mostly sells systems running either a variant of Linux or
Windows Compute Cluster Server on Windows Server 2003, and it also sells systems with a custom
operating system, AIX. Sun’s systems mostly run its own operating system, Solaris, and the same is true
for HP selling systems running HP-UX. Until recently, SGI sold systems running its operating system,
IRIX, but they now primarily run Linux. Because these companies mostly make money from the
hardware, not the software, and so for low- to mid-range servers, the overall trend has been moving away
from custom operating systems for these servers.
Comparison of Similar Operating Systems
Purpose
Min.
CPU
Reccommended
CPU
Reccommended
RAM
Status
Organisation
Market
Penetration
License
PwnOS
HP-UX
Industrial
Server
1 core,
x86-64,
SSE3
15 cores,
x86-64,
SSE3
1GB to
512GB+
Industrial
Server
2 cores,
Itanium 2
Solaris
ITRON
Minix
32GB to
512GB+
128 cores,
64-bit
MIPS
? to
1024GB
Industrial
Server
2 cores,
x86-64 or
SPARC64
8 cores,
x86-64 or
SPARC64
256MB to
64GB
Industrial
Embedded
1 core,
Large
variety
1 core,
Large
variety
Small
(<<4GB)
Develop.
Code
Cortex
None
Active
HewlettPackard
Good
Abandoned
Silicon
Graphics
Fair
Active
Sun
Microsys.
Good
GPL
Proprietary
Proprietary
CDDL
128 cores,
Itanium 2
Cellular
IRIX
Industrial
Server
64-bit
MIPS
Mach
OS
Research
1 core,
80386
L4
Fiasco
OS
Research
1 core,
80486
1 core,
Pentium
1 core,
Pentium
1 core,
80486
16MB to
<4GB
2MB to
1GB
? to <4GB
Stagnant
TRON
Assoc.
Excellent
Active
—
Stagnant
—
Abandoned
—
None
None
None
N/A
BSD
GPL
None
OS
Research
1 core,
80386
HP-UX (2), Cellular IRIX (3), and Solaris (4) are operating systems that are or have been developed by
HP, SGI, and Sun, respectively, for server systems, especially cluster or cluster-like systems. PwnOS is
similar to these in that the primary focus is cluster systems and other high performance dedicated systems.
However, PwnOS differs in that is inherently designed to remain simple while still providing the
functionality needed to create and run software for these systems, whereas HP-UX, Cellular IRIX, and
Solaris are colossal, complex masses of software. PwnOS also differs in that the other three operating
systems are sold in conjunction with expensive server systems, whereas PwnOS is intended to enable
companies and individuals to make inexpensive server systems from computers that they already have.
ITRON (5) is an operating system specification used for more embedded systems than any other design
on Earth. ITRON is, in fact, very dissimilar from PwnOS. ITRON operating systems are real-time
operating systems for embedded systems on a wide variety of different custom and general-purpose
hardware platforms. PwnOS is not intended to be a real-time operating system, it is not for embedded
systems, and is designed to support only a limited range of modern, general-purpose hardware. The only
similarity is the intent to be for industrial use. It is, however, a great success story of how in just 20
years, a project such as it can become so common.
Neil Dickson
PwnOS Design Document
Page 10 of 27
Minix (6), L4 (7), and Mach (8) are or were projects developed by independent operating system
enthusiasts interested in trying theoretical designs for operating systems (specifically micro-kernel
designs). The purpose of these operating systems is completely different than that of PwnOS. Because of
that, the designs of them are very different than that of PwnOS. The common design element is
simplicity, but beyond that, PwnOS is not a micro-kernel design and has performance, functionality, and
specific usefulness in mind, whereas the others are micro-kernel designs with reliability, minimal
functionality, and no particular usefulness in mind. The other significant element in common is that
PwnOS is currently being developed by a tiny group of independent operating system enthusiasts. The
key difference there is that this tiny group is interested in more than just operating systems.
Neil Dickson
PwnOS Design Document
Page 11 of 27
Design
Organisational Overview
This section outlines the overall organisation of PwnOS, the relations between its major modules, and
design aspects common to all or most major modules of PwnOS.
To optimise the use of modern general-purpose CPUs, PwnOS is a 64-bit operating system for processors
supporting x86-64 architecture and SSE3. Portability and backward compatibility often conflict with the
goals of low time and space overhead, and so are not considered in the design of PwnOS. PwnOS is
designed to best work with many CPU cores. Because of the similarity from software, the terms “CPU”
and “CPU core” are used interchangeably in the rest of this document.
The following figure presents the overall code architecture of PwnOS in terms of its modules and the
relevant interfaces.
Figure 1: Overall Architecture
The reasoning behind having the heap memory management module and part of the synchronisation
module accessible directly from Privilege Level 3 (PL3) is that some of the time overhead of system calls
can be avoided by using regular function calls where possible. Managing heap memory does not require
actively changing page tables or updating core data structures, so can be done from PL3. Likewise,
synchronisation actions such as getting a lock that is currently free and releasing a lock on which no
threads are blocking are both operations that can be done in PL3, and upon failure (e.g. the lock is not free
when attempting to get it) can call PL0 to properly handle all cases.
That these modules are in special, read-only, pages of memory common to all processes is also important.
There will be a heap for modules in PL0 and a heap for each application, and duplication of the heap
management code would be wasteful, so it is kept in common for all tasks using global, read-only pages
for efficiency and accident avoidance. This is discussed further in the Page Memory Management
section.
Custom libraries in this read-only memory may include libraries developed by/for a user of PwnOS, the
Code Cortex libraries, or any other libraries to be loaded here.
Neil Dickson
PwnOS Design Document
Page 12 of 27
The following diagram presents the dependencies between the modules of PwnOS.
Figure 2: Module Dependencies
Dependencies on (Fast) Sync have been omitted since all modules but Thread Scheduler and (Full) Sync
have such a dependency. There are cyclic dependencies involving I/O, Page and Heap Memory, and
Sync, and as such, they must be initialised without using any of the code therein. See the Booting section
for more detail on initialisation.
API Specification
The following table specifies the core Application Programming Interface of PwnOS. It has been
designed to be simple, with meaningful names, and still have sufficient functionality. Parameters are
passed by registers, and system calls are made using the special SYSCALL instruction for optimal
performance. Library calls can be made as normal calls (after relocation).
Function
AllocatePages
Parameters
Address
nPages
AllocType
FreePages
Address
nPages
AllocateMemory
nBytes
Neil Dickson
Returns
Address
Address
PwnOS Design Document
Description
Allocates the specified number of
pages with the specified
properties. If Address is not
NULL, the pages will be
allocated with that virtual
address (rounded down to the
page).
Deallocates the specified number
of pages starting at the specified
address (rounded down to the
page).
Allocates on the heap a range of
the specified number of bytes.
Page 13 of 27
AllocateAlignedMemory
nBytes
Alignment
FreeMemory
Address
GetAllocationSize
Address
nBytes
GetAllocationStart
Address
StartAddress
CreateProcess
pName
DataSize
pData
Flags
pProcess
DestroyProcess
pProcess
GetCurrentProcess
pProcess
CreateThread
pFunction
StackSize
Flags
Parameter
DestroyThread
pThread
PauseThread
pThread
ResumeThread
pThread
Sleep
pThread
Milliseconds
Neil Dickson
Address
pThread
PwnOS Design Document
Allocates on the head a range of
the specified number of bytes
aligned to 2Alignment bytes.
Deallocates the memory range
starting at Address from the
heap.
Returns the size of the heap
memory range containing
Address.
Returns the start address of the
heap memory range containing
Address.
Creates a new process from a file
with the specified name and
specified properties. If pData is
not NULL, DataSize bytes of
data are copied to the new
process as command data.
Stops and completely eliminates
the specified process and all of its
threads. This does not return if
pProcess is the current process.
Returns a reference to the
current process.
Creates a new thread with the
specified properties that starts
execution by calling the specified
function with Parameter. The
new thread’s stack has the
specified size. Returning from
the function destroys the thread.
Destroys the specified thread.
This does not return if pProcess
is the current process.
Pauses execution of the specified
thread, saving its state.
Resumes execution of the
specified thread if the thread was
paused or sleeping.
Puts the specified thread to sleep
(similar to pausing) for the
specified number of milliseconds,
after which execution resumes
normally.
Page 14 of 27
ScheduleThread
pThread
pFunction
Parameter
Milliseconds
UnscheduleThread
pSchedule
GetCurrentThread
pSchedule
pThread
GetLock
pLock
ReleaseLock
pLock
AttemptGetLock
pLock
Milliseconds
WaitForNotify
pQueue
AttemptWaitForNotify
pQueue
Milliseconds
wasNotified
Notify
pQueue
nNotified
NotifyAll
pQueue
nNotified
OpenFile
pName
Flags
pFile
ReadFile
pFile
nBytesRead
pDestination
nBytes
Neil Dickson
hasLock
PwnOS Design Document
Schedules the specified paused
thread to call the specified
function with Parameter after
the specified number of
milliseconds. If the thread had
state saved when paused, that
state may be overwritten.
Unschedules the previously
scheduled thread execution
event.
Returns a reference to the
current thread.
Assigns the access controlled by
the specified lock to the current
thread, blocking until it is
allowed to do so if necessary.
(PL3and PL0)
Releases the access controlled by
the specified lock from the
current thread immediately,
informing blocked threads. (PL3
and PL0)
Attempts to assigns the access
controlled by the specified lock to
the current thread for a limited
amount of time before giving up.
(PL3 and PL0)
Adds the current thread to the
specified queue of waiting
threads. (all PL0)
Adds the current thread to the
specified queue of waiting
threads for a limited amount of
time before giving up. (all PL0)
Notifies the first thread in the
specified queue of waiting
threads (if any). (all PL0)
Notifies all threads in the
specified queue of waiting
threads. (all PL0)
Opens a file with the specified
name with the specified access
and properties.
Reads the specified number of
bytes to the specified address in
memory from the specified
opened file.
Page 15 of 27
WriteFile
nBytesWritten
CloseFile
GetFileSize
pFile
pSource
nBytes
pFile
pFile
GetFilePointer
pFile
ByteIndex
SetFilePointer
pFile
ByteIndex
GetGraphicsAccess
nBytes
pGraphics
ReleaseGraphicsAccess
AddKeyListener
pFunction
RemoveKeyListener
pFunction
AddMouseButtonListener
pFunction
RemoveMouseButtonListener
pFunction
AddMouseMotionListener
pFunction
RemoveMouseMotionListener
pFunction
Neil Dickson
PwnOS Design Document
Writes the specified number of
bytes to the specified file from
the specified address in memory.
Closes the specified file.
Returns the size of the specified
file if it has a size.
Returns the current location in
the specified file from which the
next read or write operation
would occur, if it has such a
location.
Sets the current location in the
specified file from which the next
read or write operation would
occur, if it has such a location.
Allocates pages onto the graphics
linear frame buffer (or virtual
linear frame buffer), returning a
pointer to a structure describing
the buffer.
Deallocates pages of the graphics
linear frame buffer (or virtual
linear frame buffer).
Registers the specified function
to be called using the current
thread when a key of a keyboard
is pressed or released.
Deregisters the specified function
from being called on key events,
preferring the current thread if
duplicates exist.
Registers the specified function
to be called using the current
thread when a button of a mouse
is pressed or released.
Deregisters the specified function
from being called on mouse
button events, preferring the
current thread if duplicates exist.
Registers the specified function
to be called using the current
thread when a mouse is moved.
Deregisters the specified function
from being called on mouse
motion events, preferring the
current thread if duplicates exist.
Page 16 of 27
Booting
The actions for which the master boot record (MBR) code is responsible are:
 Read the rest of the boot loader from the following sectors on the boot drive
 Find and switch to a video mode based on desired resolution and either 24-bit or 32-bit colour
 Disable interrupts, etc.
 Initialize the GDT data
 Enable 32-bit protected mode and jump to the rest of the boot loader
The actions for which the rest of the boot loader is responsible are:
 Find and save relevant ACPI data as given by the BIOS. This includes data about the CPUs,
memory, interrupts, and devices. This information is critical to the functionality of PwnOS.
 Ensure that the CPU and system meet the requirements for PwnOS.
 Configure the I/O APIC to route all I/O interrupts to the bootstrap processor.
 Set Memory Type Range Registers (MTRRs) and Page Attribute Table (PAT) MSR to configure
memory caching.
 Configure Local APIC (LAPIC), and calibrate LAPIC timer using Programmable Interval Timer
(PIT) during the waits of the INIT-SIPI-SIPI protocol.
 Wait for all CPUs to configure memory caching.
 Configure hard-coded paging setup on all CPUs.
 Switch all CPUs to 64-bit mode.
 Configure PCI for DMA with ATA devices and/or with USB devices.
 Identify all ATA devices and look for all NTFS partitions.
 Find an NTFS partition containing “PwnOS\Core.bin” and “PwnOS\Main.exe”.
 Load the pieces of Core.bin to their appropriate places in virtual memory.
 Initialise the modules of PwnOS (Thread Scheduler, I/O, Page Memory, ...)
 Make this boot loader have core data structures as if it was a real process.
 Call CreateProcess on “PwnOS\Main.exe”.
 Call DestroyProcess on the fake boot process, to free its resources.
Inter-CPU communication in the boot loader (in order to initialise the non-bootstrap processors) is done
using the mechanism built into the LAPIC, i.e. by sending special interrupts to other CPUs over the APIC
bus.
The physical memory layout during boot is as follows (but may be subject to significant changes).
Neil Dickson
PwnOS Design Document
Page 17 of 27
Figure 3: Physical Memory During Boot
The 15 stacks are for the up to 15 CPUs during boot. The ATA scratch memory is a buffer for loading
data from disk during boot. The paging data is only used during boot, as the page directories are kept in
their own page after boot.
For more information on bootstrapping, the I/O APIC, and ACPI, see (9), (10), and (11).
Memory
Page Memory Management
In order to ensure that page management and address translation are efficient and simple in PwnOS, the
x86-64 option of using a 3-level page table tree with 2MB pages instead of a 4-level page table tree with
4KB pages is used. This means that page allocations and deallocations require much less updating of
data, and fewer Translation Lookaside Buffer (TLB) invalidations.
It also means that all but 8KB (i.e. the PML4 and the PDPT) of the page table tree for the first 512GB of
memory can be fit into a single 2MB page. This allows for the virtual addresses of these page directories
(PDs) to be constant, making lookup by PwnOS very fast. It also allows the processor to cache entries for
address translation much more reliably. For amounts of RAM significantly more than 512GB, additional
measures may need to be taken, but for reasonably foreseeable amounts of RAM (about 16TB), this
approach of fixed-address PDs works sufficiently (since it only requires 64MB of address space for 16TB
of RAM). The 2MB pages do not cause a significant waste of physical or virtual memory, since few
processes will be present on the system.
Page allocations will be assigned physical memory immediately (if there is enough), to avoid unnecessary
and expensive page faults later. The option is given to reserve pages, though, which can be put into
physical memory later.
Since the assumption is made that there are very few processes running on the system at any given time,
and allocations/deallocations of pages are not frequent, the brute force TLB invalidation algorithm is
sufficient, and in fact, just as efficient as elaborate algorithms for TLB invalidation. The brute force
algorithm is:
1. Interrupt all CPUs other than the current one with an indication that page tables have changed.
2. Invalidate those entries on every CPU to ensure that they are not in the TLB.
3. Resume all CPUs.
Although there may be many CPUs, it is probable that all of them are currently either:
 running different threads of the same process
 running the thread scheduler, in which case the CPU will not be interrupted and page tables will
be updated immediately anyway, or
 idle
This then means that most CPUs that can be updated need updating, so there is no significant loss in using
the brute force algorithm.
Use of a page file (hence the dependency on I/O in the dependency diagram of the Organisational
Overview section), if necessary, will be done in a very standard way. That is, pages that are paged-out
will be marked as not present, and the bits that are then available in the corresponding entry will be used
to indicate which page in the page file contains the data. Memory will only be put to the page file if
necessary, or if there is idle CPU time and physical memory is 90% occupied or worse.
Neil Dickson
PwnOS Design Document
Page 18 of 27
To keep track of the physical-to-virtual address mapping, how recently used physical pages are (with a
variant of aging), and where free physical pages are, a reverse page table tree is used. Like the page
tables, this structure also has fixed virtual addresses for faster lookup. Unlike the page tables, this
structure must be updated at periodic intervals to update the age of used physical pages. However, if no
page file is used, this update isn’t needed, and since the pages are large (2MB), this periodic interval can
be very long compared to with 4KB pages, e.g. 100ms to 1 minute or longer, depending on the
application.
The read-only memory pages used for common libraries will be in a fixed virtual address range, for
convenient dynamic or static linking to these libraries (“static” in this case meaning that the addresses are
hard-coded in the compiled application). Being read-only, fixed address, and common to all processes
means that these pages never need to have their TLB entries invalidated, and so TLB misses for these
libraries should almost never happen (also because of the 2MB page size). The ability to share other
memory between processes may eventually be added, but that is not of concern at the moment, because
the focus is on systems with few processes and many threads.
The overall virtual memory layout is as follows.
Figure 4: Overall Virtual Memory Layout
Naturally, each process must have its own page table tree, but the two page directories for the memory
from 2GB to 4GB will be shared between all processes.
For more detailed information on page table trees and their maintenance, please see (9).
Heap Memory Management
The code for managing heap memory in PwnOS is accessible directly from PL3 and PL0. To prevent
accidental corruption of this code, and to improve performance, it is kept in read-only pages shared
among all processes. Each process will have a heap, and the core code will have its own heap.
Heap memory in PwnOS is maintained using a data structure representing two AVL trees stored at the
end of the heap, along with a header of general data about the heap.
These data are stored at the end of heap memory instead of the beginning of heap memory to aid in
allocation of aligned blocks of memory within the heap.
One of the two AVL trees (the “address tree”) is a tree of all ranges of memory within the usable portion
of the heap, both free and allocated, sorted on the address of the range. The other AVL tree (the “free
tree”) is a tree of all free ranges within the heap, sorted on the size of the range. Having these trees
ensures that memory can be allocated with “best fit” (or a variant thereof), and that memory can be freed,
both in O(log n) time, where n is the number of ranges. The operations of finding an allocation’s size
from its address and finding the start of an allocation from an address it contains also run in O(log n)
time.
However, in order to avoid significant complications and/or performance hits (both asymptotic and clocktime), the two trees must share compound nodes to represent these ranges. As a concrete example to
Neil Dickson
PwnOS Design Document
Page 19 of 27
illustrate what this means, suppose that the usable portion of the heap has the following ranges. (Suppose
that ranges starting with “f” are free, and those starting with “a” are allocated.)
Figure 5: Example Heap Memory Ranges
These ranges could have the following address tree and free tree.
Figure 6: Heap Memory Range Trees
Together, the trees would then be the following.
Figure 7: Heap Memory Range Trees with Compound Nodes
Manipulation of these trees is identical to that of normal AVL trees, with the exception that upon removal
of a node from the address tree, the node space left vacant must be filled by moving the node that is first
in memory to that position. Additionally, the size of the free range preceding the trees must be updated
upon adding and removing address nodes. If ever this free range reaches a size of zero, the heap cannot
have more allocated because the trees would then intersect allocated ranges. Alternatively, another heap
could be allocated in such a case, to extend the first, without requiring a significant change in the tree
management.
All allocations on the heap are aligned to 16 bytes to support SSE# operations that require aligned
memory operands. The AllocateAlignedMemory function allows for alignment to higher powers of two.
Despite this alignment, the number of bytes requested is not rounded up to a multiple of the alignment.
This is so that when checking whether a certain address is in the allocated range using GetAllocationStart
Neil Dickson
PwnOS Design Document
Page 20 of 27
and GetAllocationSize, it will be correctly determined that any bytes past the end are not allocated. Also,
no range less than 16 bytes will be recorded as a free memory range.
Threads and Processes
Process Management
The Windows Portable Executable Format is used as the executable format for PwnOS. It has support for
dynamic linking via relocation entries and x86-64 code. Programs can then also be tested in Windows
using a simple library simulating PwnOS. For details on the PE format, see (12). Programs loaded from
disk are completely loaded immediately, instead of waiting for page faults to occur, because page faults
are expensive with 2MB pages.
The only explicit support for Inter-Process Communication (IPC) in PwnOS is via files (i.e. pipes), and so
is discussed in the Files section. This is because with most commonly 1 or 2 processes on the system,
elaborate IPC is unnecessary.
The default page-level permissions on the user-accessible pages of a process are as follows.
Type
Code
Global Data
Heap
Stack
Allocated with AllocatePages
Permissions
Read, Execute
Read, Write
Read, Write
Read, Write
Custom
The arrangement of these pages in virtual memory is as follows.
Figure 8: User-Accessible Process Pages
The red blocks are guard pages (no access allowed) for accident prevention. The heap may or may not
immediately follow the guard page after the stack, to allow for heaps that are larger than 2GB (by placing
them after the 4GB mark). If the heap does not immediately follow the guard page after the stack, it must
be preceded by its own guard page.
Each process also has its own page table tree and reverse page tables. These sets, although in different
parts of physical memory, occupy the same virtual memory space, as described in the Page Memory
Management section.
New processes are created with one thread starting at the main code entry point. This thread may have
default properties, or some of these properties can be specified to CreateProcess. Each new thread will
have its own stack with a guard page on each side.
For details on process management on x86-64, see (9).
Thread Management
Support for many threads in PwnOS requires careful use of the Global Descriptor Table (GDT). The
layout of the GDT entries is as follows.
Neil Dickson
PwnOS Design Document
Page 21 of 27
Figure 9: Global Descriptor Table (GDT)
The careful use relates to the Task State Segment (TSS) descriptors. Only one CPU can be running a
given task (thread) at a time, and the number of tasks (including idle tasks) may exceed the maximum
number of GDT entries (8,192). Also, the Thread Scheduler must be in a task separate from all others to
effectively make use of the built-in task switching and state saving.
The solution is to have a single TSS descriptor for each processor, plus one for the Thread Scheduler.
When a CPU is to switch threads, for any reason, the following occurs.
0. (All switches to the Thread Scheduler must be done from PL0, including LAPIC timer handlers,
so being in PL0 is assumed.)
1. Thread disables interrupts (if not already disabled).
2. Thread does FXSAVE to save its extended state.
3. Thread sets its own status information to indicate why & when it is going to the Thread
Scheduler.
4. Thread spinlocks for access to the Thread Scheduler (since little time will be spent in it, and
switching tasks is required for more elaborate synchronisation).
5. Thread switches tasks to the Thread Scheduler. (The general CPU state is automatically saved in
the TSS.)
6. Thread Scheduler stops APIC timer for the previous thread’s timeout if it wasn’t already stopped.
7. Thread Scheduler selects a thread to run.
8. Thread Scheduler writes the descriptor for that thread’s TSS to the GDT entry for the current
processor.
9. Thread Scheduler does FXRSTOR to restore the extended state for the next thread.
10. Thread Scheduler sets new thread status information to indicate that it is running.
11. Thread Scheduler starts APIC timer for the new thread’s timeout.
12. Thread Scheduler switches to the new thread’s task. (The general CPU state is automatically
restored.)
13. New thread releases the lock on Thread Scheduler. (Since this is always in PL0 code, this is not
dependent on the application.)
14. New thread enables interrupts (if returning to PL3).
This approach ensures proper and efficient functionality for even very large numbers of tasks. The
structure used for threads encompasses the TSS for the thread and the extended state saved by FXSAVE,
making most efficient use of the structures built into the CPU, instead of reorganising the data therein.
However, Thread Scheduler does not use any extended state, and has no independent execution context,
Neil Dickson
PwnOS Design Document
Page 22 of 27
so it is not a full thread; it only needs a TSS. These structures, along with all scheduling data are kept on
the PL0 heap.
Each thread also contains information on its current status, priority, any lock, notification, or I/O
operation or device that it might be waiting for, and the time of the last status change. This allows the
Thread Scheduler to implement any number of a wide variety of scheduling algorithms since it has
enough information to make good decisions. For example, suppose that one thread has access to a device
but is not waiting for an I/O operation to complete, and another thread of higher priority is waiting for
access to the device. The first thread can be given a priority boost (possibly just temporarily) so that the
higher priority thread is not left waiting too long. A similar situation occurs with locks, but this is
discussed in the Synchronisation section.
All I/O interrupts will be assigned to the bootstrap processor, so that preference can be given to the other
processors when scheduling higher priority threads, for example. This is done using the I/O APIC’s
software interface.
Thread time slice timeouts are implemented using the Local APIC (LAPIC) timer on each CPU. Both the
Programmable Interval Timer (PIT) and the CMOS Timer go through the I/O APIC, and so they cannot
be used for an arbitrary number of CPUs concurrently. The LAPIC timer is local to each CPU, and so
does not need intervention from another CPU to work for thread scheduling. The handler for the LAPIC
timer is in PL0, and it simply spinlocks for access to the Thread Scheduler, calls the Thread Scheduler,
then after returning from the Thread Scheduler (the next time that this thread is run), it releases Thread
Scheduler access.
For more information on task switching, LAPICs, and the I/O APIC, see (9) and (10).
Device I/O
Although it is planned that PwnOS will support PCI and USB protocols (as given by (13) and (14)), the
abstractions for these protocols have not yet been designed. As such, they are not discussed extensively
in this document.
The driver for ATA devices (e.g. harddrives) supports the following operations.
 Reading sectors with 28-bit and 48-bit addressing (with DMA once PCI is supported)
 Writing sectors with 28-bit and 48-bit addressing (same)
 Device identification
 Removable media identification
In order to support these operations, the driver strictly follows the protocols presented in (15).
Reading and writing of sectors is done with blocking DMA I/O, and as such, they are followed by an I/O
interrupt indicating that the operation has finished and that the requesting thread can be run again. Device
identification and removable media identification are done with programmed I/O to avoid the overhead of
setting up DMA, so the I/O interrupt is not needed.
The driver for PS/2 (and the driver for USB keyboards and mice) supports the following operations.
 Receive key press/release
 Receive mouse button press/release
 Receive mouse movement data
 Receive mouse scroll data
In order to support these operations, the driver is based upon information presented in (16) and much
testing.
Neil Dickson
PwnOS Design Document
Page 23 of 27
All of the PS/2 driver operations are interrupt-driven input operations. The API for the I/O module of
PwnOS allows threads to register listeners in PL3 for these input events. These threads may be given a
temporary priority boost to quickly handle the user input.
Graphics in PwnOS is done using a fixed linear frame buffer, as set up by the boot loader using the VBE
functions (17). As such, to have the ability to display graphics output, a process need only have the linear
frame buffer’s pages present in its virtual memory. The GetGraphicsAccess and ReleaseGraphicsAccess
API functions just allocate and deallocate these pages, so in a sense, they are more closely related to the
Page Memory module than the I/O module.
Separate Ethernet drivers are required for each type of Ethernet card, and due to lack of standardisation,
these drivers may be very different from each other. However, they must all provide the Internet Protocol
(IP) abstraction for the TCP driver (and UDP driver, if present). Likewise, the TCP driver must manage
its abstraction for reading and writing over TCP connections (see the Files section).
Other useful references on devices and their configuration are (18) and (11).
Files
The NTFS filesystem is the only hard disk storage filesystem supported by PwnOS. This decision was
made based on its quality of documentation compared to other filesystems, its performance, its
extensibility, and its compatibility with Windows. Details on NTFS can be found at (19), and so the
searching and manipulating the NTFS data structures is beyond the scope of this document.
Caching of Master File Table (MFT) entries for open files, and for recently used directories/files takes
place, in addition to read/write caching of file clusters. The number of clusters cached or prefetched for
reading, or cached for writing, depends on the size and number of previous requests for the file. Write
caching is done until either some number of clusters has been filled, the next write is outside the cache, or
a certain amount of time has passed.
Use of file path names in PwnOS is similar (and in some cases, identical) to that of Windows. PwnOS
will accept “/” or “\” as a directory separator, and the format of paths in general are case-insensitive,
Unicode strings exemplified by the following examples.
Path
hd00:\PwnOS\Core.bin
Meaning
harddrive 0, partition 0, directory “PwnOS”, file
“Core.bin”
hd00:PwnOS/Core.bin
same as above
hd00://PwnOS\Core.bin
same as above
hd00:\\\pwnos\core.Bin
same as above
hd312:\MyDirectory\SubDir\Cool.doc
harddrive 3, partition 1 (extended), subpartition 2,
directory “MyDirectory”, subdirectory “SubDir”,
file “Cool.doc”
http://www.codecortex.com/index.php HTTP protocol, domain “www.codecortex.com”,
file “index.php”
http:www.codecortex.com\index.php
same as above
usb00:\MyFile.txt
USB port 0, partition 0, file “MyFile.txt”
\PwnOS\Core.bin
root of partition of current directory, directory
“PwnOS”, file “Core.bin”
/PwnOS
root of partition of current directory, directory
Neil Dickson
PwnOS Design Document
Page 24 of 27
Core.bin
PwnOS\Core.bin
C:\PwnOS\Core.bin
tcp:\127.0.0.1:1234
prog:\MyProgram\MyPipe
“PwnOS”
current directory, file “Core.bin”
current directory, subdirectory “PwnOS”, file
“Core.bin”
partition mapped to “C”, directory “PwnOS”, file
“Core.bin”
TCP protocol, IP address 127.0.0.1, port 1234
Pipe (program) protocol, virtual directory
“MyProgram”,virtual file “MyPipe”
Virtual files will be supported for abstractions of network communication protocols (such as HTTP, FTP,
TCP), and for pipes. Both network abstractions and pipes work very similarly, except that in the case of
pipes, both ends of the virtual file are accessed by local programs, and in the case of a network
abstraction, (usually) only one end of the virtual file is accessed by a local program. The network
abstractions also depend on the I/O module, whereas pipes do not. Both pipes and network abstractions
are fully cached, i.e. all data not yet retrieved/sent is kept in memory buffers (which may
increase/decrease in size).
The data structures used by the filesystem management code are kept in the PL0 heap.
Synchronisation
The synchronisation module of PwnOS provides mechanisms for mutual exclusion and other coordination
of multiple threads. The most important of these mechanisms is the lock, and a wait-notify queue
mechanism is also provided.
A lock has two main operations: get and release. It keeps track of which thread currently has access to it
(if any), and which threads are waiting to have access to it (if any). Both get and release have a case
where they can be done entirely from PL3, and another case where they must be done in PL0. The get
operation goes as follows.
1. If the current thread already has the lock, return.
2. Atomically, do the following (using LOCK CMPXCHG16B):
a. If no thread currently has access and no thread has exclusive access to the list of waiting
threads,
i. Claim access for the current thread
3. If the current thread gained access, return.
4. Switch to PL0.
5. Disable interrupts.
6. Spinlock for exclusive access to list of threads waiting for access.
7. If the lock was released from PL3 before exclusive access was obtained (can only happen if this
is the first thread to go in the list),
a. Claim access for the current thread (since no other thread can claim it while this one has
exclusive access to the list)
b. Release access to the list of waiting threads.
8. Else,
a. Add current thread to list of threads waiting for access.
b. Set the current thread status to indicate that this thread is waiting for access to this lock.
c. Release access to list of threads waiting for access.
d. Go to the Thread Scheduler (see Thread Management section). Upon returning, this
thread will have gotten the lock.
9. Enable interrupts.
Neil Dickson
PwnOS Design Document
Page 25 of 27
10. Return to PL3, then return to application.
In the cases where either the current thread already has the lock or the lock is free, the operation can
complete without switching to PL0. Otherwise, the operation must be done in PL0. Similarly, the release
operation is as follows.
1. If the current thread does not have the lock, fail.
2. Atomically do the following (using LOCK CMPXCHG16B):
a. If the list of threads waiting for access is empty and no thread has exclusive access to the
list,
i. Set no threads to currently have the lock.
3. If the lock just got released by the atomic operation, return.
4. Switch to PL0.
5. Disable interrupts.
6. Spinlock for exclusive access to list of threads waiting for access.
7. Remove the next thread to run from the list of waiting threads.
8. Give that thread the claim to the lock’s access.
9. Set that thread’s status to reflect that it is now able to resume running.
10. Release access to the list of waiting threads.
11. Enable interrupts.
12. Return to PL3, then return to application.
The order of these steps is absolutely critical to their proper execution. Changing the order or function of
these steps could break certain cases, and the explanation of how would be too long and detailed for this
document. These operations would be much simpler if done completely in PL0, but allowing the
operations to occur in PL3 for the very common cases (getting a free lock, and releasing an unwatched
lock) can yield a large performance benefit where locks are used often.
The other, much simpler synchronisation mechanism provided by PwnOS is the wait-notify queue. A
thread indicates that it wants to be notified of something after waiting in line to be notified. This can be
used for larger constructs such as a simple inter-thread message coordination system. Since waiting
requires being in PL0 anyway, and notifying requires modifying the shared data structure, both must have
some component executing in PL0 in order to avoid problems. Thus, for simplicity, they both are
completely implemented in PL0. This makes the operations so much simpler than getting and releasing
locks that their steps are not discussed here.
Because PwnOS is aware of the constructs that applications will use for mutual exclusion, some problems
can be averted or at least identified. For example, cycles in the graph of locks owned by threads and
threads waiting for locks represent deadlocks, and the Thread Scheduler has full access to this graph.
Because deadlock cycles are almost always very small (2 or 3 locks in each cycle), and because few
threads at any given time would be both waiting for a lock and owning a lock, detection is an inexpensive
operation that could be performed periodically but not often (~1 minute to 1 hour) without a significant
performance hit.
The Thread Scheduler can also take into account the knowledge that, for example, 20 threads are waiting
for a lock owned by a particular thread, so the owning thread should be given a priority boost (at least
temporarily) to release the lock sooner. It is not necessary that all such information be used, but it opens
up possibilities for a more intelligent scheduler.
Neil Dickson
PwnOS Design Document
Page 26 of 27
References
1. Dickson, Neil. PwnOS Code Documentation. [Online] August 26, 2007. [Cited: October 16, 2007.]
http://www.neildickson.com/os/documentation/.
2. ITRON Committee, TRON Association. μITRON4.0 Specification. Tokyo, Japan : TRON
Association, 2002. 4.00.00.
3. International Data Corporation. HP-UX: A Foundation for Enterprise Workloads. s.l. : IDC, 2007.
#206607.
4. Silicon Graphics, Inc. Cellular IRIX™ 6.4 Technical Report. 1996.
5. MINIX 3: A Highly Reliable, Self-Repairing Operating System. Jorrit N. Herder, Herbert Bos, et al.
July 2006, s.l. : Operating Systems Review, 2006.
6. Sun Microsystems. Reference Materials. Solaris Operating System. [Online] November 2007. [Cited:
November 3, 2007.] http://www.sun.com/software/solaris/reference_resources.jsp.
7. Scalability of Microkernel-Based Systems. Uhlig, Volkmar. June 2005, s.l. : Operating Systems
Review, 2005.
8. Robert V. Baron, David Black, et al. Mach Kernel Interface Manual. 1990.
9. Intel Corporation. Intel® 64 and IA-32 Architectures Software Developer's Manuals. Intel. [Online]
May 2007. [Cited: August 13, 2007.] http://www.intel.com/products/processor/manuals/. 253665-253669.
10. —. 82093AA I/O Advanced Programmable Interrupt Controller (I/O APIC) Datasheet. 1996.
29056601.
11. Hewlett-Packard Company, Intel Corporation, et al. Advanced Configuration and PowerInterface
Specification, Revision 3.0. 2004.
12. Microsoft Corporation. Microsoft Portable Executable and Common Object File Format
Specification, Revision 8.0. 2006.
13. Technical Committee T13. AT Attachment with Packet Interface - 6 (ATA-ATAPI-6). 2002. 1410D.
14. Hyde, Randall. Chapter 20 - The PC Keyboard. The Art of Assembly Language Programming, DOS
16-bit Edition. 2000.
15. Compaq Computer Corporation, Hewlett-Packard Company, et al. Universal Serial Bus
Specification, Revision 2.0. 2000.
16. PCI Special Interest Group. PCI Local Bus Specification, Revision 2.2. 1998.
17. Video Electronics Standards Association. VESA BIOS Extension (VBE) Core Functions Standard,
Version 3.0. 1998.
18. Gook, Michael. PC Hardware Interfaces. Wayne, Pennsylvania : A-List Publishing, 2004.
193176929X.
19. Richard Russon, Yuval Fledel. NTFS Documentation. Linux-NTFS. [Online] 2005. [Cited: October
21, 2007.] http://data.linux-ntfs.org/ntfsdoc.pdf.
Neil Dickson
PwnOS Design Document
Page 27 of 27
Download