On p-Kernel Construction Jochen GMD — German National Liedtke Research Center for Information jochen.liedtke@gmd Abstract The software proach From a software-technology concept is superior point believed systems inefficient are inherently that kernels. (a) p-kernel from improper (b) to this belief, we evidence that inef- malfunction. (c) The system their formance critical is achievable, flexibility. implementation. Then, and why some published contraare not evident. Furthermore, we describe some implementation why p-kernels are inherently portability cepted. This kernels do not ciency techniques and illustrate not port able, although interface of the whole system. perform Differ- by different restricts believed most existing well. flexibility, Lack p- of effi- since important cannot be used in practice In some cases, the p-kernel has been weakened into that and special the kernel that servers have to regain efficiency. the mentioned inefficiency (and thus inflexibility) is inherent to the p-kernel approach. Folklore holds that increased user-kernel mode and address-space switches are responsible. At a first based systems have been built glance, long before the published performance support this view. In fact, the cited is used to denote the part of the operating performance and common idea of the ~-kernel i.e. to implement a more modular and tailorable. sufficiently and principles performance. term itself was introduced, e.g. by Brinch Hansen [1970] and Wulf et al. [1974]. Traditionally, the word ‘kernel’ mandatory enforces is due to the fact been re-integrated Rationale p-kernel interface is more flexible also heavily mechanisms due to poor It is widely 1 ap- Although much effort haa been invested in p-kernel construction, the approach is not (yet) generally ac- the per- We show what performance the efficiency is sufficient with re- spect to macro-kernels dictory measurements they improve we analyze points. that of this ent strategies and APIs, implemented servers, can coexist in the system. Based on functional reasons, we describe some concepts which must be implemented by a p-kernel and illustrate advantages system structure. 1 Servers can use the mechanisms provided by the ~-kernel like any other user program. Server malfunction is as isolated as any other user program’s based are not ficiency and inflexibility of current p-kernels is not inherited from the basic idea but mostly from overloading and/or technological are obvious: (a) A clear p-kernel On the and (b) they sufficiently flexible. Contradictory show and support by documentary the kernel * of view, the p-kernel to large integrated other hand, it is widely Technology .de to all other software. approach outside system that is is to minimize the kernel whatever performance of a particular the out only guess whether possible. by the concepts implemented by this particular p-kernel of the p-kernel. Since it is or by the implementation known SIGOPS ’95 12/95 CO, USA 01995 ACM 0-89791 -715-419510012 ...$3.50 237 kernel nitude I that it is caused by the ~-kernel conventional bottlenecks, fasterz reasons measured p-kernel based system withwhich limit efficiency. We can this part, Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of ACM, inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. the studies seem to The basic *GMD SET–RS, 53754 Sankt Augustin, Germany analyzing measurements than IPC, one of the can be implemented believed Althoughmanymacro-kernels before, tend the approach, traditional p- an order of mag- question is still to be less modular, there are exceptions from this rule. e.g. Chorus [Rozier et al. 1988] and Peace [Sc2w6der-Preikschat 1994]. 2 Short u.ser-tu-user cross-address space IPC in L3 [Liedtke both -ng 1993] is 22 times faster than in Ma&, on a 486. On the R2000, the specialized Exo-tlrpc [Engler et rd. 1995] is 30 times faster thsm Mach’s general RPC. open. It might the appropriate 2.1 be possible that we are still not applying construction techniques. For the above reasons, we feel that a conceptual analysis is needed which derives p-kernel concepts from pure functionality achievable tion requirements performance 3). Further (section (section sections 2) and that discusses 4) and flexibility discuss portability (section level, an address space is a mapping each virtual page to a physical page frame it or marks we omit read/write. 5) Spaces At the hardware which associates simplicity, (sec- (section and the chances of some new developments Address ‘non-accessible’. The mapping The ~-kernel, p-Kernel be impossible. Concepts mance. criterion used is functionality, More precisely, p-kernel only a concept if moving it outside the hardware not perfor- is tolerated The inside the the kernel, We assume that and/or the target system not completely i.e. per- the mandatory are empty. spaces outside address or corrupted by other subsystems S’. This is the principle of independence: ,S can give guarantees independent of Map. by system and that each pair is delivered further provides three into ant restriction from the granter’s the grantee’s address is that instead of phys- The owner of an address space can map any of its address space, provided the recipient removed from the mapper’s address space. Comparable to the granting case, the mapper can only map pages which itself already can access. Flush. The owner of an address space can j?ush any of its pages. The flushed page remains accessible in the flusher’s address space, but is removed from all other or by user-level boot servers. For illustration: a key server should deliver public-secret RSA key pairs on demand. It should guarantee that each pair has the desired RSA property the agrees. Afterwards, the page can be accessed in both address spaces. In contrast to granting, the page is not fur- administration can be ensured there represents and maintaining page is removed space and included pages into another s’. tegrity construction By magic, ical page frames, the granter can only grant pages which are already accessible to itself. S’. The second requirement is that other subsystems must be able to rely on these guarantees. This is the principle of integrity: there must be a way for S1 to address S2 and to establish a communication channel which can neither be corrupted nor eavesdropped by described by Gasser by servers. Their in- of on to Grant. The owner of an address space can grant any of its pages to another space, provided the recipient space. The import are trustworthy, recursive the kernel. For constructing subsystem and kernel would of address spaces concept. agrees. The granted hardware protection concept idea is to support One inevitable requirement for such a system is that a programmer must be able to implement an arbitrary Provided implementing The p-kernel to all subof address operations: tual memory. ther security services, like those et al. [1989], can be implemented concept address spaces on top of aO, the p-kernel applica- be disturbed hard- physical memory and is controlled by the first subsystem SO. At system start time, all other address spaces tions, i.e. it has to deal with protection. We further assume that the hardware implements page-based vir- S in such a way that it cannot layer common is one address space aO which essentially has to support trustworthy basic of address mitting competing implementations, would prevent the implement ation of the system’s required functionalist y. interactive and by TLB must be tamed, but must permit the implementation arbitrary protection (and non-protection) schemes top of the p-kernel. It should be simple and similar In this section, we reason about the minimal concepts or. “primitives” that a p-kernel should implement.3 The determining is implemented has to hide the hardware spaces, since otherwise, Some sake of like read-only ware and page tables. 6). systems, 2 For the access attributes address spaces which had received indirectly from the flusher. of the affected address-space only once and the page directly or Although explicit consent owners is not required, the operation is safe, since it is restricted to own pages. The users of these pages already agreed to accept a potential flushing, when they received the pages by mapping or granting. only to the demander. The key server can only be realized if there are mechanisms which (a) protect its code and data, (b) ensure that nobody else reads or modifies the key and (c) enable the demander to check whether the key comes from the key server. Finding the key server can be done by means of a name server and checked by public key based authentication. Appendix A contains a more precise definition dress spaces and the above three operations. of ad- Reasoning 3proving nice but ~~m~tY, is impossible, necess~ity since there and completeness is no agreed-upon wo~d metric be The described address-space concept leaves memory management and paging outside the ,u-kernel; only the and all is Turing-equivalent. 238 . grant, map and flush operations are retained inside the pure memory kernel. Mapping and flushing are required to implement memory managers and pagers on top of the p-kernel. trolled fied file, system (see figure 1). In this example, 2.2 .fl maps Threads A thread Y\g=rit””” ‘ / J ~/7f=~.:p “, Figure a state information. 1 ing the page by the standard user A, flushing either flush by ~1 only (and cannot page which Furthermore, granting overflow of F. In general, flush itself), is already granting affect .fl and when through The model pages. can easily be extended access right and granting or a subset to access rights of them, -fz. operations An address space is the natural device ports. The messages between always munication enforces a certain and abstraction This is obvious IPC classical method is by the p-kernel. agreement the sender determines is not only between decides its contents; the basic concept between subsystems spaces, can be const rutted IPC. on Note that need IPC, the for com- but also, together with of independence. message-transfer the grant and map operations since granter/mapper the from they require and recipient based (section an agreement 2.1) between of the mapping. may Supervising IPC Architectures like those described by Yokote [1993] and Kiihnhauser [1995] need not only supervise the memory of subjects but also their communication. This can be for incorpo- for memory communicant ion, (IPC), must be threads of a communication: remains Other forms of communication, remote procedure call (RPC) or controlled thread migration between address 1/0 rating In some oper- as a basic abstraction, by the p-kernel. parties Therefore, page’s i.e., can restrict access but not widen it. Special flushing remove only specified access rights. of some ~ that activity. receiver determines whether it is willing to receive information and is free to interpret the received message. mappings copy the source the notion e.g. preemption. threads to send information it. Mapping (or somecorruption ducing both address-space page concept To prevent reasons for int re- IPC by instead most of done in .fl and includes address spaces, the foundation is used the ating systems, there may be additional transferring should be passed through a controlling subsystem without burdening the controller’s address space by all pages mapped the thread the above mentioned supported Flush- since it used the avoids a potential and ~ currently executes. This to address spaces is the de- Consequently, cross-address-space also called inter-process communication F so user A. F is not affected If F had used mapping have needed to replicate page only transiently. of granting, it would the bookkeeping pager would a stack pointer Note that choosing a concrete thread concept subject to further OS-specific design decisions. that it is then available only in ~1 and user A; the resulting mappings are denoted by the thin line: the page pager. space. includ- state also includes in the p-kernel. plies that the p-kernel represents Example and the standard an address of address spaces, all changes to a t bread’s address space (at’) := a’) must be controlled by the kernel. This im- pager 1: A Granting in user A, fl pointer, A thread’s address space CT(TJin which dynamic or static association equivalent) inside by a set of registers, ing at least an instruction thing one of its pages to F which grants the received the page disappears from to user A. By granting, is mapped executing is characterized T cisive reason for including std can be con- IPC and is an activity A thread . .. disk i.e., device ports 4K granularity. as pagers ~1 and ~~, pager) into one uni- user X \ 1/0, with Controlling 1/0 rights and device drivers is thus also done by memory managers and pagers on top of the p-kernel. The grant operation is required only in very special situations: consider a pager F which combines two underlying file systems (implemented operating on top of the standard mapped and mapped mapped done by introducing either communication channels or 1/0, but 1/0 ports can also be included. The granularity of control depends on the given processor. The [Liedtke 1992] which allow supervision of IPC by user-defined servers. Such concepts are not discussed 386 and its successors permit control per port (one very small page per port) but no mapping of port addresses (it enforces mappings with v = v’); the PowerPC uses here, since they do not belong to the minimal set of concepts. We only remark that Clans do not burden the p-kernel: their base cost is 2 cycles per IPC. Clans 239 Interrupts on top The natural principal flexibility of a p-kernel. Whether it is really as flexible in practice strongly depends on the achieved abstraction IPC message. threads which for hardware interrupts is the The hardware is regarded as a set of have special thread ids and send empty efficiency driver thread: do wait for (msg. if sender then sender) reset hardware the interrupts remaining in server the interrupt 2.3 is a device spe- by drivers at user level. then is used solely for popping from the stack and/or switching by the kernel. a privileged the kernel when the driver executes operation sta- back to However, if for releasing this action implicitly llids are required for reliable If S1 wants and efficient lo- to send a message to S2, it needs to specify the destination S? (or some channel leading to S2 ). Therefore, the p-kernel must know which uid relates to S2. On the other hand, the receiver S2 wants to be sure that the message comes from S1. Therefore and time. the identifier must be unique, In theory, cryptography tice, however, enciphering nication trusted both in space could also be used. messages for local In praccommu- is far too expensive and the kernel must be anyway. S2 can also not rely on purely user- supplied capabilities, since S1 or some other instance could duplicate and pass them to untrusted subsystems without control of S2. 3 can easily of the physical interfaces, pager – client, and are user-level pager – memory are completely based defined. dress spaces as well as unpaged resident memory for device drivers and/or real time systems. Stacked pagers, i.e. multiple layers of pagers, can be used for combin- ing access control with existing pagers or for combining various pagers (e.g. one per disk) into one composed obUser-supplied paging strategies [Lee et al. 1994; Cao et al. 1994] are handled at the user level and are in no way restricted by the ~-kernel. Stacked file systems [Khalidi and Nelson 1993] can be realized accordingly. Identifiers cal communication. managers parts Pagers can be used to implement traditional paged virtual memory and file/database mapping into user ad- A p-kernel must supply unique identifiers (uid) for something, either for threads or tasks or communication channels. Memory or grants and pager – device driver, on IPC ject. issues the next IPC operation. Unique MO maps be resetting requires the ~-kernel. is not involved user mode and can be hidden an interrupt, but outside messages must cific action which can be handled a processor topic 4. be stacked: into The iret-instruction tus information performance Manager. A server managing the initial space CTOis a classical main memory manager, but the p-kernel device-specific interrupt handling. In particular, it does not know anything about the interrupt semantics. On some processors, The latter we show the Pager. A Pager may be integrated with a memory manager or use a memory managing server. Pagers use the p-kernel’s grant, map and flush primitives. The fl od Transforming in section section, interrupt ... done by the kernel, In this memory (CO) to UI, controlled by Ml, other parts to az, controlled by M2. Ifow we have two coexisting main memory managers. ; = my hardware interrupt read/write io ports ; else Memory address a p-kernel. of the p-kernel. is discussed messages (only consisting of the sender id) to associated software threads. A receiving thread concludes from the message source id, whether the message comes from hardware interrupt and from which interrupt: of the Flexibility To illustrate the flexibility of the basic concepts, we sketch some applications which typically belong to the basic operating system but can easily be implemented Multimedia and other Resource Allocation. real-time applications require sources to be allocated execution ory times. managers in a way that The above mentioned and pagers permit of physical memory for specific memory for a given time. Multimedia memory re- allows predictable user-level e.g. fixed data or locking Note that resource allocators for multimedia timesharing can coexist. .Managing allocation is part of the servers’ jobs. Device directly mem- allocation data in and for conflicts Driver. A device driver is a process which accesses hardware 1/0 ports mapped into its address space and receives messages from the hardware (interrupts) through the standard IPC mechanism Device-specific memory, e.g. a screen, is handled by means of appropriate memory managers. Compared to other user-level processes, there is nothing special about a device driver. hTo device into the g-kernel.4 4In general, the kernel. image 240 into there is no reason driver for has to be integrated integrating boot drivers into The hooter, e.g. located in ROM, simply loads a bit memory that contains the micro-kernel and perhaps Second rates Level Cache of a secondary reallocation [Kessler and can be implemented some even plemented handler like will p-kernel. handling Hill virtual this. practice, the in the of a hashed page memory. hardware table, or in the might be imple- cles for returning switch Server. switches. the communication is We take an x86 switches are ex- by an additional 36 cy- These two instructions pointer. a lower bound The remaining 107 cycles (about on kernel–user 800 or more mode cycles are pure By this term, we denote all cycles kernel overhead. of the kernel, which are solely due to the construction The nevermind whether they are spent in executing instructions (800 cycles x 500 instructions) or in cache and Unixs system calls are implemented by server can act as a pager for its clients and also use memory its clients. The Unix mode argument the user and kernel stack and push/pop and instruction 2 ps) is therefore is not involved. The Unix our processor. kernel-user to user mode. between flag register buffers server will also act as a special pager for the client. Unix costs, mode costs 71 cycles, followed protocols and vice is accessed by de- of communication and user address space is required, IPC. the measured because In fact, tremely expensive on these processors. In contrast to the worst case processor, we use a best-case measurement for discussion, 18 ps for Mach on a 486/50. e.g. Remote IPC is impleservers which translate local If special sharing call get.se~.thread The measured costs per kernel call are 18 x 50 = 900 cycles. The bare machine instruction for entering kernel messages to external communication versa. The communication hardware p-kernel processor, that one required we measured costs are high. based on a 486 (50 MHz) TLB he showed ps per getpid, these results, p-kernel kernel-ca(l For analyzing be im- handler, a 486 at 50 MHz), 18 ps per Machs the measured pol- server. Remote Communication. m~nted by communication vice drivers. 1994] first-level or roughly most machines need 20-30 even 63 KS. Corroborating or applies could TLB 11/780 hit reducing) handler a second-level as a user-level et al. which in physical TLB be implemented misses conflict the allocation Romer of a pager pages In of page 1992; a software However, mented Improving (hopefully allocating In theory, TLB. by means by means cache-dependent icy when and cache TLB misses (800 cycles x 270 primary 90 TLB sharing for communicating with server itself can be pageable or misses). We have to conclude cache misses = that the measured kernels do a lot of work when entering kernel. resident. Note that this work and exiting by definition the has no net effect. Conclusion. abstractions Is an 800 cycle kernel overhead A small set of p-kernel concepts lead to which stress flexibility, provided they per- answer is no. Empirical TLBs. The complete L3 kernel cycles, mostly less than The 4.1 Performance, Switching It is widely believed Facts switching & Rumors (i.e. the kernel remaining between Kernel–User necessary? The 1993] has a overhead call costs are thus 123 to 180 3 ps. is process and supports oriented, uses a kernel persistent user processes can be exchanged system, kernel mode). other ~-kernel kernel and user mode, between address spaces and bet ween threads is inherently expensive. Some measurements seem to support this belief. 4.1.1 L3 p-kernel stack per thread Overhead that really L3 [Liedtke minimal kernel overhead of 15 cycles. If the p-kernel call is executed infrequently enough, it may increase by up to 57 additional cycles (3 TLB misses, 10 cache misses). form well enough. The only thing which cannot be implemented on top of these abstractions is the processor architecture: registers, first-level caches and first-level 4 proof without even if a process affecting actually the resides in Therefore, it should be possible for any to achieve comparably low kernel call on the same hardware. Other processors may require a slightly higher overhead, but they offer substantially cheaper basic operations for entering and leaving kernel mode. From an architectural point of view, calling the kernel from user mode is simply an indirect call, complemented by Switches Ousterhout [1990] measured the costs for executing the “null” kernel call getpid. Since the real getpid opera- a stack switch permit privileged tion consists only of a few loads and stores, this method from measures the basic costs of a kernel call. mented by switching back to user stack and resetting the ‘kernel’-bit. If the processor has different stack pointer registers for user and kernel stack, the stack switching costs can be hidden. Conceptually, entering and leaving a hypothetical machine with 10 MIPS Normalized rating (10x some set of initial pagers and drivers (-ng at user mode not linked but simply appended to the kernel). Afterwards, boot drivers are no longer used. 5Unix is a registered trademark of UNIX System to VAX and the and setting operations. kernel mode is a normal cMach 3.0, NORMA MK 13 Laboratories. 241 the internal ‘kernel’-bit Accordingly, return operation to returning comple- this worst-caae calculation tables may become critical kernel mode can perform exactly like a normal indirect call and return instruction (which do not rely on branch prediction). Ideally, this means 2+2=4 cycles on a 1- Fortunately, and PowerPC, differently. The registers which issue processor Conclusion. kernel–user sors. Compared to the theoretical minimum, mode switches are costly on some proces- Compared to existing kernels however, they local 232-byte If we regard page this is not a problem, since on Pentium address-space switches can be handled PowerPC architecture includes segment can be controlled by the p-kernel and offer an additional can shows that switching in some situations. address translation facility address space to a global from 252-byte the global space as a set of one million local be improved 6 to 10 times by appropriate p-kernel const ruction. Kernel–user mode switches are not a serious spaces, address-space switches can be implemented conceptual reloading registers instead problem but an implementational one. the segment page table. 4.1.2 Address Space Switches With the space. by of switching the 29 cycles for 3.5 GB or 12 cycles for 1 GB segment switching, to a no longer required the overhead TLB flush. is low compared In fact, we have a Folklore also considers address-space switches as costly. All measurements known to the author and related to this topic deal with combined thread and address-space Things are not quite as easy on the Pentium or the 486. Since segments are mapped into a 232-byte space, switch costs. Therefore, in this section, we analyze only the architectural processor costs for pure address-space mapping multiple user address spaces into one linear space must be handled dynamically and depends on the switching. together Most The combined measurements with thread switching. modern primary processors cache which are discussed use a physically is not affected tagged actually indexed by address-space Some processors (e.g. Mips R4000) use tagged TLBs, where each entry does not only contain the virtual page address but also the address-space id. Switching may induce one TLB indirect entry costs, since shared the fectively the ~- on such a processor. thus requires termined by the TLB a TLB flush. are used only working for sets are ef- For example, we can multiplex one 3 GB spaces with all large 3 GB spaces. Then any is uncritical anyway, working sets implies since switching between two large TLB and cache miss costs, never- that whether the two address programs execute in the same or spaces. minimal and maximal no referenced cycles are given or modified (provided bits need updating). In the case of 486, Pentium and PowerPC, this depends on whether the corresponding page table entry is found in the cache or not. As a minimal working set. we assume 4 pages. For the maximum case, we exclude 4 pages from the address-space overhead costs, because at most 4 pages are required by the p-kernel and thus would as well occupy TLB entries when the address space would (32 + 64) x 9 = 864 cycles. Therefore, intercepting a program every 100ps could imply an overhead of up to 9%. Although using the complete TLB is unrealistic~, conflicts so that only about multaneous y. Furthermore, small to a medium or small space is albetween two large address spaces the has s entries and a TLB miss costs m cycles, at most min(n, s) x m cycles are required in total. Apparently, larger untagged TLBs lead to a performance problem. For example, completely reloading the Pentium’s data and code TLBs requires at least sets imply spaces which and with Table 1 shows the page table switch and segment switch overhead for several processors. For a TLB miss, which are required are 4-way set-associative. Working in the virtual address space, usually 1995] potential address-space switch ways fast. Switching mind tore-establish the current working set later. If the working set consists of n pages, the TLB is fully-associative, 7Both TLBs are not compact the very small in most cases, say 1 MB or less for a in different The real costs are de- load operations spaces. [Liedtke user and removes address periods the smaller Most current processors (e.g. 486, Pentium, PowerPC and Alpha) include untagged TLBs. An address-space switch user address technique user address space with 8 user spaces of 64 MB and additionally 128 user spaces of 1 MB. The trick is to share set and that there are enough TLB entries, the problem should not be serious. However, we cannot support this empirically, since we do not know an appropriate prunning that device driver. kernel (shared by all address spaces) haa a small working kernel to the mention and costs swit thing that is transparent very short pages occupy per address space. Provided active implementation For understanding that the restriction of a 232-byte global space is not crucial to performance, one has to architecture. address space is thus transparent to the TLB no additional cycles. However, address-space used sizes of the The according performance bottleneck. Address space switch overhead then is 15 cycles on the Pentium and 39 cycles on 486. switching. Switching the page table is usually very cheap: 1 to 10 cycles. The reaI costs are determined by the TLB TLB. not be switched. which some half of the TLB entries are used sia working set of 64 data pages will most likely ports 4 x 32 bytes associative, page 242 lead to cache probably in practice. per thrashing: page. only Since in best the I or 2 cache case, the cache entries is only cache sup- 2-way set- can be used per 486 Pentium PowerPC 601 Alpha 21064 Mips 96 256 TLB miss cycles 9...13 9...13 ?. 40 20. . . Soa TLB entries 32 R4000 48 Page Table \ Segment switch cycles 36...364 39 36...1196 15 ? 29 80...1800 and Mips TLB misses are handled *R4000 has a tagged TLB. Address Space by software, The Switch Overhead modern processors. On a 100 MHz costs of address-space processor, switches can be implemented effect of using pretty ning application data pages) segment is shown on Pentium a small Conclusion. Properly constructed address-space switches are not very expensive, less than 50 cycles on herited space.They copy, stack and address Spring and L3 show that fast and that dows, whereas L3 is burdened by the fact that a 486 trap is 100 cycles more expensive than a Spare trap. switch Tablel: are user to user, cross-address the costs are heavily influenced by the processor architecture: Spring on Spare has to deal with register win- nla “Alpha times communication nja Ob 20...50” All include system call, argument space switch costs. Exokernel, with a stable executes working based in figure address-space 2. One long run- working a short RPC set (2 pages). set (2 to to a server After the 64 with RPC, the application re-accesses all its pages. Measurement is done by 100,000 repetitions and comparing each run the in- against can be ignored running the application (100,000 time access- roughly up to 100,000 switches per second. Special optimization, like executing dedicated servers in kernel ing all pages) without RPC. The given times are round trip RPC times, user to user, plus the required time for space, are superfluous. re-est ablishing Expensive context swit thing some existing p-kernels is due to implementation not caused by inherent problems with the concept. in the application’s working 14 segment switch m by a by page-table switch 12. 1 4.1.3 Thread Switches and set. and 1P C 12 i Ousterhout some Unix through [1990] also measured context switching I in 10 systems by echoing one byte back and forth pipes between to a 10 Mips machine, two processes. Again RPc time + workin S ,M reestabh,h normalized most results are between 400 and 8 7,s 6.3 [M s] r 6 1 System CPU, MHz RPC time (round trip) cycles/IPC (oneway) 4 II 3.6 3,2 full IPC semantics L3 486, 50 10 us 250 QNX 486:33 76 /JS 1254 Mach R2000, 16.7 190 ps 1584 SRC RPC CVAX, 12.5 464 /.lS 2900 Mach 486, 50 Amoeba 68020, 15 Spin Alpha 21064, Mach Abha 21064, 2 L 230 @ 5750 800 /.lS 6000 Figure 2: 133 102 ,uS 6783 Switch in LJ 133 104 us 6916 2: l-byte-RPC pby Segmented trip RPC, : Versus on Pentium, 64 48 Standard IPC can be implemented handle also hardware 4.2 Memory Address-Space 90 MHz. interrupts fast enough to by this mechanism. Effects with that of the Mach p-kernel which was complemented wit h a ~’nix server. They measured memory cycle over- construction that 10 ps, i.e. a 40 to 80 times faster RPC is achievable. Table 2 gives the costs of echoing one byte by a round / 16 Chen and Bershad [1993] compared the memory system behaviour of ~ltrix, a large monolithic Unix system, performance 800 ps per ping-pong, one was 1450 ,US. All existing kernels are at least 2 times faster, but it is proved 3.: Ill 2 Conclusion. Table 3.2 head per instruction (MCPI) and found that programs running under Mach + V-nix server had a substantially i.e. two IPC operations.8 Renesse aThe respective data is taken from [Liedtke 1993; Hildebrand 1992; Schroeder and Burroughs 1989; Draves et al. 1991; van 243 1988; Liedtke et aL 1995: et al. Hamilton and 1995; Bershad et al. 1989]. 1993; Bershad Kougiouris 1993; et aL Bryce 1995; Engler and Muller higher NICPI than running the same programs under lJl- suggest a potential problem due to OS structure. trix. For some programs, the differences were up to 0.25 cycles per instruction, averaged over the total program Chen and Bershad measured cache conflicts by comparing the direct mapped to a simulated 2-way cache. g (user + system). degradation They found [N-agle et al. tant Similar of Mach versus Ultrix memory is noticed system by others 1994]. The crucial point is whether this problem is due to the way that Mach is constructed, or whether it is caused by the p-kernel Chen and Bershad (black) “This and capacity in an absolute suggests that microkernei optimizations focussing exclusively on IPC [. ..], without considering other sources of system overhead such as MCPI, will have a limited impact on overall system performance.” Although one might suppose a principal tioned paper specific impact of OS architecture, exclusively presents implementation for memory facts without “as is” about interference, is more impor- but the data also (white) scale (left) system cache misses both and as ratio (right). ~=~f~~~=es egrep the men- analyzing system self-interference user/system show that the ratio of conflict to capacity misses in Mach is lower than in Ultrix. Table 4 show-s the conflict approach. [1993, p. 125] state: that than U n 0,024 M D a y... the reasons m 1 D 1 0.009 U m 0.039 M -—] 0,098 system degradation. Careful analysis of the results is thus required. According to the original paper, we comprise under ‘system’ either all Ultrix activities or the joined activities of the Mach y-kernel, Unix emulation library and Unix seal, egrep, yacc, drew benchmark In figure gee, compress, M m From ab. 4: MCPI this Caused by Cache we can deduce that misses are caused by higher 3, we present 1 in a slightly 0.037 Figure and the an- espresso u M c1012 espresso server. The Ultrix case is denoted by U, the Mach case by M. We restrict our analysis to the samples that show a significant MCPI difference for both systems: the results reordered way. of Chen’s figure We have colored 2- system MCPI (Mach by conflicts the Misses. increased + emulation library which are inherent + Unix is supported single part by the fact library, which is reWe assume that the samples spent even Unix server than in the We also exclude Mach’s since Chen and Bershad only 3’% or less of system not structure. server behaves comparably of the Ultrix kernel. This fewer instructions in Mach’s corresponding Ult rix routines, emulation of the server), to the system’s The next task is to find the component sponsible for the higher cache consumption. that the used Unix to the corresponding cache cache consumption overhead report that is caused by it. What remains is Mach itself, including trap handling, IPC and memory management, which therefore must induce nearly all of the additional cache misses. Figure 3: Baseline for MCPI Ultriz and Therefore, the mentioned measurements suggest that memory system degradation is caused solely by high Mach. cache consumption black that are due to system The bars white buffer stalls, d-cache to system misses see that between the the user white and bars that the are essentially che misses for ity misses. Mach. system A used large causes, i-cache buffer stalls. It not average direct differ Since a p-kernel and are is easy be conflict number system as defined mapped of caches) conflict system misses of very plicitly. or capac- misses tagged flushes would 244 ttis by Hill approximation. 10 We do not ca(the by is basically user-level consumption is 0.00, gAlthough in memory by increased could invoked cache significantly difference MCPI. caused fraction write user differences They system reads, do the is 0.02 behaviour measured write Mach; deviation conclude i-cache or d-cache misses. all other uncached and Ultrix standard We comprise of the p-kernel. Or in other drastically reducing the cache working will solve the problem. can method and befieve The measured used does not [1989], the Mach system caches. The hardware for DMA. or hardware, only 10 be explained Smith that a set of procedures threads frequently words: set of a p-kernel p-kernel determine which a high by a large operations all cofict or fisses it can be used as a first-level kernel flushes was a rmiprocessor does not even require the cache with ex- physically explicit cache by high cache working sets of a few frequently used operations. According to section 2, the first case has to be considered as a conceptual mistake. Large cache work- 5.1 ing sets are also not to be conceptually For example, (Recall: voluminous namic or static by copying an inherent L3 requires feature less than communication mapping For illustration, of p-kernels. 1 Kfor short Although of kernel is almost so that the cache is not flooded that irrelevant. p-kernel in the sense that between executing multiple address spaces. describe modified how a p-kernel even when i.e. to a compatible the Pentium processor an increase in cache context switches from processor. are some differences I I We showed (section context switches there is not much application + servers line, fast 4.1.2 are not size. ways Cache ex- p-kernels will automatically system degradation measured 486 I 8K~uj write compatible in the internal 4x 16B Pentiurn I trap 2x 32B 1 cycle register I I 8K~,~ + 8K~d~ through instructions segment back 0.5–1 cycle 9 cycles 3 cycles 107 cycles 69 cycles difference in one or in Table internal Conclusion. The hypothesis that p-kernel architectures inherently lead to memory system degradation is not substantiated. On the contrary, the quoted measurements support the hypothesis that properly con- 3: 486 \ Pentium architecture p-kernel (see table architect User-address-space avoid the memory so that 3) which influence 4.1.2, a Pentium for implementing each 232-byte can be implemented Non-Portability hardware As men- p-kernel should use user address spaces address space shares all has strong advantages from the software technological point of view: programmers did not need to know very much about processors approach prevented these p-kernels necessary performance In retrospective, building a p-kernel rious implications: ● and the resulting to new machines. p-kernels could Unfortunately, this from the achieving loads take advantage p-kernels beyond quire form they are as hardware erators. 28 x 9 +36= of specific the TLB the lowest layer of operating Therefore, dependent We have learned that we should as optimizing used inside . even the algorithms internal concepts are extremely dent. places in case of 72 cycles in the best case. In total, method wins. On the Pentium however, the segment register method pays. The reasons are several: (a) Segment register loads are faster. (b) Fast instructions are cheaper. whereas the overhead by trap and TLB misses remain systems nearly constant. ative to instruction code gen- a p-kernel processor at various 288 cycles in the theoretical and only the conventional accept that not only the coding instructions 100% TLB use, 14x9+36= 162 cycles for the more probable case that the large address space uses only 5070 of layer per se costs performance. the hardware. and additional ment register model would be (130+ 39) x 2 = 338 cycles, whereas the conventional address-space model would re- . It cannot take precautions to circumvent or avoid performance problems of specific hardware. c The additional for the 486, space switch from a large space to a small one and back to the large. Assuming cache hits, the costs of the seg- and thus flexibility. cannot technique the kernel sum to roughly 130 cycles required in addition. Now look at the relevant situation: an address- we should not be surprised, since on top of abstract hardware has se- Such a p-kernel hardware. a similar this to the user. and table 1 also suggests it for the 486. Nevertheless, the conventional hardware-address-space switch is preferable on this processor. Expensive segment register Older ,u-kernels were built machine-independently on top of a small hardware-dependent layer. This approach easily be ported transparently Ford [1993] proposed the ure: implementation. tioned in section segment registers for Mach. Differences small and one large user address space. Recall that 5 has “ported” is binary to the 486, there hardware structed we briefly very long messages.) between applications with large working sets. This depends mostly on the application load and the requirement for interleaved execution (timesharing). The type arid 4.1.3) Processors 486 to Pentium, IPC. can be made by dy- Mogul and Borg [199 1] reported misses after preemptively-scheduled pensive Compatible (c) Conflict execution, cache misses (which, are anyway rel- more expen- sive) are more likely because of reduced associativity. Avoiding TLB misses thus also reduces cache conflicts. (d) Due to the three times larger TLB, the flush costs can increase substantially. As a result, on Pentium, the segment register method always pays (see figure 2). but and its depen- 245 I As a consequence, we have to implement tional user-address-space ify address-space switch The differences an addi- are orders of magnitude tween 486 and Pentium. multiplexer, we have to modrout ines, handling of user sup- processor requires higher We have to expect a new p-kernel than be- that a new design. thread control blocks, task control blocks, the IPC implementation and the address-space structure as seen by the kernel. In total, the mentioned For illustration, we compare two different kernels on two different processors: the Exokernel [Engler et al. changes affect though this is similar to comparing apples with oranges, a careful analysis of the performance differences helps plied addresses, algorithms in about half 1995] running of all p-kernel modules. 1P C implementation. Due to reduced the Pentium to exhibit caches tend nally, conflict to the 486 kernel, no effect on 486 and can be implemented to the user. In the 486 kernel, kernel stacks) 2 control blocks blocks to identical set-associative. blocks unique [Liedtke thread data addresses. Due A nice optimization bound is to align on 4K but on lK due to internal con- two randomly in the cache. this affects the internal identifiers supplied stalls 1993] for details). Therefore, the Then con- bit-structure of families (see are systems write protection mode in kernel mode, but implies this is willing no more than a hypothetical about be efficient counter to the current environment, of the callee’s to receive and permits and in- processor a message from complex 30 additional L3-equivalent 65 cycles on the R2000. consideration set, cycles. con- the messages to From our should re(Note that “Exo-IPC” Finally, that the cycles of both would cost we must take into processors equivalent as far as most-frequently-executed tions are concerned. Based on SpecInts, are not instrucroughly 1.4 486-cycles appear to do as much work as one R2000cycle; comparing the five instructions most relevant in this context (2-op-alu, 3-op-alu, load, branch taken and multi-level page tables, hashed page tables, (no) reference bits, (no) page protection, strange page protectionll, single/multiple page sizes, 232-, 243-, 252- and 264-byte address spaces, flat and segmented address spaces, various segment models, tagged/untagged TLBs, virtually/physically tagged caches. ] 1e.g. the 386 ignores elements must design. for implementing to the callee’s processor partner quire with erPC supports read only in kernel page is seen in user mode as well. is a “substrate functionality there using PCT for a trusted LRPC already costs an additional 18 cycles, see table 2.) Therefore, we assume register architecture, exception handling, cache/TLB architecture, protection and memory model. Especially the latter ones radically influence p-kernel structure. There to ~-kernel and receive timeouts, the new kernel differ in instruction by different performance, reduce marshaling costs and IPC frequency. experience, extending Exo-PCT accordingly Processors Processors of competing processor relevant required that Incompatible be explained sender and whether no clan borderline is crossed), synchronizes both threads, supports error recovery by send cannot simply replace the old one, since (persistent) user programs already hold uids which would become invalid. 5.2 architectures. L3-IPC is used for secure communication between potentially untrusted partners; it therefore additionally checks the communication permission (whether thread selected p-kernel text .“ boundaries. by the p-kernel average time-slice to its reasons.) in different IPC mechanisms. [It] changes the program an agreed-upon value in the callee, donates The of both cannot and/or Exo-PCT accesses stacks simultaneously. cache there is a 7570 chance that trol blocks do not compete Surprisingly, (including always ference an anomaly this problem could be ignored on Pentium’s data cache is only 2-way no longer (1K is the lower since it has transparently blocks IPC maps the according 4-way associativity, the 486. However, control control and kernel cache hardware trol thread were page aligned. this results and Fi- We compare Exokernel’s protected control transfer (PCT) with L3’s IPC. Exo-PCT on the R2000 requires about 35 cycles, whereas L3 takes 250 cycles on a 486 processor for an 8-byte message transfer. If this dif- misses. One simple way to improve cache behaviour during IPC is by restructuring the thread control block data such that it profits from the doubled cache line size. This can be adopted on a 486. Al- understanding the performance-determining factors weighting the differences in processor architecture. associativity, increased on an R2000 and L3 running not taken) gives 1.6 for well-optimized code. Thus we estimate that the Exo-IPC would cost up to approx. 100 486-cycles being definitely less than L3’s 250 cycles. This substantial difference in timing indicates an isolated dz~erence between both processor architectures that strongly influences IPC (and perhaps other pkernel mechanisms), In fact, but not average programs. the 486 processor imposes a high penalty entering/exiting the kernel and requires per IPC due to its untagged TLB. This 107 + 49 = 156 cycles. On the other hand. the R2000 has a tagged TLB, i.e. avoids the TLB flush, and needs less than 20 cycles for entering and exiting the kernel. the Powthat on a TLB flush costs at least the 246 From the above example, ● For well-engineered p-kernels on different processor architectures, in particular with different memory systems, ● we learn two lessons: we should expect ences that are not related formance. Different architectures optimization p-kernel isolated to overall require techniques that timing differ- processor per- processor-specific even affect the global structure. To understand the second point, tory 486-TLB flush requires ber of subsequent TLB recall that minimization misses. The the mandaof the num- relevant tech- niques [Liedtke 1993, pp. 179,182-183] are mostly based on proper address space construction: concentrating processor-internal tables and heavily one page (there implementing objects, 11 TLB is no unmapped control used kernel data in memory on then 486), blocks and kernel stacks as virtual lazy scheduling. In toto, these techniques save misses, i.e. at least 99 cycles on the 486 and are thus inevitable. Due to its unmapped memory TLB, the mentioned constraint facility and tagged disappears on the R2000. Consequently, the internal structure (address space structure, page fault handling, perhaps control block access and scheduling) can substantially tors also imply objects, from implementing of a corresponding a 486-kernel. control even the uids will differ x pointer swe differ fac- blocks as physical between size + z) and 486 kernel kernel If other (no the R2000 (no x controi block +x). Conclusion. mal “p’’-set p-kernels form the link between a miniof abstractions and the bare processor. The performance demands microprogramming. are comparable As a consequence, to those of earlier p-kernels are in- herently not portable. Instead, they are the processor dependent basis for portable operating systems. Spin. Spin [Bershad et al. 1994; Bershad et al. 1995] is a new development which tries to extend the Synthesis idea: user-supplied algorithms are translated by a kernel compiler and added to the kernel, i.e. the user may write new system calls. By controlling branches and memory references, the compiler generated code does not violate This approach reduces kernel–user sometimes address space switches. Synthesis, Panda, Spin, Cache DP-Mach, and Exokernel mode switches and Spin is based on Mach and may thus inherit many of its inefficiencies which makes it difficult to evaluate performance results. Resealing them to an efficient p-kernel with fast kernel- user mode switches and fast IPC is needed. The most crucial problem, however, is the estimation of how an optimized p-kernel architecture coming from a kernel compiler Kernel architecture and the requirements interfere and performance with each other. might be e.g. af- fected by the requirement for larger kernel stacks. (A pure p-kernel needs only a few hundred bytes per kernel stack.) Furthermore, the costs of safety-guaranteeing code have to be related timal user-level code. The first published to p-kernel results overhead [Bershad and to op- et al. 1995] can- not answer these questions: On an Alpha 21064, 133 MHz, a Spin system call needs nearly twice as many cycles (1600, 12ps) as the already call (900, that 7ps). Mach kernel The expensive application compiler; however, Mach system measurements can be substantially improved it remains show by using open whether a this technique can reach or outperform a pure p-kernel approach like that described here. For example, a simple user-level page-fault handler (11 00 ps under Mach) executes in 17 ps under Spin. However, we must take into consideration that in a traditional p-kernel, the kernel is invoked and left only twice: page fault (enter), message to pager (exit), The Spin technique on this processor traditional From our map message (enter+exit cost less than Spin overhead overhead a kernel reply 1 ps i.e. with is far beyond the ideal of 1+1 ps. experience, compiler ). can save only one system call which should 12 ps the actual 6 ensures that the newlykernel or user integrity. we expect eliminates a notable nested IPC gain if redirection, e.g. when Synthesis. Henry Massalin’s Synthesis tem [Pu et al. 1988] is another example forming (and non-portable) feature was a kernel-integrated ated kernel code at runtime. kernel. operating sysof a high per- Its distinguishing “compiler” For example, invocation. This technique when issuing was highly successful on the 68030. However (a good example for non-portability), it would most probably no longer pay on modern processors, because (a) code inflation will degrade cache performance and (b) frequent of small code chunks pollutes the instruction sign might be a promising research direction. which gener- a read pipe system call, the Synthesis kernel generated specialized code for reading out of this pipe and modified the respective using deep hierarchies of Clans or CustodiEfficient integration of the ans [Hart ig et al. 1993]. kernel compiler technique and appropriate p-kernel de- generation cache. Utah-Mach. Ford and Lepreau IPC to migrating thread semantics migration between RPC [1994] changed Mach which is based on address spaces, similar to the Clouds model [Bernabeu-Auban et al. 1988]. Substantial performance gain was achieved, a factor of 3 to 4. DP-Mach. DP-Mach [Bryce and Muller 1995] implements multiple domains of protection within one user address space and offers a protected inter-domain Consequently, the Exokernel interface is architecture dependent, in particular dedicated to software-controlled call. The performance results (see table 2) are encouraging. However, although this inter-domain call is highly specialized, it is twice as slow as achievable by a general TLBs. A further approach is that grate RPC mechanism. In fact, an inter-domain call needs two kernel calls and two address-space switches. A general RPC requires two additional t bread switches device difference Exokernel drivers, to our driver-less p-kernel appears to partially inte- in particular for disks, networks and frame buffers. and We believe that dropping the abstractional approach argument transferal 2. Apparently, the kernel call and address-space switch costs dominate. Bryce and Muller could only be justified by substantial performance gains. Whether these can be achieved remains open (see dis- presented cussion in section 5.2) until domain an interesting calls: when optimization switching for small back from inter- a very small and abstractional domain, the TLB is only selectively flushed. Although the effects are rather limited on their host machine (a 486 with only 32 TLB entries), it might become more form. relevant are too inflexible. on processors with larger TLBs. It might To analyze tions pare it with probably Panda. The RPC based on segment Panda system’s et 7 and exceptions. low In contrast to the Exokernel, extensible) nels, threads, virtual The term or too specialized special but infrequent cases. In fact, kernels this and be blocks, could be held in virtual same way as other Exokernel. memory the Exokernel in the tm.nsfer can be omitted. with space and performance. and The map, flush and inflexible to optimizing per Some existing abstractions, processor p- or too many ones. code generators, and p-kernels are basic abstractions, the most inherently frequent come from insufficient understanding hardware-software system or inefficient [En- The gler et al. 1994; Engler et al. 1995] is a small and hardware-dependent ~-kernel. In accordance with our processor-dependency thesis, the exokernel is tailored to the R2000 and gets excellent performance values for its primitives. In contrast to our approach, it is based on the philosophy that a kernel should not provide abstractions but only a minimal set of primitives. 12Sometimes, the ~went flexibility constructed priate to Spin, systems hardware. must not portable. data.) In contrast (address of Basic implementation decisions, most algorithms and data structures inside a ~-kernel are processor dependent. Their design must be guided by performance prediction and analysis. Besides inappro- could be done as well on top of a pure ~-kernel by means of according pagers. (Kernel data structures, e.g. thread control range set enough to al- operating a wade chose inappropriate Simdar haa to deal this a m~nimal with are flexible may additionally need clans or a similar reference monitor concept. Choosing the right abstractions is crucial a dynamically selected subset. It was hoped that technique would lead to a smaller p-kernel interface with of mechanisms for both code, since it no longer layers that of arbitrary ezploitatton ‘caching’ refers to the fact that the p-kernel never handles the complete set of e.g. all address spaces, but only also to less p-kernel on at least threads with IPC and unzque idenand grant operation, form such a basis. Multi-level-security tifiers) systems It caches ker- address spaces and mappings. abstractions implementation presented and p- it relies on a small machine. p-kernels platforms. can provade higher of appropriate aliow (non try to decide these ques- comparable Conclusions A p-kernel fixed abstractions Such a co-construction will also lead to new insights for both approaches. domazn and virsides its two basic concepts protection tual processor, the Panda kernel handles only interrupts kernel. the right exoplat- al. of a small kernel to user space. Be- Cache-Kernel. The Cache-kernel [Cheriton Duda 1994] is also a small and hardware-dependent out that We should by constructing two reference switching. [Assenmacher 1993] p-kernel is a further example which delegates as much as possible then turn on the same hardware are even more efficient than securely multiplexing hardware primitives or, on the other hand, that abstractions whether kernel enrichment by inter-domain calls pays, we need e.g. a Pentium implementation and then coma general we have well-engineered p-kernels presented design achieve well performing spectj?c implementations shows that p-kernels of mistakes of the combined implementation. d ZS possible through to processor- processor-tndcpendent ab- stractions. Availability The source code of the L4 ~-kernel, For imp- p-kernel, is available for examination tion through the web: lementing inter-domain calls, a pager can be used whick shares the address spacesof caller and mllee such that the trusted caflee can access the parameters in the caller space. E.g. LRPC [Bershad et al. 1989] and NetWare [Major et al. 1994] use a similar technique. httpi/borneo.gmd, 248 a successor of the L3 and experimenta- de/RSiL4. Acknowledgements Many thanks to Hermann Hartig for discussion Flushing, out explicit and Rich implicitly can j7ush Uhlig for proofreading and stylistic help. Further thanks some anonyfor reviewing remarks to Dejan Milojicic, mous referees and Sacha Krakowiak for shepherding. the reverse operation, can be executed withagreement of the mappees, since they agreed when any Note that derived Address An Abstract R-u {~} of Address u, address space, where pages, R the set of available : V to regard each mapping (real) as a one we write U. x x to describe Fortunately, a V) The three physical address a. := z exported . We use it only for describing A subsystem address provided that granted, , S determines any whereas S’ determines the granted page should page is transferred of its pages should at which virtual be mapped In contrast mapper’s per’s to grant, the mapped space u and address space dress space from av to a~,. u’, ((?, v) a link (u, v) instead This to is stored always in v’. from mapping address := that never the change, we therefore table P, where PV address P. UU + 4 invoked by a flush oper- :=4 be used instead of evaluating The granted (P, v), a. a(v). In page remains in page m the the receiving mappings of u is to use one page frame of the frame. where v is the according which describes Each node contains virtual page in the ad- dress space which is implemented by the page table P. Assume that a grant-, map- or flush-operation deals with a page v in address space c to which the page In a first step, the operation setable P is associated. lects the according tree by P., the physical page. In the next step, it selects the node of the tree that contains (P, v). (This selection can be done by parsing the tree . the implementation tree per physical all actual in the or in a single map- step, if Pv is extended by a link to the node.) ‘Granting then simply replaces the values stored in the node and map creates a new child node for storing (P’, v’). Flush lets the selected node unaffected but ad- of transferring the existing link operation permits to construct address spaces recursively, isting ones. can The recommended be A subsystem S with address space u can map any of its pages v to a subsystem S with address space u’ provided S’ agrees: + gvarantee page will fact, P is equivalent to a hardware page table. p-kernel address spaces can be implemented straightforward by means of the hardware- address-translation facilities. to c+ and removed u:, of O(V) is never re- For implementation, P. UU+4. which v’) includes address space a’ P“ u~ uv=~ each u by an additional and each replacement S’ agrees: (7:, + Note the three space u can grant S with if evaluation P;, ation S with Uu=r corresponds to CV and holds either the physical of v or d. Mapping and granting then include operations. of its pages v to a subsystem v. =(a’, if of a virtual A page potentially mapped at v in u is flushed, and the new value x is copied into UV. This operation is internal to the p-kernel. if basic operations by flushing. complement ; ; a recursive quired. except change of u at v, precisely: flush (u, = { of address spaces are based on the operation: Model U’(v’) u(v) by ~v . replacement the At a first glance, deriving the phyical address of page v in address space a seems to be rather complicated and expensive: V is the set physical column table which cent ains u(o) for all v e V and can be indexed by v. We denote the elements of this table All modifications Flushing which are indirectly * pages and d the nilpage which cannot be accessed. Further address spaces are defined recursively as mappings {4}, where X is the set of address a : V ~ (X x V)U spaces. It is convenient recursively. Spaces address spaces as mappings. is the initial of virtual are defined UV. Implementing We describe S -Q. also all mappings affects map operation. No cycles can be established by these three operations, since * flushes the destination prior to copying. Spaces Model = (c)u) :C:, N and flush from the prior pages: vu:, recursively A accepting of its parses and erases the complete i.e. new spaces based on ex- is executed 249 for each node (P’, subtree, c’) in the where P; subtree. := @ References Hall, M D CPL Assenmacher, H., Schwarz, Breltbach, R. kernel 1993 The approach. trzbuted T In Buhler, Panda 4th Computtng , P system Workshop Systems, Hubsch, V architecture on Lmboa, – , a p]cc- Trends of Portugal, pp 470–476 Intel Dts- Corp The J M architecture (Jan.), , Hutto, of the Georgia P Ra Inst,tute W and kernel. Khalid], Tech Y. Rep of Technology, A 1990 Manual Intel Bernabeu-Auban. Smith, A. IEEE J. 1989 Evaluating Transactions on assoclatwmy Computers 38, in 12 (Dee ), 1612-1630 and Future and caches 1988 t486 Intel Corp Mzcroprocesso~ Programmer’s Reference Corp 1993 Pentzurn Architecture and Processor User’s Programming Manual, Manual Intel Volume 3: Corp GIT-ICS-87/35 Atlanta, Kane, GA G and Heu-mch, J 1992 MIPS Rtsc Architecture Prentice Hall Bershad, B N , Anderson, 1989 Llghtwe]ght szum on AR, T remote Operattng pp E , Lazowska, E. D., procedure System call Levy, H 12thACM In Prtnczples and (SOSP), M. Syrnpo Lichfield Kessler, - R Systems 102-113 B D., N., Chambers, C , Eggers, S , Maeda, E G P , Savage, S , and Smer: Pardyak, extensible tem microkernel services Dagstuhl, In for 6th application-specific SIGOPS Germany, pp C , McNamee, 1994 Spm – an operating European Hall, 10, Y B N., M., Becker, biht y, Savage. Spring czpies 4 (Nov Workshop, D., safety Eggers, and 15thACM In (sOSP), performance m Sympostum Mountain Copper P., Smer, Chambers, the Schloi? E G., Fmczynsk], C 1995 Extens]- Spin on Operating Resort, CO, operating Lee, system. System xx-xx, pp Nelson, In 14th ACM (SOSP), Hansen, tem P. 1970 Commun. The nucleus ACM 13, G 1995 of Pr%nctples a multiprogramming 4 (April), In C E. C and Muller, applications Symposzum formance Matching sys- mlcrc-kernels J ing B and on on NC, D. R pp and system on 1994 to 1992 J 1994. A In Destgn R P., Bershad, Using B. N., continuations 1 st of communication szum on CA, in pp. D ing model of Research , Kaashoek, M kernel Implementatton (OSDI), .21 064-AA J R. F., and Rzsc .51GOPS European and R W. In 13th 1991 D ~, Kaa.shock, an operating M. system management In Prznctples O’Toole, J. Schlof3 B 1993. Ford, B and and D., 1994. ACM Dagstuhl, private for 1995. on Resort, CO, System PP thread model In Usents Winter Mach 3.0 Conference, to M., The Goldstein, Dig]tal A., Nat%onal Computer timore, Kaufmann, distributed pp C, system Securaty pp. G crokernel nati. for OH, Kougiourls, objects. pp P In 1993 Summer The Lampson, B 1989. In Usenzz H security de.mgn f4th In Prznczples (SOSP), sw]tch]ng on user (Sept.), K Penttum address GMD spaces. — German Technology, 1994 Wtnter The An S ankt overwew Usentz as effect of the Conference, 601 and San switches and S IL, pp. Optimal operating on sys- Computer 358-369 operating In User’s 1994 multLple-API aren’t Operatzng Microprocessor Symposzum Chicago, on on Archi- 75–84. Sechrest, for hardware? pp RISC T., Why context Languages CA, International pp of Conference Clara, memory 1990. systems Usenm Summer J The Synthesis , and Chen getting Conference, 247–256 H., and Ioannidls, Systems H., Lee, 1, D. page standard hardware M 1988 1 (Jan.), kernel. 11–32 L , Bershad, mappmg In and M Ist N cache F , F , W Computzng 1988 1, 4, on Operatzng I , C., Chorus Systems 1994 Monterey, Boule, Kaiser, B resolution on (O,SDI), , Armand, N-euhauser, confhct Sympostum Implementation Herrmann, P , and system for USENIX A , B pohcles Abrossimov, , nucleus Conference, A W .$vsterns M Fn-efly Bal- m)- The R , van of Operatzng .Logzca/ CA, Glen, M., Langlois. S , distributed op 305-370 ACM M L]chfield Park, H world’s Systems 1989 Symposzum Staveren. the Deszgn of Parallel Op- Hall Burroughs, 12th (SOSP), Renesse, tem and In Performance Cincin- 1994 Prentice D RPC Pmnczples AR, and pp 4 (Ott the System 83–90 distributed 22, of Operattng Tanenbaum, fastest Reuzew Performance on A S operating ), 1988 sys- 2.5-34 147–160 Wulf, Hzirtig K. CA, T Schroeder, 12th (NIST/NC.SC), Spring Annual Schr6der-Prelkschat, 305–319. and 21th Gtnliemont, 97–114 architecture Conje~ence kernel Programming R , Mudge, on-chip Dynamic Rozler, xx-xx van Hamilton, Springer. a mlgratmg CA, and security by International Santa of as fast erattng Gasser, 1991 (ISCA), Leonard, Evolving A In 4th Systems Destgn pp 255–266 resource Operatzng In Inc. C , Massalin, Romer, pp Exokernel, apphcat]on-level Mountain 1994 Ar- 294–305 Information Powell, PowerPC J eratmg J pp 933 and 1993 Computzng commumcatlon Lepreau, Implementa- 6th Germany, J Symposzum Copper h]gh USENIX GI/lTG-Fachtagung System system. Motorola Anaheim, operatIn In faster Grove, The for for Uhlig, Ousterhout, Sympo- Pacific machme O’Toole, architecture 15th (SOSP), Ford, F , and multiplexing Center (A5PLOS), Architecture and ACM ($ OSP), IPC No. Borg, Support Inc Nagle, Micro- management programmable Workshop, 12. Klel, GMD performance. Motorola 62–67 Engler, and allocation Dean, thread systems a secure Ist CA. C Systems Corp. Prznczples F., as HIPEC: In De.stgn address-space G., operating tectural oper- 122–136. system 1994 153–164 In transparently D , Mmshall, Mogul, Pu, Engler, C. caching 175–188 der Fhnclsco, 14th (SOSP), SyrnposZum implement System poh- Reltable Germany Opemtzng Improved by NetWare operat- USENIX Equipment operating Operating on Augustm In Prznczples pp 1995 National U.SENIX Major, caching DECChzp Ra.shid, to m Prtn- security Symposaum R. pp. chiefs on NC, J tems Draves, & user-defined IEEE Systems CA, Improvmg Arbeltspaplere per- Implements. performance Ist and Digital systems System 1–14. for Rechensystemen, 1993 Manual. 1992. Sheet. for Computer file Operattng memory Operatzng Symposzum Lledtke, 179-194 Corp. Data In Impact System J pp Chang, vmtual Clans uon cache K. Extensible on Neuenahr, Monterey, Ashevdle, In IEEE Antomo, and and The system functlonahty pp Equipment processor- memory Systems CA, caching. 1993. Operating Duda, 1993 14th C , and on (OSDI), J the Bad M Symposzum ACM 120–133 kernel Operattng Monterey, N N NC, of external chttektur modern Implementation file B structure Asheville, ating K. Systems Destgn CA, pp. 165–178 Bershad, Symposwm Cherlton, L1, on Operatzng Monterey, system ACM Dlg]tal , and of apphcation-controlled Symposzum tton (OSDI), Chen, W M Systems, processors E on 238–241 using fine-grained memory protection on Parallel D%strtbuted Systems, San P , Felten, algorithms 11-22 A paradigm Proceedings performance tion TX Cao, placement Transactions Sympostwn 1995. H , Chen, L1edtke, Bryce, Page ACM Asheville, W L1edtke, Brmch ), and Ktihnhauser, cles Pardyak, S , and 1992 sys- 68-71. S., D caches A Dtstrzbuted Bershad, M real-indexed Khalidl, Bershad, and large Park, Kowalskl, O , and architecture Kiihnhauser, Journal of W Computer 1993 The Securzty Bmhx 2, W and 1, 5– , Cohen, E , Corwm, Pollack, F. operating system 1974 W’ , Jones, Hydra: The Comrnun. ACM A . Levin, R., kernel of 17, 6 (July), Pierson a multiprocessing 337–345 21 Yokote, Hddebrand, D 1992. An Usenzx Workshop on tectures, Seattle. WA, archmectural Mzcro-kemzels pp. overv:ew and Other of QNX A’ernel In Y 1993 systems Ist on Archi- 113–126 250 Ob~ect Kernel-structuring The Apertos Technologies approach for for object-or] In Interrzat%onal Aduanced Software ented operating SymposZum Springer. C ,