Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism THOMAS E. ANDERSON, and HENRY (University BRIAN N. BERSHAD, EDWARD D. LAZOWSKA, M. LEVY of Washington Threads are the vehicle for concurrency in many approaches to parallel programming. Threads can be supported either by the operating system kernel or by user-level library code in the application address space, but neither approach has been fully satisfactory. This paper addresses this dilemma. First, we argue that the performance of kernel threads is inherently worse than that of user-level threads, rather than this being an artifact of existing implementations; managing parallelism at the user level is essential to high-performance parallel computing. Next, we argue that the problems encountered in integrating user-level threads with other system services is a consequence of the lack of kernel support for user-level threads provided by contemporary multiprocessor operating systems; kernel threads are the wrong abstraction on which to support user-level management of parallelism. Finally, we describe the design, implementation, and performance of a new kernel interface and user-level thread package that together provide the same functionality as kernel threads without compromising the performance and flexibility advantages of user-level management of parallelism. Process Management; D.4.4 D.4. 7 [Operating [operating systems]: Communications Management– input/ output; tems]: Organization and Design; D.4.8 [Operating Systems]: Performance Categories Sys- General Additional and Subject Terms: Design, Descriptors: Measurement, D.4. 1 [Operating Systems]: Performance Key Words and Phrases: Multiprocessorj thread 1. INTRODUCTION The effectiveness performance lparallelism of parallel computing of the primitives within programs. that Even depends to a great extent on the are used to express and control a coarse-grained parallel program the can This work was supported in part by the National Science Foundation (grants CCR-8619663, CCR-8703049, and CCR-8907666), the Washington Technology Center, and Digital Equipment Corporation (the Systems Research Center and the External Research Program). Anderson was !supported by an IBM Graduate Fellowship and Bershad by an AT&T Ph.D. Scholarship. Anderson is now with the Computer Science Division, University of California at Berkeley; Bershad is now with the School of Computer Science, Carnegie Mellon University. Authors’ address: Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. 01992 ACM 0734-2071/92/0200-0053 $01.50 ACM Transactions on Computer Systems, Vol. 10, No. 1, February 1992, Pages 53-79. 54 T. E. Anderson et al . exhibit poor performance high. Even a fine-grained of creating if the cost of creating program can achieve and managing parallelism One way to construct collection of traditional is low. a parallel UNIX-like program is to share memory processes, each consisting address space and a single sequential space. Unfortunately, because such gramming in a uniprocessor for general-purpose parallel parallelism well. The shortcomings and managing parallelism is good performance if the cost execution stream within processes were designed environment, programming; of traditional that address for multipro- they are simply they handle only processes between a of a single too inefficient coarse-grained for general-purpose parallel pro- Threads separate the notion of a gramming have led to the use of threads. sequential execution stream from the other aspects of traditional processes such as address spaces and 1/0 descriptors. This separation of concerns yields a significant 1.1 performance advantage relative to traditional processes. The Problem Threads can be supported either at approach has been fully satisfactory. User-level threads are managed each application so that intervention. The result PCR [25] and an order threads are also flexible; level or by runtime in library the [2], of magnitude the cost of the they of user-level cost kernel. routines thread management operations can be excellent performance: FastThreads within user in linked into require no kernel systems such as thread of a procedure can be customized Neither operations call. is User-level to the needs of the language or user without kernel modification. User-level threads execute within the context of traditional processes; indeed, user-level thread systems are typically built without any modifications to the each process executing underlying operating system kernel. and treats processor,” as a “virtual under pulls threads off virtual processors the underlying its control; each virtual the ready are being kernel. The thread package views it as a physical processor processor runs list and runs them. In multiplexed across real, “Real world” operating user-level reality, physical system code that though, these processors by activity, such as multiprogramming, 1/0, and page faults, distorts the equivalence between virtual and physical processors; in the presence of these factors, user-level threads built on top of traditional processes can exhibit poor performance or even incorrect behavior. Multiprocessor operating systems such as Mach [21], Topaz [22], and V [71 provide Programming exhibited direct kernel with kernel by user-level support threads threads, for multiple avoids the because the threads per address space. system integration problems kernel directly schedules each application’s threads onto physical processors. Unfortunately, kernel threads, just like traditional UNIX processes, are too heavyweight for use in many parallel programs. The performance of kernel threads, although typically an order of magnitude better cally an order of magnitude than worse that than of traditional the best-case ACM TransactIons on Computer Systems, Vol. 10, No 1, February processes, performance 1992. has been typi- of user-level Scheduler Activations: threads (e.g., in the absence Effective Kernel Support of multiprogramming and 1/0). . As 55 a result, user-level threads have ultimately been implemented on top of the kernel threads of both Mach (CThreads [8]) and Topaz (WorkCrews [241). User-1evel threads are built on top of kernel traditional processes; they have suffer exactly The parallel the same problems. programmer, then, employ user-level threads, provided the application kernel 1.2 threads exactly threads, which exactly as they are built the same performance, has been faced with on top of and they a difficult dilemma: which have good performance and correct behavior is uniprogrammed and does no 1/0, or employ have worse performance but are not as restricted. The Goals of this Work In this paper we address this a user-level thread package kernel threads with the dilemma. We describe a kernel interface that together combine the functionality performance and flexibility of user-level and of threads. Specifically, —In the tion, common case when our performance existing user-level system integration). —In the thread operations is essentially thread infrequent case the management when the —No processor idles —No high-priority in the presence thread waits interven- achieved by the best (which must system need kernel as that systems kernel processor reallocation or 1/0, our kernel thread management system: do not same from be involved, can of ready suffer mimic the poor such as on behavior of a threads. for a processor while a low-priority thread runs. – When a thread page fault), to run —The another user-level specific traps thread part between to block the same system is easy (for the thread or from in achieving the the necessary kernel and these control each goals and to the To be able support cations software and 1/0 application’s to manage the needs to be aware request /completions) address simplify policy scheduling an model such application’s multiproces- information space. space. application- for concurrency address of a can be used in a multiprogrammed scheduling allocate processors among applications, the kernel scheduling information (e. g., how much parallelism space). because a different is structured to change example, was running threads, or even to provide a different [16], Actors [1], or Futures [10]. The difficulty sor is that from It kernel on which of our customization. application’s as workers to the the processor is distributed To be able to needs access to user-level there is in each address parallelism, the user-level of kernel events (e. g., processor reallothat are normally hidden from the application. 1.3 The Approach multiprocessor, an Our approach provides each application with a virtual abstraction of a dedicated physical machine. Each application knows exactly ACM Transactions on Computer Systems, Vol. 10, No. 1, February 1992. 56 . T. E, Anderson et al. how many control (and which) over operating which system processors of its kernel have threads been are has complete sors among address spaces including processors assigned to an application allocated running to it and has complete on control over those the processors. allocation the ability to change during its execution. The of proces- the number of To achieve this, the kernel notifies the address space thread scheduler of every kernel event affecting the address space, allowing the application to have complete knowledge of its scheduling state. The thread system in each address space notifies the kernel of the subset that can affect processor allocation decisions, for the majority The uler of operations kernel that of user-level preserving thread operations good performance do not need to be reflected to the kernel. mechanism that we use to realize these ideas is called A scheduler activation vectors control from the kernel activations. sched- to the address space thread scheduler on a kernel event; the thread scheduler can use the activation to modify user-level thread data structures, to execute user-level threads, and to make requests of the kernel. We have implemented a prototype of our design on the DEC SRC Firefly multiprocessor workstation [22]. While the differences between scheduler activations that the and kernel threads are crucial, the similarities kernel portion of our implementation required straightforward operating modifications system on the to Firefly. the kernel threads Similarly, the are great enough only relatively of Topaz, user-level the portion implementation involved relatively straightforward modifications Threads, a user-level thread system originally designed to run Topaz Since threads kernel our can when threads are the We emphasize, implemented suffer from solved by implementing the 2. USER-LEVEL In to Faston top of threads. goal is to demonstrate that the exact functionality be provided at the user level, the presentation in assumes that user-level programmer or compiler. models, native of our same at user level problems them THREADS: PERFORMANCE LIMITATIONS this motivate we on top of kernel as user-level our work threads threads—problems on top of scheduler FUNCTIONALITY section concurrency model however, that other used by the concurrency or processes, that are activations. ADVANTAGES by of kernel this paper describing AND the advantages that user-level threads offer relative to kernel threads, and the difficulties that arise when user-level threads are built on top of the interface provided by kernel threads or processes. We argue that the performance of user-level threads is inherently better than that of kernel threads, rather than this being an artifact of existing implementations. additional advantage of flexibility with respect environments. Further, we argue that the lack ited by user-level threads is not but is a consequence of inadequate ACM Transactions inherent kernel User-level threads have an to programming models and of system integration exhib- in user-level support. on Computer Systems, Vol. 10, No. 1, February 1992 threads themselves, Scheduler Activations: Effective Kernel Support 2.1 The Case for User-Level It is natural thread that that could Unfortunately, in the kernel: cost within as user-level of accessing thread optimizations the threads there 57 Management the performance be applied are as efficient tionality. threads — The to believe systems Thread . are kernel, without significant management found yielding the threads compromises inherent costs With operations: in user-level kernel in func- to managing kernel threads, the program must cross an extra protection boundary on every thread operation, even when the processor is being switched between threads in the same address space. This involves not only an extra kernel trap, but the kernel against must buggy thread also copy and or operations techniques malicious can check parameters programs. By be quite are used to expand inexpensive, code inline in order contrast, to protect invoking particularly and perform itself user-level when compiler sophisticated regis- ter allocation. Further, safety is not compromised: address space boundaries isolate misuse of a user-level thread system to the program in which it occurs. With kernel – The cost of generality: ing implementation is used by all kernel thread system must provide application; particular system use it, this imposes feature. overhead In contrast, can be closely since different These were factors inherently instance, However, applications provided that do not use a by a user-level thread matched to the specific needs of the applications that applications can be linked with different user-level policy would on those the facilities thread libraries. As an example, preemptive priority scheduling, can use a simpler thread management, a single underlyapplications. To be general-purpose, a any feature needed by any reasonable most kernel even though thread systems implement many parallel applications such as first-in-first-out not expensive. be important Kernel trap [241. if thread overhead management and priority operations scheduling, for are not major contributors to the high cost of UNIX-like processes. the cost of thread operations can be within an order of magnitude of a procedure implementation, level thread well-written To illustrate implementations call. This implies that any overhead added by a kernel however small, will be significant, and a well-written user- system will have significantly kernel-level thread system. this better performance than a quantitatively, Table I shows the performance of example of user-level threads, kernel threads, and UNIX-like pro- cesses, all running on similar Topaz kernel threads were hardware, measured a CVAX processor. FastThreads and on a CVAX Firefly; Ultrix (DEC’S derivative of UNIX) was measured on a CVAX uniprocessor workstation. (Each of these implementations, while good, is not “optimal.” Thus, our measurements are illustrative and not definitive). Fork, the time to create, schedule, execute The two benchmarks are Null and complete a process/thread that invokes the null procedure (in other ACM Transactions on Computer Systems, Vol. 10, No. 1, February 1992. 58 . T. E. Anderson et al. Table 1, Operation the (the was executed the Table 11300 441 1840 there tuned that with while this of the the path arises between they Flexibility call Topaz written in [26]. and improve 2.2 Sources of Poor Integration Kernel Interface Unfortunately, have the same may and in User-Level in threads code and is highly in choos- flexibility threads, both performance in thread systems modifying likelihood Threads and since there may require specialthreads, supporting require the however, the kernel, of errors in the Built on the Traditional it has proven difficult to implement user-level threads level of integration with system services as is available kernel threads. level, but rather cost management, Topaz User-level important overhead, on assembler. performance models programming complexity, across 7 ,usec. difference thread multiple parallel about between the models, each of which system. With kernel increases a benchmark takes thread are many parallel programming ized support within the thread which kernel. Each averaged kernel that simultaneously is particularly for on a condi- were of magnitude Topaz services time wait 19 psec. difference fact system results order and critical implement tradeoffi is an then together). the about of magnitude despite a tradeoff to there and a procedure takes the Signal-Wait, threads and management is much where flexibility. trap order This two comparison, process another Commonly, avoid For and process/thread, processor, a kernel I shows is yet a thread), a waiting a single while FastThreads. ing of forking of synchronizing on Ultrix Ultrix processes 948 repetitions. between Topaz threads 37 overhead Firefly, (ysec ) 34 to signal tion multiple FastThreads Latencies Fork overhead processlthread Operation Signal-Wait Null words, Thread This is not inherent in managing parallelism is a consequence of the lack of kernel support that with at the user in existing abstraction for supporting user-level systems. Kernel threads are the wrong thread management. There are two related characteristics of kernel threads that cause difficulty: —Kernel the threads user —Kernel thread block, resume, are preempted without notification to threads are scheduled obliviously with respect to the user-level state. These can cause thread system will problems even on a uniprogrammed often create as many kernel threads processors” as there are physical to run user-level threads. When ACM and level. TransactIons on Computer Systems, system. A user-level to serve as “virtual processors in the system; each will be used a user-level thread makes a blocking 1/0 Vol. 10, No. 1, February 1992 Scheduler Activations: Effective Kernel Support request or takes a page fault, though, the kernel threads A plausible on the just-idled solution to this serving 59 as its virtual processor is lost to the address is no kernel thread to run other processor also blocks. As a result, the physical space while the 1/0 is pending, because there user-level thread . processor. might be to create physical processors; when one kernel thread thread blocks in the kernel, another kernel user-level threads on that processor. However, 1/0 completes or the page fault returns: there more kernel threads than blocks because its user-level thread is available to run a difficulty occurs when the will be more runnable kernel threads than processors, each kernel thread in the middle of running a user-level thread. In deciding which kernel threads are to be assigned processors, the operating system will implicitly choose which user-level threads In are assigned a traditional processors, the processors. system, operating when there system could are more employ runnable some kind threads than of time-slicing to ensure each thread makes progress. When user-level threads are running on top of kernel threads, however, time-slicing can lead to problems. For example, a kernel thread could be preempted while its user-level thread is holding a spin-lock; any user-level threads accessing the lock will then spin-wait until the lock holder is rescheduled. Zahorjan et al. [281 have shown that time-slicing in the presence of spin-locks can result in poor performance. As another example, a kernel thread running a user-level thread could be preempted to allow another kernel thread to run that happens to be idling in its user-level scheduler. Or a kernel thread running a high-priority user-level thread could be rescheduled in favor of a kernel low-priority user-level thread. Exactly the same problems page faults. If there machine’s processors; should preempt occur thread with that happens to be running multiprogramming as with a 1/0 and is only one job in the system, it can receive all of the if another job enters the system, the operating system some of the first job’s processors to give to the new job [231. The kernel then is forced to choose which of the first job’s kernel threads, and thus implicitly which user-level threads, to run on the remaining processors. The need to preempt processors from an address space also occurs due to variations the in parallelism dynamic variations within reallocation jobs; of processors in parallelism is important Zahorjan and among address to achieving McCann [271 show spaces high that in response to performance. While a kernel interface can be designed to allow the user level to influence which kernel threads are scheduled when the kernel has a choice [5], this choice is intimately tied to the user-level thread state; the communication of this information between the kernel and the user-level negates many of the performance threads in the first Finally, ensuring and place. the logical flexibility correctness advantages of of a user-level using thread user-level system built on kernel threads can be difficult, Many applications, particularly those that require coordination among multiple address spaces, are free from deadlock based on the assumption that all runnable threads eventually receive processor time. When kernel threads are used directly by applications, the kernel ACM Transactions on Computer Systems, Vol 10, No. 1, February 1992. 60 . satisfies runnable number T. E. Anderson et al this assumption by time-slicing the processors among all of the threads, But when user-level threads are multiplexed across a fixed of kernel threads, the assumption may no longer hold: because a kernel thread blocks when its user-level thread blocks, an application run out of kernel threads to serve as execution contexts, even when there runnable user-level threads and available processors. 3. EFFECTIVE KERNEL SUPPORT FOR THE USER-LEVEL can are MANAGEMENT OF PARALLELISM Section the ity) 2 described programmer and when behavior problems, system the problems that arise to express parallelism user-level threads are in the presence of multiprogramming we have designed a new kernel that together combine the own chine virtual except machine this multiprocessor, that during the the the kernel may execution kernel and interface functionality performance and flexibility of user-level The operating system kernel provides its when 1/0). and of kernel threads. each user-level abstraction change of the threads are used by (poor performance and poor flexibilbuilt on top of kernel threads (poor To address these user-level thread threads thread of a dedicated the program. number with system with physical of processors There the are several ma- in that aspects to abstraction: –The kernel allocates processors to address spaces; the kernel has complete control over how many processors to give each address space’s virtual multiprocessor. —Each address space’s user-level thread system has complete control over which threads to run on its allocated processors, as it would if the application were running on the bare physical machine. –The kernel changes thread notifies the number system the user-level of processors whenever thread system assigned a user-level whenever to it; the kernel thread blocks the or wakes up kernel (e.g., on 1/0 or on a page fault). The kernel’s role is to vector to the appropriate thread scheduler, rather than to interpret these on its own. —The user-level thread system notifies the kernel when kernel also notifies the in the the events events application needs more or fewer processors. The kernel uses this information to allocate processors among address spaces. However, the user level notifies the kernel only on those subset of user-level thread operations that might affect processor allocation decisions. As a result, performance promised; the majority of thread operations do not suffer the communication — The from with application the kernel. programmer programming directly system manages its virtual mer, providing programmers user-level different sees with no difference, kernel multiprocessor a normal runtime system could easily parallel programming model). ACM Transactions is not comoverhead of except for performance, Our user-level thread transparently to the programTopaz thread interface [41. (The threads. be adapted, on Computer Systems, Vol. 10, No. 1, February 1992 though, to provide a Scheduler Activations: Effective Kernel Support In the remainder to the user-level tion to allow user-level 3.1 of this thread section system, the kernel we describe what to allocate how kernel information processors events is provided among . 61 are vectored by the applica- jobs, and how we handle spin-locks. Explicit Vectoring The communication of Kernel between Events to the User-Level the kernel processor Thread allocator Scheduler and the user-level thread system is structured in terms of scheduler activations. The term “scheduler activation” was selected because each vectored event causes the user-level thread system to reconsider its scheduling decision of which threads to run on which processors. A scheduler activation serves —It serves as a vessel, exactly the same way —It notifies –It provides or execution that the user-level space three in roles: context, a kernel thread the system kernel for running thread for user-level threads, in does. of a kernel saving the event. processor context of the activation’s current user-level thread, when the thread is stopped by the kernel (e.g., because the thread blocks in the kernel on 1/0 or the kernel preempts its processor). A scheduler traditional activation’s kernel — one mapped space. Each starts running into data thread. the user-level [2]; when Each structures scheduler are quite activation kernel and thread is allocated one mapped a user-level its thread activation’s kernel stack. The user-level tion’s user-level stack. In addition, the into own calls thread kernel similar to those has two execution the application user-level into the scheduler maintains address stack kernel, of a stacks when it it uses its runs on the activaan activation con- trol block (akin to a thread control block) to record the state of the scheduler activation’s thread when it blocks in the kernel or is preempted; the user-level thread scheduler maintains a record of which user-level thread is running in each scheduler When activation. a program is started, the kernel assigns it to a processor, and upcalls into fixed entry point. The user-level thread creates a scheduler the application management activation, address space at a system receives the upcall and uses that activation as the context in which to initialize itself and run the main application thread. As the first thread executes, it may create more user threads and request additional processors. In this case, the kernel will create an additional scheduler activation for each processor and use it to upcall into the user level to tell it that the new processor user level then selects and executes a thread in the context is available. The of that activation. Similarly, when the kernel needs to notify the user level of an event, the kernel creates a scheduler activation, assigns it to a processor, and upcalls into the application address space. Once the upcall is started, the activation is similar to a traditional kernel thread— it can be used to process the event, run user-level threads, and trap ACM into Transactions and block on Computer within Systems, the kernel. Vol. 10, No. 1, February 1992. 62 T. E. Anderson et al. . Table II. Add this processor (processor a mnnab!e Execute Processor has been ready context of the preempted activation The blocked Scheduler list has scheduler activation to the ready context of the blocked (preempted the user-ievei scheduler Points blocked (blocked activation state) zn the #) using (unblocked wet--level scheduler # and its machine was executing that as no longer unblocked l~st the activation thread activation. activation has Return Upcall thread preempted to the Activation #) user-level Return Scheduler Scheduler thread itsprocessor. activation that was # and its machine executing in state) the act~uatton. The crucial distinction that once an activation’s between scheduler activations and kernel threads is user-level thread is stopped by the kernel, the thread resumed is never activation directly is created has been stopped. thread from to notify The user-level the old activation, by the the kernel. user-level thread tells Instead, thread system the kernel system then removes that a new that scheduler the thread the state of the the old activation can be reused, and finally decides which thread to run on the processor. By contrast, in a traditional system, when the kernel stops a kernel thread, even one running a user-level thread in its context, the kernel never notifies the user level of the event. Later, the kernel directly by implication, its user-level thread), again resumes without the kernel notification. thread (and By using scheduler activations, the kernel is able to maintain the invariant that there are always exactly as many running scheduler activations (vessels for running user-level Table scheduler II threads) lists the as there events activations; that are processors the kernel the parameters assigned vectors to the address to the to each upcall user space. level using are in parentheses, and the action taken by the user-level thread system is italicized. Note that events are vectored at exactly the points where the kernel would otherwise be forced to make a scheduling decision. In practice, these events occur in combinations; when this occurs, a single upcall is made that passes all of the events that need to be handled. As one example of the use of scheduler activations, Figure 1 illustrates what happens on an 1/0 request/completion. Note that this is the uncommon case; in normal operation, threads can be created, run, and completed, all without kernel intervention. Each pane in Figure 1 reflects a different time step. Straight arrows represent scheduler activations, s-shaped arrows represent user-level threads, and the cluster of user-level threads to the right of each pane represents the ready list. At time Tl, the kernel allocates the application two processors. On each processor, the kernel upcalls to user-level code that removes a thread from the ready (thread ACM list and starts 1) blocks Transactions on in the Computer running it. At time kernel. Systems, To notify Vol 10, No. T2, one of the user-level the 1, February user level 1992, of this threads event, the Scheduler Activations: Effective Kernel Support ‘m . 63 Time User Program ............................ .. .... .......... .......... J($ (A) p fq? ~ u (B) u ‘2 ,,: ‘(c) A’s thread has blocked 1 I ..“U!!!!-J ................................. Time T4 User Program . .. .. .. .. ... .. .. . .. ... .. . .. ... . .. .. ... . .. .. .. ... . .. .. . User-Level Runtime System ~m~ ‘(c) “(B) ‘(.4) ‘(D) Operating A’s thread System and B’s Kernel thread can continue Fig. kernel takes upcall in the scheduler 1. the processor context can then Example: that 1/0 completes. em running scheduler use the processor I request/completion. had been of a fresh and start running it. At time T3, the 1/0 i to take Again, thread activation. another the kernel 1 and performs The user-level thread must an thread off the ready notify list the user-level thread system of the event, but this notification requires a processor. The kernel preempts one of the processors running in the address space and uses it to do the upcall. (If there are no processors assigned to the address space when This the 1/0 upcall preemption. puts the completes, notifies The upcall thread that the upcall the user level invokes had been must wait until of two things: the kernel the 1/0 code in the user-level blocked on the ready allocates completion thread list system and one). and the that (2) puts (1) the thread that was preempted on the ready list. At this point, scheduler activations A and B can be discarded. Finally, at time T4, the upcall takes a thread off the ready list and starts When a user level thread running it. blocks in the kernel or is preempted, most of the state needed to resume it is already at the user level —namely, the thread’s stack and control block. The thread’s register state, however, is saved by low-level kernel routines, such as the interrupt and page fault handlers; the kernel passes this state to the user level as part of the upcall notifying the address space of the preemption and/or 1/0 completion. We use exactly the same mechanism to reallocate a processor from one ACM Transactions on Computer Systems, Vol. 10, No. 1, February 1992. 64 T. E. Anderson et al. . address space to another due to multiprogramming. For example, suppose the kernel decides to take a processor away from one address space and give it to another. The kernel does this by sending the processor an interrupt, stopping the old activation, and then using the processor to do an upcall into the new address space with a fresh activation. The kernel need not obtain permission in advance from the old address space to steal its processor; to do so would violate the semantics could have address the the old dress space space on its an address then but control we could instead, it as a processor. processors it has simply we reallocates been over (When skip delay the notifying case the it is running into the The ad- user-level threads should be is preempted the address until allows in stopped. processor old an upcall notifying of these last notification Notification assigned, been which the still to make space the The kernel processor activation, have address However, occurred. is used scheduler threads full space). the preemption processor a fresh has (e. g., the new address on a different second processors. space, preemption, that user-level remaining old preemption The using two scheduler run the be notified space. that space priorities than another address address thread still by doing old of address priority space must does this in higher the user from space kernel level of to know explicitly the eventually which managing cache locality). The above threads beyond 3 is description have the ones lower thread will system system level to put can which terms of the of the page same data Third, In was in the event. until ACM manager a different loop, only must page Transactions added for may in and when on in For whose be made kernel page it Systems, Vol. 10, No 1, February user or no user-level its on the the by stack, thread is processor during processing upcall same the allows an event an a thread is saved own handling is level. a preemption to continue 1992 other no knowledge if a preempted delay of in needs is that fault occurs, any if a user-level example, the turn way context build state with one the behavior when completes. Computer user exactly is entirely to when if it was can this, The kernel’s at the even activation is necessary; fault free kernel manager if not. of a page fault the manager a new saving the parallelism thread complication kernel thread it knows application is properly switch The user-level 2 instead. and the activations; particular, check for the thread user-level 3’s processor. application to recover context the if place processors. thread way no action a user-level The the in First, take 1, suppose because with in to case, thread as stopping it is the thread program kernel case, that run interaction work upcall, idle an upcall, the this and of its user-level Figure allowing kernel In in thread and respects. have In preemption to represent subsequent (reentrant) running, in the list The case. used 2. upcall, of scheduler activations occurs The top every structures is running. kernel. the in scheduler fault kernel’s on an additional the minor may example 1 and ready the activations. model the to preempt do on each the scheduler concurrency exactly for we described threads, In to 3 on the several preemption threads kernel is running while user-level the to ask thread both processor thread know Second, ask that in additional above. than can use an described priority then is over-simplified priorities, location; subsequent the to notify the upcall Scheduler Activations: Effective Kernel Support Finally, a user-level execute further resumes point the it part of the 3.2 Notifying has mode the leave user level, straints are not by the Kernel met any key of User-Level affect the while minimizing best our more to to notified an away, be to user on top level but the state as Allocation next as we system about the By — a thread that cache need they not subset contrast, traps are tell the that can when to the —is kernel kernel respects context and con- threads. small a processor preserving know, of kernel thread only decision. run and all that there the Table III even priorities within the same lists the that address has when the idle may that simply since apparent the longer to If their get only kernel has more The has not taken kernel to this address needed thread for none need approach spaces, an are the the system if have time must space it gets state that that idle there, the the updated serialize its but in an address it has happens, becomes gives with these kernel nothing one kernel kernel on the address found, if the space notifies a processor by to address space hints: processor user-level ordering drawback in reporting extra application system. a or additional Creating an extension than an searches processors. the by address kernel are is no returns Of course, in the rather made an eventually notifications a processor threads, the (An has it busy. if makes processors, application be and work it than reassigned must processors. calls example, space These idle not Similarly, processors be no other kernel processors, unique idle an has system has where needs more the whenever threads priorities). For kernel, in kernel Provided allocator constraints. must situation transitions. registered it the runnable the of additional meaningful more threads. processors kernel then notifies has processor violate notified space it runnable the cannot globally not it address than handles An the parallelism, where and then the them honest the that register Processor as far built allocation for for a state parallelism the Affecting systems; system operation, overhead run processors, tion. Events thread is that thread thread processors space reaches occurs thread’s tO kernel policies that both respect priorities exist. These idle if runnable threads thread directly system, threads future. user-level or latter need the space. transition the again the still so, last subsection is independent of the policy processors among address spaces. Reason- kernel processor used the not most kernel’s when In by user-level every are address may If however, must be based on the available parallelism In this subsection, we show that this information can observation about threads blocks is when the kernel completes. either It passing be efficiently communicated for guarantee that processors do not The it kernel. in the 1/0 65 upcall. able allocation policies, in each address space. kernel the until the The mechanism described in the used by the kernel for allocating met blocked when temporarily, would notifies that kernel thread where kernel thread in . the space address informa- notifications to matters. to this approach parallelism multiprocessors: to the a dishonest ACM Transactions is that operating or applications system. misbehaving on Computer Systems, Vol may This not problem program 10, No. 1, February be is can 1992. 66 T. E. Anderson ● Table III. Add et al. Communication more processors Aliocate more running them This processor sor an unfair as well. feedback for In can that elsewhere, has penalize similar goal; of idle and with this policy idle when the in could easily processors the be future, will system left to do in the avoid something the similar. performance, service, often service is improved approximated time. multiprocessors be adapted This needed is created. accumulate multiprogrammed the address more. are should systems remaining use if overall near work favor they that hand, interactive as they that when the information can processors in multi-level honest those is likely work uniproces- systems, allocator other operating least at. We by expect to achieve a the to encourage honest reporting a user-level thread could processors. 3.3 Critical One issue There SectIons we executing have not in a critical are two yet possible continue to test [28], deadlock ready list lock; preempted are not per-processor improve avoided [2]; and Prevention onto through and user use of upcall lists approaches a scheduling occur of thread also must to dealing and has by the could pre-empted to place even when locking a number uses control critical the protocol to atomically. problem preemptions of serious the the unlocked blocks be done with 1 other be holding attempted inopportune prevention, because FastThreads lists free Prevention can example, free two With if the Problems For held thread be or preempted. (e.g., spin-lock list). to these are level. performance occur a lock. it is blocked preempted per-activation) the the the ready accesses recovery when poor would by preemption. that application-level (e.g., the (really, latency inopportune an protected is instant effects: if so, deadlock thread sections at the ill threads and addressed section thread) kernel the create of jobs to be used it the especially the priority policy to reallocation time, jobs the On processors, likely start needs to provide processors that needed. uniprocessor response favoring same than most up imply are of processor reducing give priorities they production Average by the applications to space thread and spaces and on a multiprogrammed user-level processors threads Many or processor spaces overhead of resources address The when fewer if another kernel-level fewer needed) space () decisions. since address activattorw. to encourage address be returned address allocation use encourages scheduler proportion Space to the Kernel # of processors to this be used processor spaces (additional thzs processor either the Address processors is idle Preempt consume from between of are the drawbacks, 1 The need for critical sections would be avoided if we were to use wait-free synchronization [11]. Many commercial architectures, however, do not provide the required hardware support (we assume only an atomic test-and-set instruction); in addition, the overhead of wait-free synchronization can be prohibitive for protecting anything but very small data structures. ACM Transactions on Computer Systems, Vol. 10, No. 1, February 1992 Scheduler Activations: Effective Kernel Support particularly kernel in a multiprogrammed to yield user-level, control violating is inconsistent will vention the with describe in address Finally, “pinning” to in (at in requires priorities. presence memory section; 67 the temporarily) of critical the physical a critical least space implementation 4.3. while Prevention allocation of efficient Section be touched environment. processor semantics the requires might over . to the Prevention sections of all page virtual identifying that we faults, pre- pages these that pages can be cumbersome. Instead, we user-level adopt thread thread system course, this a solution system checks check based that if must the a user-level context temporarily via exits the section, upcall, again user-level continue activation on the if it locks). switch. When back At this point, list. We use the in the middle preempted the the section. If so, the control switch. ready was a critical any relinquishes context back in informs or unblocked, the to it continued the original is safe same (Of thread to place mechanism of to processing a event. This technique ensure the thread an kernel a user-level it an upcall preempted executing acquiring is continued the was When been before thread via has thread be made critical on recovery. a thread that presence supports lock from spin-locks, is to relinquish the is a spin-lock processor faults. lock the not affected, takes spinning processor for system continue a page the time fault; a while in technique thread to we even this user-level it holder, released, Further, allowing holder after the eventually since occurs, correctness when continuing is always or page a preemption Although spin-waiting By it preemptions user-level when holder. deadlock. is acquired, of processor notified wasted free a lock arbitrary always this is once is spin- may be a solution to [4]. 4. IMPLEMENTATION We have implemented the native operating tion, and We FastThreads, modified scheduler it now Table processors the Topaz Where In address described the DEC spaces to thread upcalls which existing blocked, the modified they Topaz user modifying Topaz, worksta- Topaz routines to implement resumed, or preempted level Topaz to to take do scheduled belonged. (and 3 by multiprocessor package. formerly formerly, Section Firefly management to allow we spaces; in SRC thread Topaz addition, to address compatibility; for kernel performs II). design a user-level the activations. a thread, (see the system We therefore also UNIX) these actions explicit allocation of threads obliviously to object code maintained applications still run as before. FastThreads sections, allocation In all, lines modified to provide decisions we was concerned a few (For over with and (see added to Topaz. threads below), was and not to process Topaz with Table III). hundred comparison, 4000 lines the lines of code). implementing the with scheduler ACM upcalls, the of code original The for to FastThreads majority per on Computer interrupted needed Topaz processor activations Transactions to resume information its and implementation of the allocation new critical processor about 1200 of kernel Topaz policy code was (discussed se. Systems, Vol. 10, No 1, February 1992. 68 T. E. Anderson et al. . Our is “neutral” design address pair spaces of policies briefly Processor The processor Zahorjan specting do. in allocation guaranteeing divided the Our rather the tions. Continuing compatibility In implementation, our in kernel requiring with that every Topaz kernel existing same way allocator processors or application can be neither state; processor, received in it, and interface described in scheduler activations; address spaces dress spaces scheduler threads using are is no need Thread An important aspect to application’s lists cache locality; choose The accessed is essentially by policy our design ACM Transactions kernel threads compete it each has for user-level it for to ad- thread using As a The spaces assigned spaces scheduler. yet). provide Processors thread (An for address structures to the idle. asked information address address are processor handed for activations. another to the at as appropriate; default each used it the policy by idle for in work kernel a result, there the Each can FastThreads list struc- application to is fit the per-processor order ready of an data be tuned uses last-in-first-out own of to improve is empty. This [10]. includes processor is available no knowledge or level. they if its Multilisp has policy, user in processor scans kernel scheduling these an that is that or parallelism implementation reallocations; the to preserve applications. that if kernel activa- Topaz whether data use of processors. model a processor the addition, notifying to needs. ready In of our manage free Topaz to Policy concurrency used are assigned only number scheduler scheduler know directly. activations partitioning Scheduling completely processor static this of the necessary processors kernel threads original use for provides processors kernel that instance, asked time-sliced space was use to share, Space-sharing use sequential) for internal kernel to the space some address them. address needs of re- is work in their are address that policy while if there multiple an threads has not 3.2 scheduler handed application’s tures use upcalls; for 4.2 Section that via only has that want applications more use debug- priority remainder. an integer (possibly use could idles processors the for spaces as dynamic highest of the that processor space we and processors processors is not possible address the all among priority) it to support binary processors same to the the reallocations; makes than need processors implementation threads, some implementation; no processor among evenly of processor (at course, follow. that do not divided of available spaces to processors Of enhancements “space-shares” evenly spaces are number prototype is similar chose policy and number that we The are of address The [27]. priorities reduces our allocating processors. performance subsections policy McCann processors for as some for onto Policy if some address those selected the of policies threads Processors spaces; if the be as well Allocation and choice scheduling to these, considerations, 4.1 on the for had describe ging to and for hysteresis spins for to a reallocation. on Computer Systems, Vol. 10, No. 1, February 1992. avoid short unnecessary period before Scheduler Activations: Effective Kernel Support 4.3 Performance While the as just equivalent considerations The Section In a user-level can be resumed), the thread flag when thread check to see if it being temporarily upcall when acquisition even though since we thread and We adopt collector section in the post-process would the is for the when it this to the imposes set and a then thread to the original overhead or page fault is particularly in and whether thread so that a preemption Latency kernel leaves, processor sections place code the the on occurs, important building our user-level not the A second it notifies its copy we then copy. This compiler support. At critical section, we execution kernel this to see if it was a new activation in one of these corresponding relinquishes uses starts system; at the critical each critical the Normal the thread thread place control can user in level the the in to the back can upcall ACM be be an old and cached must call, the of an in call explicit but flag). to the management activation is created for is it the free, had eventual case user been reuse. running level Vol. The that userit to is removed the of blocking be- Instead, by returning processing Systems, however, initialized. activation after on Computer not and for in the notifies latency relates allocated of preemption, that clearing scheduler preemption; on lock a procedure scheduler thread ‘Ikansactions impact case, we bracket this activation to user-level case of the no enhancement scheduler recycle as the In setting a new structures activations system context: the is occasionally section. with a new data there implementation, Logically, Creating processing The performance as soon the resumer. counter preemptions, our path, scheduler kernel of the user-level the make version occurs, program a critical line requires thread from about (In activations. discarded level the section. significant upcall. cause notify low-level package; to and to the common Trellis/Owl labels, thread code the the every language original back of assembler user-level preemption thread’s within straight of scheduler the a critical case. from the if so, continues concerned be made not copy special in in at the end of the critical section. Because normal execution code, and this code is exactly the same as it would be if we upcall uses the original common exact assembly processor to and of the original for no overhead a uniprocessor to do given When preempted copy each the activation the an with code but code. sections, were source to yield scheduler critical make on compiler-generated copy, original checks We imposes used be straightforward of the that was by delimiting, C the also end after func- additional to check is needed the Unfortunately, critical in be able flag check or not solution [17]. We do this the The blocks do this the infrequent. continuable technique section. not to it must relinquish place. are when clear whether events different a a related garbage the a safe these way continued. release these use One will some critical sections, described in continuation of critical sections system section, continued user-level are system. case; the a lock. is being (or thread a critical it reaches lock to is preempted holding enters there performance. relates temporary user-level to provide threads, for these to provide the was it of order is sufficient of kernel are important significant 3.3. when described to that that most 69 Enhancements design tionality . upcall in the that kernel, resumption 10, No. 1, February is 1992. 70 T. E. Anderson et al. . possible. A similar tions: kernel future thread in occasional scheduler bulk, bulk traditional deposit same kernel kernel 4.4 have There the two its system itself single-steps use speed the to same the number of as a preemption system, when returned Ignoring the one 1/0 crossing is completes. The system. with the with debugging Firefly their own application Topaz debugger. needs: debugging code running on top the processors, but debugged. a debugger of instructions user-level this is Instead, logical the these thread system kernel do not state the each of thread scheduler debugger cause kernel of the assigns the as little The when when events have debugged. inappropriate processor; activation, should being stops upcalls into or the system. the the debugging–the informs debugged thread user-level facilities code thread of the running system thread in is working system the context to combine to correctly, stop and the examine of a user-level debugger the thread state of [181. PERFORMANCE The goal with the the sequence a scheduler application 5. to on the is being Assuming can to and a time. processor thread each and is crucial described being user-level or our activations system physical activation in at makes 1/0 is needed environments, thread we have of collected be one a kernel In occur scheduler as possible each can implementa- destroyed system. Transparency support thread when system on another crossings separate thread effect kernel cached Considerations user-level of the and boundary our crossings system. 1/0, integrated are many be returned of discards, thread an Debugging We of being boundary to start in can activations instead application-kernel needed created, [131. discarded kernel is used once creations Further, the optimization threads, of our user research and level each flexibility within issues performance, thread we have cost fically, flexibility fork, of upcalls)? what is space. the The previous in kernel overall effect In cost system? and the on at functionality is the our threads parallelism sections. what yield) the of kernel of managing First, and between Third, in questions. block of communication functionality address addressed three (e, g., the advantages application been consider operations is the is performance user the and terms of of user-level Second, what level (speci- performance of applications? 5.1 Thread The cost of user-level as those —that for in thread FastThreads running on integration. original tained ACM of the is, system Performance FastThreads, Table Transactions I. on top Table Our Computer operations package of Topaz IV adds Topaz system Systems, in kernel the kernel 10, system on the threads, performance threads, preserves Vol. our running the No order 1, February is essentially Firefly with the of our and associated system Ultrix of magnitude 1992 the to our prior to the processes same work poor data con- advantage Scheduler Activations: Effective Kernel Support Table Signal- Wait that user-level tion in Null and the kernel on threads Fork decrementing must inform number be notified. (This available). There the (in plus which an when a 48 lock psec. is (The than 5.2 Upcall Thread nel Null does Fork (Section is important, point, level that to those threads held When to be Our we by the this time an onds, a factor We see from time in its of execution is by 1/0 the request of five nothing difference, in application the inherent which upcall we ACM may is threads similar between it a to the activations latency have user or preempting scheduler the the at user-level activations then determine be done of blocking reschedules we overhead two user-level to the to when determines to wait for how a critical the One to signal test kernel. in and Table This operations. measure IV, approximates fault. signal-wait time in scheduler through except of making The of upcall wait machinery Topaz Transactions that. performance thread activation than attribute than upcall kernel threads in our Topaz Signal-Wait be a page expected of slower scheduler or can for ker- infrequent thread. forced worse cost Further, the case when it helps that thread, when the for First, scheduler a kernel is analogous added using is considerably is the pleting this perfor- of marking optimization frequent needed If the implementation, with the intervention, preempted synchronization overhead codes). a Signal-Wait sections operations on a uniprocessor. our implementation the to this our performance—the reasons. threads. and commensurate kernel; several kernel running began performance the critical Upcall of thread preempting even is preempted resource more characterizes kernel the or be practical other for kernel in of blocking long 5.1) ratio require thread a thread has and are resumed condition way this ,usec. is due threads, it as is being the kernel Removing of 49 necessary. the to outperform user-level could not though, “break-even” to begin which restore that processors a zero-overhead 4.3). time benchmark many thread than without Section Fork to running parallelism Performance performance case—is done better whether Signal-Wait.) involvement cost (see a Null as to increment- a program in Signal-Wait, be worse for a preempted of magnitude held yielded path must is due determining sufficient wants whether be significantly FastThreads always processes 11300 1840 is a 3 psec. degrada- and be eliminated with Ultrix which threads degradation work order There threads. running it of checking extra still would that ToPaz threads 948 441 FastThreads, could 71 Latencies(Wsec) of busy or is a 5 psec. cost case Although mance machine kernel factor over kernel to original the the Operation FastThreads on Scheduler Activations 37 42 offer relative a uniprogrammed can Thread FastThreads On Topaz threack 34 37 Operation Null Fork ing IV. . the and is 2.4 that com- millisec- threads. activations to two on Computer that is responsible implementation Systems, Vol issues. 10, No. 1, February for First, 1992. 72 T. E. Anderson et al, . because we built implementation state, the and thus kernel in have from written in + . For assembler. commensurate comparison, by we expect with than what our if tuned, kernel thread be Topaz to the existing more maintain that portion thread Burrows by our [19] the reduced As next a production on application + in would be a result, the ap- are somewhat section in SRC Modula-2 performance performance. is is entirely recoding upcall of system implementation four in achieved designed and of that, had of the kernel a factor measurements might if we much Schroeder over Topaz performance worse than importantly, assembler; costs Thus, as a quick modification thread system, we must overhead, As tuned processing plication more scratch. carefully Modula-2 RPC scheduler activations of the Topaz kernel scheduler activation implementation. 5.3 Application Performance To illustrate sured the the FastThreads ning effect same built on on scheduler solution to the ing the center the tree masses were all this force on each the cache cache data. fit are processor Firefly’s we demonstrate when With the are sequential system no to kernel threads Transactions of a cluster of size was if they cache for and were into physical account; memory systems, tests by buffer 50 msec.; measurements disk the simplifica- kernel the generation the a further (Our the for used so that the for ) All memory, modified of memory As in performance illustrative. We chosen memory. current available amount contention than on Topaz threads the number three of the the for Computer enough of the all parallelize the application on our were size our measure- run on Systems, makes minimal as on original were used. a six so that running. for 2 each there use graphs of the the three is negligible (Speedup is of FastThreads Figure of processors 1/0 relative to algorithm). systems added perform overhead application. either system memory applications because threads portions exert as a buffer the access. took to explicitly a disk when has other processor, implementation, ACM if versus implementation one represent- by would bound. cause as quickly than speedup there 1/0 block only that it runs faster application’s systems be they log N) Firefly. services, much to exerted that simply than a tree traverses speed or physical point less force cache we then run- an O(N cluster. problem floating intended CVAX First, and normally of magnitude force us to control the when by the memory Firefly’s in The of processor of its enough our body. mea- original FastThreads was constructs and we threads, modified algorithm compute allowed miss would the orders and in similar because kernel This that misses ments a part kernel we measured of space ratio either a small qualitatively are be to manage threads The of the relative can always tion, of mass the application; [3]. be approximated center and application portion performance, Topaz threads, The of each on application’s of Topaz problem application the top using N-body at the application system of mass can Depending our application activations. to compute distant of parallel This user-level Vol 10, worse overhead thread No than of creating 1, February system. 1992. and is the sequential synchronizing greater for Topaz a Scheduler Activations: Effective Kernel Support 5- -0- l“opaz + orig + new o 73 threads FastThrds Fastl’hrdS 4- 3- 2- 1 o~ 1 2 number Fig.2. Speedup of N-Body As processors tially improves release an kernel, provided acquire a busy greater both user-level bottleneck during our tests, wake up lock is the of original processors. the Topaz periodically, that Thus, so that fora kernel though to and be kernel lock has good enough paral- performance. when are are using less ifthe less When the application tries (in this case, six processors), thread systems is similar. impact on the duration). Next, because performance of original and our system for a short system diverges applications has several time, Transactions if slightly were running daemon and then threads go back to spaces, proceswhich Fast- to use all of the processors of the the number of preemptions for both (The preemptions have only a small FastThreads we show that when the application requires it does 1/0, our system performs better than ACM is crucial Because Threads. machine user-level ofa lock our system explicitly allocates processors to address these daemon threads cause preemptions only when there are no idle sors available; this is not true with the native Topaz scheduler, controls the kernel threads used as virtual processors by original sleep. is attainedby level optimizations the overhead sections atuser no other operating execute to threads. FastThreads Even time iniand tries application critical threads acquire trapping speedup prevents available. a thread the Topaz ofthe short user-level in application its [121; these If good that performance can without block the kernel lock. The threads it byspinning with will released. 100%ofmemory Topaz the of contention. shows 6 a thread section for thread ofkernel to the with Topaz, critical the restructuring is built which a ofprocessors, In contention systems trapping or five on no to improve by performance four out. the overhead application performance presence or perhaps before The the thread threads the however, when in number flattens is 5 processors of versus lock lock, be able kernel with then there itisthe Wemight the added, only much busy are application rescheduled lelism; application and 4 3 on Computer Systems, because of their short kernel involvement either original FastVol. 10, No. 1, February 1992. 74 T. E. Anderson et al. . 1oo0 Topaz + orig + new 80- threads Fast Thrds Fast Thrds 60- 40- c o 20- o~ 100% 90% 80% Fig. 3. Execution time of N-Body 60% 70% % available 40% 50% memory application versus amount of available memory, 6 processors, Threads or Topaz threads. Figure 3 graphs the application’s execution on six processors as a function of the amount of available memory. For all three systems, performance degrades slowly at first, and then sharply once the application’s application performance with than with the other two working original systems. time more set does not fit in memory. However, FastThreads degrades more quickly This is because when a user-level thread blocks blocks, in the kernel, the kernel and thus the application of the parallel 1/0. The curves for modified FastThreads and for Topaz threads each other because both systems are able to exploit the parallelism of the application to overlap thread serving as its virtual processor also loses that physical processor for the duration some of the 1/0 latency with As in Figure 2, though, application performance is fied FastThreads than with Topaz because most thread useful computation. better with operations modican be implemented without kernel involvement. Finally, while Figure 3 shows the effect on performance of applicationinduced kernel events, multiprogramming causes system-induced kernel events that result in our system having better performance than either original N-body averaged system; Table FastThreads or Topaz threads. To test this, we ran two copies of the application at the same time on a six processor Firefly and then their execution times. Table V lists the resulting speedups for each note that a speedup of three would be the maximum possible. V shows that application performance with modified FastThreads is good even in a multiprogrammed that obtained when the application environment; the speedup is within 5% of ran uniprogrammed on three processors. This small degradation is about what and the need to donate a processor thread. In contrast, multiprogrammed either When ACM original FastThreads applications using Transactions on Computer we would expect from bus contention periodically to run a kernel daemon performance is much worse with or Topaz threads, although for different reasons. original FastThreads are multiprogrammed, the Systems, Vol. 10, No. 1, February 1992 Scheduler Activations: Effective Kernel Support Table V. Speedup of N-Body Application, Multiprogramming 100% of Memory Topaz threads 1.29 Level . 75 = 2, 6 Processors, Available Original FastThreads 1.26 New FastThreads 2.4.5 operating system time-slices the kernel threads serving as virtual processors; this can result in physical processors idling waiting for a lock to be released while the lock holder threads than with our expensive. In addition, tion, it may than from for Topaz is rescheduled. system because because Topaz end up scheduling more Performance is worse with Topaz common thread operations are more does not do explicit processor allocakernel the other; Figure 2 shows, threads when more than threads from one address space however, that performance flattens three processors are assigned to out the application. While the Firefly is an excellent vehicle for constructing proof-of-concept prototypes, its limited number of processors makes it less than ideal for experimenting with significantly parallel applications or with multiple, mul tiprogrammed parallel applications. For this reason, we are implementing scheduler activations in C Threads and Mach; we are also porting Amber [6], a programming system for a network of multiprocessors, onto our Firefly implementation. 6. RELATED The two IDEAS systems with goals most closely related to our own—achieving properly integrated user-level threads through improved kernel support —are Psyche [20] and Symunix [9]. Both have support for NUMA multiprocessors as a primary goal: Symunix in a high-performance parallel UNIX implementation, and Psyche in the context of a new operating system. Psyche and Symunix provide “virtual processors” as described in Sections 1 and rupts are same 2, and that like augment notify upcalls, stack the these user except and thus virtual level that processors of some kernel all, interrupts are not reentrant). of multimodal parallel programming kinds, in different address spaces, by defining events. on the Psyche software (Software same processor has also explored in which user-defined can synchronize while inter- interrupts threads sharing use the the notion of various code and data. While Psyche, Symunix, and our own work share similar goals, the approaches taken to achieve these goals differ in several important ways. Unlike our work, neither Psyche nor Symunix provides the exact functionality of kernel threads with respect to 1/0, page faults and multiprogramming; further, the performance of their user-level thread operations can be compromised, We discussed some of the reasons for this in Section 2: these systems events that affect the notify the user level of some but not all of the kernel ACM Transactions on Computer Systems, Vol. 10, No 1, February 1992. 76 T E. Anderson et al. . address space. For example, neither when a preempted virtual processor thread system threads Both are running Psyche and does not know on those Symunix Psyche nor Symunix notify is rescheduled. As a result, how many processors processors. provide shared it writable the user level the user-level has or memory what user between the kernel and each application, but neither system provides an efficient mechanism for the user-level thread system to notify the kernel when its processor allocation needs to be reconsidered. The number of processors needed by each application could be written into this shared memory, but that would give no efficient way for an application that needs more other application has idle processors. Applications in both Psyche and Symunix the kernel in order to avoid preemption processors share to know that synchronization at inopportune state moments (e.g., some with while spin-locks are being held). In Symunix, the application sets and later clears a variable shared with the kernel to indicate that it is in a critical section; in Psyche, the application checks for an imminent preemption before starting a critical latency, section. which performance The setting, clearing, and checking of these bits adds to lock constitutes a large portion of the overhead when doing high- user-level thread latency management unless [2]. By contrast, a preemption actually our occurs. system has no effect on lock Furthermore, in these preempt other systems the kernel notifies the application of its intention to before the preemption actually occurs; based on this a processor notification, the application can choose to place a thread in a “safe” state and voluntarily relinquish a processor. This mechanism violates the constraint that higher priority threads are always run in place of lower priority threads. Gupta et al. [9a] share our goal of maintaining a one-to-one correspondence between physical processors and execution contexts for running user-level threads. being texts When more until’ context. Our a processor contexts than the application kernel preemption eliminates the need application thread system of the event constant. Some systems provide asynchronous some of the problems with or 1/0 processors, Gupta reaches a point user-level for while kernel thread completion results in there et al.’s kernel time-slices conwhere it is safe to suspend a time-slicing by notifying keeping the number 1/0 as a mechanism management the of contexts to solve on multiprocessors [9, 251. Indeed, our work has the flavor of an asynchronous 1/0 system: when an 1/0 request is made, the processor is returned to the application, and later, when the 1/0 completes, the application is notified. There are two major differences between our work and traditional asynchronous 1/0 systems, though. First, and most important, scheduler activations provide a single uniform mechanism to address the problems of processor preemption, 1/0, and page faults. Relative to asynchronous 1/0, our approach derives conceptual simplicity from the fact that all interaction with the kernel is synchronous from the perspective of a single scheduler activation. A scheduler activation that blocks in the kernel is replaced with a new scheduler activation when the awaited event occurs. Second, while asynchronous 1/0 schemes may require significant changes to both application and kernel code, ACM Transactions on Computer Systems, Vol. 10, No 1, February 1992 Scheduler Activations: Effective Kernel Support . 77 our scheme leaves the structure of both the user-level thread system and the kernel largely unchanged. Finally, parts of our scheme are related in some ways to Hydra [26], one of the earliest multiprocessor moved of the out performance the kernel could policy to a scheduling without in decisions policy scheduling this and policy separation required server, to the kernel. are the kernel’s be delegated in which Hydra, communication then back was came at Longer-term responsibility, to the to a distinguished processor although application-level a through kernel to switch. In our system, an application can set its its threads onto its processors, and can implement trapping sions in our system systems, However, cost because implement a context policy for scheduling policy operating kernel. own this allocation deci- as in Hydra, this server. 7. SUMMARY Managing parallelism parallel at computing, operating described but the user kernel level threads is essential to or processes, high-performance as provided in systems, are a poor abstraction on which to support this. the design, implementation and performance of a kernel many We have interface and a user-level thread package that together combine the performance user-level threads (in the common case of thread operations that can implemented (correct entirely behavior at user in the level) with infrequent the functionality case when Our approach is based multiprocessor a virtual on providing in which each application the application many processors and which those processors. application it address –Processor has exactly Responsibilities are of its divided of kernel the kernel must threads be involved). address space knows exactly threads between are the of be kernel with how running on and each space: allocation (the allocation of processors to address spaces) is done threads to its of every event by the kernel. —Thread scheduling (the assignment processors) is done by each address —The kernel affecting –The address can affect The notifies the address the space address thread space’s scheduler space. space notifies processor address of an space. the kernel allocation of the subset of user-level events that decisions. mechanism that we use to implement these ideas is called A scheduler activation is the execution context for vectoring control from the kernel to the address space on a kernel event. The address space thread scheduler uses this context to handle the event, e.g., to modify user-level thread data structures, to execute user-level threads, and to make requests of the kernel. While our prototype implements threads as the concurrency abstraction supported at the user level, scheduler activations are not linked to any particular model; scheduler activations can support any user-level concurrency model because the kernel has no knowledge of userlevel data structures. scheduler kernel activations. ACM Transactions on Computer Systems, Vol. 10, No. 1, February 1992. 78 T. E. Anderson et al. . ACKNOWLEDGMENTS We would like to thank Andrew Black, Mike Burrows, Jan Edler, Mike Jones, Butler Lampson, Tom LeBlanc, Kai Li, Brian Marsh, Sape Mullender, Dave Redell, Michael Scott, Garret Swart, and John Zahorjan for their helpful comments. Center for providing We would also like us with their to thank Firefly the DEC hardware Systems Research and software. REFERENCES Actors: A Model of Concurrent Computation in Distributed Systems. MIT Press, 1. AGHA, G. Cmabridge, Mass. 1986. 2, ANDERSON, T , LAZOWSKA, E., AND LEVY, H. The performance implications of thread manIEEE Trans. Comput. 38, 12 (Dee. agement alternatives for shared memory multiprocessors. 1989), 1631-1644. Performance Calif., Also appeared ’89 Conference May 1989), 446-449. 4. BIRRELL, A., GUTTAG, multiprocessor: Operating A D. system. A hierarchical J., Principles Comput. Msg. programming Symposwm D. Systems J., LIPKIS, systems. 1988), Mellon In E., LEVY) The process R, Program. M. ceedings 2nd June algorithm. the 11th Nature 324 primitives ACM for Symposium a cm parallelism in the Mach operating H., AND LITTLEFIELD, In Park, ACM. Rep. R. The Proceedings Ariz., 31, Dee (Mar. 3 Amber 1989), ACM pp. 147-158. 1988), CMU-CS-88-154, system: of the 12th 314-333. School of Computer 1988. Process USENIX July management Workshop on for highly UNIX Systems and parallel UNIX Supereomputers (Sept. for 1985), Wash., concurrent Mar S.-P., AND GLIGOR, V. 1987), pp 15, MARSH, B., Proceedings Oct. 1990), pp. In Systems ACM Trans. and structures. Practice In Pro- of Parallel 197-206. S. Empirical Calif., with A comparative data Principles Oct. processes analysis Conference studies of the 1991), of competitive 13th ACM spinning Symposium on pp. 41-55. and monitors of multiprocessor on Distributed in Mesa scheduling Comm un. algorithms. Computing Systems (Sept. 356-363. SCOTT, M., of the 13th 1991), ACM T , AND MAR~ATOS, Symposium P., AND STAUNSTRUP, Parallel on LEBLANC, E. on Operating First-class Systems user-level Prlnclples threads. (Pacific In Grove, pp.110-121. Cornput. J., AND KOHLER, Transactions computation. concurrent on Proceedings Grove, of the 7th International 16. MOELLER-NIELSEN, algorithms. highly M., AND OWrC~I, (Pacific symbolic Symposium multiprocessor Principles multipro- Computer 501-538. SIGPLAN MANASSE, use of shared-memory CSL-TR-91-475A, 1991. LAMPSON, B., AND REDELL, D. Experiences ACM. 23, 2 (Feb. 1980), 104-117. Proceedings effective Rep. for implementing ACM a shared-memory Making Tech. A language 7, 4 (Oct. (Seattle, A., LI, K., Operating ACM Univ., Syst. the Programming 17. Moss, and (Oakland, pp. 94-102. and Tech. approach A methodology of 12. KARLIN, control Multilisp: Lang. 11. HERLIHY, Calif, 1987), Commun J,, AND SCHONBERG, E Stanford 10. HAI,STEAD, In Systems Synchronization of (Litchtield system. of the R. of multiprocessors. C Threads. Umv., Proceedings Laboratory, 14. Lo, SIGME’Z’RICS Pp. 1-17. cessors: 13 ACM 1990), 35-43. 9A. GUPTA, A , TUCKER, A., AND STEVENS, L for Nov. concurrency Principles The V distributed Carnegie 9. EDLER, 1989 of Computer Proceedings Tex., for 23, 5 (May 8. DRAVES, R., AND COOPER, E. Science, the Modeling AND LEVIN, In on a network on Operating 7. CHERITON, J., (Austin, support 6. CHASE, J., AMADOFt, F., LAZOWSIiA, Parallel of and O(N log N) force-calculation specification. Scheduling IEEE HORNING, formal Systems 5. BLACK, Proceedings pp. 49-60. 3, BARNES, J., AND HUT, P. (1986), in on Measurement Computer W. 4, 1 (Feb. Concurrency Systems, Vol. J. Problem-heap: 1987), features 10, No. A paradigm for multiprocessor 63-74. for the Trellis/Owl 1, February 1992 language. In Proceed- Scheduler Activations: Effective Kernel Support . 79 ings of European Conference on Object-Oriented Programming 1987 (ECOOP 87) (June 1987), pp. 171-180. Proceedings of the ACM 18. REDELL, D. Experience with Topaz teledebugging. In SIGPLAN/SIGOPS Workshop on Parallel and Distributed Debugging (Madison, Wise., May 1988), pp. 35-44. Performance of Firefly RPC. ACM Trans. Cwnput. SW 19. SCHROEDER,M., AND BURROWS, M. 8, 1 (Feb. 1990), 1-17. 20. SCOTT, M., LEBLANC, T., AND MARSH, B. Multi-model parallel programming in Psyche. In Proceedings of the 2nd 21. TEVANIAN, Threads ACM SIGPLAN Symposium on Principles and BLACK, COOPER, E., AND Practice of Parallel YOUNG, M. (Mar. 1990), pp. 70-78. Programming A., and RASHID, R., Unix Kernel: the GOLUB, The D., battle for D., control. In Proceedings of the 1987 Mach USENIX Summer Conference (1987), pp. 185-197. A multiprocessor worksta22. THACKER, C., STEWART, L., AND SATTERTHWAITE, JR., E. Firefly: tion. IEEE Trans. Comput. 37, 8 (Aug. 1988), 909-920. 23. TUCKER, A., AND GUPTA, A. Process control and scheduling issues for multiprogrammed of the 12th ACM Sympo.mum on Operating shared memory multiprocessors. In Proceedings Systems Principles (Litchfield Park, Ariz., Dec. 1989), pp. 159-166. 24. VANDEVOORDE, M., AND ROBERTS, E. WorkCrews: An abstraction for controlling paralProgram. 17, 4 (Aug. 1988), 347-366. lelism. Int. J. Parallel common runtime approach to 25. WEISER, M., DEMERS, A., AND HAUSER, C. The portable interoperability. In Proceedings of the 12th ACM Symposium on Operating Systems Principles (Litchfield 26. WULF, W., Park, LEVIN, tem. McGraw-Hillj 27. Dec. New York, ZAHORJAN, J., AND MCCANN, Proceedings Computer 28. Ariz., ZAHORJAN, overhead 1991), Received of the 1990 Systems J., 1989), R., AND HARBISON, C. ACM LAZOWSKA, Hydra/ C.mmp: An Experimental Computer Sys- 1981. (Boulder, in shared pp. 114-122. S. Processor Colo., E., memory scheduling SIGMETRICS May 1990), AND EAGER, in shared Conference D. multiprocessors. memory multiprocessors. on Measurement and Modeling In of pp. 214-225. The IEEE effect of scheduling Trans. Parallel Distrib. discipline on spin Syst. 2, 2 (Apr. 180-198. June 1991; revised August ACM 1991; accepted Transactions September on Computer 1991 Systems, Vol. 10, No. 1, February 1992.