CS550 COMPARATIVE OPERATING SYSTEMS THE SPRITE NETWORK OPERATING SYSTEM PRESENTED BY, PRASHANTHI NARAYAN NETTEM SID- 999 29 1598 CS 550- 395 Email: nettpra@iit.edu 1 ABSTRACT Sprite is a distributed operating system that provides a single system image to a cluster of workstations. It provides very high file system performance through client and server caching. It has process migration to take advantage of idle machines. Sprite is an operating system for a collection of personal workstations and file servers on a local area network. Sprite’s kernel-call interface is much like that of 4.3BSD UNIX but sprite’s implementation is a new one that provides a high degree of network integration. All the hosts on the network share a common high-performance file system. Processes may access files (or) devices on any host, and Sprite allows file data to be cached around the network while guaranteeing the consistency of shared access to files. Each host runs a distinct copy of the Sprite kernel, but the kernels work closely together using a remoteprocedure-call (RPC) mechanism. 2 TABLE OF CONTENTS 1. Introduction. 2. Process Migration a) b) c) d) e) Process Migration Mechanism. Process Transfer. Role of File system in process migration. Virtual memory transfer. Migrating open file. 3. Virtual Memory Management a) Shared Memory. b) Demand Loading Of Code. c) Backing Store Management. 4. File Management a) b) c) d) e) Shared File System. Prefix Table. Pseudo File System. Caching of file system. Cache consistency. 5. Virtual memory vs. file system. 6. Remote Procedure Call. 7. Multi-processor Sprite Kernel. 8. Conclusions. 9. References. 3 INTRODUCTION Sprite is an experimental system developed at the University Of California, Berkeley. Sprite’s eventual target is SPUR, which is a multiprocessor with a large physical memory, and it is networked together with many other powerful workstations. It has also been used on SUN-2 and SUN-3 workstations. The computers have been shifted from time-sharing to higher performance workstations for three reasons. First, in a collection of workstations many machines are idle at any given time. In a network of workstations many processors can be idle at any given period. These idle hosts represent a substantial pool of processing power, many times greater than what is available on any user’s personal machine in isolation. We expect to increase the throughput by running different programs in parallel. Second, users have become accustomed to having workstations for themselves and in Time Sharing there is interactive environment and here the workstation users expect quick, predictable response and are unlikely to tolerate mechanisms that threaten interactive performance. Applications that have programs that are short and independent can easily perform work if multiple processors are available. Third, to increase the speed of execution. The present paper deals with the process migration, virtual memory, file system and basic kernel remote procedure calls, mutual exclusion and Synchronization in Sprite Operating System. A process migration facility moves a process’s execution site at any time between two machines of the same architecture. Sprite offers transparent and automatic process migration. Migration needs to be transparent both to the user who runs a remote process and the process itself. With respect to the user, migrated processes should appear as though they were all running on the user’s own host. The virtual memory for Sprite has many important features. First, it allows processes to share memory. Second, it allows all of the physical pages of memory to be in use at the same time; that is, no pool of free pages is required. Third, it speeds program 4 startup by using free memory as a cache for recently used programs. Another key advantage of Sprite is that it has large physical memories. Sprite operating system is centered around its shared file system. The underlying distribution of the system is hidden behind the file system, which transparently provides access to local (or) remote files to all the Sprite hosts in the network. The Sprite network operating system uses large main-memory disk block caches to achieve high performance in its file system. It provides non-write-through file caching on both client and server machines. A simple cache consistency mechanism permits files to be shared by multiple clients without danger of stale data. In order to allow the file cache to occupy as much memory as possible, the file system of each machine negotiates with the virtual memory system over physical memory usage and changes the size of the file cache dynamically. The client caches allow diskless Sprite workstations. In addition, client caching reduces server loading and network traffic considerably. Sprite optimizes the common case of file and device access, both local and remote, by providing a kernel-level implementation. Sprite also allows for user-level extensibility by letting a user-level process implement the naming and I/O interfaces on the file system. Sprite allows a facility that transparently extends the Sprite distributed file system to include foreign file systems and arbitrary user services known as pseudo file system. The kernel contains the remote procedure call mechanism that allows the kernel of each workstation to invoke operations on other workstations. The RPC mechanism is used extensively in Sprite to implement other features, such as the network file system and process migration. 1) Process Migration Process migration facility moves a process execution site any time between two machines of the same architecture. Migration allows processes to be offloaded to idle hosts, and it also preserves host autonomy by evicting processes from hosts that are no longer idle. The process migration should be transparent and automatic. Sprite presents the illusion of a single fast time-sharing system, rather than distributed system with many independent hosts. Migration needs to be transparent to both user who runs a remote process and the process itself. With respect to user it should be as if they were all running 5 on the user’s own host. The user could suspend (or) terminate the process with no knowledge of the hosts on which they are actually executing. The most important aspect of transparency, however, is its impact on a migrated process. Programs should not be coded specially to take migration into account as long as they are capable of being executed on multiple processors in parallel on single host. A migrated process should have the same resources of the unmigrated process. Process migration should be automatic. The load should be spread across idle machines without user intervention. The selection of what is executed next and what to run on which machine is done by the system. Process migration mechanism: In general migrating a process involves two phases. The first phase consists of extracting the process state from one host (the source) and installing it on another host (the target). The second phase begins when the system starts executing the process on the target, and ends when the process terminates (or) migrates elsewhere. These two phases interact; because the actions the system performs during the first phase affect what special operations are necessary during the second. The first phase, process transfer depends on the state associated with a process. The second phase, process execution, depends not only on the way in which state is transferred, but also the degree to which migration is intended to be transparent. Process transfer: The techniques used to migrate a process depend on the state associated with the Process being migrated. If a stateless process existed, then migrating such a process would be trivial. Process states typically include one (or) more of the following. They are described below. Role of file system in process migration: Sprite is an operating system for a collection of personal workstations and file server on local area network. Sprite’s implementation is a new one that provides a high degree of network integration. In Sprite all the hosts on the network share a common high 6 performance file system. File server is responsible for guaranteeing “consistent access” to the cached data. It keeps track of which host has a file open for reading and writing. If a file is open on one (or) more than one host and at least one of them is writing, then caching is disabled: all hosts must forward their read and write requests for that file to the server so that they can be serialized. Because servers need to maintain state about all open files in order to ensure consistent caches, process migration and the file system interact strongly. When a process changes hosts, the servers that control access to its open files must be notified about its new location. Though the file system is physically distributed, it is logically centralized because all hosts share a single name space. Sprite presents an illusion of a Time-sharing system by making a process appear to execute in a single location throughout its lifetime. That location is referred to as the home machine of the process; it is the machine where the process would have executed if there had been no migration at all. As the file system makes files transparent to users so also the process should not be seen and it is managed by kernel. The process is unaware when it is migrating to different machine. Virtual memory transfer: If the entire shared memory is transferred at the time of migration then it can take many seconds even if we use a higher transfer rate allowed by the network. It also may result in the unwanted pages being transferred that are not used by the process after migrating. Sprite’s migration facility uses a different form of virtual memory transfer that can take advantage of the existing network services. In Sprite, backing storage for virtual memory is implemented using ordinary files. Since these backing files are stored in the network file system, they are accessible throughout the network. During migration the source machine freezes the process, flushes its dirty pages to backing files, and discards its address space. On the target machine, the process starts executing with no resident pages and uses the standard paging mechanisms to load pages from the backing files, as they are needed. The backing files are stored on network file servers, which cache recently used file data in memory. When the source machine flushes a dirty page it is simply transferred over the network to the servers main-memory file cache. Disk operations will occur only if the server’s cache overflows. 7 Virtual memory transfer becomes much more complicated if the process to be migrated is sharing writable virtual memory with some other process on the source machine. In principle, it is possible to maintain the shared virtual memory even after one of the sharing process migrates, but this changes the cost of shared accesses so dramatically that it seemed unreasonable. Currently, shared writable virtual memory almost never occurs in Sprite, so Sprite simply disallows migration for processes using it. A better long-term solution would be to migrate all the sharing processes together, but even this may be impractical if there are complex patterns of sharing that involve many processes. Migrating Open File: When a process migrates, it must have uninterrupted access to any files it has opened. Either the state associated with each file must be transferred to the target (or) operations on the file must be forwarded to the machine on which the file was opened. Sprite uses the “transfer state” approach for open files. Once an open file has been transferred to a new host, access to the file is managed using standard mechanisms in the Sprite file system. The server that stores a file is responsible for keeping two things consistent, the file’s contents and processes offsets into the file. The Sprite consistency protocol makes uncacheable if a process has a file open for writing and different hosts access the same file simultaneously. The process control block (PCB) is left on the source (or) transferred with the migrating process. The other elements are much easier to transfer than virtual memory and open file since they are not very “bulky” as the virtual memory and they don’t involve distributed state like open file. 2) Virtual Memory Management Sprite is an operating system designed for high speed (powerful) workstations. The designed virtual memory runs on the Sun architecture. There are important features of virtual memory system for Sprite as allowing shared memory for processes, allowing all of the physical pages of memory to be in use at the same time; that is no pool of free pages is required, remote paging and using cache technique for recently used programs. 8 Stack Private Stack Private Dynamic Data Dynamic Data Static Data Sharable Static Data Data (private) Data (Sharable) Code Code (Sharable) (Sharable) UNIX SPRITE Figure 2.1: In Unix, processes may share code, but not stack (or) data. In Sprite, the data segment may be shared between processes, including both statically allocated and dynamic data. Private static data may be stored at the top of the stack. Sprite’s virtual memory is very much similar to Unix but it has been redesigned to eliminate unnecessary complexity and it supports three features: multiprocessing, networks and large physical memory. Sprite has eliminated the need for a free list where some page frames are kept in memory to handle page faults. Another major simplification in Sprite has been accomplished by taking advantage of highbandwidth networks like Ethernet. These networks have influenced the design of Sprite’s virtual memory by allowing workstations to share high-performance file servers. As a result, most Sprite workstations will be diskless, and paging will be carried out over the network. These file servers are used for demand loading code and backing store. The large physical memories in Sprite offer the opportunity for speeding program startup by using free memory as a cache for recently used programs. SPUR machine is an example of this technology. It is a multiprocessor with a large physical memory, and it is networked together with many other powerful workstations. Sprite’s eventual target is SPUR. Shared memory: The Processes that are working together in parallel to solve a problem need an interprocess communication (IPC) mechanism to allow them to synchronize and share 9 data. Sprite since it is a multiprocessor architecture need to provide IPC for exploiting parallelism of the multiprocessor. Sprite uses shared writable memory and messages for IPC because of the considerations of efficiency. Demand loading of code and Backing store management: A virtual memory must be able to load pages into memory from the file system (or) backing store when a process faults on a page and write pages to backing store when removing a dirty page from memory. Sprite uses the file system for both demand loading and backing store. Demand Loading Of Code: The address space for a process consists of three segments: code, heap, and stack. In Sprite processes can share both code and heap. Sharing the heap segment permits processes to share writable memory. When a process is created by fork it is given an identical copy of its parent’s stack segment, it shares its parent’s code segment, and it can either share its parent’s heap segment (or) it can get an identical copy. Sprite initially load code and initialized heap pages from object files in the file system. The pages are read using the normal file system operations. Sprite file system is fast enough for the virtual memory system to use it for all demand loading. This is because the file system is implemented using high-performance file servers, which will be dedicated machines with local disks and large physical memories. It has got a lot of advantages like the virtual memory is simplified because it does not have to worry about the physical location on disk. Second, the file server’s cache can be used to increase performance as page reads may be able to be serviced out of the cache instead of having to go to disk. Backing Store Management: Backing Store is used to store dirty pages when they are taken away from a segment. In Sprite each segment has its own file in the file system that it uses for backing store. In Unix backing store is allocated in large contiguous chunks on disk. Sprite uses a high-performance file system; all writing out of dirty pages is done to files in the file system instead of to a special disk partition dedicated to backing store. When a segment 10 needs to write a dirty page out it opens a file and pages that need to be saved in backing store are written to the file using the normal file write operation. All reads from the backing store can use the normal file read operation by using the virtual address of the page to be read. When the segment is destroyed, the file is closed and removed. There are several advantages of using files for backing store instead of using a separate partition of disk. The virtual memory system can deal with virtual addresses instead of absolute block numbers. This frees the virtual memory system from having to perform bookkeeping about the location of pages on disk. The another advantage is that no preallocated partition of disk space is required for each CPU.This is a major saving in disk space. Backing store simplifies process migration. Since the backing store is part of a shared file system so only a pointer to the file used for backing store would have to be transferred. It also has the potential for higher performance since virtual memory may get the pages by hitting on cache, as we have large physical memories instead of going to disk. When the server’s cache will not have enough room to hold all dirty pages then the solution is to create a separate partition on the server’s disk that is only used for backing store files [Nelson86]. 3) File Management The novel feature of the Sprite operating System concerns distributed state management. The file system is implemented by a distributed set of computers, and its internal state is distributed among the operating system kernels at the different sites. Much of the state is used to optimize file system accesses by using a main-memory caching system. The servers keep state about how their clients are caching files, and they use this state to guarantee a consistent view of the file system data [Nelson88b]. System had to efficiently support diskless workstations, because they reduce cost, heat, noise, and administrative overhead. 11 Shared File System: The Sprite distributed file system provides a framework through which many different kinds of resources (files, devices, and services) are accessed. A shared file system is the basis for a distributed system, and it is implemented inside the operating system kernel. The file servers play the role of the name servers (which are used to locate file servers, device servers, and other applications in the system. They provide names for devices and services that can be located on any host in the network. The advantage of this approach is that naming, synchronization, and communication infrastructure required to support remote file access can be used to provide access to remote devices and remote services. Sprite optimizes access to regular files because only a single server (the file server) is involved which eliminates the overhead of invoking a separate name service. The I/O operations are object-specific and any host may implement them. Thus, there are three roles a host can play in the architecture, the file server does the naming operations, the I/O server implements object-specific operations and the client is using the object. Prefix table: Clients keep a Prefix Table that is a mapping from file name prefixes to their servers. The prefix tables are caches that are updated with broadcast protocol, new entries are added as client accesses new areas of the file system, and out-of-date entries are refreshed automatically if the system configuration changes. Pseudo File System: A facility that extends the transparency of the Sprite distributed file system is including the foreign file systems and arbitrary user services. A pseudo-file-system is a sub-tree of the distributed hierarchical name space that is implemented by a user level server process. The server runs on one host and access from other hosts are handled in the same way as remote access to Sprite file servers. In Sprite the file system handles local and remote file access through an internal kernel interface and this kind of structure supports modular additions to the kernel to support other types of file systems. However, we need to add a file system type that 12 allows further extensions to the system to be implemented in user-level server processes instead of inside the kernel. This new file system is called a pseudo-file-system. These systems provide a general mechanism for extending the naming and I/O structure of the file system with user-implemented applications. The kernel remains smaller and more reliable. It provides more structure than a message-based kernel. There is a pseudo file server that does the recovery and the caching mechanisms. Thus, Sprite is a file system based kernel that provides a standard interface to users and applications and provide more system support for user-implemented services than a message-based kernel. Care must be taken so that the performance of the pseudo file system will not be degraded by its user level implementation. Caching of file system: Caches are used to improve file system performance. When a part of memory has repeated access then blocking in the cache can be handled without the involvement of disk. There are two advantages: 1) It reduces delay of going back to disk. 2) Caching reduces contention for the disk arm, which may be advantageous if several processes are attempting to access files on the same disk. 3) Caching in main memory make the workstations to be diskless which is less expensive. Network File Traffic File Client Cache Server Server Server Cache Traffic Traffic Client Cache Traffic Disk Disk Traffic Traffic Server Disk Local Disk 13 As Shown in the figure above when a process makes a file access, it is first sent to the cache of the processor’s workstation (file traffic) .If not satisfied there, the request is passed either to a local disk, if the file is stored locally (disk traffic), or to the server where the file is stored (server traffic). Servers also maintain caches in order to reduce their disk traffic. In Sprite caching is done in the main memories of both server and client. Caches of client reduce the communication delay caused by fetching the blocks from server. Thus it helps in increasing performance by speeding up the programs and reducing the server utilization and the amount of clients supported by a server can be increased. In Sprite even when various workstations share same file simultaneously the file is cached in several places at once. The size of the cache can be varied dynamically. The virtual memory and file system negotiate for machine’s physical memory as the needs of both change, the file cache changes in size. Cache consistency: In Sprite many clients can cache files at the same time and it creates the problem of consistency. Many clients can cache a file simultaneously as long as none of them is writing the file and it can cache a file as long as there are no concurrent readers (or) writers on other workstations. In Sprite concurrent and sequential write sharing are the two types that cause consistency problems. Sequential writing is a method where a file is not open for reading and writing at the same time on different clients. This can result in clients maintaining stale data for a file in their cache after they have closed the file. In order to achieve consistency, the client must be able to detect this stale data by the time it reopens the file. [Nelson 88.] Concurrent sharing occurs when a file is open for multiple clients and at least one of them has it open for writing. The server handles this. If the server detects that concurrent write sharing is about to occur then it informs the client that a file has been opened for writing and if the clients have the access to open the file then it makes the file uncacheable. This causes the clients to remove all of the file blocks from their caches. The non-cacheable file becomes cacheable when it is no longer undergoing concurrent write sharing. 14 Sequential Write Sharing C1 has file open for reading C2 has file open for writing C1 has file open for reading Time Concurrent Write Sharing C1 has file open for reading C2 has file open for writing C1 has file open for writing Time Figures 3.2 and 3.3. As Shown above for the sequential write sharing C1 opens the file reads it and closes it. C2 opens the same file for writing, modifies it and then closes the file. When C1 opens the file again it has to make sure that the data in the file is not stale because C2 might have overwritten. For concurrent write sharing : C1 opens a file for reading and before it closes C2 opens the same file for writing then there is concurrent read-writesharing the file. After C2 opens the file C1 closes the file and opens it again for writing before C2 closes the file then there is concurrent write-write-sharing the file. Sprite uses the file servers as centralized control points for cache consistency. Each server guarantees cache consistency for all the files on its disks, and clients deal only with the server for a file and there are no direct client-client interactions. 15 4) Virtual memory vs. file system The file system requires very large cache whereas the virtual memory looks for small cache size so that most of the physical memory can be used for the virtual memory. For better performance, Sprite allows the cache to be flexible in terms of its size in response to the changing demands. The file system and the virtual memory system manage separate pools of physical memory pages. Each module keeps an approximate time of last access for each page. Whenever either module needs additional memory [Because of a page fault (or) a miss in the file cache.], it compares the age of its oldest page with the page of the oldest page from the other module, replacing whichever is oldest. This allows memory to flow back and forth between the virtual memory page pool and the file cache, depending on the needs of the current application. 5) Remote Procedure Call A remote procedure call is a procedure that is executed on a foreign host and the result is sent back to the calling program. A program at the remote site receives data from the caller, calls the procedure locally, and returns the result. To the programmer, remote procedure calls resemble local procedure calls. The foreign host is called the server and the calling program a client. Sprite provides a special purpose communication mechanism, which is its network RPC protocol that is used for communication among Sprite kernels. If a kernel operation needs to be carried out on a remote machine, the kernel-to-kernel RPC protocol is used to invoke it. The network protocol is based on the Birell-Nelson RPC protocol that uses implicit acknowledgements so that ordinarily an RPC requires only two network packets [Birell84]. Their basic model was extended to optimize bulk data transfer. Large messages are fragmented into multiple packets, and the reply packet acknowledges the whole batch. An RPC request (or) reply is composed of two buffers plus the header. One buffer, the parameter block, is used to marshal small arguments. The other buffer refers to a large, uninterpreted block of data, usually in user space, that can remain in place until copied 16 onto the network by the network interface. Packet headers and parameter blocks are automatically byte-swapped at low level, but only if the receiver has a different byte order. Packet headers contain boot time-stamp so that crashes and reboots can be easily detected. These optimizations tune the RPC protocol for its primary use as the file system’s network transport protocol. ar Calling Procedure Args re Results Called Procedure Request Request Args Results msg Args msg Client Stub Calling Procedure Network RPC Transport Result msg RPC Transport Server Stub Result msg Called Procedure Result Figures 5.1 and 5.2. The implementation of Sprite RPC mechanism consists of stubs and RPC transport as shown in the figure. They conceal the fact that the calling procedure and the called procedure are on different machines. There are two stubs for each call one on the client and one on the server. On the client, the stub copies its arguments into a request message and returns values from a result message, so that the calling procedure isn’t aware of the underlying message communication. The server stub passes arguments from the incoming message to the desired procedure, and packages up results from the procedure, so that the called procedure isn’t aware that its real caller is on different machine. 17 6) Multiprocessor Sprite Kernel Sprite is a network operating system and it has been designed to run on multiprocessors. The kernel is multi-threaded to allow more than one processor to execute kernel code simultaneously. Locks do the access to the kernel. In order for a multi-threaded kernel to function correctly it must contain mechanisms for providing both mutual exclusion and the synchronization between threads. Sprite uses the monitorstyle locking and condition variables to provide these features. There are two types of locks. They are monitor lock and master lock. Monitor locks are used to implement monitors. Monitor locks are acquired at the start of the procedure and released at the end. If a process tries to lock a monitor lock and another process has it locked already, then the process is put to sleep. The release of a monitor lock causes all processes that are waiting on the lock to be awakened and simultaneously reacquire the lock. Master locks are used to provide mutual exclusion between processes and interrupt handlers. A master lock is simply a spin lock that is acquired with interrupts disabled. If a master lock is already in use when an attempt is made to grab it, then the processor retries the locking operation until it succeeds. Interrupts are disabled to prevent a situation where an interrupt is taken after a master lock has been acquired and the interrupt routine spins forever waiting for the lock to be released. In addition to ensuring mutual exclusion between threads, Sprite must also provide a means for threads to wait for interesting conditions to occur. Condition variables are used for this purpose. If a process waits on a condition variable while holding a lock it will release the lock and be put to sleep. At a later time another process will signal the condition variable, causing all processes waiting on the variable to awake and try to reacquire the lock. 18 CONCLUSIONS The three goals that are the driving force behind the Sprite design are highperformance, consistency and simplicity. Like many other systems, Sprite attains high performance by using caches on both client and server workstations. Sprite uses file servers to attain cache consistency. Cache consistency is achieved by disabling cache when the same file is opened at many places. Virtual memory uses ordinary files as backing storage for the simplicity, easy implementation of process migration and dynamic usage of disk storage. 19 References 1) Ousterhout, J.K.; Cherenson, A.R.; Douglis, F.; Nelson, M.N.; Welch, B.B. “The Sprite Network Operating System.” IEEE Computer, Volume: 21 Issue: 2, Feb.1988. 2) Douglis, F. and Ousterhout, J. Process Migration In The Sprite Operating System.7th International Conference On Distributed Computing Systems, September 1987. And also as Technical report UCB/CSD 87/343, February 1987. 3) Nelson, M. Virtual Memory For The Sprite Operating System. Technical report UCB/CSD 86/301, June 1986. 4) Nelson, M., Welch, B., Ousterhout, J. “Caching In The Sprite Network File System.” ACM Transactions On Computer Systems. Also as a Technical report UCB/CSD 87/345, March 1987. 5) Welch, B. The Sprite Remote Procedure Call System. Technical report UCB/CSD 86/302, June 1986. 6) M. N. Nelson, “Physical Memory Management In A Network Operating System”, PhD Thesis, Nov. 1988. University Of California, Berkeley. 7) A.D.Birell and B.J.Nelson, Implementing Remote Procedure Calls, ACM Transactions On Computer Systems 2,1 (Feb. 1984), 39-59. 8) B.B.Welch, "Naming, State Management, and UserLevel Extensions in the Sprite Distributed File System ", University of California Berkeley, Technical Report UCB/CSD 90/567 1990. 9) B. Welch, "The File System Belongs in the Kernel", in Proceedings of the 2nd USENIX Mach Symposium, November 1991. 20 10) B. Welch and J. Ousterhout, "Pseudo File Systems", Technical Report UCB/ CSD 89/499, Computer Science Division (EECS), University of California, 1989. 11) B.B. Welch, “ The Sprite Remote Procedure Call System”, Technical Report UCB/CSD 86/302 Computer Science Division, Univ. of Berkeley, Berkeley, California 94720, June 1986. 12) F. Douglis. Transparent Process Migration in the Sprite Operating System. PhD Thesis, University of California, Berkeley, CA 94720, September 1990. Available as Technical Report UCB/CSD 90/598. 21