Interactive Machine-Language Programming Butler W. Lampson University of California, Berkeley Proc. AFIPS Conf. 27 (1965), pp 473-482. Introduction The problems of machine language programming, in the broad sense of coding in which it is possible to write each instruction out explicitly, have been curiously neglected in the literature. There are still many problems which must be coded in the hardware language of the computer on which they are to run, either because of stringent time and space requirements or because no suitable higher level language is available. It is a sad fact, however, that a large number of these problems never run at all because of the inordinate amount of effort required to write and debug machine language programs. On those that are undertaken in spite of this obstacle, a great deal of time is wasted in struggles between programmer and computer which might be avoided if the proper systems were available. Some of the necessary components of these systems,, both hardware and software, have been developed and intensively used at a few installations. To most programmers, however, they remain as unfamiliar as other tools which are presented for the first time below. In the former category fall the most important features of a good assembler: macroinstructions implemented by character substitution, conditional assembly instructions, and reasonably free linking of independently assembled programs. The basic components of a debugging system are also known but relatively unfamiliar. For these the essential prerequisite is an interactive environment, in which the power of the computer is available at a console for long periods of time. The batch processing mode in which large systems are operated today of course precludes interaction, but programs for small machines are normally debugged in this way, and as time-sharing becomes more widespread the interactive environment will become common. It is clear that interactive debugging systems must have abilities very different from those of off-line systems. Large volumes of output are intolerable, so that dumps and traces are to be avoided at all costs. To take the place of dumps, selective examination and alteration of memory locations are provided. Traces give way to breakpoints, which cause control to return to the system at selected instructions. It is also essential to escape from the switches-and-lights console debugging common on small machines without adequate software. To this end, type-in and type-out of information must be symbolic rather than octal where this is convenient. The goal, which can be very nearly achieved, is to make the symbolic representation of an instruction produced by the system identical to the original symbolic written by the user. The emphasis is on convenience to the user and rapidity of communication. The combination of an assembler and a debugger of this kind is a powerful one which can reduce by a factor of perhaps five the time required to write and debug a machine language program. A full system for interactive machine language programming (IMP), however, can do much more and, if properly designed, need not be more difficult to implement. A Critique of “An Exploratory Investigation of Programmer Performance Under On-Line and Off-Line Conditions” Butler W. Lampson IEEE Trans. Human Factors in Electronics HFE-8, 1 (Mar. 1967), pp 48-51 Abstract The preceding paper by Grant and Sackman, “An Exploratory Investigation of Programmer Performance Under On-Line and Off-Line Conditions” is discussed critically. Primary emphasis is on this paper’s failure to consider the meaning of the numbers obtained. An understanding of the nature of an on-line system is necessary for proper interpretation of the observed results for debugging time, and the results for computer time are critically dependent on the idiosyncrasies of the system on which the work was done. Lack of attention to these matters cannot be compensated for by any amount of statistical analysis. Furthermore, many of the conclusions drawn and suggestions made are too vague to be useful. Dynamic protection structures B. W. Lampson Berkeley Computer Corporation Berkeley, California Proc. AFIPS Conf. 35 (1969), pp 27-38. Introduction A very general problem which pervades the entire field of operating system design is the construction of protection mechanisms. These come in many different forms, ranging from hardware which prevents the execution of input/output instructions by user programs, to password schemes for identifying customers when they log onto a timesharing system. This paper deals with one aspect of the subject, which might be called the meta-theory of protection systems: how can the information which specifies protection and authorizes access, itself be protected and manipulated. Thus, for example, a memory protection system decides whether a program P is allowed to store into location T. We are concerned with how P obtains this permission and how he passes it on to other programs. In order to lend immediacy to the discussion, it will be helpful to have some examples. To provide some background for the examples, we imagine a computation C running on a general multi-access system M. The computation responds to inputs from a terminal or a card reader. Some of these look like commands: to compile file A, load B and print the output double-spaced. Others may be program statements or data. As C goes about its business, it executes a large number of different programs and requires at various times a large number of different kinds of access to the resources of the system and to the various objects which exist in it. It is necessary to have some way of knowing at each instant what privileges the computation has, and of establishing and changing these privileges in a flexible way. We will establish a fairly general conceptual framework for this situation, and consider the details of implementation in a specific system. Part of this framework is common to most modern operating systems; we will summarize it briefly. A program running on the system M exists in an environment created by M, just as does a program running in supervisor state on a machine unequipped with software. In the latter case the environment is simply the available memory and the available complement of machine instructions and input/output commands; since these appear in just the form provided by the hardware designers, we call this environment the bare machine. By contrast, the, environment created by M for a program is called a virtual or user machine. It normally has less memory, differently organized, and an instruction set in which the input/output at least has been greatly changed. Besides the machine registers and memory, a user machine provides a set of objects which can be manipulated by the program. The instructions for manipulating objects are probably implemented in software, but this is of no concern to the user machine program, which is generally not able to tell how a given feature is implemented. The basic object which executes programs is called a task or process; it corresponds to one copy of the user machine. What we are primarily concerned with in this paper is the management of the objects which a process has access to: how are they identified, passed around, created, destroyed, used and shared. Beyond this point, three ideas are fundamental to the framework being developed: 1. Objects are named by capabilities, which are names that are protected by the system in the sense that programs can move them around but not change them or create them in an arbitrary way. As a consequence, possession of a capability can be taken as prima facie proof of the right to access the object it names. 2. A new kind of object called a domain is used to group capabilities. At any time a process is executing in some domain and hence can exercise the capabilities which belong to the domain. When control passes from one domain to another (in a suitably restricted fashion) the capabilities of the process will change. 3. Capabilities are usually obtained by presenting domains which possess them with suitable authorization, in the form of a special kind of capability called an access key. Since a domain can possess capabilities, including access keys, it can carry its own identification. A key property of this framework is that it does not distinguish any particular part of the computation. In other words, a program running in one domain can execute, expand the computation, access files and in general exercise its capabilities without regard to who created it or how far down in any hierarchy it is. Thus, for example, a user program running under a debugging system is quite free to create another incarnation of the debugging system underneath him, which may in turn create another user program which is not aware in any way of its position in the scheme of things. In particular, it is possible to reset things to a standard state in one domain without disrupting higher ones. The reason for placing so much weight on this property is two-fold. First of all, it provides a guarantee that programs can be glued together to make larger programs without elaborate pre-arrangements about the nature of the common environment. Large systems with active user communities quickly build up sizable collections of valuable routines. The large ones in the collections, such as compilers, often prove useful as subroutines of other programs. Thus, to implement language X it may be convenient to translate it into language Y, for which a compiler already exists. The X implementer is probably unaware that Y’s implementation involves a further call on an assembler. If the basic system organization does not allow an arbitrarily complex structure to be built up from any point, this kind of operation will not be feasible. The second reason for concern about extendibility is that it allows deficiencies in the design of the system to be made up without changes in the basic system itself, simply by interposing another layer between the basic system and the user. This is especially important when we realize that different people may have different ideas about the nature of a deficiency. We now have outlined the main ideas of the paper. The remainder of the discussion is devoted to filling them out with examples and explanations. The entire scheme has been developed as part of the operating system for the Berkeley Computer Corporation Model I. Since many details and specific mechanisms are dependent on the characteristics of the surrounding system and underlying hardware, we digress briefly at this point to describe them. On Reliable and Extendible Operating Systems Butler W. Lampson Proc. 2nd NATO Conf. on Techniques in Software Engineering, Rome, 1969. Reprinted in The Fourth Generation, Infotech State of the Art Report 1, 1971, pp 421-444. Introduction A considerable amount of bitter experience in the design of operating systems has been accumulated in the last few years, both by the designers of the systems which are currently in use and by those who have been forced to use them. As a result, many people have been led to the conclusion that some radical changes must be made, both in the way we think about the functions of operating systems and in the way they are implemented. Of course, these are not unrelated topics, but it is often convenient to organize ideas around the two axes of function and implementation. This paper is concerned with an effort to create more flexible and more reliable operating systems built around a very powerful and general protection mechanism. The mechanism is introduced at a low level and is then used to construct the rest of the system, which thus derives the same advantages from its existence as do user programs operating with the system. The entire design is based on two central ideas. The first of these is that an operating system should be constructed in layers, each one of which creates a different and hopefully more convenient environment in which the next higher layer can function. In the lower layers a bare machine provided by the hardware manufacturer is converted into a large number of user machines which are given access to common resources such as processor time and storage space in a controlled manner. In the higher layers these user machines are made easy for programs and users at terminals to operate. Thus, as we rise through the layers, we observe two trends. 1 The consequences of an error become less severe 2 The facilities provided become more elaborate At the lower levels we wish to press the analogy with the hardware machine very strongly; where the integrity of the entire system is concerned the operations provided should be as primitive as possible. This is not to say that the operations should not be complete, but that they need not be convenient. They are to be regarded in the same light as the instructions of a central processor. Each operation may in itself do very little; we require only that the entire collection should be powerful enough to permit more convenient operations to be programmed. The main reason for this dogma is clear enough; simple operations are more likely to work than complex ones and, if failures are to occur, it is very much preferable that they should hurt only one user rather than the entire community. We therefore admit increasing complexity in higher layers, until the user at his terminal may find himself invoking extremely elaborate procedures. The price to be paid for low level simplicity is also clear: additional time to interpret many simple operations and storage to maintain multiple representations of essentially the same information. We shall return to these points below. It is important to note that users of the system other than the designers need not suffer any added inconvenience from its adherence to the dogma, since the designers can very well supply, at a higher level, programs that simulate the action of the powerful low level operations to which users may be accustomed. Users do, in fact, profit from the fact that a different set of operations can be programmed if the ones provided by the designer prove unsatisfactory. This point also will receive further attention. The increased reliability that we hope to obtain from an application of the above ideas has two sources. In the first place, careful specification of an orderly hierarchy of operations will clarify what is going on in the system and make it easier to understand. This is the familiar idea of modularity. Equally important, however, is a second and less familiar point, the other pillar of our system design, which might loosely be called enforced modularity. It is this: if interactions between layers or modules can be forced to take place through defined paths only, then the integrity of one layer can be assured regardless of the deviations of a higher one. The requirement that no possible action of a higher layer, whether accidental or malicious, can affect the functioning of a lower one is a strong one. In general, hardware assistance will be required to achieve this goal, although in some cases the discipline imposed by a language such as ALGOL, together with suitable checks on the validity of subscripts, may suffice. The reward is that errors can be localized very precisely. Protection And Access Control In Operating Systems Butler W. Lampson In Operating Systems, Infotech State of the Art Report 14, 1972, pp 311-326 Introduction I should like to explain what protection and access control is all about in a form that is general enough to make it possible to understand all the existing forms of protection and access control that we see in existing systems, and perhaps to see more clearly than we can now the relationships between all the different ways that now exist for providing protection in a computer system. Just in case you are not aware of how many different ways there are, let me suggest a few that you might find in a typical system: 1 a key switch on a console 2 a monitor and user mode mechanism in the hardware of the machine that decides whether a given program can execute input/output instructions 3 a memory protection scheme that attaches 4-bit tags to each block of memory and decides whether or not particular programs can address that block 4 facilities in the file system for controlling access; such a system allows a number of entities, normally referred to as users, to exist in the system and controls the way in which data belonging to one user can be accessed by other users. All these mechanisms are mostly independent, and superficially they look almost completely different. However, it turns out that there is a fair amount of unity at the basic conceptual level in the way in which protection is done, although there is enormous efflorescence in implementations. I should like to make it clear precisely what I am not talking about in discussing protection of computer systems. I am concerned only about what goes on inside the system; the question of how the system identifies someone who proposes himself as a user I do not intend to consider, although it is of course an important problem. There is a basic idea underlying the whole apparatus of protection and access control; it is the idea of a program being executed in a certain context that determines what the program is authorized to do. For example, if you have a program running on System/360 hardware, there is a bit in the physical machine that determines whether the machine is in supervisor state or in problem state. On the Transfer of Control between Contexts B. W. Lampson, J. G. Mitchell and E. H. Satterthwaite Xerox Research Center 3180 Porter Drive Palo Alto, CA 94304, USA Lecture Notes in Computer Science 19, Springer, 1974, pp 181-203 Abstract We describe a single primitive mechanism for transferring control from one module to another, and show how this mechanism, together with suitable facilities for record handling and storage allocation, can be used to construct a variety of higher-level transfer disciplines. Procedure and function calls, coroutine linkages, non-local gotos, and signals can all be specified and implemented in a compatible way. The conventions for storage allocation and name binding associated with control transfers are also under the programmer’s control. Two new control disciplines are defined: a generalization of coroutines, and a facility for handling errors and unusual conditions which arise during program execution. Examples are drawn from the Modular Programming Language, in which all of the facilities described are being implemented. Report on the Programming Language Euclid J. Horning, B. Lampson, R. London, J. Mitchell, and G. Popek SIGPLAN Notices, February 1977. Revised as Technical Report CSL-81-12, Xerox Palo Alto Research Center. Introduction “There is no royal road to geometry.” —Proclus, Comment on Euclid, Prol. G. 20. The programming language Euclid has been designed to facilitate the construction of verifiable system programs. By a verifiable program we mean one written in such a way that existing formal techniques for proving certain properties of programs can be readily applied; the proofs might be either manual or automatic, and we believe that similar considerations apply in both cases. By system we mean that the programs of interest are part of the basic software of the machine on which they run; such a program might be an operating system kernel, the core of a data base management system, or a compiler. An important consequence of this goal is that Euclid is not intended to be a generalpurpose programming language. Furthermore, its design does not specifically address the problems of constructing very large programs; we believe most of the programs written in Euclid will be modest in size. While there is some experience suggesting that verifiability supports other desired goals, we assume the user is willing, if necessary, to obtain verifiability by giving up some run-time efficiency, and by tolerating some inconvenience in the writing of his programs. We see Euclid as a (perhaps somewhat eccentric) advance along one of the main lines of current programming language development: transferring more and more of the work of producing a correct program, and verifying its correctness, from the programmer and the verifier (human or mechanical) to the language and its compiler. The main changes relative to Pascal take the form of restrictions, which allow stronger statements about the properties of the program to be made from the rather superficial, but quite reliable, analysis which the compiler can perform. In some cases new constructions have been introduced, whose meaning can be explained by expanding them in terms of existing Pascal constructions. The reason for this is that the expansion would be forbidden by the newly introduced restrictions, whereas the new construction is itself sufficiently restrictive in a different way. The main differences between Euclid and Pascal are summarized in the following list: Visibility: Euclid provides explicit control over the visibility of identifiers, by requiring the program to list all the identifiers imported into a routine or module, or .exported from a module. Variables: The language guarantees that two identifiers in the same scope can never refer to the same or overlapping variables. There is a uniform mechanism for binding an identifier to a variable in a procedure call, on block entry (replacing the Pascal with statement), or in a variant record discrimination. The variables referenced or modified by a routine (i.e., procedure or function) must be accessible in every scope from which the routine is called. Pointers: This idea is extended to pointers, by allowing dynamic variables to be assigned to collections, and guaranteeing that two pointers into different collections can never refer to the same variable. Storage allocation: The program can control the allocation of storage for dynamic variables explicitly, in a way which confines the opportunity for making a type error very narrowly. It is also possible to declare that some dynamic variables should be reference-counted, and automatically deallocated when no pointers to them remain. Types: Types have been generalized to allow formal parameters, so that arrays can have bounds which are fixed only when they are created, and variant records can be bandied in a type-safe manner. Records are generalized to include constant components. Modules: A new kind of record, called a module, can contain routine and type components, and thus provides a facility for modularization. The module can include initialization and finalization statements which are executed whenever a module variable is created or destroyed. Constants: Euclid defines a constant to be a literal, or an identifier whose value is fixed throughout the scope in which it is declared. For statement: A generator can be declared as a module type, and used in a for statement to enumerate a sequence of values. Loopholes: features of the underlying machine can be accessed, and the typechecking can be overridden, in a controlled way. Except for the explicit loopholes, Euclid is designed to be type-safe. Assertions: the syntax allows assertions to be supplied at convenient points. Deletions: A number of Pascal features have been omitted from Euclid: input-output, reals, multi-dimensional arrays, labels and gotos, and functions and procedures as parameters. The only new features in the list which can make it hard to convert a Euclid program into a legal Pascal program by straightforward rewriting are parameterized types, storage allocation, finalization, and some of the loopholes. There are a number of other considerations which influenced the design of Euclid: It is based on current knowledge of programming languages and compilers; concepts which are not fairly well understood, and features whose implementation is unclear, have been omitted. Although program portability is not a major goal of the language design, it is necessary to have compilers which generate code for a number of different machines, including mini-computers. The object code must be reasonably efficient, and the language must not require a highly optimizing compiler to achieve an acceptable level of efficiency in the object program. Since the total size of a program is modest, separate compilation is not required (although it is certainly not ruled out). The required run time support must be minimal, since it presents a serious problem for verification. Notes On The Design Of Euclid G. J. Popek UCLA Computer Science Department Los Angeles, California 90024 J. J. Horning Computer Systems Research Group University of Toronto Toronto, Canada B. W. Lampson, J. G. Mitchell Xerox Palo Alto Research Palo Alto, California 94304 R. L. London USC Information Sciences Institute Marina del Rey, California 90291 ACM Sigplan Notices 12, 3 (Mar. 1977), pp 11-18 Abstract Euclid is a language for writing system programs that are to be verified. We believe that verification and reliability are closely related, because if it is hard to reason about programs using a language feature, it will be difficult to write programs that use it properly. This paper discusses a number of issues in the design of Euclid, including such topics as the scope of names, aliasing, modules, type-checking, and the confinement of machine dependencies; it gives some of the reasons for our expectation tnat programming in Euclid will be more reliable (and will produce more reliable programs) than programming in Pascal, on which Euclid is based. Key Words and Phrases: Euclid, verification, systems programming language, reliability, Pascal, aliasing, data encapsulation, parameterized types, visibility of names, machine dependencies, legality assertions, storage allocation. CR Categories: 4.12, 4.2, 4.34, 5.24 Introduction Euclid is a programming language evolved from Pascal by a series of changes intended to make it more suitable for verification and for system programming. We expect many of these changes to improve the reliability of the programming process, firstly by enlarging the class of errors that can be detected by the compiler, and secondly by making explicit in the program text more of the information needed for understanding and maintenance. In addition, we expect that effort expended in program verification will directly improve program reliability. Although Euclid is intended for a rather restricted class of applications, much of what we have done could surely be extended to languages designed for more general purposes. Like all designs, Euclid represents a compromise among conflicting goals, reflecting the skills, knowledge, and tastes (i.e., prejudices) of its designers. Euclid was conceived as an attempt to integrate into a practical language the results of several recent developments in programming methodology and program verification. As Hoare has pointed out, it is considerably more difficult to design a good language than it is to select one’s favorite set of good language features or to propose new ones. A language is more than the sum of its parts, and the interactions among its features are often more important than any feature considered separately. Thus this paper does not present many new language features. Rather, it discusses several aspects of our design that, taken together, should improve the reliability of programming in Euclid. We believe that the goals of reliability, understandability, and verifiability are mutually reinforcing. We never consciously sacrificed one of these in Euclid to achieve another. We had a tangible measure only for the third (namely, our ability to write reasonable proof rules, so we frequently used it as the touchstone for all three. Much of this paper is devoted to decisions motivated by the problems of verification. Another important goal of Euclid, the construction of acceptably efficient system programs, did not seem attainable without some sacrifices in the preceding three goals. Much of the language design effort was expended in finding ways to allow the precise control of machine resources that seemed to be necessary, while narrowly confining the attendant losses of reliability, understandability, and verifiability. The focus here is on features that contribute to reliability. Goals, History, and Relation To Pascal The chairman originally charged the committee as follows: “Let me outline our charter as I understand it. We are being asked to make minimal changes and extensions to Pascal in order to make the resulting language one that would be suitable for systems programming while retaining those characteristics of the language that are attractive for good programming style and verification. Because it is highly desirable that the language and appropriate compilers be available in a short time, the language definition effort is to be quite limited: only a month or two in duration. Therefore, we should not attempt to design a significantly different language, for that, while highly desirable, is a research project in itself. Instead, we should aim at a ‘good’ result, rather than the superb.” We defer to the Conclusions a discussion of our current feelings about these goals and how well we have met them. The design of Euclid took place at four two-day meetings of the authors in 1976, supplemented by a great deal of individual effort and uncounted Arpanet messages. Almost all of the basic changes to Pascal were agreed upon during the first meeting; most of the effort since then has been devoted to smoothing out unanticipated interactions among the changes and to developing a suitable exposition of the language. Three versions of the Euclid Report have been widely circulated for comment and criticism; the most recent appeared in the February 1977 Sigplan Notices. Proof rules are currently being prepared for publication. A Terminal-Oriented Communication System Paul G. Heckel Interactive Systems Consultants Butler W. Lampson Xerox Palo Alto Research Center Communications of the ACM, July 1977 Volume 20, Number 7 Abstract This paper describes a system for full-duplex communication between a time-shared computer and its terminals. The system consists of a communications computer directly connected to the time-shared system, a number of small remote computers to which the terminals are attached, and connecting medium speed telephone lines. It can service a large number of terminals of various types. The overall system design is presented along with the algorithms used to solve three specific problems: local echoing, error detection and correction on the telephone lines, and multiplexing of character output. Key Words and Phrases: terminal system, error correction, multiplexing, local echoing, communication system, network CR Categories: 3.81, 4.31 Introduction A number of computer communication systems have been developed in the last few years. The best known such system is the Arpanet, which provides a 50-kilobit network interconnecting more than 40 computers. By contrast, the system described in this paper connects computers, with terminals, rather than with each other. Such terminal networks are of interest because there are many requirements for connecting a number of geographically distributed terminals with a centrally located computer, and because terminal networks can use medium speed (2400-9600 baud) telephone lines which have reasonable cost and wide availability. Another system tackling essentially the same problem is Tymnet, which was developed at about the same time as the system described here. Of course, a general-purpose network like the Arpanet can (and does) carry terminal traffic. Our system was designed to connect (presumably remote) low and medium speed devices, such as teletypes and line printers, to the Berkeley Computer Corporation’s BCC-500 computer system. The basic service provided is a full-duplex channel between a user’s terminal and his program running on the BCC-500 CPU. The design objectives were to make the system efficient in the use of bandwidth and resistant to telephone line errors, while keeping it flexible so that a wide variety of devices could be handled. The paper provides an overall description of what the BCC terminal system does and how it does it. In addition, it presents in detail the solutions to three specific problems: local echoing (see Section 2); error detection and correction on the multiplexed telephone line (see Section 4.1); and output multiplexing (see Section 4.2). The structure of the system is shown in Figure 1. The hardware components, named along the heavy black line, are —A CPU on which user programs execute; — A central dedicated processor called the CHIO which handles all characteroriented input-output to the CPU; — A number of small remote computers called concentrators to which terminals are connected, either directly or via standard low-speed modems and telephone lines; and — Leased voice-grade telephone lines with medium-speed (e.g. 4800 baud) modems which connect the concentrators to the CHIO. The system is organized as a collection of parallel processes which communicate by sending messages to each other. In some cases the processes run in the same processor and the parallelism is provided by a scheduler or coroutine linkage, but it is convenient to ignore such details in describing the logical structure. Proof Rules for the Programming Language Euclid R.L. London, J.V. Guttag, J.J. Horning, B.W. Lampson, J.G. Mitchell, and G.J. Popek Acta Informatica 10, 1-26 (1978) Summary In the spirit of the previous axiomatization of the programming language Pascal, this paper describes Hoare-style proof rules for Euclid, a programming language intended for the expression of system programs which are to be verified. All constructs of Euclid are covered except for storage allocation and machine dependencies. “The symbolic form of the work has been forced upon us by necessity: without its help we should have been unable to perform the requisite reasoning.” —A.N. Whitehead and B. Russell, Principia Mathematica, p. vii “Rules are rules.” —Anonymous Introduction The programming language Euclid has been designed to facilitate the construction of verifiable system programs. Its defining report closely follows the defining report of the Pascal language (see also). The present document, giving Hoare-style proof rules applicable only to legal Euclid programs, owes a great deal to (and is in part identical to) the axiomatic definition of Pascal. Major differences include the treatment of procedures and functions, declarations, modules, collections, escape statements, binding, parameterized types, and the examples and detailed explanations in Appendices 1-3. Other semantic definition methods are certainly applicable to Euclid. We have used proof rules for two reasons: familiarity and the existence of the Pascal definition. One may regard the proof rules as a definition of Euclid in the same sense as the Pascal axiomatization defines Pascal. By stating what can be proved about Euclid programs, the rules define the meaning of most syntactically and semantically legal Euclid programs, but they do not give the information required to determine whether or not a program is legal. This information may be found in the language report. Neither do the proof rules define the meaning of illegal Euclid programs containing, for example, division by zero or an invalid array index. Finally, explicit proof rules are not provided for those portions of Euclid defined in the report by translation into other legal Euclid constructs. This includes pervasive, implicit imports through thus, and some uses of return and exit. All such transformations must be applied before the proof rules are applicable. As is the case with Pascal, the Euclid axiomatization should be read in conjunction with the language report, and is an almost total axiomatization of a realistic and useful system programming language. While the primary goal of the Euclid effort was to design a practical programming language (not to provide a vehicle for demonstrating proof rules), proof rule considerations did have significant influence on Euclid. All constructs of the language are covered except for storage allocation (zones and collections that are not reference-counted) and machine dependencies. In a few instances rules are applicable only to a subset of Euclid; the restrictions are noted with those rules. Crash recovery in a distributed data storage system Butler W. Lampson and Howard E. Sturgis Xerox Palo Alto Research Center 3333 Coyote Hill Road / Palo Alto / California 94 June 1, 1979 DRAFT - Not for distribution - DRAFT Abstract An algorithm is described which guarantees reliable storage of data in a distributed system, even when different portions of the data base, stored on separate machines, are updated as part of a single transaction. The algorithm is implemented by a hierarchy of rather simple abstractions, and it works properly regardless of crashes of the client or servers. Some care is taken to state precisely the assumptions about the physical components of the system (storage, processors and communication). Key Words And Phrases atomic, communication, consistency, data base, distributed computing, fault-tolerance, operating system, recovery, reliability, transaction, update. Introduction We consider the problem of crash recovery in a data storage system which is constructed from a number of independent computers. The portion of the system which is running on some individual computer may crash, and then be restarted by some crash recovery procedure. This may result in the loss of some information which was present just before the crash. The loss of this information may, in turn, lead to an inconsistent state for the information permanently stored in the system. For example, a client program may use this data storage system to store balances in an accounting system. Suppose that there are two accounts, called A and B, which contain $10 and $15 respectively. Further, suppose the client wishes to move $5 from A to B. The client might proceed as follows: read account A (obtaining $10) read account B (obtaining $15) write $5 to account A write $20 to account B Now consider a possible effect of a crash of the system program running on the machine to which these commands are addressed. The crash could occur after one of the write commands has been carried out, but before the other has been initiated. Moreover, recovery from the crash could result in never executing the other write command. In this case, account A is left containing $5 and account B with $15, an unintended result. The contents of the two accounts are inconsistent. There are other ways in which this problem can arise: accounts A and B are stored on two different machines and one of these machines crashes; or, the client itself crashes after issuing one write command and before issuing the other. In this paper we present an algorithm for maintaining the consistency of a file system in the presence of these possible errors. We begin, in section 2, by describing the kind of system to which the algorithm is intended to apply. In section 3 we introduce the concept of an atomic transaction. We argue that if a system provides atomic transactions, and the client program uses them correctly, then the stored data will remain consistent. The remainder of the paper is devoted to describing an algorithm for obtaining atomic transactions. Any correctness argument for this (or any other) algorithm necessarily depends on a formal model of the physical components of the system. Such models are quite simple for correctly functioning devices. Since we are interested in recovering from malfunctions, however, our models must be more complex. Section 4 gives models for storage, processors and communication, and discusses the meaning of a formal model for a physical device. Starting from this base, we build up the lattice of abstractions shown in figure 1. The second level of this lattice constructs better behaved devices from the physical ones, by eliminating storage failures and eliminating communication entirely (section 5). The third level consists of a more powerful primitive which works properly in spite of crashes (section 6). Finally, the highest level constructs atomic transactions (section 7). Parallel to the lattice of abstractions is a sequence of methods for constructing compound actions with various desirable properties. A final section discusses some efficiency and implementation considerations. Throughout we give informal arguments for the correctness of the various algorithms. Atomic transactions Butler W. Lampson and Howard Sturgis Distributed Systems—Architecture and Implementation, Lecture Notes in Computer Science 105, Springer, 1981, Chapter 11 Introduction This chapter deals with methods for performing atomic actions on a collection of computers, even in the face of such adverse circumstances as concurrent access to the data involved in the actions, and crashes of some of the computers involved. For the sake of definiteness, and because it captures the essence of the more general problem, we consider the problem of crash recovery in a data storage system which is constructed from a number of independent computers. The portion of the system which is running on some individual computer may crash, and then be restarted by some crash recovery procedure. This may result in the loss of some information which was present just before the crash. The loss of this information may, in turn, lead to an inconsistent state for the information permanently stored in the system. For example, a client program may use this data storage system to store balances in an accounting system. Suppose that there are two accounts, called A and B, which contain $10 and $15 respectively. Further, suppose the client wishes to move $5 from A to B. The client might proceed as follows: read account A (obtaining $10) read account B (obtaining $15) write $5 to account A write $20 to account B Now consider a possible effect of a crash of the system program running on the machine to which these commands are addressed. The crash could occur after one of the write commands has been carried out, but before the other has been initiated. Moreover, recovery from the crash could result in never executing the other write command. In this case, account A is left containing $5 and account B with $15, an unintended result. The contents of the two accounts are inconsistent There are other ways in which this problem can arise: accounts A and B are stored on two different machines and one of these machines crashes; or, the client itself crashes after issuing one write command and before issuing the other. In this chapter we present an algorithm for maintaining the consistency of a file system in the presence of these possible errors. We begin, in section 11.2, by describing the kind of system to which the algorithm is intended to apply. In section 11.3 we introduce the concept of an atomic transaction. We argue that if a system provides atomic transactions, and the client program uses them correctly, then the stored data will remain consistent The remainder of the chapter is devoted to describing an algorithm for obtaining atomic transactions. Any correctness argument for this (or any other) algorithm necessarily depends on a formal model of the physical components of the system. Such models are quite simple for correctly functioning devices. Since we are interested in recovering from malfunctions, however, our models must be more complex. Section 11.4 gives models for storage, processors and communication, and discusses the meaning of a formal model for a physical device. Starting from this base, we build up the lattice of abstractions shown in figure 11-1. The second level of this lattice constructs better behaved devices from the physical ones, by eliminating storage failures and eliminating communication entirely (section 11.5). The third level consists of a more powerful primitive which works properly in spite of crashes (section 11.6). Finally, the highest level constructs atomic transactions (section 11.7). Throughout we give informal arguments for the correctness of the various algorithms. Figure 11-1: The lattice of abstractions for transactions Remote Procedure Calls Butler W. Lampson Distributed Systems-Architecture and Implementation, Lecture Notes in Computer Science 105, Springer, 1981, 357-370 Parameter and data representation In this section we consider in a broader context the problems of parameter and data representation in distributed systems. Our treatment will be somewhat abstract and, unfortunately, rather superficial. This is really a programming language design problem, although in practice it is not usually addressed in that context. The reason is that a single bit has the same representation everywhere (except at low levels of abstraction which do not concern us here). An integer or a floating poini number, lo say nothing of a relational data base, may be represented by very different sets of bits on different machines, but it only makes sense to talk aboul the representation of data when iis type is known. Thus our problem is to define a common notion of data types and suitable ways of transforming from one representation of a type to another. This kind of problem is customarily addressed in the context of language design, and we shall find it convenient to do so here. In fact, we shall confine ourselves here to the problem of a single procedure call, possibly directed to a remote site. The type of the procedure is expressed by a declaration: P: procedure(a1: T1, a2: T2,...) returns (r1 U1, r2: U2,...). We shall sometimes write (a1: T1, a2: T2 , ...) → (r1 U1, r2: U2, ...)for short. Of course, sending a message and receiving a reply can be described in the same way, as far as the representation of the data is concerned. The control flow aspects of orderly communication with a remote site are discussed in Chapter 7 and Section 14.8; here we are interested only in data representation. The function of the declaration is to make explicit what the argument and result types must be, so that the caller and callee can agree on this point, and so that there is enough information for an automatic mechanism to have a chance of making any necessary conversions. For this reason we insist that remote procedures must have complete declarations. We assert without proof that any data representation problem can be cast in this form without doing violence to its essence. Remote procedure calls One of the major problems in constructing distributed programs is to abstract out the complications which arise from the transmission errors, concurrency and partial failures inherent in a distributed system. If these are allowed to appear in their full glory at the applications level, life becomes so complicated that there is little chance of getting anything right. A powerful tool for this purpose is the idea of a remote procedure call. If it is possible to call procedures on remote machines with the same semantics as ordinary local calls, the application can be written without concern for most of the complications This goal cannot be fully achieved without the transaction mechanism of Chapter 11, but we show in this section how to obtain the same semantics, except that the action of the remote call may occur more than once. Our treatment uses the definitions of Sections 11.3 and 11.4 for processor and communication failures and stable storage. Following Lamport, we write a → b to indicate that the event a precedes the event 4; this can happen because they are both in the same process, or because they are both references to the same datum, or as the transitive closure of these immediate relations. Recall that physical communication is modeled by a datum called the message; transmitting a message from one process to another involves two transitions in the medium, the Send and the Receive. A remote procedure call consists of several events. There is the main call, which consists of the following events: the call c; the start s, the work w; the end e; the return r. The ones on the left occur in the calling machine, the ones on the right in the machine being called. The events which detail the message transmission have been absorbed into the ones listed. These events occur in the order indicated (i.e., each precedes the next). In addition, there may be orphan events o in the receiver, consisting of any prefix of the transitions indicated. These occur because of duplicated call messages, which can arise from failures in the communication medium, or from timeouts or crashes in the caller which are followed by a retry. The orphans all follow the call and precede the rest of the main call. Difficulties arise, however, in guaranteeing the order of the orphans relative to the rest of the main call. Practical Use of a Polymorphic Applicative Language Butler W. Lampson and Eric E. Schmidt Computer Science Laboratory Xerox Palo Alto Research Center 3333 Coyote Hill Road Palo Alto, CA 94304 Abstract Assembling a large system from its component elements is not a simple task. An adequate notation for specifying this task must reflect the system structure, accommodate many configurations of the system and many versions as it develops, and be a suitable input to the many tools that support software development. The language described here applies the ideas of A-abstraction, hierarchical naming and type-checking to this problem. Some preliminary experience with its use is also given. Introduction Assembling a large system from its component elements is not a simple task. The subject of this paper is a language for describing how to do this assembly. We begin with a necessarily terse summary of the issues, in order to establish a context for the work described later. A large system usually exhibits a complex structure. It has many configurations, different but related. Each configuration is composed of many elements. The elements have complex interconnections. They form a hierarchy: each may itself be a system. The system develops during a long period of design and implementation, and an even longer period of maintenance. Each element is changed many times. Hence there are many versions of each element Certain sets of compatible versions form releases. Many tools can be useful in this development. Translators build an executable program. Accelerators speed up rebuilding after a change. Debuggers display the state of the program. Databases collect and present useful information. The System Modeling language (SMI. for short) is a notation for describing how to compose a set of related system configurations from their elements. A description in SML is called a model of the system. The development of a system can be described by a collection of models, one for each stage in the development; certain models define releases. A model contains all the information needed by a development tool; indeed, a tool can be regarded as a useful operator on models, e.g., build system, display state or structure, print source files, cross-reference, etc. In this paper we present the ideas which underlie SML, define its syntax and semantics, discuss a number of pragmatic issues, and give several examples of its use. We neglect development (changes, versions, releases) and tools; these are the subjects of a companion paper. More information about system modeling can be found in the second author’s PhD thesis, along with a discussion of related work. An Instruction Fetch Unit for a High-Performance Personal Computer BUTLER W. LAMPSON, GENE McDANIEL, AND SEVERO M. ORNSTEIN IEEE Transactions On Computers, Vol. C-33, No. 8, August 1984 Abstract The instruction fetch unit (IFU) of the Dorado personal computer speeds up the emulation of instructions by prefetching, decoding, and preparing later instructions in parallel with the execution of earlier ones. It dispatches the machine’s microcoded processor to the proper starting address for each instruction, and passes the instruction’s fields to the processor on demand. A writeable decoding memory allows the IFU to be specialized to a particular instruction set, as long as the instructions are an integral number of bytes long. There are implementations of specialized instruction sets for the Mesa, Lisp, and Smalltalk languages. The IFU is implemented with a six-stage pipeline, and can decode an instruction every 60 ns. Under favorable conditions the Dorado can execute instructions at this peak rate (16 MIPS). Index Terms —Cache, emulation, instruction fetch, microcode, pipeline. Introduction THIS paper describes the instruction fetch unit (IFU) for the Dorado, a powerful personal computer designed to meet the needs of computing researchers at the Xerox Palo Alto Research Center. These people work in many areas of computer science: programming environments, automated office systems, electronic filing and communication, page composition and computer graphics, VLSI design aids, distributed computing, etc. There is heavy emphasis on building working prototypes. The Dorado preserves the important properties of an earlier personal computer, the Alto 113), while removing the space and speed bottlenecks imposed by that machine’s 1973 design. The history, design goals, and general characteristics of the Dorado are discussed in a companion paper, which also describes its microprogrammed processor. A second paper describes the memory system. The Dorado is built out of ECL 10K circuits. It has 16 bit data paths, 28 bit virtual addresses, 4K-16K words of high-speed cache memory, writeable microcode, and an I/O bandwidth of 530 Mbits/s. Fig. 1 shows a block diagram of the machine. The microcoded processor can execute a microinstruction every 60 ns. An instruction of some high-level language is performed by executing a suitable succession of these microinstructions; this process is called emulation. The purpose of the IFU is to speed up emulation by prefetching, decoding, and preparing later instructions in parallel with the execution of earlier ones. It dispatches the machine’s microcoded processor to the proper starting address for each instruction, supplies the processor with an assortment of other useful information derived from the instruction, and passes its various fields to the processor on demand. A writeable decoding memory allows the IFU to be specialized to a particular instruction set; there is room for four of these, each with 256 instructions. There are implementations of specialized instruction sets for the Mesa, Lisp, and Smalltalk languages, as well as an Alto emulator. The IFU can decode an instruction every 60 ns, and under favorable conditions the Dorado can execute instructions at this peak rate (16 MIPS). Following this introduction, we discuss the problem of instruction execution in general terms and outline the space of possible solutions (Section II). We then describe the architecture of the Dorado’s IFU (Section 111) and its interactions with the processor which actually executes the instructions (Section IV); the reader who likes to see concrete details might wish to read these sections in parallel with Section II. The next section deals with the internals of the IFU, describing how to program it and the details of its pipelined implementation (Section V). A final section tells how big and how fast it is, and gives some information about the effectiveness of its various mechanisms for improving performance (Section VI). A Kernel Language for Modules and Abstract Data Types R. Burstall and B. Lampson University of Edinburgh and Xerox Palo Alto Research Center Abstract: A small set of constructs can simulate a wide variety of apparently distinct features in modern programming languages. Using a kernel language called Pebble based on the typed lambda calculus with bindings, declarations, and types as first-class values, we show how to build modules, interfaces and implementations, abstract data types, generic types, recursive types, and unions. Pebble has a concise operational semantics given by inference rules. Specifying Distributed Systems Butler W. Lampson Cambridge Research Laboratory Digital Equipment Corporation One Kendall Square Cambridge, MA 02139 October 1988 In Constructive Methods in Computer Science, ed. M. Broy, NATO ASI Series F: Computer and Systems Sciences 55, Springer, 1989, pp 367-396 These notes describe a method for specifying concurrent and distributed systems, and illustrate it with a number of examples, mostly of storage systems. The specification method is due to Lamport (1983, 1988), and the notation is an extension due to Nelson (1987) of Dijkstra’s (1976) guarded commands. We begin by defining states and actions. Then we present the guarded command notation for composing actions, give an example, and define its semantics in several ways. Next we explain what we mean by a specification, and what it means for an implementation to satisfy a specification. A simple example illustrates these ideas. The rest of the notes apply these ideas to specifications and implementations for a number of interesting concurrent systems: Ordinary memory, with two implementations using caches; Write buffered memory, which has a considerably weaker specification chosen to facilitate concurrent implementations; Transactional memory, which has a weaker specification of a different kind chosen to facilitate fault-tolerant implementations; Distributed memory, which has a yet weaker specification than buffered memory chosen to facilitate highly available implementations. We give a brief account of how to use this memory with a tree-structured address space in a highly available naming service. Thread synchronization primitives. On-line Data Compression in a Log-structured File System Michael Burrows Charles Jerian Butler Lampson Timothy Mann DEC Systems Research Center Abstract We have incorporated on-line data compression into the low levels of a log-structured file system (Rosenblum’s Sprite LFS). Each block of data or meta-data is compressed as it is written to the disk and decompressed as it is read. The log-structuring overcomes the problems of allocation and fragmentation for variable-sized blocks. We observe compression factors ranging from 1.6 to 2.2, using algorithms running from 1.7 to 0.4 MBytes per second in software on a DECstation 5000/200. System performance is degraded by a few percent for normal activities (such as compiling or editing), and as much as a factor of 1.6 for file system intensive operations (such as copying multimegabyte files). Hardware compression devices mesh well with this design. Chips are already available that operate at speeds exceeding disk transfer rates, which indicates that hardware compression would not only remove the performance degradation we observed, but might well increase the effective disk transfer rate beyond that obtainable from a system without compression. 1 Introduction Building a file system that compresses the data it stores on disk is clearly an attractive idea. First, more data would fit on the disk. Also, if a fast hardware data compressor could be put into the data path to disk, it would increase the effective disk transfer rate by the compression factor, thus speeding up the system. Yet on-line data compression is seldom used in conventional file systems, for two reasons. First, compression algorithms do not provide uniform compression on all data. When a file block is overwritten, the new data may be compressed by a different amount from the data it supersedes. Therefore the file system cannot simply overwrite the original blocks—if the new data is larger than the old, it must be written to a place where there is more room; if it is smaller, the file system must either find some use for the freed space or see it go to waste. In either case, disk space tends to become fragmented, which reduces the effective compression. Second, the best compression algorithms are adaptive—they use patterns discovered in one part of a block to do a better job of compressing information in other parts. These algorithms work better on large blocks of data than on small blocks. The details vary for different compression algorithms and different data, but the overall trend is the same— larger blocks make for better compression. However, it is difficult to arrange for sufficiently large blocks of data to be compressed all at once. Most file systems use block sizes that are too small for good compression, and increasing the block size would waste disk space in fragmentation. Compressing multiple blocks at a time seems difficult to do efficiently, since adjacent blocks are often not written at the same time. Compressing whole files would also be less than ideal, since in many systems most files are only a few kilobytes. In a log-structured file system, the main data structure on disk is a sequentially written log. All new data, including modifications to existing files, is written to the end of the log. This technique has been demonstrated by Rosenblum and Ousterhout in a system called Sprite LFS. The main goal of LFS was to provide improved performance by eliminating disk seeks on writes. In addition, LFS is ideally suited for adding compression—we simply compress the log as it is written to disk. No blocks are overwritten, so we do not have to make special arrangements when new data does not compress as well as existing data. Because blocks are written sequentially, the compressed blocks can be packed tightly together on disk, eliminating fragmentation. Better still, we can choose to compress blocks of any size we like, and if many related small files are created at the same time, they will be compressed together, so any similarities between the files will lead to better compression. We do, however, need additional bookkeeping to keep track of where compressed blocks fall on the disk. Evolving the High Performance Computing and Communications Initiative to Support the Nation’s Information Infrastructure Computer Science and Telecommunications Board 1995 (The Brooks-Sutherland report) Executive Summary Information technology drives many of today’s innovations and offers still greater potential for further innovation in the next decade. It is also the basis for a domestic industry of about $500 billion,1 an industry that is critical to our nation’s international competitiveness. Our domestic information technology industry is thriving now, based to a large extent on an extraordinary 50-year track record of public research funded by the federal government, creating the ideas and people that have let industry flourish. This record shows that for a dozen major innovations, 10 to 15 years have passed between research and commercial application. Despite many efforts, commercialization has seldom been achieved more quickly. Publicly funded research in information technology will continue to create important new technologies and industries, some of them unimagined today, and the process will continue to take 10 to 15 years. Without such research there will still be innovation, but the quantity and range of new ideas for U.S. industry to draw from will be greatly diminished. Public research, which creates new opportunities for private industry to use. should not be confused with industrial policy, which chooses firms or industries to support. Industry, with its focus mostly on the near term, cannot take the place of government in supporting the research that will lead to the next decade’s advances. The High Performance Computing and Communications Initiative (HPCCI) is the main vehicle for public research in information technology today and the subject of this report. By the early 1980s, several federal agencies had developed independent programs to advance many of the objectives of what was to become the HPCCI. The program received added impetus and more formal status when Congress passed the High Performance Computing Act of 1991 (Public Law 102-194) authorizing a 5-year program in high-performance computing and communications. The initiative began with a focus on high-speed parallel computing and networking and is now evolving to meet the needs of the nation for widespread use on a large scale as well as for high speed in computation and communications. To advance the nation’s information infrastructure there is much that needs to be discovered or invented, because a useful “information highway” is much more than wires to every house. As a prelude to examining the current status of the HPCCI, this report first describes the rationale for the initiative as an engine of U.S. leadership in information technology and outlines the contributions of ongoing publicly funded research to past and current progress in developing computing and communications technologies (Chapter 1). It then describes and evaluates the HPCCI’s goals, accomplishments, management, and planning (Chapter 2). Finally, it makes recommendations aimed at ensuring continuing U.S. leadership in information technology through wise evolution and use of the HPCCI as an important lever (Chapter 3). Appendixes A through F of the report provide additional details on and documentation for points made in the main text. IP Lookups using Multiway and Multicolumn Search Butler Lampson, V. Srinivasan and George Varghese ACM Transactions on Networking, 7, 3 (June 1999), pp 324-334 (also in Infocom 98, April 1998) Abstract IP address lookup is becoming critical because of increasing routing table size, speed, and traffic in the Internet. Our paper shows how binary search can be adapted for best matching prefix using two entries per prefix and by doing precomputation. Next we show how to improve the performance of any best matching prefix scheme using an initial array indexed by the first X bits of the address. We then describe how to take advantage of cache line size to do a multiway search with 6-way branching. Finally, we show how to extend the binary search solution and the multiway search solution for IPv6. For a database of N prefixes with address length W, naive binary search scheme would take O(W log N); we show how to reduce this to O(W + log N) using multiple column binary search. Measurements using a practical (Mae-East) database of 30000 entries yield a worst case lookup time of 490 nanoseconds, five times faster than the Patricia trie scheme used in BSD UNIX. Our scheme is attractive for IPv6 because of small storage requirement (2N nodes) and speed (estimated worst case of 7 cache line reads) Keywords—Longest Prefix Match, IP Lookup Introduction Statistics show that the number of hosts on the internet is tripling approximately every two years. Traffic on the Internet is also increasing exponentially. Traffic increase can be traced not only to increased hosts, but also to new applications (e.g., the Web, video conferencing, remote imaging) which have higher bandwidth needs than traditional applications. One can only expect further increases in users, hosts, domains, and traffic. The possibility of a global Internet with multiple addresses per user (e.g., for appliances) has necessitated a transition from the older Internet routing protocol (IPv4 with 32 bit addresses) to the proposed next generation protocol (IPv6 with 128 bit addresses). High speed packet forwarding is compounded by increasing routing database sizes (due to increased number of hosts) and the increased size of each address in the database (due to the transition to IPv6). Our paper deals with the problem of increasing IP packet forwarding rates in routers. In particular, we deal with a component of high speed forwarding, address lookup, that is considered to be a major bottleneck. When an Internet router gets a packet P from an input link interface, it uses the destination address in packet P to lookup a routing database. The result of the lookup provides an output link interface, to which packet P is forwarded. There is some additional bookkeeping such as updating packet headers, but the major tasks in packet forwarding are address lookup and switching packets between link interfaces. For Gigabit routing, many solutions exist which do fast switching within the router box. Despite this, the problem of doing lookups at Gigabit speeds remains. For example, Ascend’s product has hardware assistance for lookups and can take up to 3s for a single lookup in the worst case and 1 s on average. However, to support say 5 Gbps with an average packet size of 512 bytes, lookups need to be performed in 800 nsec per packet. By contrast, our scheme can be implemented in software on an ordinary PC in a worst case time of 490 nsec. The Best Matching Prefix Problem: Address lookup can be done at high speeds if we are looking for an exact match of the packet destination address to a corresponding address in the routing database. Exact matching can be done using standard techniques such as hashing or binary search. Unfortunately, most routing protocols (including OSI and IP) use hierarchical addressing to avoid scaling problems. Rather than have each router store a database entry for all possible destination IP addresses, the router stores address prefixes that represent a group of addresses reachable through the same interface. The use of prefixes allows scaling to worldwide networks. The use of prefixes introduces a new dimension to the lookup problem: multiple prefixes may match a given address. If a packet matches multiple prefixes, it is intuitive that the packet should be forwarded corresponding to the most specific prefix or longest prefix match.