Dynamic confirmation of system integrity * by BARRY R. BORGERSON University of California Berkeley, California INTRODUCTION continuously integral are identified, and the integrity of the rest of the system can then be confirmed by means less stringent than concurrent fault detection. For example, it might be expedient to allow certain failures to exist for some time before being detected. This might be desirable, for instance, when certain failure modes are hard to detect concurrently, but where their effects are controllable. It is always desirable to know the current state of any system. However, with most computing systems, a large class of failures can remain undetected by the system long enough to cause an integrity violation. What is needed is a technique, or set of techniques, for detecting when a system is not functioning correctly. That is, we need some way of observing the integrity of a system. A slight diversion is necessary here. Most nouns which are used to describe the attributes of computer systems, such as reliability, availability, security, and privacy, have a corresponding adjective which can be used to identify a system that has the associated attribute. Unfortunately, the word "integrity" has no associated adjective. Therefore, in order to enhance the following discourse, the word "integral" will be used as the adjective which describes the integrity of a system. Thus, a computer system will be integral if it is working exactly as specified. Now, if we could verify all of the system software, then we could monitor the integrity of a system in real time by providing a 100 percent concurrent fault detection capability. Thus, the integrity of the entire system would be confirmed concurrentlYJ where "concurrent confirmation" of the integrity of any unit of logic means that the integrity of this unit is being monitored concurrently with each use. A practical alternative to providing concurrent confirmation of system integrity is to provide what will be called "dynamic confirmation of system integrity." With this concept, the parts of a system that must be QUALITATIVE JUSTIFICATION In most contemporary systems, a multiplicity of processes are active at any given time. Two distinct types of integrity violations can occur with respect to the independent processes. One type of integrity violation is for one process to interfere with another process. That is, one process gains unauthorized access to another's information or makes an illegitimate change of another process' state. This type of transgression will be called an "interprocess integrity violation." The other basic type of malfunction which can be caused by an integrity violation occurs when the state of a single process is erroneously changed without any interference from another process. Failures which lead to only intraprocess contaminations will be called "intraprocess integrity violations." For many real-time applications, no malfunctions of any type can be tolerated. Hence, it is not particularly useful to make the distinction between interprocess and intraprocess integrity violations since concurrent integrity-confirmation techniques must be utilized throughout the system. For most user-oriented systems, however, there is a substantial difference in the two types of violations. Intrapr{)cess integrity violations always manifest themselves as contaminations of a process' environment. Interprocess integrity violations, on the other hand, may manifest themselves as security infractions or contaminations of other processes' environments. * This research was supported by the Advanced Research Projects Agency under· contract No. DAHC15 70 C 0274. The views and conclusions contained in this document are those of the author and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the Advanced Research Projects Agency or the U.S. Government. 89 From the collection of the Computer History Museum (www.computerhistory.org) 90 Fall Joint Computer Conference, 1972 We now see that there can be some freedom in defining what is to constitute a continuously-integral, useroriented system. For example, the time-sharing system described below is defined to be continuously integral if it is providing interprocess-interference protection on a continuous basis. Thus other properties of the system, such as intraprocess contamination protection, need not be confirmed on a continuous basis. Although the concept of dynamic confirmation of system integrity has a potential for being useful in a wide variety of situations, the area of its most obvious applicability seems to be for fault-tolerant systems. More specifically, it is most useful in those systems which are designed using a solitary-fault assumption. Where "solitary fault" means that at most one fault is present in the active system at any time. The notion of "dynamic" becomes more clear in this context. Here, "dynamic" means in such a manner, and at such times, so that the probability of encountering simultaneous faults is below a predetermined limit. This limit is dictated not only by the allowable probability of a catastrophic failure, but also by the fact that other factors eventually become more prominent in determining the probability of system failure. Thus, there often becomes a point beyond which there is very little to be gained by increasing the ability to confirm integrity. The rest of this paper will concern itself with dynamic confirmation in the context of making this concept viable with respect to the solitary-fault assumption. DYNAMIC CONFIRMATION TECHNIQUES In this section, and the following section, a particular class of systems will be assumed. The class of systems considered will be those which tolerate faults by restructuring to run without the faulty units. Both the stand-by sparing and the fail-softly types of systems are in this category. These systems have certain characteristics in common; namely, they both must detect, locate, and isolate a fault, and reconfigure to run without the faulty unit, before a second fault can be reliably handled. Obviously, if simultaneous faults are to be avoided, the integrity of all parts of the system must be verified. This is reasonably straightforward in many areas. For instance, the integrity of data in memory can be rather easily confirmed by the method of storing and checking parity. Of course, checks must also be provided to make sure that the correct word of memory is referenced, but this can be done fairly easily too. 1 It is generally true that parity, check sums, and other straightforward concurrent fault-detection techniques can be used to confirm the integrity of most of the logic external to processors. However, there still remains the problems of verifying the integrity of the checkers themselves, of the processors, and of logic that is infrequently used such as that associated with isolation and reconfiguration. All too often, there is no provision made in a system to check the fault detection logic. Actually, there are two rather straightforward methods of accomplishing this. One method uses checkers that have their own failure space. That is, they have more than two output states; and when they fail, a state is entered which indicates that the checker is malfunctioning. This requires building checkers with specifically defined failure modes. It also requires the ability to recognize and handle this limbo state. An example of this type of checker appears in Reference 2. Another method for verifying the integrity of the fault-detection logic is to inject faults; that is, cause a fault to be created so that the checker must recognize it. In many cases this method turns out to be both cheaper and simpler than the previously mentioned scheme. With this method, it is not necessary to provide a failure space for the checkers themselves. However, it is necessary to make provisions for injecting faults when that is not already possible in the normal design. With this provision, confirming the integrity of the checking circuits becomes a periodic software task. Failures are injected, and fault detection inputs are expected. The system software simply ignores the fault report or initiates corrective action if no report is generated. Associated with systems of the type under discussion, there is logic that normally is called into use only when a fault has been detected. This includes the logic dedicated to such tasks as diagnosis, isolation, and reconfiguration. This normally idle class of hardware units will collectively be called "reaction logic." In order to avoid simultaneous faults in a system, this reaction logic must not be allowed to fail without the failure being rapidly detected. Several possibilities exist here. This logic can be made very reliable by using some massive redundancy technique such as triple-modular-redundancy.3 Another possibility is to design these units such that they normally fail into a failure space which is detected and reported. However, this will not be as simple here as it might be for self-checking fault detectors because the failure modes will, in general, be harder to control. A third method would be to simulate the appropriate action and observe the reaction. This also is not as simple here as it was above. For example, it may not be desirable to reconfigure a system on a frequent periodic basis. However, one way out of this is to simulate the From the collection of the Computer History Museum (www.computerhistory.org) Dynamic Confirmation of System Integrity action, initiate the reaction, and confirm the integrity of this logic without actually causing the reconfiguration. This will probably require that the output logic either be made "reliable" or be encoded so as to fail into a harmless and detectable failure space. The final area that requires integrity confirmation is the processors. The technique to be employed here is very dependent on the application of the system. For many real-time applications, nothing short of concurrent fault detection will apparently suffice. However, there are many areas where less drastic methods may be adequate. Fabry 4 has presented a method for verifying critical operating-system decisions, in a timesharing environment, through a series of independent double checks using a combination of a second processor and dedicated hardware. This method can be extended to verifying certain decisions made by a real-time control processor. If most of the tasks that a real-time processor performs concern data reduction, it is possible that software-implemented consistency checks will suffice for monitoring the integrity of the results. When critical control decisions are to be made, a second processor can be brought into the picture for consistency checks or dedicated hardware can be used for validity checking. Alternatively, a separate algorithm, using separate registers, could be run on the same processor to check the validity of a control action, with external time-out hardware being used to guarantee a response. These procedures could certainly provide a substantial cost savings over concurrent fault-detection methods. For a system to be used in a general-purpose, timesharing environment, the method of checking processors non-concurrently is very powerful because simple, relatively inexpensive schemes will suffice to guarantee the security of a user's environment. The price that is paid is to not detect some faults that could cause contamination of a user's own information. But conventional time-sharing systems have this handicap in addition to not having a high availability and not maintaining security in the presence of faults, so a clear improvement would be realized here at a fairly low cost. In order to detect failures as rapidly as possible in processors that have no concurrent fault-detection capability, periodic surveillance tests can be run which will determine if the processor is integral. VALIDATION OF THE SOLITARY-FAULT ASSUMPTION 91 removed at any given time. However, in order to design the system so that all possible types of failures can be handled, it is usually necessary to assume that at most one active unit is malfunctioning at any given time. The problem becomes essentially intractable when arbitrary combinations of multiple faults are considered. That is not to say that all cases of multiple faults will bring a system down, but usually no explicit effort is made to handle most multiple faults. Of course by multiple faults we mean multiple independent faults. If a failure of one unit can affect another, then the system must be designed to handle both units malfunctioning simultaneously or isolation must be added to limit the influence of the original fault. A quantitative analysis will now be given which provides a basis for evaluating the viability of utilizing non-concurrent integrity-confirmation techniques in an adaptive fault-tolerant system. In the analysis below, the letter "s" will be used to designate the probability that two independent, simultaneous faults will cause a system to crash. The next concept we need is that of coverage. Coverage is defined5 as the conditional probability that a system will recover given that a failure has occurred. The letter "e" will be used to denote the coverage of a system. In order to determine a system's ability to remain continuously available over a given period of time, it is necessary to know how frequently the components of the system are likely to fail. The usual measure employed here is the mean-time-between-failures. The letter "m" will be used to designate this parameter. It should be noted here that "m" represents the meantime-between-internal-failures of a system; the system itself hopefully has a much better characteristic. The final parameter that will be needed here is the maximum-time-to-recovery; This is defined to be the maximum time elapsed between the time an arbitrary fault occurs and the time the system has successfully reconfigured to run without the faulty unit. The letter "r" will be used to designate this parameter. The commonly used assumption that a system does not deteriorate with age over its useful life will be adopted. Therefore, the exponential distribution will be used to characterize the failure probability of a system. Thus, at any given time, the probability of encountering a fault within the next u time units is: p= jU (llm)*exp( -tim) dt o Fault-tolerant systems which are capable of isolating a faulty unit, and reconfiguring to run without it, typically can operate with several functional units = 1-exp( -ulm) From this we can see that the upper bound on the From the collection of the Computer History Museum (www.computerhistory.org) 92 Fall Joint Computer Conference, 1972 conditional probability of encountering a second independent fault is given by: q= l-exp( -rim) Since it is obvious that r must be made much smaller than m if a system is to have a high probability of surviving many internal faults, the following approximation is quite valid: q= l-exp( -rim) 00 =1- 2: (-r/m)k/k! k=O = 1-I+r/m- (Y2)*(r/m)2+ Oi)*(r/m)3- ..• ~r/m Therefore, the probability of not being able to recover from an arbitrary internal failure is given by: x= (I-c) +c*q*s = (I-c) +c*s*r/m where the first term represents the probability of failing to recover due to a solitary failure and the second term represents the probability of not recovering due to simultaneous failures given that recovery from the first fault was possible. If we now consider each failure as an independent Bernoulli trial and make the assumption that faulty units are repaired at a sufficient rate so that there is never a problem with having too many units logically removed from a system at any given time, then it is a simple ~atter to determine the probability of surviving a given period, T, without encountering a system crash. The hardware failures will be treated as n independent samples, each with probability of success (1- x), where n is the smallest integer greater than or equal to T /m. Thus, the probability of not crashing on a given fault is (I-x) =c*(1-r*s/m) and the probability, P, of not crashing during the period T is given by: P= [c*(I-r*s/m) In =c *(I-r*s/m)n . concurrent schemes and since this time is essentially equivalent to how frequently the confirmation procedures are invoked, we can assume that r is equal to the time period between the periodic integrity confirmation checks. In order to gain a feeling for the order of r, rather pessimistic numbers can be assumed for m, s, and T. Assume m=1 week, s= Y2, and T=10 years; this gives an n of 520. For now, assume c is equal to one. Now, in order to attain a probability of .95 that a system will survive 10 years with no crashes under the above assumptions, r will have to be: r= m/ s*[I-. 95(1/520) ] = 119 seconds Thus, if the periodic checks are made even as infrequently as every two minutes, a system will last 10 years with a probability of not crashing of approximately.95. The effects of the coverage must now be examined. In order for the coverage to be good enough to provide a probability of .95 of no system crashes in 10 years due to the system's inability to handle single faults, it must be: c= .95(11520) =.9999 Now this would indeed be a very good coverage. Since the actual coverage of any given system will most likely fall quite short of this value, it seems that the coverage, and not multiple simultaneous faults, is the limiting factor in determining a system's ability to recover from faults. The most important conclusion to be drawn from this section is that the solitary-fault assumption is not only convenient but quite justified, and this is true even when only periodic checks are made to verify the integrity of some of the logic. INTEGRITY CONFIRMATION FEATURES OF THE "PRIME" SYSTEM ll With this equation, it is now possible to establish the validity of using the various non-concurrent techniques mentioned above to confirm the integrity of a system. What this equation will establish is how often it will be necessary to perform the fault injection, action simulation, and surveillance procedures in order to gain an acceptable probability of no system crashes. Since the time required to detect, locate, and isolate a fault, and reconfigure to run without the faulty unit, will be primarily a function of the time to detection for the non- In order to better illustrate the potential power of dynamic integrity confirmation techniques, a descrip, tion. will now be given of how this concept is being used to economically provide an integrity confirmation structure for a fault-tolerant system. At the University of California, Berkeley, we are currently building a modular computer system, which has been named PRIME, that is to be used in a multiaccess, interactive environment. The initial version of this system will have five processors, 13 8K-word by 33-bit memory blocks with associated switching units, From the collection of the Computer History Museum (www.computerhistory.org) Dynamic Confirmation of System Integrity DISK DRIVE I I ~ .. .. . DISK DRIVE I I I IEXTERNAL DEVICE DISK DRIVE I I I"'TERNAL DEVICE .... IEXTERNAL DEVICE I I I I I INTERCONNECTION NETWORK I I I I/O CONTROLLER I I~I I I 1/0 CONTROLLER I 1*11ri I I I/O CONTROLLER 1*1 ~ 93 I T I/O CONTROLLER * I I ~ CONTROLLER I/O I *1 *RECONFIGURATION LOGIC EACH nIDICATED LINE REPRE SENTS 16 TERMINAL CONNECT IONS I PROCESSOR I I MEMORY :rnTERFACE PROCESSOR I I I PROCESSOR I I MEMORY INTERFACE MEMORY 1 :rnTERFACE PROCESSOR I I I I PROCESSOR I r :rnTERFACE MEMORY r MEMORY INTERFACE j - ~~E\j ~ 1 MB4 MB5 EJ~~ ~ MBlO MBn EACH MEMORY BLOCK (MB) CONSISTS OF TWO 4K MODULES 88 Figure 1-Block diagram of the PRIME system 15 high-performance disk drives, and a switching network which allows processor, disk, and external-device switching. A block diagram of PRIME appears in Figure 1. The processing elements in PRIME are 3-bus, 16-bit wide, 90ns cycle time microprogrammable processors called IVIETA 4s. 6 Each processor emulates a target machine in addition to performing I/O and executive functions directly in microcode. At any given time, one of the processors is designated the Control Processor (CP), while the others are Problem Processors (PPs). The CP runs the Central Control Monitor (CCM) which is responsible for scheduling, resource allocation, and interprocess message handling. The Problem Processors run user jobs and perform some system functions with the Extended Control l\1onitor (ECl\1) which is completely isolated from user processes. Associated with each PP is a private page, which the ECM uses to store data, and some target-machine code which it occasionally causes to be executed. A more complete descrip- tion of the structure and functioning of PRIME is given elsewhere. 7 The most interesting aspects of PRIl\1E are in the areas of availability, efficiency, and security. PRIME will be able to withstand internal faults. The system has been designed to degrade gracefully in the presence of .internal failures. 8 Also, interprocess integrity is always maintained even in the presence of either hardware or software faults. The PRIME system is considered continuously integral if it is providing interprocess interference protection. Therefore, security must be maintained at all times. Other properties, such as providing user service and recovering from failures, can be handled in a less stringent manner. Thus, dynamic confirmation of system integrity in PRIl\1E must be handled concurrently for interprocess interference protection and can be handled periodically with respect to the rest of the system. Of course, there are areas which do not affect interprocess interference protection but which From the collection of the Computer History Museum (www.computerhistory.org) 94 Fall Joint Computer Conference, 1972 will nonetheless utilize concurrent fault detection simply because it is expedient to do so. Fault injection is being used to check most of the fault-detection logic in PRIlVIE. This decision was made because the analysis of non-concurrent integrityconfirmation techniques has established that periodic fault injection is sufficiently effective to handle the job and because it is simpler and cheaper than the alternatives. There is a characteristic of the PRIME system that makes schemes which utilize periodic checking very attractive. At the end of each job step, the current process and the next process are overlap swapped. That is, two disk drives are used simultaneously; one of these disks is rolling the current job out, while the other is rolling the next job in. During this time, the associated processor has some potential free time. Therefore, this time can be effectively used to make whatever periodic checks may be necessary. And since the mean time between job steps will be less than a second, this provides very frequent, inexpensive periodic checking capabilities. The integrity of Problem Processors is checked at the end of each job step. This check is initiated by the Control Processor which passes a one-word seed to the PP and expects the PP to compute a response. This seed will guarantee that different responses are required at different times so that the PP cannot accidently "memorize" the correct response. The computation requires the use of both target machine instructions and a dedicated firmware routine to compute the expected response. The combination of these two routines is called a surveillance procedure. This surveillance procedure checks all of the internal logic and the control storage of the microprocessors. The target machine code of the surveillance routine is always resident in the processor's private page. The microcode part is resident in control storage. A fixed amount of time is allowed for generating a response when the CP asks a PP to run a surveillance on itself. If the wrong response is given or if no response is given in the allotted time, then the PP is assumed to be malfunctioning and remedial action is initiated. In a similar manner, each PP periodically requests that the CP run a surveillance on itself. If a PP thinks it detects that the CP is malfunctioning, it will tell the CP this, and a reconfiguration will take place followed by diagnosis to locate the actual source of the detected error. More will be said later about the structure of the reconfiguration scheme. While the periodic running of surveillance procedures is sufficient for most purposes, it does not suffice for protecting against interprocess interference. As previously mentioned, this protection must be continuous. Therefore, a special structure has been developed which is used to prevent interprocess interference on a continuous basis. 4 This structure provides double checks on all actions which could lead to interprocess interference. In particular, the validity of all memory and disk references, and all interprocess message transmissions, are among those actions double checked. A class code is used to associate each sector (lK words) of each disk pack with either a particular process or with the null process, which corresponds to unallocated space. A lock and key scheme is used to protect memory on a page (also lK words) basis. In both cases, at most one process is bound to a lK-word piece of physical storage. The Central Control ]VIonitor is responsible for allocating each piece of storage, and it can allocate only those pieces which are currently unallocated. Each process is responsible for deallocating any piece of storage that it no longer needs. Both schemes rely on two processors and a small amount of dedicated hardware to provide the necessary protection against some process gaining access to another process' storage. In order for the above security scheme to be extremely effective, it was decided to prohibit sharing of any storage. Therefore, the Interconnection Network is used to pass files which are to be shared. Files are sent as regular messages, with the owning process explicitly giving away any information that it wishes to share with any other process. All interprocess messages are sent by way of the CPo Thus, both the CCM and the destination EC]VI can make consistency checks to make sure that a message is delivered to the correct process. The remaining area of integrity checking which needs to be discussed is the reaction hardware. In the PRIlVIE system, this includes the isolation, power switching, diagnosis, and reconfiguration logic. A variety of schemes have been employed to confirm the integrity of this reaction logic. In order to describe the methods employed to confirm the integrity, it will be necessary to first outline the structure of the spontaneous reconfiguration scheme used in the PRIME system. There are four steps involved in reorganizing the hardware structure of PRIME so that it can continue to operate with internal faults. The first step consists of detecting a fault. This is done by one of the many techniques outlined in this paper. In the second step, an initial reconfiguration is performed so that a new processor, one not involved in the detection, is given the job of being the CPo This provides a pseudo "hard core" which will be used to initiate gross diagnostics. The third step is used to locate the fault. This is done by having the new CP attach itself to the Programmable Control Panel9 of a Problem Processor via the Interconnection Network, and test it by single-stepping this From the collection of the Computer History Museum (www.computerhistory.org) Dynamic Confirmation of System Integrity PP through a set of diagnostics. If a PP is found to be functioning properly, then it is used to diagnose its own I/O channels. After the fault is located, the faulty functional-unit is isolated, and a second reconfiguration is performed to allow the system to run without this unit. Of the four steps involved in responding to a fault, the initial reconfiguration poses the most difficulty. In order to guarantee that this initial reconfiguration could be initiated, a small amount of dedicated hardware waS incorporated to facilitate this task. Associated with each processor is a flag which indicates when the processor is the CPo Also associated with each processor is a flag which is used to indicate that this processor thinks the CP is malfunctioning. For every processor, these two flags can be interrogated by any other processor. Each processor can set only its own flag that suggests the CP is sick. The flag which indicates that a processor is the CP can be set only if both the associated processor and the dedicated hardware concur. Thus, the dedicated hardware will not let this flag go up if another processor already has its up. Also, this flag will automatically be lowered whenever two processors claim that the CP is malfunctioning. There is somewhat of a dilemma associated with confirming the integrity of this logic. Because of the distributed nature of this reconfiguration structure, it should be unnecessary to make any of it "reliable." That is, the structure is already distributed so that a failure of any part of it can be tolerated. However, if simultaneous faults are to be avoided, the integrity of this logic must be dynamically confirmed. Unfortunately, it is not practical to check this logic by frequently initiating reconfigurations. This dilemma is being solved by a scheme which partially simulates the various actions. The critical logic that cannot be checked during a simulated reconfiguration is duplicated so that infrequent checking by actual reconfiguration is sufficient to confirm the integrity of this logic. The only logic used in the diagnostic scheme where integrity confirmation has not already been discussed is the Programmable Control Panel. This pseudo panel is used to allow the CP to perform all the functions normally available on a standard control panel. No explicit provision will be made for confirming the integrity of the Programmable Control Panel because its loss will never lead to a system crash. That is, failures in this unit can coexist with a failure anywhere else in the system without bringing the system down. For powering and isolation purposes, there are only four different types of functional units in the PRIlVIE system. The four functional units are the intelligence module, which consists of a processor, its I/O controller 95 and the subunits that directly connect to the controller, its memory bus, and its reconfiguration logic; the memory block, which consists of two 4K-word by 33-bit 1\10S memory modules and a 4X2 switching matrix; the switching module, which consists of the switch part of two processor-end and three device-end nodes of the Interconnection Network; and the disk drive. The disk drives and switching modules can be powered up and down manually only. The intelligence modules must be powered up manually, but they can be powered down under program control. Finally, the memory blocks can be powered both up and down under program control. No provision was made to power down the disks or switching modules under program control because there was no isolation problem with these units. Rather than providing very reliable isolation logic at the interfaces of the intelligence modules and memory blocks, it was decided to provide additional isolation by adding the logic which allows these units to be dynamically powered down. Also, because it may be necessary to power memory blocks down and then back up in order to determine which one has a bus tied up, the provision had to be made for performing the powering up of these units on a dynamic basis. Any processor can power down any memory block to which it is attached, so it was not deemed necessary to provide for any frequent confirmation of the integrity of this power-down logic. Also, every processor can be powered down by itself and one other processor. These two power-down paths are independent so again no provision was made tofrequently confirm the integrity of this logic. In order to guarantee that the independent power-down paths do not eventually fail without this fact being knmvll, these paths can be checked on an infrequent basis. All of the different integrity confirmation techniques used in PRIlVIE have been described. The essence of the concept of dynamic confirmation of system integrity is the systematic exploitation of the specific characteristics of a system to provide an adequate integrity confirmation structure which is in some sense minimal. For instance, the type of use and the distributed intelligence of PRI1\1E were taken advantage of to provide a sufficient integrity-confirmation structure at a much lower cost and complexity than would have been possible if these factors were not carefully exploited. REFERENCES 1 B BORGERSON C V RAVI On addressing failures in merrwry systems Proceedings of the 1972 ACM International Computing Symposium Venice Italy pp 40-47 April 1972 2 D A ANDERSON G METZE Design of totally self-checking check circuits for M-out-of-N From the collection of the Computer History Museum (www.computerhistory.org) 96 Fall Joint Computer Conference, 1972 codes Digest of the 1972 International Symposium on Fault-Tolerant Computing pp 30-34 3 R A SHORT The attainment of reliable digital systems through the use of redundancy-A survey IEEE Computer Group News Vol 2 pp 2-17 March 1968 4 R S FABRY Dynamic verification of operating system decisions Computer Systems Research Project Document No P-14.0 University of California Berkeley February 1972 5 W G BOURICIUS W C CARTER P R SCHNIEDER Reliability rrwdeling techniques for self-repairing computer systems Proceedings of the ACM National Conference pp 295-309 1969 4. computer system microprogramming referlmce manual Publication No 7043MO Digital Scientific Corporation San diego California June 1972 7 H B BASKIN B R BORGERSON R ROBERTS P RIM E-A rrwdular architecture for terminal-oriented systems Proceedings of the 1972 Spring Joint Computer Conference pp431-437 8 B R BORGERSON A fail-softly system for time-sharing use Digest of the 1972 International Symposium on Fault-Tolerant Computing pp 89-93 9 G BAILLIU B R BORGERSON A multipurpose processor-enhancement structure Digest of the 1972 IEEE Computer Society Conference San Francisco September 1972 pp 197-200 6 META From the collection of the Computer History Museum (www.computerhistory.org)