Standardized fault-tolerant microcomputers in a reconfigurable distributed network promise to meet spacecraft reliability requirements at low cost. Reconfigurable Modular Computer Networks for Spacecraft On-Board Processing David A. Rennels Jet Propulsion Laboratory Over the last 20 years, a number of unmanned spacecraft have been sent to investigate the moon, Mars, Venus, and Mercury and have returned more information about these bodies than had been collected in all of previous human history.' Each of these spacecraft, as well as those in the planning stages for future missions, consists of two major parts: a science payload and a set of engineering subsystems which supply power, control, and communications with the science instruments. The science payload scans the electromagnetic spectrum and the environment in the vicinity of the spacecraft. Typical experiments employ television cameras, ultraviolet and infrared scanning devices, charged particle detectors, magnetometers, and radio astronomy instruments. The core engineering subsystem-the one that must control and collect data from these experiments, as well as point the spacecraft, control the other engineering subsystems, and handle any serious anomalies which may occur on the flight-is the computer. Distributed computers. The core electronics subsystems on these spacecraft have progressed through an evolution from simple fixed controllers and analog computers in the 1960's to generalpurpose digital computers in current designs. This evolution is now moving in the direction of distributed computer networks. Current Voyager spacecraft already use three on-board computers. One is used to store commands and provide overall spacecraft management. Another is used for instrument control and telemetry collection, and the third computer is used for attitude control and scientific instrument pointing. The scientific instruments are also candidates for dedicated July 1978 computers. These instruments vary in complexity, but they all contain command interfaces and internal logic sequencers which generate control signals to operate electronic mechanisms and to collect data. Instrument cycles vary in periodicity from a few seconds to a minute. An examination of the control logic in these instruments shows that, for many, it is cost-effective to replace the sequencing logic with a microcomputer-either to save chips or to establish standardization in instrument logic designs. An additional factor in favor of multiple computers is a potential simplification of interfaces. Typically, science instruments and engineering subsystems are produced by different contractors to meet common interface specifications. These interfaces become quite detailed and have very complex timing requirements when each instrument and subsystem must share a common computer. By placing computers in the various instruments and subsystems, the contractor can handle his own details of timing and control. The systemlevel interface becomes much simpler and the contractor can thoroughly test his instrument or subsystem without waiting for externally supplied computer hardware and software. A number of scientists and subsystem engineers have expressed interest in using microcomputers, and it has become clear that the new microprocessor technology could result in wider and wider use of computers on future spacecraft. If we are to be successful in developing a systematic approach to spacecraft distributed processing, the eventual architecture must strongly reflect the unusual requirements placed on on-board computing. 0018-9200/78/0700-0049$00.75 (D 1978 IEEE 49 On-board requirements. Reliability is the most severe constraint upon the design of spacecraft computing systems, and it represents the large majority of associated costs. The spacecraft must survive for several years in an environment where repair is not possible, and ground intervention (through commands) cannot be guaranteed in less than several hours, or perhaps several days if the spacecraft isn't continually tracked. This means carrying along two or more times the required computing hardware in the form of redundant processors, memories, and I/O circuits. Computer failure cannot be tolerated, since a spacecraft often represents an investment of a hundred million dollars. Additional problems such as wide temperature ranges, lower power availability, and radiation hazards often necessitate selection of special parts. The reliability screening of these parts drives their costs to 10 times those for the commercial marketplace. Finally, spacecraft computers and their software are characterized by extensive testing. Each subsystem is tested for hundreds of hours; the spacecraft is then assembled and again tested for hundreds of hours. Flight and ground software systems are developed and similarly tested. The effects of all spacecraft commands are simulated in detail before being transmitted, since the effect of an improper command may not be determined for hours after damage is done. This can require thousands of hours of CPU time on a large generalpurpose computing system. During a planetary encounter -of a few hours' duration, a scientific experimenter must be confident that the spacecraft is performing as specified, in order to have useful Reliability is the most severe constraint on the design of spacecraft computing systems. results. Thus, the spacecraft computer architecture must exhibit the properties of testability and ease of software validation. It must also provide automatic reconfiguration for recovery from on-board computer hardware faults, while staying within stringent power, weight, and volume constraints. These requirements have been addressed in a multiple computer architecture developed at the Jet Propulsion Laboratory for use in real-time control and data handling on unmanned spacecraft. Optimization criteria were quite different from the usual goals of throughput and efficiency, relating instead to testability, fault-tolerance, and ease of use. A breadboard of this architecture, designated the "Unified Data System," has been constructed at JPL and programmed to perform the processing tasks of a typical spacecraft. A review of the UDS design indicates some of the potentialities for reliability in systems employing such an architecture. 50 Architectural considerations The high reliability requirements and the need for extensive hardware and software testing in spacecraft computer systems demand that architecture and interfaces be simplified to the greatest extent consistent with adequate performance. Multiple computer configurations have a potential for chaotic complexity unless rigid standards are imposed on programming and intercommunication between computers. Summarized below are the architectural characteristics devised for the UDS in an attempt to achieve a more manageable system. Several of these approaches apply to distributed systems in general; others are more oriented to a spacecraft's real-time computing tasks. The goal is to simplify interfaces between computers and to make their operation more predictable to simplify testing and fault analysis. Intercommunications. We believe that it is worth the hardware investment to provide a powerful mechanism for intercommunication between computers. The requirements on software for detailed control of incoming and outgoing data should be minimized. This implies the use of hardware control of data movement between computer memories using direct-memory-access techniques. Redundancy and reconfigurability of the intercommunication system are essential to prevent failure from a single fault, and the hardware mechanisms should verify proper transmission through automated status messages. Computer utilization. It is often economical to supply much more computing capacity (memory and processing performance) than is required for a given application. As utilization of memory approaches 100 percent, or as real-time computing demands approach the maximum speed of a machine, design and verification of software usually increase dramatically in complexity. Costs of software have exceeded costs of hardware in spacecraft systems for some time, and VLSI technology will further reduce hardware costs. In many spacecraft subsystems, a dedicated computer will be used well below its nominal capabilities. Restricted communications. It is advisable to restrict communications in the spacecraft computing network to provide error confinement and to simplify testing. The majority of computers in the spacecraft system will be performing dedicated low-level functions within various instruments and subsystems. These computers should not have the ability to arbitrarily modify the memories of other computers or to tie up an intercommunications bus. If this were the case, hardware and software faults in low-level computers could propagate throughout the system. There are many ways to achieve this fault confinement, and we have chosen centrally controlled buses which do not allow low-level computers to initiate interCOMPUTER of data between ongoing programs in different computers, which occurs in a predictable periodic fashion. The second type of information is commands which specify the algorithms to be performed Synchronous functions. In order to allow correla- in the various computers. In order to simplify tion of the results of various scientific experiments testing and software verification, we have restricted and to provide data at the right time to fit into each computer in the network to receiving comtelemetry formats, the control of experiments and mands from only one higher-level computer. This engineering subsystems is tightly synchronized. higher-level computer also controls data movement Subroutines in the associated computers meet into and out of the lower-level machine, but the strict timing requirements in generating control data can come from any of a number of different cycles for their various instruments. Thus, for any computers under its control. Programs in the low-level computers interface spacecraft operational mode, the state and periodicity of most instrument and subsystem cycles, with the other computers through their own local are well defined, as are the requirements for data memories. These programs are invoked by the hightransfer between computers. This prevents conflicts level computer, which also has responsibility for and simplifies intercommunications since no conflict placing the operands which are needed in their local memories. These programs, in the low-level arbitration is required. The central bus controller establishes a periodic computers, are self-synchronized to process the set of data transmissions between computers as data when it arrives and -to place results in their needed for the particular cycles they are performing. memories for subsequent extraction by the high(The few nonperiodic functions can be treated in level computer. This approach fits well with the a similar fashion by establishing periodic transmis- hierarchic nature of the spacecraft system and sion of message buffers between their associated with the fact that most of its computing functions subroutines.) Bus control is easily verified since it are periodic. Control between computers is effecis generated from a single controller (from internal tively limited to a tree structure to provide simplimemory tables) and is highly predictable. The cost fication and a degree of fault containment. of forcing intercommunications into periodic data Timing hierarchy. Many control systems exhibit movements is a reduction of response time through the bus. A computer must wait for its time slot a timing hierarchy. Simple, high-rate functions are before communicating with another machine. This often done at the bottom. Functions of intermediate restriction is acceptable in the spacecraft, beca se rate and complexity are often done at the next of the periodic nature of its computations, a,1d higher level, and complex processes are often done thus we have sacrificed performance (concurrenc.y) at the top. The more complex processes often have a wider latitude in timing resolution. A spacecraft for increased testability. computing system can take advantage of this hierarchy to simplify software and the interface Characteristics such as synchronous between computers. communications, tree-structured It is frequently useful to offload simple highcontrol, avoidance of demand interrupts, rate signal generation from software into I/O and fault tolerance increase testability, hardware. This simplifies expensive software in subsystem computers at a cost of less expensive reliability, and ease of use-at some Similarly, the software in the low-level hardware. expense in processing performance. subsystem computers should be designed to minimize the timing resolution required of the highControl hierarchy. The spacecraft computing level computer which sends its messages over an structure is hierarchic. At the bottom is the set of intercommunications bus. This simplifies expensive dedicated terminal computers within instruments system interfaces. and subsystems. For simple spacecraft a two-level hierarchy is utilized. A single command computer Interrupts. Whenever possible, demand interrupts stores con anands from the earth, directs processing should be avoided. The software should determine in the various subsystem computers, and specifies when and in what order it interfaces with the the data movements to be carried out between outside world. This tends to increase slightly the them. For more complex systems, such as the number of instructions required and may limit I/O Mars Roving Vehicle, the hierarchy is extended response to millisecond rather than microsecond to at least three levels.2 Each of several groups of resolution. But it does lead to more predictable subsystem computers is controlled by a subsystem operation, is more easily verified, and allows for group control computer. These intermediate com- software self-defense. Spacecraft systems are extenputers are in turn controlled by a master control sively simulated, and if no restrictions are placed computer. on the response to external stimuli, this simulation There are two types of information movement can become extremely expensive. There can be a between the memories of the computers in the very large number of possible orderings and spacecraft system. The first type is the movement timings of incoming service requests. By restrict- communications activity. Centralized bus control is well adapted to the synchronous nature of spacecraft processes, as described below. July 1978 51 ing this set of possible input states, software can be more easily verified and have higher reliability. system. These are computers containing internal checking hardware that can detect nearly all possible internal faults concurrently with normal I/O timing granularity. The on-board computers software operation. The methodology for designing must generate precisely timed control signals for self-checking computers is well developed, and their associated subsystems. Several programs using VLSI technology this capability can be may operate concurrently in a single machine, implemented at relatively low cost.4 5 each one of which is generating a precisely timed Each self-checking computer in the network series of inputs and outputs. For example, the dedicated television computer may control the readout of picture lines, sample telemetry measurements, format picture data for readout, and execute several other concurrent functions which must be precisely timed. It is important to be able to change any one of these programs without affecting the input and output timing of the others. This can be achieved to a large extent by imposing granularity on I/Oi.e., inputs are sampled and held for uniform (several millisecond) intervals. During these intervals segments of several concurrent foreground programs may be executed. Their outputs are collected by I/O hardware and held until the end of the time interval, and then all outputs are executed at once. The program segments can be executed in any order, and some can be removed without affecting the output timing of the others. Programs can be added as long as the total computation for any interval does not exceed the time available. This approach simplifies simulation since the possible order and timing of inputs are drastically reduced in complexity, and visibility into the system is improved for testing since programs are executed in well defined steps during which inputs are held constant. Software can be more easily modified. The cost of this approach is reduced response time to external events. It may require two to three time intervals, on the order of 5-7 milliseconds, for the computer to acquire unexpected data and deliver a response. This is acceptable for the spacecraft application. Fault detection and automatic reconfiguration. Unlike most applications, an interplanetary spacecraft experiences the maximum demand for computing capacity at the end of a mission when, after cruising through space for a year or more, it reaches its designated target. The fault-tolerance techniques employed must give a high probability that a system is fully operational at the end of a mission. Thus, enough spare hardware must be carried along to substitute for all faulty units, rather than relying on graceful degradation. Reconfiguration for fault recovery consists of detecting faulty computers and substituting properly functioning spares. To achieve high reliability over a long period of time, it has been shown that the mechanisms for fault detection and recovery must be nearly perfect. That is, coverage, defined as the conditional probability of effecting recovery from a fault, must approach unity.3 To achieve a high degree of fault detection, we have chosen to design self-checking computers for use within the distributed computing 52 disables itself upon detecting an internal fault. A high-level control computer monitors the various other computers and, upon discovering a faultdisabled computer, activates a replacement spare by commands through the bus system. The control computer, in turn, has a "hot" backup spare (with a separate bus system) which is carrying out the same programs as the controller and takes over if it should fail and disable itself. These nine characteristics represent the conservative design approach employed in the Unified Data System. Synchronous communications, treestructured control, removal of interrupts, granularity of I/O, and fault-tolerance techniques are all directed at increasing testability, reliability, and ease of use, at some expense in processing performance (response time and throughput). It makes an unusual point in design space because we start with more computer hardware capability than is needed, and accept inefficient use of this hardware in an attempt to achieve a more manageable system. The approach is tailored to spacecraft applications, but we feel that it applies to a number of other real-time applications as well.6 We next describe the architecture of the Unified Data System, first describing an initial breadboard system which did not incorporata fault-detection and recovery features. Its main objective was to verify the software and intercommunications structure of the system. A second breadboard is under way which includes fault tolerance. Its main difference from the first is that self-checking computer modules are employed along with backup spares for fault recovery by means of reconfiguration. Its software and communication techniques are nearly identical to the initial system. UDS architecture The Unified Data System architecture consists of a set of standard microcomputers connected by several redundant buses as shown in Figure 1. The microcomputer modules, which use the same microprocessor and software executive, fall into two types: terminal modules and high-level modules. Terminal modules are located in various spacecraft subsystems and are responsible for local control and data collection. The terminal module contains a microprocessor, memory (RAM), I/O modules, and several bus adaptors which interface with each of several intercommunications buses. The bus adaptors are DMA controllers which allow the bus systems to enter and extract data from the terminal module's memory. A high-level module COMPUTER Figure 1. The Unified Data System, developed at Caltech's Jet Propulsion Laboratory, consists of two levels of standard microcomputer modules, connected by buses, some redundant. The bus controllers in the high- level module type, in addition to monitoring data movement, can release their buses under certain conditions. The bus adapters in both module types control direct memory access. enters commands, data, and timing information into prearranged areas within the terminal module. The terminal module delivers information to the system by placing outgoing messages in predetermined locations of its memory, which can then be extracted by a high-level module over the bus. The terminal module memory can be accessed by several buses simultaneously, and its processor is seldom notified when such a transaction occurs. Each high-level module consists of a microprocessor, memory, bus adaptors, and a bus controller. Each bus controller, which is unique to high-level modules, can move blocks of data between memories of all computers connected to its bus. This is the mechanism by which the high-level module can coordinate the processing in a set of remote terminal modules by entering commands into their memories and reading out information to monitor ongoing processes. The bus controller is a highly autonomous init which acts like the data channels of much larger machines. When signalled by the high-level module processor, it reads a control table from the module's memory, interprets the table and controls the requested data movement, verifies proper transmission through status messages, and notifies the processor upon completion. Each bus controller has a dedicated bus under its control, but can relinquish its bus under one of two conditions: (1) its power is turned off or (2) its processor releases the bus for a specified time interval. Thus, spare modules can gain access to a bus whose processor has failed, or a bus can be multiplexed if several other buses have failed. The individual buses are physically independent, and therefore no central controller exists for all buses as a potential catastrophic failure mechanism. A more detailed description of this hardware architecture can be found elsewhere.7 The distinguishing features of this hardware architecture are (1) a busing system which offers a high degree of redundancy and takes most of the burden of intercommunication off the host computers, (2) centralized bus control and limited bus access which makes the buses more predictable and helps prevent faulty terminal modules from propagating damaged information, (3) a design for synchronous operation without external demand interrupts, and (4) granularity of .1/0 to simplify software modifications and verification. July 1978 System synchronization and control structure The UDS computers are synchronized by a common 2.5-millisecond real-time interrupt. Various counts of RTI intervals define the uniform time measurement throughout the spacecraft. Analogous to minutes and seconds, the UDS keeps time in frames and lines. A frame (48 sec) comprises 800 lines (60 msec) and a line comprises 24 RTI intervals. These unusual values of time are chosen for convenience because they correspond to the cycles of instruments on a typical spacecraft. A television picture, consisting of 800 lines read out every 60 milliseconds, is completed every 48 seconds. Other 53 instruments are synchronized to TV lines, and telemetry sequences tend to repeat on these intervals. System executive. The highest level computer in the network serves as a system executive and broadcasts both time counts and commands into designated areas within the memories of the other computers. It reads out data needed for control decisions. Under this computer may be several additional high-level modules which serve to control collections of terminal modules or provide specialized computing services for the network.2 In simpler systems, the system executive module directly controls the terminal modules. As described in the next section, these high-level modules are being designed to be hardware self-checking. This is necessary to prevent faulty modules from sending damaged information throughout the system. Similarly, error-detecting codes are employed in the bus to allow detection of damage to information transmitted over the bus. After receiving commands from a high-level module, a terminal module starts a set of specified programs, which utilize the timing information received from the high-level module to synchronize with the rest of the spacecraft. For each program, the time counts at which data is to appear in memory, at which control signals are to be generated, and at which processed data is to be stored in memory for extraction by the high-level module have been precisely specified. In order to support several concurrent programs which generate precisely timed results and also to support more complex programs which are not easily segmented, software is run in a foregroundbackground partition. Each processor in the spacecraft system has a well defined set of foreground program segments to run in each RTI interval. The foreground programs are run in short segments which control and time I1/0. They also start and stop unsegmented background programs which perform more elaborate computations which have less stringent timing requirements. To simplify testing, the results of program segments run during any RTI interval should not be dependent upon the order or timing of their execution within the interval. In order to prevent changing input values from making foreground program segment results time-dependent within an RTI interval, inputs are sampled and held throughout the interval by the I/O circuits. Output commands are held and executed at the next RTI to make output timing independent of the speed or order of execution of the foreground program segments. Thus, if the system is stopped at the end of an RTI period, the software state of the foreground segments is known and can be easily verified. A few I/O functions which require faster time resolution than the RTI are handled by special-purpose hardware-e.g., pulse generation and DMA I/O circuits. 54 A conceptual diagram of the executive, which resides in every computer module, is shown in Figure 2. It is built around a scheduling table and is entered each RTI. Upon entry it suspends the background process, updates its time counter, and checks for proper exit during the last cycle. It then checks to see if a command has been placed in its memory and, if this is the case, it starts an associated program segment. The executive then checks its scheduling tables to see if any (segmented) foreground programs have requested reactivation at this time. (Activation can occur on' the basis of either time or a memory word reaching a specified value). If so, they are activated sequentially, and each returns to the executive after a few instructions. Upon completing the foreground, the executive returns control to the background program. Program design constructs. A UDS program specification language has been developed which is based on PDL (a program design language), augmented by four constructs which provide timing and communications with the executive program.8 These constructs are START, WHEN, STOP, and BACKSTART, which can be executed from any foreground program.9 START is utilized to activate a new foreground program by placing its entry point in the scheduler. WHEN returns control to the scheduler from the active program and specifies the conditions under which it is to be reactivated. By using STOP, a program removes itself from the scheduler, and BACKSTART iS used to initiate background programs. An example of part of a UDS program is shown in Table 1. Breadboard findings. A UDS breadboard has been constructed using six computers: three highlevel modules and three terminal modules, which carry out many of the processing functions of current JPL spacecraft (see Figure 3). Preliminary results indicate that the system is easy to program, debug, and verify. Software debugging tools have been written which take advantage of the predictable, time-synchronized interactions between computers. The breadboard can be started and run to a given spacecraft time count, and then stopped for inspection of memories in the various computers. The memories can be inspected to determine if the correct bus transmissions have taken place and if the foreground programs are in the correct state. The complete system is effectively stepped in 2.5 millisecond (RTI) processing intervals, and the processing steps within each interval are known for each machine. This degree of visibility has greatly expedited debugging and, in turn, speeded up program development. Preventing terminal modules from initiating bus communications has also been valuable, helping to contain initial software errors to individual terminal module computers, and thus aid in debugging. The restrictions on control and the removal of demand interrupts have also made the system more predictable, and thus more easily debugged. We have COMPUTER Figure 2. At each real-time interrupt interval the local executive software, which resides in every computer module, suspends the background process (a), super- vises a foreground process (b), then returns control to the background program. no quantitative measure on programmability or testability, but we are satisfied with the experiences that have occurred with the UDS system. Several maj or changes have occurred in the telemetry handling of the spacecraft simulation which caused reprogramming of several UDS computers. These changes were made rather easily in a matter of a few days. Several interesting hardware implementations can be employed in the architecture of the selfchecking computer modules for this type of system. Table 1. A typical foreground program. CONTROL START ENGCYCLE: (Program to gather engineering telemetry) (Initialize subsystem) Building-block self-checking computers OUTPUT CONTROL LEVELS; In the UDS architecture, the high-level modules have direct access to the memories of a number of other computers in the system. They must be carefully protected against faults to prevent damaged information from being sent to the other computers. A second UDS breadboard is being constructed with high-level modules which are implemented as self-checking computers, which disable themselves upon detecting an internal fault. WHEN LINE COUNT = 4 = 1,20; (This program segment gathers DO one data sample every spacecraft OUTPUT SAMPLE COMMAND; line for 20 lines. It only runs a READ DATA AND STORE IN few instructions each activation.) DATABUF (I); WHEN LINE = CURRENT LINE + 1 ENDO; (Starts a background program to BKSTART PROCESS; process the collected data.) July 1978 STOP 55 implementation. The self-checking computer module contains commercial memories, two commercial microprocessors run in synchronization for fault detection, and four types of building block circuits. The building block circuits are (1) an error detecting and correcting memory interface, (2) a microprogrammable bus interface, (3) a digital I/O building block, and (4) a core building block. A typical selfchecking computer module will contain 23 RAMs, two microprocessors, one memory interface, three bus interfaces, and one core building block.10 The self-checking computer module is built around a shared internal tri-state bus which consists of 16-bit address and data buses protected by multiple parity bits and control lines implemented as selfchecking pairs. The building block circuits control and interface the various processor, external bus, memory, and I/O functions to the internal bus as Figure 3. The Unified Data System breadboard, constructed at JPL, shown in Figure 4. Each building block is responsiconsists of three high-level modules and three terminal modules. It ble for detecting faults in its associated circuitry includes a spacecraft TV camera and its associated test gear and a and then signalling the fault condition to the core spacecraft tape recorder as host subsystems for two of the terminal building block by means of duplicate fault indimodules. cators. The various building block functions are listed below: The memory interface building block interfaces a redundant set of memory chips to the internal First, if memory-mapped I/O is employed and if bus. It provides Hamming correction to damaged intercommunication channels use DMA techniques, memory data, replacement of a faulty bit with a we no longer depend upon the specific I/O struc- spare, parity encoding and decoding to the internal ture of any specific microprocessor. Using memory- bus, and detection of internal faults. The bus interface building block can be micromapped I/O, a set of memory addresses are reserved for I/O functions. Instead of accessing memory, programmed to provide the function of a bus reads and writes to these addresses are interpreted adaptor or bus controller. The bus system is being as commands (and data) to I/O devices, communi- designed to utilize MTL STD 1553A communicacations channels, and other internal hardware tions formats."1 Microprogrammed control is being within the computer. This type of design allows utilized in the bus interface so that it can be the building of memory interface, bus interface, reprogrammed to meet other additional communiand I/O circuits which can be used with a wide cations formats. Internal faults within the bus interface are detected and signalled to the core variety of different microprocessors. Second, with the next generation of LSI tech- building block. The processor is notified of impropnology, it will be possible to implement the indi- erly completed bus transmissions. The I/O building block performs commonly used vidual peripheral functions (bus and memory interfaces, I/O, and special functions) on a single chip. digital I/O functions and verifies that they are This technology allows construction of a small set properly executed. The core buiding block is responsible for (1) runof VLSI building-block circuits from which comning two CPUs in synchronism and comparing puter networks can be constructed with the choice their outputs to detect faults, (2) allocating the of a number of different microprocessors. Third, it has been shown that with VLSI tech- internal bus between the processor and building nology, the cost of self-checking logic circuitry is blocks, (3) collecting fault indications from itself proportionally small.5 The building blocks can be and other building blocks, and (4) disabling its computer module upon detection of a permadesigned so that each computer checks itself con- host nent fault. currently with normal computations, and signals Thus, each building block computer module is the existence of a fault. Upon discovering an to detect its own faults, and automated designed internal fault, it logically disconnects itself from recovery can be implemented with backup spares. the system, and redundant computers can be subFunctional definition of the building block circuits stituted to continue the computations. has been completed, and a detailed logic design of the core, memory interface, and bus interface Self-checking computer modules. We have speci- building blocks is underway. Two building block fied, and are currently designing, a set of circuits high-level modules will be constructed and tested which serve as general building blocks for con- to verify their performance. This will provide structing self-checking computer modules. These sufficient experience to begin VLSI circuit developdesigns will be breadboarded for subsequent VLSI ment. 56 COMPUTER Figure 4. The self-checking computer module consists of four building blocks-core, memory interface, I/O, and bus interface-which share an internal tri-state bus. Each building block detects internal faults and reports them to the core building block. Reconfiguration for fault recovery sible for polling the various computer modules within the system to determine if the module has isolated itself because of a failure. Reconfiguration is accomplished by sending commands to the various computers through their bus adapters (which are always powered). These commands can be (1) direct commands to the bus adapter for such functions as power control, halt processing, and restart, (2) to load or interrogate the local memory, (3) to read out fault status messages or (4) to reconfigure the internal building blocks within the module. Since there are several bus adapters in each computer module which are connected to independent bus systems, there are redundant paths for carrying out reconfiguration. All high-level modules are self-checking. The controlling module is backed up by a "hot" spare which interrogates the status and restart parameters of the controlling module on a periodic High-level modules are responsible for replacing faulty terminal modules with spares. Since the terminal modules are attached by a number of wires to a specific subsystem, they must have dedicated spares which are also hooked to the same subsystem. Thus, block redundancy is used with cross-strapped redundant modules. The number of spares for each terminal module is determined by the criticality and failure rate of its associated subsystem. Since a terminal module does not have the ability to initiate a bus communication, it can only halt and signal an error. Recognition of a failed terminal and commands for reconfiguration are the responsibility of a high-level module. In a typical UDS configuration, the high-level module performing the overall system control functioni.e., the spacecraft command computer-is responJuly 1978 57 basis. A "hot" backup spare is a spare computer which is programmed to take over a critical function in case of its failure. If very rapid recovery is required, it may even duplicate the computations of the controller. If the self-checking controlling module disables itself due to an internal fault, the "hot" spare takes over its ongoing computations. In complex systems where there are several levels of control, each high-level module is responsible for fault detection and reconfiguration for the computers in its immediate control as described above. Spares for high-level modules are nondedicated and can come from a common shared set. Conclusions Potential users have been hesitant to adopt reconfigurable distributed processing systems because of their fear of the high degree of complexity of such systems. We have attempted to address this problem directly by employing architectural techniques to simplify the use of such systems in certain real-time control applications. From our initial spacecraft experiments, this conservative approach appears to be useful in producing a system which is easy to live with. The potential for VLSI building block implementations may result in a very large variety of fault-tolerant systems which can be assembled inexpensively and routinely. The UDS architecture has been directed to nearterm applications in an attempt to develop computing ;systems for our next several spacecraft missions. For far-term applications we see similar distributed systems configured in multilevel hierarchies-e.g., collections of computers will be used in much more coinplex subsystem functions. In these systems, the problems of complexity and reliability will become even more acute than they are today. Most computers in the system will still be performing relatively simple control and data collection tasks required of various sensors and actuators. However, there will be points within many of these systems where very high computing performance is required. Problems of this type which are currently being studied are on-board image processing and on-board synthetic aperture radar processing. Due to the complexity of these dedicated tasks, special-purpose hardware implementations are viewed by many as a less costly alternative than general-purpose computers. However, they also generate new and interesting problems in providing internal fault detection and reconfiguration-both of which are necessary for fault recovery in systems of very high complexity. U Acknowledgment The development of the Unified Data System sponsored by the National Aeronautics and Space Administration under contract NAS7-100 with the California Institute of Technology at the Jet Propulsion Laboratory. Significant contributions to this team effort have been made by R.C. Caplette, P.E. Lecoq, H.F. Lesh, D.D. Lord, V.C. Tyree, and B. Riis Vestergaard. The fault-tolerant building block development is sponsored by the Naval Electronic Systems Command, Washington, DC, under the administration of Nate Butler and Larry Sumney of the Electronic Technology Division, ELEX 304. References 1. The Solar System, W. H. Freeman and Company, San Francisco, 1975 (also September 1975 Special Issue of Scientific American). 2. B. H. Dobrotin and D. A. Rennels, "An Application of Microprocessors to a Mars Roving Vehicle," Proc. 1977 Joint Automatic Control Conference, 8. 9. San Francisco, June, 1977. 3. W. G. Bouricius, W. C. Carter, and P. R. Schneider, "Reliability Modeling Techniques for Self-Repairing Computer Systems," Proc. 24th National Conference of the ACM, (ACM Publication P-69) 1969. 4. W. C. Carter, et al., "Computer Error Control by Testable Morphic Boolean Functions-A Way of Removing Hardcore," Digest of Papers, 1972 International Symposium on Fault-Tolerant Computing, Newton Massachusetts, IEEE Computer Society, June 1972, pp. 154-159. 5. W. C. Carter, et al., "Cost Effectiveness of SelfChecking Computer Design," Digest of Papers, 1977 International Symposium on Fault-Tolerant Computing, Los Angeles, California, June 1977. 6. P. S. The Solar System, W. H. Freeman and Company, San Francisco, 1975 (also September 1975 Special Issue of Scientific American). 2. B. H. Dobrotin and D. A. Rennels, "An Application of Microprocessors to a Mars Roving Vehicle," Proc. 1977 Joint Automatic Control Conference, 8. 9. San Francisco, June, 1977. 3. W. G. Bouricius, W. C. Carter, and P. R. Schneider, "Reliability Modeling Techniques for Self-Repairing Computer Systems," Proc. 24th National Conference of the ACM, (ACM Publication P-69) 1969. 4. W. C. Carter, et al., "Computer Error Control by Testable Morphic Boolean Functions-A Way of Removing Hardcore," Digest of Papers, 1972 International Symposium on Fault-Tolerant Computing, Newton Massachusetts, IEEE Computer Society, June 1972, pp. 154-159. 5. W. C. Carter, et al., "Cost Effectiveness of SelfChecking Computer Design," Digest of Papers, 1977 International Symposium on Fault-Tolerant Computing, Los Angeles, California, June 1977. 6. P. S. Kilpatrick, et al., "All Semiconductor Distributed Processor/Memory Study," Volume I, Avionics Processing Requirements, Honeywell, Inc., AFAL TR-72, performed for the Air Force Avionics Laboratory, Wright Patterson Air Force Base, Ohio, November 1972. 7. D. A. Rennels, B. Riis Westergaard, and V. C. Tyree, "The Unified Data System: A Distributed Processing Network for Control and Data Handling on a Spacecraft," Proc. IEEE National Aerospace and 10. 11. Electronics Conference, NAECON, Dayton, Ohio, May 1976. S. H. Caine and E. K. Gordon, "PDL-A Tool for Software Design," AFIPS Conference Proceedings, Vol. 44, National Computer Conference, 1975, pp. 271-276. F. Lesh and P. Lecoq, "Software Techniques for a Distributed Real-Time Processing System," Proc. IEEE National Aerospace and Electronics Conference, Dayton, Ohio, May 1976. D. A. Rennels, A. Avizienis, and M. Ercegovac, "A Study of Standard Building Blocks for the Design of Fault-Tolerant Distributed Computer Systems," Proc. 1978 International Symposium on FaultTolerant Computing, Toulouse, France, June 1978. Aircraft Internal Time Division Command/Response Multiplex Data Bus, DoD Military Standard 1553A, 30 April 1975. US Government Printing Office 1975-603-767/1472. l g David A. Rennels is a member of the technical staff in the Spacecraft Data Systems Section of the Jet Propulsion Laboratory, Pasadena, California. His major areas of interest are distributed computer architectures systems He and received for fault-tolerant the BSEE Hulman Institute of real-time computing. from Rose- Technology, the MSEE from Caltech, and a PhD in computer science from the University of California at Los Angeles. 