Reconfigurable Modular Computer Networks for Spacecraft On

advertisement
Standardized fault-tolerant microcomputers in a reconfigurable
distributed network promise to meet spacecraft reliability
requirements at low cost.
Reconfigurable Modular
Computer Networks
for Spacecraft
On-Board Processing
David A. Rennels
Jet Propulsion Laboratory
Over the last 20 years, a number of unmanned
spacecraft have been sent to investigate the moon,
Mars, Venus, and Mercury and have returned more
information about these bodies than had been collected in all of previous human history.' Each
of these spacecraft, as well as those in the planning stages for future missions, consists of two
major parts: a science payload and a set of engineering subsystems which supply power, control, and communications with the science instruments. The science payload scans the electromagnetic spectrum and the environment in the vicinity
of the spacecraft. Typical experiments employ television cameras, ultraviolet and infrared scanning
devices, charged particle detectors, magnetometers,
and radio astronomy instruments. The core engineering subsystem-the one that must control and
collect data from these experiments, as well as
point the spacecraft, control the other engineering
subsystems, and handle any serious anomalies
which may occur on the flight-is the computer.
Distributed computers. The core electronics subsystems on these spacecraft have progressed
through an evolution from simple fixed controllers
and analog computers in the 1960's to generalpurpose digital computers in current designs. This
evolution is now moving in the direction of distributed computer networks. Current Voyager
spacecraft already use three on-board computers.
One is used to store commands and provide overall spacecraft management. Another is used for
instrument control and telemetry collection, and
the third computer is used for attitude control
and scientific instrument pointing. The scientific
instruments are also candidates for dedicated
July 1978
computers. These instruments vary in complexity,
but they all contain command interfaces and internal logic sequencers which generate control signals
to operate electronic mechanisms and to collect
data. Instrument cycles vary in periodicity from a
few seconds to a minute. An examination of the
control logic in these instruments shows that, for
many, it is cost-effective to replace the sequencing
logic with a microcomputer-either to save chips or
to establish standardization in instrument logic
designs.
An additional factor in favor of multiple computers is a potential simplification of interfaces.
Typically, science instruments and engineering
subsystems are produced by different contractors
to meet common interface specifications. These
interfaces become quite detailed and have very
complex timing requirements when each instrument and subsystem must share a common computer. By placing computers in the various instruments and subsystems, the contractor can handle
his own details of timing and control. The systemlevel interface becomes much simpler and the contractor can thoroughly test his instrument or subsystem without waiting for externally supplied
computer hardware and software.
A number of scientists and subsystem engineers
have expressed interest in using microcomputers,
and it has become clear that the new microprocessor technology could result in wider and wider use
of computers on future spacecraft. If we are to be
successful in developing a systematic approach to
spacecraft distributed processing, the eventual
architecture must strongly reflect the unusual
requirements placed on on-board computing.
0018-9200/78/0700-0049$00.75 (D 1978 IEEE
49
On-board requirements. Reliability is the most
severe constraint upon the design of spacecraft
computing systems, and it represents the large
majority of associated costs. The spacecraft must
survive for several years in an environment where
repair is not possible, and ground intervention
(through commands) cannot be guaranteed in less
than several hours, or perhaps several days if the
spacecraft isn't continually tracked. This means
carrying along two or more times the required
computing hardware in the form of redundant
processors, memories, and I/O circuits. Computer
failure cannot be tolerated, since a spacecraft often
represents an investment of a hundred million
dollars. Additional problems such as wide temperature ranges, lower power availability, and radiation
hazards often necessitate selection of special parts.
The reliability screening of these parts drives their
costs to 10 times those for the commercial marketplace.
Finally, spacecraft computers and their software
are characterized by extensive testing. Each subsystem is tested for hundreds of hours; the spacecraft is then assembled and again tested for
hundreds of hours. Flight and ground software
systems are developed and similarly tested. The
effects of all spacecraft commands are simulated
in detail before being transmitted, since the effect
of an improper command may not be determined
for hours after damage is done. This can require
thousands of hours of CPU time on a large generalpurpose computing system. During a planetary
encounter -of a few hours' duration, a scientific
experimenter must be confident that the spacecraft
is performing as specified, in order to have useful
Reliability is the most severe
constraint on the design of spacecraft
computing systems.
results. Thus, the spacecraft computer architecture
must exhibit the properties of testability and ease
of software validation. It must also provide automatic reconfiguration for recovery from on-board
computer hardware faults, while staying within
stringent power, weight, and volume constraints.
These requirements have been addressed in a
multiple computer architecture developed at the
Jet Propulsion Laboratory for use in real-time
control and data handling on unmanned spacecraft.
Optimization criteria were quite different from the
usual goals of throughput and efficiency, relating
instead to testability, fault-tolerance, and ease of
use.
A breadboard of this architecture, designated
the "Unified Data System," has been constructed
at JPL and programmed to perform the processing
tasks of a typical spacecraft. A review of the UDS
design indicates some of the potentialities for
reliability in systems employing such an architecture.
50
Architectural considerations
The high reliability requirements and the need
for extensive hardware and software testing in
spacecraft computer systems demand that architecture and interfaces be simplified to the greatest
extent consistent with adequate performance. Multiple computer configurations have a potential for
chaotic complexity unless rigid standards are
imposed on programming and intercommunication
between computers. Summarized below are the
architectural characteristics devised for the UDS
in an attempt to achieve a more manageable system. Several of these approaches apply to distributed systems in general; others are more oriented
to a spacecraft's real-time computing tasks. The
goal is to simplify interfaces between computers
and to make their operation more predictable to
simplify testing and fault analysis.
Intercommunications. We believe that it is worth
the hardware investment to provide a powerful
mechanism for intercommunication between computers. The requirements on software for detailed
control of incoming and outgoing data should be
minimized. This implies the use of hardware control
of data movement between computer memories
using direct-memory-access techniques. Redundancy
and reconfigurability of the intercommunication
system are essential to prevent failure from a single
fault, and the hardware mechanisms should verify
proper
transmission through automated status
messages.
Computer utilization. It is often economical to
supply much more computing capacity (memory
and processing performance) than is required for
a given application. As utilization of memory
approaches 100 percent, or as real-time computing
demands approach the maximum speed of a
machine, design and verification of software
usually increase dramatically in complexity. Costs
of software have exceeded costs of hardware in
spacecraft systems for some time, and VLSI technology will further reduce hardware costs. In many
spacecraft subsystems, a dedicated computer will
be used well below its nominal capabilities.
Restricted communications. It is advisable to
restrict communications in the spacecraft computing network to provide error confinement and
to simplify testing. The majority of computers
in the spacecraft system will be performing
dedicated low-level functions within various instruments and subsystems. These computers
should not have the ability to arbitrarily modify
the memories of other computers or to tie up an
intercommunications bus. If this were the case,
hardware and software faults in low-level computers
could propagate throughout the system. There are
many ways to achieve this fault confinement, and
we have chosen centrally controlled buses which
do not allow low-level computers to initiate interCOMPUTER
of data between ongoing programs in different
computers, which occurs in a predictable periodic
fashion. The second type of information is commands which specify the algorithms to be performed
Synchronous functions. In order to allow correla- in the various computers. In order to simplify
tion of the results of various scientific experiments testing and software verification, we have restricted
and to provide data at the right time to fit into each computer in the network to receiving comtelemetry formats, the control of experiments and mands from only one higher-level computer. This
engineering subsystems is tightly synchronized. higher-level computer also controls data movement
Subroutines in the associated computers meet into and out of the lower-level machine, but the
strict timing requirements in generating control data can come from any of a number of different
cycles for their various instruments. Thus, for any computers under its control.
Programs in the low-level computers interface
spacecraft operational mode, the state and periodicity of most instrument and subsystem cycles, with the other computers through their own local
are well defined, as are the requirements for data memories. These programs are invoked by the hightransfer between computers. This prevents conflicts level computer, which also has responsibility for
and simplifies intercommunications since no conflict placing the operands which are needed in their
local memories. These programs, in the low-level
arbitration is required.
The central bus controller establishes a periodic computers, are self-synchronized to process the
set of data transmissions between computers as data when it arrives and -to place results in their
needed for the particular cycles they are performing. memories for subsequent extraction by the high(The few nonperiodic functions can be treated in level computer. This approach fits well with the
a similar fashion by establishing periodic transmis- hierarchic nature of the spacecraft system and
sion of message buffers between their associated with the fact that most of its computing functions
subroutines.) Bus control is easily verified since it are periodic. Control between computers is effecis generated from a single controller (from internal tively limited to a tree structure to provide simplimemory tables) and is highly predictable. The cost fication and a degree of fault containment.
of forcing intercommunications into periodic data
Timing hierarchy. Many control systems exhibit
movements is a reduction of response time through
the bus. A computer must wait for its time slot a timing hierarchy. Simple, high-rate functions are
before communicating with another machine. This often done at the bottom. Functions of intermediate
restriction is acceptable in the spacecraft, beca se rate and complexity are often done at the next
of the periodic nature of its computations, a,1d higher level, and complex processes are often done
thus we have sacrificed performance (concurrenc.y) at the top. The more complex processes often have
a wider latitude in timing resolution. A spacecraft
for increased testability.
computing system can take advantage of this hierarchy to simplify software and the interface
Characteristics such as synchronous
between computers.
communications, tree-structured
It is frequently useful to offload simple highcontrol, avoidance of demand interrupts, rate signal generation from software into I/O
and fault tolerance increase testability, hardware. This simplifies expensive software in
subsystem computers at a cost of less expensive
reliability, and ease of use-at some
Similarly, the software in the low-level
hardware.
expense in processing performance.
subsystem computers should be designed to minimize the timing resolution required of the highControl hierarchy. The spacecraft computing level computer which sends its messages over an
structure is hierarchic. At the bottom is the set of intercommunications bus. This simplifies expensive
dedicated terminal computers within instruments system interfaces.
and subsystems. For simple spacecraft a two-level
hierarchy is utilized. A single command computer
Interrupts. Whenever possible, demand interrupts
stores con anands from the earth, directs processing should be avoided. The software should determine
in the various subsystem computers, and specifies when and in what order it interfaces with the
the data movements to be carried out between outside world. This tends to increase slightly the
them. For more complex systems, such as the number of instructions required and may limit I/O
Mars Roving Vehicle, the hierarchy is extended response to millisecond rather than microsecond
to at least three levels.2 Each of several groups of resolution. But it does lead to more predictable
subsystem computers is controlled by a subsystem operation, is more easily verified, and allows for
group control computer. These intermediate com- software self-defense. Spacecraft systems are extenputers are in turn controlled by a master control sively simulated, and if no restrictions are placed
computer.
on the response to external stimuli, this simulation
There are two types of information movement can become extremely expensive. There can be a
between the memories of the computers in the very large number of possible orderings and
spacecraft system. The first type is the movement timings of incoming service requests. By restrict-
communications activity. Centralized bus control is
well adapted to the synchronous nature of spacecraft processes, as described below.
July 1978
51
ing this set of possible input states, software can
be more easily verified and have higher reliability.
system. These are computers containing internal
checking hardware that can detect nearly all
possible internal faults concurrently with normal
I/O timing granularity. The on-board computers software operation. The methodology for designing
must generate precisely timed control signals for self-checking computers is well developed, and
their associated subsystems. Several programs using VLSI technology this capability can be
may operate concurrently in a single machine, implemented at relatively low cost.4 5
each one of which is generating a precisely timed
Each self-checking computer in the network
series of inputs and outputs. For example, the
dedicated television computer may control the
readout of picture lines, sample telemetry measurements, format picture data for readout, and execute
several other concurrent functions which must be
precisely timed.
It is important to be able to change any one of
these programs without affecting the input and
output timing of the others. This can be achieved
to a large extent by imposing granularity on I/Oi.e., inputs are sampled and held for uniform
(several millisecond) intervals. During these intervals segments of several concurrent foreground
programs may be executed. Their outputs are
collected by I/O hardware and held until the end of
the time interval, and then all outputs are executed
at once. The program segments can be executed
in any order, and some can be removed without
affecting the output timing of the others. Programs
can be added as long as the total computation
for any interval does not exceed the time available.
This approach simplifies simulation since the
possible order and timing of inputs are drastically
reduced in complexity, and visibility into the system
is improved for testing since programs are executed
in well defined steps during which inputs are held
constant. Software can be more easily modified.
The cost of this approach is reduced response time
to external events. It may require two to three
time intervals, on the order of 5-7 milliseconds, for
the computer to acquire unexpected data and
deliver a response. This is acceptable for the
spacecraft application.
Fault detection and automatic reconfiguration.
Unlike most applications, an interplanetary spacecraft experiences the maximum demand for computing capacity at the end of a mission when, after
cruising through space for a year or more, it reaches
its designated target. The fault-tolerance techniques
employed must give a high probability that a
system is fully operational at the end of a mission.
Thus, enough spare hardware must be carried along
to substitute for all faulty units, rather than relying
on graceful degradation. Reconfiguration for fault
recovery consists of detecting faulty computers
and substituting properly functioning spares.
To achieve high reliability over a long period of
time, it has been shown that the mechanisms for
fault detection and recovery must be nearly perfect.
That is, coverage, defined as the conditional
probability of effecting recovery from a fault, must
approach unity.3 To achieve a high degree of fault
detection, we have chosen to design self-checking
computers for use within the distributed computing
52
disables itself upon detecting an internal fault. A
high-level control computer monitors the various
other computers and, upon discovering a faultdisabled computer, activates a replacement spare
by commands through the bus system. The control
computer, in turn, has a "hot" backup spare (with
a separate bus system) which is carrying out the
same programs as the controller and takes over if
it should fail and disable itself.
These nine characteristics represent the conservative design approach employed in the Unified
Data System. Synchronous communications, treestructured control, removal of interrupts, granularity of I/O, and fault-tolerance techniques are all
directed at increasing testability, reliability, and
ease of use, at some expense in processing performance (response time and throughput). It makes
an unusual point in design space because we start
with more computer hardware capability than is
needed, and accept inefficient use of this hardware
in an attempt to achieve a more manageable
system. The approach is tailored to spacecraft
applications, but we feel that it applies to a
number of other real-time applications as well.6
We next describe the architecture of the Unified
Data System, first describing an initial breadboard
system which did not incorporata fault-detection
and recovery features. Its main objective was to
verify the software and intercommunications structure of the system. A second breadboard is under
way which includes fault tolerance. Its main
difference from the first is that self-checking computer modules are employed along with backup
spares for fault recovery by means of reconfiguration. Its software and communication techniques
are nearly identical to the initial system.
UDS architecture
The Unified Data System architecture consists
of a set of standard microcomputers connected by
several redundant buses as shown in Figure 1.
The microcomputer modules, which use the same
microprocessor and software executive, fall into
two types: terminal modules and high-level modules.
Terminal modules are located in various spacecraft subsystems and are responsible for local
control and data collection. The terminal module
contains a microprocessor, memory (RAM), I/O
modules, and several bus adaptors which interface
with each of several intercommunications buses.
The bus adaptors are DMA controllers which allow
the bus systems to enter and extract data from the
terminal module's memory. A high-level module
COMPUTER
Figure 1. The Unified Data System, developed at Caltech's Jet Propulsion Laboratory, consists of two levels
of standard microcomputer modules, connected by
buses, some redundant. The bus controllers in the high-
level module type, in addition to monitoring data movement, can release their buses under certain conditions. The bus adapters in both
module types control direct memory access.
enters commands, data, and timing information
into prearranged areas within the terminal module.
The terminal module delivers information to the
system by placing outgoing messages in predetermined locations of its memory, which can
then be extracted by a high-level module over the
bus. The terminal module memory can be accessed
by several buses simultaneously, and its processor
is seldom notified when such a transaction occurs.
Each high-level module consists of a microprocessor, memory, bus adaptors, and a bus controller.
Each bus controller, which is unique to high-level
modules, can move blocks of data between memories of all computers connected to its bus. This is
the mechanism by which the high-level module can
coordinate the processing in a set of remote
terminal modules by entering commands into their
memories and reading out information to monitor
ongoing processes.
The bus controller is a highly autonomous init
which acts like the data channels of much larger
machines. When signalled by the high-level module
processor, it reads a control table from the module's
memory, interprets the table and controls the
requested data movement, verifies proper transmission through status messages, and notifies
the processor upon completion. Each bus controller
has a dedicated bus under its control, but can
relinquish its bus under one of two conditions:
(1) its power is turned off or (2) its processor
releases the bus for a specified time interval. Thus,
spare modules can gain access to a bus whose
processor has failed, or a bus can be multiplexed if
several other buses have failed. The individual
buses are physically independent, and therefore no
central controller exists for all buses as a potential
catastrophic failure mechanism. A more detailed
description of this hardware architecture can be
found elsewhere.7
The distinguishing features of this hardware
architecture are (1) a busing system which offers a
high degree of redundancy and takes most of the
burden of intercommunication off the host computers, (2) centralized bus control and limited bus
access which makes the buses more predictable
and helps prevent faulty terminal modules from
propagating damaged information, (3) a design for
synchronous operation without external demand
interrupts, and (4) granularity of .1/0 to simplify
software modifications and verification.
July 1978
System synchronization and control structure
The UDS computers are synchronized by a
common 2.5-millisecond real-time interrupt. Various
counts of RTI intervals define the uniform time
measurement throughout the spacecraft. Analogous
to minutes and seconds, the UDS keeps time in
frames and lines. A frame (48 sec) comprises 800
lines (60 msec) and a line comprises 24 RTI intervals.
These unusual values of time are chosen for convenience because they correspond to the cycles of
instruments on a typical spacecraft. A television
picture, consisting of 800 lines read out every 60
milliseconds, is completed every 48 seconds. Other
53
instruments are synchronized to TV lines, and
telemetry sequences tend to repeat on these
intervals.
System executive. The highest level computer
in the network serves as a system executive and
broadcasts both time counts and commands into
designated areas within the memories of the other
computers. It reads out data needed for control
decisions. Under this computer may be several
additional high-level modules which serve to control
collections of terminal modules or provide specialized computing services for the network.2 In
simpler systems, the system executive module
directly controls the terminal modules. As described
in the next section, these high-level modules are
being designed to be hardware self-checking. This
is necessary to prevent faulty modules from sending damaged information throughout the system.
Similarly, error-detecting codes are employed in
the bus to allow detection of damage to information transmitted over the bus.
After receiving commands from a high-level
module, a terminal module starts a set of specified
programs, which utilize the timing information
received from the high-level module to synchronize
with the rest of the spacecraft. For each program,
the time counts at which data is to appear in memory, at which control signals are to be generated,
and at which processed data is to be stored in
memory for extraction by the high-level module
have been precisely specified.
In order to support several concurrent programs
which generate precisely timed results and also to
support more complex programs which are not
easily segmented, software is run in a foregroundbackground partition. Each processor in the spacecraft system has a well defined set of foreground
program segments to run in each RTI interval. The
foreground programs are run in short segments
which control and time I1/0. They also start and
stop unsegmented background programs which
perform more elaborate computations which have
less stringent timing requirements.
To simplify testing, the results of program
segments run during any RTI interval should not
be dependent upon the order or timing of their
execution within the interval. In order to prevent
changing input values from making foreground
program segment results time-dependent within an
RTI interval, inputs are sampled and held throughout the interval by the I/O circuits. Output commands are held and executed at the next RTI to
make output timing independent of the speed or
order of execution of the foreground program segments. Thus, if the system is stopped at the end
of an RTI period, the software state of the foreground segments is known and can be easily
verified. A few I/O functions which require faster
time resolution than the RTI are handled by
special-purpose hardware-e.g., pulse generation
and DMA I/O circuits.
54
A conceptual diagram of the executive, which
resides in every computer module, is shown in
Figure 2. It is built around a scheduling table and
is entered each RTI. Upon entry it suspends the
background process, updates its time counter, and
checks for proper exit during the last cycle. It then
checks to see if a command has been placed in its
memory and, if this is the case, it starts an associated program segment. The executive then checks
its scheduling tables to see if any (segmented)
foreground programs have requested reactivation
at this time. (Activation can occur on' the basis of
either time or a memory word reaching a specified
value). If so, they are activated sequentially, and
each returns to the executive after a few instructions. Upon completing the foreground, the executive returns control to the background program.
Program design constructs. A UDS program
specification language has been developed which is
based on PDL (a program design language), augmented by four constructs which provide timing
and communications with the executive program.8
These constructs are START, WHEN, STOP, and
BACKSTART, which can be executed from any foreground program.9 START is utilized to activate a
new foreground program by placing its entry point
in the scheduler. WHEN returns control to the
scheduler from the active program and specifies
the conditions under which it is to be reactivated.
By using STOP, a program removes itself from the
scheduler, and BACKSTART iS used to initiate background programs. An example of part of a UDS
program is shown in Table 1.
Breadboard findings. A UDS breadboard has
been constructed using six computers: three highlevel modules and three terminal modules, which
carry out many of the processing functions of
current JPL spacecraft (see Figure 3). Preliminary
results indicate that the system is easy to program,
debug, and verify. Software debugging tools have
been written which take advantage of the predictable, time-synchronized interactions between computers. The breadboard can be started and run to a
given spacecraft time count, and then stopped for
inspection of memories in the various computers.
The memories can be inspected to determine if the
correct bus transmissions have taken place and if
the foreground programs are in the correct state.
The complete system is effectively stepped in 2.5
millisecond (RTI) processing intervals, and the processing steps within each interval are known for
each machine. This degree of visibility has greatly
expedited debugging and, in turn, speeded up
program development.
Preventing terminal modules from initiating bus
communications has also been valuable, helping to
contain initial software errors to individual terminal
module computers, and thus aid in debugging. The
restrictions on control and the removal of demand
interrupts have also made the system more predictable, and thus more easily debugged. We have
COMPUTER
Figure 2. At each real-time interrupt interval the local
executive software, which resides in every computer
module, suspends the background process (a), super-
vises a foreground process (b), then returns control to
the background program.
no quantitative measure on programmability or
testability, but we are satisfied with the experiences
that have occurred with the UDS system. Several
maj or changes have occurred in the telemetry
handling of the spacecraft simulation which caused
reprogramming of several UDS computers. These
changes were made rather easily in a matter of a
few days.
Several interesting hardware implementations
can be employed in the architecture of the selfchecking computer modules for this type of system.
Table 1. A typical foreground program.
CONTROL START ENGCYCLE:
(Program to gather engineering
telemetry)
(Initialize subsystem)
Building-block self-checking computers
OUTPUT CONTROL LEVELS;
In the UDS architecture, the high-level modules
have direct access to the memories of a number of
other computers in the system. They must be
carefully protected against faults to prevent damaged information from being sent to the other
computers. A second UDS breadboard is being
constructed with high-level modules which are
implemented as self-checking computers, which
disable themselves upon detecting an internal
fault.
WHEN LINE COUNT = 4
= 1,20;
(This program segment gathers
DO
one data sample every spacecraft
OUTPUT SAMPLE COMMAND;
line for 20 lines. It only runs a
READ DATA AND STORE IN
few instructions each activation.)
DATABUF (I);
WHEN LINE = CURRENT LINE + 1
ENDO;
(Starts a background program to
BKSTART PROCESS;
process the collected data.)
July 1978
STOP
55
implementation. The self-checking computer module
contains commercial memories, two commercial
microprocessors run in synchronization for fault
detection, and four types of building block circuits.
The building block circuits are (1) an error detecting
and correcting memory interface, (2) a microprogrammable bus interface, (3) a digital I/O building
block, and (4) a core building block. A typical selfchecking computer module will contain 23 RAMs,
two microprocessors, one memory interface, three
bus interfaces, and one core building block.10
The self-checking computer module is built around
a shared internal tri-state bus which consists of
16-bit address and data buses protected by multiple
parity bits and control lines implemented as selfchecking pairs. The building block circuits control
and interface the various processor, external bus,
memory, and I/O functions to the internal bus as
Figure 3. The Unified Data System breadboard, constructed at JPL, shown in Figure 4. Each building block is responsiconsists of three high-level modules and three terminal modules. It ble for detecting faults in its associated circuitry
includes a spacecraft TV camera and its associated test gear and a and then signalling the fault condition to the core
spacecraft tape recorder as host subsystems for two of the terminal building block by means of duplicate fault indimodules.
cators. The various building block functions are
listed below:
The memory interface building block interfaces
a redundant set of memory chips to the internal
First, if memory-mapped I/O is employed and if bus. It provides Hamming correction to damaged
intercommunication channels use DMA techniques, memory data, replacement of a faulty bit with a
we no longer depend upon the specific I/O struc- spare, parity encoding and decoding to the internal
ture of any specific microprocessor. Using memory- bus, and detection of internal faults.
The bus interface building block can be micromapped I/O, a set of memory addresses are reserved
for I/O functions. Instead of accessing memory, programmed to provide the function of a bus
reads and writes to these addresses are interpreted adaptor or bus controller. The bus system is being
as commands (and data) to I/O devices, communi- designed to utilize MTL STD 1553A communicacations channels, and other internal hardware tions formats."1 Microprogrammed control is being
within the computer. This type of design allows utilized in the bus interface so that it can be
the building of memory interface, bus interface, reprogrammed to meet other additional communiand I/O circuits which can be used with a wide cations formats. Internal faults within the bus
interface are detected and signalled to the core
variety of different microprocessors.
Second, with the next generation of LSI tech- building block. The processor is notified of impropnology, it will be possible to implement the indi- erly completed bus transmissions.
The I/O building block performs commonly used
vidual peripheral functions (bus and memory interfaces, I/O, and special functions) on a single chip. digital I/O functions and verifies that they are
This technology allows construction of a small set properly executed.
The core buiding block is responsible for (1) runof VLSI building-block circuits from which comning
two CPUs in synchronism and comparing
puter networks can be constructed with the choice
their
outputs to detect faults, (2) allocating the
of a number of different microprocessors.
Third, it has been shown that with VLSI tech- internal bus between the processor and building
nology, the cost of self-checking logic circuitry is blocks, (3) collecting fault indications from itself
proportionally small.5 The building blocks can be and other building blocks, and (4) disabling its
computer module upon detection of a permadesigned so that each computer checks itself con- host
nent
fault.
currently with normal computations, and signals
Thus, each building block computer module is
the existence of a fault. Upon discovering an
to detect its own faults, and automated
designed
internal fault, it logically disconnects itself from
recovery
can
be implemented with backup spares.
the system, and redundant computers can be subFunctional definition of the building block circuits
stituted to continue the computations.
has been completed, and a detailed logic design of
the core, memory interface, and bus interface
Self-checking computer modules. We have speci- building blocks is underway. Two building block
fied, and are currently designing, a set of circuits high-level modules will be constructed and tested
which serve as general building blocks for con- to verify their performance. This will provide
structing self-checking computer modules. These sufficient experience to begin VLSI circuit developdesigns will be breadboarded for subsequent VLSI ment.
56
COMPUTER
Figure 4. The self-checking computer module consists
of four building blocks-core, memory interface, I/O,
and bus interface-which share an internal tri-state bus.
Each building block detects internal faults and reports
them to the core building block.
Reconfiguration for fault recovery
sible for polling the various computer modules
within the system to determine if the module has
isolated itself because of a failure.
Reconfiguration is accomplished by sending commands to the various computers through their bus
adapters (which are always powered). These commands can be (1) direct commands to the bus
adapter for such functions as power control, halt
processing, and restart, (2) to load or interrogate
the local memory, (3) to read out fault status messages or (4) to reconfigure the internal building
blocks within the module. Since there are several
bus adapters in each computer module which are
connected to independent bus systems, there are
redundant paths for carrying out reconfiguration.
All high-level modules are self-checking. The
controlling module is backed up by a "hot" spare
which interrogates the status and restart parameters of the controlling module on a periodic
High-level modules are responsible for replacing
faulty terminal modules with spares. Since the
terminal modules are attached by a number of
wires to a specific subsystem, they must have
dedicated spares which are also hooked to the
same subsystem. Thus, block redundancy is used
with cross-strapped redundant modules. The number of spares for each terminal module is determined
by the criticality and failure rate of its associated
subsystem. Since a terminal module does not have
the ability to initiate a bus communication, it can
only halt and signal an error. Recognition of a
failed terminal and commands for reconfiguration
are the responsibility of a high-level module. In a
typical UDS configuration, the high-level module
performing the overall system control functioni.e., the spacecraft command computer-is responJuly 1978
57
basis. A "hot" backup spare is a spare computer
which is programmed to take over a critical function in case of its failure. If very rapid recovery is
required, it may even duplicate the computations
of the controller. If the self-checking controlling
module disables itself due to an internal fault, the
"hot" spare takes over its ongoing computations.
In complex systems where there are several
levels of control, each high-level module is responsible for fault detection and reconfiguration for
the computers in its immediate control as described
above. Spares for high-level modules are nondedicated and can come from a common shared set.
Conclusions
Potential users have been hesitant to adopt
reconfigurable distributed processing systems
because of their fear of the high degree of complexity of such systems. We have attempted to
address this problem directly by employing architectural techniques to simplify the use of such
systems in certain real-time control applications.
From our initial spacecraft experiments, this conservative approach appears to be useful in producing a system which is easy to live with. The
potential for VLSI building block implementations
may result in a very large variety of fault-tolerant
systems which can be assembled inexpensively
and routinely.
The UDS architecture has been directed to nearterm applications in an attempt to develop computing ;systems for our next several spacecraft
missions. For far-term applications we see similar
distributed systems configured in multilevel hierarchies-e.g., collections of computers will be used
in much more coinplex subsystem functions. In
these systems, the problems of complexity and
reliability will become even more acute than they
are today.
Most computers in the system will still be performing relatively simple control and data collection
tasks required of various sensors and actuators.
However, there will be points within many of these
systems where very high computing performance
is required. Problems of this type which are currently being studied are on-board image processing
and on-board synthetic aperture radar processing.
Due to the complexity of these dedicated tasks,
special-purpose hardware implementations are
viewed by many as a less costly alternative than
general-purpose computers. However, they also
generate new and interesting problems in providing
internal fault detection and reconfiguration-both
of which are necessary for fault recovery in systems of very high complexity. U
Acknowledgment
The development of the Unified Data System
sponsored by the National Aeronautics and
Space Administration under contract NAS7-100
with the California Institute of Technology at the
Jet Propulsion Laboratory. Significant contributions to this team effort have been made by R.C.
Caplette, P.E. Lecoq, H.F. Lesh, D.D. Lord, V.C.
Tyree, and B. Riis Vestergaard. The fault-tolerant
building block development is sponsored by the
Naval Electronic Systems Command, Washington,
DC, under the administration of Nate Butler and
Larry Sumney of the Electronic Technology Division, ELEX 304. Special acknowledgment is due to
Reeve Peterson and Ralph Martinez of the Naval
Ocean Systems Center for their continued support
and encouragement of this effort. Major contributions to the building block designs were provided
by M. Ercegovac, and we have relied heavily on
the broad experience and advice of Prof. Algirdas
Avizienis. Additional acknowledgment is due to
the guest editors and reviewers for their detailed
and helpful comments.
was
Proceedings of the Fourth
IECI Annual Conference:
INDUSTRIAL APPLICATIONS
OF MICROPROCESSORS
March 20-22, 1978 (233 pages)
The proceedings of this conference, sponsored by
IEEE's Industrial Electronics and Control Instrumentation Group, include 43 papers in eight broad areasdata acquisition, signal processing, monitoring and control applications, industrial control applications, testing,
energy systems, consumer systems, and motor control
systems.
L
Non-members-$25.00
Use order form on p. 162
58
Members-$18.75
.:::E._.
COMPUTER
References
1. The Solar System, W. H. Freeman and Company,
San Francisco, 1975 (also September 1975 Special
Issue of Scientific American).
2. B. H. Dobrotin and D. A. Rennels, "An Application of Microprocessors to a Mars Roving Vehicle,"
Proc. 1977 Joint Automatic Control Conference,
8.
9.
San Francisco, June, 1977.
3. W. G. Bouricius, W. C. Carter, and P. R. Schneider,
"Reliability Modeling Techniques for Self-Repairing
Computer Systems," Proc. 24th National Conference
of the ACM, (ACM Publication P-69) 1969.
4. W. C. Carter, et al., "Computer Error Control by
Testable Morphic Boolean Functions-A Way of
Removing Hardcore," Digest of Papers, 1972 International Symposium on Fault-Tolerant Computing,
Newton Massachusetts, IEEE Computer Society,
June 1972, pp. 154-159.
5.
W. C. Carter, et al., "Cost Effectiveness of SelfChecking Computer Design," Digest of Papers, 1977
International Symposium on Fault-Tolerant Computing, Los Angeles, California, June 1977.
6. P. S. Kilpatrick, et al., "All Semiconductor Distributed Processor/Memory Study," Volume I, Avionics Processing Requirements, Honeywell, Inc.,
AFAL TR-72, performed for the Air Force Avionics
Laboratory, Wright Patterson Air Force Base, Ohio,
November 1972.
7. D. A. Rennels, B. Riis Westergaard, and V. C. Tyree,
"The Unified Data System: A Distributed Processing Network for Control and Data Handling on
a Spacecraft," Proc. IEEE National Aerospace and
10.
11.
Electronics Conference, NAECON, Dayton, Ohio,
May 1976.
S. H. Caine and E. K. Gordon, "PDL-A Tool for
Software Design," AFIPS Conference Proceedings,
Vol. 44, National Computer Conference, 1975, pp.
271-276.
F. Lesh and P. Lecoq, "Software Techniques for a
Distributed Real-Time Processing System," Proc.
IEEE National Aerospace and Electronics Conference, Dayton, Ohio, May 1976.
D. A. Rennels, A. Avizienis, and M. Ercegovac, "A
Study of Standard Building Blocks for the Design
of Fault-Tolerant Distributed Computer Systems,"
Proc. 1978 International Symposium on FaultTolerant Computing, Toulouse, France, June 1978.
Aircraft Internal Time Division Command/Response
Multiplex Data Bus, DoD Military Standard 1553A,
30 April 1975. US Government Printing Office
1975-603-767/1472.
l
g
David A. Rennels is
a
member of the
technical staff in the Spacecraft Data
Systems Section of the Jet Propulsion
Laboratory, Pasadena, California. His
major
areas of interest are distributed
computer architectures
systems
He
and
received
for
fault-tolerant
the
BSEE
Hulman Institute of
real-time
computing.
from
Rose-
Technology,
the
MSEE from Caltech, and a PhD in
computer science from the University of California at
Los Angeles.
THE INSTITUTE
FOR CERTIFICATION
OF COMPUTER
PROFESSIONALS
ANNOUNCES THE
-The annual examinations for the Certificate in Computer Programming (CCP) will
be held on December 9, 1978, at selected test centers throughout the world.
_
*
|
Specific requirements for this year's three separate examinations in Business,
Scientific, or Systems Programming are detailed in the "Certificate in Computer
Programming Examination Announcement and Study Guide." The study guide
and application form for the 1978 examination are available on request from ICCP.
Please forward the "Certificate in Computer Programming Examination Announcement and Study Guide" along with application and test site list.
_.
Name
Street Address
City
Reader Service Number 3
Slate or Province
Zip Code
Download