Dynamic confirmation of system integrity *

advertisement
Dynamic confirmation of system integrity *
by BARRY R. BORGERSON
University of California
Berkeley, California
INTRODUCTION
continuously integral are identified, and the integrity
of the rest of the system can then be confirmed by
means less stringent than concurrent fault detection.
For example, it might be expedient to allow certain
failures to exist for some time before being detected.
This might be desirable, for instance, when certain
failure modes are hard to detect concurrently, but where
their effects are controllable.
It is always desirable to know the current state of any
system. However, with most computing systems, a
large class of failures can remain undetected by the
system long enough to cause an integrity violation.
What is needed is a technique, or set of techniques, for
detecting when a system is not functioning correctly.
That is, we need some way of observing the integrity
of a system.
A slight diversion is necessary here. Most nouns
which are used to describe the attributes of computer
systems, such as reliability, availability, security, and
privacy, have a corresponding adjective which can be
used to identify a system that has the associated
attribute. Unfortunately, the word "integrity" has no
associated adjective. Therefore, in order to enhance the
following discourse, the word "integral" will be used
as the adjective which describes the integrity of a
system. Thus, a computer system will be integral if it
is working exactly as specified.
Now, if we could verify all of the system software,
then we could monitor the integrity of a system in real
time by providing a 100 percent concurrent fault
detection capability. Thus, the integrity of the entire
system would be confirmed concurrentlYJ where "concurrent confirmation" of the integrity of any unit of
logic means that the integrity of this unit is being
monitored concurrently with each use.
A practical alternative to providing concurrent
confirmation of system integrity is to provide what will
be called "dynamic confirmation of system integrity."
With this concept, the parts of a system that must be
QUALITATIVE JUSTIFICATION
In most contemporary systems, a multiplicity of
processes are active at any given time. Two distinct
types of integrity violations can occur with respect to
the independent processes. One type of integrity
violation is for one process to interfere with another
process. That is, one process gains unauthorized access
to another's information or makes an illegitimate
change of another process' state. This type of transgression will be called an "interprocess integrity
violation." The other basic type of malfunction which
can be caused by an integrity violation occurs when the
state of a single process is erroneously changed without
any interference from another process. Failures which
lead to only intraprocess contaminations will be called
"intraprocess integrity violations."
For many real-time applications, no malfunctions
of any type can be tolerated. Hence, it is not particularly useful to make the distinction between interprocess and intraprocess integrity violations since
concurrent integrity-confirmation techniques must be
utilized throughout the system. For most user-oriented
systems, however, there is a substantial difference in the
two types of violations. Intrapr{)cess integrity violations
always manifest themselves as contaminations of a
process' environment. Interprocess integrity violations,
on the other hand, may manifest themselves as security
infractions or contaminations of other processes'
environments.
* This research was supported by the Advanced Research Projects
Agency under· contract No. DAHC15 70 C 0274. The views and
conclusions contained in this document are those of the author
and should not be interpreted as necessarily representing the
official policies, either expressed or implied, of the Advanced
Research Projects Agency or the U.S. Government.
89
From the collection of the Computer History Museum (www.computerhistory.org)
90
Fall Joint Computer Conference, 1972
We now see that there can be some freedom in defining
what is to constitute a continuously-integral, useroriented system. For example, the time-sharing system
described below is defined to be continuously integral if
it is providing interprocess-interference protection on a
continuous basis. Thus other properties of the system,
such as intraprocess contamination protection, need
not be confirmed on a continuous basis.
Although the concept of dynamic confirmation of
system integrity has a potential for being useful in a
wide variety of situations, the area of its most obvious
applicability seems to be for fault-tolerant systems.
More specifically, it is most useful in those systems
which are designed using a solitary-fault assumption.
Where "solitary fault" means that at most one fault is
present in the active system at any time. The notion of
"dynamic" becomes more clear in this context. Here,
"dynamic" means in such a manner, and at such times,
so that the probability of encountering simultaneous
faults is below a predetermined limit. This limit is
dictated not only by the allowable probability of a
catastrophic failure, but also by the fact that other
factors eventually become more prominent in determining the probability of system failure. Thus, there
often becomes a point beyond which there is very little
to be gained by increasing the ability to confirm
integrity. The rest of this paper will concern itself with
dynamic confirmation in the context of making this
concept viable with respect to the solitary-fault
assumption.
DYNAMIC CONFIRMATION TECHNIQUES
In this section, and the following section, a particular
class of systems will be assumed. The class of systems
considered will be those which tolerate faults by
restructuring to run without the faulty units. Both the
stand-by sparing and the fail-softly types of systems are
in this category. These systems have certain characteristics in common; namely, they both must detect,
locate, and isolate a fault, and reconfigure to run
without the faulty unit, before a second fault can be
reliably handled.
Obviously, if simultaneous faults are to be avoided,
the integrity of all parts of the system must be verified.
This is reasonably straightforward in many areas. For
instance, the integrity of data in memory can be rather
easily confirmed by the method of storing and checking
parity. Of course, checks must also be provided to make
sure that the correct word of memory is referenced, but
this can be done fairly easily too. 1 It is generally true
that parity, check sums, and other straightforward
concurrent fault-detection techniques can be used to
confirm the integrity of most of the logic external to
processors. However, there still remains the problems
of verifying the integrity of the checkers themselves, of
the processors, and of logic that is infrequently used
such as that associated with isolation and reconfiguration.
All too often, there is no provision made in a system
to check the fault detection logic. Actually, there are
two rather straightforward methods of accomplishing
this. One method uses checkers that have their own
failure space. That is, they have more than two output
states; and when they fail, a state is entered which
indicates that the checker is malfunctioning. This
requires building checkers with specifically defined
failure modes. It also requires the ability to recognize
and handle this limbo state. An example of this type of
checker appears in Reference 2.
Another method for verifying the integrity of the
fault-detection logic is to inject faults; that is, cause a
fault to be created so that the checker must recognize it.
In many cases this method turns out to be both cheaper
and simpler than the previously mentioned scheme.
With this method, it is not necessary to provide a
failure space for the checkers themselves. However, it is
necessary to make provisions for injecting faults when
that is not already possible in the normal design. With
this provision, confirming the integrity of the checking
circuits becomes a periodic software task. Failures are
injected, and fault detection inputs are expected. The
system software simply ignores the fault report or
initiates corrective action if no report is generated.
Associated with systems of the type under discussion,
there is logic that normally is called into use only when a
fault has been detected. This includes the logic dedicated
to such tasks as diagnosis, isolation, and reconfiguration.
This normally idle class of hardware units will collectively be called "reaction logic." In order to avoid
simultaneous faults in a system, this reaction logic must
not be allowed to fail without the failure being rapidly
detected. Several possibilities exist here. This logic can
be made very reliable by using some massive redundancy
technique such as triple-modular-redundancy.3 Another
possibility is to design these units such that they
normally fail into a failure space which is detected and
reported. However, this will not be as simple here as it
might be for self-checking fault detectors because the
failure modes will, in general, be harder to control. A
third method would be to simulate the appropriate
action and observe the reaction. This also is not as
simple here as it was above. For example, it may not be
desirable to reconfigure a system on a frequent periodic
basis. However, one way out of this is to simulate the
From the collection of the Computer History Museum (www.computerhistory.org)
Dynamic Confirmation of System Integrity
action, initiate the reaction, and confirm the integrity
of this logic without actually causing the reconfiguration. This will probably require that the output logic
either be made "reliable" or be encoded so as to fail
into a harmless and detectable failure space.
The final area that requires integrity confirmation is
the processors. The technique to be employed here is
very dependent on the application of the system. For
many real-time applications, nothing short of concurrent fault detection will apparently suffice. However,
there are many areas where less drastic methods may
be adequate. Fabry 4 has presented a method for verifying critical operating-system decisions, in a timesharing environment, through a series of independent
double checks using a combination of a second processor
and dedicated hardware. This method can be extended
to verifying certain decisions made by a real-time
control processor. If most of the tasks that a real-time
processor performs concern data reduction, it is possible
that software-implemented consistency checks will
suffice for monitoring the integrity of the results. When
critical control decisions are to be made, a second
processor can be brought into the picture for consistency
checks or dedicated hardware can be used for validity
checking. Alternatively, a separate algorithm, using
separate registers, could be run on the same processor
to check the validity of a control action, with external
time-out hardware being used to guarantee a response.
These procedures could certainly provide a substantial
cost savings over concurrent fault-detection methods.
For a system to be used in a general-purpose, timesharing environment, the method of checking processors non-concurrently is very powerful because
simple, relatively inexpensive schemes will suffice to
guarantee the security of a user's environment. The
price that is paid is to not detect some faults that could
cause contamination of a user's own information. But
conventional time-sharing systems have this handicap
in addition to not having a high availability and not
maintaining security in the presence of faults, so a clear
improvement would be realized here at a fairly low cost.
In order to detect failures as rapidly as possible in
processors that have no concurrent fault-detection
capability, periodic surveillance tests can be run which
will determine if the processor is integral.
VALIDATION OF THE SOLITARY-FAULT
ASSUMPTION
91
removed at any given time. However, in order to design
the system so that all possible types of failures can be
handled, it is usually necessary to assume that at most
one active unit is malfunctioning at any given time.
The problem becomes essentially intractable when
arbitrary combinations of multiple faults are considered.
That is not to say that all cases of multiple faults will
bring a system down, but usually no explicit effort is
made to handle most multiple faults. Of course by
multiple faults we mean multiple independent faults.
If a failure of one unit can affect another, then the
system must be designed to handle both units malfunctioning simultaneously or isolation must be added
to limit the influence of the original fault.
A quantitative analysis will now be given which
provides a basis for evaluating the viability of utilizing
non-concurrent integrity-confirmation techniques in an
adaptive fault-tolerant system. In the analysis below,
the letter "s" will be used to designate the probability
that two independent, simultaneous faults will cause a
system to crash.
The next concept we need is that of coverage. Coverage
is defined5 as the conditional probability that a system
will recover given that a failure has occurred. The
letter "e" will be used to denote the coverage of a
system.
In order to determine a system's ability to remain
continuously available over a given period of time, it is
necessary to know how frequently the components of
the system are likely to fail. The usual measure employed here is the mean-time-between-failures. The
letter "m" will be used to designate this parameter. It
should be noted here that "m" represents the meantime-between-internal-failures of a system; the system
itself hopefully has a much better characteristic.
The final parameter that will be needed here is the
maximum-time-to-recovery; This is defined to be the
maximum time elapsed between the time an arbitrary
fault occurs and the time the system has successfully
reconfigured to run without the faulty unit. The letter
"r" will be used to designate this parameter.
The commonly used assumption that a system does
not deteriorate with age over its useful life will be
adopted. Therefore, the exponential distribution will
be used to characterize the failure probability of a
system. Thus, at any given time, the probability of
encountering a fault within the next u time units is:
p=
jU (llm)*exp( -tim) dt
o
Fault-tolerant systems which are capable of isolating
a faulty unit, and reconfiguring to run without it,
typically can operate with several functional units
= 1-exp( -ulm)
From this we can see that the upper bound on the
From the collection of the Computer History Museum (www.computerhistory.org)
92
Fall Joint Computer Conference, 1972
conditional probability of encountering a second independent fault is given by:
q= l-exp( -rim)
Since it is obvious that r must be made much smaller
than m if a system is to have a high probability of
surviving many internal faults, the following approximation is quite valid:
q= l-exp( -rim)
00
=1-
2: (-r/m)k/k!
k=O
= 1-I+r/m- (Y2)*(r/m)2+ Oi)*(r/m)3- ..•
~r/m
Therefore, the probability of not being able to recover
from an arbitrary internal failure is given by:
x= (I-c) +c*q*s
= (I-c) +c*s*r/m
where the first term represents the probability of failing
to recover due to a solitary failure and the second term
represents the probability of not recovering due to
simultaneous failures given that recovery from the first
fault was possible.
If we now consider each failure as an independent
Bernoulli trial and make the assumption that faulty
units are repaired at a sufficient rate so that there is
never a problem with having too many units logically
removed from a system at any given time, then it is a
simple ~atter to determine the probability of surviving
a given period, T, without encountering a system crash.
The hardware failures will be treated as n independent
samples, each with probability of success (1- x), where
n is the smallest integer greater than or equal to T /m.
Thus, the probability of not crashing on a given fault is
(I-x) =c*(1-r*s/m) and the probability, P, of not
crashing during the period T is given by:
P= [c*(I-r*s/m) In
=c *(I-r*s/m)n
. concurrent schemes and since this time is essentially
equivalent to how frequently the confirmation procedures are invoked, we can assume that r is equal to
the time period between the periodic integrity confirmation checks. In order to gain a feeling for the order
of r, rather pessimistic numbers can be assumed for m,
s, and T. Assume m=1 week, s= Y2, and T=10 years;
this gives an n of 520. For now, assume c is equal to one.
Now, in order to attain a probability of .95 that a system
will survive 10 years with no crashes under the above
assumptions, r will have to be:
r= m/ s*[I-. 95(1/520) ]
= 119 seconds
Thus, if the periodic checks are made even as infrequently as every two minutes, a system will last 10
years with a probability of not crashing of approximately.95.
The effects of the coverage must now be examined.
In order for the coverage to be good enough to provide
a probability of .95 of no system crashes in 10 years
due to the system's inability to handle single faults,
it must be:
c= .95(11520)
=.9999
Now this would indeed be a very good coverage. Since
the actual coverage of any given system will most
likely fall quite short of this value, it seems that the
coverage, and not multiple simultaneous faults, is the
limiting factor in determining a system's ability to
recover from faults.
The most important conclusion to be drawn from this
section is that the solitary-fault assumption is not only
convenient but quite justified, and this is true even when
only periodic checks are made to verify the integrity of
some of the logic.
INTEGRITY CONFIRMATION FEATURES OF
THE "PRIME" SYSTEM
ll
With this equation, it is now possible to establish the
validity of using the various non-concurrent techniques
mentioned above to confirm the integrity of a system.
What this equation will establish is how often it will be
necessary to perform the fault injection, action simulation, and surveillance procedures in order to gain an
acceptable probability of no system crashes. Since the
time required to detect, locate, and isolate a fault, and
reconfigure to run without the faulty unit, will be
primarily a function of the time to detection for the non-
In order to better illustrate the potential power of
dynamic integrity confirmation techniques, a descrip, tion. will now be given of how this concept is being used
to economically provide an integrity confirmation
structure for a fault-tolerant system.
At the University of California, Berkeley, we are
currently building a modular computer system, which
has been named PRIME, that is to be used in a multiaccess, interactive environment. The initial version of
this system will have five processors, 13 8K-word by
33-bit memory blocks with associated switching units,
From the collection of the Computer History Museum (www.computerhistory.org)
Dynamic Confirmation of System Integrity
DISK
DRIVE
I
I
~
.. .. .
DISK
DRIVE
I
I
I
IEXTERNAL
DEVICE
DISK
DRIVE
I
I
I"'TERNAL
DEVICE
....
IEXTERNAL
DEVICE I
I
I
I
I
INTERCONNECTION NETWORK
I
I
I
I/O
CONTROLLER
I
I~I
I
I
1/0
CONTROLLER
I
1*11ri
I
I
I/O
CONTROLLER
1*1 ~
93
I
T
I/O
CONTROLLER
*
I
I
~ CONTROLLER
I/O
I
*1
*RECONFIGURATION
LOGIC
EACH nIDICATED
LINE REPRE SENTS
16 TERMINAL
CONNECT IONS
I
PROCESSOR
I I
MEMORY
:rnTERFACE
PROCESSOR
I
I
I
PROCESSOR
I I
MEMORY
INTERFACE
MEMORY 1
:rnTERFACE
PROCESSOR
I I
I I
PROCESSOR
I
r :rnTERFACE
MEMORY r
MEMORY
INTERFACE
j -
~~E\j
~
1
MB4
MB5
EJ~~
~
MBlO
MBn
EACH MEMORY BLOCK (MB) CONSISTS OF TWO 4K MODULES
88
Figure 1-Block diagram of the PRIME system
15 high-performance disk drives, and a switching network which allows processor, disk, and external-device
switching. A block diagram of PRIME appears in
Figure 1.
The processing elements in PRIME are 3-bus, 16-bit
wide, 90ns cycle time microprogrammable processors
called IVIETA 4s. 6 Each processor emulates a target
machine in addition to performing I/O and executive
functions directly in microcode. At any given time, one
of the processors is designated the Control Processor
(CP), while the others are Problem Processors (PPs).
The CP runs the Central Control Monitor (CCM)
which is responsible for scheduling, resource allocation,
and interprocess message handling. The Problem Processors run user jobs and perform some system functions
with the Extended Control l\1onitor (ECl\1) which is
completely isolated from user processes. Associated with
each PP is a private page, which the ECM uses to store
data, and some target-machine code which it occasionally causes to be executed. A more complete descrip-
tion of the structure and functioning of PRIME is
given elsewhere. 7
The most interesting aspects of PRIl\1E are in the
areas of availability, efficiency, and security. PRIME
will be able to withstand internal faults. The system
has been designed to degrade gracefully in the presence
of .internal failures. 8 Also, interprocess integrity is
always maintained even in the presence of either hardware or software faults.
The PRIME system is considered continuously
integral if it is providing interprocess interference
protection. Therefore, security must be maintained at
all times. Other properties, such as providing user
service and recovering from failures, can be handled in
a less stringent manner. Thus, dynamic confirmation of
system integrity in PRIl\1E must be handled concurrently for interprocess interference protection and
can be handled periodically with respect to the rest of
the system. Of course, there are areas which do not
affect interprocess interference protection but which
From the collection of the Computer History Museum (www.computerhistory.org)
94
Fall Joint Computer Conference, 1972
will nonetheless utilize concurrent fault detection simply
because it is expedient to do so.
Fault injection is being used to check most of the
fault-detection logic in PRIlVIE. This decision was made
because the analysis of non-concurrent integrityconfirmation techniques has established that periodic
fault injection is sufficiently effective to handle the job
and because it is simpler and cheaper than the alternatives. There is a characteristic of the PRIME system
that makes schemes which utilize periodic checking very
attractive. At the end of each job step, the current
process and the next process are overlap swapped. That
is, two disk drives are used simultaneously; one of these
disks is rolling the current job out, while the other is
rolling the next job in. During this time, the associated
processor has some potential free time. Therefore, this
time can be effectively used to make whatever periodic
checks may be necessary. And since the mean time
between job steps will be less than a second, this provides very frequent, inexpensive periodic checking
capabilities.
The integrity of Problem Processors is checked at
the end of each job step. This check is initiated by the
Control Processor which passes a one-word seed to the
PP and expects the PP to compute a response. This
seed will guarantee that different responses are required
at different times so that the PP cannot accidently
"memorize" the correct response. The computation
requires the use of both target machine instructions
and a dedicated firmware routine to compute the expected response. The combination of these two routines
is called a surveillance procedure. This surveillance
procedure checks all of the internal logic and the control
storage of the microprocessors. The target machine code
of the surveillance routine is always resident in the
processor's private page. The microcode part is resident
in control storage. A fixed amount of time is allowed for
generating a response when the CP asks a PP to run a
surveillance on itself. If the wrong response is given or
if no response is given in the allotted time, then the PP
is assumed to be malfunctioning and remedial action is
initiated. In a similar manner, each PP periodically
requests that the CP run a surveillance on itself. If a
PP thinks it detects that the CP is malfunctioning, it
will tell the CP this, and a reconfiguration will take
place followed by diagnosis to locate the actual source
of the detected error. More will be said later about the
structure of the reconfiguration scheme.
While the periodic running of surveillance procedures
is sufficient for most purposes, it does not suffice for
protecting against interprocess interference. As previously mentioned, this protection must be continuous.
Therefore, a special structure has been developed which
is used to prevent interprocess interference on a continuous basis. 4 This structure provides double checks on
all actions which could lead to interprocess interference.
In particular, the validity of all memory and disk
references, and all interprocess message transmissions,
are among those actions double checked. A class code is
used to associate each sector (lK words) of each disk
pack with either a particular process or with the null
process, which corresponds to unallocated space. A
lock and key scheme is used to protect memory on a
page (also lK words) basis. In both cases, at most one
process is bound to a lK-word piece of physical storage.
The Central Control ]VIonitor is responsible for allocating each piece of storage, and it can allocate only
those pieces which are currently unallocated. Each
process is responsible for deallocating any piece of
storage that it no longer needs. Both schemes rely on
two processors and a small amount of dedicated hardware to provide the necessary protection against some
process gaining access to another process' storage.
In order for the above security scheme to be extremely
effective, it was decided to prohibit sharing of any
storage. Therefore, the Interconnection Network is
used to pass files which are to be shared. Files are sent
as regular messages, with the owning process explicitly
giving away any information that it wishes to share
with any other process. All interprocess messages are
sent by way of the CPo Thus, both the CCM and the
destination EC]VI can make consistency checks to make
sure that a message is delivered to the correct process.
The remaining area of integrity checking which
needs to be discussed is the reaction hardware. In the
PRIlVIE system, this includes the isolation, power
switching, diagnosis, and reconfiguration logic. A
variety of schemes have been employed to confirm the
integrity of this reaction logic. In order to describe the
methods employed to confirm the integrity, it will be
necessary to first outline the structure of the spontaneous reconfiguration scheme used in the PRIME
system.
There are four steps involved in reorganizing the
hardware structure of PRIME so that it can continue
to operate with internal faults. The first step consists
of detecting a fault. This is done by one of the many
techniques outlined in this paper. In the second step,
an initial reconfiguration is performed so that a new
processor, one not involved in the detection, is given the
job of being the CPo This provides a pseudo "hard core"
which will be used to initiate gross diagnostics. The
third step is used to locate the fault. This is done by
having the new CP attach itself to the Programmable
Control Panel9 of a Problem Processor via the Interconnection Network, and test it by single-stepping this
From the collection of the Computer History Museum (www.computerhistory.org)
Dynamic Confirmation of System Integrity
PP through a set of diagnostics. If a PP is found to be
functioning properly, then it is used to diagnose its own
I/O channels. After the fault is located, the faulty
functional-unit is isolated, and a second reconfiguration
is performed to allow the system to run without this
unit.
Of the four steps involved in responding to a fault,
the initial reconfiguration poses the most difficulty. In
order to guarantee that this initial reconfiguration could
be initiated, a small amount of dedicated hardware waS
incorporated to facilitate this task. Associated with
each processor is a flag which indicates when the processor is the CPo Also associated with each processor is a
flag which is used to indicate that this processor thinks
the CP is malfunctioning. For every processor, these
two flags can be interrogated by any other processor.
Each processor can set only its own flag that suggests
the CP is sick. The flag which indicates that a processor
is the CP can be set only if both the associated processor and the dedicated hardware concur. Thus, the
dedicated hardware will not let this flag go up if another
processor already has its up. Also, this flag will automatically be lowered whenever two processors claim
that the CP is malfunctioning.
There is somewhat of a dilemma associated with
confirming the integrity of this logic. Because of the
distributed nature of this reconfiguration structure, it
should be unnecessary to make any of it "reliable."
That is, the structure is already distributed so that a
failure of any part of it can be tolerated. However, if
simultaneous faults are to be avoided, the integrity of
this logic must be dynamically confirmed. Unfortunately, it is not practical to check this logic by frequently
initiating reconfigurations. This dilemma is being solved
by a scheme which partially simulates the various
actions. The critical logic that cannot be checked during
a simulated reconfiguration is duplicated so that infrequent checking by actual reconfiguration is sufficient to
confirm the integrity of this logic.
The only logic used in the diagnostic scheme where
integrity confirmation has not already been discussed is
the Programmable Control Panel. This pseudo panel is
used to allow the CP to perform all the functions
normally available on a standard control panel. No
explicit provision will be made for confirming the
integrity of the Programmable Control Panel because
its loss will never lead to a system crash. That is, failures
in this unit can coexist with a failure anywhere else in
the system without bringing the system down.
For powering and isolation purposes, there are only
four different types of functional units in the PRIlVIE
system. The four functional units are the intelligence
module, which consists of a processor, its I/O controller
95
and the subunits that directly connect to the controller,
its memory bus, and its reconfiguration logic; the
memory block, which consists of two 4K-word by 33-bit
1\10S memory modules and a 4X2 switching matrix;
the switching module, which consists of the switch part
of two processor-end and three device-end nodes of the
Interconnection Network; and the disk drive. The disk
drives and switching modules can be powered up and
down manually only. The intelligence modules must be
powered up manually, but they can be powered down
under program control. Finally, the memory blocks can
be powered both up and down under program control.
No provision was made to power down the disks or
switching modules under program control because there
was no isolation problem with these units. Rather than
providing very reliable isolation logic at the interfaces
of the intelligence modules and memory blocks, it was
decided to provide additional isolation by adding the
logic which allows these units to be dynamically powered
down. Also, because it may be necessary to power
memory blocks down and then back up in order to
determine which one has a bus tied up, the provision
had to be made for performing the powering up of these
units on a dynamic basis. Any processor can power down
any memory block to which it is attached, so it was not
deemed necessary to provide for any frequent confirmation of the integrity of this power-down logic.
Also, every processor can be powered down by itself and
one other processor. These two power-down paths are
independent so again no provision was made tofrequently confirm the integrity of this logic. In order to
guarantee that the independent power-down paths do
not eventually fail without this fact being knmvll, these
paths can be checked on an infrequent basis.
All of the different integrity confirmation techniques
used in PRIlVIE have been described. The essence of
the concept of dynamic confirmation of system integrity
is the systematic exploitation of the specific characteristics of a system to provide an adequate integrity
confirmation structure which is in some sense minimal.
For instance, the type of use and the distributed intelligence of PRI1\1E were taken advantage of to provide
a sufficient integrity-confirmation structure at a much
lower cost and complexity than would have been
possible if these factors were not carefully exploited.
REFERENCES
1 B BORGERSON C V RAVI
On addressing failures in merrwry systems
Proceedings of the 1972 ACM International Computing
Symposium Venice Italy pp 40-47 April 1972
2 D A ANDERSON G METZE
Design of totally self-checking check circuits for M-out-of-N
From the collection of the Computer History Museum (www.computerhistory.org)
96
Fall Joint Computer Conference, 1972
codes
Digest of the 1972 International Symposium on
Fault-Tolerant Computing pp 30-34
3 R A SHORT
The attainment of reliable digital systems through the use of
redundancy-A survey
IEEE Computer Group News Vol 2 pp 2-17 March 1968
4 R S FABRY
Dynamic verification of operating system decisions
Computer Systems Research Project Document No P-14.0
University of California Berkeley February 1972
5 W G BOURICIUS W C CARTER
P R SCHNIEDER
Reliability rrwdeling techniques for self-repairing computer
systems
Proceedings of the ACM National Conference pp 295-309
1969
4. computer system microprogramming referlmce
manual
Publication No 7043MO Digital Scientific Corporation
San diego California June 1972
7 H B BASKIN B R BORGERSON R ROBERTS
P RIM E-A rrwdular architecture for terminal-oriented
systems
Proceedings of the 1972 Spring Joint Computer Conference
pp431-437
8 B R BORGERSON
A fail-softly system for time-sharing use
Digest of the 1972 International Symposium on
Fault-Tolerant Computing pp 89-93
9 G BAILLIU B R BORGERSON
A multipurpose processor-enhancement structure
Digest of the 1972 IEEE Computer Society Conference
San Francisco September 1972 pp 197-200
6 META
From the collection of the Computer History Museum (www.computerhistory.org)
Download