Software Reliability as a Function of User Execution Patterns

advertisement
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999
Software Reliability as a Function
of User Execution Patterns
John C. Munson, Sebastian Elbaum
Computer Science Department
University of Idaho
Moscow, Idaho 83844-1010
{jmunson,elbaum}@cs.uidaho.edu
Abstract
Assessing the reliability of a software system
has always been an elusive target. A program may
work very well for a number of years and this same
program may suddenly become quite unreliable if
its mission is changed by the user. This has led to
the conclusion that the failure of a software system
is dependent only on what the software is currently
doing. If a program is always executing a set of
fault free modules, it will certainly execute
indefinitely without any likelihood of failure. A
program may execute a sequence of fault prone
modules and still not fail. In this particular case,
the faults may lie in a region of the code that is not
likely to be expressed during the execution of that
module. A failure event can only occur when the
software system executes a module that contains
faults. If an execution pattern that drives the
program into a module that contains faults is
never selected, then the program will never fail.
Alternatively, a program may execute successfully
a module that contains faults just as long as the
faults are in code subsets that are not executed.
The reliability of the system then, can only be
determined with respect to what the software is
currently doing. Future reliability predictions will
be bound in their precision by the degree of
understanding of future execution patterns. In this
paper we investigate a model that represents the
program sequential execution of modules as a
stochastic process. By analyzing the transitions
between modules and their failure counts, we may
learn exactly where the system is fragile and under
which execution patterns a certain level of
reliability can be guaranteed.
1. Introduction
The subject of this paper is measurement,
specifically, the measurement of those software
attributes that are associated with software reliability.
Existing approaches to the understanding of software
reliability patently assume that software failure events are
observable. The truth is that the overwhelming majority
of software failures go unnoticed when they occur. Only
when these unobserved failures disrupt the system by
second, third or fourth order effects do they provide
enough disruption for outside observation. Consequently,
only those dramatic events that lead to the immediate
collapse of a system can been seen directly by an
observer when they occur. The more insidious failures
will lurk in the code for a long interval before their
effects are observed. Failure events go unnoticed and
undetected because the software has not been
instrumented so that these failure events may be detected.
In that software failures go largely unnoticed, it is
presumptuous to attempt to model the reliability of
software based on observed failure events. If we cannot
observe failure events directly, then we must seek another
metaphor that will permit us to model and understand
reliability in a context that we can measure.
Computer programs do not break. They do not fail
monolithically. Programs are designed to perform a set
of mutually exclusive tasks or functions. Some of these
functions work quite well while others may not work well
at all. When a program is executing a particular function,
it executes a well defined subset of its code. Some of
these subsets are flawed and some are not. Users tend to
execute subsets of the total program functionality. Two
users of the same software may have totally different
perceptions as to the reliability of the same system. One
user may use the system on a daily basis and never
experience a problem. Another user may have continual
problems in trying to execute exactly the same program.
A new metaphor for software systems would focus on
the functionality that the code is executing and not the
software as a monolithic system. In computer software
systems, it is the functionality that fails. Some functions
may be virtually failure free while other functions will
collapse with certainty whenever they are executed. The
focus of this paper is on the notion that it is possible to
measure the activities of a system as it executes its
0-7695-0001-3/99 $10.00 (c) 1999 IEEE
1
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999
various functions and characterize the reliability of
the system in terms of these functionalities.
Each program function may be thought of as
having an associated reliability estimate. We may
chose to think of the reliability of a system in these
functional terms. Users of the software system,
however, have a very different view of the system.
What is important to the user is not that a particular
function is fragile or reliable, but rather whether
the system will operate to perform those actions
that the user will want the system to perform
correctly. From a user’s perspective, it matters not,
then, that certain functions are very unreliable. It
only matters that the functions associated with the
user’s actions or operations are reliable. The
classical example of this idea was expressed by the
authors of the early UNIX utility programs. In the
last paragraph of the documentation for each of
these utilities was a list of known bugs for that
program. In general, these bugs were not a
problem. Most involved aspects of functionality
that the typical user would never exploit.
As a program is exercising any one of its many
functionalities in the normal course of operation of
the program, it will apportion its time across this
set of functionalities [8]. The proportion of time
that a program spends in each of its functionalities
is the functional profile of the program. Further,
within the functionality, it will apportion its
activities across one to many program modules.
This distribution of processing activity is
represented by the concept of the execution profile.
In other words, if we have a program structured
into n distinct modules, the execution profile for a
given functionality will be the proportion of
program activity for each program module while
the function was being expressed.
As the discussion herein unfolds, we will see
that the key to understanding program failure
events is the direct association of these failures to
execution events with a given functionality. A
stochastic process will be used to describe the
transition of program modules from one to another
as a program expresses a functionality. From these
observations, it will become fairly obvious just
what data will be needed to describe accurately the
reliability of the system. In essence, the system
will be able to appraise us of its own health. The
reliability modeling process is no longer something
that will be performed ex post facto. It may be
accomplished dynamically while the program is
executing. It is the goal of this paper to develop a
methodology that will permit the modeling of the
reliability of program functionality.
This
methodology will then be used to develop notions
of design robustness in the face of departures from design
functional profiles.
2. A Formal Description Of Program
Operation
To assist in the subsequent discussion of program
functionality, it will be useful to make this description
somewhat more precise by introducing some notation
conveniences. Assume that a software system S was
designed to implement a specific set of mutually
exclusive functionalities F. Thus, if the system is
executing a function f ∈ F then it cannot be expressing
elements of any other functionality in F. Each of these
functions in F was designed to implement a set of
software specifications based on a user’s requirements.
From a user’s perspective, this software system will
implement a specific set of operations, O. This mapping
from the set of user perceived operations, O, to a set of
specific program functionalities, F, is one of the major
tasks in the software specification process.
To facilitate this discussion, it would be appropriate to
describe the operations and the functionality of the
Simple Database Management System (SDMS) that we
have instrumented for test purposes. Though the majority
of the systems that we have investigated and
instrumented are very large (>500 KLOC) we would like
demonstrate the use of this model in the context of a
relatively small database application. The target
application has 20 modules and approximately 1600
LOC. The methodology scales down quite well.
In order to implement the model, the first step was to
re-discover the set A of user operations through the
analysis of the program documentation.
These
operations, again specified by a set of functional
requirements, will be mapped into a set, B, of elementary
program functionalities. Then, the second step was the
extraction of the system functionalities. The
functionalities were extracted with a tool that analyzes
the transition patterns as the program executes and
generates a list of functionalities and the mapping of
these functionalities to modules [3]. The tool was used
because the requirements and design documentation
wasn’t complete enough to specify the operations,
functionalities, modules and their mapping. The lack of
complete and unambiguous development information is a
common situation in the software industry which forces
this type of re-engineering approaches.
Each operation that a system may perform for a user
may be thought of as having been implemented in a set of
functional specifications. There may be a one-to-one
mapping between the user’s notion of an operation and a
program function. In most cases, however, there may be
several discrete functions that must be executed to
express the user’s concept of an operation. For each
0-7695-0001-3/99 $10.00 (c) 1999 IEEE
2
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999
operation, o, that the system may perform, the
range of functionalities, f, must be well known.
Within each operation one or more of the system’s
functionalities will be expressed. For a given
operation, o, these expressed functionalities are
those with the property
F (o ) = { f : F | ∀ IMPLEMENTS(o, f )}
It is possible, then, to define a relation
O×F
such
that
IMPLEMENTS
over
IMPLEMENTS(o, f) is true if functionality f is
used in the specification of an operation, o.
As
an example of the IMPLEMENTS relation for a
user operation within the framework of our
demonstration example of a database application
let us consider a typical user operation, the
retrieval of a specific record.
In order to
implement this operation ( oi ) the system will
present the menu system to the user ( f1 ), open the
selected database ( f 2 ), perform the necessary
query ( f 3 ) and then display the desired record
( f 4 ).
The software design process is basically a
matter of assigning functionalities in F to specific
program modules m ∈ M , the set of program
modules. The design process may be thought of as
the process of defining a set of relations, ASSIGNS
over F × M such that ASSIGNS(f, m) is true if
functionality f is expressed in module m. Table 1
shows an example of the ASSIGNS relation for
four of the functions that IMPLEMENT the
retrieval operation. In this example we can see the
function f 4 has been implemented in the program
modules {m1 , m3 , m15 , m20 }. One of these modules,
m1 , will be invoked regardless of the functionality.
It is the main program module and thus is common
to all functions. Other program modules, such as
m7 , are distinctly associated with a single function.
F×M
f1
f2
f3
f4
m1
m3
T
T
T
m7
m15 m16 m20 m17
T
T
T
T
T
T
T
T
T
T
T
T
Table 1. Example of the ASSIGNS relation
There is a relationship between program
functionalities and the software modules that they will
cause to be executed. For the SDMS system, the set
M = {m1 , m2 , m3 ,K, m20 } denotes the set of all program
modules that constitute the whole system. For each
function f ∈ F , there is a relation p over F × M such
that p( f , m) is the proportion of execution events of
module m when the system is executing functionality f .
If p( f , m) < 1 this means that a module m may for may
not execute when functionality f is expressed. Thus,
program modules may be assigned to one of three distinct
sets of modules that, in turn, are subsets of M. Some
modules may execute under all of the functionalities of
SDMS. This will be the set of common modules. The
main program is an example of such a module that is
common to all operations of the software system.
Essentially, program modules will be members of one of
two mutually exclusive sets. There is the set of program
modules M c of common modules and the set of modules
M F that are invoked only in response to the execution of
a particular function. The set of common modules,
M c ⊂ M is defined as those modules that have the
property.
M c = {m : M | ∀f ∈ F • ASSIGNS(f, m)} .
All of these modules will execute regardless of the
specific functionality being executed by the software
system.
Yet another set of software modules may or may not
execute when the system is running a particular function.
These modules are said to be potentially involved
modules. The set of potentially involved modules is.
M (p f ) = {
m : M F | ∃ f ∈ F • ASSIGNS( f , m) ∧
0 < p( f , m) < 1}
In other program modules, there is extremely tight
binding between a particular functionality and a set of
program modules. That is, every time a particular
function, f, is executed, a distinct set of software modules
will always be invoked. These modules are said to be
indispensably involved with the functionality f. This set
of indispensably involved modules for a particular
functionality, f, is the set of those modules that have the
property that
M i( f ) = {m : M F | ∀f ∈ F • ASSIGNS( f , m) ⇒
.
p( f , m) = 1}
As a direct result of the design of the program, there
will be a well defined set of program modules , M f , that
might be used to express all aspects of a given
functionality, f. These are the modules that have the
(f)
(f)
property that m ∈ M f = M p ∪ M i .
From the standpoint of software design, the real
problems in understanding the dynamic behavior of a
0-7695-0001-3/99 $10.00 (c) 1999 IEEE
3
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999
system are not necessarily attributable to the set of
modules, M i , that are tightly bound to a
functionality or to the set of common modules,
M c , that will be invoked for all executing
processes.
The real problem is the set of
potentially invoked modules, M p . The greater the
cardinality of this set of modules, the less certain
we may be about the behavior of a system
performing that function. For any one instance of
execution of this functionality, a varying number of
the modules in M p may execute.
3. The Profiles of Software Dynamics
When a program is executing a functionality it
will apportion its activities among a set of
modules. As such it will transition from one
module to the next on a call (or return) sequence.
Each module called in this call sequence will have
an associated call frequency. When the software is
subjected to a series of unique and distinct
functional expressions, there will be a different
behavior for each of the user’s operations in that
each will implement a different set of functions
that will, in turn, invoke possibly different sets of
program modules.
When the process of software requirements
specification is complete we will have specified a
system consisting of a set of a mutually exclusive
operations. It is a characteristic of each user of the
new system that this user will cause each operation
to be performed at a potentially different rate than
another user. Each user, then, will induce a
probability distribution on the set O of mutually
exclusive operations. This probability function is a
multinomial distribution.
It constitutes the
operational profile for that user.
The operation profile of the software system is
the set of unconditional probabilities of each of the
operations in O being executed by the user. Then,
Pr
[X = k], k =1,2,K, O is the probability that
the user is executing program operation k as
specified in the functional requirements of the
program and O is the cardinality of the set of
functions [12]. A program executing on a serial
machine can only be executing one operation at a
time. The distribution of the operational profile,
then, is multinomial for programs designed to
fulfill more than two distinct operations. The prior
knowledge of this distribution of functions should
guide the software design process [7].
As a user performs the various operations on a
system, he/she, will cause each operation to occur
in a series of steps or transitions. The transition from one
operation to another may be described as a stochastic
process. In which case we may define an indexed
collection of random variables { X t } , where the index t
runs through a set of non-negative integers, t = 0,1,2,K
representing the individual transitions or intervals of the
process. At any particular interval the user is found to be
expressing exactly one of the system’s a operations. The
fact of the execution occurring in a particular operation is
a state of the user. During any interval the user is found
performing exactly one of a finite number of mutually
exclusive and exhaustive states that may be
labeled 0,1,2,K, a . In this representation of the system,
there is a stochastic process { X t } , where the random
variables are observed at intervals t = 0,1,2,K and where
each random variable may take on any one of the (a + 1)
integers, from the state space O = {0,1,2,K, a} .
Each user may potentially bring his/her own distinct
behavior to the system. Thus, each user will have his/her
own characteristic operational profile.
It is a
characteristic, then, of each user to induce a probability
function pi = Pr[ X = i ] on the set of operations, O. In
that these operations are mutually exclusive, the induced
probability function is a multinomial distribution.
As the system progresses through the steps in the
software lifecycle, the user requirements specifications,
the set O, must be mapped on a specific set of
functionalities, F, by systems designers. This set F is in
fact the design specifications for the system. As per our
earlier discussion, each operation is implemented by one
for more functionality.
Now let us examine the behavior of the system within
each operation. Each operation is implemented by one
or more functionalities.
The transition from one
functionality to another may be also be described as a
stochastic process. In which case we may define a new
indexed collection of random variables {Yt } , as before
representing the individual transitions events among
particular functionalities. At any particular interval a
given operation is found to be expressing exactly one of
the system’s b functionalities. During any interval the
user is found performing exactly one of a finite number
of mutually exclusive and exhaustive states that may be
labeled 0,1,2,K, b . In this representation of the system,
there is a stochastic process {Yt } , where the random
variables are observed at intervals t = 0,1,2,K and where
each random variable may take on any one of the (b + 1)
integers, from the state space F = {0,1,2,K, b} , where b
represents the cardinality of F , the set of functionalities.
When a program is executing a given operation, say
ok , it will distribute its activity across the set of
0-7695-0001-3/99 $10.00 (c) 1999 IEEE
4
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999
functionalities, F ( o ) . At any arbitrary interval, n,
during the expression of ok the program will be
k
executing a functionality
fi ∈ F (ok )
and the way that the activity of the system will be
distributed across the modules in the software will
depend on the user.
with a
probability, Pr[Yn = i | X = k ] .
From this
condition probability distribution for all operations
we may derive the functional profile for the design
specifications as a function of a user operational
profile to wit:
Pr[Y = i ] =
Pr[ X = j ] Pr[Y = i | X = j ] .
j
∑
Alternatively,
wi =
∑
j
p j Pr[Y = i | X = j ]
The next logical step is to study the behavior of
a software system at the module level. Each of the
functionalities is implemented in one or more
program modules. The transition from one module
to another may be also be described as a stochastic
process. In which case we may define a third
indexed collection of random variables {Z t } , as
before representing the individual transitions
events among the set of program modules. At any
particular interval a given functionality is found to
be executing exactly one of the system’s c
modules. The fact of the execution occurring in a
particular module is a state of the system. During
any interval the system is found executing exactly
one of a finite number of mutually exclusive and
exhaustive states (program modules)that may be
labeled 0,1,2,K, c . In this representation of the
system, there is a stochastic process {Z t } , where
the random variables are observed at epochs
t = 0,1,2,K and where each random variable may
take on any one of the (c + 1) integers, from the
state space M = {0,1,2,K, c} , where c represents
the cardinality of M , the set of modules.
Each functionality j has a distinct set of modules
M f j that it may cause to execute. At any arbitrary
interval, n, during the expression of
f j the
program will be executing a module mi ∈ M f j
with a probability, Pr[ Z n = i | Y = j ] . From this
condition probability distribution for all
functionalities we may derive the module profile
for the system as a function of a the system
functional profile as follows:
Pr[ Z = i ] =
Pr[Y = j ] Pr[ Z = i | Y = j ] .
j
∑
Again,
ri =
∑
j
w j Pr[ Z = i | Y = j ] .
The module profile ultimately depends on the
operational profile. Each user’s view of the system
F
Figure 1. Program Call Graph
Interestingly enough, for all software systems there is
a distinguished module, the main program module that
will always receive execution control from the operating
system. If we denote this main program as module 0
then, Pr[ Z 0 = 0] = 1 and Pr[ Z 0 = i ] = 0 for i = 1,2,K , c .
Further, for epoch 1, Pr[ Z1 = 0] = 0 , in that control will
have been transferred from the main program module to
another function module.
An interesting variation on this stochastic process is
one that has an absorbing state. In which case, pii( 0) = 1
for at least one i = 1,2,K, c . An example of such a
system is a call graph that has a module that always exits
to the operating system. Once this state has been entered,
no other state is reachable from this absorbing state. We
will use this notion of an absorbing state to model the
failure of a system. In this case, we will consider the
failure of a program module to be the transition from that
module to the absorbing failure state. This modeling
approach was initially explored by Littlewood [4].
When a program module fails, we can imagine that the
module has made a transition to a distinguished program
module, a failure state. Thus, every program may be
thought to have a virtual module representing the failed
state of that program. This virtual program module is
shown pictorially in program call graph of Figure 1.
When the virtual module receives control, it will not
relinquish it, there are no returns from it. The transition
matrix for this new model is augmented by an additional
row and a new column. For a program with c modules,
0-7695-0001-3/99 $10.00 (c) 1999 IEEE
5
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999
let the error state be represented by a new state,
e = c + 1 . For this new state,
 = 0 for all j = 1,2,K, c n = 0,1,2,L
,
.
pej( n ) 
= 1 for j = e
This represents the augmented row of the new
transition matrix. Each row in the transition matrix
will be augmented by a new column entry pie(n ) for
i = 1,2,K, c , where pie(n ) represents the probability
th
th
of the failure of the i module in the n epoch.
When a program dies, it is the result of a fault in
one or more of its modules. Not all modules are
equally likely to lead to the failure event. The fault
proneness of the module is distinctly related to
measurable software attributes [6]. When program
modules are executed that are fault prone, they are
much more likely to fail than those that are not
fault prone. We seek a forecasting or prediction
mechanism that will capitalize on this
understanding.
The granularity of the term epoch is now of
interest. An epoch begins with the onset of
execution in a particular module and ends when
control is passed to another module.
The
measurable event for modeling purposes is this
transition among the program modules. We will
count the number of calls from a module and the
number of returns to that module. Each of these
transitions to a different program module from the
one currently executing will represent an
incremental change in the epoch number.
Computer programs executing in their normal
mode will make state transitions between program
modules rather rapidly. In terms of real clock
time, many epochs may elapse in a relatively short
period.
In reality, few if any systems are understood at
the functional or operation level.
We are
continually confronted with systems whose
functionality is not completely understood. While
we have developed methodologies to recapture the
essential functionalities [cf. 3], the majority of the
time we will not know the precise behavior of the
system that we are working with. To this end we
will develop a more relaxed form of profile called
the execution profile of a system.
When a user is exercising a system, the software
will be driven through a sequence of functionalities
S = {f a , f b , f c ,K}. Depending on the particular
sequence of functionalities that are executed
different sets of modules may or may not execute.
From an empirical perspective, it makes a great
deal of sense to model the behavior of a system in
terms of an execution profile. This execution
profile will be the conditional probability of
execution a particular program module based on a
sequence, S as follows: pi = Pr[ Z = i | S ] . It is clear that
lim Pr[ Z = i | S ] = ri .
S →∞
That is, the execution profile
tends to the module profile for large execution sequences.
Necessarily, if the program is maintaining the
transition matrix, it will not be able to update the vector
for the virtual module that represents the failure state of
the program. When the program transitions to this
hypothetical state, it will be dead and unable to keep track
of its own failures. In addition, in order to validate the
model, consistent failure data was needed but not
available for the application. To solve those problems,
the application process was encapsulated to allow the
trapping and simulation of failures. The capsule was
designed to receive different types of process signals
from the operating system, one of the signals provided a
means to simulate a failure at any module at any time.
When the process receives a failure signal from the
operating system, it updates the vector for the virtual
module representing the failure state simulating a failure.
Other signals allow to observe the transition matrix at any
time, to modify its contents, to estimate the current
reliability of the system, etc. The purpose of this simple
simulation scheme is to provide an inexpensive procedure
to demonstrate the conceptual framework. The simulated
failure assumes perfect fault coverage and a modular failstop behavior.
4. Estimates for Transition Probabilities and
Profiles
The focus will now shift to the problem of
understanding the nature of the distribution of the
probabilities for various profiles. We have so far come to
recognize these profiles in terms of their multinomial
nature. The multinomial distribution is useful for
representing the outcome of an experiment involving a
set of mutually exclusive events. Let Z =
c
UZ
i
where
i =1
Z i is one of c mutually exclusive sets of events. Each of
these events would correspond to a program executing a
particular module in the total set of program modules.
From the definition of module profile Pr( Z i ) = ri and
re = 1 − (r1 + r2 + L + rc ) , under the condition that
e = c + 1 , as defined earlier. In which case ri is the
probability that the outcome of a random experiment is an
element of the set Z i (the program is executing module
i). If this experiment is conducted over a period of n
trials then the random variable X i will represent the
frequency of Z i outcomes. In this case, the value, n,
represents the number of transitions from one program
0-7695-0001-3/99 $10.00 (c) 1999 IEEE
6
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999
module
to
the
next.
Note
that
X e = n − X 1 − X 2 −L − X e .
This particular distribution will be useful in the
modeling of a program with a set of c modules.
During a set of n epochs, each of the modules may
be executed. These, of course, are mutually
exclusive events. If module i is executing then
module j cannot be executing. This principle of
mutual exclusion is a requirement for the
multinomial distribution.
The multinomial
distribution function with parameters n
and
r = (r1 , r2 ,K, rc ) is given by
 n! x1 x 2
xc
 c −1 r1 r2 L rc , ( x1 , x2 ,L, xc ) ∈ X
f (x | n, r ) = 
xi !
 i =1
0 elsewhere
∏
where xi represents the frequency of execution of
th
the i program module. The expected values for
the xi are given by E ( xi ) = xi = nri , for
i = 1,2,K, c .
We would like to come to understand, for
example, the multinomial distribution of a
program’s execution profile while it is executing a
particular functionality. The problem, here, is that
every time a program is run we will observe that
there is some variation in the profile from one
execution sample to the next. It will be difficult to
estimate the parameters r = (r1 , r2 ,K, re ) for the
multinomial distribution of the execution profile.
Rather than estimating these parameters statically,
it would be far more useful to us to get estimates of
these parameters dynamically as the program is
actually in operation, hence the utility of the
Bayesian approach [c.f. 5].
To aid in the process of characterizing the
nature of the true underlying multinomial
distribution, let us observe that the family of
Dirichlet distributions is a conjugate family for
observations that have a multinomial distribution
[14]. The p.d.f. for a Dirichlet distribution,
D ( ,α e ) ,
with
a
parametric
vector
= (α1 ,α 2 ,K,α c ) where (α i > 0; i = 1,2,K, c) is
f (r | ) =
Γ(α1 + α 2 + L + α c )
c
∏ Γ(α )
r1α1 −1r2α 2 −1 L rcα c −1
i
i =1
where (ri > 0; i = 1,2,K, c) and
c
∑r =1 .
i
i =1
expected values of the
wi are given by
The
E (ri ) = µi =
where α 0 =
∑
e
αi
α0
α i . In this context, α 0 represents the
i =1
total epochs.
Within the set of expected values µi ,i = 1,2,K, e , not
all of the values are of equal interest. We are interested,
in particular, in the value of µ e . This will represent the
probability of a transition to the terminal failure state
from a particular program module. So that we might use
this value for our succeeding reliability prediction
activities, it will be useful to know how good this
estimate is. To this end, we would like to set 100(1-α)%
confidence limits on the estimate. For the Dirichlet
distribution, this is not clean. To simplify the process of
setting these confidence limits, let us observe that if
r = (r1 , r2 ,K, rc ) is a random vector having the c-variate
Dirichlet distribution, D ( ,α e ) , then the sum
z = r1 + L + r has the beta distribution,
f β ( z | γ ,α e ) =
Γ(γ + α e ) γ
z (1 − z )α
Γ(γ )Γ(α e )
e
or alternately
Γ(γ + α e )
(1 − re )γ (re )α ,
Γ(γ )Γ(α e )
where γ = α1 + α 2 + L + α c .
Thus, we may obtain 100(1-α)% confidence limits for
µe − a ≤ µe ≤ µ e + b
from
µ −a
α
Fβ ( µ e − a | γ ,α e ) =
f β (re | γ ,α e )dr =
0
2
and
µ +b
α
Fβ ( µ e + b | γ ,α e ) =
f β (re | γ ,α e )dr = 1 − . (1)
0
2
The value of the use of the Dirichlet conjugate family
for modeling purposes is twofold. First, it permits us to
estimate the probabilities of the module transitions
directly from the observed transitions. Secondly, we are
able to obtain revised estimates for these probabilities as
the observation process progresses. Let us now suppose
that we wish to model the behavior of a software system
whose execution profile has a multinomial distribution
with parameters n and R = (r1 , r2 , K , re ) where n is the
total number of observed module transitions and the
f β (re | γ ,α e ) =
e
∫
∫
e
e
values of the w1 are unknown. Let us assume that the
prior distribution of R is a Dirichlet distribution with a
parametric
vector
= (α1 , α 2 , K , α e )
where
(α i > 0; i = 1,2,K, e) . Then the posterior distribution of
R for the total module execution frequency counts
X = ( x1 , x2 ,K, xe )
is a Dirichlet distribution with
0-7695-0001-3/99 $10.00 (c) 1999 IEEE
7
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999
parametric vector ∗ = (α1 + x1 ,α 2 + x2 ,K,α e + xe )
[c.f. 2]. As an example, suppose that we now wish
to model the behavior of a large software system
with such a parametric vector . As the system
makes sequential transitions from one module to
another, the posterior distribution of R at each
transition will be a Dirichlet distribution. Further,
th
the i
component of the
for i = 1,2,K, e
augmented parametric vector α will be increased
by 1 unit each time module mi is executed.
Further, every time the system fails, the failure
count, xe , will also increase my one.
5. Reliability Estimation
The modeling of reliability of systems will be
implemented through the use of an absorbing state
in the stochastic process. In this application, we
will postulate the existence of a virtual program
module representing the failure of the system.
Should control ever be transferred to this module,
it will never be returned. Each program module
has a non-zero probability of transferring control to
this virtual failure module. This probability is
directly related to the fault proneness of the
module. We may, in fact, use the functional
relationship between software complexity and
software faults to derive our prior probabilities for
the transition between each program module and
the virtual failed state module [c.f. 10].
What is important to understand is that each
program module is distinctly related to one or more
functions. If a function is expressed by a set of
modules that are failure prone, then the function
will appear to be failure prone. If, on the other
hand, a function is expressed by a set of modules in
a call tree that are fault free, this function will
never fail. The key point is that it is a
functionality that fails. Not all functions will be
executed by a user with the same likelihood. If a
user executes unreliable functions consistently,
then he will perceive the system to be unreliable.
Conversely, if another user were to use the same
system but exercise functionalities that were not so
likely to fail, then his perceptions of the same
system would be very different.
In order to model the reliability of a software
function, we will augment the basic stochastic
model to include an absorbing state that represents
a virtual program module called the failure state.
Each program module may have a non-zero
transition probability to this virtual module. If a
module is fault free, then its transition probability
will be zero.
To this point we have created a mechanism for
modeling the transition of each program module to the
failure state. In this sense the reliability of each module
mi may be directly determined by the elements, pie0 , of
P0 .
With the Bayesian approach, we have also
established a mechanism for refining our estimates of
these reliabilities and establishing a measure of
confidence in each of these estimates. This information
will now be used to establish the reliability of functions
that employ each of the modules in varying degrees. The
0
successive powers of P will show the failure likelihood
for each of the modules. This will permit us to postulate
on the probability of a failure at some future epoch n
based on the current estimates of failure probability.
A central thrust of this investigation is to establish a
framework for establishing the reliability of program
modules, the basic building blocks of a computer
program. Each program module has an associated
reliability. The total system is a composite of all of the
program modules. As a direct result of design decision
made in the implementation of program functionality, a
module profile emerged for the design. The module
profile qi is the unconditional probability that a module
will be in execution at any epoch. The expected value for
the system unreliability U S at epoch n is simply
c
U S = ∑ rj p (jen ) .
(2)
j =1
An upper bound on this value may be obtained from
c
U Su = ∑ r j ( p (jen ) + b)
j =1
.
where the value b is derived from the upper α 2
confidence limit for each of the transition failure
estimates from equation (1).
Over a number of epochs of successful program
execution the unreliability of a system will diminish
rather rapidly. Hence it only makes sense to talk about
the reliability of a system on an exponential scale. The
reliability of the system RS will be derived from its
unreliability as follows:
RS = − log10 U S .
A lower
bound on this reliability estimate is RSl = − log10 U Su .
The system reliability is ultimately a function of the
user’s operational profile. Should this profile differ from
one use to another, then so too will the system reliability.
Different users running under differing operational
profiles will have differing perceptions of the system’s
reliability.
0-7695-0001-3/99 $10.00 (c) 1999 IEEE
8
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999
the element tij of T will increase by one. Every time that
It seems pointless to engage in the academic
exercise of software reliability modeling without
making at least the slightest attempt to discuss the
process of measuring the independent variable(s) in
these models. The basic premise of this paper is
that we really cannot measure temporal aspects of
program failure. The state of the art in measuring
these temporal aspects is best summarized by
Brocklehurst [1]. There are, however, certain
aspects of program behavior that we can measure
and also measure with accuracy such as transitions
into and out of program modules [9].
Let us now turn to the measurement scenario for
the modeling process described above. For this
exercise we have instrumented the SDMS system
to track the transfer of control among the program
modules. We have also designed an experimental
scenario to introduce failures caused by specific
faults in individual program modules. Three users
then exercised the system performing different
activities. User 1 performed information retrieval
activities, displayed the retrieved records, and
generated queries. User 2 performed the role of a
database administrator, creating databases,
modifying their structure and maintaining them
User 3 entered records into a database much as a
clerk would do.
There is a need to record the dynamic behavior
of the total system as it transitions from one
program module to another. We adapted and
extended a software transition profiler called CLIC
[11]. CLIC works in three phases. Phase one
consist in the code instrumentation where "hooks"
are added to each module of the target application
in order to observe the transitions. The second
phase is the profiling itself, where the target
application is run and the software transitions are
recorded as they occurred. The third phase is the
post-processing and analysis of the collected
information which was discarded under the
proposed model because it is done dynamically as
the data is gathered.
The target application was instrumented with
CLIC following the standard procedures. The
second phase of CLIC was modified to fit the
model. Every time a transition was made, the
information was sent to an additional function
called RELEST (RELiability ESTimation) in
charge of keeping track of and analyzing the
system behavior. If there are a total of 20 modules,
then we will need an e × e (21 × 21) matrix T to
record these transitions. Whenever the program
transfers control from module mi to module m j
a failure is caused by a fault in module mi then tie of T
will increase by one representing a transition from that
module to the failure state. The results of program
execution by the three users performing their three
distinct roles is shown in Table 2. The rows of this table,
for each user, contain the row marginal totals for the
matrix T for each user. As the software executed, 9
simulated failures were created during the execution of
the system by all three users.
Total Calls
Module User 1
User 2
User 3 Failures
1
278
33
12
2
3
36
25
10
4
2
1
5
4
3
2
6
3
7
5
4
3
8
11
4
2
1
9
3
0
6
1
10
2
3
11
19
1
12
3
13
2
14
2
2
15
312
2
1
16
15
2
17
5
18
1
19
20
10
1
Totals
681
107
35
9
Table 2. Transition Frequencies
3
2.9
2.8
2.7
R
6. An Empirical Study
2.6
2.5
2.4
2.3
2.2
0
200
400
600
800
1000
1200
1400
Cumulative Epochs
Figure 2. Reliability Growth for User 3
0-7695-0001-3/99 $10.00 (c) 1999 IEEE
9
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999
From just the simple frequency counts of
program activity represented in Table 2, we can see
substantial differences in the way the three users of
this system exercise this system. It is clear, for
example, that neither User 1 nor User 3 would have
seen the 2 failure events in module 14 nor would
they have seen the failure events in modules 4 or
19. They didn’t cause either of these three
modules to execute. Their view of the system is
rather like the three blind men who are studying
the nature of the elephant by feel. Each user has a
different piece of the animal.
From the distribution of activity shown in Table
2, we can now compute the execution profiles for
Exec.
Prof.
Module
1
Fail.
Prob.
execution profiles for each user are shown in Table 3.
For each module under each user profile, the failure
probability of each module was then calculated. The is in
essence the probability of the transition to the failure state
for each module. Finally, the System Reliability (Q) is
calculated for each user. This value is obtained from
formula (2) above and is shown in the penultimate row of
Table 3. From the System reliability for each user, we
may
then
calculate
the
System
Reliability
(logarithmically scaled) for each user. These data are
shown in the last row of Table 3.
The three users are using the same software. They are
doing different things with it. User 1’s perception of the
system is very different than that of User 3. It is an order
Perceived Exec. Prof.
Unrel.
User 1
Fail.
Prob.
Perceived
Unrel.
Exec.
Prof.
User 2
Fail.
Prob.
Perceived
Unrel.
User 3
0.408
0.308
0.342
0.052
0.233
0.285
2
3
4
5
0.001
0.018
0.005
0.009
1.7e-04
0.028
6
0.030
0.057
0.028
7
0.007
8
0.016
0.001
2.3e-05
9
0.004
0.001
6.4e-06
10
0.002
11
0.037
0.037
0.009
3.4e-04
0.009
0.057
0.030
1.6e-03
0.171
0.030
4.8e-03
0.028
0.001
0.177
0.009
1.6e-03
0.030
0.018
0.019
3.4e-04
0.060
0.018
0.009
1.7e-04
0.030
12
0.028
13
0.018
14
0.085
0.003
15
0.458
0.001
6.7e-04
16
0.022
0.003
6.4e-05
17
0.007
18
0.019
0.060
0.009
19
20
0.014
System Unreliability (Q)
System Reliability ( − log10 Q )
0.009
7.6e-04
2.7e-03
6.5e-03
3.11
2.56
2.18
Table 3. Reliability Calculations
each user. Estimates for these profiles may be
obtained by dividing the total number of epochs for
each user into the individual module counts. These
of magnitude more reliable for activities performed by
User 1 than for User 3. Both User 2 and User 3 will think
0-7695-0001-3/99 $10.00 (c) 1999 IEEE
10
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999
that their software is considerably less reliable than
User 1’s perception of his system.
We would now like to represent the reliability
growth of a system after it has been placed back in
service after a fix. To this end we will continue to
monitor the activity of User 3 through an additional
1170 epochs. At intervals, the transition matrix
was dumped and the reliability calculated for the
elapsed number of epochs. The results of this
computation is shown in Figure 2. Here we can
see a fairly rapid rise in the reliability of the system
in its early stages will some attenuation in this
growth over time.
7. Summary
The failure of a software system is dependent
only on what the software is currently doing: its
functionality. If a program is currently executing a
functionality that is expressed in terms of a set of
fault free modules, this functionality will certainly
execute indefinitely without any likelihood of
failure. A program may execute a sequence of
fault prone modules and still not fail. In this
particular case, the faults may lie in a region of the
code that is not likely to be expressed during the
execution of a function. A failure event can only
occur when the software system executes a module
that contains faults. If a functionality is never
selected that drives the program into a module that
contains faults, then the program will never fail.
Alternatively, a program may well execute
successfully in a module that contains faults just as
long as the faults are in code subsets that are not
executed.
The very nature of the milieu in which
programs operate dictates that they will modify the
behavior of the individuals that are using them.
The result of this is that the user’s initial use of the
system as characterized by the operational profile
will not necessarily reflect his/her future use of the
software. There may be a dramatic shift in the
operational profile of the software user based
directly on the impact of the software or due to the
fact the users’ needs have changed over time. A
design that has be established to be robust under
one operational profile may become less than
satisfactory under new profiles. We must come to
understand that some systems may become less
reliable as they mature due to circumstances
external to the program environment.
The continuing evaluation of the execution,
function, and module profiles over the life of a
system can provide substantial information as to
the changing nature of the program’s execution
environment. This, in turn, will foster the notion that
software reliability assessment is as dynamic as the
operating environment of the program. That a system has
functioned reliably in its past is not a clear indication
that it will function reliably in the future.
Some of the problems that have arisen in past attempts
at software reliability determination all relate to the fact
that their perspective has been distorted. Programs do not
wear out over time. If they are not modified, they will
certainly not improve over time. Nor will they get less
reliable over time. The only thing that really impacts the
reliability of a software system is its functionality. A
program may work very well for a number of years based
on the functions that it is asked to execute. This same
program may suddenly become quite unreliable if its
mission is changed by the user.
By keeping track of the state transitions from module
to module and function to function we may learn exactly
where a system is fragile. This information coupled with
the functional profile will tell us just how reliable the
system will be when we use it as specified. Programs
make transitions from module to module as they execute.
These transitions may be observed. Transitions to
program modules that are fault laden will result in an
increased probability of failure. We can model these
transitions as a stochastic process. Ultimately, by
developing a mathematical description for the behavior of
the software as it transitions from one module to another
driven by the functionalities that it is performing, we can
describe the reliability of the functionality. The software
system is the sum of its functionalities. If we can know
the reliability of the functionalities and how the system
apportions its time among these functionalities, we can
then know the reliability of the system.
8. Acknowledgments
This work was supported in part by a research grant
from the National Science Foundation and the Storage
Technology Corporation of Louisville, Colorado.
7. References
[1] S. Brocklehurst and B. Littlewood, “New ways to get
accurate reliability measures”, IEEE Software, July 1997, pp.
34-42.
[2] M. H. DeGroot, Optimal Statistical Decisions, McGrawHill Book Company, New York, 1970.
[3] G. A. Hall, Usage Patterns:
Extracting System
Functionality from Observed Profiles, unpublished dissertation.
University of Idaho, 1997. SETL TR #97-030.
[4] B. Littlewood, “Software reliability model for modular
program structure”, IEEE Transactions on Reliability, Vol. R28, No. 3, 1979, pp. 241-246.
0-7695-0001-3/99 $10.00 (c) 1999 IEEE
11
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999
[5] T. A. Mazzuchi , R. Soyer, “A Bayes method for
assessing product-reliability during development
testing”, IEEE Transactions on Reliability, vol. 42, no.
3, 1993, pp. 503-510.
[6] J. C. Munson, T. M. Khoshgoftaar, “The detection of
fault-prone programs”, IEEE Transactions on Software
Engineering, SE-18, No. 5, 1992, pp. 423-433.
[7] J. C. Munson, R. H. Ravenel, “Designing reliable
software”, Proceedings of the 1993 IEEE International
Symposium of Software Reliability Engineering, IEEE
Computer Society Press, Los Alamitos, CA, November,
1993, pp. 45-54.
[8] J. C. Munson, “Software measurement: problems
and practice”, Annals of Software Engineering, vol. 1,
no. 1, 1995, pp. 255-285.
[9] J. C. Munson, “A Software Blackbox Recorder,”
Proceedings of the 1996 IEEE Aerospace Applications
Conference, IEEE Computer Society Press, pp. 309320.
[10] J. C. Munson, "A Functional Approach to Software
Reliability Modeling," in Boisvert, ed., Quality of
Numerical Software, Assessment and Enhancement,
Chapman & Hall, London, 1997. ISBN 0-412-80530-8.
[11] J. C. Munson, S. G. Elbaum, R. M. Karcich, and J.
P. Wilcox,
“Software Risk Assessment Through
Software Measurement and Modeling”, Proceedings of
the 1998 IEEE Aerospace Conference, IEEE Computer
Society Press.
[12] J. D. Musa, “The operational profile in software
reliability engineering: an overview”, In Proceedings of
the IEEE International Symposium on Software
Reliability Engineering, IEEE Computer Society Press,
Los Alamitos, CA, November, 1992. pp. 140-154.
[13] H. Raiffa, R. Schlaifer, Applied Statistical Decision
Theory, Harvard University Press, Cambridge, 1961.
[14] S.S. Wilks, Mathematical Statistics, John Wiley and
Sons, Inc., New York, 1962.
0-7695-0001-3/99 $10.00 (c) 1999 IEEE
12
Download