software reliability

advertisement
Abstract:
Software reliability as a part of software engineering & reliability analysis is now an
established interdisciplinary research. This is a known fact that software costs are much
higher than hardware when considering the total cost of computer based system.
Demands on software reliability and availability have increased tremendously due to the
nature of present day applications. A software life cycle consists of requirement,
specification, design, coding, testing and operation / maintenance stage. During testing
phase faults are corrected. This is most important phase because of the cost associated
with it. In many cases, there are serious consequences like huge economic losses or risk
to human life if the software is faulty. Since many years researches have been conducted
to study the reliability of computer software as the software systems have become
complex to design and develop, intensive studies are carried out to increase the chance
that software will perform satisfactorily in operation. Software reliability measure is one
tool to assess software engineering technologies .Reliability assessment is not only useful
for the software being studied , it has great value to the future development of other
software systems since valuable lessons are learnt and gained by making quantitative
studies of the achieved results . The paper has introduced some models for reliability
detection & techniques for fault tolerance in software.
Important concepts in software reliability engineering
Software reliability is defined as the probability of failure-free software operation in a
defined environment for a specified period of time. Since software does not deprecate
like hardware, the reliability of software stays constant over time if no changes are made
to the code or to the environmental conditions including the user behavior. However, if
each time after a failure has been experienced the underlying fault is detected and
perfectly fixed, and then the reliability of software will increase with time.
Software failure: A failure is the departure of software behavior from user requirements.
This dynamic phenomenon has to be distinguished from the static fault (or bug) in the
software code, which causes the failure occurrence as soon as it is activated during
program execution.
Software fault: Software is said to contain fault if for some set of input data the output is
not correct. Software faults are somewhat deterministic in nature. Software does not fails
due to unknown reason .in data domain the fault is deterministic but in time domain it is
difficult to say when the software will fail.
Software reliability growth models: As each failure occurrence initiates the removal of
a fault, the number of failures that have been experienced by time t, denoted by M(t), can
be regarded as a reflected image of reliability growth. Models explaining the behavior of
M(t) are called software reliability growth models (SRGMs).
Mean value function ( µ (t)) : It represents the number of failures experienced in time t
according to any reliability model .
Failure intensity (λ)(t).: It is the derivative of mean value function with respect to time .
Hazard rate of the application z (Δt | ti−1): It is the probability density for the ith failure
being experienced at ti−1 +Δt conditional that the (i−1)st failure has occurred at ti−1.
Hazard rate of individual fault: The per-fault hazard rate is a measure for the
probability that a particular fault will instantaneously cause a failure, given that it has not
been activated so far.
A list of the distinct characteristics of software compared to hardware is listed below:







Failure cause: Software defects are mainly design defects.
Wear-out: Software does not have energy related wear-out phase. Errors can
occur without warning.
Repairable system concept: Periodic restarts can help fix software problems.
Time dependency and life cycle: Software reliability is not a function of
operational time.
Environmental factors: These factors do not affect software reliability, except it
might affect program inputs.
Reliability prediction: Software reliability can not be predicted from any physical
basis, since it depends completely on human factors in design.
Interfaces: Software interfaces are purely conceptual other than visual.
Software reliability growth models:
An important aspect of developing models relating the number and type of faults in a
software system to a set of structural measurement is defining what constitutes a fault. By
definition, a fault is an imperfection in a software system that may lead to the system’s
eventually failing. A measurable and precise definition of what faults are makes it
possible to accurately identify and count them, which in turn allows the formulation of
models relating fault counts and types to other measurable attributes of a software
system. Unfortunately, the most widely-used definitions are not measurable – there is no
guarantee that two different individuals looking at the same set of failure reports and the
same set of fault definitions will count the same number of underlying faults.
Specifically, we base our recognition and enumeration of software faults on the grammar
of the language of the software system.
1) Jelinski-Moranda model:
The model developed by Jelinski and Moranda was one of the first software reliability
growth models. It consists of some simple assumptions:
1. At the beginning of testing, there are N faults in the software code with N
being an unknown but fixed number.
2. Each fault is equally dangerous with respect to the probability of its instantaneously
causing a failure. Furthermore, the hazard rate of each fault does not change over
time, but remains constant at Φ.
3. The failures are not correlated, i.e. given N and Φ the times between failures
(_t1,_t2, ... _tu0) are independent.
4. Whenever a failure has occurred, the fault that caused it is removed instantaneously
and without introducing any new fault into the software.
2 ) Goel-Okumoto model
The model proposed by Goel and Okumoto in 1979 is based on the following
assumptions:
1. The number of failures experienced by time t follows a Poisson distribution with mean
value function μ(t). This mean value function has the boundary conditions μ(0) = 0 and
limt!1 μ(t) t -infinity = N < infinity.
2. The number of software failures that occur in (t, t+Δt] with _t tending to 0 is
proportional to the expected number of undetected faults, N − μ(t). The constant of
proportionality is Φ.
3. For any finite collection of times t1 < t2 < · · · < tn the number of failures occurring in
each of the disjoint intervals (0, t1), (t1, t2), ..., (tn−1, tn) is independent.
4. Whenever a failure has occurred, the fault that caused it is removed instantaneously
and without introducing any new fault into the software.
3) Musa basic execution time model
Musa’s basic execution time model developed in 1975 was the first one to explicitly
require that the time measurements are in actual CPU time utilized in executing the
application under test Although it was not originally formulated like that the model can
be classified by three characteristics [41, p. 251]:
1. The number of failures that can be experienced in infinite time is finite.
2. The distribution of the number of failures observed by time is of Poisson type.
3. The functional form of the failure intensity in terms of time is exponential
4) Musa-Okumoto logarithmic model
1At time t = 0 no failures have been observed
2. The failure intensity decreases exponentially with the expected number of
observed, i.e. λ(t ) = β0 β1 exp(-µ(t)/β0)
3 The number of failures observed by time t , M(t), follows a Poisson process
failures
5) Enhanced Non-Homogeneous Poisson Process (ENHPP) model
Another aspect of testing used to explain why the per-fault hazard rate (and,
consequently, the fault exposure ratio) should be time-dependent is the development of
test coverage. Test coverage c(t) can be defined as follows [17]:
c(t) =Number of potential fault sites sensitized through testing by time t
Total number of potential fault sites under consideration
Its assumptions are as follows:
1. The N faults inherent in the software at the beginning of testing are uniformly
distributed over the potential fault sites.
2. At time t the probability that a fault present at the fault site sensitized at that moment
causes a failure is cd(t).
3. The development of test coverage as a function of time is given by c(t) with
limt!1 c(t) = k ≤ 1.
4. Whenever a failure has occurred, the fault that caused it is removed instantaneously
and without introducing any new fault into the software.
6) Block coverage model
A model for block coverage in dependence of the number of test cases executed during
functional testing, which can easily be extended to become a model of the number of
failures experienced in terms of time. The model is based
on the following assumptions:
1. The program under test consists of G blocks.
2. Per test case p of these blocks are sensitized on average.
3. The p blocks are always chosen from the entire population.
Techniques for Fault Tolerance in Software
In order to cope with the existence and manifestation of faults in software, are divided
into three main categories:



Fault avoidance/prevention: This include design methodologies which attempt
to make software fault-free
Fault removal: These methods aim to remove faults after the development stage
is completed. This is done by exhaustive and rigorous testing of the final product.
Fault tolerance: This method makes the assumption that the system has
unavoidable and undetectable faults and aims to make provisions for the system to
operate correctly even in the presence of faults.
Error
compensation
Error recovery
Fault treatment
Design
diversity
X
Data diversity
Environment
diversity
Check point
recovery
X
X
X
Design diversity:
A ) N-version programming: In this technique, N (N>=2) independently generated
functionally equivalent programs called versions, are executed in parallel. A majority
voting logic is used to compare the results produced by all the versions and report one of
the results which is presumed correct. The ability to tolerate faults here depends on how
``independent'' the different versions of the program are.
B) Recovery block: Basically, in this approach, multiple variants of software which are
functionally equivalent are deployed in a time redundant fashion. An acceptance test is
used to test the validity of the result produced by the primary version. The execution of
the structure does not stop until the acceptance test is passed by one of the multiple
versions or until all the versions have been exhausted.
C) N-self checking programming: In N-self checking programming, multiple variants
of software are used in a hot-standby fashion as opposed to the recovery block technique
in which the variants are used in the cold-standby mode. A self-checking software
component is a variant with an acceptance test or a pair of variants with an associated
comparison test.
Data diversity
While the design diversity approaches to provide fault tolerance rely on multiple versions
of the software written to the same specifications, the data diversity approach uses only
one version of the software. This approach relies on the observation that a software
sometime fails for certain values in the input space and this failure could be averted if
there is a minor perturbation of input data which is acceptable to the software. N-copy
programming, based on data diversity, has N copies of a program executing in parallel,
but each copy running on a different input set produced by a diverse-data system. The
diverse-data system produces a related set of points in the data space.
Environment diversity
Having its basis on the observation that most software failures are transient in nature, the
environment diversity approach requires re executing the software in a different
environment. Transient faults typically occur in computer systems due to design faults in
software which result in unacceptable and erroneous states in the OS environment.


The volatile state: This consists of the program stack and static and dynamic data
segments.
The persistent state: This state refers to all the user files related to a program's
execution.

The operating system (OS) environment: This refers to all the resources the
program accesses through the operating system like swap space, file systems,
communication channels, keyboard and monitors.
Checkpointing and Recovery
Checkpointing involves occasionally saving the state of a process in stable storage during
normal execution. Upon failure, the process is restarted in the saved state (last saved
checkpoint). This thus reduces the amount of lost work. Checkpointing and recovery was
mainly intended to tolerate transient hardware failures, where the application is restarted
upon repair of a hardware unit after failure. This technique has been implemented in both
software and hardware.
Software testing:
There is plenty of testing methods and testing techniques, serving multiple purposes in
different life cycle phases.
Classified by purpose, software testing can be divided into:
1) Correctness testing,
2) Performance testing,
3) Reliability testing and
4) Security testing.
Classified by life-cycle phase, software testing can be classified into the following
categories:
1) Requirements phase testing,
2) Design phase testing,
3) Program phase testing,
4) Evaluating test results,
5) Installation phase testing,
6) Acceptance testing and maintenance testing.
By scope, software testing can be categorized as follows:
1) Unit testing,
2) Component testing,
3) Integration testing
4) System testing
Software
testing
Execution
based
testing
Program
based test
Combined
testing
Non
execution
based test
Specification
based testing
Inspection
Ad
hoc
Structurally
based
criterion
Fault based
criterion
Check
list
Scenario
based
Error based
criterion
Black box testing treats the software as a black-box without any understanding as to how
the internals behave White box testing, however, is when the tester has access to the
internal data structures, code, and algorithms. Program based test is basically white box
approach whereas specification based testing is black box approach.
Conclusion:
Well-understood and extensively-tested standard parts will help improve maintainability
and reliability. But in software industry, we have not observed this trend. Code reuse has
been around for some time, but to a very limited extent. Strictly speaking there are no
standard parts for software, except some standardized logic structures. However, if each
time after a failure has been experienced the underlying fault is detected and perfectly
fixed, then the reliability of software will increase with time.
References :
1 )http://ieeexplore.ieee.org/iel5/8378/26367/01173299.pdf
2) http://www.statistik.wiso.uni-erlangen.de/lehrstuhl/migrottke/SRModelStudy.pdf
3) http://www.chillarege.com/authwork/TestingBestPractice.pdf
4) http://srel.ee.duke.edu/sw_ft/node1.html
5) http://www.ece.cmu.edu/~koopman/des_s99/sw_reliability/#tools
Download