By Jackie Wadzinski
The Patriot Missile
Used to destroy incoming
Iraqi Scud Missiles
Hailed for effectiveness
Operated for 100 consecutive hours
28 American soldiers killed
Cause: Software Failure
The Patriot Missile
A Learning Experience
The software can be redesigned
A new Patriot Missile can be built
The fate of the 28 soldiers remains the same
THE MORAL: Software Engineers need to find a way to engineer reliability into software.
Objectives
Definition of Software Reliability
Importance of Reliability Engineering
Why Reliability Engineering is Difficult
Reliability Engineering Processes
Weibull
Musa
Monte Carlo
Conclusion
What is Software Reliability?
IEEE Definition:
“The ability of a system or component to perform its required functions under stated conditions for a specified period of time.”
Definition allows for “Just Right” level of reliability for software
Software Reliability and Hardware Reliability have the same definition
Why is Software Reliability
Important?
Manager View
Reliable software means satisfied customers
Reliable software means repeat customers
Reliable software is ethical
Legal liability
Customer View
Reliable software saves time
Reliable software increases efficiency
Why Software Reliability is
Difficult to Calculate
Without considering program evolution, failure rate is statistically non existent
There are many possible causes for design defects for failures to arise from
Why Software Reliability is
Difficult to Calculate
Errors can occur without warning
Cannot improve software quality if identical software components are used
Periodic restarts can sometimes help fix problems
Errors are caused by incorrect logic, incorrect statements, or incorrect input data
Software may require infinite testing
Software reliability models do not always fit the data points well
Over View
There are many models to chose from when calculating software reliability
Focus on three
Weibull Failure Time Model
Musa’s Basic Execution Time Model
Monte Carlo Simulation
Of all the models, each has strengths and limitations
About Weibull Failure Model
Used to model failure processes of hardware
One of the first models to be applied to software reliability modeling
Flexible – accommodates increasing, decreasing or constant failure rates
Weibull Failure Model
Weibull Failure Model Assumptions:
There are a fixed number of faults in the software being tested
The number of faults are detected in time intervals ((t=0, t1), (t1,t2)….)
Limitations:
Flexibility allows for greater chance of making the wrong assumption
Weibull Failure Model Example
Notice how the model follows the actual data
About Musa’s Basic Time
Execution Model
Developed by John Musa of AT&T Bell
Laboratories
One of the first models to use actual execution time of software components versus calendar time
Time between failures is expressed in terms of CPU time
Musa’s Basic Time Execution
Model
Uses a Poisson Distribution
Model Assumptions:
The execution times between failures is exponentially distributed
The hazard rate for a single fault is constant
Limitations:
Assumes new faults are not introduced after correction
Assumes number of faults decreases over time
Musa’s Basic Time Execution
Model Example
Notice how the model follows the actual data
About Monte Carlo Simulation
Developed in 1940s as part of the atomic bomb program
Named after Monte Carlo, Monaco because city’s casinos featured games of chance like dice and roulette
Today Monte Carlo Simulations are used in many applications including physics, finance, and system reliability
Monte Carlo Simulation
Used for very complex problems which are difficult to solve or no solution exists
Uses statistics to mathematically model real life processes and then estimates the probability of possible outcomes
Involves fitting a curve to a process and then using the fitted curve to model a process over time
Dice Example
Monte Carlo Simulation Process
Determine a probability function
Weibull Distribution – Best for failure process
Lognormal Distribution – Best for repair process
Determine the random number generator, the source for selecting random numbers that are distributed uniformly on the proper unit interval
Determine a sampling rule for selecting samples for the model given a unit interval of random numbers
Record a count successes and failures
Monte Carlo Example
Select a random location within the rectangle
If the selected location is blue, record a hit
Repeat 10,000 times
Blue Area = (Hits / 10,000) * Area of Rectangle
Note: The standard error in the result is inversely proportional to the square root of the sample size
Monte Carlo Software Example
Arbitrary 3 component subsystem
The failure probability of each component given in the diagram above
If the first component fails, then the second is checked
If the second component fails, then the third component is checked
If the third component fails, then the entire subsystem fails
Monte Carlo Software Example
The actual failure of the subsystem is:
The results of the actual simulation are:
Conclusion
Engineering reliable software is important to both the engineer and the end user
Engineering reliable software is not an easy task to accomplish
There are methods available for measuring reliability
Each method has its strengths and weaknesses
At this time, no one method is superior
References
Ganesh, Pai. Survey of Software Reliability Models. Fall 2002 .
Korver, Brian. The Monte Carlo Method and Software Reliability
Theory.
Oregan, 1994.
Lyu, Michael R, Editor. Handbook of Software Reliability Engineering.
IEEE Computer Society Press, McGraw-Hill, 1996.
Subtitle
Mladen, Vouk A. Software Reliability Engineering. Tutorial
Presented at Annual Reliability and Maintenance Symposium,
1998.
Pham, Hoang. Software Reliability . Springer-Verlag, 2000 .