Abstract: Software reliability as a part of software engineering & reliability analysis is now an established interdisciplinary research. This is a known fact that software costs are much higher than hardware when considering the total cost of computer based system. Demands on software reliability and availability have increased tremendously due to the nature of present day applications. A software life cycle consists of requirement, specification, design, coding, testing and operation / maintenance stage. During testing phase faults are corrected. This is most important phase because of the cost associated with it. In many cases, there are serious consequences like huge economic losses or risk to human life if the software is faulty. Since many years researches have been conducted to study the reliability of computer software as the software systems have become complex to design and develop, intensive studies are carried out to increase the chance that software will perform satisfactorily in operation. Software reliability measure is one tool to assess software engineering technologies .Reliability assessment is not only useful for the software being studied , it has great value to the future development of other software systems since valuable lessons are learnt and gained by making quantitative studies of the achieved results . The paper has introduced some models for reliability detection & techniques for fault tolerance in software. Important concepts in software reliability engineering Software reliability is defined as the probability of failure-free software operation in a defined environment for a specified period of time. Since software does not deprecate like hardware, the reliability of software stays constant over time if no changes are made to the code or to the environmental conditions including the user behavior. However, if each time after a failure has been experienced the underlying fault is detected and perfectly fixed, and then the reliability of software will increase with time. Software failure: A failure is the departure of software behavior from user requirements. This dynamic phenomenon has to be distinguished from the static fault (or bug) in the software code, which causes the failure occurrence as soon as it is activated during program execution. Software fault: Software is said to contain fault if for some set of input data the output is not correct. Software faults are somewhat deterministic in nature. Software does not fails due to unknown reason .in data domain the fault is deterministic but in time domain it is difficult to say when the software will fail. Software reliability growth models: As each failure occurrence initiates the removal of a fault, the number of failures that have been experienced by time t, denoted by M(t), can be regarded as a reflected image of reliability growth. Models explaining the behavior of M(t) are called software reliability growth models (SRGMs). Mean value function ( µ (t)) : It represents the number of failures experienced in time t according to any reliability model . Failure intensity (λ)(t).: It is the derivative of mean value function with respect to time . Hazard rate of the application z (Δt | ti−1): It is the probability density for the ith failure being experienced at ti−1 +Δt conditional that the (i−1)st failure has occurred at ti−1. Hazard rate of individual fault: The per-fault hazard rate is a measure for the probability that a particular fault will instantaneously cause a failure, given that it has not been activated so far. A list of the distinct characteristics of software compared to hardware is listed below: Failure cause: Software defects are mainly design defects. Wear-out: Software does not have energy related wear-out phase. Errors can occur without warning. Repairable system concept: Periodic restarts can help fix software problems. Time dependency and life cycle: Software reliability is not a function of operational time. Environmental factors: These factors do not affect software reliability, except it might affect program inputs. Reliability prediction: Software reliability can not be predicted from any physical basis, since it depends completely on human factors in design. Interfaces: Software interfaces are purely conceptual other than visual. Software reliability growth models: An important aspect of developing models relating the number and type of faults in a software system to a set of structural measurement is defining what constitutes a fault. By definition, a fault is an imperfection in a software system that may lead to the system’s eventually failing. A measurable and precise definition of what faults are makes it possible to accurately identify and count them, which in turn allows the formulation of models relating fault counts and types to other measurable attributes of a software system. Unfortunately, the most widely-used definitions are not measurable – there is no guarantee that two different individuals looking at the same set of failure reports and the same set of fault definitions will count the same number of underlying faults. Specifically, we base our recognition and enumeration of software faults on the grammar of the language of the software system. 1) Jelinski-Moranda model: The model developed by Jelinski and Moranda was one of the first software reliability growth models. It consists of some simple assumptions: 1. At the beginning of testing, there are N faults in the software code with N being an unknown but fixed number. 2. Each fault is equally dangerous with respect to the probability of its instantaneously causing a failure. Furthermore, the hazard rate of each fault does not change over time, but remains constant at Φ. 3. The failures are not correlated, i.e. given N and Φ the times between failures (_t1,_t2, ... _tu0) are independent. 4. Whenever a failure has occurred, the fault that caused it is removed instantaneously and without introducing any new fault into the software. 2 ) Goel-Okumoto model The model proposed by Goel and Okumoto in 1979 is based on the following assumptions: 1. The number of failures experienced by time t follows a Poisson distribution with mean value function μ(t). This mean value function has the boundary conditions μ(0) = 0 and limt!1 μ(t) t -infinity = N < infinity. 2. The number of software failures that occur in (t, t+Δt] with _t tending to 0 is proportional to the expected number of undetected faults, N − μ(t). The constant of proportionality is Φ. 3. For any finite collection of times t1 < t2 < · · · < tn the number of failures occurring in each of the disjoint intervals (0, t1), (t1, t2), ..., (tn−1, tn) is independent. 4. Whenever a failure has occurred, the fault that caused it is removed instantaneously and without introducing any new fault into the software. 3) Musa basic execution time model Musa’s basic execution time model developed in 1975 was the first one to explicitly require that the time measurements are in actual CPU time utilized in executing the application under test Although it was not originally formulated like that the model can be classified by three characteristics [41, p. 251]: 1. The number of failures that can be experienced in infinite time is finite. 2. The distribution of the number of failures observed by time is of Poisson type. 3. The functional form of the failure intensity in terms of time is exponential 4) Musa-Okumoto logarithmic model 1At time t = 0 no failures have been observed 2. The failure intensity decreases exponentially with the expected number of observed, i.e. λ(t ) = β0 β1 exp(-µ(t)/β0) 3 The number of failures observed by time t , M(t), follows a Poisson process failures 5) Enhanced Non-Homogeneous Poisson Process (ENHPP) model Another aspect of testing used to explain why the per-fault hazard rate (and, consequently, the fault exposure ratio) should be time-dependent is the development of test coverage. Test coverage c(t) can be defined as follows [17]: c(t) =Number of potential fault sites sensitized through testing by time t Total number of potential fault sites under consideration Its assumptions are as follows: 1. The N faults inherent in the software at the beginning of testing are uniformly distributed over the potential fault sites. 2. At time t the probability that a fault present at the fault site sensitized at that moment causes a failure is cd(t). 3. The development of test coverage as a function of time is given by c(t) with limt!1 c(t) = k ≤ 1. 4. Whenever a failure has occurred, the fault that caused it is removed instantaneously and without introducing any new fault into the software. 6) Block coverage model A model for block coverage in dependence of the number of test cases executed during functional testing, which can easily be extended to become a model of the number of failures experienced in terms of time. The model is based on the following assumptions: 1. The program under test consists of G blocks. 2. Per test case p of these blocks are sensitized on average. 3. The p blocks are always chosen from the entire population. Techniques for Fault Tolerance in Software In order to cope with the existence and manifestation of faults in software, are divided into three main categories: Fault avoidance/prevention: This include design methodologies which attempt to make software fault-free Fault removal: These methods aim to remove faults after the development stage is completed. This is done by exhaustive and rigorous testing of the final product. Fault tolerance: This method makes the assumption that the system has unavoidable and undetectable faults and aims to make provisions for the system to operate correctly even in the presence of faults. Error compensation Error recovery Fault treatment Design diversity X Data diversity Environment diversity Check point recovery X X X Design diversity: A ) N-version programming: In this technique, N (N>=2) independently generated functionally equivalent programs called versions, are executed in parallel. A majority voting logic is used to compare the results produced by all the versions and report one of the results which is presumed correct. The ability to tolerate faults here depends on how ``independent'' the different versions of the program are. B) Recovery block: Basically, in this approach, multiple variants of software which are functionally equivalent are deployed in a time redundant fashion. An acceptance test is used to test the validity of the result produced by the primary version. The execution of the structure does not stop until the acceptance test is passed by one of the multiple versions or until all the versions have been exhausted. C) N-self checking programming: In N-self checking programming, multiple variants of software are used in a hot-standby fashion as opposed to the recovery block technique in which the variants are used in the cold-standby mode. A self-checking software component is a variant with an acceptance test or a pair of variants with an associated comparison test. Data diversity While the design diversity approaches to provide fault tolerance rely on multiple versions of the software written to the same specifications, the data diversity approach uses only one version of the software. This approach relies on the observation that a software sometime fails for certain values in the input space and this failure could be averted if there is a minor perturbation of input data which is acceptable to the software. N-copy programming, based on data diversity, has N copies of a program executing in parallel, but each copy running on a different input set produced by a diverse-data system. The diverse-data system produces a related set of points in the data space. Environment diversity Having its basis on the observation that most software failures are transient in nature, the environment diversity approach requires re executing the software in a different environment. Transient faults typically occur in computer systems due to design faults in software which result in unacceptable and erroneous states in the OS environment. The volatile state: This consists of the program stack and static and dynamic data segments. The persistent state: This state refers to all the user files related to a program's execution. The operating system (OS) environment: This refers to all the resources the program accesses through the operating system like swap space, file systems, communication channels, keyboard and monitors. Checkpointing and Recovery Checkpointing involves occasionally saving the state of a process in stable storage during normal execution. Upon failure, the process is restarted in the saved state (last saved checkpoint). This thus reduces the amount of lost work. Checkpointing and recovery was mainly intended to tolerate transient hardware failures, where the application is restarted upon repair of a hardware unit after failure. This technique has been implemented in both software and hardware. Software testing: There is plenty of testing methods and testing techniques, serving multiple purposes in different life cycle phases. Classified by purpose, software testing can be divided into: 1) Correctness testing, 2) Performance testing, 3) Reliability testing and 4) Security testing. Classified by life-cycle phase, software testing can be classified into the following categories: 1) Requirements phase testing, 2) Design phase testing, 3) Program phase testing, 4) Evaluating test results, 5) Installation phase testing, 6) Acceptance testing and maintenance testing. By scope, software testing can be categorized as follows: 1) Unit testing, 2) Component testing, 3) Integration testing 4) System testing Software testing Execution based testing Program based test Combined testing Non execution based test Specification based testing Inspection Ad hoc Structurally based criterion Fault based criterion Check list Scenario based Error based criterion Black box testing treats the software as a black-box without any understanding as to how the internals behave White box testing, however, is when the tester has access to the internal data structures, code, and algorithms. Program based test is basically white box approach whereas specification based testing is black box approach. Conclusion: Well-understood and extensively-tested standard parts will help improve maintainability and reliability. But in software industry, we have not observed this trend. Code reuse has been around for some time, but to a very limited extent. Strictly speaking there are no standard parts for software, except some standardized logic structures. However, if each time after a failure has been experienced the underlying fault is detected and perfectly fixed, then the reliability of software will increase with time. References : 1 )http://ieeexplore.ieee.org/iel5/8378/26367/01173299.pdf 2) http://www.statistik.wiso.uni-erlangen.de/lehrstuhl/migrottke/SRModelStudy.pdf 3) http://www.chillarege.com/authwork/TestingBestPractice.pdf 4) http://srel.ee.duke.edu/sw_ft/node1.html 5) http://www.ece.cmu.edu/~koopman/des_s99/sw_reliability/#tools