Software Dependability Assessment: A Reality or A Dream? Karama Kanoun The Sixth IEEE International Conference on Software Security and Reliability, SERE, Washington D.C., USA, 20-22 June, 2012 1 99.9999% 99.999% 99.99% 99.9% Interne t 99% 9% 1950 1960 1970 [From J. Gray, ‘Dependability in the Internet era’] Complexity Economic pressure 1980 1990 Cell phone s 2000 2010 Availability Outage duration/yr 0,999999 32s 0,99999 5min 15s 0,9999 52min 34s 0,999 8h 46min 0,99 3d 16h 2 Aug. 1986 - 1987: the « Willy Hacker » penetrates several tens of sensitive computing facilities 15 January 1990: 9 hours outage of the long distance phone in the USA February 1991: Scud missed by a Patriot (Irak, Gulf war) Nov. 1992: Crash of communication system of London Ambulance service 4 June 1996: Failure of Ariane 5 first flight Feb 2000: Distributed denials of service on large web sites: Yahoo, ebay… Aug. 2003: electricity blackout in USA, Canada Oct. 2006: 83000 email addresses, credit card info, banking transaction files stolen in UK Aug. 2008: Air traffic control computing system failure (USA) Sep. 2009: 3rd service interruption of Gmail, Google email service, during year 2009 Confidentiality Safety June 1985 - Jan. 1987: Excessive radiotherapy doses (Therac-25) 26 -27 June 1993: Authorization denials of credit card operations, France Distributed Interaction Development Avaialbility/ Reliability June 1980: False alerts at the North American Air Defence (NORAD) Physical Examples of Historical Failures Localized Dependability Impact properties Cause 3 Accidental faults Number of failures [consequences and outage durations depend upon application] Faults Dedicated computing systems (e.g., transaction processing, electronic switching, Internet backend servers) Controlled systems (e.g., civil airplanes, phone network, Internet frontend servers) Rank Proportion Rank Proportion Physical internal 3 ~ 10% 2 15-20% Physical external 3 ~ 10% 2 15-20% Human interactions 2 ~ 20% 1 40-50% Development 1 ~ 60% 2 15-20% 4 Global Information Security Survey 2004 — Ernst & Young Loss of availability: Top ten incidents Percentage of respondents that indicated the following incidents resulted in an unexpected or unscheduled outage of their critical business 0% 20% 40% 60% 80% Hardware failures Major virus, Trojan horse, or Internet worms Telecommunications failure Software failure Third party failure, e.g., service provider System capacity issues Operational erors, e.g., wrong software loaded Infrastructure failure, e.g., fire, blackout Former or current employee misconduct Distributed Denial of Servive (DDoS) attacks Non malicious 76% Malicious 24% 5 Why Software Dependability Assessment? User / customer • Confidence in the product • Acceptable failure rate Developer / Supplier • During production Reduce # faults (zero defect) Optimize development Increase operational dependability • During operation Maintenance planning • Long term Improve software dependability of next generations 6 Approaches to Software Dependability Assessment Assessment based on software characteristics • Language, complexity metrics, application domain, … Assessment based on measurements • Observation of the software behavior Assessment based on controlled experiments • Ad hoc vs standardized benchmarking Assessment of the production process • Maturity models 7 Outline of the Presentation Assessment based on software characteristics • Language, complexity metrics, application domain, … Assessment based on measurements • Observation of the software behavior Assessment based on controlled experiments • Ad hoc vs standardized benchmarking Assessment of the production process • Maturity models 8 Software Dependability Assessment — Difficulties Corrections non-repetitive process No relationship between failures and corrections Continuous evolution of usage profile • According to the development phase • Within a given phase Overselling of early reliability “growth” models Judgement on quality of the software developers What is software dependability dependability measures? Number of faults, fault density, complexity? MTTF, failure intensity, failure rate? 9 Dependability Measures? Dynamic measures: characterizing occurrence of failures and corrections Static measures Complexity metrics Number of faults Fault density … Usage profile & Environment Failure intensity Failure rate MTTF Restart time Recovery time Availability … 10 Number of Faults vs MTTF Percentage of faults and corresponding MTTF (published by IBM) MTTF≤1.58 y MTTF (years) 5000 1580 500 158 50 15.8 5 2,1 3,2 2,8 2,0 2,9 2,1 2,7 2,7 1,9 1,2 1,5 1,4 0,3 1,4 0,8 1,4 1,4 0,5 1.58 Product 1 2 3 4 5 6 7 8 9 34,2 34,3 33,7 34,2 34,2 32,0 34,0 31,9 31,2 28,8 28,0 28,5 28,5 28,5 28,2 28,5 27,1 27,6 17,8 18,2 18,0 18,7 18,4 20,1 18,5 18,4 20,4 10,3 9,7 8,7 11,9 9,4 11,5 9,9 11,1 12,8 5,0 4,5 6,5 4,4 4,4 5,0 4,5 6,5 5,6 0,7 0,7 0,4 0,1 0,7 0,3 0,6 1,1 0,0 1.58 y ≤ MTTF≤ 5 y 11 Assessment Based on Measurements Data Collection Times to failures / # failures Failure impact Failure origin Corrections Data Processing Outputs • Descriptive statistics • Correlations • Trend analysis • Failure modes • Modelling/prediction • Trend evolution • Non-stationary processes • Stochastic models • MTTF / failure rate • MTTR • Model validation • Availability 12 Why Trend Analysis? Corrections . . . Failure intensity Vi,k Vi,2 Vi,1 time 13 Why Trend Analysis? Corrections Vi+1,4 Vi+1,3 Vi+1,2 Vi+1,1 Corrections . . . Failure intensity Vi,2 Vi,k Changes (usage profile, environment, specifications,...) Vi,1 time 14 Example: Electronic Switching System Failure intensity # systems 25 40 Validation Operation 30 20 20 10 15 11 15 19 23 29 31 10 5 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 months 15 Electronic Switching System (Cont.) Cumulative number of failures 220 Validation 200 Operation 180 160 140 120 Observed 100 80 60 40 20 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 months 16 Electronic Switching System (Cont.) Cumulative number of failures 220 Validation 200 Hyperexponential model application maintenance planning Operation 180 160 140 120 Observed 100 80 60 Retrodictive assessment Predictive assessment 40 20 0 1 3 5 7 9 11 13 15 Observed # failures [20-32] = 33 17 19 21 23 25 27 29 31 months Predicted # failures [21-32] = 37 17 Electronic Switching System (Cont.) Failure intensity and failure rate in operation (for an average system) 2.5 2 Observed 1.5 1 Estimated by Hyperexponential model 0.5 0 17 19 21 23 25 27 29 31 Residual failure rate: 5.7 10-5 /h 18 Electronic Switching System (Cont.) Failure intensity of the software components 0.5 1 Hyperexponential model 0.8 Hyperexponential model 0.4 0.3 0.6 observed failure intensity 0.4 0.2 0.1 0 0 17 19 21 23 25 27 29 observed failure intensity 0.2 31 17 19 Telephony 0.4 21 23 25 27 29 31 Defense 0.4 Hyperexponential model 0.3 Hyperexponential model 0.3 0.2 0.2 observed failure intensity 0.1 0.1 0 0 17 19 21 23 25 27 Interface 29 31 observed failure intensity 17 19 21 23 25 Management 27 29 31 19 Electronic Switching System (Cont.) Failure intensity and failure rate in operation (for an average system) Component 2.5 Residual failure rate Telephony 1.2 10 - 6 /h Defense 1.4 10 - 5 /h Interface 2.9 10 - 5 /h Management 2 8.5 10 - 6 /h 5.3 10 - 5 /h Sum Observed 1.5 Sum of the failure intensities of the components estimated by HE 1 Estimated by Hyperexponential model 0.5 0 17 19 21 23 25 27 29 31 Residual failure rate: 5.7 10-5 /h 20 Other Example: Operating System in Operation Data = Time to Failure during operation u(i) Trend evolution = stable dependability 2 1,5 1 0,5 0 -0,5 1 -1 -1,5 -2 21 41 61 81 101 121 141 161 181 21 41 61 81 101 121 141 161 181 # failures 300000 Mean Time to Failure 250000 200000 150000 # failures 100000 1 21 Validity of Results Early Validation End of Validation Trend analysis Trend analysis + Assessment • operational development ollow-up f Operation Trend analysis + Assessment High relevance profile • enough data? Assessment Examples: E10-B (Alcatel ESS): Limits: 10-3/h -10-4/h 1400 systems, 3 years = 5 10-6/h c = 10-7/h Nuclear I&C systems: 8000 systems, 4 years : 3 10-7/h 10-7/h c = 4 10-8/h 22 Research Gaps Applicability to new classes of systems • Service oriented systems • Adaptive and dynamic software systems on-line assessment Industry implication • Confidentiality real-life data • Cost (perceptible overhead, invisible immediate benefits) Case of Off-The-Shelf software components ? Applicability to safety critical systems • During development Accumulation of experience software process improvement assessment of the software process 23 Off-The-Shelf software components — Dependability Benchmarking No information available from software development Evaluation based on controlled experimentation Ad hoc Standard Dependability benchmarking Evaluation of dependability measures / features in a non-ambiguous way comparison Properties Reproducibility, repeatability, portability, representativeness, acceptable cost 24 Benchmarks of Operating Systems Operating System Computer System Linux Mac Windows Which OS for my computer system? Limited knowledge: functional description Limited accessibility and observability Black-box approach robustness benchmark 25 Robustness Benchmarks Application Faults OS Outcomes API Operating system Device drivers Hardware Faults = corrupted system calls 26 OS Response Time Windows 700 s Linux 700 s 600 600 500 500 400 400 300 300 200 200 100 100 0 0 NT 4 2000 XP NT4 2000 2003 Server Server Server 2.2.26 2.4.5 2.4.26 2.6.6 Without corruption In the presence of corrupted system calls 27 Mean Restart Time Windows 120 seconds Linux 120 seconds 80 80 40 40 0 0 NT 4 2000 XP NT4 2000 2003 Server Server Server 2.2.26 2.4.5 2.4.26 2.6.6 Without corruption In the presence of corrupted system calls 28 Detailed Restart Time Windows XP 250 Linux 2.2.26 seconds 250 200 200 150 150 100 100 # exp 50 0 100 200 300 400 seconds check disk # exp 50 0 50 100 150 200 29 seconds More on250Windows family 200 Impact of application state after failure Restart time 150 (seconds) 2000 2000 100 NT4 NT4 XP # exp Experiment 50 0 50 100 150 200 250 300 350 400 30 Benchmark Characteristics and Limitations A benchmark should not replace software test and validation Non-intrusiveness robustness benchmarks (faults injected outside the benchmark target) Make use of available inputs and outputs impact on measures Balance between cost and degree of confidence # dependability benchmark measures >> # performance benchmark measures 31 Maturity of Dependability Benchmarks Dependability benchmarks Performance benchmarks • Infancy • Mature domain • Isolated work • Cooperative work • Not explicitly addressed • Integrated to system development • Acceptability? • Accepted by all actors for competitive system comparison “Ad hoc” benchmarks “Competition” benchmarks • • Maturity ?? 32 Software Process Improvement (SPI) Data Collection Measurements Data Processing Measures Controlled experiments Objectives of the analysis Data related to similar projects Feedback to software development process Capitalize experience 33 Examples of Benefits from SPI Programs AT&T(quality program): Customer reported problems (maintenance program) divided by 10 System test interval divided by 2 New product introduction interval divided by 3 Fujitsu (concurrent development process): Release cycle reduction = 75 % Motorola (Arlington Heights), mix of methods: Fault density reduction = 50% within 3.5 years Raytheon (Electronic Systems), CMM: Rework cost divided by 2 after two years of experience Productivity increase = 190% Product quality: multiplied by 4 34 Cost — — Total cost without process improvement approaches ——— with process improvement approaches Cost of dependability Basic development cost Cost of scrap / rework Dependability Process improvement Dependability improvement & cost reduction! 35 36 Software Dependability Assessment: A Reality or A Dream? Karama Kanoun The Sixth IEEE International Conference on Software Security and Reliability, SERE, Washington D.C., USA, 20-22 June, 2012 37