IFIP Siena Failures associated with HCI But its not the users fault! Brendan Murphy MSR Cambridge IFIP Siena Talk contents How do other technologies address this issue. Computer behaviour influenced by humans or technology? Why HCI is so difficult. Summary. IFIP Siena Complex Systems Railways Liverpool & Manchester Railway opened 15th Sep 1830 (Stevenson’s Rocket) Rapid development, no controls, many deaths. E.g. Brunel and Babbage trains narrowly escaped crashing into each other Railway Inspectors set up 1840’s IFIP Siena Technology Challenges Train movements based on Time Leaves Bristol 12 noon Single Track London-Bristol 2 hours Leaves London 2:10pm What’s the problem? Solution: Change Time November 1840 Great Western moved to GMT Sep 1847 all Railways recommended to use GMT UK set clocks to GMT 1880 US Standardized time in November 1883 IFIP Siena Improvements in railway safety Improving the quality of components that make up the system trains, carriages, signalling, tracks etc. Focus on human element Training for all personnel. Training certification Railway inspectors review total system after a crash Analysing the interaction of components Analysing whether external factors impact accident. Make mandatory recommendations. IFIP Siena Current status of trains Trains are now very safe Death Rates in UK >0.5 deaths per 1 billion miles. But are they inflexible and expensive to build? Movement from trains to cars Death Rates in UK 3,500 deaths per 1 billion miles. IFIP Siena Traditional Engineering approach to safety Use of models based on theory verified through historical validation. Standardization & Certification. Making the problem fit the solution. Very conservative, reluctant to accept new ideas/implementations. Product Behaviour VAX Behavioural Drivers Cause Of System Crashes Percentage Of Crashes Impact on Availability? 100 80 Others 60 System Management 40 Software Failure Hardware Failure 20 0 1985 1994 Changes Over Time IFIP Siena IFIP Siena Reliability vs. Maintenance Events Differentiation by Day? Distribution of System Crashes Distribution Of System Outages VAX/VMS 20% 18% 18% 16% 16% 12% System Outages System Crashes 10% 8% 6% 14% 12% 10% Win 2k 6% 4% 4% 2% 2% Win 2k SP1 VAX Servers: ©FTSC 1999 Madison, Murphy, Davies, Compaq. ay Th ur sd ay Fr id ay Sa tu rd ay Su nd ay ed ne sd da es W Saturday Tu Thursday y Sunday Friday M Monday Tuesday y 0% Wednesday da 0% NT 4 8% on Percentage 14% Percentage of filtered events 20% IFIP Siena Reliability vs. Maintenance Events Differentiation by Day? Distribution System Outages DistributionOfOf System Outages Windows 2000 Open VMS VAX 9% 8% 8% 7% Percentage Percentage 7% 6% 6% 5% 5% System Outages System System Crashes Outages 4% 4% 3% 3% System Crashes 2% 2% 1% 1% 0% 0% 00 11 22 33 44 55 66 77 8 8 9 9 10 20 20 21 21 22 22 23 23 101111121213131414151516161717181819 19 Time Of- Friday Day Monday Monday - Friday VAX Servers: ©FTSC 1999 Madison, Murphy, Davies, Compaq. IFIP Siena System Instability on single servers A system crash may induce subsequent crashes. Behaviour first characterized in the early 1980s. time Down Event Clusters Crash Distribution on VAX and systems NT systems Crash Distribution on VAX 25% 25% 20% 20% 15% 15% VAX VAX Crashes Crashes 10% 10% NT Bluescreens 5% 90 1100 55 1122 00 1133 55 1155 00 1166 55 1188 00 1199 55 75 60 45 30 15 0% 0 Distribution Distribution System Up Minutes between between crashes crashes Minutes IFIP Siena System Instability on distributed systems Configuration impacts dependability. Affected by technical and human factors. Multi-node event clusters System 1 Up Down Up System 2 Down System 3 Up Down IFIP Siena System Instability on distributed systems Configuration impacts dependability. Affected by technical and human factors. Multi-node event clusters Average Downtime of a VMS cluster in Data Centers Environments 3.5 3 Down Up System 2 Down 2.5 Hours System 1 Up 2 1.5 1 0.5 0 System 3 Up 1 2 3 4 Number of Servers in a cluster Down 5 6 IFIP Siena Techniques for improving Software Quality Formal Specifications. Use of strong type checking languages. Formal development processes (e.g. SEI 5). Use of software verification tools. User training. IFIP Siena Techniques for avoiding HCI errors Fully understand the users specifications Bound the product based on those specifications. Develop the UI based on User Modelling, Intelligent User Interfaces, HCI design and visualization, psychological studies. Verify acceptability through trials with target groups. IFIP Siena The problem: How to improve the reliability of software. Well understood user profile Hardware frozen Media Center. Standardized interface Multi Functional Evolving Hardware limited but regional variations. Changing user needs (home and business) Infinite hardware options New usages being developed Usage profile unknown Flexible and adaptable Rapid development IFIP Siena What Microsoft does well Standardized application interface File always top left, help last in list. Print always under file etc. Short cut keys common across applications. Common programmable interface. Do not force users to new UI (Classic view). Installation of OS and products for the home user. Still searching for the optimum solution for installing products across multiple systems (same as the rest of the industry). IFIP Siena Major causes of HCI induced failures Not following correct management procedures. Not correctly configuring their systems. Not patching their systems. Using non compliant hardware/software. Viruses due to non patched and wrongly configured systems. IFIP Siena Bug collection and patch distribution system an example of HCI problems. Develop process to collected Windows XP failures (i.e. to fix problems we create). Base process on Windows 2000 failure rates and characteristics. Initial results of deployment Failure rates a factor of 10 greater than forecast (HCI failures become prominent). Failure characteristics different to Windows 2000 (lack of complex software bugs). Most failures not due to Microsoft code. But still viewed as Windows problems. IFIP Siena Bug collection and patch distribution Work with peripheral device manufacturers to resolve errors. Correct all Microsoft bugs and release patches. Develop Windows Update process as a user invoked service for privacy reasons. Problems Peripheral device manufacturers assume users know they always need the latest driver. Users blames Microsoft not the device manufacturers. Users had better things to do than invoke Windows Update. Not all patches had to go to all users. The frequency of patch production overwhelmed business users. Patches were too large for users with 56k modems. IFIP Siena Then security issues came along. Most security issues identified by security researchers or firms. Provides free publicity to the researcher or firm. Pressure from them to publish. Obvious way to resolve these issues. Come up with patch for failure and other risk areas. Test and release patch through Update ASAP. Result Hackers reverse engineer patch. Virus only start appearing after patch is released. Users not updating systems. IFIP Siena Solution for Windows XP Windows update is a push rather than pull process. Patches are as small as possible Patches release in monthly batches. Not all patches are released Distribute uncommon patches via crash reporting process. Develop a process for business for patch distribution. Windows XP SP2 will automatically configure the system for security. May break some applications (e.g. Kazaa). Architectural changes in Longhorn to assist in patch distribution. IFIP Siena Summary. Perfect Reliability of complex systems is probably impossible. Normal accidents, Charles Perrow The computer is a component of the complex systems. Becoming smaller all the time. The system is the interaction of humans and technology. Wizards and usability have to replace the need for training. Computer software rarely has a single objective. Can wizards and usability address near infinite flexibility. Reliability is one of many system attributes. The problem should define the relative importance of reliability. Higher reliability inevitably decreases product flexibility. Is research focusing on the problem?