CS 5150 Software Engineering Lecture 21 Reliability 3 CS 5150 1 Administration Final presentations Sign up for your presentations now. CS 5150 2 Failures and Faults Failure: Software does not deliver the service expected by the user (e.g., mistake in requirements, confusing user interface) Fault (BUG): Programming or design error whereby the delivered system does not conform to specification (e.g., coding error, interface error) CS 5150 3 Terminology Fault avoidance Build systems with the objective of creating faultfree (bug-free) software Fault tolerance Build systems that continue to operate when faults (bugs) occur Fault detection (testing and validation) Detect faults (bugs) before the system is put into operation. CS 5150 4 Faults and Failures Actual examples (a) An application crashes with an emulator, even though the emulator is bug free. (Compensating bug problem.) (b) After an entire network is hit by lightning, the restart crashes because of overload. (Problem of incremental growth.) (c) The head of an organization is not paid his salary because it is greater than the maximum allowed by the program. (Requirements problem.) (d) An operating system fails because of a page-boundary error in the firmware. (Different operating system problem.) CS 5150 5 Defensive Programming Murphy's Law: If anything can go wrong, it will. Defensive Programming: • Redundant code is incorporated to check system state after modifications. • Implicit assumptions are tested explicitly. • Risky programming constructs are avoided. CS 5150 6 Fault Tolerance Aim: A system that continues to operate when problems occur. Examples: • Invalid input data (e.g., in a data processing application) • Overload (e.g., in a networked system) • Hardware failure (e.g., in a control system) General Approach: • • • • Failure detection Damage assessment Fault recovery Fault repair CS 5150 7 Fault Tolerance Backward Recovery: • Record system state at specific events (checkpoints). After failure, recreate state at last checkpoint. • Combine checkpoints with system log (audit trail of transactions) that allows transactions from last checkpoint to be repeated automatically. • Test the restore software! CS 5150 8 Fixing Bugs Isolate the bug Intermittent --> repeatable Complex example --> simple example Understand the bug Root cause Dependencies Structural interactions Fix the bug Design changes Documentation changes Code changes CS 5150 9 Moving the Bugs Around Fixing bugs is an error-prone process! • When you fix a bug, fix its environment • Bug fixes need static and dynamic testing • Repeat all tests that have the slightest relevance (regression testing) Bugs have a habit of returning! • When a bug is fixed, add the failure case to the test suite for future regression testing. CS 5150 10 The Heisenbug CS 5150 11 Some Notable Bugs Even commercial systems may have serious bugs 1960s: Built-in function in Fortran compiler (e0 = 0) 1970s: The microfilm plotter with the missing byte (1:1023) 1980s: Japanese microcode for Honeywell DPS virtual memory 1990s: The Sun page fault that IBM paid to fix 2000s: The preload system with the memory leak Good people work around problems. The best people track them down and fix them! CS 5150 12 Validation and Verification Validation: Are we building the right product? Verification: Are we building the product right? In practice, it is sometimes difficult to distinguish between the two. That's not a bug. That's a feature! CS 5150 13 Reliability: Adapting Small Teams to Large Projects Small teams and small projects have many advantages: • Small group communication cuts need for intermediate documentation, yet reduces misunderstanding. • Small projects are easier to test and make reliable. • Small projects have shorter development cycles, so that mistakes in requirements are less likely and less expensive to fix. • When one project is completed it is easier to plan for the next. CS 5150 14 Reliability: Adapting Small Teams to Large Projects Many modern software methodologies aim to apply the advantages of small teams to large projects. Often called Rapid Application Development or Agile Software Development. Works well with interactive systems, such as web systems, where the overall structure is well established and there is a prototype or operational system. CS 5150 15 Developing Large Systems: Incremental Development Concept • Divide a large project into units of work, typically the work that can be done by a team of 5-10 people in four weeks. • The team carries out the complete development cycle up to having a releasable product. • If the work cannot be completed in the allowed time, reduce the scope, not the quality. • Because the team is small, they can reply on face to face communication and need little intermediate documentation. Often combined with pair design and pair programming, and with incremental testing. CS 5150 16 Reliability: Incremental Development Challenges: CS 5150 • Requires strong overall leadership to ensure that the individual units fit within the overall system goals and architecture. • Requires systematic integration testing. 17 An Old Question: Safety Critical Software A software system fails and several lives are lost. An inquiry discovers that the test plan did not consider the case that caused the failure. Who is responsible? (a) The testers for not noticing the missing cases? (b) The test planners for not writing the complete test plan? (c) The managers for not having checked the test plan? (d) The client for not having done a thorough acceptance test? CS 5150 18 Software Developers and Testers: Responsibilities • Carrying out assigned tasks thoroughly and in a professional manner • Being committed to the entire project -- not just tasks that have been assigned • Resisting pressures to cut corners on vital tasks • Alerting colleagues and management to potential problems early CS 5150 19 Computing Management Responsibility • Organization culture that expects quality • Appointment of suitably qualified people to vital tasks (e.g., testing safety-critical software) • Establishing and overseeing the software development process • Providing time and incentives that encourage quality work • Working closely with the client Accepting responsibility for work of team CS 5150 20 Client Responsibility • Organization culture that expects quality • Appointment of suitably qualified people to vital tasks (e.g., technical team that will build a critical system) • Reviewing requirements and design carefully • Establishing and overseeing the acceptance process • Providing time and incentives that encourage quality work • Working closely with the software team Accepting responsibility for the resulting product CS 5150 21