University of Southern California Center for Systems and Software Engineering Software Classic Disasters CS 577b Software Engineering II Supannika Koolmanojwong April 4, 2011 University of Southern California Center for Systems and Software Engineering Outline • IT Project Management: Infamous Failures, Classic Mistakes, and Best Practices • Recovering IT in a Disaster: Lessons from Hurricane Katrina • Top 10 Worst Practices 04/04/2011 © 2011 USC-CSSE 2 University of Southern California Center for Systems and Software Engineering IT Project Management: Infamous Failures, Classic Mistakes, and Best Practices R. Ryan Nelson , MIS Quarterly Executive Vol. 6 No. 2 / June 2007 • Retrospectives by project postmortems or postimplementation reviews • 99 retrospectives conducted in 74 organizations over the past 7 years • “Insanity: doing the same thing over and over again and expecting different results.” — Albert Einstein 04/04/2011 © 2011 USC-CSSE 3 University of Southern California Center for Systems and Software Engineering 10 of the most infamous IT project failures • Large magnitude • Over $100 million • One-half come from the public sector – wasted taxpayer dollars – lost services • the other half - the private sector – billions of dollars in added costs – lost revenues – lost jobs. 04/04/2011 © 2011 USC-CSSE 4 University of Southern California Center for Systems and Software Engineering 1. Internal Revenue Service (IRS)1999 • PROJECT: – Business Systems Modernization; – Launched in 1999 to upgrade the agency’s IT infrastructure and more than 100 business applications • $8 billion modernization project , team of vendors • a complex project overwhelms the management capabilities of both vendor and client. • the most expensive systems development “fiasco” in history, with delays costing the U.S. Treasury tens of billions of dollars per year. • ability to collect revenue, conduct audits, and go after tax evaders was severely compromised 04/04/2011 © 2011 USC-CSSE 5 University of Southern California Center for Systems and Software Engineering 2. Federal Aviation Administration, 1996 • PROJECT: Advanced Automation System (AAS); FAA’s effort to modernize the nation’s air traffic control system. • Estimated to cost $2.5 billion ( $1.5 billion is wasted) • Numerous delays and cost overruns, which were blamed on both the FAA and the primary contractor, IBM. • Technical complexity of the effort, bad resource estimation, ineffectively requirements control • "For example, they wanted the system to have only 3 seconds of downtime a year. But to get the data to prove that requirement had been met would have taken about 10 years” (later on change to 5 minutes downtime) • Instead of admitting the problem, IBM turned AAS into a research project • The project collapsed 04/04/2011 © 2011 USC-CSSE 6 University of Southern California Center for Systems and Software Engineering 3. Federal Bureau of Investigation, 20004 • PROJECT: “Trilogy;” Four-year, $500M overhaul of the FBI’s antiquated computer system. • Ill-defined requirements, changed dramatically after 9/11 (agency mission switched from criminal to intelligence focus) • $170 million project was abandoned altogether • 400 problems with early versions of the troubled software, but never told the contractor • The bureau went ahead with a $17 million testing program even the software would have to be scrapped 04/04/2011 © 2011 USC-CSSE 7 University of Southern California Center for Systems and Software Engineering 4. McDonalds, 2001 • PROJECT: “Innovate;” Digital network for creating a real-time enterprise • planned to spend $1 billion over five years • Objective: to better serve customers by using information and communications technologies to monitor the quality of products and services • Executives in company headquarters would have been able to see how soda dispensers and frying machines in every store were performing, at any moment. • Would need $1billion for infrastructure, and $zillions to maintain and upgrade • After two years and $170M, the fast food giant threw in the towel. 04/04/2011 © 2011 USC-CSSE 8 University of Southern California Center for Systems and Software Engineering 5. Denver International Airport 1994 • PROJECT: Baggage-handling system. • It took 10 years and at least $600 million to figure out big muscles, not computers, can best move baggage • The baggage system, designed and built by BAE Automated Systems Inc., launched, chewed up, and spit out bags so often that it became known as the “baggage system from hell.” 04/04/2011 © 2011 USC-CSSE 9 University of Southern California Center for Systems and Software Engineering 6. AMR Corp., Budget Rent A Car Corp., Hilton Hotels Corp., Marriott International Inc, 1992 • PROJECT: “Confirm;” Reservation system for hotel and rental car bookings • After four years and $125 million in development, when it became clear that Confirm would miss its deadline by as much as two years. • Was supposed to be a leading edge comprehensive travel industry reservation program combining airline, rental car and hotel information • Major problems surfaced when Hilton tested the system, then 18 months delay and the problems could not be resolved© 2011 USC-CSSE 04/04/2011 10 University of Southern California Center for Systems and Software Engineering 7. Bank of America, 1988 • PROJECT: “MasterNet;” Trust accounting system. • hardware problems caused the Bank of America (BofA) to lose control of several billion dollars of trust accounts. • All the money was eventually found in the system, but all 255 people in the entire Trust Department were fired, as all the depositors withdrew their money. • This is a classic case study on the need for risk assessment, including people, process, and technology-related risk. • BofA spent $60M to fix the $20M project before deciding to abandon it altogether. BofA fell from being the largest bank in the world to No. 29 • CRACK stakeholders problems, bad modular design, focusing in competing with competitors-but ready for transition 04/04/2011 © 2011 USC-CSSE 11 University of Southern California Center for Systems and Software Engineering 8. Kmart, 2000 • PROJECT: IT systems modernization • $1.4 billion IT modernization effort • aimed at linking its sales, marketing, supply, and logistics systems. • 18 months later, cash-strapped Kmart cut back on modernization, writing off the $130 million it had already invested in IT. • Four months later, it declared bankruptcy • Failing to allocate enough money and manpower to not clearly establishing the IT project's relationship to the organization's business 04/04/2011 © 2011 USC-CSSE 12 University of Southern California Center for Systems and Software Engineering 9. London Stock Exchange, 1993 • • • • PROJECT: “Taurus;” Paperless share settlement system. £800 million, original budget £6 million Abandoned after 10 years of development By Vista Concepts, US, for database management. Although being very good for on-line real time processing, it could not handle distributed data processing or batch processing • LSE tried to modify Vista by rewriting almost 60% of it, hence hidden bugs and long delays • Grew from a settlement only system, to become a full “share registration and transfer system”. 04/04/2011 © 2011 USC-CSSE 13 University of Southern California Center for Systems and Software Engineering 10. Nike, 2000 • PROJECT: Integrated enterprise software • $400 million installing ERP, CRM, and SCM—the full complement of analystblessed integrated enterprise software. • Caused major inventory glitch, overproduced some shoe models and underproduced others • profits drop by $100 million 04/04/2011 © 2011 USC-CSSE 14 University of Southern California Center for Systems and Software Engineering Classic Mistakes • • • • • Behind schedule Add more people Want to speed up development Cut testing A new version of OS becomes available during the project, Time for an upgrade! • Key contributors aggravating the rest of the team? Wait until the end of the project to fire him! 04/04/2011 © 2011 USC-CSSE 15 University of Southern California Center for Systems and Software Engineering Classic Mistakes: People • Undermined motivation – productivity and quality • Individual capabilities of the team members or the working relationships • Failure to take action to deal with a problem employee • Adding people to a late project – pouring gasoline on a fire 04/04/2011 © 2011 USC-CSSE 16 University of Southern California Center for Systems and Software Engineering Classic Mistakes: Process • BDUF – Big Design Up Front • Underestimate, overly optimistic schedules, under scoping it, undermining effective planning, and shortchanging requirements determination and/or quality assurance – Poor estimation also puts excessive pressure on team members, leading to lower morale and productivity. • Insufficient risk management • contractor failure - outsourcing and offshoring 04/04/2011 © 2011 USC-CSSE 17 University of Southern California Center for Systems and Software Engineering Classic Mistakes: Product • FAA’s modernization effort, where the goal was 99.99999% reliability, which is referred to as “the seven nines.” • Requirements gold-plating • Feature creep – average project experiences about a +25% change in requirements over its lifetime. • Developer gold-plating - new technology that are required in the product. • Research-oriented development • Silver-bullet syndrome • Overestimated savings from new tools or methods • Switching tools in the middle of a project 04/04/2011 © 2011 USC-CSSE 18 University of Southern California Center for Systems and Software Engineering A Meta-Retrospective of 99 IT Projects • process mistakes (45%), people mistakes (43%) product mistakes (8%) or technology mistakes (4%). – project managers should be experts in managing processes and people. • Scope creep didn’t make the top ten mistakes – As long as project manager pays attention to it • Contractor failure has been climbing in frequency in recent years • If the project managers had focused their attention on better estimation and scheduling, stakeholder management, and risk management, they could have significantly improved the success of the majority of the projects studied. 04/04/2011 © 2011 USC-CSSE 19 University of Southern California Center for Systems and Software Engineering Avoid classic mistakes through best practices 1. Avoiding Poor Estimating and/or Scheduling – – – – Cost overrun, 1994-180%, 2003-43%, Schedule overrun, 2000- 63%, 2007-82%. cone of uncertainty • by multiplying the “most likely” single-point estimate by the optimistic factor • lower bounds - optimistic estimate • upper bounds - pessimistic estimate. Capital One • 100% cushion - beginning of the feasibility phase • 75% cushion in the definition phase • 50% cushion in design • 25% cushion at the beginning of construction 04/04/2011 © 2011 USC-CSSE 20 University of Southern California Center for Systems and Software Engineering Avoiding Poor Estimating and/or Scheduling • Valuable approaches to improving project estimation and scheduling – Timebox development • shorter, smaller projects are easier to estimate, – creating a work breakdown structure • to help size and scope projects – retrospectives • to capture actual size, effort and time data for use in making future project estimates – a project management office to maintain a repository of project data over time. 04/04/2011 © 2011 USC-CSSE 21 University of Southern California Center for Systems and Software Engineering Avoiding Ineffective Stakeholder Management • ineffective stakeholder management is the second biggest cause of project failure • Have to know – – – – 04/04/2011 who has influence over others who has direct control of resources stakeholder level of interest stakeholder degree of support/resistance © 2011 USC-CSSE 22 University of Southern California Center for Systems and Software Engineering Avoiding Insufficient Risk Management • risk identification, analysis, prioritization, risk-management planning, resolution, and monitoring. • Methods/ tools – – – – 04/04/2011 a prioritized risk assessment table a top-10 risks list, interim retrospectives appointing a risk officer © 2011 USC-CSSE 23 University of Southern California Center for Systems and Software Engineering Avoiding Insufficient Planning • Ensure the followings – – – – 04/04/2011 Clear roles and responsibilities Resource allocation Schedule / timeline Follow project policies, plans, and procedures © 2011 USC-CSSE 24 University of Southern California Center for Systems and Software Engineering Avoiding Shortchanging Quality Assurance • When a project falls behind schedule, the first two areas that often get cut are testing and training. • Cut corners by eliminating test planning, eliminating design and code reviews, and performing only minimal testing • Suggestions: – agile development, joint application design sessions, automated testing tools, and daily build-and-smoke tests. 04/04/2011 © 2011 USC-CSSE 25 University of Southern California Center for Systems and Software Engineering Avoiding Weak Personnel and/or Team Issues • get the right people assigned to the project from the beginning • Between 1999 and 2006, the retrospectives reported an increasing number of problems with distributed, inter-organizational, and multi-national teams. – reduction in face-to-face team meetings, timezone barriers, and language and cultural issues 04/04/2011 © 2011 USC-CSSE 26 University of Southern California Center for Systems and Software Engineering Avoiding Insufficient Project Sponsorship • Not only getting top management support, but identifying the right sponsor • From the beginning !!! 04/04/2011 © 2011 USC-CSSE 27 University of Southern California Center for Systems and Software Engineering 04/04/2011 © 2011 USC-CSSE 28 University of Southern California Center for Systems and Software Engineering Outline • IT Project Management: Infamous Failures, Classic Mistakes, and Best Practices • Recovering IT in a Disaster: Lessons from Hurricane Katrina • Top 10 Worst Practices 04/04/2011 © 2011 USC-CSSE 29 University of Southern California Center for Systems and Software Engineering Hurricane Katrina 04/04/2011 © 2011 USC-CSSE 30 University of Southern California Center for Systems and Software Engineering Recovering IT in a Disaster: Lessons from Hurricane Katrina Iris Junglas, Blake Ives, MIS Quarterly Executive Vol. 6 No. 1 / Mar 2007 • August 29, 2005 - Hurricane Katrina destroyed a data center and communications infrastructure at the Pascagoula and Gulfport, Mississippi, operations of the Ship Systems sector of Northrop Grumman Corporation • Also put a second data center out of commission in a shipyard near New Orleans • 20,000 employees in Ship Construction • Caused over US$1 billion in damage for the company • Brought two of the nation’s largest shipyards to a standstill 04/04/2011 © 2011 USC-CSSE 31 University of Southern California Center for Systems and Software Engineering Recovering IT in a Disaster • How to adapt when the business continuity plan; inadequate public infrastructure • Reexamine our processes for preparing disaster plans • Processes for assessing preparedness and response after a disaster or a near-disaster. 04/04/2011 © 2011 USC-CSSE 32 University of Southern California Center for Systems and Software Engineering Northrop Grumman Corporation • Products : electronics, aerospace, and shipbuilding • Customers: government and commercial customers worldwide • Major business: – – – – – 04/04/2011 Ship construction - large military vessels Revenue: US$5.7 billion in 2005 Customers: DoD and Navy 12,900 employees at Mississippi; 7,100 employees at the New Orleans © 2011 USC-CSSE 33 University of Southern California Center for Systems and Software Engineering Preparation for Hurricane • Hurricane is nothing new to ship industry – September 04 – Hurricane Ivan – July 05 - Hurricane Dennis • A bigger one is heading in – August 05 • 11 people dead, over US$1billion in damage in Florida 04/04/2011 © 2011 USC-CSSE 34 University of Southern California Center for Systems and Software Engineering Preparation for Hurricane • Data – Data backups were sent to Iron Mountain (information management services) – Double back up in Dallas • Servers – power off – wrapped in plastic • New backup generator – in secure location • Only one extranet alive (crucial the Navy and DoD) • Human – Left the area 04/04/2011 © 2011 USC-CSSE 35 University of Southern California Center for Systems and Software Engineering The storm smashed • NGC facilities are on the storm’s path • Communication failed • Extensive damage to shipyard and nearby communities • Emergency command center – at Dallas office – newly assembled emergency team is formed 04/04/2011 © 2011 USC-CSSE 36 University of Southern California Center for Systems and Software Engineering Damages • Collect digital images of damages • At Mississippi, lost – 1,500 PC, 200 servers, 300 printers, 600 data input devices, and hundreds of two-way radios. – communications closets, routers, switches, fiber and copper cables and wires. – LAN / WAN / MAN – no longer worked • At New Orleans – Infrastructures are there – AC systems are not working, hence servers are automatic shutdown • A week after the storm, communication lines are down again due to cars are driving over them 04/04/2011 © 2011 USC-CSSE 37 University of Southern California Center for Systems and Software Engineering First thing first • Not about restoring computer systems, but restoring human resources • But most of the 20,000 employees were out of contact • Tools – Press releases – Corporate web site (67,000 hits in the weeks after the storm ) – Toll-free call in number • Payroll through Wal-Mart and Western Union 04/04/2011 © 2011 USC-CSSE 38 University of Southern California Center for Systems and Software Engineering Restoring IT infrastructure • Electronic communication – nonexistent due to public communication infrastructure • Communication through Black Berry can be used intermittently • Two-way radios, walkie-talkies • Key members using satellite phones – Required line-of-sight access to satellites • Later on, use wireless communication 04/04/2011 © 2011 USC-CSSE 39 University of Southern California Center for Systems and Software Engineering Building new data center • Hardware acquisition • Incompatibilities between software and new hardware environment • Inaccessible or difficult to find system documentation, e.g. license keys, server names, addressing schemes, login IDs 04/04/2011 © 2011 USC-CSSE 40 University of Southern California Center for Systems and Software Engineering Restoring data and applications • Some firms found that their back up data is partially unreadable • For NGC, 2 backups : iron mountain and Dallas • Lost some data on desktops or local machines • Two weeks after Katrina – had a new data center; essential systems are up and running 04/04/2011 © 2011 USC-CSSE 41 University of Southern California Center for Systems and Software Engineering Disaster preparedness • Common mistake : prepare for disasters specific to their domain – financial institutions prepare for IT failures, – hospitals for pandemics – airliners for technical failures and sabotages. • An alternative approach : consider a broader spectrum of disaster types, such as the generic disaster – economic, information, physical, human resource, reputation, psychopathic, and natural disasters • Identify common characteristics of each disaster categories, then construct the plan 04/04/2011 © 2011 USC-CSSE 42 University of Southern California Center for Systems and Software Engineering IT disaster preparedness framework • provide generic objectives and measurements, guidelines for establishing IT disaster preparedness, • emphasize developing an IT continuity plan, identifying and allocating critical resources, executing a business impact analysis, and maintaining, testing and training of the plan • COBIT (Control Objectives for Information and Related Technology) – For operational IT and business managers – Focus on three core elements of IT governance: IT as an asset, ITrelated risks, and IT control structures. • ITIL (IT Infrastructure Library) – focus is to improve the efficiency and effectiveness of IT services delivered to customers within the enterprise – de facto standard for IT service management. 04/04/2011 © 2011 USC-CSSE 43 University of Southern California Center for Systems and Software Engineering IT disaster preparedness framework COBIT (Control Objectives for Information and Related Technology) 04/04/2011 © 2011 USC-CSSE ITIL (IT Infrastructure Library) 44 University of Southern California Center for Systems and Software Engineering Lesson Learned 1. Keep Data and Data Centers Out of Harm’s Way 2. Don’t Assume the Public Infrastructure Will Be Available 3. Plan for Civil Unrest 4. Assume Some People Will Not Be Available 5. Leverage Your Suppliers as Critical Team Members 04/04/2011 © 2011 USC-CSSE 45 University of Southern California Center for Systems and Software Engineering Lesson Learned 6. Expect the Unexpected 7. Get Prepared – Crisis portfolio 8. Establish a Strong Leadership Position 9. Empower Decision Makers on the Team 10.Exploit Fresh-Start Opportunities 04/04/2011 © 2011 USC-CSSE 46 University of Southern California Center for Systems and Software Engineering Outline • IT Project Management: Infamous Failures, Classic Mistakes, and Best Practices • Recovering IT in a Disaster: Lessons from Hurricane Katrina • Top 10 Worst Practices 04/04/2011 © 2011 USC-CSSE 47 University of Southern California Center for Systems and Software Engineering Worst Practices Capers Jones, "Our Worst Current Development Practices," IEEE Software, vol. 13, no. 2, pp. 102-104, Mar. 1996 • Project failures – terminated because of cost or schedule overrun – experienced schedule or cost overruns in excess of 50 percent of initial estimates – resulted in client lawsuits for contractual noncompliance 04/04/2011 © 2011 USC-CSSE 48 University of Southern California Center for Systems and Software Engineering Worst Practice #1 No historical software-measurement • Lack of historical data makes stakeholders blind to see the realities of software development • Need to check on schedule, cost, progress, performance 04/04/2011 © 2011 USC-CSSE 49 University of Southern California Center for Systems and Software Engineering Worst Practice #2 Rejection of accurate estimates • No accurate estimate is the root cause for the rest of the worst practices including: – inability to perform return-on-investment calculations – susceptibility to false claims by tool and method vendors – software contracts that are ambiguous and difficult to monitor. 04/04/2011 © 2011 USC-CSSE 50 University of Southern California Center for Systems and Software Engineering Worst Practice #3 & 4 Failure to use automated estimating tools and automated planning tools. • 50 commercial software-cost estimating tools – Checkpoint, COCOMO, Estimacs, Price-S, or Slim • 100 project-planning tools on the market – Microsoft Project, Primavera, Project Manager’s Workbench, or Timeline • Combination of estimating and planning tools leads to accurate and realistic outcomes not easily overridden by clients or executive 04/04/2011 © 2011 USC-CSSE 51 University of Southern California Center for Systems and Software Engineering Worst Practices • 5 & 6 - Excessive, irrational schedule pressure and creep in users’ requirements • 7 & 8 - Failure to monitor progress and to perform risk management – “90 percent completion” • 9 & 10 - Failure to use design reviews and code inspections. 04/04/2011 © 2011 USC-CSSE 52