CAUSES OF FAILURE IN WEB APPLICATIONS Feroz Zahid Simula Research Laboratoy & UiO Report Details Authors: Soila Portet and Priya Narasimhan Published by: Parallel Data Laboratory, Carnegie Melon University Year of PublicaCon: 2005 Type of ContribuCon: New Analysis Purpose: The report invesCgates causes and prevalence of failures in web applicaCons based on case studies and actual website outages data collected from diﬀerent sources. Report Overview Summary of Findings • Failure Types • • Causes of SoMware Failures • • • SoMware Failure and Human Errors make 80% of the total failures Maintenance, Upgrades System overload, Resource exhausCon and complex fault-‐ recovery mechanisms DownCme • • • • Ranges between few minutes to weeks Fault-‐chains increases downCme Planned downCme is about 80% of the total downCme Planned downCme may also cause unplanned downCme Findings What are the causes of failures? • • • • SoMware Failures Human/Operator Errors Hardware/Environmental Failures Security ViolaCons/Breaches SoMware Failures SoCware Error Type Examples Resource ExhausCon Memory leaks, Buﬀer overﬂows Logical Errors Corrupt Pointers, Race CondiCons, Deadlocks System Overload Flash Crowd, Slashdot Eﬀect Recovery Code Complex Fault-‐recovery, Backup restores Failed SoMware Update Upgrade Dependencies, ConﬁguraCon errors SoMware Failure – Example Incidents • System SoMware • • • • PlanetLab – Bug in updated kernal module – Detected by User Reports America Online – Server Upgrade – Intermi^ent outages – Several weeks Symantec – Mar 2005 -‐ Patch for DNS cache poisoning with redirecCon vulnerability – A^ackers redirected traﬃc to malware websites Zopewiki.org – Jul 2004 -‐ Memory leaks – Workaround was to reboot the webserver daily – Detected by performance slowdown SoMware Failure – Example Incidents • ApplicaCon Failures • • • • • Resource ExhausGon – PlanetLab – Nodes hang due to an applicaCon bug which exhausted ﬁle descriptors Logical Error – AOL – Dec 2004 – DeacCvated number of AIM accounts in regular maintenance cycle – Several days downCme for the users Logical Error – Pricing error on Amazon’s UK site lists iPaq Pocket PC under $12 (regular price: $449) – 2.5 hours aﬀected – Detected by abnormally high sales volumes Site Overload – Comair airlines – Cancels over 1000 ﬂights when a surge in crew ﬂight re-‐assignment knocked down its ﬂight reservaCon system IntegraGon – HP and Compaq implementaCons of SAP soMware – Loss $400 million in revenue – 6 weeks (3 weeks planned) SoMware Failure – Example Incidents • Databases • • • • Basecamphq.com – Feb 2005 – DB ﬂagged table as read-‐only – 30 minutes downCme Walmart.com – Apr 2001 -‐ Database glitches -‐ 9 hours downCme Sony – June 2003 -‐ Stars Wars Galaxies Game – Overwhelming traﬃc – Intermi^ent database errors for one day RECENTLY -‐ London Airport -‐ Dec 2014 – Inconsistency -‐ Nats -‐ transiCon between the two states caused a failure in the system -‐ NOT in PAPER Human/Operator Errors Human Error Type Examples ConﬁguraCon Errors Sysadmin mistakes Procedural Errors Failure to backup, typos Miscellaneous Accidents Accidently disconnect power supply Human Errors – Example Incidents • • • • ConﬁguraGon Error – MicrosoM – Incorrect conﬁguraCon change in edge routers caused MicrosoM websites downCme from several hours to 1 day ConﬁguraGon Error – MSN – mistakenly marked messages from Earthlink and RoadRunner as spam – Operator error Procedural Error – Gforge3 – Failure to restart database daemon aMer applying database patch – Several hours of downCme Miscellaneous – eBay – Electrician accidently knocked out a plug – ba^ery ran out 30 minutes later – system outage Hardware/Environment Failures Failure Type Examples Hardware Failures Crashed hard disks, burnt circuits Environmental Failures Power outages, OverheaCng Hardware Failures – Example Incidents • • • • Equipment Failure – Wall Street Journal website – Mar 2004 -‐ Hardware failure – 1 hour downCme Equipment Failure – Yahoo Groups – Mar 2002 – Hardware problems – Several hours downCme Power Outage – eBay – Power outage in webhosCng facility – 3 hours downCme Hardware Upgrades – iWon – New hardware installaCon of $2 million worth – Several days of intermi^ent failures Network Failures – Example Incidents • • • PlanetLab – Experiment overloads university’s internet connecCon – Detected by bandwidth spikes Bank of America – Network connecCon slowed banking service – several days of intermi^ent outage Sprint – ISP passes bad rouCng informaCon – 2 hours of downCme Security ViolaCons • • • • • Unauthorized accesses Password Disclosures Denial of Service A^acks (DoS / DDoS) SoMware VulnerabiliCes Viruses, Worms Security ViolaGons – Example Incidents • • • • • MicrosoM – Aug 2003 – DOS a^ack causes website downCme of 1 hour Alkamai – Jun 2004 – DOS a^ack on DNS servers caused 2 hour downCme for Google, Yahoo, Apple and MicrosoM Google – Jul 2004 -‐ MyDoom worm causes parCal outage for several hours Verizon – May 2004 – TheM of network cards caused customers to lose their internet access for one day Many recent events – Sony Pictures – Google ManifestaCon of Errors Type Examples ParCal or EnCre Website Unavailable File not found, Web server crashed Systems ExcepCons / Access ViolaCons RunCme excepCons Incorrect Results Wrong page served, Invalid Cache used Data loss or CorrupCon Disk block failures Performance Slowdowns Network congesCon, System overload Fault Chains • • Series of component failures Uncoupled Fault Chains • • • Tightly-‐coupled Fault Chains • • • • Independent failures occur one aMer another Uncoupled Fault Chains Correlated failures For example, Power-‐outage caused air-‐condiConing to fail SoMware dependencies 60% of the failures have fault chain of two Prevalence of Failures 89% of Customers have experienced Issues when compleCng transacCons • 72.5% sites experienced failures in holiday season • Causes of Site Failures ApplicaCon Failures Human Error Others DownCme Planned DownCme Unplanned DownCme CriCque ApplicaCon Domain • Restricted to Web ApplicaCons • Large websites like AOL, MicrosoM, Walmart etc. EvaluaCon – Comprehensive? • • • With 40 real-‐world test cases Not connected with the causes of soMware failure in general Small subset – evaluaCon could be biased CriCque Didn’t consider the type of Web ApplicaCons • • Causes of failure is related with web applicaCon type For example, news website is more likely to fail from crowd sourcing than an online CMS Types of Failure -‐ Taxonomy • • Four faults taxonomy is quite primiCve What type is a device driver failure? – SoMware or Hardware? CriCque Few more important causes of failures.. • Website not tested on diﬀerent plaporms • • • • e.g. Smart phones, Tablets DNS Problems Bandwidth – Webhost decides to put you oﬀ because you consumed too much bandwidth Police raid – What happed with the pirate bay J Some Thoughts Web applicaCon failures can be generally backtracked to the development phase • • • • • • Lack of Well-‐deﬁned scope Lack to professional project management Poor version control Trying to reinvent the wheel No funcConal tesCng “Freelance Syndrome” J Summary • Website Failures are prevalent • • Loss of revenue Long-‐term losses like Customer DissaCsfacCon • Important Study • • • • • Failure Taxonomy Causes General Causes Real-‐world case studies Can be improved, extended and updated!