CAUSES OF FAILURE IN WEB APPLICATIONS

advertisement
CAUSES OF FAILURE IN WEB APPLICATIONS Feroz Zahid Simula Research Laboratoy & UiO Report Details Authors: Soila Portet and Priya Narasimhan Published by: Parallel Data Laboratory, Carnegie Melon University Year of PublicaCon: 2005 Type of ContribuCon: New Analysis Purpose: The report invesCgates causes and prevalence of failures in web applicaCons based on case studies and actual website outages data collected from different sources. Report Overview Summary of Findings • 
Failure Types • 
• 
Causes of SoMware Failures • 
• 
• 
SoMware Failure and Human Errors make 80% of the total failures Maintenance, Upgrades System overload, Resource exhausCon and complex fault-­‐
recovery mechanisms DownCme • 
• 
• 
• 
Ranges between few minutes to weeks Fault-­‐chains increases downCme Planned downCme is about 80% of the total downCme Planned downCme may also cause unplanned downCme Findings What are the causes of failures? • 
• 
• 
• 
SoMware Failures Human/Operator Errors Hardware/Environmental Failures Security ViolaCons/Breaches SoMware Failures SoCware Error Type Examples Resource ExhausCon Memory leaks, Buffer overflows Logical Errors Corrupt Pointers, Race CondiCons, Deadlocks System Overload Flash Crowd, Slashdot Effect Recovery Code Complex Fault-­‐recovery, Backup restores Failed SoMware Update Upgrade Dependencies, ConfiguraCon errors SoMware Failure – Example Incidents • 
System SoMware • 
• 
• 
• 
PlanetLab – Bug in updated kernal module – Detected by User Reports America Online – Server Upgrade – Intermi^ent outages – Several weeks Symantec – Mar 2005 -­‐ Patch for DNS cache poisoning with redirecCon vulnerability – A^ackers redirected traffic to malware websites Zopewiki.org – Jul 2004 -­‐ Memory leaks – Workaround was to reboot the webserver daily – Detected by performance slowdown SoMware Failure – Example Incidents • 
ApplicaCon Failures • 
• 
• 
• 
• 
Resource ExhausGon – PlanetLab – Nodes hang due to an applicaCon bug which exhausted file descriptors Logical Error – AOL – Dec 2004 – DeacCvated number of AIM accounts in regular maintenance cycle – Several days downCme for the users Logical Error – Pricing error on Amazon’s UK site lists iPaq Pocket PC under $12 (regular price: $449) – 2.5 hours affected – Detected by abnormally high sales volumes Site Overload – Comair airlines – Cancels over 1000 flights when a surge in crew flight re-­‐assignment knocked down its flight reservaCon system IntegraGon – HP and Compaq implementaCons of SAP soMware – Loss $400 million in revenue – 6 weeks (3 weeks planned) SoMware Failure – Example Incidents • 
Databases • 
• 
• 
• 
Basecamphq.com – Feb 2005 – DB flagged table as read-­‐only – 30 minutes downCme Walmart.com – Apr 2001 -­‐ Database glitches -­‐ 9 hours downCme Sony – June 2003 -­‐ Stars Wars Galaxies Game – Overwhelming traffic – Intermi^ent database errors for one day RECENTLY -­‐ London Airport -­‐ Dec 2014 – Inconsistency -­‐ Nats -­‐ transiCon between the two states caused a failure in the system -­‐ NOT in PAPER Human/Operator Errors Human Error Type Examples ConfiguraCon Errors Sysadmin mistakes Procedural Errors Failure to backup, typos Miscellaneous Accidents Accidently disconnect power supply Human Errors – Example Incidents • 
• 
• 
• 
ConfiguraGon Error – MicrosoM – Incorrect configuraCon change in edge routers caused MicrosoM websites downCme from several hours to 1 day ConfiguraGon Error – MSN – mistakenly marked messages from Earthlink and RoadRunner as spam – Operator error Procedural Error – Gforge3 – Failure to restart database daemon aMer applying database patch – Several hours of downCme Miscellaneous – eBay – Electrician accidently knocked out a plug – ba^ery ran out 30 minutes later – system outage Hardware/Environment Failures Failure Type Examples Hardware Failures Crashed hard disks, burnt circuits Environmental Failures Power outages, OverheaCng Hardware Failures – Example Incidents • 
• 
• 
• 
Equipment Failure – Wall Street Journal website – Mar 2004 -­‐ Hardware failure – 1 hour downCme Equipment Failure – Yahoo Groups – Mar 2002 – Hardware problems – Several hours downCme Power Outage – eBay – Power outage in webhosCng facility – 3 hours downCme Hardware Upgrades – iWon – New hardware installaCon of $2 million worth – Several days of intermi^ent failures Network Failures – Example Incidents • 
• 
• 
PlanetLab – Experiment overloads university’s internet connecCon – Detected by bandwidth spikes Bank of America – Network connecCon slowed banking service – several days of intermi^ent outage Sprint – ISP passes bad rouCng informaCon – 2 hours of downCme Security ViolaCons • 
• 
• 
• 
• 
Unauthorized accesses Password Disclosures Denial of Service A^acks (DoS / DDoS) SoMware VulnerabiliCes Viruses, Worms Security ViolaGons – Example Incidents • 
• 
• 
• 
• 
MicrosoM – Aug 2003 – DOS a^ack causes website downCme of 1 hour Alkamai – Jun 2004 – DOS a^ack on DNS servers caused 2 hour downCme for Google, Yahoo, Apple and MicrosoM Google – Jul 2004 -­‐ MyDoom worm causes parCal outage for several hours Verizon – May 2004 – TheM of network cards caused customers to lose their internet access for one day Many recent events – Sony Pictures – Google ManifestaCon of Errors Type Examples ParCal or EnCre Website Unavailable File not found, Web server crashed Systems ExcepCons / Access ViolaCons RunCme excepCons Incorrect Results Wrong page served, Invalid Cache used Data loss or CorrupCon Disk block failures Performance Slowdowns Network congesCon, System overload Fault Chains • 
• 
Series of component failures Uncoupled Fault Chains • 
• 
• 
Tightly-­‐coupled Fault Chains • 
• 
• 
• 
Independent failures occur one aMer another Uncoupled Fault Chains Correlated failures For example, Power-­‐outage caused air-­‐condiConing to fail SoMware dependencies 60% of the failures have fault chain of two Prevalence of Failures 89% of Customers have experienced Issues when compleCng transacCons •  72.5% sites experienced failures in holiday season • 
Causes of Site Failures ApplicaCon Failures Human Error Others DownCme Planned DownCme Unplanned DownCme CriCque ApplicaCon Domain •  Restricted to Web ApplicaCons •  Large websites like AOL, MicrosoM, Walmart etc. EvaluaCon – Comprehensive? • 
• 
• 
With 40 real-­‐world test cases Not connected with the causes of soMware failure in general Small subset – evaluaCon could be biased CriCque Didn’t consider the type of Web ApplicaCons • 
• 
Causes of failure is related with web applicaCon type For example, news website is more likely to fail from crowd sourcing than an online CMS Types of Failure -­‐ Taxonomy • 
• 
Four faults taxonomy is quite primiCve What type is a device driver failure? – SoMware or Hardware? CriCque Few more important causes of failures.. • 
Website not tested on different plaporms • 
• 
• 
• 
e.g. Smart phones, Tablets DNS Problems Bandwidth – Webhost decides to put you off because you consumed too much bandwidth Police raid – What happed with the pirate bay J Some Thoughts Web applicaCon failures can be generally backtracked to the development phase • 
• 
• 
• 
• 
• 
Lack of Well-­‐defined scope Lack to professional project management Poor version control Trying to reinvent the wheel No funcConal tesCng “Freelance Syndrome” J Summary • 
Website Failures are prevalent • 
• 
Loss of revenue Long-­‐term losses like Customer DissaCsfacCon • 
Important Study • 
• 
• 
• 
• 
Failure Taxonomy Causes General Causes Real-­‐world case studies Can be improved, extended and updated! 
Download