The Case for Drill-Ready Cloud Computing Vision Paper Tanakorn Leesatapornwongsa and Haryadi S. Gunawi 1 Cloud Services Cheap Convenient Reliable 2 Yahoo Mail Disruption • Hardware failures • Wrong failover • Disruptions – Some users could not access – Some users saw wrong notifications – Several days to recover 3 Outlook Disruption • Hardware failures – Caching server • • • • Failover to backend servers correctly Requests flooded the servers Service went down Microsoft needed to change its software infrastructure 4 Cloud Outages Outage Root Event Supposedly tolerable failure Incorrect Recovery Major Outage Amazon EBS Network misconfig Network partition Re-mirroring storm Gmail Upgrade event Servers offline Bad request routing All routing servers down App Engine Power failure 25 % machines offline Bad failover All user app were degraded Skype Overload 30 % nodes failed Positive feedback loop Almost all nodes failed Google Drive Network bug Network offline Timeout during 33 % requests failover affected Caching failure Failover to backend Request flooding 7-hour outage Hardware failures Servers offline Buggy failover 1 % of users affected Outlook Yahoo Mail Clusters collapsed 5 Journey of Cloud Dependability Research 6 Fault-Tolerant Systems Complex failures • Hard to handle and implement correctly • Recovery protocols are very complex • Recovery code is one of the most buggy parts 7 Offline Testing • Thoroughly verify recovery mechanism 8 Offline Testing • • • • Thoroughly verify recovery mechanism Fault injection, model checking, stress testing, etc. “Mini cluster” that represents production runs Testing and production environment is different – Cluster, workload, failure Real workload Test workload Mini cluster Production run 9 Offline Testing • • • • Thoroughly verify recovery mechanism Fault injection, model checking, stress testing, etc. “Mini cluster” that represents production runs Testing and production environment is different – Cluster, workload, failure • Orders of magnitude different in scale – Facebook used 100 machines to mimic 3000-machine production run[2011] • Small start-ups forego the luxury – Many tests are much smaller than this 10 Diagnosis • Help administrators to point out and reproduce causes of outages • BUT – Post-mortem, not prevent disruptions – Passive approach, wait outages happen before diagnosis 11 Online Testing and Failure Drills Customers “Inject failures online” Administrators Test Requests Users outnumber testers Real deep scenarios 12 A Missing Piece Boss, let doDear inject beloved customers, failures online using ChaosThank you for trusting our Monkey Hmm … services, but we accidentally lose your data because the failure drills that we run ... ... Employee Boss 13 Future of Failure Drill Current Drill A team of engineers standing by Drill-ready clouds 14 Drill-Ready Cloud Computing • Automatic failure drill and automatic cancellation • Safe, efficient, easy manner • Ideally, no engineering effort required 15 Drill-Ready Cloud Computing Administrator Drill-ready cloud computing Systems take care Drill Mode failure injection and cancellation Drill Spec Kill 25 % If it disrupts revert back Drill-Ready System 16 Outline • • • • Safety Efficiency Usability Generality 17 Safety Learn about failure implications without suffering through them • Learn whether data can be lost – But not lose the data • Learn whether SLA can be violated – But not violate it for long time 18 Safety Solutions • Normal and drill states Not drill aware 19 Safety Solutions • Normal and drill states “Maintaining 2 states” Normal and drill states The first most important thing for drill-ready clouds Normal Topology Drill Topology Revert back to normal state easily 20 Safety Solutions • Drill state isolation • Self cancellation – Real failures during the drill – Drill master and drill agent – Drill master command agents – What if network partition? • Agents are in limbo state – Self cancellation when agents cannot contact master 21 Safety Solutions • Drill state isolation • Self cancellation Safe drill specification • Safe drill specification Check whether the specification – Drill specificationcan run safely Drill Spec - What failures? - How long? - Cancellation conditions - Etc. Example Kill 25 % If SLA is violated revert back 22 Efficiency • Failures trigger data migration • Monetary cost – Bandwidth – Storage space • System performance – Affect users 23 Efficiency Solutions • Low-overhead drill setup and cleanup – Do we need to do real key re-balance? – Depends on the objective of the test [11-15] [16-20] [11-20] [21-30] [1-10] Yes, if we want to see background re-balance impact [31-40] SLA okay? Read / Write data [51-60] [41-50] [41-45] [46-50] 24 Efficiency Solutions • Low-overhead drill setup and cleanup – Do we need to do real key re-balance? – Depends on the objective the cleanup test Low-overhead setupofand The cost depends [31-45] [16-30] on the drillwant objectives No, if we to measure and performance, when we lose 2 nodes Drill objectives must be parts on drill specifications SLA okay? [1-15] [46-60] Read / Write [46-60] No key [11] 25 Efficiency Solutions • Low-overhead drill setup and cleanup • Cheap drill specification – Smarter and cheaper drill specification Replicating progress status If replication is 50 % correct assume that the rest are correct Stop half way and report success 26 Usability Solutions • Declarative drill specification language Drill Specification During peak load Kill 5% machines If SLA violated > 1 mins Cancel the drill – Need declarative language • Describe results • Easy to read and write If recovery is 50% good Stop the drill Report success 27 Generality Solutions • • • • Elasticity drill Configuration change drill Software upgrade drill Security attack drill 28 Conclusion • Drill-ready cloud computing – New reliability paradigm • Sketching a first draft • We want your FEEDBACK 29 Thank You http://ucare.cs.uchicago.edu 30