What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake Thanh Do Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria First, let’s ask Google 2 Cloud era No Deep Root Causes… 3 What reliability research community do? • Bug study 1. A Study of Linux File System Evolution. In FAST ’13. 2. A Comprehensive Study on Real World Concurrency Bug Characteristics. In ASPLOS ’08. 3. Precomputing Possible Configuration Error Diagnoses. In ASE ’11. … 4 Open sourced cloud software • Publicly accessible bug repositories 5 Study to solve… • What bugs “live” in the cloud? • Are there new classes of bugs unique to cloud systems? • How should cloud dependability tools evolve in near future? • Many others questions… 6 Cloud Bug Study (CBS) • 6 systems: Hadoop MapReduce, HDFS, HBase, Cassandra, Zookeeper, and Flume • 11 people, 1 year study • Issues in a 3-year window: Jan 2011 to Jan 2014 • ~21000 issues reviewed • ~3600 “vital” issues in-depth study • Cloud Bug Study (CBS) database 7 Classifications • Aspects – Reliability, performance, availability, security, consistency, scalability, topology, QoS • Hardware failures - types of hardware and types of hardware failures • Software bug types – Logic, error handling, optimization, config, race, hang, space, load • Implications – Failed operation, performance, component down- time, data loss, data staleness, data corruption • ~25000 annotations in total, about 7 annotations per issue 8 Cloud Bug Study (CBS) database • Open to public 9 Outline • • • • • Introduction Methodology Overview of results Other CBS database use cases Conclusion 10 Methodology • 6 systems, 3-year span, 2011 to 2014 • 20~30 bugs a day! Protein yeah! • 17% “vital” issues affecting real deployments • 3655 vital issues 11 Example issue Title Time to resolve Type & Priority Description Discussion 12 Outline • • • • • Introduction Methodology Overview of results Other CBS database use cases Conclusion 13 Classifications for each vital issue • • • • • Aspects Hardware types and failure modes Software bug types Implications Bug scopes 14 Overview of result • Aspects • Hardware faults vs. Software faults • Implications 15 Aspects • • • • • • CS = Cassandra FL = flume HB = HBase HD = HDFS MR = MapReduce ZK = ZooKeeper 16 Aspects: Reliability • Reliability (45%) – Operation & job failures/errors, data loss/corruption/stalenes s 17 Aspects: Performance • Reliability • Performance (22%) 18 Aspects: Availability • Reliability • Performance • Availability (16%) – Node and cluster downtime 19 Aspects: Security • • • • Reliability Performance Availability Security (6%) 20 Overview of result • Aspects (classical) • Aspects – Data consistency, scalability, topology, QoS • Hardware faults vs. Software faults • Implications 21 Aspects: Data consistency • Data consistency (5%) – Permanent inconsistent replicas – Various root causes: • Buggy operational protocol • Concurrency bugs and node failures 22 Cassandra cross-DC synchronization A’ A’ A B’ B’ B C’ Permanent inconsistency C Background operational protocols often buggy! 23 Aspects: Scalability • Data consistency • Scalability (2%) – Small number does not mean not important! – Only found at scale • • • • Large cluster size Large data Large load Large failures 24 Large cluster • In Cassandra Ring position changed. 100x O(n3) calculation CPU explosion 25 Large data In HBase Tens of minutes Insufficient lookup operation R1 R… R2 R100K R3 26 Large load In HDFS 1000x small files in parallel … … … Not expecting small files! 27 Large failure AM managing 16,000 tasks fails … 1 2 3 5K Un-optimized connection Time cost: 7+ hours 1K 3K … 2K 4K 16K 28 From above examples… • Protocol algorithms must anticipate – Large cluster sizes – Large data – Large request load of various kinds – Large scale failures • The need for scalability bug detection tools 29 Aspects: Topology • Data consistency • Scalability • Topology (1%) – Systems have problem when deployed on some network topology • Cross DC • Different racks • New layering architecture – Typically unseen in predeployment 30 Aspects: QoS • • • • Data consistency Scalability Topology QoS (1%) – Fundamental for multitenant systems – Two main points • Horizontal/intra-system QoS • Vertical/cross-system QoS 31 Overview of result • Aspects (classical) • Aspects (unique) – Data consistency, scalability, topology, QoS • Hardware faults vs. Software faults • Implications 32 HW faults vs. SW faults “Hardware can fail, and reliability should come from software.” 33 HW faults and modes • 299 improper handling of node fail-stop failure • A 25% normal speed memory card causes problems in HBase deployment. 34 Hardware faults vs. Software faults • Hardware failures, components and modes • Software bug types 35 Software bug types: Logic • Logic (29%) – Many domain-specific issues 36 Software bug types: Error handling • Logic • Error handling (18%) – Aspirator, Yuan et al, [OSDI’ 14] 37 Software bug types: Optimization • Logic • Error handling • Optimization (15%) 38 Software bug types: Configuration • • • • Logic Error handling Optimization Configuration (14%) – Automating Configuration Troubleshooting. [OSDI ’10] – Precomputing Possible Configuration Error Diagnoses. [ASE ’11] – Do Not Blame Users for Misconfigurations. [SOSP ’13] 39 Software bug types: Race • Race (12%) – < 50% local concurrency bugs • Buggy thread interleaving • Tons of work – > 50% distributed concurrency bugs • Reordering of messages, crashes, timeouts • More work is needed – SAMC [OSDI ’14] 40 Software bug types: Hang • Hang (4%) – Classical deadlock – Un-served jobs, stalled operations, … • Root causes? • How to detect them? 41 Software bug types: Space • Space (4%) – Big data + leak = Big leak – Clean-up operations must be flawless. 42 Software bug types: Load • Load (4%) – Happen when systems face high request load – Relates to QoS and admission control 43 Overview of result • Aspects (classical) • Aspects (unique) – Data consistency, scalability, topology, QoS • Hardware faults vs. Software faults • Implications 44 Implications • • • • • • Failed operation (42%) Performance (23%) Downtimes (18%) Data loss (7%) Data corruption (5%) Data staleness (5%) 45 Root causes Every implication can be caused by all kinds of hardware and software faults! 46 “Killer” bugs • Bugs that simultaneously affect multiple nodes or even the entire cluster • Single Point of Failure still exists in many forms – Positive feedback loop – Buggy failover – Repeated bugs after failover –… 47 Outline • • • • • Introduction Methodology Overview of results Other CBS database use cases Conclusion 48 CBS database • 50+ per system and aggregate graphs from mining CBS database in the last one year • Still more waiting to be studied… 49 Components with most issues How should we enhance reliability for multiple cloud system interaction? Cross-system issues are prevalent! 50 Most challenging types of issues 51 Top k% of most complicated issue 52 System evolution Hadoop 2.0 53 Conclude • One of the largest bug studies for cloud systems • Many interesting findings, but more questions can be raised from our analysis – What types of performance issues exist? – Root causes for hang issues? –… • Cloud Bug Study(CBS) database. 54 Thank you! http://ucare.cs.uchicago.edu/ 55