What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud

advertisement
What Bugs Live in the Cloud?
A Study of 3000+ Issues in Cloud Systems
Haryadi S. Gunawi, Mingzhe Hao,
Tanakorn Leesatapornwongsa, Tiratat
Patana-anake
Thanh Do
Jeffry Adityatama, Kurnia J. Eliazar, Agung
Laksono, Jeffrey F. Lukman, Vincentius
Martin, and Anang D. Satria
First, let’s ask Google
2
Cloud era
No Deep Root Causes…
3
What reliability research community do?
• Bug study
1. A Study of Linux File System Evolution. In FAST ’13.
2. A Comprehensive Study on Real World Concurrency Bug
Characteristics. In ASPLOS ’08.
3. Precomputing Possible Configuration Error Diagnoses. In ASE
’11.
…
4
Open sourced cloud software
• Publicly accessible bug repositories
5
Study to solve…
• What bugs “live” in the cloud?
• Are there new classes of bugs unique to cloud
systems?
• How should cloud dependability tools evolve
in near future?
• Many others questions…
6
Cloud Bug Study (CBS)
• 6 systems: Hadoop MapReduce, HDFS, HBase,
Cassandra, Zookeeper, and Flume
• 11 people, 1 year study
• Issues in a 3-year window:
Jan 2011 to Jan 2014
• ~21000 issues reviewed
• ~3600 “vital” issues  in-depth study
• Cloud Bug Study (CBS) database
7
Classifications
• Aspects – Reliability, performance, availability,
security, consistency, scalability, topology, QoS
• Hardware failures - types of hardware and types
of hardware failures
• Software bug types – Logic, error handling,
optimization, config, race, hang, space, load
• Implications – Failed operation, performance,
component down- time, data loss, data staleness,
data corruption
• ~25000 annotations in total, about 7 annotations
per issue
8
Cloud Bug Study (CBS) database
• Open to public
9
Outline
•
•
•
•
•
Introduction
Methodology
Overview of results
Other CBS database use cases
Conclusion
10
Methodology
• 6 systems, 3-year span, 2011 to 2014
• 20~30 bugs a day! Protein yeah!
• 17% “vital” issues affecting
real deployments
• 3655 vital issues
11
Example issue
Title
Time to resolve
Type & Priority
Description
Discussion
12
Outline
•
•
•
•
•
Introduction
Methodology
Overview of results
Other CBS database use cases
Conclusion
13
Classifications for each vital issue
•
•
•
•
•
Aspects
Hardware types and failure modes
Software bug types
Implications
Bug scopes
14
Overview of result
• Aspects
• Hardware faults vs. Software faults
• Implications
15
Aspects
•
•
•
•
•
•
CS = Cassandra
FL = flume
HB = HBase
HD = HDFS
MR = MapReduce
ZK = ZooKeeper
16
Aspects: Reliability
• Reliability (45%)
– Operation & job
failures/errors, data
loss/corruption/stalenes
s
17
Aspects: Performance
• Reliability
• Performance (22%)
18
Aspects: Availability
• Reliability
• Performance
• Availability (16%)
– Node and cluster
downtime
19
Aspects: Security
•
•
•
•
Reliability
Performance
Availability
Security (6%)
20
Overview of result
• Aspects (classical)
• Aspects
– Data consistency, scalability, topology, QoS
• Hardware faults vs. Software faults
• Implications
21
Aspects: Data consistency
• Data consistency (5%)
– Permanent inconsistent
replicas
– Various root causes:
• Buggy operational
protocol
• Concurrency bugs
and node failures
22
Cassandra cross-DC synchronization
A’
A’
A
B’
B’
B
C’
Permanent inconsistency
C
Background operational protocols often buggy!
23
Aspects: Scalability
• Data consistency
• Scalability (2%)
– Small number does not
mean not important!
– Only found at scale
•
•
•
•
Large cluster size
Large data
Large load
Large failures
24
Large cluster
• In Cassandra
Ring position changed.
100x
O(n3) calculation
CPU explosion
25
Large data
In HBase
Tens of
minutes
Insufficient lookup
operation
R1
R…
R2
R100K
R3
26
Large load
In HDFS
1000x small files in parallel
…
…
…
Not expecting
small files!
27
Large failure
AM managing
16,000 tasks fails
…
1
2
3
5K
Un-optimized
connection
Time cost: 7+ hours
1K
3K
…
2K
4K
16K
28
From above examples…
• Protocol algorithms must anticipate
– Large cluster sizes
– Large data
– Large request load of various kinds
– Large scale failures
• The need for scalability bug detection tools
29
Aspects: Topology
• Data consistency
• Scalability
• Topology (1%)
– Systems have problem
when deployed on some
network topology
• Cross DC
• Different racks
• New layering architecture
– Typically unseen in predeployment
30
Aspects: QoS
•
•
•
•
Data consistency
Scalability
Topology
QoS (1%)
– Fundamental for multitenant systems
– Two main points
• Horizontal/intra-system
QoS
• Vertical/cross-system QoS
31
Overview of result
• Aspects (classical)
• Aspects (unique)
– Data consistency, scalability, topology, QoS
• Hardware faults vs. Software faults
• Implications
32
HW faults vs. SW faults
“Hardware can fail, and reliability
should come from software.”
33
HW faults and modes
• 299 improper handling of
node fail-stop failure
• A 25% normal speed
memory card causes problems
in HBase deployment.
34
Hardware faults vs. Software faults
• Hardware failures, components and modes
• Software bug types
35
Software bug types: Logic
• Logic (29%)
– Many domain-specific
issues
36
Software bug types: Error handling
• Logic
• Error handling (18%)
– Aspirator, Yuan et al,
[OSDI’ 14]
37
Software bug types: Optimization
• Logic
• Error handling
• Optimization (15%)
38
Software bug types: Configuration
•
•
•
•
Logic
Error handling
Optimization
Configuration (14%)
– Automating Configuration
Troubleshooting. [OSDI ’10]
– Precomputing Possible
Configuration Error
Diagnoses. [ASE ’11]
– Do Not Blame Users for
Misconfigurations. [SOSP ’13]
39
Software bug types: Race
• Race (12%)
– < 50% local concurrency
bugs
• Buggy thread interleaving
• Tons of work
– > 50% distributed
concurrency bugs
• Reordering of messages,
crashes, timeouts
• More work is needed
– SAMC [OSDI ’14]
40
Software bug types: Hang
• Hang (4%)
– Classical deadlock
– Un-served jobs, stalled
operations, …
• Root causes?
• How to detect them?
41
Software bug types: Space
• Space (4%)
– Big data + leak = Big leak
– Clean-up operations
must be flawless.
42
Software bug types: Load
• Load (4%)
– Happen when systems
face high request load
– Relates to QoS and
admission control
43
Overview of result
• Aspects (classical)
• Aspects (unique)
– Data consistency, scalability, topology, QoS
• Hardware faults vs. Software faults
• Implications
44
Implications
•
•
•
•
•
•
Failed operation (42%)
Performance (23%)
Downtimes (18%)
Data loss (7%)
Data corruption (5%)
Data staleness (5%)
45
Root causes
Every implication can be caused by
all kinds of hardware and software faults!
46
“Killer” bugs
• Bugs that simultaneously affect multiple
nodes or even the entire cluster
• Single Point of Failure still exists in many forms
– Positive feedback loop
– Buggy failover
– Repeated bugs after failover
–…
47
Outline
•
•
•
•
•
Introduction
Methodology
Overview of results
Other CBS database use cases
Conclusion
48
CBS database
• 50+ per system and aggregate graphs from
mining CBS database in the last one year
• Still more waiting to be studied…
49
Components with most issues
How should we enhance reliability for
multiple cloud system interaction?
Cross-system issues are prevalent!
50
Most challenging types of issues
51
Top k% of most complicated issue
52
System evolution
Hadoop 2.0
53
Conclude
• One of the largest bug studies for cloud
systems
• Many interesting findings, but more questions
can be raised from our analysis
– What types of performance issues exist?
– Root causes for hang issues?
–…
• Cloud Bug Study(CBS) database.
54
Thank you!
http://ucare.cs.uchicago.edu/
55
Download