Availability in Globally Distributed Storage Systems

advertisement
Failures in the System

Two major components in a Node
Applications
System
Failures in the System
Google
Nebraska
Bigtable
Cluster Scheduler
Application
GFS
Hadoop
File Systems
File Systems
System
Hard Drive
Hard Drive
Failures in the System

Similar systems at Nebraska
Google
Nebraska
Bigtable
Cluster Scheduler
Application
GFS
Hadoop
File Systems
File Systems
System
Hard Drive
Hard Drive
Failures in the System

Similar systems at Nebraska
Google
Nebraska
Bigtable
Cluster Scheduler
Application
GFS
Hadoop
File Systems
File Systems
System
Hard Drive
Hard Drive
Failure will cause unavailability
Failures in the System

Similar systems at Nebraska
Google
Nebraska
Bigtable
Cluster Scheduler
Application
GFS
Hadoop
File Systems
File Systems
Hard Drive
Hard Drive
Could cause
data loss
System
Failure will cause unavailability
Unavailability: Defined

Data on a node is unreachable

Detection:


Periodic heartbeats are missing
Correction:


Lasts until node comes back
System recreates the data
Unavailability: Measured
Unavailability: Measured
Replication Starts
Unavailability: Measured
Question: After replication starts,
why does it take so long to recover?
Replication Starts
Node Availability
Storage Software
Restart
Node Availability
Software is fast to
restart
Storage Software
Restart
Node Availability: Time
Planned Reboots
Node Availability: Time
Node updates (planned reboots) cause the most
downtime.
Planned Reboots
MTTF for Components

Even though Disk failure can cause data loss, node failure is
much more often

Conclusion: Node failure is more important to system
availability
Correlated Failures

Large number of nodes failing in a burst can reduce
effectiveness of replication and encoding schemes

Losing nodes before replication can start can cause
unavailability of data
Correlated Failures
Correlated Failures
Rolling Reboots of cluster
Correlated Failures
Oh s*!t, datacenter on fire!
(maybe not that bad)
Coping with Failure
Coping with Failure
Encoding
Replication
Coping with Failure
Encoding
Replication
27,000 Years
27.3 M Years
3 replicas is standard in large clusters
Coping with Failure
Cell Replication (Datacenter Replication)
Cell Replication
Cell 1
Cell 2
Block A
Block A
Block A
Block A
Cell Replication
Cell 1
Cell 2
Block A
Block A
Block A
Block A
Cell Replication
Cell 1
Cell 2
Block A
Block A
Block A
Block A
Cell Replication
Cell 1
Cell 2
Block A
Block A
Block A
Block A
Modeling Failures
We’ve seen the data, now lets model the
behavior.
Modeling Failures

A chunk of data can be in one of many states.

Consider when Replication = 3
3
2
Lose a replica, but still 2 available
1
0
Modeling Failures

A chunk of data can be in one of many states.

Consider when Replication = 3
Recovery
3
2
1
0
0 replicas = service unavailable
Modeling Failures

Each loss of a replica has a probability

The recovery rate is also known
Recovery
3
2
1
0
0 replicas = service unavailable
Markov Model
ρ= recovery
λ= failure rate
s = block replications
r = minimum replication
Modeling Failures

Using Markov models, we can find:
Modeling Failures

Using Markov models, we can find:
402 Years
Nebraska
Modeling Failures

For Multi-Cell Implementations
Paper Conclusions

Given enormous amount of data from Google, can say:




Failures are typically short
Node failures can happen in bursts, and are not independent
In modern distributed file systems, disk failure is the same as
node failure.
Built Markov Model for failures that accurately reason about
past and future availability.
My Conclusions

This paper contributed greatly by showing data from very
large scale distributed file systems.

If Reed – Solomon striping is so much more efficient, why
isn’t it used by Google? Hadoop? Facebook?


Complicated code?
Complicated administration?
Download