Failures in the System Two major components in a Node Applications System Failures in the System Google Nebraska Bigtable Cluster Scheduler Application GFS Hadoop File Systems File Systems System Hard Drive Hard Drive Failures in the System Similar systems at Nebraska Google Nebraska Bigtable Cluster Scheduler Application GFS Hadoop File Systems File Systems System Hard Drive Hard Drive Failures in the System Similar systems at Nebraska Google Nebraska Bigtable Cluster Scheduler Application GFS Hadoop File Systems File Systems System Hard Drive Hard Drive Failure will cause unavailability Failures in the System Similar systems at Nebraska Google Nebraska Bigtable Cluster Scheduler Application GFS Hadoop File Systems File Systems Hard Drive Hard Drive Could cause data loss System Failure will cause unavailability Unavailability: Defined Data on a node is unreachable Detection: Periodic heartbeats are missing Correction: Lasts until node comes back System recreates the data Unavailability: Measured Unavailability: Measured Replication Starts Unavailability: Measured Question: After replication starts, why does it take so long to recover? Replication Starts Node Availability Storage Software Restart Node Availability Software is fast to restart Storage Software Restart Node Availability: Time Planned Reboots Node Availability: Time Node updates (planned reboots) cause the most downtime. Planned Reboots MTTF for Components Even though Disk failure can cause data loss, node failure is much more often Conclusion: Node failure is more important to system availability Correlated Failures Large number of nodes failing in a burst can reduce effectiveness of replication and encoding schemes Losing nodes before replication can start can cause unavailability of data Correlated Failures Correlated Failures Rolling Reboots of cluster Correlated Failures Oh s*!t, datacenter on fire! (maybe not that bad) Coping with Failure Coping with Failure Encoding Replication Coping with Failure Encoding Replication 27,000 Years 27.3 M Years 3 replicas is standard in large clusters Coping with Failure Cell Replication (Datacenter Replication) Cell Replication Cell 1 Cell 2 Block A Block A Block A Block A Cell Replication Cell 1 Cell 2 Block A Block A Block A Block A Cell Replication Cell 1 Cell 2 Block A Block A Block A Block A Cell Replication Cell 1 Cell 2 Block A Block A Block A Block A Modeling Failures We’ve seen the data, now lets model the behavior. Modeling Failures A chunk of data can be in one of many states. Consider when Replication = 3 3 2 Lose a replica, but still 2 available 1 0 Modeling Failures A chunk of data can be in one of many states. Consider when Replication = 3 Recovery 3 2 1 0 0 replicas = service unavailable Modeling Failures Each loss of a replica has a probability The recovery rate is also known Recovery 3 2 1 0 0 replicas = service unavailable Markov Model ρ= recovery λ= failure rate s = block replications r = minimum replication Modeling Failures Using Markov models, we can find: Modeling Failures Using Markov models, we can find: 402 Years Nebraska Modeling Failures For Multi-Cell Implementations Paper Conclusions Given enormous amount of data from Google, can say: Failures are typically short Node failures can happen in bursts, and are not independent In modern distributed file systems, disk failure is the same as node failure. Built Markov Model for failures that accurately reason about past and future availability. My Conclusions This paper contributed greatly by showing data from very large scale distributed file systems. If Reed – Solomon striping is so much more efficient, why isn’t it used by Google? Hadoop? Facebook? Complicated code? Complicated administration?