Computer and Information Security Handbook Chapter 07 Fault Tolerance and Resilience in Cloud Computing Environments Ravi Jhawar Universita’ degli Studi di Milano Vincenzo Piuri Universita’ degli Studi di Milano Copyright © 2014, Elsevier Inc. All rights Reserved 1 Massive cloud computing data centers Introduction Introduction Meeting demands for cloud-based services Leveraging economies of scale Operating under abnormal conditions Utilizing highly complex infrastructure Reducing overall reliability and availability Changing application risk dimension Paramount issue: fault tolerance Why are traditional methods of introducing fault tolerance not very effective? Copyright © 2014, Elsevier Inc. All rights Reserved 2 Failure condition cycle: Fault->Error->Failure Fault: fundamental system impairment Error: invalid system state Failure: functionality or behavior not met Fault tolerance Cloud Computing Fault Model Faults, Errors, and Failures Performing with failures present Why must we clearly understand and define what constitutes correct system behavior? Copyright © 2014, Elsevier Inc. All rights Reserved 3 Four distinct layers Physical resources: lowest layer Low layer failure produces most impact above Server failure behavior: data results Infers need for robust fault tolerance Cloud Computing Fault Model Architecture and Server Failure Behavior Improved hard disk reliability Avoidance of failure-prone hard disks Fault trees and Markov chain analysis Captures user’s perspective on failures Correlates component failures and boundaries Copyright © 2014, Elsevier Inc. All rights Reserved 4 Cloud Computing Fault Model Figure 7.1 Layered architecture of cloud computing. Why does a failure in a given layer impact the services offered by the layers above it? Copyright © 2014, Elsevier Inc. All rights Reserved 5 Cloud Computing Fault Model Figure 7.2A Fault tree characterizing server failures. Failure/error in any component within any boundary may impact the topevent representing a failure in a user’s application. Copyright © 2014, Elsevier Inc. All rights Reserved 6 Cloud Computing Fault Model Figure 7.2B Fault tree characterizing power failures. Defining boundaries. Power supply failure may come from the server, or the hypervisor, or the VM instance itself. Copyright © 2014, Elsevier Inc. All rights Reserved 7 Knowledge required Network topology and network components Failure types Network failure behavior: data results Data center network reliability Device connection down for a specific interface Device not routing/forwarding packets correctly Cloud Computing Fault Model Failure Behavior of the Network 99.99 percent for 80 percent of the links 60 percent of the devices Fault trees reveal application failure impact Copyright © 2014, Elsevier Inc. All rights Reserved 8 Cloud Computing Fault Model Figure 7.3A Partial network architecture of a data center. Servers are connected using a set of network switches and routers. Copyright © 2014, Elsevier Inc. All rights Reserved 9 Cloud Computing Fault Model Figure 7.3B Fault tree characterizing network failures. Boundaries on the impact of each network failure are represented (using server, cluster, and data center level blocks) and further used to increase the fault tolerance of the user’s application (by placing replicas of an application in different failure zones). Copyright © 2014, Elsevier Inc. All rights Reserved 10 End user fault types Widely adopted fault tolerance methods Crash and Byzantine Checking and monitoring, checkpoint and restart, replication (active and passive), and variant replication systems Basic Concepts on Fault Tolerance Basic Concepts on Fault Tolerance Markov model analysis characterizes impact Fault tolerance factors Implementation complexity, resource costs, resilience, performance metrics Balance models, consumption, performance Copyright © 2014, Elsevier Inc. All rights Reserved 11 Basic Concepts on Fault Tolerance Figure 7.4A Markov model of a system with two replicas in active/semiactive replication scheme. Effective means of deriving the reliability and availability of the system because the failure behavior of both replicas can be taken into account. Copyright © 2014, Elsevier Inc. All rights Reserved 12 Basic Concepts on Fault Tolerance Figure 7.4B Markov model of a system with two replicas in passive replication scheme. Λ denotes the failure rate, μ denotes the recovery rate, and k is a constant. Copyright © 2014, Elsevier Inc. All rights Reserved 13 Deployment scenarios (replica location) Critical for fault tolerance mechanisms Three types System availability impact variable: Table 7.1 Multiple machines within the same cluster Multiple clusters within a data center Multiple data centers Replication technique at deployment scenario Virtualization-based approaches Achieve at least one required property Deal with both classes of faults Copyright © 2014, Elsevier Inc. All rights Reserved Different Levels of Fault Tolerance in Cloud Computing Different Levels of Fault Tolerance 14 Utilize scheme leveraging virtualization to tolerate crash faults transparently Advantage: increased level of generality Encapsulate system or user application in a VM Operations performed at the VM level Scheme independent of application and underlying hardware Example: Remus application Works in four phases Built on Xen hypervisor’s live migration machinery Copyright © 2014, Elsevier Inc. All rights Reserved Fault Tolerance Against Crash Failures in Cloud Computing Fault Tolerance Against Crash Failures 15 Too expensive for practical use High resource consumption costs: Table 7.2 Fault handling method Hardware; not running agreement protocol Replicate server (state machine replication) Replica executes same request in same order Replicas must agree on ordering of requests Service execution done, voting scheme devised Example: ZZ Execution approach integrating existing BFT SMR and agreement protocols Copyright © 2014, Elsevier Inc. All rights Reserved Fault Tolerance Against Byzantine Failures in the Cloud Fault Tolerance for Byzantine Failures 16 Avoids drawbacks of previous solutions Functions as independent modules Associates metadata with each module Characterizes fault tolerance properties Fault Tolerance Manager(FTM) Conceptual architectural framework Basis for fault tolerance as a service Inserted as a dedicated service layer between the physical hardware and user applications along the virtualization layer Built on service-oriented architecture principles Copyright © 2014, Elsevier Inc. All rights Reserved Fault Tolerance as a Service in Cloud Computing Fault Tolerance as a Service 17 Architecture of the fault tolerance manager showing all the components. What are the three main components? Copyright © 2014, Elsevier Inc. All rights Reserved Fault Tolerance as a Service in Cloud Computing Figure 7.5 18 Fault Tolerance as a Service in Cloud Computing Figure 7.6A Resource graph. FTM resource manager generates a profile of all computing resources in the cloud and identifies five processing nodes {n1 , ...,n5} € N with the network topology. Copyright © 2014, Elsevier Inc. All rights Reserved 19 Fault Tolerance as a Service in Cloud Computing Figure 7.6B Nodes selected by replication manager. The FTM replication manager (RM) selects the node n1 for the primary replica and nodes n3 and n4, respectively, for two backup replicas so that a replica group can be formed. Copyright © 2014, Elsevier Inc. All rights Reserved 20 Messaging Infrastructure created (forms a replica group). The FTM messaging manager establishes the infrastructure required for carrying out the checkpointing protocol and forms the replica group for the e-commerce service. Copyright © 2014, Elsevier Inc. All rights Reserved Fault Tolerance as a Service in Cloud Computing Figure 7.6C 21 Failure detected at node n1. When the FTM failure detection/prediction manager predicts a failure in node n1, it invokes the fault masking ft_unit that performs a live migration of the VM instance. Copyright © 2014, Elsevier Inc. All rights Reserved Fault Tolerance as a Service in Cloud Computing Figure 7.6D 22 Fault masking performed - VM instance migrated to node n2. High availability goals are satisfied using the FTM fault masking manager; however, the IaaS may be affected since the system now consists of four working nodes only. Copyright © 2014, Elsevier Inc. All rights Reserved Fault Tolerance as a Service in Cloud Computing Figure 7.6E 23 Recovery Manager brings back node n1 to working state. FTM applies robust recovery mechanisms at node n1 to resume it to a normal working state, increasing the system’s overall lifetime. Copyright © 2014, Elsevier Inc. All rights Reserved Fault Tolerance as a Service in Cloud Computing Figure 7.6F 24 Summary Summary Correct and continuous system operation Cloud computing failure characteristics Impact user application failure types Arise from crash faults and byzantine faults Properties driving fault tolerance solutions Requires fault tolerance and resilience Generality, agility, transparency, and reduced resource consumption costs Fault tolerance as a service: preferred Leverages existing solutions and properties Copyright © 2014, Elsevier Inc. All rights Reserved 25