Uploaded by patelurvin42

Chapter 1 Fault Tolerance and Resilience in Cloud Computing Environments

advertisement
Computer and Information Security
Handbook
Chapter 07
Fault Tolerance and Resilience
in Cloud Computing
Environments
Ravi Jhawar
Universita’ degli Studi di Milano
Vincenzo Piuri
Universita’ degli Studi di Milano
Copyright © 2014, Elsevier Inc. All rights Reserved
1

Massive cloud computing data centers








Introduction
Introduction
Meeting demands for cloud-based services
Leveraging economies of scale
Operating under abnormal conditions
Utilizing highly complex infrastructure
Reducing overall reliability and availability
Changing application risk dimension
Paramount issue: fault tolerance
Why are traditional methods of introducing
fault tolerance not very effective?
Copyright © 2014, Elsevier Inc. All rights Reserved
2

Failure condition cycle: Fault->Error->Failure
 Fault: fundamental system impairment
 Error: invalid system state
 Failure: functionality or behavior not met

Fault tolerance


Cloud Computing Fault Model
Faults, Errors, and Failures
Performing with failures present
Why must we clearly understand and define
what constitutes correct system behavior?
Copyright © 2014, Elsevier Inc. All rights Reserved
3

Four distinct layers



Physical resources: lowest layer
Low layer failure produces most impact above
Server failure behavior: data results

Infers need for robust fault tolerance



Cloud Computing Fault Model
Architecture and Server Failure Behavior
Improved hard disk reliability
Avoidance of failure-prone hard disks
Fault trees and Markov chain analysis


Captures user’s perspective on failures
Correlates component failures and boundaries
Copyright © 2014, Elsevier Inc. All rights Reserved
4
Cloud Computing Fault Model
Figure 7.1
Layered architecture of cloud computing.
Why does a failure in a given layer impact the services offered by the layers above it?
Copyright © 2014, Elsevier Inc. All rights Reserved
5
Cloud Computing Fault Model
Figure 7.2A
Fault tree characterizing server failures.
Failure/error in any component within any boundary may impact the topevent representing a failure in a user’s application.
Copyright © 2014, Elsevier Inc. All rights Reserved
6
Cloud Computing Fault Model
Figure 7.2B
Fault tree characterizing power failures.
Defining boundaries. Power supply failure may come from
the server, or the hypervisor, or the VM instance itself.
Copyright © 2014, Elsevier Inc. All rights Reserved
7

Knowledge required


Network topology and network components
Failure types



Network failure behavior: data results

Data center network reliability



Device connection down for a specific interface
Device not routing/forwarding packets correctly
Cloud Computing Fault Model
Failure Behavior of the Network
99.99 percent for 80 percent of the links
60 percent of the devices
Fault trees reveal application failure impact
Copyright © 2014, Elsevier Inc. All rights Reserved
8
Cloud Computing Fault Model
Figure 7.3A
Partial network architecture of a data center.
Servers are connected using a set of network switches and routers.
Copyright © 2014, Elsevier Inc. All rights Reserved
9
Cloud Computing Fault Model
Figure 7.3B
Fault tree characterizing network failures.
Boundaries on the impact of each network failure are represented (using server,
cluster, and data center level blocks) and further used to increase the fault tolerance of
the user’s application (by placing replicas of an application in different failure zones).
Copyright © 2014, Elsevier Inc. All rights Reserved
10

End user fault types


Widely adopted fault tolerance methods



Crash and Byzantine
Checking and monitoring, checkpoint and
restart, replication (active and passive), and
variant replication systems
Basic Concepts on Fault Tolerance
Basic Concepts on Fault Tolerance
Markov model analysis characterizes impact
Fault tolerance factors


Implementation complexity, resource costs,
resilience, performance metrics
Balance models, consumption, performance
Copyright © 2014, Elsevier Inc. All rights Reserved
11
Basic Concepts on Fault Tolerance
Figure 7.4A
Markov model of a system with two replicas in active/semiactive replication scheme.
Effective means of deriving the reliability and availability of the system because the failure
behavior of both replicas can be taken into account.
Copyright © 2014, Elsevier Inc. All rights Reserved
12
Basic Concepts on Fault Tolerance
Figure 7.4B
Markov model of a system with two replicas in passive replication scheme.
Λ denotes the failure rate, μ denotes the recovery rate, and k is a constant.
Copyright © 2014, Elsevier Inc. All rights Reserved
13

Deployment scenarios (replica location)


Critical for fault tolerance mechanisms
Three types




System availability impact variable: Table 7.1


Multiple machines within the same cluster
Multiple clusters within a data center
Multiple data centers
Replication technique at deployment scenario
Virtualization-based approaches


Achieve at least one required property
Deal with both classes of faults
Copyright © 2014, Elsevier Inc. All rights Reserved
Different Levels of Fault Tolerance in Cloud Computing
Different Levels of Fault Tolerance
14

Utilize scheme leveraging virtualization to
tolerate crash faults transparently



Advantage: increased level of generality


Encapsulate system or user application in a VM
Operations performed at the VM level
Scheme independent of application and
underlying hardware
Example: Remus application


Works in four phases
Built on Xen hypervisor’s live migration
machinery
Copyright © 2014, Elsevier Inc. All rights Reserved
Fault Tolerance Against Crash Failures in Cloud Computing
Fault Tolerance Against Crash Failures
15

Too expensive for practical use

High resource consumption costs: Table 7.2


Fault handling method





Hardware; not running agreement protocol
Replicate server (state machine replication)
Replica executes same request in same order
Replicas must agree on ordering of requests
Service execution done, voting scheme devised
Example: ZZ

Execution approach integrating existing BFT
SMR and agreement protocols
Copyright © 2014, Elsevier Inc. All rights Reserved
Fault Tolerance Against Byzantine Failures in the Cloud
Fault Tolerance for Byzantine Failures
16



Avoids drawbacks of previous solutions
Functions as independent modules
Associates metadata with each module


Characterizes fault tolerance properties
Fault Tolerance Manager(FTM)




Conceptual architectural framework
Basis for fault tolerance as a service
Inserted as a dedicated service layer between
the physical hardware and user applications
along the virtualization layer
Built on service-oriented architecture principles
Copyright © 2014, Elsevier Inc. All rights Reserved
Fault Tolerance as a Service in Cloud Computing
Fault Tolerance as a Service
17
Architecture of the fault tolerance manager showing all the components.
What are the three main components?
Copyright © 2014, Elsevier Inc. All rights Reserved
Fault Tolerance as a Service in Cloud Computing
Figure 7.5
18
Fault Tolerance as a Service in Cloud Computing
Figure 7.6A
Resource graph.
FTM resource manager generates a profile of all computing resources in the cloud
and identifies five processing nodes {n1 , ...,n5} € N with the network topology.
Copyright © 2014, Elsevier Inc. All rights Reserved
19
Fault Tolerance as a Service in Cloud Computing
Figure 7.6B
Nodes selected by replication manager.
The FTM replication manager (RM) selects the node n1 for the primary
replica and nodes n3 and n4, respectively, for two backup replicas so that a
replica group can be formed.
Copyright © 2014, Elsevier Inc. All rights Reserved
20
Messaging Infrastructure created (forms a replica group).
The FTM messaging manager establishes the infrastructure required for carrying out
the checkpointing protocol and forms the replica group for the e-commerce service.
Copyright © 2014, Elsevier Inc. All rights Reserved
Fault Tolerance as a Service in Cloud Computing
Figure 7.6C
21
Failure detected at node n1.
When the FTM failure detection/prediction manager predicts a failure in node n1, it
invokes the fault masking ft_unit that performs a live migration of the VM instance.
Copyright © 2014, Elsevier Inc. All rights Reserved
Fault Tolerance as a Service in Cloud Computing
Figure 7.6D
22
Fault masking performed - VM instance migrated to node n2.
High availability goals are satisfied using the FTM fault masking manager;
however, the IaaS may be affected since the system now consists of four working
nodes only.
Copyright © 2014, Elsevier Inc. All rights Reserved
Fault Tolerance as a Service in Cloud Computing
Figure 7.6E
23
Recovery Manager brings back node n1 to working state.
FTM applies robust recovery mechanisms at node n1 to resume it to a normal
working state, increasing the system’s overall lifetime.
Copyright © 2014, Elsevier Inc. All rights Reserved
Fault Tolerance as a Service in Cloud Computing
Figure 7.6F
24
Summary
Summary

Correct and continuous system operation


Cloud computing failure characteristics



Impact user application failure types
Arise from crash faults and byzantine faults
Properties driving fault tolerance solutions


Requires fault tolerance and resilience
Generality, agility, transparency, and reduced
resource consumption costs
Fault tolerance as a service: preferred

Leverages existing solutions and properties
Copyright © 2014, Elsevier Inc. All rights Reserved
25
Download