Outline • Announcements • Fault Tolerance

advertisement

• Announcements

• Fault Tolerance

Outline

Announcements

• Class evaluation at the beginning of next class

– Please come on time so that we still have enough time to cover the materials we need to cover

• Discussions

– Homework #4

– Quiz #2

• Decisions

– Final exam: open book or close book?

– Lab 2: Extension?

– Quiz #3: A week from today

April 10, 2020 COP 5611 - Operating Systems 2

Motivations

• A system is fault-tolerant

– If it can mask failures

• It continues to perform its specified function in the event of a failure

• Mainly through redundancy

– Or it exhibits a well defined failure behavior in the event of failure

• Distributed commit, either all sites commit a particular operation or none of them

April 10, 2020 COP 5611 - Operating Systems 3

Fault Tolerance Through Redundancy

• The key approach to fault tolerance is redundancy

– Three kinds of redundancy

• Information redundancy

• Time redundancy

• Physical redundancy

– A system can have

• A multiple number of processes

• A multiple number of hardware components

• A multiple number of copies of data

April 10, 2020 COP 5611 - Operating Systems 4

Failure Resilient Processes

• A process is resilient if it masks failures and guarantees progress despite a certain number of system failures

• Backup processes

– In this approach, each resilient process is implemented by a primary process and one or more backup processes

– The state of the primary processes is stored at some intervals

– If the primary terminates, one of the backup processes becomes active and takes over

April 10, 2020 COP 5611 - Operating Systems 5

Failure Resilient Processes

– cont.

• Replicated execution

– Several processes execute the same program concurrently

– It can increase the reliability and availability

– It requires that all requests at all processes in the same order

– Nonidempotent operations need to be taken care of

April 10, 2020 COP 5611 - Operating Systems 6

Distributed Commit

• The distributed commit problem involves having an operation being performed by each member of a process group or none at all

– This is referred to as global atomicity

• Commit protocols

– Given that each site has a recovery strategy at the local level, commit protocols ensure that all the sites either commit or abort the transaction unanimously, even in the presence of multiple and repetitive failures

April 10, 2020 COP 5611 - Operating Systems 7

One-phase Commit Protocol

• One-phase commit protocol

– One site is designated as a coordinator

– The coordinator tells all the other processes whether or not to locally perform the operation in question

– This scheme however is not fault tolerant

8 April 10, 2020 COP 5611 - Operating Systems

Two-Phase Commit Protocol

• In this protocol, one of the processes acts as a coordinator

– Other processes are referred to as cohorts

• Cohorts are assumed to be executing at different sites

– A stable storage is available at each site

– The write-ahead log protocol is used

– There are two phases involved in the protocol

April 10, 2020 COP 5611 - Operating Systems 9

Two-Phase Commit Protocol

– cont.

April 10, 2020 COP 5611 - Operating Systems 10

Two-Phase Commit Protocol

– cont.

April 10, 2020 COP 5611 - Operating Systems 11

Two-Phase Commit Protocol

– cont.

Coordinator

April 10, 2020 COP 5611 - Operating Systems 12

Two-Phase Commit Protocol

– cont.

• Site failures handling

– Suppose the coordinator crashes before having written the COMMIT record

• On recovery, the coordinator broadcasts an ABORT message to all the cohorts

– Suppose the coordinator crashes after writing the

COMMIT record but before writing the COMPETE record

• On recovery, the coordinate broadcasts a COMMIT message

– Suppose the coordinator crashes after writing the

COMPLETE record

• On recovery, there is nothing to be done for the transaction

April 10, 2020 COP 5611 - Operating Systems 13

Two-Phase Commit Protocol

– cont.

• Site failures handling

- continued

– If a cohort crashes in Phase I, the coordinate aborts the transaction because it does not receive a reply from the crashed cohort

– If a cohort crashes in Phase II (after writing its

UNDO and REDO log)

• On recovery, the cohort will check with the coordinator whether to abort or to commit the transaction

April 10, 2020 COP 5611 - Operating Systems 14

Two-Phase Commit Protocol

– cont.

• Limitation

– It is a blocking protocol

• Whenever the coordinator fails, cohort sites will have to wait for its recovery

• This is undesirable as these sites may be holding locks on resources

• It cannot be used if transactions must be resilient to site failures

– This leads to non-blocking commit protocols

April 10, 2020 COP 5611 - Operating Systems 15

Non-blocking Commit Protocols

• To be non-blocking in the event of site failures

– Operational sites should agree on the outcome of the transaction by examining their local states

– Failed sites, upon recovery, must also reach the same conclusion regarding the outcome of the transaction as operational sites do

• Independent recovery refers to the situation that the recovering sites can decide the final outcome of the transaction based solely on their local state

April 10, 2020 COP 5611 - Operating Systems 16

Three-Phase Commit Protocol

– cont.

April 10, 2020 COP 5611 - Operating Systems 17

Three-Phase Commit Protocol for Single Site Failure

April 10, 2020 COP 5611 - Operating Systems 18

Three-Phase Commit Protocol

– cont.

• Phase I - is identical to the that of the two-phase commit protocol except in the event of a site’s failure

– If a cohort fails, the coordinator times out waiting for the Agreed message and the coordinator aborts the transaction and sends abort messages to all the cohorts

• Phase II - The coordinator sends a Prepare message to all the cohorts if all the cohorts have sent Agreed message in phase I

– Otherwise, it sends an Abort message

April 10, 2020 COP 5611 - Operating Systems 19

Three-Phase Commit Protocol

– cont.

• Phase III – On receiving acknowledgments to the Prepare messages from all the cohorts, the coordinator sends a Commit message to all the cohorts

– On receiving a Commit message, a cohort commits the transaction

20 April 10, 2020 COP 5611 - Operating Systems

Three-Phase Commit Protocol

– cont.

• Theoretical results

– Rules 1 and 2 are sufficient for designing commit protocols resilient to a single site failure during a transaction

– There exists no protocol using independent recovery that is resilient to arbitrary failures by two sites

– There exists no protocol resilient to network partitioning when messages are lost

– There exists no protocol resilient to multiple network partitioning

April 10, 2020 COP 5611 - Operating Systems 21

Voting Protocols

• Distributed commit protocols are resilient to single site failures

– But they are not resilient to multiple site failures, communication failures, and network partitioning

• Voting protocols are more fault tolerant

– They allow data accesses under network failures, multiple site failures, and message losses without compromising the integrity of the data

– The basic idea is that each replica is assigned some number of votes and a majority of votes must be collected before a process can access a replica

April 10, 2020 COP 5611 - Operating Systems 22

Static Voting

• System model

– The replicas of files are stored at different sites

– Every file access operation requires that an appropriate lock is obtained

• The lock rule allows either “one writer and no readers” or “multiple readers and no writers”

– Every file is associated with a version number

• Indicates the number of times the file has been updated

• Version numbers are stored on stable storage

• Every write operation updates its version number

April 10, 2020 COP 5611 - Operating Systems 23

Static Voting

– cont.

• Basic idea

– Every replica is assigned a certain number of votes

• This information is stored on stable storage

– A read or write operation is permitted if a certain number of votes, read quorum or write quorum, are collected by the requesting process

24 April 10, 2020 COP 5611 - Operating Systems

Static Voting

– cont.

April 10, 2020 COP 5611 - Operating Systems 25

Static Voting

– cont.

April 10, 2020 COP 5611 - Operating Systems 26

Static Voting

– cont.

April 10, 2020 COP 5611 - Operating Systems 27

Vote Assignment

April 10, 2020 COP 5611 - Operating Systems 28

Vote Assignment Examples

April 10, 2020 COP 5611 - Operating Systems 29

Reliable Communication

• In a system using replicated data, it is important that data managers behave identically

– The data managers are required to have an identical view of the events

• Atomic broadcast

30 April 10, 2020 COP 5611 - Operating Systems

Summary

• Fault tolerance is to mask the failure or behave in a well-defined way in case of failures

– The key approach to failure masking is through redundancy

• Failure resilient processes

– Distributed commit protocols guarantee the global atomicity

• Either all sites will commit an operation or none of them

April 10, 2020 COP 5611 - Operating Systems 31

Download