Outline

• Announcements

• Fault Tolerance

Outline

Announcements

• Class evaluation at the beginning of next class

– Please come on time so that we still have enough time to cover the materials we need to cover

• Discussions

– Homework #4

– Quiz #2

• Decisions

– Final exam: open book or close book?

– Lab 2: Extension?

– Quiz #3: A week from today

April 10, 2020 COP 5611 - Operating Systems 2

Motivations

• A system is fault-tolerant

– If it can mask failures

• It continues to perform its specified function in the event of a failure

• Mainly through redundancy

– Or it exhibits a well defined failure behavior in the event of failure

• Distributed commit, either all sites commit a particular operation or none of them


Fault Tolerance Through Redundancy

• The key approach to fault tolerance is redundancy

– Three kinds of redundancy

• Information redundancy

• Time redundancy

• Physical redundancy

– A system can have

• A multiple number of processes

• A multiple number of hardware components

• A multiple number of copies of data


Failure Resilient Processes

• A process is resilient if it masks failures and guarantees progress despite a certain number of system failures

• Backup processes

– In this approach, each resilient process is implemented by a primary process and one or more backup processes

– The state of the primary processes is stored at some intervals

– If the primary terminates, one of the backup processes becomes active and takes over


Failure Resilient Processes

– cont.

• Replicated execution

– Several processes execute the same program concurrently

– It can increase the reliability and availability

– It requires that all requests at all processes in the same order

– Nonidempotent operations need to be taken care of


Distributed Commit

• The distributed commit problem involves having an operation being performed by each member of a process group or none at all

– This is referred to as global atomicity

• Commit protocols

– Given that each site has a recovery strategy at the local level, commit protocols ensure that all the sites either commit or abort the transaction unanimously, even in the presence of multiple and repetitive failures


One-phase Commit Protocol

• One-phase commit protocol

– One site is designated as a coordinator

– The coordinator tells all the other processes whether or not to locally perform the operation in question

– This scheme however is not fault tolerant

8 April 10, 2020 COP 5611 - Operating Systems

Two-Phase Commit Protocol

• In this protocol, one of the processes acts as a coordinator

– Other processes are referred to as cohorts

• Cohorts are assumed to be executing at different sites

– A stable storage is available at each site

– The write-ahead log protocol is used

– There are two phases involved in the protocol



– cont.



– cont.



– cont.

Coordinator



– cont.

• Site failures handling

– Suppose the coordinator crashes before having written the COMMIT record

• On recovery, the coordinator broadcasts an ABORT message to all the cohorts

– Suppose the coordinator crashes after writing the

COMMIT record but before writing the COMPETE record

• On recovery, the coordinate broadcasts a COMMIT message

– Suppose the coordinator crashes after writing the

COMPLETE record

• On recovery, there is nothing to be done for the transaction



– cont.

• Site failures handling

- continued

– If a cohort crashes in Phase I, the coordinate aborts the transaction because it does not receive a reply from the crashed cohort

– If a cohort crashes in Phase II (after writing its

UNDO and REDO log)

• On recovery, the cohort will check with the coordinator whether to abort or to commit the transaction



– cont.

• Limitation

– It is a blocking protocol

• Whenever the coordinator fails, cohort sites will have to wait for its recovery

• This is undesirable as these sites may be holding locks on resources

• It cannot be used if transactions must be resilient to site failures

– This leads to non-blocking commit protocols


Non-blocking Commit Protocols

• To be non-blocking in the event of site failures

– Operational sites should agree on the outcome of the transaction by examining their local states

– Failed sites, upon recovery, must also reach the same conclusion regarding the outcome of the transaction as operational sites do

• Independent recovery refers to the situation that the recovering sites can decide the final outcome of the transaction based solely on their local state


Three-Phase Commit Protocol

– cont.


Three-Phase Commit Protocol for Single Site Failure



– cont.

• Phase I - is identical to the that of the two-phase commit protocol except in the event of a site’s failure

– If a cohort fails, the coordinator times out waiting for the Agreed message and the coordinator aborts the transaction and sends abort messages to all the cohorts

• Phase II - The coordinator sends a Prepare message to all the cohorts if all the cohorts have sent Agreed message in phase I

– Otherwise, it sends an Abort message



– cont.

• Phase III – On receiving acknowledgments to the Prepare messages from all the cohorts, the coordinator sends a Commit message to all the cohorts

– On receiving a Commit message, a cohort commits the transaction



– cont.

• Theoretical results

– Rules 1 and 2 are sufficient for designing commit protocols resilient to a single site failure during a transaction

– There exists no protocol using independent recovery that is resilient to arbitrary failures by two sites

– There exists no protocol resilient to network partitioning when messages are lost

– There exists no protocol resilient to multiple network partitioning


Voting Protocols

• Distributed commit protocols are resilient to single site failures

– But they are not resilient to multiple site failures, communication failures, and network partitioning

• Voting protocols are more fault tolerant

– They allow data accesses under network failures, multiple site failures, and message losses without compromising the integrity of the data

– The basic idea is that each replica is assigned some number of votes and a majority of votes must be collected before a process can access a replica


Static Voting

• System model

– The replicas of files are stored at different sites

– Every file access operation requires that an appropriate lock is obtained

• The lock rule allows either “one writer and no readers” or “multiple readers and no writers”

– Every file is associated with a version number

• Indicates the number of times the file has been updated

• Version numbers are stored on stable storage

• Every write operation updates its version number


Static Voting

– cont.

• Basic idea

– Every replica is assigned a certain number of votes

• This information is stored on stable storage

– A read or write operation is permitted if a certain number of votes, read quorum or write quorum, are collected by the requesting process


Static Voting

– cont.


Static Voting

– cont.


Static Voting

– cont.


Vote Assignment


Vote Assignment Examples


Reliable Communication

• In a system using replicated data, it is important that data managers behave identically

– The data managers are required to have an identical view of the events

• Atomic broadcast


Summary

• Fault tolerance is to mask the failure or behave in a well-defined way in case of failures

– The key approach to failure masking is through redundancy

• Failure resilient processes

– Distributed commit protocols guarantee the global atomicity

• Either all sites will commit an operation or none of them


Outline • Announcements • Fault Tolerance

Announcements

Motivations

Fault Tolerance Through Redundancy

Failure Resilient Processes

Failure Resilient Processes

Distributed Commit

One-phase Commit Protocol

Two-Phase Commit Protocol

Two-Phase Commit Protocol

Two-Phase Commit Protocol

Two-Phase Commit Protocol

Two-Phase Commit Protocol

Two-Phase Commit Protocol

Two-Phase Commit Protocol

Non-blocking Commit Protocols

Three-Phase Commit Protocol

Three-Phase Commit Protocol

Three-Phase Commit Protocol

Three-Phase Commit Protocol

Voting Protocols

Static Voting

Static Voting

Static Voting

Static Voting

Static Voting

Vote Assignment

Vote Assignment Examples

Reliable Communication

Summary

Related documents

Products

Support

Outline • Announcements • Fault Tolerance

Outline

Announcements

Motivations

Fault Tolerance Through Redundancy

Failure Resilient Processes

Failure Resilient Processes

Distributed Commit

One-phase Commit Protocol

Two-Phase Commit Protocol

Two-Phase Commit Protocol

Two-Phase Commit Protocol

Two-Phase Commit Protocol

Two-Phase Commit Protocol

Two-Phase Commit Protocol

Two-Phase Commit Protocol

Non-blocking Commit Protocols

Three-Phase Commit Protocol

Three-Phase Commit Protocol

Three-Phase Commit Protocol

Three-Phase Commit Protocol

Voting Protocols

Static Voting

Static Voting

Static Voting

Static Voting

Static Voting

Vote Assignment

Vote Assignment Examples

Reliable Communication

Summary

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib