• Announcements
• Fault Tolerance
• Class evaluation at the beginning of next class
– Please come on time so that we still have enough time to cover the materials we need to cover
• Discussions
– Homework #4
– Quiz #2
• Decisions
– Final exam: open book or close book?
– Lab 2: Extension?
– Quiz #3: A week from today
April 10, 2020 COP 5611 - Operating Systems 2
• A system is fault-tolerant
– If it can mask failures
• It continues to perform its specified function in the event of a failure
• Mainly through redundancy
– Or it exhibits a well defined failure behavior in the event of failure
• Distributed commit, either all sites commit a particular operation or none of them
April 10, 2020 COP 5611 - Operating Systems 3
• The key approach to fault tolerance is redundancy
– Three kinds of redundancy
• Information redundancy
• Time redundancy
• Physical redundancy
– A system can have
• A multiple number of processes
• A multiple number of hardware components
• A multiple number of copies of data
April 10, 2020 COP 5611 - Operating Systems 4
• A process is resilient if it masks failures and guarantees progress despite a certain number of system failures
• Backup processes
– In this approach, each resilient process is implemented by a primary process and one or more backup processes
– The state of the primary processes is stored at some intervals
– If the primary terminates, one of the backup processes becomes active and takes over
April 10, 2020 COP 5611 - Operating Systems 5
– cont.
• Replicated execution
– Several processes execute the same program concurrently
– It can increase the reliability and availability
– It requires that all requests at all processes in the same order
– Nonidempotent operations need to be taken care of
April 10, 2020 COP 5611 - Operating Systems 6
• The distributed commit problem involves having an operation being performed by each member of a process group or none at all
– This is referred to as global atomicity
• Commit protocols
– Given that each site has a recovery strategy at the local level, commit protocols ensure that all the sites either commit or abort the transaction unanimously, even in the presence of multiple and repetitive failures
April 10, 2020 COP 5611 - Operating Systems 7
• One-phase commit protocol
– One site is designated as a coordinator
– The coordinator tells all the other processes whether or not to locally perform the operation in question
– This scheme however is not fault tolerant
8 April 10, 2020 COP 5611 - Operating Systems
• In this protocol, one of the processes acts as a coordinator
– Other processes are referred to as cohorts
• Cohorts are assumed to be executing at different sites
– A stable storage is available at each site
– The write-ahead log protocol is used
– There are two phases involved in the protocol
April 10, 2020 COP 5611 - Operating Systems 9
– cont.
April 10, 2020 COP 5611 - Operating Systems 10
– cont.
April 10, 2020 COP 5611 - Operating Systems 11
– cont.
Coordinator
April 10, 2020 COP 5611 - Operating Systems 12
– cont.
• Site failures handling
– Suppose the coordinator crashes before having written the COMMIT record
• On recovery, the coordinator broadcasts an ABORT message to all the cohorts
– Suppose the coordinator crashes after writing the
COMMIT record but before writing the COMPETE record
• On recovery, the coordinate broadcasts a COMMIT message
– Suppose the coordinator crashes after writing the
COMPLETE record
• On recovery, there is nothing to be done for the transaction
April 10, 2020 COP 5611 - Operating Systems 13
– cont.
• Site failures handling
- continued
– If a cohort crashes in Phase I, the coordinate aborts the transaction because it does not receive a reply from the crashed cohort
– If a cohort crashes in Phase II (after writing its
UNDO and REDO log)
• On recovery, the cohort will check with the coordinator whether to abort or to commit the transaction
April 10, 2020 COP 5611 - Operating Systems 14
– cont.
• Limitation
– It is a blocking protocol
• Whenever the coordinator fails, cohort sites will have to wait for its recovery
• This is undesirable as these sites may be holding locks on resources
• It cannot be used if transactions must be resilient to site failures
– This leads to non-blocking commit protocols
April 10, 2020 COP 5611 - Operating Systems 15
• To be non-blocking in the event of site failures
– Operational sites should agree on the outcome of the transaction by examining their local states
– Failed sites, upon recovery, must also reach the same conclusion regarding the outcome of the transaction as operational sites do
• Independent recovery refers to the situation that the recovering sites can decide the final outcome of the transaction based solely on their local state
April 10, 2020 COP 5611 - Operating Systems 16
– cont.
April 10, 2020 COP 5611 - Operating Systems 17
Three-Phase Commit Protocol for Single Site Failure
April 10, 2020 COP 5611 - Operating Systems 18
– cont.
• Phase I - is identical to the that of the two-phase commit protocol except in the event of a site’s failure
– If a cohort fails, the coordinator times out waiting for the Agreed message and the coordinator aborts the transaction and sends abort messages to all the cohorts
• Phase II - The coordinator sends a Prepare message to all the cohorts if all the cohorts have sent Agreed message in phase I
– Otherwise, it sends an Abort message
April 10, 2020 COP 5611 - Operating Systems 19
– cont.
• Phase III – On receiving acknowledgments to the Prepare messages from all the cohorts, the coordinator sends a Commit message to all the cohorts
– On receiving a Commit message, a cohort commits the transaction
20 April 10, 2020 COP 5611 - Operating Systems
– cont.
• Theoretical results
– Rules 1 and 2 are sufficient for designing commit protocols resilient to a single site failure during a transaction
– There exists no protocol using independent recovery that is resilient to arbitrary failures by two sites
– There exists no protocol resilient to network partitioning when messages are lost
– There exists no protocol resilient to multiple network partitioning
April 10, 2020 COP 5611 - Operating Systems 21
• Distributed commit protocols are resilient to single site failures
– But they are not resilient to multiple site failures, communication failures, and network partitioning
• Voting protocols are more fault tolerant
– They allow data accesses under network failures, multiple site failures, and message losses without compromising the integrity of the data
– The basic idea is that each replica is assigned some number of votes and a majority of votes must be collected before a process can access a replica
April 10, 2020 COP 5611 - Operating Systems 22
• System model
– The replicas of files are stored at different sites
– Every file access operation requires that an appropriate lock is obtained
• The lock rule allows either “one writer and no readers” or “multiple readers and no writers”
– Every file is associated with a version number
• Indicates the number of times the file has been updated
• Version numbers are stored on stable storage
• Every write operation updates its version number
April 10, 2020 COP 5611 - Operating Systems 23
– cont.
• Basic idea
– Every replica is assigned a certain number of votes
• This information is stored on stable storage
– A read or write operation is permitted if a certain number of votes, read quorum or write quorum, are collected by the requesting process
24 April 10, 2020 COP 5611 - Operating Systems
– cont.
April 10, 2020 COP 5611 - Operating Systems 25
– cont.
April 10, 2020 COP 5611 - Operating Systems 26
– cont.
April 10, 2020 COP 5611 - Operating Systems 27
April 10, 2020 COP 5611 - Operating Systems 28
April 10, 2020 COP 5611 - Operating Systems 29
• In a system using replicated data, it is important that data managers behave identically
– The data managers are required to have an identical view of the events
• Atomic broadcast
30 April 10, 2020 COP 5611 - Operating Systems
• Fault tolerance is to mask the failure or behave in a well-defined way in case of failures
– The key approach to failure masking is through redundancy
• Failure resilient processes
– Distributed commit protocols guarantee the global atomicity
• Either all sites will commit an operation or none of them
April 10, 2020 COP 5611 - Operating Systems 31