Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems

advertisement
Operating System Reliability
Andy Wang
COP 5611
Advanced Operating Systems
Some Axioms



Some simple systems, designed from
scratch, sometimes work
A complex system that works is invariably
found to have evolved from a simple system
that works
A complex system, designed from scratch
never works
Failure-Mode Theorems



Complex system usually operate in failure
mode
A system should have safe behaviors when
encountering failures
When a “fail-safe” system fails, it fails by
failing to fail safe
Some definitions

Failure of a system occurs when the system
does not perform its services in the manner
specified


Fault is anomalous physical condition


Sometimes failures are subtle (e.g., performance
fault)
Includes system specification/implementation
mistakes
Error is part of system state that differs from
its intended value
Classification of Failures




Process failures
System failures
Secondary storage failures
Communication medium failures
Process Failures

Examples




Errors leading to failure



Computation results in incorrect outcome
System state deviates from specification
Process fails to progress
Deadlock, timeout, protection violation
Bad input, consistency violation
Ignoring malicious behavior
System Failures

Processor fails to execute



Software error, hardware error (CPU, bus, etc.)
Fail-stop behavior assumed
Failure types




Amnesia
Partial-amnesia
Pause
Halting
Secondary Storage Failures

Stored data inaccessible





Parity error
Head crash
Contaminated medium
Reconstructable from archive + log, maybe
Mirrored disks (independent failure mode)
Communication Medium Failures


Site can’t communicate with another site
Causes

Switching node failure




Link failure



Hardware failure
Software failure
Congestion
Hardware
Implementation failure
Network partitions can result
Recovery




Restart process/processor
Reclaim resources
Undo/finish incomplete transactions
Concurrency makes things harder
Forward Error Recovery


Goal: To restore system from erroneous
state to error-free state
If nature of error is completely known



Remove error from state
Proceed with execution from error-free state
Rarely possible to do
Backward Error Recovery

When error source unknown



Restore state to previous error-free state; restart
Independent of fault, errors causing fault
Problems




Performance penalty
No guarantee fault will not reoccur
Possible unrecoverable component of state
Recovery point: state used to replace error
Backward Error Recovery

Basic approaches

Operation-based

Logs



Update-in-place
Write-ahead-log
State-based
Update-in-Place

Every update to object also records the log



Name of object
Old and new states of object
Recoverable update operation implements as

Do, undo, redo operations
Write-ahead Log




Update-in-place has problem if crash occurs
between update and log recorded to stable
storage
Update object only after undo log recorded
Before committing updates, record both redo
and undo logs
Expensive to write log to stable storage
State-Based Recovery

Save entire process state at recovery point




Recovery point called checkpoint
Rolling back process: restoring to checkpoint
Tradeoff: frequent checkpoints vs. completion
delay
Shadow pages


Save unmodified page copy on stable storage
Update only volatile copy; discard on rollback
Concurrent Systems Recovery

Rollback issues




Orphan messages
Domino effect
Lost messages
Livelocks
Orphan Messages
X
x1
[
y1
[
Y
Z
x2
[
z1
[
z2
[
[
recovery point
x3
[
y2 m
[
Domino Effect

Suppose Y rolls back to y2



m is orphan message
Process X must rollback to x2
Suppose Z rolls back to z2


Y rolls back to y1
Forcing Z to roll back to z1
Lost Messages
x1
[
X
m
z1
[
Z
failure
[
recovery point
Live Locks
X
x1
[
Z
z1
[
[
repeated failure
recovery point
Concurrent Recovery


Coordination required at either time of
establishing checkpoints
Beginning of recovery
Checkpoint Assumptions


Communication via messages
Unreliable FIFO channels



Higher-level end-to-end protocols assumed
Subsumes rollback-caused message loss
No network partitions from communication
failures
Checkpoint Algorithm Concepts

Permanent and tentative checkpoints




Saved on stable storage
Permanent: part of known consistent global
checkpoint
Tentative: until successful termination of
checkpoint algorithm
Rolls back only to permanent checkpoints
Synchronous Checkpoint Algorithms


Two-phase commit
Problems:



Message overhead for synchronizations
Synchronization delays
Costly when failures are rare
Asynchronous Checkpointing


Local checkpoints taken independently
Log all incoming messages on stable storage


Minimizes undone computation
Allows reprocessing of messages after rollback
Asynchronous Checkpointing
Assumptions

Assumptions



Reliable FIFO communication channels
Infinite buffers
Event-driven computation




A process idle until message received
Processes message and change state
Sends zero or more messages
Can identify each event with monotonically increasing
counter
Event-Driven Computation
x1
x2
X
y1
y2
Y
z1
Z
z2
Asynchronous Checkpointing

Basic idea




Save states, messages sent at each event
Volatile logging
Each processor notes number of messages sent
to others, and received from others
Use counters to determine orphan messages
Summary





Failures caused by errors
Can remove errors by forward/backward error
recovery
Backward error-recovery more costly, more
general
Synchronous checkpoints helpful, costly
Asynchronous checkpoints messier, domino
effects
Download