Professor Ken Birman
Ben Atkin: TA
• We’ve talked about
– Transactional reliability
– Replication for high availability
• But does this give us “fault-tolerant solutions?”
• How and why do real systems fail?
• Do real systems offer the hooks we’ll need to intervene?
• Failure is just one of the aspects of reliability, but it is clearly an important one
• To make a system fault-tolerant we need to understand how to detect failures and plan an appropriate response if a failure occurs
• This lecture focuses on how systems fail, how they can be “hardened”, and what still fails after doing so
• Reliability is not always a major goal when development first starts
• Most systems evolve over time, through incremental changes with some rewriting
• Most reliable systems are entirely rewritten using clean-room techniques after they reach a mature stage of development
• Based on goal of using “best available” practice
• Requires good specifications
• Design reviews in teams
• Actual software also reviewed for correctness
• Extensive stress testing and code coverage testing, use tools like “Purify”
• Use of formal proof tools where practical
• Gray studied failures in Tandem systems
• Hardware was fault-tolerant and rarely caused failures
• Software bugs, environmental factors, human factors (user error), incorrect specification were all major sources of failure
• Classification proposed by Bruce Lindsey
• Bohrbug: like the Bohr model of the nucleus: solid, easily reproduced, can track it down and fix it
• Heisenbug: like the Heisenberg nucleus: a diffuse cloud, very hard to pin down and hence fix
• Anita Borr and others have studied lifecycle bugs in complex software using this classification
?
Heisenbug is fuzzy, hard to find/fix
Bohrbug is solid, easy to recognize and fix
• Usually introduced in some form of code change or in original design
• Often detected during thorough testing
• Once seen, easily fixed
• Remain a problem over life-cycle of software because of need to extend system or to correct other bugs.
• Same input will reliably trigger the bug!
A Bohrbug is boring.
• These are often side-effects of some other problem
• Example: bug corrupts a data structure or misuses a pointer.
Damage is not noticed right away, but causes a crash much later when structure is referenced
• Attempting to detect the bug may shift memory layout enough to change its symptoms!
• They develop a test scenario that triggers it
• Use a form of binary search to narrow in on it
• Pin down the bug and understand precisely what is wrong
• Correct the algorithm or the coding error
• Retest extensively to confirm that the bug is fixed
• They fix the symptom: periodically scan the structure that is ususally corrupted and clean it up
• They add self-checking code (which may itself be a source of bugs)
• They develop theories of what is wrong and fix the theoretical problem, but lack a test to confirm that this eliminated the bug
• These bugs are extremely sensitive to event orders
• Heavily used software may become extremely reliable over its life (the C compiler rarely crashes, UNIX is pretty reliable by now)
• Large, complex systems depend upon so many components, many complex, that bug freedom is an unachievable goal
• Instead, adopt view that bugs will happen and we should try and plan for them
• Usual pattern: some component crashes or becomes partitioned away
• Other system components that depend on it freeze or crash too
• Chains of dependencies gradually cause more and more of the overall system to fail or freeze
• Everyone should use tools like
“purify” (detects stray pointers, uninitialized variables and memory leaks)
• But these tools don’t help at the level of a distributed system
• Benefit of a model, like transactions or virtual synchrony, is that the model simplifies developer’s task
“A distributed system is one in which the failure of a machine you have never heard of can cause your own machine to become unusable ”
• Issue is dependency on critical components
• Notion is that state and “health” of system at site A is linked to state and health at site B
• Modern systems are structured using object-oriented component interfaces:
– CORBA, COM (or DCOM), Jini
– XML
• In these systems, we create a web of dependencies between components
• Any faulty component could cripple the system!
• Network focus is on connectivity but components are logically independent: program fetches a file and operates on it, but server is stateless and forgets the interaction
– Less sophisticated but more robust?
• Distributed systems focus is on joint behavior of a set of logically related components. Can talk about “the system” as an entity.
– But needs fancier failure handling!
• These are distributed in the sense of our definition
– Often, they share state between components
– If a component fails, replacing it with a new version may be hard
– Replicating the state of a component: an appealing option…
• Deceptively appealing, as we’ll see
• Suppose that a distributed system was built by interconnecting a set of extremely reliable components running on fault-tolerant hardware
• Would such a system be expected to be reliable?
• Suppose that a distributed system was built by interconnecting a set of extremely reliable components running on faulttolerant hardware
• Would such a system be expected to be reliable?
• Perhaps not. The pattern of interaction, the need to match rates of data production and consumption, and other “distributed” factors all can prevent a system from operating correctly!
• The Web components are individually reliable
• But the Web can fail by returning inconsistent or stale data, can freeze up or claim that a server is not responding (even if both browser and server are operational), and it can be so slow that we consider it faulty even if it is working
• For stateful systems (the Web is stateless) this issue extends to joint behavior of sets of programs
• The Arianne rocket is designed in a modular fashion
– Guidance system
– Flight telemetry
– Rocket engine control
– …. Etc
• When they upgraded some rocket components in a new model, working modules failed because hidden assumptions were invalided.
Attitude
Control
Guidance
Telemetry
Altitude
Accelerometer
Thrust
Control
Attitude
Control
Guidance
Overflow!
Thrust
Control
Telemetry
Altitude
Accelerometer
Attitude
Control
Guidance
Telemetry
Altitude
Accelerometer
Thrust
Control
• Correctness depends very much on the environment
– A component that is correct in setting A may be incorrect in setting B
– Components make hidden assumptions
– Perceived reliability is in part a matter of experience and comfort with a technology base and its limitations!
• Not always necessary: there are ways to overcome failures that don’t explicitly detect them
• But situation is much easier with detectable faults
• Usual approach: process does something to say “I am still alive”
• Absence of proof of liveness taken as evidence of a failure
• Programs P and B are the primary, backup of a service
• Programs X, Y, Z are clients of the service
• All “ping” each other for liveness
• If a process doesn’t respond to a few pings, consider it faulty.
• Impossible in an asynchronous network that can lose packets: partitioning can mimic failure
– Best option is to track membership
– But few systems have GMS services
• Many real networks suffer from this problem, hence consistent detection is impossible “in practice” too!
• Can always detect failures if risk of mistakes is acceptable
• An even harder problem!
• Now we need to worry
– About programs that fail
– But also about modules that fail
• Unclear how to do this or even how to tell
– Recall that RPC makes component use rather transparent…
• Argues that we would not consider someone to have died because they don’t answer the phone
• Approach is to consult other data sources:
– Operating system where process runs
– Information about status of network routing nodes
– Can augment with application-specific solutions
• Won’t detect program that looks healthy but is actually not operating correctly
• Usually implemented using shared memory
• Monitored program must periodically update a counter in a shared memory region. Designed to do this at some frequency, e.g. 10 times per second.
• Monitoring program polls the counter, perhaps 5 times per second.
If counter stops changing, kills the
“faulty” process and notifies others.
• Used in a telecommunications coprocessor mockup
• Can’t wait for failures to be sensed, so his protocol reissues requests as soon as soon as the reply seems late
• Issue of detecting failure becomes a background task; need to do it soon enough so that overhead won’t be excessive or realtime response impacted
• Distributed systems have many components, linked by chains of dependencies
• Failures are inevitable, hardware failures are less and less central to availability
• Inconsistency of failure detection will introduce inconsistency of behavior and could freeze the application
• Replace critical components with group of components that can each act on behalf of the original one
• Develop a technology by which states can be kept consistent and processes in system can agree on status (operational/failured) of components
• Separate handling of partitioning from handling of isolated component failures if possible
Program
Module it uses
Program multicast
Module it uses
Transparent replication
• Replicate critical components for availability
• Replicate critical data: like coherent caching
• Replicate critical system state: control information such as “I’ll do X while you do Y”
• In limit, replication and coordination are really the same problem
• We need to understand clientside software architectures better to appreciate the practical limitations on replacing a server with a group
• Sometimes, this simply isn’t practical
• Suppose that a client observes a failure during a request
• What should it do?
Timeout
• What should the client do?
– No way to know if request was finished
– We don’t even know if server really crashed
– But suppose it genuinely crashed…
Timeout
backup
• What should client “say” to backup?
– Please check on the status of my last request?
• But perhaps backup has not yet finished the faulthandling protocol
– Reissue request?
• Not all requests are idempotent
• And what about any “cached” server state?
Will it need to be refreshed?
• Worse still: what if RPC throws an exception? Eg. “demarshalling error”
– A risk if failure breaks a stream connection
• Client is doing a request that might be disrupted by failure
– Must catch this request
• Client needs to reconnect
– Figure out who will take over
– Wait until it knows about the crash
– Cached data may no longer be valid
– Track down outcome of pending requests
• Meanwhile must synchronize wrt any new requests that application issues
• This argues that we need to make server failure
“transparent” to client
– But in practice, doing so is hard
– Normally, this requires deterministic servers
• But not many servers deterministic are
– Techniques are also very slow…
• Transparency
– On client side, “nothing happens”
– On server side
• There may be a connection that backup needs to take over
• What if server was in the middle of sending a request?
• How can backup exactly mimic actions of the primary?
• N-version programming: use more than one implementation to overcome software bugs
– Explicitly uses some form of group architecture
– We run multiple copies of the component
– Compare their outputs and pick majority
• Could be identical copies, or separate versions
• In limit, each is coded by a different team!
• Even with n-version programming, we get limited defense against bugs
• ... studies show that Bohrbugs will occur in all versions! For
Heisenbugs we won’t need multiple versions; running one version multiple times suffices if versions see different inputs or different order of inputs
• Processes make periodic checkpoints, log messages sent in between
• Rollback to consistent set of checkpoints after a failure. Technique is simple and costs are low.
• But method must be used throughout system and is limited to deterministic programs (everything in the system must satisfy this assumption)
• Consequence: useful in limited settings.
• Assumes that failures are arbitrary and may be malicious
• Uses groups of components that take actions by majority consensus only
• Protocols prove to be costly
– 3t+1 components needed to overcome t failures
– Takes a long time to agree on each action
• Currently employed mostly in security settings
• Suppose that a distributed system is built from standard components with application-specific code added to customize behavior
• How can such a system be made reliable without rewriting everything from the ground up?
• Need a plug-and-play reliability solution
• If reliability increases complexity, will reliability technology actually make systems less reliable?
• Issues of making a system flexible enough to handle multiple types of clients
• Security, scalability, real-time
• Issues seen within the Internet
• Object orientation (CORBA, COM),
XML, Microsoft’s .NET strategy
• Clustering and future networks