Reliable Distributed Systems How and Why Complex Systems Fail

advertisement
Reliable Distributed Systems
How and Why Complex Systems
Fail
How and Why Systems Fail

We’ve talked about





Transactional reliability
And we’ve mentioned replication for high
availability
But does this give us “fault-tolerant
solutions?”
How and why do real systems fail?
Do real systems offer the hooks we’ll need to
intervene?
Failure



Failure is just one of the aspects of reliability,
but it is clearly an important one
To make a system fault-tolerant we need to
understand how to detect failures and plan
an appropriate response if a failure occurs
This lecture focuses on how systems fail, how
they can be “hardened”, and what still fails
after doing so
Systems can be built in many
ways



Reliability is not always a major goal when
development first starts
Most systems evolve over time, through
incremental changes with some rewriting
Most reliable systems are entirely rewritten
using clean-room techniques after they reach
a mature stage of development
Clean-room concept






Based on goal of using “best available” practice
Requires good specifications
Design reviews in teams
Actual software also reviewed for correctness
Extensive stress testing and code coverage testing,
use tools like “Purify”
Use of formal proof tools where practical
But systems still fail!



Gray studied failures in Tandem systems
Hardware was fault-tolerant and rarely
caused failures
Software bugs, environmental factors,
human factors (user error), incorrect
specification were all major sources of
failure
Bohrbugs and Heisenbugs




Classification proposed by Bruce Lindsey
Bohrbug: like the Bohr model of the nucleus: solid,
easily reproduced, can track it down and fix it
Heisenbug: like the Heisenberg nucleus: a diffuse
cloud, very hard to pin down and hence fix
Anita Borr and others have studied life-cycle bugs in
complex software using this classification
Programmer facing bugs
?
Heisenbug is fuzzy,
hard to find/fix
Bohrbug is solid,
easy to recognize and fix
Lifecycle of Bohrbug





Usually introduced in some form of code
change or in original design
Often detected during thorough testing
Once seen, easily fixed
Remain a problem over life-cycle of software
because of need to extend system or to
correct other bugs.
Same input will reliably trigger the bug!
Lifecycle of Bohrbug
A Bohrbug is boring.
Lifecycle of a Heisenbug



These are often side-effects of some other
problem
Example: bug corrupts a data structure or
misuses a pointer. Damage is not noticed
right away, but causes a crash much later
when structure is referenced
Attempting to detect the bug may shift
memory layout enough to change its
symptoms!
How programmers fix a
Bohrbug





They develop a test scenario that triggers it
Use a form of binary search to narrow in on it
Pin down the bug and understand precisely
what is wrong
Correct the algorithm or the coding error
Retest extensively to confirm that the bug is
fixed
How they fix Heisenbugs




They fix the symptom: periodically scan the
structure that is ususally corrupted and clean
it up
They add self-checking code (which may itself
be a source of bugs)
They develop theories of what is wrong and
fix the theoretical problem, but lack a test to
confirm that this eliminated the bug
These bugs are extremely sensitive to event
orders
Bug-free software is
uncommon



Heavily used software may become extremely
reliable over its life (the C compiler rarely
crashes, UNIX is pretty reliable by now)
Large, complex systems depend upon so
many components, many complex, that bug
freedom is an unachievable goal
Instead, adopt view that bugs will happen
and we should try and plan for them
Bugs in a typical distributed
system



Usual pattern: some component crashes or
becomes partitioned away
Other system components that depend on it
freeze or crash too
Chains of dependencies gradually cause more
and more of the overall system to fail or
freeze
Tools can help



Everyone should use tools like “purify”
(detects stray pointers, uninitialized variables
and memory leaks)
But these tools don’t help at the level of a
distributed system
Benefit of a model, like transactions or virtual
synchrony, is that the model simplifies
developer’s task
Leslie Lamport
“A distributed system is one in which the failure
of a machine you have never heard of can
cause your own machine to become
unusable”


Issue is dependency on critical components
Notion is that state and “health” of system at
site A is linked to state and health at site B
Component Architectures Make it
Worse

Modern systems are structured using objectoriented component interfaces:




CORBA, COM (or DCOM), Jini
XML
In these systems, we create a web of
dependencies between components
Any faulty component could cripple the
system!
Reminder: Networks versus
Distributed Systems

Network focus is on connectivity but
components are logically independent:
program fetches a file and operates on it, but
server is stateless and forgets the interaction


Less sophisticated but more robust?
Distributed systems focus is on joint behavior
of a set of logically related components. Can
talk about “the system” as an entity.

But needs fancier failure handling!
Component Systems?


Includes CORBA and Web Services
These are distributed in the sense of our
definition



Often, they share state between components
If a component fails, replacing it with a new
version may be hard
Replicating the state of a component: an
appealing option…

Deceptively appealing, as we’ll see
Thought question


Suppose that a distributed system was built
by interconnecting a set of extremely reliable
components running on fault-tolerant
hardware
Would such a system be expected to be
reliable?
Thought question


Suppose that a distributed system was built by
interconnecting a set of extremely reliable
components running on fault-tolerant hardware
Would such a system be expected to be reliable?

Perhaps not. The pattern of interaction, the need to match
rates of data production and consumption, and other
“distributed” factors all can prevent a system from operating
correctly!
Example



The Web components are individually reliable
But the Web can fail by returning inconsistent or stale
data, can freeze up or claim that a server is not
responding (even if both browser and server are
operational), and it can be so slow that we consider it
faulty even if it is working
For stateful systems (the Web is stateless) this issue
extends to joint behavior of sets of programs
Example

The Arianne rocket is designed in a modular
fashion





Guidance system
Flight telemetry
Rocket engine control
…. Etc
When they upgraded some rocket
components in a new model, working
modules failed because hidden assumptions
were invalided.
Arianne Rocket
Attitude
Control
Guidance
Thrust
Control
Telemetry
Altitude
Accelerometer
Arianne Rocket
Attitude
Control
Guidance
Overflow!
Thrust
Control
Telemetry
Altitude
Accelerometer
Arianne Rocket
Attitude
Control
Guidance
Thrust
Control
Telemetry
Altitude
Accelerometer
Insights?

Correctness depends very much on the
environment



A component that is correct in setting A may be
incorrect in setting B
Components make hidden assumptions
Perceived reliability is in part a matter of
experience and comfort with a technology base
and its limitations!
Detecting failure




Not always necessary: there are ways to
overcome failures that don’t explicitly detect
them
But situation is much easier with detectable
faults
Usual approach: process does something to
say “I am still alive”
Absence of proof of liveness taken as
evidence of a failure
Example: pinging with
timeouts




Programs P and B are the primary,
backup of a service
Programs X, Y, Z are clients of the
service
All “ping” each other for liveness
If a process doesn’t respond to a few
pings, consider it faulty.
Consistent failure detection

Impossible in an asynchronous network that
can lose packets: partitioning can mimic
failure




Best option is to track membership
But few systems have GMS services
Many real networks suffer from this problem,
hence consistent detection is impossible “in
practice” too!
Can always detect failures if risk of mistakes
is acceptable
Component failure detection


An even harder problem!
Now we need to worry



About programs that fail
But also about modules that fail
Unclear how to do this or even how to
tell

Recall that RPC makes component use
rather transparent…
Vogels: the Failure
Investigator


Argues that we would not consider someone to have
died because they don’t answer the phone
Approach is to consult other data sources:




Operating system where process runs
Information about status of network routing nodes
Can augment with application-specific solutions
Won’t detect program that looks healthy but is
actually not operating correctly
Further options: “Hot” button



Usually implemented using shared memory
Monitored program must periodically update
a counter in a shared memory region.
Designed to do this at some frequency, e.g.
10 times per second.
Monitoring program polls the counter,
perhaps 5 times per second. If counter stops
changing, kills the “faulty” process and
notifies others.
Friedman’s approach



Used in a telecommunications co-processor
mockup
Can’t wait for failures to be sensed, so his
protocol reissues requests as soon as soon as
the reply seems late
Issue of detecting failure becomes a
background task; need to do it soon enough
so that overhead won’t be excessive or
realtime response impacted
Broad picture?



Distributed systems have many components,
linked by chains of dependencies
Failures are inevitable, hardware failures are
less and less central to availability
Inconsistency of failure detection will
introduce inconsistency of behavior and could
freeze the application
Suggested solution?



Replace critical components with group of
components that can each act on behalf of
the original one
Develop a technology by which states can be
kept consistent and processes in system can
agree on status (operational/failured) of
components
Separate handling of partitioning from
handling of isolated component failures if
possible
Suggested Solution
Program
Module
it uses
Suggested Solution
Program
multicast
Module
it uses
Module
it uses
Transparent
replication
Replication: the key
technology




Replicate critical components for availability
Replicate critical data: like coherent caching
Replicate critical system state: control
information such as “I’ll do X while you do Y”
In limit, replication and coordination are really
the same problem
Basic issues with the approach


We need to understand client-side
software architectures better to
appreciate the practical limitations on
replacing a server with a group
Sometimes, this simply isn’t practical
Client-Server issues


Suppose that a client observes a failure
during a request
What should it do?
Client-server issues
Timeout
Client-server issues

What should the client do?



No way to know if request was finished
We don’t even know if server really
crashed
But suppose it genuinely crashed…
Client-server issues
backup
Timeout
Client-server issues

What should client “say” to backup?

Please check on the status of my last request?


Reissue request?



But perhaps backup has not yet finished the fault-handling
protocol
Not all requests are idempotent
And what about any “cached” server state? Will it
need to be refreshed?
Worse still: what if RPC throws an exception? Eg.
“demarshalling error”

A risk if failure breaks a stream connection
Client-server issues

Client is doing a request that might be
disrupted by failure


Client needs to reconnect





Must catch this request
Figure out who will take over
Wait until it knows about the crash
Cached data may no longer be valid
Track down outcome of pending requests
Meanwhile must synchronize wrt any new
requests that application issues
Client-server issues

This argues that we need to make
server failure “transparent” to client


But in practice, doing so is hard
Normally, this requires deterministic
servers


But not many servers are deterministic
Techniques are also very slow…
Client-server issues

Transparency


On client side, “nothing happens”
On server side



There may be a connection that backup needs
to take over
What if server was in the middle of sending a
request?
How can backup exactly mimic actions of the
primary?
Other approaches to consider

N-version programming: use more than one
implementation to overcome software bugs



Explicitly uses some form of group architecture
We run multiple copies of the component
Compare their outputs and pick majority


Could be identical copies, or separate versions
In limit, each is coded by a different team!
Other approaches to consider

Even with n-version programming, we get
limited defense against bugs

... studies show that Bohrbugs will occur in all
versions! For Heisenbugs we won’t need multiple
versions; running one version multiple times
suffices if versions see different inputs or different
order of inputs
Logging and checkpoints




Processes make periodic checkpoints, log messages
sent in between
Rollback to consistent set of checkpoints after a
failure. Technique is simple and costs are low.
But method must be used throughout system and is
limited to deterministic programs (everything in the
system must satisfy this assumption)
Consequence: useful in limited settings.
Byzantine approach



Assumes that failures are arbitrary and may be
malicious
Uses groups of components that take actions by
majority consensus only
Protocols prove to be costly



3t+1 components needed to overcome t failures
Takes a long time to agree on each action
Currently employed mostly in security settings
Hard practical problem




Suppose that a distributed system is built from
standard components with application-specific code
added to customize behavior
How can such a system be made reliable without
rewriting everything from the ground up?
Need a plug-and-play reliability solution
If reliability increases complexity, will reliability
technology actually make systems less reliable?
Download