CS514: Intermediate Course in Operating Systems Lecture 12: Oct 3

advertisement
CS514: Intermediate
Course in Operating
Systems
Professor Ken Birman
Ben Atkin: TA
Lecture 12: Oct 3
Reliable Group Comm.
• We know how to build GMS and implement
primary partition model
– It can’t guarantee availability in all situations
– But it is much less likely to block than a
transactional replication service
• Use it to implement group membership
views for process groups
• Use views as input to reliable multicast
• Build ordered multicast on this:
– fbcast, cbcast: easy and cheap
– abcast: more costly, several approaches
• Synchronize with new view delivery
Process groups with
joins, failures
G0={p,q}
G1={p,q,r,s}
p
G2={q,r,s}
G3={q,r,s,t}
crash
q
r
s
t
r, s request to join
r,s added; state xfer
p fails
t requests to join
t added, state xfer
Notes on Virtual
Synchrony
• In fact, must extend reliability to
also exclude “gaps” in causal past
– I.e. if multicast m  n, then if after a failure n is
delivered, m should be too
– Ordering alone doesn’t guarantee this
• Turns out that this isn’t hard to implement
if the developer is careful
• “gap freedom” has no significant cost
implications
Notes on Virtual
Synchrony
• Compared to quorum schemes with
2PC, one could argue that
– View membership is “like” quorum update
with 2PC at the end
– Multicast is “like” reading a list of
members transactionally
• Insight: Virtual synchrony remembers
membership from multicast to
multicast. Transactional schemes
must rediscover membership on each
operation they do!
Asynchrony
• Notice that fbcast and cbcast can be
used asynchronously, while abcast
always “stutters”
– Insight is that fbcast and cbcast can
always be delivered to the sender at the
time the multicast is sent
– Abcast delivery ordering usually isn’t
known until a round of message
exchange has been completed
• Results in a tremendous
performance difference
Asynchronous Multicast
• Lets the sender avoid blocking
X=7
Y=23
X = X-1….
• But should be used with care!
– Sender isn’t blocking but multicasts are
accumulating in the sender subsystem
– Sender could get far ahead of other
processes in the group
Abcast is more
synchronous
• Only a lucky sender avoids blocking
Y=33
X=7
Y=23
X = X-1….
• Even sender needs to wait to know the delivery
ordering
• Allows concurrent updates, but at the cost of much
higher multicast latency!
– Most senders will wait for equivalent of an RPC; only token
holder avoids this
– Whole group moves in lock-step
Tradeoff
• With asynchronous cbcast or fbcast,
we gain concurrency at the sender
side, but this helps mostly if
remainder of group is idle or doing a
non-conflicting task
• With abcast we can concurrently do
conflicting tasks but at cost of
waiting to know the delivery order
for our own multicasts
Asynchrony: grain of
salt
• A good thing… in moderation
• Too much asynchrony
– Means things pile up in output
buffers
– If a failure occurs, much is lost
– And we could consume a lot of
sender-side buffering space
Concatenation
Application sends 3
asynchronous cbcasts
Multicast Subsystem
Message layer of multicast system combines
them in a single packet
Avoiding Trouble?
• First, system itself should limit amount of
asynchronous buffering
• Also, we add a “flush” primitive
– It delays until asynchronous messages get
delivered remotely
• Useful in two ways
– When sender wants to do something that must
“survive” even if sender fails
– If sender is worried about buildup of
asynchronously buffered messages (rare)
• Former case is like “durability” (ACID)
Effect?
• Overall, system might surge ahead
using asynchronous multicast
• But in fact, no process ever gets far
ahead of remainder of group
• Analogous to TCP sliding window,
but here window is a window of
pending multicasts
• With abcast the whole issue is much
less evident, but on the other hand,
the system runs much slower
Presenting to user
• We can just offer the user “raw” GCS
but this is uncommon
– pg_join(…), pg_leave(…)
– cbcast(….), pg_flush(…)
– Upcalls for msg_rcv, new_view
• Instead, usually build “tools” to
make user’s life simpler
• Tools package the common
functions in a standard way
Sample tools
• Replicated data with locking
and state transfer
• Load balancing
• Fault-tolerance through primary
backup or coordinator-cohort
• Task subdivision schemes,
based on work partitioning
Building a tool
• Tools simply map your request to the
underlying group primitives
• They exploit
– View notification and current view
– Multicast of various flavors
– Synchronization properties
• Tools are optimized to perform well
Tools Challenge
• Efficiency: generic interface tends to
lose performance opportunities
associated with knowing application
semantics
– For example, we might know about an
update pattern
– Or we could know that locking ensures
that concurrent threads must be doing
non-conflicting actions
• How much can we assume about the
application?
Active Replication
• Simplest use of process groups
• Members replicate
– Data (they maintain local copies)
– Actions (all perform same operations in
same order)
• Basic idea is to use totally ordered
multicast to send updates to all the
group members. They can perform
read operations using local copy of
data.
Synchronization
• Many replication schemes will
require some form of locking
– Application interface: lock/unlock(“x”)
– Think of lock “state” as replicated data!
• Architecture is simplified if locking
is not needed (if each “operation” is
fully described in a single multicast)
• Otherwise, implement lock/unlock
using a token passing algorithm
Locks as tokens
• p initially holds the lock
p
crash
q
r
s
t
• Later it moves to q, then back to p, then back to q
when p crashes
• Lock holder can update replicated data managed by
the group
Locks as tokens
• Can have multiple locks per group
• Easily implemented
– For example, can use cbcast to request
the lock
– Unlock is also cbcast. It designates the
process that will receive the lock,
probably oldest pending lock request
– Notice that since lock-req  lock-grant,
any process receiving a grant will find
the corresponding request on its queue!
State Transfer
• This is the problem of providing
a joining group member with
initial values for the replicated
data
• Needs to be synchronized with
incoming updates so that state
will reflect each update, exactly
once.
State Transfer
• Tool intercepts the new view
• Just when it would have been
delivered, instead we do an upcall to
the application
– Application writes down its state
– We capture in messages and send them
to the joining processes
– After last state message, allow the new
view to be delivered
State Transfer
Algorithm
G0={p,q}
p
q
r
G1={p,q,r}
State transfer looks
instantaneous
State Transfer
Algorithm
G0={p,q}
G1={p,q,r}
p
State transfer looks
instantaneous
q
r
G0={p,q}
Actually, it has a
concealed structure
p
q
r
G1={p,q,r}
State Merge after
Partitioning
• A merge is a form of state transfer
done when two partitions combine
after a link is fixed
• Usually, state transfer is employed
to take the state of the primary
partition and copy it to the nonprimary side.
• Sometimes the non-primary side
will then reissue updates that
occured while it was separated
Active Replication
• This adds up to active
replication
– We use asynchronous cbcast for
locking, updates
– State transfer to initialize joining
processes
– Performance limited by degree of
asynchronous communication we
tolerate
Active Replication
G0={p,q}
p
q
r
s
r, s request to join
t
Active Replication
G0={p,q}
G1={p,q,r,s}
p
crash
q
r
s
t
r, s request to join
r,s added; state xfer
Active Replication
G0={p,q}
G1={p,q,r,s}
p
G2={q,r,s}
crash
q
r
s
t
r, s request to join
r,s added; state xfer
p fails
t requests to join
Active Replication
G0={p,q}
G1={p,q,r,s}
p
G2={q,r,s}
G3={q,r,s,t}
crash
q
r
s
t
r, s request to join
r,s added; state xfer
p fails
t requests to join
t added, state xfer
How cheap is it?
• As described, this is the case
where Horus reaches 75,000 or
more updates per second
• In contrast, transactional
replication rarely pushes
beyond 200 per second and 50
is more common
Cheap can be “too
cheap”
• Database applications may need
stronger properties
• Dynamic uniformity required if
actions leave external traces and
must be consistent with them
... but when this happens, expect to
lose a factor of one hundred in
performance. Limited to
applications that are not very
sensitive to latency
Uses for replicated data
• Replicated file system: local copies are
always “safe”, never inconsistent
• Replicated or “cloned” web pages: don’t
overload the server. Load-balance queries
• Replicated control or management policy:
used to supervise a component consistent
with the rest of the system
• Replicated security keys: for authorization
Groupware Uses of
Replication
• Might want to replicate display for a
conferencing system: multicast
to/between Java applets
• Replicate the slides being shown by
a speaker
• Let individuals keep copies of bulky
documents or other information
that might be slow to transfer if we
wait to the last minute. Only need
to send out the updates.
Financial Example of
Replication
• Many “trading” systems show bankers or
brokers stock prices as they change
• Each stock is like a small process group
• Each new price is like an update
• Benefit of our “model” is that traders see
exactly the same input. This avoids risk of
inconsistent decisions
• Also replicate critical servers for
availability
Distributed Trading
System
Pricing DB’s
1.
Historical Data
Market
Data
Feeds
Trader Clients
2.
Analytics
3.



Availability for historical data
Load balancing and consistent
message delivery for price distribution
Parallel execution for analytics
Tokyo, London, Zurich, ...
Long-Haul WAN Spooler
Current Pricing
Publish-Subscribe
Paradigm
• A popular way to present replicated data
• Processes publish and subscribe to
“subjects”, which can be any ascii
pathname
– news_post(“subject”, message, length)
– news_subscribe(“subject”, procedure)
• State transfer is by “playback” of prior
postings
– news_subscribe_pb(“subject”, procedure)
Conceptually, a
message “bus”
• Boxes are publishers (blue / green
subjects)
• Circles are subscribers (“ “ )
• Disks represent spoolers used for playback
• Flexible and easily extended over time
• Supports huge numbers of subjects
Conceptually, a
message “bus”
• Boxes are publishers (blue / green
subjects)
• Circles are subscribers (“ “ )
• Disks represent spoolers used for playback
• Flexible and easily extended over time
• Supports huge numbers of subjects
Conceptually, a
message “bus”
• Boxes are publishers (blue / green
subjects)
• Circles are subscribers (“ “ )
• Disks represent spoolers used for playback
• Flexible and easily extended over time
• Supports huge numbers of subjects
Implementation of
message bus
• Map subjects to process groups
(could do 1-1 mapping but many-1 is
more efficient)
• Spoolers join all groups for which
spooling is desired, record
messages on disk files
• State transfer from spooler used for
playback. New messages handled
like “updates”
Need for speed?
• New York Stock Exchange
generates about 25-50 trades per
second, peak is 100
• Trades can be described in 512
byte records
... so we could potentially send every
trade on the stock exchange to an
individual workstation. With
hardware multicast, we could send
to a whole trading floor.
Applications with more
demanding requirements?
• Page memory over a network (replicated,
consistent, DSM)
• Implement an in-memory file system using
a set of workstations, XFS-style
• Transmit MPEG encoded video frames
• Interactively control a robot or some other
remote device
• Warn that an earthquake is about to
happen
But there are also
scaling limits
• Scaling: large numbers of
destinations, high data rates.
• With many receivers some may lag
behind, overload, or become “lossy:”
how to deal with this case?
• As networks get larger, bridges and
WAN links make performance very
variable
... all of which makes flow control very
hard
Replicating a Server
• Client/server computing is widely
standard
–
–
–
–
Database servers and OLTP applications
File servers
Web or Java servers
Special purpose servers (example:
compute the theoretical price of a stock
under some assumptions about the
economy)
• Replication for load-balancing, faulttolerance
Basic idea is simple
• If we just replicate the inputs to
the server, the copies will stay in
“sync”
• What makes it hard?
– Servers to share the work (hence do
different things)
– Fault-tolerance (we don’t want the
same input to crash both servers)
– Server restart (don’t want to transfer a
huge state)
The usual approach
• Replicate the updates but not the
queries.
• Load-balance the queries
– Bind each client to a different server, or
– Periodically “publish” load values and
use these as the basis of a random
algorithm, or
– Multicast all requests; servers split the
work using a deterministic scheme (e.g.
even/odd)
Randomized Load
Balancing
• Track loads on servers through
periodic updates
load0 = 2, load1=4, load2=4
• Think of 1/load as an interval on a
line:
1/2
1/4
1/4
• Toss a random coin for each request
and send to corresponding server
Load-Balanced,
Replicated Server
• Clients distribute queries using
random load-balancing and reissue
them if the server fails or is too slow
• Updates are multicast to all servers,
all execute them in parallel
Handling Server
Recovery
• If state is small, use state transfer
• If state is medium sized, transfer it
before the join request, then do the
join and transfer only the updates
that occured at the last moment
• If state is huge, use periodic
checkpoints and keep logs of
incremental changes. State
transfer only the log contents.
Fault-Tolerance Options
• On the “client side” can try a request
and then reissue it if the server fails.
• On the server side, can use
primary/backup or coordinatorcohort methods
• When there are real-time constraints
on the system, use 2 primaries for
each request to be sure that 1 reply
will be received in time
Primary-Backup Scheme
• Primary server handles all the
work. Backup is passive but
takes over if primary fails.
Primary-Backup issues
• Non-determinism: must “control” it,
may need to ship costly “traces”
from primary to backup so that
backup can reproduce actions of
primary
• Potential for a window of lost
actions: if we wait for stability of
trace data, response time is
reduced, but if we don’t wait, we can
lose actions when the primary fails
Coordinator-Cohort
Scheme
• Each server is primary (coordinator)
for some requests. It acts as backup
(cohort) for others.
• This approach is well matched to
load-balancing
• Servers do “different things” hence
are less likely to fail simultaneously
Coordinator-Cohort
Scheme
Coordinator for green clients,
backup for red clients
Coordinator for red clients,
backup for green clients
Beyond single groups
• We’ve talked about groups one
at a time
• But what if groups were cheap?
– We could use a separate group for
each data item in a system
– Vision: groups as a first-class
programming tool
Multigroup concepts
• Circus: Berkeley around 1988
– Uses groups for replication
– Idea is to replicate at a finegrained level
– But costs were high
• Isis: Cornell, 1987
– Groups as a distributed structuring
and management tool
Lightweight groups
• Normal groups have overheads
– Notably, membership change
• With lots of groups, these costs
become prohibitive
• Leads to lightweight groups
– One big group to track membership
– Lightweight subgroups used by
application, they map to big group
Issues to think about
• Won’t a replicated server just
reproduce the failure of the primary?
• Does it make more sense to do
replication in hardware (e.g. RAID
file system or paired hardware faulttolerance?)
• Homework: how would you do
reliability for NASA’s cheap space
missions?
Download