Ordering of Events in Distributed Systems

advertisement
UNIVERSITY of WISCONSIN-MADISON
Computer Sciences Department
CS 739
Distributed Systems
Andrea C. Arpaci-Dusseau
Ordering of Events in Distributed
Systems
Two papers:
• Time, Clocks, and the Ordering of Events in a
Distributed System, Lamport, 1978
• Distributed Snapshots: Determining Global States of
Distributed Systems, Chandy and Lamport, 1985
Motivation: Time, Clocks, Ordering
To develop distributed algorithms, want all participants see
messages in same order (if want same decisions!)
How can distributed nodes agree on order of messages?
Process A
Process B
Naïve: Each process
believes sent message
first
Motivation: Time, Clocks, Ordering
When does an event precede (“happen before”) another in a
distributed system?
• Sometimes impossible to tell; sometimes it doesn’t matter
– If event A occurs on machine A,
and event B occurs on machine B,
but there is no communication between A and B,
then did event A or event B happen first???
How can a process identify which events happened before
others?
• Logical clocks
Terminology
Distributed system: A collection of distinct processes
which are spatially separated and which
communicate with one another by exchanging
messages
• How does this differ from our previous definitions?
Process: A sequence of events (instructions,
sending messages, receiving messages)
• The events within a process have a total ordering
Partial Ordering
Happened before: ->
Rules for ordering events a and b
1) if a and b are events in same process and a comes before b,
then a->b
2) if a is the sending of a message by a process and b is receiving
that message, then a->b
3) if a->b and b->c then a->c
a->b: It is possible for a to causally affect b
Concurrent: ‡>
•
•
if a ‡> b and b ‡> a, then do not know ordering of a and b
It is not possible for a to causally affect b
Space-Time Diagram
P
Q
Time
p4
q7
R
r4
q6
q5
p3
q4
r3
r2
q3
p2
q2
p1
q1
r1
What is the relationship between (q3,p3)? (p1,q3)? (p2,q3)? (q3,r4)? (r1,q6)?
Logical Clocks
Abstract view: Logical clock is a way to assign a number to
an event to express ordering
• No relation between logical clock and physical time
Clock Ci for process Pi is a function that assigns a number
Ci(a) to any event a in Pi
Clock condition: For any events a, b:
if a->b, then C(a) < C(b)
Converse condition does not hold
• Can’t say concurrent events have same logical time
If C(a) < C(b) can’t conclude a->b
Implementation of Logical Clocks
C1.
• If a & b are in Pi & a before b, then Ci(a) < Ci(b)
IR1. (Implementation Rule)
• Each process Pi increments Ci between any two successive
events
C2.
• If a is sent by Pi and b is received by Pj, then Ci(a) < Cj(b)
IR2.
• (a) If event a is the sending of message m by process Pi, then m
contains a timestamp Tm=Ci(a)
• (b) Upon receiving m, process Pj sets Cj greater than or equal to
its presents value and greater than Tm.
Logical Clocks Example
p4
q7
r4
q6
q5
p3
q4
r3
r2
q3
p2
q2
p1
q1
What logical clock values are possible?
Assume initial C(p1)=5, C(q1)=50, C(r1)=2
r1
Logical Clocks Example
57
56 p4
q7
r4 56
q6 56
r3 55
q5 55
53 p3
54 q4
53
52 p2
6
52
p1
r2 3
q3
q2
r1
q1 51
5
50
What logical clock values are possible?
Assume initial C(p1)=5, C(q1)=50, C(r1)=2
2
Total Ordering
Use logical clocks to obtain total ordering across all
processes and events
a => b if and only if:
• 1) Ci(a) < Cj(b) OR
• 2) Ci(a) = Cj(b) and Pi < Pj (i.e., use process ids to break ties)
Partial ordering is unique, but total ordering is not!
• Concurrent operations can go in any order
• Total ordering depends upon implementation of each Ci()
• Total ordering depends upon tie breaking rules
Distributed State Machines
Each process runs same distributed algorithm
Relies upon total ordering of requests
• Agreed upon by all participants
• Can be used to ensure all see events (inputs) in same order and therefore
make same decisions
Idea:
• All requests are time-stamped, responses (acks) are time-stamped too
• Send time-stamped request to all processes
• Handle next request in total order
– To know next request, must have received greater timestamp from all possible
participants
– Problems?
Example: Mutual exclusion
Mutual Exclusion Example
A
1
B
10
C
20
D
30
Scenario: A and C want resource
A and C each send request before see other’s
(concurrent requests)
C sends request out earlier in “physical time”
How will all nodes agree who should get
resource?
Request queue:
Mutual Exclusion Example
A
1
B
10
C
D
20
21
2
21
21
22
31
2
22
32
23
32
Request queue: A - 2, C - 21
2
30
Physical Clocks
Motivation: Can observe anomalous behavior if other
communication channels exist between processes
• Useful to have physical clock with meaning in physical world
Synchronize independent physical clocks, each running at
slightly different rates (skew)
Implementation Idea:
• Send timestamp with each message
• Receiver may update clock to timestamp+minimal network delay
– Clock must always increase
Lots of work in this area
Conclusions
Distributed, replicated state machines useful for
tolerating faults
Need to construct total ordering of events to obtain
same results everywhere
Logical clocks very simple to implement
Will see logical clocks used to update replicas
Distributed Snapshots
Goal
• Want to record global state of distributed system (i.e., state of each
process, state of each communication channel)
• Useful so can observe system properties
– Computation terminated?
– System deadlocked?
– Number of tokens?
Complication:
• Distributed system has no shared state nor shared clock
• Cannot record global state simultaneously everywhere
Distributed snapshot: Record local state at different times
and combine into meaningful picture
• Obtain cut in logical time, remain consistent by preserving logical
ordering (if not ordering in physical time)
System Model
Distributed system: Finite set of processes and
channels; described by graph
Processes
• Set of states, initial state, set of events
Channels
• FIFO, error-free, infinite buffers, arbitrary but finite
delay
• State: Sequence of messages sent but not yet received
Distributed Snapshot Algorithm
Goal: Record local state (each process plus adjoining
channels) that produces a “meaningful” global system
state
Idea:
• After local snapshot, send marker along channels
• Receiver records messages in channel before marker
Initial: Some process decides to initiate snapshot (performed
periodically)
Intuition with Logical Time
A
m
1
S
m2
S
1
S
2
S3
Which snapshots are concurrent?
How to reconcile S2 and S?
How to reconcile S1 and S?
How to reconcile S3 and S?
B
Marker Rules
Marker-sending rule for p:
• Send marker along each channel (after recording state of p) before
sending more messages
Marking-receiving rule for q on channel c:
• if q has not recorded state yet:
– record state of q
– record state of c as empty
• if q has recorded state already:
– record state of c as the msg sequence that arrived since it recorded its
state
Termination
• When state recorded of all processes and all channels
• Must have algorithm to collect and assemble information too
Banking Example
p1: $10
p2: $20
$1
p3: $30
$2
$3
$5
$3
$4
Stable property?
Banking Example
p1: $10
p2: $20
$1
p3: $30
$2
$3
$5
$3
$4
p2 state: $20+1-3 = 18, empty channels
Banking Example
p1: $10
p2: $20
p3: $30
$1
$2
$3
$5
$3
$4
p1 state: $10-1-3=6, empty channels
p3 state: $30-2-5-4+3=22, empty channels
Total money?
18+6+22
= $46
Banking Example
p1: $10
p2: $20
$1
p3: $30
$2
$3
$5
$3
$4
Banking Example
p1: $10
p2: $20
$1
p3: $30
$2
$3
$5
$3
$4
c: p2 from p1: nothing
c: p2 f p3: 4 + 2
c: p3 f p1: 3
c: p1 f p3: 5
p1: 6, p2: 18, p3: 22
Banking Example
p1: $10
p2: $20
$1
p3: $30
$2
$3
$5
$3
$4
c p2 from p3: Never $2 and $4 simultaneously
Properties of
Recorded Global State
Recorded global state, S*, may not have occurred
If it bothers you that S* doesn’t actually exist...
•
•
•
•
Given a permutation of the actual sequence of events
S* is reachable from Sinit
Sfinal is reachable from S*
Stable properties will hold in S* as well
How to permute sequence of events?
Goal: Want snapshot to correspond to single logical cut
• Slide events so snapshots taken at same “logical time”
• Some events across processes will switch order with others
– Specifically, postrecorded events and prerecorded events
– prerecorded events occurred before state of p was recorded
– Can’t tell ordering of concurrent pre and post events
Banking Example
p1: $10
p2: $20
$1
p3: $30
$2
$3
$5
$3
post
$4
pre
Example: Need to swap sending $4 from p3 and receiving $2
Still logically consistent; could not observe difference
Conclusions
Distributed snapshots: Allow one to reconstruct valid
picture of system from snapshots taken at
different points in time
•
Record individual process states plus channels
Snapshots useful for determining if stable properties
hold or not
Recorded state S* may not correspond to “reality”,
but if stable properties hold in beginning and end
states, then hold in S* too
Download