Set 16: Distributed Shared Memory

advertisement
Set 16: Distributed Shared Memory
CSCE 668
DISTRIBUTED ALGORITHMS AND
SYSTEMS
CSCE 668
Fall 2011
Prof. Jennifer Welch
1
Distributed Shared Memory
2





A model for inter-process communication
Provides illusion of shared variables on top of
message passing
Shared memory is often considered a more
convenient programming platform than message
passing
Formally, give a simulation of the shared memory
model on top of the message passing model
We'll consider the special case of
no failures
 only read/write variables to be simulated

Set 16: Distributed Shared Memory
CSCE 668
The Simulation
3
users of read/write shared memory
read/write
return/ack
…
alg0
send
read/write
recv
Shared Memory
return/ack
algn-1
send
recv
Message Passing System
Set 16: Distributed Shared Memory
CSCE 668
Shared Memory Issues
4




A process invokes a shared memory operation (read or
write) at some time
The simulation algorithm running on the same node executes
some code, possibly involving exchanges of messages
Eventually the simulation algorithm informs the process of
the result of the shared memory operation.
So shared memory operations are not instantaneous!


Operations (invoked by different processes) can overlap
What values should be returned by operations that overlap
other operations?
 defined by a memory consistency condition
Set 16: Distributed Shared Memory
CSCE 668
Sequential Specifications
5


Each shared object has a sequential specification:
specifies behavior of object in the absence of
concurrency.
Object supports operations
 invocations
 matching

responses
Set of sequences of operations that are legal
Set 16: Distributed Shared Memory
CSCE 668
Sequential Spec for R/W Registers
6





Each operation has two parts, invocation and
response
Read operation has invocation readi(X) and response
returni(X,v) (subscript i indicates proc.)
Write operation has invocation writei(X,v) and
response acki(X) (subscript i indicates proc.)
A sequence of operations is legal iff each read
returns the value of the latest preceding write.
Ex: [write0(X,3) ack0(X)] [read1(X) return1(X,3)]
Set 16: Distributed Shared Memory
CSCE 668
Memory Consistency Conditions
7


Consistency conditions tie together the sequential
specification with what happens in the presence of
concurrency.
We will study two well-known conditions:
 linearizability
 sequential

consistency
We will only consider read/write registers, in the
absence of failures.
Set 16: Distributed Shared Memory
CSCE 668
Definition of Linearizability
8

Suppose  is a sequence of invocations and responses
for a set of operations.


an invocation is not necessarily immediately followed by its
matching response, can have concurrent, overlapping ops
 is linearizable if there exists a permutation  of all
the operations in  (now each invocation is
immediately followed by its matching response) s.t.
|X is legal (satisfies sequential spec) for all vars X, and
 if response of operation O1 occurs in  before invocation of
operation O2, then O1 occurs in  before O2 ( respects
real-time order of non-overlapping operations in ).

Set 16: Distributed Shared Memory
CSCE 668
Linearizability Examples
9
Suppose there are two shared variables, X and Y, both initially 0
write(X,1)
ack(X)
read(Y)
return(Y,1)
p0
1
3
write(Y,1)
ack(Y)
0
return(X,1)
read(X)
p1
2
4
Is this sequence linearizable?
Yes - brown triangles.
What if p1's read returns 0?
No - see arrow.
Set 16: Distributed Shared Memory
CSCE 668
Definition of Sequential Consistency
10
Suppose  is a sequence of invocations and
responses for some set of operations.
  is sequentially consistent if there exists a
permutation  of all the operations in  s.t.

 |X
is legal (satisfies sequential spec) for all vars
X, and
 if response of operation O1 occurs in  before
invocation of operation O2 at the same process,
then O1 occurs in  before O2 ( respects realtime order of operations by the same process in ).
Set 16: Distributed Shared Memory
CSCE 668
Sequential Consistency Examples
11
Suppose there are two shared variables, X and Y, both initially 0
write(X,1)
p0
ack(X)
read(Y)
3
4
write(Y,1)
p1
0
return(Y,1)
ack(Y)
read(X)
1
2
Is this sequence sequentially consistent?
What if p0's read returns 0?
return(X,0)
Yes - brown numbers.
No - see arrows.
Set 16: Distributed Shared Memory
CSCE 668
Specification of Linearizable Shared
Memory Comm. System
12



Inputs are invocations on the shared objects
Outputs are responses from the shared objects
A sequence  is in the allowable set iff
Correct Interaction: each proc. alternates invocations and
matching responses
 Liveness: each invocation has a matching response
 Linearizability:  is linearizable

Set 16: Distributed Shared Memory
CSCE 668
Specification of Sequentially Consistent
Shared Memory
13



Inputs are invocations on the shared objects
Outputs are responses from the shared objects
A sequence  is in the allowable set iff
Correct Interaction: each proc. alternates invocations and
matching responses
 Liveness: each invocation has a matching response
 Sequential Consistency:  is sequentially consistent

Set 16: Distributed Shared Memory
CSCE 668
Algorithm to Implement Linearizable
Shared Memory
14



Uses totally ordered broadcast as the underlying
communication system.
Each proc keeps a replica for each shared variable
When read request arrives:



send bcast msg containing request
when own bcast msg arrives, return value in local replica
When write request arrives:



send bcast msg containing request
upon receipt, each proc updates its replica's value
when own bcast msg arrives, respond with ack
Set 16: Distributed Shared Memory
CSCE 668
The Simulation
15
users of read/write shared memory
read/write
return/ack
alg0
to-bc-send
read/write
…
return/ack
algn-1
to-bc-recv
to-bc-send
Shared Memory
to-bc-recv
Totally Ordered Broadcast
Set 16: Distributed Shared Memory
CSCE 668
Correctness of Linearizability Algorithm
16

Consider any admissible execution  of the
algorithm in which
 underlying
totally ordered broadcast behaves properly
 users interact properly (alternate invocations and
responses

Show that , the restriction of  to the events of the
top interface, satisfies Liveness and Linearizability.
Set 16: Distributed Shared Memory
CSCE 668
Correctness of Linearizability Algorithm
17


Liveness (every invocation has a response): By
Liveness property of the underlying totally ordered
broadcast.
Linearizability: Define the permutation  of the
operations to be the order in which the corresponding
broadcasts are received.
 is legal: because all the operations are consistently
ordered by the TO bcast.
  respects real-time order of operations: if O1 finishes
before O2 begins, O1's bcast is ordered before O2's bcast.

Set 16: Distributed Shared Memory
CSCE 668
Why is Read Bcast Needed?
18



The bcast done for a read causes no changes to
any replicas, just delays the response to the read.
Why is it needed?
Let's see what happens if we remove it.
Set 16: Distributed Shared Memory
CSCE 668
Why Read Bcast is Needed
19
read return(1)
p0
write(1)
p1
p2
to-bc-send
read return(0)
Set 16: Distributed Shared Memory
CSCE 668
Algorithm for Sequential Consistency
20




The linearizability algorithm, without doing a bcast for reads:
Uses totally ordered broadcast as the underlying
communication system.
Each proc keeps a replica for each shared variable
When read request arrives:


immediately return the value stored in the local replica
When write request arrives:



send bcast msg containing request
upon receipt, each proc updates its replica's value
when own bcast msg arrives, respond with ack
Set 16: Distributed Shared Memory
CSCE 668
Correctness of SC Algorithm
21
Lemma (9.3): The local copies at each proc. take on all
the values appearing in write operations, in the same
order, which preserves the order of non-overlapping
writes
- implies per-process order of writes is preserved
Lemma (9.4): If pi writes Y and later reads X, then pi's
update of its local copy of Y (on behalf of that write)
precedes its read of its local copy of X (on behalf of
that read).
Set 16: Distributed Shared Memory
CSCE 668
Correctness of the SC Algorithm
22
(Theorem 9.5) Why does SC hold?
 Given any admissible execution , must come up
with a permutation  of the shared memory
operations that is
 legal
and
 respects per-proc. ordering of operations
Set 16: Distributed Shared Memory
CSCE 668
The Permutation 
23


Insert all writes into  in their to-bcast order.
Consider each read R in  in the order of
invocation:
 suppose R is a read by pi of X
 place R in  immediately after the later of
1. the operation by pi that immediately precedes
R in , and
2. the write that R "read from" (caused the latest
update of pi's local copy of X preceding the
response for R)
Set 16: Distributed Shared Memory
CSCE 668
Permutation Example
24
4
read return(2)
p0
write(2)
p1
3
ack
to-bc-send
to-bc-send
p2
write(1)
1
ack
read return(1)
2
permutation is given by brown numbers
Set 16: Distributed Shared Memory
CSCE 668
Permutation  Respects Per Proc.
Ordering
25
For a specific proc:
 Relative ordering of two writes is preserved by
Lemma 9.3
 Relative ordering of two reads is preserved by the
construction of 
 If write W precedes read R in exec. , then W
precedes R in  by construction
 Suppose read R precedes write W in . Show same
is true in .
Set 16: Distributed Shared Memory
CSCE 668
Permutation  Respects Ordering
26

Suppose in contradiction R and W are swapped in :



There is a read R' by pi that equals or precedes R in 
There is a write W' that equals W or follows W in the to-bcast
order
And R' "reads from" W'.
R'
|pi :
:

R
W
…W … W' … R' … R …
But:



R' finishes before W starts in  and
updates are done to local replicas in to-bcast order (Lemma 9.3)
so update for W' does not precede update for W
so R' cannot read from W'.
Set 16: Distributed Shared Memory
CSCE 668
Permutation  is Legal
27


Consider some read R of X by pi and some write
W s.t. R reads from W in .
Suppose in contradiction, some other write W' to X
falls between W and R in :
:

…W … W' … R …
Why does R follow W' in ?
Set 16: Distributed Shared Memory
CSCE 668
Permutation  is Legal
28
Case 1: W' is also by pi. Then R follows W' in 
because R follows W' in .
 Update for W at pi precedes update for W' at pi in
 (Lemma 9.3).
 Thus R does not read from W, contradiction.
Set 16: Distributed Shared Memory
CSCE 668
Permutation  is Legal
29
Case 2: W' is not by pi. Then R follows W' in  due to some
operation O, also by pi , s.t.
 O precedes R in , and
 O is placed between W' and R in 
Consider the earliest such O.
:
…W … W' … O … R …
Case 2.1: O is a write (not necessarily to X).
 update for W' at pi precedes update for O at pi in  (Lemma 9.3)
 update for O at pi precedes pi's local read for R in  (Lemma 9.4)
 So R does not read from W, contradiction.
Set 16: Distributed Shared Memory
CSCE 668
Permutation  is Legal
30
:
…W … W' … O … R …
Case 2.2: O is a read.
• By construction of , O must read X and in fact read from
W' (otherwise O would not be after W')
• Update for W at pi precedes update for W' at pi in 
(Lemma 9.3).
• Update for W' at pi precedes local read for O at pi in 
(otherwise O would not read from W').
• Thus R cannot read from W, contradiction.
Set 16: Distributed Shared Memory
CSCE 668
Performance of SC Algorithm
31



Read operations are implemented "locally", without
requiring any inter-process communication.
Thus reads can be viewed as "fast": time between
invocation and response is only that needed for some
local computation.
Time for a write is time for delivery of one totally
ordered broadcast (depends on how to-bcast is
implemented).
Set 16: Distributed Shared Memory
CSCE 668
Alternative SC Algorithm
32

It is possible to have an algorithm that implements
sequentially consistent shared memory on top of
totally ordered broadcast that has reverse
performance:
writes are local/fast (even though bcasts are sent, don't wait
for them to be received)
 reads can require waiting for some bcasts to be received


Like the previous SC algorithm, this one does not
implement linearizable shared memory.
Set 16: Distributed Shared Memory
CSCE 668
Time Complexity for DSM Algorithms
33




One complexity measure of interest for DSM algorithms is
how long it takes for operations to complete.
The linearizability algorithm required D time for both reads
and writes, where D is the maximum time for a totallyordered broadcast message to be received.
The sequential consistency algorithm required D time for
writes and 0 time for reads, since we are assuming time for
local computation is negligible.
Can we do better? To answer this question, we need some
kind of timing model.
Set 16: Distributed Shared Memory
CSCE 668
Timing Model
34



Assume the underlying communication system is the
point-to-point message passing system (not totally
ordered broadcast).
Assume that every message has delay in the range
[d-u,d].
Claim: Totally ordered broadcast can be
implemented in this model so that D, the maximum
time for delivery, is O(d).
Set 16: Distributed Shared Memory
CSCE 668
Time and Clocks in Layered Model
35


Timed execution: associate an occurrence time with
each node input event.
Times of other events are "inherited" from time of
triggering node input



recall assumption that local processing time is negligible.
Model hardware clocks as before: run at same
rate as real time, but not synchronized
Notions of view, timed view, shifting are same:

Shifting Lemma still holds (relates h/w clocks and msg
delays between original and shifted execs)
Set 16: Distributed Shared Memory
CSCE 668
Lower Bound for SC
36
Let Tread = worst-case time for a read to complete
Let Twrite = worst-case time for a write to complete
Theorem (9.7): In any simulation of sequentially
consistent shared memory on top of point-to-point
message passing, Tread + Twrite  d.
Set 16: Distributed Shared Memory
CSCE 668
SC Lower Bound Proof
37



Consider any SC simulation with Tread + Twrite < d.
Let X and Y be two shared variables, both initially 0.
Let 0 be admissible execution whose top layer behavior is
write0(X,1) ack0(X) read0(Y) return0(Y,0)



write begins at time 0, read ends before time d
every msg has delay d
Why does 0 exist?




The alg. must respond correctly to any sequence of invocations.
Suppose user at p0 wants to do a write, immediately followed by a
read.
By SC, read must return 0.
By assumption, total elapsed time is less than d.
Set 16: Distributed Shared Memory
CSCE 668
SC Lower Bound Proof
time
38
0
p0
0
write(X,1)
read(Y,0)
d
p1
Set 16: Distributed Shared Memory
CSCE 668
SC Lower Bound Proof
39

Similarly, let 1 be admissible execution whose top
layer behavior is
write1(Y,1) ack1(Y) read1(X) return1(X,0)
write begins at time 0, read ends before time d
 every msg has delay d


1 exists for similar reason.
Set 16: Distributed Shared Memory
CSCE 668
SC Lower Bound Proof
time
40
0
p0
0
write(X,1)
read(Y,0)
write(Y,1)
read(X,0)
d
p1
1
p0
p1
Set 16: Distributed Shared Memory
CSCE 668
SC Lower Bound Proof
41


Now merge p0's timed view in 0 with p1's timed
view in 1 to create admissible execution '.
But ' is not SC, contradiction!
Set 16: Distributed Shared Memory
CSCE 668
SC Lower Bound Proof
time
42
0
p0
0
write(X,1)
read(Y,0)
write(Y,1)
read(X,0)
write(X,1)
read(Y,0)
d
p1
1
p0
p1
'
p0
p1
write(Y,1)
read(X,0)
Set 16: Distributed Shared Memory
CSCE 668
Linearizability Write Lower Bound
43
Theorem (9.8): In any simulation of linearizable
shared memory on top of point-to-point message
passing, Twrite ≥ u/2.
Proof: Consider any linearizable simulation with
Twrite < u/2.
 Let be an admissible exec. whose top layer
behavior is:
p1 writes 1 to X, p2 writes 2 to X, p0 reads 2 from X

Shift to create admissible exec. in which p1 and
p2's writes are swapped, causing p0's read to
violate linearizability.
Set 16: Distributed Shared Memory
CSCE 668
Linearizability Write Lower Bound
time:
0
u
u/2
44
read 2
p0
:
write 1
p1
write 2
p2
p0
delay
pattern
d - u/2
d - u/2
d - u/2
p1
d - u/2
d
d-u
p2
Set 16: Distributed Shared Memory
CSCE 668
Linearizability Write Lower Bound
time:
0
u
u/2
45
read 2
p0
write 1
shift p1
by u/2
p1
shift p2
by -u/2
write 2
p2
p0
delay
pattern
d
d-u
d
p1
d-u
d- u
d
p2
Set 16: Distributed Shared Memory
CSCE 668
Linearizability Read Lower Bound
46



Approach is similar to the write lower bound.
Assume in contradiction there is an algorithm with
Tread < u/4.
Identify a particular execution:
fix a pattern of read and write invocations, occurring at
particular times
 fix the pattern of message delays


Shift this execution to get one that is
still admissible
 but not linearizable

Set 16: Distributed Shared Memory
CSCE 668
Linearizability Read Lower Bound
47
Original execution:
 p1 reads X and gets 0 (old value).
 Then p0 starts writing 1 to X.
 When write is done, p0 reads X and gets 1 (new
value).
 Also, during the write, p0 and p1 alternate reading X.
 At some point, the reads stop getting the old value (0)
and start getting the new value (1)
Set 16: Distributed Shared Memory
CSCE 668
Linearizability Read Lower Bound
48




Set all delays in this execution to be d - u/2.
Now shift p2 earlier by u/2.
Verify that result is still admissible (every delay either
stays the same or becomes d or d - u).
But in shifted execution, sequence of values read is
0, 0, …, 0, 1, 0, 1, 1, …, 1
Set 16: Distributed Shared Memory
CSCE 668
Linearizability Read Lower Bound
u/2
49
read 1
read 0
read 1
read 1
p2
read 0
read 0
read 1
read 1
p1
write 1
p0
p2
read 0
read 0
read 1
read 1
read 0
read 1
read 1
read 1
p1
p0
write 1
Set 16: Distributed Shared Memory
CSCE 668
Download